0% found this document useful (0 votes)
157 views617 pages

Biostat Compiled

This document discusses key concepts in biostatistics and data collection. It defines biostatistics as the study of statistics related to biological, social, and environmental factors impacting health. Data refers to a set of numerical observations, while information is the organized form of data. Primary data is originally collected, while secondary data has already undergone statistical analysis. The sources of health data include experiments, surveys, records, clinical practice, and external sources. Direct observation, questionnaires, and electronic media are some methods for collecting data.

Uploaded by

Majgsjq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views617 pages

Biostat Compiled

This document discusses key concepts in biostatistics and data collection. It defines biostatistics as the study of statistics related to biological, social, and environmental factors impacting health. Data refers to a set of numerical observations, while information is the organized form of data. Primary data is originally collected, while secondary data has already undergone statistical analysis. The sources of health data include experiments, surveys, records, clinical practice, and external sources. Direct observation, questionnaires, and electronic media are some methods for collecting data.

Uploaded by

Majgsjq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 617

BY

DR ABDUL RAUF
ASSISTANT PROFESSOR
 The word statistics is defined in three different ways. Firstly,
 “Statistics is numerical fact systematically collected with some
definite object”.
 For example: figures 60,62,64,65,68 are not statistics, but
heights of the students in the class will form this data a
STATISTICS, e.g.
a) Statistics of death & birth,
b) Statistics of educational institutions in Sargodha etc...
Secondly:
 Statistics is defined as “the science of systematic collection,
classification, tabulation, presentation, analysis and
interpretation of numerical data”.
Thirdly:
 “ Any numerical quantity (Such as Mean, Median, Mode &
Standard deviation) computed or collected from a SAMPLE is
known as Statistic” (singular).
Study of statistics in relation to
biological science (such as biological,
social and environmental factors) is
known as “BIO-STATISTICS”.
Study of statistics in relation to health
and disease of human population and
different factors related to them, is
known as “Health statistics".
Study of statistics in relation to the vital events of life such as
birth , marriages, deaths, divorces, etc is known as “vital
statistics’’ , which in turn is a branch of “Demography” which
deals with study of human populations.
1. Helps in effective comparisons
between two groups or two
countries.
2.Helps in measurement of health
status of a community in terms of
rates, ratios, proportions etc. which
in turn helps in comparison with
other countries and helps to study
the influencing factors.
For examples, the prevalence of typhoid fever is higher among
people of poor socio-economic status, living in unhygienic
areas with unsafe water supply, not protected through
immunizations and so on. Thus by a systematic analysis of
the factors related to the disease, a health worker or a health
administrator can define the problems in terms of contrast.
3.Helps in estimating the magnitude of a health problem.
4.Helps in analyzing the causes of the public health problems,
including epidemics, to the public health personnel.
5.Helps in monitoring & evaluation of the control measures and
also in introducing midcourse correction measures, where
ever necessary.
6.Helps in health planning and management
7.Helps in research purposes
Thus biostatistics, if properly recorded constitutes “Eyes and
Ears’’ of a health worker otherwise it would be like “sailing in
a ship without compass’’.
An inherent feature of all biological observations is their
variability e.g. every individual varies with one another
.Similarly each group of individuals is different from other
groups. For example, the pulse rate, hemoglobin level, the
number of white cells varies from person to person . Again
this varies from one group to other . e.g. pulse rate among
infants varies from that of old age group.
VARIABLE:
 Any numerical (N) value which varies from one individual to
another is called variable OR
 It is a characteristic or attribute that varies from person to
person, from place to place & from time to time.
e.g. Height of students in the class.
Weight of School boys.
 Variables are usually represented by last English letter X, Y, Z.
. Other examples: Prices of goods
Number of children in family.
CONSTANT:
 Any numerical quantity which is fixed
OR
 “constant is any fixed quantity that has a single value”
OR
 “A quantity which can assume Only One Value is called
constant”
e.g. π = 22/7 (3.14), ‘g’= 9.8 m/second
Variables may be Qualitative, Quantitative,
Continuous, Discrete, Dichotomous (Binary) & Polyotomous.
QUALITATIVE VARIABLE
A characteristic which varies only in quality from one individual
to another individual is called “qualitative variable” e.g.
beauty, intelligence, severity of disease, color, ABO blood
group, gender. It is also called as attribute or categorical
variable.
QUANTITATIVE VARIABLE
Characteristic which can be measured numerically & varies from
one individual to another individual e.g. height, weight, B.P,
temperature of patients, hemoglobin level, blood sugar level,
mid-arm circumference,
Body mass index(BMI),Serum cholesterol level.
DISCRETE VARIABLE
A variable is called discrete variable if it can take some selected
values in a given interval
OR
A variable whose value is taken from some counting process
e.g. number of patients in a ward, rooms in a house , trees in
a row
CONTINUOUS VARIABLE
If the variable takes any value within an interval that variable is
called continuous variable.
OR
The variable whose value is taken from some measuring
process e.g. B.P, temperature, height & weight of patients
DICHOTOMOUS(BINARY) VARIABLE
It is variable that has only two possible value.
Examples
Gender
weight more than 80 kg
Obesity
Rh blood group
POLYOTOMOUS VARIABLE
It is a variable that has more than two possible values.
Examples
ABO blood group
Weight
Height
Nominal scale Metric Scale
Based on NOM(names), Based on ME(measurement)
no specific order e.g. In terms of quantities
Race/ethnicity, Blood glucose
Religion, Mid –arm circumference
Sex of child/gender Hemoglobin level
ABO blood group Weight, height,
Country of birth Blood pressure
Type of anemia Pulse rate

Categorical scales Dimensional scales


ORDINAL SCALE Metric scale is of 2 types
Based on ORD(order) Interval scale
(absence of absolute 0)
Grading into categories No ratios are possible
Severity of disease e.g,
Social classes Centigrade/Fahrenheit
Socioeconomic status temperature scale
Ratio scale
(Presence of absolute 0)
Ratios are possible e.g, weight,
height

Categorical scales Dimensional scales


 Statistically most preferable scale of measurement is “metric
scale”
 Statistically least preferable scale of measurement is “nominal
scale”
EXERCISE
 Severity of anemia
 Type of anemia
 Hemoglobin level & serum ferritin level
 EXERCISE
 Weight
 Height
 BP
 PULSE RATE
 Temperature
 ABO blood group
 Rhesus blood group
 Gender
THANK YOU
BIO-STATISTICS - 2

DATA
BY
DR.ABDUL RAUF
BIO-STATISTICS
DATA
Data is plural of word datum. A set of numerical observations is
called “data”
like height of students ,temperature of patients
blood pressure of patients ,number of Para-
medical staff
DATA MATRIX: It is a kind of platform at which we
present primary data.
BIO-STATISTICS
INFORMATION
Organized or optimized form of data is called
“information”
DATA REDUCTION
The process of converting raw data into manage- able form so
that some statistical analysis could be done.
BIO-STATISTICS
TYPES OF DATA
Primary data: The data which is collected for the first time, OR
The data which has not gone through statistical
machine is called primary data.
Secondary data: The data which has already been
collected previously OR
The data which has gone through statistical machine is called
secondary data.
BIO-STATISTICS
TYPES OF DATA
Bi-variate data have exactly 2 pieces of information for each
item. If only 2 information are taken simultaneously from
individuals e.g. salt intake & BP of patients, height & weight of
students
Multivariate data have 3 or more pieces of information for each
item. If more than two-
fold information are taken simultaneously from
individuals, e.g. age, weight & height of patients
BIO-STATISTICS
TYPES OF DATA
Nominal data= data based on nominal scale or variable is called
nominal data.
Ordinal data= data based on ordinal scale is called ordinal data.
Ratio / interval data=data based on ratio/interval scale.
BIO-STATISTICS
TYPES OF DATA
Time series data= A set of ordered data values observed at
successive points in time is called time series data .e, g
measurement of hourly temperature of patient
Cross–sectional data=A set of data values observed at a fixed
point in time is called cross-sectional data . e. g measurement
of temperature in morning of a group of patient.
BIO-STATISTICS
TYPES OF DATA
Ungrouped data=the data in the original form (with out
frequency) are referred to as ungrouped data
Grouped data=data presented in the form of frequency
distribution are called grouped data
BIO-STATISTICS
SOURCES OF DATA
The main sources of collection of medical or health related data
are as
1- Experiments/trials
2-Surveys/observations
3-Records/registration
4-Clinical practice
5-External sources
BIO-STATISTICS
COLLECTION OF DATA
1. Direct personal observations
2. Indirect personal observations
3. Questionnaire method
4. Data collection through enumerators
5. Data collection through local sources
6. Electronic media
BIO-STATISTICS
CLASS= the interval of values with in the given data is called
class.
CLASS INTERVALS= when upper limit of a class does not coincide
with lower limit of next class it is called class intervals. 0-10,
11-20, 21-
30, 31-40.
Class boundaries= when upper limit of a class coincides with
lower limit of next class, it is called class boundaries.0-10, 10-
20, 20-30,
BIO-STATISTICS
Class limits=the smallest & largest values of any given class of a
frequency distribution are called class limits. 20-40,
Class magnitude= the difference between two class limits or
class intervals is called class magnitude. 40-20=20
Frequency(class frequency)=the frequency is defined as the no.
of observations in any class & is denoted by “f”.
BIO-STATISTICS
FREQUENCY DISTRIBUTION
The arrangement of statistical data according to size or
magnitude is called “ frequency distribution”. There are 3 types
of frequency distribution.
1.Individual series
2.Discrete series
3.Contineous series
BIO-STATISTICS
INDIVIDUAL SERIES= In individual series items are arranged singly
according to their size e , g
Marks x= 14 16 13 17 19 15
DISCRETE SERIES= In discrete series items are capable of exact
measurement. They are separate, complete and are arranged
according to their size e,g
Marks: X: 10 20 30 40 50
No. of students: f : 2 5 17 6 4
BIO-STATISTICS
CONTINUOUS SERIES
In continuous series items are not capable of exact measurement
but varies from one point to other and are arranged according
to their size. e,g
Marks C.B 5-10 10-15 15-20 20-25
No. of students “f” 3 9 10 7
Individual series----------- ungrouped data
Discrete series+ continuous series----- grouped data
BIO-STATISTICS
FREQUENCY DISTRIBUTION
It is arrangement of statistical data to their respective
frequencies e,g
Age groups: 0-10 10-20 20-30 30-40 40-50
No. of patients: 5 8 15 18 13
BIO-STATISTICS
STEPS OF CONSTRUCTION
1. Determine the number of classes.
2. Determine the magnitude of classes.
3. Construct the class limit.
4. Locate values in each limit.
BIO-STATISTICS
1.Determine the number of classes.
No. of classes depends upon the no. of items given with in the data
however the no. of classes should not less than 5 & not more than
20. The no. of classes can also be calculated by following formula.
k= 1+3.22 log n
If n=100 observations then k= 1+ 3.22 log 100
K=1+3.22(2) k= 7.44= 8 classes
BIO-STATISTICS
2.Determine the magnitude of classes.
To find the magnitude of class first find the range(difference
between max:& min: value ) then divide the range by the
no. of classes you require
BIO-STATISTICS
3. Construct the class limits
The class limit should be close to the minimum & maximum
value as possible
4. Locate values in each class limit
Counting of items against each class can be done in 2 ways.
= By actually listing of elements
= By tally sheet method
The marks of 80 MBBS students in Bio-statistics are as under
Make a frequency distribution, grouping in interval of 5 marks, e.g. 50-54 ,55-59 etc

68 84 75 82 68 90 62 88 76 93
73 79 88 73 60 93 71 59 85 75
61 65 75 87 74 62 95 78 63 72
66 78 82 75 94 77 69 74 68 60
96 78 89 61 75 95 60 79 83 71
79 62 67 97 78 85 76 65 71 75
65 80 73 57 88 78 62 76 53 74
86 67 73 61 72 63 76 75 85 77
KEY MAXIMUM VALUE=97, MINIMUM VALUE=53
C-I ITEMS f
50-54 53 1
55-59 59, 57 2
60-64 62, 60, 61, 62, 63, 60, 61, 60, 62, 62, 63 11
65-69 68, 68, 65, 66, 69, 68, 67, 65, 65, 67 10
70-74 73, 73, 71, 74, 72, 74, 71, 71, 73, 74, 73, 72 12
75-79 75, 76,79, 75, 75, 78, 78, 75, 77, 78, 75, 79, 79, 78, 76,76, 21
78, 76, 76, 75, 77
80-84 84, 82, 82, 83, 80, 81 6
85-89 88, 88, 85, 87, 89, 85, 88, 86, 85 9
90-94 90, 93, 93, 94 4
95-99 95, 96, 95, 97 4
Summation 80
C - I Tally sheet f
50 - 54 I 1
55 -59 II 2
60 - 64 IIII IIII I 11
65 - 69 IIII IIII 10
70 - 74 IIII IIII II 12
75 -- 79 IIII IIII IIII IIII I 21
80 -- 84 IIII I 6
85 --- 89 IIII IIII 9
90 --- 94 IIII 4
95 --- 99 IIII 4
∑ 80
DATA
CLASSIFICATION
The process or method of arranging the heterogeneous data into
homogenous classes or groups is called classification.
By homogenous we mean that like should go with like and unlike
should go with unlike.
Generally there are two types of classification.
(a)- Classification according to attribute.
(b)- Classification according to class interval.
DATA
(a) Classification according to attribute.
When statistical data is classified on basis of descriptive
characteristics , such classification is called classification
according to attribute.
This can be done in two ways.
1.Simple classification.
2.Manifold classification.
DATA
Simple classification
In simple classification only one attribute is
taken into account , which is further
subdivided into two classes and no more e.gif
require to study the poverty of a place then
the only attribute can be further subdivided
into two classes as,
1.The people who are poor.
2.The people who are not poor.
DATA
Manifold classification
If require to study more than one attribute then we make the use
of manifold classification , in which each attribute is further
subdivide into two classes and no more e.g if require to study
the poverty of a place sex-wise then the two attributes poverty
and sex can be classified as.
1.Males who are poor.2.Females who are poor.
3.Male who are not poor.4.Females who are not.
DATA
(b)CASSIFICATION ACCORDING TO CLASS INTERVAL
When the data is classified on the basis of the numerical
characteristics, it is known as classification according to class
interval. e, g if require to study the weight of blood sugar
patients then the patients who have attained the weight from
110 lbs to 120 lbs are placed in one group& who have attained
weight from 120 lbs to 130 lbs are placed in 2nd group& so on.
TABULATION
The process or method of arranging statistical data into rows&
columns is called tabulation.
There are 4 types of tabulation.
1. One- way table 2. Two- way table
3. Three- way table 4. Higher-order table
ONE- WAY TABLE
The table, which provides only one information in one table, is
called one way table.

DHQs HOSPITALS NO. OF PATIENTS ADMITTED

Sargodha 1950
Khushab 1200
Mianwali 1600
TWO-WAY TABLE
The table, which provides two inter related information in one
table, is called two- way table

DHQs NO. OF NO.OF


HOSPITALS PATIENTS PATIENTS TOTAL
(RURAL) (URBAN)

Sargodha 1200 750 1950


Khushab 850 350 1200
Mianwali 1000 600 1600
THREE- WAY TABLE
The table, which provides 3 inter-related
information in a table.
NO. OF PATIENT S. ADMI TTED

DHQs RURAL RURAL URBAN URBAN TOTAL


HOSPITALS MALE FEMALE MALE FEMALE

Sargodha 700 500 350 400 1950

Khushab 500 300 250 150 1200

Mianwali 650 450 300 200 1600


HIGHER ORDER TABLE
The table which provides more than 3 inter related information
in a table, is called higher order table.
DIAGRAMATIC REPRESENTATION OF DATA
It is visual representation of statistical data by means of
diagrams. It is very easy method to represent the huge
data/numerical data by means of figures without going into
figures. There are 3 types of diagrammatic representation of
data.
1. DIMENSIONAL DIAGRAMS
2. PICTOGRAMS/PICTURE DIAGRAMS
3. CARTOGRAMS /MAP DIAGRAMS/SPOT MAP
DIAGRAMATIC REPRESENTATION OF DATA
DIAMENSIONAL DIAGRAMS
IN DIAMENSIONAL DIAGRAMS 3 DIAMENSIONS e.g. length,
breadth & thickness are considered
Following are the dimensional diagrams;
1.ONE DIMENSIONAL DIAGRAMS.
2.TWO DIMENSIONAL DIGRAMS.
3.THREE DIMENSIONAL DIAGRAMS.
DIAGRAMATIC REPRESENTATION OF DATA
ONE DIMENSIONAL DIAGRAM
In one dimensional diagram only one dimension i.e. length is
taken into account which is represented by a thin line or a bar
. These are of equal width with equal space in between . One
dimensional diagram are of form .
1.Simple bar diagram .
2.Multiple bar diagram.
3.Sub-divided bar diagram/Component.
DIAGRAMATIC REPRESENTATION OF DATA
SIMPLE BAR DIAGRAM
In simple bar diagram only one dimension i.e. length is taken into account
which is represented by a bar .Thus number of bars depend upon the
number of figures in the data. For drawing a simple bar diagram , the
following steps:-
1)Arrange the data in ascending or descending order of magnitude.
2)Suitable scale is selected to present the length of the bar.
3)Fix the width of the bar according to the space available on the graph.
4)Keep equal interval between the bars.
5)Make the bar attractive by colors.
DIAGRAMATIC REPRESENTATION OF DATA
16

14

12

10 ITALY
UK
8
USSR
6 USA

0
DIAGRAMATIC REPRESENTATION OF DATA
MULTIPLE BAR DIAGRAM
If require to have more than one interrelated informations in one
diagram, then we make the use of Multiple bar diagram.
In multiple bar diagrams the simple bars are placed side by side
to provide more than one information in same diagram . For
the sake of distinction the bar should be colored.
DIAGRAMATIC REPRESENTATION OF DATA
9
8
7
6
5 Series 1
4 Series 2
Series 3
3
2
1
0
Category 1 Category 2 Category 3 Category 4
DIAGRAMATIC REPRESENTATION OF DATA
SUB-DIVIDED BAR DIAGRAM
If require to have more than one information in the same
diagram we make the use of sub- divided bar diagram.
Sub-divided bar diagram are those in which each bar represents
the total of the components and then it is divided according to
the size of each item . The various sub-divisions are than
colored for the purpose of distinction.
35

30

25

20 Series 3
Series 2
15
Series 1
10

0
Category 1 Category 2 Category 3 Category 4
DIAGRAMATIC REPRESENTATION OF DATA
TWO DIMENSIONAL DIAGRAM
In two dimensional diagram two dimensions i.e. length and
breadth are taken account which are represented by square ,
rectangle or circle.
As length and breadth are taken into account that is why two
dimensional diagrams are called area diagram.
Area diagram can be further sub-divided into 2 categories.
a)Simple area diagram . b)Sub-divided area diagram
DIAGRAMATIC REPRESENTATION OF DATA
Simple area diagrams are,
a)Square diagram
b)Rectangle diagram
c)Circle diagram
Sub-divided diagrams are,
a) Sub-divided/component square diagram
b) Sub-divided/component rectangle diagram
c) Sub-divided circle diagram(pie diagram)
DIAGRAMATIC REPRESENTATION OF DATA
SIMPLE RECTANGLE DIAGRAM
In simple rectangle diagram two dimensions i.e. length and
breadth are taken into account because the area of rectangle is
equal to product of its length and breadth . There are two
method of drawing simple rectangle.
a)Keeping the width equal i.e. constant and their height i.e.
length proportional to the size of figuers . b)Keeping the length
equal i.e. constant and their width proportional to the size of
figuers
DIAGRAMATIC REPRESENTATION OF DATA
SUB-DIVIDED RECTANGLE DIAGRAM
In order to compare two or more quantity as well as their
components we make the use of sub-divided rectangle
diagram . These diagrams are generally drawn to compare the
budgets of various families.
DIAGRAMATIC REPRESENTATION OF DATA
SUB-DIVIDED RECTANGLE DIAGRAM
For construction of the sub-divided rectangle diagrams , the
following steps are involved
1.Convert each component into percentage of corresponding total.
2.Take the length of rectangle equal to 100 and width proportional to
given total.
3.Divide each length according to computed percentage.
4.Using colour to distinguish the various subdivision of each
rectangles.
DIAGRAMATIC REPRESENTATION OF DATA
PIE DIAGRAM
It is an improvement over a bar diagram. It is a circular diagram,
in which the frequencies of observations are shown as sectors
or wedges in a circle, the size of each sector being proportional
to the frequency. Degrees of angle denote the frequency and
area of sector gives comparative difference at a glance.
“Pie means a piece or a sector”
PIE-DIAGRAM
To draw a pie diagram first a circle is drawn. The radius is
marked. A second radius clockwise is drawn at an angle with
first radius , depending upon the angle for the sector which can
be calculated by following formula
no.of observations
Angle of any sector= ×360°
total no.of observation
PIE-DIAGRAM
The sectors should be arranged clockwise either in ascending or
descending order of magnitude. It is often necessary to
indicate the percentage for easy comparison.
The pie diagram can be made more attractive by giving a 3
dimensional effect to it. Each sector can be sliced out from the
main diagram to highlight the fact.
PIE-DIAGRAM
Sectors are then outlined by coloring or shading.
Thus a pie diagram is more attractive.
Sales
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

9%
10%

23% 58%
PIE-DIAGRAM
Sales
4th Qtr
9%
3rd Qtr
10%

2nd Qtr 1st Qtr


23% 58%
PIE-DIAGRAM
Sales
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

23%

9%

19%

10%
58%
PIE-DIAGRAM
Sales
3rd Qtr
10% 4th Qtr
9%

2nd Qtr 1st Qtr


23% 58%
PIE-DIAGRAM
Sales
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

9%
10%

23% 58%
PIE-DIAGRAM
The results of a study of domestic accidents were as follows. Out
of 180 domestic accident, 60 occurred in the kitchen, 50 in bed
room, 40 under the staircase& 30 in the bed room.
PIE-DIAGRAM
Degree
Angle for accident in kitchen= 60/180×360=120°
_______________bed room=50/180×360=100°
__________under staircase=40/180 ×360=80°
____________drawing room=30/180×360=60 °
PIE-DIAGRAM
Percentage
%of accidents in kitchen=60/180×100=33.3%
__________ bed room =50/180×100=27.7%
______under staircase =40/180×100=22.2%
________drawing room =30/180×100=16.8%
DIAGRAMATIC REPRESENTATION OF DATA
PICTOGRAM/PICTURE DIAGRAM
It is a visual representation of statistical data by mean of
pictures. This method is used to impress a lay man, who can
not understand the orthodox charts. The pictures are drawn in
horizontal lines, each picture indicating an unit of 10, 20, 30
etc happenings. The number of pictures in each row gives an
idea of frequency of the attribute.
DIAGRAMATIC REPRESENTATION OF DATA
CARTOGRAM/MAP DIAGRAM
It is a visual representation of statistical data by means of signs &
symbols on the map and is prepared to show geographical
distribution of frequencies of characteristic. This is commonly
used to represent geographic distribution of disease and
deaths of public health importance. These are of two types –
a)spot maps, b)shaded maps.
CARTOGRAM/MAP DIAGRAM
SPOT MAP
In this type the distribution of disease frequency is represented
in the form of dots or spots, each dot representing an unit
number of 10, 20, 30 etc in the area map prepared. Such maps
show at a glance areas of high frequency (clustering of spots)
or low frequency. Clustering of spots may indicate a common
source of infection or a common risk factor shaped by all cases.
CARTOGRAM/MAP DIAGRAM
Spot map help the epidemiologists to study the place
distribution, source / reservoir of infection and behavior of a
disease.
Two different colored dots may be marked on the map to show
attacks and deaths, in the area. Maps prepared on weekly or
monthly basis help in monitoring changes in the magnitude of
epidemics over a period of time and also direction of their
spread.
CARTOGRAM/MAP DIAGRAM
SHADED MAP
These are used to indicate variability in the incidence and
prevalence of diseases in different parts of the world / country
or from time to time. These maps also help in evaluating
progress achieved in reducing the burden of diseases over a
period of time.
GRAPHIC REPRESENTATION OF DATA
A graph is a device used for representing statistical data in a
simple, clear and effective manner. A graph consists of curves
or straight lines. Graph is used to study the relationship
between two variables. Graph of frequency distribution are,
1)Histogram,2)Frequency polygon,3)Frequency
curve,4)Cumulative frequency curve(OGIVE),5)Scatter/dot
diagram,6)Line chart/graph
HISTOGRAM
It is a graphic representation of a frequency distribution table in
which the vertical axis represent the frequency& the horizontal
axis the class interval.
It consist of a series of bars adjoining to each other, length of
each bar is being proportional to the frequency and width to
the class interval.
HISTOGRAM
Histograms are ideally suited to represent the distribution of
anthropometric values like height, weight, mid-arm
circumference, etc.
They can also represent other types of continuous data series
such as blood pressure pulse rate, hemoglobin level, etc.
Histograms provide a better understanding of quantitative data
of continuous type than frequency tables.
GRAPHIC REPRESENTATION OF DATA
Histogram
GRAPHIC REPRESENTATION OF DATA
FREQUENCY POLYGON
It is a line diagram that represents a frequency
distribution table. It can be obtained by joining mid
points(dots)of the heads(heights)of histogram, each dot
represents two character-
istics; class interval as indicated on horizontal-axis and class
frequency on vertical axis. Joining the dots gives a curve with
many angle
Hence the name “frequency polygon”
GRAPHIC REPRESENTATION OF DATA
THIS type of diagram is useful especially, when it necessary to
compare two or more frequency distributions. The curves for
different distributions should be drawn with different types of
lines on the same graph paper for easy comparison, which is
not possible through histograms because of overlapping of
rectangles result in confusion.
GRAPHIC REPRESENTATION OF DATA
3 Frequency Polygon

2.5

1.5

0.5

0
GRAPHIC REPRESENTATION OF DATA
FREQUENCY POLYGON
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
GRAPHIC REPRESENTATION OF DATA
FREQUENCY CURVE
When no. of observations is very large and group interval is
reduced, the frequency polygon tends to lose its angulations,
resulting in a smooth curve, known as “Frequency curve”
GRAPHIC REPRESENTATION OF DATA
1.6

1.4

1.2

0.8

0.6

0.4

0.2

0
GRAPHIC REPRESENTATION OF DATA
CUMULATIVE FREQUENCY CURVE
Cumulative frequency
C—I f Cumulative frequency
10-20 7 7
21-30 5 12
31-40 10 22
41-50 9 29
51-60 3 32
∑= 32
GRAPHIC REPRESENTATION OF DATA
CUMULATIVE FREQUENCY CURVE
It is a line diagram, representing the cumulative frequency
distribution of quantitative data.
To draw it, an ordinary frequency table in quantitative data has
first to be converted into a cumulative frequency table
The curve is plotted by taking the variable on x-
Axis and cumulative frequency on y-axis.
GRAPHIC REPRESENTATION OF DATA
From ogive, median value of the characteristics (variable) can
also be calculated.
GRAPHIC REPRESENTATION OF DATA
LINE DIAGRAM
In this diagram vertical axis represents the magnitude and
horizontal axis represents time
Thus this diagram provides a simple, easily understandable and
highly effective means of understanding the trend or behavior
of event over a period of time, e.g. rising or falling or
fluctuations, such as birth rate, death rate, population rate etc.
GRAPHIC REPRESENTATION OF DATA
The class interval may be one month, one year, one decade.
Since line diagrams do not occupy any space several lines may be
projected on the same graph for comparing the trends of
interrelated events.
Multiple line diagrams can coexist only if they share the scales
given at two axes of the graph.
6

0
Category 1 Category 2 Category 3 Category 4
GRAPHIC REPRESENTATION OF DATA
SCATTER DIAGRAM OR DOT DIAGRAM
When observations for two variables(e. g weight & mid arm-
circumference or weight & height) are made in each of the
individuals in a group, it helps to study the relationship
between two variables. One variable is represented on x-axis
and another variable on y-axis. Perpendiculars drawn from
these readings meet, to give one scatter point.
GRAPHIC REPRESENTATION OF DATA
There will be as many points as there are indivi-
duals in the observation. When all the points are plotted, the
diagram gives the picture of scatter. Hence the name “scatter
diagram”
(Dot diagram). The direction of scatter helps to determine
presence or absence of the association
GRAPHIC REPRESENTATION OF DATA
3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3
BIO-STATISTICS

MEASURES OF CENTRAL TENDENCY


BY
DR. ABDUL RAUF
MEASURES OF CENTRAL TENDENCY
When a series of observations of continuous series are made, it
is found that a large no. of them concentrate at the center of
series and a small no. of them lie at the periphery. This
tendency of the values to aggregate in the center of
distribution series, is called “central tendency” also called
“statistical average”.
In other words it measure of central value of the distribution.
MEASURES OF CENTRAL TENDENCY
There are 3 types of central tendency
MEAN
MEDIAN
MODE
MEAN. Mean is arithmetic mean unless other-
wise specified. Other means are geometric mean &harmonic
mean. These are uncommon one.
MEASURES OF CENTRAL TENDENCY
MEAN
Mean is the value obtained by summing up all the observations&
dividing the total by no. of observation s.
Sum of all observations made
Mean=
no. of observations made
x=̄ ∑x/n
x-̄ is mean value(read as x-bar)
MEAN
Example of ungrouped data
THE systolic blood pressure in mm of Hg of 10 students are as
follows;
116,118,122,120,120,124,122,116,118,124.

Calculate the mean.


MEAN
Calculation
• Add the individual observations. (∑)
• Divide by the number of observations = n

x̄ = ∑x/n

So Mean =1200/10=120 mm of Hg
MEAN
For grouped data
x̄=∑ fx /∑ f
Example

Missing teeth No. of patients


0 5
1 3
2 4
3 6
4 2
5 1
MEAN
Missing teeth(x) patients(f) fx
0 5 0
1 3 3
2 4 8
3 6 18
4 2 8
5 1 5
∑= 21 42
MEAN
x̄ = ∑fx / ∑f
By putting values
= 42/21 =2
so mean is = 2 missing teeth
MEAN
Advantages:-
• It is defined by a mathematical formula
• It is easy to calculate & understand
• It is the most widely used “average”
• It depends on all values of the data and a change in any value,
changes the value of the mean.
• It is the most common advancement in research.
MEAN
Disadvantages:
1. It is influenced by extreme values
2 .It is “not an appropriate average for highly skewed
distribution”
3 .May not convey proper sense, some times e.g. mean no. of
children may be 4.5, which is ridiculous.
MEASURES OF CENTRAL TENDENCY

Median
Definition:- Median is defined as a central value in a data or
distribution, when arranged in a definite order, ascending or
descending which divides the data or distribution into two
equal halves. One half comprising of values greater than and
other half smaller than it. It is easy to locate when there are
odd number of values. When there are even number of values
the median is taken as the average of two central values of
data
Median

Formula for ungrouped Data


Median = If total no. of items is odd no.
=(n+1)th value
2
Median= If total no. of items is even no.
= average of middle 2 observations
Median

Example---ungrouped data
In a hospital ward, following are the number of days of stay of
patients.
13,42,8,9,7,3,6,52,8,2,11,11,10,9

Calculate the median.


Median

1st. arrange in ascending order as


2,3,6,7,8,8,9,9,10,11,11,13,42,52
As there are 14 patients, the average of the periods of stay
corresponding to 7th& 8th patients is calculated as median.
Median= 9+9 =9 days
2
Median
Formula for grouped data
Median= l + h/f(n–c)
2
l =Lower limit of the median group
h =Length of class interval (C-I) e.g.(2-4)
n =Total no. of frequencies
C =Cumulative frequency proceeding to the median group.
Example
The distribution of marks obtained by 50 students in Biostatistics,
are shown as below.
Calculate “median”
ma 10- 20- 30- 40- 50- 70- 80-
rks 19 29 39 49 69 79 89
stu 7 9 4 1 16 9 4
den
ts
marks C–B f c.f
10 – 19 9.5 – 19.5 7 7
20 – 29 19.5 – 29.5 9 16
30 – 39 29.5 - 39.5 4 20
40 – 49 39.5 – 49.5 1 21
50 – 59 49.5 – 59.5 16 37
60 – 69 59.5 – 69.5 9 46
70 – 79 69.5 – 79.5 4 50
Median= l + h/f(n–c)
2
=49.5+ 10/16(25-21)
=49.5+ 10/16(4)
=49.5+ 10/4
=49.5+ 2.5
median=52
Median

Advantages:-
• It eliminates the effect of extreme values
• It is easy to calculate & understand
• Only the values of the middle item need to be known
Median

Disadvantages:-
• If you change the extreme value median does not have any
effect.
• It can not be calculated unless the data is arranged in some
order (ascending order or descending order)
Median
Median has advantages over the mean as explained with
example i.e. salary of 07 workers per day is
Rs. 5,5,5,7,10,20,102=154 so= mean is 22 But Median=7. The
income of 7th man having Rs. 102 has seriously affected the
mean but is has not affected the median so in this case median
is more near to truth and therefore more representative than
Mean.
Median
GRAPHIC METHOD FOR LOCATING THE MEDIAN The median
can also be calculated from the curve “Ogive”. Median can be
located by following procedure.
On y-axis n/2nd frequency is located, from this a line parallel to
x-axis is drawn to meet the curve. From the point of
intersection, a line parallel to y-axis is drawn to meet the x-
axis. The point of intersection on x-axis is median value of the
observations
BIO-STATISTICS

MEASURES OF CENTRAL TENDENCY


BY
DR. ABDUL RAUF
MODE
MODE is the most frequently occurring value in
a set of observations. If the data is presented
in a curve form, then the peak of the curve
will represent the mode.
MODE
Following is the weight(in Kg)of new born babies

3, 3.1, 3.2, 3, 2.9, 2.8, 2.6, 3, 2.5, 2.7

What is the “mode”

Mode = 3
MODE
Example
Following is the ages of 10 medical students:
18, 18, 19, 19, 20, 20, 20, 21, 22, 23
What is the “mode”?
mode = 20 years of age
MODE
Example
To check the accuracy of the clinical diagnosis of malaria, blood
slides of 33 patients were examined for malaria parasites.
There were three possible results:
Negative P.
falciparum P
vivax
MODE
Example
The results are presented in the following frequency distribution.
Negative 19
P. falciparum 13
P. vivax 1
Total 33
What is the mode?
The mode is “Negative.”
MODE
Example
Health personnel from 148 different rural health institutions were
asked the following question.
“How often have you run out of drugs for the treatment of malaria in
the past two years?”
This was a closed question with the following possible answers.
Never
1 to 2 times (rarely),
3 to 5 times (occasionally),
more than 5 times (frequently)
MODE
. The numbers of responses in each category were totaled to give the
following frequency distribution.
• Never 47
• Rarely 71
• Occasionally 24
• Frequently 6
Total 148
What is the mode?
The mode is “rarely.”
MODE
Example
82 clinics in one district were asked to submit the number of
patients treated for malaria in one month. The researchers
presented both the frequency distribution and percentages (or
relative frequencies) as follows
MODE
NUMBER OF NUMBER OF RELATIVE
PATIENTS CLINICS FREQUENCY
0 to 19 25 31%
20 to 39 3 4%
40 to 59 5 6%
60 to 79 11 14%
80 to 99 19 24%
100 to 119 10 12%
120 to 139 4 5%
140 to 159 3 4%
Total 80* 100%
MODE
Data form two clinics are missing
Note: Usually you do not include missing data in the calculation
of percentages
However, the number of missing data (e.g., people who did not
respond to a question) is a useful identification of the
adequacy of your data collection. Therefore, this number
should be mentioned, as a note to your table.
MODE
What is the mode?
The mode is “O to 19”, as this outcome is recorded most
frequently (25 times out of 80).
MODE
There can be more than one mode for a series of data. In a
distribution with two most frequent values, there will be 2
modes: Bimodal distribution
Mode= average of 2 modes
MODE
Grouped data
fm – f1
Mode= l + Xh
( fm – f1) +(fm – f2)
l = lower class boundary of modal group
fm = frequency of modal group
f1 = proceeding frequency of modal group
f2 = following frequency of modal group
h = class interval of modal group
MODE
Example
Following are the number of men in various age groups with
some form of paid employment in a village. The age recorded
for each man is the number of completed years lived.
Calculate “mode”
Age = 14-20, 21-30, 31-40, 41-50, 51-60, 61-70
men = 12 14 26 35 23 5
MODE
age f Class boundaries
14 – 20 12 13.5 – 20.5
21 – 30 14 20.5 – 30.5
31 – 40 26 30.5 - 40.5
41 – 50 35 40.5 -50.5
51 – 60 23 50.5- 60.5
61 – 70 5 60.5 – 70.5
71 - 90 1 70.5 – 90.5
MODE
fm – f1
Mode = l + Xh
(fm – f1) + (fm – f2)

35 – 26
=40.5+ x 10
(35 – 26) + (35 – 23)

9 9
= 4o.5 + x 10 ,= 40.5 + x 10
9 + 12 21
Mode = 40.5+90 = 44.8
21
MODE
In distribution with extreme values
Most affected measure of central tendency;
MEAN
Least affected measure of central tendency;
MODE
Most preferable measure of central tendency;
MEDIAN
MODE
Example
The incidence of malaria in an area is
20,20,50,56,60,5000,678,898,345,456
Incidence in ascending order is
20,20,50,56,60,345,456,678,898,5000
Mean= ∑ x/n= 7583/10= 758.3
Median= average of 5th & 6th value=(60+345)/2
Median= 202.5
Mode= 20
MODE
ADVANTAGES
It is easy to calculate
It is least influence by extremes of values
It is the only average that that can be applied to qualitative data
MODE
DISADVANTAGES
It may not exist in a small group of values
It cannot be subjected to mathematical treatment
CENTRAL TENDENCY IN VARIOUS DISTREBUTION

DISTRIBUTION CENTRAL TENDENCY

NORMAL(GAUSSIAN)
DISTRIBUTION MEAN = MEDIAN = MODE

RIGHT(POSITIVE) SKEW
DISTRIBUTION MEAN > MEDIAN > MODE

LEFT(NEGATIVE) SKEW
DISTRIBUTION MEAN < MEDIAN < MODE
Mode = 3median – 2mean
If median is 5 & mean is 4
What is the mode?
Mode= 3(5) – 2(4)
= 15 – 8
=7
THANK YOU
BIO-STATISTICS

MEASURES OF DISPERSION
BY
DR. ABDUL RAUF
MEASURES OF DISPERSION
The measures of central tendency are not sufficient to describe
all the characteristics of the data or distribution.
It is quite possible that two or more distributions may have the
same average, but the observations may differ from each
other.
DISPERSION:
By dispersion, we mean how far the values are scattered from
each other or from the average.
MEASURES OF DISPERSION
For example, there are two groups of cricket teams, having their
diastolic pressures (in mm of HG) as
Team A =92, 90, 88, 88, 88, 86, 84, 84, 84, 82, 80
Team B=100,98, 96, 94, 90, 86, 82, 78, 76, 74, 72
It is seen that both the groups have their mean as 86 mm Hg. At
the same, it is also seen that the range as well as the diastolic
pressures of the two groups are different.
MEASURES OF DISPERSION

(i) 45, 45, 45, 45, x=̄ 45

(ii) 44, 45, 45, 46, x=̄ 45

(iii) 42, 45, 50, 43, x=̄ 45

(iv) 35, 40, 85, 20, x=̄ 45


MEASURES OF DISPERSION
(i) 5, 5 = 10/2 = X̄=5 (Zero Dispersion)
(ii) 4, 6 = 10/2 = X̄=5 (Small Dispersion)
(iii) 1, 9 = 10/2 =X̄=5 (Very High Dispersion)
The means of all the three distributions are same but dispersion
varies.
MEASURES OF DISPERSION
CONCLUSION
It means average/mean does not give us full information about
the data so we needs some additional information which
would tell us about variation, dispersion, or scatter.
The measures used for this purpose are called “Measures Of
Dispersion Or Variation”
MEASURES OF DISPERSION
MEASURES OF DISPERSION ARE

1. Range (R)
2. Mean Deviation (M.D)
3. Standard Deviation (S.D)
4. Coefficient of Variation (CV )
RANGE
The range is defined as the difference between maximum values
( Xm) and the minimum values (Xo) in the data or distribution.
. R= (Xm – X0)
Where R = Range
Xm = Maximum Value
Xo = Minimum Value in the Data
Example: 60, 69, 70, 71, 72.
So, R = 72 – 60 = 12
RANGE
Thus range gives the values of the extremes but does give any
information about the values in between the extreme values. It
usually defines the limits of normalcy as
Blood sugar random=110 to 160 mg
Cholesterol = 120 to 250 mg
RANGE
I. The Range is simple to understand & easy to calculate
II. It is useful as a rough measure of dispersion.
III. It is dependent upon the extreme values so it gives no
indication how the values within the two extremes are
distributed.
IV. It is highly unstable measure of dispersion.
Mean deviation (MD)

It is mean or average of the deviations. The deviation


is obtained by deducting the arithmetic mean from each
observations. The average of all the deviations is called as
“Mean deviation”. It is calculated by the following procedure.
Mean deviation (MD)

The “mean” of the observations is calculated.


Then the mean is subtracted from each of the observation to
calculate the deviation.
The mean (or average) of these deviations is then calculated by
totaling the differences from the mean and divide by the
number of observations without considering the sign of the
deviation, which gives mean deviation.
Mean deviation (MD)

Formula for mean deviation


∑|x¡―x|̄
MD= ―
n
∑= Summation n= No. of observations
l l=Refers to absolute value ignoring + or – sign
xi= individual value of observation
x=̄̄ mean of observations
Mean deviation (MD)

The systolic blood pressure in mm of Hg of 10 students is as


follows
115, 117, 121, 120, 118, 122, 123, 116, 118, 120
Calculate the “MEAN DEVIATION”
xi xi - x̄ Deviation
115 115 – 119 -4
117 117 – 119 -2
121 121 – 119 +2
120 120 – 119 +1
118 118 – 119 -1
122 122 – 119 -3
123 123 – 119 +4
116 116 – 119 -3
118 118 – 119 -1
120 120 – 119 +1
Mean deviation (MD)

1190
x̄ = = 119
10
∑ = l xi – x̄ l = 22
∑ = l xi – x̄ l 22
MD = = = 2.2
n 10
so mean deviation is = 2.2
Important characteristics of Mean Deviation.
i. It is simple to understand and easy to calculate
ii. It is not capable of further mathematical treatment.
iii. Though simple and easy, Mean Deviation is not used in
Statistical Analysis, being of less mathematical value
particularly in drawing Inferences (results).
MEASURES OF DISPERSION

STANDARD DEVIATION(SD)
BY
DR. ABDUL RAUF
STANDARD DEVIATION
Two classes took part in a recent quiz. There were
10 students in each class, and each class had an
average score of 81.5 Since
the averages are the same, can we assume that
the students in both classes perform the same on
the exam?
STANDARD DEVIATION
The answer is… No.

The average (mean) does not tell us anything


about the distribution or variation in the grades.
Here are Dot-Plots of the grades in each class:
Mean
STANDARD DEVIATION
So, we need to come up with some way of measuring not just
the average, but also the spread of the distribution of our
data.
Why not just give an average and the range of data (the highest
and lowest values) to describe the distribution of the data?
Well, for example, lets say from a set of data, the average is
17.95 and the range is 23
But what if the data looked like this:.
Here is the average

But really, most of the


numbers are in this area,
and are not evenly
distributed throughout the
And here is the range range.
STANDARD DEVIATION
The Standard Deviation is a number that
measures how far away each number in a set
of data, is from their mean
• Standard Deviation
• The Standard Deviation is a measure of how
spread out numbers are.
• Its symbol is σ (the greek letter sigma)
• The formula is easy: it is the square root of
the Variance.
STANDARD DEVIATION
It is improvement of mean deviation. In the calculation
of MD the signs of deviation(+ or -) are not taken into
consideration. In order to avoid this discrepancy,
instead of actual values of the deviations the squares
of the deviations are considered for calculation and
then the average of the squares are taken, which is
known as “VARIANCE”
STANDARD DEVIATION
• What is the Variance?"
• Variance
• The Variance is defined as:
• The average of the squared differences from the
Mean.
• To calculate the variance follow these steps:
• Work out the Mean (the simple average of the
numbers)
• Then for each number: subtract the Mean and
square the result (the squared difference).
• Then work out the average of those squared
differences.
STANDARD DEVIATION
How to Calculate Standard Deviation
Six straightforward steps to Calculate Standard Deviation ;-
• 1. Get the Mean
• 2. Get the deviations
• 3. Square these
• 4. Add the squares
• 5. Divide by total numbers less one
• 6. Square root of result is Standard Deviation
STANDARD DEVIATION
• SO STEP BY STEP;-
• 1. get the Mean
• to begin you need the mean or the average,
for example add 23, 92, 46, 55, 63, 94, 77, 38,
84, 26 ... = 598 divide by 10 (the actual
number of numbers) 598 divided by 10 = 59.8
• so the mean or average of 23, 92, 46, 55, 63,
94, 77, 38, 84, 26 is
• 59.8
STANDARD DEVIATION
• 2. get the deviations
• subtract the mean from each of the numbers, the answers are;-
• -36.8, 32.2, -13.8, -4.8, 3.2, 34.2, 17.2, -21.8, 24.2, -33.8
• 3. square these
• to square means multiply them by themselves, the answers are;-
• 1354.24, 1036.84, 190.44, 23.04, 10.24, 1169.64, 295.84, 475.24,
585.64, 1142.44
STANDARD DEVIATION
• 4. add the squares
• total of these numbers is 6,283.60
• 5. divide by total number of numbers less one;-
• you had 10 numbers less 1 is 9 numbers
• so 6283.60 divided by 9 = 698.18
• 6. square root of result is Standard Deviation
• square root is the number multiplied by itself to get 698.18 which
is:-
• 26.4 so 26.4 is the Standard Deviation...
Standard Deviation
 Standard deviation is calculated
from:
s= S ( X – X )²
n-1

 ‘Standard Deviation’ is
represented by the
symbol sigma s
Example
You and your friends have just measured the heights of
your dogs (in millimeters):

The heights (at the shoulders) are: 600mm,


470mm, 170mm, 430mm and 300mm
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:
Answer:
600 + 470 + 170 + 430 + 300 1970
Mean = = = 394
5 5

so the mean (average) height is 394 mm. Let's plot this on the chart:
Now, we calculate each dogs difference from the Mean:

To calculate the Variance, take each difference, square


it, and then average the result:
To calculate the Variance, take each difference, square it,
and then average the result:

So, the Variance is 21,704.


And the Standard Deviation is just the square root of
Variance, so:
Standard Deviation: σ = √21,704 = 147.32... = 147 (to
the nearest mm)
And the good thing about the Standard Deviation
is that it is useful. Now we can show which
heights are within one Standard Deviation
(147mm) of the Mean:

So, using the Standard Deviation we have a


"standard" way of knowing what is normal, and
what is extra large or extra small
Why square the differences?
If we just added up the differences ... the negatives would
cancel the positives:
4+4-4–4
=0
4
So that won't work. How about we use absolute values?

|4| + |4| +|– 4|+ |-4| 4+4+4+4


= =4
4 4
That looks good, but what about this case:

|7| + |1| + |-6| + |-2| 7+1+6+2


= =4
4 4

Oh No! It also gives a value of 4,


Even though the differences are more spread out!
So let us try squaring each difference
(and taking the square root at the end):
42 + 42 + 42 + 42 64
√ =√ =4
4 4

72 + 12 + 62 + 22 90
√ =√ = 4.74...
4 4

That is nice! The Standard Deviation is bigger


when the differences are more spread out ... just what we want!
STANDARD DEVIATION
If the Standard Deviation is bigger it means the
numbers are spread out from their mean.

If the Standard Deviation is smaller it means the


numbers are close to their mean.
Here are 72
the scores 76
on the 80
80
math quiz
81 Average:
for Team A: 83
84
81.5
85
85
89
The Standard Deviation measures how far away each number
in a set of data is from their mean.

For example, start with the lowest score, 72. How far away is 72 from the mean
of 81.5?

72 - 81.5 = - 9.5

- 9.5
Or, start with the highest score, 89. How far away is 89 from the mean of 81.5?

89 - 81.5 = 7.5

- 9.5 7.5
So, the first step 72 -9.5
to finding the 76
Standard Deviation 80
is to find all the 80
distances from the mean. 81
83
84
85
85
89 7.5
Distance
from
So, the first Mean
step to 72 - 9.5
finding the 76 - 5.5
Standard 80 - 1.5
Deviation is 80 - 1.5
to find all 81 - 0.5
the 83 1.5
distances 84 2.5
from the 85 3.5
mean. 85 3.5
89 7.5
Distance Distances
Next, you from Mean Squared

need to 72 - 9.5 90.25


square each 76 - 5.5 30.25
of the 80 - 1.5
distances to 80 - 1.5
81 - 0.5
turn them
83 1.5
all into
84 2.5
positive
85 3.5
numbers
85 3.5
89 7.5
Distance Distances
Next, you from Mean Squared

need to 72 - 9.5 90.25


square each 76 - 5.5 30.25
of the 80 - 1.5 2.25
distances to 80 - 1.5 2.25
turn them all 81 - 0.5 0.25
into positive 83 1.5 2.25
numbers
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance Distances
Add up all from Mean Squared

of the 72 - 9.5 90.25


distances 76 - 5.5 30.25 Sum:
80 - 1.5 2.25
214.5

80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance Distances
Divide by (n - from Mean Squared

1) where n 72 - 9.5 90.25


represents 76 - 5.5 30.25 Sum:
the amount 80 - 1.5 2.25
214.5

of numbers 80 - 1.5 2.25 (10 - 1)


you have. 81 - 0.5 0.25 = 23.8
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance Distances
Finally, take from Mean Squared

the Square 72 - 9.5 90.25


Root of the 76 - 5.5 30.25 Sum:

average 80 - 1.5 2.25


214.5

distance 80 - 1.5 2.25 (10 - 1)

81 - 0.5 0.25 = 23.8


83 1.5 2.25
= 4.88
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance Distances
This is the from Mean Squared

Standard 72 - 9.5 90.25


Deviation 76 - 5.5 30.25 Sum:
80 - 1.5 2.25
214.5

80 - 1.5 2.25 (10 - 1)

81 - 0.5 0.25 = 23.8


83 1.5 2.25
= 4.88
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance Distances
Now find from Mean Squared

the 57 - 24.5 600.25


Standard 65 - 16.5 272.25 Sum:

Deviation 83 1.5 2.25


2280.5

for the 94 12.5 156.25 (10 - 1)

95 13.5 182.25
other class = 253.4
96 14.5 210.25
grades = 15.91
98 16.5 272.25
93 11.5 132.25
71 - 10.5 110.25
63 -18.5 342.25
Now, lets compare the two
classes again

Team A Team B

Average on
the Quiz 81.5 81.5
Standard
Deviation 4.88 15.91
APPLICATIONS OF STANDARED DEVIATION
1).A standard deviation(SD) is the universally accepted unit of
dispersion of values, from the mean value.
2).SD summarizes the variation of a large distribution in one
figure
3).SD measures the position or distance of observation from the
mean
4).SD indicates whether variation of difference of an individual
from the mean, is by chance(natural) or real due to some
special reasons
5).SD helps in finding the size of the sample
6).SD is used to calculate Standard Error(SE) of mean & SE of
difference between two mean
7).SD is used for calculation of “relative deviate” or “Z score”
8).SD is used in the calculation of “Coefficient of Variation”(CV)
In a normal distribution series, a confidence interval of ×̄± 1 SD
encloses 68.27% values, an interval of ×̄± 2 SD encloses 95.45%
values,& an interval of ×̄± 2 SD encloses 99.73% values. For
purpose of simplicity, limit of ×̄± 2 SD is treated as including
95% values.
In other words six standard deviations, 3 on either side of the
mean cover almost the entire range of a quantitative
series(which will be explained under “normal curve”)
Z- score is the difference of a specific observation from the mean
in terms of SD.
Formula is
x - x̄
Z=
SD
Where Z= relative deviate
x= the observation in question
EXERCISE
The mean height of 4th year students is 150 cm with SD of 10 cm.
Ahsan’s height is 165 cm.
Calculate Z- score
165 – 150 15
Z= =
10 10

Z = 1.5 cm
THANK YOU
COEFFICIENT OF VARIATION
(CV)
BY
DR ABDUL RAUF
Can we compare the standard deviation of any two
quantitative groups(series) in the same group, if
the attributes are different like SD of Ht.& SD of
Wt.
Answer is
“NO”
Can we compare the standard deviation of same attribute like
height only, if the units of measurement are different in the
two groups, for example, cm and inch
Answer is
“NO”
This limitation of SD is removed by converting SD into
“COEFFICIENT OF VARIATION” (CV)
The CV is the standard deviation expressed as the
“percentage of the mean”
CV is a unit less number, therefore CV is well suited for
all types of dissimilar measurements such as Height
and Weight, or Hemoglobin and Weight, or pulse rate
and mid-arm circumference
COEFFICIENT OF VARIATION
(CV)

S
CV =    100%
X 
COEFFICIENT OF VARIATION
(CV)
Measure of Relative Variation
Always a %
Shows Variation Relative to Mean
Used to Compare 2 or More Groups
SD
CV= × 100
Mean
p

Comparing Coefficient of Variation


cient of
•VariationiationCompariCoefficient
Stock A: Average Price last year = of
$50
Standard Deviation =
Variation
$5
• Stock B: Average Price last year = $100
• Standard Deviation = $5
• Coefficient of Variation:
• Stock A: CV = 10%
• Stock B: CV = 5%
CO-EFFICIENT OF VARIATION
(CV)
EXERCISE
The mean and SD of Hb level of a group is 12.6 gm % &
1.5 gm% respectively while the mean and SD of body
weight of the same group is 50 kg & 2.2 kg respectively
Compare the deviations of these two sets of
observations.
COEFFICIENT OF VARIATION
(CV)
ANSWER
1.5
CV of Hb level= X 100 = 11.9%
12.6
2.2
CV of body Wt= X 100 = 4.4%
5o
Variation is greater for Hb level than for body Wt
COEFFICIENT OF VARIATION
(CV)
EXERCISE
In two series of boys & girls of same age
of 20 years, following values of height
were obtained. Find which sex shows
greater variation
Sex Mean (Ht)cm SD(cm)
Boys 163.25 6.25
Girls 150.35 5.25
COEFFICIENT OF VARIATION
(CV)
ANSWER
6.25
CV of boys= X 100= 3.83%
163.26
5.25
CV of girls= X 100 = 3.49%
150.35
Heights in boys shows slightly greater variation
than in girls in the ratio of 3.83:3.49= 1.1:1.0
‘ χ’is a Greek letter, not equivalent
of English letter ‘X’, written as
“chi, and pronounced as “Kye”
and typed as ‘χ’
1.First, a table is prepared out of qualitative data Actual observed
frequencies of 2 sets of events are entered in a two-way table,
which is also known as “Contigency table”
(Latin, con= together: tangere=to touch)
Since, this table also helps to know the association between two
sets of events, table is also called as “Association table”
2. Null Hypothesis is setup stating there is no
association between the events
χ²- test can also be applied when there are more
than two classes or groups, such as social classes
1, П, Ш and 1V among smokers and non-smokers
3. Expected frequency for each cell is
calculated on the assumption of no
association, using the formula
Row total × column total
E=
Grand total
4. Then the difference between the observed and the
expected frequencies for each cell is found i.e., O – E
5. χ²- value for each cell is calculated by using the formula
( O – E)²
χ² =
E
6. Then the total of χ² for all the four cells is calculated by the
formula(summation of all 4 cells χ² - values)
(O – E)²
Total χ² = ∑
E
(ad – bc)² × G
Alternate formula, χ²=
(a+b) (c+d) (b+d) (a+c)
7. Degree of freedom (D.F) is calculated by
using the formula
D.F = (c – 1)(r – 1)
Where
c = no. of columns
r = no. of rows
Lastly to know whether the calculated χ²- value is
significant or not, we have to refer to “Fisher’s χ²- table”
If the calculated value is higher than the table- value, it is
concluded that it is significant and the Null hypothesis is
to be rejected.
If the calculated value is LOWER than the table- value,
Null hypothesis is accepted
Table 1
Chi-Square Distribution
Degrees of
Freedom
(df) Area in Upper Tail

0.10 0.05 0.01 0.001


1 2.706 3.841 6.635 10.828
2 4.605 5.991 9.210 13.816
3 6.251 7.815 11.345 16.267
4 7.779 9.488 13.277 18.467
5 9.236 11.071 15.086 20.515
6 10.645 12.592 16.812 22.458
7 12.017 14.067 18.475 24.322
8 13.362 15.507 20.090 26.125
9 14.684 16.919 21.666 27.877
10 15.987 18.307 23.209 29.588
EXAMPLE
Apply χ²-test to find efficacy of a drug from the data given below
Outcome(result)of treatment with drug&placebo

Group Died Survived Total

A. Control(on (O) 10 (a) ( O ) 25 ( b ) 35 (a +b )


placebo) (E ) 5.25 ( E ) 29.25
B. ( O ) 5 (c ) ( O ) 60 ( d ) 65 (c+d)
Experimental(on) ( E ) 9.75 ( E ) 55.25
Drug
Total 15( a + c) 85 (b + d) G= 100
The null hypothesis that the drug has no effect
(Drug & placebo are same),(there is no difference between the
sample proportions and the population proportion of 100)
The expected(E) value and χ²-value is calculated for each cell as
follows
( a) expected number and χ²-value of “died” in control group
Row total × Column total
ͣE =
Grand total
35 × 15
= = 5.25
100
(O- E)² (10 – 5.25) ²
x²= =
E 5.25
(4.75) ² 22.5226
= = =4.29
5.25 5.25
b) expected number and χ²-value of “survived” in control group
85 X 35
E= = 29.75
100
(O – E)² (25 – 29.75)²
X² = =
E 29.75
( - 4.75)² 22.56
X² = =
29.75 29.75

x² = 0.76
c) expected number and χ²-value of “died” in
experimental group
15 x 65 39
E= = = 9.75
100 4
(O – E)² (5 – 9.75)²
x²= =
E 9.75
(- 4.75)² 22.56
x²= = = 2.31
9.75 9.75
d) expected number and χ²-value of “survived” in
experimental group
85 – 65
E= = 85 x .65 = 55.25
100
(O – E)² (60 – 55.25)²
x² = =
E 55.25
(5.25) ² 22.56
x²= = = 0.408
55.25 55.25
∑x² = Total x² value of all 4 cells
= 4.29 + 0.76 + 2.31 + 0.41
= 7.77
DF = (c – 1)(r – 1) = (2 – 1)(2 – 1) = ( 1x 1)= 1
Where
DF= Degree of freedom
c= no .of columns
r= no. of rows
On referring to Fisher’s χ²- table with 1df, the tabulated χ²-
value, corresponding to probability of 0.05(at 95%
significance level) is 3.84
Since the calculated value(7.77) is more than table
value(3.84),the null hypothesis is rejected ,accepting the
alternative hypothesis
Assumption that the drug is not
efficacious(no difference between
drug and placebo) is ruled out and
accepted that the drug is efficacious
Example

The researcher of Human Resource


Department listed five items and asked
each teacher to mark the one most
important to her or him. The item and
corresponding percentage of favorable
responses are shown in Table.The HRD
researcher would like to determine if the
distribution of response now fits last
years' distribution or if it is different.
Distribution of Teachers’ Present Response
on the Items Perceived Important to Them
Last Year.
Items Frequenc Percentag
y e
1.Vacation Leave 6 6.00
2.Salary Increase 58 58.00
3.Professional Growth 14 14.00
4.Health and retirement benefits 14 14.00
5.Honorarium,incentives, 8 8.00
overtime pay
total 100 100.00
Problem: Is the present distribution of responses the same as last year’s?

Variable: The teachers’ response on the listed items.

Instrument: Survey form

Null hypothesis: The present distribution of response is the same as last


years’

Alternative hypothesis; The present distribution of response is different.

Critical value: Referring to the critical values of chi square, at 0.05 level of
significance and 4 degrees of freedom, critical value is 9.49.
Computation:
Item O (now) E(last (O-E) (O-E)2 (O-E)2
year) E
1.Vacation 6 4 2 4 1.00
Leave
2.Salary 58 65 -7 49 0.75
Increase
3.Professional 14 13 1 1 0.08
Growth
4.Health and 14 12 2 4 0.33
retirement
benefits

5.Honorarium, 8 6 2 4 0.67
incentives,
overtime pay

X2 =2.83
Since the computed value of 2.83 is
less than the tabular value of 9.49,
hence the null hypothesis is
accepted. Therefore, at 5 percent
significance level and 4 degrees of
freedom, the present distribution of
response is the same as last year’s.
Example:

The director of the Personnel Office was interested in


knowing whether the voluntary absence behavior of the
school’s employees was independent of marital status. The
employee files contained data on marital status with married,
separated, widower, and single, and on voluntary absence
behavior with categories of often absent, seldom absent,
and never absent. The table gives the result for a random
sample of 500 the number of employees in each cell of a
two way contingency table. Test the hypothesis that
voluntary absence behavior is independent of the marital
status for this school. Use the α= 0.05.
.
Marital Status
.

Absence Married Separate Widower Single Total


Behavior d
Often 36 16 14 34 100
Absent
Seldom 64 34 20 82 200
Absent
Never 50 50 16 84 200
Absent
Total 150 100 50 200 500
Problem: Is the voluntary absence behavior of the
school’s employees independent of their marital status?
Variables: The independent variable is the employees’
marital status and the dependent variable is the
employees’ voluntary absence behavior.
Instrument: Employee files
Null hypothesis: The voluntary absence behavior and
marital status of the employees are independent.
Alternative hypothesis: The voluntary absence
behavior and marital status of the employees are
dependent.
Critical value:
df= (r-1)(c-1)= (3-1)(4-1)=6
The critical value at 5 percent significance level and
6 degree of freedom is 12.59.
Computation:
O E O-E (O-E)2 (O-E)2
E
36 150(100) = 30 6 36 1.2
500
16 100(100) = 20 -4 16 0.8
500
14 50(100) = 10 4 16 1.6
500
34 200(100) = 40 -6 36 0.9
500
64 150(200) = 60 4 16 0.27
500
34 100(200) = 40 -6 36 0.9
500
20 50(200) = 20 0 0 0.0
500
82 200(200) = 80 2 4 0.05
500
50 150(200) = 60 -10 100 1.6
500
50 100(100) = 40 10 100 2.5
500
16 50(200) = 20 -4 16 0.8
500
84 200(200) = 80 4 16 0.2
500
X2 =10.89
The Employees’ Voluntary Absence Behavior and Marital Status
:
Variables Degree of Computed Tabular X2 Decision Interpretation
freedom X2value value (0.05)

Voluntary 6 10.89 12.59 Accept Ho No significant


Absence
Behavior

Marital Status

Since the computed value of 10.89 is less than tabular value of


12.59 at 5 percent significance level and 6 degree of freedom,
accept the null hypothesis. Hence, the voluntary absence
behavior and marital status of the school’s employees are
independent.
THANK
YOU
Fisher’s exact test or Fisher-Irwin Exact Test

It is not appropriate for a situation in which the sample size is small, yielding small expected
frequencies. There should be no expected frequencies less than 1, and not more 20% of the
expected frequencies are to be less than 5. For a situation with a small sample size, we should
consider using the Fisher’s Exact Test, which computes directly the probability of observing a
particular set of frequencies in 2x2 tables. The formula is

Fisher’s exact test


P= (a+b)!(c+d)!(a+c)!(b+d)!
a!b!c!d!n!
where a, b, c, and d= the frequencies of 2x2table
n=sample size
Shortcut formula for chi-square for 2x2 tables:

X2=_____n(ad-bc)2____
(a+c)(b+d)(a+b)(c+d)
Example:

Consider the following 2x2 table showing the rating of successful or unsuccessful on a
job and pass or fail on a ability test:

Test Item
Fail Pass Total
Successful
a=4 b=1 5
Unsuccessful
c=1 d=3 4
Total 5 4 9

Computation:

P=5!4!4!5!__ = 5!4!(5∙4∙3!) = 5!4!(5∙4) = 4∙3∙2∙1(20)


9!4!1!1!3! 9!3! 9∙8∙7∙6∙5! 9∙8∙7∙6

P= 20 =0.159
126
However, to compute the P value, it is still needed to find the probability of
obtaining this or a more extreme result while keeping the marginal totals in the
table fixed. To do this, reduced by 1 the smallest frequency that is greater than
zero while holding the marginal totals constant. Hence, the table will be:

5 0 5
0 4 4
4 4 9

The probability of obtaining this set of frequencies is


P =5!4!4!5!__ = 5!4!
9!5!0!0!4! 9!

= 5!4∙3∙2∙1
9∙8∙7∙6∙5!

P= 0.008
Thus the probability of observing this particular frequency of getting successful
in a job or a more extreme frequency is 0.159 + 0.008= 0.167. This P value is for
one-tailed test. An estimate of a P value for a two-tailed test is obtained by
multiplying the value by 2; 2x 0.167= 0.334. Based on this value, the null
hypothesis that there is no difference in the success of job with or without
passing the ability test cannot be rejected.
Yates’ Correction for Continuity

The statistic on which we base our decision has a distribution that is only approximated by the
chi-square distribution. The computed X2 values depend on the cell frequencies and consequently
are discrete. The continuous chi-square distribution seems to estimate the discrete sampling
distribution of X2 very well, provided that the number of degrees of freedom is greater than 1. In
a 2x2 contingency table, where we have only 1 degree of freedom, the Yates’ correction for
continuity may be applied. It is the process of subtracting 0.5 from the numerator at each term in
the chi-square statistic for 2x2 tables prior to squaring the term.

X2(corrected)= ∑ (│O-E│-0.5)2
E

If the expected cell frequencies are large, the corrected and uncorrected results are the same.
When the expected frequencies are between 5 and 10, Yates’ correction should be applied. For
expected frequencies less than 5 the Fishers’ exact test should be used.
Chi –square Guidelines
When testing for “goodness of fit’ at least two
categories must be used to have at least 1
degree of freedom statistic. The general rule in
setting up the chi-square is to have as many as
possible categories for the test will then more
sensitive. The limitations are no more than 20
percent of cells have an expected than the value
of 5.0, and no cell has an expected frequency
smaller than 1.0. If too many small expected
frequencies exist, the categories should be
combined, unless such combinations are not
possible.

If categories are combined to the point where


there are only two categories and still an
expected frequency of less than 5.0 exists, X2
should not be used. Instead, the binomial test
may be used to treat the data.
The statistical test used to test the null
hypothesis that proportion equal or
equivalently, that factors or
characteristics are independent or not
associated.
It is used to analyze data that are presented
in categories. The test applies only to
discreet data, counted rather that are
presented in categories. In this test, the
expected frequencies and actual or obtained
frequencies are compared.
The logic of the chi-square test follows:
•The total number of observations in each column (treatment or
control) and the total number of observations in each row (positive or
negative) are considered to be given or fixed.(These column and row
totals are also called marginal frequencies.)

•If we assume that columns and rows are independent, we can


calculate the number of observations expected to occur by chance-
the expected frequencies. We find the expected frequencies by
multiplying the column total by the row total and dividing by the grand
total.

Expected frequency= Row Total x Column Total


Grand Total
The Chi-square
•The chi-square test compares the goodness
observed frequency inof each fit test
cell with expectedis

used to test whetherthe column theand rowdistribution of a set


frequency.
If no relationship exist between variables the observed frequencies will
be very close to expected frequencies; they will differ only in small amounts. In this
ofinstance,
data follows
the value a
of chi-square statistics particular pattern.
will be small. On the other For
hand, if a relationship
(or dependency) does occur, the observed frequencies will vary quite a bit from the
example,
expected frequencies,the goodness-of-fit
and the value Chi-square
of the chi-square statistic will be large.

may be used to test whether a set of values


X2 (df)= ∑ (Observed frequency- Expected frequency) 2
Expected frequency
.
follow the normal distribution or whether the
proportions of Democrats, Republicans, and
other parties are equal to a certain set of
values, say 0.4, 0.4, and 0.2.
The Chi-square test for independence in a
contingency table is the most common Chi-square
test. Here individuals (people, animals, or things)
are classified by two (nominal or ordinal)
classification variables into a two-way, contingency
table. This table contains the counts of the number
of individuals in each combination of the row
categories and column categories. The Chi-square
test determines if there is dependence
(association) between the two classification
variables. Hence, many surveys are analyzed with
Chi-square tests.
‫بسم هللا الرحمن الرحيم‬
Correlation &
Regression
BY
DR ABDUL RAUF
Correlation

Finding the relationship between two quantitative


variables without being able to infer causal
relationships

Correlation is a statistical technique used to determine


the degree to which two variables are related
Scatter diagram
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X)
and the second is called dependent (Y)
• Points are not joined
• No frequency table Y
* *
*
X
Example

Wt. 67 69 85 83 74 81 97 92 114 85
(kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
SBP(mmHg) (kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)

220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood


pressure
SBP (mmHg)
220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


Scatter plots

The pattern of data is indicative of the type of relationship


between your two variables:
 positive relationship

 negative relationship

 no relationship
Positive relationship
18

16

14

Height in CM 12

10

0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship

Reliability

Age of Car
No relation
Correlation Coefficient

Statistic showing the degree of relation between two


variables
Simple Correlation coefficient (r)

 It is also called Pearson's


correlation or product moment
correlation coefficient.
 It measures the nature and
strength between two variables of
the quantitative type.
The sign of r denotes the nature of association

while the value of r denotes the strength of


association.
 If the sign is +ve this means the
relation is direct (an increase in one
variable is associated with an increase
in the other variable and a decrease in
one variable is associated with a
decrease in the other variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the
other).
 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated
by the following diagram.

strong intermediate weak weak intermediate strong

-1 -0.75 -0.25 0 0.25 0.75 1


indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)

 xy   x y
r= n
 ( x) 2
  ( y) 
2
x 
2 .  y 
2 
 n  n 
  
Example:
A sample of 6 children was selected, data about
their age in years and weight in kilograms was
recorded as shown in the following table . It is
required to find the correlation between age and
weight.
serial Age Weight
No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
These 2 variables are of the quantitative type,
one variable (Age) is called the independent
and denoted as (X) variable and the other
(weight) is called the dependent and denoted
as (Y) variables to find the relation between
age and weight compute the simple
correlation coefficient using the following
formula:

 xy   x y
r= n
 ( x) 2  ( y )2 
x 
2 .  y 
2 
 n  n 
  
Age Weight
Serial
(years) (Kg) xy X2 Y2
n.
(x) (y)
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169
Total ∑x= ∑y= ∑xy= ∑x2= ∑y2=
41 66 461 291 742
41  66
461 
r= 6
 (41) 2   (66) 2 
291  .742  
 6  6 

r = 0.759
strong direct correlation
EXAMPLE: Relationship between Anxiety and Test Scores

Anxiety Test X2 Y2 XY
(X) score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
∑X = 32 ∑Y = 32 ∑X2 = 230 ∑Y2 = 204 ∑XY=129
Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r= = = .94
6(230)  32 6(204)  32 
2 2
(356)(200)

r = - 0.94

Indirect strong correlation


Spearman Rank Correlation Coefficient (rs)

It is a non-parametric measure of correlation.


This procedure makes use of the two sets of
ranks that may be assigned to the sample
values of x and Y.
Spearman Rank correlation coefficient could
be computed in the following cases:
Both variables are quantitative.
Both variables are qualitative ordinal.
One variable is quantitative and the other is
qualitative ordinal.
Procedure:
1. Rank the values of X from 1 to n where
n is the numbers of pairs of values of
X and Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair
of observation by subtracting the rank
of Yi from the rank of Xi
4. Square each di and compute ∑di2
which is the sum of the squared
values.
5. Apply the following formula

6 (di) 2
rs = 1 
n(n 2  1)

The value of rs denotes the magnitude


and nature of association giving the
same interpretation as simple r.
Example
In a study of the relationship between level
education and income the following data was
obtained. Find the relationship between them
and comment.
sample level education Income
numbers (X) (Y)
A Preparatory. 25
B Primary. 10
C University. 8
D secondary 10
E secondary 15
F illiterate 50
G University. 60
Answer:
Rank Rank di di2
(X) (Y) X Y
A Preparatory 25 5 3 2 4

B Primary. 10 6 5.5 0.5 0.25


C University. 8 1.5 7 -5.5 30.25
D secondary 10 3.5 5.5 -2 4
E secondary 15 3.5 4 -0.5 0.25
F illiterate 50 7 2 5 25
G university. 60 1.5 1 0.5 0.25

∑ di2=64
6  64
rs = 1  = 0.1
7(48)

Comment:
There is an indirect weak correlation
between level of education and income.
exercise
Regression Analyses

Regression: technique concerned with predicting some


variables by knowing others

The process of predicting variable Y using variable X


Regression
 Uses a variable (x) to predict some outcome variable (y)
 Tells you how values in y change as a function of changes in
values of x
Correlation and Regression

 Correlation describes the strength of a linear relationship


between two variables
 Linear means “straight line”

 Regression tells us how to draw the straight line described


by the correlation
Regression
 Calculates the “best-fit” line for a certain set of data
The regression line makes the sum of the squares of
the residuals smaller than for any other line
Regression minimizes residuals
SBP (mmHg)
220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120
By using the least squares method (a procedure
that minimizes the vertical deviations of plotted
points surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:

ŷ = a  bX

 x y
 xy 
ŷ = y  b(x  x) bb1 = n
( x) 2
 x 2

n
Regression Equation
SBP(mmHg)
220

 Regression 200

180
equation describes 160

the regression line 140

120

mathematically 100

80

 Intercept
Wt (kg)
60 70 80 90 100 110 120

 Slope
Linear Equations
Y
ŷY == bX
a +bX
a
Change
b = Slope in Y
Change in X
a = Y-intercept
X
Hours studying and grades
Regressing grades on hours


Linear Reg ression


90 .0 0 Final grade in course = 59.95 + 3.17 * study
R-Square = 0.88

Final grade in course



 
80 .0 0

 
70 .0 0

2.00 4.00 6.00 8.00 10 .0 0

Number of hours s pent studying

Predicted final grade in class =


59.95 + 3.17*(number of hours you study per week)
Predicted final grade in class = 59.95 + 3.17*(hours of study)

Predict the final grade of…

 Someone who studies for 12 hours


 Final grade = 59.95 + (3.17*12)
 Final grade = 97.99

 Someone who studies for 1 hour:


 Final grade = 59.95 + (3.17*1)
 Final grade = 63.12
Exercise
A sample of 6 persons was selected the
value of their age ( x variable) and their
weight is demonstrated in the following
table. Find the regression equation and
what is the predicted weight when age is
8.5 years.
Serial no. Age (x) Weight (y)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
Answer

Serial no. Age (x) Weight (y) xy X2 Y2


1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169

Total 41 66 461 291 742


41 66
x= = 6.83 y= = 11
6 6

41  66
461 
b= 6 = 0.92
2
(41)
291 
6

Regression equation

ŷ (x) = 11  0.9(x  6.83)


ŷ (x) = 4.675  0.92x

ŷ (8.5) = 4.675  0.92 * 8.5 = 12.50Kg

ŷ (7.5) = 4.675  0.92 * 7.5 = 11.58Kg


12.6

Weight (in Kg)


12.4
12.2
12
11.8
11.6
11.4
7 7.5 8 8.5 9
Age (in years)

we create a regression line by plotting two


estimated values for y against their X component,
then extending the line right and left.
Exercise 2
Age B.P Age B.P
(x) (y) (x) (y)
20 120 46 128
The following are the
age (in years) and 43 128 53 136
systolic blood 63 141 60 146
pressure of 20 26 126 20 124
apparently healthy 53 134 63 143
adults.
31 128 43 130
58 136 26 124
46 132 19 121
58 140 31 126
70 144 23 123
Find the correlation between age
and blood pressure using simple
and Spearman's correlation
coefficients, and comment.
Find the regression equation?
What is the predicted blood
pressure for a man aging 25 years?
Serial x y xy x2
1 20 120 2400 400
2 43 128 5504 1849
3 63 141 8883 3969
4 26 126 3276 676
5 53 134 7102 2809
6 31 128 3968 961
7 58 136 7888 3364
8 46 132 6072 2116
9 58 140 8120 3364
10 70 144 10080 4900
Serial x y xy x2
11 46 128 5888 2116
12 53 136 7208 2809
13 60 146 8760 3600
14 20 124 2480 400
15 63 143 9009 3969
16 43 130 5590 1849
17 26 124 3224 676
18 19 121 2299 361
19 31 126 3906 961
20 23 123 2829 529
Total 852 2630 114486 41678
 x y
 xy 
n 114486 
852  2630
b1 = = 20 = 0.4547
(  x) 2
852 2

x  n
2 41678 
20

ŷ =112.13 + 0.4547 x

for age 25
B.P = 112.13 + 0.4547 * 25=123.49 = 123.5 mm hg
Multiple Regression
Multiple regression analysis is a
straightforward extension of simple
regression analysis which allows more
than one independent variable.
GEOMETRIC MEAN
BY
DR ABDUL RAUF
Geometric mean

 Geometric mean is a mathematical


concept that is related to, but easily
confused with, the more commonly used
arithmetic mean. To calculate the
geometric mean, use one of the methods
below.
A =
B
Two Numbers: Simple Method

 1- Find the numbers you wish to average.


Ex. 2 and 32.
 2- Multiply them together.
Ex. 2 x 32 = 64.
 3 - Calculate the square root of said number.
Ex. √64 = 8.
Two Numbers: Detailed Method
 1 - Plug your numbers into the equation below.
If your numbers are 10 and 15, for example, plug in
10 for “first #” and 15 for “second #.”
 2-Solve for X.
Start by cross-multiplying, which means multiplying
the pairs of numbers diagonal to one another and
then setting the results on opposite sides of an =
sign. Since X*X is X^2, your equation should look
like: X^2 = (product of your other numbers).
The Geometric Mean Between
2 Numbers ( A and B ), Is The
Number That When Substituted
For X Will Make This Proportion
True

A = X
X B
Find the geometric mean
between 4 and 9
4 = X
X 9
X² = 36
X² = 36
X=6
Find the geometric mean
between 3 and 15
3 X
=
X 15
X² = 45
X² = 45
X= 3 5
X = 6.7
8 is the geometric mean between
2 and what number

2 = 8
8 X
2X = 64
X = 32
The Altitude Drawn To The
Hypotenuse In A Right Triangle
Divides The Hypotenuse Into 2
Parts. The Altitude Is The Geometric
Mean Between The 2 Parts.

A A = Alt
Alt B
B
8 = X
8 X 5
X² = 40
X 5
X = 40
X = 2 10
Find X X = 6.3
Each Leg Is The Geometric
Mean Between The Part Of The
Hypotenuse Adjacent To The Leg And
The Whole Hypotenuse

Y Y W
=
W W Y+Z
Z Z = X
X Y+Z
X
8 = T
8 T 8+5
T T² = 8 ( 13 )
5 T² = 104
T = 104
R T = 2 26
T = 10.2
Find T
5 = R
8 R 5+8
T R² = 5 ( 5+8 )
5 R² = 5 (13 )
R² = 65
R
R = 65
Find R R = 8.1
Two Numbers: Detailed Method

To solve for X, find the square root of


your product. If you’re lucky, the
results will be a whole number. If not,
you can provide a decimal answer or
leave your answer in square root form
Two Numbers: Detailed Method
The example below is in
simplified square root form.
Three or More Numbers: Simple Method
 1-Plug your numbers into the equation
below.
Mean = (a1 × a2 . . . an)1/n
a1 is your first number, a2 is your second
number, and so forth
n is the number of entries
 2-Multiply the numbers (a1, a2, etc.)
together.
 3-Calculate the nth root of this number. This
is the geometric mean.
Three or More Numbers: Detailed Method

 1-Find the log of each number and add the logarithmic


values together.
Find the LOG button on your calculator. When you’re
ready, type: (first number) LOG + (second number) LOG +
(third number) LOG [+ log of additional numbers as
necessary] =. Do not neglect to type = or the number you
see will be the log of the most recent number, not the total.
 Ex. log 7 + log 9 + log 12 = 2.878521796…
Three or More Numbers: Detailed Method

 2-Divide the sum of the logarithmic values by the


number of values you added. If you added the logs of
three numbers, divide by three.
Ex. 2.878521796 / 3 = .959507265…
 3-Find the antilog of your result. On your calculator,
press the 2nd function (usually yellow) and then LOG
to activate the secondary function of the log button, or
the antilog. This resulting value is the geometric mean.
Ex. antilog .959507265 = 9.109766916. Therefore, the
geometric mean of 7, 9, and 12 is 9.12.
Difference between arithmetic and
geometric mean:
If you wanted the arithmetic mean of 3, 4 and
18, for example, you would add 3 + 4 + 18,
then divide by 3 because there are three
numbers. The result would be 25/3 or about
8.333..., which shows that if you had three
values of 8.3333..., it would give the same
total as the individual values of 3, 4, and 18.
The arithmetic mean answers the question,
"If all the quantities had the same value,
what would that value have to be in order to
add up to the same total?"
Difference between arithmetic and
geometric mean:
By contrast, the geometric mean answers the
question, "If all the quantities had the same
value, what would that value have to be in
order to have the same product when
multiplied?" So to find the geometric mean
of 3, 4 and 18, we would multiply 3 x 4 x 18.
This would give us 216. We would then take
the cubic root (cubic root because there
were three original numbers). The answer
would be 6. In other words, since 6 x 6 x 6 =
3 x 4 x 18, 6 is the geometric mean of 3, 4
and 18.
IMPORTANT
 The geometric mean only applies to non-negative
numbers. In word problems where using a geometric mean
is appropriate, the scenario will usually not make sense
with negative numbers.
 The geometric mean of any set of numbers is always less
than or equal to the arithmetic mean of that set.
Normal Distribution
BY
DR ABDUL RAUF
The Normal Distribution is also called the Gaussian
distribution.
It is defined by two parameters mean ('average' m) and
standard deviation (σ).
A theoretical frequency distribution for a set of variable
data, usually represented by a bell-shaped curve
symmetrical about the mean.
Types of Distribution
• Frequency Distribution
• Normal (Gaussian) Distribution
• Probability Distribution
• Poisson Distribution
• Binomial Distribution
• Sampling Distribution
• t distribution
• F distribution
What is Normal (Gaussian) Distribution?
It is a continuous frequency
distribution, in which a large no. of
observations of any variable such as
Hb, Ht, Wt, BP etc. are made with a
small class interval.
This is the most important probability
distribution in statistics and
important tool in analysis of
epidemiological data and
management science.
The Normal Distribution

•‘Bell Shaped’
• Symmetrical
• Mean, Median and Mode
are Equal

Mean
= Median
= Mode
Normal distribution curve.

16

14

12

10

0 36.31 36.32 36.33 36.34 36.35 36.36 36.37 36.38 36.39


Distribution Curves - shape
• Shape

B A

 ‘A’ is a Normal Distribution Curve.


 ‘B’ is a Skewed Distribution Curve.
Distribution Curves - spread
• Spread

 Spread of curve ‘B’ is greater than


spread of curve ‘A’.
Distribution Curves - location
• Location
X Y

X Y

 Line XX is location of curve ‘A’.


 Line YY is location of curve ‘B’.
Percentages of the Normal Distribution
99.994%

99.73%

95.44%

68.26%

1s 1s
2s 2s
3s 3s
4s 4s
Characteristics of Normal Distribution
1)Has a Bell Shape Curve and is Symmetric
2)The rim of the bell does not rest on the abscissa
but is separated from it by a gap.
3) It is Symmetric around the mean:
Two halves of the curve are the same (mirror
images)
Characteristics of Normal Distribution Cont’d
4)The total area under the curve is 1 (or 100%)
5)Normal Distribution has the same shape as Standard Normal
Distribution
6)All the three measures of central tendency i.e. mean, median,
mode coincide i.e. a perpendicular drawn from the peak of
curve to abscissa, that point on the abscissa is the mean,
median and the mode
Characteristics of Normal Distribution Cont’d
7) In a Standard Normal Distribution:
The mean (μ ) = 0 and
Standard deviation (σ) =1
8)Maximum no. of observations are at the value of
variable corresponding to the mean and the no. of
observation on both sides of this value gradually
decrease and there are few observations at the
extreme points
Characteristics of Normal Distribution Cont’d
9)The area under the curve (no. of observation) can be
represented in terms of relationship between the mean and
the standard deviation. The relationship is expressed as
fallows
Mean ± 1SD includes 68.3 % (roughly 2/3rd ) of all observations
Mean ± 2SD includes 95.4 % of all observations
Mean ± 3SD includes 99.7 % of all observations
Percent of Values Within One
Standard Deviations

68.26% of Cases

361
Percent of Values Within Two
Standard Deviations

95.44% of Cases

362
Percent of Values Within Three
Standard Deviations

99.72% of Cases

363
Characteristics of Normal Distribution Cont’d
10)Thus it is seen that almost all the values of observation will
be within the range, mean ±3SD and most of the values are
within the range, mean±2SD.This relationship is useful for
fixing the confidence intervals of the varieties.
11)The properties of a normal distribution and a normal curve
form the basis of various tests of significance.
Characteristics of Normal Distribution Cont’d
12)Values larger and smaller than mean ± 3 SD will be rare (less
than 1%)in nature and those larger and smaller than mean ±
2 SD will occur less than 5%. In other words, suppose we say
that the confidence limit is 99% ,that means 99% of the
values are distributed within the range of ×̄± 3 SD and the
probability of occurrence of any value falling outside this
range is only 1% (p=0.01)
Normal Distribution
Similarly, suppose we say that the confidence limit
is 95% , that means 95% of the values are
distributed within the range of ×̄ ± 2SD and the
probability of occurrence of any value falling
outside or beyond this range is only 5% (p=0.05).
Formula
X < mean = 0.5-Z
X > mean = 0.5+Z
X = mean = 0.5
Z = (X-m) / σ
where,
m = Mean.
σ = Standard Deviation.
X = Normal Random Variable
Z score is the difference of a specific observation from the mean
in terms of SD.
Formula is
x - x̄
Z=
SD
Where Z= relative deviate
x= the observation in question
The Normal Distribution: an example.

• Suppose you must establish regulations concerning the


maximum number of people who can occupy a lift.
• You know that the total weight of 8 people chosen at random
follows a normal distribution with a mean of 550kg and a
standard deviation of 150kg.
• What’s the probability that the total weight of 8 people
exceeds 600kg?
First sketch
a diagram
• The mean is 550kg and we are interested in the area that is
greater than 600kg.
• z=(x-m)/s
• Here x = 600kg,
m , the mean = 550kg
s, the standard deviation = 150kg
• z = ( 600 - 550 ) / 150
z = 50 / 150
z = 0.33
c
• Look in the table down the left hand column for z = 0.3,
• and across under 0.03.
• The number in the table is the tail area for z=0.33 which is
0.3707 .
• This is the probability that the weight will exceed 600kg.
• Our answer is
• "The probability that the total weight of 8 people exceeds
600kg is 0.37 correct to 2 figures."
SKEWED DISTRIBUTION
When frequency distribution or frequency curve is
not symmetrical about the peak, it is said to be
“skewed”(asymmetrical). In other words one tail
of the curve will be longer than the other. This
skewness can be either to the right or to the left
of the peak
Positive Skewness (Tail to Right)
Negative Skewness (Tail to Left)
Skewness

 Positive Skewness: Mean ≥ Median

 Negative Skewness: Median ≥ Mean

 Pearson’s Coefficient of Skewness3:

= 3 (Mean –Median)
Standard deviation
Application/Uses of Normal Distribution
• It’s application goes beyond describing distributions
• It is used by researchers and modelers.

• The major use of normal distribution is the role it plays in


statistical inference.

• The z score along with the t –score, chi-square and F-statistics is


important in hypothesis testing.

• It helps managers/management make decisions.


THANK YOU
Exercise # 1

Then:

1) What area under the curve is above 80


beats/min?

Modified from Dawson-Saunders, B & Trapp, RG. Basic and Clinical Biostatistics,
2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Diagram of Exercise # 1

13.6% 33.35%

2.2%

0.15

0.159

-3 -2 -1 μ 1 2 3

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Exercise # 2

Then:

2) What area of the curve is above 90 beats/min?

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Diagram of Exercise # 2

13.6% 33.35%

2.2%

0.15

0.023
-3 -2 -1 μ 1 2 3

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Exercise # 3
Then:

3) What area of the curve is between


50-90 beats/min?

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Diagram of Exercise # 3

13.6% 33.35%

2.2%

0.15 0.954

-3 -2 -1 μ 1 2 3

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Exercise # 4
Then:

4) What area of the curve is above 100 beats/min?

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Diagram of Exercise # 4

13.6% 33.35%

2.2%

0.15

0.015
-3 -2 -1 μ 1 2 3

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Exercise # 5

5) What area of the curve is below 40


beats per min or above 100 beats per
min?

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Diagram of Exercise # 5

13.6% 33.35%

2.2%

0.15

0.015 0.015
-3 -2 -1 μ 1 2 3

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.

Tripthi M. Mathew, MD, MPH


Solution/Answers

1) 15.9% or 0.159

2) 2.3% or 0.023

3) 95.4% or 0.954

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
Solution/Answers Cont’d

4) 0.15 % or 0.015

5) 0.3 % or 0.015 (for each tail)

The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
PROBABILITY

BY
DR ABDUL RAUF
PROBABILITY
Probability means a chance factor for the
occurrence of a specific event e. g
chances of winning a lottery
chances of being selected
chances of getting a male child in the 1st
pregnancy etc.
PROBABILITY
This chance factor is associated with uncertainty,
because information in the happenings is not
available.
This uncertainty or mathematical quantity which
depends upon the occurrence of favorable or
unfavorable event, is numerically expressed as
probability
PROBABILITY
PROBABILITY of a particular event can be defined as, “the ratio
of no. of favorable cases for the particular event to the total
no. of cases both favorable & unfavorable to the particular
event”
n No. of favorable cases
Formula =P= =
N Total no. of both favorable and
unfavorable cases
The Probability of an Event
=
the number of possible
P(Event) = the number of ways outcomes
it can happen
P(H+T)=__2 __
=
1 4
2

H?H T T H T T H
Try these…
a) b)

P(re P(bl P(yellow or


d)A
1
5 ue)
2
A
5 blue)
3
A
5

1 1 1
B
4 B
4 B
2
CARDS

What is the
probability of getting
4 fives? 4
13
DICE

What is the probability of


getting an even number?
2
P(even) 13
COINS

What is the probability of rolling two


coins and getting H first and then T?
P(H & then T) =14
The outcomes of an
experiment are the ways it can
happen.

6 10 12 52

The event is the


particular outcome you
are looking for.
PROBABILITY
• The experiment
e.g., tossing a coin, picking 4 cards, weather conditions, etc.
• Outcome: What could happen in the experiment
e.g., getting a head or a tail, JJQ2 or A357, rain, snow, clouds,
sun, etc.
• Event: What we want in an experiment
e.g., getting a head, picking all hearts, no precipitation.
PROBABILITY
MUTUALLY EXCLUSIVE EVENTS
Two events A & B are said to be mutually exclusive if they do
not occur together e . g
Tossing a fair coin will result in either H or T
EQUALLY LIKELY EVENTS
Those events which are having equal chances of their
occurrence e. g tossing a fair coin
P(H)= ½ P(T)= ½
PROBABILITY
EXHAUSTIVE EVENTS
THOSE events which are having all the possibilities of their
occurrence e. g
one coin= 2 possibilities(H,T)
Two coins= 4 possibilities (H,H),(H,T),(T,H),(T,T)
INDEPENDENT EVENTS
Two events A&B are said to be independent events if the
occurrence of A does not effect the occurrence of B & VICE VERSA
e. g tossing 2 fair coins at a time
PROBABILITY

DEPENDENT EVENTS
Two events A&B are said to be independent
events if the occurrence of A effects the
occurrence of B & vice versa e. g tossing a
fair coin
PROBABILITY
• We write probabilities as ratios--these ratios
can then be written as fractions or percents.
• 0 means that the probability of something
happening is impossible.
• 1 means that the probability of something
happening is certain.
Probability

• Probability is a measure of how likely it is for


an event to happen.
• We name a probability with a number from 0
to 1.
• If an event is certain to happen, then the
probability of the event is 1.
• If an event is certain not to happen, then the
probability of the event is 0.
Probability

• If it is uncertain whether or not an event will


happen, then its probability is some fraction
between 0 and 1 (or a fraction converted to a
decimal number).
CHANCE

• Chance is how likely it is that something will


happen. To state a chance, we use a percent.

0 ½ 1
Probability

Equally likely to
happen or not to Certain to
Certain not
happen happen
to happen

Chance

50 %
0% 100%
Chance

• When a meteorologist states that the chance


of rain is 50%, the meteorologist is saying
that it is equally likely to rain or not to rain.
If the chance of rain rises to 80%, it is more
likely to rain. If the chance drops to 20%,
then it may rain, but it probably will not rain.
LAWS OF PROBALITY
A) ADDITION LAWS
i. Addition law of probability for mutually exclusive events
ii. Addition law of probability for non mutually exclusive
events
B) MULTIPLICATION LAWS
i. Multiplication law of probability for independent events
ii. Multiplication law of probability for dependent events
Question

What is the probability


of rolling a 2 when a fair
die is rolled once?
Answer

1/6
Question

What is the probability of


rolling an even number
when a fair die is rolled
once?
Answer

½
Question

What is the probability that


a card drawn at random
from a deck of cards will be
an ace ?
Answer

4 1
=
52 13
Question

What is the probability


that when a pair of six-
sided dice are thrown, the
sum of the numbers
equals 5?
4 1
=
36 9
Question

A book contains 32 pages numbered 1, 2,


..., 32. If a student randomly opens the
book, what is the probability that the
page number contains digit 1?
Answer

13
13
13
13 1,10,11,12,13,14,1
5,16,17,18,19,21,3
32 1
32
32
32
.40625
Question

Waseem wants a sandwich and a


drink for lunch. If a restaurant has 4
choices of sandwiches and 3 choices
of drinks, how many different ways
can he order his lunch?
Answer

12
Question

In how many arrangements


can a teacher seat 3 girls and 3
boys in a row of 6 if the boys
are to have the first, third, and
fifth seats?
Answer

36 = (3)(3)(2)(2)(1)(1)
Question
If a customer makes exactly 1 selection from each of the 5
categories listed below, what is the greatest number of different
ice cream sundaes that a customer can create?
12 ice cream flavors;
10 kinds of candy;
8 liquid toppings;
5 kinds of nuts;
With or without whip cream;
Answer

9600=(12)(10)(8)(5)(2)
Question
You are at your school cafeteria that allows you to
choose a lunch meal from a set menu. You have
two choices for the Main course (a hot chicken or a
big burger), Two choices of a drink (orange juice,
apple juice) and Three choices of dessert(pie, ice
cream, jello). How many different meal combos can
you select?
Answer

12
Question
Licence Plates for cars are labelled with 3
letters followed by 3 digits.(in this case, digits
refer to digits 0 - 9. If a question asks for
numbers, its 1 - 9 because 0 isn't really a
number)
How many possible plates are there? You can
use the same number more than once.
Answer

(26)(26)(26)(10)(10)(10) = 17,576,000
Question

Convert 23/25 to a percent


Answer

92%
Question

Convert 4% to a
simplified fractions
4 1
100 = 25
Answer

4 1
=
100 25
Answer

1
= .25
4
Question

What is the fraction representation of A?

A
B
C
Answer

1
8
Question
What is the
probability of
the following
spinner
landing on
Longitude?
Answer

1/3
Question
What is the probability of the
following spinner landing on blue
or red?
Answer

5
= .625
8
Question
What is the probability of the
following spinner landing on
yellow and red?
Answer

0
Question
What is the probability of landing on a
multiple of 3 if you spin the following
spinner?
Answer

2 1
= = .25
8 4
Question
What is the probability of
landing on 2 if you spin the
following spinner?
Answer

3
= .375
8
Question

You have a container filled with


eight pieces of green paper and
six pieces of blue paper. What is
the probability of choosing,
without peeking, a green piece of
paper?
Answer

8 4
=  .5714
14 7
Question

Aamir flipped a fair coin 100


times and the coin landed on tails
55 times. What is the
experimental probability that
this coin will land on heads the
next flip?
Answer

45 9
= = .45
100 20
Question
What fractional portion of the
following picture is shaded?
Answer

3
8
Question
Which representation of
fractions below is the largest
number?
Answer

3
4
THANK YOU
Question from Permutations or
Combinations

How many ways can


10 CD’s be selected
from 45?
Answer from Permutations or
Combinations

Combinations
Question from Permutations or
Combinations

How many ways can


10 songs be arranged
on a CD-R?
Answer from Permutations or
Combinations

Permutation
Question from Permutations or
Combinations

How many 5-digit


numbers are there with
no repeating digits?
Answer from Permutations or
Combinations

Permutations
Question from Permutations or
Combinations

How many ways are


there to vote for four
people from a group of
nine?
Answer from Permutations or
Combinations

Combination
Question from Permutations or
Combinations

How many sets of Secretary,


Treasurer, and Historian can
be selected from a group of
10?
Answer from Permutations or
Combinations

Permutations
THANK YOU
SAMPLING
BY
DR ABDUL RAUF
SAMPLING
It is not possible for any scientific study to cover the whole
population because of the
COST
TIME
PRACTICABILITY
So a representative portion of the universe is taken for the
study. It is called a SAMPLE
SAMPLING
Sampling is the process
of selecting a small number of elements
from a larger defined target group
of elements such that
the information gathered
from the small group will allow judgments
to be made about the larger group
SAMPLING
Universe: the theoretical aggregation of all
possible elements—unspecified to time and
space (e.g., University of Sargodha).
SAMPLING
Population: the theoretical aggregation of
specified elements as defined for a given
survey defined by time and space (e.g.,
medical students and staff in 2008).
SAMPLING
Population
The word “population” or “universe” means
An aggregate of all “elementary units”, each
unit may be animate or inanimate, about
which an information is required
SAMPLING
Universe or whole population may be finite
ff be e. g 100 kg’s of rice in a sack
All inhabitants of a city
Universe or whole population may be
Infinite e. g stars in the sky
SAMPLING
Universe may be “HOMOGENEOUS”(made up of uniform
class) e. g. polished rice in a sack
All Muslim women of reproductive age in city
IT may be “HETROGENEOUS”(made of dissimilar classes
of persons or animals or objects)
• If all members of a population are
identical, the population is considered to
be homogenous. That is, the
characteristics of any individual in the
population would be the same as the
characteristics of any other individual
(little or no variation among individuals).
So, if the human population on
Earth was homogenous in
characteristics, how many
people would an alien need to
abduct in order to understand
what humans were like?
• When individual members of a population are
different from each other, the population is
considered to be heterogeneous (having significant
variation among individuals).
• How does this change an alien’s abduction scheme
to find out more about humans?
• In order to describe a heterogeneous population,
observations of multiple individuals are needed to
account for all possible characteristics that may
exist.
Defining Population of Interest
• Population of interest is entirely dependent on
Management Problem, Research Problems, and Research
Design.
• Some Bases for Defining Population:
– Geographic Area
– Demographics
– Usage/Lifestyle
– Awareness
SAMPLING
Sample or Target population: the
aggregation of the population from
which the sample is actually drawn (e.g.,
medical students and faculty in 2008-09
academic year)
SAMPLING
Sample element: a case or a single unit that is
selected from a population and measured in
some way—the basis of analysis (e.g., a
person, thing, specific time, etc.)
SAMPLING
Sample frame: a specific list that closely
approximates all elements in the
population, and from this, the researcher
selects units to create the study sample
Sampling Frame
• A list of population elements, (people,
companies, houses, cities, etc.) from which, units
to be sampled can be selected.
• It is difficult to get an accurate list.
• Sample frame error occurs when certain
elements of the population are accidentally
omitted or not included on the list.
SAMPLING
Sample: a set of cases that is drawn from a
larger pool and used to make
generalizations about the population
Conceptual Model
Universe

Population

Sample Population

Sample Frame

Elements
What is Sampling?
What you What you
want to talk Population actually
about observe in
the data

Sampling Process
Sampling Sample
Frame

Inference

Using data to say something (make an inference) with confidence, about


a whole (population) based on the study of a only a few (sample).
The Sampling Design Process

Define the Population

Determine the Sampling Frame

Select Sampling Technique(s)

Determine the Sample Size

Execute the Sampling Process


Developing a Sampling Plan
1. Define the Population of Interest
2. Identify a Sampling Frame (if possible)
3. Select a Sampling Method
4. Determine Sample Size
5. Execute the Sampling Plan
Sampling Methods

Probability Nonprobability
sampling sampling
Classification of Sampling
Techniques
Sampling Techniques

Nonprobability Probability
Sampling Techniques Sampling Techniques

Convenience Judgmental Quota Snowball


Sampling Sampling Sampling Sampling

Simple Random Systematic Stratified Cluster Other Sampling


Sampling Sampling Sampling Sampling Techniques
Probability Sampling
• A sample must be representative of
the population with respect to the
variables of interest.
• A sample will be representative of
the population from which it is
selected if each member of the
population has an equal chance
(probability) of being selected.
Probability Sampling
• Probability samples are more accurate than non-
probability samples
–They remove conscious and unconscious sampling
bias.
• Probability samples allow us to estimate the accuracy
of the sample.
• Probability samples permit the estimation of
population parameters.
Simple Random Sampling

Simple random sampling is a method of


probability sampling in which
every unit has an equal non zero
chance of being selected
Simple
Random
Sampling

1. Select a suitable sampling frame


2. Each element is assigned a number from 1 to N
(pop. size)
3. Generate n (sample size) different random
numbers
between 1 and N
4. The numbers generated denote the elements that
should be included in the sample
Simple Random Sampling (SRS)
• Method:
–A sample size ‘n’ is drawn from a
population ‘N’ in such a way that every
possible element in the population has
the same chance of being selected.
Typically conducted “without
replacement”
• What are some ways for conducting an SRS?
–Random numbers table,
– drawing out of a hat,
–random timer, etc.
Simple Random Sampling

• Advantage
–Most representative group
• Disadvantage
–Difficult to identify every member of a
population
Systematic Random Sampling
Systematic random sampling
is a
method of
probability sampling
in which the defined
target population is ordered
and the sample is selected
according to position using a skip interval
Systematic Random Sampling
To select a systematic random sampling of 25 students (n) from
a class of 75 students (N)
“N” is divided by “n” to get a quotient “r”
(75/25=3)
Then unit of N is selected at random and the other units are
subsequently selected by the addition of this quotient “r” to
the previous selected number.
Systematic Random Sampling
Since “r” being 3, any one number (unit) is drawn between 1 to
3,say 2,then the sample will be made up of students with
numbers
2,
2+3=5,
5+3=8,
8+3=11 and so on
This quotient “r” is known as “SAMPLING INTERVAL
EXERCISE
Suppose there are 210 villages in a
community development block
and 40 villages are desired. How
will you select ?
Systematic Random Sampling
Quotient “r” = N /n
=210/40
=5, remainder is 10
Random number is selected out of 10 ,say 6
which becomes the first number
“r” 5 is added to 6 and so on
Hence , the serial numbers of the villages to be
selected will be
6,11,16,21,26,31, 36,41………so on up to 40
numbers
Systematic Sampling
• Advantage
– Quick, efficient, saves time and energy
• Disadvantage
– Not entirely bias free; each item does not
have equal chance to be selected
– System for selecting subjects may
introduce systematic error
– Cannot generalize beyond population
actually sampled
Stratified Random Sampling
Stratified random sampling is a
method of
probability sampling
in which the population is divided
into different subgroups and samples
are selected from each
Steps in Drawing a Stratified Random
Sample
Divide the target population into
homogeneous subgroups or strata
depending upon the characteristics to be
studied(basis being age-group, sex-
group, area wise, socio-economic
status,PhD students, Masters Students,
Bachelors students). Elements within
each strata are homogeneous, but are
heterogeneous across strata
Steps in Drawing a Stratified Random
Sample
A simple random or a systematic sample is
taken from each stratum relative to the
proportion of that stratum to each of the
others
Combine the samples from each stratum into a
single sample of the target population
Researchers use stratified sampling
–When a stratum of interest is a small
percentage of a population and
random processes could miss the
stratum by chance.
–When enough is known about the
population that it can be easily
broken into subgroups or strata.
This type of sampling is used when population is
heterogeneous. For example prevalence of a disease is
different in different age groups
The population is stratified into different sub groups as
Children
Adults
Old persons
ADVANTAGES
Precision of the estimate of the characteristic
under study is increased
Estimate of the characteristic under study can be
made for each strata separately
Ensures that all strata are adequately represented
POPULATION

n = 1000; SE = 10%

equal intensity

STRATA 1 STRATA 2

n= 500; SE=7.5% n = 500; SE=7.5%


POPULATION
n =1000, SE = 10%

proportional to size

STRATA 1
n =400 STRATA 2
SE=7.5%
n = 600
SE=5.0%
Sample equal intensity vs.? proportional to size ?

What do you want to do? Describe the population,


or describe each strata?
CLUSTER SAMPLING
In this case enumeration(sampling) units are not individual but
clusters such as families in a village, villages in a districts,
schools and wards in a city
A sample of clusters proportionate to their size is randomly
drawn
Either everyone in the sample is studied or only a certain
number of subjects with specified age or age-group is
examined
CLUSTER SAMPLING
Cluster sampling is employed for carrying
out evaluation survey of immunization
coverage.
It is used when list of sampling unit is not
available
Cluster
Sampling

1. Assign a number from 1 to N to each element in the population


2. Divide the population into C clusters of which c will be included
in
the sample
3. Calculate the sampling interval i, i=N/c (round to nearest integer)
4. Select a random number r between 1 and i, as explained in
simple
random sampling
5. Identify elements with the following numbers:
r,r+i,r+2i,... r+(c-1)i
6. Select the clusters that contain the identified elements
7. Select sampling units within each selected cluster based on SRS
or systematic sampling
Cluster sampling
Some populations are spread out (over
a state or country).
Elements occur in clumps (towns,
districts)—Primary sampling units
(PSU).
Elements are hard to reach and
identify
You cannot assume that any one clump is
better or worse than another clump.
• Advantage
–More practical, less costly
• Conclusions should be stated in terms of
cluster (sample unit – school)
• Sample size is # of clusters
POPULATION

“CLUMP”
POPULATION

Primary sampling
Unit
POPULATION

= Randomly selected PRIMARY SAMPLING UNITS.


Randomly selected
PRIMARY SAMPLING UNITS

Elements; sample ALL in the


selected primary sampling unit.
Cluster sampling
Used when:
– Researchers lack a good sampling frame for
a dispersed population.
– The cost to reach an element to sample is
very high.
Usually less expensive than SRS but not as
accurate
– Each stage in cluster sampling introduces
sampling error—the more stages there are,
the more error there tends to be.
Multistage Sampling
As the name implies,
This method consists of sampling
procedure carried out in several
stages,
Using random sampling
techniques
Multistage Sampling
This is convenient when the
population of entire district(or state
or country) is to be studied, within
limited resources
First random numbers of districts are
chosen from the province. Then
random numbers of tehsiles are
chosen. Followed successively by
villages and houses
example
For hookworm survey in a district, 10% of
tehsils are chosen, followed by 10% of
villages.
Then all persons in 10th house is subjected
for stool examination.
Non probability
sampling
Classification of Sampling
Techniques

Non probability
Sampling Techniques

Convenience Judgmental Quota Snowball


Sampling Sampling Sampling Sampling
Non probability Sampling
The difference between non probability
and probability sampling is that non
probability sampling does not
involve random selection and probability
sampling does. Does that mean that non
probability samples aren't representative of
the population? Not necessarily. But it does
mean that non probability samples cannot
depend upon the rationale of probability
theory. At least with a probabilistic sample, we
know the odds or probability that we have
represented the population well.
Non probability Sampling
With non probability samples, we may or may not
represent the population well, and it will often be hard
for us to know how well we've done so. In general,
researchers prefer probabilistic or random sampling
methods over non probabilistic ones, and consider
them to be more accurate and rigorous.
• Accidental, Haphazard or Convenience Sampling
• One of the most common methods of sampling goes under the
various titles listed here. I would include in this category the
traditional "man on the street" (of course, now it's probably
the "person on the street") interviews conducted frequently by
television news programs to get a quick (although non
representative) reading of public opinion. I would also argue
that the typical use of college students in much psychological
research is primarily a matter of convenience.
In clinical practice, we might use clients who are available to us
as our sample. In many research contexts, we sample simply by
asking for volunteers. Clearly, the problem with all of these
types of samples is that we have no evidence that they are
representative of the populations we're interested in
generalizing to -- and in many cases we would clearly suspect
that they are not.
Convenience Sampling
Convenience sampling attempts to
obtain a sample of convenient
elements. Often, respondents are
selected because they happen to be in
the right place at the right time.
use of students, and members of social
organizations
department stores using charge account
lists
“people on the street” interviews
Purposive Sampling
In purposive sampling, we sample with a purpose in mind. We
usually would have one or more specific predefined groups we
are seeking. For instance, have you ever run into people in a
mall or on the street who are carrying a clipboard and who are
stopping various people and asking if they could interview
them? Most likely they are conducting a purposive sample
(and most likely they are engaged in market research)
Judgmental Sampling
Judgmental sampling is a form of
convenience sampling in which the
population elements are selected
based on the judgment of the
researcher.
test markets
purchase engineers selected in
industrial marketing research
expert witnesses used in court
Quota Sampling
In quota sampling, you select people non randomly according to
some fixed quota. There are two types of quota
sampling: proportional and non proportional. In proportional
quota sampling you want to represent the major
characteristics of the population by sampling a proportional
amount of each. For instance, if you know the population has
40% women and 60% men, and that you want a total sample
size of 100, you will continue sampling until you get those
percentages and then you will stop.
Quota Sampling
So, if you've already got the 40 women for
your sample, but not the sixty men, you
will continue to sample men, you will not
sample women because you have already
"met your quota." The problem here (as
in much purposive sampling) is that you
have to decide the specific characteristics
on which you will base the quota. Will it
be by gender, age, education race,
Quota Sampling
Non proportional quota sampling is a bit less restrictive. In this
method, you specify the minimum number of sampled units
you want in each category. here, you're not concerned with
having numbers that match the proportions in the population.
Instead, you simply want to have enough to assure that you
will be able to talk about even small groups in the population.
This method is the non probabilistic analogue of stratified
random sampling in that it is typically used to assure that
smaller groups are adequately represented in your sample.
SNOWBALL SAMPLING
In snowball sampling, you begin by identifying someone who
meets the criteria for inclusion in your study. You then ask
them to recommend others who they may know who also
meet the criteria. Although this method would hardly lead to
representative samples, there are times when it may be the
best method available
SNOWBALL SAMPLING
Snowball sampling is especially useful when you are trying to
reach populations that are inaccessible or hard to find. For
instance, if you are studying the homeless, you are not likely to
be able to find good lists of homeless people within a specific
geographical area. However, if you go to that area and identify
one or two, you may find that they know very well who the
other homeless people in their vicinity are and how you can
find them.
SNOWBALL SAMPLING
In snowball sampling, an initial
group of respondents is selected,
usually at random. After being
interviewed, these respondents
are asked to identify others who
belong to the target population of
interest. Subsequent respondents
are selected based on the
referrals.
SNOWBALL SAMPLING

• One sample leads on to more of the


sample kind of sample.
THANK YOU
ERRORS IN SAMPLING &
SIZE OF SAMPLE
BY
ABDUL RAUF
ERRORS IN SAMPLING
Two types

1.Sampling errors
2.Non- sampling errors
SAMPLING ERRORS
These are due to:
Faulty sampling method
Small size of sample
These errors can be minimized through proper sampling
method
NON- SAMPLING ERRORS
A. COVERAGE ERROR this occurs
when all the units in the sample are not covered either due to
non co-operation or due to lost to follow-up.
This can be reduced by an intensive effort to get complete
coverage
NON- SAMPLING ERRORS

B. OBSERVATIONAL(OR
EXPERIMENTAL ) ERROR
This is due to interviewer’s bias or due to lack of training.
This can be reduced by setting up standards of interview or
proper training of the workers
NON- SAMPLING ERRORS

C. PROCESSING ERROR
This is due to clerical mistake or computational
error.
This can be reduced by administrative control
SIZE OF THE SAMPLE
Optimum size of the sample, has to be considered , keeping in
view the time, cost and the feasibility of the study.
FACTORS 0F ESTIMATION OF SAMPLE SIZE
Characteristic
Permissible error
Probability level
Resources
SAMPLE SIZE FOR QUALITATIVE DATA
4pq
n=

Where
n= required sample size,
p= approximate prevalence rate of disease obtained from
previous studies or from pilot study)
q= 1-p
L= permissible error in the estimate of “p”
The above formula has been work out for a probability level
of p=0·05 (i.e. , the prevalence rate will be have 5 percent
error or 95 percent correct value ) in the simple size.
Example
: To estimate prevalence rate of ascariasis in
community, where it is approximately known to be
40 percent, then the required sample size to
estimate the morbidity (ascariasis) with 5 % error
with probability of 0.05 , is calculated as fallows:
Where , p =40%
q= 1-p =100 -40 =60%
L= 5% of 40 = 5/100 X 40 = 2
4pq 4X40X60
n= = = 2400
L² (2)²
FOR QUANTITATIVE DATA

t²α×s²
n=

Where
n= desired sample size
s= standard deviation of the observation
e = permissible error
tα= is the value of ‘t’ at 5% level from ‘t’ table
EXAMPLE
In a community survey to estimate the Hb level,
from the data already available. If it is known that
the mean Hb % level is about 12 gm % with a S.D
of 1.5 gm % then the sample size required to
estimate the Hb level with a permissible error of
0.5 gram % is obtained as follows
t²α×s²
n=

s = 1.5 gm
e = 0.5 gm
t₀.₀₅ = 1.96 (can be taken as 2)
2²×(1.5)² 4× 2.25
n= =
(0.5)² 0.25

n = 36 persons
In clinical trials there will be two groups, one
experimental and other control group. In order to
estimate the size of the sample for each group,
the difference in the response rates of two groups
is to be taken into consideration
2t²α×s²
n=

Where
n= required sample size for each group
s = pooled SD of observation of two groups
d = anticipated smallest difference
tα= is usually taken as ‘t’ at 5% level
EXAMPLE
AN investigator wants to investigate the increase in
Hb% level in anemia cases by administration of a
particular drug compared against a known drug.
The minimum no. of cases in each group to be
investigated is calculated as follows
Suppose
d= 2%
s= 3 gm
t= at 5% level is taken as 2
2×(2)²×(3)² 8× 9
n= = = 18persons
(2)² 4
SAMPLING VARIATION
If two or more samples are drawn from the same
population, there means(m₁, m₂, m₃……) may not be
equal but will show variation, even though they are
from the same population.
Such a difference between the means of the samples, is
known as “sampling variation”
SAMPLING VARIATION
The sampling variation, from one sample to
another, may be by chance, when it is called
“Natural or Biological variability” or
Due to play of certain factors, when it is called
“Real variability”
e. g effect of nutrition, vaccine, smoking etc
SAMPLING VARIATION
The means of the samples(m₁, m₂, m₃ etc) show
dispersion around the population mean(M)
symmetrically as in Normal distribution with a
central tendency and with a definite standard
deviation.
The variation in the sample means is measured in
“Standard error”
STANDARD ERROR
Standard error, in fact is not an error, but is a SD of
sample means from that of population(M).
This standard deviation of sample mean with that of
population mean, is called the “Standard error of the
mean” denoted as
SE ×̄ or simply the standard error(SE)
STANDARD ERROR
SD
SE =
√n
Since the distribution of the means, follows the pattern of
normal distribution, it is not difficult to visualize that 95% of
the sample means will lie within the limits of two standard
error(M±2SE)or M±2SD/√n
Therefore, the chance that the population
mean
(M) Lies between the limits defined by sample
mean±2SE is also 95%.
This is referred to as 95% “Confidence limits”
The confidence limit is increased to 99% by
increasing the no. of standard error to ±3SE
“t” TEST

By
DR ABDUL RAUF
“t” TEST
WS Gassett observed that with small samples the sampling
variations will be large
He demonstrated that the ratio of observed difference between
two values to the standard error(SE) of difference follows a
distribution called “t” distribution and such a ratio is denoted
as “t” (for small samples)
“t” TEST
‘ t value’ was derived by “Student” in 1908
“t” is calculated as
A ratio of difference between two means or proportions to
standard error of the difference
x₁̄ - x₂̄ mean difference
t= =
SE Standard error of mean difference
Unpaired t- test
If the observations are made on two independent groups, like
Control group
Experimental (treated) group
and their means are compared for their significant difference
It is known as “unpaired comparisons” and the test applied is
Unpaired t- test
Paired t- test
If the observations are made on a single sample and the
values of a certain characteristic is noted before and
after the treatment with a particular drug, such
comparison of values of observations is known as
“paired comparison”

Test applied is “paired t- test”


“t” TEST
The unpaired sample,
×̄₁ - ×̄₂
t=
SE of difference between means
×̄₁ - ×̄₂
t=
SE(×̄₁ - ×̄₂ )
“t” TEST
FOR PAIRED SAMPLES

t=
SE of d
Where
d = difference in the two values for each pairs
(total no. of pairs being n)
d̄ = means of the n- values for d
Here SE of d = SD of d / √n
• Two basic formulas for calculating an uncorrelated t
test.
Equal sample size
t= x1 – x2

√ δ 21 + δ 22
n
Unequal sample size
x1 – x2
t=


( n1 – 1)δ21 + ( n2 – 1) δ22
∙( 1 +1
)
n1 n2
n1 + n2 – 2
• Represents the number of
independent observations in a
sample.
• Is a measure that states the
number of variables that can
change within a statistical test.
• Calculated as follows
In unpaired ‘t’ test of difference
between means,
DF= n₁ + n₂ - 2
• Where n₁ & n₂ are no. of
observations in each series
In unpaired ‘t’ test
DF= n – 1
• A probability table is used
• First determine degrees of freedom
• Decide the level of significance
• Example: degrees of freedom= 4
α= .05

 The critical value of t= 2.776


• If the calculated value of t is less than the
critical value of t obtained from the table, the
null hypothesis is not rejected.
• If the calculated value of t is greater than the
critical value of t from the table, the null
hypothesis is rejected.
• The following information is needed in a summary
table
Descriptive statistics
Mean
Variance
Standard deviation
1SD (68% Band)
2 SD (95% Band)
3 SD (99% Band)
Number

Results of t test
• Example: Data obtained from a experiment
comparing the number of un-popped seeds in
popcorn brand A and popcorn brand B.
A B
26 32
22 35
30 20
34 33
Is the difference significant?
• Determine mean, variance and standard deviation
of samples.
Mean xA = Σx
= 26+22+30+34
n = 23
4

Mean xB = Σx
n = 32+35+20+33
= 30
4
variance δ2= Σ (х – х)2
n-1
Popcorn A = ( 26-23)2 + (22-23)2 + (30-23)2 + (34-23)2
3
= 9 + 1 + 49 + 121
= 60
3
Popcorn B = ( 30-30)2+ (35-30)2 + (20- 30)2 + (33- 30)2
3
= 0 + 25 + 100 + 9
= 44.67
3
Standard deviation: δ= √ δ2

popcorn A
√ 60 = 7.75

Popcorn B
√ 44.67 = 6.68
Finding Calculated t
x –x
t= 1 2

√ δ 21 + δ 22

t = 23 - 30
60+ 44.67
√ 4
= 7
√ 26.17
= 7
5.12 = 1.38
Determine critical value of t
• Select level of significance α=.01
• Determine degrees of freedom
degrees of freedom of A= 3
degrees of freedom of B= 3
total degrees of freedom = 6
• Critical value of t = 3.707
Calculated value of t =1.38 is less than critical value of
t from the table, 3.707.
The null hypothesis is not rejected.
Descriptive statistics popcorn A popcorn B

Mean 23 30
Variance 60 44.67
Standard deviation 7.75 6.68
1SD (68% Band) 15.25 - 30.75 23.32- 36.68
2 SD (95% Band) 7.50-38.50 16.64-43.36
3 SD (99% Band) -.25 - 46.25 9.96-50.04
Number 4 4
Results of t test t= 1.38 df=6
t of 1.38 < 3.707 α=.01
Error Types
• Type I Error: Reject H0 when it is true
• Type II Error: Do not reject H0 when it is false

Test Result – Reject H0 Don’t Reject


H0
Reality
H0 True Type I Error Correct

H0 False Correct Type II Error


Types of errors in hypothesis tests
•  (alpha) is called the probability of a
type I error.
–a type I error occurs when we reject
the null hypothesis for a population
where the null hypothesis is true.
–type I error is like “crying wolf” when
there is no wolf

592
•  (beta)is called the probability of a type II error.
– type II error is the error of not rejecting the null
hypothesis, when the null is in fact false.
– type II error is like not noticing a wolf that is really there
– NOTE:  is not equal to 1- , although  is often larger
than .
Errors and correct conclusions
in a hypothesis test

Possible types of error depend on your sample


statistics and on the true state of reality

State of reality:

Your conclusion: Ho true Ho not true

do not reject Ho correct inference, type II error


negative result
reject Ho type I error correct inference,
positive result
594
Consequences of errors in hypothesis tests

• Consequences of type I (alpha)


error:
–misleads other researchers
–social costs of erroneous
information
–damages your reputation as a
careful researcher
595
• Consequences of type II (beta) error:
–no publication for you (probably)
–no damage to your reputation as a careful
researcher
–the truth stays hidden, with possible social
consequences
• hopefully, the truth will come out later
THANK YOU
ABSOLUTE & RELATIVE MEASURES

BY
DR ABDUL RAUF
Absolute measures of Dispersion are expressed in same units in
which original data is presented but these measures cannot
be used to compare the variations between the two series.
Relative measures are not expressed in units but it is a pure
number. It is the ratios of absolute dispersion to an
appropriate average such as co-efficient of Standard
Deviation or Co-efficient of Mean Deviation.
Absolute Measures
Range
quartile Deviation
Mean Deviation
Standard Deviation
Lorenz Curve
Relative Measure
Co-efficient of Range
Co-efficient of Quartile Deviation
Co-efficient of mean Deviation
co-efficient of Variation.
The semi-inter quartile range is a
measure of spread or dispersion. It is
computed as one half the difference
between the 75th percentile [often
called (Q3)] and the 25th percentile
(Q1). The formula for semi-inter quartile
range is therefore: (Q3-Q1)/2.
Since half the scores in a distribution lie between Q3 and Q1,
the semi-inter quartile range is 1/2 the distance needed to
cover 1/2 the scores. In a symmetric distribution, an interval
stretching from one semi-inter quartile range below the
median to one semi-inter quartile above the median will
contain 1/2 of the scores. This will not be true for a skewed
distribution, however.
The semi-inter quartile range is little affected by extreme
scores, so it is a good measure of spread for skewed
distributions. However, it is more subject to sampling
fluctuation in normal distributions than is the standard
deviation and therefore not often used for data that are
approximately normally distributed.
Understand measures of association and
difference

DR ABDUL RAUF
Outcome Measures
• Compare the incidence of disease among
people who have some characteristic with
those who do not
• The ratio of the incidence rate in one group
to that in another is called a rate ratio or
relative risk (RR)
• The difference in incidence rates between
the groups is called a risk difference or
attributable risk (AR)
Calculating Outcome Measures
Outcome

Disease No Disease
Exposure (cases) (controls) Incidence

Exposed A B IE = A / (A+B)
Not Exposed C D IN = C / (C+D)

Relative Risk = IE / IN
Attributable Risk = IE - IN
Lung Cancer

Exposure Yes No Total Incidence


Smoker 70 300 370 70/370 = 189 per 1000
Non-smoker 30 700 730 30/730 = 41 per 1000
100 1,000 1,100

Relative Risk = IE / IN = 189 / 41 = 4.61

Attributable Risk = IE - IN = 189 - 41 = 148 per


1000
Relative Risk = IE / IN = 189 / 41 = 4.61

Attributable Risk = IE - IN = 189 - 41 = 148 per


1000

• Smokers are 4.61 times more likely than


nonsmokers to develop lung cancer

• 148 per 1000 smokers developed lung cancer


because they smoked
RR < 1 RR = 1 RR > 1

Risk Risk for Risk for


Risk of
comparison disease is disease is
disease are
between lower in the higher in the
equal for
exposed exposed than exposed than
exposed and
and in the in the
unexposed
unexposed unexposed unexposed

Exposure Exposure
reduces Particular Exposure
as a risk
disease risk exposure is increases
factor for
not a risk disease risk
the (Protective factor (Risk factor)
disease? factor)
Annual Death Rates for Lung Cancer
and Coronary Heart Disease
by Smoking Status, Males
Annual Death Rate / 100,000
Exposure Lung Cancer Coronary Heart Disease

Smoker 127.2 1,000

Non-smoker 12.8 500

RR 127.2 / 12.8 = 9.9 1000 / 500 = 2

AR 127.2 – 12.8 = 114.4 per 1000 – 500 = 500


100,000 per 100,000
Summary
• The risk associated with smoking is
lower for CHD (RR=2) than for lung
cancer (RR=9.9)

• Attributable risk for CHD (AR=500) is


much higher than for lung cancer
(AR=114.4)
In conclusion: CHD is much more common
(higher incidence) in the population, thus
the actual number of lives saved (or
death averted) would be greater for CHD
than for lung cancer
Thank You

You might also like