BIOSTATISTICS
UNIT-II
BY
DR MASHROOR AHMAD KHAN
ASSISTANT PROFESSOR
DEPARTMENT OF TOXICOLOGY
SCLS
Types of data
Primary data
first time, afresh
Secondary data
already collected
Variable
An item of data
Value varies from one observation
to another
Examples:
gender
testscores
weight
TYPES OF VARIABLES
QUALITATIVE
DISCRETEQUANTITATIVE
CONTINOUS QUANTITATIVE
Qualitative Data
Describes the quality
Non-numerical format
Can be counted (aggregate level)
Cannot order or measure
Examples
gender
marital status
geographical region
job title….
Quantitative Data
Frequencies
Measurements
QUALITATIVE
Nominal
Example: Sex ( M, F)
Marital Status (single, married, widowed or divorced)
Blood Group (A,B, O or AB)
Color of Eyes (blue, green, brown, black)
For instance if we record marital status as 1, 2, 3, or 4 as stated above,
we cannot write
4 > 2 or 3 < 4 and we cannot write 3 – 1 = 4 – 2, 1 + 3 = 4 or 4 ¸ 2 = 2
ORDINAL-In those situations when we cannot do
anything except set up inequalities, we refer to the
data as ordinal data
Example:
Response to treatment
(poor, fair, good)
Severity of disease
(mild, moderate, severe)
Income status (low, middle, high)
QUANTITATIVE (DISCRETE)
Example: The no. of family members
The no. of heart beats
The no. of admissions in a day
QUANTITATIVE (CONTINOUS)
Example: Height, Weight, Age, BP, Serum
Cholesterol and BMI
Discrete data -- Gaps between possible values
Number of Children
Continuous data -- Theoretically,
no gaps between possible values
Hb
CONTINUOUS DATA
DISCRETE DATA
wt. (in Kg.) : under wt, normal & over wt.
Ht. (in cm.): short, medium & tall
BMI (kg/m2): underweight, normal, overweight, obese
Scale of measurement
Qualitative variable:
A categorical variable
Nominal (classificatory) scale
- gender, marital status, race
Ordinal (ranking) scale
- severity scale, good/better/best
Quantitative variable:
A numerical variable: discrete; continuous
Interval scale :
Data is placed in meaningful intervals and order. The unit of measurement are
arbitrary.
Suppose we are given the following temperature readings (in degrees
Fahrenheit): 58°, 63°, 70°, 95°, 110°, 126° and 135°. In this case, we can write
110° > 70° or 95° < 135° which simply means that 110° is warmer than 70° and
that 95° is cooler than 135°. We can also write for example 95° – 70° = 135° –
110°
On the other hand, it would not mean much if we said that 126° is twice as hot
as 63°
first temperature becomes 5/9 (126 – 32) = 52°, the
second temperature becomes 5/9 (63 –32) = 17°
This difficulty arises from the fact that Fahrenheit and
Centigrade scales both have artificial origins (zeros)
i.e., the number 0 of neither scale is indicative of the
absence of whatever quantity we are trying to
measure
Ratio scale:
Data is presented in frequency distribution in
logical order. A meaningful ratio exists.
- Age, weight, height, pulse rate
- pulse rate of 120 is twice as fast as 60
- person with weight of 80kg is twice as heavy
as the one with weight of 40 kg.
Scales of Measure
Nominal – qualitative classification of equal value: gender, race,
color, city
Ordinal - qualitative classification which can be rank ordered:
socioeconomic status of families
Interval - Numerical or quantitative data: can be rank ordered and
sizes compared : temperature
Ratio - Quantitative interval data along with ratio: time, age.
Qualitative or Quantitative?
Discrete or Continuous?
Score on a placement exam
Preferred restaurant
Dollar amount of a loan
Height
Salary
Length of time to complete a task
Number of applicants
Ethnic origin
Analysis
Qualitative Data
Frequency tables
Modes - most frequently occurring
Graphs: Bar Charts and Pie Charts
Analysis
Quantitative Data
Various types
Create groups or categories and generate frequency
tables
All descriptive and inferential statistics
Identify type of variable and measurement scale
The baby weights 20 pounds
My friend is very happy
The sky is greyish-blue
Joe is 6 foot 2 inches
Diana has $100
Representing data
Frequency
Distribution
Sturges Rule
i=
Where,
i = class interval
L = Largest observation
S = Smallest observation
n = total number of observations
Also, = the number of classes
Graphical
representation of
data
Graphical representation of data
Last week were you working full-time, part-time, going to school, keeping house, or
what”?
1. Working full-time
2. Working part-time
3. Temporarily not working
4. Unemployed, laid off
5. Retired
6. School
7. Keeping house
8. Other
Bar chart
Pie chart
Relationship between two nominal
variables
A major North American city has four competing newspapers:
the Globe and Mail (G&M), Post, Star, and Sun.
A sample of newspaper readers was asked to report which
newspaper they read—Globe and Mail (1), Post (2), Star (3),
Sun (4)—and indicate whether they were blue-collar workers
(1), white-collar workers (2), or professionals (3).
Graphical representation of quantitative
variables
If d is the gap between the upper limit of any class
and the lower limit of the succeeding class, the
class boundaries for any class are given by:
Upper class boundary = upper class limit +
Lower class boundary = lower class limit -
Inclusive and exclusive class intervals
(i) In exclusive class intervals, the upper limit of a class is the
lower limit of the next class. Also the upper limit of a class is
not included in that class,
(ii) In inclusive class intervals the upper limit of a class instead
is not the lower limit of the next class. The lower limit is
generally greater by unit measurement
(iii) In inclusive method, both the limits of a class are included
(iv) To simplify the calculation procedure, inclusive classes are
converted into exclusive classes
Ogive
DESCRIBING THE RELATIONSHIP BETWEEN TWO
INTERVAL VARIABLES
A real estate agent wanted to know to what extent the
selling price of a home is related to its size. To acquire
this information, he took a sample of 12 homes that
had recently sold
Measures of
central tendency
Measures of central tendency (or statistical averages) tell us
the point about which items have a tendency to cluster. Such
a measure is considered as the most representative figure for
the entire mass of data. Measure of central tendency is also
known as statistical average. Mean, median and mode are the
most popular averages.
It is a single value within the range of data which represents a
group of individual values in a simple and concise manner so
that the mind can get a quick understanding of the general
size of the individuals in the group.
Arithmetic mean
Mean, also known as arithmetic average, is the
most common measure of central tendency
and may be defined as the value which we get
by dividing the total of the values of various
given items in a series by the total number of
items.
In case of a frequency distribution, we can work out
mean in this way
f1 X 1 f 2 X 2 ... f n X n f i X i
X
f1 f 2 ... f n fi
Example 1
Average birthweight = (3265+3260+...
+ 2834)/20 = 3166.9 g
Example 2: Calculate the arithmetic mean from the
following data
Merits and demerits of arithmetic mean
Merits
(i) It is rigidly defined
(ii) It is easy to understand and easy to calculate.
(iii) It is based upon all the observations.
(iv) It is amenable to algebraic treatment.
Demerits
(i) It is too much affected by extreme values.
(ii) Mostly it does not correspond to any value
of the set of observations.
Median
It is a most preferable measure of location for asymmetric distributions.
Median is the value of the variable that divides the ordered set of values
into two equal halves.
50 percent values are to the left of the median and 50 percent are to the
right of the median.
For ungrouped data
For grouped data (exclusive)
Find the median of the data
Find the median wage of the following distribution
Wages (in Rs) 20-30 30-40 40-50 50-60 60-
70
No. of labourers 3 5 20 10 5
Find median for the following data.
Merits
i) Median is not influenced by extreme values because it is a
positional average.
ii) Median can be calculated in case of distribution with open
end intervals
iii) Median can be located even if the data are incomplete.
iv) Median can be located even for qualitative factors such as
ability, honesty, etc.
Demerits
(i) A slight change in the series may bring drastic change in
median value.
(ii) In case of even number of observations or continuous series,
median is an estimated value other than any value in the series.
(iii) It is not suitable for further mathematical treatment except
its use in mean deviation.
Mode
Mode is the value which occurs most frequently in a set of
observations and around which the other items of the set
cluster densely.
Mode is 4 (corresponding to highest frequency)
For grouped data
l = lower limit of the modal class
h= width of the modal class
f1 = frequency of the modal class
f0 and f2 are frequencies of the class preceding and
succeeding the modal class
Example
Find the mode of the following frequency distribution
+ = 40+ = 46.67
Merits
(i) Mode is readily comprehensible and easy to calculate. Like median, mode can be
located in some cases merely by inspection
(ii) It is not at all affected by extreme values
(iii) Mode can be conveniently located even if the frequency distribution has class-intervals
of unequal magnitude provided the modal class and the classes preceding and succeeding it
are of the same magnitude. Open end classes also do not pose any problem in locating the
mode
Demerits
i) Mode is ill defined. It is not always possible to find a clearly defined mode. Example-
Bimodal distributions
ii) It is not based upon all the observations
iii) It is not capable of further mathematical treatment.
iv) As compared with mean, mode is affected to a greater extent by fluctuations of sampling
Consider the series (i) 7, 8, 10, 11, (ii) 3, 6,
9, 12, 15, (iii) 1, 5, 9, 13, 17
Measures of Dispersion
An average can represent a series only as best as a single
figure can, but it certainly cannot reveal the entire story of any
phenomenon under study. Specially it fails to give any idea
about the scatter of the values of items of a variable in the
series around the true value of average. In order to measure
this scatter, statistical devices called measures of dispersion
are calculated.
i) It tells about the reliability of a measure of central value.
ii) It makes possible to compare two series of data in respect of their
variability.
iii) Measure of dispersion provides the basis for the control of variability.
iv) It has a wide application in almost all fields of statistics.
Mean Deviation (MD)
Standard Deviation (Population)
Standard Deviation (Sample)
Standard deviation of a data set is the positive square root of the arithmetic mean
of the squares of deviations of the various items from the arithmetic mean of the
series. It is also called root mean square deviation.
x x
2
i
n
For grouped data
f x x
2
i i
f i
where f denotes an interval frequency, m the
interval midpoint
Consider the dataset
{ 8, 5, 4, 12, 15, 5, 7 }
Estimate variance and standard deviation
Coefficient of Variation (CV)
The standard deviation is useful as a measure of variation within
a given set of data. When one desires to compare the dispersion
in two sets of data, however, comparing the two standard
deviations may lead to fallacious results.
Compare the variation
Sample 1 Sample 2
Age 25 years 11 years
Mean weight 145 pounds 80 pounds
Standard deviation 10 pounds 10 pounds
Estimate mean and median for the frequency distribution
Prepare a frequency distribution and estimate mean, variance and
standard deviation
Percentiles and Quartiles
Inter quartile range
Quartile Deviation
Quartiles for grouped
data
Coefficient of Quartile deviation
Continuous probability distributions
Figure 3.1 shows a distribution based on a total of 57
children; the frequency distribution consists of intervals with
a width of 10 lb. Now imagine that we increase the number
of children to 50,000 and decrease the width of the intervals
to 0.01 lb. The histogram would now look more like the one
in Figure 3.2, where the step to go from one rectangular bar
to the next is very small.
Finally, suppose that we increase the number of children to 10
million and decrease the width of the interval to 0.00001 lb.
You can now imagine a histogram with bars having practically
no widths and thus the steps have all but disappeared. If we
continue to increase the size of the data set and decrease the
interval width, we eventually arrive at a smooth curve
superimposed on the histogram of Figure 3.2 called a density
curve.
As was noted, subareas of the histogram correspond to the
frequencies of occurrence of values of the variable between
the horizontal scale boundaries of these subareas. This
provides a way whereby the relative frequency of occurrence
of values between any two specified points can be calculated:
merely determine the proportion of the histogram's total area
falling between the specified points.
Probability Density Function: A nonnegative function f(x) is
called a probability distribution (sometimes called a probability
density function) of the continuous random variable X if the total
area bounded by its curve and the x-axis is equal to 1 and if the
subarea under the curve bounded by the curve, the x-axis, and
perpendiculars erected at any two a and b gives the probability
that X is between the points a and b.
Normal Distribution
Properties of Normal Distribution
1. It is symmetrical about its mean, μ. The curve on either side
of μ is a mirror image of the other side.
2. The mean, the median, and the mode are all equal.
3. The total area under the curve above the x-axis is one square
unit. This characteristic follows from the fact that the normal
distribution is a probability distribution. Because of the
symmetry already mentioned, 50 percent of the area is to the
right of a perpendicular erected at the mean, and 50 percent
is to the left.
4. If we erect perpendiculars a distance of 1 standard
deviation from the mean in both directions, the area
enclosed by these perpendiculars, the x-axis, and the curve
will be approximately 68 percent of the total area. If we
extend these lateral boundaries a distance of 2 standard
deviations on either side of the mean, approximately 95
percent of the area will be enclosed, and extending them a
distance of 3 standard deviations will cause approximately
99.7 percent of the total area to be enclosed.
5.The normal distribution is completely determined by the
parameters μ and σ. In other words, a different normal
distribution is specified for each different value of μ and σ.
Standard Normal distribution
concerned with finding the probability that the variable assumes
any value in an interval between two specific points a and b. The
probability that a continuous variable assumes a value between
two points a and b is the area under the graph of the density
curve between a and b; the vertical axis of the graph represents
the densities.
How to Read the Table :The entries in table give the
area under the standard normal curve between zero and
a positive value of z. Suppose that we are interested in
the area between z = 0 and z = 1.35 (numbers are first
rounded off to two decimal places). To do this, first find
the row marked with 1.3 in the left-hand column of the
table, and then find the column marked with .05 in the
top row of the table (1.35 = 1.30 + 0.05). Then looking in
the body of the table, we find that the ‘‘1.30 row’’ and the
‘‘.05 column’’ intersect at the value .4115. This number,
0.4115, is the desired area between z = 0 and z = 1.35.
Another example: The area between z = 0 and z = 1.23 is
0.3907; this value is found at the intersection of the ‘‘1.2
row’’ and the ‘‘.03 column’’ of the table.