0% found this document useful (0 votes)
10 views135 pages

Basics of Statistics Unit-I SCLS

Uploaded by

Mehwish Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views135 pages

Basics of Statistics Unit-I SCLS

Uploaded by

Mehwish Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 135

BIOSTATISTICS

UNIT-II

BY
DR MASHROOR AHMAD KHAN
ASSISTANT PROFESSOR
DEPARTMENT OF TOXICOLOGY
SCLS
Types of data
 Primary data
first time, afresh
 Secondary data
already collected
Variable
An item of data
Value varies from one observation
to another
Examples:
gender
testscores
weight
TYPES OF VARIABLES

QUALITATIVE
DISCRETEQUANTITATIVE
CONTINOUS QUANTITATIVE
Qualitative Data
 Describes the quality
 Non-numerical format

 Can be counted (aggregate level)


 Cannot order or measure

 Examples
 gender
 marital status
 geographical region
 job title….
Quantitative Data

 Frequencies

 Measurements
QUALITATIVE

Nominal
Example: Sex ( M, F)
Marital Status (single, married, widowed or divorced)
Blood Group (A,B, O or AB)
Color of Eyes (blue, green, brown, black)
For instance if we record marital status as 1, 2, 3, or 4 as stated above,
we cannot write
4 > 2 or 3 < 4 and we cannot write 3 – 1 = 4 – 2, 1 + 3 = 4 or 4 ¸ 2 = 2
ORDINAL-In those situations when we cannot do
anything except set up inequalities, we refer to the
data as ordinal data
Example:
Response to treatment
(poor, fair, good)
Severity of disease
(mild, moderate, severe)
Income status (low, middle, high)
QUANTITATIVE (DISCRETE)

Example: The no. of family members


The no. of heart beats
The no. of admissions in a day

QUANTITATIVE (CONTINOUS)

Example: Height, Weight, Age, BP, Serum


Cholesterol and BMI
Discrete data -- Gaps between possible values

Number of Children

Continuous data -- Theoretically,


no gaps between possible values

Hb
CONTINUOUS DATA

DISCRETE DATA

wt. (in Kg.) : under wt, normal & over wt.


Ht. (in cm.): short, medium & tall
BMI (kg/m2): underweight, normal, overweight, obese
Scale of measurement
Qualitative variable:
A categorical variable

Nominal (classificatory) scale


- gender, marital status, race

Ordinal (ranking) scale


- severity scale, good/better/best
Quantitative variable:
A numerical variable: discrete; continuous

Interval scale :
Data is placed in meaningful intervals and order. The unit of measurement are
arbitrary.

Suppose we are given the following temperature readings (in degrees


Fahrenheit): 58°, 63°, 70°, 95°, 110°, 126° and 135°. In this case, we can write
110° > 70° or 95° < 135° which simply means that 110° is warmer than 70° and
that 95° is cooler than 135°. We can also write for example 95° – 70° = 135° –
110°
On the other hand, it would not mean much if we said that 126° is twice as hot
as 63°
first temperature becomes 5/9 (126 – 32) = 52°, the
second temperature becomes 5/9 (63 –32) = 17°

This difficulty arises from the fact that Fahrenheit and


Centigrade scales both have artificial origins (zeros)
i.e., the number 0 of neither scale is indicative of the
absence of whatever quantity we are trying to
measure
Ratio scale:
Data is presented in frequency distribution in
logical order. A meaningful ratio exists.

- Age, weight, height, pulse rate


- pulse rate of 120 is twice as fast as 60
- person with weight of 80kg is twice as heavy
as the one with weight of 40 kg.
Scales of Measure

 Nominal – qualitative classification of equal value: gender, race,


color, city
 Ordinal - qualitative classification which can be rank ordered:
socioeconomic status of families
 Interval - Numerical or quantitative data: can be rank ordered and
sizes compared : temperature
 Ratio - Quantitative interval data along with ratio: time, age.
Qualitative or Quantitative?
Discrete or Continuous?

 Score on a placement exam


 Preferred restaurant
 Dollar amount of a loan
 Height
 Salary
 Length of time to complete a task
 Number of applicants
 Ethnic origin
Analysis
Qualitative Data
 Frequency tables
 Modes - most frequently occurring
 Graphs: Bar Charts and Pie Charts
Analysis
Quantitative Data
 Various types
 Create groups or categories and generate frequency
tables
 All descriptive and inferential statistics
Identify type of variable and measurement scale

 The baby weights 20 pounds


 My friend is very happy
 The sky is greyish-blue
 Joe is 6 foot 2 inches
 Diana has $100
Representing data
Frequency
Distribution
Sturges Rule

i=

Where,

i = class interval
L = Largest observation
S = Smallest observation
n = total number of observations

Also, = the number of classes


Graphical
representation of
data
Graphical representation of data

 Last week were you working full-time, part-time, going to school, keeping house, or
what”?

 1. Working full-time
 2. Working part-time
 3. Temporarily not working
 4. Unemployed, laid off
 5. Retired
 6. School
 7. Keeping house
 8. Other
Bar chart
Pie chart
Relationship between two nominal
variables
A major North American city has four competing newspapers:
the Globe and Mail (G&M), Post, Star, and Sun.

A sample of newspaper readers was asked to report which


newspaper they read—Globe and Mail (1), Post (2), Star (3),
Sun (4)—and indicate whether they were blue-collar workers
(1), white-collar workers (2), or professionals (3).
Graphical representation of quantitative
variables
If d is the gap between the upper limit of any class
and the lower limit of the succeeding class, the
class boundaries for any class are given by:

Upper class boundary = upper class limit +

Lower class boundary = lower class limit -


Inclusive and exclusive class intervals
(i) In exclusive class intervals, the upper limit of a class is the
lower limit of the next class. Also the upper limit of a class is
not included in that class,

(ii) In inclusive class intervals the upper limit of a class instead


is not the lower limit of the next class. The lower limit is
generally greater by unit measurement

(iii) In inclusive method, both the limits of a class are included

(iv) To simplify the calculation procedure, inclusive classes are


converted into exclusive classes
Ogive
DESCRIBING THE RELATIONSHIP BETWEEN TWO
INTERVAL VARIABLES

A real estate agent wanted to know to what extent the


selling price of a home is related to its size. To acquire
this information, he took a sample of 12 homes that
had recently sold
Measures of
central tendency
Measures of central tendency (or statistical averages) tell us
the point about which items have a tendency to cluster. Such
a measure is considered as the most representative figure for
the entire mass of data. Measure of central tendency is also
known as statistical average. Mean, median and mode are the
most popular averages.

It is a single value within the range of data which represents a


group of individual values in a simple and concise manner so
that the mind can get a quick understanding of the general
size of the individuals in the group.
Arithmetic mean

Mean, also known as arithmetic average, is the


most common measure of central tendency
and may be defined as the value which we get
by dividing the total of the values of various
given items in a series by the total number of
items.
In case of a frequency distribution, we can work out
mean in this way

f1 X 1  f 2 X 2  ...  f n X n  f i X i
X 
f1  f 2  ...  f n  fi
Example 1
Average birthweight = (3265+3260+...
+ 2834)/20 = 3166.9 g
Example 2: Calculate the arithmetic mean from the
following data
Merits and demerits of arithmetic mean

Merits
(i) It is rigidly defined
(ii) It is easy to understand and easy to calculate.
(iii) It is based upon all the observations.
(iv) It is amenable to algebraic treatment.

Demerits
(i) It is too much affected by extreme values.
(ii) Mostly it does not correspond to any value
of the set of observations.
Median
It is a most preferable measure of location for asymmetric distributions.
Median is the value of the variable that divides the ordered set of values
into two equal halves.

50 percent values are to the left of the median and 50 percent are to the
right of the median.

For ungrouped data


For grouped data (exclusive)
Find the median of the data
Find the median wage of the following distribution
Wages (in Rs) 20-30 30-40 40-50 50-60 60-
70
No. of labourers 3 5 20 10 5
Find median for the following data.
Merits
i) Median is not influenced by extreme values because it is a
positional average.
ii) Median can be calculated in case of distribution with open
end intervals
iii) Median can be located even if the data are incomplete.
iv) Median can be located even for qualitative factors such as
ability, honesty, etc.
Demerits
(i) A slight change in the series may bring drastic change in
median value.
(ii) In case of even number of observations or continuous series,
median is an estimated value other than any value in the series.
(iii) It is not suitable for further mathematical treatment except
its use in mean deviation.
Mode
Mode is the value which occurs most frequently in a set of
observations and around which the other items of the set
cluster densely.

Mode is 4 (corresponding to highest frequency)


For grouped data

l = lower limit of the modal class


h= width of the modal class
f1 = frequency of the modal class
f0 and f2 are frequencies of the class preceding and
succeeding the modal class
Example

Find the mode of the following frequency distribution


+ = 40+ = 46.67
Merits
(i) Mode is readily comprehensible and easy to calculate. Like median, mode can be
located in some cases merely by inspection
(ii) It is not at all affected by extreme values
(iii) Mode can be conveniently located even if the frequency distribution has class-intervals
of unequal magnitude provided the modal class and the classes preceding and succeeding it
are of the same magnitude. Open end classes also do not pose any problem in locating the
mode

Demerits
i) Mode is ill defined. It is not always possible to find a clearly defined mode. Example-
Bimodal distributions
ii) It is not based upon all the observations
iii) It is not capable of further mathematical treatment.
iv) As compared with mean, mode is affected to a greater extent by fluctuations of sampling
Consider the series (i) 7, 8, 10, 11, (ii) 3, 6,
9, 12, 15, (iii) 1, 5, 9, 13, 17
Measures of Dispersion
An average can represent a series only as best as a single
figure can, but it certainly cannot reveal the entire story of any
phenomenon under study. Specially it fails to give any idea
about the scatter of the values of items of a variable in the
series around the true value of average. In order to measure
this scatter, statistical devices called measures of dispersion
are calculated.

i) It tells about the reliability of a measure of central value.


ii) It makes possible to compare two series of data in respect of their
variability.
iii) Measure of dispersion provides the basis for the control of variability.
iv) It has a wide application in almost all fields of statistics.
Mean Deviation (MD)
Standard Deviation

Standard deviation of a data set is the positive square root of


the arithmetic mean of the squares of deviations of the
various items from the arithmetic mean of the series. It is also
called root mean square deviation.

 x  x 
2
i

n
For grouped data

 f x  x 
2
i i

f i

where f denotes an interval frequency, m the


interval midpoint
Consider the dataset

{ 8, 5, 4, 12, 15, 5, 7 }

Estimate variance and standard deviation


Find standard deviation for the following data
mid-vaule
freq(fi) fi*xi xi-xbar (xi-xbar)^2 fi(xi-xbar)^2 |xi-xbar| fi|xi-xbar|
15 3 45 -30.53 931.86 2795.57 30.53 91.58
25 8 200 -20.53 421.33 3370.64 20.53 164.21
35 8 280 -10.53 110.80 886.43 10.53 84.21
45 16 720 -0.53 0.28 4.43 0.53 8.42
55 12 660 9.47 89.75 1077.01 9.47 113.68
65 6 390 19.47 379.22 2275.35 19.47 116.84
75 4 300 29.47 868.70 3474.79 29.47 117.89
57 2595 13884.21 696.84

mean 45.52632
SD 15.60713
MD 12.2253
Coefficient of Variation (CV)

The standard deviation is useful as a measure of variation within


a given set of data. When one desires to compare the dispersion
in two sets of data, however, comparing the two standard
deviations may lead to fallacious results.
Compare the variation

Sample 1 Sample 2
Age 25 years 11 years
Mean weight 145 pounds 80 pounds
Standard deviation 10 pounds 10 pounds
Measures of Skewness and Kurtosis

Skewness:
Skewness means lack of symmetry.

In Statistics, a distribution is called symmetric if mean, median and


mode coincide. Otherwise, the distribution becomes asymmetric.

If the right tail is longer, we get a positively skewed distribution for
which mean > median > mode.

If the left tail is longer, we get a negatively skewed distribution for
which mean < median < mode.
Measures of Skewness:
Measures of Skewness:
Kurtosis
Kurtosis is another measure of the shape of a frequency curve.

The measures of kurtosis describe the degree of concentration of frequencies


(observations) in a given distribution. That is, whether the observed values
are concentrated more around the mode (a peaked curve) or away from the
mode towards both tails of the frequency curve.

The measure of kurtosis is very helpful in the selection of an appropriate


average. For example, for normal distribution, mean is most appropriate; for
a leptokurtic distribution, median is most appropriate; and for platykurtic
distribution, the quartile range is most appropriate.
Kurtosis
In statistics, it refers to the degree of flatness or peak in the region about the mode of the
frequency curve. There are three types of frequency cure.
(i) Leptokurtic: Distribution is longer, tails are fatter. Peak is higher and sharper than
mesokurtic, which means that data are heavy-tailed or large amount of outliers.
(ii) Mesokurtic: This distribution has kurtosis statistic similar to that of the normal
distribution.
(iii) Platykurtic: Distribution is shorter,
tails are thinner than the normal
distribution. The peak is lower and
broader than Mesokurtic, which
means that data are light-tailed or
lack of outliers.
Estimate mean and median for the frequency distribution
Prepare a frequency distribution and estimate mean, variance and
standard deviation
Percentiles and Quartiles
Inter quartile range
Quartile Deviation
Quartiles for grouped
data
Coefficient of Quartile deviation
Continuous probability distributions
Figure 3.1 shows a distribution based on a total of 57
children; the frequency distribution consists of intervals with
a width of 10 lb. Now imagine that we increase the number
of children to 50,000 and decrease the width of the intervals
to 0.01 lb. The histogram would now look more like the one
in Figure 3.2, where the step to go from one rectangular bar
to the next is very small.
Finally, suppose that we increase the number of children to 10
million and decrease the width of the interval to 0.00001 lb.
You can now imagine a histogram with bars having practically
no widths and thus the steps have all but disappeared. If we
continue to increase the size of the data set and decrease the
interval width, we eventually arrive at a smooth curve
superimposed on the histogram of Figure 3.2 called a density
curve.
As was noted, subareas of the histogram correspond to the
frequencies of occurrence of values of the variable between
the horizontal scale boundaries of these subareas. This
provides a way whereby the relative frequency of occurrence
of values between any two specified points can be calculated:
merely determine the proportion of the histogram's total area
falling between the specified points.
Probability Density Function: A nonnegative function f(x) is
called a probability distribution (sometimes called a probability
density function) of the continuous random variable X if the total
area bounded by its curve and the x-axis is equal to 1 and if the
subarea under the curve bounded by the curve, the x-axis, and
perpendiculars erected at any two a and b gives the probability
that X is between the points a and b.
Normal Distribution
Properties of Normal Distribution
1. It is symmetrical about its mean, μ. The curve on either side
of μ is a mirror image of the other side.

2. The mean, the median, and the mode are all equal.

3. The total area under the curve above the x-axis is one square
unit. This characteristic follows from the fact that the normal
distribution is a probability distribution. Because of the
symmetry already mentioned, 50 percent of the area is to the
right of a perpendicular erected at the mean, and 50 percent
is to the left.
4. If we erect perpendiculars a distance of 1 standard
deviation from the mean in both directions, the area
enclosed by these perpendiculars, the x-axis, and the curve
will be approximately 68 percent of the total area. If we
extend these lateral boundaries a distance of 2 standard
deviations on either side of the mean, approximately 95
percent of the area will be enclosed, and extending them a
distance of 3 standard deviations will cause approximately
99.7 percent of the total area to be enclosed.

5.The normal distribution is completely determined by the


parameters μ and σ. In other words, a different normal
distribution is specified for each different value of μ and σ.
Standard Normal distribution
concerned with finding the probability that the variable assumes
any value in an interval between two specific points a and b. The
probability that a continuous variable assumes a value between
two points a and b is the area under the graph of the density
curve between a and b; the vertical axis of the graph represents
the densities.
How to Read the Table :The entries in table give the
area under the standard normal curve between zero and
a positive value of z. Suppose that we are interested in
the area between z = 0 and z = 1.35 (numbers are first
rounded off to two decimal places). To do this, first find
the row marked with 1.3 in the left-hand column of the
table, and then find the column marked with .05 in the
top row of the table (1.35 = 1.30 + 0.05). Then looking in
the body of the table, we find that the ‘‘1.30 row’’ and the
‘‘.05 column’’ intersect at the value .4115. This number,
0.4115, is the desired area between z = 0 and z = 1.35.
Another example: The area between z = 0 and z = 1.23 is
0.3907; this value is found at the intersection of the ‘‘1.2
row’’ and the ‘‘.03 column’’ of the table.

You might also like