TextbookofComputerApplicationsandBiostatistics PDF
TextbookofComputerApplicationsandBiostatistics PDF
net/publication/209729264
CITATIONS READS
0 61,448
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Kailas K Mali on 06 October 2014.
Dr. S. B. Bhise
M. Pharm PhD
Principal,
Singhad Institute of Pharmaceutical Sciences, Lonavala
[email protected]
Dr. R. J. Dias
M. Pharm PhD MBA
Professor,
Singhad Institute of Pharmaceutical Sciences, Lonavala
[email protected]
K. K. Mali
M. Pharm (Biopharm)
Associate Professor,
Satara College of Pharmacy, Satara
[email protected]
P. H. Ghanwat
DIE, MCA
Visiting Lecturer,
Satara College of Pharmacy, Satara
[email protected]
INNOVATE
P
TRINITY PUBLISHING HOUSE
PUBLISH
Serving Pharmacy Profession
Textbook of Computer Applications and Biostatistics
Published by
ISBN 978-81-920565-1-7
Rs. 350/-
Printed at
Vikram Printers Pvt. Ltd.
31 & 34, Parvati Industrial Estate,
Pune-Satara Road,
Pune- 411 009. India.
Phone: (020)24220890, 24228905. Distributed by
www.vikram-printers.com Amit Book Company Pvt. Ltd.
B-3/16, Darja Ganj,
Designed by Near The Time of India,
Srushti Computers, Satara. New Delhi - 110 002.
G-2, ‘Venna’, Adarshnagar, Khed, Satara. Phone: 011 - 43538989.
[email protected]. [email protected].
INNOVATE
PUBLISH
TRINITY PUBLISHING HOUSE
PREFACE
We are very pleased to put forth the first edition of book, ‘Textbook of Computer
Applications and Biostatistics’. This book is intended to be an introduction to pharmacy
students regarding applications of computers and biostatistics to pharmacy. The basic
knowledge of computers and their applications is covered in details as it is essential to
students in every walk of their lives. The procedures for operating MS-Office 2003 is
discussed here as many colleges still use this version. However our second edition will have
the procedures for operating latest versions of Windows. We regret for inconvenience caused
to few readers, due to this.
The concepts of biostatistics are discussed here with minimum of maths, so as to drive away
the maths phobia in pharmacy students. Moreover, most of the statistics can be handled
through computers using excel and we have emphasized in every chapter on how to use
computers for statistical needs. This will help students to handle the data and infer about their
experiments easily.
This book is an sincere effort to bring statistical concepts in simple, understandable form so
that every student will enjoy to learn them with ease. The learning objectives, summary,
multiple choice questions and exercise in all twenty two chapters makes the book more
interesting.
We acknowledge the help and co-operation extended by various persons in bringing out this
book. We are highly indebted to the authors of the various books and articles mentioned in
bibliography which became a major source of information for writing this book. We also
thank the publishers and designers who graciously worked hard to publish this book in time.
Our request to all users of this book is to provide constructive criticism in improving further
editions of the book. We sincerely hope that readers will certainly welcome the book.
SB Bhise
Satara RJ Dias
KK Mali
January 15, 2011. PH Ghanwat
(iii)
CONTENTS
(v)
CONTENTS
(vi)
Chapter 8
APPLICATIONS OF COMPUTERS TO PHARMACY
Learning objectives
When we have finished this chapter, we should be able to:
1. Understand applications of computers to pharmacy.
2. Know various computer programmes used in different areas of pharmacy field.
Introduction
The utility of computer in collection, evaluation, organisation and dissemination of
information has made their presence virtually in every walk of life. Their potential in every field of
pharmacy has led to its extensive use encompassing research of drug, its manufacturing and till its
proper usage. The following properties of computers have made them to bring biggest revolution of
the twenty first century known as information technology. The properties are:
1. Large storage capacity
2. Speed and accuracy
3. Flexibility
4. Ease of dissemination and transmission
5. Multiple user capacity
6. Can do repetitive tasks.
Let us see the applications of computers to various fields of pharmacy as given below:
1. Use of computers for manufacturing of drugs
The manufacturing of drugs in various dosage forms requires sophisticated instruments and
machinery which are now-a-day controlled by computers. The touch screen panels provided to these
machines can be utilised for controlling various manufacturing variables thereby producing quality
medicines. Automation brought in manufacturing area has increased the efficiency, quality and safety
manifold.
Computer- aided manufacturing (CAM) is the use of computers to plan and control
manufacturing process. A well designed CAM system allows manufacturers to become much more
productive. Not only a greater number of products are produced, but also speed and quality is
increased.
Softwares available: Marg pharmaceutical software for manufacturers, DMC Medical
Manufacturing, Taylor Pharmaceutical Manufacturing, TGI Process manufacturing, MISys
Manufacturing software, etc.
141
142 Textbook of Computer Applications and Biostatistics
MetaCore, etc.
20. Literature storage and retrieval system
Computers have been utilized to offer bibliographic, indexing and abstracting services. the
articles can be referred through keywords, titles, authors or journals. Automated on-line literature
retrieval systems like Medline and Chemline are offered by National Library of Medicine, USA.
International Pharmaceutical Abstracts (IPA) and Martindale’s Extra Pharmacopoeia are also
available on compact disks.
Databases available: Excerpta Medica, LIMS,AMA/NET Information base, etc.
Summary
Use of computer in various fields of Pharmacy is given in following table:
Sr. No. Use of Computers in Pharmacy SoftwaresAvailable
1. Manufacturing of drugs Marg, DMC, Taylor, TGI, MIsys
2. Quality control Darwin LIMS, DMC-QC, Maonark, MasterControl
3. Quality assurance Qtor, EtQ, Darwin LIMS, Marg, DMC-QA
4. Pharmaceutical analysis DeWinter, Drugpak, IPACore,Assistant Pro, DTSMS
5. Inventory control mSupply, IMS Leon, Meditab IMS, CASI, DMC
6. Clinical research OpenClinica, Clinplus, Cytel, Metadata, TrackWise
7. Retail drug store and wholesalers PharmaSoft,Apothesoft, PEPID, Medivision,Abacus
8. Drug information services MicroMedex, Lexicomp Platinum, Davi’s Drug Guide
Tarascon’s Pharmacopoeia, DIT Drug Risk Navigator
9. Marketing and sales Marg Ethical Marketing, Pharma CRM, Metastorm
10. Hospital management and pharmacy WinPharm, WorkPath, MediNous, HMS-Leon
11. Clinical services PharmacyPlus, VirtualCare, TEICTDM, NAMAH
MEDIPHOR, MW/Pharm
12. Bioequivalence testing centres WinNonlin, DDSolver, EquivTest, MONOLIX
13. Research and development Prochemist, Tripos,AMBER, DRAGON, RECKON
Spartan, CHARMM, GROMACS, MATLAB
14. Biostatistics SAS, SPSS, Minitab, SigmaStat
15. Medical Diagnosis and Imaging DICOM,3D-DOCTOR, Easy Diagnosis, MIM, LDRA
16. Patent search www.uspto.gov, www.wipo.int, www.epo.org,
www.ipo.gov.uk, www.ipindia.nic.in, etc
17. Pharmacology simulations LabTutor, X-Cology, Ex-Pharm, Biosoft, Neurosim
18. Pharmacokinetic simulation NONLIN, KINETICS, KINPAK, ESTRIP
19. Bioinformatics MetaCore, BioSpice, ISYS, 3Dslicer, Bioconductor
20. Literature survey Pubmed, Medline, Google, IPA, LIMS
Applications of Computers to Pharmacy 147
Exercise:
1. Give an account on application of computers to pharmacy.
2. Discuss the use of computers in basic research.
3. Give various softwares available for pharmaceutical manufacturing.
4. How literature storage and retrieval is possible with computers?
5. Give the use of computers in pharmacokinetics.
Answers:
Multiple Choice Questions
1. a 2. b 3. d 4. b 5. d 6. c 7. c 8. b 9. d 10. a
Chapter 9
INTRODUCTION TO BIOSTATISTICS
Learning objectives
When we have finished this chapter, we should be able to:
1. Explain types of statistics with their components.
2. Explain the difference between nominal, ordinal, metric discrete and metric continuous variables.
3. Identify the type of a variable.
What is Biostatistics?
As defined by Daniel in 1978, “Biostatistics is a field of study concerned with the
organisation and summarisation of data from health sciences and drawing of inferences about a body
of data when only a part of the data are observed”.
In simpler terms, biostatistics can be defined as the branch of statistics applied to biological
sciences whereby collection, classification, summarizing, analysis and interpretation of data is done.
Types of Statistics
All statistical procedures can be divided into general categories- descriptive or inferential.
Descriptive statistics
As the name implies, descriptive statistics describes data that we collect or observe
(empirical data). They represent all of the procedures that can be used to organise, summarise,
display, and categorise data collected for a certain experiment or event. It includes tabulation,
graphical presentation, measures of central tendency, etc.
Inferential statistics
Inferential statistics represents a wide range of procedures (tests) that are used to infer or
make predictions about a large body of information based on sample observations. The inferential
statistics include z test, t test, analysis of variance, etc.
Statistical samples and population
Statistical data usually involve a relatively small portion of an entire population, and
decisions and interpretations (inferences) are made about that population through numerical
manipulation. The population may be defined as the entire number of observations that constitute a
particular group. Samples are generally a relatively small group of observations that have been taken
from a defined population. Parameters are characteristics of populations while statistic are
characteristics of samples representing summary measures computed on observed sample values.
Parameters or population values are usually represented by Greek symbols (e.g. m, s, y) and sample
statistic are denoted by letters (e.g. `X, S , r).
2
148
Introduction to Biostatistics 149
Samples, as given in above examples, are only a small subset of a much larger population and
are used for nearly all statistical tests. By using various formulas, these descriptive sample results are
manipulated to make predictions (inferences) about the population from which they are sampled.
Population Sampling Descriptive
Parameter Sample Statistic
(Unknown) Statistic (Known)
Best Mathematical
Estimate Manipulation
Inferential
Statistic
Figure 9.1 Descriptive statistics and inferential statistics
Types of Variables
A variable is something whose value can vary. For example age, sex, and blood type are
variables. Data are the values we get when we measure or observe a variable. There are two major
types of variables- categorical variables and metric variables. Each of these can be further divided
into two sub-types as shown in table 9.2.
Table 9.2 Types of Variables
Categorical variables
1. Nominal categorical variables
Consider the variable blood type, O, A, B and A/B. The variable ‘blood type ‘ is a nominal
categorical variable. A typical characteristics of this variable are that they do not have any units of
measurement, and the ordering of the categories is completely arbitrary. In other words, the
categories cannot be ordered in any meaningful way. Therefore, we can easily write the blood type
categories asA/B, B, O,Aor B, O,A,A/B or B,A,A/B, O, or whatever.
2. Ordinal categorical variables
The Child Pugh Score is an ordinal categorical variable. This data too do not have any units
of measurement as like that of nominal variables but the ordering of the categories is not arbitrary as
it was with nominal variables. It is now possible to order the categories in a meaningful way.
Ordinal data are not real numbers. They cannot be placed on the number line. The reason is
that the Child Pugh Score data, and the data of most other clinical scales, are not properly measured
but assessed in some way, by the clinician working with the patient. This is a characteristic of all
ordinal data.
As ordinal data are not real numbers, it is not appropriate to apply any of the rules of basic
arithmetic to sort this data. We can not add, subtract, multiply or divide ordinal values. This limitation
has marked implications for the analyses of such data.
152 Textbook of Computer Applications and Biostatistics
Metric Variables
1. Continuous metric variables
The variable ‘weight’ is a metric continuous variable. With metric variables, proper
measurement is possible and therefore these variables produce data that are real numbers, and can be
placed on the number line. Some common examples of metric continuous variables include: Birth
weight (g), blood pressure (mmHg), blood cholesterol (mg/ml), waiting time (minutes), body mass
index (kg/m2), peak expiry flow (1 per min), and so on. These variables have units of measurement
attached to them.
In contrast to ordinal values, the difference between any pair of adjacent values of
continuous metric variables is exactly the same. The difference between birth weights of 3000 g and
3001 g is the same as the difference between 3001 g and 3002 g, and so on.
Metric continuous variables can be properly measured and have units of measurement.
2. Discrete metric variables
Continuous metric data usually comes from measuring while discrete metric data, usually
comes from counting. For example, number of deaths, number of pressure sores, number of angina
attacks, and so on, are all discrete metric variables. The data produced are real numbers, and are
invariably integer (i.e. whole number). They can be placed on the number line, and have the same
interval and ratio properties as continuous metric data. Metric discrete variables can be properly
counted and have units of measurement- ‘numbers of things’.
No Yes
Summary
Biostatistics
It can be defined as the branch of statistics applied to biological sciences whereby collection,
classification, summarising, analysis and interpretation of data is done.
Types of Statistics
1. Descriptive Statistics: It describes collected data.
2. Inferential Statistics: It infers about a large body of information based on sample.
Types of Variables
1. Categorical Nominal : No any units of measurement and values in arbitrary categories.
2. Categorical Ordinal: No any units of measurement and values in ordered categories.
3. Metric Continuous: It can be properly measured and have units of measurement.
4. Metric Discrete: It can be properly counted and have units of measurement.
Exercise
1. Define biostatistics and enumerate applications of it.
2. Give various types of variables with their characteristics.
3. Give the importance of identifying type of variable in biostatistics.
4. Identify the type of variables associated with clinical trials of a drug given below,
a. Sex
b.Age
c. Height
d. Weight
e. Blood type (A, B,AB, O)
f. Blood pressure (Mild, Moderate, Severe)
g. Blood glucose level
h. Fed vs fasted state
i. Manufacturer (generic vs brand)
j. Smoking history (no of cigarettes per day)
5. Identify the types of variables given below, associated with manufacturing a batch of tablets of
Ciprofloxacin.
a. Impurities- present or absent
b.Amount of active ingredient (content uniformity)
c. Disintegration time
d. Dissolution test- pass or fail criteria
e. Friability- pass or fail criteria
f. Hardness
g.Appearance (good, better, best)
h. Machine efficiency score (-5 to +5)
i. Weight variation test (pass of fail)
j. Human resources employed (No of persons)
156 Textbook of Computer Applications and Biostatistics
Answers:
Multiple Choice Questions
1. c 2. c 3. a 4. d 5. c 6. b 7. c 8. c 9. a 10. d
Exercise
4. a. Sex- categorical nominal
b.Age- metric continuous
c. Height- metric continuous
d. Weight- metric continuous
e. Blood type (A, B,AB, O)- categorical nominal
f. Blood pressure (Mild, Moderate, Severe)- categorical ordinal
g. Blood glucose level- metric continuous
h. Fed vs fasted state- categorical nominal
i. Manufacturer (generic vs brand)- categorical nominal
j. Smoking history (no of cigarettes per day)- metric discrete
Learning objectives
When we have finished this chapter, we should be able to:
1.Construct the tables of frequency, relative frequency,cumulative frequency and relative cumulative
frequency.
2. Construct grouped frequency table and a cross-tabulation table.
3. Choose the most appropriate graph for the given data type.
4. Draw pie charts, bar charts, histograms, frequency polygons and ogives.
5. Interpret and explain what a table or graph reveals.
Introduction
Whenever the data is collected for some project, it is usually in the ‘raw’ form and not in a
organised way. Descriptive statistics deals with sorting this raw data by putting it into a table or by
presenting it in an appropriate chart or summarising it numerically.
An important consideration in sorting the raw data is the type of variable concerned. The data
from some variables are best described with a table, some with a chart, and some with both. However,
a numeric summary is more appropriate for some types of variable.
Tabulation of Data
Tabulation is the first step before the data is used for analysis or interpretation. Frequency
distribution tables presents data in a relatively compact form, ready to use but certain information
may be lost. The data can be reduced to manageable form using frequency tables.
The frequency table
The frequency table can have one or all the following parameters, depending on the type of
data.
1. Frequency:
Frequency is the repetition of observations or actual number of subjects in each category.
2. Relative frequency:
Relative frequency is the frequency converted into percentage of the total number of
observations.
Number of observations in category
Relative frequency = ´100
Total number of observations ...1
3. Cumulative frequency:
It is the cumulative total of frequencies and is obtained by adding the frequency of
157
158 Textbook of Computer Applications and Biostatistics
observations at each level point to those frequencies of the preceding level (s).
4. Cumulative relative frequency:
It is cumulative frequency converted into the percentage of the total number of observations.
Let us take the examples of various types of data and construct the frequency table.
Solution
As we know, the ordering of nominal categories is arbitrary, and in this example they are
shown by the number of students in each – largest first. The total frequency (n = 95), is shown at the
top of the frequency column. This is helpful for the reader.
1. Frequency
Table 10.1 Frequency table showing the distribution of blood group of 95 pharmacy students
Category of Frequency
Blood group Tally marks (number of students) n=95
A |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| 49
B |||| |||| |||| |||| |||| || 27
AB |||| |||| |||| 15
O |||| 04
2. Relative frequency
Table 10.2 Relative frequency table showing the percentage of students in each blood group
Category of Frequency Relative Frequency
Blood group (number of students) n=95 (% of students in each category)
A 49 (49/95)*100 = 51.6
B 27 (27/95)*100 = 28.4
AB 15 (15/95)*100 = 15.8
O 04 (04/95)*100 = 04.2
Presentation of Data 159
3. Cumulative frequency
It makes no sense to calculate cumulative frequency for nominal data, because of the
arbitrary category order. Hence, cumulative frequency is not calculated.
2. Frequency table for Ordinal Data
When the variable in question is ordinal, we can allocate the data into ordered categories.
Example 10.2
Let us take an example of ‘level of satisfaction’ of 60 final year students regarding
infrastructure available in the college. The following data is given in numbered form for easy
understanding: (4- very satisfied, 3- satisfied, 2- neutral, 1-dissatisfied, 0- very dissatisfied).
Data: 3, 0, 2, 1, 3, 4, 0, 3, 4, 0, 2, 3, 4, 1, 3, 2, 3, 4, 0, 1, 3, 4, 3, 4, 0, 3, 2, 1, 3, 4, 2, 1, 3, 3, 1, 4,
3, 1, 3, 4, 1, 3, 4, 3, 4, 0, 3, 2, 3, 4, 1, 3, 1, 0, 3, 4, 3, 2, 1, 0
Solution:
Level of satisfaction is clearly an ordinal variable. ‘Satisfaction’ cannot be properly
measured, and has no units. But the categories can be meaningfully ordered, as they have been given
here.
The frequency values indicate that more than half of the patients were happy with their infra
structural facilities, 34 students (13+21), out of 60. Much smaller numbers expressed dissatisfaction.
Table 10.3 The frequency distributions for the ordinal variable ‘level of satisfaction’ with
infrastructure available in college
Satisfaction with Tally marks Frequency
infrastructure (number of students) n=60
Very satisfied (4) |||| |||| ||| 13
Satisfied (3) |||| |||| |||| |||| | 21
Neutral (2) |||| || 07
Dissatisfied (1) |||| |||| | 11
Very dissatisfied (0) |||| ||| 08
Table 10.5 The Cumulative and relative cumulative frequency distributions for data
23-45 6
46-90 7
91-181 8
182-363 9
364-726 10
727-1454 11
1455-2909 12
Table 10.7 The frequency distribution table for metric continuous data
Weight Tally marks Frequency
( kg) n=60
45-49 ||| 03
50-54 |||| |||| 10
55-59 |||| |||| || 12
60-64 |||| |||| |||| 15
65-69 |||| | 06
70-74 |||| || 07
75-79 |||| 05
80-84 || 02
1. Class:
The group of observations is called as class. In this example there are 8 classes ( 45-49, 50-
54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84).
2. Class limits:
The minimum value that can be included in the class is lower class limit while the maximum
value that can be included in the class is upper class limit. In this example, for class 45-49, the lower
class limit is 45 while upperclass limit is 49.
3. Class boundaries:
Consider the classes 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84 in this example.
49 is the upper class limit for 45-49 class while 50 is the lower class limit for 50-54 class. Here, class
limits are not continuos and therefore we have to subtract 0.5 from lower limit and add 0.5 from upper
limit. Thus class become continuous as shown in table below and they are called as class boundaries.
In case of classes 45-50, 50-55, 55-60, 60-65, 65-70, 70-75, 75-80, 80-85 the class limits are called
continuous and hence class limits are called as class boundaries.
162 Textbook of Computer Applications and Biostatistics
Class mark:
Class marks are simply the midpoints of the classes and they are found by adding lower and
upper limits of a class (or its lowest and upper boundaries) and dividing by two.
Table 10.11 The relative, cumulative and relative cumulative frequency distribution table
Cross- tabulation
Each of the frequency tables above provides us with a description of the frequency
distribution of a single variable. However, sometimes, we need to examine the association between
two variables, within a single group of individuals. We can do this by putting the data into a table of
164 Textbook of Computer Applications and Biostatistics
cross-tabulations, where the rows represent the categories of one variable, and the columns represent
the categories of a second variable.
Example 10.5
A study was carried out on the degree of job satisfaction among doctors and nurses in rural
and urban areas. To describe the sample a cross-tabulation was constructed which included the sex
and the residence (rural or urban) of the doctors and nurses interviewed. This was useful because in
the analysis the opinions of male and female staff had to be compared separately for rural and urban
areas.
Table 10.12 Type of health worker by residence
Type of Health Worker
Residence Doctors Nurses Total
Table 10.12 a shows that a higher percentage of nurses than of doctors work in rural areas, but
that, overall, a greater proportion of staff works in urban areas (67%).
O 4%
16%
AB
A
52%
B
28%
For example, Figure 10.2 is a pie chart of blood group of each of 95 pharmacy students
Presentation of Data 165
shown in Table 10.2. A disadvantage of a pie chart is that it can only represent one variable (in Figure
10.2, blood group). We will therefore need a separate pie chart for each variable we want to chart. A
disadvantage of pie chart can lose clarity if it is used to represent more than four or five categories.
50
40
Frequency
30
20
10
0
A B AB O
Blood Group
Figure 10.3 Simple bar chart of blood group of pharmacy students
A 04 11
B 29 20
A 01 03
O 14 13
Alternatively, the chart can also be drawn with the categories boys and girls, on the
166 Textbook of Computer Applications and Biostatistics
horizontal axis. This format is more useful if we want to compare category sizes within each group.
Example 10.6
Girls with blood group A, B, AB and O can be compared. Which chart is more appropriate
depends on what aspect of the data we want to examine.
35 35
A
30 30
Boys
B
25 25
Frequency
Girls
AB
20 20
Frequency
O
15 15
10 10
5 5
0 0
A B AB O Boys Sex Girls
Blood Group
Figure 10.4 Clustered bar chart of blood group of 95 pharmacy students by sex
60
50
O
40 AB
Frequency
30 B
20
A
10
0
Boys Sex Girls
Figure 10.5 Astacked bar chart of blood group of 95 pharmacy students by sex
5. Pictograms
Pictograms are similar to bar charts. They present the same type of information, but the bars
Presentation of Data 167
are replaced with a proportional number of icons. This type of presentation for descriptive statistics
dates back to the beginning of civilization when pictorial images were used to record numbers of
people, animals or objects.
Example 10.8
Population of different districts of Western Maharashtra. Each diagram indicates one lakh
population.
Pune Satara Kolhapur Sangli
10
8
6
4
2
0
0 1 2 3 4 5 6 7
No of times inhaler used in past 24h
Figure 10.7 The frequency distribution table for metric discrete data
2. Line Plot
A line chart is similar to a bar chart except that thin lines, instead of thicker bars, are used to
represent the frequency associated with each level of the discrete variable.
Example 10.10
Assay result of 30 amoxicillin capsules are as follows
168 Textbook of Computer Applications and Biostatistics
Data: 251, 250, 253, 249, 250, 252, 247, 248, 254, 245, 250, 253, 251, 250, 249, 252, 249,
251, 246, 250, 250, 254, 248, 252, 251, 248, 250, 247, 251, 249.
Table 10. 14 Frequency distribution table for assay result of 30 amoxicillin tablets
Assay Result Frequency
Tally marks
(mg) n=30
245 | 1
246 | 1
247 || 2
248 ||| 3
249 |||| 4
250 |||| || 7
251 |||| 5
252 ||| 3
253 || 2
254 || 2
Using the data presented in Table 10.14, a corresponding line chart is presented in Figure
10.8. 8
7
6
5
Frequency
4
3
2
1
0
245 246 247 248 249 250 251 252 253 254
Assay Result
3 . Point Plot
Point plots are identical to line charts, however, instead of a line, a number of points or dots
equivalent to the frequency are stacked vertically for each value of the horizontal axis. Also referred
to as dot diagram, point plots are useful for small data sets.
Using the data presented in Table 10.14, a corresponding point diagram is presented in
Figure 10.9.
Presentation of Data 169
8
7
6
5
Frequency
4
3
2
1
0
245 246 247 248 249 250 251 252 253 254
Assay Result
1. The Histogram
A continuous metric variable can take a very large number of values, so it is usually
impractical to plot them without first grouping the values. The grouped data is plotted using a
frequency histogram, which has frequency plotted on the vertical axis and group size on the
horizontal axis.
A histogram looks like a bar chart but without any gaps between adjacent bars. This
emphasises the continuous nature of the underlying variable. If the groups in the frequency table are
all of the same width, then the bars in the histogram will also be of the same width. One limitation of
the histogram is that it can represent only one variable at a time (like the pie chart), and this can make
comparisons between two histograms difficult.
Figure 10.10 shows a histogram of the grouped weight in kg of 60 final year pharmacy
students as per the data in Table 10.8.
170 Textbook of Computer Applications and Biostatistics
16
14
12
10
Frequency
8
6
4
2
0
45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84
Weight in kg
Figure 10.10 A histogram of the grouped weight in kg of 60 final year pharmacy students
2. Frequency Polygon
A frequency polygon can be constructed by placing a dot at the midpoint (class mark) for
each class interval in the histogram and then these dots are connected by straight lines. This frequency
polygon gives a better conception of the shape of the distribution. The class interval midpoint for a
section in a histogram is
calculated as follows:
The frequency polygon is then created by listing the midpoints (class marks) on the x axis,
frequencies on the y-axis, and drawing lines to connect the midpoints for each interval.
Table 10.15
Class limits Class Boundaries Midpoint(Class Mark) Frequency
45-49 44.5-49.5 47 03
50-54 49.5-54.5 52 10
55-59 54.5-59.5 57 12
60-64 59.5-64.5 62 15
65-69 64.5-69.5 67 06
70-74 69.5-74.5 72 07
75-79 74.5-79.5 77 05
80-84 79.5-84.5 82 02
16
14
12
Frequency
10
8
6
4
2
0
40 50 60 70 80 90
Class Marks
Figure 10.11 A Frequency Polygon of the grouped weight in kg of 60 final year pharmacy students
Table 10.16
100
% Relative cumulative frequency
75th percentile
75
50th percentile
50
25th percentile Q3
25 Q2
Q1
0
40 50 60 70 80 90
Class Marks
Applications of Ogive
1. By using Ogive we can locate any percentile that will divide the series into two parts.
2. Quartiles: There are three different points located on the entire range of variable (here it is weight in
kg). These are Q1, Q2, Q3.
Q1 or lower quartile will have 25% observations falling in its left and 75% observations on
its right side.
Q2 is the median, i.e., 50% values lies on either side.
Q3 is the upper quartile, will have 75% observations falling on its left side and 25%
Presentation of Data 173
Example 10. 11
Plot the Stem and Leaf graph for following data of age in years of 27 patients.
8, 13, 16, 25, 26, 29, 30, 32, 37, 38, 40, 41, 44, 47,
49, 51, 54, 55, 58, 61, 63, 67, 75, 78, 82, 86, 95
Solution
1. We can group 27 patients (8-95 yrs) according to age groups. Here we will take groups of 10 yrs
each.
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
2. We can now write these groups on left side of the vertical line and place the last digit of the age on
the right of the vertical line to the corresponding group (see figure 10.13).
3. Now, groups can be considered as stem and corresponding digits on the right as leaves. This can be
rewritten in figure 10.14.
4. From stem and leaves, we can read the exact ages of the patients and the data is not lost.
top and bottom lines of the box to an upper and lower adjacent value. The details of drawing box and
whiskers plot is discussed in next chapter (Example 12.5).
Stem Leaves
0-9 8
10-19 36
20-29 569
30-39 0278
40-49 01479
50-59 1458
60-69 137
70-79 58
80-89 26
90-99 5
Figure 10.13 Stem and Leaf plot of grouped data
Stem Leaves
0 8
1 3 6
2 569
3 0278
4 01479
5 1458
6 137
7 5 8
8 2 6
9 5
Figure 10.14 Stem and Leaf plot of ungrouped data
Example 10.12:
The table shows the age and the weight of a child. Plot the scatter graph.
Table 10.17 Age and weight of five children
2 20
3 25
4 28
5 30
6 35
Solution
1. Take the weight of children on y-axis while age in years on x-axis.
2. Take appropriate scale on y-axis covering the weight from 20-35.
3. Take appropriate scale on x-axis covering the age from 1-6.
4. Now put the marks of weight to the corresponding ages of children.
5. Draw the line connecting the marks (points) to get the scatter plot.
40
35
Weight (Pound)
30
25
20
15
0 2 4 6 8
Age (years)
Figure 10.19 Window of Chart Source Data Figure 10.20 Window of selecting Series option
button.
15
49
27
2. Now click on the picture of a graph in the toolbar at the top. Alternatively, we can go to “Insert
Chart” from the drop-down menus.
3. Step 1 of the Chart Wizard will appear. A bar graph is called a “Column Chart” in Excel; so select
“Clustered Column Chart”. Select that and click “Next”
4. Step 2 of the Chart Wizard has 2 tabs, the “Data Range” & “Series” tabs.
a. The “Data Range” tab asks us to pick data range. If data is not highlighted previously we
can use the selection tool to do so from this screen.
b. Click on the “Series” tab.
In the space labeled “Name” we can type the title of the graph.
In the space labeled “Category (x) axis labels”, we can highlight a data range that contains
the words or numbers that we would like to appear under each bar. Click on the picture of the
spreadsheet beside the blank to allow us to highlight the range on the spreadsheet. Hit “enter”, then
click “Next”.
5. Step 3 of the Chart Wizard has several different tabs.
a. Titles tab –
Chart Title: This is the graph title that will appear above the graph by default. We have
already entered title under step 2.
Category X axis : Enter the X axis label here.
Value Y axis: Enter the Y axis label here.
b. Axes tab – We can change the type of scaling on X axis & Y axis.
c. Gridlines tab – We can check the boxes on this tab to make gridlines appear or disappear.
d. Legend tab – We can check the boxes to make series legend appear or disappear.
e. Data Labels tab – We can check boxes to show specific data labels and values.
f. Data Table tab – We can use this tab showing the data table on the graph.
g. Hit “Next” after making choices on step 3.
6. Step 4 of the chart wizard simply asks where we want the graph to appear. We can check for the
graph to appear by itself on a separate sheet in the workbook, or to appear as an object embedded in the
current sheet of the notebook, next to the data.
Summary
1.Tabulation of data
Frequency distribution table
1. Frequency: Repetition of observations
2. Relative frequency: Frequency converted into Percentage
3.Cumulative frequency: Cumulative total of frequencies
4. Cumulative relative frequency: Cumulative frequency converted into percentage
2 . Graphical presentation of nominal and ordinal data
1. Pie Chart 2. Simple Bar Chart 3. Clustered Bar Chart
4. Stacked Bar Chart 5. Pictogram
Presentation of Data 181
Exercise
1. Based on 1210 patients involved in clinical trials of an anticancer drug, it was observed that 841
experienced no adverse effects, while 256, 91 and 22 subjects suffered mild, moderate and severe
adverse effects respectively. Prepare tabular and graphical presentation for this data with proper
reasoning for choosing particular graphs.
2. The following assay results (percentage of label claim) were observed in 50 random samples during
a production run.
102, 100, 96, 99, 101, 102, 100, 105, 97, 100, 92, 103, 101, 100, 99, 102, 96, 100,
101, 98, 107, 95, 98, 100, 100, 99, 97, 104, 101, 103, 98, 101, 100, 105, 99, 101, 102, 100,
87, 98, 101, 103, 93, 99, 101, 97, 100, 102, 99, 104.
Tabulate the data and report results as a stemplot.
3. Construct table of frequency, relative frequency, cumulative frequency and relative cumulative
frequency for following data of weights in kg of 50 pharmacy students and present data graphically.
51, 53, 52, 39, 58, 48, 45, 56, 62, 64, 66, 67, 42, 48, 43, 44, 45, 47, 52, 54, 50, 49, 38, 39, 38,
32, 34, 36, 33, 34, 36, 38, 42, 43, 48, 49, 50, 51, 52, 30, 31, 30, 32, 45, 45, 47, 46, 48, 49, 58.
4. Prepare Ogive curve for the following data.
Height of groups (cm) 72-76 76-80 80-84 84-88 88-92 92-96 96-100
Frequency 5 7 10 15 11 9 7
5. Construct scatter plot of the responses given by a drug in 15 patients.
Concentration of drug (mg/l) 4 6 8 10 12 14 16 18 20 22 24 26
Response 1 2 3 4 5 6 7 8 9 10 11 12
6. Following data provides the number of deaths for several leading causes for the year 2010. Display
Presentation of Data 183
Answers:
Multiple Choice Questions
1. a 2. d 3. b 4. d 5. d 6. b 7. c 8. b 9. a 10. b
Exercise
1. Tabular presentation of data
Grade of Frequency Relative Cumulative Relative Cumulative
adverse effect n=1210 frequency frequency Frequency
0 841 69.50 841 69.50
1 256 21.15 1097 90.66
2 91 7.52 1188 98.18
3 22 1.81 1210 100
No adverse effect (0); Mild adverse effect (1); Moderate adverse effect (2); Severe adverse effect (3)
Graphical presentation of data: The given data is ordinal categorical type and hence can be
represented as Bar chart.
184 Textbook of Computer Applications and Biostatistics
900
800
700
600
No of patients
500
400
300
200
100
0
No adverse Mild adverse Moderate Severe
effects effects adverse adverse
effects effects
2. Stem plot: The given data varies from 87 to 107 and therefore the stems should be 8, 9 and 10
8 7
9 235667778888999999
10 0000000000111111112222233344557
Stems Leaves
Histogram is to be given.
16
14
12
10
Frequency
8
0
30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70
Weight in kg
4. Ogive 120
% Cumulative frequency
100
80
60
40
20
0
70 80 90 100
Mind point of height of groups (cm)
5. Scatter plot
14
12
10
8
Response
0
0 10 20 30
Concentration of drug (mg/l)
186 Textbook of Computer Applications and Biostatistics
6. Pie Chart
Other causes 34%
Accidents 4%
Cerebrovascular disease 8%
14
12
10
Frequency
0
25 35 45 55 65 75
Weight in kg
Learning objectives
When we have finished this chapter, we should be able to:
1. Explain what is meant by the ‘shape’of a frequency distribution.
2. Sketch and explain: negatively skewed, symmetric and positively skewed distributions.
3. Sketch and explain a bimodal distribution.
4. Describe the approximate shape of a frequency distribution from a frequency table or chart.
5. Sketch and describe a Normal distribution.
0
0 2 4 6 8 10
Figure 11.1 Chart showing uniform distribution of data
2. Positively skewed distribution: The values are concentrated towards the bottom of the range, with
progressively fewer values towards the top of the range. This is a right or positively skewed
distribution. See figure 11.2 for positively skewed distribution.
3. Negatively skewed distribution: The values are concentrated towards the top of the range, with
progressively fewer values towards the bottom of the range. This is a left or negatively skewed
distribution. See figure 11.3 for negatively skewed distribution.
187
188 Textbook of Computer Applications and Biostatistics
0
0 2 4 6 8 10
0
0 2 4 6 8 10
Figure 11.3 Chart showing negatively skewed distribution of data
4. Symmetric or Mound-shaped distribution: The values are clumped together around one
particular value, with progressively fewer values both below and above this value. This is a
symmetric or mound-shaped distribution
5
0
0 2 4 6 8 10
5. Bimodal or Multimodal distribution: The values are clumped around two or more particular
values. This is a bimodal or multimodal distribution.
5
0
0 2 4 6 8 10
Figure 11.5 Chart showing bimodal distribution of data
One simple way to assess the shape of a frequency distribution is to plot a bar chart, or a
histogram.
5
0
0 2 4 6 8 10
Figure 11.6 Frequency curve superimposed on a histogram
35
30
25
20
15
10
0
0 1 2 3 4 5 6 7 8 9 10
Figure 11.7 Frequency curve superimposed on a histogram
Skewness
Skewness measures asymmetry around the mean. The parameter is best interpreted as
relative to the normal distribution (whose skewness equals to zero). The interpretation of the
skewness is
- Skewness > 0 asymmetric tail with more values above the mean
- Skewness < 0 asymmetric tail with more values below the mean
Skewned data is required to be treated using non parametric tests while normal curve data is
treated using parametric tests.
Kurtosis
Kurtosis is a property associated with a frequency distribution and refers to the shape of the
distribution of values regarding its relative flatness and peakedness. Compared with normal
distribution, the interpretation of the kurtosis is:
Kurtosis > 0 peaked relative to Normal distribution
Kurtosis < 0 flat relative to Normal distribution
Here are some examples of the shapes described above.
Shape of Distribution of Data 191
6 Peaked (Leptokurtic)
5
4 Normal (Mesokurtic)
3
Flatter ( Platykurtic)
2
0
0 2 4 6 8 10
Figure 11.8 Chart showing Kurtosis of distribution of data
Summary
Shapes of data:
1. Uniform distribution
2. Positively skewed distribution
3. Negatively skewed distribution
4. Symmetric or Mound-shaped distribution
5. Bimodal or Multimodal distribution
If the data is skewed it is required to be treated using non parametric tests while normal data
is treated using parametric tests.
Normal distribution:
It is a bell shaped curve, symmetric and usually has Mean = Mode = Medium
Skewness
Skewness measures asymmetry around the mean.
Kurtosis
Kurtosis measures relative flatness and peakedness.
Exercise:
1. Describe shapes of data.
2. Write characteristics of normal distribution.
3. Why knowing the shape of the data is important?
4. Explain skewness with the help of diagram.
5. Explain kurtosis with the help of diagram.
Answers:
Multiple Choice Questions
1. a 2. b 3. c 4. a 5. b
Chapter 12
MEASURES OF CENTRAL TENDENCY
Learning objectives
When we have finished this chapter, we should be able to:
1. Explain what a summary measure of location is, and calculate the mode, median and mean for a set
of values.
2. Demonstrate the role of data type and distributional shape in choosing the most appropriate
measure of central tendency.
3. Explain what a percentile is, and calculate any given percentile value.
4. Explain what a summary measure of spread is, and calculate, the range, the interquartile range and
the standard deviation.
5. Estimate percentile values from an ogive.
8. Demonstrate the role of data type and distributional shape in choosing the most appropriate
measure of spread.
9. Draw a boxplot and explain how it works.
10. Explain the use of Excel in estimating all measures of central tendency.
Introduction
As we have seen in the previous two chapters, we can ‘describe’ raw data by charting it, or
arranging it in table form or we can examine its shape. These procedures helps us to see patterns in the
data. However, it is often more useful to summarise the data numerically. There are two principal
features of a set of data that can be summarised with a single numeric value:
1. Measure of Location: A value around which the data has a tendency to congregate or cluster, is
called as summary measure of location.
2. Measure of Dispersion: A value which measures the degree to which the data are spread out is
called a summary measure of spread or dispersion.
With these two summary values we can compare different sets of data quantitatively.
1. The Mode
The mode is that category or value in the data that has the highest frequency (i.e. it occurs
most often).The mode is not particularly useful with metric continuous data where no two values are
193
194 Textbook of Computer Applications and Biostatistics
same. The other shortcoming of this measure is that there may be more than one mode in a set of data.
Example 12.2
Calculate mode for following grouped data
Table 12.1 Frequency distribution of Sales Per day
Sales volume (Class interval) 53-56 57-60 61-64 65-68 69-72 72 and above
Number of days (Frequency) 2 4 5 4 4 1
Solution
Since the largest frequency corresponds to the class interval 61-64, hence it is the modal
class.
l1 = Lower limit of modal class = 61; f1 = Frequency of modal class = 5;
f0 = Frequency before the modal class = 4; f2 = Frequency after the modal class = 4
i = Class interval of modal class = 3
Formula
æ f1 - f 0 ö
Mode = l1 + çç ÷÷ ´ i
è 2f1 - f 0 - f 2 ø
Measures of Central Tendency 195
æ 5- 4 ö
Mode = 61 + ç ÷ ´ 3 = 62.5
è 10 - 4 - 4 ø
Hence, the modal sale is of 62.5 units.
Example 12.3
Calculate mode for following grouped data
Table 12.2 Frequency distribution for the heights of the Pharmacy students
Solution
Since the largest frequency corresponds to the class interval 61-63, hence it is the modal
class.
l1 = Lower limit of modal class = 61; f1 = Frequency of modal class = 65;
f0 = Frequency before the modal class = 23; f2 = Frequency after the modal class = 51
i = Class interval of modal class = 2
Formula
æ f1 - f 0 ö
Mode = l1 + çç ÷÷ ´ i
è 2f1 - f 0 - f 2 ø
æ 65 - 51 ö
Mode = 61 + ç ÷ ´ 2 = 61.5
è 130 - 23 - 51 ø
Hence, the modal height is of 61.5 inches.
2. The Median
If we arrange the data in ascending order of size, the median is the middle value. Thus, half of
the values will be equal to or less than the median value, and half will be equal to or above it. The
median is thus a measure of central-ness. If we have an even number of values, the median is the
average of the two values either side of the ‘middle’.
An advantage of the median is that it is not much affected by skewness in the distribution, or
by the presence of outliers. However, it discards a lot of information, because it ignores most of the
values, apart from those in the centre of the distribution.
æ n +1ö
Median = Size or value of ç ÷ th observation
è 2 ø ...2
If the number of observations (n) is an even number, then the median is defined as the
arithmetic mean of the numerical values of n/2th and (n/2 + 1)th observations in the data array.
n æn ö
th + ç + 1÷ th observation
Median =
2 è 2 ø ...3
2
Example 12.4
Calculate the median of the following data that relates to the number of patients per day in the
outpatient ward in a Civil Hospital.
100, 200, 120, 170,130, 150, 180.
Solution:
1. First arrange the data in an ascending order
100, 120, 130, 150, 170, 180, 200.
2. Since the number of observations in the data array are odd (n=7), the median for this data is
æ n +1ö
Median = Size or value of ç ÷ th observation
è 2 ø
æ 7 +1ö
Median = ç ÷ = 4th observation
è 2 ø
4th observation in the data array = 150.
Thus the median number of patients examined per day in OPD in a Civil Hospital are 150.
Example 12.5
Calculate the median of the following data that relates to the sale in lakh per month of a
company in last one year. 12, 18, 15, 14, 13, 12, 20, 10, 11, 18, 19, 16
Solution:
1. First arrange the data in ascending order
10, 11, 12, 12, 13, 14, 15, 16, 18, 18, 19, 20.
2. Since the number of observations in the data array are even (n=12), the median for this data is
n æn ö
th + ç + 1÷ th
Median =
2 è2 ø
2
12 æ 12 ö
th + ç + 1÷th
Median =
2 è2 ø = (6th value + 7th value) = 14 + 15 = 14.5
2 2 2
Example 12.6
A survey was conducted to determine the age (in years) of 130 pharmacists. The result of
such a survey is as follows:
Table 12.3 Frequency distribution of age of Pharmacist
Solution:
First, we should find the cumulative frequencies to locate the median class ( see table 12.4).
Here the total number of observations are n = 130. Median is the size of (n/2)th = 130/2 =
65th observation in the data set. This observation lies in the class interval 35-40.
l1 = Lower limit of median class = 35
c.f.= Cumulative frequency of the class prior to the median class interval = 48
f = Frequency of median class = 48
i= Class interval of median class = 5
Formula
(n/2) - c.f.
Median = l1 + ´ i
f
(130/2) - 48 17
Median = 35 + ´ 5 = 35 + ´ 5 = 35 + 1.77 = 36.77
48 48
Hence the median age of pharmacist is 36.77 years.
Example 12.7
A survey was conducted to determine the height (in inches) of 50 pharmacists. The result of
such a survey is as follows:
Table 12.5 Frequency distribution of age of Pharmacist
Solution:
First, we should find the cumulative frequencies to locate the median class ( see table 12.6).
c.f.= Cumulative frequency of the class prior to the median class interval = 18
f= Frequency of median class = 36
i= Class interval of median class = 4
(n/2) - c.f.
Median = l1 + ´i
f
(50/2) - 18
Median = 62 + ´ 4 = 62.78
36
Hence the median age of pharmacist is 62.78 inches.
In cumulative frequency column 500th observation comes after 331 and before 733.
Hence median is 2 defective components.
3. The Mean
The mean, or the arithmetic mean, is more commonly known as the average. One advantage
of the mean over the median is that it uses all of the information in the data set. However, it is affected
by skewness in the distribution, and by the presence of outliers in the data. This may, sometimes,
produce a mean that is not very representative of the general mass of the data. Moreover, it cannot be
used with ordinal data (as ordinal data are not real numbers, so they cannot be added or divided).
Example 12.9
The following data gives weight of 20 paracetamol tablets in mg. Calculate average weight
of a paracetamol tablet.
625, 617, 633, 630, 620, 631, 618, 620, 619, 632,
625, 628, 626, 624, 622, 625, 627, 631, 619, 624.
Solution:
Sum of all observations (å X) 12496
Mean (X) = = = 624.8
Total number of observations (N) 20
Example 12.10
The blood serum cholesterol levels of 10 subjects are given as:
245 262 292 247 253 286 274 265 279 252
Calculate mean.
Solution:
Sum of all observations (å X) 2655
Mean (X ) = = = 265.5
Total number of observations (N) 10
Mean ( X ) = A +
å fd ´ i
N ...6
Where
A=Assumed mean
f = frequency of ith class interval
N = Summation of all frequencies
d = (mi -A)/ i, deviation from assumed mean
m = mid value of ith class interval
i = width of the class interval
Example 12.11
A company is planning to improve plant safety. For this, accident data for the last 50 weeks
was compiled. These data are grouped into the frequency distribution as shown below. Calculate
mean of the number of accidents per week.
Table 12.9 Number of accidents in last 50 weeks
No. of accidents 0-4 5-9 10-14 15-19 20-24
Number of weeks 5 22 13 8 2
Solution
1. Construct frequency distribution table as per given in table 12.10.
Table 12.10 Calculations for mean accidents per week
2. Take the mid point of any class interval as assumed mean (A). Here we have taken the mid point of
class interval of 10-14, which is 12, as assumed mean.
3. Set third column as d. That gives deviation from assumed mean, which is calculated by using
formula:
(mid point of ith class interval - assumed mean)/ width of class interval.
Here class interval is 5.
202 Textbook of Computer Applications and Biostatistics
Mean ( X ) = A +
å fd ´ i
N
- 20
Mean ( X ) = 12 + ´ 5 = 10
50
Example 12.12
The following are the weights in kg of 60 final year pharmacy students. Calculate mean
weight.
Table 12.11 Weights of 60 final year students
Weight (kg) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-54
Frequency 3 10 12 15 6 7 5 2
Solution
1. Construct frequency distribution table as per given in table 12.12.
Table 12.12 Calculations for mean weight of final year students
2. Take the mid point of any class interval as assumed mean (A). Here we have taken the mid point of
class interval of 60-64, which is 62, as assumed mean.
3. Set third column as d. That gives deviation from assumed mean, which is calculated by using
formula:
(mid point of ith class interval - assumed mean)/ width of class interval.
Here class interval is 5.
4. Finally set forth column fd and calculate å fd.
5. Now put the values in formula for calculation of mean for grouped data
Data:
A=Assumed mean = 62
N = Summation of all frequencies = 60
åf d = 2
i = width of the class interval = 5
Formula:
Mean ( X ) = A +
åf d ´i
N
2
Mean ( X ) = 62 + ´ 5 = 62 .16
60
Hence mean weight of final year pharmacy students is 62.16 kg.
Mean (X) = åf Xi i
N ...7
Where
N = å fi
fi = frequency with which variable Xi occurs
Example 12.13
Following data gives number of times that inhaler used in past 24h by 53 children with
asthma. Find mean number of times inhaler used by children with asthma.
Table 12.13 Number of times inhaler used by children
Solution:
1. Construct frequency distribution table as per given in table 12.14.
2. Set third column as fiXi.
3. Determine summation of fi and fiXi.
4. Now put the values in formula for calculation of mean for grouped discrete data
Mean (X) =
åf X i i
=
118
= 2.23
N 53
Hence mean number of times inhaler used by children with asthma in past 24 h was 2.23.
Example 12.14
Calculate mean number of living children per woman from the following table.
Table 12.15 Number of living children per woman
Solution:
1. Construct frequency distribution table as per given in table 12.16.
2. Set third column as fiXi.
3. Determine summation of fi and fiXi.
4. Now put the values in formula for calculation of mean for grouped discrete data.
Measures of Central Tendency 205
Mean (X) =
åf Xi i
=
517
= 2.145
N 241
Nominal yes no no
Ordinal yes yes no
Metric continuous yes yes, if skewed yes
Metric discrete yes yes, if skewed yes
4. Percentiles
The measures of central values discussed so far are averages. They locate the centre or mid
point of a distribution. It may also be of interest to locate other points in the range like percentiles.
They are values of a variable which divide the total observations by an imaginary line into two parts,
expressed in percentages such as 10% and 90% or 25% and 75%, etc. In all, there are 99 percentiles.
Percentiles are values in a series of observations arranged in ascending order of magnitude which
divide the distribution into 100 equal parts. Thus, the median is 50th percentile. The 50th percentile
206 Textbook of Computer Applications and Biostatistics
will have 50% observations on either side. Accordingly, 10th percentile will have 10% observations
to the left and 90% to the right.
Quartiles are three different points located on the entire range of a variable- Q1, Q2 and Q3.
Q1 or lower quartile will have 25% observations falling on its left and 75% on its right: Q2 or median
will have 50% observations on either side and Q3 or upper quartile will have 75 % observations on its
left and 25% on its right.
Quintiles are four in number and divide the distribution into 5 equal parts. So 20th
percentile or first quintile will have 20% observations falling to its left and 80% to its right.
Deciles are nine in number and divide the distribution into 10 equal parts, first decile or 10th
percentile will divide the distribution into 10% and 90% while 9th decile will divide into 90% and
10% .
Example 12.15
Following table records the percentage mortality in 26 ICUs. Calculate the 25th and 75th
percentiles for the ICU percent mortality values.
Table 12.18 Percent mortality in 26 ICUs
ICU 1 2 3 4 5 6 7 8 9 10 11 12 13
% mortality 15.2 31.3 14.9 16.3 19.3 18.2 20.2 12.8 14.7 29.4 21.1 20.4 13.6
ICU 14 15 16 17 18 19 20 21 22 23 24 25 26
% mortality 22.4 14 14.3 22.8 26.7 18.9 13.7 17.7 27.2 19.3 16.1 13.5 11.2
Solution:
1. First, arrange values in ascending order for 26 ICUs as shown below
ICU 1 2 3 4 5 6 7 8 9 10 11 12 13
% mortality 11.2 12.8 13.5 13.6 13.7 14 14.3 14.7 14.9 15.2 16.3 16.1 17.7
ICU 14 15 16 17 18 19 20 21 22 23 24 25 26
% mortality 18.2 18.9 19.3 19.3 20.2 20.4 21.1 22.4 22.8 26.7 27.2 29.4 31.3
th
2. Now, the 25 percentile can be calculated by using formula
Measures of Central Tendency 207
P
Pth Percentile = (N + 1)th value
100
P= 25th percentile = 25
N= no. of ICU = 26
25
25 th Percentile = (26 + 1) = 0.25 ´ 27th value = 6.75th value
100
th th
The 6 value is 14 % while the 7 value is 14.3%,
A difference is of 0.3,
th
The 25 percentile is 14% + 0.75 of 0.3,
14% + 0.75 x 0.3% = 14% + 0.225 =14.225
th
So, the 25 percentile for the ICU percent mortality is 14.225.
th
3. Now, the 75 percentile can also be calculated using same formula, P = 75 and N = 26.
75
75th Percentile = (26 + 1) = 0.75 ´ 27th value = 20.25th value
100
th st
The 20 value is 21.1 % while the 21 value is 22.4%, a differences of 1.3,
The 75th percentile is 21.1% + 0.25 of 1.3 = 21.1% + 0.25 x 1.3% = 21.1% + 0.325 =21.425
So, the 75th percentile for the ICU percent mortality is 21.425.
1. The Range
The range is the distance from the smallest value to the largest. The range is not affected by
skewness, but is sensitive to the addition or removal of an outlier value.
Range = Lowest value to Highest value. ...9
Example 12.16
Weight of six paracetamol tablets in mg are given below. Find weight range for
paracetamol tablet.
Data: 625, 612, 615, 632, 628, 618
Solution:
1.Arrange data in ascending order
612, 615, 618, 625, 628, 632
2. Lowest value = 612 mg
3. Highest value = 632 mg
208 Textbook of Computer Applications and Biostatistics
Example 12.17
Calculate the iqr for the ICU percentage mortality values in table 12.19 .
Table 12.19 Percent mortality in 26 ICUs
ICU 1 2 3 4 5 6 7 8 9 10 11 12 13
% mortality 15.2 31.3 14.9 16.3 19.3 18.2 20.2 12.8 14.7 29.4 21.1 20.4 13.6
ICU 14 15 16 17 18 19 20 21 22 23 24 25 26
% mortality 22.4 14 14.3 22.8 26.7 18.9 13.7 17.7 27.2 19.3 16.1 13.5 11.2
Solution
1.Arrange the data in ascending order
2. Calculate 25th percentile as a Q1
3. Calculate 75th percentile as a Q3
4. We have already calculated the 25th and 75th percentiles before in example 12.15
Q1 = 14.225%,
Q3 = 21.425%
Therefore interquartile range = (14.225 to 21.425) %.
Example 12.18
Estimate iqr for the data of weight in kg of 60 final year pharmacy students given in Table
10.6.
Table 12.20 Weights of 60 final year students
Weight (kg) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-54
Frequency 3 10 12 15 6 7 5 2
Solution
We have already drawn ogive of this example.
Ogive is drawn by plotting % cumulative frequency on Y axis against midpoint of class
interval on X axis.
Now, draw perpendicular to ogive at 25th percentile (Q1), 50th percentile (Q2) and 75th
percentile (Q3) as shown in figure below.
% Relative cumulative frequency
100
Q3= 66
75
Q2= 58.5
50
Q1= 53
25
0
40 50 60 70 80 90
Mid point of class interval
Figure 12.1 % cumulative frequency curve of weight to estimate median and iqr
The weight values corresponding to these perpendiculars are taken as Q1, Q2 and Q3 and here
they are Q1=53, Q2 = 58.5 and Q3 =66
Interquartile range (IQR)= Q1 to Q3
Interquartile range (IQR)= 53 to 66
The interquartile range for the data of weights of final year students is 53 to 66 kg.
The Boxplot
Now that we have discussed the median and interquartile range, we can now better
understand the boxplot which we had seen in chapter 11. Boxplots provide a graphical summary of
the three quartile values, the minimum and maximum values, and any outliers. They are usually
plotted with value on the vertical axis. Like the pie chart, the boxplot can only represent one variable
at a time, but a number of boxplots can be set alongside each other.
210 Textbook of Computer Applications and Biostatistics
120
Percentage of label claim
110
100
90
80
n = 50
Figure 12.1 Boxplot
Example 12.21
Sketch the box plot for the percentage mortality in ICUs shown in table 12.10. We have
already calculated Q1 and Q3 for same.
Solution:
Calculation of Q2, Median
The 50th percentile can be calculated using formula
P
Pth Percentile = (N + 1)th value
100
th
P = 50 percentile = 50; N = 26
Measures of Central Tendency 211
50
50th Percentile = (26 + 1) = 0.5 ´ 27th value = 13.5th value
100
The 13th value is 17.7 % while the 14th value is 18.2%. A differences of 0.5.
The 50th percentile is 17.7% + 0.5 of 0.5 = 17.7% + 0.5 x 0.5% = 17.7% +0.25 =17.95
So, the 25th percentile for the ICU percent mortality is 14.225.
So, the 50th percentile for the ICU percent mortality is 17.95.
So, the 75th percentile for the ICU percent mortality is 21.425.
Now, the box plot can be sketched as under
Maximum
75th percentile
50th percentile
25th percentile
Minimum
2
deviation, s and variance, s .
Solution
1. Let us calculate the mean
8 + 10 + 19 + 12 + 6 + 5
Mean, X = = 10
6
2. Now, construct the table as shown below:
x (x - `x ) (x - `x )2
08 -2 4
10 0 0
19 9 81
12 2 04
06 -4 16
05 -5 25
å(x - `x )2=130
3. Let us use å(x - `x ) value in the formula
2
Formula
å(X - X) 2
Standard deviation (s) = ...10
N -1
130
Standard deviation (s) = = 26 = 5.1
6 -1
2
4. Thus, standard deviation, s was found to be 5.1 while variance, s was found to be 26.
However, the calculations using this formula can be tedious when the numbers and mean are
in decimals. So, the computing formula used is as follows
Formula
(å X)2
å X2 -
Standard deviation (s) = N ...11
N -1
Let us calculate standard deviation using this formula
2
x x
08 64
10 100
19 361
12 144
06 36
05 25
åx = 60 åx =730
2
Measures of Central Tendency 213
(60) 2
730 -
Standard deviation (s) = 10 = 5.1
6 -1
2
The, standard deviation, s by this method is also 5.1 and variance, s is 26.
(å X )2
å X2 - ... 12
Standard deviation (s) = N
N -1
Where,
N = number of observations
X = deviation from assumed mean = x-A.
A=Assumed mean
x = given observations
Example 12.23
Find standard deviation of incubation period of smallpox in 9 patients where it was found to
be 14, 13, 11, 15, 10, 7, 9, 12, and 10.
Solution
1. Construct calculation table as per given in table no 12.21.
2. Set any assumed mean from set of observations (x).
3. Set second column as deviation from assumed mean i.e. X=x-A.
2
4. Set third column as square of deviation i.e X
2
5. Determine summation of deviation from assumed mean (X) and square of deviation (X ).
6. Finally put the values in equation 10 so as to get standard deviation for ungrouped data.
214 Textbook of Computer Applications and Biostatistics
N=9
Summation of deviation from assumed mean, X =2
2
Summation of square of deviation, X = 52
Formula
(å X)2
å X2 -
Standard deviation (s) = N
N -1
22
52 -
9 = 51.56
Standard deviation (s) = = 6 .443 = 2.54
9 -1 8
Example 12.24
The following data gives weight of 10 paracetamol tablets in mg. Find standard deviation.
625 617 633 630 620 631 618 620 619 632
Solution
1. Construct calculation table as per given in table no 12.22.
2. Set any assumed mean from set of observations (x).
3. Set second column as deviation from assumed mean i.e. X=x-A.
2
4. Set third column as square of deviation i.e X
2
5. Determine summation of deviation from assumed mean (X) and square of deviation (X ).
6. Finally put the values in equation 10 so as to get standard deviation for ungrouped data.
Measures of Central Tendency 215
Data
N = 10
Summation of deviation from assumed mean, X = -5
2
Summation of square of deviation, X = 373
Formula
(å X) 2
å X2 -
Standard deviation (s) = N
N -1
- 52
373 -
Standard deviation (s) = 10 = 6.416
10 - 1
( å fd ) 2
å fd 2 -
Standard deviation (s) = N ´i ...13
N -1
Where,
f = frequency
N = summation of frequency, f = Sf
216 Textbook of Computer Applications and Biostatistics
d = (m.p. - A)/ i
m.p. = mid point
A=Assumed mean
i = class interval
Example 12.25
Calculate SD of following data
Table 12.23 IQ of 50 students
Answer:
1. Construct calculation table as per given in table no 12.24.
2. Construct third column as mid value of class interval.
3. Set any assumed mean (A) from set of mid points (m.p.).
4. Set forth column as deviation from assumed mean i.e.d= (m.p. -A)/ i.
2
5. Determine fd and fd .
2
6. Determine summation of f, fd and fd .
7. Finally put the values in equation 11 so as to get standard deviation for grouped data.
0-20 3 10 -4 16 -12 48
20-40 4 30 -3 9 -12 36
40-60 3 50 -2 4 -6 12
60-80 4 70 -1 1 -4 4
80-100 13 90 A 0 0 0 0
100-120 12 110 1 1 12 12
120-140 8 130 2 4 16 32
140-160 3 150 3 9 9 9
N=Sf = 50 Sfd=3 Sfd2= 171
Data:
Class interval, i = 20;
Assumed mean, A= 90;
N=Sf = 50;
Sfd=3;
Measures of Central Tendency 217
Sfd = 171
2
Formula:
( å fd ) 2
å fd 2 -
Standard deviation (s) = N ´i
N -1
Example 12.26
Calculate SD of following data of intelligence quotient (IQ) of 50 students.
Table 12.25 Given data
Classes 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Frequency 5 10 15 20 25 12 8 4 1
Answer:
1. Construct calculation table as per given in table no 12.26.
2. Construct third column as mid value of class interval.
3. Set any assumed mean (A) from set of mid points (m.p.).
4. Set forth column as deviation from assumed mean i.e. d= (m.p. -A)/ i.
2
5. Determine fd and fd .
6. Determine summation of f, fd and fd2.
7. Finally put the values in equation 11 so as to get standard deviation for grouped data.
Table 12.26 Calculation of SD in grouped data
Classes Frequency Mid value of d 2
d fd f d2
(f) class interval (m.p.) (mp - A)/i
0-10 5 5 -4 16 -20 80
10-20 10 15 -3 9 -30 90
20-30 15 25 -2 4 -30 60
30-40 20 35 -1 1 -20 20
40-50 25 45 A 0 0 0 0
50-60 12 55 1 1 12 12
60-70 8 65 2 4 16 32
70-80 4 75 3 9 12 36
80-90 1 85 4 16 4 16
N=Sf = 100 Sfd=-56 Sfd = 346
2
218 Textbook of Computer Applications and Biostatistics
Data:
Class interval, i = 10;
Assumed mean, A= 45;
N=Sf = 100;
Sfd=-56;
Sfd = 346
2
Formula:
( å fd ) 2
å fd 2 -
Standard deviation (s) = N ´i
N -1
Solution:
Now, putting the values in above formula, we get
( - 56 ) 2
346 -
Standard deviation (s) = 100 ´ 10 = 17.8
100 - 1
(å f i X i ) 2
å fiXi -
2
Where,
fi = frequency with which variable Xi occurs
N = åfi
Example 12.27
Calculate standard deviation of Example 12.13
Solution
1. Construct the frequency table as shown below:
Number of times inhaler used in past 24 h Number of children fiXi Xi2 fiXi2
(Xi) (fi)
0 06 0 0 0
1 16 16 1 16
2 12 24 4 48
3 08 24 9 72
4 05 20 16 80
5 03 15 25 75
6 02 12 36 72
7 01 7 49 49
N= åfi = 53 å fiXi = 118 åfiXi = 412
2
Data
N= åfi = 53
å fiXi = 118
åfiXi2 = 412
Formula
(å fiX i )2
å fiX i -
2
(118 ) 2
412 -
53 = 412 - 262.71
Standard deviation (s) = = 1.72
53 - 1 53 - 1
Example 12.28
In two series of adults aged 21 years and children 3 months old, following values were
obtained for the height. Find which series shows greater variation?
Data Persons Mean height SD
Adults 160 cm 10 cm
Children 60 cm 5 cm
Formula:
CV = SD/mean
RSD = CV x 100
Solution:
Put the values in equation
Adults
CV of adults = 10/160 = 0.0625
RSD = 0.0625 x 100 = 6.25 %
Children
CV of children = 5/60 = 0.0833
RSD = 0.0833 x 100 = 8.33 %
Thus, heights in children show greater variation than adult.
Example 12.29
Chest circumference in cm of 10 malnourished children and normal children aged one year,
are given below. Find which series shows greater variation?
Data Children Mean height SD
Malnourished 48.5 4.53
Normal 34.8 4.57
Formula:
CV = SD/mean
RSD = CV x 100
Solution:
Put the values in equation
Malnourished
CV = 4.53/48.5 = 0.09
RSD = 0.093 x 100 = 9.3 %
Normal
CV = 4.57/34.8 = 0.131
RSD = 0.131 x 100 = 13.1 %
Thus, Normal children show greater variation than malnourished children.
Measures of Central Tendency 221
Summary
The mean, median and mode, are common measures of location which represent either a
typical or representative score and/or a value about which the data tend to center.
Mode
The mode (denoted by M) represents the most frequently occurring score. When more than
two scores occur with the greatest frequency, the data set is said to be multimodal.
Mode for grouped data
æ f1 - f 0 ö
Mode = l1 + çç ÷÷ ´ i
è 2f1 - f 0 - f 2 ø
Median
th
The median of a set of scores represents the middle value (50 percentile) when the scores are
arranged as an array in order of increasing (or decreasing) magnitude.
Median for ungrouped data
Odd number
æ n +1ö
Median = Size or value of ç ÷ th observation
è 2 ø
n æn ö
Even number th + ç + 1 ÷ th
Median = value of
2 è2 ø observation
2
Mean
Mean represents the most appropriate measure of central tendency for continuous-type data.
It is obtained by adding all of the scores and dividing this sum by the number of scores.
Mean for ungrouped data
Sum of all observations (å X)
Mean (X) =
Total number of observations (N)
N
Percentile
Percentiles are values in a series of observations arranged in ascending order of magnitude
which divide the distribution into 100 equal parts.
P
Pth Percentile = (N + 1)th value
100
The range, interquartile range, standard deviation are common measures of spread and
represents the extent of spread of data
Range
The range is the distance from the smallest value to the largest.
Range = Lowest value to Highest value.
The interquartile range
The interquartile range is Q1 to Q3
Q1 is the value which cuts off the bottom 25th percent of values and is known as the first
quartile (25th percentile),while Q3 cut off top 25th percent of values and is known as third quartile (75th
percentile).
Standard Deviation (SD)
The standard deviation measures the variation of scores about the mean (average) score, and
can be defined as the “root mean squared deviation.” Variance is square of standard deviation.
Standard deviation of ungrouped data
(å X )2
å X2 -
Standard deviation (s) = N
N -1
Standard deviation of grouped data
( å fd ) 2
å fd 2 -
Standard deviation (s) = N ´i
N -1
Measures of Central Tendency 225
(å f i X i ) 2
å f i X i2 -
Standard deviation (s) = N
N -1
Nominal yes no no no no no
Ordinal yes yes no yes yes no
Metric yes yes, if skewed yes yes yes, if skewed yes
Exercise
1. Amount of 10 tablets of Atenolol were determined for its quality control. Calculate the mode of the
observed amounts.
Data: 20, 22, 18, 24, 24, 24, 18, 19, 18, 18.
3.Anovel antidiabetic drug is developed by researcher. The reduction in glucose in 10 patients after 2
hours administration of drug is recorded. Calculate the mean reduction in blood glucose levels in the
10 patients.
Data: 20, 15, 22, 18, 25, 17, 23, 27, 19, 21.
4. The weights of 20 tablets, removed from a batch for quality control purpose are given below.
Calculate the mean and median values of the tablet weights.
Measures of Central Tendency 227
Data: 250, 252, 255, 245, 251, 260, 258, 256, 248, 275,
268, 240, 257, 262, 270, 280, 266, 279, 258, 265.
5. Serum bilirubin levels are measured for 10 cirrhosis patients as given below. Determine the range.
Data: 1, 8, 5, 7, 10, 2, 15, 12, 3, 9.
6. Drug content of injection were determined using ultraviolet spectroscopy. Results are reported in
mg/ml. Calculate mean and standard deviation of the drug content of injection.
Data: 50.3, 48.1, 48.9, 51.5, 52.5, 50.6, 49.5, 53, 47.5, 50.
7. Time taken for dissolution of 50% of the original mass of drug from 10 tablets are summarized
below. Determine the mean and standard deviation of the time required for the release of 50% of the
original mass of drug.
t50% (h): 24.7, 20.1, 25.3, 22, 22.5, 28.6, 21.6, 20.4, 23.5, 25.1.
8. Calculate mean and standard deviation of 15 readings in the preliminary study of urinary lead
concentrations, micro mol/24h.
Data: 0.1, 0.4, 0.6, 0.8, 1.1, 1.2, 1.3, 1.5, 1.7, 1.9, 1.9, 2.0, 2.2, 2.6, 3.2
9. Results of new therapeutic agent with 50 mg/tablet administered to six healthy volunteers are listed
below. Report median and mean.
Volunteer number 1 2 3 4 5 6
Cmax mg/l 60 71 111 46 81 96
10. Calculate median and mean of following assay data of 30 tetracycline capsules.
Assay result 245 246 247 248 249 250 251 252 253 254
Frequency 1 1 2 3 4 7 5 3 2 2
11. Calculate mode, median, mean and standard deviation for the following data.
Age in years 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No of persons 4 6 10 20 10 6 4
12. In addition to studying the lead concentration in the urine of 140 children, the paediatrician asked
how often each of them had been examined by a doctor during the year 2010. Calculate mean, median
and standard deviation.
No of Visits to doctor 0 1 2 3 4 5 6 7
No of Children 2 8 27 45 38 15 4 1
228 Textbook of Computer Applications and Biostatistics
13. Age at Death (in days) of 78 cases of sudden infant death syndrome (SIDS) are given in following
table.
Age in days 1-30 31-60 61-90 91-120 120-150 151- 180 181-210 211-240 241-270
No. of deaths 6 13 23 18 7 5 3 2 1
14. Ages of Patients Diagnosed with Multiple Sclerosis are given in following table.
Age in days 20-29 30-39 40-49 50-59 60-69 70-79 80-89
No. of deaths 4 44 124 124 48 25 4
16. From the 140 children whose urinary concentration of lead were investigated, 40 were chosen
who were aged at least 1 year but under 5 years. The following concentrations of copper were found.
Find the median, range and quartiles.
Data 0.70, 0.45, 0.72, 0.30, 1.16, 0.69, 0.83, 0.74, 1.24, 0.77,
0.65, 0.76, 0.42, 0.94, 0.36, 0.98, 0.64, 0.90, 0.63, 0.55,
0.78, 0.10, 0.52, 0.42, 0.58, 0.62, 1.12, 0.86, 0.74, 1.04,
0.65, 0.66, 0.81, 0.48, 0.85, 0.75, 0.73, 0.50, 0.34, 0.88
17. Erythromycin contents of 10 tablets from an Alpha and a Bravo tabletting machine was
determined and given below. Determine coefficient of variation and relative standard deviation. Give
your comment on it.
Mean SD
Alpha machine 248.7 8.72
Bravo machine 251.1 3.78
Measures of Central Tendency 229
18. Fifteen patients were provided with their drugs in a child-proof container of a design that they had
not previously experienced. A note was taken of the time it took each patient to get the container open
for the first time. Determine median and quartile. The results are shown below
Times taken to open a child-proof container (s)
2.2 3 3.1 3.2 3.4 3.9 4 4.1 4.2 4.5 5.1 10.7 12.2 17.9 24.8
19. The percentage of ideal body weight was determined for 18 randomly selected insulin-dependent
diabetics. The outcomes (%) are
107 119 99 114 120 104 124 88 114
116 101 121 152 125 100 114 95 117
Determine mean, CV and RSD.
20. Calculate all the following measures of central tendency for data given below:
9.4 7 7.6 6.3 6.7 8.6 6.8 10.6 8.9 9.4
a. Mean b. Median c. 25th percentile
d. 75th percentile e. 10th percentile f. Range
g. Iqr h. Standard deviation i. Variance
j. Coefficient of variance k. Relative standard deviation
Answers:
Multiple Choice Questions
1. c 2. d 3. c 4. d 5. a 6. c 7. a 8. b 9. a 10. b
11. c 12. a 13. a 14. c 15. c
Exercise
1. Mode 18.
2. Mode 10.
3. Mean 20.7.
4. Median 258, mean 259.75.
5. Range 1 to 15.
6. Mean 50.2, standard deviation 1.8.
7. Mean 23.4, standard deviation 2.6.
8. Mean 1.5, standard deviation 0.84.
9. Mean 3.1, median 3.
10. Mean 250, median 250.
11. Median 35, Mean 35 , Mode 35, standard deviation 15.40 .
12. Mean 3.25, median 3, standard deviation 1.25.
13. Median 86, Mean 56.68 , standard deviation 51.85.
230 Textbook of Computer Applications and Biostatistics
Learning objectives
When we have finished this chapter, we should be able to:
1. Define probability and calculate simple probabilities.
2. Explain the proportional frequency approach to calculate probability.
3. Explain probabilities of simple and composite outcomes, probability involving two variables and
conditional probability.
4. Explain how probability can be used with the area properties of the normal distribution.
5. Explain probability distribution, binomial distribution and poisson distribution.
Classic Probability
Statistical concepts are essentially derived from probability theory. Thus, it would be only
logical to review some of the fundamentals of probability.
Probability is a measure of the chance of getting some outcome of interest from some event.
The event might be rolling a dice and the outcome of interest might be getting a six. Some basic ideas
about probability are given below:
1. The probability of a particular outcome from an event will lie between zero and one.
2. The probability of an event that is certain to happen is equal to one. For example, the probability
that everybody dies eventually.
3. The probability of an event that is impossible is zero. For example, throwing a seven with a normal
dice.
4. If an event has as much chance of happening as of not happening (like tossing a coin and getting a
head), then it has a probability of 1/2 or 0.5.
5. If the probability of an event happening is p, then the probability of the event not happening is 1 – p.
Calculating Probability
We can calculate the probability of a particular outcome from an event by using the
following formula:
Number of outcomes that favour the event (m) ...1
Probability [P E ] =
Total number of outcomes (N)
The probability of a particular outcome from an event is equal to the number of outcomes that favour
that event, divided by the total number of possible outcomes.
231
232 Textbook of Computer Applications and Biostatistics
Example 13.1
What is the probability of getting an odd number when we roll a dice?
Solution
Total number of possible outcomes = 6 (1 or 2 or 3 or 4 or 5 or 6)
Total number of outcomes favouring the event ‘an odd number’= 3 (i.e. 1 or 3 or 5)
So probability of getting an odd number = 3/6 = ½= 0.5
The above method for determining probability works well with experiments where all of the
outcomes have the same probability, e.g. rolling dice, tossing a coin, etc. In the real world, however,
we will require to use the proportional frequency approach, which uses existing frequency data as the
basis for probability calculations.
Example 13.2
Each box contains 10 strips of paracetamol. As a check on quality such 50 boxes were tested
and distribution regarding number of defectives and the proportional frequency is calculated as
category frequency divided by total frequency.
Table 13.1 Frequency table showing number of defective strips in 50 boxes
No. of defectives Frequency Proportional
(n=50) frequency
0 30 30/50 = 0.6
1 10 10/50 = 0.2
2 5 5/50 = 0.1
3 3 3/50 = 0.06
4 1 1/50 = 0.02
5 1 1/50 = 0.02
Notice that the proportional frequencies sum to one, similar to probability.
Now if we ask the question, ‘What is the probability that if we chose one of these 50 boxes at
random and we will get one defective strip in it? The answer is the proportional frequency for the
‘one’ defective strip i.e. 0.2. In other words, we can interpret proportions as equivalent to
probabilities.
Q H
Figure 13.1 Probability of drawing a card which is both a queen and a heart
234 Textbook of Computer Applications and Biostatistics
Q H
Example 13. 7
Consider that in a survey of 407 villages in Satara district, 219 villages had medical shops
while 305 had doctors in their villages; and 192 villages had both doctor and medical shops.
Assuming that this sample is representative of villages nationally,
1. What is the probability of selecting a village at random and finding that this village has a medical
shop (MS)?
2. What is the probability of selecting a village at random and finding that this village has doctors?
3. What is the probability of selecting a village at random and finding that this village does not have
doctor?
Probability and Probability Distribution 235
4. What is the probability of selecting a village at random that has both medical shop and doctor
(intersect)?
5. What is the probability of selecting a village at random who has either medical shop or doctor but
both (conjoint)?
Solution:
1. Probability of medical shop, p(MS)
m(MS) = No of outcomes favoring event (here medical shop) = 219
N= No of possible outcomes (here villages) = 407
p(MS)= m(MS)/N= 219/407=0.538
The probability of selecting a village having medical shops at random is 0.538.
2. Probability of Doctor, p(D)
m(D) = 305; N= 407
p(D) = m(D) /N= 305/407= 0.749
The probability of selecting a village having doctor at random is 0.749.
3. Probability of no Doctor, p(nD)
m(nD) = (407-305); N= 407; p(D) =0.749
p(n D) = m(n D)/N = (407 -305)/407 = 0.251
or
p(n D) = 1- p(D) = 1- 0.749 = 0.251
The probability of selecting a village having no doctor at random is 0.251.
4. Probability of medical shop and doctor
m (MSÇD) = 192; N = 407
p(medical shop and doctor)= p(MSÇ D) = m (MSÇD)/ N= 192/407 =0.472
The probability of selecting a village having both medical shops and doctor at random is
0.472.
5. Probability of medical shop or doctor p(MS or D)
p(MS)= 0.538, p(D) = 0.749, p (MSÇD) = 0.472
p (medical shop or doctor) = p(MSÈD)= p(MS) + p(D) - p (MSÇD)
= 0.538 + 0.749 -0.472 = 0.915
The probability of selecting a village having medical shop or doctor at random is 0.915.
Conditional Probability
Many times it is necessary to calculate the probability of an outcome, given that a certain
value is already known for a second variable. For example, what is the probability of event A
occurring given the fact that only a certain level (or outcome) of a second variable (B) is considered.
p(A) given B = p(A|B) = p(AÇB) /p(B) ... 4
236 Textbook of Computer Applications and Biostatistics
Example 13.8
What is the probability of drawing a queen from a stack of cards containing all the hearts
from a single deck?
Solution
p(QÇH) = 1/52, p(H)=13/52
Probability of (Queen| Heart), p(Q|H)= p(QÇH)/p(H)= (1/52)/(13/52)=1/13
In this example, if all the hearts are removed from a deck of cards, 1/13 is the probability of
selecting a queen from the extracted hearts.
Example 13.9
1. From the previous example 13.7, if a selected village has a medical shop, what is the probability
that this same village also has doctor?
2. If the selected village has doctor, what is the probability that this same village also has access to a
medical shop?
Solution:
1. If selected village is having medical shop, probability that this same village will have doctor is:
p(D|MS) = p(MSÇD) / p(MS) = 0.472/0.538 =0.877
2. If selected village is having doctor, probability that this same village will have medical shop is:
p(MS|D) = p(MSÇD)/ p(D) = 0.472/0.749 =0.630
Probability Distribution
Inferential statistics employs probability theory to make assumptions about the properties of
populations on the basis of data recorded from smaller samples taken from population. An
instrumental component of such estimations is the use of probability distributions, i.e. the
relationships between particular variable and their probability of occurrence.
1. Discrete probability distribution
Discrete probability distributions are those in which the probability of the occurrence of
discrete events is calculated and graphically portrayed. To illustrate these, consider firstly the
numerical outcome following the rolling of one die and then two dice (Table 13.2)
The plot of theoretical probabilities against defined variables is referred to as a probability
distribution. Frequency distributions represents the distribution of data derived from analysis of a
sample taken from a population, whereas probability distributions reflect the distributions of a
variable in a population.
In a probability distribution, the sum of all the individual probabilities is always 1 (the area
under the plotted distribution is 1).
Probability and Probability Distribution 237
Table 13.2 Probabilities associated with defined numerical values obtained following the
rolling of one or two dice
One dice Two dice
Variable Probability Variable Probability
(Numerical outcome) (Numerical outcome)
1 0.167 2 0.038
2 0.167 3 0.056
3 0.167 4 0.083
4 0.167 5 0.111
5 0.167 6 0.139
6 0.167 7 0.167
8 0.139
9 0.111
10 0.083
11 0.056
12 0.038
0.2 0.2
0.16 0.16
Probability
Probability
0.12 0.12
0.08 0.08
0.04 0.04
0 0
1 2 3 4 5 6 2 3 4 5 6 7 8 9 10 11 12
Numerical outcome
Numerical outcome
a b
Figure 13.2 Probability distributions for the numerical values shown in table 13.2 after one dice (a)
and two dice (b)
The distribution shown in figure 13.2 is a discrete probability distribution because the
variable can adopt a countable number of values.
2. Binomial distribution
One of the distributions most commonly employed in the pharmaceutical and life science is
the binomial distribution. This distribution is used whenever the outcome of an event consists of only
two categories. An example of a binomial event has been described previously, namely tossing of a
coin. The other examples of binomial data include:
a. the outcome of a quality control assessment, which is either a pass or a fail.
b. a new formulation may produce side effects: the outcomes is either positive or negative (no effects)
c. the gender distribution in a population: the disease is either present or absent
d. a new pharmaceutical agent is either clinically efficacious or non-efficacious.
238 Textbook of Computer Applications and Biostatistics
In addition to the requirement for only two possible outcomes, each binomial trial must be
independent, i.e. the occurrence of one events must not influence subsequent events. In the generation
of the binomial distribution, it is assumed that the proportion of observations (or individuals) in one
category is p and, consequently, the proportion of observations in the other category is 1-p, i.e. q.
The probability of events using binomial data can be calculated by expansion of the binomial
n
term, (p+q) , in which n denotes the sample size, p is probability of the occurrence of the first event, q
is the probability of the occurrence of the second event.
The probability of X observations in a sample of size n that has been removed from a
binomial distribution may be mathematically described as follows:
æn ö
P(X) = çç ÷÷ p X q n - X ...6
èXø
X
Where p denotes the probability of a sample composed of X observations possessing a
n-x
possibility p and q denotes the probability of a sample composed of n-X observations possessing a
probability q.
The above equation can be rewritten by substituting
æn ö n!
çç ÷÷ =
è ø
X X ! (n - X )! ...7
n!
P(X) = pXq n -X ...8
X!(n - X)!
Example 13.3
In pharmaceutical manufacturing, tablets from a batch may be categorized into those that
pass or those that fails quality control. As part of QC process for a batch of tablets, the probability
associated with the pass category was 0.95 whereas the probability associated with fail category was
0.05. Using the binomial expansion calculate the probability of selecting three defective tablets in a
sample of three.
Solution:
The probabilities were p (defective)= 0.05,
q (non defective) = 0.95,
The number of observations (X) = 3,
Sample size (n) = 3
æn ö æ3ö
P(X ) = çç ÷÷ p X q n - X = çç ÷÷ ( 0 . 05 ) 3 ( 0 . 95 ) 3 - 3
èXø è3ø
3!
= ´ 0.000125 = 0.000125
3!(0!)
Thus, the probability of selecting three defective tablets in a sample of three is 0.000125.
Probability and Probability Distribution 239
3. Poisson distribution
The Poisson distribution is another discrete data distribution that is commonly employed to
describe random occurrences when the probability of observing an event is small. the Poisson
distribution approximates to the binomial distribution when the sample size is large and the
probability of a specified event is small. Mathematically, the Poisson distribution is described by
e- m m X mX
P(X) = = m
X! e X! ...9
Where
p(X)= the probability of an event occurring in a single observation
m = the mean number of occurrences (number of observations, x and probability, Np).
The above equation may be expanded to enable calculation of the probabilities of occurrence
of an event or events (Table 13.3).
Table 13.3 Expansion of the equation that defines the Poisson distribution
Variable Probability
(Numerical outcome)
0 P(0) = e-m
1 P(1) = e-mm
2 P(2) = e-mm2/2!
3 P(3) = e-mm3/3!
4 P(4) = e-mm4/4!
Summary
Probability is a measure of the chance of getting some outcome of interest from some
event.
Number of outcomes that favour the event (m)
Probability [P E ] =
Total number of outcomes (N)
æn ö æn ö n!
P(X) = çç ÷÷ p X q n - X çç ÷÷ =
èXø è X ø X!(n - X)!
n!
P(X) = pXq n -X
X!(n - X)!
Poisson distribution
e-m m X mX
P(X) = = m
X! e X!
3. The distribution used whenever the outcome of an event consists of only two categories
is________.
a. Normal distribution b. Poisson distribution
c. Binomial distribution d. Uniform distribution
4. The distribution used when the probability of observing an event is small is________.
a. Normal distribution b. Poisson distribution
c. Binomial distribution d. Uniform distribution
5. The likelihood of two or more mutually exclusive outcome equals the sum of their individual
probabilities is given by
a. Conjoint b. Intersect
c. Multiplication theorem d.Addition theorem
6. The probability of an intersect p(AÇB) is easily determined by using ______.
a.Addition theorem b. Multiplication theorem
c. Baye theorem d.Addition and subtracting theorem
7. The probability of an conjoint p(AÈB) is easily determined by using ______.
a.Addition theorem b. Multiplication theorem
c. Bayes theorem d.Addition and subtracting theorem
8. The outcomes that are mutually exclusive and exhaustive are _________ outcomes.
a. simple b. composite
c. complementary d. conjoint
9. The probability of an event that is certain to happen is equal to _______ .
a. 100 b. 1
c. 10 d. 0
10.About 99% of the observations lie within _______ standard deviation either side of mean.
a. 1 b. 2
c. 3 d. 4
Exercise
1. A newly designed shipping containers for ampules was compared to the existing one to determine if
the number of broken units could be reduced. One hundred shipping containers of each design (old
and new) were subjected to identical rigorous abuse. The containers were evaluated and failures were
defined as containers with more than 1% of the ampules broken. A total of 15 failures were observed
and 12 of those failures were with the old container. If one container was selected at random:
a. What is the probability that the container will be of the new design?
b. What is the probability that the container will be a "failure"?
c. What is the probability that the container will be a "success"?
d. What is the probability that the container will be both an old container design and a
“failure"?
242 Textbook of Computer Applications and Biostatistics
e. What is the probability that the container will be either of the old design or a "failure"?
f. If one container is selected at random from only the new containers, what is the probability
that the container will be a "failure"?
g. If one container is selected at random from only the old container design, what is the
probability that the container will be a "success"?
2. Total of 150 healthy females volunteered to take part in a multi-center study of a new urine testing
kit to determine pregnancy. One-half of the volunteers were pregnant, in their first trimester. Urinary
pHs were recorded and 62 of the volunteers were found to have a urine pH less than 7.0 (acidic) at the
time of the study. Thirty-six of these women with acidic urine were also pregnant.
If one volunteer is selected at random:
a. What is the probability that the volunteer is pregnant?
b. What is the probability that the volunteer has urine that is acidic, or less than a pH 7?
c. What is the probability that the volunteer has a urine 'which is basic or a pH equal to or
greater than 7?
d. What is the probability that the volunteer is both pregnant and has urine which is acidic
or less than pH 7?
e. What is the probability that the volunteer is either pregnant or has urine which is acidic
or less than pH 7?
f. If one volunteer is selected at random from only those women with acidic urinary pHs,
what is the probability that the volunteer is also pregnant?
g. If one volunteer is selected at random from only the pregnant women, what is the
probability that the volunteer has a urine pH of 7.0 or greater?
3. If a medicine cures 80% of the people who take it, what is the probability that out of the eight people
who take the medicine, 5 will be cured?
4. If a microchip manufacturer claims that only 4% of his chips are defective, what is the probability
that among the 60 chips chosen, exactly three are defective?
5. If a telemarketing executive has determined that 15% of the people contacted will purchase the
product, what is the probability that among the 12 people who are contacted, 2 will buy the product?
6. A department store buys 50% of its appliances from Manufacturer A, 30% from Manufacturer B,
and 20% from Manufacturer C. It is estimated that 6% of Manufacturer A's appliances, 5% of
Manufacturer B's appliances, and 4% of Manufacturer C's appliances need repair before the warranty
expires. An appliance is chosen at random. If the appliance chosen needed repair before the warranty
expired, what is the probability that the appliance was manufactured by Manufacturer A?
Manufacturer B? Manufacturer C?
Probability and Probability Distribution 243
Answers:
Multiple Choice Questions
1. d 2. a 3. c 4. b 5. d 6. b 7. d 8. a 9. b 10. c
Exercise
1.
a. 0.5, b. 0.075,
c. 0.925, d. 0.06,
e. 0.515, f. 0.03,
g. 0.88
2.
a. 0.5, b. 0.413,
c. 0.587, d. 0.24,
e. 0.673, f. 0.581,
g. 0.52
3. 0.1464
4. 0.2137
5. 0.292.
6.A- 0.03, B-0.015, C-0.008
Chapter 14
SAMPLING TECHNIQUES
Learning objectives
When we have finished this chapter, we should be able to:
1. Explain various methods of sampling techniques
2. Explain characteristics of good sample
Samples are considered to be the true representatives of that population. It is not possible to
include all the members of the population in an experimental study because of constraints of cost,
time and labour involved. Hence, appropriate sampling techniques are utilised to ensure that the
samples are randomly selected and that every observation is independently measured.
Various sampling techniques used are given below:
1. Random sampling
2. Systematic sampling
3. Stratified sampling
4. Cluster sampling
5. Multistage sampling
6. Multiphase sampling
7. Sequential sampling
1. Random Sampling
In this method, every unit of the population has an equal chance of being selected. This
method is applicable when population is small, homogeneous and readily available. This method is
also called as ‘Unrestricted Random Sampling’.
Randomization may be accomplished for a smaller number of units by using a random
numbers table or by numbers generated at random using a calculator or computer.
Lottery Method
This is the simplest and most popular method of obtaining a random sample. Under this
method, various units of population we are going to study are numbered on small and identical slips of
paper, which are folded and put them into a box or a bag. Then they are thoroughly mixed and required
number of slips for the sample are picked one after the other without replacement. While doing this, it
has to be ensured that in successive drawings each of the remaining slips of population has equal
chance of being chosen.
Use of Random Number
Common method of drawing sample is by making use of published tables of random
244
Sampling Techniques 245
numbers. First give serial numbers to all the people of the study population. Then select at random,
any page of the random number table and pick up the number in any row and column at random. The
population units corresponding to the numbers are selected.
Advantages of random sampling
1. Scientific method
2. More representative
3. More economical
Disadvantages of random sampling
1. It needs complete list of study population which is often difficult to get.
2. If the sample size is small, this sample will not be a true representative of the universe.
3. Cases selected by random sampling tends to be widely dispersed geographically and cost
of collecting data becomes too large.
2. Systematic Sampling
This method is popularly used in those cases when a complete list of population from which
sample is to be drawn, is available. It is more often applied to field studies when the population is
large, scattered and homogeneous.
The most practical way of sampling is to select every ith item on a list. Sampling of this type
is known as systematic sampling. An element of randomness is introduced into this kind of sampling
by using random numbers to pick up the unit with which to start. Sample interval for systematic
sampling is calculated using following formula:
Sample interval = Total population / Sample size desired
For instance, if a 4 per cent sample is desired, the first item would be selected randomly from
the first twenty-five and thereafter every 25th item would automatically be included in the sample.
Thus, in systematic sampling only the first unit is selected randomly and the remaining units of the
sample are selected at fixed intervals. Although a systematic sample is not a random sample in the
strict sense of the term, but it is often considered reasonable to treat systematic sample as if it were a
random sample.
Advantages
It can be taken as an improvement over a simple random sample in as much as the systematic
sample is spread more evenly over the entire population. It is an easier, accurate and less costlier
method of sampling and can be conveniently used even in case of large populations.
Disadvantages
If there is a hidden periodicity in the population, systematic sampling will prove to be an
inefficient method of sampling. In practice, systematic sampling is used when lists of population are
available and they are of considerable length.
3. Stratified Sampling
If a population from which a sample is to be drawn does not constitute a homogeneous group,
246 Textbook of Computer Applications and Biostatistics
stratified sampling technique is generally applied in order to obtain a representative sample. Under
stratified sampling the population is divided into several sub-populations that are individually more
homogeneous than the total population (the different sub-populations are called ‘strata’) and then we
select items from each stratum to constitute a sample. Since each stratum is more homogeneous than
the total population, we are able to get more precise estimates for each stratum and by estimating more
accurately each of the component parts, we get a better estimate of the whole. In brief, stratified
sampling results in more reliable and detailed information.
Advantages
1. More representative
2. Greater accurate
3.Administrative convenience
4. More advantageous when distribution of population is skewed.
Disadvantages
1. It is very difficult task to divide the population into homogenous strata. This may require
considerable time and money and statistical expertise.
2. The supplementary information to set up strata is not available some times.
3. Sometimes the different strata may overlap and the sampling would not be representative.
4. Multistage Sampling
As the name implies this method refers to the sampling procedures carried out in several
stages using random sampling techniques. Ordinarily multi-stage sampling is applied in big inquiry
extending to a considerable large geographical area, say, the entire country. In the first stage, random
numbers of districts are chosen in all the states, followed by random numbers of talukas, villages and
units, respectively.
Advantages
1. It is easier to administer than most single stage designs mainly because of the fact that
sampling frame under multi-stage sampling is developed in partial units.
2. Alarge number of units can be sampled for a given cost under multistage sampling.
3. It introduces flexibility in sampling.
4. It enables the use of existing divisions and subdivisions which saves extra labour.
5. Cluster Sampling
If the total area of interest happens to be a big one, a convenient way in which a sample can be
taken is to divide the area into a number of smaller non-overlapping areas and then to randomly select
a number of these smaller areas called clusters, with the ultimate sample consisting of all units in these
small areas or clusters.
Thus in cluster sampling the total population is divided into a number of relatively small
subdivisions which are themselves clusters of still smaller units and then some of these clusters are
randomly selected for inclusion in the overall sample.
Sampling Techniques 247
Advantages
1. Data collection method is simple and economic.
2. Cluster sampling reduces cost by concentrating surveys in selected clusters.
3. Involves less time and cost.
4. Estimates based on cluster samples are usually more reliable per unit cost.
Disadvantages
1. Gives higher standard error.
2. It is less precise than random sampling.
3. There is not as much information in observations within a cluster.
6. Multiphase Sampling
In this method part of information is collected from the whole sample and part of information
is from sub sample.
Advantages
1. Less cost
2. Less laborious
3. More purposeful.
7. Sequential Sampling
This sampling design is some what complex sample design. The ultimate size of the sample
under this technique is not fixed in advance, but is determined according to mathematical decision
rules on the basis of information yielded as survey progresses. This is usually adopted in case of
acceptance sampling plan in context of statistical quality control. When a particular lot is to be
accepted or rejected on the basis of a single sample, it is known as single sampling; when the decision
is to be taken on the basis of two samples, it is known as double sampling and in case the decision rests
on the basis of more than two samples but the number of samples is certain and decided in advance,
the sampling is known as multiple sampling. But when the number of samples is more than two but it
is neither certain nor decided in advance, this type of system is often referred to as sequential
sampling. Thus, in brief, we can say that in sequential sampling, one can go on taking samples one
after another as long as one desires to do so.
Characteristics of Sample
The main characteristics of a representative sample are as follows:
1. Precision andAccuracy
2. Unbiased character
1. Precision andAccuracy
Precision refers to how closely data are grouped together or the compactness of the
sample data. Precision measures the variability of a group of measurements. A precise set of
248 Textbook of Computer Applications and Biostatistics
Summary
Sample
Samples are relatively small group of observations that are taken from a defined population
and considered to be the true representative of that population.
Sampling techniques
1. Random sampling
2. Systematic sampling
3. Stratified sampling
4. Cluster sampling
5. Multistage sampling
6. Multiphase sampling
7. Sequential sampling
Characteristics of Sample
1. Precision and accuracy
2. Unbiased character
Exercise
1. Explain various methods of sampling techniques.
2. What are the characteristics of good sample?
3. From the batch of 50 tablets, sampling of 10 tablets is to be done. Explain how the systematic
sampling and random sampling method can be used?
4. Write a note on stratified sampling.
5. Explain the terms: a) Precision b)Accuracy c) Bias
Answers:
Multiple Choice Questions
1. c 2. d 3. c 4. a 5. a 6. d 7. a 8. a 9. c 10. a
Chapter 15
ESTIMATION OF CONFIDENCE INTERVAL
Learning objectives
When we have finished this chapter, we should be able to:
1. Explain what the standard error of the sample mean is and calculate its value.
2. Explain how we can use the probability properties of the Normal distribution to measure the
preciseness of the sample mean as an estimator of the population mean.
3. Estimate the confidence interval of the population mean.
4. Calculate and interpret a 95 per cent confidence interval for a population mean.
251
252 Textbook of Computer Applications and Biostatistics
Where
SEM = Standard error of mean
SD = Standard deviation
N = number of observations
Standard error of the mean can be considered as a measure of precision. Obviously, the
smaller the SEM, the more confident we can be that our sample mean is closer to the true population
mean. A general rule of thumb is that with samples of 30 or more observations, it is safe to use the
sample standard deviation as an estimate of population standard deviation.
Example 15.1
Diastolic blood pressure of 322 males was taken. Mean BP was found to be 95 and SD 12
mm. Determine SEM.
Answer:
Data: N =322,
s = 12 mm
Formula:
s
SE M =
N
Solution:
12
SEM = = 0.67
322
SEM was found to be 0.67.
Example 15.2
Hardness of tablets from same batch was determined using Pfizer type hardness tester. The
data is as follows:
5.9 6.4 5.7 5.4 4.9 6.2 5.8 5.5 5.3 5.2
Calculate SEM.
Solution
1. First, calculate standard deviation of the given data
Formula:
(å X ) 2
å X2 -
Standard deviation (s) = N
N -1
Solution:
Where
åX = 56.3
åX = 318.89
2
Estimation of Confidence Interval 253
2
( 56 . 3) 2 X X
318 . 89 - 5.9 34.81
Standard deviation (s) = 10 6.4 40.96
10 - 1 5.7 32.49
Standard deviation (s) = 0 .21 5.4 29.16
4.9 24.01
Standard deviation (s) = 0.458 6.2 38.44
5.8 33.64
5.5 30.25
5.3 28.09
5.2 27.04
2. Now determine SEM 56.3 318.89
N = 10; s = 0.458
Formula:
s
SE M =
N
Solution:
0.458
SEM = = 0 .144
10
SEM was found to be 0.144
Figure 15.1 Confidence limits of sample mean (`X) from the population mean (m) are (m± 1.96 SE
and m± 2.58 SE )
254 Textbook of Computer Applications and Biostatistics
z1-a/2 s
p%=X± ...3
N
Where,
p % = selected confidence interval (90%, 95% or 99%)
`X = observed mean
Z(1-a/2) = z value corresponding to the percentage confidence interval (1.65 for 90%), (1.96 for
95%) and (2.58 for 99%)
s = standard deviation of population
N = number of observations
Example 15.3
Clinical study of new anticoagulant drug has been performed in 30 patients, in which the
volume of distribution has been calculated as 10.2 ± 1.9 l. Calculate the 95% confidence interval of
the mean value.
Answer:
Data:
`X = observed mean = 10.2
Z(1-a/2) = z value corresponding to the 95% percentage confidence interval=1.96
s = standard deviation =1.91
N = number of observations = 30
Formula:
z1-a/2 s
p 95% = X ±
N
Estimation of Confidence Interval 255
Solution
1.96 ´ 1.91
p95% = 10.2 ± = 10.2 ± 0.68
30
The 95% confidence interval was found to be 10.2±0.68 L, i.e. 9.52-10.88 L.
Example 15.4
The following are the results of assay
Tablet No 01 0 2 0 3 0 4 05 06 0 7 0 8 0 9 10 11 12 13 14 15
Assay (mg) 75 74 72 78 78 74 75 77 76 78 73 77 75 74 72
Tablet No 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Assay (mg) 75 76 75 73 76 79 73 76 75 80 77 74 76 77 74
Assume that three assays are selected at random to give sample of tablets: 4, 18 and 26.
Estimate 90%, 95% and 99% confidence intervals assuming that population standard
deviation is 2.04.
Answer:
Taking the assay results of sampled tablets 4, 18 and 26, let us calculate mean
Data:
78 75 77
Formula
Sum of all observations ( å X)
Mean ( X ) =
Total number of observations (N)
Solution
78 + 75 + 77 230
Mean ( X ) = = = 76.66
3 3
Mean = 76.66 mg
1. 90 % confidence interval
Data:
`X = observed mean = 76.66
Z(1-a/2) = z value corresponding to the 90% percentage confidence interval=1.65
s = standard deviation =2.04
N = number of observations = 3
Formula
z 1-a /2 s
p 90% = X ±
N
Solution
1.65 ´ 2.04
p 90% = 76.66 ±
3
256 Textbook of Computer Applications and Biostatistics
p 95% = 76.66 ± 2 .3
Summary
Standard error of Mean
SD S
SEM = =
N N
At 95% confidence, Population mean = Sample mean± 1.96 x SE
Confidence interval
z s
p % = X ± 1 - a /2
N
Exercise:
o
1. 100 aspirin tablets were removed from batch and the stability of drug was determined at 40 C. The
degradation rate constant for the drug was 0.09±0.01 per hour. Calculate the 95% confidence interval.
2. Six bottles from batch of Roxithromycin suspension were assayed by spectrophotometric method.
The results are given below. Estimate confidence interval.
52 57 49 51 53 52
3. Estimate 95% confidence interval, if the mean assay of 20 Paracetamol tablets is 524.3 mg and
standard deviation is 3.5 mg.
4. Estimate 99% confidence interval for the following data:
Mean = 18.85; N = 80, standard deviation = 5.55.
5. Estimate 90% confidence interval for the hardness of tablets:
Mean = 5.62; N = 40, standard deviation = 0.68.
Answers
Multiple Choice Questions
1. b 2. d 3.d 4.c 5.b 6. a 7. b 8. a 9. b 10. c
Exercise:
1. 95 % confidence interval = 0.09± 0.00196.
2. Mean = 52.33, SD=2.683, 95 % confidence interval = 52.33± 2.147.
3. 95 % confidence interval = 524.3± 1.533.
4. 99 % confidence interval = 18.85± 1.60.
5 .90 % confidence interval = 5.62± 0.177.
Chapter 16
HYPOTHESIS TESTING
Learning objectives
When we have finished this chapter, we should be able to:
1. Explain how a research question can be expressed in the form of a testable hypothesis.
2. Explain what a null hypothesis is.
3. Summarise the hypothesis test procedure.
4. Explain what a p-value is.
5. Use the p-value to appropriately reject or not reject a null hypothesis.
6. Describe type I and type II errors, and their probabilities.
Introduction
Hypothesis testing is the process of inferring from a sample whether or not to accept a certain
statement about a population or populations. The sample is assumed to be a small representative
proportion of the total population. Two errors can occur, rejection of a true hypothesis or failing to
reject a false hypothesis.
Using an inferential statistics there are two possible outcomes:
H0: Hypothesis Under Test (Null Hypothesis)
Ha:Alternative Hypothesis (Research Hypothesis)
By convention, the Null hypothesis is stated as no real differences in the outcomes.
For example, if we are comparing three levels of a discrete independent variable (m1, m2, m3), the null
hypothesis would be stated as m1 = m2 = m3. The evaluation then attempts to nullify the hypothesis of no
significant difference in favor of an alternative research hypothesis.
Null hypotheses reflect the conservative position of no difference, no risk, no effect, etc.,
hence the name, ‘null’ hypothesis. To test this null hypothesis, researchers will have to take samples
and measure outcomes, and decide whether the data from the sample provides strong enough
evidence to be able to refute or reject the null hypothesis or not. If evidence against the null hypothesis
is strong enough for us to be able to reject it, then we are implicitly accepting that some specified
alternative hypothesis, usually labeled Ha, is probably true.
259
260 Textbook of Computer Applications and Biostatistics
outcome variable.
3. Collect the appropriate sample data and determine the relevant sample statistic, e.g. sample mean,
sample proportion, sample median, (or their difference or ratio), etc.
4. Use a decision rule that will enable us to judge whether the sample evidence supports or does not
support the null hypothesis.
5. Thus, on the strength of this evidence, either reject or do not reject the null hypothesis.
Let’s take a simple example. Suppose we want to test whether a coin is fair, i.e. not weighted
to produce more heads or more tails than it should. The null hypothesis is that the coin is fair, i.e. it will
produce as many heads as tails, so that the population proportion π, equals 0.5. The outcome variable
is the sample proportion of heads, p. If we toss the coin 100 times, and get 41 heads, then p = 0.41. Is
this outcome compatible with our hypothesised value of 0.5? Is the difference between 0.5 and 0.41
statistically significant or could it be due to chance? We decide what proportion of heads we might
expect to get if the coin is fair, by using two approaches.
1. creation of a confidence interval or
2. a comparison with a critical value.
The estimation of confidence interval has already been studied in the previous chapter, This
can be estimated by using the formula
s
p% = X ± z % ´
N ...1
Population Mean = Estimate Sample Mean ± Reliability Coefficient x Standard Error.
In the second method we would calculate a "test statistic', a value based on the manipulation
of sample data. This value is compared to a preset "critical" value (usually given in a special table)
based on a specific acceptable error rate (i.e., 5%). If the test statistic is extremely rare it will be to the
extreme of our critical value and we will reject the null hypothesis under test in favor of the research
hypothesis.
Decision rule for hypothesis testing
1. Determine the p-value for the output we have obtained. A p-value is the probability of getting the
outcome observed (or one more extreme), assuming the null hypothesis to be true.
2. Compare it with the critical value, usually 0.05.
3. If the p-value is less than the critical value, reject the null hypothesis; otherwise do not reject it.
It’s important to stress that the p-value is not the probability that the null hypothesis is true or
not true. It’s a measure of the strength of the evidence against the null hypothesis. The smaller the p-
value, the stronger the evidence. This means it is less likely that the outcome we have got is occurred
by chance. Note that the critical value, usually 0.05 or 0.01, is called the significance level of the
hypothesis test and is denoted as α (alpha).
Types of Error
Whenever we decide either to reject or not to reject a null hypothesis, we may be making a
mistake. After all, we are basing the decision on sample evidence. Even if we have done everything
Hypothesis Testing 261
right, the sample could still, by chance, not be very representative of the population. Moreover, the
test might not be powerful enough to detect an effect if there is one.
There are two possible errors associated with hypothesis testing. Type I error is the
probability of rejecting a true null hypothesis (H0) and Type II error is the probability of accepting a
false H0. Type I error is also called the level of significance and is denoted by the symbol a.
Alternatively, level of confidence is given as (1-a). Type II error is symbolized using the Greek letter
b. The probability of rejecting a false H0 is called power (1-b). Type I error is called as false positive
while Type II error is called as false negative.
Table 16.1 Summary of the relationships between statistical outcome and statistical errors
Statistical outcome Real results
(decision)
Null hypothesis is true Null hypothesis is false
-3 -2 -1 0 +1 +2 +3
z statistic
Figure 16.1 Standardised normal distribution showing the (shaded) region of the z statistic for a two-
tailed test.
If the alternative hypothesis states, that ‘the mean of a treatment group is greater than the
mean of another group’, then the critical value of z is ³ + 1.65 and test is referred to as a positive one
tailed test.
When the alternative hypothesis states, that the mean of a treatment group is lower than the
mean of another group, the rejection region resides at the left-hand (negative) side of standardised
normal distribution and is defined by z values that are £ - 1.65.
-3 -2 -1 0 +1 +2 +3 -3 -2 -1 0 +1 +2 +3
z statistic z statistic
Positive one tailed test Negative one tailed test
Figure 16.2 Standardised normal distribution showing the (shaded) region of the z statistic for a one-
tailed test.
Summary
Hypothesis testing
It is the process of inferring from a sample whether or not to accept a certain statement about
a population (s).
Key stages in hypothesis testing
1. State the null hypothesis
2. State the alternative hypothesis
3. Select the level of significance
4. Select number of tails
5. Test the statistics
6. Compare table and observed value
7. Decide whether to accept or reject null hypothesis
Decision rule
1. Determine p value.
2. Compare it with the critical value, usually 0.05.
3. If the obtained p value is less than critical value, reject null hypothesis; otherwise accept it.
Types of error
Type I error is the probability of rejecting a true null hypothesis
Type II error is the probability of accepting a false H0.
Exercise
1. Write the alternative hypothesis for each of the following null hypotheses:
a. mA = mB
b. mH ³ mI.
Hypothesis Testing 265
c. ml = m2 = m3 = m4 = m5 = m6
d. m A £ mB
e. m =115
f. populations C, D, E, F and G are the same
g. both samples come from the same population
2. What is hypothesis testing?
3. Give key stages in hypothesis testing.
4. Give various types of errors in hypothesis testing.
5. What is one tailed and two tailed test?
Answers:
Multiple Choice Questions
1. a 2. c 3. a 4. b 5. b 6. c 7. a 8. b 9. a 10. b
Chapter 17
CHOICE OF STATISTICAL TESTS
Learning objectives
When we have finished this chapter, we should be able to:
1. Choose appropriate statistical test for given type of data.
2. Understand parametric and non parametric statistical tests.
One of the major steps in the process of statistical hypothesis testing involves the choice of
statistical test. The most appropriate statistical test is chosen according to the desired power of the
study, the nature of the population from which the observations are taken and the nature of
measurement of the variable.
a binomial-based test. Characterization of the association within nominal data is performed using the
contingency coefficient.
2. Ordinal data
Many non-parametric tests are referred to as ranking tests and may be employed for the
analysis of ordinal data. Characterization of the association within ordinal data is commonly
performed using the Spearman correlation coefficient.
3. Metric discrete data
Metric discrete data may be analysed using parametric statistical tests. Indeed, if all the
assumptions of parametric statistics are valid, statistical comparisons of groups of interval data
266
Choice of Statistical Tests 267
should be performed using parametric tests, e.g. t test, analysis of variance, etc.
4. Metric continuous data
Metric continuous data scales may be manipulated using conventional arithmetic and may
therefore be conveniently analysed using either parametric or non-parametic methods.
Characterisation of the association within normally distributed interval or ratio data is commonly
performed using the correlation coefficient.
More specifically, non-parametric methods are exclusively employed to analyse nominal
and ordinal data, whereas parametric and non-parametric methods may be used to examine
continuous metric data. If metric data is skewed then non parametric tests are used or otherwise
parametric tests are choice for metric data.
Chart Showing Choice of Statistical tests
No. of samples Statistical test
Type of Matched/
data to be Unmatched Parametric test Non-parametric test
compared
1 N/A One sample binomial test
Metric
data 2 Matched Paired t test Wilcoxon signed rank test
Summary
Choice of statistical test
Parametric Non-Parametric Frequency
DataType
Ratio, Interval Ordinal Nominal
c Goodness of fit
2
Single sample z-test, t-test Sign test, K-S test
Two independent > 20, Chi-squared
z-test, t-test Mann-Whitney U
samples < 20, Fisher’s Exact
Two independent
Paired t-test Wilcoxon Signed ranks McNemar’s test
paired samples
More than two
One-way ANOVA Kruskall-Wallis One-way ANOVA
independent samples
c Test of
2
Exercise
1. Give the choice of statistical tests for following
a. If data is metric, continuous and three groups are to be compared.
b. If data is nominal and paired and two groups are to compared.
c. If data is ordinal and more than three groups are to be compared.
d. If data is ordinal, non parametric and paired.
e. If data is metric, non-parametric and two groups are to be compared.
2. How will you choose statistical test according to type of data.
3. Give the assumptions of parametric tests.
4. Give the non parametric equivalents of following parametric tests.
a. Paired t-test b. One wayANOVA
c. t-test d. Pearson correlation
5. If more than 2 groups are to be compared which tests can be used? Explain.
Answers:
Multiple Choice Questions
1. b 2. c 3. d 4. a 5. a 6. a 7. b 8. b 9. c 10. b
Chapter 18
HYPOTHESIS TESTING FOR ONE SAMPLE MEAN
Learning objectives
When we have finished this chapter, we should be able to:
1. Understand and estimate z statistic for one sample z test.
2. Understand and estimate t statistic for one sample t test.
One sample testing involves the estimation of whether sample data, generated from an
experimental procedure, are derived from a defined population or not. More specifically, one sample
tests evaluate whether the mean of sample and the mean of the population are different.
Two parametric one sample tests are described here, one sample z test and one sample t test.
The non parametric tests, the chi-square one sample test and the Kolmogorov-Smirnov one sample
test are used but are out of scope of this book.
X - m0
z= ...1
s/ N
Where
`X = observed sample mean
m0 = hypothesised mean
s = standard deviation of the population
N = sample size
Example 18.1
A pharmaceutical company has purchased a new liquid filling machine for use in its liquid
oral division. The liquid filling properties of existing machine linked to the current manufacturing
process are 5.04 ± 0.27 per bottle. A pilot study has been engaged to evaluate whether (or not) the
filling performance of the new machine is similar to that of previous machine using identical process
271
272 Textbook of Computer Applications and Biostatistics
conditions. Therefore, 100 vials of the product were removed from a pilot and the mean fill volume
was determined gravimetrically to be 5.13 ml. Is there a difference between the performance of new
and existing filling machines?
Solution: As the mean filling volume of one machine is given and the number of samples are more
than 30, one sample z test is employed here. The following steps gives the systematic approach of
hypothesis testing.
1. State the null hypothesis
The null hypothesis (H0) states that there is no significant difference between the sample
mean and the population mean.
H0: m = 5.04 ml
2. State the alternative hypothesis
The alternative hypothesis states that there is a difference between the sample mean and the
population mean.
Ha: m ¹ 5.04 ml
3. State the level of significance
The choice of the level of significance (a) is an extremely important consideration in the
establishment of the experimental design. Traditionally, a, the probability of rejecting the null
hypothesis when it is in fact true, is chosen to be 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection. As alternative hypothesis states m ¹ 5.04 ml, it may be less or more than 5.04 ml and the
outcome may be two tailed.
5. Select the most appropriate statistical test
A two tailed parametric, one sample test may be applied to the statistical question. As the
sample size is large, the z test is the most suitable statistical method.
6. Perform the statistical analysis
Step 1 Calculate the test statistic
Formula:
The z statistic is calculated by using the following formula
X - m0
z=
s/ N
Data:
`X = observed sample mean = 5.13
m0 = hypothesised mean= 5.04
s = standard deviation of the population =0.27
N = sample size =100
Hypothesis Testing for one sample mean 273
Solution:
5.13 - 5.04 0.09
z= = = 3.33
0.27 / 100 0.027
Step 2 Define the critical z statistic
Identification of critical statistic requires knowledge of the chosen level of significance (a)
and the number of tails associated with the experimental design. Therefore, the critical z statistics
associated with the specified level of significance (0.05) and a two tailed design may be derived from
the standardized normal distribution as 1.96. The regions of the z distribution associated with the null
and alternative hypothesis may now be defined in terms of the z statistic as:
H0: - 1.96 < z < + 1.96
Ha: z ³+ 1.96 or z ³ -1.96
7. Decision
As the calculated z value (+3.33) is greater than +1.96 (i.e. calculated z value does not lie in
the region -1.96 < z < +1.96), the null hypothesis is rejected and the alternative hypothesis is accepted.
It is therefore concluded that there is a significant difference in the performance of the new and
existing filling machine, in terms of fill volumes.
Example 18.2
A pharmaceutical company has purchased a new tablet punching machine for use in its
compression division. Existing tableting machine produces mean of 1000 tablets/min with a standard
deviation of 150. A pilot study has been engaged to evaluate whether (or not) the tableting
performance of the new machine is similar to that of previous machine using identical process
conditions. The mean tableting rate for 50 samples were determined and found to be 1050 tablets/min.
Is there a difference between the performance of new and existing filling machines?
Solution:
As the mean tableting rate of one machine is given and the number of samples are more than
30, one sample z test is employed here. The following steps gives the systematic approach of
hypothesis testing.
1. State the null hypothesis
The null hypothesis (H0) states that there is no significant difference between the sample
mean and the population mean.
H0: m = 1000 tablet/min
2. State the alternative hypothesis
The alternative hypothesis states that there is a difference between the sample mean and the
population mean.
Ha: m ¹ 1000 tablet/min
3. State the level of significance
The choice of the level of significance (a) is an extremely important consideration in the
274 Textbook of Computer Applications and Biostatistics
establishment of the experimental design. Traditionally, a, the probability of rejecting the null
hypothesis when it is in fact true, is chosen to be 0.05.
4. Sate the number of tails
As alternative hypothesis states m ¹ 1000 tablets/min, it may be less or more than 1000
tablets/minl and the outcome may be two tailed.
5. Select the most appropriate statistical test
A two tailed parametric, one sample test may be applied to the statistical question. As the
sample size is large, the z test is the most suitable statistical method.
6. Perform the statistical analysis
Step 1. Calculate the test statistic
Formula:
The z statistic is calculated by using the following formula
X - m0
z=
s/ N
Data:
`X = observed sample mean = 1050
m0 = hypothesised mean= 1000
s = standard deviation of the population = 150
N = sample size =50
Solution:
1050 - 1000 50
z= = = 2.36
150 / 50 21.2
has been assumed that the variance of the population is known. However, in many experiments it is
impossible to obtain knowledge of the relevant population statistics and, accordingly, the only
available estimates of the central tendency and variability of the population are the mean and variance
that are associated with the sample data. Under these circumstances, one sample t test is employed to
calculate the t statistic. The process for calculation of the one sample t test is similar to that for the one
sample z test.
The t statistics is calculated using the following mathematical formula:
X - m0
t ( N -1 ) d f = ...2
s / N
Example 18. 3
To test a manufacturer claim that his fruit juice contains 60 mg of vitamin C per 100ml, a
quality controller analysts takes six randomly selected samples with the following results: 65, 58, 62,
57, 62, 65. Is the manufacturer’s claim justified (sample mean = 61.5; sample standard deviation
(s)=3.39)?
Answer:
Hypothesis testing of this problem is done by using following steps:
1. State the null hypothesis
The null hypothesis states that there is no difference between the sample mean and
hypothetical mean.
H0: m = 60 mg
2. State the alternative hypothesis
The alternative hypothesis states that there is a difference between the manufacturer’s claim
and observed mean.
Ha :m ¹ 60 mg
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection. As the sample mean may be either lower or higher than the manufacturers claim, we have
to use two tailed test .
276 Textbook of Computer Applications and Biostatistics
Example 18. 4
To test a manufacturer claim that his tablet contains 60 mg of drug, 10 tablets were selected
randomly and analysed spectrophotometrically. Is the manufacturer’s claim justified? Analysis
results are given below: (sample mean = 62.7; sample standard deviation (s)=6.255)
53, 56, 57, 60, 61, 64, 67, 68, 70, 71
Solution:
Hypothesis testing of this problem is done by using following steps:
1. State the null hypothesis
The null hypothesis states that there is no difference between the sample mean and
hypothetical mean.
Hypothesis Testing for one sample mean 277
H0: m = 60 mg
2. State the alternative hypothesis
The alternative hypothesis states that there is a difference between the manufacturer’s claim
and observed mean.
Ha :m ¹ 60 mg
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection. As the sample mean may be either lower or higher than the manufacturers claim, we have
to use two tailed test .
5. Select the appropriate statistical test
Whenever the population variance is unknown, the one sample t test and not the one sample z
test is used. Atwo tailed parametric, one sample t test may be applied to the statistical question.
6. Perform the statistical analysis
Step 1. Calculate the t statistics
The t statistic is calculated by using the following formula
Formula:
X - m0
t ( N -1 ) d f =
s / N
Data:
`X = observed sample mean = 62.7
m0 = hypothesised population mean = 60
s = sample standard deviation = 6.255
N = sample size = 10
t(N-1)df = t value associated with N-1degrees of freedom = 9
Solution:
62.7 - 60 2 .7
t5 = = = 1 .36
6.255/ 10 1 .978
Step 2. Define the critical ‘t’statistics.
The critical value of ‘t’ statistics for 9 degrees of freedom, two tailed and a = 0.05 is given as
±2.26 (Appendix II). Therefore , the calculated t value should be in the range of - 2.26 < t < + 2.26.
7. Decision
As the calculated t value lies within the region -2.26 < t < +2.26, the null hypothesis is
accepted. So the manufacturer’s claim that his tablet contains 60mg drug is justified.
278 Textbook of Computer Applications and Biostatistics
Summary
One sample z test
Used to test a single sample mean (`X ) when the population mean (m) and variance (s) is
known.
X - m0
z=
s/ N
One sample t test
Used to test a single sample mean (`X ) when the population variance (s) is unknown or n <
30. X - m0
t (N-1)df =
s/ N
Exercise
1. During the manufacturer of a 1 L infusion, 50 bottles were drawn at random and the volume of each
container measured to check whether the full volume met the 1 liter claim. The mean volume was
found to be 1008 ml and standard deviation was 65 ml. Is there sufficient evidence to suggest that the
average volume of the infusion was not 1 liter at 5% level of significance?
2. A random sample of 400 items gives the mean 4.45 with a standard deviation of 2. Can it be
regarded as drawn from a normal population with mean 4 at 5% level of significance?
3. A randomly 200 medical shops were taken from Satara district and the average sale per shop was
found to be 520. Population mean sale was 510 per shop with a standard deviation of 40. Is the
difference between sample mean and population mean statistically significant at 5% level of
significance?
4. A random sample of 45 items gives the mean 73.2 with a standard deviation of 8.6. Can it be
regarded as drawn from a normal population with mean 76.7 at 1% level of significance?
5. A random sample of 900 children was found to have a mean fatfold thickness at triceps of 3.4 mm
with SD of 2.3mm. Can it be reasonably regarded as a representative sample of population having a
mean thickness of 3.2mm at 5% level of significance?
6. A batch of 40 L of paracetamol suspension (30 mg/ml) was filled into containers. 25 containers of
product have been removed for analysis of their drug content. Does the mean concentration of drug in
the batch conform to the 30 mg/ml at the 0.05 level of significance.
Concentration of paracetamol (mg/ml) in 25 containers:
31.5 30.5 30.5 29.8 30.1 30.2 30.2 30.6 29.8 31.1
30.1 30.1 30.1 30.2 29.7 28.9 30 30.4 30.3 29.9
32.1 30.1 28.9 30.8 29.5
7. A pharmaceutical manufacturer does a chemical analysis to check the potency of products. The
280 Textbook of Computer Applications and Biostatistics
standard release potency for cephalothin crystals is 910 and the manufacturer believes this claim may
be too high. An assay of 16 lots gives the following potency data:
Data: 897 918 914 906 913 895 906 893 916 908
918 906 905 907 921 901
Test the manufacturer's claim at the 0.01 level of significance.
8. Ten individuals are chosen at random from a normal population and their weights in kg are found to
be
68 63 66 69 63 67 70 70 71 71
Does this sample adequately represent a population in which the mean weight was found to be 66 kgs?
9. Mean Hb % of the population is 14.3. Can a sample of 15 individuals with a mean of 13.5 and SD
1.5 be from same population?
10. Ten tablets are chosen at random from batch and their content in mg are found to be
47 51 48 49 48 52 47 49 46 47
Manufacturer claims that tablet contains 50 mg of drug. Test the manufacturers claim at the
0.05 level of significance.
Answers:
Multiple Choice Questions
1. a 2. a 3. c 4. a 5. c 6. b 7. a 8. b 9. a 10. d
Exercise
1. z=0.87; Accept null hypothesis because observed z value is between -1.96 and +1.96 for two tailed
test at 0.05% level of significance. It is therefore concluded that the average volume of container was
1 liter.
2. z= 4.5; Accept alternative hypothesis because observed z value is more than +1.96 for two tailed
test at 0.05% level of significance. It is therefore concluded that the sample has not been drawn from a
population with mean 4.
3. z= 3.53; Accept alternative hypothesis because observed z value is more than +1.96 for two tailed
test at 0.05% level of significance. It is therefore concluded that the sample mean and population
mean differ significantly.
4. z= - 2.73; Accept alternative hypothesis because observed z value is more than -2.58 for two tailed
test at 0.01% level of significance. It is therefore concluded that the sample has not been drawn from a
population with mean 76.7.
5. z= 2.61; Accept alternative hypothesis because observed z value is more than +1.96 for two tailed
Hypothesis Testing for one sample mean 281
test at 0.05% level of significance. It is therefore concluded that the sample is not representative of
population having mean thickness of 3.2 mm at 5% level of significance.
6. Initially determine sample mean (30.216) and standard deviation (0.691). As the population
variance is unknown and also sample size is less than 30 use one sample t test. The critical value of ‘t’
statistics for 24 degrees of freedom at 5% level of significance is given as 2.064. Observed t value is
1.52 which is less than t critical. So it can be concluded that mean concentration of drug in the batch
conform to the 30 mg/ml at the 0.05 level of significance.
7. Initially determine sample mean (907.75) and standard deviation (8.48). As the population
variance is unknown and also sample size is less than 30 use one sample t test. The critical value of ‘t’
statistics for 15 degrees of freedom and two tail at 1% level of significance is given as 2.95. Observed t
value is -1.06 which is in between - 2.95 and +2.95. It can be concluded that sample mean for potency
of cephalothin crystals does not differ significantly with manufacturers claim. So, the manufacturer
claim is false about high potency of cephalothin crystals.
8. Initially determine sample mean (67.8) and standard deviation (3.01). As the population variance is
unknown and also sample size is less than 30 use one sample t test. The critical value of ‘t’statistics for
9 degrees of freedom and two tail at 0.05 level of significance is given as 2.26. Observed t value is
1.89 which is in between - 2.26 and +2.26. It can be concluded that sample adequately represent a
population in which the mean weight was found to be 66 kgs.
9. As the population variance is unknown and also sample size is less than 30 use one sample t test.
The critical value of ‘t’ statistics for 14 degrees of freedom and two tail at 0.05 level of significance is
given as 2.14. Observed t value is -2.58 which is not between - 2.26 and +2.26. It can be concluded
that sample of 15 individuals with a mean of 13.5 and SD 1.5 is not from same population.
10. Initially determine sample mean (48.4) and standard deviation (1.9). As the population variance is
unknown and also sample size is less than 30, use one sample t test. The critical value of ‘t’ statistics
for 9 degrees of freedom and two tail at 0.05 level of significance is given as 2.26. Observed t value is
-2.67 which is not in the range of -2.26 and +2.26. Reject null hypothesis and accept alternative
hypothesis.
Chapter 19
HYPOTHESIS TESTING FOR TWO SAMPLE MEANS
Learning objectives
When we have finished this chapter, we should be able to;
1. Understand and estimate the z statistics for two independent data sets.
2. Understand and perform the t statistics for two independent data sets.
3. Understand and perform paired t test for matched data sets.
One of the most common statistical testing used in the pharmaceuticals is the examination of
differences between two sets of sample data, i.e. whether or not the two populations whose properties
are estimated by the sample statistics differ from one another.
There are two types of two sample statistical tests; independent and paired, as shown in
following table:
Type Parametric test Non-parametric test
Independent samples z test (Sample > 30) Chi-square test (sample > 20)
t test (sample < 30) Fisher exact test (sample < 20)
Mann Whitney U test (Metric)
Paired samples Paired t test McNemar’s test (Categorical)
Wilcoxon signed rank test (Metric)
In the two sample test, the null hypothesis specifically states that there is no difference
between the mean values of each set of data.
X1 - X2
z= ...1
SE (X1 - X2 )
Where
`X1 & `X2 = sample means of two independent data sets
SE (`X1 -`X2) is the standard error of difference between two means
The standard error of difference of two means can be calculated as
282
Hypothesis Testing for two sample means 283
s 12 s 22
SE(X1 - X 2 ) = + ...2
N1 N 2
Where
s1, s2 = population standard deviations of samples N1 and N2 respectively
The following steps are used for calculating z statistic for two sample means:
1. Calculate the mean `X1 and `X2 and standard deviation (s1) and (s2) of each population.
2. Calculate the standard error of difference between two means SE(`X1 -`X2 ).
3. Calculate z statistics using formula given above.
4. If the observed difference between the two means is greater than 1.96 times the standard error of
difference, it is significant at 5% level of significance.
5. If the observed difference is greater than 3 times the SE, it is real variability in more than 99% cases,
and biological or due to chance in less than 1% cases
Example 19.1
To determine whether two drugs affected human mental concentration equally, 50 students
were given one drug and 50 others the second drug.All the students were then given an examination to
measure their mental concentration index. The mean scores for the two groups were 65 and 70, and
the respective standard deviations were 15 and 18. Is there sufficient evidence to suggest that the
drugs affected mental concentration differently?
Solution: Let us test the difference between two groups systematically.
1. State the null hypothesis
The null hypothesis states that there is no difference between the effect of two drugs on
mental concentration.
H0: m1 = m2
2. State the alternative hypothesis
The alternative hypothesis states that there is a difference between the effect of two drugs on
mental concentration.
Ha :m1 ¹ m2
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection. As the effect produced by two drugs on mental concentration may differ to any side, we
have to use two tailed test .
284 Textbook of Computer Applications and Biostatistics
s 12 s 22 152 182
SE(X1 - X 2 ) = + = + = 10.98 = 3.31
N1 N 2 50 50
3. Calculate z statistics using formula given above.
X1 - X 2 65 - 70 -5
z= = = = - 1 . 51
SE ( X 1 - X 2 ) 3 . 31 3 . 31
Example 19.2
In a group of 196 adults the mean serum cholesterol was 180 mg/dl with a standard deviation
of 42 mg/dl. In a comparable group of 144 adults, the mean serum cholesterol was 150 mg/dl with a
standard deviation of 48 mg/dl. Is the difference in cholesterol level of the two classes statistically
significant? Solution: Let us test the difference between two groups systematically.
1. State the null hypothesis
The null hypothesis states that there is no difference between the cholesterol level of the two
groups.
H0: m1 = m2
2. State the alternative hypothesis
The alternative hypothesis states that there is a difference between the cholesterol levels of
Hypothesis Testing for two sample means 285
two classes.
Ha :m1 ¹ m2
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection.As the cholesterol levels may differ to any side, we have to use two tailed test.
5. Select the appropriate statistical test
The most relevant statistical method to assess the validity of the null hypothesis, is a two-
tailed z test for two independent samples. As the sample size is more than 30, the z test is the most
suitable statistical method.
6. Perform the statistical analysis
Step 1. Calculate z statistic
The following steps are used for calculating z statistic
1. Calculate the mean `X1 and `X2 and standard deviation (s1) and (s2) for each population.
The data is already provided.
`X1 = 180 s1= 42 `X2 = 150 s2=48
2. Calculate the standard error of difference between two means SE(`X1 -`X2 ).
s 12 s 22 42 2 48 2
SE( X 1 - X 2 ) = + = + =5
N1 N 2 196 144
7. Decision
As observed z value (6) is more than z critical (1.96), reject null hypothesis and accept
alternative hypothesis. The evidence would suggest that difference in cholesterol level of the two
groups is statistically significant.
2. The t test for two independent samples
In the equation for the calculation of the z statistic the population variance is used, indicating
that there is prior knowledge of this parameter. Conversely, in the calculation of the t statistic the
variances of the populations from which the samples were drawn are estimated from the variances of
286 Textbook of Computer Applications and Biostatistics
S 2p S 2p
SE (X1 - X 2 ) = + ...4
N1 N2
Where,
2
Sp is calculated as pooled variance.
( N 1 - 1) s 12 + ( N 2 - 1) s 22
S 2p = ...5
N1 + N 2 - 2
The following steps are used for calculating t statistics:
1. Calculate mean and standard deviation of each group.
2. Calculate the pooled variance from sample size and sample variances using following formula.
( N 1 - 1) s 12 + ( N 2 - 1) s 22
S 2p =
N1 + N 2 - 2
Where,
S1 and S2 are standard deviations of two sample groups.
3. Calculate the standard error of difference between two means by using formula.
S2p S2p
SE (X1 - X2 ) = +
N1 N2
4. Calculate t statistic using formula.
(X 1 - X 2 ) (X 1 - X 2 )
t = =
SE (X 1 - X 2 ) S 2p S 2p
+
N1 N 2
5. Find the degrees of freedom, which is N1+N2 - 2.
6. For given degrees of freedom, find critical value from t table. If the calculated value is less than
critical value, then the null hypothesis is accepted while if calculated t statistic exceeds then the null
hypothesis is rejected.
Hypothesis Testing for two sample means 287
Example 19.3
Two formulations of same drug was placed for stability testing under controlled storage
o
conditions at 37 C. After 3 months, samples of each formulation were removed and assayed
individually. The analytical results are given in following table. Determine whether there is a
difference in the mean assay of drug in each formulation following the period of storage.
Data:
Formulation 1 104.1, 108.2, 108.6, 100.8, 106.5, 101.0, 102.6, 99.2, 95.2, 100.8
Formulation 2 102.9, 99.6, 98.1, 104.2, 90.2, 101.0, 99.9, 89.5, 95.5, 98.6
Solution: This problem can be solved in following steps:
1. State the null hypothesis
The null hypothesis states that there is no difference between the mean of assay of two
formulations.
H0: m1 = m2
2. State the alternative hypothesis
The alternative hypothesis states that there is a difference between the mean of assay of two
formulation
Ha :m1 ¹ m2
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection.As there are two possible outcomes to the study, this is a two tailed experimental design.
5. Select the appropriate statistical test
As the population variances are not known and the sample size is less than 30, two sample t
test is the appropriate choice.
6. Perform the statistical analysis
1. Calculate mean and standard deviation of each group.
Calculation of mean
å (X )
X =
N
Mean of formulation 1= `X1 = 102.7
Mean of formulation 2= `X2 = 97.95
Calculation of standard deviation
(å X) 2
å X2 -
SD = N
N -1
Standard deviation of formulation 1 = s1 = 4.21
288 Textbook of Computer Applications and Biostatistics
3. Calculate the standard error of difference between two means by using formula
S2p S2p 1 1 1 1
SE ( X1 - X 2 ) = + = Sp + = 4.56 + = 2.04
N1 N2 N1 N 2 10 10
4. Calculate t statistic using formula
Example 19.4
Weight in kg of 10 boys and 10 girls aged between 15 to 20 are given below. Determine
whether there is a significant difference in the mean weight of two groups.
Data:
Boys (Wt in kg) 42 46 50 48 50 52 41 49 51 56
Girls (Wt in kg) 38 41 36 35 30 42 31 29 31 35
groups.
Ha :m1 ¹ m2
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection.As there are two possible outcomes to the study, this is a two tailed experimental design.
5. Select the appropriate statistical test
As the population variances are not known and the sample size is less than 30, two sample t
test is the appropriate choice.
6. Perform the statistical analysis
1. Calculate mean and standard deviation of each group.
Calculation of mean
å(X )
X=
N
Mean of weight of Boys= `X1 = 48.5
Mean of weight of Girls= `X2 = 34.8
Calculation of standard deviation
(å X) 2
å X2 -
SD = N
N -1
S2p S2p 1 1 1 1
SE (X1 - X 2 ) = + = Sp + = 4.54 + = 2.03
N1 N2 N1 N 2 10 10
Example 19.5
Systolic blood pressure of 9 normal individuals was taken. Then a known hypotensive drug
was given and blood pressure was again recorded. Did the hypotensive drug lowers the systolic blood
pressure?
Hypothesis Testing for two sample means 291
Data: Blood pressure of nine healthy volunteers before and after injection of hypotensive drug.
BP Before (X1) 122 121 120 115 126 130 120 125 128
BPAfter (X2) 120 118 115 110 122 130 116 124 125
Solution:
Let us test the hypothesis using following steps:
1. State the null hypothesis
The null hypothesis states that there is no difference between the mean of systolic BP of
healthy volunteers before and after injection of hypotensive drug treatment.
H0: m1 before = m2 after
2. State the alternative hypothesis
The alternative hypothesis states that the mean of systolic BP of healthy volunteers before
injection of hypotensive drug treatment is greater than mean of systolic BP after injection of drug.
Ha :m1before > m2 after
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection. As there is only one possible outcome to the study i.e. injection of drug lowers the blood
pressure, this is a one tailed experimental design.
5. Select the appropriate statistical test
As the given experiment consists of data of independent observations from same sample
giving pair of observations, paired t test is applied.
6. Perform the statistical analysis
1. Find the difference in each set of paired observations before and after (X1-X2 = x) and its squares
BP Before (X1) 122 121 120 115 126 130 120 125 128
BPAfter (X2) 120 118 115 110 122 130 116 124 125
Difference (x) 2 3 5 5 4 0 4 1 3 Sx = 27
Sx = 105
2 2
Squares (x ) 4 9 25 25 16 0 16 1 9
2. Calculate the mean of the difference (`x).
Formula:
åx
x=
N
Data: å x = 27; N=9
Mean of the difference (`x) = 27/9 = 3
3. Calculate SD of differences by using formula
292 Textbook of Computer Applications and Biostatistics
Formula:
(å X) 2
å X2 -
SD = N
N -1
Data: Sx2 = 105; Sx = 27; N = 9
( å X) 2 (27) 2
å X2 - 105 -
SD = N = 9 = 24 = 1.73
N -1 9 -1 8
SD of differences = 1.73
4. Determine ‘t” value by using following formula:
x x 3 3
t= = = = = 5.17
SE SD/ N 1.73 / 9 0.58
7. Decision
The calculated t value is more than critical value and therefore it may be concluded that there
was significant difference in blood pressure before and after administration of hypotensive drug.
Reject the null hypothesis and accept the alternative hypothesis i.e. the mean of systolic BP of healthy
volunteers before injection of hypotensive drug treatment is greater than mean of systolic BP after
injection of drug.
Example 19.6
Serum digoxin levels were determined for nine healthy males following rapid intravenous
injection of the drug. The measurements were made 4h after the injection and again at the end of an 8h
period. Is the difference in the serum digoxin concentration at the end of 4h and at the end of 8h
statistically significant?
Data: Serum digoxin concentration mg/ml after 4h and 8h.
After 4h (X1) 1.0 1.3 0.9 1.0 1.0 0.9 1.3 1.1 1.0
After 8h (X2) 1.0 1.3 0.7 1.0 0.9 0.8 1.2 1.0 1.0
H0: m1 4h = m2 8h
2. State the alternative hypothesis
The alternative hypothesis states that the mean serum concentration of digoxin after 4h is
greater than mean serum concentration of digoxin after 8h .
Ha :m1 4h > m2 8h
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. Sate the number of tails
The number of tails associated with the alternative hypothesis is determined before data
collection.As there is only one possible outcome to the study, this is one tailed experiment.
5. Select the appropriate statistical test
As the given experiment consists of data of independent observations from same sample
giving pair of observations, paired t test is applied.
6. Perform the statistical analysis
1. Find the difference in each set of paired observations before and after (X1-X2 = x) and its squares
BP Before (X1) 1.0 1.3 0.9 1.0 1.0 0.9 1.3 1.1 1.0
BPAfter (X2) 1.0 1.3 0.7 1.0 0.9 0.8 1.2 1.0 1.0
Difference (x) 0 0 0.2 0 0.1 0.1 0.1 0.1 0 Sx = 0.6
2
Squares (x ) 0 0 0.04 0 0.01 0.01 0.01 0.01 0 Sx2 = 0.08
2. Calculate the mean of the difference (`x).
Formula:
åx
x=
N
Data: å x = 0.6; N=9
Mean of the difference (`x) = 0.6/9 = 0.067
Sx = 0.08; Sx = 0.6;
2
Data: N=9
(å X ) 2 ( 0 .6 ) 2
å X2 - 0.08 -
N 9 0.04
SD = = = = 0.07
N -1 9 -1 8
SD of differences = 0.07
294 Textbook of Computer Applications and Biostatistics
2. Variable 1 Range - Enter the cell reference for the first range of data we want to analyze.
The range must consist of a single column or row of data.
3. Variable 2 Range - Enter the cell reference for the second range of data we want to analyze.
The range must consist of a single column or row of data.
4. Hypothesized Mean Difference - Enter the number we want for the shift in sample means.
Avalue of 0 (zero) indicates that the sample means are hypothesized to be equal.
5. Variable 1 Variance (known) - Enter the known population variance for the Variable 1
input range.
6. Variable 2 Variance (known) - Enter the known population variance for the Variable 2
input range.
7. Labels - Select if the first row or column of our input ranges contains labels. Clear this
check box if our input ranges have no labels; Microsoft Excel generates appropriate data
labels for the output table.
9.Alpha - Enter the confidence level for the test. Generally it is 0.05.
10. Leave the other items at their default selections. Click OK.
This will give the result of z test for two independent samples.
Data:
Formulation 1 104.1, 108.2, 108.6, 100.8, 106.5, 101.0, 102.6, 99.2, 95.2, 100.8
Formulation 2 102.9, 99.6, 98.1, 104.2, 90.2, 101.0, 99.9, 89.5, 95.5, 98.6
Excel Solution:
Step I
1. Open new MS-Excel file from MS-office.
2. Enter data into Sheet1.
3 . In first row put the labels for variables (i.e. inA1 and B1 cell ).
4. Enter data for variable 1 i.e formulation 1 in column ‘A’
5. Enter data for variable 2 i.e. formulation 2 into column ‘B’.
Step II
1. After entering data into the Sheet 1 Select: Tools/ Data Analysis/ t-Test: Two Sample
assuming Unequal Variances.
2. Then click on OK button.
Hypothesis Testing for two sample means 297
Step III
1. For the Input Range for Variable 1, SelectA1:A11 (Label along with values).
2. For the input range for Variable 2, Select B1:B11 (Label along with values).
3. Click on Labels. Leave the other items at their default selections.
This dialog box is shown below. Click on OK button.
Step IV
1. The following output is created in Sheet 4
298 Textbook of Computer Applications and Biostatistics
Formulation 1 Formulation 2
Mean 102.7 97.95
Variance 17.78666667 24.145
Observations 10 10
Hypothesized Mean Difference 0
df 18
t Stat 2.319650459
P(T<=t) one-tail 0.016157567
t Critical one-tail 1.734063592
P(T<=t) two-tail 0.032315134
t Critical two-tail 2.100922037
Excel Solution:
Step I
1. Open new MS-Excel file from MS-office.
2. Enter data into Sheet1.
3 . In first row put the labels for variables (i.e. inA1 and B1 cell ).
4. Enter data for variable 1 i.e. ‘BP Before’into column ‘A’.
5. Enter data for variable 2 i.e. ‘BPAfter’into column ‘B’.
Hypothesis Testing for two sample means 299
BP Before BP After
Mean 123 120
Variance 21.75 36.25
Observations 9 9
Pearson Correlation 0.979375
Hypothesized Mean Difference 0
df 8
t Stat 5.196152
P(T<=t) one-tail 0.000413
t Critical one-tail 1.859548
P(T<=t) two-tail 0.000826
t Critical two-tail 2.306004
Interpretation of Results
Depending on the hypothesis, level of significance and number of tails associated with
design, one can interpret above obtained result for paired t test.
At (N-1) = 8 degrees of freedom, 0.05 level of significance critical t value for one tail is 1.86.
The calculated t value (5.19) is more than critical value (1.86) and therefore it may be concluded that
there was significant difference in blood pressure before and after administration of hypotensive
drug. Therefore, reject the null hypothesis and accept the alternative hypothesis that the mean systolic
BP of healthy volunteers decreases after hypotensive drug injection.
Summary
Statistical tests for two samples
Independent samples z test (Sample > 30) Chi-square test (sample > 20)
t test (sample < 30) Fisher exact test (sample < 20)
Mann Whitney U test (Metric)
Paired samples Paired t test McNemar’s test (Categorical)
Wilcoxon signed rank test (Metric)
(å X) 2
x -O x -O å X2 -
t= = SD = N
SE SD/ N N -1
Exercise
1. In a group of 169 boys in the age group of 12-20 years the mean height was 168 cm with a standard
deviation of 14 cm. In a comparable group of 54 girls the mean height was 153 cm with a standard
deviation of 8 cm. Is the height differs with sex?
2. In a group of 100 infants of 8 month age, the mean weight was 6.9 kg with a standard deviation of
1.1 kg. In a comparable group of 169 infants the mean weight was 7.3 kg with a standard deviation of
0.91 kg. Test whether mean weights are significantly different.
3. The mean plasma potassium level for 50 adult males with a disease was found to be 3.35 mEq/l with
a standard deviation of 0.5 mEq/l. The normal 50 adult male value for plasma potassium is 4.6 mEq/l
with a standard deviation of 0.01 mEq/l. Based on above data, can it be concluded that males with
disease have lower plasma potassium levels than normal males?
4. Mean systolic blood pressure of 54 normal adults was 75 with a standard deviation of 6. In 31
diseased adults, mean systolic blood pressure was 69 with a standard deviation of 5. Test whether
mean systolic blood pressure of two groups differ significantly?
5. In a nutritional study, 13 children (group A) were given a usual diet plus vitamin A and D tablets
while the second group (B) of 12 children was taking the usual diet. After 24 months, the gain in
weight in kg was noted as given in table below. Can we say that vitamins A and D were responsible for
this difference.
GroupA 5 3 4 3 2 6 3 2 3 6 7 5 3
Group B 1 3 2 4 2 1 3 4 3 2 2 3
6. In the experiment, there are two groups of 15 subjects. For the first group, each individual was pre-
treated with oral rifampicin (600 mg daily for 10 days). The other group acted as a control, receiving
only placebo pre-treatment. All subjects then received an intravenous injection of theophylline (3
mg/kg). A series of blood samples was obtained after the theophylline injections, and analysed for
drug content. The efficiency of removal of theophylline was reported as a clearance value. Is increase
in clearance of theophylline would be due to the fact that rifampicin may increase the ability of the
liver to eliminate this drug? Clearance of theophylline (ml/min/kg) for control subjects and for those
pre-treated with rifampicin are given below.
Control 0.81 1.06 0.43 0.54 0.68 0.56 0.45 0.88 0.73 0.43
Treated 1.15 1.28 1.00 0.95 1.06 1.15 0.72 0.79 0.67 1.21
Control 0.46 0.43 0.37 0.73 0.93
Treated 0.92 0.67 0.76 0.82 0.82
Hypothesis Testing for two sample means 303
7. Two groups of hypertensive patients are subjected to two different treatment regimens and systolic
blood pressure was recorded after specific time period. Do the results of systolic BP listed below
indicate a significant difference between the two therapies at 95% confidence level?
Group I 78 87 75 88 91 82 87 65 80
Group II 75 88 93 86 84 71 91 79 81 86 89
8. A new natural product of company reported to promote weight loss. To ensure the claim, a clinical
study of developed tablet was conducted on ten obese volunteers. Initial weight of volunteers before
clinical trial and weight after receiving therapy for 2 months were recorded. Whether the natural
product was clinically successful?
Weight before (kg) 141 124 153 120 116 155 151 155 132 116
Weight after (kg) 130 120 149 120 109 150 151 154 125 110
9. The systolic blood pressures of 12 women between the ages of 20 and 35 were measured before and
after administration of a newly developed oral contraceptive. Is this increase in systolic blood
pressure due to oral contraceptive?
Before 122 126 132 120 142 130 142 137 128 132 128 129
After 127 128 140 119 145 130 148 135 129 137 128 133
10. The extent to which an infant’s health is affected by parental smoking is an important public health
concern. The following data are the urinary concentrations of cotinine (a metabolite of nicotine);
measurements were taken both from a sample of infants who had been exposed to household smoke
and from a sample of unexposed infants. Comment.
Unexposed 8 11 12 14 20 43 111
Exposed 35 56 83 92 128 150 176 208
Answers:
Multiple Choice Questions
1. c 2. b 3. a 4. d 5. d 6. c 7. a 8. a 9. c 10. b
Exercise
1. Yes, height differs with sex. (SE= 1.53; z=9.8)
2. Yes, mean weights are significantly different. (SE=0.13; z=-3.07)
3. It be concluded that males with disease have significantly lower plasma potassium levels than
normal males. (SE=0.07; z=-17.67)
4. Yes, mean systolic blood pressure of two groups differ significantly. (SE=1.21; z= 4.94)
5. The calculated t value (2.74) is more than critical value (At 23 df, t critical for two tailed test at 5%
level of significance is 2.07) and therefore it can be concluded that vitamins A and D were responsible
304 Textbook of Computer Applications and Biostatistics
Learning objectives
When we have finished this chapter, we should be able to:
1. Understand meaning ofANOVA.
2. Calculate ANOVAby definitional formula.
3. Calculate ANOVAby computational formula.
What isANOVA?
The ANOVA is used to identify and measure sources of variation within a collection of
observations, hence the name analysis of variance. Analysis of variance is a parametric statistical
technique that has found extensive applications in scientific research, mainly because of its
flexibility. This method may be employed to analyse both paired and independent data and also is
used to simultaneously compare large number of variables.
The one-way ANOVA is nothing more than an expansion of the t-test to more than two
groups of sample. The analysis of variance involves determining if the observed values belong to the
same population, regardless of the group, or whether the observations in at least one of these groups
come from a different population.
H0 : m1 = m2 = m3 .... = mk = m
To obtain a F value we need two estimates of the population variance. It is necessary to
examine the variability (analysis of variance) of observations within groups as well as between
groups. The F statistic is computed using a simplified ratio similar to the t-test.
Mean squared between (MSB)
F= ...1
Mean squared within (MSW)
To calculate the F-statistic for the decision rule either the definitional or computational
formulas may be used. With the exception of rounding errors, both methods will produce the same
results. In the former case the sample means and standard deviations are used.
Where,
n1, n2, n3 =number of observations in group 1, 2 and 3.
k = number of groups.
s1, s2, s3 = standard deviation of group 1, 2 and 3.
N = total number of observations in all groups.
2. Determine the grand mean or pooled mean by using following formula,
(n 1 X 1 ) + (n 2 X 2 ) + (n 3 X 3 ) . . . (n k X k )
XG =
N ...3
`X= mean
3. Now, the mean squared between (MSB) is calculated similar to a sample variance by squaring the
difference between each sample mean and the grand mean, and multiplying by the number of
observations associated with each sample mean
n 1 ( X1 - X G ) 2 + n 2 ( X 2 - X G ) 2 + n 3 ( X 3 - X G ) 2 . . . n k ( X k - X G ) 2
MSB =
K -1 ...4
4. Finally, F statistics is calculated by formula
M SB
F =
M SW
5. Compare the calculated F value with critical value from F table (Appendix III ) and take decision.
6. The greater the spread of the sample observations, the larger the denominator and the smaller the
calculated statistic, and thus the lesser the likelihood of rejecting H0. The greater the differences
between the means, the larger the numerator, the larger the calculated statistic, and the greater the
likelihood of rejecting Ho in favor of Ha.
Example 20.1
The rate of release of three controlled release formulation of Diclofenac sodium after 2hr in
% are given in following table. Is there a difference between the release kinetics of these three
formulations?
Formulation 1 4.21 5.23 4.01 6.00 5.25 6.41 4.52 4.18 6.05 4.66
Formulation 2 3.69 3.99 4.25 4.08 5.22 2.99 5.66 4.25
Formulation 3 5.00 6.55 6.02 6.11 4.88 4.29 6.00 5.45 5.04
Answer: Let us performANOVAfollowing the steps given below
1. State the null hypothesis
In this experiment null hypothesis states that the mean rate of release of diclofenac sodium
from three formulations is identical.
H0 : m1 = m2 = m3
2. State the alternative hypothesis
The mean rate of release of diclofenac sodium from the three formulations is not identical.
One way analysis of Variance (ANOVA) 307
Ha : m1 ¹ m2 ¹ m3
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. State the number of tails
As experiment involves a multiple hypothesis test, so the outcome is two tailed.
5. Select the appropriate statistical test
This is a case where more than two means will be simultaneously compared and hence
multiple hypothesis test is required.
There is a one factor i.e. formulation of which there are three sub categories F1, F2 and F3, so
one wayANOVAis required.
6. Perform the statistical test
1. Calculate the denominator of the F statistics, the mean squared within (MSW) in the same way as
the pooled variance is calculated for t test.
Formula:
(n1 - 1)s12 + (n 2 - 1)s22 + (n3 - 1)s32 . . . (n k - 1)s2k
MSW =
N
Data: n1 = 10, n2 =8, n3 = 9 k = number of groups= 3
Calculation for standard deviation
Formula:
( å X) 2
å X2 -
SD = n
n -1
Data: s1= standard deviation of group 1= 0.87
s2= standard deviation of group 2= 0.84
s3= standard deviation of group 3 =0.73
(10 - 1)0.87 2 + (8 - 1)0.84 2 + (9 - 1)0.73 2
MSW = = 0 .57
27
2. Determine the grand mean or pooled mean by using following formula
Formula:
(n 1 X 1 ) + (n 2 X 2 ) + (n 3 X 3 ) . . . (n k X k )
XG =
N
Calculation of mean
åX
X =
n
Data: `X1= mean = 5.05; `X2= mean = 4.27; `X3= mean = 5.48
(10 x 5.05) + (8 x 4.27) + (9 x 5.48) 133.99
XG = = = 4.96
27 27
308 Textbook of Computer Applications and Biostatistics
3. Now, the mean squared between (MSB) is calculated similar to a sample variance by squaring the
difference between each sample mean and the grand mean, and multiplying by the number of
observations associated with each sample mean
Formula:
n ( X1 - X G ) 2 + n 2 ( X 2 - X G ) 2 + n 3 ( X 3 - X G ) 2 . . . n k ( X k - X G ) 2
MSB = 1
K -1
Example 20.2
Four brands of cereal are compared to see if they produce significant weight gain in rats. Four
groups of seven rats each were given a diet of the respective cereal brand. At the end of the
experimental period, the rats were weighed and the weight was compared to the weight just prior to
the start of the cereal diet. Determine whether each brand has a statistically significant effect on the
amount of weight gain. The data are provided in the table below.
BrandA 9 7 8 8 7 8 8
Brand B 5 4 6 4 5 7 3
Brand C 2 1 1 2 2 3 2
Brand D 3 8 5 9 2 7 8
Answer:
Let us performANOVAfollowing the steps given below
1. State the null hypothesis
In this experiment null hypothesis is the mean weight gain of rats from four brands of cereal
One way analysis of Variance (ANOVA) 309
is identical.
H0 : m1 = m2 = m3
2. State the alternative hypothesis
The mean weight gain of rats from the four brands of cereal is not identical.
Ha : m1 ¹ m2 ¹ m3
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. State the number of tails
As experiment involves a multiple hypothesis test, so the outcome is two tailed.
5. Select the appropriate statistical test
This is a case where more than two means will be simultaneously compared and hence
multiple hypothesis test is required.
There is a one factor i.e. brands of cereal, of which there are four sub categories A, B, C and
D, so one wayANOVA is required.
6. Perform the statistical test
1. Calculate the denominator of the F statistics, the mean squared within (MSW) in the same way as
the pooled variance is calculated for t test.
Formula:
(n - 1)s 12 + (n 2 - 1)s 22 + (n 3 - 1)s 32 . . . (n k - 1)s 2k
MSW = 1
N
Data: n1 = 7, n2 =7, n3 = 7, n4 = 7
k = number of groups= 4
Calculation for standard deviation
Formula:
(å X) 2
å X2 -
SD = n
n -1
Formula:
(n1 X1 ) + (n 2 X 2 ) + (n 3 X 3 ) . . . (n k X k )
XG =
N
Calculation of mean
åX
X=
n
Data: `X1= mean =7.86 ; `X2= mean = 4.86; `X3= mean =1.86; X4= mean = 6.0
3. Now, the mean squared between (MSB) is calculated similar to a sample variance by squaring the
difference between each sample mean and the grand mean, and multiplying by the number of
observations associated with each sample mean
Formula:
n ( X1 - X G ) 2 + n 2 ( X 2 - X G ) 2 + n 3 ( X 3 - X G ) 2 . . . n k ( X k - X G ) 2
MSB = 1
K -1
(å X) 2
SST = å X 2 -
N ...5
Where,
å X = sum of squares of all observations.
2
(å X 1 ) 2 (å X 2 ) 2 (å X 3 ) 2 (å X T ) 2
SSB = + + ..... - ...6
n1 n2 n3 N
6. PrepareANOVATable
Example 20.3
The rate of release of three controlled release formulation of Diclofenac sodium after 2 hr in
% are given in following table. Is there a difference between the release kinetics of the three
formulations?
Formulation 1 4.21 5.23 4.01 6.00 5.25 6.41 4.52 4.18 6.05 4.66
Formulation 2 3.69 3.99 4.25 4.08 5.22 2.99 5.66 4.25
Formulation 3 5.00 6.55 6.02 6.11 4.88 4.29 6.00 5.45 5.04
Answer: Let us performANOVAfollowing the steps given below:
1. State the null hypothesis
In this experiment null hypothesis is the mean rate of release of diclofenac sodium from three
formulation is identical.
H 0 : m1 = m2 = m3
2. State the alternative hypothesis
The mean rate of release of diclofenac sodium from the three formulations is not identical.
Ha : m1 ¹ m2 ¹ m3
3. State the level of significance
It is assumed that the level of significance is 0.05.
4. State the number of tails
As experiment involves a multiple hypothesis test, so the outcome is two tailed.
5. Select the appropriate statistical test
This is a case where more than two means will be simultaneously compared and hence
multiple hypothesis test is required.
There is a one factor i.e. formulation, of which there are three sub categories F1, F2 and F3,
so one wayANOVAis required.
6. Perform the statistical test
1. Calculate total sum of squares (SST)
SST can be calculated using the formula
One way analysis of Variance (ANOVA) 313
Formula:
(å X) 2
SST = å X 2 -
N
å X = sum of squares of all observations
2
Data:
(åX) = square of summation of observations
2
(å X) = (50.52+34.13+49.34) å X = (262.03+150.53+274.8)
2 2 2
(å XT) = 17953.3
2
4. Calculate between mean sum of squares (MSB) and within mean sum of squares (MSW)
Mean sum of squares between groups,
MSB = SSB/Df
Df = degree of freedom =No. of groups-1 = k-1 =3-1 =2
MSB= SSB/Df = 6.39/2 = 3.2
Mean sum of squares within group,
MSW= SSW/Df
Df = degree of freedom = No. of observations - k= 27 -3=24
MSW= SSW/Df = 16.03/24= 0.67
5. Calculate F statistics
F statistics can be calculated by using formula
F = Mean sum of squares between groups/ Mean sum of squares within group
F=MSB/MSW
F = 3.2/0.67 = 4.78
6. PrepareANOVATable
Total 26 22.42
One way analysis of Variance (ANOVA) 315
7. Decision
If calculated F value is less than the critical F value the null hypothesis is accepted, whereas if
the calculated F value is more than or equal to critical F value, the null hypothesis is rejected in favour
of alternative hypothesis. The observed F value is greater than the critical F value (3.40). It is
concluded that there is a significant difference between the rates of release of diclofenac sodium from
the three controlled release formulations.
Example 20.4
Four brands of cereal are compared to see if they produce significant weight gain in rats. Four
groups of seven rats each were given a diet of the respective cereal brand. At the end of the
experimental period, the rats were weighed and the weight was compared to the weight just prior to
the start of the cereal diet. Determine whether each brand has a statistically significant effect on the
amount of weight gain. The data are provided in the table below.
BrandA 9 7 8 8 7 8 8
Brand B 5 4 6 4 5 7 3
Brand C 2 1 1 2 2 3 2
Brand D 3 8 5 9 2 7 8
Answer:
Let us performANOVAfollowing the steps given below
1. State the null hypothesis
In this experiment null hypothesis is the mean weight gain of rats from four brands of cereal
is identical.
H 0 : m1 = m2 = m3
2. State the alternative hypothesis
The mean weight gain of rats from the four brands of cereal is not identical.
Ha : m1 ¹ m2 ¹ m3
3. State the level of significance
It is assumed that th level of significance is 0.05
4. State the number of tails
As experiment involves a multiple hypothesis test, so the outcome is two tailed.
5. Select the appropriate statistical test
This is a case where more than two means will be simultaneously compared and hence
multiple hypothesis test is required.
There is a one factor i.e. brands of cereal of, which there are four sub categories A, B, C and
D, so one wayANOVA is required.
6. Perform the statistical test
1. Calculate total sum of squares (SST)
SST can be calculated using the formula
316 Textbook of Computer Applications and Biostatistics
Formula:
(å X) 2
SST = å X 2 -
N
å X = sum of squares of all observations
2
Data:
(åX) = square of summation of observations
2
(å X) = (55+34+13+42) å X = (435+176+27+296)
2 2 2
(å XT) = 20736
2
N =28
Therefore, the between sum of squares is
4. Calculate between mean sum of squares (MSB) and within mean sum of squares (MSW)
Mean sum of squares between groups,
MSB = SSB/Df
Df = degree of freedom =No. of groups-1 = k-1 =4-1 =3
MSB= SSB/Df = 132.86/3 = 44.28
Mean sum of squares within group,
MSW= SSW/Df
Df = degree of freedom = No. of observations - k= 28 -4=24
MSW= SSW/Df = 60.47/24= 2.52
5. Calculate F statistics
F statistics can be calculated by using formula
F = Mean sum of squares between groups/ Mean sum of squares within group
F=MSB/MSW
F =44.28/2.52 = 17.57
6. PrepareANOVATable
Total 27 22.42
7. Decision
If calculated F value is less than the critical F value the null hypothesis is accepted, whereas if
the calculated F value is more than or equal to critical F value, the null hypothesis is rejected in favour
of alternative hypothesis. The observed F value is greater than the critical F value (3.01). Therefore, it
318 Textbook of Computer Applications and Biostatistics
is concluded that there is a significant difference between the mean weight gain by rats fed from four
brands of cereal.
Step 4:
In the following Dialog box, enter the input range that corresponds to the data columns
($A$1:$C$11) and click OK. Check the option "Labels in First Row".
SUMMARY
Groups Count Sum Average Variance
Formulation 1 10 50.52 5.052 0.755729
Formulation 2 8 34.13 4.26625 0.703513
Formulation 3 9 49.34 5.482222 0.538094
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 6.38922 2 3.194608 4.782674 0.01786 3.40283
Within Groups 16.0309 24 0.667954
Total 22.4201 26
In this output, the test statistic, F, is reported in the analysis of variance table, F(2, 24) = 4.78 .
The p-value for this statistics is reported in the table as 0.017. As observed F value is greater than
320 Textbook of Computer Applications and Biostatistics
critical F value at 5 % level of significance, it can be concluded that there is a significant difference
between the rates of release of diclofenac sodium from the three controlled release formulations.
Summary
An independent group ANOVA is an extension of the independent group t-test where we
have more than two groups. This test is used to compare the means of more than two independent
groups and is also called a One WayAnalysis of Variance.
ANOVAusing definitional formula
(n 1 - 1)s 12 + (n 2 - 1)s 22 + (n 3 - 1)s 32 . . . (n k - 1)s 2k
M SW =
N
(n1 X1 ) + (n 2 X 2 ) + (n 3 X 3 ) . . . (n k X k )
XG =
N
n 1 ( X1 - X G ) 2 + n 2 ( X 2 - X G ) 2 + n 3 ( X 3 - X G ) 2 . . . n k ( X k - X G ) 2
MSB =
K -1
MSB
F=
MSW
Exercise
1. During the manufacture of a diclofenac coated tablet, samples were periodically selected from
production lines at three different facilities. Weights were taken for 10 tablets and their average
weights listed in table. Is there any significant difference in weights of the tablets between the three
facilities? F Critical at 2 degrees of freedom in numerator and 27 degrees of freedom in denominator
is 3.35 at 5% level of significance.
322 Textbook of Computer Applications and Biostatistics
FacilityA 77.3 80.3 79.1 75.2 73.6 76.7 81.7 78.7 78.4 72.9
Facility B 71.6 74.8 71.2 77.6 74.5 75.7 76.1 75.9 75.5 74.0
Facility C 75.5 74.2 67.5 74.2 70.5 84.4 75.6 77.1 72.2 73.4
2. Four brands of diclofenac sodium were selected and assayed according to IP. Results of assay in
terms of percent labeled amount of drug are listed in table. Was there any significant difference based
on brands in terms of labeled claim? F Critical at 3 degrees of freedom in numerator and 36 degrees of
freedom in denominator is 2.86 at 5% level of significance.
BrandA 100 99.8 99.5 100.1 99.7 99.9 100.4 100 99.7 99.9
Brand B 99.5 100 99.3 99.9 100.3 99.5 99.6 98.9 99.8 100.1
Brand C 99.6 99.3 99.5 99.1 99.7 99.6 99.4 99.5 99.5 99.9
Brand D 99.8 100.5 100 100.1 99.4 99.6 100.2 99.9 100.4 100.1
3. Four brands of Atenolol were under investigation for their antihypertensive efficacy. Four groups
were selected comprising of 10 volunteers in each group. According to designed protocol drug was
given to each group and mean systolic blood pressure was listed in table. Determine if there is
significant difference in mean systolic blood pressure in order to assess the role of different brands of
Atenolol. F Critical at 3 degrees of freedom in numerator and 36 degrees of freedom in denominator is
2.86 at 5% level of significance.
BrandA 125 130 135 120 115 120 130 135 140 135
Brand B 120 122 115 110 125 122 120 120 126 120
Brand C 120 115 115 130 120 125 122 115 126 118
Brand D 118 120 118 120 120 115 125 125 120 115
4. Four treatments are given to four groups of patients with anaemia. Increase in Hb% level was noted
after one month . Find whether the difference in improvement in four groups is significant or not.
GroupA 3 1 2 0 1 2 2
Group B 3 2 2 3 1 3 2
Group C 3 4 5 4 2 2 4
Group D 4 2 3 1 4 5 1
F Critical at 3 degrees of freedom in numerator and 24 degrees of freedom in denominator is 3.01 at
5% level of significance.
5. Pharmaceutical company suspected that four filling machines were not filling the bottles in a
uniform way. An experiment on four machines were performed and recorded in table. Find whether
there is a significant difference in the filling performance of four machines? F Critical at 3 degrees of
freedom in numerator and 19 degrees of freedom in denominator is 3.24 at 5% level of significance.
One way analysis of Variance (ANOVA) 323
6. Dissolution test was performed for different brands of paracetamol available in India. Dissolution
efficiencies were calculated and reported in table. Find whether there is a significant difference in the
dissolution efficiencies of six brands of paracetamol? F Critical at 5 degrees of freedom in numerator
and 30 degrees of freedom in denominator is 2.53 at 5% level of significance.
BrandA 63.33 56.67 56.67 60.00 56.67 53.33
Brand B 63.33 63.33 60.00 66.67 60.00 73.33
Brand C 50.00 46.67 53.33 50.00 46.67 53.33
Brand D 53.33 56.67 46.67 50.00 45.00 46.67
Brand E 45.00 46.67 48.33 45.00 46.67 48.33
Brand F 50.00 53.33 50.00 46.67 56.67 60.00
7. The students are doing titrimetric estimation of certain compound in the laboratory. Three groups
have reported their analysis results as given below.
GroupA 18.0 16.4 15.7 19.6 16.5 18.2
Group B 21.1 17.8 18.6 20.8 17.9 19.0
Group C 16.1 17.8 16.5
Perform ANOVA at 0.05 level of significance to test whether the differences among the
sample means are significant.
8. Three chemicals A, B and C show the cleaning efficiency as given below. Find whether the
differences among them are significant at 5% significance level.
ChemicalA 80 77 76 81 71
Chemical B 70 58 72 66 74
Chemical C 77 80 82 85 76
9. If five different liquid filling machines fill 30 ml medicines into a container as shown below, find
whether the differences amongst the five machines is significant at 0.01 significance level by
performingANOVA.
MachineA 30 25 27 26
Machine B 29 28 26 29
Machine C 37 32 32 35
Machine D 32 33 34 29
Machine E 31 26 27 32
324 Textbook of Computer Applications and Biostatistics
10. In a pharmaceutical company, three different brands of lubricants were used during tableting. The
quantity required for three brands is given below. Test at the 0.01 level of significance whether the
differences among the three sample means are significant.
LubricantA 8 14 10 10 13 12 13 12
Lubricant B 11 7 6 8 9 11 8 9 12 8 9
Lubricant C 7 7 5 6 10 4 8 9
Answers:
Multiple Choice Questions
1. d 2. a 3. b 4. a 5. a 6. b 7. b 8. b 9. a 10. d
Exercise
1. The observed F value is less than the critical F value (3.35). It is concluded that there is a no
significant difference in weights of the tablets between the three facilities.
2. The observed F value is more than the critical F value (2.86). It is concluded that there is a
significant difference in terms of labeled claim of four brands.
3. The observed F value is more than the critical F value (2.86). It is concluded that there is a
significant difference in mean systolic blood pressure produced by different brands ofAtenolol.
4. The observed F value is more than the critical F value (3.01). It is concluded that there is a
significant difference in mean %Hb in four groups.
Source Degree of Sum of Mean F
freedom squares squares
Between groups 3 13.25 4.41 3.34
Within groups 24 31.71 1.32
Total 27 44.96
5.The observed F value is less than the critical F value (3.24). It is concluded that there is a no
significant difference in the filling performance of four machines.
6. The observed F value is more than the critical F value (2.53). It is concluded that there is a
significant difference in the dissolution efficiencies of six brands of paracetamol.
7. The observed F value is more than the critical F value (3.89). It is concluded that there is a
significant difference among sample means of three groups.
8.The observed F value is more than the critical F value (3.89). It is concluded that there is significant
difference in the cleaning efficiency of chemicalsA, B and C.
9. The observed F value is more than the critical F value (6.42). It is concluded that there is a
significant difference in filling performance of five liquid filling machines at 0.01 level of
significance.
10. The observed F value is more than the critical F value (5.61). It is concluded that there is a
significant difference in three brands of lubricants at 0.01 level of significance.
Learning objectives
When we have finished this chapter, we should be able to:
1. Understand meaning of correlation.
2. Understand types of correlation.
3. Calculate the correlation coefficient.
What is correlation?
The relationship between two metric continuous variables is called correlation. The easiest
way to visualise the relationship between two continuous variables is graphically, using a scatter plot.
One variable is labeled X and plotted on the x-axis of the graph or the abscissa. The second variable, Y
is plotted on the vertical y-axis or the ordinate. The first role of correlation is to determine the strength
of the relationship between the two variables represented on the x-axis and the y-axis. The measure of
this magnitude is called the correlation coefficient. This index measures both the magnitude and the
direction of the relationships:
+ 1.0 perfect positive correlation
0.0 no correlation
- 1.0 perfect negative correlation
Types of correlation
1. Perfectly Positive Correlation
2. Perfectly Negative Correlation
3. Moderately Positive Correlation
4. Moderately Negative Correlation
5.Absolutely no correlation
327
328 Textbook of Computer Applications and Biostatistics
5.Absolutely No Correlation
If the points fall within the circle there in no correlation. Here the two variables are not
related to one another.
Correlation Coefficient
The strength of the relationship (correlation coefficient) can be calculated by using Pearson
correlation coefficient for parametric data while Spearman rank correlation is used for non parametric
procedures.
Significance of a correlation coefficient
A positive or negative correlation between two variables shows that a relationship exists.
Whether it is strong or weak correlation, it can be roughly interpreted as given in following
table.
< 20 Slight correlation
0.20 - 0.40 Low correlation
0.40-0.70 Moderate correlation
0.70-0.90 High correlation
> 90 Very high correlation
å(x - X) (y - Y)
r= ...1
å(x - X) 2 (y - Y) 2
Where
x = value of each measurement on x axis
y = value of each measurement on y axis
`X = Mean for variables on x axis
`Y = Mean for variables on y axis
1. Calculate the mean for both x variable (`X) and y variable (`Y).
åx
X = ...2
N
å y
Y =
N ...3
Where
x = value of each measurement on x axis
y = value of each measurement on y axis
N = number of observations
2. Now, calculate deviations of individual scores x and y about means; (x-`X) and (y-`Y).
3. Then, multiply these deviations (x-`X) (y-`Y) and find the sum of the product of these deviations
S (x-`X) (y-`Y). This is used in the numerator of the equation.
4. The deviations (x-`X) of x variables and deviations (y-`Y) of y variables are squared, which
330 Textbook of Computer Applications and Biostatistics
å(x - X) (y - Y)
r=
å(x - X) 2 (y - Y) 2
Example 21.1
Samples of drug products are stored in their original containers under normal conditions and
sampled periodically to analyse the content of the medication. Determine correlation coefficient.
Data:
Time (months) 6 12 18 24 36 48
Content (mg) 995 984 973 960 952 948
Solution:
1. Construct the following table
(x-`X) (y-`Y) (x-`X) (y-`Y) (x-`X) (y-`Y)
2 2
x y
6 995 -18 26.33 -474 324 693.44
12 984 -12 15.33 -184 144 235.11
18 973 -6 4.33 -26 36 18.78
24 960 0 -8.67 0 0 75.11
36 952 12 -16.67 -200 144 277.78
48 948 24 -20.67 -496 576 427.11
144 5812 0 00 -1380 1224 1727.33
Sx Sy S (x-`X) S (y-`Y) S (x-`X) (y-`Y) S (x-`X) S (y-`Y)
2 2
2. Calculate the mean for both x variable (`X) and y variable (`Y).
`X = 144/6 = 24
Correlation 331
`Y = 5812/6 = 968.66
S (x-`X) (y-`Y) = -1380
S (x-`X)2 = 1224
S (y-`Y)2 = 1727.33
3. Calculate, correlation coefficient, r by using following formula
å(x - X) (y - Y)
r=
å(x - X) 2 (y - Y) 2
-1380
r= = -0.9490
1224 ´1727.33
Correlation coefficient is -0.949.
Where
åx = Summation of x
åy =Summation of y
åxy = Summation of product of x and y
(åx)2 = Square of summation of x
åx2 = Summation of square of x
(åy)2 = Square of summation of y
åy = Summation of square of y
2
2 2
x y x y xy
x1 y1 ... ... ...
x2 y2 ... ... ...
x3 y3 ... ... ...
... ... ... ... ...
xn yn ... ... ...
Sx Sy Sx2 Sy2 Sxy
332 Textbook of Computer Applications and Biostatistics
2. Write the values of x variable in column 1of the table while the values of y variable in column 2 of
the table.
3. Write the individual x values squared (x2) in column 3 while individual y values squared (y2) in
column 4 of the table.
4. Write the product of x and y for each data point in column 5 of the table.
5. Now using the formula for Pearson correlation, find the value of r.
n å xy - å x å y
r=
n å x - (å x) 2 n å y 2 - (å y)2
2
Example 21.2
Samples of drug products are stored in their original containers under normal conditions and
sampled periodically to analyse the content of the medication. Determine correlation coefficient.
Data:
Time (months) 6 12 18 24 36 48
Content (mg) 995 984 973 960 952 948
Solution:
1. Develop the table as shown below
2 2
x y x y xy
6 995 36 990025 5970
12 984 144 968256 11808
18 973 324 946729 17514
24 960 576 921600 23040
36 952 1296 906304 34272
48 948 2304 898704 45504
r = - 0.9490
Example 21.3.
Following data were obtained for plotting calibration curve of drug. Determine correlation
coefficient by both methods .
Data:
Concentration (mg/l) 5 10 15 20 25 30
Absorbance 0.11 0.2 0.31 0.4 0.51 0.62
1. Definitional formula
Solution:
1. Construct the following table
x y (x-`X) (y-`Y) (x-`X) (y-`Y) (x-`X)2 (y-`Y)2
5 0.11 -12.5 -0.248 3.1 156.25 0.0616
10 0.20 -7.5 -0.158 1.18 56.25 0.025
15 0.31 -2.5 -0.048 0.12 6.25 0.0023
20 0.40 2.5 0.042 0.10 6.25 0.0017
25 0.51 7.5 0.152 1.14 56.25 0.0023
30 0.62 12.5 0.262 3.27 156.25 0.0684
105 2.15 00 00 8.925 437.5 0.1822
Sx Sy S (x-`X) S (y-`Y) S (x-`X) (y-`Y) S (x-`X) S (y-`Y)
2 2
2. Calculate the mean for both x variable (`X) and y variable (`Y).
`X = 105/6 = 17.5
`Y = 2.15/6 = 0.358
S (x-`X) (y-`Y) = 8.925
S (x-`X) = 437.5
2
S (y-`Y) = 0.1822
2
8.925
r= = 0 .999
437.5 ´ 0 .1822
Computational Formula
Solution:
1. Develop the table as shown below
2 2
x y x y xy
5 0.11 25 0.0121 0.55
10 0.2 100 0.04 2
15 0.31 225 0.0961 4.65
20 0.4 400 0.16 8
25 0.51 625 0.2601 12.75
30 0.62 900 0.3844 18.6
Sx Sy Sx Sy Sxy
2 2
(6 ´ 4 6 . 55 ) - (105 ´ 2 . 15 )
r=
(6 ´ 2 275 - (105 ) 2 (6 ´ 0 . 9527 - ( 2 . 15 ) 2
r = 0.999
Coefficient of correlation r is 0.999
Solution:
Step 1:
Open new file in MS-Excel as Book 1.
Enter the data into an Excel datasheet (Sheet 1).
Worksheet will appear as follows:
Correlation 335
Summary
Correlation
The relationship between two metric continuous variables is called correlation.
Types of correlation
+ 1.0 perfect positive correlation
0.0 no correlation
- 1.0 perfect negative correlation
Significance of correlation
< 0.2 Slight correlation
0.20 - 0.40 Low correlation
0.40-0.70 Moderate correlation
0.70-0.90 High correlation
> 0.9 Very high correlation
Computation of correlation coefficient
Using definitional formula Using computational formula
å(x - X) (y - Y) n å xy - å x å y
r= r=
å(x - X) (y - Y)
2 2 n å x - (å x) 2 n å y 2 - (å y)2
2
4. Correlation involves
a. Independent variable b. Response variable
c. a and b d. None of above
5. A statistical test used to determine whether a correlation coefficient is statistically significant is
called the ___________.
a. One-way analysis of variance b. t-test for independent samples
c. Chi-square test for contingency tables d. t-test for correlation coefficients
6. In this the variable are directly proportional to each other.
a. perfect positive correlation b. low correlation
c. moderate correlation d. no correlation
7. The correlation coefficient value of 0.70-0.90 means ______.
a. low correlation b. high correlation
c. moderate correlation d. very high correlation
8. If one variable increases, other variable decreases but the fall is directly proportional to each other.
This is _______ correlation.
a. perfect positive b. moderate
c. perfect negative d. no
9. For non parametric data, correlation is measured by _________.
a. Spearman rank b. Pearson
c. a and b d. none of above
10. The value of correlation coefficient < 0.2 suggests_________.
a. low correlation b. moderate correlation
c. slight correlation d. high correlation
Exercise
1. Following area under curves were observed after clinical trials of different formulations of same
drug. Calculate correlation coefficient between dose andAUC
Dosage 100 300 600 900 1200
AUC 1.07 5.82 15.85 25.18 33.12
2. Following data were obtained after diffusion experiment of drug. Calculate correlation coefficient.
Time (h) 1 2 3 4 5 6 7
Amount in donor compartment (mg) 248.7 246.6 244.9 243.4 242.1 241.9 240.1
3. Data given are the values for age (X, in years) and systolic blood pressure (Y, in mm/Hg) for 15
women. Calculate correlation coefficient between age and systolic BP.
X 42 46 42 71 80 74 70 80 85 72 64 81 41 61 75
Y 130 115 148 100 156 162 151 156 162 158 155 160 125 150 165
338 Textbook of Computer Applications and Biostatistics
4. The following are the height (cm) and the weight ( kilogram) of 10 men. Calculate correlation
coefficient between height and weight.
Height 162 168 174 176 180 180 182 184 186 186
Weight 65 65 84 63 75 76 82 65 80 81
5. Following data were obtained during laboratory experiment of muscular contraction of a rabbit
intestine. The height of the curve was considered as the response to the drug. Calculate correlation
coefficient between dose and response.
Dose of drug (mcg) 0.3 0.4 0.6 0.8 0.9 1.2
Response (mm) 54 59 60 65 70 75
6. In a study on the elimination of a drug in man, the following data were recorded. Calculate
correlation coefficient between time and drug concentration.
Time (h) 0.5 1 2 3 4 5
Drug Concentration (g/ml) 0.39 0.34 0.27 0.2 0.16 0.1
7. An experiment was conducted to study the effect on sleeping time of increasing the dosage of a
barbiturate. Three readings were made at each of three dose levels. Calculate correlation coefficient
between dose and sleeping time.
Dosage (mM/kg) 3 3 3 10 10 10 15 15 15
Sleeping Time (Hrs) 4 6 5 9 8 7 13 11 9
Answers:
Multiple Choice Questions
1. b 2. b 3. d 4. b 5. d 6. a 7. b 8. c 9. a 10. c
Correlation 339
Exercise
1. r = 0.998 2. r = -0.984
3. r = 0.564 4. r = 0.514
5. r = 0.981 6. r = -0.993
7. r = 0.900 8. r= 0.680
9. r = -0.01 10. r = 0.948
Chapter 22
LINEAR REGRESSION
Learning objectives
When we have finished this chapter, we should be able to
1. Differentiate between correlation and regression.
2. Understand meaning of regression.
3. Determine regression coefficient (b).
Linear Regression
Linear regression is a statistical method to evaluate how one or more independent variables
influence outcomes for one continuous dependent variable through a linear relationship. For
example, we can predict the weight of children (up to 10 yrs), based on their age. Here age is known as
predictor and also known as independent variable. That which is predicted (weight) is referred to as
dependent variable. By the use of a regression equation, we can predict scores on the dependent
variable from those of independent variable. This is done by finding another constant called
‘Regression Coefficient’.
Let us understand difference between correlation and regression as both describe the
strength of the relationship between two or more continuous variables.
Correlation Linear Regression
1. Relationship between two variables is 1. Relationship between two variables is
established but response for dependent established but response for dependent
variable (y) cannot be predicted based on variable (y) can be measured based on
independent variable (x). independent variable (x).
4. Correlation coefficient gives the idea of 4. Regression coefficient is useful for pre-
relationship between two or more continuous dicting the value of y from the value of
variable. x, or vice versa.
340
Linear Regression 341
Meaning of Regression
Regression is based on the relationship or association between two (or more) variables. In
this analysis we have known variables which are used to predict the other variable. The known
variable (or variables) is called the independent variable (or variables). The variable we are trying to
predict is the dependent variable. In regression, we have only one dependent variable in our
regression equation. However, we can use more than one independent variable.
Regression equation can be expressed as,
Y = a + bX ...1
Where,
Y = Dependent variable
X = Independent variable
a = Y intercept
b = Slope of line
The above equation is linear in X and Y. Also graphically it represents a straight line. So
straight line is called as regression line.
The regression analysis confined to study of only one independent variable is called the
simple regression. If we are interested in studying relationship between a dependent variable with
more than one independent variable then it is called as multiple regression.
N å xy - ( å x) ( å y)
b= ...2
N å x 2 - ( å x) 2
Where
åxy = Summation of product of x and y.
åy = Summation of y
åx = Summation of x
åx = Summation of x
2 2
N = Number of observations
b = Regression coefficient (Slope of line)
342 Textbook of Computer Applications and Biostatistics
2. Write the corresponding values of x and y in column 1 and 2 respectively and calculate their sums
(åx and åy )
3. Write squared values of x and y in column 3 and calculate its sum.
4. Write the squared values of y in column 4 and calculate its sum.
5. Write the product of values of x and y in column 5 and calculate its sum .
6. Using formula of regression coefficient find the value of ‘b’.
N å xy - (å x) (å y)
b=
N å x 2 - (å x) 2
Where
åxy = Summation of product of x and y.
åy = Summation of y
åx = Summation of x
åx = Summation of x
2 2
N = Number of observations
b = Regression coefficient (Slope of line)
7. Now the y intercept (a) can be calculated by using formula
a = ( åy - b åx)/N ...3
8. Now, put the values of a and b in regression equation y = a + bx and find the value of y
corresponding value of x.
Example 22.1
Samples of drug products are stored in their original containers under normal conditions and
sampled periodically to analyse the content of the medication. Determine slope and intercept by least
square method.
Linear Regression 343
Data:
Time (months) 6 12 18 24 36 48
Content (mg) 995 984 973 960 952 948
Solution:
1. Develop a table as shown below
x y x2 y2 xy
6 995 36 990025 5970
12 984 144 968256 11808
18 973 324 946729 17514
24 960 576 921600 23040
36 952 1296 906304 34272
48 948 2304 898704 45504
144 5812 4680 5631618 138108
åx åy åx åy åxy
2 2
Example 22.2
The values for the birthweight (X, in kgs) and the increase in weight between 70 and 100
days of life, expressed as a percentage of the birth weight (Y) for 9 infants. Perform linear regression
344 Textbook of Computer Applications and Biostatistics
analysis and determine slope and intercept. Report graphical display of data.
Data:
X 112 111 107 119 80 81 84 106 94
Y 63 66 72 52 118 120 114 72 91
Solution:
1. Develop a table as shown below
x y x2 y2 xy
112 63 12544 3969 7056
111 66 12321 4356 7326
107 72 11449 5184 7704
119 52 14161 2704 6188
80 118 6400 13924 9440
81 120 6561 14400 9720
84 114 7056 12996 9576
106 72 11236 5184 7632
94 91 8836 8281 8554
894 768 90564 70998 73196
åx åy åx2 åy2 åxy
140
120
Increase in weight
100
80
60
40
20
0
60 80 100 120 140
Birth weight
Estimation of Linear Regression using Excel
Example 22.1
Samples of drug products are stored in their original containers under normal conditions and
sampled periodically to analyse the content of the medication. Determine slope and intercept by least
squar method. Determine content after 20 months.
Data:
Time (months) 6 12 18 24 36 48
Content (mg) 995 984 973 960 952 948
Excel Solution
Step 1:
Open new file in MS-Excel as Book 1. Enter the data into an Excel datasheet (Sheet 1).
Worksheet will appear as follows:
Step 2:
In MS-Excel, select Tools menu from Menu bar. Then, it displays pull down menus. From
pull down menus, select DataAnalysis option. Instantly, DataAnalysis dialog box will appear.
Step 3:
Select Regression option fromAnalysis Tools and then click on Ok.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9490745
R Square 0.9007424
Adjusted R Square 0.875928
Standard Error 6.5469646
Observations 6
ANOVA
df SS MS F Significance F
Regression 1 1555.882 1555.882 36.29918 0.003824076
Residual 4 171.451 42.86275
Total 5 1727.333
Regression equation
y = a + bx
y = 995.72 -1.127x
Summary
Regression
Regression is based on the relationship or association between two (or more) variables.
Regression equation, Y = a + bX
Regression coefficient (b)
Regression coefficient is a measure of the change in one dependent character (Y) with one
unit change in the independent character (X)
N å xy - (å x) (å y)
b=
N å x 2 - (å x) 2
a = ( åy - b åx)/N
Exercise
o
1. In an experiment investigating the breakdown of aspirin in a pharmaceutical product stored at 25 C
is given below. Perform linear regression analysis.
Time 18 35 51 68 85 101 118 136 166
Aspirin remaining 603.6 601.1 597.9 594.3 591.4 587.3 581.8 580 578.4
Linear Regression 349
Answers:
Multiple Choice Questions
1. b 2. b 3. a 4. c 5. a 6. c 7. b 8. a 9. a 10. d
Exercise
1. Intercept 606.98, Slope -0.19.
2. Intercept -2.3, Slope 0.03.
3. Intercept 249.38 , Slope -1.36 .
250
249
Amount diffused (mg) 248
247
246
245
244
243
242
241
240
239
0 2 4 6 8
4. Intercept 47.96 , Slope 22.68. Time (h)
5. Intercept 0.0013, Slope 0.02 .
0.7
0.6
Absorbance
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35
Concentration
6. Intercept 99.96 , Slope 0.7.
7. Intercept 3.37 , Slope 0.495.
8. Intercept 0.406 , Slope -0.063.
9. Intercept 79.03, Slope 1.114
10. y = a + bx y = -28.62 + 0.171x
Linear Regression 351
Most frequently æ f1 - f 0 ö
Mode Mode = l1 + çç ÷÷ ´ i
è 2f1 - f 0 - f 2 ø
occurring score
Odd number
= Size or value of
æ n +1ö
ç ÷ th observation
è 2 ø
(n/2) - c.f. Median
Median Even number Median = l1 + ´i = value of (n/2)th observation
f
= Size or value of
n æn ö
th + ç +1÷th
2 è 2 ø observation
2
P
Percentile Pth Percentile = (N + 1)th value
100
(å X ) 2 ( å fd ) 2 (å f i X i ) 2
SD å X2 - å fd 2 - å f i X i2 -
s= N s= N ´i s= N
N -1 N -1 N -1
SD
Coefficient of variation (CV) = Relative Standard Deviation = CV x 100
mean
351
352 Textbook of Computer Applications and Biostatistics
Probability
Number of outcome that favour the event (m)
Probability [P E ] =
Total number of outcomes (N)
Probability of composite outcomes: P(Ei or Ej)= P(Ei) + P(Ej)
Probability of complementary event: p(`E ) = 1 - p(E)
Probability of intersect: p(Aand B) = p(AÇB)=p(A) x p(B)
Probability of conjoint: p(Aor B)= p(AÈB) =p (A) + p (B) - p(AÇB)
Conditional probability: p(A) given B = p(A|B) = p(AÇB) /p(B)
Probability of binomial distribution:
æn ö æn ö n! n!
P(X) = çç ÷÷ p X q n - X çç ÷÷ = P(X) = pXq n -X
èXø è X ø X!(n - X)! X!(n - X)!
Poisson distribution:
e- m m X mX
P(X) = = m
X! e X!
Estimation of Confidence Interval
Standard error of Mean
SD
SEM =
N
At 95% confidence, Population mean = Sample mean± 1.96 x SE
Confidence interval
z s
p % = X ± 1 - a /2
N
Key stages in hypothesis testing
1. State the null hypothesis 2. State the alternative hypothesis
3. Select the level of significance 4. Select number of tails
5. Test the statistics 6. Compare table and observed value
7. Decision
Decision rule
1.Determine p value 2. Compare it with the critical value, usually 0.05.
3. If the obtained p value is less than critical value, reject null hypothesis; otherwise accept it.
Types of error
Type I error is the probability of rejecting a true null hypothesis
Type II error is the probability of accepting a false H0.
Important Points & Formulaes At A Glance 353
z-test t-test
X - m0 X - m0
Single sample z= t (N-1)df =
s/ N S/ N
X1 - X 2 (X1 - X 2 ) S2p S2p
z= t= SE (X1 - X2 ) = +
SE (X1 - X 2 ) SE (X1 - X 2 ) N1 N2
Two independent
samples ( N 1 - 1) s 12 + ( N 2 - 1) s 22
s 12 s 22 S 2p =
SE(X1 - X 2 ) = + N1 + N 2 - 2
N1 N 2
Two paired (å X) 2
x-O x-O å X2 -
samples t= = SD = N
SE SD/ N N -1
ANOVA
(n 1 - 1 )s 12 + (n 2 - 1 )s 22 + (n 3 - 1 )s 32 . . . (n k - 1 )s 2k
M SW =
N
(n1 X1 ) + (n 2 X 2 ) + (n 3 X 3 ) . . . (n k X k )
XG =
Definitional N
formula n 1 ( X1 - X G ) 2 + n 2 ( X 2 - X G ) 2 + n 3 ( X 3 - X G ) 2 . . . n k ( X k - X G ) 2
MSB =
K -1
MSB
F=
MSW
(å X) 2
SST = å X 2 -
Computational N
formula (å X1 )2 (å X2 )2 (å X3 )2 (å XT )2
SSB = + + -
n1 n2 n3 N
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
Also, for z = 4.0, 5.0 and 6.0, the areas are 0.49997, 0.4999997, and 0.499999999.
354
Appendix 2
Critical values of the t distribution
355
Appendix 3
Values of F 0.05 Degrees of freedom for numerator
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ??
1 161 200 216 225 230 234 237 239 241 242 244 246 248 249 250 251 252 253 254
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.5 19.5 19.5 19.5 19.5 19.5
3 10.10 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.37
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
Critical Values of F Distribution
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
Degrees of freedom for denominator
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.29 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.76 1.74 1.68 1.62
356
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
?? 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00
Appendix 3
Degrees of freedom for numerator
Values of F 0.01
357
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ??
1 4052 5000 5403 5625 5764 5859 5928 5982 6023 6056 6106 6157 6209 6235 6261 6287 6313 6339 6366
2 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 99.4 99.5 99.5 99.5 99.5 99.5 99.5
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 26.9 26.7 26.6 26.5 26.4 26.3 26.2 26.1
4 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.4 14.2 14.0 13.9 13.8 13.7 13.7 13.6 13.5
5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11 9.02
6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.05 6.97 6.88
7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95 4.86
Critical Values of F Distribution
9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31
Degrees of freedom for denominator
10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00 3.91
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.25 4.10 4.02 3.94 3.86 3.78 3.69 3.60
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25 3.17
14 8.86 6.51 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09 3.00
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96 2.87
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84 2.75
17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75 2.65
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66 2.57
19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58 2.49
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46 2.36
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.40 2.31
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35 2.26
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31 2.21
25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.70 2.62 2.53 2.45 2.36 2.27 2.17
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11 2.01
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 1.92 1.80
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.60
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38
?? 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32 1.00
BIBLIOGRAPHY
Belle GV, Fisher LD, Heagerty PJ, Lumley T. Biostatistics A methodology for the health sciences. John
Wiley & Sons, Inc.America, 2004.
Bowers D. Medical Statistics from Scratch An Introduction for Health Professionals. John Wiley &
Sons, Ltd., England, 2008.
David J. Pharmaceutical Statistics. Pharmaceutical Press, London, 2002.
De Muth JE. Basic statistics and Pharmaceutical applications. Marcel Dekker Inc, New York, 1999.
Freund JE. Modern elementary statistics. Prentice Hall International Inc, London, 1984.
Gupta V. Statiatical analysis with Excel. VJ Books Inc, Canada, 2002.
Kothari CR. Research methodology methods and techniques. New Age International Pvt Ltd., New
Delhi, 2004.
Le CT. Introductory statistics. John Wiley & Sons, Inc.,America, 2003.
Mahajan BK. Methods in Biostatistics. Jaypee, New Delhi, 2010.
Mithal P, Goel R. Computer fundamentals. Paragon International Publishers, New Delhi, 2007.
Nagpal DP. Computer fundamentals. S. Chand & Company Ltd, New Delhi, 2008.
Oka MM. Computer Fundamentals. 6th Edition. Everest Publishing House, Pune, 2001.
O’Leary TJ, O’Leary LI. Computer Essentials. Tata McGraw-Hill, New Delhi, 2002.
Po WAL. Statistics for pharmacists. Blackwell Publishing Ltd, India, 2006.
Rajaraman V. Fundamentals of Computers. 4th Edition. Prentice-Hall of IndiaPvt Ltd., New Delhi,
2004.
Ram B. Computer fundamentals architecture and organization. New Age International Publishers,
Delhi, 2007.
Rao BT. Methods of Biostatistics. Paras Medical Publishers, Hyderabad, 2010.
Row P. Essential statistics for the pharmaceutical sciences. John Wiley & Sons, Ltd., England, 2007.
Thakur PS. Manchanda R, Nand P. Computer in Pharmacy. 2nd Edition. Birla Publications Pvt Ltd.,
New Delhi, 2002.
Shah YI, Paradkar AR, Dhayagude MG. Introduction to biostatistics and computer applications. Nirali
Prakashan, Pune, 2007.
Sinha PK, Sinha PP. Computer Fundamentals. 4th Edition. BPB Publication, New Delhi, 2007.
Tipnis HP, BajajA. Clinical Pharmacy. Career Publications, Nasik, 2009.
358