Data Collection and Analysis in Obstetrics and Gynecology
Data Collection and Analysis in Obstetrics and Gynecology
and Gynecology
BY
S. M. Ogbonmwan
Department of Mathematics
University of Benin
Benin City, Nigeria
10/14/08 1
Data Collection and Analysis In Obstetrics
and Gynecology
What is Data?
Measurable characteristics of a sampling unit (or
subject) of a population, that yields information
about the population.
Type of Data:
There are mainly two types. viz: Broadly, data
can either be Categorical or Numerical
Categorical Data:
The simplest type of observation that is made on
a subject that comes to the clinic is the allocation
(the classification) of the subject to one of only
two categories that relate to the presence or
absence of some attributes.
10/14/08 2
Examples:
Pregnant/Not Pregnant
Married/Single
Hypertensive/Normotensive
Diabetic/Non-Diabetic.
More than two categories:
Marital Status: Married/Single/Divorced/Separated
Blood group: A/B/AB/O
Degree of pain:
Minimal/Moderate/severe/unbearable
– Numerical Data:
There are two main types viz: Discrete and
continuous.
Discrete Data:
Arise when observations take certain numerical
values through counting.
Examples:
Number of children, number of visits to ANC in a year,
number of ectopic heart beats in 24 hours, number of
threatened abortions in the last two years, etc.
10/14/08 3
Continuous Data:
Usually obtained by some form of measurements.
Examples:
Height, weight, age, body temperature, blood
pressure, serum cholesterol, etc.
Other types of Data:
Censored Data:
In many cases of life data, one could find that all
of the subjects in the sample may not have
failed. That is, in some cases the event of
interest may not be observed or the exact times-
to-failure of some of the subjects may not be
known. These types of data are commonly called
censored data and they are of three types; viz:
right censored (or suspended), interval censored
and left censored data.
10/14/08 4
Right Censored (Suspended): These are the cases (of
life data) composed of subjects that did not fail.
Example: 8 breast cancer cases, 5, failed at the end of
experiment then the remaining 3 would be regarded as
suspended (right censored) data.
Interval Censored Data: Interval censored data results
where there is uncertainty as to the exact times the units
failed within an interval.
Example: Assuming units are being inspected every 6
hours say at 6:00 am, 12:00 noon, 6:00 pm and so on.
Assuming 8 were surviving at 6:00 am and when inspected
at 12:00 noon only 7 were surviving. Then you can only
say that one failed between 6:00 am and 12:00 noon. The
exact time when that one failed would not be known.
Left Censored Data: In this case, failure time is only
known to be before a certain time.
Example: Suppose an experiment scheduled for
inspection after 12 hours is found to have failed before
inspection. Thus, what is known is that the experiment
failed sometime before 12 hours (i.e. between 0 and 12
hours) but nor exactly when.
10/14/08 5
Variable:
A Variable is any attribute, Phenomenon or event
that can have different values.
A variable can either be quantitative of qualitative
A quantitative variable describes a
characteristic in terms of a numerical value. The
value may vary from subject to subject or from
time to time in the same subject. The value is
expressed in units of measurement.
Examples: Height in meters, Blood pressure in
mm/Hg, weight in kilograms, etc.
A qualitative variable describes the attribute of
a characteristic (by classifying it into categories
to which the subject either belongs or does not
belong).
Examples: State of origin, Tribe or Ethnic
group, etc.
10/14/08 6
Types of Variables: Two types: Continuous and
Discrete.
Continuous Variable:
A Variable with potentially infinite number of
possible values in any interval. It can assume
either integral or fractional values and can be
measured to different levels of accuracy.
Continuous variable is realized through actual
measurements.
Examples: Weight of babies delivered in a
Health facility could be 314, 2.98. 2.94, 3.10 kg.
Discrete Variable:
Can have a number of values in any interval. The
values are invariably whole numbers. They are
integers. Discrete variable is usually realized
through counting.
Examples: Number of children in a family,
number of clinic in a community, number of
children delivered within a given period in a
Teaching Hospital, etc.
10/14/08 7
Collection of Data (In O & G)
Sources of Data:
There are two main sources of data in Healthcare
delivery including O & G. these are regular or
routine system and Ad Hoc systems.
Regular or Routine Data Collection Systems:
A regular or routine data collection system
usually consists of established procedures for
collecting data (in the clinics) as they become
available. This could be at national, sub-national
or institutional levels. This system provides a
rough indication of the frequency of occurrence of
diseases and their descriptive epidemiology,
which serves as leads concerning disease
etiology. The sources of data in this system
include information from: hospital (medical)
records, autopsy reports, physician records, etc.
10/14/08 8
Example: (Part of Patient’s Form)SystDiast
Patient’s Name: -----------------------------------------
Patient’s Number:
Data of Registration:
Data of Birth:
Sex (1= male, 2 = female):
Marital status:
Religion:
Ethnic group/Tribe:
Height (m):
Weight (kg)
Systolic Diastolict
Blood pressure (mm Hg):
Number of Pregnancies:
Number of Deliveries:
Number of Children Alive:
Number of Children Dead:
Number of Abortions:
10/14/08 9
– The advantage of this system of data collection is that it
guarantees availability of data in every specific area of
healthcare delivery.
Ad Hoc Data Collection Systems:
Ad hoc data collection is usually in the form of a (Research)
survey to gather information that may not be available on a
regular basis. This at times may include special
investigative studies or it could just be the collection of
additional information as part of the routine data collection.
This system gives a large coverage of the population.
Examples:
An investigation of the effects of FGM on complications
during delivery
An investigation of breastfeeding practices among women
who registered a birth in the previous year.
A study to investigate whether the use of hormonal
contraceptives affect the fertility status of the users.
– The Ad hoc data collection systems could be extensive,
intensive and expensive. However, an advantage of the Ad
hoc system is that it provides accurate and reliable data (when
well conducted) in response to the specific needs of the users.
An important tool for ad hoc data collection system is the use
of adequate questionnaire.
10/14/08 10
Good Questionnaire Design.
Guidelines for Designing a Questionnaire
·Use simple language
Avoid long complicated questions (avoid double negatives)
Be unambiguous – be clear and simple
Do not ask general questions if you want specific answers.
Ask only valid questions.
Do not ask leading questions
Avoid hypothetical questions about situations outside the
people’s direct experience
Be careful with embarrassing questions. Do not make it
too difficult for the respondents.
Use minimum number of questions.
Pre-coded questions enable you to analyse your replies
easily by the computer, but they may force people to give
wrong answers.
People tend to choose the first response.
Ask easy questions first and difficult questions last.
Pre-test your questionnaires.
10/14/08 11
– Steps in the Planning of a Survey
Step 1 – Preparation of a detailed written statement of the
objectives of the survey.
Step 2 – Determination of the items of information required
and methods of collection.
Step 3 – Definition of the reference population on which
information is to be sought.
Step 4 – decision on whether the reference population is to
be studied as a whole or in part (sample).
Step 5 – Determination of the number of units in the
population to be selected for study during the survey
(sample size).
Step 6 – Decision on how respondents will be selected from
the population (sampling method).
Step 7 – Design, testing and validation of the
questionnaires on which observations will be recorded.
Step 8 – Selection and training of enumerators
(interviewers).
Step 9 – Collection of data.
Step 10 – Preparation for data analysis.
10/14/08 12
Analysis of Data:
The general methodology for the analysis of data (in O & G)
is of two types; viz: Descriptive and Inferential.
Descriptive Statistics Approach for Data
Analysis:
Descriptive Statistics:
Descriptive statistics are the statistical tools for the
organization and summarization of data. They describe a
set of data which eventually provides a basis for a
generalization about a population when only a sample is
observed. Descriptive statistics point up a characteristic of
the population being studied. Descriptive statistics simply
summarize a mass of data into a few simple ideas. In data
analysis, descriptive statistics are presented in tables which
provides summary statistics for continuous, numeric
variables. The summary statistics includes:
measures of central tendency such as mean, median and
mode
measures of dispersion (spread of the distribution) such as
range and standard deviation (including variance of the
distribution)
measures of distribution such as skewness and kurtosis
which indicate how much a distribution varies from a
10/14/08 13
In summary, descriptive statistics
described a set of data which will
provide a basis for a generalization
about a population when a sample is
observed. Thus, descriptive
statistics point up a characteristic of
the population being studied.
Descriptive statistics summarize a
mass of data into a few simple ideas.
10/14/08 14
Organization and Presentation of data
Useful information is usually not immediately evident from
a mass of raw data. Collected data need to be organized in
such a way that the information they contain may clearly
reveal the patterns of variation in the distribution.
Organization of data gives vent to the understanding of the
structures and characteristics of the data. Data are usually
presented in either tabular or diagrammatic forms.
Tabular Presentation
This is the presentation of data in tables so as to organize
them into a compact and readily comprehensible form. For
example, a frequency distribution table gives the number of
observations at different values or classes of the variable.
Tabular presentation could be handled as:
(a) Single variable frequencies:
For a qualitative variable (such as the distribution of the
state of origin of 100-women who visited the ANC in the
last one year).
For a large data set of a quantitative variable requiring
grouping of the data into classes (such as the distribution
of the weight of new born babies in a Teaching Hospital)
10/14/08 15
(b) Cross-tabulation:
Two dimensional tables, in which two variables are cross-tabulated (such
as the cross-classification of weight of babies at birth and economic status
of their parents).
Three-dimensional tables, in which three variables are cross-classified
(such as outcome of treatment by sex and by age group).
– Diagrammatic presentation
Diagrammatic presentation is the use of a diagram to show the distribution
of data. The methods of diagrammatic presentation of data are:
Qualitative or Categorical Data
Pie Charts
A circle is divided into sectors with areas proportional to the frequencies or
the relative frequencies of the categories of the variable.
Bar Charts
The bars are constructed to show the frequency or relative frequency for
each category of the attribute. The bars are usually equal in width. It is
important that the vertical scale should start at zero; otherwise the heights
of the bars will not be proportional to the frequencies.
10/14/08 16
(b) Quantitative data
Frequency Histograms
The chosen class intervals should not overlap and should
cover the full range of the data. The area of each bar (not
just its height) should be proportional to the frequency.
Unequal class intervals are taken into account by the areas
of the bars.
Frequency Polygons (Line Charts)
This is constructed by joining the midpoints of the top of
each bar of a histogram. This chart provides ease of visual
comparison between two or more distributions drawn on
the same chart.
Cumulative frequency polygons and cumulative
frequency charts (Ogives).
This is the chart in which the cumulative frequencies are
plotted against the upper tabulated limit for each class. In
principle, the ogvie can be used to estimate, by
interpolation, the frequency of occurrence of a value of the
variable less than or equal to a specified value.
10/14/08 17
Measures of Location:
One of the first statistics usually computed for a set of data is a
measure of central tendency such as the Mean, Median and
the Mode.
The Mean:
Most frequently used in data analysis. The Mean may be
considered as the center of gravity of the distribution.
n
∑ xi
i= 1
Mean: X= Raw data
n
k
∑ f i xi
i= 1
X= k
Group data
∑ fi
i= 1
10/14/08 18
The Median:
It is the point in the distribution with 50% of the measures of scores on each
side of it. That is, it is the midpoint of the distribution for even number of
n n+2
observations; the median occupies the point between th and th
2 2
positions when the values of the observations are arranged in order of
magnitude. When the number of observations is odd, the Median occupies the
n +1
th position in the ordered arrangements. For the grouped data case, the
2
Median is estimated by using the expression:
n
−Cf
Median = L1 +
2 C
i
fi
Where
L1 = lower class boundary of the median class
n= number of observations
C f = Cumulative frequency of the class just before the median class
Ci = Median class interval
f i = frequency of the median class
10/14/08 19
The Mode:
This is simply the value that occurs most frequently in the distribution. For
the grouped frequency case, the Mode is estimated by using the expression:
( f − fa ) × c
Mode = L1 +
( f − f a ) + ( f − fb )
Where
L1 = lower class boundary of the modal class
f= modal frequency
f a = Frequency of the class after the modal class
f b = Frequency of the class before the modal class
C= Modal class interval
10/14/08 20
Measure of Variability (Measure of Spread)
The Range:
The simplest way to describe the spread of a set of data is to quote the lowest
and highest values. The difference between the highest and lowest values given
the range of the distribution. It is however not satisfactory measure. It is
therefore not widely used.
Variance:
This is the mean of the squared differences (deviations) between the mean and
each observed value. It is mathematically expressed as:
∑ ( xi − X ) ( )
n 2 k 2
∑ f i xi − X
Variance, S2 = i =1
= i =1 k
n −1
∑ fi − 1
i =1
Standard Deviation:
The square root of the variance
∑ ( xi − X )
n 2
i =1
Standard deviation S =
10/14/08 n −1 21
Inferential Statistics:
Usually when samples are studied, the investigator will be
interested in going beyond the sample and would want to
make inference about the population from which the
sample was drawn. Thus, from the knowledge of the
descriptive statistics such as the mean and variance from
sample values, inferences about the same traits in the
population are made. The use of inferential statistics is
basic to Medical research. The exploits in inferential
statistics include: Confidence Interval, Test of hypothesis,
contingency Tables, Nonparametric Tests, Regression and
Correlation analysis, ANOVA, etc.
Confidence Interval:
Confidence Interval combines the features of estimates
from a sample with known properties of the normal
distribution to get an idea about the uncertainty associated
with a single sample estimate of the population parameter.
Confidence interval gives a range of values for which one
can be confident would include the true value.
10/14/08 22
C I for a Single Mean ( µ )
σ
The 100 (1 − α )% C I = X ± Z (α ) .
2 n
s
OR X ± t n−1 (α 2).
n
σ 12 σ 22
The 100 (1 − α )% C I = X − X 2 ± Z (α ) +
2 n1 n2
1 1
OR C I = X − X 2 ± t n1+ n2 − 2 .(α 2) S p + ,
n1 n2
(n1 − 1) S12 + ( n2 − 1) S 22
where Sp =
n1 + n2 − 2
10/14/08 23
C I for the Single Proportion (P)
p0 q 0
The 100(1 − α )% C I = P ± Z (α )2 . n
The 100(1 − α )% C I = Ρ 1 − Ρ 2 ± Z (α ) .
( ) (
Ρ 1− Ρ Ρ 1− Ρ
+
)
2 n1 n2
10/14/08 24
Test of Statistical Significance
Tests of significance are standard statistical procedures for
drawing inferences from sample estimates about unknown
population parameters
In medical research, tests of significance allow us to decide
whether the sample estimates, or differences between
estimates are within their normal biological variation,
commonly called variability due to chance.
Procedure for testing statistical hypothesis
– State the null hypothesis
– State the alternative hypothesis (indicate 1 – tail or 2 – tail)
– State the level of significance (explain type 2 errors)
– Choose the test statistic (explain parametric and non-
parametric tests)
– Compute the numerical value of the statistic from the
observed data
– Compare the calculated value of test statistic with tabulated
values in appropriate standard distribution tables at a specified
probability level of significance
– Decide whether or not to reject the null hypothesis according
to the p-value
10/14/08 25
Test for Single Mean:
10/14/08 27
Contingency Tables:
Test for Associations between two categorical variables is by the use of the χ ~2
distribution
10/14/08 28
Nonparametric Tests:
In the tests for means, proportions and association, there is a fundamental
assumption of the knowledge of the distribution of the test statistics and
indeed the knowledge of the functional form of the distribution of the
variables under consideration. When there is no knowledge of the
functional form of the basic density function of the variables, then it is
usually good to resort to the Nonparametric test such as:
The Wilcoxon (Rank sum) test
The Mann-Whitney U – test
The Median test
The Sign test
m( N + 1)
SW −
Reject H0 when Z = 2 > Z (α )
mn( N + 1) 2
12
10/14/08 29
The Mann-Whitney U – Test (Two Samples)
m(m + 1)
Test statistic: U = SW −
2
Where SW is as in Wilcoxon test
mn
U=
Reject H0 when Z = 2 > Z (α )
mn( N + 1) 2
12
Regression and Correlation:
A high proportion of data analyses are carried out
to study the relationship between two variables.
The purposes of such analysis are:
To assess whether the two variables are
associated.
To enable the value of one variable to be
predicted from any known value of the other
variable
To assess the amount of agreement between the
values of the two variables.
10/14/08 30
Correlation:
Correlation is the method of analysis used when studying the measure of
relationship (association) between two continuous variables – e.g. – percentage
of body fat and age or normal adults. The actual measure of the association is
done by calculating the correlation coefficient r. The correlation coefficient r
can take any value between –1 and +1.
∑ ( X i − X )(Yi − Y )
n
i =1
r=
∑(X ) ∑ (Y − Y )
n 2 n 2
i −X i
i =1 i =1
10/14/08 31
Regression:
Linear regression describe the linear relationship between variables and can be
used to predict the value of one variable for an individual when we only known
the other variable. Consider a simple case of: Fetal weight (kg) and Non-
pregnant Maternal weight. Here we consider the fetal weight as the response
(or outcome) variable while the maternal weight is the predictor variable.
These are also called the dependent and independent variables respectively.
The linear relationship between the dependent (Y) and the independent (X)
variables is given as:
Y = α + βX
∑ ( X i − X )(Yi − Y )
n
∧
β= i =1
∑( X )
n 2
i −X
i =1
∧ ∧
α =Y −β X
∧ ∧
Hence, Y = α+ β X
which is used for prediction.
10/14/08 32
Multiple Regression:
Y = α + β 1 X 1 + β 2 X 2 + ... + β p X p
e.g. – obesity, smoking and snoring
YSnoring = α + β 1 X Smoking + β 2 X Obesity
Logistic Regression:
Good for prediction for dichotomous variables.
10/14/08 33
Simple Experimental Design
One Way ANOVA
In research work or in the handling of patients, comparisons are often
made between several sets of data collected from basically similar
populations, such as treatments given to some groups of patients having
the same ailment except that different drugs were used for each group.
Generally, any experiment denoted to compare several treatments (source
of variation) must embody two important principles of experimental design
viz: (i) Replication and (ii) Randomization. The simplest experimental
design which incorporates those two principles is the completely
Randomized design or simply also called the one-way classification or
the one-way analysis of variance involving one factor appearing at
different levels.
10/14/08 34
The null hypothesis we would wish to test is:
H0: µ 1 = µ 2 = ... = µ k = µ versus
H1: At least one of the µ k differs from µ .
Test for One-Way Classification
1. State H0 and H1
H0: µ 1 = µ 2 = ... = µ k
10/14/08 35
ANOVA TABLE
S. V. d. f. SS MS F-Ratio
Total kn – 1 SST
5. Under H0 and the assumptions in (3) being correct, Fcal under F – Ratio
in the ANOVA table has Fk-1,(n – 1) – distribution. Hence, we find the
critical point by reading off Fk-1,(n – 1) ( α ) from the F – distribution table for
the appropriate level of significance.
6. Compare the values of Fcal from the ANOVA table and Fk-1,(n – 1) ( α ) – from
the statistical table.
If Fcal > Fk-1,(n – 1) ( α ) then reject the null hypothesis.
7. Draw a conclusion.
Remark
When the sample sizes (i.e. the number of observations in each
treatment) are not all equal, necessary adjustment must be made in the
computation of sums of squares.
Example
Six patients each were tested on four types of oral contraceptive
10/14/08 36 to
investigate the average reaction time.
Risk Estimation:
Disease
Yes No Total
Yes a b c+b
Exposure
No c d c+d
Total a +c b+d n=a+b+c+d
10/14/08 37
a /( a + b) a c + d a (c + d )
Thus, RR = = . =
c /(c + d ) a + b c c ( a + b)
Remarks:
1. RR of 1.0 indicates that the incidence rates of
disease in the exposed and non-exposed groups
are identical and thus indicates that there is no
association observed between the exposure and
the disease.
2. A value of RR greater than 1.0 indicates a
positive association or an increase risk among
those exposed (to a factor).
3. Analogously, a RR less than 1.0 means that
there is inverse association or a decrease risk
among those exposed.
4. RR may change (in some cases) with time
e.g. RR for 1 year exposure might be different
from RR for 10 years exposure.
10/14/08 38
Odd Ratio (for case – control cases)
Cases where participants are selected on the basis of their disease
status.
OR ≡ ratio of the odds of exposure among the cases to that
among the controls.
a
c ad
OR ≡ b
=
d
bc
10/14/08 39
Worked Examples
Example 1:
Blood pressure levels were measured in 100 diabetic and
100 non-diabetic women aged 40 – 49 years. Mean
systolic blood pressures were 146.4 mm Hg (with standard
deviation of 18.5) among the diabetics and 140.4 mm Hg
(with standard deviation of 16.8) among the non-diabetics.
By making the necessary assumptions, calculate the 95%
confidence interval for the difference of means of the blood
pressures of the two groups of women.
Solution:
Assume that the blood pressures of each of the two groups
of women are normally distributed. Hence, assume that
the difference of means of the blood pressures is also
normally distributed.
10/14/08 40
Given is : 100(1− α )% = 95%
⇒ 1 − α = 0.95
⇒ α = 0.05
⇒ α = 0.025
2
The formula for 100(1 − α )% CI for difference of two means is:
S12 S 22
X 1 − X 2 ± Z (α ) . +
2 n1 n2
This is true since n1 = n2 = 100 are considered to be large values.
18.5 2 16.8 2
146.4 − 140.4 ± 1.96 +
100 100
i.e. 6 ± 1.96 × 2498979792
i.e. 6 ± 4.898
(1.102, 10.898)
∴ 95% confidence interval for the difference of mean is: 1.1 to 10.9
10/14/08 41
Example 2:
A team of medical researchers wished to measure
the level of weight gained by users of oral
contraceptives. The weights of 12 women were
taken before and after the use of the
contraceptive within one year interval. But
unfortunately, one of the women died before the
end of the year, and therefore there was no
result for her (this is indicated by * in the date
set). Estimate the weight of the woman that died
before the experiment was concluded.
10/14/08 42
Weights of Women
Before (X) After (Y)
50 61
55 61
60 59
65 71
70 80
75 76
79.5 *
80 90
85 106
90 98
95 100
100 114
10/14/08 43
Complete the table:
x y x2 y2 xy
50 61 2500 3721 3050
55 61 3025 3721 3355
60 59 3600 3481 3540
65 71 4225 5041 4615
70 80 4900 6400 5600
75 76 5625 5776 5700
80 90 6400 8100 7200
85 106 7225 11236 9010
90 98 8100 9604 8820
95 100 9025 10000 9500
100 114 10000 12996 11400
825 916 64625 80076 71790
10/14/08 44
Using the result of the table we get
β = ∑ i 2i ∑ i ∑2 i = 1.1236
∧ n x y − x y
n∑ x i − ( ∑ x i )
∧
α = Y − β X = −0.9973
∧ ∧
∴ Y = α + β X = − 0.9973 + 1.1236 X
Hence, when X = 79.5 we have
Y = − 0.9973 + 1.1236 × 79.5 = 88.3289
That is, the estimated weight of the woman that died (after one year) would
have been 88.33kg.
Conclusion: Based on the given data we shall conclude that the mean of
the population from which the sample came is not 120.
10/14/08 46
Exercise:
At admission two groups of women on two different family planning methods in
clinical trials show the following characteristics.
Height (cm)
Cycloprovera 155.86 5.17 42
HRP 102 155.83 6.39 48
Age (years)
Cycloprovera 27.71 4.10 42
HRP 102 28.46 4.66 48