Biostatistics in Public Health Using STATA-2016
Biostatistics in Public Health Using STATA-2016
Nogueras • Moreno-Gorrín
Striking a balance between theory, application, and programming, Biostatistics in
Suárez • Pérez
Public Health Using STATA is a user-friendly guide to applied statistical analysis in
Public Health
public health using STATA version 14. The book supplies public health practitioners
and students with the opportunity to gain expertise in the application of statistics in
epidemiologic studies.
The book shares the authors’ insights gathered through decades of collective experience
Using STATA
teaching in the academic programs of biostatistics and epidemiology. Maintaining a
focus on the application of statistics in public health, it facilitates a clear understanding
of the basic commands of STATA for reading and saving databases.
Each chapter is based on one or more research problems linked to public health.
Additionally, every chapter includes exercise sets for practicing concepts and exercise
solutions for self or group study. Several examples are presented that illustrate the
applications of the statistical method in the health sciences using epidemiologic study
designs.
For readers new to STATA, the first three chapters should be read sequentially, as
they form the basis of an introductory course to this software.
Erick L. Suárez
Cynthia M. Pérez
K25609
Graciela M. Nogueras
6000 Broken Sound Parkway, NW
Suite 300, Boca Raton, FL 33487
711 Third Avenue
ISBN: 978-1-4987-2199-8
90000
Camille Moreno-Gorrín
New York, NY 10017
an informa business 2 Park Square, Milton Park
www.crcpress.com Abingdon, Oxon OX14 4RN, UK
9 781498 721998
w w w.crcpress.com
Erick L. Suárez
Cynthia M. Pérez
Graciela M. Nogueras
Camille Moreno-Gorrín
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To our loved ones
Preface ................................................................................................................xi
Acknowledgments ............................................................................................xiii
Authors .............................................................................................................. xv
1 Basic Commands ....................................................................................1
1.1 Introduction ....................................................................................1
1.2 Entering Stata ..................................................................................2
1.3 Taskbar ............................................................................................2
1.4 Help ................................................................................................3
1.5 Stata Working Directories ...............................................................4
1.6 Reading a Data File .........................................................................6
1.7 insheet Procedure .............................................................................7
1.8 Types of Files ...................................................................................7
1.9 Data Editor......................................................................................7
2 Data Description ..................................................................................11
2.1 Most Useful Commands ...............................................................11
2.2 list Command ................................................................................12
2.3 Mathematical and Logical Operators.............................................12
2.4 generate Command ........................................................................14
2.5 recode Command ...........................................................................15
2.6 drop Command .............................................................................16
2.7 replace Command ..........................................................................16
2.8 label Command .............................................................................16
2.9 summarize Command ...................................................................17
2.10 do-file Editor .................................................................................19
2.11 Descriptive Statistics and Graphs...................................................19
2.12 tabulate Command ........................................................................20
3 Graph Construction .............................................................................23
3.1 Introduction ..................................................................................23
3.2 Box Plot .........................................................................................23
3.3 Histogram .....................................................................................25
3.4 Bar Chart ......................................................................................25
vii
viii ◾ Contents
Erick L. Suárez
University of Puerto Rico
Cynthia M. Pérez
University of Puerto Rico
Graciela M. Nogueras
MD Anderson Cancer Center
Camille Moreno-Gorrín
University of Puerto Rico
xi
This page intentionally left blank
Acknowledgments
xiii
This page intentionally left blank
Authors
xv
xvi ◾ Authors
Basic Commands
1.1 Introduction
Stata is a computer program designed to perform various statistical procedures.
Among the basic statistical procedures that can be performed are the following:
calculation of summary measures, construction of graphs, and frequency distribu-
tion using contingency tables. Furthermore, using Stata, you can perform param-
eter estimation in generalized linear models and survival analysis models using
uncorrelated and correlated data. The program also has the ability to perform arith-
metic operations on matrices. Its ability to export and import databases in the Excel
format gives Stata great versatility. This program is regularly used in biostatistics
courses in public health schools in different countries. It is also often cited as one
of the main programs used for statistical analysis in scientific publications related
to public health research.
This chapter will provide an introduction to the Stata program, version 14.0.
We assume that readers of this book have a basic knowledge of both biostatistics
and epidemiology.
1
2 ◾ Biostatistics in Public Health Using STATA
1.3 Taskbar
The taskbar provides common access to all windows-based program commands, such
as File, Edit, Data, Graphics, and Statistics; these options can be found at the upper part
of the main window. The most frequently used icon is the Data Editor icon, with which
it is possible to enter values and identify the variables in a given project. The Graphics
button provides access to the window used to generate different types of graphs. The
Statistics option allows the user to perform statistical mathematical operations through
the execution of the commands. Below the taskbar are icons that allow the user to open,
save, and print, along with icons that facilitate the observation of graphics (Figure 1.2).
1.4 Help
One of the most useful attributes of Stata is its support system, which allows the
user to find the commands and their ways of execution, according to that user’s
specific needs. The help menu can be accessed by clicking on the “New Viewer”
icon on the toolbar or by typing either help or the letter h in the command area
and following that with a keyword that represents the topic about which the user
requires more information (see Figure 1.3).
4 ◾ Biostatistics in Public Health Using STATA
or
h anova
Upon entering those commands, a specific window for ANOVA will appear (see
Figure 1.4).
It is important to keep the working files in a directory that is different from the
default directory that Stata assigns, because during the regular program updates
files located in the default directory may be removed.
To create a particular file, the mkdir and cd commands must be used to navi-
gate to that directory again. The sequence of commands to create a directory is
as follows:
To use Stata in the new working directory, you need to restart the program
and immediately move to the desired directory. For example, assuming that the
name of the working directory is “students” and assuming, as well, that this
6 ◾ Biostatistics in Public Health Using STATA
directory is located in your computer’s Documents folder, the following will take
you to that folder:
cd “/Users/Documents/students”
For the latter, on the other hand, it is necessary to click , the Open icon, and
browse the folder that contains the working file. The describe command can be
used to view the information contained in the data file, which might include the
number of observations, variables, and file size, among others, as shown below
(assuming that the active database being used contains the anthropometric mea-
surements of 10 subjects):
describe
Output
. describe
Contains data
obs: 10
vars: 5
size: 200
-----------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------
var1 float %9.0g
var2 float %9.0g
var3 float %9.0g
var4 float %9.0g
var5 float %9.0g
-----------------------------------------------------------------------------
Basic Commands ◾ 7
The replace option that has been placed after the comma (above) is used to
clear the program if another database was being used. Stata does not open
a database if there is another one that is already open. The clear command can
also be used in Stata to remove a database, therefore clearing the way to use a
new one.
.ado programs
.gph graphs
To access the Data Editor window (Figure 1.5), click the “Edit” icon, , on
the taskbar located in the main window.
At the beginning of the data entry process, the program automatically assigns a
name to the column that defines each variable (var1, var2, …, vark). This name can
be changed in the Variables Manager window after clicking the Data Editor icon,
using the box “Name” (Figure 1.6). To return to the main window of Stata, you
close or minimize the Data Editor window.
Constructing a user-friendly database requires that each variable be named in
such a way as to be easy to identify. This can be done using the “Label” box in the
properties window. When building a database, it is possible for the values assigned to
the variables to be represented by codes. The coding of the variables can be done using
the “Value Label” option. With this option you can assign numerical values to alpha-
numeric variables, thereby allowing better management of the database. This coding
can be done in the Variables Manager window. The steps to do this are as follows:
1. Click “Manage” in the Variables Manager window, and a new window appears
(Figure 1.7). Then click “Create Label” to assign each code a label.
2. After creating the value labels, return to the Variables Manager window, in
which you will be able to assign labels to each variable in the “Label” box (if they
were not assigned previously in the Properties window) (Figure 1.8).
Basic Commands ◾ 9
To continue working in Stata after having created a database, the user needs to
ensure that the data have been saved. To that end, the user will need to assign a
name to the file to continue working on the database. Clicking on “File” (on the
toolbar) followed by “Save As” (on the subsequent dropdown menu) begins this
process. After that, select the working folder or directory and assign a name to the
database. The default file extension is .dta.
Chapter 2
Data Description
11
12 ◾ Biostatistics in Public Health Using STATA
Output
. list in 5/10
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
5. | 5 45 56 1.52 1 |
6. | 6 36 87 1.46 1 |
7. | 7 30 78 1.44 1 |
8. | 8 29 77 1.56 1 |
9. | 9 27 67 1.52 0 |
|----------------------------------|
10. | 10 29 63 1.52 1 |
+----------------------------------+
Symbol Definition
Usually, these operators are associated with the conditional command If for specific
variables. For example, to display only those observations in which the age is below
30, the command line is as follows:
Output
+--------------------------+
| id age weikg heimt |
|--------------------------|
1. | 1 28 59 1.55 |
3. | 3 25 76 1.6 |
4. | 4 26 65 1.78 |
8. | 8 29 77 1.56 |
9. | 9 27 67 1.52 |
|--------------------------|
10. | 10 29 63 1.52 |
+--------------------------+
14 ◾ Biostatistics in Public Health Using STATA
The symbol of asterisk (*) is also used to make any comment during the Stata pro-
gramming; for example:
+--------------------------------+
| id age weikg heimt sex |
|--------------------------------|
1. | 1 28 59 1.55 0 |
2. | 2 32 35 1.35 0 |
3. | 3 25 76 1.6 0 |
4. | 4 26 65 1.78 0 |
5. | 5 45 56 1.52 1 |
|--------------------------------|
6. | 6 36 87 1.46 1 |
7. | 7 30 78 1.44 1 |
8. | 8 29 77 1.56 1 |
9. | 9 27 67 1.52 0 |
10. | 10 29 63 1.52 1 |
+--------------------------------+
To compute and display the bmi of each participant, the following commands are
executed:
You can see that a new variable, named bmi, has been created as a result of using
the list command:
Data Description ◾ 15
Output
+---------------+
| id bmi |
|---------------|
1. | 1 24.55775 |
2. | 2 19.20439 |
3. | 3 29.6875 |
4. | 4 20.51509 |
5. | 5 24.23823 |
|---------------|
6. | 6 40.81441 |
7. | 7 37.61574 |
8. | 8 31.64037 |
9. | 9 28.99931 |
10. | 10 27.26801 |
+---------------+
gen bmig=bmi
recode bmig 18.5/24.9=1 25/29.9=2 30/max=3
list id bmig
Output
+-----------+
| id bmig |
|-----------|
1. | 1 1 |
2. | 2 1 |
3. | 3 2 |
4. | 4 1 |
5. | 5 1 |
|-----------|
6. | 6 3 |
7. | 7 3 |
8. | 8 3 |
9. | 9 2 |
10. | 10 2 |
+-----------+
16 ◾ Biostatistics in Public Health Using STATA
drop bmi
After the list command, the results will be the same as that reported with the replace
command.
In addition, the label command decodes the categories of the variables, combining
label define and label value commands. The label define command is used to create a
label for different codes to be attached to a legend. Then, the label value command
is used to relate the categories of 1 variable to the labels defined in label define
command. For example, the command lines that are used to label the codes of the
variables sex and bmig are as follows:
After using the list command, the following output will be displayed:
+----------------------------+
| id sex bmig |
|----------------------------|
1. | 1 Male Overweight |
2. | 2 Male Normal |
3. | 3 Male Overweight |
4. | 4 Male Normal |
5. | 5 Female Normal |
|----------------------------|
6. | 6 Female Obese |
7. | 7 Female Obese |
8. | 8 Female Obese |
9. | 9 Male Overweight |
10. | 10 Female Overweight |
+----------------------------+
If you want to eliminate a label that was previously assigned to a variable, the drop
command must be used, as follows:
Output
Output
The detail command can be written at the end of the command line to obtain
information, which is more detailed, about quantitative variables in the database.
For example, assuming we want the detailed information of the distribution of the
variable bmi, the following command line can be used:
sum bmi, detail
Output
bmi
---------------------------------------------------------------
Percentiles Smallest
1% 19.20439 19.20439
5% 19.20439 20.51509
10% 19.85974 24.23823 Obs 10
25% 24.23823 24.55775 Sum of Wgt. 10
distribution. Based on the iqr (interquartile range), the output indicates that 50% of
the bmi around the median value is not greater than 7.4.
tab bmig
Output
In this example, 30% of the study group was categorized as being obese and 40%
as being normal.
The tab command can be used to report contingency tables that, in turn, can be
used to report the frequency distribution, with the option of including percentages
by column and row. For example, to describe the association between the variables
bmig and sex (see the previous database), use the tab command, as follows:
tab bmig sex, co
Output
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| sex
bmig | Male Female | Total
----------+----------------------+----------
Normal | 2 1 | 3
| 40.00 20.00 | 30.00
----------+----------------------+----------
Overweight| 3 1 | 4
| 60.00 20.00 | 40.00
----------+----------------------+----------
Obese | 0 3 | 3
| 0.00 60.00 | 30.00
----------+----------------------+----------
Total | 5 5 | 10
| 100.00 100.00 | 100.00
The results show that 80% of women are categorized as being either overweight
or obese, while 40% of men are categorized as being overweight, with none being
categorized as being obese. Only 30% of the subjects (both sexes) are categorized as
being of normal weight.
This page intentionally left blank
Chapter 3
Graph Construction
Aim: Upon completing the chapter, the learner should be able to create
the graphs that are most commonly used for data description.
3.1 Introduction
To create a graph, we click on the Graphics option on the taskbar (Figure 3.1).
After we do this, the following dropdown menu appears, listing a series of possible
graphs that can be constructed.
Afterward, the user clicks the type of graph or plot needed; a new window
with the different specifications available for this type of graph will be displayed.
Once the specifications are provided, the user must choose one of the following two
options for obtaining the graph that he or she desires: Submit or OK. If Submit is
chosen, the requested graph will be displayed, with the graph window remaining
open (enabling the user to explore other specifications); choosing OK brings up the
requested graph but the graph window remains closed.
23
24 ◾ Biostatistics in Public Health Using STATA
40
35
bmi
30
25
20
3.3 Histogram
Another commonly used chart is the histogram, which shows the frequency dis-
tribution of the variable of interest using abutting rectangles, and in which the
height of each rectangle corresponds to the frequency of subjects within certain
limits of the variable (these limits are the base of each rectangle). For example, to
create a histogram of the variable with four rectangles using the Graphics window,
the user needs to click the Histogram option and write the name of the variable,
bmi (Figure 3.4). At this point, the user has the option of specifying the number of
rectangles in the space labeled Number of bins and, in addition, has the option to
include the normal density plot (see Figure 3.5).
The normal option will show a curve of the normal probability distribution over
the histogram. This tells us how far away the distribution of the variable of interest
is from the normal distribution.
40
30
Percent
20
10
0
20 25 30 35 40
bmi
The results show that the mean of the bmi in women is higher than it is in men.
The next chapter will demonstrate the procedure that is used to determine whether
this sort of difference is statistically significant.
Graph Construction ◾ 27
Male Female
30
Mean of bmi
20
10
0
Graphs by sex
Male Female
40
30
20
10
0
Graphs by sex Sex
Another type of bar chart that the user might want to create is one in which the
standard deviation is added (see Figure 3.8). The next sequence of commands can
be used for this purpose:
sort sex
gen mbmi=bmi
gen sbmi=bmi
collapse (mean) mbmi (sd) sbmi, by(sex)
gen hbmi = mbmi + sbmi
twoway (bar mbmi sex) (rcap hbmi mbmi sex), yscale(range(0 40))
xlabel(none) by(sex, noxrescale) by(,legend(off))
The collapse command is used to summarize a set of data using statistics, such as
mean, sum, median, and percentiles. These statistics can be computed overall or for
each category of specific variables, previously sorted. In the last sequence of com-
mands, we computed the mean and standard deviation of the variable bmi for each
category of variable sex. The twoway command is used to create different plots in
the same graph. In the previous example, we used bar for graph bars and rcap for
capped spikes in the same graph.
Chapter 4
Significance Tests
Aim: Upon completing the chapter, the learner should be able to per-
form significance tests that are concerned with the expected values of
continuous random variables.
4.1 Introduction
Classical statistical tests are performed to compare the expected values of a random
variable, under the assumption that these values are constant parameters of the tar-
get population. The Bayesian approach assumes that these parameters are another
random variable. In this chapter we will concentrate our analysis using classical
statistical tests for comparing the expected values of a continuous random variable
in two independent groups.
The classical statistical tests are based on the initial formulation of two comple-
mentary hypotheses that are related to the parameters of the target population;
these hypotheses are the null and the alternative hypotheses. The null hypothesis,
denoted by H0, is the hypothesis that is to be tested. The alternative hypothesis, usu-
ally denoted by Ha, is the hypothesis that contradicts the null hypothesis (Rosner,
2010); usually, the alternative hypothesis will be related to a research hypothesis.
To assess the null hypothesis, a sample of data is collected to compute a test statis-
tic for supporting a decision in favor of or against the H0; there are four possible
outcomes:
29
30 ◾ Biostatistics in Public Health Using STATA
H0
Decision based on
the sample data True False
The general aim in hypothesis testing is to use statistical tests that make α and β
as small as possible. Typically, the evidence against H0 is determined with a signifi-
cance level less than or equal to 5%, while a statistical power of 80% or higher is
considered adequate.
The significance level can be defined prior to performing the test; when this is
done, two regions for the test statistics are defined: the acceptance region (evidence
for accepting H0) and the rejection region (evidence against the null hypothesis).
However, the output of the statistical programs usually shows the probability (called
P-value) for each statistical test. P-value is defined as the probability of obtaining a
test statistic as extreme as or more extreme than the test statistic actually obtained,
given that the null hypothesis is true. As a consequence, the P-value is interpreted
as α level at which the given value of the test statistic is on the borderline between
the acceptance and rejection regions (Rosner, 2010). In Stata, the P-value will be
presented according to the test statistic used and the probability distribution assumed
for this statistic; for example, assuming the Student’s t-test statistic with t-probability
distribution, the output for identifying the P-value will be expressed as Pr(T > t).
To interpret P-values, we can use one of the following statements (Rosner, 2010):
If the P-value ≥ .05, then the results are considered not statistically significant.
If .01 < P-value < .05, then the results are significant.
If .001 < P-value ≤ .01, then the results are highly significant.
If the P-value ≤ .001, then the results are very highly significant.
However, if .05 ≤ P-value < .10, then a trend toward statistical significance is
sometimes noted.
Significance Tests ◾ 31
Output
The results above provide evidence in favor of the null hypothesis for all variables
(P-value > .05) with the exception of the variable age (P-value = .0259).
Output
. sdtest bmi, by(sex)
For each variable (in the above case, sex), a table displays a description of the sum-
mary measures in each category of that variable: Obs (number of observations),
Mean, Std. Err. (standard error), Std. Dev. (standard deviation), and 95% Conf.
Interval (the 95% confidence interval is used to estimate the expected value of the
random variables). In the above table, the user can see that the standard deviation
of the variable bmi among males is 4.77 (variance = 22.75), while among females
it is 6.92 (variance = 47.74); so the estimated ratio of the variances is 0.4754 (male/
female). If the variances are equal in these two groups, the expected value of this
ratio must be 1 (Rosner, 2010). Near the bottom of the table, the user can see that
the null hypothesis is “H0: ratio = 1.” The alternative hypothesis is expressed in
three ways: Ha: ratio < 1, Ha: ratio != 1 (different than 1), and Ha: ratio > 1; it is
recommended that only the second alternative hypothesis (ratio is different from 1)
be considered, if the purpose is assessing the variance homogeneity. Below each
alternative hypothesis, the corresponding P-values are presented. Only for the vari-
able height does the statistical evidence not support the assumption of variance
homogeneity (P-value = .0468).
Y1 − Y2
t= ∼ t k|H0
Var (Y1 − Y2 )
where:
Yi indicates the sample mean of the variable bmi for the ith group
tk is the t-probability distribution with k degrees of freedom
Var (Y1 − Y2 ) is the variance of (Y1 − Y2 )
To compute the P-value, it is assumed that this expression follows the t-probability
distribution under the null hypothesis assumption. To perform this kind of t-test,
the user can utilize the ttest command. The specifications for this command can
change depending on the structure of the database. For example, assuming that the
34 ◾ Biostatistics in Public Health Using STATA
previous database is being used and that the aim of the user is to assess the variance
homogeneity in the variable bmi by sex group, the command line for performing
student’s t-test is as follows:
Output
The above table is the same as the one described by the sdtest command. However,
the null hypothesis formulated below in this table is different. The null hypoth-
esis states that the expected bmi value is the same for both sexes ( µ Male = µFemale ).
In Stata notation, this hypothesis is formulated as the following: diff =
mean(Male) − mean(Fem) = 0. The alternative hypotheses that can be assessed
are Ha: diff < 0, Ha: diff != 0 (different than zero), and Ha: diff > 0. Assuming
that the research hypothesis is that males have a lower mean body mass index than
females do, the user has to assess the P-value below the first alternative hypothesis
(one-tailed alternative hypothesis), with the result indicating that there is statisti-
cal evidence against the null hypothesis (P-value = .037); this finding suggests
that the expected bmi in males is lower than the expected bmi in females. If the
research hypothesis is that males have different mean body mass index than females
do, then the user has to assess the P-value below the second alternative hypoth-
esis (two-tailed alternative hypothesis), with this result indicating that there is
statistical evidence in favor of the null hypothesis (P-value = .074); this finding
suggests that the expected bmi in males is not different from the expected bmi
in females.
Significance Tests ◾ 35
Output
As can be seen above, the results of this test show that the frequency distributions of
the bmi for both sexes are identical (P-value = .1172). This interpretation is consistent
with that of Student’s t-test for the two-tailed alternative hypothesis. An extensive
review of the parametric and nonparametric statistical procedures can be found in the
book of Sheskin (2007).
After clicking the Submit option, the output will be as seen below:
Study parameters:
alpha = 0.0500
power = 0.8000
delta = 7.7000
m1 = 24.6000
m2 = 32.3000
sd1 = 4.7700
sd2 = 6.9100
N = 22
N per group = 11
Figure 4.4 Sample size for two-sample means tests with several power levels and
allocation ratios.
40 ◾ Biostatistics in Public Health Using STATA
25
20
15
0.8 0.85 0.90
Power (1-β)
Allocation ratio (N2/N1)
1 2
3
Parameters: α = .05, δ = 7.6, μ1 = 25, μ2 = 32, σ = 5
Figure 4.5 Sample size for two-sample means tests using graph option.
When the graph option is used, the output will be as seen in Figure 4.5.
The total sample size requested will increase when the statistical power is
increased; however, the changes in the sample size will depend on the alloca-
tion ratio.
Chapter 5
Aim: Upon completing the chapter, the learner should be able to use
simple, multiple, and polynomial linear regression models for estimating
the expected values of a continuous random variable.
5.1 Introduction
A simple linear regression model (SLRM) is a statistical technique that attempts to
model the relationship between two variables. One of these variables is the main
outcome of interest and is a quantitative random variable, usually denoted with
the letter Y and called the response or dependent variable. The second one can also
be quantitative and is used to explain the behavior of the expected values of Y; it
is usually denoted with the letter X and is called the predictor, explanatory, or inde-
pendent variable. The relationship between these variables, when X is a quantitative
variable, is established using the following expression:
µ yi |xi = β0 + β1 ∗ xi
where:
μ yi |xi defines the expected value of the random variable Y given the predictor
variable X for the ith subject
β1 is a constant parameter associated with the predictor variable X; it is known
as the slope of the regression line and indicates the change in the expected
value of Y per unit of change in X
β0 is a constant parameter that indicates the expected value of Y when xi = 0;
it is known as the intercept of the regression line
41
42 ◾ Biostatistics in Public Health Using STATA
A simple regression model can also be expressed with the following formula:
yi = β0 + β1 X i + ei
where:
yi indicates the response or dependent variables for the ith subject
ei denotes the residual, which is the difference between the observed values in Yi and
the expected value under the model β0 + β1 ∗ xi for the ith subject, as follows:
ei = ( yi − β0 + β1 ∗ xi )
(
Y ∼ N β0 + β1 ∗ xi , σY2 |X )
The expected value of the random variable Y is a straight-line function of X.
2. There is independence between the response variable values.
3. The independent or predictor variable is a quantitative variable, not necessar-
ily a random one.
4. The βi coefficients should not be affected by any power, other than the unit,
or by any trigonometric function.
5. The expected value of the residuals is zero, that is,
6. The variance of the residuals is constant and is equal to the variance of the
response variable under the SLRM, that is,
var ( ei ) = σY2 / X
E ( ei , e j ) = 0 for all, i ≠ j
The residual associated with a subject does not affect the residual of another
subject.
Linear Regression Models ◾ 43
(
ei ∼ N 0, σY2 / X )
where:
yi indicates the observed value of the y variable for the ith subject
ŷi indicates the estimated expected value of Y under the model for the ith subject
using a specific combination of the estimated beta coefficients, as follows:
yˆi = βˆ 0 + βˆ 1 X i
βˆ 1
t= ∼ t n − 2|H0
Var βˆ 1( )
in which β̂1 indicates the estimated β1 , tn−2 is the t-probability distribution with n − 2
degrees of freedom under the null hypothesis, and Var ( β̂1 ) is the variance of β1. To
compute the P-value, it is assumed that this formula follows the t-probability distribu-
tion under the null hypothesis assumption.
44 ◾ Biostatistics in Public Health Using STATA
SS divided by their degrees of freedom are the mean squares (MS). The follow-
ing ANOVA table summarizes the sources of variation in the data, SS due to the
source, degrees of freedom in the source, MS due to the source, and the expected
value of the MS (Draper and Smith, 1998):
σ2 + β12 ∑ i =1( X i − X )
n 2
Regression SSR 1 SSR/1
SSE
R2 = 1 − × 100%
TSS
This coefficient is used as a criterion to compare two or more models; the higher the
R 2, the better the model fits the data.
SSXY
r=
SSX ∗ TSS
where:
n
SSXY = ∑( X
i =1
i − X ) (Yi − Y )
and
n
∑( X −X)
2
SSX = i
i =1
r = sign β̂1 ( ) R2
To assess whether the Pearson correlation coefficient is different from zero (H0: ρ = 0),
with data from a random sample of size n, the following formula is used (Kleinbaum
et al., 2008):
r n−2
T= ∼ t n −2
1− r2
46 ◾ Biostatistics in Public Health Using STATA
To obtain the P-value, the t-distribution with n − 2 degrees of freedom is used. This
test is equivalent to the t-test for assessing H0: β1 = 0, described previously.
Output
The option lfit in the twoway command is used to draw a line to describe the linear
relationship between two variables using an SLRM: in this case between height and
weight given the observed data.
100
80
Weight (kg)
60
40
Output
The results of this command show two tables. The first table describes the esti-
mated ratio of the MS obtained by the model over the residual MS is less than 1
(125.6 225.1) = 0.56 . This result indicates that there is no evidence to reject the
null hypothesis (P-value = .4765 > .05). The second table shows the estimated coef-
ficients of the model for the predictor weight and for the intercept (_cons); so, the
= 15.6 + 33.1 ∗ height.
linear trend is estimated using the following equation: Weight
Thus, the estimated expected weight in kilograms will increase 33.1 (95% CI: −69.2,
135.4) for every additional meter of height. However, this increasing trend was not
significant (P-value > .05). The percentage of total variation from Y explained by
the model is 6.5% (R-squared). In this case, Student’s t-test described below the
ANOVA table shows nonsignificant results for the predictor weight with exactly
the same P-value described for the F-distribution in ANOVA, which is because of
the fact that in an SLRM, t2 = F.
5.9 Centering
To facilitate the interpretation of the intercept on a linear regression model, it is
advisable to transform the values of Xi to the difference of each value from its
mean as ( X i − X i ). This transformation is known as centering. As a result of the
48 ◾ Biostatistics in Public Health Using STATA
centralization, the estimator of the coefficient associated with the intercept is equal
to the mean of the dependent variable, that is,
β̂0 = Y
This process does not affect the estimates of the coefficients associated with the
independent variable. Assuming the previous database, the process of center-
ing height to explain weight in STATA can be achieved by typing the following
commands:
sum weikg
sum heimt
*Centering weikg using the result of the previous sum command
gen heimtc=heimt-r(mean)
reg weikg heimtc
Output
sum weikg
Variable | Obs Mean Std. Dev. Min Max
------------+----------------------------------------------
weikg | 10 66.3 14.62912 35 87
. sum heimt
Variable | Obs Mean Std. Dev. Min Max
------------+----------------------------------------------
heimt | 10 1.53 .1127435 1.35 1.78
*Centering weikg using the result of the previous sum command
gen heimtc=heimt-r(mean)
reg weikg heimtc
Source | SS df MS Number of obs = 10
------------+------------------------ F(1, 8) = 0.56
Model | 125.56044 1 125.56044 Prob > F = 0.4765
Residual |1800.53956 8 225.067445 R-squared = 0.0652
------------+--------------- Adj R-squared = -0.0517
Total | 1926.1 9 214.011111 Root MSE = 15.002
------------------------------------------------------------------------------
weikg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
heimtc | 33.12939 44.35508 0.75 0.476 -69.15361 135.4124
_cons | 66.3 4.744127 13.98 0.000 55.36002 77.23998
------------------------------------------------------------------------------
5.10 Bootstrapping
Bootstrapping is a robust alternative to classical statistical methods when the
assumptions are not met using these methods; it provides more accurate inferences,
particularly when the sample size is small. The procedure to perform bootstrapping
is via resampling methods for estimating standard errors and computing confi-
dence intervals (Good, 2006). Bootstrapping in Stata can be done using the option
vce(boot) in the command reg. For example, assuming the previous database, the
command for estimating weight with centering height using bootstrapping esti-
mates is as follows:
Output
The results show that the estimates of the regression coefficients are the same as
those obtained using the least-squares method, but the standard errors for the coef-
ficient of heimtc are different, being 44.4 versus 68.3. It is likely that these dif-
ferences are due to the small sample size that was used in this example. For more
information on this topic, we recommend checking out the book by Draper and
Smith (1998).
50 ◾ Biostatistics in Public Health Using STATA
µ y / x = β0 + β1 X 1 + + βm X m
where:
μy/x indicates the expected value of Y explained by the X variables for the ith
subject
Xj indicates the predictor variables (j = 1,…, m)
β j indicates the coefficient (constant) associated with Xj
The assessment of the overall significance of its regression coefficients (βis, i > 0) can
be performed using an ANOVA table, as follows:
Source of
Variation SS df MS F-Ratio
n
TSS = ∑( y − y )
i =1
i
2
n
SSR = ∑ ( ŷ − y )
i =1
i
2
n
SSE = ∑ ( y − yˆ )
i =1
i i
2
where:
ŷi indicates the estimated expected value of Y given a set of specific values of the
predictors
Xs for the ith subject, as follows: yˆ i = βˆ 0 + βˆ 1 X 1 + … + βˆ m X m
β̂ j indicates the estimated value of the coefficient β j
y i indicates the overall mean of Y
Linear Regression Models ◾ 51
For a multiple regression model, the ANOVA table assumes that H0:
β1 = β2 = … = βm = 0. If the calculated value of the test statistic Fc is greater than
F(1−α; m, n−m−1) for a given significance level α, we conclude that there is evidence
against H0.
In MLRMs, the Stata command reg can be used in a manner that is similar
to how the simple linear regression was programmed, except that in the latter, the
model consists of more than one independent variable. For example, assuming the
previous database, to explain the expected bmi by the predictors age and sex, the
specifications of the reg command are as follows:
Output
Age + sex 7 239.3 –
(complete model)
In both incomplete models, the additional sum of squares increases. To assess if this
increment is statistically significant, a partial F-statistic is used. For example, let us
assume the following notation for the complete and incomplete models:
Complete Model: μy|X = β0 + β1X1 + … + βkXk + βk+1Xk+1 + … + βmXm
The user has to be aware that the coefficients from both models do not necessarily
have the same value. Based on these models, a partial hypothesis can be defined
with the following equation:
H 0 : βk +1 = βk +2 = … = βm = 0 X1 , X 2 ,…, X k
Then, the following steps are performed to evaluate this type of partial
hypothesis:
1. Calculate the sum of squares of the residuals in the complete model (SSEcom)
with n − m − 1 degrees of freedom.
2. Calculate the sum of squares of the residuals in the incomplete model (SSEinc)
with n − k − 1 degrees of freedom.
3. Compute the difference of SS between the sum of squares of the complete
and incomplete models, which is called the additional sum of squares, with
m − k degrees of freedom.
4. Compute the following formula (partial F):
F ( X k +1 ,…, X m |X 1 ,…, X k ) =
( SSE inc − SSE com ) (m − k )
SSE com ( n − m − 1)
5. Calculate the P-value using Fisher’s F-distribution with m − k and n − m − 1
degrees of freedom.
Considering the previous data of the sum of squares, the partial F, discarding sex
from the complete model, will be:
190.8 1
F ( sex | age ) = = 5.58
239.3 7
And the partial F, again discarding age from the complete model, will be:
43.2 1
F ( age | sex ) = = 1.26
239.3 7
H 0: βsex|age = 0
and
H 0: βage|sex = 0
54 ◾ Biostatistics in Public Health Using STATA
The respective P-values are computed with the F-Fisher probability distribution
with 1 and 7 degrees of freedom. In Stata, these P-values can be obtained using the
Ftail command, as is illustrated in the following:
An alternative procedure is to use the test command after the reg command for the
complete model, as in the following:
Output
. test sex
( 1) sex = 0
F( 1, 7) = 5.58
Prob > F = 0.0502
. test age
( 1) age = 0
F( 1, 7) = 1.26
Prob > F = 0.2978
Thus, we conclude that there is marginal evidence against the null hypothesis,
H 0: βsex|age = 0 (P-value = .05), suggesting that the variable sex could be part of the
model when the variable age is already one of the predictors.
5.13 Prediction
Should the user pursue using the model for predicting the expected value under
the specific conditions of the predictors, the adjust command is available in Stata
for this purpose. For example, assuming that the user is interested in estimating
the expected bmi for females (sex = 1) aged 30 years; after the reg command, the
specifications for the adjust command are as follows:
Output
-----------------------------------------------------------------------------
Dependent variable: bmi Command: regress
Covariates set to value: age = 30, sex = 1
-----------------------------------------------------------------------------
----------------------------------------------
All | xb lb ub
---------+------------------------------------
| 33.9999 [26.8742 41.1257]
----------------------------------------------
Key: xb = Linear Prediction
[lb , ub] = [95% Confidence Interval]
The option ci in the adjust command is used to display the 95% confidence interval
of the prediction. The results displayed in the above table indicate that for 30-year-
old females, the estimated expected bmi is 34 (95% CI: 26.9, 41.1).
Yi = β0 + β1 X i + β2 X i2 + ei
Output
. predict bmiesp1
(option xb assumed; fitted values)
Once the expected value for each model is estimated with the command predict
(bmiesp1 and bmiesp2), a plot with these estimates can be displayed with the fol-
lowing command (Figure 5.2):
twoway (scatter bmi age, sort) (line bmiesp1 age, sort) (line
bmiesp2 age, sort), ytitle(bmi)
Linear Regression Models ◾ 57
40
35
30
bmi
25
20
25 30 35 40 45
Age
Output
In this example, the quadratic curve appears to be better than the linear trend in
terms of its ability to explain the expected value of bmi by age.
1 1+ ρ
C (ρ) = Ln
2 1− ρ
In the above, ρ can be estimated with the square root of the expected coefficient of
determination (R 2) for the model under consideration.
In Stata we can estimate sample size using the option of correlation in the
Power and sample size analysis window in the Statistics menu, providing significance
level, power value, and the linear correlation coefficient value under the alternative
58 ◾ Biostatistics in Public Health Using STATA
hypothesis. For example, assuming that we want to determine the minimum sam-
ple size needed to estimate the expected bmi value using an SLRM with sex as pre-
dictor and an approximate R 2 of 0.3454 (ρ ∼ .3454 = .59), the dialog box should
be filled out as described in Figure 5.3.
Output
Therefore, the minimum sample size for performing an SLRM between BMI and sex,
assuming 90% statistical power and a 5% significance level, is 26 subjects. Should the
user want to determine the minimum sample size for an MLRM (nm), when X1 is the
main predictor, the following expression is recommended (Kleinbaum et al., 2008):
Linear Regression Models ◾ 59
ns
nm =
1 − ρX21( X 2 ,, X k )
where ns is the minimum sample size for the SLRM using X1 as predictor, ρ X1( X 2 ,…, X k )
2
eˆi = yi − yˆi
where
yˆ i = βˆ 0 + βˆ 1 X 1 + + βˆ p X p
At first it is assumed that the residuals are independent; however, the ei obtained
from the study data depend on the expected values of Y under the model, which
in turn depend on the values of the predictors. Moreover, the model assumes that
0.2
0.1
Residuals
0.0
−0.1
−0.2
25 30 35 40 45
Age
the variances are constant, but the variance of the residual depends on the distance
of the central values of the predictors. To verify the compliance of constant vari-
ance, it is recommended that the standardized residuals be graphically represented.
In Stata, this type of graph can be obtained using the rvpplot command after the
reg command. For example, to describe the residuals distribution related with the
linear regression model between heimt and age, the Stata commands are:
The output of these commands can be seen in Figure 5.4. Because of the small
sample size in this example, it is difficult to visualize a particular pattern around 0,
although some symmetric distribution is observed. For more discussion on this
topic, we recommend checking out the book by Draper and Smith (1998).
Chapter 6
Analysis of Variance
Aim: Upon completing the chapter, the learner should be able to perform
an analysis of variance to compare the expected values of a continuous
random variable between different groups.
6.1 Introduction
An analysis of variance (ANOVA) can be performed to compare two or more
parameters (expected values and variances). The possible objectives of this analysis
might be to:
H 0 : µ1 = µ2 = = µm
2. Determine which expected values are different among comparison groups to
evaluate any of the following potential null hypotheses:
µi + µ k µ + µ 2 + µ 3 µ 4 + µ5
H 0 : µi = µ j ; H 0 : µi = ; H0 : 1 =
2 3 2
3. Determine if the variability of a random continuous variable is the same
between different groups to evaluate the following hypothesis:
H 0 : σα2 = 0
61
62 ◾ Biostatistics in Public Health Using STATA
: : :
: : :
Σ Σ Σ ∑ i =1∑ j =1Yi,j
n1 n2 n3 k ni
Total Y
j =1 1,j
Y
j =1 2, j
Y
j =1 3, j
Mean Y1 Y2 Y3 Y
Number of n1 n2 n3 n
subjects
Expected μ1 μ2 μ3 μ
value
1. Assuming that the information available is from all possible groups, the
research question can be stated as follows: Does the expected value vary by
group (μ1= μ2= μ3)? (Fixed effects model)
2. Assuming that the information available is from a random sample of all
possible groups, the research question can be stated as follows: Is there any
variation among all the groups (σ2α ≠ 0)? (Random effects model)
The basic statistics from the database, using the table command, are in the fol-
lowing output:
The results from the above table show that subjects having a normal bmi are in
the older group, and those categorized as being overweight are younger. But the
variability in these groups seems to be very different, based on the comparison of
standard deviations. To assess whether these differences are statistically significant,
either a linear model or an analysis of variance can be used (both of which are
described in the following sections).
and
where:
yij indicates the value of the continuous random variable Y in the ith subject that
belongs to the jth bmig category
μj indicates the expected value of Y in the jth bmig category, E ( yij ) = µ j
eij indicates the difference between the observed value of yij and the expected
value of the random variable Y in the jth bmig category (µj) (it is assumed
that the errors, eij, are independent and follow an N ( 0,σ2 ) distribution)
α j indicates the effect of the jth bmig category with respect to the first bmig
category (normal), subject to the restriction α1 = 0
BMIGj is a dummy variable whose value is 1 if the subject belongs to the jth bmig
category; its value is 0 if the subject belongs to another bmig group
64 ◾ Biostatistics in Public Health Using STATA
When the groups being compared correspond to all of the possible groups or when
they represent a select group of interest, the αi s are constants and are defined as
fixed effects. If the effects are fixed, then it is initially assumed that variances within
groups are equal, Var (Yij ) = σ2.
k ni k k ni
∑ ∑ (Y ) ∑ n (Y − Y ) + ∑ ∑ (Y − Yi )
2 2 2
ij −Y = i i ij
i =1 j =1 i =1 i =1 j =1
where:
∑i =1 ni (Yi − Y ) indicates
k 2
the variation between groups (between sum of squares)
∑ik=1 ∑ nji=1 (Yij − Y ) indicates the overall variation within each group (within sum
2
of squares)
The null hypothesis in ANOVA with fixed effects determines that the expected values
of the random variable of interest, Y, in all groups are the same, H 0 :µ1 = = µk ;
thus, αi = 0 for all groups. To assess the null hypothesis, the estimated expected
values of SS between and SS within are compared, considering their respective
degrees of freedom, as follows:
Source of
Variation SS Df MS E[MS]
k
Note: φ=
1
k −1 ∑
i =1
ni α i
2
Analysis of Variance ◾ 65
( )
Under the null hypotheses, the ratio σ2 + k1−1 ∑ ni αi2 σ2 = 1. To determine how
far this ratio should be away from 1, once a dataset is collected and the parameters
( )
of the linear model σ2 , αi are estimated, a P-value is computed using the F-Fisher
probability distribution with 1 and n − 2 degrees of freedom.
------------------------------------------------------------------------------
age | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bmig |
Overweight | -5.75 4.613537 -1.25 0.253 -16.65928 5.159281
Obese | -1.083333 4.613537 -0.23 0.821 -11.99261 9.825948
|
_cons | 32.75 3.020269 10.84 0.000 25.6082 39.8918
------------------------------------------------------------------------------
The results show that there is no evidence of significant differences in the mean age
across bmig categories using normal subjects as the reference group (P-value > .1).
If the user wants to change the reference group (e.g., use the second category of
bmig), the following command syntax should be used:
Output
------------------------------------------------------------------------
age | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
bmig |
Normal | 5.75 4.613537 1.25 0.253 -5.159281 16.65928
Obese | 4.666667 4.932078 0.95 0.376 -6.995845 16.32918
|
_cons | 27 3.487506 7.74 0.000 18.75336 35.24664
------------------------------------------------------------------------------
The commands oneway and anova can be used in the assessment of the null hypoth-
esis, H 0 :µ1 = = µk (using fixed-effect ANOVA). The oneway command includes
Bartlett’s test for equal variances, a condition needed in the F-test for comparing
expected values. The anova command expands the sum of squares if more than one
source of variation is used. For example, to compare age between the bmig catego-
ries, the following command is used:
oneway age bmig
Analysis of Variance ◾ 67
Output
. oneway age bmig
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 60.6833333 2 30.3416667 0.83 0.4742
Within groups 255.416667 7 36.4880952
------------------------------------------------------------------------
Total 316.1 9 35.1222222
The output of the oneway command provides evidence in favor of the null hypoth-
esis, H 0 :µnormal = µoverweight = µobese, and evidence in favor of equal variances, via the
Bartlett’s test (P-value > 0.1).
If the user includes the variable sex as a second source of variation and the
interaction of age and sex to explore how the mean of age changes by bmig
category and sex, the command line (making use of the anova command) is as
follows:
Output
The output suggests that the mean of age changes according to the bmig and sex cat-
egories due to the fact that the interaction term bmig#sex is marginally significant
(P-value = .053); however, caution should be taken with this interpretation because
the sample size is very small.
68 ◾ Biostatistics in Public Health Using STATA
H 0 : µ1 = µ2 or H 0 : µ1 − µ2 = 0
which is equivalent to
H0 : α2 = 0
The test statistic will be
2
Yi − Y j
F =t =
2 ∼F
2 1,n − k ½H 0
s (1 ni ) + (1 n j )
where s2 is MSE (within MS).
To evaluate the null hypothesis in Stata, using this statistic, we can use the test
command after the anova command, as can be seen in the following:
anova age bmig
test 1.bmig=2.bmig
Output
. anova age bmig
. test 1.bmig=2.bmig
( 1) 1b.bmig - 2.bmig = 0
F( 1, 7) = 1.55
Prob > F = 0.2527
The results suggest that there is no difference in the expected age between normal
bmi subjects and those whose bmi indicates that they are overweight (P-value > .1).
where:
k
∑c = 0
i =1
i
The definition for a linear contrast depends on the null hypothesis under evalua-
tion. For example,
1. When H 0 : µ1 − µ2 = 0 then,
L = (1) µ1 + ( −1) µ2
where c1 = 1 and c 2 = −1
2. When H 0 : µ1 = ( µ2 + µ3 2 ) ⇒ H 0 : (1) µ1 + ( −0.5 ) µ2 + ( −0.5 ) µ3 = 0 then,
Lˆ
t= ∼ t α
∑ (c )
2
k
2 n − k ,1−
s ni 2
i
i =1
where s2 = MSE (within MS) and L is determined by the sample means.
To evaluate linear contrasts in Stata, we can use the test command after the
anova command. For example, assuming the user wants to compare the expected age
between the subjects categorized as having a normal bmi against the average of the
70 ◾ Biostatistics in Public Health Using STATA
expected age in subjects categorized as being overweight or obese, the null hypothesis is
formulated as H0: µ1 = (µ2 + µ3)/2. To assess this hypothesis with the test command,
the specified command line, after running the anova command, is as follows:
test 1.bmig=(2.bmig+3.bmig)/2
Output
. test 1.bmig=(2.bmig+3.bmig)/2
( 1) 1b.bmig - .5*2.bmig - .5*3.bmig = 0
F( 1, 7) = 0.77
Prob > F = 0.4099
The results suggest that the expected age does not change between the groups under
comparison (P-value > .1).
Output
| Summary of age
bmig | Mean Std. Dev. Freq.
-----------+------------------------------------
Normal | 32.75 8.5391256 4
Overweigh | 27 2 3
Obese | 31.666667 3.7859389 3
-----------+------------------------------------
Total | 30.7 5.9264004 10
Analysis of Variance ◾ 71
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 60.6833333 2 30.3416667 0.83 0.4742
Within groups 255.416667 7 36.4880952
------------------------------------------------------------------------
Total 316.1 9 35.1222222
Note: The option Bon displays the Bonferroni multiple-comparison test. The option
tab produces a summary of age at each category of bmig.
At the bottom of the output table, all the pairwise comparisons between the samples
means can be seen; for example, the difference between the mean age of the obese group
and that of the overweight group is 4.67, which can be confirmed with the first table
=
requested in this output, in which Ynormal 32= .8, Yoverweight 27.0, and Yobese = 31.7.
Below the pairwise differences is the P-value for the F-statistics of one pair, which was
computed by multiplying by the total number of possible mean pairs to be compared.
The results show that there are no evidences of significant differences (P-values > .1) in
any of the pairwise comparisons.
Scheffé’s method performs multiple comparisons through linear contrasts, but
the significance level of each comparison is not adjusted. In Scheffé’s method, the
test statistic is calculated using the following formula:
2
L
t / ( k − 1) =
2
/ ( k − 1)
∑ ( )
k
s 2 ci2 ni
i =1
The P-value is calculated with the F-Fisher probability distribution with k − 1 and
n − k degrees of freedom. The Stata command line for performing multiple com-
parisons by Scheffé’s method is as follows:
oneway age bmig, sch
Output
The results displayed in the above output table show that the P-values that result
when using Scheffé’s method are different from those that result when Bonferroni’s
method is used; however, the statistical evidence confirms that there are no signifi-
cant differences (P-values > .1) in any of the pairwise comparisons.
where:
(
Yij ½µi ∼ i.i.d N µi , σ2)
(
µi ∼ i.i.d N µi , σµ2)
i.i.d = independently and identically distributed
E (Yij ½αi ) = µ + αi
(
Yij ½αi ∼ i.i.d N µi , σ2 )
(
αi ∼ i.i.d N 0, σ2α )
The procedure to estimate the parameters of this model is similar to the process
used in Bayesian data analysis; however, in ANOVA with random effects, we are
only assuming randomness in μi. The null hypothesis of the ANOVA with random
effects is formulated as follows:
H 0 : σ2α = 0
Having σ2α = 0 implies that the expected values of Y for all groups, including the
groups in the study sample and the groups of subjects who were not included in the
study sample, are equal ( µ1 = µ2 = = µm ).
Under the assumption of random effects, the expected values for the sum of
squares in the ANOVA can be seen below in the following:
Source of
Variation SS Df MS E[MS]
∑ i =1∑ j =1(Yij −Y )
k ni 2
Total n − 1
74 ◾ Biostatistics in Public Health Using STATA
where:
k k k
i =1
∑
ni −
∑ ∑n
i =1
ni2
i =1
i
n0 =
k −1
To evaluate the null hypothesis with the data collected in the sample, the F-statistic
is obtained using the following equation:
∑ n (Y − Y ) / ( k − 1) ∼ F(
k 2
i i
F= i =1
k −1, n −k ,1−α )
∑ ∑ (Y − Y ) / (n − k )
k ni 2
ij i
i =1 j =1
(Yij − Yi )
k ni 2
σ̂ =2
∑∑
i =1 j =1
(n − k )
Cov (Yim ,Yin ) = Cov{E (Yim |αi , E (Yin |αi ) } + E {Cov (Yim ,Yin |αi )}
{ {}
Var (Yij ) = Var E (Yij ½αi ) + E Var (Yij ½αi ) }
= Var ( µ + α ) + E ( σ ) = σ
i
2 2
α + σ2
readings from a subject twice a day, for 10 consecutive days, as a random sample
of days in 1 month. Finally, let us assume that the data in Stata conform to the
following format:
+-----------------+
| day sb1 sb2 |
|-----------------|
1. | 1 98 99 |
2. | 2 102 93 |
3. | 3 100 98 |
4. | 4 99 100 |
5. | 5 96 100 |
6. | 6 95 100 |
7. | 7 90 98 |
8. | 8 102 93 |
9. | 9 91 92 |
10. | 10 90 94 |
+-----------------+
sb1 indicates the first measure of systolic blood pressure.
sb2 indicates the second measure of systolic blood pressure.
To perform an ANOVA, the user has to modify the previous database structure.
The actual format is called wide format, where every row in the dataset contains all
the information of one subject. To run an ANOVA the database structure must be
in the long format, where every row contains the information of each subject’s visit.
The reshape command can be used to change the database structure from wide to
long format, as follows:
Output
. reshape long sb, i(day)
(note: j = 1 2)
After the reshape command, use the list command to see the current data structure,
as is demonstrated in the following table:
Analysis of Variance ◾ 77
list
+----------------+
i | day _j sb |
|----------------|
1. | 1 1 98 |
2. | 1 2 99 |
3. | 2 1 102 |
4. | 2 2 93 |
5. | 3 1 100 |
6. | 3 2 98 |
7. | 4 1 99 |
8. | 4 2 100 |
9. | 5 1 96 |
10. | 5 2 100 |
11. | 6 1 95 |
12. | 6 2 100 |
13. | 7 1 90 |
14. | 7 2 98 |
15. | 8 1 102 |
16. | 8 2 93 |
17. | 9 1 91 |
18. | 9 2 92 |
19. | 10 1 90 |
20. | 10 2 94 |
+----------------+
To run the linear model with random effects, use the loneway command, as follows:
loneway sb day
Output
One-way Analysis of Variance for sb:
Number of obs = 20
R-squared = 0.5118
Intraclass Asy.
correlation S.E. [95% Conf. Interval]
------------------------------------------------
0.07611 0.32301 0.00000 0.70920
The results indicate that there is evidence in favor of the null hypothesis
(H )
0 : σ α = 0 . Therefore, the expected values of the systolic blood pressure read-
2
ings from the portable machine are equal for the subject under study for 1 month
(P-value > .1). The estimated intraclass correlation coefficient is as follows:
1.19
ρI = = 0.076
1.19 + 14.5
Study parameters:
alpha = 0.0500
power = 0.8000
delta = 0.4163
N_g = 3
m1 = 32.8000
m2 = 27.0000
m3 = 31.7000
Var_m = 6.3267
Var_e = 36.5000
Analysis of Variance ◾ 79
N = 60
N per group = 20
Both delta (effect design, Var_m Var_e ) and Var_m (variance between groups,
based on the means of each group and the grand mean, ∑i3=1 (Y i − Y ) 3 ) are com-
2
Aim: Upon completing the chapter, the learner should be able to perform
a stratified analysis in an epidemiological study, using cohort and case-
control study designs.
7.1 Introduction
So far, we have discussed the Stata commands for estimating the conditional expectations
of continuous variables. There are, however, numerous occasions in the public health
field in which we are interested in exploring the association between a categorical out-
come (e.g., disease status) and one or more predictor variables (e.g., exposure status,
confounding variables, and effect modifiers variables) collected in epidemiologic stud-
ies. Epidemiology is “the study of the occurrence and distribution of health-related
events in specified populations and the application of this knowledge to control rel-
evant health problems” (Porta, 2008; Rothman, 2002). Epidemiological studies are
commonly categorized as descriptive or analytical studies. These studies are defined
immediately below:
81
82 ◾ Biostatistics in Public Health Using STATA
In the next sections, we will show the application of the Mantel–Haenszel method
for the analysis of data derived from cohort and case-control studies (Jewell, 2004;
Rothman, 2002). This method is based on the stratification of potential confound-
ing variables to estimate a weighted average of the magnitude of the exposure–disease
association. Confounding factors are variables that are related to both the exposure
and the outcome but do not lie in the causal pathway between them (Rothman,
2002; Woodward, 2004). As we shall see in the next chapters, regression models are
efficient techniques that can be employed to assess the exposure–disease association
while controlling for the confounding variables.
I exposure
RR =
I nonexposure
where Ij indicates the incidence of the disease in the jth group. When stratified
analysis is performed, the RR is assessed under different strata (to be combined into
one single RR) or reported in each stratum. In the Mantel–Haenszel method, the
combined RR is computed using the weighted mean of the RRs, as follows:
RR M−H = ∑ w wRR k
k
k
where wk is the weighted factor in the kth stratum, which is itself determined with
the product of the total number of cases who are unexposed and the proportion of
exposure in this stratum.
For example, let us say that we want to evaluate the association between alcohol
intake (exposure) and a diagnosis of myocardial infarction (MI) over a period of
5 years, controlling for the effect of cigarette smoking (potential confounding
variable). To analyze this type of study in Stata, we can use the following data:
Categorical Data Analysis ◾ 83
Smoker (0)
Between the parentheses are the codes for each category (1 indicates presence and 0
indicates absence).
To program these data, the database can be entered in Stata as follows:
+----------------------------------+
| smoker alcohol mi subjects |
|----------------------------------|
1. | 0 1 1 8 |
2. | 0 1 0 16 |
3. | 0 0 1 22 |
4. | 0 0 0 44 |
5. | 1 1 1 63 |
6. | 1 1 0 36 |
7. | 1 0 1 7 |
8. | 1 0 0 4 |
+----------------------------------+
Output
smoker | RR [95% Conf. Interval] M-H Weight
---------------+----------------------------------------------
0 | 1 0.5164877 1.936154 5.866667
1 | 1 0.6244517 1.601405 6.3
---------------+--------------------------------------------–-
Crude | 1.53266 1.10769 2.120674
M-H combined | 1 0.6695272 1.493591
-------------------------------------------------------––--------
Test of homogeneity (M-H) chi2(1) = 0.000 Pr>chi2 = 1.0000
84 ◾ Biostatistics in Public Health Using STATA
Note: When the database collapses and contains a variable that tells the frequency
of each observation, the fw option is used. This option specifies the variable that
contains the number of times the observation was actually observed.
The output reports the point estimation of the relative risk (RR) for each stratum, as
well as the crude RR and the weighted RR (RR M–H) with their respective 95% confidence
intervals. In addition, the weighted factor in each stratum is reported (M–H weight),
as is the significance test (test of homogeneity [H 0 : RR = 1 =
RR 2 = RR k ]), to
assess whether the RRs in all strata are equal.
The results indicate that there is a nonsignificant difference in the RRs, per
stratum (P-value > .10); therefore, it is recommended that the RR M–H be used.
= 1.53 and the RR
When we compare the point estimates of the crude RR M–H = 1,
we are able to conclude that the data show a strong confounding effect, as the crude
RR is overestimating the magnitude of the association between MI and alcohol
intake. Finally, the estimated magnitude of the association of interest, controlling
for smoking, is 1 (95% CI: 0.67, 1.49); this, however, is not statistically significant
(P-value > .05).
Oddsexposure
OR =
Oddsnonexposure
where Oddsj indicates the expected number of cases per control in the jth group
(exposed or nonexposed) and can be defined as follows:
p
Odds =
1− p
where p is the probability of having a diagnosis of the disease of interest under the
study design.
In the Mantel–Haenszel method, the combined OR is computed using the
weighted mean of the ORs, as follows:
OR M−H = ∑ w wOR
k
k
k
Categorical Data Analysis ◾ 85
where wk is the weighted factor in the kth stratum, which is determined with the
product of the number of cases who are unexposed and the number of controls who
are exposed divided by the number of subjects in this stratum.
For example, let us assume that the user wants to evaluate the association
between HPV (human papilloma virus) infection status and oropharyngeal cancer
(OC), stratified by smoking (smokers vs. nonsmokers), using a case-control design
with the following data:
Smoker (0)
Present (1) 75 20 95
Absent (0) 5 80 85
Nonsmoker (1)
Present (1) 5 18 23
Absent (0) 10 72 82
Total 15 90 105
To perform the stratified analysis of these data, the database in Stata is prepared as
is seen here:
. list
+------------------------------+
| smoker hpv oc subjects |
|------------------------------|
1. | 0 1 1 75 |
2. | 0 1 0 20 |
3. | 0 0 1 5 |
4. | 0 0 0 80 |
5. | 1 1 1 5 |
|------------------------------|
6. | 1 1 0 18 |
7. | 1 0 1 10 |
8. | 1 0 0 72 |
+------------------------------+
Output
smoker | OR [95% Conf. Interval] M-H Weight
---------------+-----------------------------------------------
0 | 60 20.21104 207.9978 .5555556 (exact)
1 | 2 .4721323 7.399327 1.714286 (exact)
---------------+-----------------------------------------------
Crude | 21.33333 10.63312 43.91911 (exact)
M-H combined | 16.1958 8.529819 30.75142
---------------------------------------------------------------
Test of homogeneity (M-H) chi2(1) = 18.06 Pr>chi2 = 0.0000
The output reports the point estimation of the OR for each stratum as well as the crude
OR and the weighted OR (M–H combined) with 95% confidence intervals, respec-
tively. In addition, the weighted factor in each stratum is reported (M–H weight)
as well as two significance tests. The purposes of these tests are as follows:
1. To assess whether the ORs in all strata are equal: test of homogeneity
=
(H 0 : OR 1 =
OR 2 = OR k ).
2. To assess whether the weighted OR is equal to 1: test of combined OR = 1
(H 0 : OR M−H = 1).
The results indicate that there is a significant difference in the ORs of each stratum
(P-value < .05); therefore, it is recommended that the OR be analyzed per stratum.
When we compare the point estimates of the OR0 (60) in nonsmokers and those of
the OR1 (2) in smokers, we can see that the smoking habit modifies the magnitude
of the association between HPV and OC. Finally, the estimated magnitude of the
association of interest among smokers is 60 (95% CI: 20.2, 207.9), which is statisti-
cally significant (P-value < .05).
For example (using the data of nonsmokers), to assess the magnitude of the
association between HPV status and OC with ORs = 2, 2.5, and 3, assuming
that the prevalence estimates of OC in HPV-negatives are 0.10, 0.15, and 0.2, the
table of sample size should be filled in for a one-sided test and equal allocation, as
illustrated in Figure 7.2.
Figure 7.2 Sample size specifications for comparing two independent propor-
tions under different conditions.
88 ◾ Biostatistics in Public Health Using STATA
500
400
Sample size
300
200
100
0.1 0.15 0.2
Control-group proportion (p1)
Figure 7.3 Alternatives of sample size for comparing two independent propor-
tions under different conditions. Parameters: α = 0.05, 1 − β = 0.8.
Once the window for the previous requested sample size is submitted with
the graph option, a plot is displayed (Figure 7.3). The results show that the lower
the OR, the total sample size increases; however, when the proportion of OC in
HPV-negative groups is incremented, the differences in sample size are reduced.
To determine the sample size for the overall OR while controlling for potential
confounders, an adjustment has to be made, as explained in Chapter 8 (Hosmer
and Lemeshow, 2000).
Chapter 8
Aim: Upon completing the chapter, the learner should be able to use a
logistic regression model to estimate the magnitude of the association
between exposure and disease, controlling for potential confounders.
e −(β0 + β1X ) 1
Pr (Yi = 1) = pi = −(β0 + β1 X )
= −(β0 + β1 X )
1+ e 1+ e
where:
pi indicates the probability of having the diagnosis of interest in the ith subject,
that is, the probability of the ith subject being a case
Y indicates a dichotomous variable, coded as Y = 1 for a case and Y = 0 for a
control
X indicates a predictor variable
β0 indicates the coefficient that is not affected by any predictor
β1 indicates the coefficient that affects the predictor X
89
90 ◾ Biostatistics in Public Health Using STATA
p
g ( x ) = logit ( pi ) = ln i = β0 + β1 x
1 − pi
where:
β0 indicates the value of the logit(pi) when X = 0
β1 indicates changes in the logit(pi) per unit of change in X
where n is the number of observations or the sample size. The likelihood function
provides support for a particular value of the parameter βi , given an observed data.
If the observed data provide more support for one value of the parameter than
for another value, then the likelihood is higher for the former parameter value
(Marschener, 2015).
Under the assumption that the observed data are independent, the likelihood
function can be expressed as follows:
∏C ∗ piyi ∗ (1 − pi )
ni − yi
L (β ) = ni
yi
i =1
where:
C ynii is the total number of combinations with y cases given n subjects in the ith
group
pi is the probability of the disease in the ith group
Logistic Regression Model ◾ 91
∏p ∗ (1 − pi )
1 − yi
L (β ) = i
yi
i =1
where:
pi is the probability of having the disease in the ith subject
n is the total number of subjects
In the case of the logistic regression model, we use the coefficients βsthat produce the
highest value of the likelihood function. The βs obtained in this maximization process
are identified as the maximum-likelihood estimates (Hardin and Hilbe, 2001).
Cancer
The estimate of the simple logistic regression model parameters for the above
grouped data can be accomplished with different commands. The difference
92 ◾ Biostatistics in Public Health Using STATA
between them is the default output provided and the method used to maximize the
likelihood function for parameters’ estimation. Some of these commands and their
outputs are shown below.
Output
AIC = 1.364153
Log likelihood = -1041.577122 BIC = -9121.705
------------------------------------------------------------------------------
| OIM
cancer | Coef. Std. Err. Z P>|z| [95% Conf. Interval]
--------+----------------------------------------------------------------
Smoker | 0.4107339 0.1038812 3.95 0.000 0.2071305 0.6143372
_cons | -0.428227 0.0704269 -6.08 0.000 -0.5662612 -0.2901928
------------------------------------------------------------------------------
The command fam(bin) is used to ensure that the probability distribution of the
dependent variable (cancer) will follow a binomial distribution.
Output
------------------------------------------------------------------------------
cancer | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------+----------------------------------------------------------------
smoker | 0.4107339 0.1038812 3.95 0.000 0.2071306 0.6143373
_cons |-0.4282271 0.0704269 -6.08 0.000 -0.5662613 -0.2901929
------------------------------------------------------------------------------
Logistic Regression Model ◾ 93
Output
Logistic regression Number of obs = 1530
LR chi2(1) = 15.69
Prob > chi2 = 0.0001
Log likelihood = -1041.5771 Pseudo R2 = 0.0075
------------------------------------------------------------------------------
cancer | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------+----------------------------------------------------------------
smoker | 0.4107339 0.1038812 3.95 0.000 0.2071306 0.6143373
_cons |-0.4282271 0.0704269 -6.08 0.000 -0.5662613 -0.2901929
------------------------------------------------------------------------------
To display the estimates of the beta parameters, the coef option is used.
Output
The coef option is used to display the estimates of the beta parameters. The option
ml is for obtaining the maximum-likelihood estimates.
Therefore, the fitted model for all the commands of the simple logistic regres-
sion can be determined with the following equation:
logit ( p ) = −0.43 + 0.41∗ smoker
94 ◾ Biostatistics in Public Health Using STATA
Output
Generalized linear models No. of obs = 2
Optimization : ML Residual df = 0
Scale parameter = 1
Deviance = 7.90479e-14 (1/df) Deviance = .
Pearson = 1.60265e-29 (1/df) Pearson = .
Variance function: V(u) = u*(1-u/total) [Binomial]
Link function : g(u) = ln(u/(total-u)) [Logit]
AIC = 9.063989
Log likelihood = −7.063989303 BIC = 7.90e-14
------------------------------------------------------------------------------
| OIM
cases | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------+----------------------------------------------------------------
smoker | 0.4107339 0.1038812 3.95 0.000 0.2071306 0.6143373
_cons | -0.4282271 0.0704269 -6.08 0.000 -0.5662613 -0.2901929
------------------------------------------------------------------------------
The fam(bin) option is modified to fam(bin total) because the dependent variable
denotes the number of cases and not the presence or absence of disease (dichoto-
mous scenario).
Logistic Regression Model ◾ 95
Similar results can be obtained with the binreg command if the following speci-
fications are used:
binreg cases smoker, or n(total) ml
The or and ml options are added to estimate the odds ratio (OR) using the maxi-
mum likelihood method.
oddsi =
pi
=
(
1 1 + e −(β0 +β1X ) )
= eβ0 +β1X
1 − pi
{
1− 1 1+ e
(
−( β0 +β1 X )
)}
Using this expression, the user can obtain the OR. For example, if we assume that X
takes 0 for unexposed subjects and 1 for exposed subjects, the resulting OR will be
oddsexposed eβ0 + β1
OR exp vs. unexp = = β0 = eβ1
oddsunexposed e
In this case, the OR is the exponential of the regression coefficient associated with
the exposure. The syntax in Stata to estimate the OR of the previous example (using
the glm command) is as follows:
glm cases smoker, fam(bin total) ef nolog noheader
Output
------------------------------------------------------------------------------
| OIM
cases | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
----------+----------------------------------------------------------------
smoker | 1.507924 0.1566449 3.95 0.000 1.230143 1.848431
_cons | 0.6516634 0.0458946 -6.08 0.000 0.5676437 0.7481193
---------------------------------------------------------------------------
The ef option is added to obtain the estimated OR. The terms nolog and noheader
are used to display only the parameters of the model.
The result indicates that the odds of having oral cavity cancer among smokers
is 1.51 (95% CI: 1.23, 1.85) times the odds of having oral cavity cancer among
nonsmokers. This OR is known as the crude OR, because the model includes only
the exposure variable.
96 ◾ Biostatistics in Public Health Using STATA
This comparison is known as the likelihood ratio test. The syntaxes in Stata to
perform this test (with the previous database) to assess the effect of the predictor
smoker are as follows:
. quietly: glm cases smoker, fam(bin total)
. estimates store model1
. quietly: glm cases, fam(bin total)
. lrtest model1 .
Likelihood-ratio test LR chi2(1) = 15.69
(Assumption: . nested in model1) Prob > chi2 = 0.0001
The results show that removing the predictor smoker from the model has a signifi-
cant effect (P-value = .0001). Therefore, it is suggested that it not be removed from
the model.
β j
Z0 =
( )
SE β j
Logistic Regression Model ◾ 97
( )
where SE β̂ j is the asymptotic (i.e., large-sample) standard error of β j . The test
statistic Z0 follows an asymptotic standard-normal distribution, N(0,1), under the
null hypothesis.
An equivalent process is to calculate the square of Z 0 and use the chi-squared
distribution (χ2) to assess the null hypothesis, H0: β j = 0. The use of χ is recom-
2
mended for two-sided alternatives (Ha: βi ≠ 0). For one-sided alternatives (Ha: βi < 0,
Ha:βi > 0), the normal distribution is recommended.
The output of the glm command for the logistic model shows the Wald test for
each predictor. Another Stata command that can be used to perform the Wald test
is test. For example, using the previous database to assess the effect of the predictor
smoker, the syntaxes are as follows:
glm cases smoker, fam(bin total)
test smoker
Output
The likelihood ratio test and the Wald test showed the significant effect of the pre-
dictor smoker in the logistic regression model (P-value = .0001); however, the test
statistics differ (15.69 vs. 15.63).
1
Pr (Yi = 1) = pi =
1+ e
(
− β0 + βE ∗Ei + ΣβiCi + ∑γ( j )i ∗( E ∗C )( j )i )
where:
pi indicates the probability of the ith subject’s having the disease of interest
E indicates the exposure
Ci indicates the ith potential confounding variable
γ j indicates the jth coefficient or the interaction terms associated with the prod-
uct of the exposure and the potential confounding variables (E*C)
98 ◾ Biostatistics in Public Health Using STATA
These interaction terms are useful to estimate the magnitude of the association in
different strata.
By way of illustration, let us continue to use the previous example, in which the
predictor sex was included as a potential confounding variable, with the following
data distribution:
To analyze these data, the following database is created in the Stata data editor:
+------------------------------+
| smoker sex cases total |
|------------------------------|
1. | 0 0 218 477 |
2. | 0 1 115 367 |
3. | 1 0 20 42 |
4. | 1 1 370 694 |
+------------------------------+
In Stata, the glm command can be used in conjunction with the previous database
to fit a logistic regression model with interaction terms. The syntax for doing so is
as follows:
Output
where:
_Ismoker_1 is a dummy variable with value 1 if the subject smokes, otherwise is 0
_Isex_1 is a dummy variable with value 1 for males, otherwise is 0
_IsmoXsex_1_1 is a dummy variable with value 1 if the subject smokes and is a
male, otherwise is 0
Start the command with xi: in the glm command to indicate that some of the
predictors are defined as categorical. This instruction, when placed prior to the glm
command, enables us to define the model with interaction terms. The instruction
i.smoker*i.sex indicates that the logistic regression model will use as predictors
smoker, sex, and the interaction term formed by the product of these predictors. This
form is useful when there are more than two categories in the predictor variables that
are defined as being categorical.
In the previous output table, the Wald test shows that there is evidence that the
interaction term _IsmoXsex_1_1 affects the logit(p) estimate (P-value = .016). An
alternative procedure for making a statistical assessment of the interaction term is
the likelihood ratio test (lrtest), which is recommended when the user is interested
in assessing simultaneously several interaction terms. The following commands
sequence perform the lrtest with the previous database:
The results indicate that the interaction term composed of smoker and sex is statisti-
cally significant (P-value = .0001), which is similar to what was found using the
Wald test. Therefore, the variable sex modifies the relationship between smoking
status and cancer. As a consequence, it is recommended to estimate sex-specific OR
using the lincom command as follows:
In females the result indicates that the odds of having oral cavity cancer among smok-
ers is 1.08 (95% CI: 0.57, 2.03) times the odds of having oral cavity cancer among
nonsmokers. However, this excess was not statistically significant (P-value > .1).
*In males
( 1) [cases]_Ismoker_1 + [cases]_IsmoXsex_1_1 = 0
------------------------------------------------------------------------------
cases | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------+----------------------------------------------------------------
(1) | 2.502415 .3399329 6.75 0.000 1.917479 3.26579
------------------------------------------------------------------------------
In males the result indicates that the odds of having oral cavity cancer among
smokers is 2.5 (95% CI: 1.92, 3.27) times the odds of having oral cavity cancer
among nonsmokers. This excess was statistically significant (P-value < .05).
where:
C′ is used to distinguish the value of the potential confounders
2. Calculate the odds when the exposure is present (E1 = 1):
p1
=e 0 E ∑ i i
β + β + β ∗C
Odds1 =
1 − p1
Logistic Regression Model ◾ 101
3. Calculate the ratio of the odds obtained in steps (1) and (2):
When (Ci − Ci′ ) = 0, that is, when we assume that the values of the potential con-
founding variables are equal in exposed and nonexposed subjects, we can obtain
the adjusted odds ratio (OR adjusted), as follows:
OR adjusted = eβE
The syntax in Stata for obtaining the adjusted odds ratio using the previous data is
as follows:
xi: glm cases i.smoker i.sex, fam (bin total) ef nolog noheader
Output
------------------------------------------------------------------------------
| OIM
cases | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------––––+----------------------------------------------------------------
_Ismoker_1 | 2.211517 0.2757527 6.37 0.000 1.732026 2.823749
_Isex_1 | 0.6247632 0.0825355 -3.56 0.000 0.482243 0.8094034
_cons | 0.7947862 0.0708192 -2.58 0.010 0.6674276 0.9464473
------------------------------------------------------------------------------
The results indicate that the odds of having oral cavity cancer in smokers is 2.21
(95% CI: 1.73, 2.82) times the odds of having oral cavity cancer in nonsmokers,
after adjusting for sex. The difference between the point estimate of the adjusted OR
adjusted = 2.21) and the point estimate of the crude OR (OR
(OR crude = 1.51) indi-
cates that the magnitude of association given by the crude OR is underestimated.
Therefore, the variable sex confounds the relationship between the smoking habit
and oral cavity cancer.
. xi: glm cases i.smoker if sex==1, fam (bin total) ef nolog noheader
------------------------------------------------------------------------------
| OIM
cases | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
–------------+----------------------------------------------------------------
_Ismoker_1 | 2.502415 0.3399329 6.75 0.000 1.917479 3.26579
_cons | 0.4563492 0.0513548 -6.97 0.000 0.3660228 0.5689662
------------------------------------------------------------------------------
. xi: glm cases i.smoker if sex==0, fam (bin total) ef nolog noheader
------------------------------------------------------------------------------
| OIM
cases | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+---------------------------------------------------------------
_Ismoker_1 | 1.080067 0.3481481 0.24 0.811 0.5742153 2.031545
_cons | 0.8416988 0.0773638 -1.87 0.061 0.702942 1.007845
------------------------------------------------------------------------------
This result indicates that the predictor sex has a modifying effect on the relationship
between the smoking habit and oral cavity cancer, as was expected (because of the
significant results in the likelihood ratio test).
Output
------------------------------------------------------------------------------
| OIM
cases | Risk Ratio Std. Err. z P>|z| 95% Conf. Interval]
----------+------------------------------------------------------------------
1.smoker | 1.701416 0.1446968 6.25 0.000 1.440191 2.010022
_cons | 0.3133515 0.0242131 -15.02 0.000 0.2693136 0.3645904
------------------------------------------------------------------------------
glm cases i.smoker if sex==0, fam(bin total) ef link(log)
Output
------------------------------------------------------------------------------
| OIM
cases | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------- +----------------------------------------------------------------
1.smoker | 1.04194 0.1764579 0.24 0.808 0.7476307 1.452105
_cons | 0.4570231 0.0228087 -15.69 0.000 0.4144356 0.5039868
------------------------------------------------------------------------------
Logistic Regression Model ◾ 103
Among males we can see that there is a substantial difference between the ORs and
the PRs; the estimated OR is 2.50 and the estimated PR is 1.70.
Another command that can be used to obtain the PR by sex in a logistic regres-
sion model is binreg, as follows:
binreg cases smoker if sex==1, n(total) rr nolog
Output
------------------------------------------------------------------------------
| EIM
cases | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
smoker | 1.701416 0.1446966 6.25 0.00 1.440191 2.010022
_cons | 0.3133515 0.0242131 -15.02 0.000 0.2693137 0.3645904
------------------------------------------------------------------------------
binreg cases smoker if sex==0, n(total) rr nolog
Output
------------------------------------------------------------------------------
| EIM
cases | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------- +----------------------------------------------------------------
smoker | 1.04194 0.1764578 0.24 0.808 0.7476309 1.452105
_cons | 0.4570231 0.0228087 -15.69 0.000 0.4144356 0.5039868
------------------------------------------------------------------------------
The observed results (using binreg and glm), by sex, show that the estimates of the
PRs are the same; only slight differences are observed in the standard errors, and
these are due to the default methods used to estimate the variance; the glm com-
mand uses the maximum-likelihood method, and binreg uses Fisher’s scoring method
(Hardin and Hilbe, 2001; Collett, 2002).
in the outcome Y. For example, assuming Y has k categories (0, 1, 2, …, k), then the
most simple expression of the multinomial regression model is the following:
Pr [Y = k|E ]
ln = β0 + βE ∗ Eii
Pr[Y = 0 | E ]
The exponential of the estimated exposure coefficient (β̂E ) will provide the estimated
OR between the category with code k and the category with code 0, as follows:
(Ek+vsvs.0.E) − = eβ E
OR
The interpretation of this OR is as follows: in reference to category 0, the probability
β E
of being in category k among the exposure group is e times this probability among
the nonexposure group’s being in category k. For the following example, let us assume
that we are working with a case-control study to assess the relationship between hepa-
titis C and receiving a blood transfusion (before 1992), using two types of controls
(subjects with hepatitis B and healthy subjects), and using, as well, the following data:
Hepatitis
Yes (1) 19 11 14
No (0) 85 63 220
Note: Codes are in parentheses.
Output
--------------––----------------------------------------------
hep | RRR Std. Err. z P>|z| [95% Conf. Interval]
-----------+----------------------------------------------------------------
1 | (base outcome)
-----------+----------------------------------------------------------------
2 |
trans | 2.743767 1.17296 2.36 0.018 1.187022 6.342139
_cons | .2863636 .0409194 -8.75 0.000 .2164148 .3789211
-----------+----------------------------------------------------------------
3 |
trans | 3.512603 1.316033 3.35 0.001 1.685457 7.320497
_cons | .3863636 .049343 -7.45 0.000 .3008072 .4962543
------------------------------------------------------------------------------
The above table shows that in reference to the healthy subjects: the likelihood of
hepatitis C among the subjects who had had blood transfusion experience before
1992 is 3.51 (95% CI: 1.69, 7.32) times the likelihood of hepatitis C among the
subjects who had not had blood transfusion experience before 1992. This excess was
statistically significant (P-value = .001).
To use as the reference the subjects with hepatitis B instead of the healthy sub-
jects, you need to use the option baseoutcome, as is done in the following:
Output
------------------------------------------------------------------------------
hep | RRR Std. Err. z P>|z| [95% Conf. Interval]
---------+----------------------------------------------------------------
1 |
trans | 0.3644624 0.1558076 -2.36 0.018 0.1576755 0.8424443
_cons | 3.492063 0.4989922 8.75 0.000 2.639072 4.620756
----- –---+----------------------------------------------------------------
2 | (base outcome)
---------+----------------------------------------------------------------
3 |
trans | 1.280212 0.529671 0.60 0.550 0.5689947 2.880417
_cons | 1.349206 0.2243001 1.80 0.072 0.9740238 1.868905
------------------------------------------------------------------------------
The table above indicates that in reference to the subjects with hepatitis B, the like-
lihood of hepatitis C among the subjects who had had blood transfusion experience
before 1992 is 1.28 (95% CI: 0.57, 2.88) times the likelihood of hepatitis C among
the subjects who had not had blood transfusion experience before 1992. However,
this excess was not statistically significant (P-value > .1).
For the ordinal logistic regression model, there are different expressions, the
use of each depending on the manner in which the categories are compared. When
these categories are grouped and the ORs do not depend on the grouping procedure,
106 ◾ Biostatistics in Public Health Using STATA
it is said that the proportional odds assumption is met. The most common expression
of this model, under this assumption, is as follows:
Pr [Y ≤ k |E ]
ln = β0 − β E ∗ E i
Pr[Y > k | E ]
This model combines into two groups the categories of the outcome, as follows: those
subjects with categories that are less than or equal to k and those with categories that are
greater than k. The negative sign in the coefficient of the exposure occurs because of the
way Stata programmed this model; therefore, caution has to be taken to interpret the
output and the way the codes of the outcome categories are defined. The exponential
of the estimated exposure coefficient, β E , will provide the estimated OR between cat-
egories with code >k and categories with code ≤k, due to the following relationship:
1
e −βE = ( ≤k vs.>k )
, then,
OR E + vs.E −
(E>+kvsvs.E.≤−k ) = eβ E
OR
The interpretation of this OR is as follows: the likelihood of being in a category
greater than k among the members of the exposure group is eβE times the likelihood
of being in a category greater than k among the nonexposure group. To improve
the interpretation, use high values of the outcome codes for those subjects with
worst outcome. For example, assuming a case-control study to assess the relation-
ship between glycohemoglobin and age, let us suppose that glycohemoglobin is cat-
egorized into three groups—using tertiles as the cutoff points—as follows:
In addition, let’s assume that age was categorized into two groups (above and at
or below the mean value of the study sample). Using the available data, then, the
following table results:
Glycohemoglobin Group
≤45 (0) 14 7 3 24
>45 (1) 9 16 14 39
Total 23 23 17 63
+---------------------------+
| glycon3 age subjects |
|---------------------------|
1. | 1 0 14 |
2. | 1 1 9 |
3. | 2 0 7 |
4. | 2 1 16 |
5. | 3 0 3 |
|---------------------------|
6. | 3 1 14 |
+---------------------------+
The ordinal logistic model can be run with the assumption that the OR depends
on the cutoff point of the outcome. Therefore, for every cutoff point in the
outcome, one OR is estimated. If we assume that the proportional odds assump-
tion is fulfilled, then we would expect all ORs to be equal. The syntax in Stata
to run the ordinal logistic model without the proportional odds assumption is
as follows:
Output
The syntax in Stata to run the ordinal logistic model, assessing the proportional
odds assumption, is as follows:
Output
------------------------------------------------------------------------------
Testing parallel lines assumption using the .05 level of
significance...
Step 1: Constraints for parallel lines imposed for age
(P Value = 0.8004)
Step 2: All explanatory variables meet the pl assumption
Wald test of parallel lines assumption for the final model:
( 1) [1]age - [2]age = 0
chi2( 1) = 0.06
Prob > chi2 = 0.8004
An insignificant test statistic indicates that the final model
does not violate the proportional odds/ parallel lines
assumption
If you re-estimate this exact same model with gologit2,
instead of autofit you can save time by using the parameter
pl(age)
------------------------------------------------------------------------------
Generalized Ordered Logit Estimates Number of obs = 63
LR chi2(1) = 8.77
Prob > chi2 = 0.0031
Log likelihood = -64.236022 Pseudo R2 = 0.0639
( 1) [1]age - [2]age = 0
------------------------------------------------------------------------------
glycon3 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
----------+----------------------------------------------------------------
1 |
age | 4.427933 2.303347 2.86 0.004 1.597417 12.27394
_cons | .7293552 .2970438 -0.77 0.438 .3283002 1.620343
Logistic Regression Model ◾ 109
-----------+----------------------------------------------------------------
2 |
age | 4.427933 2.303347 2.86 0.004 1.597417 12.27394
_cons | .1290829 .0626706 -4.22 0.000 .0498431 .334297
------------------------------------------------------------------------------
The first assessment in gologit2 with the autofit lrf command is used to determine
if there is statistical evidence, based on the likelihood ratio test, that the propor-
tional odds assumption has been fulfilled. If this assumption has been fulfilled, the
same OR is estimated for all combinations of the outcome. In this example, the
output indicates that the model does not violate the proportional odds assumption
(P-value = .8004). As a consequence we interpret only one OR, as follows:
Using as the reference category the participants with the low levels of glycohe-
moglobin, the likelihood of having high levels of glycohemoglobin among subjects
older than 45 years is 4.43 (95% CI: 1.60, 12.27) times the likelihood of having
high levels of glycohemoglobin among subjects 45 years old or younger.
8.12 Overdispersion
When the logistic regression model is run with grouped data (binomial proportion),
the relationship between the deviance and the degrees of freedom can be useful in
determining the model’s goodness of fit (McCullagh and Nelder, 1999; Hardin and
Hilbe, 2001). Overdispersion occurs when data exhibit more variation than expected.
Underdispersion occurs when data exhibit less variation than expected. Because devi-
ance is a random variable with chi-squared distribution, if the model is adequate
to explain the binomial proportion, then it is expected that the observed deviance
would be close to the degrees of freedom of the model (equidispersion). For example,
if we run the model with only the predictor smoker in the example of cancer explained
by smoker and sex, we discover that the deviance is 18.62, with 2 degrees of freedom;
therefore, overdispersion is observed. To assess the departure between the deviance
and the degrees of freedom, we can use the P-value to determine the statistical signifi-
cance of this difference. The syntax to perform this in Stata is as follows:
dis chi2tail(2,18.62)
.00009051
The results show that there is a significant difference between the deviance and
its degrees of freedom (P-value = 0.00009051). Therefore, the logistic regression
model using only the predictor smoker is not adequate. Either including more pre-
dictors or exploring other models would be another option to consider at this point.
regression model with one exposure and different covariates, the following expres-
sion (Hosmer and Lemeshow, 2000) can be used:
(1 + 2 P ) ( Z ))
2
0
1−α (1 1− π ) + (1 π ) + Z1 − β (1 1− π ) + (1 πeβ E
n= ∗
1− ρ 2
P0 ∗ β2E
where:
Z1−α and Z β denote the upper α and β percentage points, respectively, of the
standard normal distribution
π denotes the fraction of subjects in the study who are not exposed
P0 denotes the probability of being a case among those who are not exposed
ρ2 denotes the squared correlation between the observed and fitted values of the
exposure (dichotomous variable) using a logistic regression model, as fol-
lows: logit ( pr[E = 1]) = β0 + ∑ β0 ∗ X i
Lp
pseudo R 2 = 1 −
L0
where:
L0 and Lp denote the log likelihoods for models containing only the intercept
and the model containing the intercept plus the p-covariates, respectively
βE denotes the coefficient of the exposure in the multivariate logistic regression
model for the outcome of interest under the alternative hypothesis. If we
assume that OR = 2, then βE is approximately 0.69314 {ln(2) = 0.69314}
Based on the data presented previously, the purpose of which was to assess the mag-
nitude of the association between cancer and the smoking habit adjusted by sex
adj = 2.21), the parameters needed to obtain the minimum sample size are as
(OR
follows:
Therefore,
n=
(1 + 2 ∗ .3945)
1 − 0.5679
( ))
2
1.96 ∗ (
(1 1 − 0.4658) + (1 0.4658) +1.28 (1 1 − 0.44658) + 1 0.4658e .7929
∗
0.3945 ∗ 0.7929 2
= 618.64
Logistic Regression Model ◾ 111
It is desirable for the result to be divisible by 2, given that a total sample size of
about 619, or 310, per group would be the minimum required. Unfortunately, Stata
does not provide the option in its power and sample size calculation tool for this
formula. Therefore, a do-file has to be programmed with the following sequence of
commands (and assuming the data of the previous example):
gen a=(1+2*.3945)/(1-.5679)
gen b=1.96*sqrt((1/(1-.4658))+(1/.4658))
gen c=1.28*sqrt( (1/(1-.4658)) + (1/(.4658*exp(.7929))))
gen d=.3945*.7929^2
gen n=a*((b+c)^2)/d
Before running these commands, go to edit and create a dataset with one variable,
such as id, and one space row. The other option is to work interactively with Stata
by invoking the mata command, as follows:
. mata
–––––––––––––––––––– mata (type end to exit) ----------------––
: a=(1+2*.3945)/(1-.5679)
: b=1.96*sqrt((1/(1-.4658))+(1/.4658))
: c=1.28*sqrt( (1/(1-.4658)) + (1/(.4658*exp(.7929))))
: d=.3945*.7929^2
: n=a*((b+c)^2)/d
: n
618.6378426
: end
---------------------------------------------------------------
Aim: Upon completing the chapter, the learner should be able to esti-
mate the magnitude of the association between disease and exposure,
controlling for potential confounders, using a Poisson regression model.
113
114 ◾ Biostatistics in Public Health Using STATA
[i ] µi = Ti * I i = Ti * eβ0 + βE *E + βC *C + βEC *( EC )
[iv ] Ln ( I i ) = β0 + βE * E + βC * C + βEC * ( EC )
where:
µi indicates the expected value for the outcome variable
Ii represents the incidence (the expected cases by time unit or population under
the ith condition)
Ti represents the sum of the times in the study under the ith condition
E denotes the exposure variable
C denotes the effect of the confounder variable
E*C denotes the interaction between the exposure and confounder
βj denotes the coefficients (parameters) associated with the jth predictor vari-
ables (j = E, C, or E * C); this value represents the expected changes in the
natural logarithm of μi in the expression [iii]
β0 represents the constant term (intercept) in the model
The expression Ln(Ti) denotes the natural logarithm of Ti under the expression [iii]
of the Poisson regression model, which is included as a predictor variable with a
coefficient or parameter equal to 1. This type of predictor is identified as an offset
and has a fixed parameter.
Exposure (E = 1)
µexp
I exp = = e β0 +βE +βC *C +βEC *C
Texp
Nonexposure (E = 0)
µnon-exp
I non-exp = = eβ0 +βC *C ′
Tnon-exp
Poisson Regression Model ◾ 115
If C = C′, then
I exp
RR = = eβE +βE *C *C
I non-exp
In the case of nonsignificant interaction terms (H0: βE*C = 0), we can obtain the
adjusted RR using the following:
I exp *
RR adjusted = = eβE
I non-exp
where β*E is obtained from the model that excludes the interaction term. If the inter-
action term is significant, it is necessary to estimate the RR in different population
subgroups defined by the levels of C.
where µi is the expected value of Y under the ith condition in the Poisson regression
model. The coefficients βs that produce the highest value of this likelihood function
are the maximum-likelihood estimates (MLEs) for this model. Based on the MLEs
estimates, we can also estimate the RRs and test the statistical hypothesis, with the
approach similar to that performed for the logistic regression model.
9.4 Example
Suppose we are interested in assessing the difference in the incidence of car-
diovascular disease by sex, controlling for age. Available data for this pur-
pose can be extracted from the epidemiological cohort study of Framingham
(Massachusetts), which started in 1948 with a sample of 5,127 subjects, aged
30–62 years old. The following table summarizes the incidence of cardiovascu-
lar disease by age and sex:
116 ◾ Biostatistics in Public Health Using STATA
The last column of the table shows the RRs between males and females by age group.
The observed trend in these RRs indicates that, in the older age groups, the RRs are
getting close to 1; therefore, the incidences of cardiovascular disease by sex are quite
different for the younger age groups and quite similar for the older age groups. This
trend suggests that age has a modifying effect on the relationship between sex and
cardiovascular disease (Szklo and Nieto, 2004).
+-------------------------+
| age sex py cvd |
|-------------------------|
1. | 1 1 7370 43 |
2. | 1 0 9205 9 |
3. | 2 1 12649 163 |
4. | 2 0 16708 71 |
5. | 3 1 7184 155 |
|-------------------------|
6. | 3 0 10139 105 |
7. | 4 1 15015 443 |
8. | 4 0 24338 415 |
9. | 5 1 470 19 |
10. | 5 0 1383 50 |
+-------------------------+
Poisson Regression Model ◾ 117
Note:
◾ age indicates the code of the age group (1: <46 years; 2: 46–55 years;
3: 56–60 years; 4: 61–80 years; 5: >80 years)
◾ sex indicates the code of the sex (0 = female, 1 = male)
◾ py indicates person-years
◾ cvd indicates the number of cardiovascular disease cases
. lrtest model1 .
Note: lnoff(py) indicates the inclusion of the natural logarithm of the py variable
as an offset variable.
The results indicate a significant age–sex interaction term (P-value < .0001),
confirming that age has a modifying effect on the relationship that exists between
sex and cardiovascular disease. Therefore, it is necessary to estimate the RR (male
vs. female) by age group. To carry out this evaluation, we use the Poisson regression
model, while also including interaction terms. For example, the resulting models of
previous data can be programmed in Stata with the following command:
Output
Generalized linear models No. of obs = 10
Optimization : ML Residual df = 0
Scale parameter = 1
Deviance = 5.31308e-13 (1/df) Deviance = .
Pearson = 4.76108e-13 (1/df) Pearson = .
AIC = 8.241062
Log likelihood = -31.20531135 BIC = 5.31e-13
-------------------------------------------------------------------------------
| OIM
cvd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+---------------------------------------------------------------
_Isex_1 | 1.786304 .3665609 4.87 0.000 1.067858 2.504751
_Iage_2 | 1.469314 .3538299 4.15 0.000 .7758204 2.162808
_Iage_3 | 2.360093 .3473253 6.80 0.000 1.679348 3.040838
_Iage_4 | 2.858762 .3369284 8.48 0.000 2.198394 3.519129
_Iage_5 | 3.61029 .3620926 9.97 0.000 2.900601 4.319979
_IsexXage_1_2 |-.6769246 .3931747 -1.72 0.085 -1.447533 .0936837
_IsexXage_1_3 |-1.052307 .38774 -2.71 0.007 -1.812263 -.2923501
_IsexXage_1_4 |-1.238024 .3728725 -3.32 0.001 -1.968841 -.5072074
_IsexXage_1_5 |-1.674611 .4549709 -3.68 0.000 -2.566337 -.7828843
_cons |-6.930277 .3333333 -20.79 0.000 -7.583599 -6.276956
ln(py) | 1 (exposure)
-------------------------------------------------------------------------------
The resulting equation of the Poisson regression model, using only one decimal
approximation of the estimated coefficient of this output, is as follows:
Ln ( µˆ i / PYi ) = −6.9 + 1.8 * _Isex_1 + 1.5 * _Iage_2 + 2.4 * _Iage_3 + 2.9 * _Iage_4
where:
PYi = person-years
_Isex_1 = 1, if sex = M; _Isex_1 = 0 if sex = F
_Iage_2 = 1, if group of age “2”; _Iage_2 = 0 other groups of age
_Iage_3 = 1, if group of age “3”; _Iage_3 = 0 other groups of age
_Iage_4 = 1, if group of age “4”; _Iage_4 = 0 other groups of age
_Iage_5 = 1, if group of age “5”; _Iage_5 = 0 other groups of age
_IsexXage_1_2 = 1, if _Isex_1 = 1 and _Iage_2 = 1, otherwise 0
_IsexXage_1_3 = 1, if _Isex_1 = 1 and _Iage_3 = 1, otherwise 0
_IsexXage_1_4 = 1, if _Isex_1 = 1 and _Iage_4 = 1, otherwise 0
_IsexXage_1_5 = 1, if _Isex_1 = 1 and _Iage_5 = 1, otherwise 0
Considering the previous estimated coefficients, and using the expression [ii] of the
Poisson model, we can determine the age-specific incidences as follows:
1. Incidence for the <46 years age group (age2 = 0, age3 = 0, age4 = 0,
age5 = 0):
I <46 = e( −6.9 +1.8*_Isex_1 )
Poisson Regression Model ◾ 119
2. Incidence for the 46–55 years age group (age2 = 1, age3 = 0, age4 = 0,
age5 = 0):
3. Incidence for the 56–60 years age group (age2 = 0, age3 = 1, age4 = 0,
age5 = 0):
4. Incidence for the 61–80 years age group (age2 = 0, age3 = 0, age4 = 1,
age5 = 0):
5. Incidence for the >80 years age group (age2 = 0, age3 = 0, age4 = 0,
age5 = 1):
Therefore, to estimate the relative risk (males vs. females) for the first two age groups.
We estimate the incidence by sex and then divide these incidences in each age group
as follows:
I female = e( −6.9 )
Then,
I male
=
RR = e(1.8) = 5.96
I female
I female = e( −6.9+1.5)
Then
I male
RR = = exp (1.8 − 0.7 ) = 3.03
I female
To facilitate the estimation of these RRs with 95% confidence intervals in Stata,
we can use the lincom command in the model, with interaction terms, instead of
having one model for each age group. To use this command, after running the
model with interaction terms, we enter the name of the predictor with the name
Stata assigned to it; then we add the plus sign (+) followed by the corresponding
interaction terms and the option irr. The syntaxes for the first two age groups will
be as follows:
For the <46 years age group:
Output
(1) [cvd]_Isex_1 = 0
------------------------------------------------------------------------------
cvd | IRR Std. Err. z P>|z| [95% Conf. Interval]
---------+-----------------------------------------------------------
(1) | 5.967359 2.1874 4.87 0.000 2.909142 12.24051
------------------------------------------------------------------------------
Output
The incidence of cardiovascular disease in males aged 46–55 years old is 3.03 (95%
CI: 2.29, 4.01) times the incidence of cardiovascular disease in females aged 46–55
years old. This greater level of risk is highly significant (P-value < .001).
9.7 Overdispersion
Using the Poisson regression model, we can also assess the goodness of fit of the
model, as was shown in the previous chapter. For example, if we run the model
using only the predictor sex (in the previous database) as follows:
AIC = 62.15084
Log likelihood = -308.7542 BIC = 536.6771
------------------------------------------------------------------------------
| OIM
cvd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+----------------------------------------------------------------
_Isex_1 | .6055324 .0524741 11.54 0.000 .5026851 .7083797
cons |-4.554249 .0392232 -116.11 0.000 -4.631125 -4.477373
ln(py) | 1 (exposure)
------------------------------------------------------------------------------
1.05e-114
The results show a very highly significant difference between the deviance and its
degrees of freedom (P-value < .001). Therefore, the Poisson regression model using only
the predictor sex is not adequate; including more predictors, exploring another type of
model, or assessing the potential correlation between adjacent age groups is called for.
For more discussion on this topic, we recommend checking out the books by Cameron
and Trivedi (1998), Hilbe (2007), Hoffmann (2004), and Kleinbaum et al. (2008).
This page intentionally left blank
Chapter 10
Survival Analysis
Aim: Upon completing the chapter, the learner should be able to use
the Cox proportional hazards model to estimate the magnitude of the
association between the risk of the occurrence of a given clinical event
(e.g., disease, death, remission) after a certain period of time and a fac-
tor of exposure, controlling for potential confounders.
10.1 Introduction
In this chapter we present the use of a regression model to analyze the occurrence
of an event after a certain time. This analysis is regularly identified as a survival
analysis or a time to event analysis (Kleinbaum and Klein, 2005). The objective in
survival analysis is to assess the time it takes for an event of interest to occur when
there is the possibility that this event will not occur in all subjects under study.
Take the following examples:
123
124 ◾ Biostatistics in Public Health Using STATA
For the analysis of survival times, it is necessary to identify a start date for partici-
pation in the study. Some possible start dates are date of birth, date of diagnosis,
therapy start date, and date or time of an exposure to a toxin. In addition, it is also
necessary to identify the date on which the event of interest occurs or the date of
study completion.
The study time in survival analysis is determined by the difference between the
date of the occurrence of the event of interest and the start date of the study:
T = date of occurrence of event under study – start date
where T can be measured in days, months, years, or some other time unit.
The use of survival analysis is justified when there is a possibility that an event
of interest will not occur in a high number of subjects during a given study period,
meaning that there will be a high number of individuals with incomplete infor-
mation. The time of occurrence of the event of interest (T ) cannot be exactly
determined when the event does not happen; only the minimum survival time (t)
(in which the event of interest does not occur in the individual) can be determined.
Therefore, the formulation of a study problem in survival analysis with the event’s
date or time of occurrence being unknown is given by the following expression:
T ≥t
Censoring information may arise in the following situations:
When the study time of a subject has not been determined, we use the term cen-
sored. If the date of the event’s occurrence is unknown, the incomplete data in the
survival analysis are called right censoring. The existence of censored observations
can be attributed to a selection bias, unless it can be assured that censored indi-
viduals are representative of the study population. Therefore, censoring has to be
independent of t.
A survival analysis involves a longitudinal design in which there is a recruitment
period and a maximum date of observation, as illustrated in the following:
Time of study
Recruitment period
Last date for observation
Recruitment time comprises a fixed period of time during which the initial mea-
surement of the study subjects for survival analysis is performed. The maximum
date of observation indicates the last day or the specific time to observe the occur-
rence of an event. The possible situations that may occur while observing the event
of interest are illustrated in Figure 10.1.
Situation A. The occurrence of the event after the completion of the study (censored).
Situation B. The occurrence of the event before the completion of the study.
Known time
S ( t ) = Pr [T > t ]
S(t) is defined as the survival function and indicates the probability of being free of
the event of interest at least at t; that is, the probability of the event occurring after t.
Subject 1
Subject 2
Subject 3
Subject 4
Subject 5
Subject 6
Subject 7
0 1 2 3 4 5 6 7 8 9 10 11 12 (Months)
In the above illustration, the following patterns occur: (1) the recruitment date
is different for every subject; (2) the date of the occurrence of the study event is also
different in each subject; (3) the last date of observation is the same for all subjects,
12 months; however, not all the subjects get free of the event at this point. Usually,
this type of information can be summarized in the following table:
1 0 6 D 6
2 1 12 C 11
3 2 3 C 1
4 1 6 C 5
5 0 4 D 4
6 4 6 D 2
7 0 11 D 11
When there are censored observations, there are several nonparametric methods for
estimating S(t). The life-table estimate of the survivor function, also known as the
actuarial estimate of survivor function, assumes that the censoring process is such
that the censored survival times occur uniformly within different series of time
intervals. Another method for estimating the survival function, S(t), is through the
Kaplan‒Meier (KM) method. This method determines the probability of surviving
at least to time t(j). Time t(j) indicates the times at which one or more events have
occurred and is sorted in ascending order:
where t(1) is the time at which the event of least time occurred.
where Pr[T > t(j)|T ≥ t(j)] indicates the probability, among those persons who reached
t(j) alive, of remaining alive after that specific time,
◾ t(j) indicates the time in j order in which at least one event occurs after the
data are ordered from least to greatest
◾ S(t(j−1)) is the function of survival until time t(j−1)
The development of the previous expression with the data from the previous exam-
ple is presented in the following table:
0 7 0 1 − (0/7) = 1 1.0
1d – – – –
2 6 1 1 − (1/6) = 0.833 0.833
4 5 1 1 − (1/5) = 0.8 0.666
5d – – – –
6 3 1 1 − (1/3) = 0.667 0.44
11d – – – –
11 1 1 1 − (1/1) = 0 0
a rj indicates the subjects that are at risk an instant before t(j).
b fj indicates the number of deaths in j time.
c Pr[T > t(j)|T ≥ t(j)] = 1 − (fj/rj).
d censored cases.
The S(t ( j)) usually is graphically represented as a step function; it means that
S(t ( j)) probability remains constant until the time when the next event of interest
occurs.
1 1 2 2 34 46 0 2 1 28
2 1 2 2 61 47 1 1 2 39
3 0 2 1 78 48 1 1 2 8
4 0 1 2 95 49 1 2 1 35
5 1 2 2 49 50 0 1 1 5
6 1 2 2 59 51 0 1 1 45
7 0 2 2 2 52 1 2 2 0
8 0 2 2 1 53 0 2 2 17
9 1 1 3 6 54 0 2 1 0
10 0 1 1 53 55 0 1 2 1
11 1 2 2 8 56 0 2 1 39
12 0 1 2 21 57 1 2 1 9
13 1 2 3 71 58 1 1 2 5
14 1 1 3 47 59 1 2 2 2
15 0 1 2 35 60 0 1 2 38
16 0 1 1 1 61 1 2 1 41
17 1 2 3 10 62 0 2 2 6
18 0 1 2 7 63 1 2 1 28
19 0 2 2 1 64 0 2 1 60
20 1 1 2 27 65 0 2 2 1
21 0 1 2 34 66 1 2 1 81
22 1 2 3 10 67 0 2 1 2
23 1 2 2 43 68 0 1 1 5
24 0 1 2 84 69 0 1 2 2
25 1 2 2 89 70 0 2 2 44
26 1 2 2 6 71 1 2 3 8
130 ◾ Biostatistics in Public Health Using STATA
27 0 2 2 4 72 1 1 3 7
28 0 2 2 0 73 0 1 2 5
29 0 2 1 22 74 0 2 2 3
30 1 2 3 1 75 1 2 1 3
31 1 2 1 23 76 1 2 2 2
32 1 2 3 37 77 1 1 2 55
33 1 2 1 25 78 1 2 1 46
34 0 2 1 0 79 0 2 1 70
35 0 2 3 1 80 1 2 1 39
36 1 2 2 39 81 1 1 1 99
37 1 2 1 20 82 0 1 2 5
38 0 1 2 48 83 1 1 2 52
39 1 1 2 20 84 1 2 3 12
40 1 2 2 6 85 0 2 1 39
41 0 2 1 44 86 1 1 1 40
42 1 2 3 13 87 0 1 2 73
43 0 1 3 1
44 0 2 2 50
45 0 2 1 0
To run a survival analysis in Stata, we have to specify the name of the variable that
defines the time and the variable that defines the event with the code to be used for
the occurrence of the event, as follows:
stset time,fa(death=1)
Output
-----------------------------------------------------------------------------
87 total observations
5 observations end on or before enter()
-----------------------------------------------------------------------------
82 observations remaining, representing
43 failures in single-record/single-failure data
2385 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 99
We defined the time-of-survival variable at the beginning; in this case it was defined
with the name time. After the comma, the occurrence of the event of interest is
indicated with the command fa followed by a parenthesis to indicate the variable
for the event of interest and the code that indicates when the event occurs.
An estimation of survival probability is obtained in Stata with the ltable com-
mand, as is demonstrated in the following:
ltable time
Output
Beg. Std.
Interval Total Deaths Lost Survival Error [95% Conf. Int.]
-------------------------------------------------------------------------------
0 1 87 5 0 0.9425 0.0250 0.8674 0.9757
1 2 82 8 0 0.8506 0.0382 0.7566 0.9104
2 3 74 5 0 0.7931 0.0434 0.6919 0.8642
3 4 69 2 0 0.7701 0.0451 0.6667 0.8451
4 5 67 1 0 0.7586 0.0459 0.6542 0.8354
5 6 66 5 0 0.7011 0.0491 0.5930 0.7856
6 7 61 4 0 0.6552 0.0510 0.5453 0.7446
7 8 57 2 0 0.6322 0.0517 0.5218 0.7238
8 9 55 3 0 0.5977 0.0526 0.4870 0.6920
9 10 52 1 0 0.5862 0.0528 0.4755 0.6813
10 11 51 2 0 0.5632 0.0532 0.4527 0.6597
12 13 49 1 0 0.5517 0.0533 0.4414 0.6489
13 14 48 1 0 0.5402 0.0534 0.4302 0.6380
17 18 47 1 0 0.5287 0.0535 0.4190 0.6270
20 21 46 2 0 0.5057 0.0536 0.3967 0.6049
21 22 44 1 0 0.4943 0.0536 0.3857 0.5938
22 23 43 1 0 0.4828 0.0536 0.3747 0.5826
23 24 42 1 0 0.4713 0.0535 0.3637 0.5714
25 26 41 1 0 0.4598 0.0534 0.3529 0.5601
27 28 40 1 0 0.4483 0.0533 0.3420 0.5488
28 29 39 2 0 0.4253 0.0530 0.3205 0.5260
34 35 37 2 0 0.4023 0.0526 0.2993 0.5029
35 36 35 2 0 0.3793 0.0520 0.2783 0.4797
37 38 33 1 0 0.3678 0.0517 0.2678 0.4680
132 ◾ Biostatistics in Public Health Using STATA
Note: The command ltable is followed by the variable that indicates the observation
time.
After running the stset command, the sts graph command can be used in Stata
to construct the S(t) graph using the KM method, as illustrated in Figure 10.3.
where τj indicates the size of the interval (t(j), t(j+1)), that is, τj = t(j+1) − t(j). According to
the time unit that is used, the product rj *τj indicates person-time (i.e., person-years,
Survival Analysis ◾ 133
0.75
0.50
S(t)
0.25
0.00
0 20 40 60 80 100
Months
Output
To graphically represent h(t) through the KM method, the same command for S(t)
is used, but we add the hazard option (sts graph, hazard ). The output of this com-
mand is illustrated in Figure 10.4.
0.04
0.03
h(t)
0.02
0.01
0 20 40 60 80 100
Months
It has been shown that there is a mathematical relationship between these survival
and hazard functions, the specifics of which are as follows (Collett, 2003):
∂F ( t ) ∂ (1 − S ( t ) )
a. f (t ) = = = −S ′ ( t )
∂t ∂t
c.
∫
H ( t ) = h ( u ) ∂u =
o
∫
0
S (u )
∂u = −
∫
o
S ( u ) ∂u
∂u = −ln ( S ( t ) )
where S′(t) indicates the derivative of S(t) and H(t), the cumulative hazard function.
1. The KM method:
ri − f i
H ( )
( t ) = −L n S (t ) = −L n
∏ =
t (i )≤t ri t (i )≤t ∑ f
−L n 1 − i
ri
136 ◾ Biostatistics in Public Health Using STATA
3.00
2.00
H(t)
1.00
0.00
0 20 40 60 80 100
Months
∑r
(t ) = fi
H
t ( i ) ≤t i
The estimate of H(t) using the Nelson–Aelen method will always be greater than or
equal to that which is generated using the KM method. When the number of sub-
jects at risk at any given time is large, the two estimates are basically equal. Based on
the Nelson–Aalen method, we can obtain the survival function with the following
expression:
t ) = e(
− H (t ) )
S (
The graphic representation of the cumulative hazard [H(t)] using the Nelson–Aalen
method is programmed in Stata using the following command to create Figure 10.5:
stci, median
Survival Analysis ◾ 137
Output
failure _d: death == 1
analysis time _t: time
| no. of
| subjects 50% Std. Err. [95% Conf. Interval]
-----------+---------------------------------------------------------------
total | 82 41 3.705979 34 55
Output
| no. of
| subjects 25% Std. Err. [95% Conf. Interval]
-----------+--------------------------------------------------------------
total | 82 13 6.687969 8 28
stci, p(75)
Output
| no. of
| subjects 75% Std. Err. [95% Conf. Interval]
-----------+---------------------------------------------------------------
total | 82 71 12.37697 52 .
Note: In the previous case, the 25th percentile indicates the minimum time for
which the survival probabilities are less than 75%. The 75th percentile indicates the
minimum time for which the survival probabilities are less than 25%.
1.00
0.75
0.50
S(t)
0.25
0.00
0 20 40 60 80 100
Months
Sex = 1 Sex = 2
To obtain the median survival time in different subgroups, the option by can also be
used after the stci command. For example, the command line for finding the median
time by sex would be
Output
| no. of
sex | subjects 50% Std. Err. [95% Conf. Interval]
----------+----------------------------------------------------------------
1 | 31 52 5.187957 27 .
2 | 51 39 3.999146 23 49
----------+----------------------------------------------------------------
total | 82 41 3.705979 34 55
0.05
0.04
0.03
S(t)
0.02
0.01
0 20 40 60 80 100
Months
Sex = 1 Sex = 2
stphplot, by(sex)
3
-In[-In(Survival probability)]
−1
0 1 2 3 4 5
In (analysis time)
Sex = 1 Sex = 2
H 0 : S1 ( t ) = S2 ( t )
To evaluate the survival curves with the log-rank test, the following contingency
table is constructed at each time t(j):
Total fj rj − f j rIj
Under the null hypothesis of no association between the type of group and the
occurrence of the event, you can determine the expected events in each group and
compare with the observed event with the following statistics:
U 2L
∼ χ2(1)
VL
Survival Analysis ◾ 141
where:
UL = ∑wi(f Ij − eIj) defines the weighted difference (observed events minus
expected events E(f Ij) = eIj, under H0) with equal weights (wi = 1) over
time
V L = ∑vij determines the sum of the variances under the hypergeometric distri-
bution vij = Var(f Ij) = r Ij *r IIj *f j * (rj − f j)/r 2j * (rj − 1)
Output
H 0 : S1 ( t ) = S2 (t )
2
U WGB
~ χ(21)
V WGB
where:
U WGB = ∑w ∗ ( f
j Ij − e Ij ) ; w j = r j
V WGW = ∑w 2
j ∗ v Ij
142 ◾ Biostatistics in Public Health Using STATA
The difference between the UW statistic and the statistics of the log-rank test is that
(fij − eij) is being weighted by rj. As time increases, the rj decreases; therefore, indi-
viduals with very high values in the observation times will have less weight. The
Stata command for this test is
Output
According to the Wilcoxon test, there is evidence in favor of the statement H0:
S1(t) = S2(t) (P-value > .1).
Output
According to the Tarone‒Ware test, there is evidence in favor of the statement H0:
S1(t) = S2(t) (P-value > .10).
Survival Analysis ◾ 143
h ( t , X ) = h0 ( t ) ∗ e
βE ∗E + ∑ βi X i
where:
E defines the exposure variable of interest
Xi defines the group of independent or predictor variables (it includes potential
confounding variables and interaction terms)
bi defines the group of coefficients of the predictor variables
h 0(t) defines the immediate risk in time t. This function depends on the time
and indicates the risk at initial conditions (E = 0, Xi = 0) or in average con-
ditions when the predictor variables are centralized (E = 0, Xi − X )
The predictor variables can be time dependent (e.g., age, blood pressure), but in this
book, we are analyzing only those variables that are not time dependent. One of the
most important uses of this model in epidemiologic studies is to estimate the hazard
ratio (HR) adjusted for potential confounding variables. For example, assuming that
the hazard between two persons having different exposure levels is obtained by the
Cox model without interaction terms, the HR is estimated as follows:
h2 ( t ; x ′ ) = h0 ( t ) ∗ e(β1 X1 ++βκ X κ )
′ ′
as a consequence,
h1 ( t ; x )
HR =
h2 ( t ; x ′ )
If we assume that the difference between both persons is only the exposure, then:
( X1 = X 1′ ) , ( X 2 = X 2′ ) ,…, ( X k = X k′ )
144 ◾ Biostatistics in Public Health Using STATA
HR adjusted = eβE
During the HR estimation, the h0(t) function is canceled. We assume that the HR
remains constant over time; therefore, the HR varies only according to the value of
the predictor variables. This process of obtaining the adjusted HR is similar to that
used in evaluating the adjusted OR and RR of the logistic and Poisson regression
models, respectively.
The Stata commands to evaluate the interaction terms in the Cox regression
model is as follows:
Output
The results show that there is no evidence of any significant interaction terms in
the Cox model (P-value > .10). Now, to assess the effect of potential confounding
variables, we need to compare the crude and adjusted HRs. To estimate the crude
HR, stcox is used, as can be seen in the following:
stcox b1.sex
Output
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------+---------------------------------------------------------------------
2.sex | 1.906899 0.6747282 1.82 0.068 0.9531086 3.815161
------------------------------------------------------------------------------
Note: The use of b1 before the predictor is to indicate that the category with a code
equal to 1 is the reference category.
The HR between sexes adjusted for stage is estimated with the following Stata
command:
stcox b1.sex b1.stage
Survival Analysis ◾ 145
Output
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
2.sex | 1.894716 0.690035 1.75 0.079 0.9279952 3.868499
|
stage |
2 | 1.472089 0.5575111 1.02 0.307 0.7007552 3.092446
3 | 3.766573 1.596584 3.13 0.002 1.641108 8.644817
------------------------------------------------------------------------------
This example indicates that tumor stage is not a confounding variable in the asso-
ciation of sex and cancer mortality because the difference between the estimated
HRCrude (1.91) and the estimated HRAdjusted (1.89) is very small.
h ( t ; x1 ) = h0 ( t ) ∗ eβE ∗E +β X * X
where:
X = E * ln(time)
If the PH condition is met, then bE = 0. Therefore, the expectation is that the inter-
action variable X would not be statistically significant to provide evidence that the
PH assumption is met.
There are several methods for assessing the PH assumption, which are based on
the quantities known as residuals. A description of these methods can be read in
Collett (2003). Stata uses Schoenfeld residuals, or partial residuals, as its method
for assessing PH. This method can be used with the option phtest after running
stcox, as is demonstrated in the following:
Output
Time: Time
------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+-----------------------------------------------
1b.sex | . . 1 .
2.sex | 0.10298 0.46 1 0.4954
1b.stage | . . 1 .
2.stage | –0.03693 0.06 1 0.8142
3.stage | –0.09269 0.33 1 0.5671
------------+-----------------------------------------------
global test | 0.80 3 0.8483
------------------------------------------------------------
The data suggest that the condition of PH is fulfilled for both predictors simultane-
ously (P-value > .10).
where:
g ( E , x ,β ) = e
β E ∗E + ∑ βi X i
S0 ( t ) = e − H0 (t )
Therefore, for visualizing the survival curves by sex at stage 3 (stage = 3) after running
the Cox model, the following sequences of commands is used to create Figure 10.9:
1.0
0.8
0.6
Survival
0.4
0.2
0.0
0 20 40 60 80 100
Analysis time
Sex = 1 Stage = 3 Sex = 2 Stage = 3
hg ( t , X ) = h0 g ( t ) e
β E ∗E + ∑βi ∗ X i
Output
A slight variation is observed between the HR stratified by stage (HR stratified by stage: 1.82,
95% CI: 0.89, 3.74) and the adjusted HR (HRadjusted by stage: 1.89, 95% CI = 0.93, 3.87).
There are other applications of survival analysis that can be explored in Stata,
including time-dependent predictors, competing risks regression, parametric sur-
vival models, and multilevel parametric regression. These topics are beyond the
scope of this book, but an extensive review of survival analysis can be found in
Collett (2003), Peace (2009), Royston and Lambert (2011), and Wienke (2011).
Chapter 11
Analysis of
Correlated Data
Aim: Upon completing the chapter, the learner should be able to fit a
linear regression with correlated data.
149
150 ◾ Biostatistics in Public Health Using STATA
+------------------------------------------+
| id weight1 weight2 weight3 sport |
|------------------------------------------|
1. | 1 66 67 68 0 |
2. | 2 71 71 65 0 |
3. | 3 70 66 62 0 |
4. | 4 64 62 66 1 |
5. | 5 67 66 68 1 |
|------------------------------------------|
6. | 6 65 64 65 1 |
7. | 7 67 67 63 0 |
8. | 8 65 66 66 1 |
9. | 9 69 70 68 0 |
10. | 10 63 62 63 1 |
|------------------------------------------|
11. | 11 61 60 60 1 |
12. | 12 66 68 68 1 |
13. | 13 68 68 70 0 |
14. | 14 67 69 65 0 |
15. | 15 65 67 63 1 |
|------------------------------------------|
16. | 16 64 62 64 1 |
17. | 17 65 64 55 0 |
18. | 18 65 65 66 0 |
19. | 19 64 63 62 1 |
20. | 20 67 66 63 0 |
+------------------------------------------+
To perform the analysis with independent measurements, assuming that the objective
is to compare the average weight by type of sport, we could carry out a simple linear
regression analysis of the weight of each child, using the following command lines at
each visit:
For the first visit:
Output
------------------------------------------------------------------------------
weight1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+-----------------------------------------------------------
sport | -3.1 .8225975 -3.77 0.001 -4.828213 -1.371787
_cons | 67.5 .5816643 116.05 0.000 66.27797 68.72203
------------------------------------------------------------------------------
The results show a significant effect of the predictor sport on mean weight at visit 1
(P-value = .001). The estimated regression coefficient for the predictor sport is the
difference between the mean weights by sport in visit 1. The following command
line can be used to compute the observed mean weight by sport:
Output
------------------------
sport | mean(weight1)
---------+--------------
0 | 67.5
1 | 64.4
------------------------
The difference in the mean weights at visit 1 is −3.1, so the children who practice
regularly a sport weigh less, on average, than those who do not practice regularly a
sport. To explore the differences in mean weight in each visit, a line can be drawn
between the estimated weights from a linear regression model by sport. For exam-
ple, in the first visit the following Stata commands for visualizing this line can be
used to create Figure 11.2:
152 ◾ Biostatistics in Public Health Using STATA
predict weight1exp
twoway (line weight1exp sport, sort), ytitle(Mean weight)
xtitle(Sport) xlabel(0(1)1) legend(off)
Output
------------------------------------------------------------------------------
weight2 |Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+-----------------------------------------------------------
sport |-3.3 1.085766 -3.04 0.007 -5.581111 -1.018889
_cons |67.3 .7677529 87.66 0.000 65.68701 68.91299
------------------------------------------------------------------------------
0 1
Sport
Output
-------------------------
sport | mean(weight2)
---------+---------------
0 | 67.3
1 | 64
-------------------------
The results also show a significant effect of the predictor sport on mean weight at
visit 2 (P-value = .007). The difference in the mean weights at the second visit
is −3.3; children who practice regularly a sport weight less, on average, than those
who do not practice regularly a sport. To draw the estimated weight by sport at visit
2, the following Stata commands are used to create Figure 11.3:
predict weight2exp
twoway (line weight2exp sport, sort), ytitle(Mean weight)
xtitle(Sport) xlabel(0(1)1) legend(off)
68
67
Mean weight
66
65
64
0 1
Sport
Output
Source | SS df MS Number of obs = 20
------------+---------------------- F(1, 18) = 0.00
Model | 0 1 0 Prob > F = 1.0000
Residual | 219 18 12.1666667 R-squared = 0.0000
------------+---------------------- Adj R-squared = -0.0556
Total | 219 19 11.5263158 Root MSE = 3.4881
------------------------------------------------------------------------------
weight3 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+---------------------------------------------------------
sport | 0 1.559915 0.00 1.000 -3.277259 3.277259
_cons | 64.5 1.103026 58.48 0.000 62.18263 66.81737
------------------------------------------------------------------------------
table sport, c(mean weight3)
Output
-------------------------
sport | mean(weight3)
---------+---------------
0 | 64.5
1 | 64.5
-------------------------
The results do not show a significant effect of the predictor sport on mean weight at
visit 3 (P-value > .1). There is no difference in the mean weights at the third visit.
To draw the estimated weight by sport on the third visit, the following Stata com-
mands are used to create Figure 11.4:
predict weight3exp
twoway (line weight3exp sex, sort), ytitle(Mean weight)
xtitle(Sport) xlabel(0(1)1) legend(off)
65.5
Mean weight
63.5
0 1
Sex
Level 2 (subjects)
Level 1 (visit)
yij indicates the value of Y in the jth visit for the ith subject.
Level 1 refers to the set of weights of each subject on different visits.
Level 2 refers to the set of subjects.
Mixed models can be expressed in various forms to explain the expected value
of the main outcome (Y ) of the study. The construction of these models depends on
the following variations:
Fixed Fixed
Fixed Random
Random Fixed
Random Random
Based on the previous example of the estimated weights by sport, we need to iden-
tify the possible patterns of the linear relationships. According to the previous
graphs, the respective patterns in the lines suggest a model with a random intercept
considering the first two visits, similar slopes but different intercept; it means that
the difference in the mean weight, between those who practice regularly a sport
and those who do not, is independent of the visit. However, if the three visits are
considered, a model with random intercept and slope is suggested; the difference in
the mean weight, between those who practice sport and those who do not, is not
independent of the visit.
where:
β0i = γ 00 + U 0i (random coefficient)
β1 indicates the coefficients (fixed) associated to the predictor X
(
U 0i ∼ N 0,σU2 0 )
γ 00 indicates the average intercept in a design with two levels
σu0
2
indicates the variance of the intercept between subjects
The variance of Yij conditional on the value of Xij is given by the following expression:
where σ2ε indicates the variance of Y within the subjects (variance of the residuals).
Using the data of the previous example, the correlation between two differ-
ent visits ( j ≠ j ′) of the ith subject is calculated through the covariance expres-
sion, as follows:
+-----------------------------+
| id visit weight sport |
|-----------------------------|
1. | 1 1 66 0 |
2. | 1 2 67 0 |
3. | 1 3 68 0 |
4. | 2 1 71 0 |
5. | 2 2 71 0 |
|-----------------------------|
6. | 2 3 65 0 |
7. | 3 1 70 0 |
8. | 3 2 66 0 |
9. | 3 3 62 0 |
10. | 4 1 64 1 |
|-----------------------------|
11. | 4 2 62 1 |
12. | 4 3 66 1 |
13. | 5 1 67 1 |
14. | 5 2 66 1 |
15. | 5 3 68 1 |
|-----------------------------|
16. | 6 1 65 1 |
17. | 6 2 64 1 |
18. | 6 3 65 1 |
19. | 7 1 67 0 |
20. | 7 2 67 0 |
|-----------------------------|
21. | 7 3 63 0 |
22. | 8 1 65 1 |
23. | 8 2 66 1 |
24. | 8 3 66 1 |
25. | 9 1 69 0 |
|-----------------------------|
26. | 9 2 70 0 |
27. | 9 3 68 0 |
28. | 10 1 63 1 |
29. | 10 2 62 1 |
30. | 10 3 63 1 |
|-----------------------------|
31. | 11 1 61 1 |
32. | 11 2 60 1 |
33. | 11 3 60 1 |
34. | 12 1 66 1 |
35. | 12 2 68 1 |
|-----------------------------|
Analysis of Correlated Data ◾ 159
36. | 12 3 68 1 |
37. | 13 1 68 0 |
38. | 13 2 68 0 |
39. | 13 3 70 0 |
40. | 14 1 67 0 |
|-----------------------------|
41. | 14 2 69 0 |
42. | 14 3 65 0 |
43. | 15 1 65 1 |
44. | 15 2 67 1 |
45. | 15 3 63 1 |
|-----------------------------|
46. | 16 1 64 1 |
47. | 16 2 62 1 |
48. | 16 3 64 1 |
49. | 17 1 65 0 |
50. | 17 2 64 0 |
|-----------------------------|
51. | 17 3 55 0 |
52. | 18 1 65 0 |
53. | 18 2 65 0 |
54. | 18 3 66 0 |
55. | 19 1 64 1 |
|-----------------------------|
56. | 19 2 63 1 |
57. | 19 3 62 1 |
58. | 20 1 67 0 |
59. | 20 2 66 0 |
60. | 20 3 63 0 |
+-----------------------------+
Subsequently, the mixed command is used for the first two visits as can be seen
below:
mixed weight sport if visit < 3, || id:, stddev
Output
------------------------------------------------------------------------------
Random-effects Parameters |Estimate Std. Err. [95% Conf. Interval]
------------------------------+----------------------------------------------
id: Identity |
sd(_cons) |1.746425 .3322952 1.202792 2.535768
------------------------------+----------------------------------------------
sd(Residual) | 1.07238 .1695582 .7866148 1.46196
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 14.99 Prob >= chibar2 = 0.0001
The results indicate that there is a significant change in the expected weight by sport
(P-value <.001), even after controlling for the effect between subjects (βˆ 1 = −3.2
95% CI: −4.87, −1.53). The intraclass correlation coefficient is determined with the
following estimates:
U 0 = 1.74
σ
ε = 1.07
σ
1.742
ρ = = .73
1.742 + 1.07 2
We can see that the average intercept is γ 00 = 67.4 and varies per visit ±1.74.
Another option when running the mixed model with a random intercept is to
use the gllamm command, which may be downloaded from www.gllamm.org. The
result is the following, assuming that the data are in the long format:
gllamm weight sport if visit < 3, i(id) nip(20)
Output
gllamm model
----------------------------------------------------------------------------
weight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+------------------------------------------------------------
sport |-3.194099 .8346379 -3.83 0.000 -4.829959 -1.558239
_cons | 67.38501 .570114 118.20 0.000 66.26761 68.50241
----------------------------------------------------------------------------
Variance at level 1
------------------------------------------------------------------------------
1.1368944 (.37024967)
***level 2 (id)
The results of both mixed and gllamm are similar, with a slight difference in the
log-likelihood estimate and in the variance of random effects.
where:
β0i = γ 00 + U 0i (random coefficient)
β1i indicates a random coefficient associated with the predictor X, which is
defined as follows: γ 01 +U1i
(
U 0i ∼ N 0,σU2 0)
(
U1i ∼ N 0,σU2 1 )
γ 00 indicates the average intercept in a design with two levels
γ 01 indicates the average slope in a design with two levels
σu02
indicates the variance of the intercept between subjects
σu12
indicates the variance of the slope between subjects
162 ◾ Biostatistics in Public Health Using STATA
In this mixed model, the variance of Yij, given Xij, is enumerated by the following
expression:
where σ01 is the covariance between the intercept and the slope.
In addition, there is a possibility that there is a correlation between each of two
different visits (j ≠ j′) by the same subject, which is calculated from the covariance,
as follows:
Output
Mixed-effects ML regression Number of obs = 60
---------------------------------------------------
| No. of Observations per Group
Group Variable | Groups Minimum Average Maximum
---------------+-----------------------------------
id | 20 3 3.0 3
sport | 20 3 3.0 3
---------------------------------------------------
------------------------------------------------------------------------------
weight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+-------------------------------------------------------------
sport |-2.133333 .9387534 -2.27 0.023 -3.973256 -.2934105
_cons | 66.43333 .6637989 100.08 0.000 65.13231 67.73436
------------------------------------------------------------------------------
Analysis of Correlated Data ◾ 163
------------------------------------------------------------------------------
Random-effects Parameters |Estimate Std. Err. [95% Conf. Interval]
-----------------------------+---------------------------------------------
id: Identity |
sd(_cons) |1.206639 18.1424 1.92e-13 7.58e+12
-----------------------------+---------------------------------------------
sport: Identity |
sd(_cons) |1.206645 18.14232 1.92e-13 7.58e+12
-----------------------------+---------------------------------------------
sd(Residual) |2.1173 .2367064 1.700675 2.635989
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 8.40 Prob > chi2 = 0.0150
The results indicate that there is a significant change in the average expected weight
by sport (P-value = .023), even after controlling for the effect between subjects
(γˆ 01 = −2.13, 95% CI: −3.97, −0.29). However, the average of the slopes associated
with the sport predictor will vary ±1.21, so the pattern of change in the average
weight will depend on each visit.
+-------------------------+
| block co hcv age2 |
|-------------------------|
1. | 3 0 0 2 |
2. | 4 1 1 1 |
3. | 4 1 0 2 |
4. | 1 0 0 2 |
5. | 1 0 0 1 |
|-------------------------|
6. | 1 1 1 1 |
7. | 4 1 0 1 |
8. | 4 0 0 1 |
9. | 2 0 0 1 |
10. | 2 0 0 1 |
|-------------------------|
11. | 3 0 1 2 |
12. | 2 0 0 1 |
13. | 2 1 0 2 |
14. | 4 0 0 1 |
15. | 3 0 0 1 |
|-------------------------|
164 ◾ Biostatistics in Public Health Using STATA
16. | 1 1 0 2 |
17. | 1 0 0 1 |
18. | 4 1 0 1 |
19. | 2 1 1 1 |
20. | 2 1 0 1 |
|-------------------------|
21. | 2 0 0 1 |
22. | 2 1 0 1 |
23. | 3 0 0 1 |
24. | 3 1 0 1 |
25. | 3 0 0 1 |
|-------------------------|
26. | 3 0 0 2 |
27. | 1 1 0 1 |
28. | 1 1 0 1 |
29. | 4 0 0 2 |
30. | 4 0 0 1 |
|-------------------------|
31. | 4 1 1 2 |
32. | 4 1 1 1 |
33. | 1 0 0 1 |
34. | 2 0 1 1 |
35. | 2 0 0 1 |
|-------------------------|
36. | 3 0 0 2 |
37. | 3 0 1 1 |
38. | 2 1 0 2 |
39. | 4 0 0 1 |
40. | 4 0 1 1 |
|-------------------------|
41. | 1 0 0 1 |
42. | 4 0 0 2 |
43. | 4 0 0 1 |
44. | 1 0 0 1 |
45. | 2 1 0 2 |
|-------------------------|
46. | 2 0 0 1 |
47. | 1 1 1 1 |
48. | 1 1 0 1 |
49. | 3 0 0 1 |
50. | 3 0 0 2 |
|-------------------------|
51. | 2 0 1 1 |
52. | 2 1 0 1 |
53. | 2 0 0 1 |
54. | 4 0 0 1 |
55. | 4 0 0 1 |
|-------------------------|
Analysis of Correlated Data ◾ 165
56. | 4 1 1 1 |
57. | 4 0 0 1 |
58. | 4 0 0 1 |
59. | 3 1 0 1 |
60. | 3 1 0 2 |
|-------------------------|
61. | 3 1 0 1 |
62. | 4 1 1 2 |
63. | 2 0 0 1 |
64. | 2 0 0 1 |
65. | 2 0 0 2 |
|-------------------------|
66. | 1 0 0 2 |
67. | 2 0 0 1 |
68. | 2 0 0 2 |
69. | 4 0 0 1 |
70. | 2 0 0 2 |
|-------------------------|
71. | 2 1 0 1 |
72. | 2 1 0 2 |
73. | 2 0 0 2 |
74. | 2 0 0 1 |
75. | 2 0 0 1 |
|-------------------------|
76. | 3 1 0 2 |
77. | 3 0 0 2 |
78. | 3 0 0 1 |
79. | 1 0 0 1 |
80. | 1 0 0 2 |
|-------------------------|
81. | 1 0 0 2 |
82. | 3 1 0 2 |
83. | 1 1 0 2 |
84. | 1 0 0 2 |
85. | 2 1 0 1 |
|-------------------------|
86. | 2 0 0 1 |
87. | 2 1 0 2 |
88. | 1 1 0 1 |
+-------------------------+
If the data are analyzed under the assumption that there is a possible correlation
between the subjects that reside in the same block (random intercept), the syntax of
the gllamm command line to estimate the prevalence ratio using a logistic regres-
sion model with the option link(log), which is called log-binomial regression model,
is as follows:
xi:gllamm hcv i.co i.age2,fam(bin) i(block) eform link(log)
166 ◾ Biostatistics in Public Health Using STATA
Output
gllamm model
------------------------------------------------------------------------------
hcv | exp(b) Std. Err. z P>|z| [95% Conf. Interval]
------------+-------------------------------------------------------------
_Ico_1 |2.834653 1.482238 1.99 0.046 1.017202 7.899373
_Iage2_2 |.5385232 .3316394 -1.01 0.315 .1610675 1.800532
_cons |.1036169 .0489457 -4.80 0.000 .0410533 .2615254
------------------------------------------------------------------------------
Variances and covariances of random effects
------------------------------------------------------------------------------
***level 2 (block)
The results show that the prevalence of HCV infection among cocaine users is 2.83
(95% CI: 1.02, 7.90) times the prevalence of HCV infection among cocaine nonus-
ers, adjusting for age and block of residence. This excess was statistically significant
(P-value = .046).
There are other applications of multilevel modeling in health sciences that can
be explored in Stata, including ordinal outcomes, count outcomes, and censored
outcomes. These topics are beyond the scope of this book, but an extensive review
of multilevel modeling can be found in Snijders and Bosker (2003), Leyland and
Goldstein (2001), Twisk (2003), and Rabe-Hesketh and Skrondal (2005).
Chapter 12
Introduction to Advanced
Programming in STATA
12.1 Introduction
Stata provides an editor window to save Stata commands and user-defined commands.
These files can be executed within this editor or they can be called for execution
within another do-file. In this chapter, we will present an introduction about how to
prepare do-files and the structure to define program commands (Juul, 2014).
12.2 do-files
The do-file editor tool can be used for data management and to create programs.
There are four ways to open a new do-file: (1) the Windows menu (Window → do-file
editor); (2) the keyboard (press Crtl+9); (3) the Windows icon (using the new
do-file editor); and (4) using the command line doedit.
Example 1
Open a new do-file editor using the command window by typing “doedit.” A do-file
editor page will open. Create the following do-file:
167
168 ◾ Biostatistics in Public Health Using STATA
cd “\Users\Documents\students”
do “example1.do”,
run example1,
The commands noisily and quietly are special commands that turn the output on
and off. The first, noisily, performs the command subsequently written and ensures
terminal output. The second, quietly, performs the command subsequently writ-
ten but suppresses terminal output. As you can see in the example above, if you
type “do” before typing “example1,” all the information in the do-file will be dis-
played in the Stata results window. If you type “run,” only the information after the
noisily display commands will be shown in the output window.
When you are writing a program, the program name has to be unique and cannot
be the same as any other command name. For example, you cannot use the name
“ttest” because it is a built-in command in Stata. To be able to find out whether a
Introduction to Advanced Programming in STATA ◾ 169
name is already in use by Stata, you can use the which command. In the command
window type which program name. Doing so will result in the following:
. which ttest
C:\Program Files (x86)\Stata\ado\base\t\ttest.ado
*! version 4.1.1 30dec2004
. which example
command example not found as either built-in or ado-file
r(111);
As you can see in the previous example, a program named “ttest” is being used by
Stata, and the program name “example” is not being used. Example2.do illustrates
the use of the program command using the command window. Type the following
lines in the command field:
program example2
display “Example 2: How to use the command program”
display “STATA commands”
display “End of the Example”
end
To execute the program, type “example2” in the command line; the output after
having done so will be:
If you want to change or edit a program, you will need to first delete that program
from the memory. If you try to use the name “example2” again, you will get the
following error:
program example2
example3 already defined
r(110);
To delete this error or cause it to be ignored, you can use the commands drop and
capture. The drop command deletes the program from the memory, and the capture
command causes the errors associated with the command that follows the capture
command to be ignored. Type the following example in the command line, or create
a do-file with these commands:
capture program drop example3
program example3
display “Example 3: How to use the commands drop and capture”
display “STATA commands”
display “End of the Example”
end
170 ◾ Biostatistics in Public Health Using STATA
After log using command, all the commands and their outputs will be saved in
example4.txt until log close command.
If you run the do-file named “example5,” you will get the following error:
. example5
------------------------------------------- begin example5 ---
- display “Example #5”
Example #5
- display “Runs the error when you run the program”
Runs the error when you run the program
- display “Stops and displays the error when executing the
program”
Stops and displays the error when executing the program
- display “End of the Example”
End of the Example
- ERRORRRRRR
unrecognized command: ERRORRRRRR
--------------------------------------------- end example5 ---
r(199);
end of do-file
12.6 Delimiters
Stata reads each line as a complete command line, but sometimes the com-
mands are long. To be able to use more than one line as your command line
you can use delimiters. There are two types of delimiters, one you can use in
each line (///), and one you set up before running your Stata commands (#d ;).
The following examples show how each of the two delimiters is used to create
a two-way graph:
And
#d ;
twoway (scatter bmi age if sex==0, sort mcolor(navy) msymbol(circle_hollow))
(scatter bmi age if sex==1, sort mcolor(maroon) msymbol(circle))
(line bmi age if sex==0, sort lcolor(navy) lwidth(thick))
(line bmi age if sex==1, sort lcolor(maroon) lwidth(thick)),
legend(position(10) ring(0) col(1) order(1 "Males" 2 "Females")
region(fcolor(none) lcolor(none))) ylab( , angle(horizontal))
ytitle("BMI") xtitle("Age") graphregion(fcolor(white));
#d cr
As you can see above, you need to open with #d ; and then close with #d cr for the
next command lines. If you do not close the delimiter, Stata will continue to read
all the lines continuously.
12.7 Indexing
When you execute a Stata command, the command will loop across each line of
the dataset. For example, if you generate a new variable, Stata will work in line 1,
then line 2, and so on. The use of indexing will help the user to run only Stata com-
mands in certain observations. The following are examples of indexing:
1. Generate a new variable, x, that contains the number of the current observation:
gen x=_n
Output
. list x
+---+
| x |
|---|
1. | 1 |
2. | 2 |
3. | 3 |
4. | 4 |
5. | 5 |
|---|
6. | 6 |
7. | 7 |
+---+
Introduction to Advanced Programming in STATA ◾ 173
. list x y
+-------+
| x y |
|-------|
1. | 1 7 |
2. | 2 7 |
3. | 3 7 |
4. | 4 7 |
5. | 5 7 |
|-------|
6. | 6 7 |
7. | 7 7 |
+-------+
3. To check for duplicates in your dataset, assuming every subject has an id,
which is identified in this dataset by ID and it is a sequential set of numbers
starting with 1, use the following command line:
bysort ID: gen duplicates = _n
4. To create a variable with the total number of subjects in a group, where these
groups are identified by groupid, use the following command line:
bysort groupid: gen subjects = _N
5. Generate two new variables, z and w. Variable z contains the current obser-
vation minus 1. The first observation will be missing. Variable w contains the
current observation plus 1. The last observation will be missing. Observe:
gen z=x[_n-1]
gen w=x[_n+1]
list x y z w
+---------------+
| x y z w |
|---------------|
1. | 1 7 . 2 |
2. | 2 7 1 3 |
3. | 3 7 2 4 |
4. | 4 7 3 5 |
5. | 5 7 4 6 |
6. | 6 7 5 7 |
7. | 7 7 6 . |
+---------------+
174 ◾ Biostatistics in Public Health Using STATA
If we are interested in using age and hemoglobin (hgb) levels as predictors of bmi,
we could define the list of predictors and then run a multivariate linear regression
model, as follows:
local list = “age hgb”
reg bmi ’list’
Output
Source | SS df MS Number of obs = 14
---------+-------------------------------- F(2, 11) = 43.77
Model | 221.078339 2 110.539169 Prob > F = 0.0000
Residual | 27.7788042 11 2.52534584 R-squared = 0.8884
---------+-------------------------------- Adj R-squared = 0.8681
Total | 248.857143 13 19.1428571 Root MSE = 1.5891
-------------------------------------------------------------------------
bmi | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+---------------------------------------------------------------
age | .2179857 .0239641 9.10 0.000 .165241 .2707304
hgb | 2.241635 .3550485 6.31 0.000 1.460178 3.023091
_cons | -12.84275 5.193205 -2.47 0.031 -24.27292 -1.412584
-------------------------------------------------------------------------
Introduction to Advanced Programming in STATA ◾ 175
12.9 Scalars
Scalars are temporary results that are saved in the memory after a command is
run. After you run a command, you can review which scalars were saved using
the return list command. For example, let’s assume that we have the variable smoke
from the previous database, and we want to run a Student’s t-test to compare the
expected bmi by smoke. The following is what that would look like:
Output
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
-------- -+----------------------------------------------------------------
0 | 8 27.25 1.497021 4.234214 23.71011 30.78989
1 | 6 22.66667 1.308094 3.204164 19.3041 26.02923
-------- -+----------------------------------------------------------------
combined | 14 25.28571 1.169336 4.375255 22.75952 27.81191
---------+--------------------------------------------------------------------
diff | 4.583333 2.07317 .0662847 9.100382
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = 2.2108
Ho: diff = 0 degrees of freedom = 12
return list
scalars:
r(level) = 95
r(sd) = 4.375255094603872
r(sd_2) = 3.204163957519444
r(sd_1) = 4.234214381508266
r(se) = 2.073169652345752
r(p_u) = .0236068853559555
r(p_l) = .9763931146440445
r(p) = .047213770711911
r(t) = 2.210785464733855
r(df_t) = 12
r(mu_2) = 22.66666666666667
r(N_2) = 6
r(mu_1) = 27.25
r(N_1) = 8
176 ◾ Biostatistics in Public Health Using STATA
Scalars are useful for displaying only the results you want, instead of displaying all
the results. Here is an example:
In addition, you can create new scalars to calculate results not included in the saved
results. In the following example, using the previous database, the mean difference
between two groups is calculated:
. return list
scalars:
r(diff) = 4.583333333333332
Here is an example that uses the previous database with the following do-file:
Output
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
------------+-------------------------------------------------
age | 40.71429 5.614374 28.58517 52.8434
--------------------------------------------------------------
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
------------+-------------------------------------------------
hgb | 13.05 .3789444 12.23134 13.86866
--------------------------------------------------------------
In addition, you can use the local command in the do-file, as can be seen in the following:
The command forvalues loops over consecutive values, using the following structure:
For example, assuming we want to generate two random variables with uniform
distribution between the numbers 1 and 14, and assuming we are using the previ-
ous bmi database, the do-file will be composed of the following commands:
forvalues i = 1(1)2 {
generate x‘i’ = 1+ int(runiform()*14)
}
Once the above forvalues command is run, the variables x1 and x2 are generated. To
explore the values of these variables, we use list, as is demonstrated in the following:
. list x1 x2
+---------+
| x1 x2 |
|---------|
1. | 7 13 |
2. | 11 13 |
3. | 13 11 |
4. | 2 13 |
5. | 7 10 |
|---------|
6. | 13 4 |
7. | 11 12 |
8. | 4 1 |
9. | 3 13 |
10. | 11 5 |
|---------|
11. | 14 11 |
12. | 11 9 |
13. | 13 5 |
14. | 4 3 |
+---------+
Assuming we would like to select those persons for whom x1 is greater than x2 for
further assessment, we would use the following commands:
gen id=_n
gen selec=(x1 > x2)
list id age bmi hgb smoke if selec==1
Introduction to Advanced Programming in STATA ◾ 179
Output
+-------------------------------+
| id age bmi hgb smoke |
|-------------------------------|
3. | 3 23 27 14.5 1 |
6. | 6 56 29 13 0 |
8. | 8 52 23 11 1 |
10. | 10 45 20 11.5 1 |
11. | 11 25 24 14.7 0 |
|-------------------------------|
12. | 12 34 20 12 0 |
13. | 13 59 29 13 0 |
14. | 14 78 32 12 0 |
+-------------------------------+
Output
The prevalence estimate of a hemoglobin level below 12 is 21.4% (95% CI: 5.9,
54.0%).
If we want to use the glm command for this estimation, we will use the logistic
regression model with no predictor variables, as follows:
1
Prevalence =
1 + e −β0
+--------------------------------+
| prev previnf prevsup |
|--------------------------------|
1. | 21.42857 7.070517 49.43356 |
+--------------------------------
The point estimates of this prevalence are the same, but the confidence limits are
different, probably because of the small sample size for the normal approach used
in the proportion command.
The other option for prevalence estimation is to use the adjust command after
the logit command, as is demonstrated in the following:
logit nhgb
adjust ,pr ci
Output
------------------------------------------------------------------------------
nhgb | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------+----------------------------------------------------------------
_cons | -1.299283 .6513389 -1.99 0.046 -2.575884 -.0226821
------------------------------------------------------------------------------
. adjust ,pr ci
--------------------------------------------------------------------------------------
Dependent variable: nhgb Equation: nhgb Command: logit
--------------------------------------------------------------------------------------
----------------------------------------------
All | pr lb ub
-------+-–––----------------------------------
| .214286 [.070707 .49433]
----------------------------------------------
Key: pr = Probability
[lb , ub] = [95% Confidence Interval]
The results are the same as those obtained with the glm command. That is, the
prevalence estimate of hemoglobin levels below 12 is 21.4% (95% CI: 7.07%,
49.43%).
When the logistic regression model includes predictors, prevalence estimation
can be performed setting the value of only one of the predictors. For example, if
we run the previous logistic model with age as the predictor, the prevalence can be
estimated at mean bmi and at bmi equal to 20, as follows:
Output
------------------------------------------------------------------------------
nhgb | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------+----------------------------------------------------------------
bmi | -.7007166 .4025497 -1.74 0.082 -1.489699 .0882662
_cons | 14.81156 8.899236 1.66 0.096 -2.630626 32.25374
------------------------------------------------------------------------------
. adjust , pr ci
182 ◾ Biostatistics in Public Health Using STATA
--------------------------------------------------------------------------------------
Dependent variable: nhgb Equation: nhgb Command: logit
Variable left as is: bmi
--------------------------------------------------------------------------------------
----------------------------------------------
All | pr lb ub
----------+-----------------------------------
| .05183 [.00225 .569917]
----------------------------------------------
Key: pr = Probability
[lb , ub] = [95% Confidence Interval]
. adjust bmi=20, pr ci
--------------------------------------------------------------------------------------
Dependent variable: nhgb Equation: nhgb Command: logit
Covariate set to value: bmi = 20
--------------------------------------------------------------------------------------
----------------------------------------------
All | pr lb ub
----------+-----------------------------------
| .68938 [.165612 .961265]
----------------------------------------------
Key: pr = Probability
[lb , ub] = [95% Confidence Interval]
The prevalence estimate of hemoglobin levels below 12 set at mean bmi is 5.2%
(95% CI: 0.22%, 57.0%). The prevalence estimate of hemoglobin levels below
12 for those subjects with bmi equal to 20 is 68.9% (95% CI: 16.56%, 96.13%).
Although the bmi predictor in the model is marginally significant (P-value = .082),
the prevalence estimates at different bmi values are quite different.
There are other options of programming that can be explored in Stata, including
different procedures for matrix operations using Mata functions. These topics are
beyond the scope of this book, so we recommend checking out the books by Acock
(A Gentle Introduction to Stata, 4th edition, 2014) and by Baum (An Introduction to
Stata Programming, 2009).
References
Acock A. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press, 2014.
Baum C. An Introduction to Stata Programming. College Station, TX: Stata Press, 2009.
Bingham N, Fry J. Regression Linear Models in Statistics. London, UK: Springer-Verlag, 2010.
Cameron A, Trivedi P. Regression Analysis of Count Data. London, UK: Cambridge University
Press, 1998.
Collett D. Modelling Binary Data. 2nd ed. London: Chapman & Hall, 2002.
Collett D. Modelling Survival Data in Medical Research. 2nd ed. London, UK: Chapman &
Hall, 2003.
Draper NR, Smith H. Applied Regression Analysis. 3rd ed. Hoboken, NJ: John Wiley &
Sons, 1998.
Fox J. Applied Regression Analysis and Generalized Linear Models. 2nd ed. Thousand Oaks,
CA: Sage Publications, 2008.
Fu J, Gao J, Zhang Z, Zheng J, Luo JF, Zhong LP, Xiang YB. Tea consumption and
the risk of oral cancer incidence: A case-control study from China. Oral Oncol. 2013;
49:918–922.
Good PI. Resampling Methods: A Practical Guide to Data Analysis. 3rd ed. Boston, MA:
Birkhäuser Basel, 2006.
Hardin J, Hilbe J. Generalized Linear Models and Extensions. 1st ed. College Station, TX:
Stata Press, 2001.
Hilbe J. Negative Binomial Regression. New York: Cambridge University Press, 2007.
Hoffmann J. Generalized Linear Models: An Applied Approach. Boston, MA: Pearson/Allyn &
Bacon, 2004.
Hosmer D, Lemeshow S. Applied Logistic Regression. 2nd ed. Hoboken, NJ: John Wiley &
Sons, 2000.
Jewell N. Statistics for Epidemiology. Boca Raton, FL: Chapman & Hall, 2004.
Juul S, Frydenberg M. An Introduction to STATA for Health Researchers. 4th ed. College
Station, TX: Stata Press, 2014.
Kleinbaum D, Klein M. Logistic Regression: A Self-Learning Text. 2nd ed. New York:
Springer-Verlag, 2002.
Kleinbaum D, Klein M. Survival Analysis: A Self-Learning Text. 2nd ed. New York: Springer-
Verlag, 2005.
Kleinbaum D, Kupper L, Nizam A, Muller K. Applied Regression Analysis and Other
Multivariable Methods. 4th ed. Belmont, CA: Thomson Brooks, 2008.
Leyland A, Goldstein H. Multilevel Modelling of Health Statistics. Chichester: John Wiley &
Sons, 2001.
Marschener I. Inference Principles for Biostatisticians. Boca Raton, FL: CRC Press, 2015.
183
184 ◾ References
McCullagh P, Nelder J. Generalized Linear Models. 2nd ed. Boca Raton, FL: Chapman & Hall,
1999.
Peace K (ed). Design and Analysis of Clinical Trials with Time-to-Event Endpoints. Boca
Raton, FL: CRC Press, 2009.
Porta M (ed). A Dictionary of Epidemiology. 5th ed. New York: Oxford University Press,
2008.
Rabe-Hesketh S, Everitt B. A Handbook of Statistical Analyses Using STATA. Boca Raton,
FL: Chapman & Hall, 1999.
Rabe-Hesketh S, Skrondal A. Multilevel and Longitudinal Modeling Using STATA. College
Station, TX: Stata Press, 2005.
Rosner B. Fundamentals of Biostatistics. 7th ed. Boston, MA: Cengage Learning, 2010.
Rothman K. Epidemiology: An Introduction. New York: Oxford University Press, 2002.
Royston P, Lambert P. Flexible Parametric Survival Analysis Using STATA: Beyond the Cox
Model. College Station, TX: Stata Press, 2011.
Sheskin D. Handbook of Parametric and Nonparametric Statistical Procedures. 4th ed. Boca
Raton, FL: Chapman & Hall, 2007.
Snijder T, Bosker R. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel
Modeling. Thousand Oaks, CA: Sage Publications, 1999, reprinted 2003.
Szklo M, Nieto J. Epidemiology: Beyond the Basics. Sudbury, MA: Jones and Bartlett, 2004.
Twisk J. Applied Longitudinal Data Analysis for Epidemiology: A Practical Guide. London,
UK: Cambridge University Press, 2003.
Wienke A. Frailty Models in Survival Analysis. Boca Raton, FL: CRC Press, 2011.
Woodward M. Epidemiology: Study Design and Data Analysis. 2nd ed. Boca Raton, FL:
Chapman & Hall, 2004.
Biostatistics in
Biostatistics / Public Health
Nogueras • Moreno-Gorrín
Striking a balance between theory, application, and programming, Biostatistics in
Suárez • Pérez
Public Health Using STATA is a user-friendly guide to applied statistical analysis in
Public Health
public health using STATA version 14. The book supplies public health practitioners
and students with the opportunity to gain expertise in the application of statistics in
epidemiologic studies.
The book shares the authors’ insights gathered through decades of collective experience
Using STATA
teaching in the academic programs of biostatistics and epidemiology. Maintaining a
focus on the application of statistics in public health, it facilitates a clear understanding
of the basic commands of STATA for reading and saving databases.
Each chapter is based on one or more research problems linked to public health.
Additionally, every chapter includes exercise sets for practicing concepts and exercise
solutions for self or group study. Several examples are presented that illustrate the
applications of the statistical method in the health sciences using epidemiologic study
designs.
For readers new to STATA, the first three chapters should be read sequentially, as
they form the basis of an introductory course to this software.
Erick L. Suárez
Cynthia M. Pérez
K25609
Graciela M. Nogueras
6000 Broken Sound Parkway, NW
Suite 300, Boca Raton, FL 33487
711 Third Avenue
ISBN: 978-1-4987-2199-8
90000
Camille Moreno-Gorrín
New York, NY 10017
an informa business 2 Park Square, Milton Park
www.crcpress.com Abingdon, Oxon OX14 4RN, UK
9 781498 721998
w w w.crcpress.com