Using_MonetDB_and_dplyr_to_Work_with_Large_HCUP_NIS_Data_Files
Using_MonetDB_and_dplyr_to_Work_with_Large_HCUP_NIS_Data_Files
Charles DiMaggio
Bellevue-NYU Trauma Service
November 7, 2016
Contents
1 Introduction 2
1
1 Introduction
This is an addendum to notes related to Nationwide Inpatient Sample analyses and Nation-
wide Emergency Department Sample analyses found elsewhere on this site. It presents some
approaches to dealing with very large data files that result from combining multiple years of
administratively collected discharge data like HCUP. This issue is particularly relevant when
conducting multi-year analyses of large complex survey data. To most appropriately assess
year-to-year trend, it’s necessary to combine and adjust for survey sampling, weighting and
clustering using the full multi-year data set. This, though, results in a very large data file
that taxes even the most robust systems.
As a rule of thumb, an R file with 100 million rows and 5 variables takes up about 4GB. I
recently completed an analysis of a 12-year HCUP NIS file 1 , which was nearly 100 million
rows long with approximately 25 variables for an estimated 250GB. A 7-year NEDS file I
worked with had nearly 200 million entries and 45 variables of interest for about double that
size. Since R loads everything into RAM, this presents computational memory difficulties
on most any machine. 2 One work around is to use an out-of-memory data base. The
data reside outside of a computer’s RAM, i.e. in a file somewhere on your system, and R
only brings results of data manipulations or analyses into your physical memory. This kind
of approach has great promise to utilize the robust statistical powers of R in a large-data
setting.
2
connecting to it. MonetDBLite has superseded that and made using MonetDB with R
pretty painless. But...although MonetDBLite seems to require and depend on MonetDB.R,
databases created with MonetDB are not compatible with MonetDBLite. (Ouch.) Bottom
line. Just use MonetDBLite.
Install MonetDB.R and MonetDBLite. MonetDB.R installs the client version of MonetDB
(client-side means it is on the users machine, as opposed to server-side or out of memory).
This also installs the DBI package. DBI is a general interface to DBMS’s. You have to load
DBI separately and explicitly each time you use MonetDBLite.3 But, each particular DBMS
needs an additional package to handle the specifics for that particular DBMS, e.g. SQLLite.
The MonetDBLite package serves that purpose and makes it no longer necessary to to set
up and maintain a separate MonetDB server for each session.4
The default MonetDB.R and MonetDBLite installation directs to CRAN, but when I tried
installing the packages were not there. This may have since been corrected, but the code
below sets the repository directly to the MonetDB site and worked the last time I tried in
June 2016.
Then install dplyr.
# from https://www.monetdb.org/blog/monetdblite-r
install.packages("MonetDB.R", repos="http://dev.monetdb.org/Assets/R/")
install.packages("MonetDBLite", repos="http://dev.monetdb.org/Assets/R/")
# NB: MonetDBLite site set the repository as cran.rstudio,
# but the file was not there. I found it at the monetdb site.
install.packages(dplyr)
The following code takes MonetDB and plyr for a quick ride.5 We begin by creating a
directory in which to hold the MonetDB database we will be writing. In this case, for
3
At one point it MonetDBLite loaded DBI as a dependency, but that doesn’t seem to happen anymore...
4
This was previously necessary when working with MonetDB.R
5
This code is taken most shamelessly from the MonetDBLite site itself.
3
demonstration purposes, we create a temporary directory, but in the real world you will
want to create a permanent folder somewhere on your system.
The next step is to specify a database connection within R to the directory or folder that will
hold the MonetDB database. This is done with the DBI::dbConnect function and is saved to
the object ”con”. Note that the first argument to dbConnect specifies that this connection is
a MonetDB database. The following step actually writes the MonetDB database. Here, we
take the venerable mtcars data set that comes with R and use DBI::dbWriteTable to write it
to the MonetDB folder using the MonetDB connection. Within the MonetDB directory, we
have specified that the table is called ”mtcars”, but we could have named it anything. The
dbListTables function assures that the MonetDB database contains a table called ”mtcars”;
dbListFields returns the fields (or variables) in that table.
The DBI:dbGetQuery function allows us to pass SQL statements to the database.6 Depend-
ing on your familiarity and comfort with MonetDB and SQL, you could happily use this
approach for all your data manipulation needs.
Alternatively, you could use dplyr to connect to the MonetDB table and utilize the in-
creasingly popular plyr-like syntax to work with the database. The dplyr code below uses
src monetdb() to establish a connection to the MonetDB database which is saved as the
object ”ms”, then creates a dplyr table of the ”mtcars” MonetDB table in the directory. We
can then use dplyr (and plyr) functions on that table.7
Begin by loading ”dplyr”.8
library(MonetDB.R)
library(MonetDBLite)
library(DBI)
dbdir <- tempdir() # creates a temporary directory
con <- dbConnect(MonetDB.R(), embedded=dbdir) #DBI fx to connect to MonetDB
dbWriteTable(con, "mtcars", mtcars) # writes DBMS "mtcars" from mtcars
dbListTables(con)
dbListFields(con,"mtcars")
dbGetQuery(con, "SELECT MAX(mpg) FROM mtcars WHERE cyl=8;") # sql query
# library(plyr)
library(dplyr) # run similar query in R using dplyr
ms <- src_monetdb(embedded=dbdir) # connect to a MonetDB database
mt <- tbl(ms, "mtcars") # creates a dplyr version of a dataframe
mt %>% filter(cyl == 8) %>% summarise(max(mpg))
6
The MonetDB dialect of SQL differs in some respects from standard SQL...
7
When working with databases, dplyr uses an approach called ”lazy evaluation”. It tries, as much as
possible, not to bring anything into R’s memory unless and until absolutely necessary. It accomplishes
this by translating R commands into SQL and sending a single statement to the out-of-memory database.
Practically this means you can work with very large databases without taxing your memory. seehttps:
//cran.r-project.org/web/packages/dplyr/vignettes/databases.html
8
If you plan to use both ”plyr” and ”dplyr”, load ”plyr” first, else you will run into problems.
4
3.2.2 Reading .csv files by writing a schema
The following vignette is stolen from Bob Rudis and is an example of the much more likely
scenario of reading in a .csv file into MonetDB. CSV files are by definition comma separated
and usually come with headers. MonetDB by default expects no headers and uses the — as
a separator. There is a MonetDB.R function to read in .csv files (monetdb.read.csv) but the
following general approach creates a MonetDB table with a schema into which to copy the
.csv data which can be used for more complex files.
We will again work with the venerable mtcar data set. For demonstration purposes, we will
create a .csv version of it. We again create a directory to hold our MonetDB table. This
time, rather than create a temporary directory, we first use our operating system (not shown)
to create a folder called ”testMonet” on our desktop. As before, we then use dbConnect to
establish a connection to it.
The next couple of lines of code create a schema or container for the MonetDB table. First
read in a 1000 rows from the full .csv file into a smaller representative .csv. Then run the
sprintf() on that representative file to return ”... a character vector containing a formatted
combination of text and variable values.”
We then create the table with dbSendQuery, and populate the table with data from the
mtcars.csv file using a second dbSendQuery.9
We then connect to the out-of-memory directory and create a dplyr table version of the
MonetDB table.
library(MonetDBLite)
library(MonetDB.R)
library(dplyr)
9
The ”OFFSET 2” option starts adding data on the second line, i.e. below the header.
5
create
dbListTables(mdb)
count(mdb_mtcars, cyl)
I have found creating a schema to be difficult for more complex files like HCUP NIS and
NEDS. The following code presents an example using the monetdb.read.csv() function. As
you can see, this is much more straightforward. It has in general worked well for me, and
is my preferred approach. I’m using a teaching file of .csv digitalis study data. First,
create a connection to the .csv file. Then, read it into a MonetDB data base using the
MonetDB.R::monetdb.read.csv() function.
6
Now, connect to the MonetDBLite table with dplyr and run some commands. Note that
you must use dplyr commands on the connection object rather than base commands.
library(dplyr)
The key here is the ”appendT̄RUE, overwrite=FALSE” option of the DBI function db-
WriteTable. This example assumes you have already read the HCUP NEDS .csv files into a
separate R dataframe for each year. Caution: Don’t use dots to name your database
tables. MonetDB SQL queries won’t work if the sql table name includes a dot. Use under-
scores.
Begin by creating a folder to hold your large multi-year file, then connect to it using
DBI::dbConnect(). Read the first year of data (2006) into R. Then, because MonetDB
is very picky about variable names in tables, set all the names to lower case, and change the
”year” variable to ”yr”. (Year is a restricted name for MonetDB). Then read the file into
7
a MonetDB table called ”neds”. The gc() (for ”garbage collection”) command cleans and
frees up R memory after memory-intensive operations.
The process is repeated with the R files for 2007 and 2008. This should go smoothly because
those files have the same names and variable types as the 2006 file. After each year of data
is added, check and note the number of entries in the table.
library(DBI)
library(MonetDB.R)
library(MonetDBLite)
library(plyr)
library(dplyr)
# read write yearly NEDS R files into R fix varaible names for sql write
# initial table as individual table append additional years
neds.2006.core <- readRDS("~/NEDS_2006_Core.rds")
names(neds.2006.core) <- tolower(names(neds.2006.core)) # fix names for sql
names(neds.2006.core)[names(neds.2006.core) == "year"] <- "yr"
dbWriteTable(mdb, "neds", neds.2006.core)
rm(neds.2006.core)
gc()
# 2008
neds.2008.core <- readRDS("~/NEDS_2008_Core.rds")
names(neds.2008.core) <- tolower(names(neds.2008.core))
names(neds.2008.core)[names(neds.2008.core) == "year"] <- "yr"
8
dbWriteTable(mdb, "neds", neds.2008.core, append = TRUE, overwrite = FALSE)
rm(neds.2008.core)
gc()
We hit our first wrinkle at the 2009 data. Some of the variables have changed. At this point
you have a decision to make. Do you need to create a new variable? Drop an older one?
Rename something? It will vary based on your analysis need. In my case, it was sufficient
to restrict to the intersection of variable names for the existing and the additional table.
# 2009, variables change, need to restrict to intersection of names
neds.2009.core <- readRDS("~/NEDS_2009_Core.rds")
names(neds.2009.core) <- tolower(names(neds.2009.core))
names(neds.2009.core)[names(neds.2009.core) == "year"] <- "yr"
neds.2009.core <- neds.2009.core[commonNames] # restrict to common variables
dbWriteTable(mdb, "neds", neds.2009.core, append = TRUE, overwrite = FALSE)
rm(neds.2009.core)
gc()
# 2010
neds.2010.core <- readRDS("~/NEDS_2010_Core.rds")
names(neds.2010.core) <- tolower(names(neds.2010.core))
names(neds.2010.core)[names(neds.2010.core) == "year"] <- "yr"
neds.2010.core <- neds.2010.core[commonNames]
dbWriteTable(mdb, "neds", neds.2010.core, append = TRUE, overwrite = FALSE)
rm(neds.2010.core)
gc()
We hit another bump in the road at 2011. This time, the variable names match, but the
variable types differ. This was much more problematic as MonetDB is very finicky about data
types. After much trial and error (mostly error and very time consuming) I found the only
thing that worked was reading the problematic additional year of data into the database as
a separate table. Then concatenating this new table of data to the existing multi-year table
using SQL UNION. This required dropping an additional field that didn’t match up. After
ensuring the union operation worked, I removed the single year of data from the database.
The process is repeated for the final year of data, along with a couple of checks to ensure
things worked.
################## 2011 ######################## differing variables caused errors and
################## crashes following approach only thing tht seemed to work
9
# read 2011 file into monet
dbWriteTable(mdb, "neds_11", neds.2011.core)
# concatenate with existing neds table using sql union (first try returned
# error about non-matching fields that had to be fixed)
setdiff(dbListFields(mdb, "neds"), dbListFields(mdb, "neds_11"))
dbSendQuery(mdb, "alter table neds drop intent_self_harm;")
dbListTables(mdb)
rm(neds.2011.core)
gc()
# append 2012
neds.2012.core <- readRDS("~/NEDS_2012_Core.rds")
names(neds.2012.core) <- tolower(names(neds.2012.core))
names(neds.2012.core)[names(neds.2012.core) == "year"] <- "yr"
neds.2012.core <- neds.2012.core[intersect(dbListFields(mdb, "neds_06_11"),
names(neds.2012.core))]
dbWriteTable(mdb, "neds_06_11", neds.2012.core, append = TRUE, overwrite = FALSE)
dbListTables(mdb)
dbListFields(mdb, "neds_06_11")
dbGetQuery(mdb, "SELECT COUNT(*) FROM neds_06_11") # 198102435
dbGetQuery(mdb, "SELECT AVG(yr) AS AverageYear FROM neds_06_11;") # 2009.104
rm(neds.2012.core)
gc()
10
dbGetQuery(mdb, "CREATE TABLE neds_06_12 AS SELECT * FROM neds_06_11 WITH DATA; DROP TAB
dbListTables(mdb)
dbListFields(mdb, "neds_06_12")
dbGetQuery(mdb, "SELECT COUNT(*) FROM neds_06_12") # 198102435
dbGetQuery(mdb, "SELECT AVG(yr) AS AverageYear FROM neds_06_12;") # 2009.104
Finally, here is some code that illustrates working directly with the out-of-memory file using
SQL syntax to create a new variable. First, I add a sequential variable that will serve as an
identifier that is required by the sqlsurvey package. Next I add a variable set to 1 which can
be used for count operations in the survey package. Lastly I create an indicator variable for
acute alcohol intoxication based on ICD diagnoses.
# need to create an id variable (required by sqlsurvey), and a count
# variable
dbSendQuery(mdb, "alter table neds_06_12 add id serial;")
dbListFields(mdb, "neds_06_12")
This code illustrates how you can use dplyr to manipulate and combine 12 large files via
a MonetDBLite connection. Unlike the previous example where all the data manipulations
are conducted outside of R’s physical memory constraints, this brings the files into R, and
11
requires an incremental approach as well as paring down as many variables as possible. In
contrast to the previous approach that results in a single, multi-year file with most all the
variables, this process results in 4 tables in a MonetDB directory:
• ”nis 0003” - single table all rows and all variables 2000 to 2003
• ”nis 0407” - single table all rows and all variables 2004 to 2007
• ”nis 0811” - single table all rows and all variables 2008 to 2011
• ”nis 0011” - single table all rows (94,646,462) and 52 variables from 2000 to 2011 (est
about 40GB)
After establishing a connection to a folder that will hold the database, the first part of the
code writes each individual-year NIS R file to a MonetDB database as tables by reading
in each dataframe, using dbWriteTable() to write to out-of-memory database folder, then
removing the R dataframe. Again, Caution: Don’t use dots to name your database
tables. MonetDB SQL queries won’t work if the sql table name includes a dot. Use under-
scores.
Next, create dplyr table versions of the individual-year MonetDB tables. Then, use DBI::dbWriteTable
and dplyr::rbind list to combine the individual-year tables two at a time. Attempting
to combine all the files at one time failed. Note that the ”n=-1 option” is necessary to
combine all the data in a table, else dplyr restricts to just first 100K rows.10 . Use dbRe-
moveTable(mdb,tablename) to clean up and remove the individual-year files afterward.
The process is repeated to combine the 2-year files into 4-year files.
The 4-year files are too large to reasonably combine on most machines. It is necessary to
pare down the number of variables. This involves first finding the intersection of the variable
names between files (they tend to change across years) and then use a select statement to
retain only the variables you need for analysis.11
NB files are too large as they are, need to (1) find the intersection of the variable names
between the files, then (2) pare down the list of variables
library(MonetDBLite)
library(MonetDB.R)
# library(plyr)
library(dplyr)
12
dbWriteTable(mdb, "nis_00", nis.2000.core)
rm(nis.2000.core)
nis.2001.core<-readRDS("~/nis.2001.core.rds")
dbWriteTable(mdb, "nis_01", nis.2001.core)
rm(nis.2001.core)
nis.2002.core<-readRDS("~/nis_2002_core.rds")
dbWriteTable(mdb, "nis_02", nis.2002.core)
rm(nis.2002.core)
nis.2003.core<-readRDS("~/nis_2003_core.rds")
dbWriteTable(mdb, "nis_03", nis.2003.core)
rm(nis.2003.core)
nis.2004.core<-readRDS("~/nis_2004_core.rds")
dbWriteTable(mdb, "nis_04", nis.2004.core)
rm(nis.2004.core)
nis.2005.core<-readRDS("~/nis_2005_core.rds")
dbWriteTable(mdb, "nis_05", nis.2005.core)
rm(nis.2005.core)
nis.2006.core<-readRDS("~/nis_2006_core_2.rds")
dbWriteTable(mdb, "nis_06", nis.2006.core)
rm(nis.2006.core)
nis.2007.core<-readRDS("~/nis_2007_core_2.rds")
dbWriteTable(mdb, "nis_07", nis.2007.core)
rm(nis.2007.core)
nis.2008.core<-readRDS("~/nis_2008_core_2.rds")
dbWriteTable(mdb, "nis_08", nis.2008.core)
rm(nis.2008.core)
nis.2009.core<-readRDS("~/nis_2009_core_2.rds")
dbWriteTable(mdb, "nis_09", nis.2009.core)
rm(nis.2009.core)
nis.2010.core<-readRDS("~/nis_2010_core_2.rds")
dbWriteTable(mdb, "nis_10", nis.2010.core)
rm(nis.2010.core)
nis.2011.core<-readRDS("~/nis_2011_core_2.rds")
13
dbWriteTable(mdb, "nis_11", nis.2011.core)
rm(nis.2011.core)
# create dplyr connection to the database and create dplyr table for each year of data
mdb_src <- src_monetdb(embedded="~/2000_2011/monet")
dbRemoveTable(mdb,"nis_00")
dbRemoveTable(mdb,"nis_01")
14
dbRemoveTable(mdb,"nis_02")
dbRemoveTable(mdb,"nis_03")
dbRemoveTable(mdb,"nis_04")
dbRemoveTable(mdb,"nis_05")
dbRemoveTable(mdb,"nis_06")
dbRemoveTable(mdb,"nis_07")
dbRemoveTable(mdb,"nis_08")
dbRemoveTable(mdb,"nis_09")
dbRemoveTable(mdb,"nis_10")
dbRemoveTable(mdb,"nis_11")
#clean up
dbRemoveTable(mdb,"nis_0001")
dbRemoveTable(mdb,"nis_0203")
dbRemoveTable(mdb,"nis_0405")
dbRemoveTable(mdb,"nis_0607")
dbRemoveTable(mdb,"nis_0809")
dbRemoveTable(mdb,"nis_1011")
15
# combine the 4 year-files
# create dplyr tables from the 4-year tables
mdb_nis_0003 <- tbl(mdb_src,"nis_0003")
mdb_nis_0407 <- tbl(mdb_src,"nis_0407")
mdb_nis_0811 <- tbl(mdb_src,"nis_0811")
dbListTables(mdb)
dbGetQuery(mdb, "SELECT COUNT(*) FROM nis_0411")
16
PR3, PR4, PR5, PR6, PR7, PR8, PR9, PR10, PR11, PR12, PR13, PR14, PR15,
RACE, TOTCHG, YEAR)
# write 12-year database table from rbind of 8-year and 4-year table
dbWriteTable(mdb, "nis_0011", rbind_list(as.data.frame(nis_0411, n = -1),
as.data.frame(nis_0003b, n = -1)))
dbListTables(mdb)
dbGetQuery(mdb, "SELECT COUNT(*) FROM nis_0011") # 94,646,462
dbListTables(mdb)
The sqlsurvey package is an adaptation of the well-known and wonderful survey package
written by Dr. Thomas Lumley that allows for SQL-based analysis of large survey, and thus
allows the kind of out-of-memory analyses, in this case for complex surveys, we are aiming
for. It is listed as experimental. In my experience, when it works it returns results extremely
close to the standard survey package. When it doesn’t work it simply fails. The biggest
drawback is that it has only a limited number of available functions.
Install and load sqlsurvey.12 Establish a connection to the out-of-memory Monet database
containing the NEDS injury file. Check the file and fields. Note the speed with which the
survey object is created.13
12
Note that it’s not a standard or default installation.
13
It should take some matter of minutes. Try creating a standard survey object from a database that
large. Assuming you could read the file into R in the first place...
17
Create a sqlsurvey object from the MonetDB injury table.14 for details.
install.packages("sqlsurvey", repos = "http://R-Forge.R-project.org")
library(sqlsurvey)
The following code illustrates the use of dplyr and the standard survey package to account
for year-to-year trends in a large multi-year out-of-memory MonetDB table. The methods
are based on work by Bieler, et al.15 and the US CDC 16 . The code for the actual trend
analyses was taken whole-cloth from Anthony D’Amico’s very informative site.17 . I used this
approach to report results in Traumatic injury in the United States: In-patient epidemiology
2000-2011.
Connect to the MonetDB database. Connect and create dplyr tables of both a full data set,
and a subset of injury discharges.
14
See http://rpackages.ianhowson.com/rforge/sqlsurvey/ and http://rpackages.ianhowson.
com/rforge/sqlsurvey/man/sqlsurvey.html
15
Bieler GS, Brown GG, Williams RL, Brogan DJ. Estimating model-adjusted risks, risk differences, and
risk ratios from complex survey data. American Journal of Epidemiology 2010;171(5):618-23
16
http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_conducting_trend_analyses.pdf
17
http://www.asdfree.com/2015/11/statistically-significant-trends-with.html
18
# create connection to folder that holds the database
mdb <- dbConnect(MonetDBLite(), "~/2000_2011/monet")
glimpse(inj)
glimpse(nis)
# merge files
inj.long<-full_join(as.data.frame(inj, n=-1), as.data.frame(nis, n=-1),
by="KEY")
class(inj.long) # dataframe
nrow(inj.long) #95049907
table(inj.long$YEAR)
19
svydes<- svydesign(
id = ~HOSPID ,
strata = ~interaction(NIS_STRATUM , YEAR), # note YEAR interaction
weights = ~DISCWT ,
nest = TRUE,
data = inj.long
)
injsvy<-subset(svydes, count==1)
20
# determine how many, if any, joinpoints are necessary for trend analysis
# by running regressions testing the contrasts
# test each contrast in sequence
summary(linyear)
# Call:
# svyglm(formula = AGE ~ FEMALE + linearContr, design = injsvy)
#
# Survey design:
# subset(svydes, count == 1)
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 46.42117 0.19437 238.829 <2e-16 ***
# FEMALE 19.59507 0.05964 328.578 <2e-16 ***
# linearContr 5.25335 0.63692 8.248 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for gaussian family taken to be 606.9774)
#
# Number of Fisher Scoring iterations: 2
# Call:
# svyglm(formula = AGE ~ FEMALE + linearContr + quadContr, design = injsvy)
#
# Survey design:
# subset(svydes, count == 1)
#
21
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 46.42221 0.19448 238.704 <2e-16 ***
# FEMALE 19.59129 0.05899 332.088 <2e-16 ***
# linearContr 5.25985 0.63851 8.238 <2e-16 ***
# quadContr 0.97005 0.66038 1.469 0.142
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for gaussian family taken to be 606.8987)
#
# Number of Fisher Scoring iterations: 2
# Call:
# svyglm(formula = AGE ~ FEMALE + linearContr + quadContr + cubeContr,
# design = injsvy)
#
# Survey design:
# subset(svydes, count == 1)
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 46.41798 0.19403 239.228 <2e-16 ***
# FEMALE 19.59091 0.05921 330.867 <2e-16 ***
# linearContr 5.26371 0.63792 8.251 <2e-16 ***
# quadContr 0.96476 0.66040 1.461 0.144
# cubeContr -0.68867 0.63917 -1.077 0.281
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for gaussian family taken to be 606.8593)
#
# Number of Fisher Scoring iterations: 2
Since there is a single linear trend in these data, we don’t have to run a joinpoint analysis.
We can plot the data and present the results of the regression.
22
meanAGE<-svymean(~AGE, injsvy, na.rm=T)
meanAGE # 56.215 sd=0.2309
confint(meanAGE)
# 2.5 % 97.5 %
# AGE 55.76277 56.66788
# Call:
# svyglm(formula = AGE ~ FEMALE + YEAR, design = injsvy)
#
# Survey design:
# subset(svydes, count == 1)
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -834.60967 106.81186 -7.814 6.04e-15 ***
# FEMALE 19.59507 0.05964 328.578 < 2e-16 ***
# YEAR 0.43931 0.05326 8.248 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for gaussian family taken to be 606.9774)
#
# Number of Fisher Scoring iterations: 2
confint(linyear)
# 2.5 % 97.5 %
# (Intercept) -1043.957076 -625.2622596
# FEMALE 19.478185 19.7119536
# YEAR 0.334916 0.5436986
23
Figure 1: Mean Age by Year, Injury Discharges 2000-2011
24
5.3.1 Trauma Centers Adjusted for Year-to-year Variation
We now repeat this kind of approach to look at the year-to-year number of trauma centers
in the United States based on HCUP NIS survey estimates.
# reduce number variables from full-year injury and nis
# files files using dplyr syntax on the tbl connection objects
inj<-select(inj_0011, KEY, FEMALE, DIED, AGE, count, region,
ageGrp, level.one, HOSP_TEACH, AWEEKEND, severe, Charlson)
nis<-select(nis_0011, KEY, DISCWT, NIS_STRATUM, HOSPID, YEAR)
# merge files
inj.long<-full_join(as.data.frame(inj, n=-1), as.data.frame(nis, n=-1), by="KEY")
svydes<- svydesign(
id = ~HOSPID ,
strata = ~interaction(NIS_STRATUM , YEAR),
weights = ~DISCWT ,
nest = TRUE,
data = inj.long
)
injsvy<-subset(svydes, count==1)
# total SE
25
# level.one 2399216 186904
confint(tot.center)
# 2.5 % 97.5 %
# level.one 2032891 2765541
confint(centerSevere1)
# 2.5 % 97.5 %
# 0 0.2358120 0.2462456
# 1 0.3802098 0.4084600
confint(centerCFR)
# 2.5 % 97.5 %
# 0 0.02228137 0.02310309
# 1 0.03307262 0.03581808
26
# svyglm(formula = DIED ~ level.one, injsvy, family = binomial(logit))
#
# Survey design:
# subset(svydes, count == 1)
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.762779 0.009452 -398.08 <2e-16 ***
# level.one 0.429450 0.023225 18.49 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1.000053)
#
# Number of Fisher Scoring iterations: 6
exp(coef(centerCFR.glm))
# (Intercept) level.one
# 0.02321912 1.53641261
#
exp(confint(centerCFR.glm))
# 2.5 % 97.5 %
# (Intercept) 0.02279292 0.0236533
# level.one 1.46804142 1.6079680
summary(centerCFR.glmLinContr)
# Call:
# svyglm(formula = DIED ~ level.one + AGE + FEMALE + region + HOSP_TEACH +
# AWEEKEND + severe + Charlson + linearContr, injsvy, family = binomial(logit))
#
# Survey design:
# subset(svydes, count == 1)
#
27
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -6.4326594 0.0262064 -245.461 < 2e-16 ***
# level.one 0.2198277 0.0223437 9.838 < 2e-16 ***
# AGE 0.0239986 0.0002884 83.217 < 2e-16 ***
# FEMALE -0.3935988 0.0089987 -43.740 < 2e-16 ***
# regionNortheast 0.0548135 0.0198443 2.762 0.00575 **
# regionSouth 0.1588829 0.0195357 8.133 4.63e-16 ***
# regionWest 0.0708024 0.0219623 3.224 0.00127 **
# HOSP_TEACH 0.3580994 0.0159756 22.415 < 2e-16 ***
# AWEEKEND 0.0458233 0.0091747 4.995 5.99e-07 ***
# severe 1.9051679 0.0123445 154.333 < 2e-16 ***
# Charlson 0.2620184 0.0028396 92.273 < 2e-16 ***
# linearContr -0.6185967 0.0260916 -23.709 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 0.7989355)
#
# Number of Fisher Scoring iterations: 7
exp(coef(centerCFR.glmLinContr))
summary(centerCFR.glmAdj)
# Call:
# svyglm(formula = DIED ~ level.one + AGE + FEMALE + region + HOSP_TEACH +
# AWEEKEND + severe + Charlson + YEAR, injsvy, family = binomial(logit))
#
# Survey design:
# subset(svydes, count == 1)
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 97.3111639 4.3729642 22.253 < 2e-16 ***
# level.one 0.2198277 0.0223437 9.838 < 2e-16 ***
# AGE 0.0239986 0.0002884 83.217 < 2e-16 ***
# FEMALE -0.3935988 0.0089987 -43.740 < 2e-16 ***
# regionNortheast 0.0548135 0.0198443 2.762 0.00575 **
# regionSouth 0.1588829 0.0195357 8.133 4.63e-16 ***
# regionWest 0.0708024 0.0219623 3.224 0.00127 **
28
# HOSP_TEACH 0.3580994 0.0159756 22.415 < 2e-16 ***
# AWEEKEND 0.0458233 0.0091747 4.995 5.99e-07 ***
# severe 1.9051679 0.0123445 154.333 < 2e-16 ***
# Charlson 0.2620184 0.0028396 92.273 < 2e-16 ***
# YEAR -0.0517297 0.0021819 -23.709 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 0.7989355)
#
# Number of Fisher Scoring iterations: 7
exp(coef(centerCFR.glmAdj))[-1]
exp(confint(centerCFR.glmAdj))[-1,]
# 2.5 % 97.5 %
# level.one 1.1924797 1.3016341
# AGE 1.0237101 1.0248680
# FEMALE 0.6628305 0.6866286
# regionNortheast 1.0160468 1.0982387
# regionSouth 1.1281664 1.2179536
# regionWest 1.0281458 1.1205814
# HOSP_TEACH 1.3865073 1.4761111
# AWEEKEND 1.0282325 1.0658849
# severe 6.5598849 6.8851210
# Charlson 1.2923379 1.3068033
# YEAR 0.9455334 0.9536551
confint(tot.CFR)
# 2.5 % 97.5 %
29
# DIED/count 0.0236305 0.02448734
dbSendQuery(mdb, create)
# dbRemoveTable(mdb, "nis_1998_core")
30
NULL AS '' ")
Working with dplyr was also not always a bed of roses. Using plyr::r.bind.fill seemed ideal be-
cause the data sets differ in variables from year to year. But, r.bind.fill via DBI:dbWriteTable
returns a truncated table. Using dplyr::rbind all does not seem to work at all. A less useful
approach (but still acceptable) involves removing columns to make the tables identical and
using a straight-ahead UNION ALL SQL operation. But this requires knowing SQL well
enough to parse and address the invariable error messages.
# combining dataframes in R memory and writing table failed,
# insufficient memory
nis_core_98_12 <- rbind.fill(nis.1998.core, nis.1999.core,
nis.2000.core, nis.2001.core, nis.2002.core, nis.2003.core,
nis.2004.core, nis.2005.core, nis.2006.core, nis.2007.core,
nis.2008.core, nis.2009.core, nis.2010.core,
nis.2011.core, nis.2012.core)
# SQL CREATE TABLE ... UNION ALL failed differing column names
dbGetQuery(mdb,
"CREATE TABLE nis_0411 AS
SELECT *
FROM nis_0407
UNION ALL
SELECT *
FROM nis_0811
WITH DATA")
#NB need to include the WITH DATA statement at end
# to actually populate the table
# including intersect names failed
intersect(dbListFields(mdb, "nis_0407"), dbListFields(mdb, "nis_0811"))
31