Some Comments On The Reliability of NOAA's Storm Events Database
Some Comments On The Reliability of NOAA's Storm Events Database
Database
Renato P. dos Santos
ULBRA - Lutheran University of Brazil
PPGECIM - Doctoral Program in Science and Mathematics Education
22 de junho de 2016
Abstract
Storms and other severe weather events can result in fatalities, injuries, and property
damage. Therefore, preventing such outcomes to the extent possible is a key concern,
and the scientific community faces an increasing demand for regularly updated
appraisals of evolving climate conditions and extreme weather. NOAA's Storm Events
Database is undoubtedly an invaluable resource to the general public, to the
professional, and to the researcher. Due to such importance, the primary objective of
this study was to explore this database and get clues about its reliability. A complete
investigation of the damage estimates, injuries or fatalities figures is unfeasible due to
the extension of the database. However, an exploratory data analysis with the
resources of the R statistical data analysis language found that damage reports are
missing in more than half of the records, that part of the damage values are incorrect,
and that, despite all efforts of standardizations, non-standard event type names are
still finding their way into the database. These few results are enough to demonstrate
that the database suffers from incompleteness and inconsistencies and should not be
used without taking reservations and appropriate precautions before advancing any
inferences from the data.
Introduction
Storms and other severe weather events can cause both public health and economic
problems for communities and municipalities. Many serious events can result in
fatalities, injuries, and property damage, and preventing such outcomes to the extent
possible is a key concern.
According to Trenberth, Fasullo, & Shepherd (2015), "the main way in which climate
change is likely to affect societies around the world is through changes in extremes. As
a result, the scientific community faces an increasing demand for regularly updated
appraisals of evolving climate conditions and extreme weather. Such information
would be immensely beneficial for adaptation planning."
This study evolved from an assessment during the Exploratory Data Analysis course,
which is part of Johns Hopkins Data Science Specialization provided by Coursera. It
involved exploring the version 1.0 of U.S. National Oceanic and Atmospheric
Administration's (NOAA) Storm Events Database, comprising data from 1950-01-03 to
2011-11-30. This database tracks characteristics of major storms and weather events
in the United States, including when and where they occur, as well as estimates of any
fatalities, injuries, and property damage. It is fed with information from a variety of
sources, which include but are not limited to county, state and federal emergency
management officials, local law enforcement officials, Skywarn spotters, NWS damage
surveys, newspaper clipping services, the insurance industry and the general public
(NCDC, 2008).
Storm Data is an official monthly publication of the National Oceanic and Atmospheric
Administration (NOAA) which documents (Murphy, 2016, p. 4):
• The occurrence of storms and other significant weather phenomena having
sufficient intensity to cause public health and/or economic problems
This database is updated monthly and generally lags 90-120 days behind the current
month, what makes it a very accessible data source. It constitutes an invaluable
resource that is heavily used by the general public, insurance adjusters, litigators, and
severe weather climatologists. Thousands of scientific papers have been made using
this database. For recent examples, see Miller, Black, Williams, & Knox (2016) and
Schroeder et al. (2016).
The primary goal of this study was to explore NOAA's Storm Events Database and get
clues about its reliability.
Data
The NOAA's Storm Event Database is available in the form of comma-separated-value
(.csv) files compressed via the bzip2 algorithm to reduce their size. They can be
downloaded from the NOAA web site. The files are named in the form
StormEvents_details-ftp_v1.0_dYYYY_c20160223.csv,
where YYYY indicates the year of the events records.
Inspection shows that the database has the following 48 variables:
BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY,
END_TIME, EPISODE_ID, EVENT_ID, STATE, STATE_FIPS, YEAR, MONTH_NAME,
EVENT_TYPE, CZ_TYPE, CZ_FIPS, CZ_NAME, WFO, BEGIN_DATE_TIME, CZ_TIMEZONE,
END_DATE_TIME, INJURIES_DIRECT, INJURIES_INDIRECT, DEATHS_DIRECT,
DEATHS_INDIRECT, DAMAGE_PROPERTY, DAMAGE_CROPS, SOURCE, MAGNITUDE,
MAGNITUDE_TYPE, FLOOD_CAUSE, CATEGORY, TOR_F_SCALE, TOR_LENGTH,
TOR_WIDTH, TOR_OTHER_WFO, TOR_OTHER_CZ_STATE, TOR_OTHER_CZ_FIPS,
TOR_OTHER_CZ_NAME, BEGIN_RANGE, BEGIN_AZIMUTH, BEGIN_LOCATION,
END_RANGE, END_AZIMUTH, END_LOCATION, BEGIN_LAT, BEGIN_LON, END_LAT,
END_LON, EPISODE_NARRATIVE, EVENT_NARRATIVE, and DATA_SOURCE. They are
described at (NOAA, 2014).
Data Preparation
This work was done with the resources of the R statistical data analysis language (R
Core Team, 2016) R version 3.3.0 (2016-05-03) on Windows 8.1 x64 (build 9600). All
efforts were made to conform to the best practices of Reproducible Research (Peng,
2011, 2016a).
After installing needed packages ggplot2 (Wickham, 2009), gridExtra (Auguie, 2016),
scales (Wickham, 2016), readr (Wickham & Francois, 2015a), dplyr (Wickham &
Francois, 2015b), knitr (Xie, 2015), lubridate (Grolemund & Wickham, 2011), XML
(Lang & The CRAN Team, 2016), and pander (Daróczi & Tsegelskyi, 2015), we begin
by downloading the storm database event details .csv files, from the NOAA website.
Afterwards, they are read into R.
The only relevant variables for this study are EPISODE_ID, YEAR, MONTH_NAME,
EVENT_TYPE, DAMAGE_PROPERTY, DAMAGE_CROPS, EPISODE_NARRATIVE, and,
therefore, only these were read into the dataset.
For convenience, we make the variable names more readable as EpisodeID, Year,
Month, EventType, PropertyDamage, CropDamage, and Narrative.
Analysis
Before any further processing, it is advisable to check if the dataset needs some data
cleaning, which is considered an essential part of the statistical analysis (de Jonge; van
der Loo, 2013). We will check if the dataset lacks headers, contains wrong data types
(e.g. numbers stored as strings), bad category labels, unknown or unexpected
character encoding and so on.
Figure 3 shows that, fortunately, the presence of non-standard event type names in
the dataset is quite irrelevant, amounting to no more than one-fifth of a percent of the
total number of events each year.
Damage values
Property damage refers to damage inflicted to private property (structures, objects,
vegetation) as well as public infrastructure and facilities.
Tornadoes may contain multiple segments and are reported in Storm Data in separate
segments (NCDC, 2008). This fact may affect the attribution of harmful effects for
population health and to the economy to individual tornadoes.
According to NWS Instruction 10-1605 (Murphy, 2016), estimates should be in the
form of US Dollar values and rounded to three significant digits, followed by the
magnitude of the value (i.e., 1.55B for $1,550,000,000). Values used to signify
magnitude include: K for thousand $USD, M for million $USD, and B for billion $USD.
Inspection shows, however, that many other values were used as magnitudes: K, M, 0,
3, B, 4, 2, 6, h, 5, 1, H, 9, 7, 8. It is, of course, impossible to be completely sure what the
preparer had in mind when introducing these values, but we can make educated
guesses about their meaning, such as h for hundreds and 6 for millions. Now, we
recalculate PropertyDamage and CropDamage variables taking these magnitude
values into account.
Having the damages values in a proper form, it is interesting to check for missing
values (coded as NA). It is known that missing values are a problem that plagues any
data analysis, their presence introducing bias into some calculations or summaries of
the data (Peng, 2015, p. 134).
Figure 4 shows that this version of NOAA's Storm Event Database is quite faulty in
terms of property damages and crop damages values, with the percentage of missing
values in the CropDamage variable rising up to 62%. This incompleteness is surprising
as the study mentioned above of its version 1.0 found no missing values in these
variables, but lots of them in the magnitude character code. A reasonable hypothesis is
that the complete rebuild of the database that happened in 2012 discarded all damage
values without definite magnitudes.
Figure 5 summarizes their values.
The violin plots (Hintze & Nelson, 1998) of Figure 5 show that property and crop
damage values concentrate around 104 , or US $10 thousand. They also exhibit an
outlier property damages value of 1.15 × 1011 , that is, US $115 billion, which is higher
than Katrina, estimated at US $108 billion, considered the most destructive and
costliest natural disaster in the history of the United States (Knabb, Rhome, Brown,
2011).
Inspection shows that this entry corresponds to an event of type FLOOD that would
have occurred on January 1, 2006, in Napa, CA. The event narrative states: "Major
flooding continued into the early hours of January 1st, before the Napa River finally
fell below flood stage, and the water receded. Flooding was severe in Downtown Napa
from the Napa Creek, and the City and Parks Department was hit with $6 million in
damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged
businesses with costs of at least $70 million."
Further investigation shows that a flood did take place in Napa on that date, but was
"not as bad at the devastating 1986 storm that caused $100 million in damage"
(Courtney, 2005). As a matter of fact, this event does not show up in NOAA's Billion-
Dollar Weather and Climate Disasters: Table of Events webpage.
A possible interpretation is that the preparer has introduced the magnitude character
B for billions, instead of M for millions, what seems more reasonable given the costs
"of at least $70 million," mentioned above.
Now, let us introduce a new TotDamage variable accounting for the total damage.
Table 1 lists the 10 top costliest events recorded in NOAA's Storm Events Database.
The top 10 costliest events (in Billion $USD)
EpisodeID Year Month TotDamage
1203478 2006 January 115
68471 2012 October 25
1198432 2005 August 7
1181034 2004 September 5
50455 2011 April 3
49972 2011 May 3
64742 2012 July 2
53127 2011 May 2
2084790 1998 September 2
89491 2014 August 2
It should be noticed, however, that the amounts shown in Table 2 refer to episodes
damages, which may be part of a bigger, more extense storm system, such as the
Hurricane Katrina, that can include many different types of events, and, therefore,
differ from the total damage estimates shown in the last column of this table.
Besides, a few Billion-Dollar events are missing in comparison with the Billion-Dollar
Weather and Climate Disasters: Table of Events webpage, such as: the 2012 U.S.
Drought/Heatwave ($33.3 billion), the 2004 Hurricane Ivan ($25.8 billion), the 2005
Hurricane Wilma ($23.2 billion), and the 2005 Hurricane Rita ($22.6 billion).
On the other hand, except the Napa River Flood 2005/2006, the other episode
estimates are reasonably lower than the corresponding total estimates.
This observation is comforting but does not ensure by itself that the remaining
estimate values in NOAA's Storm Events Database are all trustable.
Figure 6 lists the 5 most frequent event types in the database.
It may be illustrative to visualize the distribution of damage estimates for the most
frequent event, namely 'thunderstorm wind'.
The graph in Figure 7 shows that most damage estimates for 'thunderstorm wind'
events are concentrated at $0 value. Inspection shows that 48461 'thunderstorm
wind' events have $0 value, what, on top of the high count of 62% of missing values in
the CropDamage variable in Table 2, suggests a lack of information, as it is difficult to
believe that so many events of this kind could result in no damage at all.
As an example, Curran, Holle, & López (2000) concluded that "damage reports appear
to be poorly represented" in the Storm Events Database as a review in 1989 of 106
entries with damage values of over $500 million showed they to be erroneously coded
events, which were then changed to the 'unknown' category. With regard to fatalities,
Dixon et al. (2005) concluded that "depending on the database used and the compiling
U.S. agency, completely different results can be obtained." As an example, these
authors affirm that "there are several instances in Storm Data where traffic-related
deaths were classified as directly caused by weather in contrast to the official
guidelines."
Conclusion
NOAA's Storm Events Database is undoubtedly an invaluable resource to the general
public, to the professional, and to the researcher.
A complete investigation of the damage estimates, injuries or fatalities figures is
unfeasible due to the extension of the database.
The few results obtained here, however, are enough to show that the database suffers
from incompleteness and inconsistencies and should not be used without taking
reservations and appropriate precautions before advancing any inferences from the
data.
References
• Auguie, B. (2016). gridExtra: Miscellaneous Functions for "Grid" Graphics. R
package version 2.2.1. https://CRAN.R-project.org/package=gridExtra
• Courtney, K. (Dec. 31, 2005). Severe flooding hits Napa Valley. Napa Valley
Register. Retrieved on May 15, 2016 from
http://napavalleyregister.com/news/local/article_ebee8598-a512-572a-baea-
ca254758614f.html.
• Curran, E. B.; Holle, R. L.; & López, R. E. (2000). Lightning Casualties and Damages
in the United States from 1959 to 1994. Journal of Climate, 13(19), 3448-3464.
http://doi.org/10.1175/1520-0442(2000)013<3448:LCADIT>2.0.CO;2
• Daróczi, G. & Tsegelskyi, R. (2015). pander: An R Pandoc Writer. R package
version 0.6.0. https://CRAN.R-project.org/package=pander
• de Jonge, E.; van der Loo, M. (2013). An introduction to data cleaning with R. The
Hague: Statistics Netherlands.
• Dixon, P. G., Brommer, D. M., Hedquist, B. C., Kalkstein, A. J., Goodrich, G. B.,
Walter, J. C., . Cerveny, R. S. (2005). Heat Mortality Versus Cold Mortality: A Study
of Conflicting Databases in the United States. Bulletin of the American
Meteorological Society, 86(7), 937-943. http://doi.org/10.1175/BAMS-86-7-937
• Grolemund, G. & Wickham, H. (2011). Dates and Times Made Easy with lubridate.
Journal of Statistical Software, 40(3): 1-25.
• Hintze, J. L., & Nelson, R. D. (1998). Violin Plots: A Box Plot-Density Trace
Synergism. The American Statistician, 52(2), 181-184.
http://doi.org/10.1080/00031305.1998.10480559
• Knabb, R. D.; Rhome, J. R.; Brown, D. P. (14 September, 2011). Hurricane Katrina:
August 23 - 30, 2005. Tropical Cyclone Report. National Hurricane Center. United
States National Oceanic and Atmospheric Administration's National Weather
Service. Retrieved on May 15, 2016 from
http://www.nhc.noaa.gov/data/tcr/AL122005_Katrina.pdf.
• Lang, D.T. & The CRAN Team (2016). XML: Tools for Parsing and Generating XML
Within R and S-Plus. R package version 3.98-1.4. https://CRAN.R-
project.org/package=XML
• Miller, P. W., Black, A. W., Williams, C. A., & Knox, J. A. (2016). Maximum Wind
Gusts Associated with Human-Reported Nonconvective Wind Events and a
Comparison to Current Warning Issuance Criteria. Weather and Forecasting,
31(2), 451-465. http://doi.org/10.1175/WAF-D-15-0112.1
• Murphy, J. D. (Mar 9, 2016). Storm Data Preparation (National Weather Service
Instruction 10-1605). Retrieved on May 24, 2016 from
http://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf.
• Schroeder, A. J., Gourley, J. J., Hardy, J., Henderson, J. J., Parhi, P., Rahmani, V., .
Taraldsen, M. J. (2016). The development of a flash flood severity index. Journal of
Hydrology. http://doi.org/10.1016/j.jhydrol.2016.04.005
• NOAA (n.d.). The History of the NCDC Storm Events Database. Avaliable at NOAA
site: http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/The-History-of-
the-Storm-Events-Database.docx
• NOAA (Oct 15, 2014). Storm Data Export Format, Field names. Retrieved on May
24, 2016 from
http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/Storm-Data-
Export-Format.docx.
• Peng, R. D. (2011). Reproducible Research in Computational Science. Victoria, CA-
BC: Leanpub.
• Peng, R. D. (2016a). Report Writing for Data Science in R. Victoria, CA-BC:
Leanpub.
• Peng, R. D. (2016b). R Programming for Data Science. Victoria, CA-BC: Leanpub.
• R Core Team (2016). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. Avaliable at R site:
https://www.R-project.org/.
• Trenberth, K. E., Fasullo, J. T., & Shepherd, T. G. (2015). Attribution of climate
extreme events. Nature Climate Change, 5(8), 725-730.
• Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag
New York.
• Wickham, H. (2016). scales: Scale Functions for Visualization. R package version
0.4.0. https://CRAN.R-project.org/package=scales
• Wickham, H. & Francois, R. (2015a). readr: Read Tabular Data. R package version
0.2.2. https://CRAN.R-project.org/package=readr
• Wickham, H. & Francois, R. (2015b). dplyr: A Grammar of Data Manipulation. R
package version 0.4.3. https://CRAN.R-project.org/package=dplyr
• Wilbanks, T. J.; Fernandez, S. J.; & Allen, M. R. (2015). Extreme Weather Events
and Interconnected Infrastructures: Toward More Comprehensive Climate
Change Planning. Environment: Science and Policy for Sustainable Development,
57(4), 4-15.
• Xie, Y. (2015). Dynamic Documents with R and knitr. 2nd edition. Boca Raton:
Chapman and Hall/CRC.
Appendix: Code
# Explore NOAA's Storm Database webpage to extract files URL's
NOAAURL <-
"http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
NOAAWPage <- htmlTreeParse(NOAAURL, useInternalNodes = TRUE)
# Extract all URL's in NOAA's Storm Database webpage
URLList <- data.frame(fileURL = matrix(unlist(xpathApply(NOAAWPage,
"//a",
function(x) xmlGetAttr(x,
"href"))),
byrow = TRUE),
row.names = NULL, stringsAsFactors = FALSE)
for(i in 1:nrow(URLEDetailsList)) {
stormData <- rbind(stormData,
NOAARead(URLEDetailsList$fileURL[i]))
}
for(i in 1:nrow(uniEventType)) {
uniEventType$EventType[i] <-
list(as.vector(unique(unlist(totEventType$EventType[i]))))
}
for(i in 1:nrow(difuEventType)) {
difuEventType$EventType[i] <-
testTypes(difuEventType$EventType[i])
difuEventType$EventNumber[i] <-
length(unlist(difuEventType$EventType[i]))
}
for(i in 2:nrow(cumEventType))
for (j in 1:i-1)
cumEventType$EventType[i] <-
list(unique(c(unlist(cumEventType$EventType[i]),
unlist(cumEventType$EventType[j]))))
for(i in 1:nrow(diftEventType)) {
diftEventType$EventType[i] <-
testTypes(diftEventType$EventType[i])
diftEventType$EventNumber[i] <-
length(unlist(diftEventType$EventType[i]))
diftEventType$EventPercent[i] <-
diftEventType$EventNumber[i] /
totEventType$EventNumber[i]
}
nchar(PropertyDamage) - 1)) *
as.numeric(multipliers[match(
substr(PropertyDamage,
nchar(PropertyDamage),
nchar(PropertyDamage)),
multipliers[, 1]), 2])),
CropDamage =
ifelse(is.na(CropDamage), NA,
as.numeric(substr(CropDamage,
1,
nchar(CropDamage) - 1)) *
as.numeric(multipliers[match(
substr(CropDamage,
nchar(CropDamage),
nchar(CropDamage)),
multipliers[, 1]), 2])))
decreasing = TRUE)])
missingValues <- missingValues[order(missingValues$MissingValues,
decreasing = TRUE),]
pandoc.table(topDamEpisodes,
caption = "The top 10 costliest events (in Billion $USD)",
digits = 1,
round = 1,
keep.line.breaks = TRUE,
justify = c("right", "right", "center", "right"))
decreasing = TRUE)])
Affiliation
Renato P. dos Santos
ULBRA - Lutheran University of Brazil
PPGECIM - Doctoral Program in Science and Mathematics Education
Av. Farroupilha, 8001 - 92425-900 Canoas/RS - Brazil
E-mail: [email protected]
URL: http://www.linkedin.com/in/RenatoPdosSantos