0% found this document useful (0 votes)
15 views7 pages

Ish QC

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 7

7.

8 THE QUALITY CONTROL OF THE INTEGRATED SURFACE HOURLY DATABASE

J. Neal Lott *
National Climatic Data Center, Asheville, North Carolina

ABSTRACT contributions from member agencies in the FCC,


the NCDC has completed two phases of the ISH
The National Climatic Data Center (NCDC), database project:
in conjunction with the Federal Climate Complex
(FCC), developed the global Integrated Surface a) The “database build” phase, producing
Hourly (ISH) database to address a pressing ISH Version 1 – The new database collects all of
need for an integrated global database of hourly the NCDC and Navy surface hourly data (DSI
land surface climatological data. The database 3280), NCDC hourly precipitation data (DSI
of approximately 20,000 stations has data from 3240), and Air Force Datsav3 surface hourly
as early as 1900 (many stations beginning in data (DSI 9956), into one global database. The
1948-1973 timeframe), is operationally updated database totals approximately 350 gigabytes, for
with the latest data, and is now being used by nearly 20,000 stations, with data from as early
numerous customers in many varied as 1900 to present. The building of the database
applications. ISH is being quality-controlled in involved extensive research, data format
several phases, with two phases now conversions, time-of-observation conversions,
completed. This paper addresses: a) the and development of extensive metadata to drive
challenges and lessons learned in ISH the processing and merging. This included the
development, b) the quality control (QC) applied complex handling of input data stored in three
during the initial development, c) the more different station-numbering/ID systems. See
extended QC applied after the initial Figure 1 for a high-level flow diagram of the
development, d) the current shortcomings and process.
needs for the database, and e) the future plans
for QC and for partnerships. b) The first two phases of QC, resulting in
ISH Version 2 – Phase one involved the quality
1. INTRODUCTION assurance of the Version 1 database build, to
detect and correct any errors identified during
The FCC is comprised of the Department of this phase (e.g., due to input data file problems).
Commerce’s NCDC, and two components of the Phase two involved the research, development,
Department of Defense -- the Air Force Combat and programming of algorithms to correct
Climatology Center (AFCCC) and the US Navy’s random and systematic errors in the data, to
Fleet Numerical Meteorological and improve the overall quality of the database; and
Oceanographic Command Detachment the data processing of the full period of record
(FNMOC Det). The FCC provides much of the archive through these QC algorithms.
Nation's climatological support. The purpose of
the FCC is to provide a single location for the The database has been archived on
long term stewardship of the nation's NCDC’s Hierarchical Data Storage System
climatological data, and to provide the (HDSS, tape-robotic system). All surface hourly
opportunity for customers to request any climatic elements are now stored in one
climatological data product from a single consistent format for the full period of record.
location. The database is operationally updated with the
As a result of Environmental Services Data latest data on a routine basis.
and Information Management funding, Office of Surface hourly is one of the most-used types
Global Programs funding, and extensive of climatic data for NOAA customer-servicing
and research, involving requests for the hourly
data and for applications/products produced
from the data. ISH is greatly simplifying servicing
* Corresponding author address: J. Neal Lott, and use of the data, in that users do not have to
National Climatic Data Center, 151 Patton acquire portions of three datasets with differing
Avenue, Asheville, NC 28801; email formats, and do not have to deal with and
[email protected]. program for the inconsistencies and overlaps
between the three input datasets. Also, this has datasets). This station history then controls the
resulted in an end-to-end process for routine overall process flow and data merging, and also
database updates, the database is being placed must account for station number changes over
on-line for Web access, and the more recent time.
data have been collected into a CD-ROM The station number changes over time were
product with a map interface. an added challenge, as some stations had three
to as many as six different station numbers to
2. PROJECT CHALLENGES AND ISSUES identify the same location. It was critical to
merge these data into a single “entity” over time,
Though surface-based meteorological data so that users would have a consistent set of
are the most-used, most-requested type of data. Also, in looking ahead (at that time) to
climatological data, a single integrated database having the data on-line in a web-based interface,
of global surface hourly meteorological it was important to consider what would be
observations over land did not previously exist. presented to the user through the interface (i.e.,
Researchers requiring surface climatic data a poplist of stations by country, state, etc).
often acquired the data from several sources in A time conversion control file was used to
a variety of formats, greatly complicating the convert the NCDC and Navy hourly data (DSI
design of their applications and increasing the 3280 and 3240) to Greenwich Mean Time
cost of using the data. For example, when (GMT), so that all input data were then in GMT
someone needed all available surface hourly time convention. (The Air Force surface hourly
data for a selected region (U.S. or worldwide) data (DSI 9956) were already in GMT.) The ISH
and time, they often would receive data from data are therefore in the same time convention
three datasets which differed in format, units of as global upper air and many other global
storage, and levels of QC. Alternately, the user databases, model output, satellite data, etc. This
would simply choose which one of the datasets is quite important for potential GIS applications.
might be able to meet their requirements, which The creation of this time conversion control file
often resulted in incomplete or inaccurate was very cumbersome, involving research of
results. several sources of information concerning time
Many users complained about the problems zones, time zone changes historically, etc. This
this created in data usage, and in getting required fully accounting for time zone changes
complete, accurate results. Therefore, this to properly merge the data.
project was undertaken to produce a single, Finally, toward the end of the development
integrated, quality-controlled, complete global phase, the original workstation for development
climatic database of hourly data, for use in and testing was replaced with a newer 64-bit
numerous applications, by private and public workstation. Although the change should have
researchers, corporations, universities, and been transparent to the ISH system, it was not.
government agencies. However, the integration Many problems began to appear with code that
of disparate data sources presented many had been working before the change. After
challenges. The three input datasets were the much research, no cause could be identified,
most logical starting point for ISH, as they were although evidence seemed to point to a system
the most-used hourly datasets available, having memory utilization problem. Several work-
also been subjected to considerable QC, and arounds were put into place in order for all of the
having adequate station history information components to function again.
available.
All necessary metadata within the FCC were 3. QUALITY CONTROL APPLIED INITIALLY
collected, coordinated, and loaded into a set of
relational database tables. The metadata include Procedures, algorithms, and then computer
important information about the data: station programs were written to merge the surface
histories; dataset documentation; inventories; hourly datasets into one common database.
and other critical information to control the More than one billion surface weather
process of merging the data. Since the three observations (covering 1900 to present) for
input data sources are archived in dissimilar approximately 20,000 global locations were
station numbering/identification systems, the accessed and merged during this process.
metadata had to provide a cross-reference to Examples of input data types were: Automated
identify data for the same location (i.e., same Surface Observing System (ASOS), Automated
station with data in each of the three input Weather Observing System (AWOS), Synoptic,
Airways, Metar, Coastal Marine (C-MAN), Buoy, processing, and verification of “test data” to
and various others. Both military (e.g., USAF, attempt to check as many possible paths
Navy, etc.) and civilian stations, automated and through the process that the data might follow.
manual observations, are included in the The painstaking process of creating, processing,
database. and checking the test data, though very time-
As part of the “Version 1” building of the consuming, was critical to the success of the
database, we included QC checks to ensure that project. This, in conjunction with the checking of
the input data were actually for the same actual data from the archive, proved to be very
location at the same time before performing the valuable, and more than worth the time invested
intra-observational merge, which creates a in this component of QC. An added benefit of
composite observation for that date-time. The test data is the re-validation of the system
QC check was conducted on a daily basis (i.e., periodically, such as when a source code
on each day’s data) to determine if the data for change or operating system upgrade is required.
that day should actually be merged into Then, an automated comparison (e.g., Unix diff)
composite observations. Temperature, dew of baseline output test data vs. the new output
point, and wind direction were compared for test data quickly reveal if any problems are
each data value (e.g., temperature at a given present.
station-date-time in DSI 9956 vs. temperature at Needless to say, it was very important to
same station-date-time in DSI 3280) to obtain a randomly check the results of the final process—
percent score for the day for coincident data. i.e., selected output files, for any unforeseen
Criteria of 1-degree Celsius for temperature and problems. This component of the process
dew point, and 10 degrees for wind direction revealed very few problems, due to the intensive
were used as the pass/fail limits for each nature of the QC described above.
element, with an overall 70% score for the day
required to perform the intra-observational 4. PHASE 2 OF QUALITY CONTROL
merge for that day. In other words, 70% of the
data values compared would have to meet the To develop the Version 2 database, we
limit checks to “pass” for the day. Failure of researched, developed, programmed, and
these checks sometimes pointed to time processed the data through 57 QC algorithms.
conversion problems, updates to the control file, This phase of QC subjects each observation to a
and re-processing of those stations. Subjective series of validity checks, extreme value checks,
analysis of the data before and after processing internal (within observation) consistency checks,
proved this QC check and the limits applied to and temporal (versus another observation for
be quite effective. the same station) continuity checks. Therefore, it
A complete inventory system was included may be referred to as an inter/intra-
to fully verify that no data were “lost” during the observational QC, and is entirely automated
processing. This involved an inventory (i.e., during the processing stage. However, it does
number of observations by station-month) for not include any spatial QC (“buddy” checks with
each of the input datasets, with the inventories nearby stations), which is planned for later
stored in Oracle relational tables. Then, the final development.
“output” ISH database produced a similar An example of one of the algorithms
inventory stored in a database table. The performed is the continuity check for
inventory tables were then compared to check temperature, which does a “two-sided” continuity
for any loss of data. This proved to be a critical validation on each temperature value for periods
component of the process, and revealed a ranging from 1 hour to 24 hours. An increase in
number of problems that would have otherwise the temperature of 8 degrees Celsius in one
º º
gone undetected. The database processing was hour (e.g., from 10 C to 18 C) prompts a check
not considered complete until the inventory on the next available (i.e., next reported)
verification process was complete. Also, the final temperature–if that value then decreases by at
º º
inventory thereby provided a very useful least 8 degrees in an hour (e.g., 18 C to 9 C),
inventory of the ISH archive for use in placing then that indicates a very improbable “spike” in
º
the data online and servicing customers. See the data, and the erroneous value (e.g., 18 C) is
Table 1 for an inventory by WMO block of the changed to indicate “missing” for that
number of stations in the ISH database. observation. However, the original value is
Another critical component of the Version 1 saved in a separate section of the data record
database build was the development, for future use if needed. The same would apply
for a downward “spike” in the data, and similar needed with relative ease and can focus on their
checks are performed for periods out to 24 research or studies, rather than getting bogged
hours, to allow for missing data and for part-time down in the merging process. A second goal is
stations which do not report hourly or three- to continue to add to and improve this global
hourly data. The validation always checks the baseline database for research and applications
closest values available temporally (i.e., before requiring data of this type, by adding additional
and after the data point being checked), and the datasets (i.e., merge into ISH), and
limit is automatically adjusted based on the developing/applying more extensive QC checks.
elapsed time between values. Temporal As is the case with most software systems,
continuity checks are performed for continuous ISH is designed to evolve, within the limitations
elements such as temperature, dew point, wind of current funding and technology, to reach
speed, and pressure (station, sea-level, and these goals. Here are some of the future plans
altimeter setting). within the overall effort:
Another example of the algorithms is the
consistency check for present weather vs a) Store the entire database in relational
temperature, to ensure that, for example, frozen tables, with access provided via the NNDC CDO
precipitation is not reported at unrealistic system. This process is well underway.
temperatures. There are a number of these QC
checks for various types of present weather b) Continue the routine updating of the
reports. Similar checks are performed for database using the established procedures and
various other elements such as cloud data, software, and revise the NCDC processing
precipitation amount, snow depth, etc. software for U.S. surface data to perform a near
Though all climatic elements are checked to real-time (daily) ingest and QC of all surface
some extent, the elements validated to the data into ISH format, thereby providing users
greatest extent are: wind data, temperature and with near real-time access to quality-controlled
dew point data, pressure data, cloud data, data. This process is also well underway, and is
visibility and present weather data, precipitation being referred to as the Integrated Surface Data
amounts, snowfall and snow depth. In addition, Processing System.
a selected number of systematic deficiencies are
addressed with specific algorithms to correct c) As funding permits, add additional
those problems. As mentioned above, the input datasets to ISH, via the merging process. This is
datasets had already been subjected to a great now planned for selected U.S. mesonet data,
deal of QC, so this phase of QC was designed and several datasets of non-U.S. data.
to address problems which were less likely to
have already been corrected. d) As funding permits, add additional station
The creation and verification of test data for history/metadata to the “system,” to include as
each algorithm was just as critical in this phase much instrumentation information as possible;
as in the creation of ISH Version 1. As thereby making the data more useful for climate
mentioned above, an automated comparison of change research.
baseline output test data vs. the new output test
data quickly reveal if any problems are present e) As funding permits, research, develop,
in the system. and apply more sophisticated time series and
We do not consider this to be an “end-all” spatial QC checks to ISH; thereby making the
QC process, but merely the next step in data more robust and useful for all applications.
producing a better quality database for NOAA
customers. Detailed documentation on each of f) Develop partnerships with other
these QC algorithms is available (Lott, 2003). government agencies and groups, such as the
Regional Climate Centers. This includes
5. FUTURE PLANS partnerships for additional data sources,
enhanced QC techniques, and online
One of the goals for ISH is to have the entire applications.
dataset available for query via the NNDC
Climate Data Online (CDO) system (Lott and 6. SUMMARY AND CONCLUSIONS
Anders, 2000). With the difficult and tedious task
of blending the data from the three sources The development and QC of ISH has been a
completed, end-users may then extract what is rather long and arduous process, but well worth
the effort. The QC has included the following d) Finally – expect to reprocess. No matter
phases/components (as described in more detail how many checks and balances are in place,
above): further improvements and some reprocessing
will be necessary. The key is in limiting its
Phase 1: frequency, while at the same time having a
a) Validation of the merging process through willingness to do it when needed.
element value comparisons, such as
temperature. 7. ACKNOWLEDGMENTS

b) A complete inventory of all input and There were many people who contributed to
output data, to ensure no data loss during the this project’s success. The key members of the
processing. team were: Rich Baldwin, NCDC; Vickie Wright,
NCDC; Dee Dee Anders, NCDC; Danny
c) Thorough checking of test data and Brinegar, NCDC; Neal Lott, NCDC; Pete Jones,
archive data, to fully test the software before full TMC Corporation; and Fred Smith, TMC
database processing began. Corporation. As mentioned previously, the Air
Force (AFCCC) and Navy (FNMOC Det)
d) Random checks of the final output contributed to the effort. Bob Boreman (NCDC),
database (ISH). who devoted a great deal of time and effort to
ISH, especially in the research and development
Phase 2: of the time conversion control file, passed away
a) Extremes / validation checks, to ensure in July 2001. He is greatly missed, and his
no obviously erroneous values are present in the contributions to this effort are gratefully
data. acknowledged.

b) Temporal continuity checks, to look for 8. REFERENCES


“spikes” in continuous elements such as
temperature, dew point, wind speed, and AFCCC. Documentation for Datsav3 Surface
pressure (station, sea-level, and altimeter Hourly Data. [Asheville, N.C.]: Air Force Combat
setting). Climatology Center, 1998.

c) Consistency checks of one element vs Lott, Neal. Quality Control of USAF Datsav3
another within a given data record/observation, Surface Hourly Data–Versions 7 and 8.
such as temperature vs present weather (e.g., [Asheville, N.C.]: USAF Environmental Technical
º
no snow at 10 C) Applications Center, 1991.

The lessons learned include: Lott, Neal. Data Documentation for Federal
Climate Complex Integrated Surface Hourly
a) Thorough test data are critical to any Data. [Asheville, N.C.]: National Climatic Data
process such as this. Though proven to be true Center, 2000.
in many of the author’s previous projects, it
certainly proved to be critical in this one. Lott, Neal. Quality Control Documentation for
Federal Climate Complex Integrated Surface
b) Peer review is very important to ensure Hourly Data. [Asheville, N.C.]: National Climatic
that the overall process and the individual QC Data Center, 2003.
checks are not merely the ideas of one
individual, but are consistent with good science Lott, Neal and Dee Dee Anders. NNDC Climate
and good data processing standards. Data Online for Use in Research, Applications,
and Education. Twelfth Conference on Applied
c) The concept of “phases” in a project of Climatology. Pages 36-39. American
this magnitude is critical to success. There is a Meteorological Society, May 8-11, 2000,
tendency to “bite off more than we can chew” Asheville, NC.
with any project, and the phased-in approach
was one of the keys to success with ISH. Lott, Neal, Rich Baldwin, and Pete Jones. NCDC
Technical Report 2001-01, The FCC Integrated
Surface Hourly Database, A New Resource of
Global Climate Data. [Asheville, N.C.]: National TD3280. [Asheville, N.C.]: National Climatic
Climatic Data Center, 2001. Data Center, 2000.

Plantico, Marc and J. Neal Lott. Foreign Steurer, Pete and Greg Hammer. Data
Weather Data Servicing at NCDC. ASHRAE Documentation for Hourly Precipitation Data,
Transactions 1995, V. 101, Pt 1. TD3240. [Asheville, N.C.]: National Climatic
Data Center, 2000.
Steurer, Pete and Matt Bodosky. Data
Documentation for Surface Hourly Airways Data,

Table 1 - Database Inventory

The following table provides an inventory by WMO block of the number of stations in the ISH database
having at least one year of data.

WMO Bk Stations WMO Bk Stations WMO Bk Stations WMO Bk Stations


01 297 30 230 58 189 88 36
02 413 31 227 59 149 89 124
03 593 32 102 60 193 91 234
04 105 33 190 61 128 92 12
06 275 34 170 62 125 93 82
07 297 35 143 63 143 94 622
08 165 36 130 64 155 95 163
09 74 37 208 65 123 96 92
10 491 38 202 66 30 97 67
11 226 40 356 67 186 98 85
12 219 41 202 68 270 99 334
13 153 42 225 69 548
14 71 43 151 70 223
15 322 44 100 71 1085
16 280 45 13 72 2113
17 122 46 60 74 700
20 59 47 521 76 261
21 58 48 350 78 244
22 180 50 52 80 130
23 177 51 68 81 27
24 116 52 107 82 144
25 99 53 118 83 290
26 220 54 178 84 101
27 147 55 22 85 163
28 237 56 163 86 57
29 178 57 243 87 183
Figure 1 - ISH Database Build Process Flow

You might also like