mcs308_data_format_lecture
mcs308_data_format_lecture
SCIENCE
• Data formats commonly encountered in climate research fall into 3 generic
categories: GRIB, netCDF and HDF.
• All of these formats are machine independent and self-describing.
Self-describing
files can be examined and read by the appropriate software without the user
knowing the file's structural details.
• Metadata with the file information are always included
• Typical metadata may include textual information about each variable's contents
and units (eg.,"specific humidity" and "g/kg") or numerical information describing the
coordinates (eg., time, level, latitude, longitude) that apply to the variables on the
file.
•
DIFFERENT FILES AND VARIATIONS
• GRIB1: GRIdded Binary (Edition 1), World Meteorological Organization
• GRIB2: GRIdded Binary (Edition 2), World Meteorological Organization
• netCDF3: Network Common Data Form, (Version 3.x), Unidata (UCAR/NCAR)
• netCDF4: Network Common Data Format, (Version 4.x), Unidata (UCAR/NCAR
• HDF4: Hierarchical Data Format, (Version 4.x), NCSA/NASA
• HDF4-EOS2: HDF4-Earth Obseving System, (Version 2; georeferenced data)
• HDF5: Hierarchical Data Format, (Version 5.x), NCSA/NASA
• HDF5-EOS5: HDF5-Earth Obseving System, (Version 5; georeferenced data)
• GeoTIFF: Georeferenced raster imagery
GRIB-GRIDDED BINARY
• GRIB is a file format for the storage and transport of gridded meteorological data, such as
Numerical Weather Prediction model output. It is designed to be self-describing, compact and
portable across computer architectures. The GRIB standard was designed and is maintained by the
World Meteorological Organization.
• Over the years, the WMO issued three editions of the GRIB standard:
• GRIB Edition 0: now obsolete, unsupported, and rarely used.
• GRIB 1: no longer the most current WMO GRIB edition, the format of Edition 1 has been frozen
from future enhancements. However, due to it's usage in the World Area Forecast system of the ICAO,
it it still recognized by the WMO. In the medium term, the CMC will no longer produce data in this
format.
• GRIB 2 (GRIB2): the GRIB2 format represents an enlarging and a significant modernization of the
GRIB standard. It is being phased in by the ECMWF and some national Numerical Weather
Prediction institutions, notably in the US and Europe. A significant modernization and broadening of
the GRIB standard, Edition 2 is not backward-compatible with Edition 1.
• A GRIB file contains one or more data records, arranged as a
sequential bit stream. Each record begins with a header, followed
by packed binary data.
• It contains information about :
• the qualitative nature of the data (field, level, date of production, forecast
valid time, etc),
• the header itself (meta-information on header length, header byte usage,
presence of optional sub-headers),
• the method and parameters to be used to decode the packed data,
• the layout and geographical characteristics of the grid the data is to be
plotted on.
HDF- HIERARCHICAL DATA FORMAT
• Hierarchical Data Format (HDF) is a data file format designed by the National Center
for Super-computing Applications (NCSA) to assist users in the storage and manipulation
of scientific data across diverse operating systems and machine.
• There are two distinct varieties of HDF, known as HDF (version 4 and earlier) and the
newer HDF5.
• HDF files are also self-describing. For each data object in an HDF file, there are
predefined tags that identify such information as the type of data, the amount of data,
its dimensions, and its location in the file.
• The self-describing capability of HDF files has important implications for
processing scientific data. It makes it possible to fully understand the
structure and contents of a file just from the information stored in the file
itself.
• A program that has been written to interpret certain tag types can scan a
file containing those tag types and process the corresponding data.
• Self-description also means that many types of data can be bundled in an
HDF file. For example, it is possible to accommodate symbolic, numerical,
and graphical data in one HDF file
GEOTIFF
• GeoTIFF is a public domain metadata standard that enables
georeferencing information to be embedded within an image file.
• The GeoTIFF format embeds geospatial metadata into image files
such as aerial photography, satellite imagery, and digitized maps
so that they can be used in GIS applications.
• A GeoTIFF file extension contains geographic metadata that
describes the actual location in space that each pixel in an image
represents.
• In creating a GeoTIFF file, spatial information is included in the .tif file as
embedded tags, which can include raster image metadata such as:
• horizontal and vertical datums
• spatial extent, i.e. the area that the dataset covers
• the coordinate reference system (CRS) used to store the data
• spatial resolution, measured in the number of independent pixel values per unit
length
• the number of layers in the .tif file
• ellipsoids and geoids - estimated models of the Earth’s shape
• mathematical rules for map projection to transform data for a three-dimensional
space into a two-dimensional display
NETCDF- NETWORK COMMON DATA FORM
• NetCDF (Network Common Data Form) is a set of software libraries and machine-independent
data formats that support the creation, access, and sharing of array-oriented scientific data. It
is also a community standard for sharing scientific data.
• NetCDF data is:
• Self-Describing. A netCDF file includes information about the data it contains.
• Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters,
and floating-point numbers.
• Scalable. A small subset of a large dataset may be accessed efficiently.
• Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or
redefining its structure.
• Shareable. One writer and multiple readers may simultaneously access the same netCDF file.
• Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of
the softwar
DATA PROCESSING AND MANIPULATIONS
CLIMATE DATA OPERATOR (CDO)
• CDO is a collection of command line operators to manipulate and
analyze climate and Numerical Weather Prediction model (NWP) model
o u t p u t s .
• To add to datasets:
• cdo add ifile1 ifile2 ofile
PIPING IN CDO
• The use of pipes reduce unnecessary disk usage:
• e.g., calculation of the 1961-1990 October-November-December (OND) mean
• Step by step:
• Piping: