0% found this document useful (0 votes)
31 views

netCDF and Xarray

This document discusses netCDF and xarray libraries for working with climate data. netCDF is a data format and library for storing scientific data, including multidimensional climate data arrays with metadata. It allows data to be stored in chunks for efficient access. Xarray builds on NumPy and provides labeled multidimensional arrays called DataArrays and aligned DataArrays in Datasets, making operations and selection more intuitive. It supports remote data access via backends like netCDF and Dask.

Uploaded by

Jaime Bala Norma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

netCDF and Xarray

This document discusses netCDF and xarray libraries for working with climate data. netCDF is a data format and library for storing scientific data, including multidimensional climate data arrays with metadata. It allows data to be stored in chunks for efficient access. Xarray builds on NumPy and provides labeled multidimensional arrays called DataArrays and aligned DataArrays in Datasets, making operations and selection more intuitive. It supports remote data access via backends like netCDF and Dask.

Uploaded by

Jaime Bala Norma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

netCDF and xarray

Ezequiel Cimadevilla Álvarez


[email protected]

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021


Agenda
● Multidimensional arrays in climate science
● netCDF
○ Library
○ Data model
● xarray
○ Labelled multidimensional arrays
Multidimensional arrays in climate science
Multidimensional arrays in climate science
● Climate data is the output of climate models and/or observations
● Climate data is multidimensional (e.g. tas.npy)
● IPCC Sixth Assessment Report (AR6)
Multidimensional arrays in climate science
● X (longitude), Y (latitude), T (time) - 3D
○ precipitation or mean sea level pressure
● X (longitude), Y (latitude), Z (level), T (time) - 4D
○ temperature at isboraric levels or wind speed profile
● X (longitude), Y (latitude), Z (level), T (time), E (realization) - 5D
○ multiple model, initializations and parameter ensembles
Multidimensional arrays in climate science
● Climate data is distributed by modelling institutions
○ ESGF (Earth System Grid Federation)
○ University of Cantabria owns a tier 2 data node ;)
● To avoid huge file sizes, datasets are often split by time and by variable

Dataset

Files in Dataset
Multidimensional arrays in climate science
● Metadata is a crucial requirement for scientific data
● What can you tell from the tas.npy dataset?
○ Global metadata - Institution? Model? Experiment?
○ Time - Units? Calendar (365,360 days)? Leap years?
○ Temperature - Units? Missing values? Mean, median, max/min? Point or cell?
Multidimensional arrays in climate science
● “Information about a document rather than a document content” (Raggett Hors
& Jacobs, 1999)
○ Units, coordinate systems, model and institution information...
● File name can provide metadata
Multidimensional arrays in climate science
● Open tas.npy, what can you say from it?
● We can use NumPy ‘np.save’ and ‘np.load’ to share our
multidimensional arrays but...
○ no metadata is available
○ no coordinates are available
○ disk representation is not optimal
■ c contiguous or f contiguous
■ chunked
Multidimensional arrays in climate science
We want two different things

● Data model and disk storage format for multiple chunked multidimensional
arrays with metadata (netCDF)
● Library for semantic data analysis (xarray)
netCDF
netCDF
Network Common Data format

● Developed by Unidata
● “a set of software libraries
and machine-independent
data formats that support
the creation, access, and
sharing of array-oriented
scientific data”
● C, Python and Java APIs
netCDF
Network Common Data format

● Data Model
○ Groups, Variables, Dimensions,
Attributes, Data types
● Classic vs version 4 data models
○ Since version 4, netCDF files are
valid HDF5 files
○ Alternative backends can be
implemented (NCZarr)
Multidimensional arrays in climate science
● netCDF files contain multiple multidimensional arrays (variables) and metadata
○ See the source of tas.npy
● netCDF files can be persisted using contiguous or chunked alignment
● netCDF files can be opened using the netCDF4-python library
netCDF
● Well integrated within the Python ecosystem
netCDF
● Format used for international climate research projects (CMIP6, CORDEX)
○ ESGF (Earth System Grid Federation)
● To avoid huge file sizes, datasets are often split by time and by variable

Dataset

Files in Dataset
netCDF
CF Conventions

● Metadata that provide a definitive description of what the data in each variable
represents, and the spatial and temporal properties of the data.
● Interoperability between applications that are “CF compliant”
● Standard table of variable standard names
● See CMIP6_ScenarioMIP_CSIRO_ACCESS-ESM1-5_ssp585_r1i1p1f1_Amon
again
xarray
xarray
● Xarray introduces labels in the form of dimensions, coordinates and attributes
on top of raw NumPy-like arrays
● More intuitive, more concise, and less error-prone developer experience
● Real-world datasets are usually more than just raw numbers
○ Labels which encode information about how the array values map to locations in
space, time, etc.
xarray
● Apply operations over dimensions by name: x.sum('time')
● Select values by label (or logical location) instead of integer location:
x.loc['2014-01-01'] or x.sel(time='2014-01-01')
● Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array
broadcasting) based on dimension names, not shape.
● Easily use the split-apply-combine paradigm with groupby:
x.groupby('time.dayofyear').mean()
● Database-like alignment based on coordinate labels that smoothly handles
missing values: x, y = xr.align(x, y, join='outer')
● Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs
xarray

Two core data structures, which build upon and extend the core strengths of NumPy
and pandas. Both data structures are fundamentally N-dimensional:

● DataArray is the implementation of a labeled, N-dimensional array. It is an N-D


generalization of a pandas.Series. The name DataArray itself is borrowed from
Fernando Perez’s datarray project, which prototyped a similar data structure.
● Dataset is a multi-dimensional, in-memory array database. It is a dict-like
container of DataArray objects aligned along any number of shared dimensions,
and serves a similar purpose in xarray to the pandas.DataFrame.
xarray
● Heavily inspired in netCDF data model
○ However, it does not model groups, just Dataset and DataArray
○ Multiple backends: netcdf4-python, zarr
● Support for DataArrays that do not fit into memory via Dask
● Support for remote data analysis via DAP (Data Access Protocol)
○ Because it uses netcdf4-python as backend
○ Only requested data is sent over the network
netCDF and xarray
Ezequiel Cimadevilla Álvarez
[email protected]

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021

You might also like