A Comparison of HDF5, Zarr, and netCDF4
in Performing Common I/O Operations
Sriniket Ambatipudi Suren Byna
Dulles High School Lawrence Berkeley National Laboratory
Houston, TX, USA Berkeley, CA, USA
Email:
[email protected] Email:
[email protected] Abstract—Scientific data is often stored in files because of functionality for the storing and retrieval of data and metadata
arXiv:2207.09503v2 [cs.DC] 5 Feb 2023
the simplicity they provide in managing, transferring, and in files.
sharing data. These files are typically structured in a specific
arrangement and contain metadata to understand the structure
There are numerous file formats in existence, such as
the data is stored in. There are numerous file formats in use in HDF5 [4], netCDF4 [5], ROOT [6], Zarr [7], and many more
various scientific domains that provide abstractions for storing [8]. Each of these file formats exhibits different performance
and retrieving data. With the abundance of file formats aiming characteristics and was designed to accomplish a specific
to store large amounts of scientific data quickly and easily, a task. For example, the High Energy Physics (HEP) com-
question that arises is, “Which scientific file format is best suited
for a general use case?” In this study, we compiled a set of
munity developed the ROOT framework to meet the high-
benchmarks for common file operations, i.e., create, open, read, performance requirements for multithreaded read and write
write, and close, and used the results of these benchmarks to operations and support object-oriented programming. As a
compare three popular formats: HDF5, netCDF4, and Zarr. result of these design specifications, the ROOT framework is
Note: This paper is currently a work in progress, and our used by CERN in its research with the Large Hadron Collider
results are representative for a general-purpose use case in
[9]. On the other hand, file formats such as netCDF4 or HDF5
which datasets are small in size and are not using optimizations
such as HDF5 or netCDF4 chunking, asynchronous I/O, sub- are often used in more general use case scenarios because
filing. We welcome any comments or suggestions regarding the of their self-describing capabilities, which allow the storing
benchmark located at https://github.com/asriniket/ of metadata to describe the data within a file. Such self-
File-Format-Testing. describing capabilities allow these file formats to be used
in a multitude of applications, like how HDF5 is used in
G ENERAL T ERMS / K EYWORDS
astronomy, medicine, physics, and many more fields [4].
Scientific File Formats, HDF5, netCDF4, Zarr Because there are a multitude of file formats that are
available to store scientific data, the question of which
I. I NTRODUCTION
file format is best suited for a general use case arises.
With the rapid advancement of experiments, observations, Previous research [10, 11] has mainly focused on testing
and simulations in recent years, various domains of science the performance characteristics of individual file formats,
are producing enormous amounts of data. For example, the like how HDF5 was tested for its performance in reading a
Large Hadron Collider (LHC) experiments at CERN produce subset of a large array [10] or how netCDF4’s performance
90 petabytes of data per year [1]. NASA’s Climate Data characteristics were analyzed [11]. The first test revealed that
Services (CDS) simulate our planet’s weather and climate when working with an HDF5 file in Python, the fastest way to
models from hours to millennia and produce datasets up to read data is to memory map the file with NumPy, bypassing
petabytes in volume [2]. the HDF5 Python API (h5py). Memory mapping involves
Much of the data produced in scientific experiments, obser- mapping a file’s contents into memory, and this means that
vations, and simulations are stored in files with various for- data within a file can be accessed if the location of the data
mats. Scientific file formats offer a medium to store scientific in terms of an offset is known. The hierarchical structure
data for long-term processing, which is of great importance of file formats such as HDF5 means that accessing the
to researchers. Each file format has specific characteristics data within is relatively simple, as the file’s metadata stores
that make it suited for a particular use case, and this makes the location of individual datasets. This test was useful in
choosing the appropriate file format an important task. The analyzing the shortcomings of HDF5 in a particular use case,
typical content of scientific files includes data with a structure allowing potential users to reconsider whether the HDF5 file
and metadata describing the structure and the data within. format would be best suited for their use case. In the second
Data in these files are often structured as arrays [3]. The experiment, conducted by the HDF Group, the netCDF4 file
metadata that describes the data often contains the origins format was tested for its performance characteristics and then
of the data, configurations used in generating or collecting compared to its predecessor, netCDF3. The results of this
the data, and the location of the data in the file format for experiment showed that netCDF4 generally had slower write
easy access. As such, scientific file formats will often provide speeds than netCDF3, but it had faster read speeds due to
netCDF4’s use of the HDF5 library internally. It stores data in a manner similar to HDF5, with groups
This type of testing is useful for analyzing the performance serving as the overarching data structure. Within a group,
characteristics of one file format and its shortcomings in there can be other groups or variables. Variables are akin
specific use cases, but when the performance characteristics to HDF5 datasets. Unlike HDF5 datasets, netCDF4 variables
of one file format must be compared to the performance cannot be resized once they are created [15]. To circumvent
characteristics of other file formats, a benchmark offers this, variables can be declared with an unlimited size in
itself as a viable option because it allows for the objective a specified dimension. Similar to HDF5, netCDF4 is a
measurement of the speed at which each file format is able to self-describing file format, and this means that groups and
perform a specific task. Such a benchmark has the potential variables both contain metadata describing their contents.
to be a valuable asset to researchers, as it allows them Unlike its predecessor, netCDF3, netCDF4 uses HDF5 as its
to choose the file format that is not only suited for their backend, allowing it to achieve faster read times [11].
particular use case but also performs the best in comparison
to its alternatives. This allows the researchers to have ease C. Zarr
of access when storing and modifying data at fast speeds, Zarr is a file format that is designed to store large arrays of
allowing more time and effort to be put elsewhere in their data and is characterized by a .zarr file extension. Because
project. In this work, we developed a benchmark to compare it is based on NumPy, it is geared mainly towards Python
the read and write speeds of three multipurpose scientific users. Similar to both HDF5 and netCDF4, Zarr is also a
file formats (HDF5, netCDF4, and Zarr). This benchmark hierarchical, self-describing file format that has groups as
writes randomized data to a specified number of datasets the overarching file structure. Each group contains datasets,
within a file and measures the time taken to write the data to which are representative of multidimensional arrays of a
each dataset and the time taken to read the contents of each homogeneous data type. Furthermore, the API for this file
dataset, allowing objective comparisons to be drawn between format was designed to be similar to h5py (HDF5’s Python
the three file formats’ performance in different operations. API), and as a result, it includes functions based on h5py’s
In the remainder of the paper, we first provide a brief functions, namely the group creation function [16]. One
background in §II to the three file formats we used in our advantage to using Zarr is that it provides multiple options
evaluation in this study. In §III, we describe the read and to store data by allowing a user to store a file in memory,
write benchmarks we used in the evaluation. In §IV, we in the file system, or in other storage systems with a similar
provide details of the system we used for comparing and interface to the first two options [16].
evaluating the performance of the three file formats under
different workloads. III. B ENCHMARKS
As a benchmark is being used to compare the performance
II. BACKGROUND
of the file formats, the benchmark must only test features
A. HDF5 of the file format that are supported by all the file formats
HDF5, or the Hierarchical Data Format 5, is a file format being tested. To accomplish this task, we programmed our
designed to store a large amount of data in an organized benchmark in Python, and this means we will rely on the
manner. Typically characterized by a .hdf5 or .h5 file Python APIs for each file format being tested to perform
extension, this file format stores data in a manner very similar the requested operations. Our benchmark compares the time
to that of a file system. This file format’s primary data models taken to create a dataset, write data to a dataset, and finally
are groups and datasets. Groups are the overarching structure, open that dataset at a later time and read its contents. This
and they can hold other groups or datasets. Datasets store raw can be categorized into two main types of operations—the
data values of a specified data type and are usually stored writing operation and the reading operation. Both are very
within groups [12]. A feature of HDF5 is that it is able important features to test in a file format, as the end goal of
to store data consisting of different data types within the a file format is to store data for long-term processing. The
same file [13]. As mentioned earlier, this file format is self- faster write and read times are not only indicative of better
describing, meaning that all the groups and datasets within performance characteristics but also have a tangible effect
the file format contain metadata describing their contents. on the improvement of an end-user’s workflow, as less time
This allows for the data within the file to be mapped in would be spent performing operations that are not directly
memory, provided the API supports it. Generally speaking, relevant to the task at hand.
users use the HDF5 API to issue commands to a lower-level To allow for greater flexibility when benchmarking the file
driver, which is in charge of accessing the file and performing formats, we added a configuration system in which the user
the requested operations [14]. Because the file format is open is able to specify the testing parameters such as the number
source, there has been widespread API support across most of datasets to create within the file and the dimensions of
modern languages (Python, C++, and Java). the array that will be written to each dataset by editing a
.yaml configuration file. After the benchmark is done, the
B. netCDF4 program then stores the times taken across multiple trials
netCDF4 is a file format that is designed to store array- in a .csv file and plots the data in the .csv file with
oriented data and is characterized by a .netc file extension. matplotlib.pyplot to allow a user to make a definitive
comparison between the file formats being tested. Below, the IV. P ERFORMANCE E VALUATION
main operations, the write operation and the read operation,
will be discussed in-depth.
A. Experimental setup
A. Write Benchmark
The write operation is the first operation to be tested in
the benchmark. It creates files with the filename as specified The three file formats were tested on a computer running
in the configuration file and extensions .hdf5 for HDF5 Ubuntu 18.04.5 with an Intel(R) Xeon(R) Silver 4215R CPU,
files, .netc for netCDF4 files, and .zarr for Zarr files. 196 Gigabytes of RAM, and 960 Gigabytes of solid-state
The file is placed inside a folder named Files/, to help storage provided by a Micron 5200 Series SSD. The version
reduce clutter in the working directory. Taking information of h5py used to test the HDF5 file format was 3.6.0. The
from the configuration file, a sample data array is generated version of netCDF4 used to test the netCDF4 file format
with dimensions and length as specified. This sample data was 1.5.8. The version of zarr used to test the Zarr file
array consists of randomly-generated 32-bit floats. Then, the format was 2.11.0.
program creates a dataset within the file and writes the The benchmark parameters that were used in each run of
sample data array to the dataset. This process of generating a the test can be found in the tables to the right. Note that
sample data array, creating a dataset, and populating it with the Test Name parameter is automatically generated by the
the values from the sample data array is repeated until the benchmark and is used to create the generated plot’s title.
benchmark has created the number of datasets as specified
by the configuration file. After the file is populated with data,
the benchmark copies the file to a directory named Files Test Name 2048 Datasets of
Read/ and renames the file to avoid any caching effects [128] Elements
that may interfere with the read times. There are numerous File Name 2048-Vector
ways to mitigate such caching effects, such as waiting for Number Datasets 2048
an extended period of time, but simply moving the file to Number Elements [128]
another directory and renaming the file is the quickest and
easiest way to mitigate the effects of caching in interfering Test Name 2048 Datasets of
with the times taken to read from the file. The time taken to [128, 128] Elements
create all the datasets and populate them with data is divided File Name 2048-Matrix
by the number of datasets to find the average time taken to Number Datasets 2048
create and populate one dataset. Both of these times are then Number Elements [128, 128]
returned to the main program, where they are written to the
.csv output file. Test Name 2048 Datasets of
[128, 128, 128] Elements
File Name 2048-Tensor
B. Read Benchmark Number Datasets 2048
The benchmark now opens the copied file in the Files Number Elements [128, 128, 128]
Read/ directory and begins testing the read operations of
the three file formats. This operation consists of opening each Test Name 2048 Datasets of
dataset within the file and printing its contents to the standard [256] Elements
output. The time taken to open all the datasets and the time File Name 2048-Datasets
taken to read from all the datasets are once again divided Number Datasets 2048
by the number of datasets within the file to find the average Number Elements [256]
time taken to open and read one dataset. Both of these times
are then returned to the main program, where they are also Test Name 4096 Datasets of
written to the .csv output file. [256] Elements
This process of running the write operation benchmark and File Name 4096-Datasets
the read operation benchmark is then repeated multiple times Number Datasets 4096
in order to ensure the consistency of the data gathered. To Number Elements [256]
avoid filling up the disk with generated test files, the Files/
and Files Read/ directories are deleted between trials. Test Name 8192 Datasets of
Finally, the data from the .csv file are averaged out with [256] Elements
pandas and plotted with matplotlib.pyplot to allow File Name 8192-Datasets
for visualizing a comparison between the tested file formats Number Datasets 8192
in a given operation. Number Elements [256]
B. Data
(a) Create / Open Times (b) Read / Write Times
Fig. 5: 4096 Datasets of [256] Elements
(a) Create / Open Times (b) Read / Write Times
Fig. 1: 2048 Datasets of [128] Elements
(a) Create / Open Times (b) Read / Write Times
Fig. 6: 8192 Datasets of [256] Elements
(a) Create / Open Times (b) Read / Write Times
C. Discussion
Fig. 2: 2048 Datasets of [128, 128] Elements Figure 1 shows the results when 2,048 datasets are created
and populated with a one-dimensional array containing 128
32-bit floats. This graph shows that the time taken to create
and open a dataset in netCDF4 is much faster than that of
HDF5’s or Zarr’s. In comparison to Zarr, HDF5 takes less
time to create a dataset, but it takes slightly more time to open
a dataset. When it comes to writing to datasets or reading
from datasets, HDF5 and Zarr share very similar times in
both operations, with netCDF4 trailing by a large margin.
Figure 2 shows the results when 2,048 datasets are created
and populated with a two-dimensional array containing 128
elements in each dimension. The results from this test are
almost identical to the results from the previous test, both
(a) Create / Open Times (b) Read / Write Times in terms of the trend and the time taken to complete each
operation.
Fig. 3: 2048 Datasets of [128, 128, 128] Elements
Figure 3 shows the results when 2,048 datasets are created
and populated with a three-dimensional array containing 128
elements in each dimension. The results from this test follow
the same trend as the past two tests, but the times taken to
complete each operation are almost double the times taken
in the past two tests.
These past three bar graphs show the tests in which the
number of datasets is held constant while increasing the
number of dimensions in the data array, but the next three bar
graphs will involve increasing the number of datasets while
holding the size of the data array constant in order to measure
the effect of increasing the number of datasets on file-format
(a) Create / Open Times (b) Read / Write Times performance.
Figure 4 shows the results when 2,048 datasets are created
Fig. 4: 2048 Datasets of [256] Elements and populated with a one-dimensional array containing 256
elements. The results from this test mirror those from Figure subset of a dataset, overwriting a dataset). The code for the
1, and this is to be expected as the number of datasets in benchmark can be found here: https://github.com/
both tests is the same, with the size of each dataset varying asriniket/File-Format-Testing. We are evaluat-
slightly. ing further the overheads of using the Python API on the
Figure 5 shows the results when 4,096 datasets are created observed performance as well as the impact of caching.
and populated with a one-dimensional array containing 256 Considering the small size of the data, the observed results
elements, and Figure 6 shows the results when 8,192 datasets may have been impacted by caching. This caching effect will
are created and populated with a one-dimensional array typically be reduced when the data sizes are in gigabytes
containing 256 elements. Both graphs are almost identical to (GB). We also note that many applications work with smaller
Figure 4, meaning that the number of datasets most likely has amounts of data that we used in this study. We encourage
no impact on the average time taken to perform the operations readers to try out the benchmarks provided in the GitHub
requested. repository and contribute any optimizations.
D. Write Benchmark Discussion VI. ACKNOWLEDGMENTS
The results of this benchmark show that a general trend This effort was supported in part by the U.S. Department
is that when creating a dataset, netCDF4 takes the least time of Energy (DOE), Office of Science, Office of Advanced Sci-
and is followed by HDF5, which is followed by Zarr. entific Computing Research (ASCR) under contract number
When actually writing data to a file, HDF5 takes the least DE-AC02-05CH11231 with LBNL.
time to write data to a dataset and is followed by Zarr, R EFERENCES
which is followed by netCDF4—taking on average more than
double the time of HDF5. [1] CERN. CERN - Storage. URL: https : / / home . cern /
science/computing/storage.
E. Read Benchmark Discussion [2] NASA. NASA’s Climate Data Services (CDS). URL:
https : / / www. nccs . nasa . gov / services / climate - data -
The read benchmarks show results similar to those from
services.
the write benchmark. netCDF4 takes the least time to open a
[3] Arie Shoshani and Doron Rotem. “Scientific data man-
dataset and is followed by Zarr, which is followed by HDF5.
agement. Challenges, technology, and development”.
When reading the data by printing the dataset values to the
In: Scientific Data Management: Challenges, Technol-
standard output, HDF5 takes the least time to read a dataset
ogy, and Deployment (Dec. 2009). DOI: 10 . 1201 /
and is followed by Zarr, which is followed by netCDF4.
9781420069815.
V. C ONCLUSIONS [4] HDFGroup. The HDF5® Library & File Format -
The HDF Group. URL: https : / / www. hdfgroup . org /
In this paper, we demonstrated a method in which the
solutions/hdf5/.
performance of a file format can be compared to that of
[5] UCAR/Unidata. Unidata — NetCDF. URL: https : / /
another file format through the running of a benchmark that
www.unidata.ucar.edu/software/netcdf/.
tests performance in operations like create, open, read, write,
[6] CERN. ROOT: analyzing petabytes of data, scientifi-
and close. This paper focused specifically on benchmarking
cally. URL: https://root.cern/.
three file formats: HDF5, netCDF4, and Zarr, as these three
[7] Zarr Developers. Zarr. URL: https://zarr.readthedocs.
file formats are considered to be general-purpose scientific
io/en/stable/.
file formats due to their storing of various types of data in a
[8] Wikipedia. List of file formats. URL: https : / / en .
hierarchical manner, similar to a file system.
wikipedia . org / wiki / List of file formats # Scientific
The benchmark was conducted in Python due to the lan-
data (data exchange).
guage’s widespread use in numerous scientific applications,
[9] Barbara Warmbein. Big data takes ROOT. URL: https:
and as such, the Python API for each file format was tested.
// home .cern /news /news /computing /big - data - takes -
To determine the performance of a file format, the time taken
root.
to create a dataset, write data to the dataset, open the dataset
[10] Cyrille Rossant. Moving away from HDF5. URL: https:
once the file is closed, and read data from the dataset was
//cyrille.rossant.net/moving-away-hdf5/.
measured and plotted in a bar graph. The results of the
[11] Choonghwan Lee, MuQun Yang, and Ruth Aydt.
benchmark show that HDF5 is fastest in reading or writing to
NetCDF-4 Performance Report. URL: https://support.
a dataset, netCDF4 is fastest in creating or opening a dataset,
hdfgroup . org / pubs / papers / 2008 - 06 netcdf4 perf
and Zarr generally trails right behind HDF5 in performance.
report.pdf.
Future work for this benchmark would include: expanding
[12] Leah A. Wasser. Hierarchical Data Formats - What is
support to other programming languages, as this would
HDF5? URL: https://www.neonscience.org/resources/
reveal any potential bottlenecks within the language-specific
learning-hub/tutorials/about-hdf5.
API for a file format; testing more file formats in order
[13] HDFGroup. Introduction to HDF5. URL: https://portal.
to better determine which file format is the fastest; and
hdfgroup.org/display/HDF5/Introduction+to+HDF5.
testing more aspects of a file format, which may include
testing performance in specific scenarios (i.e., reading a small
[14] HDFGroup. Chapter 3: The HDF5 File. URL: https:
/ / support . hdfgroup . org / HDF5 / doc / UG / FmSource /
08 TheFile favicon test.html.
[15] UCAR/Unidata. NetCDF4 API Documentation. URL:
https://unidata.github.io/netcdf4-python/.
[16] Zarr Developers. Zarr. URL: https://zarr.readthedocs.
io/en/stable/tutorial.html#groups/.