Earth Observation Using Python A Practical Programming Guide Rebekah B Esmaili instant download
Earth Observation Using Python A Practical Programming Guide Rebekah B Esmaili instant download
https://ebookbell.com/product/earth-observation-using-python-a-
practical-programming-guide-rebekah-b-esmaili-46457136
https://ebookbell.com/product/earth-observation-data-analytics-using-
machine-and-deep-learning-modern-tools-applications-and-
challenges-1st-edition-sanjay-garg-50558476
https://ebookbell.com/product/the-value-of-using-hydrological-
datasets-for-water-allocation-decisions-earth-observations-
hydrological-models-and-seasonal-forecasts-1st-edition-alexander-jos-
kaune-schmidt-author-11911748
https://ebookbell.com/product/earth-observation-science-and-
applications-for-risk-reduction-and-enhanced-resilience-in-hindu-kush-
himalaya-region-a-decade-of-experience-from-servir-1st-edition-
birendra-bajracharya-51700146
https://ebookbell.com/product/earth-observation-of-wildland-fires-in-
mediterranean-ecosystems-1st-edition-emilio-chuvieco-auth-2171122
Earth Observation With Champ Results From Three Years In Orbit 1st
Edition Christoph Reigber
https://ebookbell.com/product/earth-observation-with-champ-results-
from-three-years-in-orbit-1st-edition-christoph-reigber-2202872
https://ebookbell.com/product/earth-observation-science-and-
applications-for-risk-reduction-and-enhanced-resilience-in-hindu-kush-
himalaya-region-birendra-bajracharya-38403946
https://ebookbell.com/product/earth-observation-of-global-change-the-
role-of-satellite-remote-sensing-in-monitoring-the-global-
environment-1st-edition-beatriz-alonso-4390370
Earth Observation Of Global Changes Eogc 1st Edition Roland Pail Auth
https://ebookbell.com/product/earth-observation-of-global-changes-
eogc-1st-edition-roland-pail-auth-4404798
https://ebookbell.com/product/earth-observation-applications-and-
global-policy-frameworks-argyro-kavvada-44746210
Earth Observation Using Python
EARTH OBSERVATION
USING PYTHON
A Practical Programming Guide
Rebekah B. Esmaili
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted,
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as
permitted by law. Advice on how to obtain permission to reuse material from this title is available at
http://www.wiley.com/go/permissions.
The right of Rebekah B. Esmaili to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at
www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears
in standard print versions of this book may not be available in other formats.
While the publisher and authors have used their best efforts in preparing this work, they make no representations or
warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all
warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose.
No warranty may be created or extended by sales representatives, written sales materials or promotional statements
for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or
potential source of further information does not mean that the publisher and authors endorse the information or
services the organization, website, or product may provide or recommendations it may make. This work is sold with
the understanding that the publisher is not engaged in rendering professional services. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a specialist where appropriate.
Further, readers should be aware that websites listed in this work may have changed or disappeared between when this
work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any
other commercial damages, including but not limited to special, incidental, consequential, or other damages.
10 9 8 7 6 5 4 3 2 1
Introduction ...................................................................................................... 1
Part I: Overview of Satellite Datasets ............................................................... 5
1 A Tour of Current Satellite Missions and Products .................................. 7
2 Overview of Python................................................................................ 17
3 A Deep Dive into Scientific Data Sets.................................................... 25
Conclusion..................................................................................................... 253
Index.............................................................................................................. 283
When I first met the author a few years ago, she was eager to become more
involved in the Joint Polar Satellite System’s Proving Ground. The Proving
Ground by definition assesses the impact of a product in the user’s environment;
this intrigued Rebekah because as a product developer, she wanted to understand
the user’s perspective. Rebekah worked with the National Weather Service to
demonstrate how satellite-derived atmospheric temperature and water vapor
soundings can be used to describe the atmosphere’s instability to support severe
weather warnings. Rebekah spent considerable time with users at the Storm Pre-
diction Center in Norman, Oklahoma, to understand their needs, and she found
their thirst for data and the need for data to be easily visualized and understand-
able. This is where Rebekah leveraged her expert skills in Python to provide NWS
with the information they found to be most useful. Little did I know at the time she
was writing a book.
As noted in this book, a myriad of Earth-observing satellites collect critical
information of the Earth’s complex and ever-changing environment and land-
scape. However, today, unfortunately, all that information is not effectively being
used for various reasons: issues with data access, different data formats, and the
need for better tools for data fusion and visualization. If we were able to solve these
problems, then suddenly there would be vast improvements in providing societies
with the information needed to support decisions related to weather and climate
and their impacts, including high-impact weather events, droughts, flooding, wild-
fires, ocean/coastal ecosystems, air quality, and more. Python is becoming the uni-
versal language to bridge these various data sources and translate them into useful
information. Open and free attributes, and the data and code sharing mindset of
the Python communities, make Python very appealing.
Being involved in a number of international collaborations to improve the
integration of Earth observations, I can certainly emphasize the importance
of working together, data sharing, and demonstrating the value of data
fusion. I am very honored to write this Foreword, since this book focuses on these
vii
issues and provides an excellent guide with relevant examples for the reader to
follow and relate to.
This book evolved from a series of Python workshops that I developed with
the help of Eviatar Bach and Kriti Bhargava from the Department of Atmospheric
and Oceanic Science at the University of Maryland. I am very grateful for their
assistance providing feedback for the examples in this book and for leading several
of these workshops with me.
This book would not exist without their support and contributions from
others, including:
The many reviewers who took the time to read versions of this book, several of
whom I have never met in person. Thanks to modern communication systems,
I was able to draw from their expertise. Their constructive feedback and insights
not only helped to improve this quality and breadth of the book but also helped me
hone my technical writing skills.
Rituparna Bose, Jenny Lunn, Layla Harden, and the rest of the team at AGU
and Wiley for keeping me informed, organized, and on track throughout this
process. They were truly a pleasure to work with.
Nadia Smith and Chris Barnet, and my other colleagues at Science and Tech-
nology Corp., who provided both feedback and conversations that helped shape
some of the ideas and content in this book.
Catherine Thomas, Clare Flynn, Erin Lynch, and Amy Ho for their endless
encouragement and support.
Tracie and Farid Esmaili, my parents, who encouraged me to aim high even if
they were initially confused when their atmospheric scientist daughter became
interested in “snakes.”
ix
Earth Observation Using Python: A Practical Programming Guide, Special Publications 75,
First Edition. Rebekah B. Esmaili.
© 2021 American Geophysical Union. Published 2021 by John Wiley & Sons, Inc.
DOI: 10.1002/9781119606925.introduction
Fortran and C/C++. A drawback of printed media is that it tends to be static and
Python is evolving more rapidly than the typical production schedule of a book.
To mitigate this, this book intends to teach fluency in a few, well-established
packages by detailing the steps and thought processes needed for a user needs
to carry out more advanced studies. The text focuses discipline-agnostic packages
that are widely used, such as NumPy, Pandas, and xarray, as well as plotting
packages such as Matplotlib and Cartopy.
I have chosen to highlight Python primarily because it is a general-purpose
language, rather than being discipline or task-specific. Python programmers
can script, process, analyze, and visualize data. Python’s popularity does not
diminish the usefulness and value of other languages and techniques. As with
all interpreted programming languages, Python may run more slowly compared
to compiled languages like Fortran and C++, the traditional tools of the trade.
For instance, some steps in data analysis could be done more succinctly and with
greater computational efficiency in other languages. Also, underlying packages in
Python often rely on compiled languages, so an advanced Python programmer can
develop very computationally efficient programs with popular packages that are
built with speed-optimized algorithms. While not explicitly covered in this book,
emerging packages such as Dask can be helpful to process data in parallel, so more
advanced scientific programmers can learn to optimize the speed performance of
their code. Python interfaces with a variety of languages, so advanced scientific
programmers can compile computationally expensive processing components
and run them using Python. Then, simpler parts of the code can be written in
Python, which is easier to use and debug.
This book encourages readers to share their final code online with the broader
community, a practice more common among software developers than scientists.
However, it is also good practice to write code and software in a thoughtful and
carefully documented manner so that it is usable for others. For instance, well-
written code is general purpose, lacks redundancy, and is intuitively organized
so that it may be revised or updated if necessary. Many scientific programmers
are self-learners with a background in procedural programming, and thus their
Python code will tend to resemble the flow of a Fortran or IDL program. This
text uses Jupyter Notebook, which is designed to promote good programming
habits in establishing a “digestible code” mindset; this approach organizes code
into short chunks. This book focuses on clear documentation in science algorithms
and code. This is handled through version control, using virtual environments,
how to structure a usable README file, and what to include in inline
commenting.
For most environmental science endeavors, data and code sharing are part of
the research-to-operations feedback loop. “Operations” refers to continuous data
collection for scientific research and hazard monitoring. By sharing these tools
with other researchers, datasets are more fully and effectively utilized. Satellite
data providers can upgrade existing datasets if there is a demand. Globally,
satellite data are provided through data portals by NASA, NOAA, EUMETSAT,
ESA, JAXA, and other international agencies. However, the value of these data-
sets is often only visible through scientific journal articles, which only represent a
small subset of potential users. For instance, if the applications of satellite obser-
vations used for routine disaster mitigation and planning in a disadvantaged
nation are not published in a scientific journal, improvements for disaster-
mitigation specific needs may never be met.
Further, there may be unexpected or novel uses of datasets that can drive sci-
entific inquiry, but if the code that brings those uses to life is hastily written and not
easily understood, it is effectively a waste of time for colleagues to attempt to
employ such applications. By sharing clearly written code and corresponding doc-
umentation for satellite data applications, users can alert colleagues in their com-
munity of the existence of scientific breakthrough efforts and expand the potential
value of satellite datasets within and beyond their community. Moreover, public
knowledge of those efforts can help justify the versatility and value of satellite mis-
sions and provide a return on investment for organizations that fund them. In the
end, the dissemination of code and data analysis tools will only benefit the scien-
tific community as a whole.
Overview of Satellite
Datasets
At present, there are over 13,000 satellite-based Earth observations freely and
openly listed on www.data.gov. Not only is the quantity of available data notable,
its quality is equally impressive; for example, infrared sounders can estimate
brightness temperatures within 0.1 K from surface observations (Tobin et al.,
2013), imagers can detect ocean currents with an accuracy of 1.0 km/hr
(NOAA, 2020), and satellite-based lidar can measure the ice-sheet elevation
change with a 10 cm sensitivity (Garner, 2015). Previously remote parts of our
planet are now observable, including the open oceans and sparsely populated
areas. Furthermore, many datasets are available in near real time with image
latencies ranging from less than an hour down to minutes – the latter being crit-
ically important for natural disaster prediction. Having data rapidly available
enables science applications and weather prediction as well as to emergency man-
agement and disaster relief. Research-grade data take longer to process (hours to
months) but has a higher accuracy and precision, making it suitable for long-term
consistency. Thus, we live in the “golden age” of satellite Earth observation. While
the data are accessible, the tools and skills necessary to display and analyze this
information require practice and training.
Earth Observation Using Python: A Practical Programming Guide, Special Publications 75,
First Edition. Rebekah B. Esmaili.
© 2021 American Geophysical Union. Published 2021 by John Wiley & Sons, Inc.
DOI: 10.1002/9781119606925.ch1
Scientific data visualizing used to be a very tedious process. Prior to the 1970s,
data points were plotted by hand using devices such as slide rules, French curls,
and graph paper. During the 1970s, IBM mainframes became increasingly avail-
able at universities and facilitated data analysis on the computer. For analysis,
IBM mainframes required that a researcher write Fortran-IV code, which was
then printed to cards using a keypunch machine (Figure 1.1). The punch cards
then were manually fed into a shared university computer to perform calculations.
Each card is roughly one line of code. To make plots, the researcher could create a
Fortran program to make an ASCII plot, which creates a plot by combining lines,
text, and symbols. The plot could then be printed to a line-printer or a teleprinter.
Some institutions had computerized graphic devices, such as Calcomp plotters.
Rather than create ASCII plots, the researcher could use a Calcomp plotting
Figure 1.1 (a) An example of a Fortran punch card. Each vertical column represents a
character and one card roughly one line of Fortran code. (b) 1979 photo of an IMSAI
8080 computer that could store up to 32 kB of the data, which could then be
transferred to a keypunch machine to create punch cards. (c) an image created from the
Hubble Space Telescope using a Calcomp printer, which was made from running
punch cards and plotting commands through a card reader.
command library to control how data were visualized and store the code on com-
puter tape. The scientist would then take the tape to a plotter, which was not nec-
essarily (or usually) in the same area as the computer or keypunch machine. Any
errors – such as bugs in the code, damaged punch cards, or damaged tape – meant
the whole process would have to be repeated from scratch.
In the mid-1980s, universities provided remote terminals that would eventu-
ally replace the keypunch and card reader machine system. This substantially
improved data visualization processes, as scientists no longer had to share limited
resources such as keypunch machines, card readers, or terminals. By the late
1980s, personal computers became more affordable for scientists. A typical PC,
such as the IBM XT 286, had 640 Kb of random access memory, a 32 MB hard
drive, and 5.25 inch floppy disks with 1.2 MB of disk storage (IBM, 1989). At this
time, pen plotters became increasingly common for scientific visualization, fol-
lowed later by the prevalence of ink-jet printers in the 1990s. These technologies
allowed researchers to process and visualize data conveniently from their offices.
With the proliferation of user-friendly person computers, printers eventually made
their way into all homes and offices.
Now with advances in computing and internet access, researchers no longer
need to print their visualizations at all, but often keep data in digital form only. Plots
can be created in various data formats that easily embed into digital presentations
and documents. Scientists often do not ever print visualizations because computers
and cloud storage can store many gigabytes of data. Information is created and con-
sumed entirely in digital form. Programming languages, such as Python, can tap
into high-level plotting programs and can minimize the axis calculation and labeling
requirements within a plot. Thus, the expanded access to computing tools and sim-
plified processes have advanced scientific data visualization opportunities.
In Figure 1.2, you can see that the international community has developed
and launched a plethora of Earth-observing satellites, each with several onboard
sensors that have a range of capabilities. I am not able to discuss every sensor,
dataset, and mission (a term coined by NASA to describe projects involving
Figure 1.2 Illustration of current Earth, space weather, and environmental monitoring
satellites from the World Meteorological Organization (WMO). Source: U.S.
Department of Commerce / NOAA / Public Domain.
spacecraft). However, I will describe some that are relevant to this text, organized
by subject area.
AQUA
23 METOP-A
METOP-B
22
Local Equator Crossing Time (hour)
NOAA-06
NOAA-07
21 NOAA-08
NOAA-09
20
NOAA-10
19 NOAA-11
NOAA-12
18 NOAA-14
NOAA-15
17 NOAA-16
NOAA-18
16 NOAA-19
NOAA-20
15 SNPP
TERRA
14
TIROS-N
13
12
80 85 90 95 00 05 10 15
19 19 19 19 20 20 20 20
Date (UTC)
Figure 1.3 Equatorial crossing times for various LEO satellites displayed using Python.
1.2.2. Hydrology
Both GEO and LEO satellites can provide sea surface temperature (SST)
observations. The GOES series of GEO satellites provides continuous sampling
of SSTs over the Atlantic and Pacific Ocean basins. The MODIS instrument on
the Aqua satellite has been providing daily, global SST observations continuously
since the year 2000. Visible wavelengths are useful for detecting ocean color, par-
ticularly from LEO satellites, which are often observed at very high resolutions.
1.2.4. Cryosphere
ICESat-2 (Ice, Cloud, and land Elevation Satellite 2) is a LEO satellite mis-
sion designed to measure ice sheet elevation and sea ice thickness. The GRACE-
FO satellite mission can also monitor changes in glaciers and ice sheets.
The missions mentioned in the previous section provide open and free data to
all users. However, data delivery, the process of downloading sensor data from the
satellite and converting it into a usable form, is not trivial. Raw sensor data are
first acquired on the satellite, then the data must be relayed to the Earth’s ground
system, often at speeds around 30 Mbits/second. For example, GOES satellite
data are acquired by NASA’s Wallops Flight Facility in Virginia; data from
the Suomi NPP satellite is downloaded to the ground receiving station in Svalbard,
Norway (Figure 1.4). Once downloaded, the observations are calibrated and sev-
eral corrections are applied, such as an atmospheric correction to reduce haze in
the image or topographical corrections to adjust changes in pixel brightness on
complex terrain. The corrected data are then incorporated into physical products
using satellite retrieval algorithms. Altogether, the speed of data download and
processing can impact the data latency, or the difference between the time the
physical observation is made and the time it becomes available to the data user.
Data can be accessed in several ways. The timeliest data can be downloaded
using a direct broadcast (DB) antenna, which can immediately receive data when
the satellite is in range. This equipment is expensive to purchase and maintain, so
usually only weather and hazard forecasting offices install them. Most users will
access data via the internet. FTP websites post data in near real time, providing the
data within a few hours of the observation. Not all data must be timely – research-
grade data can take months to calibrate to ensure accuracy. In this case, ordering
through an online data portal will grant users access to long records of data.
While data can be easily accessed online, they are rarely analysis ready. Soft-
ware and web-based tools allow for quick visualization, but to create custom ana-
lyses and visualizations, coding is necessary. To combine multiple datasets, each
must be gridded to the same resolution for an apples-to-apples comparison. Fur-
ther, data providers use quality flags to indicate the likelihood of a suitable
retrieval. However, the meaning and appropriateness of these flags are not always
Stored Mission
Data Antenna
NOAA-20
Svalbard, Norway
300 Mbps
High-Rate Data
Antenna
Direct
Broadcast
Network
15 Mbps
McMurdo, Antarctica
300 Mbps
well communicated to data users. Moreover, understanding how such datasets are
organized can be cumbersome to new users. This text thus aims to identify specific
Python routines that enable custom preparation, analysis, and visualization of sat-
ellite datasets.
I have structured this book so that you can learn Python through a series of
examples featuring real phenomena and public datasets. Some of the datasets and
visualizations are useful for studying wildfires and smoke, dust plumes, and hur-
ricanes. I will not cover all scenarios encountered in Earth science, but the skills
you learn should be transferrable to your field. Some of these case studies include:
• California Camp Fire (2018). California Camp Fire was a forest fire that
began on November 8, 2018, and burned for 17 days over a 621 km2 area.
It was primarily caused by very low regional humidity due to strong gusting
wind events and a very dry surface. The smoke from the fire also affected
regional air quality. In this case study, I will examine satellite observations
to show the location and intensity as well as the impact that the smoke had
on regional CO, ozone, and aerosol optical depth (AOD). Combined satellite
channels also provide useful imagery for tracking smoke, such as the dust
1.5. Summary
I have provided a brief overview of the many satellite missions and datasets
that are available. This book has two main objectives: (1) to make satellite data
and analysis accessible to the Earth science community through practical Python
examples using real-world datasets; and (2) to promote a reproducible and trans-
parent scientific code philosophy. In the following chapters, I will focus on
describing data conventions, common methods, and problem-solving skills
required to work with satellite datasets.
References
Tobin, D., Revercomb, H., Knuteson, R., Taylor, J., Best, F., Borg, L., et al. (2013).
Suomi-NPP CrIS radiometric calibration uncertainty. Journal of Geophysical Research:
Atmospheres, 118(18), 10,589–10,600. https://doi.org/10.1002/jgrd.50809
Garner, R. (2015, July 10). ICESat-2 Technical Requirements. www.nasa.gov/content/god-
dard/icesat-2-technical-requirements
IBM (1989, January). Personal Computer Family Service Information Manual. IBM doc-
ument SA38-0037-00. http://bitsavers.trailing-edge.com/pdf/ibm/pc/SA38-0037-00_Per-
sonal_Computer_Family_Service_Information_Manual_Jul89.pdf
National Oceanic and Atmospheric Administration (2020, June 12). GOES-R Series Level
I Requirements (LIRD). www.goes-r.gov/syseng/docs/LIRD.pdf.
OVERVIEW OF PYTHON
In this chapter, I discuss some reasons why Python is a valuable tool for Earth
scientists. I will also provide an overview of some of the commonly used Python
packages for remote sensing applications that I will use later in this book. Python
evolves rapidly, so I expect these tools to improve and new ones to become avail-
able. However, these will provide a solid foundation for you to begin your
learning.
Chances are, you may already know a little about what Python is and have
some motivation to learn it. Below, I outline common reasons to use Python rel-
evant to the Earth sciences:
• Python is open-source and free. Some of the legacy languages are for profit,
and licenses can be prohibitively expensive for individuals. If your career
plans include remaining at your current institution or company that supplies
you with language licenses, then open source may not be of concern to you.
But often, career growth means working for different organizations. Python is
Earth Observation Using Python: A Practical Programming Guide, Special Publications 75,
First Edition. Rebekah B. Esmaili.
© 2021 American Geophysical Union. Published 2021 by John Wiley & Sons, Inc.
DOI: 10.1002/9781119606925.ch2
17
portable, which frees up your skillset from being exclusively reliant on propri-
etary software.
• Python can increase productivity. There are thousands of supported libraries
to download and install. For instance, if you want to open multiple netCDF
files at once, the package called xarray can do that. If you want to re-grid an
irregular dataset, there is a package called pyresample that will do this quickly.
Even more subject-specific plots, like Skew-T diagrams, have a prebuilt pack-
age called MetPy. For some datasets, you can download data directly into
Python using OPenDAP. Overall, you spend less time developing routines
and more time analyzing results.
• Python is easy to learn, upgrade, and share. Python code is very “readable” and
easy to modularize, so that functions can be easily expanded or improved.
Further, low- or no-cost languages like Python increase the shareability of
the code. When the code works, it should be distributed online for other users’
benefit. In fact, some grants and journals require online dissemination.
You may already have knowledge of other computer languages such as IDL,
MATLAB, Fortran, C++, or R. Learning Python does not mean you will stop
using other languages or rewrite all your existing code. In many cases, other lan-
guages will continue to be a valuable part of your daily work. For example, there
are a few drawbacks to using Python:
• Python may be slower than compiled languages. While many of the core scien-
tific packages compile code on the back end, Python itself is not a compiled
language. For a novice user, Python will run more slowly, especially if loops
are present in the code. For a typical user, this speed penalty may not be
noticeable, and advanced users can tap into other runtime frameworks like
Dask or Cython or even run compiled Fortran subroutines to enhance perfor-
mance. However, new users might not feel comfortable learning these work-
arounds, and even with runtime frameworks and subroutines, performance
might not improve. If speed is a concern, then Python could be used as a pro-
totype code tool prior to converting into a compiled language.
• New users often run packages “as-is” and the contents are not inspected. There
are thousands of libraries available, but many are open-source, community
projects and bugs and errors will exist. For example, irregular syntax can
result whenever there is a large community of developers. Thus, scientists
and researchers should be extra vigilant and only use vetted packages.
• Python packages may change function syntax or discontinued. Python changes
rapidly. While most developers refrain from abruptly changing syntax, this
practice is not always followed. In contrast, because much of the work in
developing these packages is on a volunteer basis, the communities supporting
them could move on to other projects and those who take over could begin a
completely new syntax structure. While this is unlikely to be the case for highly
Unlike legacy languages such as Fortran and C++, there is no guarantee that
code written in Python will remain stable for 30+ years. However, the packages
presented in this book are “mature” and are likely to continue to be supported
for many years. Additionally, you can reproduce the exact packages and versions
using virtual environments (Section 11.3). This text highlights newer packages that
save significant amounts of development time and streamline certain processes,
including how to open and read netCDF files and gridding operations.
Python contains intrinsic structure and mathematical commands, but its cap-
abilities can be dramatically increased using modules. Modules are written in Python
or a compiled language like C to help simplify common, general, or redundant tasks.
For instance, the datetime module helps programmers manipulate calendar dates
and times using a variety of units. Packages contain one or more modules, which
are often designed to facilitate tasks that follow a central theme. Some other terms
used interchangeably for packages are libraries and distributions.
At the time of writing, there are over 200,000 Python packages registered on
pipy.org and more that live on the internet in code repositories such as GitHub
(https://github.com/). Many of the most popular packages are often developed
and maintained by large online communities. This massive effort benefits you
as a scientist because many common tasks have already been developed in Python
by someone else. This can also create a dilemma for scientists and researchers – the
trade-off between using existing code to save time against time spent researching
and vetting so many code options. Additionally, because many of these packages
do not have full-time staff support, the projects can be abandoned by their devel-
opment teams, and your code could eventually become obsolete.
In your research, I suggest you use three rules when choosing packages to
learn and work with:
1. Use established packages.
2. Use packages that have a large community of support.
3. Use code that is efficient with respect to reduced coding time and increased
speed of performance.
Following is a list of the main Python packages that I will cover in this text.
2.2.1. NumPy
2.2.2. Pandas
Pandas is a library that permits using data frame (stylized DataFrame) struc-
tures and includes a suite of I/O and data manipulation tools. Unlike NumPy,
Pandas allows you to reference named columns instead of using indices. With
Pandas, you can perform the same kinds of essential tasks that are available in
spreadsheet programs (but now automated and with fewer mouse clicks!). For
those who are familiar with R programming language, Pandas mimics the
R data.frame function.
A limitation of Pandas is that it can only operate with 2D data structures.
More recently, the xarray package has been developed to handle higher-
dimensional datasets. In addition, Pandas can be somewhat inefficient because
the library is technically a wrapper for NumPy, so it can consume up to three times
as much memory, particularly in Jupyter Notebook. For larger row operations
(500K rows or greater), the differences can even out. (Goutham, 2017).
2.2.3. Matplotlib
I will discuss two common self-describing data formats, netCDF and HDF, in
Section 3.2.3. Two major packages for importing these formats are the netCDF4
and h5py packages. These tools are advantageous because the user does not have
to have any knowledge of how to parse the input files, so long as the files follow
standard formatting. These two packages import the data, which can then be con-
verted to NumPy to perform more rigorous data operations.
2.2.5. Cartopy
The packages detailed in this section are worth mentioning because they may
apply to your specific project. Further, some features are too good to ignore, so
they are highlighted below. However, if your code requires a long-term shelf life,
it may be best to find alternative solutions, as the following packages may change
more rapidly than those listed in Section 2.3.
2.3.1. xarray
2.3.2. Dask
2.3.3. Iris
2.3.4. MetPy
Cfgrib is a useful package for reading GRIB1 and GRIB2 data, which is a
common format for reanalysis and model data, particularly for the ECMWF.
Cfgrib decodes GRIB data in a way that it mimics the structure of NetCDF files
using the ecCodes python package. ecCodes was developed by ECMWF to decod-
ing and encoding standard WMO GRIB and BUFR files.
2.4. Summary
I hope you are excited to begin your Python journey. Since it is free and open-
source, Python is a valuable tool that you can carry with you for the rest of your
career. Furthermore, there are many existing packages to perform common tasks
in the Earth Sciences, such as importing common datasets, organizing data, per-
forming mathematical analysis, and displaying results. In the next chapter, I will
describe some of the common satellite data formats you may encounter.
References
Dask: Scalable analytics in Python. (n.d.). Retrieved November 25, 2020, from https://
dask.org/
ecmwf/cfgrib. (2020). Python, European Centre for Medium-Range Weather Forecasts.
Retrieved from https://github.com/ecmwf/cfgrib (Original work published July 16, 2018).
Matplotlib: Python plotting — Matplotlib 3.3.3 documentation. (n.d.). Retrieved November
25, 2020, from https://matplotlib.org/.
MetPy — MetPy 0.12. (n.d.). Retrieved November 25, 2020, from https://unidata.github.io/
MetPy/latest/index.html.
NASA’s Earth Observing System Data and Information System (EOSDIS) has
accumulated 27 PBs of data since the 1990s with the purpose of furthering scientific
research. EOSDIS continues to add data from missions prior to the 1990s, which are
stored as hard disk media (Figure 3.1). Many of these older datasets need to be
“rescued,” which is challenging because such legacy media are often disorganized,
lack documentation, are physically damaged, or lack appropriate readers (Meier et
al., 2013; James et al., 2018). As the satellite era began in the 1960s, it is unlikely that
the planners considered how voluminous the data would become and how rapidly it
would be produced.
Nowadays, data archiving at agencies is carefully planned and organized, and
it uses scientific data formats. This infrastructure allows EOSDIS to freely distrib-
ute hundreds of millions of data files a year to the public (I describe how to obtain
data in Appendix E). In addition to improving access and storage, scientific data
formats provide consistency between files, which reduces the burden on research-
ers who can more easily write code using tools like Python to read, combine, and
analyze data from different sources. This philosophy is encompassed by the term
Earth Observation Using Python: A Practical Programming Guide, Special Publications 75,
First Edition. Rebekah B. Esmaili.
© 2021 American Geophysical Union. Published 2021 by John Wiley & Sons, Inc.
DOI: 10.1002/9781119606925.ch3
25
(a) (b)
(c) (d)
Figure 3.1 (a) Canisters of 35mm film that contain imagery recovered from Nimbus 1 of
polar sea ice extent collected in 1964. From roughly 40,000 recovered images, scientists
were able to reconstruct scenes such as (b) the ice edge north of Russia (78 N, 54 E) and
composites of the (c) north and (d) south poles. Photos courtesy of the National Snow and
Ice Data Center, University of Colorado, Boulder.
analysis-ready data (Dwyer et al., 2018). Common formats and labeling also
ensure that data are well understood and used appropriately. In this chapter,
I go over some of the ways data are stored, provide an overview of major satellite
data formats, as well as common ways to classify and label satellite data.
3.1. Storage
Table 3.1 Data Types, Typical Ranges, and Decimal Precision and Size in Computer
Memory
Note: Numbers in parentheses are the unsigned integers (positive only). Unsigned floats are
not supported in Python.
has some computer number formats that satellite data are commonly stored in,
along with their respective numeric ranges and storage requirements. Integers
are numbers that have no decimal precision (e.g., 4, 8, 15, 16, –5,250, 8,642,
…). Floats are real numbers that have decimal precision (e.g., 3.14, 5.0,
1.2E23). Integers are typically advantageous for storage because they are smaller
and will not have rounding errors like float values can. To keep data small, values
are stored in the smallest numerical type necessary.
Even if an observed value is large, values can be linearly scaled (using an offset
and a scale factor) to fit within integer ranges to keep file sizes small. This is
because many Earth observations naturally only scale across a small range of
numbers. The scale and offset can be related to the measured value as follows:
Measured Value = Offset + Stored_Value ∗ Scale_Factor
For instance, the coldest and hottest surface temperatures on Earth are on the
order of 185 K (–88 C) and 331 K (58 C), respectively. The difference between the
two extremes is 146 K, so only 146 numbers are needed if you are not using any
decimal precision. For example, if I observe a surface temperature of 300 K,
I would store this in a two-byte unsigned integer if I do not rescale the data.
I can offset our data by 185 K, which is the lowest realistic temperature. Then,
I can store this measurement as 115 K, which fits in the one-byte integer ranges.
If later I want to read this value, I would then add the 185 K offset back to the
value. While reading and writing the data is more complex, the file is now 50%
smaller.
I may want further precision, but I can also account for this. For example, at one-
digit decimal place (e.g., 300.1 K), the number can be scale multiplied times 10 (which
is called the scale factor) and still saved as an integer (Table 3.2). However, our stored
value is now 1,515, so this data would be stored as a two-byte integer, which can con-
tain unsigned values up to 65,535. This time, if I am reading the data from the file,
value will need to be divided by 10. Again, this conversion saves two bytes of memory
over storing 300.15 K as a floating-point value, which is four bytes.
Table 3.2 Examples of How Data Can Be Rescaled to Fit in Integer Ranges
Note: Storing data in integers can provide significant storage savings over real values.
If a dataset does rescale the data to improve data storage (and many do not),
linear scaling is the most common. Conveniently, some Python packages can auto-
matically detect and apply scale and offset factors when importing certain data
types, such as netCDF and HDF formats, which we will discuss in Section 3.2.1.
3.1.2. Arrays
Nearly all satellite data are in three dimensions (3D), which are a value and
two corresponding spatial coordinates. The value can, for example, be a physical
observation, a numeric code that tells us something about the data, or some ancil-
lary characteristics. The spatial coordinate can be an x, y or a latitude, longitude
pair. Some datasets may also have a vertical coordinate or a time element, further
increasing the dimensionality of the data. As an example, I will discuss some dif-
ferent ways you can structure (or, how one organizes and manages) three-
dimensional data: latitude, longitude, and surface albedo (which shows the reflec-
tivity of the Earth to solar radiation). A simple way to structure the data is in a list,
an ordered sequence of values. So, I could store the following values:
Where longitude ranges from –180 to 180 (East is positive), latitude from –
90 to 90 (North is positive), and surface albedo from 0 to 1 (which shows low to
high sunlight reflection, respectively).
Instead of three separate values, I could organize the data into a table, which
stores data in rows and columns:
Latitude
Albedo
Within Python, this organization is called a meshgrid, which stores the spatial
and value coordinated in a 2D rectangular grid. Coordinates stored in a meshgrid
are either regularly spaced or irregularly spaced data (Figure 3.2). If every coordi-
nate inside the meshgrid has a consistent distance between their neighboring coor-
dinate, then they are regularly spaced. On the other hand, the irregularly spaced
data will have varying distance between the x and y coordinates, or both. In gen-
eral, if grid coordinates are regularly spaced, the coordinates may be stored in
datasets as lists (e.g., GPM IMERG L3). If the data are irregularly spaced, the
data will likely be stored as a 2D meshgrid (e.g., in both cases, the data are very
commonly stored in a 2D grid).
In Figure 3.3, I show a plot of a single granule that contains several swaths of
surface albedo from 25 June 2018 at 11:21 UTC. Swaths are what a satellite can see
0.55 0.25
0.20 0.82
Figure 3.2 Spacing and distance between (x, y) points for an example regularly and
irregularly spaced rectangular grid.
Columns (3200)
Rows (768)
Figure 3.3 An illustration of how a granule of satellite data taken from a polar-orbiting
satellite would be stored in a meshgrid. Regions outside of where the satellite can see
or that are not stored in the file are indicated using fill values. These values are
excluded from analysis.
in longitude and latitude during a scan of the Earth (also called a scanline).
A granule is a collection of swaths, usually taken over a time period or for a fixed
number of scanlines. Data are often stored in chunks because global scans of the
Earth are voluminous. Additionally, the smaller data files improve the latency or
timeliness of the data because we do not have to wait for the full global image to
be assembled.
On the right of Figure 3.3, I illustrate what happens when the data are flat-
tened and projected into a 2D array. Due to the curvature of the Earth and the
satellite viewing angle, the coordinates are spatially irregular. At the scan edges,
the footprints are larger than they are when the satellite is directly overhead. So,
when the data are stored in a rectangular grid, there will be places there is no data.
These empty values are called missing, gaps, or fill values. They will contain a spe-
cial number that is not used elsewhere in the data, which for continuous data, com-
mon values are NaN (not a number), –9999.9, –999.9, or –999. These numbers are
not particularly common in nature and thus won't usually be confused with mean-
ingful observations, 0 is usually not used as a fill number because it becomes dif-
ficult to distinguish low values of observed data with values that are outside of the
satellite scan. Data can also contain integers, for example data flags, which are
integer codes that categorize the data, which may indicate the data quality, or
identify if the scene is over land or ocean. For integer data, some common fill
values may include 0, 128, –127, 255, and 32,766. These numbers are endpoint
values for the numeric data types shown in Table 3.1.
3.2.1. Binary
In our daily lives, we use the base-10 numerical. Our counting numbers are
0–9, and then digits increase in length every tenth number. Alternatively, com-
puter data are natively stored in base-2 or binary, which takes take on two discrete
states, 1 or 0 and can be thought of as “on” or “off.” Each state is called a bit. Each
additional bit doubles the previous amount of information: one bit allows you to
count 0 or 1 (two possible integers), two bits allowed you to count to 3 (four inte-
gers), three bits allows you to count to 7 (8 possible integers), and so on (Table 3.3).
Note that 8 bits = 1 byte.
Binary data are aggregated into a dataset by structuring bits into sequential
arrays. Reading a single-dimension sequence of numbers is relatively simple. For
instance, if you know the data type, you can calculate the dimensions of the data
by dividing the byte size by the file size. For example, if you have a 320-bit file and
you know it contains 32-bit floats, then there it is an array with length 10. How-
ever, if you have a multidimensional array, the read order of the programming
language will matter. If the 32-but file in the previous example is a 5 × 2 array,
then it’s not obvious if you are reading the row or the column. Row-major order
(used by Python’s NumPy, C/C++, and most object-oriented languages) and col-
umn-major order (Fortran and many procedural programming languages) are
methods for storing multidimensional arrays in memory (Figure 3.4). By conven-
tion, most languages that are zero indexed tend to be row major, while languages
where the index starts with 1 are column major. If you use the wrong index, the
program will read the data in the incorrect order.
The default byte read order (the endianness) varies from one operating system
or software tool to another; little endian data are read from right to left (most sig-
nificant data byte last) and big endian is read from left to right (most significant
data byte first). You can often tell if you are reading data using the wrong system if
you have unexpectedly large or small values.
Base-10 Binary
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
Row-major order
Column-major order
3.2.2. Text
Most of us have probably already seen or used text data. Some advantages of
text-based data are that they are easily readable, so you can visually check if the
imported data matches the input file formatting. However, text data also has dif-
ferent encoding methods and organization. American Standard Code for Informa-
tion Interchange (ASCII) was released in 1963 and is one of the earliest text
encodings for computers. When viewed, text files look like alphanumeric charac-
ters, but under the hood ASCII character encoding consists of 128 characters
which are represented by 7-bit integers.
Originally, ASCII could only display Latin characters and limited punctua-
tion. ASCII was followed by the Unicode Transformation Format – 8-bit
(UTF-8) which produces 1,112,064 valid code points in Unicode using one to four
8-bit bytes. Since 2009, UTF-8 has become the primary encoding of the World
Wide Web, with over 60% of webpages encoded in UTF-8 in 2012. Under the
hood, UTCF-8 is backward compatible with ASCII data; the first 127 characters
in Unicode map to the integers as their ASCII predecessor.
While all languages can be displayed in UTF-8, UTF-16, and UTF-32 are
becoming more frequently used for languages with more complex alphabets, such
as Chinese and Arabic. Many scientists will usually not notice these distinctions
and may use the terms interchangeably. However, knowing the distinction is
important, as using the wrong format can introduce unwanted character artifacts
that can cause code to erroneously read files.
Of plain text data formats, comma-separated values (.csv files) and tab-
separated values (.tsv files) are highly useful for storing tabular data. Below is
an example of a csv file:
Note that just because text data are “human readable,” does not mean it is
“human understandable” without additional explanation. However, the above
code is very “machine readable” because it follows consistent formatting and uses
a common set of codes to describe the current weather conditions. For instance,
the above METAR can be translated from “machine readable” into a “human-
readable” description using a text-parsing algorithm written in Python:
Like binary data, text files can store empirical and abstract data, and some
simple metadata. However, there are some disadvantages of character code.
For one, the data require much more storage space than binary data formats,
which do not require as many bytes to store the data. Second, formatting csv
and tsv files needs to be carefully organized to be machine readable: column head-
ers may not match the number of columns below, the separating or delimiting
character may be inconsistent, or there may be missing values. Third, sometimes
multiple variables are stacked into the same file, rather than organized by rows,
and a special reader must be designed to import the data.
In the next section, I will discuss self-describing datasets, which in more recent
years have become the primary method of storing large satellite datasets. Self-
describing datasets combine the compactness of binary data with the human-
readable descriptions of text files so that users can understand the structure, met-
adata, and formatting of the stored empirical and abstract data.
disk than text data. Each byte of binary data can hold 256 possible values whereas
unextended text (text that is exclusively machine readable) can only store 128 and
human readable text is even less efficient. Computationally, it is faster to read in
binary-based datasets than text, which needs to be parsed before being stored into
a computer’s memory. Because the files are more compact, binary formats are com-
monly used to store large, long-term satellite data.
The most common self-describing format is the Hierarchical Data Format
(HDF5). HDF files are often used outside of the Earth sciences, as they are useful
for storing other big datasets. The next common format is Network common data
format (netCDF4), which is derived from HDF. Standards for netCDF are hosted
by the Unidata program at the University Corporation for Atmospheric Research
(UCAR). As a result, these files are very often found in the Earth sciences. NOAA
remote sensing datasets are almost exclusively netCDF, and more recently, NASA
produced datasets have been more frequently stored in netCDF4. Please note that
older versions of the datasets (HDF4, netCDF3) do exist but are not necessarily
backward compatible with the Python techniques presented in this book.
Before getting started with any data-based coding project, it is a good practice
to inspect the data you are about to work with. Self-describing datasets are a type
of structured binary format. While the variable data itself is binary, it is organized
into groups. I recommend that after opening the dataset, you examine which spe-
cific elements within it are of interest to you. Additionally, self-describing datasets
can have both local and global attributes, such as a text description of the variable,
the valid range of values, or the number that is used to indicate missing or fill,
values (Figure 3.5). In self-describing formats you do have to know where in
the large array your variable of interest is stored. You can extract it by using
the variable name, which points to the address inside of the file. Because the data
can be complex, it is worthwhile to inspect the contents of a new dataset in
advance. You can utilize free data viewers to inspect the dataset contents, such
as Panoply (NASA 2021). Python or command line tools also allow you to exam-
ine what is inside the files. Planning your coding project with knowledge of the
data will allow you to work efficiently and overall will save you time.
The variables can be one-, two-, three-dimensional variables, or more. HDF
files may organize the variables into groups and subgroups. While variables can be
organized into groups in netCDF files, Unidata compliance standards do not
recommended grouping. Most often, data can be separated into other variables.
If the data have the same grid and there is a temporal element, it is more efficient to
increase the dimensionality of the variables and stack the matrices in time.
Binary data that take on table-driven code form require external tables to
decode the data type. Thus, they are not self-describing. These files follow a meth-
odology of encoding binary data and not a distinct file type. Binary Universal
Form for the Representation of meteorological data (BUFR) and GRIdded
Variable Attributes
Name:
Coordinates
768 Fill Value
Valid Range
...
3200
Attributes
Name
Coordinates
Fill Value
768 Valid Range
...
3 3200
Figure 3.5 Example of how netCDF data are organized. Each variable has metadata that
stores units, fill values, and a more detailed description of the contents.
Binary (GRIB) are two common table-driven formats in Earth Sciences, but they
are specific to certain subject areas. I will not cover these formats in significant
detail in this text but will mention them here for your awareness.
• BUFR. BUFR was developed by the World Meteorological Organization
(WMO). Many assimilated datasets are stored in this format. BUFR files
are organized into 16-bit chunks called messages, with the first bits represent-
ing the numeric data type and format codes followed by the bit-stream of the
data. The software that parses these files uses external tables to look up the
meaning of the numeric codes, with the WMO tables as the field standard.
BUFR Table A is known as the Data Category. If the Data Category bit
stream contains the number 21, the Python parser can use the tables to see
that this code corresponds to “Radiances (satellite measured).” An advantage
of BUFR is that the descriptions of the data are harmonized and has superior
compression to text-based formats. When BUFR files follow standards, they
are more easily decoded than plain binary files. However, if the stored data
does not conform to the codes in the WMO tables, then the data must be dis-
tributed with a decoding table (Caron, 2011).
• GRIB2. American NWS models (e.g., GFS, NAM, and HRRR) and the Euro-
pean (e.g., ECMWF) models are stored in GRIB2. While they share the same
format, there are some differences in how each organization stores its data. Like
BUFR, GRIB2 are stored as binary variables with a header describing the data
stored followed by the variable values. GRIB2 replaced the legacy GRIB1 for-
mat as of 2001 and has a different encoding structure (WMO, 2001).
Models merit discussion in this text because researchers often compare models
and satellite data. For example, models are useful for validating satellite datasets or
supplementing observations with variables that are not easily retrieved (e.g.,
wind speed).
At the time this text was published, many Python readers had been developed
and tested with ECMWF because historically, most Python developers have been
in Europe. For instance, some of the GRIB2 decoders have problems parsing the
American datasets because the American models have multiple pressure dimen-
sions (depending on the variable) while the European models have one. Still, there
are ways the data can be inspected by using the pygrib and cfgrib packages, which
were described later in Section 2.4.
3.2.5. geoTIFF
GeoTIFF is essentially a geolocated image file. Like other image files, the data
are organized in regularly spaced grids (also known as raster data). In addition,
GeoTIFF contains metadata that describes the location, coordinate system, image
projection, and data values. An advantage of this data format is that satellite
imagery is stored as an image, which can be easily previewed using image software.
However, like text data, the data are not compact in terms of storage. While gaining
popularity in several fields, geoTIFF is most used for GIS applications.
There is a lot of technical jargon associated with satellite data products. How-
ever, having a working knowledge of the terminology will enable you to under-
stand how to appropriately choose what data to use. Below, I discuss how
several data producers (e.g., NASA, NOAA, ESA) define the processing levels,
the level of data maturity, and quality control. The timeliness of the data may also
be of importance, so we describe what is meant by data latency. Finally, the algo-
rithms used to calibrate and retrieve data change over time, so it is important to be
aware of what version you are using for your research.
3.3.1. Processing Levels
Data originating from NASA, NOAA, and ESA are often assigned a proces-
sing level, numbered from 0 to 4 to help differentiate how processed the data are
(Earthdata, 2019; ESA, 2007). Specific definitions can vary between agencies, but
Generally, Level 3 and Level 4 data may be more appropriate to study long-
term trends or teleconnections. For real-time environmental monitoring or region-
specific analysis, Level 2 datasets may be more useful. Modelers may be interested
in assimilating radiances from Level 1 data, and on occasion Level 2 data, such as
for trace gas assimilation.
When new NASA or NOAA satellites are launched, there is a test period
where the products are evaluated, which is sometimes referred to as the “alpha”
or “post-launch” phase. After this test period, so long as the data meet minimum
standards, the data are in the beta stage. Beta products can sometimes restricted to
the public, although access may be requested by users from the agencies under spe-
cific circumstances. The next two stages are provisional and validated. At these
maturity stages, you can utilize these datasets with more confidence. Both provi-
sional and validated datasets are appropriate for scientific research.
In terms of appropriate use, data that have beta maturity or below really are
not scientifically rigorous. This data may still contain significant errors in the
values and even geolocation. There might be certain applications where using beta
stage data are appropriate, such as if you want to do a feasibility study, want to
simply become familiar with the data, or if you are evaluating the data accuracy
and precision. If these data are used in published scientific research, you should
clearly state the maturity level in your manuscript or presentation.
In terms of time scales, advancement through maturity stages is slower for
brand-new sensor, anywhere about one to two years from launch. Level 1 products
are always the first to become validated – they have fewer moving parts and pro-
duce the simplest of measurements. Level 2 products can at times rely on other
Level 2 products as inputs. For instance, many of the GOES-16 products rely
on the Level 2 Clear Sky Mask to determine if a scene is clear or cloudy. So,
the Clear Sky Mask had to reach provisional before other products could.
Logistically, this can complicate the review process for downstream products.
For a series of satellites, the review process is expedited. For instance,
GOES-17 was a twin satellite to GOES-16, so the review process for GOES-17
products was faster than for GOES-16 products.
WA NH
MT VT ME
ND
OR MN
ID MA
SD WI NY
WY MI RI
PA CT
NE IA
NV OH NJ
IL IN
UT DE
CA CO WV
KS MO VA MD
KY
NC D.C.
TN
AZ OK
NM AR SC
MS AL GA
TK LA
AK
FL
HI Puerto Rico
U.S.
(PR)
Virgin
Islands (VI)
Guam
(GU)
Figure 3.6 Direct broadcast (DB) antenna sites, which can provide data in real-time.
Source: Joint Polar Satellite System/Public Domain.
Data from multiple receivers can be combined into a larger, regional image. How-
ever, global view of the data is only possible at two satellite ground stations located
in Svalbard, Norway, and Troll, Antarctica.
IMERG, a global Level 3 precipitation product that combines many passive
microwave sensors from various platforms, is critical for flood monitoring in
remote regions without access to radar. However, the logistics of combining obser-
vations and running the retrieval make distribution challenging. The solution that
was adopted was to produce three versions of the dataset:
• Early run: Produced with a latency of 4 hours from the time all data are
acquired.
• Late run: Produced with a latency of 12 hours.
• Final run: Research grade product that is calibrated with rain gauge data to
improve the estimate. The latency is 3.5 months.
The required timeliness of datasets varies between different agencies, mis-
sions, and priorities. NASA datasets tend to focus on long-term availability
and consistency of the data to create climate records. In contrast, NOAA datasets
tend to focus on capturing the state of the atmosphere and oceans in the present to
serve operational forecasting. So, you will likely find more real-time datasets gen-
erated by forecast agencies such as NOAA.
3.3.5. Reprocessing
3.4. Summary
In this chapter, you learned some of the main scientific data types and com-
mon terminology. In the past, satellite data was distributed as text or binary data-
sets, but now netCDF and HDF files are more common. These self-describing
data are advantageous because they contain descriptive metadata and important
ancillary information such as georeferencing or multiple variables on the same
grid. There is a significant amount of remote sensing data, so it can be overwhelm-
ing for new users to know which data are appropriate to use for research or envi-
ronmental monitoring. Scientific data from major international agencies often
classify them according to processing level to convey information on the degree
of calibration, aggregation, or combination. Furthermore, quality control flags
within the datasets can help discriminate which measurements should and should
not be used.
Because no two satellite datasets are identical, universally understood formats
such as netCDF or HDF make it easier for scientists to compare, combine, and
analyze different datasets. Since there is ongoing research to improve retrieval
algorithms, self-describing formats can also include production information on
the version number and maturity level of the data from new satellite missions.
Overall, unification of datasets, language, and production promotes analysis-
ready data (ARD) so that scientists can more easily use tools like Python for
research and monitoring.
References
Dwyer, J. L., Roy, D. P., Sauer, B., Jenkerson, C. B., Zhang, H. K., & Lymburner, L.
(2018). Analysis ready data: Enabling analysis of the Landsat archive. Remote Sensing,
10(9), 1363. https://doi.org/10.3390/rs10091363
Earthdata (2019). Data processing levels. Retrieved April 18, 2020, from https://earthdata.
nasa.gov/collaborate/open-data-services-and-software/data-information-policy/data-
levels/
ESA (2007, January 30). GMES Sentinel-2 mission requirements. European Space Agency.
Retrieved from https://www.eumetsat.int/website/home/Data/TechnicalDocuments/
index.html
James, N., Behnke, J., Johnson, J. E., Zamkoff, E. B., Esfandiari, A. E., Al-Jazrawi, A. F.,
et al. (2018). NASA Earth Science Data Rescue Efforts. Presented at the AGU Fall
Meeting 2018, AGU. Retrieved from https://agu.confex.com/agu/fm18/meetingapp.
cgi/Paper/381031
Meier, W. N., Gallaher, D., and G. C. Campbell (2013). New estimates of Arctic and Ant-
arctic sea ice extent during September 1964 from recovered Nimbus I satellite imagery.
The Cryosphere, 7, 699–705, doi:10.5194/tc-7-699-2013.
Murphy, K. (2020). NASA Earth Science Data: Yours to use, fully and without restrictions
| Earthdata. Retrieved April 17, 2020, from https://earthdata.nasa.gov/learn/articles/
tools-and-technology-articles/nasa-data-policy/
NASA (2021). Panoply, netCDF, HDF, and GRIB data viewer. https://www.giss.nasa.gov/
tools/panoply/
Precipitation Processing System (PPS) At NASA GSFC. (2019). GPM IMERG early pre-
cipitation L3 half hourly 0.1 degree x 0.1 degree V06 [Data set]. NASA Goddard Earth
Sciences Data and Information Services Center. https://doi.org/10.5067/GPM/IMERG/
3B-HH-E/06
WMO (2003, June). Introduction to GRIB Edition1 and GRIB Edition 2. Retrieved from
https://www.wmo.int/pages/prog/www/WMOCodes/Guides/GRIB/Introduction_-
GRIB1-GRIB2.pdf
Special thanks to Christopher Barnet for his description of historic plotting
routines.
ebookbell.com