0% found this document useful (0 votes)
37 views

Programmation Météo en Python

This document provides an overview of the book "Data Analysis and Visualisation in Climate Science: A Programmer's Guide". The book covers topics like accessing and working with climate data, using Unix systems and servers, working with multi-dimensional gridded climate datasets, and the netCDF file format which is commonly used for climate data. It is intended to help students and programmers get started with the tools and skills needed for programming and working with climate data.

Uploaded by

Oscard Kana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Programmation Météo en Python

This document provides an overview of the book "Data Analysis and Visualisation in Climate Science: A Programmer's Guide". The book covers topics like accessing and working with climate data, using Unix systems and servers, working with multi-dimensional gridded climate datasets, and the netCDF file format which is commonly used for climate data. It is intended to help students and programmers get started with the tools and skills needed for programming and working with climate data.

Uploaded by

Oscard Kana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Analysis and Visualisation in

Climate Science
A Programmer’s Guide

Sebastian Engelstaedter
This book is for sale at
http://leanpub.com/data-analysis-and-visualisation-in-climate-sciences

This version was published on 2020-05-12

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean
Publishing process. Lean Publishing is the act of publishing an in-progress ebook
using lightweight tools and many iterations to get reader feedback, pivot until you
have the right book and build traction once you do.

© 2019 - 2020 Sebastian Engelstaedter


This book is dedicated to the frustrated student trying to find their way through the
obstacle course of climate computing and coding.
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Concept of Local and Remote Machines . . . . . . . . . . . . . . . . 2
1.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Climate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Climate Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Data Use Licences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Accessing Climate Data . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Types of Climate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Analyses and Reanalyses Products . . . . . . . . . . . . . 7
2.5.2 Climate and NWP Model Output . . . . . . . . . . . . . . 8
2.5.3 Point observations . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Data File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Plain Text and ASCII . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.3 GRIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.4 netCDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.5 PP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3. Unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Introduction to Unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Linux Distributions . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Desktop versus Server . . . . . . . . . . . . . . . . . . . . . 13
CONTENTS

3.1.3 High Performance Computing on a Server . . . . . . . . 14


3.2 Accessing a Remote Server . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Remote Server Login Details . . . . . . . . . . . . . . . . . 18
3.2.2 Virtual Private Network (VPN) . . . . . . . . . . . . . . . 19
3.2.3 X Window System (X11 forwarding) . . . . . . . . . . . . 20
3.2.4 Connecting to a Remote Server . . . . . . . . . . . . . . . 21
3.3 First Steps on the Unix server . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 The Terminal Window . . . . . . . . . . . . . . . . . . . . 24
3.3.2 The Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.3 Linux Directory Structure and Home Directory . . . . . 26
3.3.4 Quota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.5 File Transfer to and from the Server . . . . . . . . . . . . 29
3.3.6 Mapping the Linux Home Directory as a Remote Net-
work Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Some More Unix Server Basics . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Manual Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Editing Text Files . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Full versus Relative Paths . . . . . . . . . . . . . . . . . . 38
3.4.4 Special Characters . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Working with Files and Directories . . . . . . . . . . . . . . . . . . . 40
3.5.1 Creating Text Files and Directories . . . . . . . . . . . . . 40
3.5.2 Listing Files and Directories . . . . . . . . . . . . . . . . . 41
3.5.3 Moving Around in the Directory Tree . . . . . . . . . . . 42
3.5.4 Copying, Moving, Renaming and Deleting Files and
Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Advanced Unix Commands . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 Examining Text Files . . . . . . . . . . . . . . . . . . . . . 44
3.6.2 File and Directory Properties . . . . . . . . . . . . . . . . 45
3.6.3 File Permissions . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.4 Changing File Permissions and Ownership . . . . . . . . 47
3.6.5 Changing the Unix Account Password . . . . . . . . . . 49
3.6.6 Redirecting Command Output . . . . . . . . . . . . . . . 50
3.6.7 Finding Files . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.8 File Compression and Archives . . . . . . . . . . . . . . . 51
3.6.9 Download Files from the Command Line . . . . . . . . . 52
3.7 Long-running Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
CONTENTS

3.7.1 GNU Screen (recommended) . . . . . . . . . . . . . . . . 53

4. Multi-dimensional Gridded Datasets . . . . . . . . . . . . . . . . . . . . . 56


4.1 The Earth’s Coordinate System and Realms . . . . . . . . . . . . . . 56
4.2 The Model Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Grid Indexing and Geographical Referencing of Data Points . . . . 60
4.4 The Time Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Horizontal Resolutions and Grid Types . . . . . . . . . . . . . . . . . 67
4.5.1 Spectral Resolution . . . . . . . . . . . . . . . . . . . . . . 67
4.5.2 Full and Reduced Gaussian Grid . . . . . . . . . . . . . . 67
4.5.3 Regular latitude-longitude grid . . . . . . . . . . . . . . . 71
4.6 Vertical Level Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6.1 Pressure, Potential Temperature and Potential Vorticity
Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6.2 Sigma (Model) Levels . . . . . . . . . . . . . . . . . . . . . 72
4.6.3 Sigma-Hybrid Levels . . . . . . . . . . . . . . . . . . . . . 73

5. The netCDF File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


5.1 Introduction to the netCDF File Format . . . . . . . . . . . . . . . . 76
5.2 netCDF File Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.1 Exploring netCDF File Headers with ncdump . . . . . . 77
5.2.2 Exploring netCDF File Headers with CDO . . . . . . . . 81
5.2.3 Exploring netCDF File Headers with ncview . . . . . . . 84
5.3 Packed netCDF Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 netCDF File Format Conventions . . . . . . . . . . . . . . . . . . . . 86

6. Python - Concepts and Work Environment . . . . . . . . . . . . . . . . . 87


6.1 Python Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Python Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 Python Modules and Packages . . . . . . . . . . . . . . . 87
6.2.2 Package Dependencies . . . . . . . . . . . . . . . . . . . . 90
6.2.3 Package Managers, Repositories and Channels . . . . . . 90
6.2.4 Python Virtual Environments . . . . . . . . . . . . . . . . 91
6.3 Conda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.1 Creating a Conda Environment . . . . . . . . . . . . . . . 92
6.3.2 Activating and Deactivating Conda Environments . . . 96
6.3.3 Installing Python Packages . . . . . . . . . . . . . . . . . . 97
CONTENTS

6.3.4 Listing and Deleting Conda Environments . . . . . . . . 98


6.4 Python Code Development . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4.1 Python Code Editors . . . . . . . . . . . . . . . . . . . . . 99
6.4.2 Python IDEs . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.3 Browser-based Python Code Editing . . . . . . . . . . . . 103

7. Python - Programming Basics . . . . . . . . . . . . . . . . . . . . . . . . . . 105


7.1 Basic Python Programming Building Blocks . . . . . . . . . . . . . . 105
7.1.1 Declaring Variables . . . . . . . . . . . . . . . . . . . . . . 105
7.1.2 Variable Types and Conversion Between them . . . . . . 106
7.1.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.1.4 Methods and Attributes . . . . . . . . . . . . . . . . . . . . 117
7.1.5 Controlling the Code Flow . . . . . . . . . . . . . . . . . . 118
7.2 Applying Python in Climate Data Analysis . . . . . . . . . . . . . . 124
7.2.1 Error Messages when Running Code . . . . . . . . . . . . 124
7.2.2 Looping Through Input Files . . . . . . . . . . . . . . . . 125
7.2.3 Reading Data Files Into NumPy Variables . . . . . . . . . 131
7.2.4 Executing Unix System Commands From Within Python 136
7.3 Introduction to Numpy . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3.1 Creating Numpy Arrays . . . . . . . . . . . . . . . . . . . 138
7.3.2 Indexing NumPy Arrays . . . . . . . . . . . . . . . . . . . 138
7.3.3 Saving and Loading NumPy Variables . . . . . . . . . . . 141
7.4 Tips and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.4.1 String Formatting of Numbers . . . . . . . . . . . . . . . . 142
7.4.2 Zero-padding Integer Values in Filenames . . . . . . . . 143
7.4.3 Calculate Height From Geopotential with MetPy . . . . 145

8. Python - Creating Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


8.1 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.1.1 Setting up Plotting Page (Figure and Axes) . . . . . . . . 149
8.1.2 Main Plotting Commands . . . . . . . . . . . . . . . . . . 151
8.1.3 Colour Names and Colour Maps . . . . . . . . . . . . . . 152
8.2 Line Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.1 Line Plot with Labels . . . . . . . . . . . . . . . . . . . . . 156
8.2.2 Line Plot with Arrows . . . . . . . . . . . . . . . . . . . . 160
8.2.3 Multiple Lines Plot with Markers and Legend . . . . . . 164
CONTENTS

8.2.4 Multiple Lines Plot with two Scales . . . . . . . . . . . . 168


8.2.5 Multiple Lines Plot with Standard Deviation . . . . . . . 172
8.3 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.3.1 Scatter Plot with a Legend . . . . . . . . . . . . . . . . . . 177
8.3.2 Scatter Plot with Divergent Colour Bar . . . . . . . . . . 181
8.3.3 Scatter Plot on a Map with Colour Bar and Legend . . . 185
8.4 Map Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.4.1 Cartopy Map Projections and Data Transformation . . . 191
8.4.2 Simple Map of SST Anomalies . . . . . . . . . . . . . . . 197
8.4.3 Map with Stipples for Statistical Significance . . . . . . 201
8.5 Bar Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.5.1 Anomalies Bar Graph . . . . . . . . . . . . . . . . . . . . . 207
8.6 Hovmöller Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.6.1 Hovmöller Plot with Time Axis Formatting . . . . . . . 210
8.7 Vertical Cross-Section Plots . . . . . . . . . . . . . . . . . . . . . . . . 213
8.7.1 Meridional Cross-Section . . . . . . . . . . . . . . . . . . . 213
8.7.2 Vertical Cross-Section Between two Points . . . . . . . . 219
8.8 Multiple Panels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
8.8.1 Multiple Line Plots (axes.flat method) . . . . . . . . . . 227
8.8.2 Multiple Line Plots (pop() function) . . . . . . . . . . . . 228
8.8.3 Multiple Map Plots (axes.flat method) . . . . . . . . . . 230

9. Data Analysis with CDO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236


9.1 What is CDO? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
9.2 Useful CDO Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
9.3 Basic Syntax of CDO Commands . . . . . . . . . . . . . . . . . . . . 237
9.3.1 CDO Options . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.3.2 CDO Operator Categories . . . . . . . . . . . . . . . . . . 239
9.3.3 Using Multiple CDO Operators . . . . . . . . . . . . . . . 240
9.3.4 CDO Operator Parameters . . . . . . . . . . . . . . . . . . 240
9.3.5 CDO Command Input and Output Files . . . . . . . . . . 241
9.4 Merging Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.5 Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
9.5.1 Selecting Variables . . . . . . . . . . . . . . . . . . . . . . . 244
9.5.2 Selecting Spatial Subsets (Geographical Regions) . . . . 245
9.5.3 Selecting Vertical Levels . . . . . . . . . . . . . . . . . . . 245
CONTENTS

9.5.4 Selecting Time Subsets . . . . . . . . . . . . . . . . . . . . 246


9.6 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
9.6.1 Statistics over the Time Domain . . . . . . . . . . . . . . 249
9.6.2 Statistics over the Spatial Domain . . . . . . . . . . . . . 253
9.6.3 Statistics over the Vertical Domain . . . . . . . . . . . . . 256
9.6.4 Statistics over the Zonal Domain . . . . . . . . . . . . . . 260
9.6.5 Statistics over the Meridional Domain . . . . . . . . . . . 263
9.6.6 Statistics over Ensembles . . . . . . . . . . . . . . . . . . . 266
9.7 Interpolations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.7.1 Interpolation to a new horizontal grid (remapping) . . . 267
9.7.2 Interpolation in the Vertical Domain . . . . . . . . . . . . 269
9.7.3 Interpolation in the Time Domain . . . . . . . . . . . . . 270
9.8 Basic Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
9.8.1 Artithmetic Between Two Files . . . . . . . . . . . . . . . 271
9.8.2 Arithmetic Using a Constant Value . . . . . . . . . . . . . 271
9.9 Applying CDO in Climate Computations . . . . . . . . . . . . . . . 272
9.9.1 Indian Ocean Dipole Example . . . . . . . . . . . . . . . . 273
9.9.2 Sahel Rainfall Variability Example . . . . . . . . . . . . . 276
9.9.3 Creating a Land-Sea Mask File . . . . . . . . . . . . . . . 279
9.10 Using CDO with Python . . . . . . . . . . . . . . . . . . . . . . . . . . 281

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
CONTENTS i

Preface
I first ventured into the world of environmental data analysis in the early 2000s at
the end of my undergraduate degree when I was working as a part-time research
assistant at the Max Planck Institute for Biogeochemistry (MPI-BGC) in my then
home town Jena in Germany. I was fortunate enough to have had attended an IDL
programming course as part of my geophysics studies programme. While this gave
me a head start I still had to overcome some unforeseen challenges.
The data I was supposed to work with were saved on one of the MPI-BGC Unix
servers on which also the number crunching was supposed to be done, meaning
I had to deal with the technical challenges of setting up and configuring software
associated with logging on to a remote server. Once I managed to log on I was greeted
by the next challenge - the Unix command line. Coming from a Microsoft Windows
desktop background I was used to click my way through icons, menus and buttons to
get things done. There were no buttons on the Unix command line, I had to learn Unix
commands to move around within the directory tree, access files and start programs.
While I first thought this way of working is quite time consuming and tedious I
learned to appreciate the many advantages of the command line later on. In addition,
I had to learn how to automate tasks by using shell scripts (yet another programming
language), how to use command line tools such as Climate Data Operators (CDO)
and I had to manipulate and analyse climate data saved in file formats unknown to
me which required learning how to use a few more software tools.
Overcoming those obstacles was frustratingly time consuming. Often, I felt lost and
most certainly was close to giving up at many occasions. The only books available at
the time were written on special subjects and were difficult to follow at a beginner
level as they went far beyond what I needed to know often using terminology I
was not familiar with. The only way to learn the skills I needed was to either
find an unfortunate colleague who was willing to spend some of his own precious
research time explaining to me how to do things or to spend endless frustrating hours
searching internet forums for solutions.
Some progress has been made in recent years. Many software packages have
improved in terms of robustness and functionality and they tend to be better docu-
mented these days. There has also been a significant shift in the use of code-based
data analysis software from commercial products such as IDL or Matlab to open
CONTENTS ii

source products such as Python or R. In addition, interest in the field of climate and
environmental sciences increased significantly in recent years as a result of climate
change having become a permanent point on the political agenda, being passionately
debated on social media and being featured in high-profile documentaries (e.g., An
Inconvenient Truth) and Hollywood blockbusters (The Day After Tomorrow). From
personal experience I can also say that the number of students who want to learn
climate computing skills has been slowing but steadily increasing over the last few
years.
The motivation for writing this book was to make it easier for the next generation of
students and young researchers to get started with the analysis and visualisation of
climate and environmental data. They often work on time limited projects or short
term contracts which do not allow them to spend much time on learning new skills.
The book will hopefully allow them to focus on the science questions they are trying
to answer instead of spending weeks and months learning how to code and how to
use the tools needed. The book will guide the student through all steps from setting
up a work environment, understanding the data they are working with, doing the
number crunching and creating graphical output in high quality publishable format.
CONTENTS iii

Acknowledgements
This book would not have seen the light of day without the help and support of
many people and institutions. First, I would like to thank all the scientist, colleagues
and friends from whom I learned so much over the years. There are too many to
name but I would like to specifically say thanks to Ina Tegen, Arnaud Desitter, Gil
Lizcano and Mark New who all contributed significantly to improvements in my
coding abilities. I would also like to thank my long-term supporter, mentor, colleague
and friend Richard Washington without whom I would not be where I am today. And
last, I would like to thank everyone who help getting this book out by helping with
proof-reading chapters and graphical design including Thomas Caton Harrison and
Esther-Miriam Wagner.
In addition, I would like to thank the staff of the IT Office of the School of Geography
and the Environment, University of Oxford, for sharing their knowledge and for
building and maintaining a high-performance computational research cluster as well
as Jesus College, Oxford, for their financial support. I would also like to thank The
Woolf Institute in Cambridge for their hospitality during challenging times.
Also, a special thanks to Emma Heyn¹ and Marcus Wachsmuth their contributions
to the cover art and design.

¹https://www.emmaheyn.com
1. Introduction
The field of climate and environmental sciences has been constantly growing in
importance over the last few decades mainly due to an increased need to understand
how the climate system works and how it may change in the future as a result of
men-made climate change. Similar to a craftsman who as part of his apprenticeship
needs to learn what tools are available to work wood and hone his skills in how to use
them, young researchers need to learn what software is available, what the software
can do and how to use it.
Climate and environmental data come in a variety of formats depending on the
nature of the data and the preference of the scientists or organisations who compiled
or generated them. The analysis and subsequent visualisation of such data requires
a good understanding of the file formats and data structures as well as the tools that
can be used to manipulated these data, to calculate statistics and to visualise the
output.
The material presented in this book is based on more than 15 years of experience
working in the field of climate sciences and teaching climate data analysis and
visualisation courses to students at all levels at Oxford University. The aim of
this book is to introduce students to the technical background, set of tools and
programming skills required to successfully analyse climate datasets and produce
scientific output in publishable format.

1.1 Overview and Objective


In most institutions where work on climate and environmental data is being carried
out the storage, analysis and visualisation of climate data is done on Unix servers.
This is because Unix servers tend to have large disk arrays attached to them which
provide the storage capacity needed to store large datasets (terrabytes range). In
addition, they allow fast read and write operations and have substantially more
processing power than standard desktop or laptop machines. If storage capacity
Introduction 2

and processing power is not essential (e.g., for exploratory research) then all of the
software packages discussed in this book can also be installed on local computers or
laptops as long as they are running a Linux operating system. All software packages
introduced in this book are freely available for research purposes.
This book provides an introduction to different types of climate data and the main
data formats in which they are being made available with a specific focus on the
most commonly used formats such as comma-separated values (CSV) and Network
Common Data Form (netCDF) files. The nature of gridded data will be discussed as
well as ways to explore the content of netCDF files. Students will learn how to work
on the Unix command line, how to use Climate Data Operators (CDO) to manipulate
climate data saved in netCDF file format, how to calculate climate statistics and
how to visualise the output using the Python programming language. Many code
examples will help in the learning process. Additional tools and techniques will be
discussed which will help with the data analysis and visualisation tasks including
how to deal with long-running processing jobs and which graphical output formats
should be used (bitmap vs. vector graphics).
For many of the subjects and software packages covered in this book (e.g., Unix,
CDO, Python) detailed in-depth user guides, tutorials and books exist. The focus
of this book is to integrate the different tools that have been shown to work well
in climate computing into a single framework that allows to create a seamless work
flow from understanding and analysing data through to computational data analysis
and the publication of high-quality graphical output in an efficient way.
While the collection of tools and techniques presented here have been shown to
work well for most climate computing tasks it should be noted that there are many
roads to get from A to B and there is no doubt that scientists around the world have
created data analysis environments and programming solutions that differ from what
is presented here. Where appropriate references to additional tools or solutions are
given.

1.2 Concept of Local and Remote Machines


It is assumed that students are familiar with working on a laptop or desktop computer
running either a Microsoft Windows, macOS or Linux operating system. These
Introduction 3

computers are normally owned by the user or provided by the work place and are
generally referred to as local computers or local machines.
It is also assumed that the climate data analysis and visualisation tasks will be
performed on a server or server cluster running Unix/Linux. A server can be thought
of as a more powerful computer with extended disk arrays attached for storage. A
server may also be referred to as a remote server or remote machine because they
tend to be located physically in a different place from your local machine such as
in a different building or in a research centre somewhere else in the world. When
multiple servers are combined to create a more powerful setup then this is refereed
to as a server cluster or a computational research cluster.
In general, the local machine is used to connect to the remote server. This means that
it is possible to work from anywhere in the world as long as a reasonably fast and
stable internet connection is available.

1.3 Software
The software that is required on the local machine that allows to connect to the
remote server differs between operating systems. The software will be introduced
in Section 3.2 and will be discussed separately for each operating system. Every
aspect of climate computing discussed in this book can be achieved using open source
software.
The administrator rights for the installation of software on local machines will very
likely lie with either the user or with IT office of the institution that provides the
computer (e.g., department or research centre). In the latter case, it may be necessary
to contact the IT administrator to install the required software.
With regards to software on the remote server, users will have no or only very
limited control over the software installed and have to rely on the remote server
system administrator. However, it is very likely that most, if not all, software is
already installed on the remote server if that server is frequently used for climate
data analysis.
In exceptional circumstances, where a remote server is not available or accessible
and the climate data to be analysed are small enough then a local machine (PC or
laptop) may be used for data analysis. While Python can be installed on any operating
Introduction 4

system (Windows, OS X or Linux) some of the other software such as CDO, ncdump
or ncview works best on a Linux system.
2. Climate Data
2.1 Climate Data Overview
The term climate data as used throughout this book may refer to any observational
or numerically simulated dataset within the field of climate and environmental
sciences. Climate data may come from any realm of the Earth’s system including
the atmosphere, oceans, land ecosystems and cryosphere. In general, climate data
show how a variable changes in space or time or both. Space here may refer to
the spatial (horizontal) domain, the vertical domain (e.g., atmospheric or oceanic
vertical profiles) or the 3-dimensional (3D) space encompassing both the horizontal
and vertical domain.
Weather stations or fixed scientific instruments tend to provide measurements from
a single point location. Their general purpose is to see how a variable changes over
time at that specific location. In contrast, climate model output and many satellite-
derived data products are available on a time-varying spatial grid, allowing analysis
of both spatial and temporal variability. Some observations may also include both
space and time-varying dimensions, one example being aircraft measurements.
There are many reasons why one might prefer a gridded spatially consistent
dataset over point observations. One such reason may be the need to fill in gaps
between observational sites to better study spatial patterns. For instance, gridded data
products generated by the Climate Research Unit (CRU¹) at the University of East
Anglia are derived by filling in the gaps between point observations using statistical
methods.
Another reason for the need of gridded data products is that weather prediction
models require observations on a regular grid as input. The generation of such
gridded observational datasets is done routinely at some of the large data centres
and meteorological agencies in the world (e.g., ECMWF, NCEP, NASA) which use
data assimilation techniques, statistical methods and modelling as part of the gridded
data generation.
¹www.cru.uea.ac.uk
Climate Data 6

2.2 Data Use Licences


The licences under which climate data are being made available vary but tend to
be one of the following. Some data are available free of charge to the public with
the restriction that they are for non-commercial use only. Other datasets may be
available free of charge for research purposes only. Those data generally require
a registration process including user verification by means of an academic email
address. Other datasets may be made available by commercial companies and will
have to be paid for.
Regardless of the data provider, it is important to always check the terms and
conditions of data use carefully before working with them and to make sure that
data sources are referenced appropriately in any publications.

2.3 Data Quality


A good understanding of the nature of the data products being analysed is essential
to assess uncertainties associated with instrumental design and setup, assumptions
being made as part of the raw data post-processing or observational problems such as
human error or gaps in the data records. It is advisable to always check the available
documentation which may come in the form of user guides, published technical
notes, descriptions on webpages or peer-reviewed scientific publications.

Never blindly trust data regardless of where they were sourced from. Errors
creep in quite easily for a multitude of reasons. Check for errors by visually
inspecting raw data and by creating simple test plots. The brain is quite
good a processing visual information and spotting things that are not quite
right. Use common sense.

2.4 Accessing Climate Data


Institutions differ in how they make their data available. One of the simplest and most
direct forms of data access is via the transmission of Excel spreadsheets, scanned data
records or formatted text files (e.g., from a weather station).
Climate Data 7

However, more often than not, data files are too large to be sent by email which
means other methods for retrieving the data need to be provided. Nowadays, most
national and international data centres have web portals for browsing data products,
many of which come with the functionality to do temporal and spatial sub-setting
in order to give users an easier way to handle large datasets.
Some institutions make data available on a FTP (file transfer protocol) server. FTP
servers may be accessible through a web browser allowing manual file downloads or
via other computational resources allowing data to be downloaded programmatically
(e.g., Bash or Python scripts).
In some cases, especially when a large download would be unpractical, data may
be sent physically on storage devices such as USB (universal serial bus) flash drives,
external hard drives, CD-ROM (compact disc read-only memory) or DVD (digital
versatile disc).
With rapid advances in the technology used to accommodate large datasets such as
these, it is likely that new data acquisition methods will be developed in future.

2.5 Types of Climate Data

2.5.1 Analyses and Reanalyses Products


In order to predict future states of the atmosphere, numerical weather prediction
(NWP) models require as input spatially varying meteorological fields that describe
the state of the atmosphere at the time of model initialisation. A complete and
spatiotemporally consistent observational dataset without gaps is needed here. This
is achieved by assimilating point observations from weather stations, buoys and
ships as well as satellite retrievals into gridded observational products by means of
modelling and statistical methods. The output from this process can then be used to
initialise the forecast simulation. These gridded observation-based data products are
called analyses.
A gridded observational dataset without gaps has considerable value for the study
of climate. However, due to model updates and continuous improvements in the
assimilation process, analysis products can develop inconsistencies through time.
For this reason initiatives have been set up to regenerate observational input
Climate Data 8

fields over extended periods (usually decades) using a ‘frozen’ (fixed) version of
the assimilation and model code. These products are called reanalyses. Reanalysis
products also contain model-generated fields that are not based on observations.
Because reanalyses are internally consistent over time they are often used to study
climate processes and variability from the recent past. Some of the latest reanalysis
products and their properties are listed in Table 2.5.1.1.

Table 2.5.1.1: Some of the reanalyses commonly used in climate science and modelling.

Institution Name Period Resolution Reference


ECMWF ERA-40 1957 - 2002 1.125 x 1.125 x 60 Uppala et al., 2005
ECMWF ERA-Interim 1979 - 2018 0.75 x 0.75 x 60 Dee et al., 2011
ECMWF ERA-5 1950 - present 0.28125 x 0.28125 x ECMWF²
137
ECMWF ERA-20CM 1900 - 2010 0.75 x 0.75 x 137 Hersbach et al.,
2015
NCEP/DOE Reanalysis 2 1979 - present 2.5 x 2.5 x 28 Kanamitsu et al.,
2002
NCEP CFSR 1979 - present 0.5 x 0.5 Saha et al., 2010
NASA MERRA 1979 - 2016 0.667 x 0.5 x 42 Rienecker et al.,
2011
NASA MERRA-2 1980 - present 0.625 x 0.5 x 42 Gelaro et al., 2017
JMA JRA-55 1958 - present ? x ? x 60 Kobayashi et al.,
2015

2.5.2 Climate and NWP Model Output


Climate and NWP models can generate large amounts of output. These output
files will primarily contain gridded data. How the model output is structured and
organised varies between models. However, climate model output is likely to be made
available in netCDF, GRIB or PP (Met Office) format.

2.5.3 Point observations


Point observations are valid only for a particular location, unlike gridded datasets
which represent regions or the globe. One may differentiate between static point
observations and moving point observations.
²https://confluence.ecmwf.int/display/CKB/ERA5+data+documentation
Climate Data 9

Static point observations refer to observations made at a specific geographical


location by an instrument or weather station whereby measurements are mostly
taken at or close to the surface.
Moving point observations come for instance from instruments installed on mov-
ing carriers such as aircraft, ships or radiosondes attached to weather balloons.
Analysing moving point observations is more difficult because the measurements
change in time and space at the same time. For comparisons with gridded model
or satellite data uncertainties regarding temporal and spatial colocation have to be
considered.
While weather balloons move vertically and geographically with time driven by the
prevailing wind systems at different altitudes, the resulting atmospheric profiles are
often taken to be representative of a single geographical point in subsequent data
analysis.

2.6 Data File Formats


Climate data come in a variety of file formats which can be broadly divided into text-
based, binary or a combination of both. Self-describing data formats such as GRIB
and netCDF are formats where a description of the data the file contains is saved
as part of the file itself. Such data descriptions may also be referred to as metadata.
The main climate and environmental data formats will be briefly described in the
following sub-sections. For the remainder of this book, the focus will be primarily
on the analysis of data in netCDF format.

2.6.1 Plain Text and ASCII


Plain text is a rather loosely used term for files containing only readable characters
(no objects, images or binary data). One may refer to a plain text file when it
is unknown how the characters in the file are encoded. If the popular American
Standard Code for Information Interchange (ASCII) character-encoding scheme is
used then the file may be referred to as an ASCII file.
Most ASCII data files are formatted in such a way that the data may be organised
into rows and columns. Data values within a row may be separated by spaces or a
Climate Data 10

specific delimiter. If the delimiter is a comma then the file is referred to as a comma-
separated values (CSV) file. ASCII files often have a header at the beginning of the
file that provides information about where the data files were generated, what the
data units are and how the data that follow are organised (e.g., column headers).
File extensions vary but include .txt for general text files, .asc for ASCII files and
.csv for comma-separated values files. In some cases no file extensions is included
in the filename.

2.6.2 Binary
Binary files often have the file extension .bin or .dat or do not have a file extension at
all. A binary file can contain any type of data encoded in binary form. When a binary
file is opened with a text editor a typically unintelligible mess of characters and
symbols is displayed. Binary files can be viewed using a hex editor. Many software
packages have libraries or packages that provide functionality to read binary files.
Pure binary files have become less common in climate sciences over the last two
decades.
The Met Office provides some of its model data in a specifically developed binary
format known as PP files. These are discussed in more detail in Section 2.6.5.
Big endian vs little endian…

2.6.3 GRIB
The Gridded Binary (GRIB) format is a self-describing data format that is widely used
to store model and satellite data in climate sciences. The GRIB format is standardised
by the World Meteorological Organization’s Commission for Basic Systems. There
are two GRIB Editions: GRIB1 and GRIB2. GRIB files usually have the file extension
.grb or grb2. The main advantage of the GRIB data format is that file sizes are smaller
relative to normal binary files.
In some cases it is useful to convert GRIB files to netCDF for more convenient file
manipulation. Table 2.6.3.1 lists some of the tools that may be useful for converting
files in GRIB format to files to netCDF format. The output should be checked
carefully as the conversion process may have some unexpected results especially
when it comes to the internal organisation of the netCDF file.
Climate Data 11

Table 2.6.3.1: Tools for generic conversion from GRIB to netCDF file format.

Tool Description
NCL NCL script ncl_convert2nc
Xconv/Convsh Can read GRIB1/2, can write netCDF
CDO cdo -f nc copy ifile.grb ofile.nc

2.6.4 netCDF
The netCDF file format is a self-describing data format developed at Unidata³. The
netCDF file format is one of the most common file formats in climate science
and most analytical software packages for climate data analysis have functions for
reading files in netCDF format. Many climate and numerical weather prediction
models will make their data available in netCDF format. To work with netCDF files
it is essential that the netCDF libraries are installed on the in the operating system
that is used to analyse the data. The file extension for netCDF files is .nc.
Different versions of the netCDF file format exist. Version 3 (netCDF-3) is known
as the classic format. Version 4 (netCDF-4) uses the HDF5⁴ format with some
restrictions.

2.6.5 PP
(https://artefacts.ceda.ac.uk/badc_datadocs/um/umdp_F3-UMDPF3.pdf)
PP format is big endian.
³http://www.unidata.ucar.edu/software/netcdf
⁴https://en.wikipedia.org/wiki/Hierarchical_Data_Format
3. Unix
3.1 Introduction to Unix
Like Windows or Mac OS X, Unix/Linux is a computer operating system. The
difference between Unix and Linux can be a bit confusing as both terms are often
used interchangeably. Unix operating systems are mainly commercial in nature
whereas Linux is an open source freely available clone of Unix.
Linux operating systems are also found on some mobile devices such as mobile
phones (e.g., Android) and tablets.

3.1.1 Linux Distributions


There are various Linux distributions (also referred to as distros or flavours). The
most popular and most widely supported Linux distribution for personal use is
Ubuntu. But there are many more distributions or sub-versions available that are
worth exploring (e.g., Linux Mint, Fedora or elementary OS). Table 3.1.1.1 list some
of the more popular Linux distributions.

Table 3.1.1.1: Most popular Linux distributions.

Distribution Based on
CentOS¹ Red Hat Enterprise Linux
Elementary OS² Ubuntu
Fedora³ Red Hat
Linux Mint⁴ Ubuntu
openSUSE⁵ Slackware Linux
Ubuntu⁶ Debian
¹www.centos.org
²www.elementary.io
³www./getfedora.org
⁴www.linuxmint.com
⁵www.opensuse.org
⁶ubuntu.com
Unix 13

The development of some Linux distributions is sponsored by commercial companies


such as Red Hat⁷ (Fedora), SUSE⁸ (openSuse) or Canonical Ltd.⁹ (Ubuntu) whereas
other Linux distributions are developed and maintained entirely by volunteers of
the Linux community.
The development history of many Linux distributions can often be traced back to
some of the past and present big players such as Debian, Red Hat or Slackware. A
complete timeline of Linux distribution’s development history can be found on the
Linux Distribution Wikipedia webpage¹⁰.

3.1.2 Desktop versus Server


A desktop environment refers to the graphical user interface (GUI) that allows
interaction with the underlying operating system. To facilitate this, peripherals such
as a keyboard, mouse and a monitor are used to interact with buttons, menus, icons
and folders, as well as to enter text. Almost all personal computers (PCs) or laptops
have a desktop environment.
A Linux server may be viewed as a big computer that has much more processor
power, more RAM and more storage capacity than a personal computer or laptop.
A server does not normally have any peripherals (keyboard, mouse or monitor)
attached. The way a user interacts with the server is via the command line interface
(CLI) (introduced in Section 3.3.1 which is accessed via a remote login (Section 3.2.4).
Sometimes a server is also referred to as a headless server because no display (head)
is attached. This concept is important when running software on a remote server that
uses GUIs (see Section 3.2.3).
Linux distributions usually come in two versions. The first is a desktop version that
can be installed on a laptop or PC and that includes a graphical desktop environment.
The second is a server version that is designed for the setup and running of a server
and where the graphical desktop environment has been stripped away leaving just
the operating system itself.
⁷www.redhat.com
⁸www.suse.com
⁹canonical.com
¹⁰https://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
Unix 14

3.1.3 High Performance Computing on a Server


Most climate centres and many academic institutions around the world operate
servers running Linux to facilitate large computational processing jobs such as
climate model runs and data processing. This is often referred to as high performance
computing (HPC). These servers tend to have large disk arrays attached that allow for
the storage of large datasets such as output from climate models or satellite-derived
observations (Figure 1.3.1).
A server may be physically located quite close to the user (e. g., in the same building
or department) or far away in a different country. The physical server location does
not matter as the login process via the internet is the same. However, a stable internet
connection is helpful. The process of logging on to a server is also referred to as
remote login and the server may be called a remote server. All the technical steps
involved in logging in to a remote server are introduced in the following Section 3.2.
Some of the terminology associated with a remote server can be confusing. While the
term server is used quite freely it is more likely that the computational infrastructure
takes the form of a cluster. A cluster is created by combining individual nodes
whereby each node can be seen as an isolated computer set up to run as a server.
The individual nodes are controlled by a master server (node) and communication
between them is facilitated via a local area network (LAN). Each node will have a
specific number of processors. Understanding the concept of processors, nodes and
clusters is important when parallelising processes.
Unix 15

Figure 3.3.1: HPC cluster at the School of Geography and the Environment, Oxford University.

3.2 Accessing a Remote Server


A very common scenario is shown in Figure 3.2.1 where it is assumed that a user
works from home and aims to connect from a laptop or PC (local machine) to
a remote server (remote machine). The remote server is physically located in an
Unix 16

academic institution such as a university or research institute and is connected to


the institution’s internal network.
In order to access a remote server and start working on it the following four points
need to be considered. First, a Virtual Private Network (VPN) connection needs to be
established that creates a secure connection to the institution’s network (also referred
to as _ VPN tunnel). Second, a Secure Shell (SSH) connection needs to be established
between the local machine and the remote server enabling end-to-end encryption
of network traffic. Third, if software on the server uses GUIs then an X Window
Manager running on the local machine needs to handle the display of graphical
windows on the local screen. Fourth, some software may be needed to enable file
transfer between the local machine and the remote server.

If the local machine is connected directly to the academic institution’s


network (e.g., department’s internal wired or wireless network) then a VPN
connection is not normally required.
Unix 17

Figure 3.2.1: Example of a common networking scenario for remote server access. Commands are
sent from the laptop via SSH connection to the server through the VPN tunnel (green arrow). X11
Forwarding facilitates graphical information to be send back to the laptop display if required (purple
arrow).

The software required to facilitate the four steps outlined above depends on the
operating system installed on the local machine. The list of software shown in Table
3.2.1 is just a recommendation but includes software that has been shown to work
well.

Table 3.2.1: Software recommendations for accessing a remote server (¹Free, ²Commercial, ³Cisco-
compatible VPN client).

# Function Windows Mac OS X Linux


1 Set up a secure VPN AnyConnect² ³ AnyConnect² ³ AnyConnect² ³
connection using a native VPN client¹ ³ vpn¹
VPN client vpnc¹ ³

2 Start X Window VcXsrv¹ XQuarz¹ Terminal command:


Manager to enable Xming¹ ssh -Y¹
X11 Forwarding for Exceed²
displaying graphics
Unix 18

Table 3.2.1: Software recommendations for accessing a remote server (¹Free, ²Commercial, ³Cisco-
compatible VPN client).

# Function Windows Mac OS X Linux


3 Log on to a remote PuTTY¹ iTerm¹ Terminal¹
server via SSH iTerm2¹
connection Terminal¹
xterm¹

4 Transfer files FileZilla¹ FileZilla¹ Filezilla¹


between remote WinSCP¹ scp¹ command scp¹ command
server and local
machine

An X Window Manager is only required if a graphical window is to be


opened on the local display. When working only on the command line an
X Window Manager is not required.

The following four subsections will explain in more detail how to

• use login details (Section 3.2.1)


• connect to a Virtual Private Network (VPN) (Section 3.2.2)
• start an X window system (X11 forwarding) (Section 3.2.3)
• access the remote server (Section 3.2.4)

File transfer between the local and remote machine is covered in Section 3.3.5.

3.2.1 Remote Server Login Details


Before it is possible to log on to a remoter server a Unix account needs to be set
up on the server by the system administrator. In order to log on three pieces of
information are required: the username, password and server name. These details
should be obtained from the system administrator.
An initial password will have been set by the system administrator but it is highly
recommended to change that password once logged on to the server (see [Section
Unix 19

3.6.5]#changing-unix-account-password) for instructions). Depending on the server


setup, the user may be asked automatically to change the password at the first login.
The serve name can take the form of an IP address (e.g., 163.1.38.93) or the associated
resolved name of the server (e.g., linux.ouce.ox.ac.uk). Both representations can be
used to log on to the server.

3.2.2 Virtual Private Network (VPN)


If an institution is running a VPN then it is necessary to connect to it in order to
access the institution’s network services (e.g., when working from home). A VPN
connection allows secure internet traffic between computers and an institution’s
network by creating a ‘tunnel’ through the internet (Figure 3.2.1). Using this ‘tunnel’
to send information securely is sometimes referred to as tunneling.
To establish a VPN connection a VPN client needs to be installed on the local
machine. Check with the system administrator which VPN client is recommended
(e.g., Cisco-compatible or not). Some commonly used commercial and free VPN
clients are listed in Table 3.2.1.
In some cases the institution may provide a pre-configured VPN client or a profile
configuration file (file extension .pfc) that contains the connection details. If not then
the VPN client needs to be configured manually. Ask the system administrator for the
VPN login details and configuration advice. The VPN connection details will differ
from the Unix account login details and should not be confused. Example details
required for setting up a Cisco-compatible VPN client are given in Table 3.2.2.1.

Table 3.2.2.1: Input details required by Cisco-compatible VPN clients.

Required Notes Example


IPSec gateway Gateway address vpn.ox.ac.uk
IPSec ID VPN group authentication profile oxford
IPSec secret VPN group password *****
Xauth username VPN username smith
Xauth password VPN password *****
Unix 20

3.2.3 X Window System (X11 forwarding)


Some Unix commands executed on the remote server will attempt to open a graphical
window or GUI. Examples are, starting Matlab, RStudio or a graphical text editor. The
server does not have any peripherals such as a mouse, keyboard or monitor attached
to it (‘headless’ server). Therefore, an executed command to open a GUI is going to
fail.
This is because a Unix environment variable named DISPLAY is not set and therefore
the Unix system does not know on which display to open the GUI. To make the
server finds the monitor connected to the local machine (e.g., laptop display) an X
Window System (aka X11 forwarding) needs to be running on the local machine and
X11 forwarding needs to be enabled. Without the X Window System running only
work on the command line can be carried out (this is often enough). How to start
the X Window System and enable X11 forwarding depends on the operating system
running on the local machine and will be discussed in the following subsections.

3.2.3.1 X Window System on Windows OS

To get an X Window System running on Windows OS an X Window Manager needs


to be installed. A popular X Window Manager for Windows OS is the open source
software VcXsrv¹¹ (recommended). For more options see Table 3.2.1. No configuration
of the software is required. Once installed start VcXsrv and it will run in the
background. The VcXsrv symbol should be visible in the Windows taskbar while
VcSxrv is running.

X11 forwarding needs to be enabled in PuTTY as well for it to work


correctly (see Section 3.2.4.1 for details).

3.2.3.2 X Window System on Mac OS X

To get an X Window System running on Mac OS X an X Window Manager needs to


be installed. On Mac OS X 10.6.3 or later (including El Capitan) the free X Window
¹¹https://sourceforge.net/projects/vcxsrv
Unix 21

Manager XQuarz¹² should be installed. After the installation it is recommended to


reboot the operating system. There is no need to start XQuarz manually, it will be
running in the background automatically.

In order to enable X11 forwarding on Mac OS X the -Y option should be


added to the ssh command (see Section 3.2.4.2 for details).

3.2.3.3 X Window System on Linux

There is no need to install an X Window Manager on Linux operating systems. Linux


operating systems communicate quite smoothly with Unix servers.

In order to enable X11 forwarding on Linux the -Y option should be added


to the ssh command (see Section 3.2.4.3 for details).

3.2.4 Connecting to a Remote Server


At this stage a VPN connection (if required) and the X Window System (if needed)
should be running. How to connect to the remote server depends on the operating
system running on the local machine and will be discussed in the following subsec-
tions. In order to connect to the remoter server the login details username, password
and server name are required (Section 3.2.1). In the examples in the following sub-
sections jsmith is used as a username and linux.ouce.ox.ac.uk is used as a server
name.

3.2.4.1 Connecting to a Remote Server From Windows OS

In order to connect to a remote server from a Windows OS the Secure Shell (SSH)
client PuTTY ¹³ needs to be installed. PuTTY is a freely available widely used SSH
client. Download the appropriate Windows installer or binary executable file from
the webpage and install PuTTY on the local machine. Configure PuTTY following
the steps outlined below.
¹²http://www.xquartz.org
¹³https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
Unix 22

1. Start PuTTY.
2. Go to Session and enter the server name under Host Name.
3. Select Connection Type SSH.
4. Go to Connection –> SSH –> X11 and check the box Enable X11 forwarding.
5. Go back to Session, give this configuration a name in the field Saved Sessions
and click the Save button.
6. The name under which this configuration was saved should now always be
listed on the right-hand side of main Session window every time PuTTY is
opened.

Once PuTTY configuration is completed do the following to connect to the remote


server.

1. Open PuTTY.
2. Select the name the configuration was saved under on the right-hand side of
the main PuTTY Session window.
3. Click the Load button.
4. Click the Open button (as a short-cut just double-click on the configuration
name).
5. A terminal window will open. Enter username and password when prompted.

Upon first login to a remote server the following prompt may appear.

The authenticity of host 'servername (163.1.38.97)' can't


be established. RSA key fingerprint is
4d:fa:ab:36:c0:c4:5f:c2:e6:a6:0f:2a:d4:48:af:24. Are you
sure you want to continue connecting (yes/no)?

Confirm this by typing yes. This message will not appear on subsequent logins.

3.2.4.2 Connecting to a Remote Server From Mac OS X

Mac OS X is an operating system that is very similar to Linux under the hood.
This makes the communication between the local machine and the remote server
straightforward. Access the remote server from Mac OS X following the steps
outlined below.
Unix 23

1. Open a command line interface (CLI) by starting the program iTerm, iTerm2,
Terminal or xterm.
2. Use the command below to connect to the remote server. ssh starts a Secure
Shell connection. -Y enables X11 forwarding.

ssh -Y jsmith@linux.ouce.ox.ac.uk

3. is the username of user’s Unix server account. servername is the name


username
of the Unix server.
4. The user will be prompted to enter a password.

Note that in some Mac OS X terminal windows the cursor does not change
while the password is entered. This often leads to students thinking that
they cannot enter their password.

Upon first login to a remote server the following prompt may appear.

The authenticity of host 'servername (163.1.38.97)' can't


be established. RSA key fingerprint is
4d:fa:ab:36:c0:c4:5f:c2:e6:a6:0f:2a:d4:48:af:24. Are you
sure you want to continue connecting (yes/no)?

Confirm this by typing yes. This message will not appear on subsequent logins.

3.2.4.3 Connecting to a Remote Server From Linux OS

For users working on a Linux OS such as Ubuntu or Mint the communication


between the local machine and remote server is relatively straightforward. Access
the remote server from a Linux machine by following the steps outlined below.

1. Open a command line interface (CLI) by starting the program Terminal. It


should be available in all Linux distributions by default.
2. Use the command below to connect to the remote server. ssh starts a Secure
Shell connection. -Y enables X11 forwarding.
Unix 24

ssh -Y jsmith@linux.ouce.ox.ac.uk

3. is the username of user’s Unix server account. servername is the name


username
of the Unix server.
4. The user will be prompted to enter a password.

Upon first login to a remote server the following prompt may appear.

The authenticity of host 'servername (163.1.38.97)' can't


be established. RSA key fingerprint is
4d:fa:ab:36:c0:c4:5f:c2:e6:a6:0f:2a:d4:48:af:24. Are you
sure you want to continue connecting (yes/no)?

Confirm this by typing yes. This message will not appear on subsequent logins.

3.3 First Steps on the Unix server

3.3.1 The Terminal Window


No matter whether PuTTY (Windows OS) is used or one of the terminal applications
for Mac OS X and Linux a terminal window should now be open. A blinking cursor
indicates that the command line interface (CLI) of the terminal is available. The CLI
allows the user to enter text-based commands. The Shell (a Unix program that runs in
the background) interprets commands and passes them on to the Unix system which
executes them.
The dollar sign ($) on the left-hand side of the cursor indicates the command prompt.

The command prompt may look slightly different depending on which ter-
minal application was used to connect to the remote server. The appearance
of the command prompt may also be modified by the user.

In the reminder of the book a dollar sign at the beginning of code examples will be
used to represent the Unix command prompt.
Unix 25

Once a command has been entered hitting the Enter key will execute the command.
Some commands will produce output in text form that is displayed in the terminal
window. Other commands may open graphical windows. Some commands will
complete very quickly, while others will take longer to complete. Once Unix has
finished processing the command the command prompt will be available again.
The up and down arrow keys can be used to scroll backwards and forwards through
the command history. This can be very useful when commands including long paths
need to be (re-)entered.
The tab key can be used in most Unix systems to auto-complete commands, file
names and paths. Hitting the tab button twice in a row first auto-completes and then
presents possible options for completion if a full auto-complete of the command, file
name or path was not possible as multiple options are available.
An SSH session can be terminated by executing the command ‘exit’ or by closing
the terminal window by using the mouse. In both cases the SSH connection to the
remote server will be terminated.

How to copying and pasting text inside the terminal depends on the
terminal application used. If the standard way of using Ctrl + C (copy)
and CTRL + V (paste) does not work then the try the following. Highlight
the text to copy with the mouse (double-click to highlight whole word or
line), then use a single click of the right mouse button (or sometimes the
mouse wheel) to paste the highlighted text onto the command line.

3.3.2 The Shell


The Shell is a Unix program that starts automatically in the background when logging
on to the remote server. The shell interprets Unix commands and passes them on to
the Unix system. There are different shells that can be used. Although their basics are
very similar differences exist in the syntax and some commands. The most common
shell is the Bash shell. To test which shell is currently being used type the following
command on the Unix command prompt.
Unix 26

echo $SHELL

As a beginner it is suggested to stick with the Bash shell. However, to try a different
shell type csh for the C shell, tcsh for the TENEX C shell, or ksh for the Korn shell.
More information about shells can be found on Wikipedia’s Unix Shell¹⁴ page.

3.3.3 Linux Directory Structure and Home Directory


What is called a folder in Windows OS is called a directory on a Linux system.
Both are in principle the same. Files or sub-directories can be place inside them. An
example of a common Linux server directory tree is shown in Figure 3.3.3.1 including
short descriptions about each of the main directories.
¹⁴http://en.wikipedia.org/wiki/Unix_shell
Unix 27

Figure 3.3.3.1: Example of a Unix server directory structure.

A home directory will have been created by the system administrator as part of the
Unix account setup. The home directory (also referred to as user area) is where files
can be saved and sub-directories can be created by the user. For example, in Figure
3.3.3.1 two home directories were created as /home/rjones and /home/jking. When
logging on to the server users will be directed automatically to the root of their home
directory.
Unix 28

The directory tree structure may vary slightly from the example shown in Figure
3.3.3.1 depending on the server setup. To show the full path to the root of the home
directory the pwd (present working directory) command can be used. For the user
rjones the pwd command would return the following (see Figure 3.3.3.1).

/home/rjones

3.3.4 Quota
There are limits as to how much data can be stored in the user area. The user’s initial
quota will have been set by the system administrator when their user account was
created. The size of this quota can be found by using the quota command or by asking
the system administrator. The quota command returns details about the user’s quota
in tabular form. Table 3.3.4.1 provides details about the information presented in each
column.
Headers 2 to 5 show the block quota. This refers to the actual amount of space used
on the system in blocks wherein one block equals 1 KB. Headers 6 to 9 show the file
quota. This refers to the number of files and directories on the system.

Table 3.3.4.1: Table headers returned by the quota command.

# Header Explanation
1 Filesystem Name of the file system for which quota information
is displayed.

2 blocks The actual number of blocks used.

3 quota The quota for blocks, warnings will be displayed


when this amount is exceed.

4 limit The hard quota for blocks; this limit cannot be


exceeded.

5 grace The amount of time left to get back below block’s soft
quota.

6 files The number of files used.


Unix 29

Table 3.3.4.1: Table headers returned by the quota command.

# Header Explanation

7 quota The soft quota for files, warnings will be displayed


when this amount is exceeded.

8 limit The hard quota for files, this limit cannot be exceed.

9 grace The amount of time left to get back below your file’s
soft quota.

3.3.5 File Transfer to and from the Server


There are two ways to copy or move files between the server and your local machine.
First, if the home directory on the Unix server is mapped on the local machine then
it is as simple as opening a file manager such as File Explorer (Windows 10), Finder
(Mac OS X) or Nautilus (Linux) and copy, paste, delete or move directories and files
between the local and remote machine as normal. How to map the home directory
is introduced in Section 3.3.6.
Secondly, files can be moved between the local machine and remote server by using
specific software (Windows OS) or the terminal CLI (Mac OS X and Linux). In both
cases the user name, password and server name of the Unix account are required
(Section 3.2.1). The file transfer options are discussed for Windows OS users and
Mac OS X/Linux users in the following two sub-sections.

3.3.5.1 File Transfer between Windows OS and Remote Server

A copy of the freely available WinSCP¹⁵ software can be obtained and installed on
the Windows machine. It is recommended to use the Norton Commander Interface
which provides a split window with the local drive on the left-hand side and the home
directory on the server on the right-hand side. Files and directories can be dragged
and dropped between your local machine and the server. If the Norton Commander
¹⁵http://sourceforge.net/projects/winscp
Unix 30

Interface is not the default setting then this can be changed through the main menu
Options Preferences –> Environment –> Interface –> Commander.
The Linux account login details can be entered in the WinSCP start-up window.
Similar to PuTTY the login details can be saved for easier access using the Save
button. After successful login the Norton Commander Interface will show the local
drive on the left-hand side and the home directory on the remote server on the right-
hand side. Files and directories can copy, move or delete on both sides. The mouse
can be used to drag and drop files and directories either from the local drive to the
remote server and vice versa.

3.3.5.2 File Transfer on the Command Line for Mac OS X and Linux

If a client with a GUI such as FileZilla is not available on Mac OS X and Linux
operating systems then files can be copied between the local machine and the remote
server by using the scp (secure copy) command. No matter which direction a file or
directory is copied the general syntax of the scp command is always the same and
looks like the following.

scp <options> <source> <destination>

The source and destination can be either the local drive or the remove drive
depending on which direction a file or directory is being transferred. The general
syntax for the remote server is as follows (note the colon at the end).

username@servername:

The following are working examples of copying files from the local machine to the
remote server (Example 1) as well as from the remote server to the local machine
(Example 2).
Example 1

1. Open a terminal window.


2. Go to the directory where the file to be copied is located.
3. Copy the locale file myplot.jpg to the remote server into the directory research/
located at the root of the user area using the following command.
Unix 31

scp myplot.jpg jking@linux.ouce.ox.ac.uk:research/

Example 2

1. Open a terminal window.


2. Go into the directory where the file to be copied is supposed to be saved in.
3. Copy the file data.csv from the remote server directory data/ (located at the
root of the user area) into the current directory on the local machine using the
following command.

scp jking@linux.ouce.ox.ac.uk:data/data.csv ./

The construction and use of relative and full paths is discussed in more
detail in Section 3.4.4.

3.3.6 Mapping the Linux Home Directory as a Remote


Network Drive
If the remote home directory is available as a Server Message Block (SMB) shared
drive (ask administrator) then it should be possible to map (known as mount in
Linux) the remote server home directory on the local machine as a remote drive.
This has the advantage of having direct access to the files via the local file browser
or command line. Locally installed software such as text editors or image software
can then be used to open files located on the server.
How to map (mount) the remote drive depends on the operating system installed
on the local machine and is discussed in the following subsections. Note that the
instructions may vary and are likely to only work if the VPN connection to the
institution’s network is up and running.
Unix 32

3.3.6.1 Mapping Home Directory on Windows

The following instructions for mapping a remote network drive a based on Windows
10.

1. Open the File Explorer.


2. Sellect This PC from the pan on the left-hand side.
3. At the top select the Computer tab.
4. Click on Map network drive.
5. Select an available capital letter from the Drive drop-down menu.
6. Enter the path to the folder on the remote server in the Folder field (e.g., \\ouce-
smb.ouce.ox.ac.uk\rjones).
7. You can check the Reconnect at sign-in checkbox. But remember that it is likely
that your VPN needs to be running. If it causes problems then uncheck it.
8. Click Finish.

During the process of mapping the remote drive the username and password are
likely to be requested. The mapping process may take a few seconds to complete.

3.3.6.2 Mapping Home Directory on OS X

The following instructions should work on most OS X versions.

1. Open the Mac OS X Finder.


2. Click Command+K to open the Connect to Server pop-up window.
3. Enter the path to the folder on the remote server in the Folder field (e.g.,
smb://ouce-smb.ouce.ox.ac.uk/rjones).
4. Click Connect.
5. Enter the username and password when prompted.
6. Click OK.

Once the above steps completed successfully the remote drive will now be accessible
from the desktop (folder icon) or via the Finder pane on the left-hand side.
Unix 33

3.3.6.3 Mounting Home Directory on Linux

First, a mount point has to be created on the local machine. The mount point is an
empty directory usually placed in the /media directory. The mount point has to be
created only once. To create a mount point named ldrive (e.g., linux drive) in the
/media directory the following command can be used.

mkdir /media/ldrive

Once a mount point exists the sshfs (Secure SHell FileSystem) command can be used
to mount (synonymous here for to map) the home directory located on the remote
server on the local machine. Assuming the username is jsky, the server name is
linux.ox.ac.uk, the full path to the home directory on the remote server is /home/jsky/
and the local mount point is /media/ldrive then the following command can be used
to mount the remote home directory on the local machine.

sshfs -o idmap=user jsky@linux.ox.ac.uk:/home/jsky/ /media/ldrive -o nonempty

The -o idmap=user option is used here to set the user/group ID mapping to user.
The jsky@linux.ox.ac.uk:/home/jsky/ sets the full path to the home directory on the
remote server and /media/ldrive is the local mount point. The -o nonempty option
avoids the problem of the getting a mountpoint is not empty error message.
Files and directories saved in the home directory on the remote server should now
be accessible via the local mount point directory.
To unmount the mounted remote server home directory the following command can
be used.

fusermount -u /media/ldrive

3.4 Some More Unix Server Basics


{id: unix-command-syntax
### Unix Command Syntax}
Unix 34

The general syntax of Unix commands is in the form of the main command (indicated
in the following code example as cmd) potentially followed by some options, a source
and a destination. The main command, option, source and destination need to be
separated by at least one single space.

cmd <options> <source> <destination>

The most basic form of a Unix command is just the main command on its own as a
single word. For instance, the ls command on its own generates a simple list of the
content of the current directory.

ls

Most commands allow options to be added that modify how the command behaves.
The options start either with one hyphen (-) for the short version of the option (single
letter) or with two hyphens (--) for the long version of the option (single word).
Multiple options can be passed to the main command by combining the letters of the
short version options preceded by a single hyphen. For example, adding the options
-ltr to the ls command generates an extended (long) list (-l) of the directory content
sorted by time (-t) and the list being displayed in reversed order (-r) so that the
newest file is at the bottom of the list.

ls -ltr

Some commands expect an input file or directory as a source. In the following


example, the cat command dumps the contents of the text file test.txt inside the
terminal window.

cat text.txt

Some commands also expect an output file name or directory as a destination, for
instance, when copying or renaming files or directories. The following command
copies the file test.txt into a directory called output whereby ‘cp’ is the main
command, test.txt is the source and output is the destination.
Unix 35

cp test.txt my_output/

3.4.1 Manual Pages


The options available for a specific command can be found in the manual pages (short
man pages). On most servers the man pages are available from the command line. For
instance, to get information including all options for the ls command execute the
following command on the Unix command prompt.

man ls

The output from the above command may look similar to the following.
Example output from the man ls command.
1 LS(1) User Commands \
2 LS(1)
3
4 NAME
5 ls - list directory contents
6
7 SYNOPSIS
8 ls [OPTION]... [FILE]...
9
10 DESCRIPTION
11 List information about the FILEs (the current directory by default). \
12 Sort entries alphabetically if none of
13 -cftuvSUX nor --sort is specified.
14
15 Mandatory arguments to long options are mandatory for short options too.
16
17 -a, --all
18 do not ignore entries starting with .
19
20 -A, --almost-all
21 do not list implied . and ..
22
23 --author
24 with -l, print the author of each file
25 ...
Unix 36

Use the up and down arrow keys to scroll through the man pages. Type q to terminate
the man pages and get back to the Unix command prompt.
Line 1 shows the name of the main command followed by the section number in
brackets. The section number identifies the section of manual the pages come from.
Each section corresponds to a specific set of commands. For instance, Section 1 is for
User Commands (see the manual pages for the man command itself for more details).

Some commands (e.g., printf) have manual pages in different sections


of the manual. In these cases the section number can be passed to the
man command to specify which manual pages are being requested. The
command man 3 printf requests the manual pages for the printf command
from Section 3 (C Library Functions) of the manual.

The reminder of the manual pages provides information about the command in
several sub-sections. The NAME section (line 4) provides a short description of
the command. The SYNOPSIS section (line 7) provides the general syntax of the
command. The DESCRIPTION section (line 10) provides a more detailed description
of the command including a list of the available options. Some information can be
obtained at the end of the man pages from the AUTHOR, REPORTING BUGS and
COPYRIGHT sections.
The manual pages can also be found in the form of webpages. A simple web search
for something like man pages ls will return many results including this page¹⁶.

3.4.2 Editing Text Files


A large part of data analysis requires writing programming language source code
(e.g., Python or Shell scripts) using text editors. A text editor can be used to create
and edit plain text files including files that contain programming language source
code (code files). Code files are written in a specific programming language using
programming language specific commands and syntax. A code file can be executed
which means that software compiles or interprets the content of the file.
¹⁶http://man7.org/linux/man-pages/man1/ls.1.html
Unix 37

A text editor is used for writing plain text files. Text editors should not
be confused with word processing software such as Microsoft Word or
LibreOffice which are used for writing rich text files whereby text is
saved in the software’s own specific binary format (e.g., .doc or .odt file
extensions).

For coding purposes a text editor should feature syntax highlighting and be customis-
able in terms of background and font colour (bright versus dark) as well as fonts and
font size as a minimum.
Some editors such as Atom, gedit or jEdit open a GUI whereas other editors such as
nano, vi or vim open inside the terminal window (also referred to as screen-based
or screen-orientated editors). Some editors such as Emacs feature both options. A
non-exhaustive list of frequently used text editors is shown in Table 3.4.3.1

To start the Emacs editor from the Unix command line in screen-based
mode use the -nw (no window) option followed by the filename (emacs -nw
myfile.txt).

Table 3.4.3.1: Free code editors (¹Windows OS, ²Max OS X, ³Linux).

Editor Description GUI terminal


Atom¹ ² ³ Highly customisable popular code editor x
gedit³ Basic text editor x
jEdit¹ ² ³ Good but basic code editor x
nano³ Basic GNU text editor x
Emacs¹ ² ³ Mature, extensible, customisable GNU text editor x x
vi/vim³ More advanced clone of vi editor, highly configurable x

To edit a text file located in the home directory on the server either a text editor
installed on the local machine or one that is installed on the server can be used. A
text editor installed on the local machine can only be used if the home directory
located on the server is mapped on the local machine. If that is not the case then
a text editor installed on the server should be used. Either way, a GUI-based text
editor should only be used if a fast and stable internet connection is available. For
slow internet connections a server-side screen-based text editor allows for a more
Unix 38

uninterrupted workflow.
Text editors vary in terms of their functionality and ease of use. While basic editors
such as gedit, jEdit or nano may be easy and intuitively to use, more feature-
reach and customisable editors such as Atom, Emacs or vim tend to be preferred by
analytical programmers (see Editor War¹⁷ for Emacs/vim pros and cons and historical
background).

For the use of Emacs or vim it is advisable to take the time and go through
one of the many online tutorials to learn how to use these screen-based
text editors as they require the use of key board shortcuts. While this may
sound more complicated than it should be it does pay off in the long run.

To create or open a file from the server command line use the text editor’s name
followed by the filename as shown in the following example which shows the use of
jEdit to open a file named myfile.csv.

jedit myfile.csv

After opening a GUI-based editor from the command line the command
line will not be available as the command that opened the editor only
terminates once the GUI is closed. One solution is to add a space and
the ampersand symbol (&) at the end of the command which instructs the
system that the command should run in the background. Another solution
is to just open a second terminal window.

Code editors can also be found as part of Integrated Development Environments


(IDEs). Python language specific IDEs are covered in Section 6.x.x.

3.4.3 Full versus Relative Paths


A path is the text representation of the location of a file or directory within the
server directory structure. There are two types of paths, a full path and a relative
¹⁷https://en.wikipedia.org/wiki/Editor_war

You might also like