Programming With Python: Contents
Programming With Python: Contents
Contents:
1. Useful tools for data analysis
2. Case study: How warm was Europe in the past?
Jupyter Notebooks contain both Python code and text, which is formatted in Markdown. Here are a few Markdown formats:
JupterLab
Similar to Jupyter Notebook, but with many additional features focused on interactive, exploratory computing. The
JupyterLab interface consists of a main work area containing tabs of documents and activities, a collapsible left sidebar, and
a menu bar. The left sidebar contains a file browser, the list of running kernels and terminals, the command palette, the
notebook cell tools inspector, and the tabs list. It is also good for viewing large CSV files.
Here we just show a quick example of pandas, using the data_reader package (which must be installed separately) to read
and plot the daily low and high prices of the Amazon stock.
In [1]: import pandas_datareader.data as web
import matplotlib.pyplot as plt
import pandas as pd
Out[1]:
High Low Open Close Volume Adj Close
Date
Apart from being proficient in Python and the pandads package, a data analyst knows about various data analysis and
machine learning techniques such as
Many of these techniques are implemented in the Python package scikit-learn (https://scikit-learn.org). Check out the
extensive example collection (https://scikit-learn.org/stable/auto_examples).
A good way to get familiar with the fundamental data science toos and algorithms is to code them from scratch, using only
basic Python language. I strongly recommend the book Data Science from Scratch by Joel Grus (O'Reilly 2015).
Indeed, a great deal of work can be done without using any of the above-mentioned libraries. We will best demonstrate this
with a concrete data analysis problem.
1. What were the extreme average temperatures in the past 500 years in Europe?
2. How did the temperature change?
3. What did it look like at a certain point in time (a date or some approximant of that)?
Loading the data
Obviously, Python itself does not provide the needed data. This is where searching the internet comes in handy, leading us
to the historical paleoclimatological data (http://www.ncdc.noaa.gov/data-access/paleoclimatology-data/datasets/historical)
of the NCDC (National Climatic Data Center) (http://www.ncdc.noaa.gov/). From their FTP site
(ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/) we can download various data.
Some of the data, along with other info and the copyright notice, is available in the file europe-seasonal.txt (10a-
temps/eu-data/europe-seasonal.txt) ( ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe-
seasonal.txt ). The data itself looks like this:
The column "year" holds the year of each row's data "DJF" stands for Winter (December, January, February), "MAM"
stands for Spring (March, April, May), "JJA" stands for Summer (June, July, August), "SON" stands for Autumn
(September, October, November), and "Annual" is the average temperature for the given year.
We can copy just this part into a new file and save it under some name, for example "europe-seasonal.dat" .
Notice that this is not exactly a CSV file like we saw before, as the separators are strings of whitespaces of a varying length.
The columns of this file are defined by their length (the year holds 4 characters and the rest hold 12 characters each).
Luckily, this is not a problem: the split function that we used earlier uses exactly strings of whitespaces of a varying
length as separators if it's not given a different one. So, one line from the above file can be split like this:
year,djf,mam,jja,son,annual = line.strip().split()
The additional strip() call removes leading and trailing whitespaces (each line ends with a new-line characters that we
want to remove). However, don't forget: each of the year , djf , mam , jja , son , and annual is a string now and
needs to be converted to either int or float if we are to use it as such.
A note on organizing our code. Given that we want to write several programs dealing with the same data, creating a
module with some common functionality is a reasonable way to go.
The first function to write would be the one fetching the data from the above file. There are two things to consider here:
1. What to return? We can write it to return either an iterator or a list of all values. Since this data set is not very big, the
two approaches don't differ much. Still, iterator is usually a better option and we'll do that here.
2. How to store each year's (row's) data? Obvious choices are a tuple or a dictionary. The latter is a tad more
descriptive, but a tuple is a bit easier to create, so we'll work with tuples. This is really just the matter of a personal
choice.
So, what we need to do is read the file line by line, split each line, convert the elements to int / float , put them in a tuple
and yield them.
Since we have created our own input file, we can choose the format. It will be as described above, but minus the header
row, since we have no use for it. In other words, our file "europe-seasonal.dat" looks like this:
We can now create a function that will read the data from this file:
Let us now get some basic info about the temperature in Europe in the past 500 years:
In [4]: # Get data to a list, as to avoid rereading the file several times.
# We can afford this because the file is fairly small.
data = list(seasonal_data())
As for the remaining two lines, they establish the size of the plot and can be used in Python as well. However, there are
usually better ways to do it and this is used merely to set the default values for all the plots produced by the program.
years = list()
djfs = list()
mams = list()
jjas = list()
sons = list()
annuals = list()
plt.plot(
years, djfs, "blue",
years, mams, "green",
years, jjas, "red",
years, sons, "orange",
years, annuals, "gray"
)
plt.show()
So, how does this work?
1. Winters are always colder than Springs, which are usually (a bit) colder than Autumns, which are always colder than
Summers. Average, as its name suggests, is in the middle. None of this is really surprising.
2. The average temperature varies more for the Winters than for other seasons.
3. Springs are varying more since around the beginning of the 19th century.
What we cannot see are trends. For example, is the temperature rising?
The above numbers suggest that Europe is warming up, because the maximum temperatures in Winter, Summer, and on
average have all occured in recent years. However, these are just extremes that may or may not correlate with the general
behaviour of the temperature. To observe that, we use smoothing (http://en.wikipedia.org/wiki/Smoothing).
There are many different smoothing algorithms. Here, we shall use the moving average
(http://en.wikipedia.org/wiki/Moving_average) in its most simple form. If the temperature for a certain season in year y is
given by the variable Ly , we create new variables:
y+r
1
′
Ly := ∑ Lk ,
2r + 1
k=y−r
i.e., L′y is the average value of the temperatures from the year y − r up to (and including) the year y + r , where r (the
radius) is some given number. The bigger the r, the smoother the result.
i+r
Let us smooth only one element first, the i -th one. This means computing the sum ∑k=i−r Lk and dividing it by 2r + 1.
This means we need to:
get a part of the list: L[i-r:i+r+1] (the +1 part is here because the right limit is not included as a part of the new
list),
i+r
find its sum: ∑k=i−r Lk = sum(L[i-r:i+r+1]) ,
i+r
divide it with 2*r+1 : 1
2r+1
∑
k=i−r
Lk = sum(L[i-r:i+r+1]) / (2*r+1) .
Repeating the above for all viable indices i can be easily done as a list comprehension:
Finally, since we want to do this for a whole list, it is wise to compute 2*r+1 ahead and just store it in some variable.
Smoothed Winter temperatures: -0.917, -1.04, -1.16, ..., 0.338, 0.197, 0.169
Recall that the smoothed arrays are shorter than the original ones. This means that the years list is no longer appropriate
for the x-axis and we need to create a new one, with the first and the last r elements removed:
In [9]: r = 5
sdjfs = smooth(djfs, r)
print("len(smooth_djfs) = ", len(sdjfs))
print("len(years) = ", len(years))
syears = years[r:-r]
print("len(smooth_years) =", len(syears))
len(smooth_djfs) = 495
len(years) = 505
len(smooth_years) = 495
However, there are various improvements that can be done to our plot.
First, to make it easier to make some improvements, we take the figure and the subplot reference in two variables:
fig = plt.figure()
ax = plt.subplot(111)
This allows us to do the customisations that are related to them, and not just the plots themselves. For example:
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
is used to reduce the width of the plotting area by 20% (to 0.8 of its original width), leaving some space on the right side for
the legend.
The description of the arguments used can be found in the function's reference
(http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend).
So, how does the legend get the names of the plots?
This can be done in several different ways, the easiest one being the plot command itself. To do that, we draw all the
plots one by one:
The value of the label argument is used as a description of the plot in the legend.
plt.grid()
Notice how our plot has a big empty space on the right side. This is because the Matplotlib's automation decided that 2100
is a good right limit for the x-axis. However, we might want to use a different value, maybe 2015. We set this by calling the
axis function (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.axis):
This sets the x-axis to display the values from 1500 to 2015, and the y-axis to display the values from -10 to 35.
Of course, it would be better to derive these limits from the data. Luckily, we know that all the elements of djf (the Winter
temperatures) are smaller than all the elements of the remaining lists; also, all the ellements of jja (the Summer
temperatures) are bigger than all the elements of the remaining lists. This simplifies finding minimum and maximum, so our
limits can be:
Finally, nothing bad will happen if we go a bit wider with the temperatures, i.e., if instead of the interval [−4.152, 19.615]
we plot [−5, 20]. This can be done by some rounding magic, for example to the next value divisable by 5:
This will add only a minor extra empty space to the top and to the bottom of our plot, but nothing big like the year 2100
added to the right. At the same time, our y-axis labels will turn out nicer.
Instead of just showing it on the screen, we can also save the created plot:
The bbox_inches defines the padding around the image, while the dpi argument stands for "Dots Per Inch". The bigger
the value, the bigger the produced image. You can find these and other parameters in the documentation of the savefig
function (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.savefig).
Using what we've seen so far, we can produce the following plot:
In [10]: import matplotlib.pyplot as plt
from math import floor, ceil
r = 17
years = list()
djfs = list()
mams = list()
jjas = list()
sons = list()
annuals = list()
fig = plt.figure()
ax = plt.subplot(111)
# Remove the first and the last `r` years as they cannot be properly smoothed
syears = years[r:-r]
# Compute the smoothed values
plt.plot(syears, smooth(djfs, r), color="blue", label="Winter")
plt.plot(syears, smooth(mams, r), color="green", label="Spring")
plt.plot(syears, smooth(jjas, r), color="red", label="Summer")
plt.plot(syears, smooth(sons, r), color="orange", label="Autumn")
plt.plot(syears, smooth(annuals, r), color="gray", label="Average")
# Display grid
plt.grid()
It would be nice to have this plot and some form of the previous one together, overlapping. Or, even better, have a several
smoothed versions (for different values of r ), in a way that the less smoothed ones are less visible, yet still present.
We can do this by plotting as we did above, for several different values of r . The only question is how to achieve "less
visibility" of certain plots.
Those familiar with image processing probably know what an alpha-channel is. It holds an additional pixel information, not
unlike color, that defines transparency of the pixel. The value can be any real number between 0 and 1 , where 0 means
invisible and 1 means completely visible.
We shall define our alpha according to r , with some tweaking to make the final image look better:
In [11]: import matplotlib.pyplot as plt
from math import floor, ceil
years = list()
djfs = list()
mams = list()
jjas = list()
sons = list()
annuals = list()
fig = plt.figure()
ax = plt.subplot(111)
# Get the smoothed amounts (for each point take the average of
# `r` values to the left and to the right
for r in rs:
# Remove the first and the last `r` years as they cannot be properly smoothed
syears = years[r:-r] if r else years
# Compute the smoothed values
alpha = 0.1 + 0.9*(r/rs[-1])**2 if r else 0.1 # 0 = invisible, 1 = fully visible
plt.plot(syears, smooth(djfs, r), color="blue", alpha=alpha, label="Winter")
plt.plot(syears, smooth(mams, r), color="green", alpha=alpha, label="Spring")
plt.plot(syears, smooth(jjas, r), color="red", alpha=alpha, label="Summer")
plt.plot(syears, smooth(sons, r), color="orange", alpha=alpha, label="Autumn")
plt.plot(syears, smooth(annuals, r), color="gray", alpha=alpha, label="Average")
# Display grid
plt.grid()
To avoid this, we can define the label to be None for all but the last r :
In [12]: import matplotlib.pyplot as plt
from math import floor, ceil
years = list()
djfs = list()
mams = list()
jjas = list()
sons = list()
annuals = list()
fig = plt.figure()
ax = plt.subplot(111)
# Get the smoothed amounts (for each point take the average of
# `r` values to the left and to the right
for r in rs:
# Remove the first and the last `r` years as they cannot be properly smoothed
syears = years[r:-r] if r else years
# Compute the smoothed values
alpha = 0.1 + 0.9*(r/rs[-1])**2 if r else 0.1 # 0 = invisible, 1 = fully visible
plt.plot(syears, smooth(djfs, r), color="blue", alpha=alpha, label="Winter" if r == r
s[-1] else None)
plt.plot(syears, smooth(mams, r), color="green", alpha=alpha, label="Spring" if r ==
rs[-1] else None)
plt.plot(syears, smooth(jjas, r), color="red", alpha=alpha, label="Summer" if r == rs
[-1] else None)
plt.plot(syears, smooth(sons, r), color="orange", alpha=alpha, label="Autumn" if r ==
rs[-1] else None)
plt.plot(syears, smooth(annuals, r), color="gray", alpha=alpha, label="Average" if r
== rs[-1] else None)
# Display grid
plt.grid()
And here is our (overly large) saved image, loaded by the IPython-specific function Image :
In [13]: from IPython.display import Image
Image("europe-temps-smooth.png")
Out[13]:
Creating a temperature map
What we did above, we did by using just one of the available tables of data (the one from the file "europe-
seasonal.txt" ). Looking at the other files there (and their descriptions), we can extract more data and we can make more
plots reflecting different information.
Despite its weird file extension .GDX , it is (almost) an ordinary CSV file.
1. The file is quite big (143 MB) and stuffing it all to memory is not a good idea.
However, to draw the map for just one year and season, we only need its corresponding line. So, our program shall
read the local copy of the file (downloaded from the internet) line by line until we reach the data that we need. Then we
shall collect that data and be done with the file.
2. What type of plot should we use?
Unless we already have some idea, it is best to check the Matplotlib gallery (http://matplotlib.org/gallery.html). What
would be a good plot type to use? These
(http://matplotlib.org/examples/images_contours_and_fields/interpolation_methods.html) certainly seem nice:
3. Obviously, Matplotlib has no idea what our coloured smudges represent (Europe), so we need an appropriate image
to combine with the plot.
This is somewhat tricky, as our data represents square parts of the map, so we must use the map that was created by
the appropriate projection (called equirectangular projection (http://en.wikipedia.org/wiki/Equirectangular_projection))
and we need to crop the map so that it fits the data (note: the official description of the data is wrong; the covered area
is between 25W-40E and 35N-70N, not 30N-70N).
This part is beyond the scope of this course. We shall use this image:
We are now ready to begin. Let us first define some useful variables and then grab the data:
Once the first field in the line corresponds to our filter (a "yyyyss" string, where "yyyy" is a four digit year and "ss" is
a two-digit season identifier), we stop reading.
The last thing we do is splitting the rest of the line, which contains only temperatures, that we immediatelly convert to floats.
Now, out data list contains all the temperatures for the given year and season.
Notice how our data is in a list, and our map requires a grid (a table, a matrix,... some rectangular shape).
Instead of carefully creating a list of lists, we can get some help from NumPy, which is -- in essence -- a system for handling
multidimensional arrays. Its basic data structure is an ndarray (which stands for n-dimensional array): it is created by the
array function (http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html), and it has a neat little function
called reshape (http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html) that does exactly what we
need:
In [15]: import numpy as np
Note: NumPy's ndarray allows double indexing, i.e., the element with the indices (53, 31) is referenced as
data[53,31] . If this was an ordinary Pythonic list of lists, we would have to use data[53][31] .
While somewhat interesting, this is far from what we've seen above. What happened?
The colours are assigned to the values automatically, with the lowest ones being blue, the highest ones being red, and those
in between having other colours.
The readme file says that only the continental temperatures are available. But our data needs to be "matrix-like", so what is
there in the locations describing the sea?
Opening the file reveals the secret: those temperatures are given as -999.99. So, our automatic colouring works fine, but all
the "interesting" temperatures (from approx. -25C to approx. 40C) are squeezed at the top of the scale, thus all getting
coloured red.
After a bit of Googling, it is easy to find that this is done by the function matplotlib.colors.Normalize
(http://matplotlib.org/api/colors_api.html#matplotlib.colors.Normalize), which takes minimum and maximum values.
These are easy to find while avoiding all the values that are not between −100 and +100:
In [17]: import matplotlib.colors
norm = matplotlib.colors.Normalize(
vmin=min(fld for fld in data_list if fld > -100),
vmax=max(fld for fld in data_list if fld < +100)
)
plt.imshow(data, interpolation="bicubic", norm=norm, cmap='jet');
To combine it with the above image (the map of Europe), we look at the gallery again and find this
(http://matplotlib.org/examples/pylab_examples/layer_images.html):
Now, we don't want to use a checker's board as the background, but an image, but the principle is the same.
Luckily, both the checkerboard and the image are dealt with using the function imshow
(http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.imshow), so merging these examples is easy:
In [18]: import pylab
img = matplotlib.image.imread(iname)
im_europe = plt.imshow(img)
#pylab.hold(True)
im_temps = plt.imshow(data,
interpolation="bicubic",
norm=norm,
alpha=0.43,
extent=(0,img.shape[1],img.shape[0],0),
cmap='jet'
)
plt.show()
fig = plt.figure()
ax = plt.subplot(111)
img = matplotlib.image.imread(iname)
im_europe = plt.imshow(img)
#pylab.hold(True)
im_temps = plt.imshow(data,
interpolation="bicubic",
norm=norm,
alpha=0.43,
extent=(0,img.shape[1],img.shape[0],0),
cmap='jet'
)
plt.title("European temperatures for the {} of {}.".format(seasons[str(season)], year))
ax.set_xticks([])
ax.set_yticks([])
plt.show()
Last, but not least, it would be nice to explain what those colours actually mean. Like the legend in the previous example, a
map like this can use a colorbar, with the shrink argument that makes the colorbar a bit smaller than it would be
otherwise. This is trivial to add:
In [20]: import pylab
fig = plt.figure()
ax = plt.subplot(111)
img = matplotlib.image.imread(iname)
im_europe = plt.imshow(img)
#pylab.hold(True)
im_temps = plt.imshow(data,
interpolation="bicubic",
norm=norm,
alpha=0.43,
extent=(0,img.shape[1],img.shape[0],0),
cmap='jet'
)
plt.title("European temperatures for the {} of {}.".format(seasons[str(season)], year))
ax.set_xticks([])
ax.set_yticks([])
plt.colorbar(shrink=0.85)
plt.show()
So, with the program's docstring and import statements ommited, here is our program, but this time displaying the
Summer of '69 (https://www.youtube.com/watch?v=eFjjO_lhf9c):
In [21]: # Data file name
fname = os.path.join("10a-temps", "eu-data", "TT_Europe_1500_2002_New.GDX")
# Image file name
iname = os.path.join("10a-temps", "images", "europe.png")
# Seasons and their codes
seasons = { "13": "Winter", "14": "Spring", "15": "Summer", "16": "Autumn" }
# Prepare a plot
fig = plt.figure()
ax = plt.subplot(111)
While these are just some examples of what can be done with data in Python, there are specialized modules and packages
for dealing with large data and for doing far more advanced data analysis. To learn more, feel free to check Pandas
(http://pandas.pydata.org/), statistics module (https://docs.python.org/3/library/statistics.html), Statsmodels module
(http://statsmodels.sourceforge.net/), ...
References
Xoplaki, E., Luterbacher, J., Paeth, H., Dietrich, D., Steiner N., Grosjean, M., and Wanner, H., 2005:
European spring and autumn temperature variability and change of extremes over the last half millennium,
Geophys. Res. Lett., 32, L15713 (DOI:10.1029/2005GL023424 (http://doi.org/10.1029/2005GL023424)).