Chap05 - Jupyter Notebook
Chap05 - Jupyter Notebook
In [1]:
Reading data
Pandas is a library that provides tools for reading and processing data. read_html reads a web page from a
file or the Internet and creates one DataFrame for each table on the page.
In [2]:
The arguments of read_html specify the file to read and how to interpret the tables in the file. The result,
tables , is a sequence of DataFrame objects; len(tables) reports the length of the sequence.
In [3]:
filename = 'data/World_population_estimates.html'
tables = read_html(filename, header=0, index_col=0, decimal='M')
len(tables)
Out[3]:
We can select the DataFrame we want using the bracket operator. The tables are numbered from 0, so
tables[2] is actually the third table on the page.
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 1/12
7/4/2020 chap05soln - Jupyter Notebook
In [4]:
table2 = tables[2]
table2.head()
Out[4]:
United
United Population Nations
States Reference Department
Maddison HYDE (2007) Tanton (1994)
Census Bureau of Economic
(2008)[17] [24] [18] (1
Bureau (1973–2016) and Social
(2017)[28] [15] Affairs (2015)
[16]
Year
In [5]:
table2.tail()
Out[5]:
United
United Population Nations
McEvedy
States Reference Department HYDE Tanton Biraben
Maddison & Jones Th
Census Bureau of Economic (2007) (1994) (1980)
(2008)[17] (1978)
Bureau (1973–2016) and Social [24] [18] [19]
[20]
(2017)[28] [15] Affairs (2015)
[16]
Year
Long column names are awkard to work with, but we can replace them with abbreviated names.
In [6]:
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 2/12
7/4/2020 chap05soln - Jupyter Notebook
In [7]:
table2.head()
Out[7]:
Year
The first column, which is labeled Year , is special. It is the index for this DataFrame , which means it
contains the labels for the rows.
Some of the values use scientific notation; for example, 2.544000e+09 is shorthand for 2.544 ⋅ 109 or
2.544 billion.
Series
We can use dot notation to select a column from a DataFrame . The result is a Series , which is like a
DataFrame with a single column.
In [8]:
census = table2.census
census.head()
Out[8]:
Year
1950 2557628654
1951 2594939877
1952 2636772306
1953 2682053389
1954 2730228104
Name: census, dtype: int64
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 3/12
7/4/2020 chap05soln - Jupyter Notebook
In [9]:
census.tail()
Out[9]:
Year
2012 7013871313
2013 7092128094
2014 7169968185
2015 7247892788
2016 7325996709
Name: census, dtype: int64
In [10]:
un = table2.un / 1e9
un.head()
Out[10]:
Year
1950 2.525149
1951 2.572851
1952 2.619292
1953 2.665865
1954 2.713172
Name: un, dtype: float64
In [11]:
Out[11]:
Year
1950 2.557629
1951 2.594940
1952 2.636772
1953 2.682053
1954 2.730228
Name: census, dtype: float64
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 4/12
7/4/2020 chap05soln - Jupyter Notebook
In [12]:
decorate(xlabel='Year',
ylabel='World population (billion)')
savefig('figs/chap05-fig01.pdf')
The following expression computes the elementwise differences between the two series, then divides through
by the UN value to produce relative errors (https://en.wikipedia.org/wiki/Approximation_error), then finds the
largest element.
In [13]:
Out[13]:
1.3821293828998855
Exercise: Break down that expression into smaller steps and display the intermediate results, to make sure
you understand how it works.
In [14]:
# Solution
census - un
Out[14]:
Year
1950 0.032480
1951 0.022089
1952 0.017480
1953 0.016188
1954 0.017056
1955 0.020448
1956 0.023728
1957 0.028307
1958 0.032107
1959 0.030321
1960 0.016999
1961 0.001137
1962 -0.000978
1963 0.008650
1964 0.017462
1965 0.021303
1966 0.023203
In [15]:
# Solution
abs(census - un)
Out[15]:
Year
1950 0.032480
1951 0.022089
1952 0.017480
1953 0.016188
1954 0.017056
1955 0.020448
1956 0.023728
1957 0.028307
1958 0.032107
1959 0.030321
1960 0.016999
1961 0.001137
1962 0.000978
1963 0.008650
1964 0.017462
1965 0.021303
1966 0.023203
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 6/12
7/4/2020 chap05soln - Jupyter Notebook
In [16]:
# Solution
abs(census - un) / un
Out[16]:
Year
1950 0.012862
1951 0.008585
1952 0.006674
1953 0.006072
1954 0.006286
1955 0.007404
1956 0.008439
1957 0.009887
1958 0.011011
1959 0.010208
1960 0.005617
1961 0.000369
1962 0.000311
1963 0.002702
1964 0.005350
1965 0.006399
1966 0.006829
In [17]:
# Solution
Out[17]:
1.4014999251669376
max and abs are built-in functions provided by Python, but NumPy also provides version that are a little
more general. When you import modsim , you get the NumPy versions of these functions.
Constant growth
We can select a value from a Series using bracket notation. Here's the first element:
In [18]:
census[1950]
Out[18]:
2.557628654
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 7/12
7/4/2020 chap05soln - Jupyter Notebook
In [19]:
census[2016]
Out[19]:
7.325996709
But rather than "hard code" those dates, we can get the first and last labels from the Series :
In [20]:
t_0 = get_first_label(census)
Out[20]:
1950
In [21]:
t_end = get_last_label(census)
Out[21]:
2016
In [22]:
Out[22]:
66
In [23]:
p_0 = get_first_value(census)
Out[23]:
2.557628654
In [24]:
p_end = get_last_value(census)
Out[24]:
7.325996709
Then we can compute the average annual growth in billions of people per year.
In [25]:
Out[25]:
4.768368055
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 8/12
7/4/2020 chap05soln - Jupyter Notebook
In [26]:
Out[26]:
0.07224800083333333
TimeSeries
Now let's create a TimeSeries to contain values generated by a linear growth model.
In [27]:
results = TimeSeries()
Out[27]:
values
Initially the TimeSeries is empty, but we can initialize it so the starting value, in 1950, is the 1950 population
estimated by the US Census.
In [28]:
results[t_0] = census[t_0]
results
Out[28]:
values
1950 2.557629
After that, the population in the model grows by a constant amount each year.
In [29]:
Here's what the results looks like, compared to the actual data.
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 9/12
7/4/2020 chap05soln - Jupyter Notebook
In [30]:
decorate(xlabel='Year',
ylabel='World population (billion)',
title='Constant growth')
savefig('figs/chap05-fig02.pdf')
The model fits the data pretty well after 1990, but not so well before.
Exercises
Optional Exercise: Try fitting the model using data from 1970 to the present, and see if that does a better job.
Hint:
1. Copy the code from above and make a few changes. Test your code after each small change.
2. Make sure your TimeSeries starts in 1950, even though the estimated annual growth is based on later
data.
3. You might want to add a constant to the starting value to match the data better.
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 10/12
7/4/2020 chap05soln - Jupyter Notebook
In [31]:
# Solution
decorate(xlabel='Year',
ylabel='World population (billion)',
title='Constant growth')
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 11/12
7/4/2020 chap05soln - Jupyter Notebook
In [32]:
census.loc[1960:1970]
Out[32]:
Year
1960 3.043002
1961 3.083967
1962 3.140093
1963 3.209828
1964 3.281201
1965 3.350426
1966 3.420678
1967 3.490334
1968 3.562314
1969 3.637159
1970 3.712698
Name: census, dtype: float64
In [ ]:
localhost:8888/notebooks/unphu/simulacion/exercises/ModSimPy/soln/chap05soln.ipynb 12/12