Teach Yourself Data Analytics in 30 Days
Teach Yourself Data Analytics in 30 Days
Analytics in 30 Days
Learn to use Python and Jupyter
Notebooks by exploring fun,
real-world data projects
David Clinton
This book is for sale at http://leanpub.com/thedataproject
© 2021 Bootstrap IT
Contents
Installing Python
The good news is that most operating systems come with
Python pre-installed. Not sure your particular OS got the
memo? Open a command prompt/terminal and type: python
--version (or python3 --version).
If Python’s installed, you’ll probably see something like this:
$ python --version
Python 3.7.6
or this:
$ python3 --version
Python 3.8.5
At this point, by the way, the version number you get had
better not begin with a 2 (as in 2.7) - that’s no longer
supported and using such an old release will add some serious
vulnerabilities to your software.
If you do need to install Python manually, you’re best off us-
ing Python’s official documentation (python.org/downloads)⁴
- that’ll be the most complete and up-to-date source available.
¹https://www.python.org/about/gettingstarted/
²https://pythonbasics.org/
³https://jobtensor.com/Tutorial/Python/en/Introduction
⁴https://www.python.org/downloads/
Getting Started with Data Analytics and Python 3
import pandas as pd
You can read more about using pip for the proper
care and feeding of your Python system here:
docs.python.org/3/installing⁶.
All the software you’ll need to run the projects in this book
should be noted where appropriate. But do pay attention to
any unexpected error messages you might encounter in case
your environment is somehow missing something.
suffix. When you wanted to run your code to see how things
went, you’d do it from the command line (or a powerful and
complicated source-code editor like Visual Studio Code).
Fun. But it meant that, for anything to work, it all had to work.
That would make it harder to troubleshoot when something
didn’t go according to spec. But it would also make it a lot
harder to play around with specific details just to see what
happens - which is where a lot of our most innovative ideas
come from. And it also made it tough to share live versions of
our code across the internet.
Jupyter Notebooks are JSON-based files (using .ipynb exten-
sions) that, along with their various hosting environments,
have gone a long way to solving all those problems. Note-
books move the processing environment for just about any
data-oriented programming code using Python (or other lan-
guages) from your workstation to your web browser.
For me, a notebook’s most powerful feature is the way you can
run subsets of your code within individual cells. This makes
it easy to break down long and complex programs into easily
readable - and executable - snippets. Whatever values were
created will remain in the kernel memory until output for a
particular cell or for the entire kernel are cleared.
This lets you re-run previous or subsequent cells to see what
impact a change might have. It also means resetting your
environment for a complete start-over is as easy as selecting
Restart Kernel and Clear All Outputs.
Azure Notebook.
If you do decide to make an old sysadmin happy and host
your own notebooks, you’ll need to choose between classic
notebooks and the newer JupyterLab. Both run nicely within
your browser, but JupyterLab comes with more extensions
and lets you work with multiple notebook files (and terminal
access) within a single browser tab.
JupyterHub, just to be complete, is a server version built to
provide authenticated notebook access to multiple users. You
can serve up to 100 or so users from a single cloud server
using The Littlest JupyterHub (tljh.jupyter.org)⁷. For larger
deployments involving clusters of servers, you’d probably
be better off with a Kubernetes version known as Zero to
JupyterHub with Kubernetes⁸.
And always watch closely for error messages that can tell you
important things about your environment.
Getting help
The internet is host to more knowledge than any one human
being could possibly remember, or even organize. More often
than not, we use search to access the tiny fragments of that
knowledge that we need at a given time. Effective search,
however, is far more than just typing a few related words
Getting Started with Data Analytics and Python 8
into the search field and hitting Enter. There’s method to the
madness. Here are some powerful tips that will work on any
major search engine. My own favorite is DuckDuckGo.
Be precise
The internet has billions of pages, so vague search results are
bound to include a whole lot of false positives. That’s why
you want to be as precise as possible. One powerful trick
is to enclose your error message in quotation marks, telling
the search engine that you’re looking for an exact phrase,
rather than a single result containing all or most of the words
somewhere on the page. However, you don’t want to be so
specific that you end up narrowing your results down to zero.
Therefore, for an entry from the Apache error log like this:
Getting Started with Data Analytics and Python 9
…you should leave out the date and client IP address, because
there’s no way anyone else got those exact details. Instead,
include only the “Client sent…” part in quotations:
If that’s still too broad, consider adding the strings Apache and
[error] outside the quotation marks.
Finally, if you see that many or all of the false positives you’re
getting seem to include a single word that is very unlikely
to occur in the pages you’re looking for, exclude it with a
dash. In this example you, of course, were looking for help
learning how to write Bash scripts, but you kept seeing links
with advice for aspiring Hollywood screenwriters. Here’s how
to solve it:
Getting Started with Data Analytics and Python 10
Good luck!
David Clinton¹⁰
The Data Project¹¹
¹⁰https://bootstrap-it.com/davidclinton/
¹¹https://thedataproject.net
Comparing Wages With
Consumer Price Index
Data
How many people do you know who have a favorite US
government department? I’ve got a favorite, and I’m not even
American. It’s the Bureau of Labor Statistics, and I’ve been
enjoying the torrents of employment and economics-related
data they produce for decades, now.
But my admiration for BLS jumped to a new level when
I discovered their well-supported application programming
interface (API). This opens up all that delicious data to smart
retrieval through our Python scripts. And it gives us some rich
resources for discovery.
Let me illustrate with a relatively simple example. I’m going
to use the API to request US consumer price index (CPI) and
wage and salary statistics between 2002 and 2020.
The CPI is a measure of the price of a basket of essential
consumer goods. It’s an important proxy for changes in the
cost of living which, in turn, is an indicator of the general
health of the economy.
Our wages data will come from the BLS Employment Cost
Index covering “wages and salaries for private industry work-
ers in all industries and occupations.” A growing employment
index would, at first glance, suggest that things are getting
better for most people.
Comparing Wages With Consumer Price Index Data 12
You can also search for data sets on this page¹³. Searching
for “computer,” for instance, will take you to a list that
includes the deeply tempting “Average hourly wage for level
11 computer and mathematical occupations in Austin-Round
Rock, TX.” The information you’ll discover by expanding
that selection will include its series ID (endpoint) of
“WMU00124201020000001500000011”
Because I know you can barely contain your curiosity, I’ll tell
you that it turns out that level 11 computer and mathematical
professionals in Austin-Round Rock, Texas could expect to
earn $51.76/hour in 2019.
How do you turn series IDs into Python-friendly data? Well
that’s what we’ll learn next.
Getting the GET and PUT requests exactly right can be
complicated. But because I enjoy a simple life, I decided to
go with one of the available third-party Python libraries. The
one I use is called, simply, bls and is available through Oliver
Sherouse’s GitHub repo¹⁴. You install the library on your host
¹³https://beta.bls.gov/dataQuery/search
¹⁴https://github.com/OliverSherouse/bls
Comparing Wages With Consumer Price Index Data 14
machine using:
export BLS_API_KEY=lk88af0f0d5fd1iw290s52a01b8q
import pandas as pd
import matplotlib as plt
import numpy as np
import bls
Now I pass the BLS endpoint for the wages and salaries data
series to the bls.get_series command from the bls library.
I copied the endpoint from the popular data sets page on the
BLS website. I’ll assign the data series that comes back to the
variable wages and then take a look at a sample from the data
set.
¹⁵https://data.bls.gov/registrationEngine/
Comparing Wages With Consumer Price Index Data 15
wages = bls.get_series('CIU2020000000000A')
wages
0 2002Q1 3.5
1 2002Q2 3.6
2 2002Q3 3.1
3 2002Q4 2.6
4 2003Q1 2.9
cpi = bls.get_series('CUUR0000SA0')
cpi.to_csv('cpi_data.csv')
cpi_data = pd.read_csv('cpi_data.csv')
cpi_data.columns = 'Date','CPI'
cpi_data
Date CPI
0 2002-01 177.100
1 2002-02 177.800
2 2002-03 178.800
3 2002-04 179.800
4 2002-05 179.800
... ... ...
222 2020-07 259.101
223 2020-08 259.918
224 2020-09 260.280
225 2020-10 260.388
226 2020-11 260.229
227 rows × 2 columns
edit each row. But that would be boring and time consuming
for the hundreds of rows we’re working with. And it would
be impossible for the millions of rows of data used by many
other analytics projects.
So we’re going to automate our data clean up.
We’ll focus first on the CPI series. I’m going to use the Python
str.replace method to search for any occurrence of -03 (i.e.,
“March”) in the Date column, and replace it with the string Q1.
This will match the Q1 records in my wages data set. I’ll then
do the same for the June, September, and December rows.
cpi_data['Date'] = cpi_data['Date'].str.replace('-0\
3', 'Q1')
cpi_data['Date'] = cpi_data['Date'].str.replace('-0\
6', 'Q2')
cpi_data['Date'] = cpi_data['Date'].str.replace('-0\
9', 'Q3')
cpi_data['Date'] = cpi_data['Date'].str.replace('-1\
2', 'Q4')
cpi_data['Date']
0 2002-01
1 2002-02
2 2002Q1
3 2002-04
4 2002-05
...
222 2020-07
223 2020-08
224 2020Q3
225 2020-10
226 2020-11
Name: Date, Length: 227, dtype: object
The quarterly records are exactly the way we want them, but
the rest are obviously still a problem. We should look for a
characteristic that’s unique to all the records we don’t want
to keep and use that as a filter. Our best (and perhaps only)
choice is the dash (“-“). The str.contains method when set
to False will, when run against the Date column as it is here,
drop all the contents of all rows that contain a dash.
newcpi_data = cpi_data[(cpi_data.Date.str.contains(\
"-") == False)]
newcpi
Date CPI
2 2002Q1 178.800
5 2002Q2 179.900
8 2002Q3 181.000
11 2002Q4 180.900
14 2003Q1 184.200
... ... ...
212 2019Q3 256.759
215 2019Q4 256.974
218 2020Q1 258.115
221 2020Q2 257.797
224 2020Q3 260.280
75 rows × 2 columns
If you like, you can save the data series in its final state to a
CSV file:
newcpi_data.to_csv('cpi-clean.csv')
Once again, I’ll save the data to a CSV file (which, again,
isn’t necessary), push it to a dataframe I’ll call df, and give
it column headers. Naming the date column Date to match
the CPI set will make things easier.
wages = bls.get_series('CIU2020000000000A')
wages.to_csv('bls_data_csv')
df = pd.read_csv('bls_wages_data_csv')
df.columns = 'Date','Wages'
df.head()
Date Wages
0 2002Q1 3.5
1 2002Q2 3.6
2 2002Q3 3.1
3 2002Q4 2.6
4 2003Q1 2.9
newdf = df
say, the three months of 2002 Q1 wasn’t 3.5%, but only one
quarter of that (or 0.875%). If I don’t make this adjustment,
but continue to map quarterly growth numbers to quarterly
CPI prices, then our calculated output will lead us to think
that wages are growing so fast that they’ve become detached
from reality.
And here’s where part of the fake math is going to rear its
ugly head. I’m going to divide each quarterly growth rate by
four. Or, in other words, I’ll pretend that the real changes to
wages during those three months were exactly one quarter of
the reported year-over-year rate. But I’m sure that’s almost
certainly not true and is a gross simplification. However, for
the big historical picture I’m trying to draw here, it’s probably
close enough.
Now that will still leave us with a number that’s a percentage.
But the corresponding CPI number we’re comparing it to is,
again, a point figure. To “solve” this problem I’ll apply one
more piece of fakery.
To convert those percentages to match the CPI values, I’m
going to create a function. I’ll feed the function the starting
(2002 Q1) CPI value of 177.10. That’ll be my baseline. I’ll give
that variable the name newnum.
For each iteration the function will make through the rows of
my wages data series, I’ll divide the current wage value (x) by
400. 100 simply converts the percentage (3.5, etc.) to a decimal
(0.035). And the four will reduce the annual rate (12 months)
to a quarterly rate (3 months).
To convert that to a usable number, I’ll multiply it by the
current value of newnum and then add newnum to the product.
That should give us an approximation of the original CPI
value adjusted by the related wage-growth percentage.
Comparing Wages With Consumer Price Index Data 22
newnum = 177.1
def process_wages(x):
global newnum
if type(x) is str:
return x
elif x:
newnum = (x / 400) * newnum + newnum
return newnum
else:
return
newwages = newdf.applymap(process_wages)
newwages
Date Wages
0 2002Q1 178.649625
1 2002Q2 180.257472
2 2002Q3 181.654467
3 2002Q4 182.835221
4 2003Q1 184.160776
... ... ...
70 2019Q3 273.092663
71 2019Q4 275.140858
72 2020Q1 277.410770
73 2020Q2 279.421999
74 2020Q3 281.308097
75 rows × 2 columns
Looks great.
merged_data
Date CPI Wages
0 2002Q1 178.800 178.649625
1 2002Q2 179.900 180.257472
2 2002Q3 181.000 181.654467
3 2002Q4 180.900 182.835221
4 2003Q1 184.200 184.160776
... ... ... ...
70 2019Q3 256.759 273.092663
71 2019Q4 256.974 275.140858
72 2020Q1 258.115 277.410770
73 2020Q2 257.797 279.421999
74 2020Q3 260.280 281.308097
75 rows × 3 columns
Our data is all there. We could visually scan through the CPI
and Wages columns and look for any unusual relationships,
but we didn’t come this far to just look at numbers. Let’s plot
the thing.
Here we’ll tell plot to take our merged dataframe (merged_-
data) and create a bar chart. Because there’s an awful lot
of data here, I’ll extend the size of the chart with a manual
figsize value. I set the x-axis labels to use values in Date
column and, again because there are so many of them, I’ll
rotate the labels by 45 degrees to make them more readable.
Finally, I’ll set the label for the y-axis.
Comparing Wages With Consumer Price Index Data 25
It’s certainly a busy graph, but you can clearly see the gentle
upward slope, punctuated by a handful of sudden jumps. Next.
we’ll see that same data after removing three out of every four
months’ data points. The same ups and downs are still visible.
Given our overall goals, I’d categorize our transformation as
a success.
The gentle curve you see does make sense - it’s about real
growth, after all, not growth rates. But it’s also possible to
recognize a few spots where the curve steepens, and others
where it smooths out. But why are the slopes so smooth in
comparison with the percentage-based data? Look at the Y-
axis labels: the index graph is measured in points between
180 and 280, while the percentage graph goes from 0-3.5. It’s
the scale that’s different.
All in all, I believe we’re safe concluding that what we’ve
produced is a good match with our source data.
Wages and CPI: Reality Check 30
housing = bls.get_series('SUUR0000SAH')
housing.to_csv('housing_index.csv')
Date Index
2004-01 112.5
2004-07 115.0
2005-01 115.6
2005-07 118.1
2006-01 119.6
2006-07 122.0
2007-01 122.767
2007-07 125.416
2008-01 125.966
2008-07 130.131
2008-08 129.985
2008-09 129.584
2008-10 129.189
2008-11 128.667
2008-12 128.495
¹⁶https://www.wsj.com/market-data/quotes/index/SPX/historical-prices
Wages and CPI: Reality Check 34
sp = pd.read_csv('new_s_p_500.csv')
sp['Date'] = sp['Date'].astype('datetime64[ns]')
sp['Date'] = sp['Date'].dt.strftime('%Y-%m-%d')
Now I’ll use str.replace in much the same way I did for the
wages data to reformat all the quarterly quotes as the “year”
plus, say, “Q1”.
sp = sp[(sp.Date.str.contains("Q") == True)]
Date Close
187 2002Q3 815.28
251 2002Q4 879.82
375 2003Q2 974.50
439 2003Q3 995.97
503 2003Q4 1111.92
Looking through the data will show you that some quarters
are actually missing. I’ll let you figure out why that is (hint:
you’ll need some very basic general domain knowledge to
figure it out).
Now that we’ve got our S&P quotes all nice and comfy in a
dataframe, what should we do with them? My first impulse
was to throw all three of our data sets into a single plot so
we can compare them. That was a terrible idea, but let’s see
it through anyway so we can learn why.
The way we merged our first two data sets into a single
dataframe earlier won’t work here. But getting it done the
right way isn’t at all difficult. I’ll have to import a new tool
called reduce that’ll help. First, thought, I’ll create a new
dataframe made up of each of our three existing sets.
Now I can use reduce to merge the three sets within dfs with
the common Date column acting as a single index.
Wages and CPI: Reality Check 37
And with that, we’re all set to plot our triple-set monster:
Ouch. The first - and smaller - problem is that the x-axis labels
are out of sync with the actual data. That’s because line plots
are normally used to represent sequential, numeric data that
won’t require step-by-step labelling. Our 2002Q4 labels just
weren’t what Python was expecting.
But the bigger issue is that, as a way to visually compare our
data sets, this is pretty much unusable. That’s because the
S&P data is on a hugely different scale (ranging between 800
and 3400), making differences between the CPI and wages sets
nearly invisible.
I suppose we could play with the scale of the S&P data,
perhaps dividing all the numbers by, say, 10. But why bother?
Wages and CPI: Reality Check 38
first = 815.28
last = 3756.07
periods = 20
first = 178.8
last = 260.28
periods = 20
first = 178.64
last = 281.31
periods = 20
You’ll get the whole story, including a nice explanation for the
data manipulation choices they made, by reading the study
itself. In fact, I encourage you to read that study, because
it’s a great example of how the professionals approach data
problems.
From here on in, however, you’ll be stuck with my amateur
and simplified attempts to visualize the raw, unadjusted data
record.
import pandas as pd
import matplotlib as plt
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('all-us-hurricanes-noaa.csv')
Let’s look at the data types for each column. We can ignore the
strings in the States and Name column - we’re not interested
in those anyway. But we will need to do something with the
date and Max Wind columns - they won’t do us any good as
object.
Working With Major US Storm Data 47
df.dtypes
Year object
Month object
States Affected and Category by States object
Highest\nSaffir-\nSimpson\nU.S. Category float64
Central Pressure\n(mb) float64
Max Wind\n(kt) object
Name object
dtype: object
So I’ll filter all rows in the Year column for the letter s and
simply drop them (== False). That will take care of all the
decade headers (i.e., those rows containing an s as part of
something like 1850s).
I’ll similarly drop rows containing the string None in the
Month column to eliminate years without storm events. While
quiet years could have some impact on our visualizations, I
suspect that including them with some kind of null value
would probably skew things even more the other way. They’d
also greatly complicate our visualizations. Finally, I’ll replace
those two multi-month rows.
df = df[(df.Year.str.contains("s")) == False]
df = df[(df.Month.str.contains("None")) == False]
df = df.replace('Sp-Oc','Sep')
df = df.replace('Jl-Au','Jul')
df = df.astype({'Year': 'int'})
df = df.replace('-----',np.NaN)
df = df.astype({'Max Wind': 'float'})
²¹https://www.w3schools.com/python/gloss_python_date_format_codes.asp
Working With Major US Storm Data 49
df.dtypes
Year int64
Month int64
States object
Category float64
Pressure float64
Max Wind float64
Name object
dtype: object
Much better.
df['Category'].value_counts()
1.0 121
2.0 83
3.0 62
4.0 25
5.0 4
Name: Category, dtype: int64
df.hist(column='Year', bins=25)
Working With Major US Storm Data 51
df_category = df[['Year','Category']]
df_wind = df[['Year','Max Wind']]
df_pressure = df[['Year','Pressure']]
Category 1 Hurricanes
Working With Major US Storm Data 53
Category 2 Hurricanes
Category 3 Hurricanes
Working With Major US Storm Data 54
Category 4 Hurricanes
Category 5 Hurricanes
Working With Major US Storm Data 55
import pandas as pd
import matplotlib as plt
import numpy as np
df = pd.read_csv('all-us-tropical-storms-noaa.csv')
df = df[(df.Date.str.contains("None")) == False]
df.dtypes
Storm# object
Date object
Time object
Lat object
Lon object
MaxWinds float64
LandfallState object
StormName object
dtype: object
I’m actually not sure what those Storm # values are all about,
but they’re not hurting anyone. The dates are formatted much
better than they were for the hurricane data. But I will need
to convert them to a new format. Let’s do it right and go with
datetime.
df.Date = pd.to_datetime(df.Date)
df1 = df[['Date','MaxWinds']]
df1['Date'].hist()
But we really should drill down a bit deeper here. After all,
this data just mixes together 30 knot with 75 knot storms.
We’ll definitely want to know whether or not they’re happen-
ing at similar rates.
Let’s find out how many rows of data we’ve got. shape tells
us that we’ve got 362 events altogether.
print(df1.shape)
(362, 2)
df1
Date MaxWinds
1 1851-10-19 50.0
6 1856-08-19 50.0
7 1857-09-30 50.0
8 1858-09-14 60.0
9 1858-09-16 50.0
... ... ...
391 2017-09-27 45.0
392 2018-05-28 40.0
393 2018-09-03 45.0
394 2018-09-03 45.0
395 2019-09-17 40.0
362 rows × 2 columns
Let’s confirm that the cut-off points we’ve chosen make sense.
This code will attractively print the number of rows in the
index of each of our four dataforms.
Working With Major US Storm Data 60
st1 = len(df_30.index)
print('The number of storms between 30 and 39: ', s\
t1)
st2 = len(df_40.index)
print('The number of storms between 40 and 49: ', s\
t2)
st3 = len(df_50.index)
print('The number of storms between 50 and 59: ', s\
t3)
st4 = len(df_60.index)
print('The number of storms between 60 and 79: ', s\
t4)
df_40['MaxWinds'].value_counts()
40.0 71
45.0 42
Name: MaxWinds, dtype: int64
df_30['Date'].hist(bins=20)
Working With Major US Storm Data 63
df_40['Date'].hist(bins=20)
Working With Major US Storm Data 64
df_50['Date'].hist(bins=20)
Working With Major US Storm Data 65
df_60['Date'].hist(bins=20)
Working With Major US Storm Data 66
I got my GDP data from the World Bank site, on this page²⁶.
My Index of Economic Freedom data I took from here²⁷.
Here’s how all the basic setup works in a Jupyter notebook:
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
gdp = pd.read_csv('WorldBank_per_capita_GDP.csv')
ec_index = pd.read_csv('heritage-all-data.csv')
Running gdp shows us that the GDP data set has got 248 rows
that include plenty of stuff we don’t want; like some null
values (NaN) and at least a few rows at the end with general
values that will get in the way of our focus on countries. We’ll
need to clean all that up, but our code will simply ignore those
general values, because there will be no corresponding rows
in the ec_index dataframe.
gdp
dtypes shows us that the Year and Value columns are already
formatted as float64, which is perfect for us.
gdp.dtypes
Country object
Year float64
Value float64
dtype: object
We should also check the column data types used in our Index
of Economic Freedom dataframe. The only three columns
we’ll be interested in here are Name (which holds country
names) and Index Year (because there are multiple years of
data included), and Overall Score.
ec_index.dtypes
Name object
Index Year int64
Overall Score float64
Property Rights float64
Judicial Effectiveness float64
Government Integrity float64
Tax Burden float64
Government Spending float64
Property Rights and Economic Development 72
I’ll select just the columns from both dataframes that we’ll
need:
Next, I’ll limit the data we take from ec_index to only those
rows whose Year column covers 2019.
ec_index = ec_index[ec_index.Year.isin(["2019"])]
Now I’ll merge the two dataframes, telling Pandas to use the
values of Country to align the data. When that’s done, I’ll
select only those columns that we still need: to exclude the
Year column from ec_index and then remove rows with NaN
values. That’ll be all the data cleaning we’ll need here.
Property Rights and Economic Development 73
plt.scatter(merged_data.Score, merged_data.Value)
A simple scatter plot showing countries’ index scores on the x-axis and
GDP on the y-axis
Property Rights and Economic Development 74
What that did was take the two data points (GDP and index)
for each country and plot a dot at its intersection point. The
higher a country’s index score, the further along to the right
it’s dot will appear. And the higher its per capita GDP, the
higher up the y-axis it’ll be.
If there were absolutely no correlation between a country’s
GDP and its index score (or, in other words, the economic
freedoms had no impact on production) then you would
expect to see the dots spread randomly across both axes. The
fact that we can easily see a pattern - the dots are clearly
trending towards the top-right of the chart - tells us that
higher index scores tend to predict higher GDP.
Of course there are anomalies in our data. There are countries
whose position appears way out of range of all the others. It
would be nice if we could somehow see which countries those
are. And it would also be nice if we could quantify the precise
statistical relationship between our two values, rather than
having to visually guess. I’ll show you how those work in just
a moment.
But first, one very small detour. Like everything else in the
technology world, there are many ways to get a task done
in Python. Here’s a second code snippet that’ll generate the
exact same output:
x = merged_data.Score
y = merged_data.Value
plt.scatter(x,y)
There will be times when using that second style will make it
easier to add features to your output. But the choice is yours.
Property Rights and Economic Development 75
import plotly.graph_objs as go
import plotly.express as px
This time, when you run the code, you get the same nice plot.
But if you hover your mouse over any dot, you’ll also see its
data values. In this example, we can see that the tiny - but rich
- country of Luxembourg has an economic freedom score of
75.9 and a per capita GDP of more than 121 thousand dollars.
Property Rights and Economic Development 76
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y, rcond=None)[0]
return m, c
fig.show()
When I hover over the regression line, I’m shown an R^2 value
of 0.550451. Or, in other words, around 55%. For our purposes,
I’d consider that a pretty good correlation.
Other Considerations
When interpreting our plots, we should always seek to vali-
date what we’re seeing in the context of the real world. If, for
instance, the data is too good to be true, then it probably isn’t
true.
For example, residual plots, the points that don’t fall out right
next to the regression line, should normally contain a visually
random element. They should, in other words, present no
visible pattern. That would suggest a bias in the data.
We should also be careful not to mix up correlation with
causation. Just because, for instance, there does seem to
be a demonstrable relationship between economic freedoms
and productivity, we can’t be absolutely sure which way
that relationship works: do greater freedoms lead to more
Property Rights and Economic Development 80
serve. If, on the other hand, they often cast votes indepen-
dently of their parties, then they might be thinking more
about their own constituents.
This won’t definitively prove anything one way or the other
but, if we can access a large enough data set, we should be
able to draw some interesting insights.
We’ll begin on a webpage managed by the House
of Commons itself: Parliament’s Open Data project -
ourcommons.ca/en/open-data²⁹. The page explains how we
can make use of a freely-available application programming
interface (API) to download and manipulate detailed data
representing the core operations of Canada’s legislature.
Like many APIs in use these days, the precise syntax you need
to get the data you’re after can be a bit of a puzzle. But since
most programmers enjoy puzzles, this isn’t a big deal. The
OurCommons API expects you to play around with URLs
based on the base address, ourcommons.ca/Members/. Adding
en tells the server that you want service in English. Adding
a forward slash and then the word votes means that you’re
looking for voting records.
Some resources are available in XML formatting, while others
can be downloaded in the spreadsheet-friendly CSV format.
But we’re going to stick with plain old HTML for our purposes.
That means any URL you see here can be loaded and enjoyed
in your browser just like any other webpage.
We’ll begin with the Votes page - ourcom-
mons.ca/Members/en/votes³⁰. The main data on this
page consists of a table listing all the bills associated with a
particular parliamentary session.
²⁹https://www.ourcommons.ca/en/open-data
³⁰https://www.ourcommons.ca/Members/en/votes
How Representative Is Your Government? 83
https://www.ourcommons.ca/Members/en/votes?parlSess\
ion=41-2
That URL would present you with links to all the votes from
that session. If you preferred to see only private members’
bills from that session, you could add the bill document ar-
gument: TypeId=4. Substituting TypeId=3 for that, as with the
next example, would return all house government bills. This
example points to house government bills from the current
session (the second session of the 43rd parliament):
https://www.ourcommons.ca/Members/en/votes?parlSess\
ion=43-2&billDocumentTypeId=3
import pandas as pd
dfs = pd.read_html('https://www.ourcommons.ca/Membe\
rs/en/votes/43/2/17',header=0)
For some reason, the specific data we’re after exists in the
dataframe identified as dfs[0] (rather than just dfs). I can’t
say I understand why that is, but it is. So for convenience, I’ll
push that to the new dataframe, df:
df = dfs[0]
df.shape
(319, 4)
There are four columns, comprising 319 rows. That means 319
members cast votes for this bill.
To keep things clean, I’ll change the names of the column
headers:
df.columns = ['Member','Party','Vote','Paired']
All the votes of these first five members went against the bill
(“Nay”). An affirmative vote would be identified as “Yea.”
We can easily see how the party numbers broke down using
the .value_counts() method:
df['Party'].value_counts().to_frame()
Party
Liberal 146
Conservative 115
Bloc Québécois 31
NDP 22
Green Party 3
Independent 2
I’m sure you’re impatiently waiting to hear how the vote went.
Once again, it’s .value_counts() to the rescue:
How Representative Is Your Government? 86
df['Vote'].value_counts().to_frame()
Vote
Nay 263
Yea 56
Not a happy end, I’m afraid. The bill was shot down in flames.
dfs_vote_list = pd.read_html('https://www.ourcommon\
s.ca/Members/en/votes?parlSession=42-1&billDocument\
TypeId=4',header=0)
vote_list = dfs_vote_list[0]
total_votes = 0
partyLineVotesConservative = 0
non_partyLineVotesConservative = 0
partyLineVotesLiberal = 0
non_partyLineVotesLiberal = 0
partyLineVotesNDP = 0
non_partyLineVotesNDP = 0
partyLineVotesBloc = 0
non_partyLineVotesBloc = 0
def liberal_votes():
global partyLineVotesLiberal
global non_partyLineVotesLiberal
df_party = df[df['Party'].str.contains('Liberal\
')]
vote_output_yea = df_party['Vote'].str.contains\
('Yea')
total_votes_yea = vote_output_yea.sum()
vote_output_nay = df_party['Vote'].str.contains\
('Nay')
total_votes_nay = vote_output_nay.sum()
if total_votes_yea>0 and total_votes_nay>0:
How Representative Is Your Government? 88
non_partyLineVotesLiberal += 1
else:
partyLineVotesLiberal += 1
dfs_vote_list = pd.read_html('https://www.ourcommon\
s.ca/Members/en/votes?parlSession=42-1&billDocument\
TypeId=3',header=0)
vote_list = dfs_vote_list[0]
vote_list.columns = ['Number','Type','Subject','Vot\
es','Result','Date']
vote_list['Number'] = vote_list['Number'].str.extra\
ct('(\d+)', expand=False)
base_url = "https://www.ourcommons.ca/Members/en/vo\
tes/42/1/"
url_data = pd.DataFrame(columns=["Vote"])
Vote = []
for name in vote_list['Number']:
newUrl = base_url + name
Vote.append(newUrl)
url_data["Vote"] = Vote
url_data.head()
Vote
0 https://www.ourcommons.ca/Members/en/votes/42/...
1 https://www.ourcommons.ca/Members/en/votes/42/...
2 https://www.ourcommons.ca/Members/en/votes/42/...
3 https://www.ourcommons.ca/Members/en/votes/42/...
4 https://www.ourcommons.ca/Members/en/votes/42/...
url_data.to_csv(r'url-text-42-1-privatemembers',
header=None, index=None, sep=' ', m\
ode='a')
URLS = open("url-text-42-1-privatemembers","r")
for url in URLS:
# Read next HTML page in set:
dfs = pd.read_html(url,header=0)
df = dfs[0]
df.rename(columns={'Member Voted':'Vote'}, inpl\
ace=True)
df.rename(columns={'Political Affiliation':'Par\
ty'}, inplace=True)
# Ignore unanimous votes:
vote_output_nay = df[df['Vote'].str.contains('N\
ay', na=False)]
total_votes_nay = vote_output_nay['Vote'].str.c\
ontains('Nay', na=False)
filtered_votes = total_votes_nay.sum()
if filtered_votes==0:
continue
# Call functions to tabulate votes:
else:
liberal_votes()
conservative_votes()
ndp_votes()
bloc_votes()
How Representative Is Your Government? 91
total_votes += 1
…And so on.
Government Bills
To this:
…And run the code again. When I did that, here’s what came
back:
The financial data I’m using here comes from the World
Bank data.worldbank.org site³¹ and presents per capita gross
domestic product numbers from 2019.
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
gdp = pd.read_csv('WorldBank_PerCapita_GDP.csv')
gdp.head()
Country Value
0 Afghanistan 2156.4
1 Albania 14496.1
2 Algeria 12019.9
3 American Samoa NaN
4 Andorra NaN
³¹https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD
Does Wealth Influence the Prevalence of Mental Illness? 97
The health data is from the GHD Results Tool on the Global
Health Data Exchange site³². I used their drop-down con-
figurations to select data by country that focused on the
prevalence of individual illnesses measured as a rate per
100,000. These records also represent 2019.
After selecting just the two columns that interest me from
each of the health CSV files (“Country” and “Val” - which is
the average prevalence rate of the illness per 100,000 over the
reporting period), here’s what my data will look like:
anx = pd.read_csv('anxiety.csv')
anx = anx[['Country','Val']]
anx.head()
Country Val
0 Afghanistan 485.96
1 Albania 258.95
2 Algeria 817.41
3 American Samoa 73.16
4 Andorra 860.77
My assumptions
I’m only trying to measure prevalence, not outcomes or life ex-
pectancy. In that context, I would expect that medical condi-
tions which seem related more to organic than environmental
conditions should appear more or less evenly across societies
at all stages of economic development. The prevalence of
conditions that are more likely the result of environmental
influences should, on the other hand, vary between economic
strata.
³²http://ghdx.healthdata.org/gbd-results-tool
Does Wealth Influence the Prevalence of Mental Illness? 98
import plotly.graph_objs as go
import plotly.express as px
merged_data_anx = pd.merge(gdp, anx, on='Country')
merged_data_anx = merged_data_anx[['Country', 'Valu\
e', 'Val']]
merged_data_anx.dropna(axis=0, how='any', thresh=No\
ne,
subset=None, inplace=True)
https://statsapi.web.nhl.com/api/v1/teams/15/roster
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt
import numpy as np
roster page that the next lines of code will scrape. As we did
before for teams, we’ll insert each person.id record into the
new endpoint and save it to the variable url. I’ll then scrape
the birthDate field for each player, read it to the birthday
variable and, after first stripping unnecessary characters, read
that to newmonth. Finally, I’ll pull the birthCountry status
from the page and, using if, drop the player if the value is
anything besides CAN.
All that will then be plotted in a histogram using
df3.months.hist(). Take a few minutes to look over
this code to make sure it all makes sense.
df3 = pd.DataFrame(columns=['months'])
for team_id in range(1, 11, 1):
url = 'https://statsapi.web.nhl.com/api/v1/team\
s/{}/roster'.format(team_id)
r = requests.get(url)
roster_data = r.json()
df = pd.json_normalize(roster_data['roster'])
for index, row in df.iterrows():
newrow = row['person.id']
url = 'https://statsapi.web.nhl.com/api/v1/\
people/{}'.format(newrow)
newerdata = requests.get(url)
player_stats = newerdata.json()
birthday = (player_stats['people'][0]['birt\
hDate'])
newmonth = int(birthday.split('-')[1])
country = (player_stats['people'][0]['birth\
Country'])
if country=='CAN':
df3 = df3.append({'months': newmonth}, \
ignore_index=True)
Do Birthdays Make Elite Athletes? 109
else:
continue
df3.months.hist()
• Be careful how and how often you use this code. There
are nested for/loops that mean running the script even
once will hit the NHL’s API with more than a thousand
queries. And that’s assuming everything goes the way
it should. If you make a mistake, you could end up
annoying people you don’t want to annoy.
• This code (for team_id in range(1, 11, 1):)
actually only scrapes data from eleven teams. For
some reason, certain API roster endpoints failed
to respond to my queries and actually crashed the
script. So, to get as much data as I could, I ran the
script multiple times. This one was the first of those
runs. If you want to try this yourself, remove the
df3 = pd.DataFrame(columns=['months']) line from
subsequent iterations so you don’t inadvertently reset
the value of your DataFrame to zero.
• Once you’ve successfully scraped your data, use
something like df3.to_csv('player_data.csv') to
copy your data to a CSV file, allowing you to further
analyze the contents even if the original dataframe is
lost. It’s always good to avoid placing an unnecessary
load on the API origin.
import pandas as pd
df = pd.read_csv('player_data.csv')
df['months'].value_counts()
Month Frequecy
5 35
2 29
1 26
8 25
3 23
7 21
4 20
6 18
10 17
12 13
11 10
9 10
Looks like there were nearly double the births the first four
months of the year than in the final four. Now that’s exactly
Do Birthdays Make Elite Athletes? 112
import pandas as pd
df = pd.read_csv('player_data.csv')
df.hist(column='months', bins=12);
df2 = df['months'].value_counts()
df2.plot(kind='bar')
need to paste the Wikipedia URL, click Convert, and the site
does all the rest.
With that, we’re ready to begin. After importing the libraries
we’ll need, we can read our new CSV file into a data frame.
import pandas as pd
import matplotlib as plt
import matplotlib.pyplot as plt
# https://en.wikipedia.org/wiki/Overview_of_gun_law\
s_by_nation
df = pd.read_csv('wikipedia_gun_law_table.csv')
df.rename(columns={'Concealed carry[8]':'Carry'}, i\
nplace=True)
df = df[['Region', 'Carry']]
df.head()
Region Carry
0 Afghanistan[9][law 1] Restricted
1 Albania[law 2] Self-defense permits
2 Algeria[10] No[N 2]
3 Andorra[law 3] Justification required
4 Angola[11] Restricted
df['Region'] = df['Region'].str.replace(r"\[.*\]","\
")
df['Carry'] = df['Carry'].str.replace(r"\[.*\]","")
df['Carry'] = df['Carry'].str.replace(r"\(.*\)","")
df['Carry'].value_counts().to_frame()
Do Gun Control Laws Impact Gun Violence? 119
import re
df['Carry'].replace(re.compile('.*Yes.*'), 'Yes', i\
nplace=True)
df['Carry'].replace(re.compile('.*Rarely.*'), 'Rare\
ly', inplace=True)
df['Carry'].replace(re.compile('.*rarely.*'), 'Rare\
ly', inplace=True)
df['Carry'].replace(re.compile('.*No.*'), 'No', inp\
lace=True)
df['Carry'].replace(re.compile('.*Restrict.*'), 'Re\
strict', inplace=True)
df['Carry'].replace(re.compile('.*restrict.*'), 'Re\
strict', inplace=True)
Do Gun Control Laws Impact Gun Violence? 120
df = df[~df.Carry.str.contains("Justification", na=\
False)]
df = df[~df.Carry.str.contains("legal", na=False)]
df = df[~df.Carry.str.contains("states", na=False)]
df = df[~df.Carry.str.contains("Moratorium", na=Fal\
se)]
df = df[~df.Carry.str.contains("specific", na=False\
)]
dfv = pd.read_csv('Gun-Related-Deaths_WPR.csv')
³⁵https://worldpopulationreview.com/country-rankings/gun-deaths-by-country
Do Gun Control Laws Impact Gun Violence? 121
dfv.rename(columns={'country':'Region'}, inplace=Tr\
ue)
dfv.rename(columns={'homicide':'Homicides'}, inplac\
e=True)
dfv_data = dfv[['Region','Homicides']]
merged_data.head()
Region Homicides Carry
0 El Salvador 26.49 Yes
1 Jamaica 30.38 Yes
2 Panama 14.36 Yes
3 Uruguay 4.78 Yes
4 Montenegro 2.42 Yes
import plotly.graph_objs as go
import plotly.express as px
And this code will build us a scatter plot, complete with the
information that’ll appear when we hover the mouse over a
dot:
Note how Plotly spaced our homicide data across the X-axis.
Because of the numbers we’re getting, the check points (0.01,
0.1, 1, etc.) don’t increase at a fixed rate, but by increments of
Do Gun Control Laws Impact Gun Violence? 123
ten. You can see that in action by hovering over the “No” point
at the top right. Honduras, unfortunately, had more than 66
homicides for every 100,000 people in that year.
prohibitive gun carry legislation - and two of the top five rates
are from “No” jurisdictions. The actual numbers are seven
“No” and ten “Yes.”
On the other end of the spectrum, countries with very low
murder rates skew heavily (14 to 1) towards restrictive laws.
On the other hand, suppose we were to group Restrict
together with Yes. Perhaps “restrict” makes more sense as “yes,
but with some restrictions.” In that case, the top 17 would be
shifted to 13 (Yes) to 4 (No). But the bottom 15 would now be
a bit more evenly balanced: 11 to 4.
Are more conclusive conclusions possible? Perhaps we need
more data, or a better way to interpret and “translate” legal
standards. But, either way, let’s turn our attention to data
covering US states.
df = pd.read_csv('US-carry-laws-by-state-WPR.csv')
df.rename(columns={'permitReqToCarry':'PermitRequir\
ed'}, inplace=True)
df = df[['State','PermitRequired']]
As you can see from the Wyoming record in this head out-
put, the table represents a negative value (i.e., no permit is
required) as NaN. to make things easier for us, I’ll use fillna
to replace those values with False.
df.head(10)
State PermitRequired
0 Washington True
1 New York True
2 New Jersey True
3 Michigan True
4 Maryland True
5 Hawaii True
6 Connecticut True
7 California True
8 Wyoming NaN
9 Wisconsin True
df.PermitRequired.fillna('False', inplace=True)
That’ll be all we’ll need for the legal side. I’ll use data from
another Wikipedia page³⁷ to provide us with information
about gun violence. I manually removed the columns we’re
not interested in before importing the CSV file.
³⁷https://en.wikipedia.org/wiki/Gun_violence_in_the_United_States_by_state
Do Gun Control Laws Impact Gun Violence? 126
dfv = pd.read_csv('gun_violence_by_US_State_Wikiped\
ia.csv')
Running head against the data frame shows us we’ve got some
cleaning up to do: there isn’t data for every state (as you can
see from Alabama). So I’ll run str.contains to remove those
rows altogether.
dfv.head()
State GunMurderRate
0 Alabama — [a]
1 Alaska 5.3
2 Arizona 2.5
3 Arkansas 3.7
4 California 3.3
Just as we did with the world data earlier, I’ll create a merged_-
data data frame, this time referenced on the State column.
import plotly.graph_objs as go
import plotly.express as px
What’s next?
Well that’s entirely up to you. As for me, I wish you great
success as you now embark on your own data journey.
There is one thing I would request of you. Writing an honest
review of this book on Amazon can go a long way to help
expose the content to as many eyes as possible. And being
able to see your insights would also make it easier for people
to assess whether the book is a good match for their needs.
Either way, be in touch,
David Clinton³⁸
The Data Project³⁹
³⁸https://bootstrap-it.com/davidclinton/
³⁹https://thedataproject.net