Gds Course
Gds Course
Contents
Information
Overview
Infrastructure
Assessment
Datasets
Bibliography
A - Introduction
Concepts
Hands-on
Do-It-Yourself
B - Open Science
Concepts
Hands-on
Do-It-Yourself
C - Spatial Data
Concepts
Hands-on
Do-It-Yourself
D - Mapping Data
Concepts
Hands-on
Do-It-Yourself
E - Spatial Weights
Concepts
Hands-on
Skip to main content
Do-It-Yourself
F - ESDA
Concepts
Hands-on
Do-It-Yourself
G - Clustering
Concepts
Hands-on
Do-It-Yourself
H - Points
Concepts
Hands-on
Do-It-Yourself
Contact
Citation
JOSE 10.21105/jose.00042
@article{darribas_gds_course,
author = {Dani Arribas-Bel}, Skip to main content
title = {A course on Geographic Data Science},
t t e { cou se o Geog ap c ata Sc e ce},
year = 2019,
journal = {The Journal of Open Source Education},
volume = 2,
number = 14,
doi = {https://doi.org/10.21105/jose.00042}
}
Overview
Aims
The module provides students with little or no prior knowledge core competences in Geographic Data Science (GDS). This
includes the following:
Learning outcomes
By the end of the course, students will be able to:
Demonstrate advanced GIS/GDS concepts and be able to use the tools programmatically to import, manipulate and
analyse spatial data in different formats.
Understand the motivation and inner workings of the main methodological approcahes of GDS, both analytical and
visual.
Critically evaluate the suitability of a specific technique, what it can offer and how it can help answer questions of
interest.
Apply a number of spatial analysis techniques and explain how to interpret the results, in a process of turning data into
information.
When faced with a new data-set, work independently using GIS/GDS tools programmatically to extract valuable insight.
Feedback strategy
The student will receive feedback through the following channels:
Formal assessment of three summative assignments: two tests and a computational essay. This will be on the form of
Skip tospecifying
reasoning of the mark assigned as well as comments main content
how the mark could be improved. This will be provided
no later than three working weeks after the deadline of the assignment submission.
Direct interaction with Module Leader and demonstrators in the computer labs. This will take place in each of the
scheduled lab sessions of the course.
Online forum maintained by the Module Leader where students can contribute by asking and answering questions related
to the module.
https://darribas.org/gds_course
Specific videos, (computational) notebooks, and other resources, as well as academic references are provided for each
learning block.
In addition, the currently-in-progress book “Geographic Data Science with PySAL and the PyData stack” provides and
additional resource for more in-depth coverage of similar content.
Infrastructure
This page covers a few technical aspects on how the course is built, kept up to date, and how you can create a computational
environment to run all the code it includes.
Software stack
This course is best followed if you can not only read its content but also interact with its code and even branch out to write
your own code and play on your own. For that, you will need to have installed on your computer a series of interconnected
software packages; this is what we call a stack. You can learn more about modern stacks for scientific computing on Block B.
Instructions on how to install a software stack that allows you to run the materials of this course depend on the operating
system you are using. Detailed guides are available for the main systems on the following resource, provided by the
Geographic Data Science Lab:
https://gdsl-ul.github.io/soft_install/
@gdsl-ul/soft_install
Skip to main content
Github repository
All the materials for this course and this website are available on the following Github repository:
darribas/gds_course
@darribas/gds_course
If you are interested, you can download a compressed .zip file with the most up-to-date version of all the materials,
including the HTML for this website at:
darribas/gds_course
@darribas/gds_course_zip
Continuous Integration
Following modern software engineering principles, this course is continuously tested and the website is built in an automated
way. This means that, every time a new commit is built, the following two actions are triggered (click on the badges for
detailed logs):
2. Testing of the Python code that makes up the practical sections (“Hands-on” and DIY) of the course
Containerised backend
The course is developed, built and tested using the gds_env , a containerised platform for Geographic Data Science. You
can read more about the gds_env project at:
https://darribas.org/gds_env/
launch binder
⚠ Warning
It is important to note Binder instances are ephemeral in the sense that the data and content created in a session is
NOT saved anywhere and is deleted as soon as the browser tab is closed.
Binder is also the backend this website relies on when you click on the rocket icon () on a page with code. Remember, you
can play with the code interactively but, once you close the tab, all the changes are lost.
Assessment
This course is assessed through four components, each with different weight.
Students are encouraged to contribute to the online discussion forum set up for the module. The contribution to the discussion
forum is assessed as an all-or-nothing 5% of the mark that can be obtained by contributing meaninfully to the online
discussion board setup for the course before the end of the first month of the course. Meaningful contributions include
both questions and answers that demonstrate the student is committed to make the forum a more useful resource for the rest
of the group.
Test I (20%)
Information provided on labs.
Test II (25%)
Skip to main content
Information provided on labs.
o at o p ov ded o abs.
Computational essay (50%)
Here’s the premise. You will take the role of a real-world data scientist tasked to explore a dataset on the city of Toronto
(Canada) and find useful insights for a variety of decision-makers. It does not matter if you have never been to Toronto. In
fact, this will help you focus on what you can learn about the city through the data, without the influence of prior knowledge.
Furthermore, the assessment will not be marked based on how much you know about Toronto but instead about how much
you can show you have learned through analysing data.
A computational essay is an essay whose narrative is supported by code and computational results that are included in the
essay itself. This piece of assessment is equivalent to 2,500 words. However, this is the overall weight. Since you will need to
create not only English narrative but also code and figures, here are the requirements:
Maximum of 750 words (bibliography, if included, does not contribute to the word count)
Up to three maps or figures (a figure may include more than one map and will only count as one but needs to be
integrated in the same matplotlib figure)
Up to one table
The assignment relies on two datasets provided below, and has two parts. Each of these pieces are explained with more detail
below.
Data
To complete the assignment, the following two datasets are provided. Below we show how you can download them and what
they contain.
This dataset contains a set of polygons representing the official neighbourhoods, as well as socio-economic information
attached to each neighbourhood.
neis = geopandas.read_file("https://darribas.org/gds_course/_downloads/a2bdb4c2a088e602c3bd649
neis.info()
You can find more information on each of the socio-economic variables in the variable list file:
pandas.read_csv("https://darribas.org/gds_course/_downloads/8944151f1b7df7b1f38b79b7a73eb2d0/t
Census Profile
Population and
0 3 population2016 Population 98-316- Population, 2016
dwellings
X2016001
Census Profile
Population and Population density per
1 8 population_sqkm Population 98-316-
dwellings square kilometre
X2016001
Census Profile
2 10 pop_0-14_yearsold Population Age characteristics 98-316- Children (0-14 years)
X2016001
Census Profile
3 11 pop_15-24_yearsold Population Age characteristics 98-316- Youth (15-24 years)
X2016001
Census Profile
Working Age (25-54
4 12 pop_25-54_yearsold Population Age characteristics 98-316-
years)
X2016001
Census Profile
Pre-retirement (55-64
5 13 pop_55-64_yearsold Population Age characteristics 98-316-
years)
X2016001
Census Profile
6 14 pop_65+_yearsold Population Age characteristics 98-316- Seniors (65+ years)
X2016001
Census Profile
Older Seniors (85+
7 15 pop_85+_yearsold Population Age characteristics 98-316-
years)
X2016001
Census Profile
Income of Total - Income statistics
8 1018 hh_median_income2015 Income 98-316-
households in 2015 in 2015 for private ...
X2016001
Census Profile
Highest certificate,
10 1711 deg_bachelor Education 98-316- Bachelor's degree
diploma or degree
X2016001
Census Profile
Highest certificate, Degree in medicine,
11 1713 deg_medics Education 98-316-
diploma or degree dentistry, veterinar...
X2016001
Census Profile
Highest certificate,
12 1714 deg_phd Education 98-316- Earned doctorate
diploma or degree
X2016001
Census Profile
13 1887 employed Labour Labour force status 98-316- Employed
X2016001
Census Profile
Household
14 1636 bedrooms_0 Housing 98-316- No bedrooms
characteristics
X2016001
Census Profile
Household
15 1637 bedrooms_1 Housing 98-316- 1 bedroom
characteristics
X2016001
Census Profile
Household
16 1638 bedrooms_2 Housing 98-316- 2 bedrooms
characteristics
X2016001
Census Profile
Household
17 1639 bedrooms_3 Housing 98-316- 3 bedrooms
characteristics
X2016001
Census Profile
Household
18 1641 bedrooms_4+ Housing 98-316- 4 or more bedrooms
characteristics
X2016001
This is a similar dataset to the Tokyo photographs we use in Block H but for the city of Toronto. It is a subsample of the 100
million Yahoo dataset that contains the location of photographs contributed to the Flickr service by its users. You can read it
with:
photos = pandas.read_csv("https://darribas.org/gds_course/_downloads/fc771c3b1b9e0ee00e875bb2d
photos.info()
IMPORTANT - Students of ENVS563 will need to source, at least, two additional datasets relating to Toronto. You can use
any dataset that will help you complete the tasks below but, if you need some inspiration, have a look at the Toronto Open
Data Portal:
https://open.toronto.ca/
Part I - Common
This is the one everyone has to do in the same way. Complete the following tasks:
1. Create a geodemographic classification and interpret the results. In the process, answer the following questions:
What are the main types of neighborhoods you identify?
Which characteristics help you delineate this typology?
If you had to use this classification to target areas in most need, how would you use it? why?
2. Create a regionalisation and interpret the results. In the process, answer at least the following questions:
Skip to main content
How is the city partitioned by your data?
yp yy
What do you learn about the geography of the city from the regionalisation?
What would one useful application of this regionalisation in the context of urban policy?
3. Using the photographs, complete the following tasks:
Visualise the dataset appropriately and discuss why you have taken your specific approach
Use DBSCAN to identify areas of the city with high density of photographs, which we will call areas of interest
(AOI). In completing this, answer the following questions:
What parameters have you used to run DBSCAN? Why?
What do the clusters help you learn about areas of interest in the city?
Name one example of how these AOIs can be of use for the city. You can take the perspective of an urban
planner, a policy maker, an operational practitioner (e.g. police, trash collection), an urban entrepreneur, or any
other role you envision.
Marking criteria
This course follows the standard marking criteria (the general ones and those relating to GIS assignments in particular) set by
the School of Environmental Sciences. In addition to these generic criteria, the following specific criteria relating to the code
provided will be used:
0-15: the code does not run and there is no documentation to follow it.
16-39: the code does not run, or runs but it does not produce the expected outcome. There is some documentation
explaining its logic.
40-49: the code runs and produces the expected output. There is some documentation explaining its logic.
50-59: the code runs and produces the expected output. There is extensive documentation explaining its logic.
60-69: the code runs and produces the expected output. There is extensive documentation, properly formatted, explaining
its logic.
70-79: all as above, plus the code design includes clear evidence of skills presented in advanced sections of the course
(e.g. custom methods, list comprehensions, etc.).
80-100: all as above, plus the code contains novel contributions that extend/improve the functionality the student was
provided with (e.g. algorithm optimizations, novel methods to perform the task, etc.).
Datasets
This course uses a wide range of datasets. Many are sourced from different projects (such as the GDS Book); but some have
been pre-processed and are stored as part of the materials for this course. For the latter group, the code used for processing,
from the original state and source of the files to the final product used in this course, is made available.
These pages are NOT part of the course syllabus. They are provided for transparency and to facilitate reproducibility
of the project. Consequently, each notebook does not have the depth, detail and pedagogy of the core materials in the
course.
The degree of sophistication of the code in these notebooks is at times above what is expected for a student of this
course. Consequently, if you peak into these notebooks, some parts might not make much sense or seem too
complpicated. Do not worry, it is not part of the course content.
Brexit vote
Dar Es Salaam
Liverpool LSOAs
London AirBnb
Tokyo administrative boundaries
Bibliography
Concepts
Here’s where it all starts. In this section we introduce the course and what Geographic Data Science is. We top it up with a
few (optional) further readings for the interested and curious mind.
This course
Let us start from the beginning, here is a snapshot of what this course is about! In the following clip, you will find out about
the philosophy behind the course, how the content is structured, and why this is all designed like this. And, also, a little bit
about the assessment…
Slides
[HTML]
[PDF]
The video also mentions a clip about the digitalisation of our activities. If you want to watch it outside the slides, expand
below.
Slides
[HTML]
[PDF]
20:50
Further readings
To get a better picture, the following readings complement the overview provided above very well:
1. The introductory chapter to “Doing Data Science” schutt2013doing, by Cathy O’Neil and Rachel Schutt is general
overview of why we needed Data Science and where if came from.
Skip to main content
2. A slightly more technical historical perspective on where Data Science came from and where it might go can be found in
David Donoho’s recent overview donoho201750.
3. A geographic take on Data Science, proposing more interaction between Geography and Data Science
singleton2019geographic.
Hands-on
Let’s get our hands to the keyboard! In this first “hands-on” session, we will learn about the tools and technologies we will
use to complete this course.
Software infrastructure
The main tool we will use for this course is JupyterLab so, to be able to follow this course interactively and successfully
complete it, you will need to be able to run it.
Tip
Jupyter Lab
Once you have access to an installation of Jupyter Lab, watch the following video, which provides an overview of the tools
and its main components:
0:00 / 0:00
Jupyter Notebooks
The main vehicle we will use to access, edit and create content in this course are Jupyter Notebooks. Watch this video for a
tour into their main features and how to master their use:
Skip to main content
0:00 / 0:00
As mentioned in the clip, notebook cells can contain code or text. You will write plenty of code cells along the way but, to
get familiar with Markdown and how you can style up your text cells, start by reading this resource:
https://guides.github.com/features/mastering-markdown/
0:00 / 0:00
Do-It-Yourself
https://en.wikipedia.org/wiki/Chocolate_chip_cookie_dough_ice_cream
History
Tip
Do not over think it. Focus on getting the bold, italics, links, headlines and lists correctly formated, but don’t worry
too much about the overall layout. Bonus if you manage to insert the image as well (it does not need to be properly
placed as in the original page)!
Concepts
The ideas behind this block are better communicated through narrative than video or lectures. Hence, the concepts section are
delivered through a few references you are expected to read. These will total up about one and a half hours of your focused
time.
Open Science
The first part of this block is about setting the philosophical background. Why do we care about the processes and tools we
use when we do computational work? Where do the current
Skip paradigm
to main contentcome from? Are we on the verge of a new model? For
all of this, we we have two reads to set the tone. Make sure to get those in first thing before moving on to the next bits.
First half of Chapter 1 in “Geographic Data Science with PySAL and the PyData stack” reyABwolf.
The 2018 Atlantic piece “The scientific paper is obsolete” on computational notebooks, by James Somers
somers2018scientific.
Second half of Chapter 1 in “Geographic Data Science with PySAL and the PyData stack” reyABwolf.
The chapter in the GIS&T Book of Knowledge on computational notebooks, by Geoff Boeing and Dani Arribas-Bel.
Hands-on
Once we know a bit about what computational notebooks are and why we should care about them, let’s jump to using them!
This section introduces you to using Python for manipulating tabular data. Please read through it carefully and pay attention
to how ideas about manipulating data are translated into Python code that “does stuff”. For this part, you can read directly
from the course website, although it is recommended you follow the section interactively by running code on your own.
Once you have read through and have a bit of a sense of how things work, jump on the Do-It-Yourself section, which will
provide you with a challenge to complete it on your own, and will allow you to put what you have already learnt to good use.
Happy hacking!
Data munging
Real world datasets are messy. There is no way around it: datasets have “holes” (missing data), the amount of formats in
which data can be stored is endless, and the best structure to share data is not always the optimum to analyze them, hence the
need to munge them. As has been correctly pointed out in many outlets (e.g.), much of the time spent in what is called
(Geo-)Data Science is related not only to sophisticated modeling and insight, but has to do with much more basic and less
exotic tasks such as obtaining data, processing, turning them into a shape that makes analysis possible, and exploring it to get
to know their basic properties.
For how labor intensive and relevant this aspect is, there is surprisingly very little published on patterns, techniques, and best
practices for quick and efficient data cleaning, manipulation, and transformation. In this session, you will use a few real
world datasets and learn how to process them into Skip
Python
to so they
main can be transformed and manipulated, if necessary, and
content
analyzed. For this, we will introduce some of the bread and butter of data analysis and scientific computing in Python. These
are fundamental tools that are constantly used in almost any task relating to data analysis.
This notebook covers the basic and the content that is expected to be learnt by every student. We use a prepared dataset that
saves us much of the more intricate processing that goes beyond the introductory level the session is aimed at. As a
companion to this introduction, there is an additional notebook (see link on the website page for Lab 01) that covers how the
dataset used here was prepared from raw data downloaded from the internet, and includes some additional exercises you can
do if you want dig deeper into the content of this lab.
In this notebook, we discuss several patterns to clean and structure data properly, including tidying, subsetting, and
aggregating; and we finish with some basic visualization. An additional extension presents more advanced tricks to
manipulate tabular data.
Before we get our hands data-dirty, let us import all the additional libraries we will need, so we can get that out of the way
and focus on the task at hand:
Dataset
We will be exploring some demographic characteristics in Liverpool. To do that, we will use a dataset that contains
population counts, split by ethnic origin. These counts are aggregated at the Lower Layer Super Output Area (LSOA from
now on). LSOAs are an official Census geography defined by the Office of National Statistics. You can think of them, more
or less, as neighbourhoods. Many data products (Census, deprivation indices, etc.) use LSOAs as one of their main
geographies.
To make things easier, we will read data from a file posted online so, for now, you do not need to download any dataset:
# Read table
db = pd.read_csv("https://darribas.org/gds_course/content/data/liv_pop.csv",
index_col='GeographyCode')
Let us stop for a minute to learn how we have read the file. Here are the main aspects to keep in mind:
We are using the method read_csv from the pandas library, which we have imported with the alias pd .
In this form, all that is required is to pass the path to the file we want to read, which in this case is a web address.
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
db = pd.read_csv("liv_pop.csv", index_col="GeographyCode")
Structure
The first aspect worth spending a bit of time is the structure of a DataFrame . We can print it by simply typing its name:
db
GeographyCode
E01006518 1531 69 73 19 4
E01033764 2106 32 49 15 0
E01033765 1277 21 33 17 3
E01033766 1028 12 20 8 7
E01033767 1003 29 29 5 1
Note the printing is cut to keep a nice and compact view, but enough to see its structure. Since they represent a table of data,
DataFrame objects have two dimensions: rows and columns. Each of these is automatically assigned a name in what we
will call its index. When printing, the index of each dimension is rendered in bold, as opposed to the standard rendering for
the content. In the example above, we can see how the column index is automatically picked up from the .csv file’s column
names. For rows, we have specified when reading the file we wanted the column GeographyCode , so that is used. If we
hadn’t specified any, pandas will automatically generate a sequence starting in 0 and going all the way to the number of
rows minus one. This is the standard structure of a DataFrame object, so we will come to it over and over. Importantly,
even when we move to spatial data, our datasets will have a similar structure.
One final feature that is worth mentioning about these tables is that they can hold columns with different types of data. In our
example, this is not used as we have counts (or int , for integer, types) for each column. But it is useful to keep in mind we
can combine this with columns that hold other type of data such as categories, text ( str , for string), dates or, as we will see
later in the course, geographic features.
Inspect
Inspecting what it looks like. We can check the top (bottom) X lines of the table by passing X to the method head ( tail ).
For example, for the top/bottom five lines:
Skip to main content
db.head()
GeographyCode
E01006518 1531 69 73 19 4
db.tail()
GeographyCode
E01033764 2106 32 49 15 0
E01033765 1277 21 33 17 3
E01033766 1028 12 20 8 7
E01033767 1003 29 29 5 1
db.info()
<class 'pandas.core.frame.DataFrame'>
Index: 298 entries, E01006512 to E01033768
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Europe 298 non-null int64
1 Africa 298 non-null int64
2 Middle East and Asia 298 non-null int64
3 The Americas and the Caribbean 298 non-null int64
4 Antarctica and Oceania 298 non-null int64
dtypes: int64(5)
memory usage: 14.0+ KB Skip to main content
Summarise
Or of the values of the table:
db.describe()
Note how the output is also a DataFrame object, so you can do with it the same things you would with the original table
(e.g. writing it to a file).
In this case, the summary might be better presented if the table is “transposed”:
db.describe().T
Middle East and Asia 298.0 62.909396 102.519614 1.0 16.00 33.5 62.75 840.0
The Americas and the Caribbean 298.0 8.087248 9.397638 0.0 2.00 5.0 10.00 61.0
Antarctica and Oceania 298.0 1.949664 2.168216 0.0 0.00 1.0 3.00 11.0
731
Note here how we have restricted the calculation of the maximum value to one column only.
457.8842648530303
# Longer, hardcoded
total = db['Europe'] + db['Africa'] + db['Middle East and Asia'] + \
db['The Americas and the Caribbean'] + db['Antarctica and Oceania']
# Print the top of the variable
total.head()
GeographyCode
E01006512 1880
E01006513 2941
E01006514 2108
E01006515 1208
E01006518 1696
dtype: int64
GeographyCode
E01006512 1880
E01006513 2941
E01006514 2108
E01006515 1208
E01006518 1696
dtype: int64
Note how we are using the command sum , just like we did with max or min before but, in this case, we are not applying it
over columns (e.g. the max of each column), but over rows, so we get the total sum of populations by areas.
Once we have created the variable, we can make it part of the table:
db['Total'] = total
db.head()
GeographyCode
A different spin on this is assigning new values: we can generate new variables with scalars, and modify those:
GeographyCode
db.loc['E01006512', 'ones'] = 3
db.head()
GeographyCode
Delete columns
Permanently deleting variables is also within reach of one command:
del db['ones']
db.head()
GeographyCode
Index-based queries
Here we explore how we can subset parts of a DataFrame if we know exactly which bits we want. For example, if we want
to extract the total and European population of the first four areas in the table, we use loc with lists:
Total Europe
GeographyCode
Note that we use squared brackets ( [] ) to delineate the index of the items we want to subset. In Python, this sequence of
items is called a list. Hence we can see how we can create a list with the names (index IDs) along each of the two dimensions
of a DataFrame (rows and columns), and loc will return a subset of the original table only with the elements queried for.
An alternative to list-based queries is what is called “range-based” queries. These work on the indices of the table but, instead
of requiring the ID of each item we want to retrieve, the operate by requiring only two IDs: the first and last element in a
range of items. Range queries are expressed with a colon ( : ). For example:
GeographyCode
E01006518 1531 69 73 19 4
We see how the range query picks up all the elements in between the two IDs specificed. Note that, for this to work, the first
ID in the range needs to be placed before the second one in the table’s index.
Once we know about list and range based queries, we can combine them both! For example, we can specify a range of rows
and a list of columns:
Europe Total
GeographyCode
Condition-based queries
However, sometimes, we do not know exactly which observations we want, but we do know what conditions they need to
satisfy (e.g. areas with more than 2,000 inhabitants). For these cases, DataFrames support selection based on conditions.
Let us see a few examples. Suppose we want to select…
GeographyCode
GeographyCode
GeographyCode
Pro-tip: these queries can grow in sophistication with almost no limits. For example, here is a case where we want to find out
the areas where European population is less than half the population:
GeographyCode
Note that, in these cases, using query results in code that is much more streamlined and easier to read. However, query is
not perfect and, particularly for more sophisticated queries, it does not afford the same degree of flexibility. For example, the
last query we had using loc would not be possible using query .
Combining queries
Now all of these queries can be combined with each other, for further flexibility. For example, imagine we want areas with
more than 25 people from the Americas and Caribbean, but less than 1,500 in total:
GeographyCode
GeographyCode
If you inspect the help of db.sort_values , you will find that you can pass more than one column to sort the table by. This
allows you to do so-called hiearchical sorting: sort first based on one column, if equal then based on another column, etc.
Visual exploration
The next step to continue exploring a dataset is to get a feel for what it looks like, visually. We have already learnt how to
unconver and inspect specific parts of the data, to check for particular cases we might be intersted in. Now we will see how to
plot the data to get a sense of the overall distribution of values. For that, we will be using the Python library seaborn .
Histograms.
One of the most common graphical devices to display the distribution of values in a variable is a histogram. Values are
assigned into groups of equal intervals, and the groups are plotted as bars rising as high as the number of values into the
group.
A histogram is easily created with the following command. In this case, let us have a look at the shape of the overall
population:
_ = sns.distplot(db['Total'], kde=False)
We can quickly see most of the areas contain somewhere between 1,200 and 1,700 people, approx. However, there are a few
areas that have many more, even up to 3,500 people.
An additional feature to visualize the density of values is called rug , and adds a little tick for each value on the horizontal
axis:
Histograms are useful, but they are artificial in the sense that a continuous variable is made discrete by turning the values into
discrete groups. An alternative is kernel density estimation (KDE), which produces an empirical density function:
_ = sns.kdeplot(db['Total'], shade=True)
Another very common way of visually displaying a variable is with a line or a bar chart. For example, if we want to generate
a line plot of the (sorted) total population by area:
_ = db['Total'].sort_values(ascending=False).plot()
/opt/conda/lib/python3.7/site-packages/pandas/plotting/_matplotlib/core.py:1235: UserWarning: F
ax.set_xticklabels(xticklabels)
For a bar plot all we need to do is to change from plot to plot.bar . Since there are many neighbourhoods, let us plot
only the ten largest ones (which we can retrieve with head ):
_ = db['Total'].sort_values(ascending=False)\
.head(10)\
.plot.bar()
_ = db['Total'].sort_values()\
.head(50)\
.plot.barh(figsize=(6, 20))
This section is a bit more advanced and hence considered optional. Fell free to skip it, move to the next, and return
later when you feel more confident.
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy.
Once you can read your data in, explore specific cases, and have a first visual approach to the entire set, the next step can be
preparing it for more sophisticated analysis. Maybe you are thinking of modeling it through regression, or on creating
subgroups in the dataset with particular characteristics, or maybe you simply need to present summary measures that relate to
a slightly different arrangement of the data than you have been presented with.
For all these cases, you first need what statistician, and general R wizard, Hadley Wickham calls “tidy data”. The general
idea to “tidy” your data is to convert them from whatever structure they were handed in to you into one that allows
convenient and standardized manipulation, and that supports directly inputting the data into what he calls “tidy” analysis
tools. But, at a more practical level, what is exactly “tidy data”? In Wickham’s own words:
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending
on how rows, columns and tables are matched up with observations, variables and types.
If you are further interested in the concept of “tidy data”, I recommend you check out the original paper (open access) and
the public repository associated with it.
Let us bring in the concept of “tidy data” to our own Liverpool dataset. First, remember its structure:
db.head()
GeographyCode
Thinking through tidy lenses, this is not a tidy dataset. It is not so for each of the three conditions:
Starting by the last one (each type of observational unit forms a table), this dataset actually contains not one but two
observational units: the different areas of Liverpool, captured by GeographyCode ; and subgroups of an area. To tidy up
this aspect, we can create two different tables:
Total
GeographyCode
E01006512 1880
E01006513 2941
E01006514 2108
E01006515 1208
E01006518 1696
# Create a table `db_subgroups` that contains every column in `db` without `Total`
db_subgroups = db.drop('Total', axis=1)
db_subgroups.head()
GeographyCode
E01006518 1531 69 73 19 4
Note we use drop to exclude “Total”, but we could also use a list with the names of all the columns to keep. Additionally,
notice how, in this case, the use of drop (which leaves db untouched) is preferred to that of del (which permanently
removes the column from db ).
At this point, the table db_totals is tidy: every row is an observation, every table is a variable, and there is only one
observational unit in the table.
The other table ( db_subgroups ), however, is not entirely tidied up yet: there is only one observational unit in the table,
true; but every row is not an observation, and there are variable values as the names of columns (in other words, every
column is not a variable). To obtain a fully tidy version of the table, we need to re-arrange it in a way that every row is a
population subgroup in an area, and there are three variables: GeographyCode , population subgroup, and population count
(or frequency).
Because this is actually a fairly common pattern, there is a direct way to solve it in pandas :
tidy_subgroups = db_subgroups.stack()
tidy_subgroups.head()
GeographyCode
E01006512 Europe 910
Africa 106
Middle East and Asia 840
The Americas and the Caribbean 24
Antarctica and Oceania 0
dtype: int64
The method stack , well, “stacks” the different columns into rows. This fixes our “tidiness” problems but the type of object
that is returning is not a DataFrame :
type(tidy_subgroups)
It is a Series , which really is like a DataFrame , but with only one column. The additional information
( GeographyCode and population group) are stored in what is called an multi-index. We will skip these for now, so we
would really just want to get a DataFrame as we know it out of the Series . This is also one line of code away:
GeographyCode level_1 0
To do this in pandas , meet one of its workhorses, and also one of the reasons why the library has become so popular: the
groupby operator.
pop_grouped = tidy_subgroupsDF.groupby('Subgroup')
pop_grouped
The object pop_grouped still hasn’t computed anything, it is only a convenient way of specifying the grouping. But this
allows us then to perform a multitude of operations on it. For our example, the sum is calculated as follows:
pop_grouped.sum()
Freq
Subgroup
Africa 8886
Europe 435790
pop_grouped.describe()
Subgroup
Antarctica and Oceania 298.0 1.949664 2.168216 0.0 0.00 1.0 3.00 11.0
Middle East and Asia 298.0 62.909396 102.519614 1.0 16.00 33.5 62.75 840.0
The Americas and the Caribbean 298.0 8.087248 9.397638 0.0 2.00 5.0 10.00 61.0
We will not get into it today as it goes beyond the basics we want to conver, but keep in mind that groupby allows you to
not only call generic functions (like sum or describe ), but also your own functions. This opens the door for virtually any
kind of transformation and aggregation possible.
This NY Times article does a good job at conveying the relevance of data “cleaning” and munging.
A good introduction to data manipulation in Python is Wes McKinney’s “Python for Data Analysis”
mckinney2012python.
To explore further some of the visualization capabilities in at your fingertips, the Python library seaborn is an excellent
choice. Its online tutorial is a fantastic place to start.
A good extension is Hadley Wickham’ “Tidy data” paper Wickham:2014:JSSOBK:v59i10, which presents a very popular
way of organising tabular data for efficient manipulation.
Do-It-Yourself
import pandas
This section is all about you taking charge of the steering wheel and choosing your own adventure. For this block, we are
going to use what we’ve learnt before to take a look at a dataset of casualties in the war in Afghanistan. The data was
originally released by Wikileaks, and the version we will use is published by The Guardian.
The data are published on a Google Sheet you can check out at:
https://docs.google.com/spreadsheets/d/1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/edit?hl=en#gid=1
As you will see, each row includes casualties recorded month by month, split by Taliban, Civilians, Afghan forces, and
NATO.
To read it into a Python session, we need to slightly modify the URL to access it into:
url = ("https://docs.google.com/spreadsheets/d/"\
"1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/"\
"export?format=csv&gid=1")
url
'https://docs.google.com/spreadsheets/d/1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/export?fo
Note how we split the url into three lines so it is more readable in narrow screens. The result however, stored in url , is the
same as one long string.
This allows us to read the data straight into a DataFrame, as we have done in the previous session:
Note also we use the skiprows=[0, -1] to avoid reading the top ( 0 ) and bottom ( -1 ) rows which, if you check on the
Google Sheet, involves the title of the table.
db.head()
Year Month Taliban Civilians Afghan forces Nato (detailed in spreadsheet) Nato - official figures
Obtain the minimum number of civilian casualties (in what month was that?)
How many NATO casualties were registered in August 2008?
Concepts
This blocks explore spatial data, old and new. We start with an overview of traditional datasets, discussing their benefits and
challenges for social scientists; then we move on to new forms of data, and how they pose different challenges, but also
exciting opportunities. These two areas are covered with clips and slides that can be complemented with readings. Once
conceptual areas are covered, we jump into working with spatial data in Python, which will prepare you for your own
adventure in exploring spatial data.
Before you jump on the clip, please watch the following video by the US Census Burearu, which will be discussed:
Slides
[HTML]
[PDF]
0:00 / 0:00
[HTML]
[PDF]
0:00 / 0:00
Although both papers are discussed in the clip, if you are interested in the ideas mentioned, do go to the original sources as
they provide much more detail and nuance.
Hands-on
Mapping in Python
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas
In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS
import geopandas
In this lab, we will learn how to load, manipulate and visualize spatial data. In some senses, spatial data are usually included
simply as “one more column” in a table. However, spatial is special sometimes and there are few aspects in which
geographic data differ from standard numerical tables. In this session, we will extend the skills developed in the previous one
about non-spatial data, and combine them. In the process, we will discover that, although with some particularities, dealing
with spatial data in Python largely resembles dealing with non-spatial data.
Datasets
To learn these concepts, we will be playing with three main datasets. Same as in the previous block, these datasets can be
loaded dynamically from the web, or you can download them manually, keep a copy on your computer, and load them from
there.
Important
Make sure you are connected to the internet when you run these cells as they need to access data hosted online
Cities
First we will use a polygon geography. We will use an open dataset that contains the boundaries of Spanish cities. We can
read it into an object named cities by:
cities = geopandas.read_file("https://ndownloader.figshare.com/files/20232174")
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
Skip to main content
cities = geopandas.read_file("../data/web_cache/cities.gpkg")
Streets
In addition to polygons, we will play with a line layer. For that, we are going to use a subset of street network from the
Spanish city of Madrid.
url = (
"https://github.com/geochicasosm/lascallesdelasmujeres"
"/raw/master/data/madrid/final_tile.geojson"
)
url
'https://github.com/geochicasosm/lascallesdelasmujeres/raw/master/data/madrid/final_tile.geojso
streets = geopandas.read_file(url)
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
streets = geopandas.read_file("../data/web_cache/streets.gpkg")
Bars
The final dataset we will rely on is a set of points demarcating the location of bars in Madrid. To obtain it, we will use
osmnx , a Python library that allows us to query OpenStreetMap. Note that we use the method pois_from_place , which
queries for points of interest (POIs, or pois ) in a particular place (Madrid in this case). In addition, we can specify a set of
tags to delimit the query. We use this to ask only for amenities of the type “bar”:
pois = osmnx.geometries_from_place(
"Madrid, Spain", tags={"amenity": "bar"}
)
If you are using an old version of osmnx (<1.0), replace the code in the cell above for:
pubs = osmnx.pois.pois_from_place(
"Liverpool, UK", tags={"amenity": "bar"}
)
You can check the version you are using with the following snipet:
osmnx.__version__
You do not need to know at this point what happens behind the scenes when we run geometries_from_place but, if you
are curious, we are making a query to OpenStreetMap (almost as if you typed “bars in Madrid, Spain” within Google Maps)
and getting the response as a table of data, instead of as a website with an interactive map. Pretty cool, huh?
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
pois = geopandas.read_parquet("../data/web_cache/pois_bars_madrid.parquet")
In two lines of code, we will obtain a graphical representation of the spatial data contained in a file that can be in many
formats; actually, since it uses the same drivers under the hood, you can load pretty much the same kind of vector files that
Desktop GIS packages like QGIS permit. Let us start by plotting single layers in a crude but quick form, and we will build
style and sophistication into our plots later on.
Polygons
Now lsoas is a GeoDataFrame . Very similar to a traditional, non-spatial DataFrame , but with an additional column
called geometry :
cities.plot()
<Axes: >
This might not be the most aesthetically pleasant visual representation of the LSOAs geography, but it is hard to argue it is
not quick to produce. We will work on styling and customizing spatial plots later on.
Pro-tip: if you call a single row of the geometry column, it’ll return a small plot ith the shape:
cities.loc[0, 'geometry']
streets.loc[0, 'geometry']
streets.plot()
Again, this is not the prettiest way to display the streets maybe, and you might want to change a few parameters such as
colors, etc. All of this is possible, as we will see below, but this gives us a quick check of what lines look like.
Points
Points take a similar approach for quick plotting:
pois.plot()
<Axes: >
NOTE: some of these variations are very straightforward while others are more intricate and require tinkering with the
internal parts of a plot. They are not necessarily organized by increasing level of complexity.
Changing transparency
The intensity of color of a polygon can be easily changed through the alpha attribute in plot. This is specified as a value
betwee zero and one, where the former is entirely transparent while the latter is the fully opaque (maximum intensity):
pois.plot(alpha=0.1)
<Axes: >
Removing axes
Although in some cases, the axes can be useful to obtain context, most of the times maps look and feel better without them.
Removing the axes involves wrapping the plot into a figure, which takes a few more lines of aparently useless code but that,
in time, it will allow you to tweak the map further and to create much more flexible designs:
1. We have first created a figure named f with one axis named ax by using the command plt.subplots (part of the
library matplotlib , which we have imported at the top of the notebook). Note how the method is returning two
elements and we can assign each of them to objects with different name ( f and ax ) by simply listing them at the front
of the line, separated by commas.
2. Second, we plot the geographies as before, but this time we tell the function that we want it to draw the polygons on the
axis we are passing, ax . This method returns the axis with the geographies in them, so we make sure to store it on an
object with the same name, ax .
3. On the third line, we effectively remove the box with coordinates.
4. Finally, we draw the entire plot by calling plt.show() .
Adding a title
Adding a title is an extra line, if we are creating the plot within a figure, as we just did. To include text on top of the figure:
Note how the lines are thicker. In addition, all the polygons are colored in the same (default) color, light red. However,
because the lines are thicker, we can only see the polygon filling for those cities with an area large enough.
Skip to main content
Let us examine line by line what we are doing in the code snippet:
y g pp
We begin by creating the figure ( f ) object and one axis inside it ( ax ) where we will plot the map.
Then, we call plot as usual, but pass in two new arguments: linewidth for the width of the line; facecolor , to
control the color each polygon is filled with; and edgecolor , to control the color of the boundary.
This approach works very similarly with other geometries, such as lines. For example, if we wanted to plot the streets in red,
we would simply:
<Axes: >
Important, note that in the case of lines the parameter to control the color is simply color . This is because lines do not have
an area, so there is no need to distinguish between the main area ( facecolor ) and the border lines ( edgecolor ).
Transforming CRS
The coordindate reference system (CRS) is the way geographers and cartographers have to represent a three-dimentional
object, such as the round earth, on a two-dimensional plane, such as a piece of paper or a computer screen. If the source data
contain information on the CRS of the data, we can modify this in a GeoDataFrame . First let us check if we have the
information stored properly:
cities.crs
As we can see, there is information stored about the reference system: it is using the standard Spanish projection, which is
expressed in meters. There are also other less decipherable parameters but we do not need to worry about them right now.
If we want to modify this and “reproject” the polygons into a different CRS, the quickest way is to find the EPSG code online
(epsg.io is a good one, although there are others too). For example, if we wanted to transform the dataset into lat/lon
coordinates, we would use its EPSG code, 4326:
The shape of the polygons is slightly different. Furthermore, note how the scale in which they are plotted differs.
For this section, let’s select only Madrid from the streets table and convert it to lat/lon so it’s aligned with the streets and
POIs layers:
Combining different layers on a single map boils down to adding each of them to the same axis in a sequential way, as if we
were literally overlaying one on top of the previous one. For example, let’s plot the boundary of Madrid and its bars:
<Axes: >
If you now check on the folder, you’ll find a png (image) file with the map.
The command plt.savefig contains a large number of options and additional parameters to tweak. Given the size of the
figure created is not very large, we can increase this with the argument dpi , which stands for “dots per inch” and it’s a
standard measure of resolution in images. For example, for a high quality image, we could use 500:
GeoDataFrame s come with a whole range of traditional GIS operations built-in. Here we will run through a small subset of
them that contains some of the most commonly used ones.
Area calculation
One of the spatial aspects we often need from polygons is their area. “How big is it?” is a question that always haunts us
when we think of countries, regions, or cities. To obtain area measurements, first make sure you GeoDataFrame is
projected. If that is the case, you can calculate areas as follows:
city_areas = cities.area
city_areas.head()
This indicates that the area of the first city in our table takes up 8,450,000 squared metres. If we wanted to convert into
squared kilometres, we can divide by 1,000,000:
0 8.449666
1 9.121270
2 13.226528
3 68.081212
4 10.722843
dtype: float64
Length
Similarly, an equally common question with lines is their length. Also similarly, their computation is relatively
straightforward in Python, provided that our data are projected. Here we will perform the projection ( to_crs ) and the
calculation of the length at the same time:
street_length = streets.to_crs(epsg=25830).length
street_length.head()
0 120.776840
1 120.902920
2 396.494357
3 152.442895
4 101.392357
dtype: float64
Since the CRS we use ( EPSG:25830 ) is expressed in metres, we can tell the first street segment is about 37m.
Centroid calculation
Sometimes it is useful to summarize a polygon into a single point and, for that, a good candidate is its centroid (almost like a
spatial analogue of the average). The following command will return a GeoSeries (a single column with spatial data) with
the centroids of a polygon GeoDataFrame :
Skip to main content
cents = cities.centroid
cents.head()
Note how cents is not an entire table but a single column, or a GeoSeries object. This means you can plot it directly, just
like a table:
cents.plot()
<Axes: >
But you don’t need to call a geometry column to inspect the spatial objects. In fact, if you do it will return an error because
there is not any geometry column, the object cents itself is the geometry.
poly.contains(pt1)
False
poly.contains(pt2)
False
Performing point-in-polygon in this way is instructive and useful for pedagogical reasons, but for cases with many points and
polygons, it is not particularly efficient. In these situations, it is much more advisable to perform then as a “spatial join”. If
you are interested in these, see the link provided below to learn more about them.
Buffers
Buffers are one of the classical GIS operations in which an area is drawn around a particular geometry, given a specific
radious. These are very useful, for instance, in combination with point-in-polygon operations to calculate accessibility,
catchment areas, etc.
For this example, we will use the bars table, but will project it to the same CRS as cities , so it is expressed in metres:
pois_projected = pois.to_crs(cities.crs)
pois_projected.crs
To create a buffer using geopandas , simply call the buffer method, passing in the radious. For example, to draw a 500m.
buffer around every bar in Madrid:
buf = pois_projected.buffer(500)
buf.head()
f, ax = plt.subplots(1)
# Plot buffer
buf.plot(ax=ax, linewidth=0)
# Plot named places on top for reference
# [NOTE how we modify the dot size (`markersize`)
# and the color (`color`)]
pois_projected.plot(ax=ax, markersize=1, color='yellow')
We can begin by creating a map in the same way we would do normally, and then use the add_basemap command to, er,
add a basemap:
ax = cities.plot(color="black")
cx.add_basemap(ax, crs=cities.crs);
cities_wm = cities.to_crs(epsg=3857)
ax = cities_wm.plot(color="black")
cx.add_basemap(ax);
Note how the coordinates are different but, if we set it right, either approach aligns tiles and data nicely.
Now, contextily offers a lot of options in terms of the sources and providers you can use to create your basemaps. For
example, we can use satellite imagery instead:
Terrain maps
Interactive maps
Everything we have seen so far relates to static maps. These are useful for publication, to include in reports or to print.
However, modern web technologies afford much more flexibility to explore spatial data interactively.
We will use the state-of-the-art Leaflet integration into geopandas . This integration connects GeoDataFrame objects with
the popular web mapping library Leaflet.js. In this context, we will only show how you can take a GeoDataFrame into an
interactive map in one line of code:
Make
+ this Notebook Trusted to load map: File -> Trust Notebook
−
10 km
5 mi Leaflet (https://leafletjs.com) | Data by © OpenStreetMap (http://openstreetmap.org), under ODbL (http://www.openstreetmap.org/copyright).
Further resources
More advanced GIS operations are possible in geopandas and, in most cases, they are extensions of the same logic we have
used in this document. If you are thinking about taking the next step from here, the following two operations (and the
Skip to main content
documentation provided) will give you the biggest “bang for the buck”:
Spatial joins
https://geopandas.org/mergingdata.html#spatial-joins
Spatial overlays
https://geopandas.org/set_operations.html
Do-It-Yourself
In this session, we will practice your skills in mapping with Python. Fire up a notebook you can edit interactively, and let’s do
this!
Data preparation
Polygons
For this section, you will have to push yourself out of the comfort zone when it comes to sourcing the data. As nice as it is to
be able to pull a dataset directly from the web at the stroke of a url address, most real-world cases are not that straight
forward. Instead, you usually have to download a dataset manually and store it locally on your computer before you can get
to work.
We are going to use data from the Consumer Data Research Centre (CDRC) about Liverpool, in particular an extract from the
Census. You can download a copy of the data at:
Important
You will need a username and password to download the data. Create it for free at:
https://data.cdrc.ac.uk/user/register
import geopandas
liv = geopandas.read_file("Census_Residential_Data_Pack_2011_E08000012/data/Census_Residential_
Lines
For a line layer, we are going to use a different bit of osmnx functionality that will allow us to extract all the highways:
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
bikepaths = osmnx.load_graphml("../data/web_cache/bikepaths_liverpool.graphml")
len(bikepaths)
23481
Points
For points, we will use an analogue of the POI layer we have used in the Lab: pubs in Liverpool, as recorded by
OpenStreetMap. We can make a similar query to retrieve the table:
pubs = osmnx.geometries_from_place(
"Liverpool, UK", tags={"amenity": "bar"}
)
If you are using an old version of osmnx (<1.0), replace the code in the cell above for:
pubs = osmnx.pois.pois_from_place(
"Liverpool, UK", tags={"amenity": "bar"}
)
You can check the version you are using with the following snipet:
osmnx.__version__
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
pubs = geopandas.read_parquet("../data/web_cache/pois_bars_liverpool.parquet")
Tasks
Make a map of the Liverpool neighborhoods that includes the following characteristics:
Features a title
Does not include axes frame
It has a figure size of 10 by 11
Polygons are all in color "#525252" and 50% transparent
Boundary lines (“edges”) have a width of 0.3 and are of color "#B9EBE3"
Includes a basemap with the Stamen watercolor theme
Not all of the requirements above are not equally hard to achieve. If you can get some but not all of them, that’s also
great! The point is you learn something every time you try.
Which group accounts for longer total street length in Zaragoza: men or women? By how much?
The suggestion is that you get to work right away. However, if this task seems too daunting, you can expand the tip below for
a bit of help.
1. You will need your spatial data projected, so they are expressed in metres, and the length calculation makes
sense. Check out the section on transforming the CRS, and use, for example EPSG:25830 as the target CRS.
2. Separate streets named after men from those named after women, perhaps in two objects ( men , women ) that
contain the streets for each group. This is a non-spatial query at its heart, so make sure to revisit that section on
the previous block.
3. Calculate the length of each street in each group. Refresh your memory of this in this section.
4. Create a total length by group by adding the lengths of each street. This is again a non-spatial operation (sum), so
make sure to re-read this part of Block B.
5. Compare the two and answer the questions.
Surprised by the solution? Perhaps not, but remember data analysis is not only about discovering the unexpected, but
about providing evidence of the things we “know” so we can build better arguments about actions.
Concepts
This block is all about Geovisualisation and displaying statistical information on maps. We start with an introduction on what
geovisualisation is; then we follow with the modifiable areal unit problem, a key concept to keep in mind when displaying
statistical information spatially; and we wrap up with tips to make awesome choropleths, thematic maps. Each section
contains a short clip and a set of slides, plus a few (optional) readings.
Geovisualisation
Geovisualisation is an area that underpins much what we will discuss in this course. Often, we will be presenting the results
of more sophisticated analyses as maps. So getting the principles behind mapping right is critical. In this clip, we cover what
is (geo)visualisation and why it is important.
Slides
[HTML]
[PDF]
0:00 / 0:00
Slides
[HTML]
[PDF]
0:00 / 0:00
Choropleths
Choropleths are thematic maps and, these days, are everywhere. From elections, to economic inequality, to the distribution of
population density, there’s a choropleth for everyone. Although technically, it is easy to create choropleths, it is even easier to
make bad choropleths. Fortunately, there are a few principles that we can follow to create effective choropleths. Get them all
delivered right to the conform of your happy place in the following clip and slides!
Slides
[HTML]
[PDF]
0:00 / 0:00
Further readings
The clip above contains a compressed version of the key principles behind successful choropleths. For a more comprehensive
coverage, please refer to:
Hands-on
Choropleths in Python
Important
This is an adapted version, with a bit less content and detail, of the chapter on choropleth mapping by Rey, Arribas-
Bel and Wolf (in progress) reyABwolf. Check out the full chapter, available for free at:
https://geographicdata.science/book/notebooks/05_choropleth.html
In this session, we will build on all we have learnt so far about loading and manipulating (spatial) data and apply it to one of
the most commonly used forms of spatial analysis: choropleths. Remember these are maps that display the spatial distribution
of a variable encoded in a color scheme, also called palette. Although there are many ways in which you can convert the
values of a variable into a specific color, we will focus in this context only on a handful of them, in particular:
Unique values
Equal interval
Quantiles
Fisher-Jenks
Before all this mapping fun, let us get the importing of libraries and data loading out of the way:
%matplotlib inline
import geopandas
from pysal.lib import examples
import seaborn as sns
import pandas as pd
from pysal.viz import mapclassify
import numpy as np
import matplotlib.pyplot as plt
Data
To mirror the original chapter this section is based on, we will use the same dataset: the Mexico GDP per capita dataset,
Skip to main content
which we can access as a PySAL example dataset.
Note
We can get a short explanation of the dataset through the explain method:
examples.explain("mexico")
mexico
======
Data used in Rey, S.J. and M.L. Sastre Gutierrez. (2010) "Interregional inequality dynamics in
mx = examples.load_example("mexico")
This will download the data and place it on your home directory. We can inspect the directory where it is stored:
mx.get_file_list()
['/opt/conda/lib/python3.8/site-packages/libpysal/examples/mexico/mexicojoin.shx',
'/opt/conda/lib/python3.8/site-packages/libpysal/examples/mexico/README.md',
'/opt/conda/lib/python3.8/site-packages/libpysal/examples/mexico/mexico.gal',
'/opt/conda/lib/python3.8/site-packages/libpysal/examples/mexico/mexico.csv',
'/opt/conda/lib/python3.8/site-packages/libpysal/examples/mexico/mexicojoin.shp',
'/opt/conda/lib/python3.8/site-packages/libpysal/examples/mexico/mexicojoin.prj',
'/opt/conda/lib/python3.8/site-packages/libpysal/examples/mexico/mexicojoin.dbf']
For this section, we will read the ESRI shapefile, which we can do by pointing geopandas.read_file to the .shp file.
The utility function get_path makes it a bit easier for us:
db = geopandas.read_file(examples.get_path("mexicojoin.shp"))
And, from now on, db is a table as we are used to so far in this course:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 POLY_ID 32 non-null int64
1 AREA 32 non-null float64
2 CODE 32 non-null object
3 NAME 32 non-null object
4 PERIMETER 32 non-null float64
5 ACRES 32 non-null float64
6 HECTARES 32 non-null float64
7 PCGDP1940 32 non-null float64
8 PCGDP1950 32 non-null float64
9 PCGDP1960 32 non-null float64
10 PCGDP1970 32 non-null float64
11 PCGDP1980 32 non-null float64
12 PCGDP1990 32 non-null float64
13 PCGDP2000 32 non-null float64
14 HANSON03 32 non-null float64
15 HANSON98 32 non-null float64
16 ESQUIVEL99 32 non-null float64
17 INEGI 32 non-null float64
18 INEGI2 32 non-null float64
19 MAXP 32 non-null float64
20 GR4000 32 non-null float64
21 GR5000 32 non-null float64
22 GR6000 32 non-null float64
23 GR7000 32 non-null float64
24 GR8000 32 non-null float64
25 GR9000 32 non-null float64
26 LPCGDP40 32 non-null float64
27 LPCGDP50 32 non-null float64
28 LPCGDP60 32 non-null float64
29 LPCGDP70 32 non-null float64
30 LPCGDP80 32 non-null float64
31 LPCGDP90 32 non-null float64
32 LPCGDP00 32 non-null float64
33 TEST 32 non-null float64
34 geometry 32 non-null geometry
dtypes: float64(31), geometry(1), int64(1), object(2)
memory usage: 8.9+ KB
db.crs
To be able to add baselayers, we need to specify one. Looking at the details and the original reference, we find the data are
expressed in longitude and latitude, so the CRS we can use is EPSG:4326 . Let’s add it to db :
Skip to main content
db.crs = "EPSG:4326"
db.crs
Choropleths
Unique values
A choropleth for categorical variables simply assigns a different color to every potential value in the series. The main
requirement in this case is then for the color scheme to reflect the fact that different values are not ordered or follow a
particular scale.
In Python, creating categorical choropleths is possible with one line of code. To demonstrate this, we can plot the Mexican
states and the region each belongs to based on the Mexican Statistics Institute (coded in our table as the INEGI variable):
db.plot(
column="INEGI",
categorical=True,
legend=True
)
<AxesSubplot:>
Note how we are using the same approach as for basic maps, the command plot , but we now need to add the argument
column to specify which column in particular is to be represented.
Since the variable is categorical we need to make that explicit by setting the argument categorical to True .
As an optional argument, we can set legend to True and the resulting figure will include a legend with the names of
all the values in the map.
Unless we specify a different colormap, the selected one respects the categorical nature of the data by not implying a
gradient or scale but a qualitative structure.
Equal interval
If, instead of categorical variables, we want to display the geographical distribution of a continuous phenomenon, we need to
select a way to encode each value into a color. One potential solution is applying what is usually called “equal intervals”. The
intuition of this method is to split the range of the distribution, the difference between the minimum and maximum value,
into equally large segments and to assign a different color to each of them according to a palette that reflects the fact that
values are ordered.
Creating the choropleth is relatively straightforward in Python. For example, to create an equal interval on the GDP per
capita in 2000 ( PCGDP2000 ), we can run a similar command as above:
db.plot(
column="PCGDP2000",
scheme="equal_interval",
k=7,
cmap="YlGn",
legend=True
)
<AxesSubplot:>
Now, let’s dig a bit deeper into the classification, and how exactly we are encoding values into colors. Each segment, also
called bins or buckets, can also be calculated using the library mapclassify from the PySAL family:
EqualInterval
Interval Count
----------------------------
[ 8684.00, 15207.57] | 10
(15207.57, 21731.14] | 10
(21731.14, 28254.71] | 5
(28254.71, 34778.29] | 4
(34778.29, 41301.86] | 2
(41301.86, 47825.43] | 0
(47825.43, 54349.00] | 1
The only additional argument to pass to Equal_Interval , other than the actual variable we would like to classify is the
number of segments we want to create, k , which we are arbitrarily setting to seven in this case. This will be the number of
colors that will be plotted on the map so, although having several can give more detail, at some point the marginal value of an
additional one is fairly limited, given the ability of the brain to tell any differences.
Once we have classified the variable, we can check the actual break points where values stop being in one class and become
part of the next one:
classi.bins
The array of breaking points above implies that any value in the variable below 15,207.57 will get the first color in the
gradient when mapped, values between 15,207.57 and 21,731.14 the next one, and so on.
The key characteristic in equal interval maps is that the bins are allocated based on the magnitude on the values, irrespective
Skipoftoit.main
of how many obervations fall into each bin as a result content
In highly skewed distributions, this can result in bins with a large
number of observations, while others only have a handful of outliers. This can be seen in the summary table printed out
above, where ten states are in the first group, but only one of them belong to the one with highest values. This can also be
represented visually with a kernel density plot where the break points are included as well:
Technically speaking, the figure is created by overlaying a KDE plot with vertical bars for each of the break points. This
makes much more explicit the issue highlighed by which the first bin contains a large amount of observations while the one
with top values only encompasses a handful of them.
Quantiles
One solution to obtain a more balanced classification scheme is using quantiles. This, by definition, assigns the same amount
of values to each bin: the entire series is laid out in order and break points are assigned in a way that leaves exactly the same
amount of observations between each of them. This “observation-based” approach contrasts with the “value-based” method
of equal intervals and, although it can obscure the magnitude of extreme values, it can be more informative in cases with
skewed distributions.
The code required to create the choropleth mirrors that needed above for equal intervals:
db.plot(
column="PCGDP2000",
scheme="quantiles",
k=7,
cmap="YlGn",
legend=True
)
Note how, in this case, the amount of polygons in each color is by definition much more balanced (almost equal in fact,
except for rounding differences). This obscures outlier values, which get blurred by significantly smaller values in the same
group, but allows to get more detail in the “most populated” part of the distribution, where instead of only white polygons,
we can now discern more variability.
To get further insight into the quantile classification, let’s calculate it with mapclassify :
Quantiles
Interval Count
----------------------------
[ 8684.00, 11752.00] | 5
(11752.00, 13215.43] | 4
(13215.43, 15996.29] | 5
(15996.29, 20447.14] | 4
(20447.14, 26109.57] | 5
(26109.57, 30357.86] | 4
(30357.86, 54349.00] | 5
classi.bins
db.plot(
column="PCGDP2000",
scheme="fisher_jenks",
k=7,
cmap="YlGn",
legend=True
)
<AxesSubplot:>
Interval Count
----------------------------
[ 8684.00, 13360.00] | 10
(13360.00, 18170.00] | 8
(18170.00, 24068.00] | 4
(24068.00, 28460.00] | 4
(28460.00, 33442.00] | 3
(33442.00, 38672.00] | 2
(38672.00, 54349.00] | 1
This methodology aims at minimizing the variance within each bin while maximizing that between different classes.
classi.bins
Graphically, we can see how the break points are not equally spaced but are adapting to obtain an optimal grouping of
observations:
For example, the bin with highest values covers a much wider span that the one with lowest, because there are fewer states in
that value range.
db.plot(
column="PCGDP2000",
scheme="quantiles",
k=7,
cmap="YlGn",
legend=False
)
<AxesSubplot:>
If we want to focus around the capital, Mexico DF, the first step involves realising that such area of the map (the little dark
green polygon in the SE centre of the map), falls within the coordinates of -102W/-97W, and 18N/21N, roughly speaking. To
display a zoom map into that area, we can do as follows:
Partial map
The approach above is straightforward, but not necessarily the most efficient one: not that, to generate a map of a potentially
very small area, we effectively draw the entire (potentially very large) map, and discard everything except the section we
want. This is not straightforward to notice at first sight, but what Python is doing in the code cell above is plottin the entire
db object, and only then focusing the figure on the X and Y ranges specified in set_xlim / set_ylim .
Sometimes, this is required. For example, if we want to retain the same coloring used for the national map, but focus on the
region around Mexico DF, this approach is the easiest one.
However, sometimes, we only need to plot the geographical features within a given range, and we either don’t need to keep
the national coloring (e.g. we are using a single color), or we want a classification performed only with the features in the
region.
For these cases, it is computationally more efficient to select the data we want to plot first, and then display them through
plot . For this, we can rely on the cx operator:
<AxesSubplot:>
This approach is a “spatial slice”. If you remember when we saw non-spatial slices (enabled by the loc operator), this is a
similar approach but our selection criteria, instead of subsetting by indices of the table, are based on the spatial coordinates of
the data represented in the table.
Since the result is a GeoDataFrame itself, we can create a choropleth that is based only on the data in the subset:
subset.plot(
column="PCGDP2000",
scheme="quantiles",
k=7,
cmap="YlGn",
legend=False
)
<AxesSubplot:>
Do-It-Yourself
Let’s make a bunch of choropleths! In this section, you will practice the concepts and code we have learnt in this block.
Happy hacking!
Data preparation
Note
The AHAH dataset was invented by a University of Liverpool team. If you want to find out more about the
background and details of the project, have a look at the information page at the CDRC website.
Important
You will need a username and password to download the data. Create it for free at:
https://data.cdrc.ac.uk/user/register
Once you have the .zip file on your computer, right-click and “Extract all”. The resulting folder will contain all you need.
For the sake of the example, let’s assume you place the resulting folder in the same location as the notebook you are using. If
that is the case, you can load up a GeoDataFrame of Liverpool neighborhoods with:
import geopandas
lsoas = geopandas.read_file("Access_to_Healthy_Assets_and_Hazards_AHAH_E08000012/data/Access_to
Now, this gets us the geometries of the LSOAs, but not the AHAH data. For that, we need to read in the data and join it to
ahah . Assuming the same location of the data as above, we can do as follows:
import pandas
ahah_data = pandas.read_csv("Access_to_Healthy_Assets_and_Hazards_AHAH_E08000012/data/Access_to
Tasks
Zoom of the city centre of Liverpool with he same color for every LSOA
Quantile map of AHAH for all of Liverpool, zoomed into north of the city centre
Zoom to north of the city centre with a quantile map of AHAH for the section only
Concepts
This block is about how we pull off the trick to turn geography into numbers statistics can understand. At this point, we dive
right into the more methodological part of the course; so you can expect a bit of a ramp up in the conceptual sections. Take a
deep breath and jump in, it’s well worth the effort! At the same time, the coding side of each block will start looking more
and more familiar because we are starting to repeat concepts and we will introduce less new building blocks and instead rely
more and more on what we have seen, just adding small bits here and there.
Space, formally
How do you express geographical relations between objects (e.g. areas, points) in a way that can be used in statistical
analysis? This is exactly the core of what we get into in here. There are several ways, of course, but one of the most
widespread approaches is what is termed spatial weights matrices. We motivate their role and define them in the following
clip.
Slides
0:00 / 0:00
Once you have watched the clip above, here’s a quiz for you!
Imagine a geography of squared regions (ie. a grid) with the following structure:
Each region is assigned an ID; so the most north-west region is 1, while the most south-east is 9. Here’s a question:
What is the dimension of the Spatial Weights Matrix for the region above?
Tip
1. 3 × 3
2. 9 × 3
3. 3 × 9
4. 9 × 9
It is nine rows by nine columns. To see why, remember that spatial weights matrices are N × N , as they need to
record the spatial relationship between every single observation (regions in the example above), with every other
single observation on the dataset.
Types of Weights
Once we know what spatial weights are generally, in this clip we dive into some of the specific types we can build for our
analyses.
Slides
[HTML]
[PDF]
0:00 / 0:00
Here is a second question for you once you have watched the clip above:
What does the rook contiguity spatial weights matrix look like for the region abvoe? Can you write it down by hand?
Here it is:
Slides
[HTML]
[PDF]
0:00 / 0:00
More materials
If you want a similar but slightly different take on spatial weights by Luc Anselin, one of the biggest minds in the field of
spatial analysis, I strongly recommend you watch the following two clips, part of the course offered by the Spatial Data
Center at the University of Chicago:
Skip to main content
Lecture on “Spatial Weights”
p g
Show code cell outputs
Lecture on “Spatial Lag”, you can ignore the last five minutes as they are a bit more advanced
Further readings
If you liked what you saw in this section and would like to digg deeper into spatial weights, the following readings are good
next steps:
Hands-on
Spatial weights
In this session we will be learning the ins and outs of one of the key pieces in spatial analysis: spatial weights matrices. These
are structured sets of numbers that formalize geographical relationships between the observations in a dataset. Essentially, a
spatial weights matrix of a given geography is a positive definite matrix of dimensions N by N , where N is the total number
of observations:
0 w12 … w1N
⎛ ⎞
w21 ⋱ wij ⋮
W =
⋮ wji 0 ⋮
⎝ ⎠
wN 1 … … 0
where each cell w contains a value that represents the degree of spatial contact or interaction between observations i and j.
ij
A fundamental concept in this context is that of neighbor and neighborhood. By convention, elements in the diagonal (w ) ii
are set to zero. A neighbor of a given observation i is another observation with which i has some degree of connection. In
terms of W , i and j are thus neighbors if wij > 0 . Following this logic, the neighborhood of i will be the set of observations
in the system with which it has certain connection, or those observations with a weight greater than zero.
There are several ways to create such matrices, and many more to transform them so they contain an accurate representation
that aligns with the way we understand spatial interactions between the elements of a system. In this session, we will
introduce the most commonly used ones and will show how to compute them with PySAL .
Skip to main content
%matplotlib inline
Data
For this tutorial, we will use a dataset of Liverpool small areas (or Lower layer Super Output Areas, LSOAs) for Liverpool.
The table is available as part of this course, so can be accessed remotely through the web. If you want to see how the table
was created, a notebook is available here.
To make things easier, we will read data from a file posted online so, for now, you do not need to download any dataset:
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 298 entries, E01006512 to E01033768
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LSOA11CD 298 non-null object
1 MSOA11CD 298 non-null object
2 geometry 298 non-null geometry
dtypes: geometry(1), object(2)
memory usage: 9.3+ KB
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
db = gpd.read_file("liv_lsoas.gpkg")
Contiguity
Contiguity weights matrices define spatial connections through the existence of common boundaries. This makes it directly
suitable to use with polygons: if two polygons share boundaries to some degree, they will be labeled as neighbors under these
kinds of weights. Exactly how much they need to share is what differenciates the two approaches we will learn: queen and
rook.
Queen
Under the queen criteria, two observations only need to share a vortex (a single point) of their boundaries to be considered
neighbors. Constructing a weights matrix under these principles can be done by running:
<libpysal.weights.contiguity.Queen at 0x7fba3879f910>
The command above creates an object w_queen of the class W . This is the format in which spatial weights matrices are
stored in PySAL . By default, the weights builder ( Queen.from_dataframe ) will use the index of the table, which is useful
so we can keep everything in line easily.
A W object can be queried to find out about the contiguity relations it contains. For example, if we would like to know who
is a neighbor of observation E01006690 :
w_queen['E01006690']
This returns a Python dictionary that contains the ID codes of each neighbor as keys, and the weights they are assigned as
values. Since we are looking at a raw queen contiguity matrix, every neighbor gets a weight of one. If we want to access the
weight of a specific neighbor, E01006691 for example, we can do recursive querying:
w_queen['E01006690']['E01006691']
1.0
W objects also have a direct way to provide a list of all the neighbors or their weights for a given observation. This is thanks
to the neighbors and weights attributes:
w_queen.neighbors['E01006690']
['E01006697',
'E01006692',
'E01033763',
'E01006759',
'E01006695',
'E01006720',
'E01006691']
w_queen.weights['E01006690']
Once created, W objects can provide much information about the matrix, beyond the basic attributes one would expect. We
have direct access to the number of neighbors each observation has via the attribute cardinalities . For example, to find
out how many neighbors observation E01006524 has:
w_queen.cardinalities['E01006524']
6
Skip to main content
Since cardinalities is a dictionary, it is direct to convert it into a Series object:
queen_card = pd.Series(w_queen.cardinalities)
queen_card.head()
E01006512 6
E01006513 9
E01006514 5
E01006515 8
E01006518 5
dtype: int64
This allows, for example, to access quick plotting, which comes in very handy to get an overview of the size of
neighborhoods in general:
sns.distplot(queen_card, bins=10)
<AxesSubplot:ylabel='Density'>
The figure above shows how most observations have around five neighbors, but there is some variation around it. The
distribution also seems to follow a symmetric form, where deviations from the average occur both in higher and lower values
almost evenly.
Some additional information about the spatial relationships contained in the matrix are also easily available from a W object.
Let us tour over some of them:
# Number of observations
w_queen.n
5.617449664429531
11
[]
Spatial weight matrices can be explored visually in other ways. For example, we can pick an observation and visualize it in
the context of its neighborhood. The following plot does exactly that by zooming into the surroundings of LSOA
E01006690 and displaying its polygon as well as those of its neighbors:
Note how the figure is built gradually, from the base map (L. 4-5), to the focal point (L. 9), to its neighborhood (L. 11-13).
Once the entire figure is plotted, we zoom into the area of interest (L. 19-20).
Rook
Rook contiguity is similar to and, in many ways, superseded by queen contiguity. However, since it sometimes comes up in
the literature, it is useful to know about it. The main idea is the same: two observations are neighbors if they share some of
their boundary lines. However, in the rook case, it is not enough with sharing only one point, it needs to be at least a segment
of their boundary. In most applied cases, these differences usually boil down to how the geocoding was done, but in some
cases, such as when we use raster data or grids, this approach can differ more substantively and it thus makes more sense.
<libpysal.weights.contiguity.Rook at 0x7fba38676df0>
The output is of the same type as before, a W object that can be queried and used in very much the same way as any other
one.
Distance
Distance based matrices assign the weight to each pair of observations as a function of how far from each other they are.
How this is translated into an actual weight varies across types and variants, but they all share that the ultimate reason why
two observations are assigned some weight is due to the distance between them.
K-Nearest Neighbors
One approach to define weights is to take the distances between a given observation and the rest of the set, rank them, and
consider as neighbors the k closest ones. That is exactly what the k-nearest neighbors (KNN) criterium does.
To calculate KNN weights, we can use a similar function as before and derive them from a shapefile:
<libpysal.weights.distance.KNN at 0x7fba34e3dfd0>
Note how we need to specify the number of nearest neighbors we want to consider with the argument k . Since it is a
polygon shapefile that we are passing, the function will automatically compute the centroids to derive distances between
observations. Alternatively, we can provide the points in the form of an array, skipping this way the dependency of a file on
disk:
# Extract centroids
cents = db.centroid
# Extract coordinates into an array
pts = pd.DataFrame(
{"X": cents.x, "Y": cents.y}
).values
# Compute KNN weights
knn5_from_pts = weights.KNN.from_array(pts, k=5)
knn5_from_pts
<libpysal.weights.distance.KNN at 0x7fba386707f0>
Skip to main content
Distance band
Another approach to build distance-based spatial weights matrices is to draw a circle of certain radious and consider neighbor
every observation that falls within the circle. The technique has two main variations: binary and continuous. In the former
one, every neighbor is given a weight of one, while in the second one, the weights can be further tweaked by the distance to
the observation of interest.
To compute binary distance matrices in PySAL , we can use the following command:
This creates a binary matrix that considers neighbors of an observation every polygon whose centroid is closer than 1,000
metres (1Km) of the centroid of such observation. Check, for example, the neighborhood of polygon E01006690 :
w_dist1kmB['E01006690']
{'E01006691': 1.0,
'E01006692': 1.0,
'E01006695': 1.0,
'E01006697': 1.0,
'E01006720': 1.0,
'E01006725': 1.0,
'E01006726': 1.0,
'E01033763': 1.0}
Note that the units in which you specify the distance directly depend on the CRS in which the spatial data are projected, and
this has nothing to do with the weights building but it can affect it significantly. Recall how you can check the CRS of a
GeoDataFrame :
db.crs
In this case, you can see the unit is expressed in metres ( m ), hence we set the threshold to 1,000 for a circle of 1km of
radious.
An extension of the weights above is to introduce further detail by assigning different weights to different neighbors within
the radious circle based on how far they are from the observation of interest. For example, we could think of assigning the
inverse of the distance between observations i and j as w . This can be computed with the following command:
ij
In w_dist1kmC , every observation within the 1km circle is assigned a weight equal to the inverse distance between the two:
1
wij =
dij
This way, the further apart i and j are from each other, the smaller the weight w will be.
ij
Contrast the binary neighborhood with the continuous one for E01006690 :
w_dist1kmC['E01006690']
{'E01006691': 0.001320115452290246,
'E01006692': 0.0016898106255168294,
'E01006695': 0.001120923796462639,
'E01006697': 0.001403469553911711,
'E01006720': 0.0013390451319917913,
'E01006725': 0.001009044334260805,
'E01006726': 0.0010528395831202145,
'E01033763': 0.0012983249272553688}
Skip to main content
Following this logic of more detailed weights through distance, there is a temptation to take it further and consider everyone
else in the dataset as a neighbor whose weight will then get modulated by the distance effect shown above. However,
although conceptually correct, this approach is not always the most computationally or practical one. Because of the nature of
spatial weights matrices, particularly because of the fact their size is N by N , they can grow substantially large. A way to
cope with this problem is by making sure they remain fairly sparse (with many zeros). Sparsity is typically ensured in the
case of contiguity or KNN by construction but, with inverse distance, it needs to be imposed as, otherwise, the matrix could
be potentially entirely dense (no zero values other than the diagonal). In practical terms, what is usually done is to impose a
distance threshold beyond which no weight is assigned and interaction is assumed to be non-existent. Beyond being
computationally feasible and scalable, results from this approach usually do not differ much from a fully “dense” one as the
additional information that is included from further observations is almost ignored due to the small weight they receive. In
this context, a commonly used threshold, although not always best, is that which makes every observation to have at least one
neighbor.
min_thr = weights.min_threshold_distance(pts)
min_thr
939.7373992113434
Block weights
Block weights connect every observation in a dataset that belongs to the same category in a list provided ex-ante. Usually,
this list will have some relation to geography an the location of the observations but, technically speaking, all one needs to
create block weights is a list of memberships. In this class of weights, neighboring observations, those in the same group, are
assigned a weight of one, and the rest receive a weight of zero.
In this example, we will build a spatial weights matrix that connects every LSOA with all the other ones in the same MSOA.
See how the MSOA code is expressed for every LSOA:
db.head()
LSOA11CD
To build a block spatial weights matrix that connects as neighbors all the LSOAs in the same MSOA, we only require the
mapping of codes. Using PySAL , this is a one-line task:
w_block = weights.block_weights(db['MSOA11CD'])
In this case, PySAL does not allow to pass the argument idVariable as above. As a result, observations are named after
the order the occupy in the list:
w_block[0]
The first element is neighbor of observations 218, 129, 220, and 292, all of them with an assigned weight of 1. However, it is
possible to correct this by using the additional method remap_ids :
w_block.remap_ids(db.index)
Now if you try w_bloc[0] , it will return an error. But if you query for the neighbors of an observation by its LSOA id, it
will work:
w_block['E01006512']
w_queen['E01006690']
{'E01006697': 1.0,
'E01006692': 1.0,
'E01033763': 1.0,
'E01006759': 1.0,
'E01006695': 1.0,
'E01006720': 1.0,
'E01006691': 1.0}
Since it is contiguity, every neighbor gets one, the rest zero weight. We can check if the object w_queen has been
transformed or not by calling the argument transform :
w_queen.transform
'O'
where O stands for “original”, so no transformations have been applied yet. If we want to apply a row-based transformation,
so every row of the matrix sums up to one, we modify the transform attribute as follows:
w_queen.transform = 'R'
Now we can check the weights of the same observation as above and find they have been modified:
w_queen['E01006690']
{'E01006697': 0.14285714285714285,
'E01006692': 0.14285714285714285,
'E01033763': 0.14285714285714285,
'E01006759': 0.14285714285714285,
'E01006695': 0.14285714285714285,
'E01006720': 0.14285714285714285,
'E01006691': 0.14285714285714285}
Skip to main content
Save for precission issues, the sum of weights for all the neighbors is one:
pd.Series(w_queen['E01006690']).sum()
0.9999999999999998
Returning the object back to its original state involves assigning transform back to original:
w_queen.transform = 'O'
w_queen['E01006690']
{'E01006697': 1.0,
'E01006692': 1.0,
'E01033763': 1.0,
'E01006759': 1.0,
'E01006695': 1.0,
'E01006720': 1.0,
'E01006691': 1.0}
V : variance stabilizing, with the sum of all the weights being constrained to the number of observations.
PySAL has a common way to write any kind of W object into a file using the command open . The only element we need to
decide for ourselves beforehand is the format of the file. Although there are several formats in which spatial weight matrices
can be stored, we will focused on the two most commonly used ones:
Skip
Contiguity spatial weights can be saved into a .gal filetowith
mainthe
content
following commands:
# Open file to write into
fo = psopen('imd_queen.gal', 'w')
# Write the matrix into the file
fo.write(w_queen)
# Close the file
fo.close()
1. Open a target file for w riting the matrix, hence the w argument. In this case, if a file imd_queen.gal already exists, it
will be overwritten, so be careful.
2. Write the W object into the file.
3. Close the file. This is important as some additional information is written into the file at this stage, so failing to close the
file might have unintended consequences.
Once we have the file written, it is possible to read it back into memory with the following command:
<libpysal.weights.weights.W at 0x7fba351d76a0>
Note how we now use r instead of w because we are r eading the file, and also notice how we open the file and, in the
same line, we call read() directly.
A very similar process to the one above can be used to read and write distance based weights. The only difference is
specifying the right file format, .gwt in this case. So, if we want to write w_dist1km into a file, we will run:
# Open file
fo = psopen('imd_dist1km.gwt', 'w')
# Write matrix into the file
fo.write(w_dist1kmC)
# Close file
fo.close()
And if we want to read the file back in, all we need to do is:
Note how, in this case, you will probably receive a warning alerting you that there was not a DBF relating to the file. This is
because, by default, PySAL takes the order of the observations in a .gwt from a shapefile. If this is not provided, PySAL
cannot entirely determine all the elements and hence the resulting W might not be complete (islands, for example, can be
missing). To fully complete the reading of the file, we can remap the ids as we have seen above:
w_dist1km2.remap_ids(db.index)
Spatial Lag
One of the most direct applications of spatial weight matrices is the so-called spatial lag. The spatial lag of a given variable is
the product of a spatial weight matrix and the variable itself:
Ysl = W Y
where Y is a Nx1 vector with the values of the variable. Recall that the product of a matrix and a vector equals the sum of a
row by column element multiplication for the resulting value of a given row. In terms of the spatial lag:
ysl−i = ∑ wijyj
If we are using row-standardized weights, w becomes a proportion between zero and one, and y
ij sl−i can be seen as the
average value of Y in the neighborhood of i.
For this illustration, we will use the area of each polygon as the variable of interest. And to make things a bit nicer later on,
we will keep the log of the area instead of the raw measurement. Hence, let’s create a column for it:
db["area"] = np.log(db.area)
The spatial lag is a key element of many spatial analysis techniques, as we will see later on and, as such, it is fully supported
in PySAL . To compute the spatial lag of a given variable, area for example:
Line 4 contains the actual computation, which is highly optimized in PySAL . Note that, despite passing in a pd.Series
object, the output is a numpy array. This however, can be added directly to the table db :
db['w_area'] = w_queen_score
Moran Plot
The Moran Plot is a graphical way to start exploring the concept of spatial autocorrelation, and it is a good application of
spatial weight matrices and the spatial lag. In essence, it is a standard scatter plot in which a given variable ( area , for
example) is plotted against its own spatial lag. Usually, a fitted line is added to include more information:
y − ȳ
zi =
σy
where z is the standardized version of y , ȳ is the average of the variable, and σ its standard deviation.
i i
Creating a standardized Moran Plot implies that average values are centered in the plot (as they are zero when standardized)
and dispersion is expressed in standard deviations, with the rule of thumb of values greater or smaller than two standard
deviations being outliers. A standardized Moran Plot also partitions the space into four quadrants that represent different
situations:
These will be further explored once spatial autocorrelation has been properly introduced in subsequent blocks.
Do-It-Yourself
In this section, we are going to try
import geopandas
import contextily
from pysal.lib import examples
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas
In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS
import geopandas
examples.explain("NYC Socio-Demographics")
</> GeoDa
NYC
NYC Education
Education ++ Socio-Demographics
Socio-Demographics
+
-
To check out the location of the files that make up the dataset, we can load it with load_example and inspect with
get_file_list :
nyc = geopandas.read_file(nyc_data.get_path("NYC_Tract_ACS2008_12.shp"))
nyc.plot(figsize=(9, 9))
<Axes: >
Now with the nyc object ready to go, here a few tasks for you to complete:
The data is available over the web on this address but it is not accessible programmatically. For that reason, we have cached
it on the data folder, and we can read it directly from there into a GeoDataFrame :
jp_cities = geopandas.read_file('../data/Japan.zip')
jp_cities.head()
If we make a quick plot, we can see these are polygons covering the part of the Japanese geography that is considered urban
by their analysis:
jp_cities.crs
jp = jp_cities.to_crs(epsg=2459)
jp.crs
Now, distances are easier to calculate between points than between polygons. Hence, we will convert the urban areas into
their centroids:
jp.geometry = jp.geometry.centroid
jp.plot()
⚠ Warning
The final task below is a bit more involved, so do not despair if you cannot get it to work completely!
Focus on Tokyo (find the row in the table through a query search as we saw when considering Index-based queries) and the
100km spatial weights generated above. Try to create a figure similar to the one in the lecture. Here’s a recipe:
Concepts
In this block we delve into a few statistical methods designed to characterise spatial patterns of data. How phenomena are
distributed over space is at the centre of many important questions. From economic inequality, to the management of disease
outbreaks, being able to statistically characterise the spatial pattern is the first step into understanding causes and thinking
about solutions in the form of policy.
This section is split into a few more chunks than usual, each of them more byte size covering a single concept at a time. Each
Skip to main content
chunk builds on each other sequentially, so watch them in the order presented here. They are all non-trivial ideas, so focus all
q y p y
your brain power to understand them while tackling each of the sections!
ESDA
ESDA stands for Exploratory Spatial Data Analysis, and it is a family of techniques to explore and characterise spatial
patterns in data. This clip introduces ESDA conceptually.
Slides
[HTML]
[PDF]
0:00 / 0:00
Spatial autocorrelation
In this clip, we define and explain spatial autocorrelation, a core concept to understand what ESDA is about. We also go over
the different types and scales at which spatial autocorrelation can be relevant.
Slides
[HTML]
[PDF]
0:00 / 0:00
Slides
[PDF]
0:00 / 0:00
Once you have seen the clip, check out this interactive online:
launch binder
The app is an interactive document that allows you to play with a hypothetical geography made up of a regular grid. A
synthetic variable (one created artificially) is distributed across the grid following a varying degree of global spatial
autocorrelation, which is also visualised using a Moran Plot. Through a slider, you can change the sign (positive/negative)
and strength of global spatial autocorrelation and see how that translates on the appearance of the map and how it is also
reflected in the shape of the Moran Plot.
Slides
[HTML]
[PDF]
0:00 / 0:00
If you like this clip and would like to know a bit more about local spatial autocorrelation, the chapter on local spatial
autocorrelation in the GDS book (in progress) reyABwolf is a good “next step”.
Further readings
Skip to main content
If this section was of your interest, there is plenty more you can read and explore. The following are good “next steps” to
delve a bit deeper into exploratory spatial data analysis:
Hands-on
A key idea in this context is that of spatial randomness: a situation in which the location of an observation gives no
information whatsoever about its value. In other words, a variable is spatially random if it is distributed following no
discernible pattern over space. Spatial autocorrelation can thus be formally defined as the “absence of spatial randomness”,
which gives room for two main classes of autocorrelation, similar to the traditional case: positive spatial autocorrelation,
when similar values tend to group together in similar locations; and negative spatial autocorrelation, in cases where similar
values tend to be dispersed and further apart from each other.
In this session we will learn how to explore spatial autocorrelation in a given dataset, interrogating the data about its
presence, nature, and strength. To do this, we will use a set of tools collectively known as Exploratory Spatial Data Analysis
(ESDA), specifically designed for this purpose. The range of ESDA methods is very wide and spans from less sophisticated
approaches like choropleths and general table querying, to more advanced and robust methodologies that include statistical
inference and an explicit recognition of the geographical dimension of the data. The purpose of this session is to dip our toes
into the latter group.
ESDA techniques are usually divided into two main groups: tools to analyze global, and local spatial autocorrelation. The
former consider the overall trend that the location of values follows, and makes possible statements about the degree of
clustering in the dataset. Do values generally follow a particular pattern in their geographical distribution? Are similar
values closer to other similar values than we would expect from pure chance? These are some of the questions that tools for
Skip to main content
global spatial autocorrelation allow to answer. We will practice with global spatial autocorrelation by using Moran’s I
statistic.
Tools for local spatial autocorrelation instead focus on spatial instability: the departure of parts of a map from the general
trend. The idea here is that, even though there is a given trend for the data in terms of the nature and strength of spatial
association, some particular areas can diverege quite substantially from the general pattern. Regardless of the overall degree
of concentration in the values, we can observe pockets of unusually high (low) values close to other high (low) values, in
what we will call hot(cold)spots. Additionally, it is also possible to observe some high (low) values surrounded by low (high)
values, and we will name these “spatial outliers”. The main technique we will review in this session to explore local spatial
autocorrelation is the Local Indicators of Spatial Association (LISA).
Data
For this session, we will use the results of the 2016 referendum vote to leave the EU, at the local authority level. In particular,
we will focus on the spatial distribution of the vote to Leave, which ended up winning. From a technical point of view, you
will be working with polygons which have a value (the percentage of the electorate that voted to Leave the EU) attached to
them.
All the necessary data have been assembled for convenience in a single file that contains geographic information about each
local authority in England, Wales and Scotland, as well as the vote attributes. The file is in the geospatial format GeoPackage,
which presents several advantages over the more traditional shapefile (chief among them, the need of a single file instead of
several). The file is available as a download from the course website.
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
br = gpd.read_file("brexit.gpkg")
Now let’s index it on the local authority IDs, while keeping those as a column too:
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 380 entries, E06000001 to W06000024
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objectid 380 non-null int64
1 lad16cd 380 non-null object
2 lad16nm 380 non-null object
3 Pct_Leave 380 non-null float64
4 geometry 380 non-null geometry
dtypes: float64(1), geometry(1), int64(1), object(2)
memory usage: 17.8+ KB
# Plot polygons
ax = br.plot(alpha=0.5, color='red');
# Add background map, expressing target CRS so the basemap can be
# reprojected (warped)
ctx.add_basemap(ax, crs=br.crs)
For this example, we will show how to build a queen contiguity matrix, which considers two observations as neighbors if
they share at least one point of their boundary. In other words, for a pair of local authorities in the dataset to be considered
neighbours under this W , they will need to be sharing border or, in other words, “touching” each other to some degree.
Technically speaking, we will approach building the contiguity matrix in the same way we did in Lab 5. We will begin with a
GeoDataFrame and pass it on to the queen contiguity weights builder in PySAL
( ps.weights.Queen.from_dataframe ). We will also make sure our table of data is previously indexed on the local
authority code, so the W is also indexed on that form.
Now, the w object we have just is of the same type of any other one we have created in the past. As such, we can inspect it in
the same way. For example, we can check who is a neighbor of observation E08000012 :
w['E08000012']
However, the cell where we computed W returned a warning on “islands”. Remember these are islands not necessarily in the
geographic sense (although some of them will be), but in the mathematical sense of the term: local authorities that are not
sharing border with any other one and thus do not have any neighbors. We can inspect and map them to get a better sense of
what we are dealing with:
In this case, all the islands are indeed “real” islands. These cases can create issues in the analysis and distort the results. There
are several solutions to this situation such as connecting the islands to other observations through a different criterium (e.g.
nearest neighbor), and then combining both spatial weights matrices. For convenience, we will remove them from the dataset
because they are a small sample and their removal is likely not to have a large impact in the calculations.
Technically, this amounts to a subsetting, very much like we saw in the first weeks of the course, although in this case we
will use the drop command, which comes in very handy in these cases:
br = br.drop(w.islands)
Once we have the set of local authorities that are not an island, we need to re-calculate the weights matrix:
Skip to main content
# Create the spatial weights matrix
# NOTE: this might take a few minutes as the geometries are
# are very detailed
%time w = weights.Queen.from_dataframe(br, idVariable="lad16cd")
And, finally, let us row-standardize it to make sure every row of the matrix sums up to one:
Now, because we have row-standardize them, the weight given to each of the three neighbors is 0.33 which, all together, sum
up to one.
w['E08000012']
{'E08000011': 0.3333333333333333,
'E08000014': 0.3333333333333333,
'E06000006': 0.3333333333333333}
Spatial lag
Once we have the data and the spatial weights matrix ready, we can start by computing the spatial lag of the percentage of
votes that went to leave the EU. Remember the spatial lag is the product of the spatial weights matrix and a given variable
and that, if W is row-standardized, the result amounts to the average value of the variable in the neighborhood of each
observation.
We can calculate the spatial lag for the variable Pct_Leave and store it directly in the main table with the following line of
code:
Let us have a quick look at the resulting variable, as compared to the original one:
lad16cd
The way to interpret the spatial lag ( w_Pct_Leave ) for say the first observation is as follow: Hartlepool, where 69,6% of
the electorate voted to leave is surrounded by neighbouring local authorities where, on average, almost 60% of the electorate
also voted to leave the EU. For the purpose of illustration, we can in fact check this is correct by querying the spatial weights
matrix to find out Hartepool’s neighbors:
w.neighbors['E06000001']
['E06000004', 'E06000047']
lad16cd
E06000004 61.73
E06000047 57.55
Name: Pct_Leave, dtype: float64
And the average value, which we saw in the spatial lag is 61.8, can be calculated as follows:
neis.mean()
59.64
For some of the techniques we will be seeing below, it makes more sense to operate with the standardized version of a
variable, rather than with the raw one. Standardizing means to substract the average value and divide by the standard
deviation each observation of the column. This can be done easily with a bit of basic algebra in Python:
Skip to main content
br['Pct_Leave_std'] = (
br['Pct_Leave'] - br['Pct_Leave'].mean()
) / br['Pct_Leave'].std()
Finally, to be able to explore the spatial patterns of the standardized values, also called sometimes z values, we need to create
its spatial lag:
Moran Plot
The moran plot is a way of visualizing a spatial dataset to explore the nature and strength of spatial autocorrelation. It is
essentially a traditional scatter plot in which the variable of interest is displayed against its spatial lag. In order to be able to
interpret values as above or below the mean, and their quantities in terms of standard deviations, the variable of interest is
usually standardized by substracting its mean and dividing it by its standard deviation.
Technically speaking, creating a Moran Plot is very similar to creating any other scatter plot in Python, provided we have
standardized the variable and calculated its spatial lag beforehand:
The plot displays a positive relationship between both variables. This is associated with the presence of positive spatial
autocorrelation: similar values tend to be located close to each other. This means that the overall trend is for high values to be
close to other high values, and for low values to be surrounded by other low values. This however does not mean that this is
only situation in the dataset: there can of course be particular cases where high values are surrounded by low ones, and
viceversa. But it means that, if we had to summarize the main pattern of the data in terms of how clustered similar values are,
the best way would be to say they are positively correlated and, hence, clustered over space.
In the context of the example, this can be interpreted along the lines of: local authorities display positive spatial
autocorrelation in the way they voted in the EU referendum. This means that local authorities with high percentage of Leave
voters tend to be located nearby other local authorities where a significant share of the electorate also voted to Leave, and
viceversa.
Moran’s I
The Moran Plot is an excellent tool to explore the data and get a good sense of how much values are clustered over space.
Skip to hard
However, because it is a graphical device, it is sometimes mainto
content
condense its insights into a more concise way. For these
cases, a good approach is to come up with a statistical measure that summarizes the figure. This is exactly what Moran’s I is
meant to do.
Very much in the same way the mean summarizes a crucial element of the distribution of values in a non-spatial setting, so
does Moran’s I for a spatial dataset. Continuing the comparison, we can think of the mean as a single numerical value
summarizing a histogram or a kernel density plot. Similarly, Moran’s I captures much of the essence of the Moran Plot. In
fact, there is an even close connection between the two: the value of Moran’s I corresponds with the slope of the linear fit
overlayed on top of the Moran Plot.
In order to calculate Moran’s I in our dataset, we can call a specific function in PySAL directly:
mi = esda.Moran(br['Pct_Leave'], w)
Note how we do not need to use the standardized version in this context as we will not represent it visually.
The method ps.Moran creates an object that contains much more information than the actual statistic. If we want to retrieve
the value of the statistic, we can do it this way:
mi.I
0.6228641407137806
The other bit of information we will extract from Moran’s I relates to statistical inference: how likely is the pattern we
observe in the map and Moran’s I captures in its value to be generated by an entirely random process? If we considered the
same variable but shuffled its locations randomly, would we obtain a map with similar characteristics?
The specific details of the mechanism to calculate this are beyond the scope of the session, but it is important to know that a
small enough p-value associated with the Moran’s I of a map allows to reject the hypothesis that the map is random. In other
words, we can conclude that the map displays more spatial pattern that we would expect if the values had been randomly
allocated to a particular location.
The most reliable p-value for Moran’s I can be found in the attribute p_sim :
mi.p_sim
0.001
That is just 0.1% and, by standard terms, it would be considered statistically significant. We can quickly ellaborate on its
intuition. What that 0.001 (or 0.1%) means is that, if we generated a large number of maps with the same values but
Skip to main
randomly allocated over space, and calculated the Moran’s content
I statistic for each of those maps, only 0.1% of them would
display a larger (absolute) value than the one we obtain from the real data, and the other 99.9% of the random maps would
receive a smaller (absolute) value of Moran’s I. If we remember again that the value of Moran’s I can also be interpreted as
the slope of the Moran Plot, what we have is that, in this case, the particular spatial arrangement of values for the Leave votes
is more concentrated than if the values had been allocated following a completely spatially random process, hence the
statistical significance.
Once we have calculated Moran’s I and created an object like mi , we can use some of the functionality in splot to
replicate the plot above more easily (remember, D.R.Y.):
moran_scatterplot(mi);
As a first step, the global autocorrelation analysis can teach us that observations do seem to be positively correlated over
space. In terms of our initial goal to find spatial structure in the attitude towards Brexit, this view seems to align: if the vote
had no such structure, it should not show a pattern over space -technically, it would show a random one.
So far we have classified each observation in the dataset depending on its value and that of its neighbors. This is only half
way into identifying areas of unusual concentration of values. To know whether each of the locations is a statistically
significant cluster of a given kind, we again need to compare it with what we would expect if the data were allocated in a
completely random way. After all, by definition, every observation will be of one kind of another, based on the comparison
above. However, what we are interested in is whether
Skipthe
to strength with which the values are concentrated is unusually high.
main content
This is exactly what LISAs are designed to do. As before, a more detailed description of their statistical underpinnings is
beyond the scope in this context, but we will try to shed some light into the intuition of how they go about it. The core idea is
to identify cases in which the comparison between the value of an observation and the average of its neighbors is either more
similar (HH, LL) or dissimilar (HL, LH) than we would expect from pure chance. The mechanism to do this is similar to the
one in the global Moran’s I, but applied in this case to each observation, resulting then in as many statistics as original
observations.
LISAs are widely used in many fields to identify clusters of values in space. They are a very useful tool that can quickly
return areas in which values are concentrated and provide suggestive evidence about the processes that might be at work. For
that, they have a prime place in the exploratory toolbox. Examples of contexts where LISAs can be useful include:
identification of spatial clusters of poverty in regions, detection of ethnic enclaves, delineation of areas of particularly
high/low activity of any phenomenon, etc.
lisa = esda.Moran_Local(br['Pct_Leave'], w)
All we need to pass is the variable of interest -percentage of Leave votes- and the spatial weights that describes the
neighborhood relations between the different observation that make up the dataset.
Because of their very nature, looking at the numerical result of LISAs is not always the most useful way to exploit all the
information they can provide. Remember that we are calculating a statistic for every sigle observation in the data so, if we
have many of them, it will be difficult to extract any meaningful pattern. Instead, what is typically done is to create a map, a
cluster map as it is usually called, that extracts the significant observations (those that are highly unlikely to have come from
pure chance) and plots them with a specific color depending on their quadrant category.
All of the needed pieces are contained inside the lisa object we have created above. But, to make the map making more
straightforward, it is convenient to pull them out and insert them in the main data table, br :
Let us stop for second on these two steps. First, the significant column. Similarly as with global Moran’s I, PySAL is
automatically computing a p-value for each LISA. Because not every observation represents a statistically significant one, we
want to identify those with a p-value small enough that rules out the possibility of obtaining a similar situation from pure
chance. Following a similar reasoning as with global Moran’s I, we select 5% as the threshold for statistical significance. To
identify these values, we create a variable, significant , that contains True if the p-value of the observation is satisfies
the condition, and False otherwise. We can check this is the case:
Skip to main content
br['significant'].head()
lad16cd
E06000001 False
E06000002 False
E06000003 False
E06000004 True
E06000005 False
Name: significant, dtype: bool
lisa.p_sim[:5]
Note how the third and fourth are smaller than 0.05, as the variable significant correctly identified.
Second, the quadrant each observation belongs to. This one is easier as it comes built into the lisa object directly:
br['quadrant'].head()
lad16cd
E06000001 1
E06000002 1
E06000003 1
E06000004 1
E06000005 1
Name: quadrant, dtype: int64
The correspondence between the numbers in the variable and the actual quadrants is as follows:
1: HH
2: LH
3: LL
4: HL
With these two elements, significant and quadrant , we can build a typical LISA cluster map combining the mapping
skills with what we have learned about subsetting and querying tables:
Or, if we want to have more control over what is being displayed, and how each component is presented, we can “cook” the
plot ourselves:
The substantive interpretation of a LISA map needs to relate its output to the original intention of the analyst who created the
map. In this case, our original idea was to explore the spatial structure of support to leaving the EU. The LISA proves a fairly
useful tool in this context. Comparing the LISA map above with the choropleth we started with, we can interpret the LISA as
“simplification” of the detailed but perhaps too complicated picture in the choropleth that focuses the reader’s attention to the
areas that display a particularly high concentration of (dis)similar values, helping the spatial structure of the vote emerge in a
more explicit way. The result of this highlights the relevance that the East of England and the Midlands had in voting to
Leave, as well as the regions of the map where there was a lot less excitement about Leaving.
The results from the LISA statistics can be connected to the Moran plot to visualise where in the scatter plot each type of
polygon falls:
Data preparation
For this section, we are going to revisit the AHAH dataset we saw in the DIY section of Block D. Please head over to the
section to refresh your mind about how to load up the required data. Once you have successfully created the ahah object,
move on to Task I.
Create cluster maps for significance levels 1% and 10%; compare them with the one we obtained. What are the main
changes? Why?
Concepts
This block is all about grouping; grouping of similar observations, areas, records… We start by discussing why grouping, or
clustering in statistical parlance, is important and what it can do for us. Then we move on different types of clustering. We
focus on two: one is traditional non-spatial clustering, or unsupervised learning, for which we cover the most popular
technique; the other one is explicitly spatial clustering, or regionalisation, which imposes additional (geographic) constraints
when grouping observations.
Slides
[HTML]
[PDF]
0:00 / 0:00
Non-spatial clustering
Non-spatial clustering is the most common form of data grouping. In this section, we cover the basics and mention a few
approaches. We wrap it up with an example of clustering very dear to human geography: geodemographics.
Slides
Skip to main content
The slides used in the clip are available at:
[HTML]
[PDF]
0:00 / 0:00
K-Means
In the clip above, we talk about K-Means, by far the most common clustering algorithm. Watch the video on the expandable
to get the intuition behind the algorithm and better understand how it does its “magic”.
For a striking visual comparison of how K-Means compares to other clustering algorithms, check out this figure produced by
the scikit-learn project, a Python package for machine learning (more on this later):
https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_0011.png
Fig. 1 Clustering algorithms comparison [Source]
Geodemographics
If you are interested in Geodemographics, a very good reference to get a broader perspective on the idea, origins and history
of the field is “The Predictive Postcode” webber2018predictive, by Richard Webber and Roger Burrows. In particular, the
first four chapters provide an excellent overview.
Furthermore, the clip mentions the Output Area Classification (OAC), which you can access, for example, through the CDRC
Maps platform:
Slides
[HTML]
[PDF]
0:00 / 0:00
If you are interested in the idea of regionalisation, a very good place to continue reading is Duque et al. (2007)
duque2007supervised, which was an important inspiration in structuring the clip.
Hands-on
The basic idea of statistical clustering is to summarize the information contained in several variables by creating a relatively
small number of categories. Each observation in the dataset is then assigned to one, and only one, category depending on its
values for the variables originally considered in the classification. If done correctly, the exercise reduces the complexity of a
multi-dimensional problem while retaining all the meaningful information contained in the original dataset. This is because,
once classified, the analyst only needs to look at in which category every observation falls into, instead of considering the
multiple values associated with each of the variables and trying to figure out how to put them together in a coherent sense.
When the clustering is performed on observations that represent areas, the technique is often called geodemographic analysis.
Although there exist many techniques to statistically group observations in a dataset, all of them are based on the premise of
using a set of attributes to define classes or categories of observations that are similar within each of them, but differ between
groups. How similarity within groups and dissimilarity between them is defined and how the classification algorithm is
operationalized is what makes techniques differ and also what makes each of them particularly well suited for specific
problems or types of data. As an illustration, we will only dip our toes into one of these methods, K-means, which is probably
the most commonly used technique for statistical clustering.
In the case of analysing spatial data, there is a subset of methods that are of particular interest for many common cases in
Geographic Data Science. These are the so-called regionalization techniques. Regionalization methods can take also many
forms and faces but, at their core, they all involve statistical clustering of observations with the additional constraint that
observations need to be geographical neighbors to be in the same category. Because of this, rather than category, we will use
the term area for each observation and region for each category, hence regionalization, the construction of regions from
smaller areas.
Data
The dataset we will use in this occasion is an extract from the online website AirBnb. AirBnb is a company that provides a
meeting point for people looking for an alternative to a hotel when they visit a city, and locals who want to rent (part of) their
house to make some extra money. The website has a continuously updated listing of all the available properties in a given
city that customers can check and book through. In addition, the website also provides a feedback mechanism by which both
ends, hosts and guests, can rate their experience. Aggregating ratings from guests about the properties where they have
stayed, AirBnb provides additional information for every property, such as an overall cleanliness score or an index of how
good the host is at communicating with the guests.
The original data are provided at the property level and for the entire London. However, since the total number of properties
is very large for the purposes of this notebook, they have been aggregated at the Middle Super Output Area (MSOA), a
geographical unit created by the Office of National Statistics. Although the original source contains information for the
Greater London, the vast majority of properties are located in Inner London, so the data we will use is restricted to that
extent. Even in this case, not every polygon has at least one property. To avoid cases of missing values, the final dataset only
contains those MSOAs with at least one property, so there can be average ratings associated with them.
Our goal in this notebook is to create a classification of areas (MSOAs) in Inner London based on the ratings of the AirBnb
locations. This will allow us to create a typology for the geography of AirBnb in London and, to the extent the AirBnb
locations can say something about the areas where they are located, the classification will help us understand the geography
of residential London a bit better. One general caveat about the conclusions we can draw from an analysis like this one that
derives from the nature of AirBnb data. On the one hand, this dataset is a good example of the kind of analyses that the data
revolution is making possible as, only a few years ago, it would have been very hard to obtain a similarly large survey of
properties with ratings like this one. On the other hand, it is important to keep in mind the kinds of biases that these data are
subject to and thus the limitations in terms of generalizing findings to the general population. At any rate, this dataset is a
great example to learn about statistical clustering of spatial observations, both in a geodemographic as well as in a
regionalization.
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
abb = gpd.read_file("london_abb.gpkg")
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSOA_CODE 353 non-null object
1 accommodates 353 non-null float64
2 bathrooms 353 non-null float64
3 bedrooms 353 non-null float64
4 beds 353 non-null float64
5 number_of_reviews 353 non-null float64
6 reviews_per_month 353 non-null float64
7 review_scores_rating 353 non-null float64
8 review_scores_accuracy 353 non-null float64
9 review_scores_cleanliness 353 non-null float64
10 review_scores_checkin 353 non-null float64
11 review_scores_communication 353 non-null float64
12 review_scores_location 353 non-null float64
13 review_scores_value 353 non-null float64
14 property_count 353 non-null int64
15 BOROUGH 353 non-null object
16 GSS_CODE 353 non-null object
17 geometry 353 non-null geometry
dtypes: float64(13), geometry(1), int64(1), object(3)
memory usage: 49.8+ KB Skip to main content
Before we jump into exploring the data, one additional step that will come in handy down the line. Not every variable in the
table is an attribute that we will want for the clustering. In particular, we are interested in review ratings, so we will only
consider those. Hence, let us first manually write them so they are easier to subset:
ratings = [
'review_scores_rating',
'review_scores_accuracy',
'review_scores_cleanliness',
'review_scores_checkin',
'review_scores_communication',
'review_scores_location',
'review_scores_value'
]
Later in the section, we will also use what AirBnb calls neighborhoods. Let’s load them in so they are ready when we need
them.
boroughs = gpd.read_file(
"https://darribas.org/gds_course/content/data/london_inner_boroughs.geojson"
)
Note that, in comparison to previous datasets, this one is provided in a new format, .geojson . GeoJSON files are a plain
text file (you can open it on any text editor and see its contents) that follows the structure of the JSON format, widely used to
exchange information over the web, adapted for geographic data, hence the geo at the front. GeoJSON files have gained
much popularity with the rise of web mapping and are quickly becoming a de-facto standard for small datasets because they
are readable by humans and by many different platforms. As you can see above, reading them in Python is exactly the same
as reading a shapefile, for example.
Since we have many columns to plot, we will create a loop that generates each map for us and places it on a “subplot” of the
main figure:
First (L. 2) we set the number of rows and columns we want for the grid of subplots.
The resulting object, axs , is not a single one but a grid (or array) of axis. Because of this, we can’t plot directly on
axs , but instead we need access each individual axis.
To make that step easier, we unpack the grid into a flat list (array) for the axes of each subplot with flatten (L. 4).
At this point, we set up a for loop (L. 6) to plot a map in each of the subplots.
Within the loop (L. 6-14), we extract the axis (L. 8), plot the choropleth on it (L. 10) and style the map (L. 11-14).
Display the figure (L. 16).
As we can see, there is substantial variation in how the ratings for different aspects are distributed over space. While
variables like the overall value ( review_scores_value ) or the communication ( review_scores_communication ) tend
to higher in peripheral areas, others like the location score ( review_scores_location ) are heavily concentrated in the
city centre.
Even though we only have seven variables, it is very hard to “mentally overlay” all of them to come up with an overall
assessment of the nature of each part of London. For bivariate correlations, a useful tool is the correlation matrix plot,
available in seaborn :
Although the underlying algorithm is not trivial, running K-means in Python is streamlined thanks to scikit-learn .
Similar to the extensive set of available algorithms in the library, its computation is a matter of two lines of code. First, we
need to specify the parameters in the KMeans method (which is part of scikit-learn ’s cluster submodule). Note that,
at this point, we do not even need to pass the data:
This sets up an object that holds all the parameters required to run the algorithm. In our case, we only passed the number of
clusters( n_clusters ) and the random state, a number that ensures every run of K-Means, which remember relies on
random initialisations, is the same and thus reproducible.
To actually run the algorithm on the attributes, we need to call the fit method in kmeans5 :
The k5cls object we have just created contains several components that can be useful for an analysis. For now, we will use
the labels, which represent the different categories in which we have grouped the data. Remember, in Python, life starts at
zero, so the group labels go from zero to four. Labels can be extracted as follows:
k5cls.labels_
Each number represents a different category, so two observations with the same number belong to same group. The labels are
returned in the same order as the input attributes were passed in, which means we can append them to the original table of
data as an additional column:
abb['k5cls'] = k5cls.labels_
k5sizes = abb.groupby('k5cls').size()
k5sizes
k5cls
0 56
1 104
2 98
3 72
4 23
dtype: int64
The groupby operator groups a table ( DataFrame ) using the values in the column provided ( k5cls ) and passes them
onto the function provided aftwerards, which in this case is size . Effectively, what this does is to groupby the observations
by the categories created and count how many of them each contains. For a more visual representation of the output, a bar
plot is a good alternative:
As we suspected from the map, groups varying sizes, with groups one, two and three being over 70 observations each, and
four being under 25.
In order to describe the nature of each category, we can look at the values of each of the attributes we have used to create
them in the first place. Remember we used the average ratings on many aspects (cleanliness, communication of the host, etc.)
to create the classification, so we can begin by checking the average value of each. To do that in Python, we will rely on the
groupby operator which we will combine it with the function mean :
k5cls 0 1 2 3 4
This concludes the section on geodemographics. As we have seen, the essence of this approach is to group areas based on a
purely statistical basis: where each area is located is irrelevant for the label it receives from the clustering algorithm. In many
contexts, this is not only permissible but even desirable, as the interest is to see if particular combinations of values are
distributed over space in any discernible way. However, in other context, we may be interested in created groups of
Skip to main content
observations that follow certain spatial constraints. For that, we now turn into regionalization techniques.
p , g q
Regionalization algorithms
Regionalization is the subset of clustering techniques that impose a spatial constraint on the classification. In other words, the
result of a regionalization algorithm contains areas that are spatially contiguous. Efectively, what this means is that these
techniques aggregate areas into a smaller set of larger ones, called regions. In this context then, areas are nested within
regions. Real world examples of this phenomenon includes counties within states or, in the UK, local super output areas
(LSOAs) into middle super output areas (MSOAs). The difference between those examples and the output of a
regionalization algorithm is that while the former are aggregated based on administrative principles, the latter follows a
statistical technique that, very much the same as in the standard statistical clustering, groups together areas that are similar on
the basis of a set of attributes. Only that now, such statistical clustering is spatially constrained.
As in the non-spatial case, there are many different algorithms to perform regionalization, and they all differ on details
relating to the way they measure (dis)similarity, the process to regionalize, etc. However, same as above too, they all share a
few common aspects. In particular, they all take a set of input attributes and a representation of space in the form of a binary
spatial weights matrix. Depending on the algorithm, they also require the desired number of output regions into which the
areas are aggregated.
To illustrate these concepts, we will run a regionalization algorithm on the AirBnb data we have been using. In this case, the
goal will be to re-delineate the boundary lines of the Inner London boroughs following a rationale based on the different
average ratings on AirBnb proeperties, instead of the administrative reasons behind the existing boundary lines. In this way,
the resulting regions will represent a consistent set of areas that are similar with each other in terms of the ratings received.
Technically speaking, this is the same process as we have seen before, thanks to PySAL . The difference in this case is that
we did not begin with a shapefile, but with a GeoJSON. Fortunately, PySAL supports the construction of spatial weights
matrices “on-the-fly”, that is from a table. This is a one-liner:
w = weights.Queen.from_dataframe(abb)
abb['sagg13cls'] = sagg13cls.labels_
def dissolve(gs):
'''
Take a series of polygons and dissolve them into a single one
Arguments
---------
gs : GeoSeries
Sequence of polygons to be dissolved
Returns
-------
dissolved : Polygon
Single polygon containing all the polygons in `gs`
'''
return gs.unary_union
The boundaries for the AirBnb boroughs can then be obtained as follows:
The delineation above provides a view into the geography of AirBnb properties. Each region delineated contains houses that,
according to our regionalisation algorithm, are more similar with each other than those in the neighboring areas. Now let’s
compare this geography that we have organically drawn from our data with that of the official set of administrative
boundaries. For example, with the London boroughs.
boroughs.head()
PO
(
0 Lambeth E09000022 2724.940 43.927 T None None 5
51
PO
(
1 Southwark E09000028 2991.340 105.139 T None None 5
51
PO
(
2 Lewisham E09000023 3531.706 16.795 T None None 5
51
MULTIPO
((
3 Greenwich E09000011 5044.190 310.785 F None None
5
-0.02
PO
(
4 Wandsworth E09000032 3522.022 95.600 T None None 5
51
Looking at the figure, there are several differences between the two maps. The clearest one is that, while the administrative
boundaries have a very balanced size (with the exception of the city of London), the regions created with the spatial
agglomerative algorithm are very different in terms of size between each other. This is a consequence of both the nature of
the underlying data and the algorithm itself. Substantively, this shows how, based on AirBnb, we can observe large areas that
are similar and hence are grouped into the same region, while there also exist pockets with characteristics different enough to
be assigned into a different region.
For this, make sure you standardise the table by the size of each tract. That is, compute a column with the total population as
the sum of all the ethnic groups and divide each of them by that column. This way, the values will range between 0 (no
population of a given ethnic group) and 1 (all the population in the tract is of that group).
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
br = geopandas.read_file("dar_es_salaam.geojson")
db.info()
Two main aspects of the built environment are considered: the street network and buildings. To capture those, the following
variables are calculated at for the H3 hexagonal grid system, zoom level 8:
Develop a regionalisation that partitions Dar Es Salaam based on its built environment
Concepts
In this block, we focus on a particular type of geometry: points. As we will see, points can represent a very particular type of
spatial entity. We explore how that is the case and what are its implications, and then wrap up with a particular machine
learning technique that allows us to identify clusters of points in space.
Skip to main content
Point patterns
Collections of points referencing geographical locations are sometimes called point patterns. In this section, we talk about
what’s special about point patterns and how they differ from other collections of geographical features such as polygons.
Slides
[HTML]
[PDF]
0:00 / 0:00
Once you have gone over the clip above, watch the one below, featuring Luc Anselin from the University of Chicago
providing an overview of point patterns. This will provide a wider perspective on the particular nature of points, but also on
their relevance for many disciplines, from ecology to economic geography..
If you want to delve deeper into point patterns, watch the video on the expandable below, which features Luc Anselin
delivering a longer (and slightly more advanced) lecture on point patterns.
Visualisating Points
Once we have a better sense of what makes points special, we turn to visualising point patterns. Here we cover three main
strategies: one to one mapping, aggregation, and smoothing.
[HTML]
[PDF]
0:00 / 0:00
We will put all of these ideas to visualising points into practice on the Hands-on section.
Clustering Points
As we have seen in this course, “cluster” is a hard to define term. In Block G we used it as the outcome of an unsupervised
learning algorithm. In this context, we will use the following definition:
Concentrations/agglomerations of points over space, significantly more so than in the rest of the space considered
Spatial/Geographic clustering has a wide literature going back to spatial mathematics and statistics and, more recently,
machine learning. For this section, we will cover one algorithm from the latter discipline which has become very popular in
the geographic context in the last few years: Density-Based Spatial Clustering of Applications with Noise, or DBSCAN
ester1996density.
Wath the clip below to get the intuition of the algorithm first:
Let’s complement and unpack the clip above in the context of this course. The video does a very good job at explaining how
the algorithm works, and what general benefits that entails. Here are two additional advantages that are not picked up in the
Skip to main content
clip:
1. It is not necessarily spatial. In fact, the original design was for the area of “data mining” and “knowledge discovery in
databases”, which historically does not work with spatial data. Instead, think of purchase histories of consumers, or
warehouse stocks: DBSCAN was designed to pick up patterns of similar behaviour in those contexts. Note also that this
means you can use DBSCAN not only with two dimensions (e.g. longitude and latitude), but with many more (e.g.
product variety) and its mechanics will work in the same way.
2. Fast and scalable. For similar reasons, DBSCAN is very fast and can be run in relatively large databases without
problem. This contrasts with much of the traditional point pattern methods, that rely heavily on simulation and thus are
trickier to scale feasibly. This is one of the reasons why DBSCAN has been widely adopted in Geographic Data Science:
it is relatively straightforward to apply and will run fast, even on large datasets, meaning you can iterate over ideas
quickly to learn more about your data.
DBSCAN also has a few drawbacks when compared to some of the techniques we have seen earlier in this course. Here are
two prominent ones:
1. It is not based on a probabilistic model. Unlike the LISAs, for example, there is no underlying model that helps us
characterise the pattern the algorithms returns. There is no “null hypothesis” to reject, no inferential model and thus no
statistical significance. In some cases, this is an important drawback if we want to ensure what we are observing (and the
algorithm is picking up) is not a random pattern.
2. Agnostic about the underlying process. Because there is no inferential model and the algorithm imposes very little
prior structure to identify clusters, it is also hard to learn anything about the underlying process that gave rise to the
pattern picked up by the algorithm. This is by no means a unique feature of DBSCAN, but one that is always good to
keep in mind as we are moving from exploratory analysis to more confirmatory approaches.
Further readings
If this section was of your interest, there is plenty more you can read and explore. A good “next step” is the Points chapter on
the GDS book (in progress) reyABwolf.
Hands-on
Points
This is an adapted version, with a bit less content and detail, of the chapter on points by Rey, Arribas-Bel and Wolf
(in progress) reyABwolf. Check out the full chapter, available for free at:
https://geographicdata.science/book/notebooks/08_point_pattern_analysis.html
Points are spatial entities that can be understood in two fundamentally different ways. On the one hand, points can be seen as
fixed objects in space, which is to say their location is taken as given (exogenous). In this case, analysis of points is very
similar to that of other types of spatial data such as polygons and lines. On the other hand, points can be seen as the
occurence of an event that could theoretically take place anywhere but only manifests in certain locations. This is the
approach we will adopt in the rest of the notebook.
When points are seen as events that could take place in several locations but only happen in a few of them, a collection of
such events is called a point pattern. In this case, the location of points is one of the key aspects of interest for analysis. A
good example of a point pattern is crime events in a city: they could technically happen in many locations but we usually find
crimes are committed only in a handful of them. Point patterns can be marked, if more attributes are provided with the
location, or unmarked, if only the coordinates of where the event occured are provided. Continuing the crime example, an
unmarked pattern would result if only the location where crimes were committed was used for analysis, while we would be
speaking of a marked point pattern if other attributes, such as the type of crime, the extent of the damage, etc. was provided
with the location.
Point pattern analysis is thus concerned with the description, statistical characerization, and modeling of point patterns,
focusing specially on the generating process that gives rise and explains the observed data. What’s the nature of the
distribution of points? Is there any structure we can statistically discern in the way locations are arranged over space? Why
do events occur in those places and not in others? These are all questions that point pattern analysis is concerned with.
This notebook aims to be a gentle introduction to working with point patterns in Python. As such, it covers how to read,
process and transform point data, as well as several common ways to visualize point patterns.
import numpy as np
import pandas as pd
import geopandas as gpd
import contextily as cx
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from ipywidgets import interact, fixed
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas
In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS
import geopandas as gpd
Data
Photographs
We are going to dip our toes in the lake of point data by looking at a sample of geo-referenced photographs in Tokyo. The
dataset comes from the GDS Book reyABwolf and contains photographs voluntarily uploaded to the Flickr service.
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
tokyo = pd.read_csv("tokyo_clean.csv")
Administrative areas
We will later use administrative areas for aggregation. Let’s load them upfront first. These are provided with the course and
available online:
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
areas = gpd.read_file("tokyo_admin_boundaries.geojson")
The final bit we need to get out of the way is attaching the administrative area code where a photo is located to each area.
This can be done with a GIS operation called “spatial join”.
One-to-one
The first approach we review here is the one-to-one approach, where we place a dot on the screen for every point to visualise.
In Python, one way to do this is with the scatter method in the Pandas visualisation layer:
However this does not give us much geographical context and, since there are many points, it is hard to see any pattern in
areas of high density. Let’s tweak the dot display and add a basemap:
This approach is intuitive but of course raises the following question: what polygons do we use to aggregate the points?
Ideally, we want a boundary delineation that matches as closely as possible the point generating process and partitions the
space into areas with a similar internal intensity of points. However, that is usually not the case, no less because one of the
main reasons we typically want to visualize the point pattern is to learn about such generating process, so we would typically
not know a priori whether a set of polygons match it. If we cannot count on the ideal set of polygons to begin with, we can
adopt two more realistic approaches: using a set of pre-existing irregular areas or create a artificial set of regular polygons.
Let’s explore both.
Irregular lattices
To exemplify this approach, we will use the administrative areas we have loaded above. Let’s add them to the figure above to
Skip to main content
get better context (unfold the code if you are interested in seeing exactly how we do this):
Show code cell source
Now we need to know how many photographs each are contains. Our photograph table already contains the area ID, so all we
need to do here is counting by area and attaching the count to the areas table. We rely here on the groupby operator
which takes all the photos in the table and “groups” them “by” their administrative ID. Once grouped, we apply the method
size , which counts how many elements each group has and returns a column indexed on the LSOA code with all the counts
as its values. We end by assigning the counts to a newly created column in the areas table.
# Create counts
photos_by_area = tokyo.groupby("admin_area").size()
# Assign counts into a table of its own
# and joins it to the areas table
areas = areas.join(
pd.DataFrame({"photo_count": photos_by_area}),
on="GID_2"
)
The lines above have created a new column in our areas table that contains the number of photos that have been taken
Skip to main content
within each of the polygons in the table.
wt eac o t e po ygo s t e tab e.
At this point, we are ready to map the counts. Technically speaking, this is a choropleth just as we have seen many times
before:
Let’s first calculate the area in Sq. metres of each administrative delineation:
The pattern in the raw counts is similar to that of density, but we can see how some peripheral, large areas are “downgraded”
when correcting for their size, while some smaller polygons in the centre display a higher value.
Sometimes we either do not have any polygon layer to use or the ones we have are not particularly well suited to aggregate
points into them. In these cases, a sensible alternative is to create an artificial topology of polygons that we can use to
aggregate points. There are several ways to do this but the most common one is to create a grid of hexagons. This provides a
regular topology (every polygon is of the same size and shape) that, unlike circles, cleanly exhausts all the space without
overlaps and has more edges than squares, which alleviates edge problems.
Python has a simplified way to create this hexagon layer and aggregate points into it in one shot thanks to the method
hexbin , which is available in every axis object (e.g. ax ). Let us first see how you could create a map of the hexagon layer
alone: Skip to main content
# Setup figure and axis
f, ax = plt.subplots(1, figsize=(9, 9))
# Add hexagon layer that displays count of points in each polygon
hb = ax.hexbin(
tokyo["longitude"],
tokyo["latitude"],
gridsize=50,
)
# Add a colorbar (optional)
plt.colorbar(hb)
<matplotlib.colorbar.Colorbar at 0x7fc4590dec50>
See how all it takes is to set up the figure and call hexbin directly using the set of coordinate columns
( tokyo["longitude"] and tokyo["latitude"] ). Additional arguments we include is the number of hexagons by axis
( gridsize , 50 for a 50 by 50 layer), and the transparency we want (80%). Additionally, we include a colorbar to get a sense
of how counts are mapped to colors. Note that we need to pass the name of the object that includes the hexbin ( hb in our
case), but keep in mind this is optional, you do notSkip
needtotomain
always create one.
content
Once we know the basics, we can dress it up a bit more for better results (expand to see code):
The actual algorithm to estimate a kernel density is not trivial but its application in Python is rather simplified by the use of
Seaborn. KDE’s however are fairly computationally intensive. When you have a large point pattern like we do in the tokyo
example (10,000 points), its computation can take a bit long. To get around this issue, we create a random subset, which
retains the overall structure of the pattern, but with much fewer points. Let’s take a subset of 1,000 random points from our
original table:
Note we need to specify the size of the resulting subset (1,000), and we also add a value for random_state ; this ensures
that the sample is always the same and results are thus reproducible.
Same as above, let us first see how to create a quick KDE. For this we rely on Seaborn’s kdeplot :
sns.kdeplot(
x="longitude",
y="latitude",
data=tokyo_sub,
n_levels=50,
fill=True,
cmap='BuPu'
);
Once we know how the basic logic works, we can insert it into the usual mapping machinery to create a more complete plot.
The main difference here is that we now have to tell sns.kdeplot where we want the surface to be added ( ax in this
case). Toggle the expandable to find out the code that produces the figure below:
Both m and r need to be prespecified by the user before running DBSCAN . This is a critical point, as their value can
influence significantly the final result. Before exploring this in greater depth, let us get a first run at computing DBSCAN in
Python.
Skip to main content
Basics
The heavy lifting is done by the method DBSCAN , part of the excellent machine learning library scikit-learn . Running
the algorithm is similar to how we ran K-Means when clustering. We first set up the details:
# Set up algorithm
algo = DBSCAN(eps=100, min_samples=50)
We decide to consider a cluster photos with more than 50 photos within 100 metres from them, hence we set the two
parameters accordingly. Once ready, we “fit” it to our data, but note that we first need to express the longitude and latitude of
our points in metres (see code for that on the side cell).
algo.fit(tokyo[["X_metres", "Y_metres"]])
DBSCAN
DBSCAN(eps=100, min_samples=50)
algo.labels_
The labels_ object always has the same length as the number of points used to run DBSCAN . Each value represents the
index of the cluster a point belongs to. If the point is classified as noise, it receives a -1. Above, we can see that the first five
points are effectively not part of any cluster. To make thinks easier later on, let us turn the labels into a Series object that
we can index in the same way as our collection of points:
Now we already have the clusters, we can proceed to visualize them. There are many ways in which this can be done. We
will start just by coloring points in a cluster in red and
Skipnoise in grey:
to main content
# Setup figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Assign labels to tokyo table dynamically and
# subset points that are not part of any cluster (noise)
noise = tokyo.assign(
lbls=lbls
).query("lbls == -1")
# Plot noise in grey
ax.scatter(
noise["X_metres"],
noise["Y_metres"],
c='grey',
s=5,
linewidth=0
)
# Plot all points that are not noise in red
# NOTE how this is done through some fancy indexing, where
# we take the index of all points (tw) and substract from
# it the index of those that are noise
ax.scatter(
tokyo.loc[tokyo.index.difference(noise.index), "X_metres"],
tokyo.loc[tokyo.index.difference(noise.index), "Y_metres"],
c="red",
linewidth=0
)
# Display the figure
plt.show()
This is a first good pass. The algorithm is able to identify a few clusters with high density of photos. However, as we
mentioned when discussing DBSCAN, this is all contingent on the parameters we arbitrarily set. Depending on the maximum
Skip to main content
radious ( eps ) we set, we will pick one type of cluster or another: a higher (lower) radious will translate in less (more) local
ad ous ( eps ) we set, we w p c o e type o c uste o a ot e : a g e ( owe ) ad ous w t a s ate ess ( o e) oca
clusters. Equally, the minimum number of points required for a cluster ( min_samples ) will affect the implicit size of the
cluster. Both parameters need to be set before running the algorithm, so our decision will affect the final outcome quite
significantly.
For an illustration of this, let’s run through a case with very different parameter values. For example, let’s pick a larger
radious (e.g. 500m) and a smaller number of points (e.g. 10):
# Set up algorithm
algo = DBSCAN(eps=500, min_samples=10)
# Fit to Tokyo projected points
algo.fit(tokyo[["X_metres", "Y_metres"]])
# Store labels
lbls = pd.Series(algo.labels_, index=tokyo.index)
And let’s now visualise the result (toggle the expandable to see the code):
The output is now very different, isn’t it? This exemplifies how different parameters can give rise to substantially different
outcomes, even if the same data and algorithm are applied.
Advanced plotting
Please keep in mind this final section of the tutorial is OPTIONAL, so do not feel forced to complete it. This will
not be covered in the assignment and you will still be able to get a good mark without completing it (also, including
any of the following in the assignment does NOT guarantee a better mark).
As we have seen, the choice of parameters plays a crucial role in the number, shape and type of clusters founds in a dataset.
To allow an easier exploration of these effects, in this section we will turn the computation and visualization of DBSCAN
outputs into a single function. This in turn will allow us to build an interactive tool later on.
Arguments
---------
db : (Geo)DataFrame
Table with at least columns `X` and `Y` for point coordinates
eps : float
Maximum radious to search for points within a cluster
min_samples : int
Minimum number of points in a cluster
'''
algo = DBSCAN(eps=eps, min_samples=min_samples)
algo.fit(db[['X_metres', 'Y_metres']])
lbls = pd.Series(algo.labels_, index=db.index)
1. db : a (Geo)DataFrame containing the points on which we will try to find the clusters.
2. eps : a number (maybe with decimals, hence the float label in the documentation of the function) specifying the
maximum distance to look for neighbors that will be part of a cluster.
3. min_samples : a count of the minimum number of points required to form a cluster.
Voila! With just one line of code, we can create a map of DBSCAN clusters. How cool is that?
However, this could be even more interesting if we didn’t have to write each time the parameters we want to explore. To
change that, we can create a quick interactive tool that will allow us to modify both parameters with sliders. To do this, we
will use the library ipywidgets . Let us first do it and then we will analyse it bit by bit:
interact(
clusters, # Method to make interactive
db=fixed(tokyo), # Data to pass on db (does not change)
eps=(50, 500, 50), # Range start/end/step of eps
min_samples=(50, 300, 50) # Range start/end/step of min_samples
);
Phew! That is cool, isn’t it? Once passed the first excitement, let us have a look at how we built it, and how you can modify it
further on. A few points on this:
First, interact is a method that allows us to pass an arbitrary function (like clusters ) and turn it into an interactive
widget where we modify the values of its parameters through sliders, drop-down menus, etc.
The above results in a little interactive tool that allows us to play easily and quickly with different values for the parameters
and to explore how they affect the final outcome.
Do-It-Yourself
import pandas, geopandas, contextily
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas
In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS
import pandas, geopandas, contextily
url = (
"http://data.insideairbnb.com/china/beijing/beijing/"
"2023-03-29/data/listings.csv.gz"
)
url
abb = pandas.read_csv(url)
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
abb = pandas.read_csv("listings.csv")
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
abb = pandas.read_csv("../data/web_cache/abb_listings.csv.zip")
abb.info()
Also, for an ancillary geography, we will use the neighbourhoods provided by the same source:
url = (
"http://data.insideairbnb.com/china/beijing/beijing/"
"2023-03-29/visualisations/neighbourhoods.geojson"
)
url
'http://data.insideairbnb.com/china/beijing/beijing/2023-03-29/visualisations/neighbourhoods.ge
neis = geopandas.read_file(url)
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
neis = geopandas.read_file("neighbourhoods.geojson")
Skip to main content
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
neis = geopandas.read_file("../data/web_cache/abb_neis.gpkg")
neis.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 neighbourhood 16 non-null object
1 neighbourhood_group 0 non-null object
2 geometry 16 non-null geometry
dtypes: geometry(1), object(2)
memory usage: 512.0+ bytes
url = (
"https://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/"
"ne_50m_populated_places_simple.geojson"
)
url
'https://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_50m_populated_places_simple.geojso
Skip to main content
Let’s read the file in and keep only places from India:
Alternative
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and
read it locally. To do that, you can follow these steps:
1. Download the file by right-clicking on this link and saving the file
2. Place the file on the same folder as the notebook where you intend to read it
3. Replace the code in the cell above by:
places = geopandas.read_file("ne_50m_populated_places_simple.geojson")
Note the code cell above requires internet connectivity. If you are not online but have a full copy of the GDS course in your
computer (downloaded as suggested in the infrastructure page), you can read the data with the following line of code:
places = geopandas.read_file(
"../data/web_cache/places.gpkg"
).query("adm0name == 'India'")
By defaul, place locations come expressed in longitude and latitude. Because you will be working with distances, it makes
sense to convert the table into a system expressed in metres. For India, this can be the “Kalianpur 1975 / India zone I”
( EPSG:24378 ) projection.
places_m = places.to_crs(epsg=24378)
ax = places_m.plot(
color="xkcd:bright yellow", figsize=(9, 9)
)
contextily.add_basemap(
ax,
crs=places_m.crs,
source=contextily.providers.CartoDB.DarkMatter
)