Interactive Mapping in Python With UK Census Data
Interactive Mapping in Python With UK Census Data
Census Data
Introduction
In this article I will describe the process I followed to create this dashboard displaying maps of
London showing UK Census Data:
I am new to GIS and mapping, so this was a voyage of discovery – I will describe what I learned, and
some of the issues I encountered along the way, with their resolutions.
I assume familiarity with Python. The most common Python data visualization libraries are
compared in this article The Top 6 Python Data Visualization Libraries: How to choose. Their
summary was:
Matplotlib - foundation and comprehensive library, highly customizable, not very easy to
learn and use
Seaborn - easy to learn and use, good for exploratory data analysis, relatively customizable,
can integrate with Matplotlib if need more customizations
Plotly (Dash) - easy to use and customizable, create interactive graphs and web applications
Bokeh - create interactive graphs and web applications
Folium - good for nice looking map graphs
Plotnine - grammar of graphics, based on the R ggplot2 library
1. Download UK Census Data, and Census Ward and Local Authority geographic information (as
Shapefiles).
2. Plot a simple fixed map using GeoPandas and Matplotlib.
3. Change to use Plotly with GeoJSON geographic data.
4. Add interactive functionality using Dash.
My code is available in my Github repository. When I show code fragments below I will refer to the
file in the repository where it appears.
UK Census Data
Data
The UK Office for National Statistics is responsible for the Census. The 2021 Census results are still
being prepared, the schedule is discussed here. The 2011 Census data is described and can be
downloaded here. For this exercise, I downloaded Bulk Data which is described as:
I downloaded the Detailed Characteristics on demography and families for merged wards, local
authorities and regions. This is a 79MB Zip file, containing data in a structured CSV format, with an
Excel catalogue. I extracted the files into a subdirectory of my project data directory. Let’s look at
the code to read the data.
Tables
The different tables are described in the Excel catalogue Cell Numbered DC Tables
3.3.xlsx. The Index sheet of this file lists the tables, and I read it as follows. (This code appears
in the file census_read_data.py in the repository.)
import pandas as pd
CENSUS_DATA = 'data/BulkdatadetailedcharacteristicsmergedwardspluslaandregE&Wandinfo3.3'
f = None
def read_index():
# Read Index sheet
global f
excel_file = CENSUS_DATA+'/Cell Numbered DC Tables 3.3.xlsx'
f = pd.ExcelFile(excel_file)
index = f.parse(sheet_name='Index')
return index
There are 25 tables of different statistics; each has its own sheet in the catalogue. Looking at the
sheet for the first table DC1104EW, we see:
Back to Index
The table has categories for Residence Type, Sex and Age, with values for each combination of
category values, including All categories. The values in the table are the index of the column in the
data file for the statistic. The categories vary for each table, with one or two categories in columns,
and one or two in the rows. I could not find a way to automatically cope with these varied column
and row headings using pandas. Instead, I read the sheet into a DataFrame and then process the
headings in my own code, part of which is shown below.
def read_table(table_name):
# Read Table sheet, transforming it into a DataFrame with columns
# for each category and the Dataset index
if f == None:
exit("Call read_index() to open table list")
table = f.parse(sheet_name=table_name, header=None)
# Construct DataFrame from column and row level names and values
num_cols = len(col_level_values[0])
num_rows = len(row_level_values[0])
for l in range(row_levels):
# Repeat row values in order
row_level_values[l] = [x for x in row_level_values[l]
for n in range(num_cols)]
for l in range(col_levels):
# Repeat col values in turn
col_level_values[l] = col_level_values[l] * num_rows
values = []
for r in range(0, num_rows):
row = data_row_indexes[r]
values.extend(table.iloc[row, 1:])
values = [str(int).zfill(4) for int in values]
data = [row_level_values[l] for l in range(row_levels)] + \
[col_level_values[l] for l in range(col_levels)] + \
[values]
index = row_level_names+col_level_names+['Dataset']
df = pd.DataFrame(
data=data,
index=index
)
df = df.transpose()
return df
Each row in the DataFrame identifies the Dataset for a combination of the category values. The
first row, with the value All for the three categories identifies Dataset 0001, which corresponds
to the column DC1104EW0001 in the data file DC1104EWDATA.CSV.
Data Files
Each table has many CSV files; the file we load for DC1104EW is DC1104EWDATA.CSV. As
explained above, this data has columns for the different combinations of the category values. The
rows have the counts for different geographical areas. We read the data as follows:
def read_data(table_name):
datafile = CENSUS_DATA + '/' + table_name + 'DATA.CSV'
df = pd.read_csv(datafile)
return df
Geography
The census data is summarised by Merged Ward and by Local Authority District. Merged Wards refer
to Electoral Wards, where a few small wards are merged to protect privacy. Merged Wards are
assigned to Local Authorities, which are themselves assigned to Regions. The geography data is
published on the Open Geography Portal in a number of formats:
Shapefile – A geospatial vector data format for geographic information system (GIS) software. It
is developed and regulated by Esri.
GeoJSON – An open standard format designed for representing simple geographical features,
along with their non-spatial attributes, based on JSON.
KML - Keyhole Markup Language is an XML format developed for use with Google Earth.
The Shapefile format is much more compact than GeoJSON, and is supported by GeoPandas, (see
below), so this is what I chose to download. (Plotly requires GeoJSON, which I created from the
Shapefiles later.)
The geography data for Wards and Local Authority Districts (LADs) that I used is on the Open
Geography Portal under the menu options Boundaries | Census Boundaries | Census Merged Wards
and Boundaries | Administrative Boundaries | Local Authority Districts. The files I downloaded were:
The Shapefiles I downloaded are high resolution, so large: about 120MB and 40MB respectively. The
portal has lower resolution versions that are a tenth of the size if you prefer to use those.
(Alternatively, you could use a site like mapshaper.org to compress the files to your preferred
resolution.)
These Shapefiles have Coordinate Reference System OSGB36 / British National Grid. In order to map
the data we need to change it to EPSG 4326 (aka WGS84), we will see this in the code below. (I am
afraid that it took me a long frustrating time, during which no maps were displayed by Plotly, to find
this out!)
I used the lookup CSV file to create a list geography lookup DataFrame, with rows for Merged
Wards and Local Authority Districts, and columns GeographyCode and Name:
def read_geography():
# Get Census Merged Ward and Local Authority Data
lookupfile =
'data/Ward_to_Census_Merged_Ward_to_Local_Authority_District_(December_2011)_Lookup_in_England_a
nd_Wales.csv'
cmwd = pd.read_csv(lookupfile, usecols=[
'CMWD11CD', 'CMWD11NM', 'LAD11CD', 'LAD11NM'])
cmwd.drop_duplicates(inplace=True)
locationcol = "GeographyCode"
cmwd[locationcol] = cmwd['CMWD11CD']
namecol = 'Name'
cmwd[namecol] = cmwd['CMWD11NM']
lad = pd.read_csv(lookupfile, usecols=['LAD11CD', 'LAD11NM'])
lad = lad.drop_duplicates()
lad[locationcol] = lad['LAD11CD']
lad[namecol] = lad['LAD11NM']
lad['CMWD11CD'] = ''
lad['CMWD11NM'] = ''
geography = pd.concat([cmwd, lad])
return geography
GeoPandas and Matplotlib
Having downloaded the data, we are ready to produce our first map! While I downloaded data for
the whole of England and Wales, I will restrict the mapping to London for simplicity.
GeoPandas is an open source project to make working with geospatial data in Python easier.
GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.
Geometric operations are performed by shapely. Geopandas further depends on Fiona for file access
and Matplotlib for plotting.
I installed GeoPandas on Windows and it failed. Because it depends on packages that are
implemented in C/C++, special procedures are required to install it on Windows, these are described
in the Appendix below. (Apparently the install is straightforward on Linux and Mac.)
def read_london_lad_geopandas():
# Get Local Authority Boundaries as GeoPandas
shapefile = 'data/Local_Authority_Districts_(December_2011)_Boundaries_EW_BFC/
Local_Authority_Districts_(December_2011)_Boundaries_EW_BFC.shp'
gdf = gpd.read_file(shapefile)
# Convert coordinates
gdf.to_crs(epsg=4326, inplace=True)
# London
ladgdf = gdf[gdf['lad11cd'].str.startswith('E090000')]
return ladgdf
GeoPandas loads the Shapefile, and then we convert the co-ordinates as discussed above, and filter
the rows by the LAD code for London LADs.
# Read the data table (all data items) and merge with the LAD geo data
df = crd.read_data(table_name)
gdf = london_lads_gdf.merge(df, left_on='lad11cd', right_on='GeographyCode')
# Create a Matplotlib plot, turn off axes, position color bar, and plot
fig, ax = plt.subplots(1, 1)
ax.set_axis_off()
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.0)
gdf.plot(column=datacol, ax=ax, legend=True, cax=cax)
plt.show()
This displays a window with a choropleth map, colour coded according to the first dataset for the
table DC1104EW, i.e. the column DC1104EW0001:
def read_london_lad_geojson():
# Get LAD GeoJSON
london_jsonfile = "data/json_files/London_LAD_Boundaries.json"
if not os.path.exists(london_jsonfile):
lad_jsonfile = "data/json_files/Local_Authority_Districts_(December_2011)_Boundaries_EW_BFC.json"
if not os.path.exists(lad_jsonfile):
# Get Census LAD Boundaries as GeoPandas
shapefile = 'data/Local_Authority_Districts_(December_2011)_Boundaries_EW_BFC/
Local_Authority_Districts_(December_2011)_Boundaries_EW_BFC.shp'
ladgdf = gpd.read_file(shapefile)
# Convert coordinates
ladgdf.to_crs(epsg=4326, inplace=True)
# Write GeoJSON
ladgdf.to_file(lad_jsonfile, driver='GeoJSON')
with open(lad_jsonfile) as f:
census_lads = json.load(f)
london_lads = census_lads
london_lads['features'] = list(filter(
lambda f: f['properties']['lad11cd'].startswith('E090000'), london_lads['features']))
with open(london_jsonfile, 'w') as f:
json.dump(london_lads, f)
else:
with open(london_jsonfile) as f:
london_lads = json.load(f)
return london_lads
There are two cached JSON files: the London LAD Boundaries, and the complete LAD Boundaries.
The JSON files are simply converted from GeoPandas.
The first Plotly map uses the GeoJSON files. (Code in census_plotly_script.py.)
# Read the data table (all data items) and merge with the geography names
df = crd.read_data(table_name)
df = pd.merge(df, geography, on=locationcol)
fig = px.choropleth(london_lad_df,
geojson=london_lads,
locations=locationcol,
color=datacol,
color_continuous_scale="Viridis",
range_color=(0, max_value),
featureidkey=key,
scope='europe',
hover_data=[namecol],
title=title
)
fig.update_geos(
fitbounds="locations",
visible=False,
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=30),
title_x=0.5,
width=1200, height=600)
fig.show()
The px.choropleth function call maps the london_lads GeoJson, colouring the map
according to the datacol (“DC1104EW0001”) column of the london_lad_df, matching
the GeoJson feature property key (“properties.lad11cd”) with the locationcol
("GeographyCode") column.
The update_geos function call sets the bounds of the map from the displayed locations, and
hides the underlying map.
The update_layout function call reduces the margin around the map, and specifies the
width and height.
The code specifies the width and height because a) the default size is smaller, and b) in order to have
an appropriate aspect ratio. However, the map is still surrounded by a large amount of white space. I
improved this by manually specifying the bounds for the map:
from turfpy.measurement import bbox
from functools import reduce
def compute_bbox(gj):
# Compute bounding box for GeoJSON
gj_bbox_list = list(
map(lambda f: bbox(f['geometry']), gj['features']))
gj_bbox = reduce(
lambda b1, b2: [min(b1[0], b2[0]), min(b1[1], b2[1]),
max(b1[2], b2[2]), max(b1[3], b2[3])],
gj_bbox_list)
return gj_bbox
gj_bbox = compute_bbox(london_lads)
fig.update_geos(
# fitbounds="locations",
center_lon=(gj_bbox[0]+gj_bbox[2])/2.0,
center_lat=(gj_bbox[1]+gj_bbox[3])/2.0,
lonaxis_range=[gj_bbox[0], gj_bbox[2]],
lataxis_range=[gj_bbox[1], gj_bbox[3]],
visible=False,
)
The package turfpy.measurement provides a function bbox to compute the bounding box
for a GeoJSON feature geometry.
The function compute_bbox computes the bounding box for each feature and reduces the list
of bounding boxes to compute the combined bounding box.
Then update_geos uses the box to specify the center and longitude and latitude ranges for
the map.
Dash
Dash is built on top of Plotly and “abstracts away all of the technologies and protocols that are
required to build a full-stack web app with interactive data visualization”. I used Dash to allow
selection of the table, dataset and granularity for the map.
In my first attempt I used the Dash Core Components to add the selection controls. While functional,
the appearance was not great, so I switched to using Dash Bootstrap Components, which provide
the consistent Bootstrap look and feel to the controls without needing CSS expertise.
The first version just allowed selection of the map granularity, adding these controls:
The Granularity radio items change the granularity of the map between Local Authority, which we
have seen so far, and Ward, for a more detailed map. The dropdown selection optionally specifies
the Local Authority for a Ward map.
# Read the data table (all data items) and merge with the geography names
df = crd.read_data(table_name)
df = pd.merge(df, geography, on=locationcol)
# Dash
def blank_fig():
# Blank figure for initial Dash display
fig = go.Figure(go.Scatter(x=[], y=[]))
fig.update_layout(template=None)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False)
fig.update_yaxes(showgrid=False, showticklabels=False, zeroline=False)
return fig
def compute_bbox(gj):
# Compute bounding box for GeoJSON
gj_bbox_list = list(
map(lambda f: bbox(f['geometry']), gj['features']))
gj_bbox = reduce(
lambda b1, b2: [min(b1[0], b2[0]), min(b1[1], b2[1]),
max(b1[2], b2[2]), max(b1[3], b2[3])],
gj_bbox_list)
return gj_bbox
local_authorities = london_lad_ids
all_local_authorities = ['All'] + local_authorities
map_controls = dbc.Card(
[
dbc.Row([
dbc.Label("Granularity", html_for="granularity", width=2),
dbc.Col(
[
dbc.RadioItems(
id='granularity',
options=[{'label': i, 'value': i}
for i in ['Local Authorities', 'Wards']],
value='Local Authorities',
inline=True
),
],
width=8
)
]),
dbc.Row([
dbc.Label('Wards for Local Authority',
html_for="local-authority", width=2),
dbc.Col(
[
dbc.Select(
id='local-authority',
options=[
#{'label': i, 'value': i}
{'label': 'All' if i == 'All'
else geography[geography[locationcol] == i][namecol].iat[0],
'value': i}
for i in all_local_authorities],
value='All'
)
],
width=8
)
]),
]
)
app.layout = dbc.Container(
[
html.H1("Census Data"),
html.Hr(),
dbc.Col(
[
dbc.Row(map_controls),
dbc.Row(dcc.Graph(id='map', figure=blank_fig()),
class_name='mt-3'),
],
align="center",
),
],
fluid=True,
)
@app.callback(
Output('map', 'figure'),
Input('local-authority', 'value'),
Input('granularity', 'value'),
)
def update_graph(local_authority, granularity):
gj_bbox = compute_bbox(gj)
fig = px.choropleth(fdf,
geojson=gj,
locations=locationcol,
color=datacol,
color_continuous_scale="Viridis",
range_color=(0, max_value),
featureidkey=key,
scope='europe',
hover_data=[namecol, 'LAD11NM'],
title=title
)
fig.update_geos(
center_lon=(gj_bbox[0]+gj_bbox[2])/2.0,
center_lat=(gj_bbox[1]+gj_bbox[3])/2.0,
lonaxis_range=[gj_bbox[0], gj_bbox[2]],
lataxis_range=[gj_bbox[1], gj_bbox[3]],
visible=False,
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=30),
title_x=0.5,
width=1200, height=600)
return fig
if __name__ == '__main__':
app.run_server(debug=True)
This script reads the Ward data, in addition to the LAD data, using
crd.read_london_ward_geojson().
The Dash functionality starts after the comment # Dash.
The initial figure displayed by Dash is a chart, which I did not want to see, so the function
blank_fig() creates a blank figure to use as the initial display. (Credit to this Stack Overflow
answer.)
The call to dash.Dash() creates the Dash application, using the standard Bootstrap
stylesheet.
The assignment to map_controls creates the selection controls.
The assignment to app.layout creates a container for the page with heading,
map_controls and placeholder for the map initially showing the blank figure.
The @app.callback annotation defines a callback function that is called when either of the
selection controls is updated, returning the updated figure.
The callback builds the map, using Plotly as before, using the data appropriate to the controls.
One minor change is the hover_data=[namecol, 'LAD11NM'], which adds the LAD
name to the hover display.
app.run_server() starts the Dash server. In my environment it is accessed on
http://127.0.0.1:8050/ and displays this page:
If I change the granularity to Ward and select the Local Authority Ealing I get this map; the image
shows the hover text with the LAD name:
Next I added controls to select the table and dataset. This required many more inputs and outputs
on the callback. Ideally, I would like to have multiple callbacks chained together as described in Dash
Basic Callbacks. However, callbacks must be stateless, so it is not possible for the table selection to
update global state with the table data to be displayed on the dataset selection. So, I ended up with
one large callback that does everything.
The code is in census_dash_script_full.py. I will not reproduce it here, but summarise the
changes:
The article Using geopandas on Windows by Geoff Boeing is referenced as the definitive explanation
of how to install geopandas on Windows. The comments on the article have many suggestions on
how best to proceed. The essential advice is to use pipwin which installs unofficial python package
binaries for Windows provided by Christoph Gohlke here. These are the steps I followed: