Spatial SQL A Practical Approach to Modern GIS Using SQ
Spatial SQL A Practical Approach to Modern GIS Using SQ
SQL
A practical approach to
modern GIS using SQL
MATT FORREST
S PATIAL SQL
A P RACTICAL A PPROACH TO
M ODERN GIS U SING SQL
M ATTHEW F ORREST
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Credits & Copyright
Spatial SQL
A Practical Approach to Modern GIS Using SQL
by Matthew Forrest
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Contents
1 Introduction 1
1.1 Goals for this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Skills you will need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 What we will cover and outcomes for you . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Data and files for exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1 Getting Started 7
1 Why SQL? 9
1.1 Evolution to modern GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Why learn spatial SQL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 To use SQL, Python, or both? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Spatial SQL landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 The landscape today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Expert Voices: Uchenna Osia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Setting Up 27
2.1 Setting up PostGIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Why Docker? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Installing PostGIS with Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Installing docker-postgis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Expert Voices: Getu Abdissa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 Thinking in SQL 51
3.1 Moving from desktop GIS to SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Importing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Database organization and design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Using PostGIS indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Thinking in SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Optimizing our queries and other tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.8 Using pseudo-code and "rubber ducking" . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.9 Expert Voices: Giulia Carella, PhD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 SQL Basics 77
4.1 Importing Data to PostGIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 ogr2ogr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 SQL Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6 Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7 Other data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.8 Basic SQL Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.9 Aggregates and GROUP BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2 Learning Spatial SQL 117
1 Advanced SQL Topics for Spatial SQL 119
1.1 CASE/WHEN Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
1.2 Common Table Expressions (CTEs) and Subqueries . . . . . . . . . . . . . . . . . . . . . 121
iii
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
iv CONTENTS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CONTENTS v
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
To my two amazing children and my more amazing wife,
I couldn’t do anything without you.
And to all the individuals who helped me along the way or
helped answer one of my questions, thank you.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Foreward
Many say that about 80% of our data includes a location aspect. This idea has been important in the
geospatial industry for a long time. It shows that knowing where something happens is key. In the
past, working with this kind of data was mostly done in special GIS software. But now, things are
changing. The world of geography is stepping out of its silo, becoming a regular part of everyone’s
database.
A big part of this change comes from SQL, the common language of data analysis. SQL helps differ-
ent technologies and products work together. Adding spatial features to modern data warehouses is
making it even more popular.
I remember when Matt Forrest started working with us. His growth from a beginner in SQL to a leader
in the field reflects how SQL itself has changed. It used to be a tool for specialists, but now it’s a key
language in data analysis. This change is important for three reasons:
Making Spatial Analysis Available to More People: With SQL, more people can use spatial data. It’s not
just for experts anymore. This means more people can understand and use information about locations.
Bringing Geography into Everyday Analysis: As SQL becomes more common, spatial data is no longer
unusual. It’s part of regular data analysis, making it easier to understand all kinds of data together.
Staying Current with New Technology: Big data platforms like BigQuery, Snowflake, and Databricks
are using SQL for spatial data and adding new tech like AI. This keeps Spatial SQL relevant now and
in the future.
Looking ahead, as understanding locations becomes more important in different areas, ’Spatial SQL’ is
the perfect guide. This book invites you to learn and be part of this exciting change.
Javier de la Torre
Founder, CARTO
vii
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Preface
Welcome to the book that you have been waiting for, helping to bridge between status quo GIS and the
next level of data interaction with enterprise tooling.
To truly push the limits of geospatial technology today, you really need to know how to do some coding
and some geospatial data processing. Just knowing one or the other will reduce your opportunities
greatly.
Likewise, if you do GIS but do not know SQL (or coding) then you will be limited to using a GUI for
everything. That might not sound too bad but errors in GUI-based workflows are incredibly hard to
fix when using mouse clicks on the screen. Consulting businesses incur significant losses due to this
limitation.
Adopting SQL into your workflow is a simple way of scripting without having to learn a full program-
ming language, making it a good fit for slowly moving toward more advanced spatial data handling.
Matt’s experience is both broad and deep with regards to GIS and SQL. Many leading geospatial users
and companies adopted databases as part of their overall infrastructure. As open source databases
became popular and spatial addons were made available, many system architects wanted to see all
types of data move into the database.
Getting data into a database is only the beginning. As Matt shows in this book, working with spatial/-
geometry data in a system, like PostGIS, gives you access to more advanced capabilities in a structured
way. After getting PostGIS set up, you’ll get a well-grounded education in the real power that comes
with understanding spatial relationships and the flexibility of applying spatial analysis to tabular, vec-
tor, and raster data.
If you have always thought you should be better at using SQL for spatial analysis (and not just for data
warehousing), this is the book for you. If you have lamented having so many GIS files sitting around on
disks, this book will help you get that loaded and start extracting value. If your SQL skills are already
strong but you need to access this new "spatial" stuff, read on.
As always, thank you for supporting our work at Locate Press. Our goal is to help raise awareness
and education around open source geospatial topics. Keep in touch if you have a book idea, or want to
discuss how to support training using our material.1
Tyler Mitchell
Publisher
ix
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
1. Introduction
The year was 2012, and I was sitting in an office on the border of Soho and Little Italy on Lafayette
Street at the then CartoDB offices, looking at some code on a screen I had never seen before:
This was my first introduction to spatial SQL. Up until this point my experience with GIS was all
within a desktop environment. I had graduated two years earlier with an undergraduate degree in
Geography and had started to use QGIS to do some different spatial analysis and mapping projects.
My programming experience was limited to writing a few lines of not-so-great JavaScript and, believe
it or not, ActionScript, a scripting language used with Flash (which no longer exists).
Sitting there, watching the speed and ease of data being changed with a few edits to queries was
mesmerizing, especially coming from a mostly desktop "point and click" background. Immediately I
could see the power of spatial SQL and how it could be a powerful tool for so many types of spatial
analysis and data manipulation.
At the same time, I had a ton of questions. How do you load a Shapefile into this thing? What does the
star mean? How do you save the results? And how is it so fast?
From that point I slowly (very slowly at times) started my journey to learn spatial SQL. I didn’t learn
spatial SQL at any specific time or using any particular methods, but rather through various projects. I
kept trying to see if I could accomplish it with SQL and the answer was almost always yes. Big spatial
joins, find the three nearest neighbors, find the average median income of block groups in zip codes
that have an overlap greater than half the block group area, find the nearest point on a line from another
point, aggregate data by state while preserving aggregated data by state and by category in JSON, and
the list goes on. As time went on I felt more and more free: if I could think it, I could do it in a few
minutes with spatial SQL. I no longer needed to rely on the availability of an analysis in a desktop tool
or the performance limits that came along it.
Most of my SQL knowledge was learned by trial and error, help from colleagues, and lots of Google and
Stack Overflow searches. While there are a number of books and courses on spatial SQL, oftentimes
they focus on the implementation and management of spatial databases rather that the benefits and
use cases for using spatial SQL to power spatial analysis.
The goal of this book is to fill that gap, to help you use spatial SQL for geospatial analysis, and get you
using it proficiently, as fast as possible. I also want to give you ideas for the wide range of things you
can do with spatial SQL. We will cover a lot of ground in the book, but there are nearly endless ways
to use spatial SQL.
My singular goal for this book is to help you learn spatial SQL for spatial analysis to help you start
using it in your daily work. With that in mind, the book will focus on the functional use of spatial SQL
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
2 1.2. APPROACH
for you as an end user rather than the owner or manager of the database. Given that, we will not be
focusing on topics like database administration and resource optimization, and will only lightly touch
on topics like query optimization and plans.
The first chapters will focus on helping you understand why spatial SQL is important and growing
in importance within geospatial careers and education. They will also help you lay the foundational
groundwork for thinking in SQL to help you write better queries and translate your current GIS knowl-
edge into a database language.
Next, we will start to learn basic SQL, which will help you learn all the ways you can use it to query,
structure, and manipulate non-spatial data. We won’t focus on every topic in great detail, but this will
give you the knowledge to understand the foundations of SQL and the skills to learn and expand your
knowledge on your own as needed.
Then we will start our work with spatial SQL, beginning with the GEOMETRY and GEOGRAPHY data types
that make spatial SQL, spatial. We will cover the various functions and building blocks that you need to
use within a GIS toolkit: manipulating geometries, measurements, relationships, aggregates, and more.
From there we will focus in on a set of use cases for some real world problems to show you how to do
common GIS operations like nearest neighbors or spatial joins within SQL, as well as more advanced
analysis use cases from clustering to spatial auto-correlation. We will break down each of the different
steps to get there to show you the building blocks, so you can start to build a functional knowledge
and toolkit so you can start to design and build your own analyses. During the course of the book we
will also use PostGIS extensions such as pgRouting for routing analysis using road network data and
h3-postgis, an extension to use the H3 spatial index library.2
1.2 Approach
All the examples and tools we will use in this book will work on any computer, any operating system,
and with or without internet (you will need an internet connection to download the data from the
repo and the tools to set up your database, but after that it will work without internet access). As I
said earlier, the goal is to help you learn practical analysis skills in using spatial SQL, so with that in
mind we will be setting up a PostGIS database, the leading open source spatial extension of the highly
popular, also open source, database PostgreSQL.
To run PostGIS we will use Docker to ensure compatibility across operating systems and other installed
libraries. We will use several tools to visualize, import, and work with our data. The primary tools will
be PGAdmin to query and visualize data and GDAL to import/export data. We will also explore other
tools such as KeplerGL and QGIS to explore our results.
Even if you have never written a line of code in your life, you can use this book and start to learn
spatial SQL. All you need is a willingness to learn, practice, and occasionally make mistakes (mistakes
are good, they help you practice and learn along the way). It is written to start from a blank slate, with
all the relevant commands and code you will need to run. With that said this book cannot possibly
cover all the things you will need to know, but this book will build the foundation for you to continue
learning past the pages of this book.
I will also share some tools to help you become proficient in thinking through programming problems
such as learning how to debug, using pseudo-code to think through what you want to accomplish and
how to do it efficiently in plain text, working through practical exercises, and how to use tools to find
2 See pgRouting: A Practical Guide by Regina Obe and Leo Hsu at https://locatepress.com/book/pgr
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. INTRODUCTION 3
answers to common problems (and yes this includes Google and Stack Overflow, as everyone uses
these tools). Some of our exercises will also have some issues built into them by design (with answers
of course) to help you try and think through some common issues and problems to help practice and
build proficiency and problem solving in writing spatial SQL.
1. Getting Started: Why SQL and Learning basic and advanced SQL
2. Learning Spatial SQL: The GEOMETRY, Spatial functions, Spatial relationships, and Spatial anal-
ysis
3. Use Cases & Using Spatial SQL: Suitability analysis, using raster data, spatial auto-correlation
and optimization, advanced spatial analytics, and routing
Our first section is foundational, yet important, for setting up success later on. This section focuses on
why spatial SQL is important and all wide range of use cases you can use spatial SQL for. In my experi-
ence these are the areas that I have seen many individuals struggle with: setting up a database, moving
data into the database, and getting beyond basic queries. Understanding the spatial SQL landscape,
actually setting up a database, understanding the different terms and jargon, seeing the advantages to
doing things in spatial SQL compared to a desktop GIS, importing data, and thinking SQL-ly are all
challenges one faces in the early stages of learning SQL. These topics will help you overcome those
common hurdles and build the base for success later on in the book. This section covers using func-
tions, managing and manipulating data including CRUD operations and table management. Then we
will cover some advanced SQL topics that are particularly useful for spatial SQL such as joins, common
table expressions (CTEs), window functions, using arrays and JSON, and more.
Next, we will start to learn spatial SQL. It is at this point that many other learning resources begin,
assuming that you have the knowledge and set up already established. We will only jump into actual
spatial SQL after we have built the foundations. All of these foundational elements up to this point
will set us up to use our already existing geospatial knowledge to quickly use spatial SQL basics. All
the things you know and use today will be easily translated into SQL: managing geometries, buffers,
centroids, measurements, spatial relationships, and more. We will also look at advanced topics like
clustering, spatial indexes, indexing tables/geometries to improve performance, etc.
Our last section focuses on putting this into practice. You have learned the tools, SQL, and spatial SQL
- now comes the most important part which is applying this to your work. We will focus on exercises
to do this starting with some simple patterns all the way to advanced topics like isovists and bulk
nearest neighbor joins. We will also investigate wider uses for spatial SQL like using pgRouting to
analyze network data and H3 spatial indexes to increase performance. We will also showcase different
projects where these things have been implemented to help you think about how this can be used in
your day-to-day work.
The outcomes you should walk away with from this book are:
1. An understanding of spatial SQL, how it has come to be, and applicability in the GIS and geospa-
tial fields.
2. The ability to create and set up a basic PostGIS instance.
3. Connect to and query data with pgAdmin and QGIS.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
4 1.5. DATA AND FILES FOR EXERCISES
Learning spatial SQL has been one of the most valuable skills that I have learned in my career, and I am
passionate about helping others learn it too because I know how much it can help your career but also
accelerate the important problems being solved with GIS and geospatial tools today. And with that,
let’s begin!
We have several resources for you to access to keep in touch and follow along with the book. Our
dedicated book website is spatial-sql.com.3
Data files for the exercises and examples in this book are available for download from the Locate Press
website at loc8.cc/sql/data.4
Code samples can be browsed and copied from our book’s code website at spatialsqlbook.com.5
Access the Locate Press bookstore page at: locatepress.com/book/spatial-sql6 and send any ques-
tions through the contact page at locatepress.com/contact.
3 https://spatial-sql.com/
4 https://loc8.cc/sql/data
5 https://spatialsqlbook.com
6 https://locatepress.com/book/spatial-sql
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Matt has dozens of videos on YouTube teaching on
contemporary GIS topics. Subscribe today to follow
along and stay up to date.
youtube.com/@MattForrest/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Part 1
Getting Started
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
1. Why SQL?
If you are just starting to take your first steps into learning more technical, open-source, or modern
GIS, then you likely have addressed the question of what tools or languages to start with. This is a
question I receive all the time: "what should I learn and in what order?" While there isn’t a right or
wrong answer, this chapter will cover the reasons why I believe investing time in spatial SQL is well
worth the investment.
To begin, we need to start with an assumption: GIS is changing and moving from a traditional GIS
framework to a more modern GIS approach. My definition of modern GIS is as follows:
Modern GIS is the process, systems, and technology used to derive insights from geospatial data. Modern GIS
uses open, interoperable, and standards based technology. It can be run locally or in the cloud and can scale to
work with different types, velocities, and scales of data.
Compared to traditional GIS, we are now entering a new phase of geospatial technology that is pred-
icated on a more open and scalable paradigm, compared to one that depends on closed/proprietary
technology with limited scalability. This chart compares the differences between the two approaches:
Traditional Modern
Standards Platform and software-based Open and standards-based
Cloud Ac- Cloud-hosted or on-premises Cloud-native
cess
Deployment Local software package up to enterprise soft- Open-source local use up to full en-
ware packages terprise
Collaboration Siloed Interoperable
Scalability Single-threaded Serverless
Data Limited data scale Scalable, even further in the cloud
This shift is driven by three main changes: two within geospatial and one within the technology space
in general:
There are numerous reports that show exactly how the geospatial market is expected to grow and
expand in the years ahead. The most recent report from MarketsAndMarkets released in 2022 states
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
10 1.1. EVOLUTION TO MODERN GIS
that the global geospatial analytics market will grow from $59.5 billion in 2022 to $107.8 billion in 2026,
with an Compound Annual Growth Rate of 12.6%7 .
While these numbers indicate massive growth in our industry, the report has some other points that
showcase the specific ways the market is growing. First, is the increase in the volume of geospatial
data:
"The advent of new technologies, such as cloud services, embedded sensors, and social media, as well as the quick
uptake of geospatial technologies by many industries, are making the mapping and analysis of data highly compli-
cated. The introduction of Big Data analytics along with GIS had resulted in increasing growth opportunities for
the emerging geospatial analytics vendors, as big data analytics can process large massive amounts of collected
data in the quickest amount of time possible, thereby facilitating business intelligence."8
Various changes in technology and data collection have only increased the amount of data available
to those using GIS. The increase in data generated from mobile devices, more frequent and detailed
imagery, increase in data providers, and increase in the size of publicly available data including Open-
StreetMap are all contributing factors to this increase in data. This continued growth in available data
includes the frequency in which data is updated (or the time-series factor), total number of rows of
data, and complexity and size of the geometry attached to the data.
The report also points to the increase of data as one of the key drivers of the Extract Transform Load
(ETL) services increasing as a need in geospatial.
"Among the solutions, the data integration and ETL segment is estimated to grow with the highest CAGR during
the forecast period. . . The data obtained can be linked to location data and analyzed with the help of location
analytics. ETL helps in extracting geographic data from any source system, transforming it into a format based
on users’ needs, and loading it in target systems."9
As we will see during the course of this book, spatial SQL can be used to manage, transform, aggregate,
and analyze data so it can be more easily consumed by end users. The combination of increased data
sizes and the critical role that geospatial data is playing in modern analytics points to a crucial need for
skills in spatial SQL.
In the past decade, open source geospatial tools have increased in popularity and usage, both within
and outside geospatial circles. While open source has been integral to the development and advance-
ment of GIS for years, its popularity has increased namely due to technology shifts outside of the
geospatial industry.
In years prior, if you wanted to use an open source toolkit (let’s use PostGIS for example), you or
your organization would be in charge of the maintenance, security, deployments, updates, and overall
management of that service. Not only that, you or someone on your team, would have to have the
skills to manage that service. For organizations that are concerned about consistency and security, this
presents a risk that generally pushed users towards a more commercial or closed-source solution.
Over time however some of these roadblocks have changed, or have been alleviated in some way. Us-
ing containerized services like Docker and orchestration services like Kubernetes, the skills and time
needed to orchestrate the deployment of an instance of PostGIS have become easier and far more con-
sistent. Additionally, cloud providers have made it far easier to use a PostGIS database in their cloud
7 https://loc8.cc/sql/bloomberg-markets
8 https://loc8.cc/sql/marketsandmarkets-geospatial
9 https://loc8.cc/sql/prnewsite-markets
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 11
with services dedicated to PostgreSQL with PostGIS while also investing in features like security and
data replication, which makes the proposition for using open source in the cloud more attractive.
This quote from Eddie Pickle from Maxar describes how open source geospatial tools also enables more
viewpoints and collaboration compared to closed source technology:
"In the proprietary world, software intellectual property (IP) is tightly controlled and only available to the devel-
opers the IP owner allows (usually company employees). This creates a pernicious situation where the company
has a strictly limited developer roster, and developers cannot work on the software anywhere else—so developers
and owners are both isolated and stuck with each other! Open source’s inherent ability to improve collaboration
in turn stimulates and accelerates innovation. The open level of interaction among developers and organizations
means open source software is available to people with different points of view—diverse backgrounds, goals, per-
spectives, expertise—spurring creativity. It also means the openness of the software creates the conditions for a
rich ecosystem for development."10
Open source allows you to take and use exactly what you need, nothing more and nothing less. If you
need to add or remove components from your architecture you are free to do so. There is no greater
practice that proves this exact point like data science. Pandas, a core data science library for reading
and analyzing data has a geospatial extension called GeoPandas that seamlessly integrates into Pandas
allowing for any data scientist to install and start using this library with ease.
New technology trends and cloud services are changing traditional IT norms
This was alluded to in our first section, but new ways (such as cloud services) to deploy and maintain
software packages is changing. Specifically, the ability to parallelize processes and query or process
large amounts of data has greatly expanded the scope and speed of analytics both in geospatial analyt-
ics and in the broader analytics space. This includes tools like Spark, and other data warehouses such
as Google BigQuery, Amazon Web Services (AWS) Redshift, Databricks, and Snowflake. While, apart
from Spark, these are proprietary tools, they all use spatial SQL as a foundational language.
The same report describes why cloud adoption will be critical to the growth within the geospatial
industry:
"Increased internet connectivity due to advancements in communication technology and increased flexibility
due to cloud computing is changing the way companies are delivering software and services to their customers.
Geospatial analytics solution providers are also taking advantage of this improved business ecosystem to pro-
vide their customers with easy access to geospatial data. With cloud computing, several users can easily access
geospatial data and leverage cloud computing resources to perform analysis and mapping."11
More and more, geospatial will move to using more open and interoperable tools, and can leverage the
cloud to increase scale and access when the time is right. With that said, what does any of this have to
do with me learning spatial SQL? Let’s find out.
For the changes we discussed earlier, spatial SQL can and likely will play a critical role in the new,
modern GIS paradigm. While traditional systems used local files or shared file systems, or even enter-
prise servers to store data, they often used any number of languages to access that data. Today, SQL
has become the lingua franca of modern analytics practices, and it appears that geospatial is headed in
10 https://loc8.cc/sql/maxar-osgis
11 https://loc8.cc/sql/bloomberg-markets
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
12 1.2. WHY LEARN SPATIAL SQL?
the same direction. Spatial SQL already has a robust set of GIS and geospatial functionality, and as the
ecosystem grows, more will be added to fold in other functionality over time.
Before we go any further, I also want to take a moment to define the term spatial SQL. Spatial SQL is
not a separate language and also does not refer to a single tool or system. In the simplest terms, spatial
SQL is a commonly accepted set of functions, naming conventions, and data types that can be used in
any number of tools that support SQL.
"Spatial SQL is an interoperable language for working with spatial data enabling geospatial analysis, spatial data
science, application development.
It provides GIS and geospatial users the ability to work in the same location as other data in databases and data
warehouses, removing the traditional silos between GIS and other areas of an organization.
Spatial SQL supports advanced operations such as spatial modeling and machine learning in the same location
that the data resides."
Let’s focus on the key points from this definition.
First, and quite possibly most important, is interoperability. The concept that you can use spatial SQL
across any number of database and tools is quite important and that means that you can move to a new
tool or combine tools as needed, which means you are never locked in to any single solution. While
there may be some steps required to migrate, overall your analytical code and structure can be easily
replicated between databases.
Next is that it enables different use cases. There are only a few listed here but there are a multitude that
spatial SQL can serve. Let’s take a look at some problems you can solve with spatial SQL:
• Query and manage spatial datasets
• Manage projections and re-project data
• Store spatial data in multiple formats (WKT, GeoJSON, etc.)
• Analyze and understand many types spatial relationships
• Perform spatial clustering and perform nearest neighbor analysis
• Transform geometries using buffers, Voronoi polygons, and convex/concave hulls
• Calculate distances – straight line and using the curve of the earth
• Query and manage 3D data
• Create triggers to change and update data based on different events such as an INSERT into a
specific table
• Connect to APIs and other external services
• Create user-defined functions in other languages such as Python, Javascript, and SQL
• Perform statistical analysis and machine learning using functions and tools like BigQuery ML
• Store and manage unstructured data like JSON or arrays
• Return data in many formats such as GeoJSON or Shapefiles
• Produce and generate map tiles for frontend apps
The next section focuses on removing silos and working in the same location as other data within your
organization. For years, the saying "spatial is special" could be heard by many within the geospatial
industry, and indeed we kept business data and geospatial data separate: in separate databases, servers,
software, etc. But the need to treat geospatial data differently than normal data is decreasing rapidly.
Most databases support spatial data and functions, and most data warehouses are now fully adopting
geospatial support. This means that geospatial data and the expertise of those who know how to use it
can now pair with other business users and use cases, in the same place and with the same data.
Imagine the power of being able to quickly combine data from a complex query a colleague wrote,
join it to your spatial data, and run spatial analysis - in the same language and in the same tool. The
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 13
language and tool barriers that have existed for so long are no longer necessary, and SQL is what is
driving that change.
Not only that, SQL took the top spot in the IEEE Spectrum’s interactive rankings of top programming
languages in 202212 . This replaced other common languages such as C-based languages, and the ever
popular Python. Here is a quote from their article:
"SQL dominated the jobs ranking in IEEE Spectrum’s interactive rankings of the top programming languages
this year. Normally, the top position is occupied by Python or other mainstays, such as C, C++, Java, and
JavaScript, but the sheer number of times employers said they wanted developers with SQL skills, albeit in
addition to a more general-purpose language, boosted it to No. 1.
So what’s behind SQL’s soar to the top? The ever-increasing use of databases, for one. SQL has become the pri-
mary query language for accessing and managing data stored in such databases—specifically relational databases,
which represent data in table form with rows and columns. Databases serve as the foundation of many enterprise
applications and are increasingly found in other places as well, for example taking the place of traditional file
systems in smartphones."13
And the final component of spatial SQL speaks to its extensibility. More and more, you are starting to
see different data warehouses and even databases adding machine learning capabilities. Everything
from linear regression, to logistic regression, classification, and even popular models like XGBoost.
Most data warehouses like Redshift, BigQuery, Snowflake, and Databricks do this, but even Post-
greSQL has an extension called postgresml14 that adds machine learning capabilities into PostgreSQL,
and by extension, PostGIS.
You can do the same with more pure spatial models as well. PostGIS has implementations of the
popular clustering methods KMeans and Density-based spatial clustering of applications with noise
(DBSCAN). Other spatial functions like Moran’s I and Getis-Ord Gi* have been implemented in data
warehouses by CARTO in their Analytics Toolbox. You can also perform raster analysis and complex
routing queries and analysis using popular tools like pgRouting for PostGIS.
But why would you bother doing this in SQL when all these tools exist in Python? The reason is that
the closer you run your analysis to the data, the faster the analysis will run. Sure, for small problems or
small volumes of data this can work great, but as you scale up you will inevitably need to find efficien-
cies to make things run faster. When using Python, that either means adding more compute resources
or some type of parallelization. In SQL however, you already have the added bonus of running your
models right where the data resides, and on a lower level language (for example PostgreSQL is written
in C).
I don’t think SQL will ever replace Python, and in the same IEEE/Spectrum report they point to a major
reason for the rise in SQL is employers looking for "SQL+ roles", meaning SQL plus something else.
Now let’s take a quick look at when or how you might use these two popular languages together.
Python has had a meteoric rise in the past few years, driven in part by the rise of data science and
machine learning. Most of the first data scientists were using either R or Python in their research, and
over time Python has increased in popularity since it is ease for others to learn and easy to develop
12 https://spectrum.ieee.org/top-programming-languages-2022
13 https://spectrum.ieee.org/the-rise-of-sql
14 https://github.com/postgresml/postgresml
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
14 1.4. SPATIAL SQL LANDSCAPE
with.
But as the need to use more data started to grow, those same data scientists needed new tools to increase
their capacity. Some used larger servers to add more computational resources on the machines running
their code. Others started to use GPU accelerated tools to greatly increase their performance.
Yet the method that seems to be winning out is using SQL based technologies, either databases or data
warehouses. One reason is that SQL is also an easy language to use, and you can offload much of your
data prep into the database or data warehouse. You can also do aggregations and reporting with time
series data, so the database can accelerate the time to prepare your data while also making it ready for
analysis in Python or other tools in a shorter time. The field of Data Engineering has taken off due
to the need for more and cleaner data in a common location, and other technologies like Spark have
helped to expand the usage of SQL too.
For me, these are the set of recommendations, not rules, that I use when deciding between the two.
There are no right or wrong answers, but this is usually what I follow:
I use Python when I need to:
• Load and explore some data really quickly from a flat file
• Quickly visualize some data
• Translate between data formats (or I use GDAL on the command line)
• Perform exploratory spatial data analysis (always with PySAL)
• Analyze territory design problems
• Perform location allocation problems (although sometimes it is more efficient to create an origin
destination matrix in SQL)
• Call APIs programmatically via Python to collect data
I use spatial SQL when I need to:
• Store larger spatial datasets that I access frequently
• Perform joins across tables – spatial or otherwise
• Analyze spatial relationships
• Perform spatial feature engineering (covers almost all use cases)
• Aggregate data spatial or otherwise
• Create tile sets (although Python is still used in the API service)
• Route between lots of points
I flip flop between spatial SQL and Python when I need to:
• Write custom functions to manipulate my data
• Perform geocoding (Python generally has more options)
• Re-project data (spatial SQL has the edge in this case)
• Manipulate geometries
• Make based aggregations like H3 or Quadkey
• Perform spatial clustering
• Machine learning using tools like BigQuery ML or postgresml
We will cover how you can combine spatial SQL with some of these emerging technologies later in the
book so that you can bring your expertise into these exciting fields.
During the course of this book, we will primarily be using PostGIS to run spatial SQL. It has the most
spatial functions and is generally the standard that most others look to follow when they add spatial
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 15
SQL functionality. With that said the amount of tools that have adopted spatial SQL has grown signifi-
cantly in the past years, so we will first take a look back to see how spatial SQL evolved, and then what
the spatial SQL landscape looks like today.
There is not a central source of truth when it comes to the history of spatial SQL. This section is my
attempt to document the different developments of spatial databases over time and how spatial SQL
came to be today. SQL itself dates back to the 1970s:
"The SQL programming language was developed in the 1970s by IBM researchers Raymond Boyce
and Donald Chamberlin. The programming language, known then as SEQUEL, was created following
Edgar Frank Codd’s paper, ’A Relational Model of Data for Large Shared Data Banks’ in 1970." *15
SQL as a language was made widely available in 1979 by the company that eventually became Oracle,
and they were the driving force behind SQL and databases for many years.
When it comes to the history of spatial SQL, the first few years have many different developments, but
in a strange way they all built on top of each other, as shown in Figure 1.1. Apart from Oracle which
introduced the Spatial Data Option (SDO) in 1995, there appear to be three other companies that drive
the adoption of spatial SQL: Illustra, Informix, and the open source option PostGIS. Strangely enough
Illustra was later acquired by Informix, and the key person that founded Illustra was actually Michael
Stonebraker, the eventual creator of PostgreSQL.
Things really started to take off around 2003 when the Open Geospatial Consortium adopted the
SQL/MM Spatial Standard, which standardized function naming and data types within SQL databases16 .
15 https://www.businessnewsdaily.com/5804-what-is-sql.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
16 1.4. SPATIAL SQL LANDSCAPE
A few others entered the space including Spatialite, a spatial extension of SQLite, Microsoft SQL, and
MySQL.
From that point on most of the focus and development in the database space was focused on "big data".
While I feel that term is somewhat outdated, the core developments in the database space that were
being developed focused on technologies like Spark, data warehouse, Hive, and other parallelized
processes for managing and querying large datasets.
This is where you see the next big burst of activity, starting with GeoSpark, which eventually turned
into Apache Sedona. Often there is a gap between the original development of the technology and the
spatial functionality being added. With PostgreSQL it was 7 years (1994 was the release of PostgreSQL,
PostGIS was released in 2001). Spark and GeoSpark were only one year. Then came the migration
to data warehouses. Amazon Web Services (AWS) Redshift launched in 2012, geospatial functionality
launched in 2019. Google Cloud Platform (GCP) BigQuery launched in 2010, geospatial functionality
launched in 2018. Snowflake launched in 2014, geospatial support launched in 2020.
We have yet to see how geospatial will play out in the next big push with SQL, which appears to be
within the data engineering space with ETL and ELT tools as well as tools like dbt (data build tool) or
within distributed query engines like Trino or PrestoDB (both of which support spatial functions).
• 1994 - Illustra Spatial launches, which as far as I can tell is the first SQL database featuring a
spatial data type and spatial functions. Michael Stonebraker was developing what was called
POSTGRES (or POST inGRES) at the University of California - Berkeley at the time and Illustra
was the effort to commercialize that work.17
• 1995 - Shortly after, Oracle launched the Spatial Data Option or SDO in Oracle 4. This was devel-
oped in conjunction with a research team at the Canadian Hydrographic Service (CHS).18
• 1996 - Illustra was acquired by Informix and was renamed the Informix Spatial Datablade. Map-
Info was heavily integrated with the Spatial Datablade.19
• 1998 - Oracle releases a new upgraded version of Oracle Spatial in Oracle 8i which added "native
internet protocols".20
• 2001 - PostGIS launches the first candidate version. PostGIS was born out of the need to add su-
perior support for the geometric type object in PostgreSQL. Refractions Reseach and Paul Ramsey
lead much of the efforts around PostGIS from Victoria, British Columbia, Canada.21
• 2002 - IBM releases the DB2 Spatial Extender. This provided compatibility with the Esri product
suite.
• 2003 - Informix, which was acquired by IBM, releases the Spatial Extender which also provides
compatibility with the Esri product suite.
• 2003 - One of the more consequential moments in the history of spatial SQL was when the Open
Geospatial Consortium (OGC) adopted the standards for Simple Feature Access, or ISO 19125.
This provided the blueprint for spatial data and functions within a SQL database environment.22
16 https://www.ogc.org/standards/sfs
17 https://en.wikipedia.org/wiki/Michael_Stonebraker
18 https://en.wikipedia.org/wiki/Oracle_Spatial_and_Graph
19 https://www.bizjournals.com/albany/stories/1997/11/10/daily6.html
20 https://en.wikipedia.org/wiki/Oracle_Database
21 http://www.refractions.net/products/postgis/history/
22 https://www.ogc.org/standards/sfs
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 17
The landscape of spatial SQL databases, data warehouses, and supportive tools is growing and expand-
ing greatly. More and more tools are adopting spatial SQL than ever before, increasing the opportunity
to learn one language and apply it to a multitude of tools. In this section we will review the current set
of systems that support spatial SQL as of the writing of this book.
Relational databases
Relational databases are generally the most common type of database used today. These databases are
designed to perform transactions with data, meaning that records are stored, created, updated, and
deleted. These databases provide powerful tools for geospatial analytics as well, but many organiza-
tions will separate their databases to have an analytical database in a different location then the real
data being processed in that database.
PostGIS
By and large, PostGIS is the most popular option for spatial SQL. It is completely open source and
it extends an already popular database, PostgreSQL. It has the largest number of spatial functions
available for a wide range of use cases and supports 2D vector data, 3D/4D vector data, and raster
data. PostGIS also has other extensions like pgRouting that allow for routing options, mobilitydb for
spatial movement data, postgis_tiger_geocoder for geocoding data in the US, address_standardizer for
standardizing address data, pgpointcloud to use point cloud data, and a range of foreign data wrappers
to query external databases.
23 https://aws.amazon.com/blogs/aws/using-spatial-data-with-amazon-redshift/
24 https://github.com/apache/incubator-sedona/releases/tag/sedona-1.0.0-incubating
25 https://pinot.apache.org/blog/2021/06/13/DevBlog-Geospatial/
26 https://loc8.cc/sql/snowflake-geoanalytics
27 https://loc8.cc/sql/databricks-h3
28 https://duckdb.org/2023/04/28/spatial.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
18 1.5. THE LANDSCAPE TODAY
PostGIS is the database we will be using for most of the activities in this book due to its popularity
and breadth of spatial functions. While it is near impossible to track how many installations of PostGIS
there are, Docker Hub shows that it has over 100 million pulls as of December 2022 for this installation
method alone.29
PostGIS will serve almost any general purpose needs for a spatial database, so it is helpful across a wide
range of use cases. We will discuss the advantages and disadvantages for the remaining databases in
this list as well. There are some logical limits with PostGIS in terms of size of data and query time,
but much of that can be managed, except for the largest of spatial data, with different methods and
spatial indexes. It is a larger install compared to SpatiaLite and also requires the user to be in charge of
deployment, updates, and security, although many cloud managed solutions exist for PostGIS.
SpatiaLite
If you need a very small and lightweight system to manage spatial data, SpatiaLite, the spatial exten-
sion for SQLite, is a great choice. It provides spatial functions in a small lightweight package suitable
for a range of use cases. Since it is such a small installation this is really useful for individual use cases.
SQLite describes why it is advantageous:
"SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-
featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built
into all mobile phones and most computers and comes bundled inside countless other applications that
people use every day."
SQLite is deployed in every Android and iPhone, so it claims to be the most deployed database in
the world. SpatiaLite is also used in GDAL, specifically ogr2ogr, to add a spatial SQL component to
transform data within a data transformation command. SpatiaLite only lists one use case where it is
not suitable:
"Conditions where a SQLite/SpatiaLite solution may not be the best choice. . . for support for multiple
concurrent access, a client-server DBMS, such as PostgreSQL/PostGIS, is required"
This basically means when multiple clients are connecting to a common server.
MySQL
MySQL is a very common database, more comparable to PostgreSQL. The primary differences between
PostgreSQL and MySQL are that MySQL is open source but maintained by Oracle (after MySQL was
acquired by Oracle, the founder forked it and created a fully open source version called MariaDB) and
that PostgreSQL is object-relational whereas MySQL is just relational. This means that PostgreSQL
works with more complex data types, but also adds complexity into the queries and operations you
can do.
MySQL is a very popular database which is used by Wordpress, Facebook, YouTube, Drupal, Twitter,
and more. It doesn’t appear that geospatial use cases are as common with MySQL as they are in PostGIS
or other spatial databases, but the simple fact that spatial SQL functionality exists in an incredibly
popular database means that spatial use cases and data can be leveraged from these database systems.
29 https://registry.hub.docker.com/r/postgis/postgis/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 19
In effect there are four main "enterprise" (meaning commercial only) databases that support spatial
data: Oracle Database, Microsoft SQL Server, Informix Spatial DataBlade (owned by IBM), and IMB
Db2. These are all proprietary databases, some of which include the earliest spatial databases (Informix
and Oracle). Each of these support spatial functionality and functions in addition to their standard
core features. While some offer free trials or limited free versions, most of the full functionality will
only be accessible if you or your organization has a full license to them. On inspection of their public
documentation it does appear that these tools have a wide set of functions for spatial analysis and data
management.
Data warehouses
While the concept of a data warehouse has been around for quite some time, in the past few years the
adoption of cloud data warehouses has greatly increased, driven in part by the massive increase in
available data that organizations are producing. There are two main factors that distinguish data ware-
houses from databases: the separation of compute resources and storage, and focusing on analytics
first.
The separation of compute resources and storage has several implications for the performance and
purpose of a data warehouse. The first is that data is being created in more formats, at faster velocity,
and in greater volumes. Because of this the concept of a data lake has become increasingly popular. In
simple terms a data lake is a place to store lots of data efficiently and commonly takes the form of cloud
storage tools like GCP Cloud Storage or AWS Simple Storage Service (S3). This means you can dump
and store data in a semi organized manner and then use the data warehouse as the query interface for
that data. You may still need to ingest the data into the warehouse, but the data lives in one location
and the query engine, or the engine using the compute resources to process and orchestrate the query,
live elsewhere.
Why does this matter? This means that you can store the data efficiently, and at a low cost, in a tool
optimized for general data storage (sometimes called blob storage) and the data warehouse query en-
gine can use the computing power it needs. The data warehouse also parallelizes the query, or splits
the job into smaller parts then assembles the results, which results in far faster query times for specific
operations. Some data warehouses also offer serverless options. This means that instead of having
compute services running around the clock even when you are not using them, the data warehouse
will look at the query and delegate an appropriate amount of compute resources for that specific job.
Once the job is complete those resources shut down. It’s basically the difference between owning a car
versus taking a cab or ride-share.
The other difference is that data warehouses are purpose built for analytics. Databases, while often
used for analytics purposes, are built for online transaction processing (OLTP). On the other hand, data
warehouses are generally defined as online analytical processing (OLAP) services which means they
are designed specifically to manage analytical queries. This means that things like aggregations, joins,
WHERE queries, and other operations are well suited for these tools. This is not to say that using
databases to perform analytics is wrong, but certainly data warehouses can provide more performance
when moving to very large volumes of data.
There are many data warehouses available but the most common ones that are used for geospatial
analysis currently are AWS Redshift, GCP BigQuery, and Snowflake. All of these tools support the
geometry data type natively and have built in spatial functionality as well as the ability to extend that
functionality. CARTO has done so with it’s Analytics Toolbox which has open-source functions for
a number of operations like using spatial indexes like H3 or Quadbin, the former of which we will
explore in this book.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
20 1.5. THE LANDSCAPE TODAY
In the realm of big data processing, currently Spark has become a leader in this category. There have
been a number of systems that have been used to store and query big data, and for a period of time
Apache Hadoop, and its counterpart Apache Hive which provides a SQL interface to Hadoop, where
the clear category leaders in big data. Apache Spark, which was released in 2014 by the University of
California - Berkeley’s AMPLab, has quickly become the category leader for processing large amounts
of analytical data.
While comparing Hadoop to Spark is not an apples to apples comparison as there are slight differences
between them in terms of focus areas. Spark can outperform Hadoop in terms of processing time
anywhere from 10 to 100 times faster30 . It improves upon the Map Reduce framework used by Hadoop
to use a Directed Acyclic Graph or DAG to map out how the tasks should be performed. Spark uses
"drivers" which turns the code you write into multiple tasks that can be run across a network of nodes.
The “executors” do exactly what the name sounds like: executes the tasks assigned to the nodes. Apart
from speed, Spark provides a far easier API to run code and commands and, important to our purposes,
a SQL compliant toolkit for querying data known as Spark SQL.
Each of the major clouds support the Spark based tool, Databricks, which provides a complete data
analytics platform on top of Spark. The founders of Spark founded Databricks and helped develop this
ecosystem while also continuing to develop Spark as an open source project.
As it relates to geospatial analytics, Spark currently does not support geospatial data as a native data
type. That said, there are two tools that provide geospatial support on top of Spark. This provides a
scalable analytics platform using Spark native programming in addition to Spark SQL, as well as tools
for visualizing data and producing map layers.
Apache Sedona
Formerly known as GeoSpark, Apache Sedona is an extension of Apache Spark as well as Apache Flink
which effectively extends the Resilient Distributed Dataset (RDD) or the core data structure of Spark, to
support the geometry. This description from a post on Medium by Mo Sawat, a maintainer of Apache
Sedona, describes this in more detail:
"A SpatialRDD consists of data partitions that are distributed across the Spark cluster. A Spatial RDD can be
created by RDD transformation or be loaded from a file that is stored on permanent storage. This layer provides
a number of APIs which allow users to read heterogeneous spatial object from various data format.
GeoSpark allows users to issue queries using the out-of-box Spatial SQL API and RDD API. The RDD API
provides a set of interfaces written in operational programming languages including Scala, Java, Python and R.
The Spatial SQL interfaces offers a declarative language interface to the users so they can enjoy more flexibility
when creating their own applications. These SQL API implements the SQL/MM Part 3 standard which is widely
used in many existing spatial databases such as PostGIS (on top of PostgreSQL)."
In effect, it provides a geospatial analytics platform on top of Spark using spatial SQL. It also has
the ability to run with other languages and use within the context of a Python notebook. The project
gradutated from incubating to Apache top project in January 2023, so there will likely be more features
released as it expands.
30 https://loc8.cc/sql/infoworld-spark
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 21
GeoMesa
While GeoMesa is not a Spark specific tool, it is worth mentioning here because it now includes a con-
nector to SparkSQL. At its core, GeoMesa is described as:
"...an open source suite of tools that enables large-scale geospatial querying and analytics on distributed com-
puting systems. GeoMesa provides spatio-temporal indexing on top of the Accumulo, HBase, Google Bigtable
and Cassandra databases for massive storage of point, line, and polygon data. GeoMesa also provides near real
time stream processing of spatio-temporal data by layering spatial semantics on top of Apache Kafka. Through
GeoServer, GeoMesa facilitates integration with a wide range of existing mapping clients over standard OGC
(Open Geospatial Consortium) APIs and protocols such as WFS and WMS. GeoMesa supports Apache Spark for
custom distributed geospatial analytics."
It primarily focuses on the Hadoop ecosystem, but also incorporates many other database connections
as well. It also allows for the creation of map tile layers. As for the scope of this book, SparkSQL is the
primary location where you will find spatial SQL within GeoMesa, although the primary interface will
be with Java or Scala.
GeoMesa certainly could fall under the later section of distributed query engines as it positions well
alongside other tools that support a wide range of non-geospatial specific tools for distributed query-
ing. But as the only spatial SQL component, apart from a connection to PostGIS (as of version 3.5.0),
this felt more appropriate here for the time being.
We mentioned the term OLAP in the data warehouses section, and there are many more currently
being developed that are gaining popularity such as Apache Druid and Clickhouse which have limited
geospatial support and do not support spatial SQL.
DuckDB
DuckDB was originally listed as one of the OLAP tools that did not have geospatial support when
I first wrote this chapter, but in the course of writing this book it added geospatial support with it’s
"SPATIAL" package which provides support for reading, writing, and querying geospatial data and
files31 . DuckDB has a few unique attributes that makes it an increasingly appealing choice for spatial
analytics.
First is that it is a no dependency package, meaning that all you need to run it is the DuckDB package
installed on your computer. It is an OLAP which means it is made to query and process large amounts
of data, but it is unique in the sense that it can do this from your laptop by using your computers built
in processing power. It is also a columnar-vector query execution engine which is explained here from
the DuckDB website32 :
"DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large
batch of values (a “vector") are processed in one operation. This greatly reduces overhead present in traditional
systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution
leads to far better performance in OLAP queries."
It does not require you to import any files either. You can query files on your computer or from cloud
storage services like AWS S3 or Google Cloud Storage. You simply call the filename as the table and
31 https://duckdb.org/2023/04/28/spatial.html
32 https://duckdb.org/why_duckdb
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
22 1.5. THE LANDSCAPE TODAY
you can start using it. And if you need more power from cloud computing power MotherDuck, a
serverless cloud platform that supports DuckDB, can provide that and allow you to intermingle data
from your computer and data in the cloud.
Additionally, you can create a database that is contained in a single file which you can then share to
any other user who uses DuckDB. You can also access and install DuckDB directly through Python
and pass the returned data immediately into a Pandas dataframe. DuckDB has been referred to as the
SQLite for analytics, but it’s ease of use, portability, and simple installation will, I believe, make it an
increasingly popular choice for spatial analytics in the short and long term.
Apache Pinot
The notable exception is Apache Pinot. Released in 2019, this OLAP datastore that supports realtime
(streaming) analytics from Apache Kafka and other topics as well as stored file data has gained popu-
larity and is currently in use by LinkedIn, Uber, Slack, Stripe, DoorDash, Target, Walmart, Amazon, and
Microsoft33 . I learned about geospatial support for Apache Pinot at Apachecon North America 2022
in a presentation from Yupeng Fu34 from Uber where they used Pinot to provide real time analytics
within the Uber Eats app.
What makes Pinot unique is that it is effectively a database you can query directly with SQL, and
in turn spatial SQL. This works with not only static data but also data event streams. In the case of
Uber, they used static data (their restaurant partner locations) and real time data (streams of orders) to
produce end user analytics. This differs from internal analytics such as reports and dashboards and is
analytics for the end user, potentially hundreds of thousands, or millions of users. Pinot is able to scale
horizontally to support this volume of requests but also show analytics like the image below in Figure
1.2, on the next page.
The query would look something like this:
This would show you recent orders within ten miles of your location where items ordered are greater
than 0 and the order created timestamp is greater than a specified time. This opens a whole new range
of possibilities for end user applications that embed the ability to learn the same spatial SQL you are
using in this way.
Another area that is continuing to gain popularity are distributed query engines, the two most promi-
nent are Presto, released in 2013, and Trino, released in 2019. Presto was the original development
which took place at Meta (then Facebook) and while Presto is still open source, the original founders
created a fork in 2019 which came to be known as Trino.
33 https://pinot.apache.org/who_uses/
34 https://www.apachecon.com/acna2022/slides/02_Fu_Geospatial_support_pinot.pdf
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 23
Figure 1.2: Uber Eats Near Me feature powered by Apache Pinot ("Geospatial support in Apache Pinot" -
presentation by Yupeng Fu at Apachecon North America 2022)
These tools also leverage distributed computing processing similar to Spark or Hadoop, known as
massively parallel processing. It has workers that coordinate the tasks and split them up to query the
data. The main differences are that it can use a multitude of sources to connect to and you only need to
know SQL to use it. The SQL you write is processed and turned into different execution steps similar
to SparkSQL.
Some example connections you can use (list taken from Trino):
• BigQuery
• Cassandra
• ClickHouse
• Cloud Storage (GCP Cloud Storage, AWS S3) via Hive Connector
• Delta Lake
• Elasticsearch
• Google Sheets
• Hadooop
• Hive
• Kafka
• Local File
• MariaDB
• MongoDB
• MySQL
• Oracle
• Pinot
• PostgreSQL
• Redshift
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
24 1.6. EXPERT VOICES: UCHENNA OSIA
• SQL Server
What this means is that you can query one or many of these data sources together in unison in a
distributed manner, all with SQL. Trino and Presto also support a wide range of spatial SQL queries
that are SQL/MM compliant based on the OGC standards which allows you to do all the great things
with spatial SQL that you might need. The downside is that they don’t support the GEOMETRY or
GEOGRAPHY data type natively so you have to create this from Well-Known Text, Well-Known Binary,
or other format for storing the geographic data, which inevitably lengthens the query processing. As
this is an evolving space I am interested to see how these systems will be used for larger scale geospatial
analysis in the future, especially as it now provides the full power of big data systems all in SQL.
The last group that I will cover are GPU, or graphical processing units, accelerated databases. Where
most databases run on central processing units, or CPUs, databases that use GPUs have a much higher
performance boost simply based on the number of cores that can be used to process operations since
GPUs are optimized to handle many operations concurrently, like rendering complex graphics in video
games. This means that databases that run with GPUs can run more tasks concurrently, sometimes
thousands more, which results in much faster operations. This is why GPU processing has become
popular for large scale machine learning and deep learning operations.
There are three main GPU accelerated databases that provide geospatial functionality: HEAVY.AI (for-
merly known as MapD and OmniSci), Kinetica, and Brytlyt. While these solutions are mainly commer-
cially focused, HEAVY.AI and Kinetica have free versions that you can use, but the developer version
doesn’t allow you to take advantage of the GPU tooling. HEAVY.AI does have open source compo-
nents that you can also download or contribute to and is the only open source version available of the
three.
This wraps our section on all things spatial SQL. In the next chapter we start to get our hands dirty
with setting up our spatial SQL set up we will use in the rest of the book, how to think in SQL, and
how to migrate (or mix and match) from a GIS desktop set up to spatial SQL.
One goal of this book for me was to include other perspectives apart from my own from those using
spatial SQL in innovative ways around the globe. To do that, several professional colleagues and con-
nections were gracious enough to share their experiences and knowledge. The first of these comes from
Uchenna Osia.
Name: Uchenna Osia Title: PhD Student in Geospatial Analytics at North Carolina State University
I learned spatial SQL during the Geospatial Data Management course at NCSU.
I enjoy using spatial SQL because it allows me to represent space in a way that is most appropriate
within the context of a given problem. It provides clarity in executing my end goal and enables the
capture of nuanced relationships formed within data. Spatial SQL offers a structured approach to
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. WHY SQL? 25
handling difficult problems, its capabilities make it easier to derive valuable insight that can be used to
drive decision-making.
Can you share an interesting way or use case that you are using spatial SQL for today?
Presently, I am designing a spatial database that will model a network of disaster relief organizations
operating in North Carolina. I will use spatial SQL to investigate how organizations can adjust their
reach to be more equitable in their service and connect those that share aligned missions.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
2. Setting Up
Now that we have reviewed the complete spatial SQL landscape and history, it is time to build our
spatial SQL set up that we will be using for the rest of this book. As I mentioned in the previous
chapter, we will primarily be using a combination of PostGIS as our database and pgAdmin as our
method to access and query our data. In this section we will also cover some other tools you can use to
access, import, and query data such as ogr2ogr (part of the GDAL package), and QGIS.
The first step in the process is to set up PostGIS on your computer. This instance of PostGIS will live
only on your computer or, as I will refer to it going forward, locally. You won’t be able to use it with a
web app or any other computer. So how is this database able to run on your computer without living
somewhere else, such as the cloud?
In short, the cloud is more or less just a collection of servers, similar to the one that runs on your
computer. It has computing resources like CPU, RAM, and hard disk storage to store data. They
have far more of these and on top of the actual hardware they provide specific services, everything
ranging from pure compute access with the command line, to databases like PostgreSQL, services for
messaging, machine learning and artificial intelligence, and more. Now in effect there is not much
difference in installing PostGIS on your local machine compared to a web server or in the cloud, apart
from the fact that you can access the database from the internet.
There are many methods to actually install PostgreSQL on your computer. Below are some of the
other options which we will not be covering in this book (more on that later). The recommended
path for installing PostGIS on Windows is to download the PostGIS for Windows package from the
PostGIS website and then install it using the StackBuilder application that comes with the download.
For MacOS users, it is recommended that you use the Postgres.app which is a simple download, then
run a simple command from the command line to install PostGIS. For Linux users, you will have a few
more steps, but this Medium post from Joe T. Santhanavanich provides a great guide in 5 steps35 (see
footnote for URL). I recommend checking the PostGIS website for updated instructions as these may
change in the future, but these three paths are the best and most supported methods to install PostGIS
if you choose to go this route.
The instructions I will provide will be using the PostGIS Docker image which will work cross-platform,
and is easy to set up with minimal command line instructions.
In recent years the software deployment method known as containerization has become increasingly
popular for a number of reasons, and Docker is the most popular tool used to build and manage various
containers. A container is effectively a self-contained, portable, and standardized method of deploying
software.
The best way to think about a container is that it is a completely blank slate before we provide it any
35 https://loc8.cc/sql/joets-install-postgresql
27
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
28 2.3. INSTALLING POSTGIS WITH DOCKER
instructions. It has no operating system, no information about what it should run, no data, nothing. The
Dockerfile tells the container what to install, including an operating system. It then tells the container
the steps it should run, such as other things to install and commands on the command line. As a user,
you don’t actually have to do any of this, apart from installing and starting the Docker container itself.
Even better, Docker provides a desktop application to view and manage your container(s).
So why use this approach over the approved approaches? First is that the Docker container will run
the same code on each and every computer used by the users in this book. Since the container is
"contained" in your computer and will follow the same set of installation instructions every time, the
experience will be consistent no matter what computer or operating system you are using.
The second is a pretty consistent issue when trying to create repeatable and consistent instructions
when installing software: version of other tools running on computers. There is no possible way to
know the myriad of issues that could arise from different versions of operating systems or languages
that are installed on the computers of every person using this book. The great part about Docker is that
there is no need to worry about that since the container will run the same using the same Dockerfile
for everyone, the steps will be the same.
Third is updating your database to new versions to take advantage of new features that are released.
Updating is as simple as installing the new version, creating a dump of data from your existing database,
moving the data dump over to the new version, and deleting the old version. The container will persist
the data you have in it even if you shut it down or restart it.
Download Docker
To set up PostGIS our first step is to download and install Docker. You can simply go to the Docker
download page36 and download the correct version for your computer. Once it downloads, go ahead
and install Docker and open it. This will also install the Docker Engine, Docker CLI client, Docker
Compose, Docker Content Trust, Kubernetes, and Credential Helper.
Next we are going to open our command line terminal. We will be running a few commands from the
terminal so if you have not used the terminal before, here are a few tips to get started.
The one universal command that you will want to know is cd, which is short for change directory. In
short, you are basically navigating the folder structure of your computer, but instead of using the user
interface, you are using the command line to do so. For example let’s imagine that you are in a folder
called ’Home’. There is a folder within that folder called ’Documents’ and within that folder there is a
folder called ’SpatialSQL’. To navigate to the Documents folder we can use the command
cd Documents
And from here we can navigate to the SpatialSQL folder using this command:
36 https://docs.docker.com/get-docker/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 29
cd SpatialSQL
cd ..
Or if you want to go from the SpatialSQL folder all the way back to Home:
cd ../..
You can also list the items in the folders using ls on Mac or Linux, and dir on Windows.
At least for this book you won’t need any more knowledge than this, but you can find plenty of tutorials
online on more advanced command line prompts as needed.
From here create the folder in the location that you want to store your Docker container and PostGIS
data files, and navigate to that folder in the command line. Once you have navigated to that folder, our
first step is to make sure Docker is up and running. You can use this command to check and see.
docker info
You should see an output that looks something like Figure 2.1, on the following page.
Troubleshooting
Installing docker-postgis
In addition to installing PostGIS, we will also add the following PostGIS extensions to help us perform
more advanced analysis throughout the book. In a normal PostgreSQL installation, these can be in-
stalled via any number of methods such as compiling the code yourself or using an extension manager.
The extensions we will be using are:
• postgis_raster: An extension that is already bundled within PostGIS that allows you to work
with raster data inside the database37
• pgRouting: An external extension that allows you to import network data such as roads, bike
routes, even shipping lanes, and perform common routing analysis with them38
• pg-h3: An external extension that allows you to work with the H3 global discrete grid system in
PostGIS39
37 https://postgis.net/docs/RT_FAQ.html#idm34630
38 https://pgrouting.org/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
30 2.3. INSTALLING POSTGIS WITH DOCKER
Now installing each of these extensions on your own would require varying levels of difficulty, ranging
from simple with postgis_raster to complex with pgRouting. But fortunately for us there is a Docker
container that is set up to handle all of this out of the box. Another wonderful feature of Docker is that
there are containers that already exist that are maintained by the community that make it very easy to
start with new tools. The container we will use in this case is very similar to the main PostGIS container.
It is called docker-postgis40 which is maintained by Kartoza41 , an IT service company focused on
geospatial. In fact, they maintain a number of containers for a variety of geospatial services, so I want
to make a special acknowledgment of their efforts here which saves you (and myself) time in using
these features.
To set this up is quite simple since we already have Docker up and running on our computer. There
39 https://github.com/zachasme/h3-pg
40 https://github.com/kartoza/docker-postgis
41 https://kartoza.com/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 31
are several methods to do this but the method that I prefer is using docker-compose‘ since this greatly
decreases the complexity of the commands we need to run and allows us to also benefit from another
container to back-up our database. In short, the docker-compose format embeds all your options into
a file that then passes those to Docker to build or stand up your container or multiple containers. The
docker-compose option for docker-postgis includes PostGIS with all of the extensions except H3 ready
to use and a second container that creates database backups at regular intervals.
You can do this using Docker desktop or by running docker update in your command line interface.
Note that this may not work if you are using Windows. See this documentation for more details.42
2. Our next step will be to download a file from the docker-postgis GitHub repo. Note that this file
is included in the course files so you do not have to download it yourself, but I will include the
instructions here so you can see how to do it on your own. First, go to the Kartoza GitHub repo.43
Once you are there, click on the file named docker-compose.yml (Figure 2.2), or you can find it directly
by going to this link footnote.44
3. From here, you can download the file by clicking the download button, as shown in Figure 2.3,
on the following page.
This will download the docker-compose.yml to your computer. Move the file to a location that you will
remember as we will need to navigate to that location in the command line in the coming steps.
42 https://loc8.cc/sql/docker-windows
43 https://github.com/kartoza/docker-postgis
44 https://github.com/kartoza/docker-postgis/blob/develop/docker-compose.yml
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
32 2.4. INSTALLING DOCKER-POSTGIS
4. Before we run the containers, we have to make a small modification to the docker-compose.yml
file. Open this file with a text editor of your choosing. This is a YAML, or a markup language
file which is also known as Yet Another Markup Language. As of the publishing of this book the
relevant code is on line 21 of the YAML file, which looks like this:
- POSTGRES_MULTIPLE_EXTENSIONS=postgis,hstore,postgis_topology,postgis_
raster,pgrouting
This argument in the file defines the extensions that will be installed in your PostGIS database. The
Dockerfile for the PostGIS database already installs the extensions we need to use the H3 functions,
but we will need to add them here to this line of code. The text we want to add is:
,h3,h3_postgis
- POSTGRES_MULTIPLE_EXTENSIONS=postgis,hstore,postgis_topology,postgis_
raster,pgrouting,h3,h3_postgis
Ensure that there are commas separating all the values and that there are no spaces. If you downloaded
the file from the GitHub repo, you will also need to replace this line on line 10:
image: kartoza/postgis:${POSTGRES_MAJOR_VERSION}-${POSTGIS_MAJOR_
VERSION}.${POSTGIS_MINOR_RELEASE}
We need to do the same on line 31 for our backups Docker container from this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 33
image: kartoza/pg-backup:${POSTGRES_MAJOR_VERSION}-${POSTGIS_MAJOR_
VERSION}.${POSTGIS_MINOR_RELEASE}
To this:
image: kartoza/pg-backup:15-3.3
Special note for MacOS users using Apple products with a Apple silicon chip, currently the M1 or
M2 chips
If you are using a Mac laptop or computer that was released after 2019, then it is likely that your
computer falls into this category. You can find out if you are using an M1 or M2 chip by clicking on the
"apple" icon on your computer, then clicking on About This Mac. This will open a window that will
tell you which chip your computer is using (Figure 2.5, on the following page).
If you do fall into this category then you need to add one more line to the docker-compose.yml file
immediately after line 10, or this code:
image: kartoza/postgis:15-3.3
This is because the Apple silicon chips use the ARM architecture, whereas most Docker containers and
other computers use the AMD64 architecture. Without going into too much detail these are different
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
34 2.4. INSTALLING DOCKER-POSTGIS
microchip architectures that are used by various chip manufacturers. Once you have added the code
make sure your indent is consistent with the line above it as indentation matters in YAML.
The final file will look like Figure 2.6, on the next page.
5. Now we need to use our terminal to navigate to the location where our docker-compose.yml file is
located. For illustration purposes, lets imagine that my file is located first within my Documents
folder, and then inside a folder named spatial-sql.
To get to that folder, open your terminal where we will use the commands we just learned. In our
example we are in our home folder, or the folder that your terminal opens up to when launched. In our
example we are in our Home directory. Our first step will be to navigate into our Documents folder:
cd Documents
cd spatial-sql
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 35
To confirm you are in the correct location, you can run this command which will list the files in that
folder, where you should see the docker-compose.yml file:
ls
If you see it there then you are ready to move to the next step! If not, there are some shortcuts you can
use to find the path of the file and navigate directly to the file:
• MacOS: Right click (Control + Click) on the docker-compose.yml and you should see a menu pop
up. While that menu is open, click and hold the Option key and you should see an option named
Copy docker-compose.yml as path name (Figure 2.7, on the following page). Click on that then paste
the result into your terminal, and press Enter.
• Windows: While holding Shift, right click on the docker-compose.yml file, the select Copy as Path.
Then paste the result into your terminal, and press Enter
• Linux: When the docker-compose.yml file is highlighted, press Control + L. This will make the
path editable and then you can copy that text, and then paste the result into your terminal, and
press Enter.
6. Now we can run the last step of the process which is running the command to start our containers:
docker-compose up -d
This command will download the Docker containers for the PostGIS database and the PostGIS backups
container based on the specifications in the docker-compose.yml file and build/start both containers.
The -d flag tells Docker to run it in detached mode, which means that you will not have to keep your
terminal open and, once it is set up for the first time, manage your containers via the Docker desktop
app.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
36 2.4. INSTALLING DOCKER-POSTGIS
Once you have run the command to start the Docker containers you should see them live in your
Docker desktop app (Figure 2.8, on the next page):
From here you can turn your Docker container on and off by pressing the stop button (see the arrow in
Figure 2.8, on the facing page).
Troubleshooting
First, you will need to find the logs for your new container using the Docker Desktop app. This will
provide you information to what is happening within the container. Your terminal will tell you any
issues with the docker run command itself.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 37
• docker: Error response from daemon: Conflict. The container name "/spatialsql" is already in
use by container "RANDOM_ALPHA_NUMERIC_CODE". You have to remove (or rename) that
container to be able to reuse that name.
– You already have a container running that is called "spatialsql". You can either
delete that container (if you are doing this for the first time I recommend that) or
change your container name.
• WARNING: The requested image’s platform (linux/amd64) does not match the detected host
platform (linux/arm64/v8) and no specific platform was requested.
– This will likely happen if you are using a Mac with an M1 chip (or Apple Chip).
You can add this flag to the end of the command, right before "postgis/postgis:15-
3.3": --platform linux/amd64
– Change the part after the –platform flag to match whatever the name in is this
section: The requested image’s platform (linux/amd64)
• Bind for 0.0.0.0:5432 failed: port is already allocated.
– You have another tool on your machine or computer using port 5432, likely another
PostgreSQL or PostGIS database. Either shut down or remove that database, or
change both port values to something new like 5433
Now we can connect our new PostGIS database to QGIS. If you have not downloaded QGIS you can
do so on the QGIS website45 .
45 https://qgis.org/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
38 2.4. INSTALLING DOCKER-POSTGIS
We are using QGIS since it is a free and open source project and allows us to quickly and easily view
our data from our PostGIS database. The only limitation is that you do have to create what is known
as a VIEW or a new TABLE from a query in your PostGIS database, which is why we will also show you
how to connect pgAdmin to view data directly from queries. QGIS also has tools to easily import data
via QGIS too like Shapefile, GeoJSON, etc.
First, open QGIS and in the Browser panel on the left, right-click on the PostGIS label and click New
Connection (Figure 2.9).
Next we need to fill in the details for our connection. You can follow along here if you used the exact
commands as above, but if you changed any variables please adjust accordingly (Figure 2.10, on the
facing page).
• Name: any name you choose for your connection
• Host: localhost
• Port: 25432
• Database: gis
• Authentication
– Click the "Basic" Tab and enter
* User: docker
* Password: docker
From here you can test the connection to make sure everything is working appropriately and if it is you
should see the blue success message as shown below (Figure 2.11, on the next page).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 39
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
40 2.4. INSTALLING DOCKER-POSTGIS
Troubleshooting
• Make sure your PostGIS Docker container is active and running by making sure it is showing in
the color green in the Containers section of the app. This is the most likely error at this stage.
• Ensure that all the parameters you entered match the ones in the Docker command you used.
Let’s upload our first dataset into PostGIS. At this stage we can keep it simple by using QGIS to handle
this for us. The dataset we will be using is a Shapefile of United States Counties which can be found
in the resources downloads for this book. You can also use any other dataset of US Counties that you
want, but ours was downloaded from the US Census Website.
First, download the dataset or open the dataset from the downloaded files from book resources. You
will need to unzip the .zip file named cb_2018_us_county_500k.zip. Once it is unzipped, you can either
go into that folder and double-click the file named cb_2018_us_county_500k.shp or in QGIS Layer →
Add Vector Layer. In the dialog, click the three dots in the section labeled Source → Vector Dataset(s)
and navigate to the folder to select the file cb_2018_us_county_500k.shp (Figure 2.12).
This will open up the counties layer in QGIS. To add it to PostGIS, in QGIS select the Database item in
the menu bar and click on DB Manager where you should see this dialog pop up (Figure 2.13, on the
next page):
From here, click on PostGIS → Spatial SQL (if that is the name you used for your connection) → Public.
Once Public is selected, click on Import Layer/File (Figure 2.14, on the facing page).
QGIS allows you to import layers that you have already added into QGIS as well as files directly. In
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 41
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
42 2.4. INSTALLING DOCKER-POSTGIS
this case, we can just add the layer we already added to the map. You do not need to adjust any of the
selections in options unless you want to. One that is helpful here is creating a spatial index, which we
will cover later in the book.
Once complete hit OK and once complete you should see the following dialogue (Figure 2.15):
Now it is time to run your first query in QGIS! In the same DB Manager window you can click on the
icon that looks like a piece of paper with a wrench (Figure 2.16):
This will open a Spatial SQL window for you to write SQL. There are a lot of different features within
this window for you to inspect tables and other features. If you click on the table you have created
which should be named cb_2018_us_county_500k (find it by going to PostGIS → Spatial SQL → Public →
cb_2018_us_county_500k) you will see several tabs.
• Info: This will tell you some general information about the table such as the geometry column
and type, projection, field names and data types (which will be important going forward)
• Table: shows a preview of a limited set of rows of the table
• Preview: shows a preview of the data in a map view
Now, navigate back to the tab that says Query (Spatial SQL) where we will run our first query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 43
3 from
4 cb_2018_us_county_500k
5 where
6 statefp = '55'
Let’s break down what this query is doing (note we are not using any spatial SQL, yet)
• SELECT - designates that this is a select statement which means we are going to query data from
the database.
• ’*’ - select all columns in the table.
• FROM - this is telling PostGIS that we are about to designate a table to pull the data from.
• cb_2018_us_county_500k - the table name.
• WHERE - this tells PostGIS that we are about to add a conditional statement to filter our data.
• statefp - this column contains the two digit FIPS (Federal Information Processing System) for
each state. This column’s data type is a VARCHAR(2), which means it is a string with a length of
2.
• = - operator that tells PostGIS we are looking for values that match what will follow the "="
symbol.
• ’55’ - This is the FIPS code for Wisconsin. It is contained within single quotes since it is a string.
If we did not include the quotes, this would be a number and there would be no results returned.
Go ahead and click on ’Execute’. You should see a dialog that looks like this when complete (Figure
2.17):
We can see that the query returned 72 rows, which is the correct number of counties in Wisconsin. You
can see the rows that have been returned from the query as well.
To see the results on the map, we can actually load the layer on the map using the Load as new layer
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
44 2.4. INSTALLING DOCKER-POSTGIS
toggle at the bottom. You can leave the options the same for our purposes or change them as you see
fit. Once it loads you should see the data on the map (Figure 2.18):
Before we move forward let’s write some basic spatial SQL. First let’s turn our counties into centroids.
To do so we will use the ST_Centorid function. This is the definition from the documentation in PostGIS:
Synopsis
Description
Computes a point which is the geometric center of mass of a geometry.
Let’s first focus on this line:
We can see that it starts with the word ‘GEOMETRY‘. This means that the function will return a geom-
etry. Functions return a lot of different data types so it is important to read the docs to know what is
going into a function and what is coming out. The next part of the line says ST_Centroid(geometry g1)
which tells us that the function called ST_Centroid takes one argument, a geometry.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 45
You can also use this with a GEOGRAPHY data type which has an additional option called use_spheroid,
we will discuss the difference between GEOMETRY and GEOGRAPHY.
As we go through this book I will try to reference back to the arguments that go into a function and the
return values, but I always recommend using the documentation and using it as a resource as you go
forward. Being able to read documentation independently is one of the core skills that will help you
become independent, not just in spatial SQL, but in any programming venture.
With that said let’s write our second SQL query:
We separate the id column with a comma, then add this statement after it, which creates a new tempo-
rary column in the scope of our query. As stated above this will return a geometry, and we have given
it an alias or temporary name using the AS clause, called geom. Note that in almost all cases you will
need a unique identifier column in your data. If you can reuse one that you already have that is usually
the fastest route but you can always create one if needed.
First, let’s query our data with this new function. This should return the results in the tabular format
so we can see that we have our id column and our new GEOMETRY column (Figure 2.19).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
46 2.4. INSTALLING DOCKER-POSTGIS
The numbers you are seeing in each row of the geom column are the geometry represented in Well
Known Binary. This is a common format used to represent geometries but just so you know that the
geometry data is there. Let’s go ahead and try to add this query to the map (Figure 2.20, on the
following page):
You will likely see this same error that is showing above (note that this may vary depending on your
version of QGIS). While you can visualize this data in other tools through a query only such as pgAd-
min, in QGIS it requires you to create something known as a view within your database. A view is sort
of a "virtual table" that you can query just like a table, but is based on the results of a query. This means
that should the results of a query change, that view would change as well. You can click the button
that says "Create a view" to do this (Figure 2.21) and add a new name for your view (Figure 2.22, on
the facing page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 47
1 create
2 or replace view wi_centroids AS
3 select
4 id,
5 st_centroid(geom) as geom
6 from
7 cb_2018_us_county_500k
8 where
9 statefp = '55'
With that complete, we can now go to the left-hand panel in QGIS to find our newly created view which
we can then add to the map (Figure 2.23) resulting in the map seen in 2.24, on the following page.
You have just run your first few spatial SQL queries! While there is a lot to learn, you have cleared one
of the first major hurdles by installing a database, loading data, and connecting to a tool to query and
view your data.
Name: Getu Abdissa Title: GIS Specialist, UN Integrated Electoral Support Group (IESG)
I took a course called Mastering Geospatial SQL from Kuba Konczyk (https://kubakonczyk.com/). This
was the best personal investment I made as I paid it out of my pocket. It was very tricky to fall for a
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
48 2.5. EXPERT VOICES: GETU ABDISSA
training ad that claims to provide a non-existing course module and has so much value. I was glad to
have taken the course as modules helped me take my GIS and SQL skills to a whole new level. I had
prior knowledge of basic SQL while working as an intern at United Nations Economic Commission for
Africa.
Although I am a fan and an ardent user of Excel spreadsheets, they’re often annoying. They work
slowly, and when you do a lot of entries - they get stuck and it is difficult to control them in teamwork.
When working for GeoMark systems I had to deal with XYZ data points having more than the allowed
number of rows excel can handle. Managing it in SQL and relational databases are a much better
solution. SQL is very powerful and way better than Excel. I enjoy it as It makes my job easier.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. SETTING UP 49
Can you share an interesting way or use case that you are using spatial SQL for today?
I have worked on an election project which was promoting a "One-person-one-vote" election in Soma-
lia. My job as a GIS specialist includes identifying and locating suitable voter registration/polling sites
in a security-constrained environment. The best use case Spatial SQL is when I used Spatial SQL to an-
swer queries such as which sites are securable (Closer to security infrastructures), Which ones can serve
internally displaced populations (IDPs) to ensure inclusiveness. Spatial SQL helped the management
in making key decisions as whittling down the number of voter registration sites to optimize security
resources while ensuring inclusiveness.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
3. Thinking in SQL
Another major hurdle in moving from a traditional GIS set up is understanding how to transition from
a desktop oriented set up with files that are either on your desktop or in an enterprise GIS server. There
are a few topics that you need to think about when you are moving to a database set up. This includes:
The main difference between a database and a desktop GIS system is that with a database or data ware-
house you will generally need to import that data into the database to make use of it. In a traditional
GIS environment simply opening the file will load the file into the program that you are using.
While there are many file types that exist for both raster and vector data, there is not a single "file
format" that exists within a database. Since we are using PostGIS for the purposes of this book, I
will limit the examples to PostGIS since it would be too difficult to cover the nuances of every single
technology covered in the previous section.
The best way to think about data within a PostGIS database is to conceptualize files as tables. Tables
are tabular data that exist within the database, but also contain some other information about them.
If you open QGIS and click on the table cb_2018_us_county_500k that we already created in the DB
Manager dialogue you will see the information in Figure 3.1, on the next page.
You can see details about what the table type is, the owner, rows, and more. You can also see de-
tails related specifically to PostGIS such as the geometry column, geometry type, dimensions, spatial
reference system or projection, and an estimated extent or bounding box of the data.
While there is not a direct correlation between tables and files, this is probably the easiest way to think
about this when first migrating from a desktop to database set up.
While the focus of this book is not on database design and management, there are a few concepts and
tips that can help make your databases performant and organized.
51
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
52 3.3. DATABASE ORGANIZATION AND DESIGN
When you look at the view in the DB Manger in QGIS, you will see that there is a tab called public
(Figure 3.2). This is known as the schema that exists within the database we are using, which was
created in the Docker command when we created our database.
A schema stores lots of information such as triggers, sequences, tables, views, functions, stored proce-
dures, data types, and more. See Figure 3.3, on the next page for the comparable view of our public
schema in pgAdmin.
What is important for the scope of this book is using schemas effectively to save and separate data.
While this may not be the best example, you can use schemas similarly to how you might use folders
in a traditional GIS system. Each schema can hold data, grant access to certain users for specific opera-
tions, and be managed as needed. In this sense, schemas can provide an effective way to manage and
organize your data, especially coming from a desktop system.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 53
The other key point is using the GEOMETRY or GEOGRAPHY data in your tables effectively within a database.
Geometries and geographies are likely to be the largest data types within your tables. Let’s take a look
at an example of the data in our database.
Go ahead and run this query in QGIS:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
54 3.3. DATABASE ORGANIZATION AND DESIGN
16 from
17 cb_2018_us_county_500k
18
19 -- Using order by to order the results of ST_MemSize()
20 -- from largest to smallest, or descending using desc
21 order by
22 st_memsize(geom) desc
You will notice that there are now comments in the query which can be identified since they start with
"--".
These comments won’t affect the query and are there to better annotate what is happening in the query
itself. You should see the data return with results that look like Figure 3.4.
Note that longer strings will have a base of 4 bytes but for this data it will not exceed 126 bytes.
Now, let’s take a look at the bytes storage required for the GEOMETRY data type, located in the docu-
mentation at the link in the footnotes and shown in Figure 3.5, on the next page.47
So we can see that the number of bytes in a polygon data type can be calculated by starting with a 40
byte base, and then 16 bytes for every point. We can try and confirm this with the following query:
46 https://www.postgresql.org/docs/current/datatype-character.html
47 https://www.postgresql.org/docs/current/datatype-geometric.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 55
5 st_npoints(geom) as n_points,
6
7 -- Calculate the size of the geometry using ST_MemSize()
8 st_memsize(geom) as size_bytes,
9
10 -- Using the formula we saw, calculate the size of the geometry as: 40 + ( no. of points * 16)
11 40 + (16 * st_npoints(geom)) as calculated_size
12 from
13 cb_2018_us_county_500k
14 order by
15 st_memsize(geom) desc
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
56 3.3. DATABASE ORGANIZATION AND DESIGN
16 st_memsize(geom) desc
This time our results are too high! So to see why this is the case, you need to know what the base bytes
that are being used are actually storing. I pulled this information from a blog post from Dan Baston48 .
48 http://www.danbaston.com/posts/2016/11/28/what-is-the-maximum-size-of-a-postgis-geometry.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 57
Why did we spend all this time looking at the data size and number of points in each geometry? Two
reasons:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
58 3.3. DATABASE ORGANIZATION AND DESIGN
Imagine that you have two tables, one with postal code geometries and the other with a list of cus-
tomers. Each row of your customer data contains the following columns:
• First name
• Last name
• Address
• City
• State/Province
• Postal Code
• Customer Joined Date
• Customer ID
And you have a table with geometries of postal codes that include the postal code geometry and the
postal code itself. You can easily join these two tables to perform queries as needed such as:
This provides you flexibility to perform any number of analytical queries and only joining to the geome-
tries when you need. You can also update the geometries table should they change without impacting
your customers list or any other data that could potentially be impacted.
Imagine you are working for a municipality, and you use two datasets on a regular basis: land/prop-
erty records and road centerlines. Both datasets have geometric data that is updated yearly or sooner
depending on changes to the records, and other tabular data associated with the geometries.
As we saw in the first scenario, you can easily update your tabular data with changes to road names
or property owners without impacting the geometry data in your other tables. With that said we also
know that our geometries may change on a regular basis with new roads being added or properties
being built or even combined. The first advantage is that you can easily update the geometries table by
adding and removing, or even modifying the geometry values as needed with the new geometries that
are being added to the database.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 59
The other key value is using a common table naming convention to store your data vintages in. This
means that you can basically create a table and take a snapshot of your data to be able to look at specific
time periods, or a vintage for everything that has changed. You could even do this to store a vintage
with just the change only if you wanted to do so. A sample naming schema could look like this:
• street_centerlines_012023_032023
• property_geoms_2023_1
In our first example, our table has two dates with the MMYYYY formatting. This represents the months
the data falls between, in this case January 1, 2023 and March 31, 2023. Our second example has the
year, 2023, and the number 1, which stands for the version number for that year.
While you could look for distinct values between two tables, you can easily do so by also adding a date
to the geometry to show when it has been updated since we know that other data types in PostgreSQL
are smaller than geometries. This would allow you to write queries like this:
This would return all the roads that have been added between January 2022 and June 2022.
This applies to some use cases but not all, and in the end you need to decide how you want to manage
your data, but in cases where data is changing often, or you have non-geospatial data you need to join
to a geospatial component, this practice has proven helpful for organizing and separating your data.
This is a topic we will cover later in this book, but indexing in PostGIS is another method that can
improve performance by making it easier for the database to better identify a specific feature that it is
looking for, especially in the context of spatial relationship queries. It is fairly simple to add an index to
your data, but this is another topic to understand when and how to apply it. In short I have two rules
for indexing:
• If you are going to use the table to perform spatial relationship analysis such as intersections,
overlaps, nearest, etc., use an index
• If you are going to query the data within a bounding box on a regular basis, use an index
There are a few different index options, for example, R-Tree shown in Figure 3.9, on the next page, but
we will share some rules of thumb when to use which one later in the book.49
Automatic transformations
One great tool in PostgreSQL that can provide some really great functionality, especially for regularly
updated data, are triggers. A trigger is basically a tool that, upon a set of conditions being met, runs
some code to complete a process. Imagine that you have a table that is being updated with data col-
lected from the field. You can create a trigger that will automatically take some data, say a latitude
and longitude, and turn that into a geometry using PostGIS functions to do so. You can create triggers
49 https://www.crunchydata.com/blog/the-many-spatial-indexes-of-postgis
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
60 3.4. USING POSTGIS INDEXES
Figure 3.9: R-tree indexing visualized from PostGIS Workshops (postgis.net) - Introduction to PostGIS - 15. Spatial
Indexing
for any number of operations in your database, and the functions that are run can be as simple or as
complex as you desire.
I recommend that you think of using triggers for annotating your data or adding new columns to your
data as you need automatically, instead of creating new columns every time. You can automate a lot of
your data annotation and other processes that were more manual in a desktop environment by using
SQL.
Exporting data
Just as data has to go into a database, it has to come back out as files for others to download and use as
needed. Unless you are providing database access to many users, you will need to export your data to
share with those who do not have access to your database.
There are a few different ways to do this but once you have added a layer to your map in QGIS you
can right-click on that layer to export the data as the file type of your choice. Similarly, you can use the
ogr2ogr tools in the Geospatial Data Abstraction Library, or GDAL, to export your data as the format of
your choice.
The advantage over a desktop GIS system is that you can simply export the results of a query, rather
than creating a new table. Once you have run a query you can just provide that query and export the
results.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 61
3.5 Projections
As with all geospatial data, projections are critical to managing and using your geospatial data. While
projections are embedded in most geospatial files you might work with in a traditional GIS set up,
projections can be used a few different ways in PostGIS:
• Inherited when the data is ingested
• Assigned by the user for geometries created in the database
• Transformed to a new projection using ST_Transform
Each of these will require you to use a spatial reference ID, or SRID with your data. All your spatial
references that are installed by default are stored in a table in PostGIS named spatial_ref_sys. We can
query this table to see specific values:
You can also add in projections using the epsg.io website50 . You can enter the projection you want to
add in and then use the code provided to add that projection to PostGIS to your database. Let’s imagine
we want to add in a new projection for Hennepin County, Minnesota, or EPSG 104726.
You can go to epsg.io/10472651 to find the projection and other details, as in Figure 3.11, on the next
page.
If you scroll down you can then find this section (Figure 3.12, on the following page) which has details
of the projection for various languages.
You can then copy the PostGIS code, which is all SQL, to add that projection to your PostGIS database:
If you try to run this function, you will see that it is already in your database! You can run this query
to create a new view from this data then load it in QGIS:
50 https://epsg.io/
51 https://epsg.io/104726
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
62 3.5. PROJECTIONS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 63
One thing to note is that almost all web based mapping services use the projection EPSG 3857 or
WGS84/Pseudo-mercator and for analytical purposes many data warehouses are limited to EPSG 4326
or WGS 84.
Finally, understanding the difference between tables, views, and materialized views are key.
• Tables are stored to the disk in the database, meaning the data is stored within the database.
• Views are views of a table, represented with a query. You can create a complex query and create
a view from that with a new name. When you query that view, you are basically querying the
results of that query.
• Materialized views are sort of in between. They do the same thing as a view, but store the results
in a cache for faster performance, so the results are temporarily written to disk in a cache, or
temporary storage.
I recommend using views for slices of data that you want to query or share, and materialized views for
the same but for data that will be frequently queried or accessed. This can save you some time and help
performance in certain cases. This is a far different concept compared to a desktop GIS infrastructure.
The closest thing that I can imagine would be a map file from a desktop service that references a data
file (or table) and has some filters or transformations applied to it (the view).
The other area that can be a big hurdle for many is translating your knowledge of working in a point-
and-click GIS environment and taking that thinking to independently structure a SQL query. This was
a major leap for me when I was learning SQL. Previously I thought about each step individually: first
perform a join to other data, then perform some spatial operation, then do a filter, etc. On top of that,
there are a few different areas in SQL that can cause you to run very large or complex queries that can
cause things to stall out.
Knowing a few general concepts and some tips that I have used in the past will help set you up for
success. These tips are applicable for anyone using spatial SQL and even SQL in general. In this section
we will cover a few concepts that will help you understand how a SQL query works under the hood,
and how to take a look at where there may be bottlenecks or issues in your query using pgAdmin. We
will also talk about using pseudo-code to script out what you want to accomplish in your SQL query
and how to translate that into a query that you will write. Finally, we will talk about a few specific
things that can cause roadblocks with geospatial data in SQL.
To better understand how to write a SQL query it is important to understand what happens after you
run your query and how, in this case PostGIS, handles your query once you send it off for execution.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
64 3.6. THINKING IN SQL
1. Row Filtering: This is where the query will figure out what rows to include or exclude based on
parameters in your query. This includes FROM / JOIN, WHERE, GROUP BY, HAVING.
2. Column Filtering: This stage is where the query will pick which columns of data it needs to
return and is limited to SELECT.
3. Row Filtering (Part 2): This last phase is where the query will shuffle the rows to decide what to
keep or not keep. This includes ORDER BY and LIMIT.
You are likely wondering why you need to know these different steps and how this can apply to how
you think about structuring your queries. Here is a simple example that can help you imagine why this
matters.
Imagine that we have two tables of data (we will be using this data and importing it in the next chapter):
• A table of 311 Call Requests in New York City from 2021
• A table of postal codes (known as ZIP Codes in the United States) containing the postal code
geometry for New York City
• Each table contains a 5 digit value for the postal code (i.e. 10001)
Our job is to create a map of postal codes containing all the incidents where the "complaint_type" is
equal to "Illegal Parking". I will share the queries here which we will review in detail later, but the idea
is to show how different approaches to writing a query can impact query time and performance.
First, let’s install and use pgAdmin so we can run some ad-hoc queries and use the built-in tools to
see the performance plans of the query. First, go to pgadmin.org/download/52 and download the
appropriate version for your computer. Once you have it installed, you can navigate to Object →
Register → Server. You should see this window pop up where you can enter the same details as we did
when establishing the PostGIS connection in QGIS, see Figures 3.13, on the next page and 3.14, on the
facing page.
You will only need to enter details into the General and Connection tabs. Once you add your connection,
you will see a lot more information about your database. Since pgAdmin, as the name implies, can be
used for any number of administrative tasks, it shows every aspect of the database that is available. For
52 https://www.pgadmin.org/download/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 65
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
66 3.6. THINKING IN SQL
our purposes we will primarily use the information in our database which can be found by navigating
to (assuming you used the same names as we have in the book so far):
Spatial SQL → Databases → geo → Schemas → Public
Once there you can find two sections, Tables and Views where your tables and views will be listed
respectively (Figure 3.15).
While we have not imported this data yet, we will be importing this data in the next chapter. This
query will use several elements we have not yet covered so the query itself will look confusing and I
will only annotate it with the high level elements for this exercise.
First, let’s join the two tables in one single query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 67
16 where
17 nyc_311.complaint_type = 'Illegal Parking'
18 group by
19 zips.zipcode,
20 zips.geom
This query took just under 40 seconds for me to run (I also applied an index to the complaint type
column which makes it a bit faster). First, there are a few things we can fix.
Since we know that the query will look for the rows to include first using SELECT/JOIN, GROUP BY, WHERE,
and HAVING (this one is not in our query) we know that we want to optimize these queries to be as
efficient as possible. To see how the query is executing we can actually see this in pgAdmin. Instead of
using the "Play" button to run our query we can press the small button that looks like a bar chart to run
an EXPLAIN ANALYZE query on our query. This basically shows us how the query is executing (Figure
3.16).
This will return a graphical view of our query. Now, I don’t expect you fully understand the EXPLAIN
ANALYZE results but we will refer to them a few times during the course of the book so we can take a
look at the query to understand some high level concepts about query improvements. Once you learn
a few key concepts, you should be able to optimize most of your queries, and using it, find any major
bottlenecks in your code.
With that said let’s take a look at the results of our EXPLAIN ANALYZE call, in Figure 3.17.
What you can see here is a visual representation of how our query is being run. Basically the steps are:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
68 3.7. OPTIMIZING OUR QUERIES AND OTHER TIPS
This is where we can make a few optimizations to improve our query. First, we are counting all the
columns in each row in our nyc_311 table which is not necessary. We can count across one column, in
this case the "unique id" column. Our new query will look like this"
A bit of an improvement, down to 30 seconds total! Next is one major lesson for writing your queries.
One strategy is that you should think of your queries as a funnel. You want to get rid of as much data
as you can in earlier stages of your query. To do this we can use a Common Table Expression or CTE
that essentially acts as a temporary table in the scope of the query.
Since our nyc_311 table has the most records (over 3.2 million roads), let’s get rid of as many of those
as we can using a CTE:
This query takes about 39 seconds to run, an increase in time. What did we do wrong? When we check
out our analyzed query we can see that we didn’t really modify the query execution that much at all
(Figure 3.18, on the facing page).
We still have to scan all the 3+ million rows and join to the roughly 100k results that are returned
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 69
from the CTE. Let’s move more work into the CTE, namely the aggregation of the data using the COUNT
function.
Now our query only takes about 21 seconds to complete. When we look at the plan in another tool,
called Dalibo, we can see (in Figures 3.19, on the next page and 3.20, on the following page) that
the majority of our query cost is tied up in the part of the query that finds all the rows that match the
"Illegal Parking" category.53
The only things we might be able to do are to create a view of our data and query that view, try some
different indexing methods, or cluster our table to group the categories together. Most of these things
fall in the category of database administration which is out of the scope of this book. Figure 3.21, on
page 71 shows the final results of that query.
What we can apply here is our first lesson about structuring our queries. The more operations that
limit the amount of data we are using, much like a funnel, will improve our query performance. Try as
much as you can to look at the amount of data you are querying to the first steps in your query or limit
to only what you need to get a performance improvement.
The other concepts I like to use when trying to improve or test a query are:
• Use the query EXPLAIN ANALYZE to see where there are roadblocks and if they are persisting
throughout the query. pgAdmin has great built in tools but some of the web viewers are also
good to help here too. Basically you should be looking out for issues where lots of data is making
53 https://explain.dalibo.com/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
70 3.7. OPTIMIZING OUR QUERIES AND OTHER TIPS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 71
This is pulled from a PostGIS workshop on the PostGIS website54 . What appear to be equal polygons in
the image are actually not since the ordering of the points is different in each polygon. Equality is a strict
concept in any programming language and spatial SQL is no different. You can read more about the
details of this in the workshop link, but for our purposes making sure that you are avoiding grouping
by a GEOMETRY is a general good practice. I would also reference this post by Simon Wrigley that has
more details on the topic too55 .
As we go through the course of the book we will use a few challenges that will help to build your skills
to become independent in your ability to write spatial SQL. This will take time and practice, but one
skill that I have used is using pseudo-code, or the process of writing the code you want to build using
human-readable text, and rubber ducking, the process of talking out loud to talk through your code.
54 https://postgis.net/workshops/postgis-intro/equality.html
55 https://loc8.cc/sql/linkedin-simon-wrigley
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
72 3.8. USING PSEUDO-CODE AND "RUBBER DUCKING"
The name comes from an excerpt in the book The Pragmatic Programmer where a programmer carried
around a rubber duck to talk to out loud to try and work through a tough coding problem.
While these strategies may seem a bit strange, using these in early stages to think through the queries
you want to write will help you become self proficient in the long run. In this section I’ll share some
examples of how to do this, but I will also reference back to these concepts in the book as well.
Pseudo-coding
Pseudo-code is essentially the process of writing out how you want an application, function, or in our
case query to run. There are no hard and fast rules on how to write pseudo-code and it can be as
structured or unstructured as you like. Of all the tutorials I took a look at to reference for this book,
strangely enough the one from WikiHow provided the best walkthrough on how I use pseudo-code56 .
My recommended process is three steps:
You can see I added in the SQL keywords where I could and then used the relevant column names
from the data. From here we are left with one challenge: how to find all the dates that are in July. To
do this using the date column, we want to identify a function that can help us extract the month from
the date. There are two different functions to do this but in this example we will use the DATE_PART
function to accomplish what we need. The function takes two arguments: the element you want to
extract which can be anything from the century of the date to the timezone, but in our case the month,
56 https://www.wikihow.com/Write-Pseudocode
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 73
and the column containing the data57 . We know that the return value will be the number representing
the month, in this case 7
Let’s edit our pseudo-code to construct our query:
SELECT all columns
select *
from nyc_311
And the result in QGIS is shown in Figure 3.23, on the next page.
This represents a relatively simple example of what we want to accomplish. Of course this represents
a perfect world with perfect data, which I am sure you all know is never the case. Let’s walk through a
real example we will address in the next chapter.
Rubber ducking
Rubber ducking is a way to work through a problem by verbally talking it out. There are plenty of
occasions where I have used this process and actually found an answer. More times that I can count,
I have tried to Google my way through a problem, trying out any number of random solutions I have
found online, only to find that actually talking a problem through has given me much better results.
Now let’s review the query from our last step:
Now what actually happened is that I got an error on my query that told me that the function DATE_
PART did not match the arguments I was giving it with the following hint:
HINT: No function matches the given name and argument types. You might need to add explicit type
casts.
57 https://loc8.cc/sql/postgresqltutorial-date
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
74 3.8. USING PSEUDO-CODE AND "RUBBER DUCKING"
So, what could be the issue? Take a minute to think about it. I will write out my rubber ducking in the
next paragraph to show a real example:
We know that the DATE_TYPE function takes two arguments: first is the date part you want to extract, in
this case ’month’. The second is a date or timestamp. Is there an issue with the date created field?
And in fact there is an issue with that field as when we imported our data it came in as a character field
not an actual date. When we import this data into our database in the next chapter we will handle this
issue, but for now we can simply cast, or transform our data on the fly as so:
Problem solved. We will use this technique a few times and then provide some prompts to use it as
some of the queries will have some issues built into them that you will need to solve (the answers will
be provided of course) but these two techniques will help anyone of any skill level start to become fully
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. THINKING IN SQL 75
As a data scientist working with spatio-temporal data, I enjoy using spatial SQL since it allows pro-
cessing geospatial data efficiently directly within data warehouses, reducing the need for complex data
transfers between different systems.
Can you share an interesting way or use case that you are using spatial SQL for today?
Today, I have been working on creating a composite score to identify areas of high priority for im-
proving network accessibility for senior citizens in the US. The analysis consists of estimating average
download/upload speed everywhere in the US, creating the composite score by taking into account
the presence of the senior population, the network speed and the level of urbanity of an area, and then
leverage an LLM model to interpret the results (Figure 3.24, on the following page).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
76 3.9. EXPERT VOICES: GIULIA CARELLA, PHD
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
4. SQL Basics
In this section we will start to learn the fundamental building blocks of SQL so we can lay a solid
foundation in SQL, or vanilla SQL (since this is a book focused on spatial SQL) before we move into
spatial SQL. I believe this is the chapter where you will start to see the true power of SQL and why it
is a popular language, not just in the geospatial industry, but also for all aspects of data storage and
analysis.
While many books that teach SQL teach it in the order of SQL operations, I feel it is easier to learn the
fundamentals such as data types, functions, and some simple processes like aggregations and WHERE
operators. From there we will address more advanced topics in the next chapter. It is important to note
that these next two chapters cannot possibly cover all the nuances of SQL. The goal of this chapter is
to give you the fundamental elements that we will use in all SQL queries, and that will be of particular
importance in spatial SQL.
But before we get there, we need to start importing some data into our PostGIS database!
We already covered the process to import data into PostGIS with QGIS, and this will work great with
smaller datasets that are of the size that can easily be viewed/loaded into QGIS. But for larger data we
need a few other strategies to do this. There are three methods we will cover in this chapter for vector
data: shp2pgsql, ogr2ogr, and the COPY command. These three commands will allow you to expand the
scale of data that you can import into your database and with various data formats.
My two favorite methods to use are ogr2ogr and the COPY command. I use ogr2ogr when I have dif-
ferent formats of data like GeoJSON or KMLs. ogr2ogr is also very useful when you want to do some
transformations of your data before it lands in PostGIS (using - you guessed it - spatial SQL). I use the
COPY command when I have larger datasets, but it only works with CSV files.
4.2 ogr2ogr
If you have not used the GDAL library before, now is as good a time as ever to start. GDAL is one of
the most fundamental libraries that exists in the modern GIS ecosystem, and exists in some of the most
popular geospatial tools today including Geopandas, QGIS, GeoServer, GRASS GIS, Rasterio, and even
parts of PostGIS (for example raster import).58
The ogr family of functions were a separate set of tools meant to work with vector data, but were
merged into the GDAL project. It originally stood for OpenGIS Simple Features Reference Implemen-
tation and, while the original reference has changed, the name stuck59 . ogr2ogr (often pronounced
"Ogre to ogre" which has to be one of my favorite named tools in all of geospatial) is basically a library
that allows for the conversion of one vector file type to another. As of the writing of this book, there are
currently 79 different vector file formats supported by GDAL.
As it relates to our use case in the scope of spatial SQL, there are two core elements that you can use
58 https://gdal.org/software_using_gdal.html~software-using-gdal
59 https://gdal.org/faq.html~what-is-this-ogr-stuff
77
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
78 4.2. OGR2OGR
for with spatial SQL: importing data to PostGIS and transforming data within the ogr2ogr
ogr2ogr
command with SpatiaLite.
Installing GDAL
Before we do anything, we need to install GDAL on your computer. GDAL can be installed a few ways,
but for our needs we will be using the command line version, although you can certainly use Python
or other versions too. With that said, the installation instructions vary based on your operating system
if you want to install GDAL directly on to your machine. For our needs and to ensure interoperability
between operating systems, we will again be using Docker to solve this issue. If you do want to install
GDAL on your system directly below are some quick guides to help you do so.
Windows
The GDAL website recommends using Anaconda and Conda Forge to install GDAL. Anaconda is a
package management system and platform for Python packages.
First, go to the Anaconda website at anaconda.com60 and find the "Download" tab61 . The website
should detect your operating system and direct you to download the correct version. When the instal-
lation package finishes downloading, go ahead and install it on your computer.
Once it is done installing you can open up the Anaconda program. You can also follow the instructions
from the USGS62 to download and install the Windows binaries directly.
Linux/Ubuntu
In Linux, you can use the Advanced Packaging Tool, or "apt" package manager, to download and install
GDAL. Make sure you perform the correct updates of apt prior to installing. Below is the command
you can use to install GDAL:
sudo apt install libgdal-dev
Of course, you can always use the direct website download as well.
MacOS
For MacOS, using the Homebrew package manager. You can install Homebrew using the commands
listed on their homepage63 . Once you have that installed you can use this command:
brew install gdal
Of course, you can always use the direct website download as well.
As stated earlier I recommend using Docker to install GDAL for all the same reasons we discussed
when we installed PostGIS. The only downside to this is that you need to add a few extra arguments
to the command to run it, but overall using Docker ensures compatibility across all systems. To install
GDAL via Docker you can follow these steps:
60 https://www.anaconda.com/
61 https://www.anaconda.com/products/distribution
62 https://loc8.cc/sql/nationalmap-gdal-install
63 https://brew.sh/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 79
Or you can open the Docker application and in the search bar in the blue header area, search for the
image name, in this case "osgeo/gdal"
Once you have found it you can go ahead and hit "Pull" to install GDAL. Once it is completed we can
run our first command to make sure it has installed correctly.
The command we will be running to start the container looks like this:
docker run --rm -v //Users:/Users osgeo/gdal gdalinfo --version
64 https://github.com/OSGeo/gdal/pkgs/container/gdal
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
80 4.2. OGR2OGR
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 81
/Users/matt/Documents/spatial-sql-book/file.csv
While this has some extra steps, Docker provides an easy solution to make sure that any computer
can download and access Docker. For this chapter I will include the full functions with the Docker
commands but going forward I will only include the specific GDAL command to shorten the code so
please keep this in mind as we proceed.
With that let’s import one of our first files. We are going to import some data from the New York City
Taxi Open Dataset65 .
The NYC Taxi and Limousine Commission (TLC) now publishes their data with only the taxi pickup
zone as the location identifier. However, I have a copy of the original data that has data from the first
15 days of June 2016 which has the latitude and longitude pick up and drop off locations. The data has
5,651,686 total rows and is stored in a file type called Parquet.
Parquet is a unique file format that stores tabular data, but in what is known as a columnar storage
format. Instead of storing data in rows as many systems do including PostgreSQL, Parquet stores data
in columns and has a common index. This allows for more efficient file compression within the data
resulting in faster loading and smaller file sizes. As of 2022, there is an effort underway to create the
Geoparquet file type which will provide the same advantages but for geometry data.
The good news for us is that GDAL already supports Parquet and GeoParquet, so we can import this
file with one command right into PostGIS.
These files are available in the book repo for download. To load these files we can use the ogr2ogr com-
mand from GDAL that will translate the file from Parquet directly into PostGIS. Below is the command
we are going to run, without the Docker commands attached to it.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
82 4.2. OGR2OGR
This will take some of time to run but once it completes you should see your data show up in your
PostGIS database. Let’s take a look at the data in pgAdmin to see how it all looks.
The COPY command allows you to copy data to and from your PostGIS database, but only works with
the CSV file format. This does exactly what it sounds like, it simply copies the data directly from the
CSV file into the PostgreSQL database. The data we are going to import in this case is the NYC 311
data66 that you saw in the previous chapter.
The data is available in the book repo as well, and the data I downloaded were calls from 2021, which
contains 3,214,361 rows of data. The only caveat with the COPY command is that you have to create
a table matching the schema of your CSV file before you can copy the data into it. I created that
command for this table which you can see here. We will cover more about this in the coming sections,
but effectively this command defines the names of the columns followed by their data type. This means
that we are implicitly defining the data type for each column we are pulling into the database based on
the data in our CSV. Since we have the data dictionary in the NYC Open Data Portal, we can use that
to know what data type is contained within each column.
66 https://loc8.cc/sql/cityofnewyork-311
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 83
16 intersection_street_2 text,
17 address_type text,
18 city text,
19 landmark text,
20 facility_type text,
21 status text,
22 due_date timestamp,
23 resolution_description text,
24 resolution_action_updated_date timestamp,
25 community_board text,
26 bbl text,
27 borough text,
28 x_coordinate_planar numeric,
29 y_coordinate_planar numeric,
30 open_data_channel_type text,
31 park_facility_name text,
32 park_borough text,
33 vehicle_type text,
34 taxi_company_borough text,
35 taxi_pickup_location text,
36 bridge_highway_name text,
37 bridge_highway_description text,
38 road_ramp text,
39 bridge_highway_segment text,
40 latitude numeric,
41 longitude numeric,
42 location text
43 )
Once you run this command we can then COPY the data into our target table. To do so, we need to open
a psql shell within pgAdmin. psql is a terminal based command line toolkit to query and access your
database. You can install this library yourself but since it already is in pgAdmin we can go ahead and
use it there.
To open the psql terminal you can click this button in pgAdmin (Figure 4.4):
And you should see this window when it is ready (Figure 4.5, on the following page):
The command we are going to run looks like this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
84 4.2. OGR2OGR
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 85
Now we are ready to take a look at data types and the functions to manipulate those data types in
PostgreSQL. Even if you have no experience in programming, you have likely come across data types
and functions to change or manipulate data types in your other GIS work or in tools like Excel. For
this part of the chapter we will only be using simple SELECT queries that take a subset of rows from our
table using the LIMIT argument to show the different data types and functions for transforming them.
Since our goal with this chapter, and the next, are to get you up to speed writing spatial SQL as fast
as possible, we will start with the fundamentals and progress in complexity. This will also give you a
good understanding of working with GEOMETRY or GEOGRAPHY data when we get there since it is in fact,
only another data type in SQL.
One final note. Since we are using PostGIS for our examples it is worth noting that PostgreSQL has its
own syntax, or flavor, of SQL. While SQL is highly standardized, there are some variations between
each database or data warehouse.
Okay, so let’s take a look at the different data types that are available in PostgreSQL:
• Boolean: true or false values
• Characters: text or string data
• Numeric: integers and floats (data with decimal values)
• Date/Time values: dates, times, time stamps, etc.
• UUID: Universally Unique Identifiers or an alpha numeric code that is completely unique
• Array: Data that exists in square brackets, [], that is equivalent to a list in Python
• JSON: Data that has key value pairs that exists between curly brackets (ex. {’key’: ’value’}) that
can contain nested data
• hstore: Key value pairs only, but requires an extension
• Geometry/Geography: The stuff you are all here to learn about, geometric and geospatial data
• Network addresses or IP addresses
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
86 4.3. SQL DATA TYPES
Now while NULL is not a data type per se, it can be in a column of any data type you have in PostgreSQL.
NULL is effectively the absence of data. There are some specific ways you can use NULL data and some
things you need to be aware of when your data has NULL values or is lacking NULL values.
For example let’s consider that you have some data, and it contains a list of first names and last names,
however, the values that don’t have a name are just blank. Below is a table example of this.
firstname lastname
Jim Smith
Nancy Drew
NULL NULL
Now let’s try and query to find the values where there are no last names. Let’s keep in mind that since
NULL is the absence of data (even if a cell has just a space or no characters at all this counts as data) we
can’t use operators like = or != because it is empty, and we cannot compare it to anything. In this case
after our WHERE operator we have to add the statement IS NULL to find our NULL values
As we can see we still have all our values including the ones where the firstname seems to be empty.
This is because the value in that row came in as an empty string or text data without any characters.
This is not a NULL value because while there are not characters there the data is still a string.
This is something to be aware of when you are working with your data and you see some odd results.
As we go forward keep in mind that any data type can contain NULL values so in addition to all the
options available, NULL is also an option.
Boolean
The boolean data type represents a true or false value, and these are the only possible options for a
Boolean column. However, there are some different ways you can represent a Boolean column but
keep in mind that the column must be a Boolean to use these values.
True False
true false
t f
true false
y n
yes no
1 0
The Boolean type is very simple and there are not any functions that are used to transform or update
it. With that said to query or discover the data you will need to use operators, which are effectively
functions themselves and are many times used in the WHERE section of your query.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 87
Another thing to know about Boolean values is that many functions have a Boolean as a return value,
and many with spatial SQL as well. Let’s walk through an example.
Let’s query a sample of our data from our NYC 311 data:
As you can see here we have three columns and five rows. We used the LIMIT statement to limit our
data to 5 rows. Now let’s filter our results by the city column to results in Brooklyn using a WHERE
statement:
What is happening here is that the WHERE clause is telling the SQL execution engine to only include
the results that match true. In different terms, the statement is going row by row and evaluating this
equation:
city = ’Brooklyn’
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
88 4.4. CHARACTERS
If the value in city matches ’Brooklyn’, the function (in this case =) evaluates to true, then the row is
included. To show this we can add a calculated column on to our query.
As you can see we now have a column of Boolean values that match our condition. This is important
to understand and can help us better understand the WHERE clause and conditionals that go with it. We
will go into this in greater detail but effectively every condition after a WHERE clause is a condition that
will evaluate to true or false in one way or another.
4.4 Characters
This data type represents any type of text, also known as a string. This includes all characters, numbers
(that are inside of a string), and symbols in any language. Strings will (almost) always start and end
with single quotes. So in effect this can be any text that you want to store in your database.
There are three different data types you might see within PostgreSQL:
• VARCHAR(n): A string of varying lengths with a limit of length n
• CHAR(n) or CHARACTER(n): A string of exactly n length, padded for blanks
• TEXT or VARCHAR: A string with no fixed length
For the most part you will likely be using TEXT as your primary data type for strings unless you want
to enforce a strict limit on the number of characters in a string, in which case you can use VARCHAR(n)
or CHAR(n) depending on your need.
Let’s take a look at some string data in our database and learn a bit more about the character data type
and some functions we can use to work with characters.
First let’s run this query in pgAdmin which should return these results (Figure 4.6, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 89
6 nyc_2015_tree_census
7 limit
8 5
Let’s review the following functions that we can use to manipulate the text in these results:
• CONCAT
• ||
• LEFT
• RIGHT
• INITCAP
• REPLACE
• REVERSE
• LENGTH
• LOWER
• SPLITPART
• UPPER
CONCAT
In the case you want to concatenate, or join some various text fields together, we can actually do this
two different ways. The first is using the || operator between two or more strings. Below is the query
that we will run to concatenate the tree common name column to the health column in the NYC trees
dataset, with a " - " between:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
90 4.4. CHARACTERS
3 from
4 nyc_2015_tree_census
5 limit
6 5
You can accomplish the same results with the following query:
Before we proceed I want to call out one important point about using and reading documentation. Both
PostgreSQL and PostGIS have excellent documentation, but at first glance the documentation, or docs
for short, can be hard to read if you have never spent much time using or reading docs. They can be
very helpful once you are able to navigate them and in the end, they can save you a ton of time from
random Googling and reading StackOverflow articles. With that let’s take a look at the documentation
for the two functions we used today which you can find at the link in the footnotes67 .
First let’s look at the docs for the || function (Figure 4.7):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 91
As you can see here the function structure is a bit different, as well as the example and description. We
can see in the example, concat(’abcde’, 2, ‘NULL, 22)‘, that there are other values here too
strings, two integers (2 and 22), and a NULL value.
We can also take a look at the function definition: concat(str "any" [, str "any" [, ...] ]). Cer-
tainly this looks a lot different from the other definition. What this shows us is that the argument can
be a string or "any" which means any data type. We see this repeat and then the three dots . . . , which
means that you can pass as many arguments as you choose to this function.
If we take a look at our original function we can see that there is also another function using the "||"
function (Figure 4.9):
We can see that || can accept any number of arguments as long as at least one is a string. So what is
the difference between these functions? Well not much really. But we can see here that by checking the
docs and reading everything we have all the information we need. Let’s proceed with looking at some
other functions.
LEFT, RIGHT
These two functions allow you to trim your string to a specific number of characters from either the left
or the right position on the string. You can use this query to see that in action:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
92 4.4. CHARACTERS
This can be useful if you need to retrieve the first few characters of a string such as the first two char-
acters of a US Census code which represents the state code.
These functions change the capitalization and case of the text. INITCAP capitalizes the first letter of each
string. UPPER changes the entire string to uppercase, and LOWER changes each to lower case.
REPLACE
REPLACE is a function that takes three arguments: the source, the text to replace, and what to replace
it with. Let’s go ahead and try this out with our sample data:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 93
As you can see we can change the ’honeylocust’ value to ’honey locust’ using this function.
REVERSE
You might be able to guess, but this function reverses our string:
LENGTH
red maple 9
pin oak 7
honeylocust 11
honeylocust 11
American linden 15
SPLITPART
SPLITPART takes three arguments: the source data, a delimiter to split by like a comma or space (but
this can be any character), and which part of take from the split using an integer. Let’s say we want to
get the tree type from the columns, we can do that with this query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
94 4.5. NUMERIC
Since the ’honeylocust’ is only one word we can actually combine two functions to do this:
There are many other string functions which you can find in the PostgreSQL documentation, but these
are the functions that I have used with spatial SQL.
4.5 Numeric
Numeric data contains any data that contains a number. Generally you can use the NUMERIC data type
which will accommodate any type of numeric value you may need. That said there are a variety of
numeric data types that you may encounter at some point. Below is a table showing the different
options and constraints from the PostgreSQL 15 Documentation:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 95
The functions with numeric data allow you to perform different mathematical operations. First, you
can use any combination of mathematical operators with numeric data. This includes:
• Addition: +
• Division: /
• Subtraction: -
• Multiplication: *
• Modulo (remainder from division): %
Here is a quick example from our Taxis data set:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
96 4.6. DATES AND TIMES
As you can see we can calculate the tip percentage by dividing the tip amount by the base amount, or
the total minus the tip.
There are a few other handy functions to use as well:
• ceil ( numeric ): Rounds up a decimal to the nearest integer
• floor ( numeric ): Rounds down a decimal to the nearest integer
• round ( numeric ): Rounds to the nearest integer up or down
• round ( v numeric, s integer ): Same function but will round v to s decimal places. Ex. round
(42.451534, 2) → 42.5
• log ( numeric ): Base 10 logarithm
• sqrt ( numeric ): Square root
• random ( ): Generates a random number between 0 and 1
My tips and advice are to use the NUMERIC data type, unless you are positive you need or want a specific
data type. While there are plenty of other numeric functions, these are the functions that I have used
the most.
Dates and times can be one of the more complex data types namely because of the multiple ways you
can format data and the number of formats your data may come in as. In actuality there are only four
data types for date and times:
• date: A date with no time of day
• time: A time of day with no date (with or without timezone)
• timestamp: A day and time (with or without timezone)
• interval: Time interval
To manage and deal with times and dates there are really two groups of functions. The first are func-
tions to turn strings into dates and times, the other set has to deal with performing calculations, finding
current times, and more.
While dates and times can be a big topic that we can’t completely cover here, below are a few topics
that I think are most important to understand, at least how to accomplish them, in SQL:
First, let’s take a quick look at the functions to translate strings into dates and times, and vice versa,
borrowed directly from the PostgreSQL 15 docs:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 97
As you can see there are a few different functions for moving data from strings to date/times. But the
second argument is the one to pay attention to here. Let’s take a look at this example:
We can see that there is text here that shows ’DD Mon YYYY’, which is what we call a date formatter.
There are a lot of date formatting options, so I will refer you to the full list in the PostgreSQL docs in
the footnotes68 .
Let’s test this out on our taxi data to show a few different things you can do. Below is a query that has
the start and end timestamp of some taxi trips:
68 https://loc8.cc/sql/postgresql-datetime
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
98 4.6. DATES AND TIMES
We can see we have the ’+00’ at the end of the timestamp which is the timezone, which is Greenwich
Mean time, not Eastern Standard time. Let’s first turn our data into strings using the same format as
above.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 99
Now there are many ways you can format the text of your dates, and you can do the inverse formatting
using the functions above to turn strings into dates should your DATE data not import correctly.
The other types of functions for dates and times provide a lot of different functionality such as extract-
ing parts such as a month or hour from a date/timestamp, reading the current time or date, working
with intervals, and more. You can see the full set of functions at the link in the footnote69 .
The operations I perform the most are performing calculations on dates/timestamps and extracting
parts of dates and timestamps. Let’s look at some examples. First let’s find the duration between the
start and end times for our trips.
00:07:04
00:05:06
00:06:31
00:06:31
00:06:20
Now, let’s see if we can find the hour of the day and the day of the week from the start time of each
timestamp.
69 https://www.postgresql.org/docs/current/functions-datetime.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
100 4.7. OTHER DATA TYPES
3 dow
4 from
5 pickup_datetime
6 ) as day_of_week,
7 extract(
8 hour
9 from
10 pickup_datetime
11 ) as hour_of_day
12 from
13 nyc_yellow_taxi_0601_0615_2016
14 where
15 tip_amount > 0
16 and trip_distance > 2
17 limit
18 5
4 21
1 6
2 23
5 5
6 4
We can see here that we have the day of the week using numbers from 0 to 6, 0 being Sunday and 6
being Saturday, and the hour of the day from 0 to 23.
We will be revisiting dates several times during this book but for now this is a good starting point.
UUID
Universal Unique Identifier or UUID is a unique identifier that is a 32 character length, alpha-numeric
string. These are unique across space and time, and while this may seem difficult to believe since
UUIDs are used across many tools and languages. Even if you generated one UUID per second, it would
take roughly a billion years to create a duplicate70 .
This data type is pretty simple since there is one function to create a unique ID which is gen_random_
uuid() which returns a UUID.
Array
If you are familiar with lists in Python or arrays in Javascript, then an ARRAY should be familiar. If not,
then an array is a group of data contained within square brackets. Data can be ordered or unordered,
but data is accessed by using an index to grab a specific item, and in PostgreSQL there are a number of
functions to access, edit, and query arrays71 . For example imagine you have an array in a row of data
that looks like this:
70 https://towardsdatascience.com/are-uuids-really-unique-57eb80fc2a87
71 https://www.postgresql.org/docs/current/functions-array.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 101
If you wanted to extract Nigeria from the list you could use a query that looks like this:
PostgreSQL uses a base index of 1 compared to 0 which is used by other programming languages so
keep this in mind. Arrays can be helpful for data that you want to keep that may have multiple or
different values, such as tags, categories or other data.
JSON
JSON, or Javascript Object Notation, is another popular and flexible data structure which is similar to
a dictionary in Python. JSON can store nested data and can vary in structure. Let’s take a look at
an example. In this query we are using a common table expression, or CTE, which in effect creates a
temporary table that exists only in the scope of our query. You will see these many times as we progress
through the book, and we will cover them in detail in the next chapter, but for now we are basically
creating a temporary table with one column and one row of data that contains the JSON we are writing.
As expected, we get back the exact data we put into our CTE query above. Say we want to access the
value in the ’last’ key, we can use this query.
Which will return the value in the last name field. Let’s add a more complex JSON value:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
102 4.7. OTHER DATA TYPES
1 with json_table as (
2 select
3 JSON(
4 '{
5 "first": "Matt",
6 "last": "Forrest",
7 "age": 35,
8 "cities": ["Minneapolis", "Madison", "New York City"],
9 "skills": {"SQL": true, "Python": true, "Java": false}
10 }'
11 ) as data
12 )
13 select
14 data
15 from
16 json_table
{
"first": "Matt",
"last": "Forrest",
"age": 35,
"cities": ["Minneapolis", "Madison", "New York City"],
"skills": {
"SQL": true,
"Python": true,
"Java": false
}
}
As you can see we can store values of any data type, and even arrays and other JSON, in this case nested
JSON. This is why JSON is a really flexible data structure. Let’s say we want to query and find the 2nd
city in the ’cities’ value and then find the value of ’SQL’ in ’skills’:
’"Madison"’,’true’
As you can see we got both the results we wanted, but the data types are still JSON values. We can avoid
this by changing the accessor operator from -> to -»:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 103
You can find the full set of JSON functions in the documentation link in this footnote72 . Overall, JSON
gives you a very flexible data structure to work with and is worth understanding. The most interesting
use case I have seen for JSON data is with data from OpenStreetMap. Since the tags and data for any
feature can vary, JSON provides a great data structure to store that data.
The other data types of note in PostgreSQL include:
• Geometric Data: This differs a bit from the geometry data as this data can represent any geomet-
ric data that is not tied to a spatial plane. But some of the operators can be used within PostGIS.
• hstore: A key/value pair data structure similar to a tuple in Python (ex. (key, value))
• Network Address: Things like IP addresses or MAC address
• XML: A data type for storing XML data
• Text Search Data: Data type and functions for storing data for text searches
We won’t be covering these data types in this book, but they exist and showcase the breadth of data
PostgreSQL can support and data that can live alongside your geospatial data.
Casting Data
Imagine you want to change your data between data types. Maybe you want to change a number to a
string or vice versa. You can do this with casting, and there are two methods to do so. First let’s change
a number, the zip code in our NYC Zip Codes data:
11436
11213
10002
As you can see we can use the CAST function and pass our column name and the type we want to cast
to with. We can also use a shorthand version of this using the :: operator.
This achieves the same result. The double colon method is older and unique to PostgreSQL, so it will
not work in other databases.
72 https://www.postgresql.org/docs/current/functions-json.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
104 4.8. BASIC SQL OPERATORS
By now, you have seen some different SQL operators we will cover in this section. In short, SQL
operators allow you to filter your data in various ways to retrieve the data that you want. In our
SELECT clause alone we just take the table or columns of data as they exist in the table currently. With
operators, we can create a more clear set of directions to grab just the data we are interested in. Let’s
jump in and walk through some operators.
WHERE
The WHERE operator is one of the most essential operators in SQL. It allows you to define a condition
that will filter your table based on the conditions in the clause. There are several operators you can use
with the WHERE clause, as well as the AND or OR conditionals that allow you to string several conditionals
together. The best way to think about the WHERE clause is that it will return the rows that equate to
"true" based on the conditions you provide. This will be important later on as we start to use spatial
SQL operators in the WHERE clause.
The operators you can use with the WHERE clause are:
Operator Description
= Equal
> Greater than
< Less than
>= Greater than or equal
<= Less than or equal
<> or != Not equal
AND Logical operator AND
OR Logical operator OR
IN Return true if a value matches any value in a list
BETWEEN Return true if a value is between a range of values
LIKE and ILIKE Return true if a value matches a pattern
IS NULL Return true if a value is NULL
NOT Negate the result of other operators
These are the most fundamental conditionals that you can use in a query. First we can look at our trees
dataset and see which trees are in fair condition by following the pattern of:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 105
4 nyc_2015_tree_census
5 where
6 health = 'Fair'
We can do the same with the greater/lesser than or greater/lesser than or equal to operators. Let’s find
all the trees that are stumps that have a diameter greater than 0 (note that we can achieve the same
thing by finding the trees that have the status of ’Stump’:
It seems that we have hit an error! Don’t worry you will see plenty of errors when you write SQL so
while this may be the first, it won’t be the last. This also is our first challenge, where you will have a
change to think about the issue and see what might be the problem. So first let’s take a look at our error
code:
ERROR: operator does not exist: character varying > integer
LINE 3: where stump_diam > 0
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 62
Take a moment to think about it, and we will walk through the problem after the page break.
First we can see that the operator does not exist, in this case that our greater than sign cannot equate
the relation between ’character varying’ and ’integer’. And we can see which line this is taking place
on.
We also get a hint from PostgreSQL too.
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
So we can see from our two clues that there seems to be a data issue. Since we know 0 is an integer,
reflected by the position of the two data types in the error statement, let’s take a look at the data type
of stump_diam (Figure 4.11, on the following page):
And there we go. We can see that even though the data in stump_diam should be a numeric type, it has
in fact imported as a text column. We can fix this later but for now let’s just recast that column to a
numeric value and re-run our query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
106 4.8. BASIC SQL OPERATORS
2 *
3 from
4 nyc_2015_tree_census
5 where
6 stump_diam :: numeric > 0
Voila! The only other comment on these operators is that you can use them for multiple data types,
including dates, and even text.
AND/OR
The AND and OR logical operators allow you string together multiple arguments in your data. You can
use one or many together, and you can use AND and OR together as well. Let’s look at some quick
examples using these operators:
And now using and OR operator. This will include any rows that meet either condition.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 107
IN
Let’s imagine that you want to filter to more than one value on the same column. For two values you
can just have two conditionals and join them together with an AND. But what about three, or five, or
more?
This is where IN comes into play. You can have more than one value on the same column, and you can
use the AND/OR conditionals to string two or more IN operators together. The structure looks like this:
The IN operator accepts any value separated by commas. It can also accept an array of values or values
returned from a subquery (which we will cover in the next chapter). Let’s take a look at a few examples:
NOT
If you want to find values that are not in a specific group of values, you can use the appropriately
named NOT operator. Here is an example using the most recent query we ran:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
108 4.8. BASIC SQL OPERATORS
4 nyc_311
5 where
6 complaint_type IN ('Illegal Fireworks', 'Noise - Residential')
7 and descriptor NOT IN ('Loud Music/Party', 'Banging/Pounding')
8 limit
9 25
You can use NOT with the next few operators we are covering here too.
BETWEEN
Another appropriately named operator is BETWEEN which allows you to select values that fall between
two other values. BETWEEN is inclusive of the values you add into the statement, so it is synonymous to
greater than or equal to the lowest value, and lower than or equal to the lowest value. Let’s take a look
at how we can use BETWEEN with dates. Note that you can use BETWEEN with numeric and string data as
well (i.e. find all the states between California and Texas in alphabetical order).
Here we can see all the yellow taxi trips that were started on June 10th, 2016 between 3:00pm and
3:05pm.
LIKE/ILIKE
LIKE and ILIKE are two different methods of searching string or text data using pattern matching. The
only difference between LIKE and ILIKE is that ILIKE is case insensitive (meaning it will ignore the
case of the letters) and is unique to PostgreSQL. These operators use the following methods to perform
pattern matching in your data:
• The percent sign (%) will match any sequence of zero or more characters
• The underscore sign (_) will match any single character
So what does this look like in practice? First let’s find all the trees in the NYC Tree Census that have
the name "maple" somewhere in them, and this will be another coding challenge. First, consider the
following scenarios to see which pattern matching will return which values (note that we are just using
a query to define arbitrary text - or any other value you want - which is something SQL can also do,
and common to see in documentation):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 109
Given that we know there are likely to be maple trees that end with maple, only contain the word
maple, and maybe start with maple, which would be the correct operator to select here? Answer after
the page break:
Why is this? Because the % operator finds matches of 0 or more matching characters which means
that the word could start, end, or contain the word maple in it. The underscore operator is much more
explicit:
And if you are not sure of the character capitalization, feel free to use ILIKE as well.
IS NULL
If you want to find the values that are or are not NULL you can use the IS NULL operator. This can help
in queries to include or exclude, or update rows that contain no data. Simply add this after a where
clause such as:
DISTINCT
Now as you saw in the query where we found the names of all the trees that have maple in them, we
got a full list of every single row that contained the name maple. Instead, if we wanted to get the name
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
110 4.9. AGGREGATES AND GROUP BY
of all the trees that had maple in the name but only one time, we can use the DISTINCT keyword in our
query to select a single instance of each result in a column. The DISTINCT will also be subject to all the
other aspects of the query, such as WHERE clauses. We can add DISTINCT in our query to see the new
results
spc_common
Amur maple
black maple
crimson king maple
hedge maple
Japanese maple
These are the first five results but we will only have 1 instance of each tree species.
Up until this point we have simply been selecting rows of data from our tables. Let’s say that we want
to ask a question such as how many maple trees are in each neighborhood in New York City? To do
this we can use aggregations and the GROUP BY conditional argument. This allows us to define an
aggregation such as a COUNT or a SUM and then group the results by other data in our table. Let’s take
a look at an aggregation in action based on the scenario above, and then discuss the different available
aggregation options.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 111
• where spc_common like ’%maple%’: Find all the trees that have the name "maple" in them
• group by nta_name: Here we use our new operator GROUP BY to group our results by the "nta_
name"
• limit 5: This will limit to the first 5 grouped results, not the first 5 rows of the source table.
And we can see our results here:
nta_name count
Allerton-Pelham Gardens 801
Annadale-Huguenot-Prince”s Bay-Eltingville 2935
Arden Heights 1729
Astoria 318
Auburndale 1016
You can also combine your GROUP BY statements to have more than one column. Let’s run the same
query but this time grouping by neighborhood and then the "problem" column to see if there are any
columns with the trees:
We can see that the first column in our GROUP BY statement, neighborhood, is the first group and then
the problem column is second.
For me personally, this is where I started to see the power of SQL. The ability to very quickly explore
and modify queries to see your data in new ways show the power of SQL. Of course, there are many
more reasons, but this is one of the first places that got me to stop and say, "whoa". With that, let’s take
a quick look at some different aggregation options:
String Aggregates
• ARRAY_AGG: Aggregate string results in an array
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
112 4.9. AGGREGATES AND GROUP BY
Notice that I used the DISTINCT operator inside the ARRAY_AGG function since if I did not, I would
get every instance of the values in the array, so for our first row, 801 individual values.
Numeric Aggregates
• AVG: Take the average (mean) of numeric values
• COUNT: Count of columns that match the conditions in your query
• MAX: Finds the maximum value in your numeric values
• MIN: Finds the minimum value in your numeric values
• SUM: : Finds the sum of your numeric values
Numeric aggregates operate on numeric columns to find different numeric summaries. There are more
statistical aggregation methods that require some other tools which we will cover in the next chapter.
For now, let’s find the average stump diameter grouped by neighborhood:
nta_name avg
Allerton-Pelham Gardens 17.7638888888888889
Annadale-Huguenot-Prince”s Bay-Eltingville 12.9958677685950413
Arden Heights 10.7619047619047619
Note that there is no SQL function for median, although there are some methods to compute this using
other functions73 . We will showcase that specific function in the next chapter.
73 https://ubiq.co/database-blog/calculate-median-postgresql/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 113
FILTER
Now if you want to aggregate your data with a condition, in PostgreSQL you can use the FILTER opera-
tor to aggregate based on a WHERE condition. This is unique to PostgreSQL and there are other methods
to do this in different databases. For example if we wanted to find the average distance of all taxi trips
where the tip was greater than $5 between 3:00pm and 3:05pm on June 12th, and group it by passenger
count, we can run this query:
In the FILTER function we can add any type of argument we can write into a WHERE operator.
HAVING
What if you want to write a query like this, where you use the WHERE operator but on the results of an
aggregated column:
Instead of using WHERE we can use HAVING to achieve the same effect but with aggregate values:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
114 4.9. AGGREGATES AND GROUP BY
3 avg(tip_amount) filter (
4 where
5 tip_amount > 5
6 ),
7 count(ogc_fid)
8 from
9 nyc_yellow_taxi_0601_0615_2016
10 where
11 pickup_datetime between '2016-06-10 15:00:00'
12 and '2016-06-10 15:05:00'
13 group by
14 passenger_count
15 having
16 count(ogc_fid) > 50
ORDER BY
In these queries we can see that the data is not organized any specific way. We can change that by using
ORDER BY, which uses this simple syntax:
Ascending order is the default setting for ORDER BY, and it works for all data types. First we can see
what the largest five tips were in our taxi dataset in the same time window we have been using:
passenger_count tip_amount
1 60
1 32.8
3 27.35
1 22.57
. . . continued on next page
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SQL BASICS 115
passenger_count tip_amount
1 21.58
And now let’s take a look at this with our tree census dataset again, ordering our data by the COUNT.
nta_name count
Annadale-Huguenot-Prince”s Bay-Eltingville 2935
Great Kills 2815
Rossville-Woodrow 2503
Glen Oaks-Floral Park-New Hyde Park 2088
Bayside-Bayside Hills 1988
LIMIT/OFFSET
You have already seen LIMIT plenty of times which limits the number of rows that a query returns. You
can also add an operator called OFFSET, which allows you to offset the number of rows by a specific
amount. In this case we can find the New York neighborhoods with the 6th through 10th most maple
trees in them:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
116 4.9. AGGREGATES AND GROUP BY
nta_name count
New Springville-Bloomfield-Travis 1945
Arden Heights 1729
Charleston-Richmond Valley-Tottenville 1548
Douglas Manor-Douglaston-Little Neck 1503
Murray Hill 1479
With that we now have our most basic SQL operations covered! In the next chapter we will take a look
at advanced SQL topics to round out our SQL training.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Part 2
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
1. Advanced SQL Topics for Spatial SQL
In this chapter we will take a deeper dive into some of the more complex parts of SQL. These new
elements will allow you to do write more complex queries while also making you SQL easier, more
legible and often times more performant. So, let’s jump in!
The first is the CASE / WHEN conditional that acts like if/else statements would in other languages. This
allows us to work with any type of data to effectively process data into a new column of data using
conditions that we define. The basic structure of the CASE / WHEN statement looks like this:
Effectively you can define a condition that you would define in a WHERE conditional, but without the
WHERE. If that condition is met, then the data returned will be the result. You can string as many of
these together as you want. Finally, you end the statement with an ELSE which will be returned if the
condition is not met. For example let’s say we have some weather data and there is a column with the
temperature in Celsius in it. If we wanted to create a column that described the temperature in text we
could write this CASE / WHEN statement:
119
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
120 1.1. CASE/WHEN CONDITIONALS
Let’s take a look at a real example with our tree census dataset:
This will create a new temporary column in our data that will return a column that will say ’Maple’ or
’Not Maple’ depending on if our data meets our condition.
spc_common is_maple
red maple Maple
pin oak Not a Maple
honeylocust Not a Maple
honeylocust Not a Maple
American linden Not a Maple
Now as a coding challenge, see if you can create a CASE / WHEN statement that will classify the tip
percentage as Good (between 15 and 20%), Great (between 20 and 25%), Amazing (25 to 30%), and
Awesome (Over 30%). For other results you can call those Not Great and limit it to the first 10 rows.
To do this we will need to calculate the tip percentage by dividing the tip amount by the total amount,
and then classifying that amount in our case when statement:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 121
As you can see instead of adding another category for "Not Great" we can just use the last part of the
statement to classify all other results since all other results should be covered here.
In other SQL tools outside of PostgreSQL you will also have to use CASE / WHEN in your aggregations
instead of FILTER since it is not supported outside of PostgreSQL. Take a look at this example:
As you can see we can take the sum of the return results from the conditional values, which will return
1 if the tree is a maple or 0 if it is not a maple.
You have seen CTEs earlier in this book but subqueries are a new concept. I won’t get too deep into
subqueries vs. CTEs as there are several different and specific reasons why a subquery may be more
advantageous over a CTE, but in many cases a CTE is more legible, is reusable, and acts like a table
within the context of your query74 . This is not a hard rule, but in general I prefer CTEs over subqueries.
Let’s first dive into subqueries and see where they can be useful and then talk more about CTEs.
74 https://loc8.cc/sql/towardsdatascience-sql-subquery
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
122 1.2. COMMON TABLE EXPRESSIONS (CTES) AND SUBQUERIES
Subqueries
A subquery looks more like a query contained within the query you are writing. Most commonly they
are used within a WHERE clause to grab data from another table, but they can be used as a column of
data or within a FROM operator as well. We will actually see this later in the chapter when we use a
lateral join which is particularly useful for some specific spatial analysis.
For now let’s take a look at examples of sub-queries in the WHERE clause. First let’s take a look at the
count of maple trees in zip codes with populations greater than 100,000 people. We have a column for
zipcodes in our in our Tree Census data and a dataset of our NYC ZIP codes which has a population
column. Let’s practice our pseudo coding here to write out what we want to accomplish:
Count all the maple trees in each zip code with a population over 100,000
And let’s format this in a more SQL friendly structure:
Looking at this you can see that we may need some way to join the data from the tree census table to
the ZIP code table, and that is the logical way to think about this, and it is actually a great use for a
subquery:
• The subquery would be used in a WHERE clause
• Return value can be something used in a WHERE conditional, such as a single value to find values
equal to it, or a set of values for an IN conditional
In this case we can write a query that looks like this:
zipcode count
11226 276
11368 272
. . . continued on next page
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 123
zipcode count
11373 578
As you can see our subquery returns three zip codes which have populations over 100,000. Now in this
case we cannot add the population column to the query. If we wanted to do that we would need to do
a join. Another way we can look at the query from above is like this.
The subquery can also look like this if you want to return one value if we want to find how many trees
are in the zip code with the lowest population:
zipcode count
10048 12
Sub-queries can get a lot more complicated, but the CTE can accomplish many of the same things in a
more legible way.
CTEs
The best way to think about a CTE is that it is the same as any other query, but the results that are
returned act just like a table in the context of that query. The CTE structure looks like this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
124 1.2. COMMON TABLE EXPRESSIONS (CTES) AND SUBQUERIES
ogc_fid lanecount
609 3
2061 3
2062 3
As you can see above we defined a query in our subquery, and when we query that CTE table, we get
the results from that table. What is great about this is that we can use the CTE as any normal table
within our query. We can also create more than one CTE within a query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 125
31 select
32 *
33 from
34 lanes_2
ogc_fid lanecount
2062 3
3456 3
3459 3
3454 3
2061 3
609 3
We will be using CTEs a lot going forward so you will see these plenty of times in this book, but there
are a few rules to keep in mind when using them:
• When you are using CTEs make sure to only return the rows that you need and try and filter as
much data out as you can in each step to make the query more performant
• If you plan on reading the results of a CTE more than one time or on a regular basis, consider
using a VIEW or creating a new table
• Using a CTE won’t make operations more performant, so you will always need to consider query
performance, but they generally increase query legibility
Now once your tables are in your database you may want to make some changes to your data, or
remove it all together. That is where the CRUD operations come in: create, read, update, and delete.
Now we have actually already been doing the read step, which is just using the SELECT statement in
SQL, so there is no need to go into that further. Let’s take a look at each of the other operations in more
detail.
The good part about this is that all the foundational elements you have learned already apply here so
we will show a few of those in action but won’t go into too much detail.
Create
• Import data into your database (we have done this already)
• Use a CREATE TABLE table_name AS statement (we have also done this)
• CREATE TABLE directly within the database
First we can create a new table or view using this syntax:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
126 1.3. CRUD: CREATE, READ, UPDATE, AND DELETE
This statement creates a brand new table with only the columns and rows returned from that query.
You can also write the CREATE TABLE AS statement, adding "REPLACE" like below to replace the table if it
is already created:
Alternatively if you want to do this in the database itself you can do this with the standard CREATE
TABLE statement:
This will create a new, empty table in our database. We can add data into the database using this
statement:
You can control the columns you want to insert the data into but the names will need to match those in
the table. Then each row of data you want to insert along with the values will insert. You can also use
columns like CURRENT_TIMESTAMP to add the current date or time, as well as other values.
Update
Once you have created your data there are several ways to update or modify your tables. These state-
ments all use the ALTER TABLE statement. From here you can modify rows, columns, and more with
all the same tools you can use in your standard SQL syntax. Fair warning that when you use these
statements they are permanent and unless you have a database backup (we do in our case) or other
tool set up, you cannot undo or recover these changes.
First let’s add a new column to our test table:
And when we query our new table we can see the new column has been added:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 127
What is great about this is that you can also selectively run these queries using the other tools we have
already learned. Say we want to change the name Mexico City to the Spanish name we can do this:
Delete
Now if we want to delete or drop part of all of a table or column we can use these functions.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
128 1.4. STATISTICAL FUNCTIONS
As it relates to spatial SQL, I tend to use the CREATE TABLE AS or CREATE VIEW AS statements all the time
to save results of a query, as well as creating new columns and populating them with new data.
Earlier we saw some basic statistical functions such as SUM and AVG, but PostgreSQL provides many
more functions to perform statistical analysis including:
• Correlations
• Standard deviations
• Variances
• Mode
• Discrete and Continuous Percentiles
• Rankings
Let’s test out some of these functions with our New York City MAPPLUTO dataset, or property assess-
ment data, first taking a look at the correlation between lot area and the lot assessed value:
0.5561503500548179
As we can see there is a very slight positive correlation between these two variables. Next, let’s find
the standard deviation and variance between lot areas in Brooklyn.
430938.03381969
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 129
185707588992.37929885
Let’s also find the mode of lot areas in Brooklyn as well. The MODE function does not take any arguments,
and we can choose the column we want which is located in parentheses as an ORDER BY statement.
WITHIN GROUP is similar to the FILTER operator we used with aggregated data, but focusing on filtering
groups of data.
2000
In the next view we can analyze data as percentiles and rankings. Let’s take a small sample of data
from our taxis dataset and find the percentiles for the tips in that data.
Here we are finding the trips that started within 5 seconds of 3:00 PM on June 10th, 2016. The PERCENT_
RANK function uses a window syntax using the OVER operator which we will learn more about in this
chapter. Basically, it is telling the query to perform the query within the context of that specific set,
or window, of data and how to evaluate that data. In this case that is the tip_amount ordered from
smallest to largest. If we take a look at the last values in our data we can see this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
130 1.4. STATISTICAL FUNCTIONS
This shows us the percentile of each tip in this dataset. We can also rank these results too:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 131
We can also find percentile values within our dataset too using PERCENTILE_DISC:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
132 1.5. WINDOWS
You can also find interpolated values with PERCENTILE_CONT, meaning that if the value at the 50th per-
centile fell between 9 and 10, this would return 9.5 rather than one of the values that is actually in the
data:
1.5 Windows
I mentioned Window functions in the previous section, and while they can be difficult to initially un-
derstand, they have a lot of interesting utility for really complex calculations. The PostgreSQL docs do
a great job of explaining what a Window function can do:
A window function performs a calculation across a set of table rows that are somehow related to the current row.
This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular
aggregate functions, use of a window function does not cause rows to become grouped into a single output row
— the rows retain their separate identities. Behind the scenes, the window function is able to access more than
just the current row of the query result.
So what can you do with a Window function? Here are a few examples
• Rolling averages such as a 30 day or 30 row average (think of GPS points or weather data)
• Averages across groups of data
• Running totals
• Rankings (as we saw above)
• In spatial SQL, calculate clusters using KMeans or DBSCAN
Window functions are actually a great fit to work with our NYC Taxi data, so let’s test this out on that
dataset. Now the simplest type of window functions work on data with consistent intervals or non-
aggregated data. In this case, we can use a CTE to show what this can look like with the total amount.
In this case we will find the total fares for each day of our data, then find a rolling 3 day average of that
data:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 133
As you can see, in our query we are using the window function to find an average OVER the current row
and the preceding two rows which makes up our 3 day average, and we are ordering our window by
the date.
There is also an argument called PARTITION BY which we can use. In this case let’s run the same query
but also add in the passenger count and create a running total:
Next, let’s find a rolling 6-hour average every hour of the tip percentage on June 15th, 2016. Let’s plan
this out using pseudocode.
Select the average tip percentage (tip divided by total amount) Over a 6 hour window Where the date
is June 15th 2016 Grouped by the hour of the day
The only new piece here is this section in bold. So let’s write the parts of the query that we know right
now:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
134 1.5. WINDOWS
1 select
2 avg(tip_amount / total_amount),
3 extract(
4 hour
5 from
6 pickup_datetime
7 ) as hour_of_day
8 from
9 nyc_yellow_taxi_0601_0615_2016
10 where
11 pickup_datetime :: date = '2016-06-15'
12
13 -- since we can't divide by 0 we will remove all amounts
14 -- that equal 0
15
16 and total_amount > 0
17 group by
18 extract(
19 hour
20 from
21 pickup_datetime
22 )
23 order by
24 extract(
25 hour
26 from
27 pickup_datetime
28 ) asc
avg hour_of_day
0.10381201153091137 0
0.09603191451808368 1
0.09137175846790613 2
0.08759743269474266 3
0.06674264082545112 4
0.09629157470475837 5
So the only thing we need to add in is our window function. Our window functions will roughly follow
this signature:
To accomplish this we want to use another CTE to group our tip percentages by day and hour. Let’s
take a look at the query and talk through what it is doing:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 135
10 date_trunc('hour', pickup_datetime)
11 )
12 select
13 day_hour,
14 tip_percentage,
15 avg(tip_percentage) over (
16 order by
17 day_hour asc rows between 5 preceding
18 and current row
19 ) as moving_average
20 from
21 taxis
First let’s look at line 4 in the CTE. This is a great example of "if there is something you need to do there
is probably a SQL function for it". In this case, we want to truncate our pickup timestamp to the nearest
hour. We can use the DATE_TRUNC function to do this.
With that added into our query we can select the newly created day_hour column and the tip_percentage
and add in our window function. Since we are not grouping the window function we do not need to
use the partition by clause. Another great feature of pgAdmin is the ability to quickly make some sim-
ple chart visualizations. We can see the results of our average and moving average on a chart (Figure
1.1):
Up until this point all our window functions have used the window of number of rows to look back
over a period of time, which is not necessary in all window queries. For example below we can simply
query a running sum of the total fares by day and partition, or group by, the passenger count.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
136 1.5. WINDOWS
18 passenger_count,
19 sum(total) over (
20 partition by passenger_count
21 order by
22 date
23 )
24 from
25 taxis
There are several other window functions that are available, and you can see the complete set using the
link in the footnote here75 . There are three more functions that I have used most often when writing
window functions. The first is ROW_NUMBER.
In the case that you need to number a row you can use the ROW_NUMBER window function. This returns
a row number starting at 1 for each row in the partition. For example:
row_no ogc_fid
1 5243761
2 5243762
3 5243763
4 5243764
5 5243765
We can also use partition here to number the rows based on a grouped value such as:
75 https://www.postgresql.org/docs/current/functions-window.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 137
Another useful set of functions are LAG and LEAD. These functions take two arguments, the column and
the number of rows you cant to lag or lead by. In this case we will use lag to look at the daily total
change in total amount of fares grouped by the number of passengers.
Using LEAD we could find the same values in the other direction for the total of the current row minus
the total from the next day in the future.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
138 1.6. JOINS
1.6 Joins
Up until now we have only been working with queries on one table. Some of the real utility of SQL
comes when you start to join tables. Joins can seem like a complex topic, especially if you have only
worked on joining data in a GIS toolkit, but when planned out correctly and with a set of a few simple
rules, joins will help you quickly make your analysis far more useful and fast, in fact joins will become
an easy task.
Joins in SQL allow you to join two or more datasets based on one or more conditions. You can join the
data with multiple different methods, including only matching columns, matching columns from one
table and all other columns from another table, all matches and non matches, or join every column to
every other column.
For me, the best way to think about a join was always visually. Below is a quick illustration of the
different types of joins (Figure 1.2):
We will be using two tables, our tree census and our zip code dataset to practice a few joins. We can
then structure the same illustration with our tables and the joins that we will be doing.
With that there are a few tips and tricks I like to use when setting up a join:
• As much as possible, use inner joins, represented by the keyword JOIN
• Instead of switching between LEFT and RIGHT joins, just use one and move the table order
• In most cases you will want to use an INNER join rather than an OUTER join, which you don’t need
to designate the INNER keyword.
• CROSS JOIN can be used in very specific use cases as we will see later
With that let’s do a very simple join between our two tables:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 139
5 from
6 nyc_311
7 join nyc_zips on nyc_311.incident_zip = nyc_zips.zipcode
8 limit
9 5
There are a few things to notice here. First is that before every column we are using the table name,
followed by a period, followed by the column name. Since we are using two tables we need to designate
which tables each column is coming from.
Next is the join syntax itself:
First we designate our left table in the FROM clause. Followed by the right table, which has JOIN listed
before it. We can then use the ON keyword to tell the query how to join our two tables based on a
condition. That condition will equate to true between two columns. In most cases this will be an
equation that uses = since that will join based on a specific value.
Keep in mind just like any other SQL query you can also use a WHERE conditional as needed in addition
to the join. So if we wanted to join only a specific type of 311 call we could do so with the WHERE clause,
the JOIN is only representative of the key which we want to join on.
So when we run our join, you can find the results here:
complaint_type population
Food Poisoning 69255
Noise - Commercial 69255
Illegal Fireworks 80857
Illegal Fireworks 80857
Noise - Residential 56670
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
140 1.6. JOINS
As stated earlier, we have to reference the table names prior to the columns we want to reference. Now
when you have longer table names this adds a ton of typing. The good news is that we can actually use
a table name alias by adding a table name of one character (must start with a character not a number)
or more to the tables, then use those as the table name alias. For example, we can rewrite our query
above to:
I usually use one letter in alphabetical order or a shorthand name for my tables where I will recognize
what each shorthand name is referencing
The LEFT and RIGHT joins will perform the same operation as the (INNER ) JOIN above, but will also
include non-matching values. To illustrate this, we can write a subquery that grabs 30 rows from our
zip codes table.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 141
As you can see here the rows that don’t match from the left table, in this case our 311 table, are still
shown but there are null values in the rows from the zip code table. The RIGHT join will accomplish the
exact opposite of this.
A FULL OUTER join will join all possible rows, matches and non-matches. Very few times have I actually
had to use a FULL OUTER join, but we can take a look at it here just in case it comes up for you.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
142 1.6. JOINS
11 population,
12 zipcode
13 from
14 nyc_zips
15 limit
16 30
17 )
18 select
19 a.complaint_type,
20 a.incident_zip,
21 b.population
22 from
23 a full
24 outer join b on a.incident_zip = b.zipcode
25 limit
26 100
I won’t show the full results here but you should see three categories of results:
• Results that match from the left table (311 data in CTE)
• Results that match from the left table (zip code data in CTE)
• Results that match both tables
CROSS Joins
Cross joins are joins between two tables that compute each row in the left table to every other row in
the right table. In effect, if you have a table of 10 rows and another table of 10 rows, your resulting
join will have 100 rows. While this may not seem useful, I have actually used a cross join many times
for creating origin destination matrices or distance tables. Since you are joining every single row, you
don’t actually need a join condition. There are two ways to accomplish this, and we will show both
using a simple subquery with 2 rows in each table.
In the first method we can explicitly call out the CROSS JOIN
And in the next we can just separate the tables using a comma for a more shorthand method.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 143
In most cases you will want to compute something between the two tables. Here we can divide the
value of the population of the zip code and the shape area of the neighborhood boundary. While this
has no analytical value, you can see some use cases for a cross join:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
144 1.6. JOINS
While joins are great you can also perform aggregations alongside joins and this is where things start
to get really powerful. Of course you can perform basic calculations such as math as well as other
functions between tables with a join, but aggregations allow you to aggregate a longer table and join it
to other data. This will come up later in spatial joins as well.
For a quick example let’s look at how many buildings fall within each zip code:
count zipcode
21000 10314
19436 10312
19389 11234
17021 10306
14682 11236
As you can see we also used a new keyword, USING, to join our data. If your column names that you
want to join on are the same name and same data type (I had to cast the zipcode column from the
buildings data to achieve this) then you can simply designate that single shared column name inside
the USING join parameter.
Now you can join more than one table, but there are some caveats here that you want to be careful of.
Let’s take a quick scenario using our aggregation from above with adding another aggregation to our
query with two other datasets:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 145
5 from
6 nyc_mappluto
7 order by
8 ogc_fid desc
9 limit
10 5000
11 ), c as (
12 select
13 ogc_fid,
14 zipcode
15 from
16 nyc_2015_tree_census
17 order by
18 ogc_fid desc
19 limit
20 5000
21 )
22 select
23 count(a.ogc_fid) as buildings,
24 count(c.ogc_fid) as trees,
25 b.zipcode
26 from
27 nyc_zips b
28 join a using(zipcode)
29 join c using(zipcode)
30 group by
31 b.zipcode
32 order by
33 count(a.ogc_fid) desc
Full warning, I do not recommend running this query without limits it will take a very long time to
complete, and I will explain why below. Let’s take a look at the results:
As you can see both our counts are actually the same. We can check and see if that is accurate by
running each join separately. First for our buildings layer:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
146 1.6. JOINS
17 order by
18 ogc_fid desc
19 limit
20 5000
21 )
22 select
23 count(a.ogc_fid) as buildings,
24 -- count(c.ogc_fid) as trees,
25 b.zipcode
26 from
27 nyc_zips b
28 join a using(zipcode)
29 -- join c
30 -- using(zipcode)
31 group by
32 b.zipcode
33 order by
34 count(a.ogc_fid) desc
buildings zipcode
3480 10309
1187 10307
323 10312
3 10304
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 147
trees zipcode
211 11375
190 11230
177 11105
168 10457
146 11370
So we can see that neither of the two have the same amount in our first query. So instead we can
single out one specific zip code, 10312, and see what the results for that one zip code look like. You can
add this as a WHERE clause to your queries from above but we can skip ahead to the results, first for
buildings:
buildings zipcode
323 10312
trees zipcode
135 10312
If you haven’t guessed it yet, our first return value for the zip code 10312 was 43,605, which is the
result of multiplying 135 by 323. So then why did our query do this? In short, each table join will
return a single, intermediate table, also known as a derived table. The multiple joins will read each
step sequentially, so in the case above when we reference our second join to the first table, we are
inadvertently returning a Cartesian join, or cross join, because one table may match 50 rows in the first
join, while the second could match 1000, for example.
If you take a look at the visual analysis of the query you can see that the aggregation only happens
at the final part of the query, so that is why the rows are in effect multiplied. Here you can see the
visualized query plan in 1.3).
So to accomplish this query we can simply use CTEs to aggregate our data twice then join it in one final
query. We can then get rid of the temporary CTEs and write our full query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
148 1.7. LATERAL JOINS
So this gives us the correct counts and runs quite quickly given the total amount of data we are an-
alyzing. We are joining a table of 683,788 rows in our trees table and 856,997 rows in our buildings
table.
Now there is one other specific type of join that requires a section of its own because of its applicability
for advanced analysis with spatial SQL. What makes a LATERAL JOIN special is that the join:
• Follows the FROM statement as a separate subquery
• Can access data (i.e. columns) from the left table, or the table preceding the LATERAL JOIN
We will cover why this has some very special applications for spatial data later in the book but for now
let’s show what this can look like in practice. To show what this looks like we will actually use a CROSS
JOIN that will take every combination of records, even though we are just returning one result. Using
this syntax we can divide the count of the total number of trees in each neighborhood divided by the
shape_area column which is the size of the neighborhoods in square yards.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 149
ntaname trees_per_sq_yrd
Upper East Side-Carnegie Hill 0.00023288887617650641
Central Harlem South 0.00018488732748556470
Brooklyn Heights-Cobble Hill 0.00017696398193253380
Upper West Side 0.00017105904160212209
Fordham South 0.00016801727715830502
As we can see Upper East Side-Carnegie Hil has the most tree density.
What is great about the LATERAL JOIN is that you can calculate across many rows within the lateral join
and use aggregates like we did above.
1.8 Triggers
TRIGGER functions allow you to write a SQL action that will "trigger" based on one of four event types on
your table: INSERT, UPDATE, DELETE, or TRUNCATE. Within the TRIGGER we will be using a new language
variation of SQL known as PL/pgSQL76 or SQL Procedural Language. This allows you to do more
scripting like operations similar to Python, in fact you can actually use Python as a scripting language
with PostgreSQL as well using PL/Python77 .
To do this we can use a temporary table that we used in our CRUD operations earlier. If you already
have that table you can delete it and start from scratch or make a second table with another table name.
76 https://www.postgresql.org/docs/current/plpgsql.html
77 https://www.postgresql.org/docs/current/plpython.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
150 1.8. TRIGGERS
Next we will create our trigger that will look up the city’s time zone using the city name. The first step
is to create a new function (which we will cover in more detail in the next section). For now we can use
this function and run it in pgAdmin:
A few notes:
• We are looking up the timezone against a built-in table in PostgreSQL named (pg_timezone_
names) which you can query in pgAdmin
• The section ’% || city || %’ wraps our column from the cities table, the city name, in the "%"
to do the wildcard search
• We can see if that column value is null using the IF statement
– Later we end the IF statement and return the new row, then finally END the function
procedure
Next we can create our TRIGGER:
So if we did everything correctly we can run our insert statement and then our trigger should take
action once this statement is run.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 151
So it looks like only Shanghai and Tokyo updated. What might be the issue? Here is another coding
challenge for you. Take a look and think why the data is not updating. A few hints to think about:
Okay so let’s think backwards and think where the issue might be:
• Our rows inserted correctly
• We know the trigger ran since the data was updated after an insert like we specified
• We know the function matched the cities correctly for two cities
The likely remaining issue is the equality operator between our two tables. In that case maybe we can
take a look at the timezones table:
So for cities with a space in their name we can see that it uses an underscore.
So we can update our function to use the REPLACE function, and then run the process over again.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
152 1.9. UDFS
13 from
14 pg_timezone_names
15 where
16 name like '%' || replace(city, ' ', '_') || '%'
17 ) as data;
18
19 return new;
20
21 end;
22
23 $$
Now as we can see we were able to update Mexico City but none of the others. And we would have to
cover a ton of edge cases to accomplish this. First off we would have to address accent marks, and the
fact that India has one time zone for the country not tied to major city names and Brazil has regionally
defined time zones.
1.9 UDFs
As mentioned above, user defined functions, or UDFs can greatly extend the capabilities of PostgreSQL
by building a function out of multiple parts of a statement or even adding in other languages like
Python. Once again these use PL/pgSQL or SQL Procedural Language and PL/Python. There are also
procedural languages for Java, Perl, Javascript, R, and more.
For our purposes we will stick to SQL and also implement some functions in Python later in the book.
The advantages of using SQL is that you can take a potentially lengthy chunk of code that you are run-
ning or something that requires multiple functions or steps that you are using on a regular or recurring
basis. User defined functions can be programmed do any number of things including:
• Storing a code that can be used like any other function
• Performing things like if, case, and loops
• Defining parameters and constant variables
• Create stored procedures which have the added capability of running table modifications
PL/pgSQL has a ton of options, and you can create very complex functions and procedures to manage
analytical workflows as well as create complex data management processes. Let’s create two functions
that we can use, the first will calculate the tip percentage of the NYC Taxi Dataset and the second will
look at our NYC 311 Data to create a function to search specific values in the data.
As we know we can calculate our tip percentage in our taxis data by running a query such as:
tip_amount/total_amount
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 153
So to do so we can create a function that will return a new column of data from that table. Our function
structure will look like this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
154 1.9. UDFS
• We created a function that defines the inputs and column data types which match the data types
which are float8 or numbers with decimal places
• We declare a return variable which is tip_percentage and the data type is numeric
• We use an if/else statement to handle edge cases:
– If the total_column is 0 or null then we can just return 0 since we can’t divide by
0 or null, rather than a calculation (this is a very small gain but in more complex
functions and queries this can save a lot of time)
– In each step we end it with a semicolon and set our declared value of tip_percentage
to the value we want
• If the total_column is greater than 0 then we can run our calculation
• We then end the if statement and return the tip_percentage variable
Let’s give this a test and see if it works, using our new function just like any other function:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. ADVANCED SQL TOPICS FOR SPATIAL SQL 155
Great! Okay let’s try this again, but we can actually use a function to return a subset of results just like
a query, but abstract the logic away from the query into one compact function. In this case we want to:
• Query the NYC 311 table and return text matches of terms in the columns
– complaint_type
– descriptor
– location_type
This function will look a bit different:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
156 1.10. EXPERT VOICES: FAWAD QURESHI
There are so many more use cases for user defined functions, but these should give you some tools and
starting points for basic functions to help you speed up your workflows.
As of this chapter you should have all the building blocks in SQL that you need to start writing spatial
SQL. So now we can finally move onto the core topic we want to cover: spatial SQL!
Started as a hobby playing with geo data and then started using it in different projects.
I always think Spatial SQL allows you to discover patterns in data that otherwise you might miss. The
new patterns unlocked using Spatial SQL are always fascinating.
Can you share an interesting way or use case that you are using spatial SQL for today?
At the beginning of the pandemic, I combined data between multiple telecoms and used complex
Spatial SQL (using tessellation indexing) to perform COVID contact tracing.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
2. Using the GEOMETRY
Now that we have learned the basic and advanced use cases, functions, and tools for using SQL, we
are finally ready to move onto the core topic of this book: spatial SQL. The rest of this book will be
dedicated to the spatial side of SQL, but before we move on I want to briefly pause to reflect and also
address one point.
First off, you have, in the past few chapters, built a very solid foundation for understanding SQL and
put yourself in a position to be able to create a database, load data, query data, and write a full range of
queries. This alone is an achievement and one that you could cover in one or many books or courses.
So give yourself a high five, pat on the back, or a well deserved reward. You earned it.
Next, I want to reflect on why spatial SQL has been somewhat of an under utilized tool for the analytics
side of geospatial. If you come from more of a traditional GIS background you can start to see why the
barrier of entry is so high for a tool like PostGIS or spatial SQL as a language. To fully make use of it
you need a foundation in the language before you can proceed with using the spatial aspects of that
language. SQL alone is a big skill to learn and something that people work years on to perfect and
master, so learning it alone is a challenge.
The other side of this is that traditional education generally treats SQL as a more advanced skill. Most
GIS programs start at the desktop tool level since they are purpose built for GIS and spatial data, and
only when you need to move on to more advanced use cases or to use more data should you introduce
spatial SQL and databases because of the additional complexity they come with. But I hope to prove to
you through the rest of the book, with ideas and practical examples, why spatial SQL is very valuable
and critical to modern GIS and expanding the use of spatial analytics beyond those in GIS.
The single thing that differentiates SQL from spatial SQL are the two data types that contain spatial
data. These are the GEOMETRY and the GEOGRAPHY. In effect these two data types are mostly the same as
they store things like points, lines, and polygons with one key difference.
The GEOMETRY uses a flat, projected or Cartesian surface, and the GEOGRAPHY, uses the curved earth for
functions that need to account for the curvature of the earth. Let’s say that for distances from one side
of a country to another, you will get a more accurate measure when you use the GEOGRAPHY. However,
at a city of local level, the GEOMETRY will work just fine. A few other key points to highlight here:
• The GEOMETRY can be transformed into other projection systems just as you would in other tools
using a function called ST_Transform. The GEOGRAPHY has no SRID since it is the curved earth.
• There are performance considerations for the two as well. There can be some delay in perfor-
mance when using the GEOGRAPHY which is well documented in this StackOverflow post78 which
runs two tests and finds a 2.6 to 4.5 times increase in function time when using geometries. You
can also see other testing benchmarks in this post79 .
• The very rough best practice is that if you have data at a large continental scale you may want
78 https://loc8.cc/sql/gis.stackexchange
79 https://loc8.cc/sql/medium-postgis-performance
157
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
158 2.2. GEOMETRY TYPES
to consider using GEOGRAPHY if you are using operations that require it, otherwise more likely
GEOMETRY will be sufficient.
In addition there are only a handful of functions that will support the GEOGRAPHY type in PostGIS. Those
are listed here and are borrowed from the PostGIS documentation80 :
• Parser functions
– ST_GeographyFromText(text) returns geography
– ST_AsBinary(geography) returns bytea
– ST_GeogFromWKB(bytea) returns geography
– ST_AsSVG(geography) returns text
– ST_AsGML(geography) returns text
– ST_AsText(geography) returns text
– ST_AsKML(geography) returns text
– ST_AsGeoJson(geography) returns text
• Transformation functions
– ST_Buffer(geography, float8) returns geography 1
– ST_Intersection(geography, geography) returns geography 1
• Measurement functions
– ST_Distance(geography, geography) returns double
– ST_Area(geography) returns double
– ST_Length(geography) returns double
• Spatial relationship functions
– ST_Covers(geography, geography) returns boolean
– ST_DWithin(geography, geography, float8) returns boolean
– ST_CoveredBy(geography, geography) returns boolean
– ST_Intersects(geography, geography) returns boolean
But with all that said it is very easy to move from one type to the other. All you have to do is cast the
data just like you would with any other data type. A sample query would look like this:
So you likely won’t have to worry much about this as you have the ability to change these as needed.
If you want to find what data type your data is using you can simply use the pg_typeof() function. No
special functions just normal functions from SQL.
Now not all geometries are created the same, and there are many types of geometry data that can be
stored and queried in PostGIS. So let’s start to take a look at some of the data we loaded to understand
this data type. We can query our NYC Building Footprint data to get started:
80 https://loc8.cc/sql/postgis-intro
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 159
geom
0106000020E61000...
Confused yet? Not to worry, we can explain this. What you are looking at is called Well Known Binary,
one of two accepted geometric representations defined by OpenGIS81 . The other is Well Known Text
which is far more readable to the human eye. We can see that by using the following query:
wkt
MULTIPOLYGON(((-73.96664570466969 40.62599676998366...
As you can see here we have what is known as a MULTIPOLYGON, or a polygon of multiple parts. Each sub-
POLYGON is contained within parentheses and is defined by points separated by commas. Each point
is separated not by a comma but by a space. This makes up the basics of a geometry in PostGIS. While
the full geometry text is not in the example above, there is actually only one polygon, not multiple, in
the above result. We will cover that later in the next section.
With that let’s cover the different types of geometries that PostGIS, and most of the other spatial
databases support. First we have our base geometries of which all others are made up of (all will
be represented in Well Known Text, or WKT which we will use henceforth):
• Point: ’POINT(0 0)’
• Line: ’LINESTRING(0 0,1 1,1 2)’
• Polygon ’POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))’
We can actually see these on a map by creating a table and inserting that data into it:
create table geometries (name text, geom geometry)
81 https://postgis.net/docs/manual-2.3/using_postgis_dbmanagement.html#RefObject
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
160 2.2. GEOMETRY TYPES
5 (
6 'line',
7 st_geomfromtext('LINESTRING(0 0,1 1,1 2)')
8 ),
9 (
10 'polygon',
11 st_geomfromtext(
12 'POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))'
13 )
14 )
Then query it and see the geometries on a map using the geometry viewer in pgAdmin:
Simply click the little map icon in the query results (Figure 2.1):
And you should see something like this (Figure 2.2, on the next page):
Note that we have not assigned a coordinate system to our data, but we can do that later. Let’s continue
on to look at some other geometry types, this time the "multi" geometry family.
• Multi-point: ’MULTIPOINT((0 0),(1 2))’
• Multi-line: ’MULTILINESTRING((0 0,1 1,1 2),(2 3,3 2,5 4))’
• (Multi-polygon): ’MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2
-1,-1 -1)))’
• Geometry collection: ’GEOMETRYCOLLECTION(POINT(2 3),LINESTRING(2 3,3 4))’
Now most of these are the same as they are a group of one or more geometries of the same type, points,
lines, or polygons, apart from the Geometry collection which can hold one or more of any geometry
type, including multi-geometries. Let’s add these to our table to see them on the map too (Figure 2.3,
on page 162).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 161
5 'multipoint',
6 st_geomfromtext('MULTIPOINT((0 0),(1 2))')
7 ),
8 (
9 'multiline',
10 st_geomfromtext('MULTILINESTRING((0 0,1 1,1 2),(2 3,3 2,5 4))')
11 ),
12 (
13 'multipolygon',
14 st_geomfromtext(
15 'MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1)))'
16 )
17 ),
18 (
19 'geometry collection',
20 st_geomfromtext(
21 'GEOMETRYCOLLECTION(POINT(2 3),LINESTRING(2 3,3 4))'
22 )
23 )
The next set of geometry functions also use Well Known Binary (WKB) and Well Known Text (WKT) as
well but in a format known as Extended Well Known Binary and Extended Well Known Text (EWKB
and EWKT respectively). This allows for:
• 3 dimensional data with an X and Y dimension and either an Z (height) or M (other dimension,
most commonly time, but can be distance marker or upstream distance)82
• 4 dimensional data with X, Y, Z, M
• Embedded Spatial Reference ID or SRID
These new data types include but are not limited to:
82 https://postgis.net/workshops/postgis-intro/3d.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
162 2.2. GEOMETRY TYPES
We will do some work with 3D data later in the book but for now we dont need to add these to our
table. There are also several other types of data that PostGIS supports that we will not be using in
this book. I have not used any of these types myself but they are available should you need or want
to use them. These are generally curved lines or polygons, triangular data, and POLYHEDRALSURFACE or
effectively a 3D polygon/shape. Generally in a 2D map this data cannot be rendered unless you use a
function to turn these into one of the primary geometry types. For example running this query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 163
Outside of PostGIS, most other databases or data warehouses support the first set of geometries we dis-
cussed, and sometimes only support a GEOGRAPHY, and many of the data warehouses will only support
EPSG 4326. PostGIS also supports raster data as well which we will cover in a later chapter.
It is important to understand that in almost all cases the size of spatial data, meaning the amount of disk
space (or bytes) it takes to store a geometry, is much larger than any other data type in your database.
There is an easy function to measure this which we can use to look at our NYC Zip Code polygons to
take a look at this.
Before we do that let’s understand why this matters. There are three main reasons why you want to
know the sizing of your geometry data. The first is for storage. Taking the example of Polygon data,
as the number of points increases in a polygon, each of those points will increase the disk size of that
polygon. Let’s test some examples out with our data. First let’s check out the size of a single point:
83 https://postgis.net/docs/using_postgis_dbmanagement.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
164 2.3. SIZE OF GEOMETRY DATA
Note that this query is effectively selecting this single item of data on the fly. This is not stored any-
where, but this allows us to test on simple geometries effectively.
The query returns "32" which represents 32 bytes of data. So how exactly does it get to that number?
Now we can take a look at the PostgreSQL documentation for the size of the primitive geometric data
types84 to see that a point requires 16 bytes of storage. But what are the other 16 bytes that make up the
total of 32 for? This post85 from Dan Baston explains how this works for a point for the other 16 bytes:
This returns 120 bytes. A bit more than you may have thought but here is the explanation:
• 4 points in the polygon which means 16 * 4 which means 64 bytes
– But we also have to account for an additional point for the repeated first point to
close the polygon, which adds 16 bytes one more time
• The POLYGON has a base of 40 bytes on its own, which is defined in the PostgreSQL docs86 .
So 16 * 5 = 80 and when we add 40 we get 120.
Let’s try this with one geometry from out New York City Neighborhoods dataset.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 165
1 select
2 st_memsize(geom)
3 from
4 nyc_neighborhoods
5 where
6 neighborhood = 'College Point'
This is a particularly complex geometry in north Queens, and it returns 17,488 bytes or 17.4 KB. So let’s
first figure out how to get to this number. To do so we can see how many points are in the polygon, the
geometry type, and the number of geometries that are in the polygon in case it is a multi-polygon.
It is a multi-polygon with 1 geometry and 1,090 points. So based on what we know from above:
• 40 bytes for the base polygon
• 16 bytes * 1,090 points or 17,440 bytes
• 16 bytes for the closing point
17440 + 40 + 16 = 17,496, which means we somehow have 8 extra bytes. So why did it end up with
extra data in our calculation? To break this down we can revisit our base polygon which has 120 bytes,
but this time with a multi-polygon:
Now we get 128 bytes. Let’s try again with another polygon in the multi-polygon:
This time we get 224 bytes, when we expected 256 (128 * 2). So what exactly is going on? Let’s try and
break this down for our first multi-polygon:
• First we have 5 points here (4 + 1 closing point) which gives us 80 bytes, 48 bytes remain
• Then we have the base 40 to store our base polygon, 8 remain
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
166 2.3. SIZE OF GEOMETRY DATA
Let’s look back at the elements needed for our 16 base bytes from before:
"4 bytes to tell Postgres how large the PostGIS object is
4 bytes to store the SRID and various flags
4 bytes to store the geometry type (Point)
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 167
storage and operation time is important to consider at the base data level which cannot be changed in
all cases, but this will help you understand where some scalability issues will originate.
One key skill that I have always reiterated as a critical to becoming self-sufficient in spatial SQL or any
SQL programming is the ability to read and use documentation, and in my opinion, is one of the most
important things you can do to become truly self-sufficient. For example, if you hit an issue, the first
place you might look is by searching online, looking through issues on StackOverflow, or even using
ChatGPT to see if it can help solve the issue. But one sure way is to go to the source to see what input
the function requires, what the outputs are, and what you can expect the function to perform.
You can find the PostGIS documentation at postgis.net/documentation87 which will have several ver-
sions depending on the release of PostGIS that you are using. It is important to make sure you have
the right version so you know those match to the current set of functions you are using as there may be
minor difference between the versions. The docs have a built-in search function, but the documentation
pages are well indexed by search engines like Google so you can certainly use Google to navigate to
what you are looking for. In addition, you can sometimes even search generally for what you want to
do if you do not know the exact function you need and Google will sometimes point you in the right
direction.
On a side note this is one area where a tool like ChatGPT can be helpful as it does a fairly decent job
of indexing PostGIS and other documentation pages well. Keep in mind that while ChatGPT can help
you assemble some code, it is not perfect, so I really like to use it when looking something up when I
want to structure a question in a more human way.
Now there are two notes when you are reading the docs that I think are important to call out. First is
the reading of the function in the docs which we have covered a bit in the book already. This is the first
thing that I always start with when looking at any documentation. Let’s take a look at a function we
will cover later in this chapter as they are described in PostGIS docs.
ST_Length
First we see the word float and this tells us the data type that is being returned from the function, in
this case a numeric float value or a number with decimals. Next we can see that there are two different
function signatures with one and two arguments respectively. The first is:
ST_Length(geometry a_2dlinestring);
This first part tells us that the function will take in a GEOMETRY data type, specifically a 2D linestring,
so not a 3D value and not a point or a polygon. The second:
This tells us this time that it take a GEOGRAPHY data type followed by geog which means any geography
data type and another parameter to use a BOOLEAN which will allow you to use a spheroid or sphere
87 https://postgis.net/documentation/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
168 2.4. A NOTE ON POSTGIS DOCUMENTATION
depending on the boolean value. The second value says that the default argument here is true. This
means that you only need to add false if you explicitly do not want to use a spheroid. Spheroid will be
a more accurate representation of the globe but may be a bit slower than using a sphere.
The next portion is to read the description. I like to read this to make sure that my interpretation of the
functions was correct and to make sure I am not missing any "gotchas" or potential parts of the function
that may return an odd value. From the PostGIS docs:
For geometry types: returns the 2D Cartesian length of the geometry if it is a LineString, MultiLineString, ST_
Curve, ST_MultiCurve. For areal geometries 0 is returned; use ST_Perimeter instead. The units of length is
determined by the spatial reference system of the geometry.
For geography types: computation is performed using the inverse geodesic calculation. Units of length are in
meters. If PostGIS is compiled with PROJ version 4.8.0 or later, the spheroid is specified by the SRID, otherwise
it is exclusive to WGS84. If use_spheroid=false, then the calculation is based on a sphere instead of a spheroid.
Currently for geometry this is an alias for ST_Length2D, but this may change to support higher dimensions.
Based on the description this looks like we have a good understanding of ST_Length. We can also see
that this function acts as an alias for ST_Length2D but in the future it may have added support for 3D or
4D lines.
Next we can see a warning which is called out by a red stop sign (Figure 2.5):
Changed: 2.0.0 Breaking change -- in prior versions applying this to a MULTI/POLYGON of type geography
would give you the perimeter of the POLYGON/MULTIPOLYGON. In 2.0.0 this was changed to return 0 to be
in line with geometry behavior. Please use ST_Perimeter if you want the perimeter of a polygon
And this validates that we can only use this function with linestrings but in a previous version (below
2.0.0) that this did support other geometry types.
There is also a note called out in a yellow box with a stick note icon (Figure 2.6):
For geography the calculation defaults to using a spheroidal model. To use the faster but less accurate spherical
calculation use ST_Length(gg,false);
And this validates that we can use the sphere for a faster calculation. Everything that is contained in
PostGIS docs is there for a reason, there is no erroneous or extra information, so it all can be considered
important.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 169
Every example in PostGIS is written so you do not have to have any data in your database to run them,
meaning that they use WKT string polygons to show the functionality and cast them to a GEOMETRY like
we have in previous examples using ST_GeomFromText or ST_GeogFromText. This means that you can
copy/paste these examples in pgAdmin and run them right from the docs.
Let’s look at the first example:
Return length in feet for line string. Note this is in feet because EPSG:2249 is Massachusetts State Plane Feet
We can actually run these examples or the GEOGRAPHY based examples on the page below to get the same
responses as we can see in the docs. We will refer to the docs multiple times in the book, so it is a good
practice to start to use and become fluent with the docs.
Another great feature in the PostGIS documentation are the Workshops. These are more hands on
guided tutorials that you can test out to go deeper into specific topics. You can find these on postgis.net/workshops88
and they often have some other detail if you are looking for some specific workflows too.
In this section we are going to dive into working with the GEOMETRY itself. These functions allow you to
do everything from creating geometries or geographies, manipulating them, changing them, and more.
While they are a component of spatial analysis, they are very important to the data engineering and
process. We will use them sporadically such as finding the nearest point on a line to another point, but
overall these functions are ones that you can generally keep in your back pocket to use as needed for
analysis purposes.
This section also groups the functions by their overall functional purpose as described in the PostGIS
documentation here89 . I will also add an opinionated view of the functions that I believe are the most
important to know based on my usage of them over the years. This may very well be different for you
depending on your use case so please keep that in mind. The function names and descriptions are from
88 https://postgis.net/workshops/postgis-intro/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
170 2.6. CONSTRUCTORS
the PostGIS documentation directly so as not to add any extra flavor to the descriptions. With that let’s
dive in.
2.6 Constructors
My top functions
• ST_Collect — Creates a GeometryCollection or Multi* geometry from a set of geometries.
• ‘ST_MakeEnvelope‘ — Creates a rectangular Polygon from minimum and maximum coordinates.
• ST_MakePoint — Creates a 2D, 3DZ or 4D Point.
• ST_MakeLine — Creates a LineString from Point, MultiPoint, or LineString geometries.
The rest
• Line creation functions
– ST_LineFromMultiPoint — Creates a LineString from a MultiPoint geometry.
• Point creation functions
– ST_MakePointM — Creates a Point from X, Y and M values.
– ST_Point — Creates a Point with X, Y and SRID values.
– ST_PointZ — Creates a Point with X, Y, Z and SRID values.
– ST_PointM — Creates a Point with X, Y, M and SRID values.
– ST_PointZM — Creates a Point with X, Y, Z, M and SRID values.
• Polygon creation functions
– ST_MakePolygon — Creates a Polygon from a shell and optional list of holes.
– ST_Polygon — Creates a Polygon from a LineString with a specified SRID.
• ST_TileEnvelope — Creates a rectangular Polygon in Web Mercator (SRID:3857) using the XYZ
tile system.
• ST_HexagonGrid — Returns a set of hexagons and cell indices that completely cover the bounds
of the geometry argument.
• ST_Hexagon — Returns a single hexagon, using the provided edge size and cell coordinate
within the hexagon grid space.
• ST_SquareGrid — Returns a set of grid squares and cell indices that completely cover the bounds
of the geometry argument.
• ST_Square — Returns a single square, using the provided edge size and cell coordinate within
the square grid space.
• ST_Letters — Returns the input letters rendered as geometry with a default start position at the
origin and default text height of 100.
There are three functions I think you definitely need to know out of this list.
ST_Collect
This function allows you to take several geometries and turn them into a geometry collection. This can
be useful for aggregations and other use cases. Let’s test this out using this query:
89 https://postgis.net/docs/reference.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 171
4 from
5 nyc_zips
6 limit
7 5
8 )
9 select
10 st_collect(geom)
11 from
12 a
This function will not union your geometries into a single geometry just turn them into a collection of
one or many geometries.
ST_MakeEnvelope
This creates a geometry from two different points: an X Min/Y Min, and X Max/Y Max pair. I use a
web based bounding box tool to give me the data that goes into the function here that also has a search
capability integrated with OpenStreetMap to search anything in OSM90 .
In the image below I searched for San Juan, Puerto Rico and selected the CSV option from the drop-
down to create this query (Figure 2.7, on the following page):
You can also add in the projection EPSG code to see it on the map, in this case that is 4326 (Figure 2.8,
on page 173).
90 https://boundingbox.klokantech.com/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
172 2.6. CONSTRUCTORS
7 4326
8 ) as geom
ST_MakePoint
This function allows you to create points from data in your table such as latitude and longitude data.
I have used this many times as will we for data that does not have a geometry in it such as our NYC
Taxi Data. Let’s try it with a few rows now.
Now if you try to see this data on the map you will notice that it doesn’t since you have to manually
set the SRID which we can do with this query (Figure 2.9, on page 174):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 173
3 st_setsrid(
4 st_makepoint(pickup_longitude, pickup_latitude),
5 4326
6 ) as geom
7 from
8 nyc_yellow_taxi_0601_0615_2016
9 limit
10 100
Now this is something we will be doing several more times during the course of this book. Is there a
way we can make this more efficient? If you guessed user defined function then you would be correct!
Try to write a UDF called BuildPoint to do so. If you need some hints:
• You will need three arguments:
– X coordinate
– Y coordinate
– SRID integer
Okay so here is how to build our UDF to save us some typing later on:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
174 2.6. CONSTRUCTORS
2 returns geometry
3 language plpgsql
4 as $$
5 begin
6 return st_setsrid(st_makepoint(x, y), srid);
7 end;
8 $$;
Now just for fun you may have noticed a function called ST_Letters. Go ahead and give this a try:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 175
1 SELECT
2 st_setsrid(
3 ST_Translate(ST_Scale(ST_Letters('Spatial SQL'), 1, 1), 0, 0),
4 4326
5 );
I have personally never used this function until now but hey, why not! See the results in 2.10.
2.7 Accessors
There are many different accessor functions, so I am only going to list my top functions here, but you
can see the full list using the link in the footnote91 . There are only a handful that I have found helpful
to keep in memory on a regular basis, but keep in mind that if there is something you want to do there
is almost always a function for it in PostGIS.
My top functions
• ST_Dump — Returns a set of geometry_dump rows for the components of a geometry.
• ST_GeometryType — Returns the SQL-MM type of a geometry as text.
• ST_MemSize — Returns the amount of memory space a geometry takes.
• ST_NPoints — Returns the number of points (vertices) in a geometry.
• ST_PointN — Returns the Nth point in the first LineString or circular LineString in a geometry.
• ST_X — Returns the X coordinate of a Point.
91 https://postgis.net/docs/reference.html#Geometry_Accessors
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
176 2.7. ACCESSORS
ST_Dump
The first function is call ST_Dump which basically dumps out the individual geometries from a com-
pound geometry like a MULTIPOLYGON or GEOMETRYCOLLECTION as rows called a geometry_dump
(more on this later). As we can see in our NYC Neighborhoods data the City Island neighborhood is
made up of several shapes or islands. We can use this to see what the ST_Dump will do (Figure 2.11).
As you can see this returns a column with the data type geometry_dump which is basically what is
known as a composite, or combined, data type. From the PostGIS docs:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 177
path[] - an integer array that defines the navigation path within the dumped geometry to the geom
component. The path array is 1-based (i.e. path[1] is the first element.)
So how can we access the actual geometries and get that data back. We basically need to treat it as
JSON. We can access the geometry by adding parenthesis around the function and .geom on to the end
of the parenthesis as so (Figure 2.12):
From here you can join this back to any original data and access these geometries as individual geome-
tries.
ST_GeometryType
A simple function that tells you the geometry type of your data:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
178 2.7. ACCESSORS
2 st_geometrytype(geom)
3 from
4 nyc_neighborhoods
5 where
6 neighborhood = 'City Island'
st_geometrytype
ST_MultiPolygon
ST_MemSize
We have used this function earlier which this tells you the disk memory to store you geometry data:
st_memsize
21088
ST_NPoints
Find the number of points in a geometry (we have also used this before):
st_npoints
1314
ST_PointN
Function to find a specific point at a specific position in a LINESTRING geometry. We can try this with
our NYC Bike Routes:
select
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 179
4 nyc_bike_routes
5 where
6 segmentid = '331385'
Well this returned null. So what happened? Here is a coding debugging challenge for you to figure
out. Take a look at the docs and try to see what the issue might be? This is a common one so no hints
here.
Okay so if you took a look at the docs you can see that the function signature is:
geometry ST_PointN(geometry a_linestring, integer n);
It very clearly calls for a LINESTRING and an integer representing the point position. We know that 1
is in fact an integer, so the issue must be with the geometry data. Let’s take a look and see by checking
our geometry type:
geom
ST_MultiLineString
Aha! We have a Multi-linestring, so it was an issue with the geometry. So how can we make this one
single linestring? We know we can dump the parts out the linestring then select them individually, but
how can we turn them into one? If you checked the docs, or better yet Googled what you could do,
then you may have come across the function ST_LineMerge which has this signature and description:
geometry ST_LineMerge(geometry amultilinestring);
Returns a LineString or MultiLineString formed by joining together the line elements of a MultiLineString.
Lines are joined at their endpoints at 2-way intersections. Lines are not joined across intersections of 3-way or
greater degree.
So this looks like it would work for us. Our final function (Figure 2.13, on the following page):
Why go through all this trouble? If you ever need to get the starting/ending points of all your line
data, now you know how!
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
180 2.8. EDITORS
2.8 Editors
This set of functions allows you to edit and modify your geometries on a very granular level like
adding or removing points, snapping, forcing 2/3D geometries, and more. I won’t mention these here
since there isn’t a ton of analytical value, but it is good to know of their existence. For those who are
maintaining geometries only in SQL this is a core set of functions to understand. You can see the full
set of functions in the PostGIS reference92 .
2.9 Validators
https://postgis.net/docs/reference.html#Geometry_Validation 93
Out of all the different spatial formats that are available there are bound to be some issues with our
geometries. The geometry validators provide the best place to start to check if you have invalid geome-
tries. Let’s see if we have any in our data!
ST_IsValid
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 181
mpluto_bbl
1022430261
1016710039
4039760001
Instead of looking for all the invalid geometries each time we can just query these three IDs.
ST_IsValidDetail
So let’s see the valid detail from our three geometries above:
mpluto_bbl | st_isvaliddetail
4039760001 | (t)
1016710039 | (f, "Ring Self-intersection", 01010000007F4EC1DA6E7C52C0DCF0906B9F644440)
4039760001 | (f, "Ring Self-intersection", 01010000009C8C8F29BB7552C0323B56A12C654440)
1022430261 | (t)
4039760001 | (t)
1022430261 | (f, "Ring Self-intersection", 0101000000D9494DB2957A52C01C07C27A6F6F4440)
1016710039 | (t)
What this tells us is that these are MULTIPOLYGON geometries and that one part of each has a self-
intersection error.
ST_IsValidReason
Better yet we can use this function to get a more readable detail:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
182 2.9. VALIDATORS
3 st_isvalidreason(geom)
4 from
5 nyc_building_footprints
6 where
7 mpluto_bbl in ('1022430261', '1016710039', '4039760001')
mpluto_bbl | st_isvalidreason
4039760001 | Valid Geometry
1016710039 | Valid Geometry
1016710039 | Ring Self-intersection[-73.9442660224686 40.7861151178092]
4039760001 | Ring Self-intersection[-73.8395484830712 40.7904245062877]
1022430261 | Valid Geometry
4039760001 | Valid Geometry
1022430261 | Ring Self-intersection[-73.9153867487688 40.8705895850564]
ST_MakeValid
Spatial Reference
https://postgis.net/docs/reference.html#SRS_Functions 94
Of course no geospatial system will be complete without projection support and PostGIS has three
simple functions to manage your projections.
• ST_SetSRID — Set the SRID on a geometry.
• ST_SRID — Returns the spatial reference identifier for a geometry.
• ‘ST_Transform‘ — Return a new geometry with coordinates transformed to a different spatial ref-
erence system.
ST_SetSRID
If you have a geometry without an SRID you can add one using this function, which we already used
in an earlier example. Keep in mind some other functions may let you set the SRID in the creation a
geometry.
94 https://postgis.net/docs/reference.html#SRS_Functions
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 183
ST_SRID
st_srid
4326
4326
4326
ST_Transform
And finally this function is the one you might use the most if you need to transform your SRID:
Creators
Sometimes you need to create, or export, geometries from other types of data that you have imported
into your database. For that we have the creator functions that can turn some data into a geometry
and vice-versa. We don’t need to look at each one of these in detail, but we see one example from each
group. Overall the WKT and WKB formats are the most common but there are plenty of others you can
use such as GeoJSON, KML, and more.
2.10 Inputs
This set of functions takes an argument and then returns a geometry or geography:
This will return a point at Grand Central Station in New York City.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
184 2.11. OUTPUTS
2.11 Outputs
We have seen the ST_AsText function earlier but you can also output geometries as different formats:
st_asgeojson
{"type":"MultiPolygon", "coordinates":[[[[-73.925006409, 40.623351698]...
{"type":"MultiPolygon", "coordinates":[[[[-73.753816555, 40.674630114]...
{"type":"MultiPolygon", "coordinates":[[[[-73.721306282, 40.734020454]...
• ST_AsEWKT — Return the Well-Known Text (WKT) representation of the geometry with SRID
meta data.
• ST_AsText — Return the Well-Known Text (WKT) representation of the geometry/geography
without SRID metadata.
• ST_AsBinary — Return the OGC/ISO Well-Known Binary (WKB) representation of the geome-
try/geography without SRID meta data.
• ST_AsEWKB — Return the Extended Well-Known Binary (EWKB) representation of the geome-
try with SRID meta data.
Other Formats
• ST_AsEncodedPolyline — Returns an Encoded Polyline from a LineString geometry.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 185
This set of functions allows you to turn your geometry or geometries into a new geometry. This
includes many common operations like buffers, centroids, concave/convex hulls, simplification and
more.
My top functions
• ST_Buffer — Computes a geometry covering all points within a given distance from a geometry.
• ST_Centroid — Returns the geometric center of a geometry.
• ST_ChaikinSmoothing — Returns a smoothed version of a geometry, using the Chaikin algo-
rithm
• ST_ConcaveHull — Computes a possibly concave geometry that encloses all input geometry
vertices
• ST_ConvexHull — Computes the convex hull of a geometry.
• ST_DelaunayTriangles — Returns the Delaunay triangulation of the vertices of a geometry.
• ST_GeneratePoints — Generates random points contained in a Polygon or MultiPolygon.
• ST_LineMerge — Return the lines formed by sewing together a MultiLineString.
• ST_Simplify — Returns a simplified version of a geometry, using the Douglas-Peucker algo-
rithm.
• ST_SimplifyPreserveTopology — Returns a simplified and valid version of a geometry, using
the Douglas-Peucker algorithm.
Others
• ST_BuildArea — Creates a polygonal geometry formed by the linework of a geometry.
• ST_FilterByM — Removes vertices based on their M value
• ST_GeometricMedian — Returns the geometric median of a MultiPoint.
• ST_MaximumInscribedCircle — Computes the largest circle contained within a geometry.
• ST_MinimumBoundingCircle — Returns the smallest circle polygon that contains a geometry.
• ST_MinimumBoundingRadius — Returns the center point and radius of the smallest circle that
contains a geometry.
• ST_OrientedEnvelope — Returns a minimum-area rectangle containing a geometry.
• ST_OffsetCurve — Returns an offset line at a given distance and side from an input line.
• ST_PointOnSurface — Computes a point guaranteed to lie in a polygon, or on a geometry.
• ST_Polygonize — Computes a collection of polygons formed from the linework of a set of ge-
ometries.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
186 2.11. OUTPUTS
ST_Buffer
This function will perform a very popular analysis: creating a buffer around a geometry. A few impor-
tant notes:
• There are many stylistic options such as different end-caps (round, mitre, etc.) and you can choose
to have the buffer one one side of the polygon or line
• For geometries the measurement uses the unit of measurement of the projection (keep this in
mind as we proceed)
• You can also control the number of points per quarter circle
For our example we will create some points with the user defined function then turn those into geogra-
phies and create a ½ kilometer buffer (Figure 2.14, on the next page):
For polygons you can also create buffers, as well as negative buffers (Figure 2.15, on page 188):
ST_Centroid
This creates a centroid at the geographic center of mass (Figure 2.16, on page 188):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 187
3 from
4 nyc_zips
5 limit
6 10
ST_ChaikinSmoothing
Here we can check each of the different results in Chaikin algorithm which double the number of
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
188 2.11. OUTPUTS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 189
vertices to create a smoothed polygon. Below are the original polygons (Figure 2.17, on page 189):
Then the with the base number of iterations or 1 (Figure 2.18, on page 189):
Then with the max iterations or 5 (Figure 2.19, on page 190):
ST_ConcaveHull
You can also create concave hull polygons around geometries. You need to add a float as a second
argument to define the amount of "concave-ness". 0 is the most concave and 1 will produce a convex
hull so something around 0.3 or 0.5 will likely work for most use cases. To demonstrate this we will
grab 10 points from our NYC 311 dataset and create a geometry collection from them to build a single
concave hull (Figure 2.20, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
190 2.11. OUTPUTS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 191
ST_ConvexHull
And you can do the same with a convex hull (Figure 2.21, on the facing page):
ST_DelaunayTriangles
This function allows you to triangulate polygons using the Delauny triangulation method95 . The result
is a GEOMETRY COLLECTION of multiple polygons (using the 0 flag) which can be used as needed.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
192 2.11. OUTPUTS
There is a great use case for this which we will review in a later chapter but for now here is the query:
You can include a parameter for tolerance (float) in the second argument and type flag in the third
argument but if you just want polygons you don’t need the extra arguments (Figure 2.22, on page 194).
95 https://en.wikipedia.org/wiki/Delaunay_triangulation
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 193
ST_GeneratePoints
This function allows you to generate a random set of points within a polygon. The second argument is
the number of points you want to generate (Figure 2.23, on page 195):
ST_LineMerge
This is a quick and easy way to merge several linestrings together, which is perfect for our NYC Bike
Routes which are all individual linestrings. Let’s join one path into a complete line using this query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
194 2.11. OUTPUTS
ST_Simplify
This is the first of two functions to simplify geometries using the Douglas-Peucker algorithm. This one
focuses on just simplification with two arguments, one for the geometry and one for the tolerance of
simplification. We will run the same query 4 times with different tolerance levels to see the changes to
the geometry on one map, and union the using the UNION operator since they have one column with
the same name (Figure 2.24, on page 196).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 195
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
196 2.11. OUTPUTS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 197
ST_SimplifyPreserveTopology
This is the same function as above but in this case it preserves the topology, or the touching parts of
the geometry. We can test this out using a new operator we will see in the next chapter on spatial
relationships called ST_Touches. We will get all the zip codes that touch 11434 (where JFK Airport is
located) and 11434 too (Figure 2.25, on page 198).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
198 2.12. MEASUREMENTS IN SPATIAL SQL
56 union
57 select
58 st_transform(st_simplifypreservetopology(geom, 50), 4326) as geom
59 from
60 nyc_zips
61 where
62 zipcode = '11434'
63 or st_touches(
64 geom,
65 (
66 select
67 geom
68 from
69 nyc_zips
70 where
71 zipcode = '11434'
72 )
73 )
There are many different ways to measure a geometry, and PostGIS comes with a complete set of
functions to measure all types of geometries.
My top functions
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 199
ST_Area
This allows us to find the area of a polygon. This will return in the unit of measure from the SRID for
geometries and is square meters for geographies:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
200 2.12. MEASUREMENTS IN SPATIAL SQL
1.4428577997126009e-09
7.891702476768148e-09
6.695094119757017e-09
1.2662729366160693e-08
2.4262124438870062e-09
13.556961725698784
74.09368502616417
62.80374527350068
119.01674082968384
22.811112018302083
ST_ClosestPoint
This function takes two geometries and finds the closest point from the first geometry to the second
geometry:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 201
ST_Distance
Similarly, we can find the difference between two geometries, once again in the spatial reference units
for geometries and meters for geographies.
dist
3855.56160052085
Since this returns a number, this case the distance in feet, we can easily turn this into a different unit of
measure using simple math, in this case miles:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
202 2.12. MEASUREMENTS IN SPATIAL SQL
dist
0.7302200000986458
So almost a quarter of a mile. We can also compare the difference between geometry and geography
here too:
Whoops. That seems odd. Here is the response that we got back from the query we just ran:
ERROR: Only lon/lat coordinate systems are supported in geography.
SQL state: 22023
So it appears there is some issue with our geometry? Enter another coding challenge. See if you can
figure out how to solve this problem using some functions we just explored in this chapter.
So if we read our error closely it appears that there may be an error with our coordinate system. Let’s
first see what our projection that we are using with our data is currently:
st_srid
2263
So it looks like our SRID is 2263 which is the North American Datum 1983 (NAD 83) for New York
Long Island96 . And in this case the unit of measurement is in feet. So we need a coordinate system
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. USING THE GEOMETRY 203
such as WGS 84 or 4326. Once we perform the transformation we can cast this to a GEOGRAPHY and we
can get our measurement in meters.
dist
0.7303799424922313
And with that we can see that we have a very similar, but not totally exact answer.
ST_Length
To measure the length of a line we can use ST_Length. In this case we can turn the geometry, in SRID
4326, to a geometry to get a measurement in meters.
st_length
40.163645298747284
ST_Perimeter
96 https://epsg.io/2263
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
204 2.12. MEASUREMENTS IN SPATIAL SQL
st_perimeter
16247.124894682644
ST_ShortestLine
This function returns a line that represents the shortest line between two geometries. We can reuse our
query and then transform the return value to SRID 4326 to make sure it renders on the map.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
3. Spatial Relationships
Now that we have the fundamental spatial SQL elements in our toolkit, we can now move into some
of the more advanced topics around spatial relationships which include spatial joins and aggregations,
overlay functions which can do things like spatial unions and clipping, clustering, and some special
spatial operators which you can use in your spatial SQL.
Spatial relationship functions97 provide capabilities to understand how different geometries interact
in space, and includes a widely used function and likely my most used spatial SQL function, ST_
Intersects. Some characteristics of these functions are that:
• Almost all of these functions return a boolean, so either a true or false condition
• Most functions take two geometries that will be evaluated
– These can come from a single table, two tables, a table and a static geometry, etc.
• You can use these functions in a few different areas in your query such as
– As a new column in your data
– As a condition in a WHERE clause
– As a condition for a table join (more on this later)
Below are a list of all the different spatial relationship functions. We will review the top list as well as the
relationships between the functions since many have very similar functionality with slight differences,
so there are some that overlap functionality (no pun intended):
My top functions
• ST_Contains — Tests if no points of B lie in the exterior of A, and A and B have at least one
interior point in common.
• ST_Disjoint — Tests if two geometries are disjoint (they have no point in common).
• ST_Intersects — Tests if two geometries intersect (they have at least one point in common).
• ST_Overlaps — Tests if two geometries intersect and have the same dimension, but are not com-
pletely contained by each other.
• ST_Touches — Tests if two geometries have at least one point in common, but their interiors do
not intersect.
• ST_Within — Tests if no points of A lie in the exterior of B, and A and B have at least one interior
point in common
Other functions
• ST_3DIntersects — Tests if two geometries spatially intersect in 3D - only for points, linestrings,
polygons, polyhedral surface (area).
• ST_ContainsProperly — Tests if B intersects the interior of A but not the boundary or exterior.
• ST_CoveredBy — Tests if no point in A is outside B
• ST_Covers — Tests if no point in B is outside A
97 https://postgis.net/docs/reference.html#Spatial_Relationships
205
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
206 3.2. WAYS TO USE SPATIAL RELATIONSHIP FUNCTIONS
• ST_Crosses — Tests if two geometries have some, but not all, interior points in common.
• ST_Equals — Tests if two geometries include the same set of points.
• ST_LineCrossingDirection — Returns a number indicating the crossing behavior of two LineStrings.
• ST_OrderingEquals — Tests if two geometries represent the same geometry and have points in
the same directional order.
• ST_Relate — Tests if two geometries have a topological relationship matching an Intersection
Matrix pattern, or computes their Intersection Matrix
• ST_RelateMatch — Tests if a DE-9IM Intersection Matrix matches an Intersection Matrix pattern.
As I mentioned before, there are several ways to use spatial relationship functions all of which produce
different yet equally useful results. We will use ST_Intersects as the function here as it covers the most
common spatial relationship question which is if two polygons touch or overlap in some way. Below
are the most common ways that I have used these functions, but as with all things in spatial SQL, you
have the full control of SQL to build and write queries as you want.
As a column
The first way of using spatial relationship functions is using the returned result of the spatial relation-
ship, or the boolean value, as a new column in your query. For this we can use our NYC Neighborhoods
and our NYC Taxis data from the previous chapters:
Now this query is a bit busy because we are using a subquery to extract one specific geometry from
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 207
the Neighborhoods table, and we have to turn our pickup locations to geometries, but this returns data
that looks like this:
Now you can use this to add a new column to your dataset and then update that column with the
return results from that function too if you want to store that data too.
In a WHERE clause
Let’s say we want to filter results rather than add them as a column:
We can find only the trips that start in the West Village and limit that to 10 results.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
208 3.2. WAYS TO USE SPATIAL RELATIONSHIP FUNCTIONS
Hold on a minute. If you take a look at our query you can see that where we would normally have
some sort of equality evaluation after our WHERE operator we simply have the function and nothing
more. Think about it for a minute about why this works as our next challenge.
Did you figure it out? If you recall from our earlier chapters our WHERE condition allows us to filter our
rows based on a condition of equality, or in short true or false (it can also evaluate to unknown but for
now let’s focus on true and false). So for a normal WHERE conditional we might have some condition
like:
And if the row matches the condition, or neighborhood = ’West Village’, will evaluate to true or
false. If you read the above section about spatial relationship functions closely you can see that the
return value of ST_Intersects is in fact a boolean, or true/false, in which case the return value of the
function is enough for us to evaluate the equality of our spatial relationship.
Before we proceed you may have noticed that the last query took some time to run. That is because we
are still creating our geometries on the fly using our BuildPoint user defined function. A good rule of
thumb is that a GEOMETRY or GEOGRAPHY stored on the database will be far more efficient to query than
those generated on the fly. This is because when we create it on the fly it takes time to run that function
for every point being evaluated. In the case of our first query we return the first 100 rows no matter
what they are but in our second we may have to query through 500, 1,000, or 10,000 rows to reach the
100 desired results that match our condition. If you recall from our section on CRUD tasks, to fix this
we have to:
• Add new columns for our geometry
• UPDATE our new columns with the geometry values created from the latitude and longitudes
Since we have a set of lat/longs for pickup and drop-off we can create two different geometries. This
is a nice advantage of PostGIS where we can have multiple geometries in the same table.
First we can add the columns to our table:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 209
2 nyc_yellow_taxi_0601_0615_2016
3 add
4 column pickup geometry,
5 add
6 column dropoff geometry
Then UPDATE the table with the new values. This process took 6 minutes and 26 seconds on my computer
so don’t be surprised if this takes a long time.
To test our new columns out let’s see if our last query using ST_Intersects runs any faster this time
around:
On my computer using the GEOMETRY stored as a column my query finished in 14 seconds compared to 2
minutes and 7 seconds creating the geometries on the fly. This translates into about a 907% perfomance
increase, so definitely worth the effort to create the geometries.
Within a JOIN
Within a JOIN we can perform our previous queries with some cleaner code, as well as perform spatial
aggregations with joins. First let’s rewrite our first query but this time as a join:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
210 3.2. WAYS TO USE SPATIAL RELATIONSHIP FUNCTIONS
1 select
2 a.ogc_fid,
3 a.trip_distance,
4 a.total_amount,
5 st_intersects(a.pickup, b.geom)
6 from
7 nyc_yellow_taxi_0601_0615_2016 a,
8 nyc_neighborhoods b
9 where
10 b.neighborhood = 'West Village'
11 order by
12 a.pickup_datetime asc
13 limit
14 100
Since we want to check all the values we can perform a cross join since we have no condition apart
from the WHERE clause so our Neighborhoods table is only one after the filter.
We can do the same with our second query to move the intersection parameter from the WHERE operator
into the JOIN:
Once again this results in a cleaner query. You can also use this same structure to add columns from
your second data source as well now that we have that table joined to our other data:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 211
Finally, we can use a spatial join to perform aggregate joins. Now I will demonstrate this with the
total and average trip values for a few neighborhoods in New York, but you can use any aggregation
method that we have covered in previous chapters:
In a CTE
Another way to use a spatial relationship function is in the context of a CTE that can then be joined to
other data. This may make the query a bit cleaner and potentially help with performance depending
on the scenario. In particular in cloud data warehouses, it is likely that you will not be able to GROUP
BY the geometry to visualize your data, and in general this is a frowned upon practice since even if one
point is in a different order despite the polygons matching, they will not match in the GROUP BY clause.
Let’s modify our query to join it back to the geometry data to map it:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
212 3.3. SPATIAL RELATIONSHIP FUNCTIONS
This allows us to join the aggregate back to the original data and view the results for the neighborhood
tabulation areas.
This is one of my favorite uses for a spatial relationship function, and we will cover it in more detail
later in the book, but for now you can see a quick preview of how to use this to analyze, row by row,
the values of the other table. In this case we are using the same table. Note that we don’t need to group
by since the SQL in the CROSS LATERAL JOIN is running once for each row so keep that in mind.
Two things should stand out to you here. First is that you can preform these lateral queries very
effectively with this method and also move between the main table and the joined table to grab different
data components from each. Second ,is that this is a very fast operation and highly efficient approach
for this type of query.
Next, we can take a look at the various spatial relationship functions that are available. Each one does
a slightly different task, but all serve the purpose of analyzing the relationships between two or more
geometries. We will review each of these in detail, but they can also be explained visually, which the
PostGIS Documentation Exercises section does a great job of. You can find that link at the link in the
footnotes98 , but below is the example for ST_Intersects (Figure 3.1, on the next page):
ST_Contains
This function will return true if all vertices overlap or fall within the other geometry. If any point falls
outside, it will return false. We will showcase this with some sample data from our NYC Buildings
dataset and see how many buildings are contained by a 200-meter buffer of Madison Square Garden:
98 https://loc8.cc/sql/postgis-spatial-relationships
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 213
2 name,
3 geom
4 from
5 nyc_building_footprints
6 where
7 st_contains(
8 (
9 select
10 st_buffer(
11 buildpoint(-73.993584, 40.750580, 4326) :: geography,
12 200
13 ) :: geometry
14 ),
15 geom
16 )
You may notice this part of the query which creates the point and buffer around the centroid of Madison
Square Garden:
Since our source data is in a 4326 projection we create our point in that projection, however the ST_
Buffer function asks for the buffer distance to be in the units of the source projection, which in this case
is degrees. So to use meters we cast this to a geography, but since our ST_Contains function, and most
of the spatial relationship functions, accept only geometries as their arguments, we then need to turn
it back into a geometry. A lot of work indeed, but we will see a different way of doing this later in the
chapter.
We also want to wrap it in a subquery since this will generate the geometry one time rather than every
single time the function needs to evaluate (Figure 3.2, on the following page).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
214 3.3. SPATIAL RELATIONSHIP FUNCTIONS
Since Madison Square Garden is in the middle of the Midtown neighborhood which is home to many
large buildings, let’s us a different location with smaller buildings to test all the edge cases of the spatial
relationship functions, in this case the Stonewall Inn National Monument (Figure 3.3, on the next page).
We can see that our query has returned 349 buildings that are totally contained by the buffer. It is also
important to note the function signature as well:
Returns TRUE if geometry B is completely inside geometry A. A contains B if and only if no points of B lie in
the exterior of A, and at least one point of the interior of B lies in the interior of A.
Since we have our 200 meter buffer first this makes sense. If we flipped this around we would have no
rows returned since the buffer is not contained completely any single building. As we will see later ST_
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 215
Figure 3.3: Buildings within 200 meters of the Stonewall Inn National Monument
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
216 3.3. SPATIAL RELATIONSHIP FUNCTIONS
Within is the inverse of this, meaning we would reverse the order of our geometries to lead with our
buildings followed by our buffer. The names of the functions are helpful to remember which is which:
• Does our buffer contain this building?
• Does this building fall within our buffer?
Keep this in mind as you use these functions as there are two with very similar properties!
ST_Crosses
This function will return true if one part of the geometry crosses the other, but does not simply over-
lap with the other. The illustration of ST_Crosses in the PostGIS documentation are the best visual
representations of this function, so I will defer to those which are referenced in the footnotes99 .
We can test this out with our NYC Bike Routes and Buildings to see if there are any bike routes that
cross over or under a building in New York.
Here we can see that there are several bike paths that cross totally through buildings, mostly on Roo-
sevelt Island (Figure 3.4, on the facing page).
ST_Disjoint
This function is the opposite of ST_Intersects and returns everything that does not intersect and this
will return true if the polygons do not intersect, and in this case order of the geometries is irrelevant.
In this query let’s find the first 200 buildings that do not intersect our 200 meter buffer that are closest
to the Stonewall Inn using ST_Distance:
99 https://postgis.net/docs/ST_Crosses.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 217
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
218 3.3. SPATIAL RELATIONSHIP FUNCTIONS
17 order by
18 st_distance(
19 (
20 select
21 st_buffer(
22 buildpoint(-74.002222, 40.733889, 4326) :: geography,
23 200
24 ) :: geometry
25 ),
26 geom
27 ) asc
28 limit
29 200
Here we used ST_Distance to calculate the distance between the buildings and the centroid of the
Stonewall Inn, then ordered that in ascending order from smallest to largest until we get our target of
200 buildings (Figure 3.5, on the next page).
ST_Intersects
This function will return all the buildings that intersect the buffer including ones that overlap the buffer.
An interesting side note about ST_Intersects is that it will return true even if your geometries do not
touch in the case that they fall within 0.00001 meters (or 0.00039 inches/0.01 millimeters). This only
really applies in a few scenarios but is good to know should your analysis fall into that category.
This is also the function you will likely use the most if you want to evaluate if two geometries overlap.
It has had a lot of work and resources put into it over the years that have increased the speed some 5
times100 . So it is a great function to use as you know it has great engineering power behind it. But it
is also important to understand how this function works, so you can make it work better for you and
your needs.
This post101 by Paul Ramsey does a good job of explaining this and how, depending on the size and
complexity of your geometries, you can use some techniques (some of which are now baked into Post-
GIS) such as subdividing the geometries to make them smaller, but baked into that explanation is a
subtle detail you may have missed about how bounding boxes are used within this function.
What this means is that if you have a spatial index on your dataset then the function will first run a
bounding box comparison to see if the bounding box touches the other geometry, then it will perform
the intersection. The bounding box step efficiently removes any non-matches and then will help the
join run faster. The next step is to evaluate if any part of the geometries you are comparing touch. This
effectively includes making a comparison across all the various vertices in your geometries, thus the
more vertices it needs to compare the longer it will take.
So what does this mean for you?
1. First, is that you should use spatial indexes on tables that will commonly have spatial relationship
analysis performed on them.
2. Next, for larger or complex geometries make sure to try and decrease the number of vertices in
your polygons or use less complex geometries if possible
100 https://blog.cleverelephant.ca/2020/12/waiting-postgis-31-1.html
101 https://blog.cleverelephant.ca/2019/11/subdivide.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 219
Figure 3.5: ST_Disjoint or an anti-intersection with the 200 buildings nearest to the Stonewall Inn National
Monument
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
220 3.3. SPATIAL RELATIONSHIP FUNCTIONS
3. Also, if you need, you can use the subdivide method in the blog post in the previous footnote
especially for large geometries. In PostGIS 3.1 there are native improvements that have boosted
this even further.
Apart from that we will explore two other approaches, using spatial indexes (different from a spatial
database index which is how we have used the term so far) such as H3 hexagons as well as using
triangulated polygons which is similar to the subdivide method.
ST_Overlaps
The next function returns TRUE when a spatial relationship has at least one part that overlaps equally
with the geometry, but that are not totally contained or crossing. It also includes the concept of dimen-
sion meaning that the two geometries should be the same such as Polygon to Polygon. We can validate
this by using the following query:
Here even though the two geometries overlap, they do not share the same dimension. The next two
queries will return TRUE since they share the same dimension:
ST_Touches
This function will return TRUE if the two geometries share one or more points in common, but do
not intersect in their interiors. Basically they share a border but do not overlap. This means that the
geometries have to have a perfect match and even a slight overlap will return false. We can check and
see if this is the case with our NYC Zip Codes:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 221
And we can see that the three zip codes that border 10009 are returned (Figure 3.6):
ST_Within
Finally closing out this section is ST_Within, which has two arguments and asks if geometry A is com-
pletely inside of geometry B. We can test this by modifying our Stonewall National Monument query
(Figure 3.7, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
222 3.3. SPATIAL RELATIONSHIP FUNCTIONS
3 geom
4 from
5 nyc_building_footprints
6 where
7 st_within(
8 geom,
9 (
10 select
11 st_buffer(
12 buildpoint(-74.002222, 40.733889, 4326) :: geography,
13 200
14 ) :: geometry
15 )
16 )
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 223
For distance relationships there are five total functions that you can use to analyze the relationship of
features based on distance. The two we will focus on are ST_DFullyWithin and ST_DWithin.
These functions take three arguments, two geometries and a distance in meters. Now in a classic GIS
sense we think of this as a buffer analysis and this is a very common point of confusion. Let’s take a
look at our most recent query but using the ST_Intersects relationship:
On my computer this took 14.81 seconds and returned 224,392 rows. Let’s see how this performs using
the same query, but this time with ST_DWithin. Note that we will want to use a projection that measures
in meters. To do so we will use ST_Transform to turn the geometries into the 3857 projection, and we
can test this with our original 200 meter radius first:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
224 3.4. DISTANCE RELATIONSHIP FUNCTIONS
Surprised? On my computer I stopped the query at 3 minutes since it was taking so long. Why is this
query, with a more efficient function and our original 200 meter radius, taking far longer?
Here is another coding challenge for you. We will cover one way to speed this up, but there is actually
an error in this code that is causing the issue. Think back to your GIS basics and projections, and if you
need a hint I would take a look at the information about the 3857 projection
So if you remember when we created our buildpoint() function we used this code:
So in theory this should work, but the devil is in the details, and if we look at the units the projection
uses, which are used by most other WGS84 and NAD83 projections, they actually use different units
that are not latitude and longitude. So we need to first create our point in the 4326 projection which
uses degrees and then transform that to 3857.
Now that is a lot of code to generate one point. I tested this query on its own, and it took 0.218 seconds
to run. Now running that a handful of times isn’t too difficult but the more times you have to run it the
more costly it will be. Let’s try it out and see how much time it will take in 200 meters by transforming
each point.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 225
This took 28 seconds for 200 meters which took me under 2 seconds using our buffer and intersection
query. To speed this up let’s add a new geometry column for each row of our buildings dataset using
the 3857 projection, so we don’t have to create one each time. First altering our table and then updating
our table.
Okay this went down to 12 seconds, still better but not perfect. Another approach that might work is
creating our point that represents the Stonewall National Monument in a subquery, and then running
a cross join since it is only joining to the one row:
This brings us down to 9 seconds, still too slow. Now there is a bit of a secret here that is important to
note. When we originally brought this data in using our ogr2ogr method the command added a very
helpful database component known as a database index. This is, in simple terms, a data structure that
allows the database to organize the table on a specific column in a way that makes it know the data
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
226 3.5. SPATIAL JOINS
structure and to make it more searchable ahead of time. And many spatial relationship functions take
advantage of indexes, and they are quite easy to create. Let’s add one for our new geom_3857 column:
This will create an index with the name geom_3857_idx on our buildings table using the GiST method
which is for generalized text102 . Once you have run this you will get a message that it has been created.
Now give the previous query a try again. Using the subquery method my time went down to 0.15
seconds which is far faster than our 1 second time. Trying this again with a 10 kilometer radius:
This brings it down to just a hair under 5 seconds total, down from 14.81 seconds. Using indexes are a
major boost, but thinking through your queries logically and trying to reduce the number of operations
your query needs to run are both good practices.
As one of the most frequent operations that I have used over the course of time I agave been using
spatial SQL, I think it is important to dedicate a bit of space to the spatial join. For the most part there
are two common spatial joins:
• A true spatial join where one joins some data from one table to another based on a spatial rela-
tionship (ex. Adding the name of a county to a set of points) which may also be known as "Join
points by location"
• A "point in polygon" join, although this could be any type of geometry to polygon, where the
results are aggregated into a count, sum, or any other aggregation.
We reviewed some of these approaches at the beginning of the chapter but let’s take a look at the
102 https://www.postgresql.org/docs/9.1/textsearch-indexes.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 227
performance of these to compare. Both are important and there are a number of different ways that
you can achieve each. First let’s take a look at the spatial join, and then let’s review the aggregation
spatial join.
For these use cases we will use the same data for each, a tree census and our neighborhoods. Ignoring
the fact that we can perform a string join here that would be more efficient, this will give us a good idea
on how you can perform a variety of joins. We will also limit our query to have trees that include the
name Maple in them.
First let’s test out a query with a WHERE clause, which represents one of the most common mistakes
when performing a spatial join.
This took ~41 seconds on my computer and the reason for that is as you can see we are performing a
cross join, or every row to every row. This is not the most efficient way to accomplish this, and we can
take a look at what is happening in the database using the Explain Analyze tool within pgAdmin. You
can do this by clicking the small graph button in pgAdmin next to the button with the "E" on it with
these settings in the dropdown (Figure 3.8).
That big red box (Figure 3.9, on the next page) indicates that there is one operation that is taking
particularly long which is the join we are performing. Now let’s try this again and move this into a
table join with the condition of the join being our spatial intersection:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
228 3.5. SPATIAL JOINS
Figure 3.9: Explain analyze showing a high cost nested loop inner join
This query came in at ~35 seconds on my computer, and we can check to see if there was a difference
in our query plan again using the same Explain Analyze feature (Figure 3.10):
So no not really.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 229
This improved our query time to 19 seconds! The other strategy you can use is to create an index on the
columns you need to use in your query. Remember that when we imported our data using ogr2ogr an
index on the GEOMETRY column automatically. However we can create an index on our new subdivided
geometries here too.
From here we can also cluster our geometry data using the index, which effectively organizes the data
based on the index which can also improve query time.
This improved the query to 16 seconds. While you don’t need to optimize every table and every query,
using these tools strategically can help you make some significant gains in your query time.
Adding aggregations to your query only takes a few extra steps. We can run our same query as above,
but this time return a count of maples per neighborhood and a percentage of all trees in the neighbor-
hood (note that we are casting the numbers to numeric to make sure it divides accurately since COUNT
returns an integer and the result will always be 0 or 1):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
230 3.6. OVERLAY FUNCTIONS
10 b.neighborhood
11 from
12 nyc_2015_tree_census a
13 join nyc_neighborhoods_subdivide b on st_intersects(a.geom, b.geom)
14 group by
15 b.neighborhood
Now as you notice we did not add in a geometry here. While we could do so, if we did we would
need to GROUP BY our geometry. There are two issues with that, one is that the GEOMETRY column is quite
large as we know and this makes for a cleaner operation since we can always quickly join the GEOMETRY
to our results using the common neighborhood name column. The other reason is that if we group by
a geometry, even though in this case we know we have one unique geometry per neighborhood, this
could present issues if we have more than one geometry and if our geometries are slightly different,
even if their points (or start/end points) are out of order. This means that even though we perceive two
geometries to be the same, the database does not, and it will group them as such.
The next group of functions includes overlay functions which allows us to analyze and create new
geometries from our existing geometries. These functions perform common GIS tasks generally known
as clipping, subdividing, unioning, and desolving and geometries. It’s important to know that these
functions do not take advantage of database indexes and will be quite slower than the previous set of
spatial relationship examples that we just took a look at.
ST_Difference
The ST_Difference function allows you to take two geometries and return the difference of geometry A
minus geometry B. Order is important in this function as the first geometry will be the geometry with
the remainder (that is with the overlap subtracted) returned and none of the remainder of geometry B
will be returned.
We can test this out by subtracting the geometry that represents the West Village from its accompanying
zip code, 10014 (Figure 3.11, on the next page):
As you can see, the arrows point out some discrepancies between the geometries beyond just the docks
which were not included in the neighborhood file.
ST_Intersection
ST_Intersection returns just the area that intersects the two geometries. We can see this by running a
query to find the intersection between the Gramercy neighborhood and zip code 10003.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 231
And you can use the following query to perform a table union to combine the three different geometries
together to see how they overlay on the map (Figure 3.12, on the following page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
232 3.6. OVERLAY FUNCTIONS
ST_Split
This function does exactly what it sounds like, it splits a geometry using a linestring or multi-linestring.
We can do this by joining some of the various linestrings from our bike routes dataset using ST_Union
which we will learn more about shortly, then split the West Village Neighborhood using the bike path
that runs along Hudson St/8th Avenue. This path is split into two parts, the part that runs from W
Houston St to Bank St, and from Bank St to W 39th St.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 233
24 (
25 select
26 geom
27 from
28 a
29 )
30 )
This will return a split geometry right along the bike path (Figure 3.13):
ST_Subdivide
We have seen this function already as we used it to try and improve our spatial join times, so as you
may recall this function will split geometries such as lines and polygons into different parts based on a
threshold of max vertices passed into the function, resulting ins smaller geometries.
We can test this out with the same West Village neighborhood using 50 vertices (Figure 3.14, on the
following page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
234 3.6. OVERLAY FUNCTIONS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 235
ST_Union, as we saw earlier, merges two or more geometries together to create a single geometry. The
variant ST_UnaryUnion applies this to a single, multi-part geometry, creating one geometry from the
input. We can test this with all neighborhoods that intersect the zip code 10001 (Figure 3.17, on the
next page):
The next set of functions allow us to create spatial clusters from our data. These functions, which can
also be used for clustering non-spatial data, allow us to see how the data is clustered spatially. It is also
worth noting that these are some of the only spatial specific window functions that are available.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
236 3.7. CLUSTER FUNCTIONS
ST_ClusterDBSCAN
The DBSCAN method will build an appropriate number of clusters based on the data it is provided,
so we do not know how many clusters we will end up with when our query is complete. We can test
this out by making a view so we can see our data in QGIS and style it. We will create clusters with
a minimum of 30 points and within 30 meters of each other (we will have to translate the projection
again to make sure we can provide the measurements for this).
First, we need to add a geometry using the latitude and longitude columns in our table:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 237
6 id,
7 ST_ClusterDBSCAN(st_transform(geom, 3857), 30, 30) over () AS cid,
8 geom
9 from
10 nyc_311
11 where
12
13 -- Find just the complaints with "noise" in the description
14 -- and that are in zip code 10009
15 st_intersects(
16 geom,
17 (
18 select
19 st_transform(geom, 4326)
20 from
21 nyc_zips
22 where
23 zipcode = '10009'
24 )
25 )
26 and complaint_type ilike '%noise%'
From here we can add our view and style it in QGIS which will result in this map, where I removed the
points that were null, or not a part of a specific cluster (Figure 3.18):
As you can see there are 71 different compact clusters using these parameters. As you can see most of
the data is very dense, so depending on your data you may need to play with the parameters a bit to
build appropriate clusters.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
238 3.7. CLUSTER FUNCTIONS
ST_ClusterKMeans
KMeans differs from DBSCAN in that you provide KMeans with a desired number of clusters and all
features are assigned to a cluster. Let’s use the same data and choose 7 clusters, once again creating a
view, so we can see and style the data in QGIS.
ST_ClusterWithin
The final clustering method is one that clusters points within a specific distance of separation. For this
we will test this at 25 meters:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 239
20 )
21 and complaint_type ilike '%noise%'
22 ),
23
24 -- In CTE "b", we have to unnest the results since it returns
25 -- geometries in an array
26 b as (
27 select
28 st_transform(
29 unnest(ST_ClusterWithin(st_transform(geom, 3857), 25)),
30 4326
31 ) AS geom
32 from
33 a
34 )
35
36 -- row_number() over() creates an id for each row starting at 1
37 select
38 row_number() over() as id,
39 geom
40 from
41 b
In my results I ended up with 256 clusters. Since our data is so dense and since we used such a low
distance, this actually makes sense, but other datasets will likely be different so make sure to test the
distance parameter. And here is our resulting map (Figure 3.20, on the following page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
240 3.8. SPECIAL OPERATORS
Now from time to time you may see some of these special operators within a PostGIS query. Post-
greSQL has these special operators such as <> which is the equivalent of not equal. In the case of
PostGIS there are a handful that we will review that can come in handy and save some keystrokes.
&&
This function works the same as ST_Intersects but instead of finding the overlap of the two geome-
tries, it simply finds if the bounding boxes of the two geometries overlap. For example:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 241
15 UNION
16 select
17 'None' as zipcode,
18 st_envelope(
19 (
20 select
21 geom
22 from
23 nyc_neighborhoods
24 where
25 neighborhood = 'East Village'
26 )
27 )
Note that we are drawing the bounding box of the East Village neighborhood as reference using the
UNION operator to join the data together (Figure 3.21).
These functions check if the bounding box of the first geometry overlaps or is to the left for &< or to
the right for &> of the second geometry.
Here we can query all the neighborhood bounding boxes that fall to the left of the East Village (Figure
3.22, on the following page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
242 3.8. SPECIAL OPERATORS
11 where
12 neighborhood = 'East Village'
13 )
Figure 3.22: Using &< to show all the polygons to the left of the East Village
<->
Finally, this operator returns the 2D Distance between two geometries. We can compare the results of
this and the ST_Distance function below:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 243
8 ),
9 ues as (
10 select
11 geom
12 from
13 nyc_neighborhoods
14 where
15 neighborhood = 'Upper East Side'
16 )
17 select
18 ev.geom <-> ues.geom,
19 st_distance(ev.geom, use.geom)
20 from
21 ev,
22 ues
These are exactly the same result. We will see a difference if we cast the geometries to geographies
since ST_Distance uses a spheroid measurement in ST_Distance by default:
new_operator st_distance
3581.510728297563 3580.0905042
And with that we have wrapped up our chapters on the foundation elements of spatial SQL. Obviously
we did not touch on everything, but in almost every circumstance when I needed a function to do
something I often found it, and even when I don’t, it was a matter of using SQL to create the right query
to do so. In our next chapters, we will be going over some fundamental spatial analysis problems in
spatial SQL, some various use cases and examples, and advanced use cases.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
244 3.9. EXPERT VOICES: JUSTIN CHANG
I like how fast it is and how instantly I can acquire analysis using SQL. Although it requires some
knowledge of GIS and the logics behind it, it is very good at getting results quickly. PostGIS comes
with handy spatial functions such as ST_Intersects, ST_DWithin and ST_Length, which enables users
to use their spatial knowledge with SQL. Furthermore, spatial SQL allows users to do GIS without the
visuals in a quick way and the queries can be repeated and shared among users, which makes it quite
useful.
Can you share an interesting way or use case that you are using spatial SQL for today?
In Telecommunication design we need to do a lot of analysis within a FSA (Fibre Serving Areas) like
counting how many items are there and how much length of certain cables are within the boundary.
Spatial SQL enables users to get analysis quickly and reliably. In essence, it’s a modification of "select by
location" in a GIS desktop program like QGIS and Spatial SQL allows users to customize their queries
by using Spatial SQL to find the data they need.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. SPATIAL RELATIONSHIPS 245
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
4. Spatial Analysis
Now that we have all the building blocks in place, our next step is to start applying spatial SQL to some
common GIS analyses. While I think that many things you need to do in GIS or geospatial analytics
can be performed with spatial SQL, most people employ some sort of hybrid approach, meaning that
their toolkit consists of things like QGIS, Python, visualization libraries, and more. Spatial SQL is one
part of the modern GIS ecosystem and knowing what to use when is another skill that is incredibly
important.
This section is built to help you connect spatial SQL to functions that you may find in QGIS or ArcGIS’
toolboxes to help you see how you can apply these analyses in spatial SQL. Many times you will use
two or more of these analyses together and in the next chapter we will take a look at some common
problems in spatial SQL that use several techniques in combination.
Some of these are analyses we have seen throughout the course of the book already since they are
generally implemented with one or two functions from spatial SQL. Given that we will start with these
since we know what they are and can easily draw upon what we have already learned:
• KMeans Clustering
• DBScan Clustering
• Merge or Union geometries
• Buffer
• Centroids
• Deluaney triangles
• Collect geometries
• Deluany triangulation
• Simplify
• Split with lines
• Voronoi polygons
• Count points in polygons
• Concave hull
• Convex hull
• Extract X/Y coordinates
• Smoothing lines
This one actually has nothing to do with spatial SQL as we can just use a CASE WHEN statement which
we can show with our tree census dataset:
247
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
248 4.1. ANALYSES WE HAVE ALREADY SEEN
Anything that you can write in a WHERE clause after the WHEN can be used including dates, numbers, and
text data. This is also great for generating labels for your data and visualizations as well.
Distance Matrix
We reviewed this briefly when discussing cross joins, which can be used to calculate the distances
from one set of points to another. We can do this by simulating a table with points located at Yankee
Stadium and Citi Field, the stadiums of the New York Yankees and New York Mets of Major League
Baseball, respectively. Using the centroids of the neighborhoods file, we will find which neighborhoods
are closest to each, first calculating the distance of each neighborhood to each stadium.
Additionally, we will use a new syntax called a temporary table which will only persist the table until
the session is terminated. Since we don’t need to keep this in memory long term this is a great solution.
Now, we can write a query to find the distance for each neighborhood to see which neighborhood is
closer to which stadium.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 249
16 stadiums_matrix
17 where
18 stadium = 'Yankees Stadium'
19 )
20
21 select
22 a.neighborhood,
23 b.st_distance as mets,
24 c.st_distance as yankees
25 from
26 nyc_neighborhoods a
27 join mets b using (neighborhood)
28 join yankees c using (neighborhood)
To see all the neighborhoods closer to Yankees Stadium we can use this query (Figure 4.1, on the next
page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
250 4.1. ANALYSES WE HAVE ALREADY SEEN
14 where
15 stadium = 'Yankees Stadium'
16 )
17 select
18 a.ntaname,
19 a.geom,
20 b.st_distance as mets,
21 c.st_distance as yankees
22 from
23 nyc_neighborhoods a
24 join mets b using (neighborhood)
25 join yankees c using (neighborhood)
26 where
27 b.st_distance < c.st_distance
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 251
Bounding Box
There are two different methods of creating a bounding box in spatial SQL. The first is to create a
bounding box from a set of coordinates, the minimum X and Y values and the maximum X and Y
values. To do that we can use the function ST_MakeEnvelope to create a bounding box. If you need to
find the values of a specific area I recommend using a free online tool from Klokan Technologies103
(which I have used hundreds of times so thank you if you are reading this!) which you can access at
the URL in the footnotes.
First we can make a quick bounding box over Manhattan:
The other method is to extract a bounding box from a specific geometry. We can do so using this query:
103 https://boundingbox.klokantech.com/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
252 4.1. ANALYSES WE HAVE ALREADY SEEN
As you can see we have not one but two bounding boxes here. For some reason, the city decided to
store these geometries as two separate records. As a coding challenge how could you combine them
into one?
If you guessed ST_Collect then you are correct! We can validate that here (Figure 4.4, on the facing
page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 253
1 select
2 st_collect(st_transform(geom, 4326)) as geom
3 from
4 nyc_zips
5 where
6 zipcode = '11231'
Merge/split lines
If you recall, we can use ST_Union to merge lines or other geometries together and ST_Split to split a
line by another geometry, or use a line to split other geometries. You can refer back to the previous
chapter for some examples on this using the NYC Bike Route data.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
254 4.1. ANALYSES WE HAVE ALREADY SEEN
Make valid
To make a geometry that is invalid in the database valid, which will generally return an error for any
number of reasons such as an unclosed polygon or overlapping points on a polygon, you can use ST_
MakeValid. Depending on the import method that you use you may or may not actually get to this
point. For example GDAL will generally throw an error if there is an invalid geometry and you can
use a flag within the ogr2ogr command, in this case -makevalid, to validate the geometries before they
land in PostGIS.
To accomplish the task of interpolating a point or points along a line, we can use one of two functions.
First, let’s grab a specific line from our dataset, in this case the longest which can be found using a
query you are going to write as another coding challenge! Think of a query you can use to find the
longest segment in the bike paths dataset.
So if you landed on using ST_Length as the function you are correct. Now the question asks us to find
the longest linestring, so we know we only need one result. We can order our table by the length of the
lines in descending order and simply LIMIT the query to return one row:
And that will return a bike route in south Brooklyn (Figure 4.5, on the next page):
Now let’s first start by interpolating a single point. We can use a function called ST_LineInterpolatePoint
which takes two arguments, a geometry and a number between 0 and 1 representing a fraction of how
far you want to place the point. For example 0 will be at the beginning of the line, 0.5 will be at the
halfway point, and 1 will be at the end. Let’s try this using 0.5 (note that since it is a multi-linestring
we will merge it using ST_LineMerge) (Figure 4.6, on page 256):
And this is great if you know the exact fraction of the length of the line where you want to place your
point, but in most cases you will want to place a point at a specified distance along the line. Let’s first
find the length of our line in meters:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 255
Figure 4.5: The longest bike path linestring in New York City
1 select
2 st_length(st_transform(geom, 3857))
3 from
4 nyc_bike_routes
5 where
6 ogc_fid = 20667
This shows that the line is about 2,133 meters long. So let’s say we want to find a point along the line
that is 500 meters from the beginning we need to find the fraction of the distance that 500 meters is
compared to 2,133 meters which we can accomplish as so:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
256 4.1. ANALYSES WE HAVE ALREADY SEEN
Figure 4.6: One point 500 meters from the start of the linestring
12 nyc_bike_routes
13 where
14 ogc_fid = 20667
15 )
16 )
17 ) as geom
18 from
19 nyc_bike_routes
20 where
21 ogc_fid = 20667
Now instead imagine that we want to add a point every 75 meters along the line. A second function ex-
ists called ST_InterpolatePoints which we can use our same query for just swapping out our function
and changing 500 for 75 (Figure 4.7, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 257
8 (
9 75 / (
10 select
11 st_length(st_transform(geom, 3857))
12 from
13 nyc_bike_routes
14 where
15 ogc_fid = 20667
16 )
17 )
18 ) as geom
19 from
20 nyc_bike_routes
21 where
22 ogc_fid = 20667
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
258 4.1. ANALYSES WE HAVE ALREADY SEEN
Clip/Intersection
While there is an ST_Clip function in PostGIS it applies to raster data, so we use ST_Intersection in
this case. We can clip multiple geometries using this function across all geometries in the table. We can
see this in this example where we clip all the zip codes in New York City with a bounding box around
Central Park (Figure 4.8, on the following page):
Difference
The function to perform a difference does carry the same name as the analysis in this case, ST_Difference.
We can see this using the same query as above (Figure 4.9, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 259
We will see this function again to generate some random points for another analysis, but this can be
solved using a single function, ST_GeneratePoints. This will return n-number of points and accepts
two arguments, the geometry and an integer with the number of points to generate in that order.
Let’s test this with out zip codes and generate 50 points in the 10001 zip code (Figure 4.10, on page 260):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
260 4.2. NEW ANALYSES
In the output you will see that it returns one row which means that it returns a geometry collection. To
extract each point, we can use ST_Dump:
Earlier in the book we saw how you can use ST_Subdivide to divide polygons based on the number of
vertices they contain. However, another common analysis that we may want to perform is to split a
polygon into a number of equal areas. To do that we need to perform a few different steps, which are
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 261
To accomplish the task first we will generate some random points inside this polygon using ST_GeneratePoints.
Since this returns a geometry collection we will need to dump the individual points using ST_Dump and
extract the geometry using the .geom accessor (Figure 4.12, on the following page):
104 https://blog.cleverelephant.ca/2018/06/polygon-splitting.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
262 4.2. NEW ANALYSES
1 select
2 (
3 st_dump(
4 st_generatepoints(st_transform(geom, 4326), 5000)
5 )
6 ).geom as geom
7 from
8 nyc_zips
9 where
10 zipcode = '11101'
Next we will need to create several evenly sized clusters, which calls for ST_ClusterKMeans to group
our random points. In this case let’s create 6 clusters (Figure 4.13, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 263
7 )
8 ).geom as geom
9 from
10 nyc_zips
11 where
12 zipcode = '11101'
13 ),
14
15 -- Create 6 even clusters using ST_ClusterKMeans
16 cluster as (
17 select
18 geom,
19 st_clusterkmeans(geom, 6) over () AS cluster
20 from
21 points
22 )
23
24 -- Group or collect each cluster and find the centroid
25 select
26 st_centroid(
27 st_collect(geom)
28 ) as geom,
29 cluster
30 from
31 cluster
32 group by
33 cluster
In our next to last step, we need to now create Voronoi polygons around our centroids (Figure 4.14, on
page 265):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
264 4.2. NEW ANALYSES
And in our final step we will then clip our original polygon with our resulting Voronoi polygons (Figure
4.15, on page 266):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 265
9 nyc_zips
10 where
11 zipcode = '11101'
12 ),
13 cluster as (
14 select
15 geom,
16 st_clusterkmeans(geom, 6) over () AS cluster
17 from
18 points
19 ),
20 centroid as (
21 select
22 st_centroid(st_collect(geom)) as geom,
23 cluster
24 from
25 cluster
26 group by
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
266 4.2. NEW ANALYSES
27 cluster
28 ),
29 voronoi as (
30 select
31 (
32 st_dump(st_voronoipolygons(st_collect(geom)))
33 ).geom AS geom
34 from
35 centroid
36 )
37
38 -- In the last step, we find the intersection (or clip)
39 -- the 11101 zip code and the Voronoi polygons
40 select
41 st_intersection(
42 (
43 select
44 st_transform(geom, 4326)
45 from
46 nyc_zips
47 where
48 zipcode = '11101'
49 ),
50 geom
51 )
52 from
53 voronoi
As you can see this is quite a chunky bit of SQL. Even if we combined this into one single query it
would be difficult to read and decipher. This is a great use case for a user defined function. We can do
this by adding placeholder variables in our SQL, so it can be used across multiple data types. Let’s take
a look at the function creation statement and break it down:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 267
Figure 4.15: Final result with the Voronoi polygons clipped by 11101
30 -- These values will replace each instance of %s in the order they appear
31 unique_id,
32 seed_points,
33 tablename,
34 polygons,
35 tablename,
36 unique_id
37 );
38
39 end;
40 $$
This is the create function syntax that contains the different input variables. Note that tablename and
unique_id are stored as text variables in this case. More on that later.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
268 4.2. NEW ANALYSES
The thing we want to return from the function in this case is a table with two columns, an ID that is a
VARCHAR format and a geometry, which will be our equally divided areas.
The first four lines are standard for defining a stored procedure in PostgreSQL. The final line tells that
the function will return a query which will in turn produce a table described in the previous section
above.
Okay so this is where things get a bit funky. We will discuss the few changes to our query next but for
now there are a few things to focus on:
• The execute format allows us to construct our query as a string and insert data from our function
arguments into the query
• At the end of the query you will see this ...unique_id, seed_points, tablename, polygons,
tablename, unique_id)
– This is the order that the elements will be added to the query contained in the
string. unique_id is first, followed by seed_points, so on and so forth
• Within the query you will see several instances of %s. Each time you see this that placeholder will
be replaced by one of the values from above, in the order that they appear.
– The first %s is replaced by unique_id, the second by seed_points, etc.
Finally, in our query we made a few modifications. Since we are running this query across an entire
table, we will need to keep the unique ID for each original polygon intact, in this case that is the zip
code. You will see that this is included in the first CTE and given the alias ID which carries through the
rest of the query:
As you can see there are three instances of %s here, which are replaced by unique_id, seed_points, and
tablename respectively.
The only other change comes in the section that utilizes ST_ClusterKMeans. In our original query that
line looks like this:
Remember that this is one of only a few window functions that exist in PostGIS. Since we were only
dealing with one geometry at the time we knew that we would get 6 total clusters back. Now since we
are now passing many points across many polygons, if we used the same structure of the function we
would in fact be getting 6 clusters for every single point in every zip code. But since this is a window
function we can easily fix this with one quick change
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 269
By adding the PARTITION BY this acts kind of like a group by meaning that the window will be applied
for each ID, in this case zip code. So we will, in this example, end up with 6 clusters for each zip code.
end; $$
Now that we have done this let’s try our new function with 6 even areas per zip code:
Keep in mind we have not re-projected our data, so you should see this when you open the geome-
try viewer, and we are missing a large portion of Queens and Brooklyn since it will only show 1,000
geometries (Figure 4.16):
However, we can add a geometry that will show on the base map like this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
270 4.2. NEW ANALYSES
Now do you think we could modify our function to subdivide our polygons into areas of equal area
globally, such as 50 acres, rather than 5 equal areas per polygon. If you feel a coding challenge coming
up, then you are correct! Here are some hints to get you started:
• You only need to modify one line of code and add one extra CTE (this calculates the area of the
original zip code polygons)
• Since you know that the small building size polygons are under this limit, make sure you account
for that edge case
• The code you will be replacing takes a whole integer, so you will need to round the result
• Make the function flexible so you can divide the area of the polygon by any number
So if you focused in on the number of clusters in ST_ClusterKMeans then you went in the right direction.
This was previously static based on the number of polygons we provided in the input. In this case we
want to:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 271
Figure 4.18: Equal areas for even the smallest zip codes
• Add a new CTE that contains the zip code and area of the polygon which we will use to join in
the step where the KMeans clusters are created
• Use that number and a modified input to change the cluster size based on the area divided by the
input parameter
The full code is here with the modified lines in bold:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
272 4.2. NEW ANALYSES
As you can see we added this CTE to calculate the area one time rather than each time for each point:
The first change is that instead of the static integer we used as an input we changed it to this:
ceil((a.area/(%s)))::int)
This takes the total area of the zip code from our preceding CTE, then divides it by the input number
in the function. In this case we are creating areas of 50 acres which we will see below that 43,560 square
feet (our source projection in this case measures in feet) multiplied by 50. We then use CEIL which will
round up to the highest nearest integer, then cast this to an integer to match the input type required by
ST_ClusterKMeans.
You will also see that there are additional placeholders at the end of the function to replace one addi-
tional %s placeholder added as seen above.
With that we can run our new query (Figure 4.19, on the next page)!
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 273
Since we can only see 1,000 geometries let’s create a new table to view in QGIS (Figure 4.20, on
page 274):
Overall it seems to work great apart from areas with islands which may have something to do with the
random point assignment but for now this is a workable solution (Figure 4.21, on page 275)!
Another common analysis is connecting points to other points with lines using a common attribute. To
do this we can use our taxis dataset to find all the trips that originated within 50 meters of a specific
intersection, in this case the intersection at 5th Avenue and East 59th Street at the southeast corner of
Central Park next to the Plaza Hotel.
To do so we will first create a CTE with a single point, or the location of the intersection. Next we will
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
274 4.2. NEW ANALYSES
find all the pickups that fall within 50 meters of that point using ST_DWithin and focus on pickups on
June 1st, 2016. Finally, we will build a line between the first point at the intersection and all the drop-off
geometries (Figure 4.22, on page 276).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 275
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
276 4.2. NEW ANALYSES
22 ogc_fid,
23 trip_distance,
24 total_amount,
25
26 -- Create a line from the Plaza (in the subquery in the second argument)
27 -- and all the dropoff locations
28 st_makeline(
29 dropoff,
30 (
31 select
32 geom
33 from
34 point
35 )
36 ) as geom
37 from
38 start
You can do the same thing with multiple origin points by joining your data using a common column
between your datasets. For example:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 277
2 a.id,
3 st_makeline(a.geom, b.geom) as geom
4 from
5 destinations a
6 left join origins b using (id)
In this case we have two tables, destinations with the table alias a, and origins with the alias b. Both
tables have a column called id. By joining on this table we can use the geometry from a and b to create
the line. We use a left join because we want to preserve all the points from our destinations table and
join it to the source origin in b.
Nearest neighbor(s)
Finding the nearest neighbor to a feature is actually quite easy. In spatial SQL, this means that we want
to compare a feature to another set of features and find the feature(s) that are nearest to it. So we can
simply order those features by distance and limit the values that are returned, in this case the three
nearest trees to an entrance to Grand Central Station:
However, we can use one of the shorthand functions to make this even shorter, in this case <-> which
will is the same as ST_Distance.
Performing a bulk nearest neighbor join, or from all values in one table to another requires some extra
steps.
For this analysis, I must first credit Paul Ramsey who wrote a blog post on this method in 2016105
which introduced me to the bulk nearest neighbor join as well as lateral joins. I have used the lateral
join method to do a number of different analyses as we will see moving forward. But for now we
will limit this to nearest neighbors. The post itself finds the distance from each property parcel to the
nearest fire hydrant which we can repeat as well. To do so we can import another new dataset, NYC
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
278 4.2. NEW ANALYSES
fire hydrants106 :
ogr2ogr \
-f PostgreSQL PG:"host=localhost user=docker password=docker \
dbname=gis port=25432" \
NYCDEPCitywideHydrants.geojson \
-nln nyc_fire_hydrants -lco GEOMETRY_NAME=geom
With this we can sample buildings within 1 kilometer of John’s of Bleeker Street, a famous New York
City pizzeria, in the West Village of Manhattan.
105 https://carto.com/blog/lateral-joins
106 https://data.cityofnewyork.us/Environment/NYCDEP-Citywide-Hydrants/6pui-xhxz
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 279
46 nyc_fire_hydrants
47 order by
48 geom <-> buildings.geom
49 limit
50 1
51 ) nearest
First we create a CTE to make a point at the center of the John’s and turn it into a geography, and a
second CTE to grab all the buildings within 1 km of the John’s, making sure to cast our geometries to
geographies to make sure we have an accurate distance. From here we can write our query that runs
the cross lateral join. Breaking this down:
We will select the geometry, name, and ID from our buildings table, and then the distance from "nearest"
which is an alias for the result of our cross join.
This begins the cross lateral join and everything between the parentheses will be added to each row,
sort of like the for/each loop we described earlier
select unitid,
st_distance(geom::geography, buildings.geom::geography) as distance
from nyc_fire_hydrants
order by geom <-> buildings.geom
limit 1
Here we are running the query that will return values that can be joined to each row. First we select the
ID of the hydrant and the distance between the building and the hydrant using the geography from
our fire hydrants table. Next we use the shorthand order by distance (<->) and limit the result to 1, or
the first result, or the hydrant closest to the building.
) nearest
Closes the lateral join and gives the result an alias of "nearest". We can create a table from this query to
visualize the results:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
280 4.2. NEW ANALYSES
13 st_dwithin(
14 geom :: geography,
15 (
16 select
17 geog
18 from
19 point
20 ),
21 1000
22 )
23 )
24 select
25 geom,
26 name,
27 mpluto_bbl,
28 nearest.distance
29 from
30 buildings
31 cross join lateral (
32 select
33 unitid,
34 st_distance(geom :: geography, buildings.geom :: geography) as distance
35 from
36 nyc_fire_hydrants
37 order by
38 geom <-> buildings.geom
39 limit
40 1
41 ) nearest
And here we can see the results with both the table we created and the fire hydrants layer (Figure 4.23).
As we will see with the NYC Taxi Data, often times with GPS generated data is not always ready to
use in spatial analysis. New York City, and in fact any area with large buildings, are notorious for
decreasing the accuracy of GPS data. Now of course roads, are in fact areas, not linestrings, but in
many cases we do want to snap a set of points to road centerlines to prepare our data for analysis.
We will do that with the NYC Taxi Data and road centerlines in NYC. To do that we need to import
another file into our database:
ogr2ogr \
-f PostgreSQL PG:"host=localhost user=docker password=docker \
dbname=gis port=25432" \
street_centerlines.geojson \
-nln nyc_street_centerlines -lco GEOMETRY_NAME=geom
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 281
2 *
3 from
4 nyc_yellow_taxi_0601_0615_2016
5 where
6 st_dwithin(
7 pickup :: geography,
8 buildpoint(-73.987224, 40.733342, 4326) :: geography,
9 300
10 )
11 and pickup_datetime between '2016-06-02 9:00:00+00'
12 and '2016-06-02 18:00:00+00'
Our next task is to determine the closest road to each point. We can do this by using a bulk nearest
neighbor join as we have seen above. We want to include the ID and street name from the centerlines
table.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
282 4.2. NEW ANALYSES
11 st_dwithin(
12 pickup :: geography,
13 buildpoint(-73.987224, 40.733342, 4326) :: geography,
14 300
15 )
16 and pickup_datetime between '2016-06-02 9:00:00+00'
17 and '2016-06-02 18:00:00+00'
18 )
19
20 -- Find the nearest road to each point using a
21 -- cross join lateral
22 select
23 a.*,
24 street.name,
25 street.ogc_fid,
26 street.geom
27 from
28 pickups a
29 cross join lateral (
30 select
31 ogc_fid,
32 full_stree as name,
33 geom
34 from
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 283
35 nyc_street_centerlines
36 order by
37 a.pickup <-> geom
38 limit
39 1
40 ) street
This will give us a table with our original pickup geometry, tip amount, total trip amount, the street
name, and street ID.
In our final step we need to find the nearest point on the centerline to the pickup point, and then move
(or interpolate) that point along the line. The function ST_InterpolatePoint should ring a bell here,
which allows us to add a point along a linestring, not a multi-linestring, based on a number from 0 to
1, 0 being the starting point and 1 being the end point of the line.
To find the closest location on the linestring to our pickup point we can use another similarly named
function: ST_LineLocatePoint. This takes two arguments, a linestring and a point geometry in that
order, and returns the input we need for the proceeding function: a number from 0 to 1 which indicates
the location of the nearest point on the line.
In summary, we first need to calculate the location of the point with ST_LineLocatePoint, then use the
return value of that function to calculate ST_InterpolatePoint, which will return the final snapped
point. In the query we will also make another geometry which is a line which will connect the location
of the original point to the snapped point:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
284 4.2. NEW ANALYSES
36 ) street
37 )
38 select
39 a.*,
40
41 -- Create a line between the original point and the new snapped point
42 st_makeline(
43 pickup,
44 st_lineinterpolatepoint(
45 st_linemerge(b.geom),
46 st_linelocatepoint(st_linemerge(b.geom), pickup)
47 )
48 ) as line,
49
50 -- Add a column for the snapped point
51 st_lineinterpolatepoint(
52 st_linemerge(b.geom),
53 st_linelocatepoint(st_linemerge(b.geom), pickup)
54 ) as snapped
55 from
56 nearest a
57 join nyc_street_centerlines b using (ogc_fid)
This results in our snapped points and our lines (Figures 4.25 and 4.26, on the facing page):
Now we can also create table with each of the individual geometries to show in QGIS. First let’s make
a table of our full results:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 285
35 limit
36 1
37 ) street
38 )
39 select
40 a.*,
41 st_makeline(
42 pickup,
43 st_lineinterpolatepoint(
44 st_linemerge(b.geom),
45 st_linelocatepoint(st_linemerge(b.geom), pickup)
46 )
47 ) as line,
48 st_lineinterpolatepoint(
49 st_linemerge(b.geom),
50 st_linelocatepoint(st_linemerge(b.geom), pickup)
51 ) as snapped
52 from
53 nearest a
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
286 4.2. NEW ANALYSES
Figure 4.26: Visualizing the lines between the original point location and snapped point location
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 287
14 from
15 nyc_taxi_union_square;
16
17 create table nyc_taxi_union_square_points as
18 select
19 pickup,
20 name,
21 ogc_fid
22 from
23 nyc_taxi_union_square;
24
25 create table nyc_taxi_union_square_lines as
26 select
27 line,
28 name,
29 ogc_fid
30 from
31 nyc_taxi_union_square;
So in our output table from above we have the street segment unique ID, the tip amount, total trip
amount. This is enough to group the point values by each unique ID. Keep in mind you will want to
remove any trips where the total amount is 0 to avoid any division by 0 errors.
And that is it! You can create a table using this query so we can view it in QGIS.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
288 4.2. NEW ANALYSES
12 b.geom,
13 b.ogc_fid
14 from
15 a
16 join nyc_street_centerlines b using (ogc_fid)
17 group by
18 b.geom,
19 b.ogc_fid
Basic statistics
Gathering basic statistics for a single polygon layer or for aggregated results for a points layer is as easy
as using the aggregate functions in PostgreSQL that we have seen earlier such as COUNT, SUM, MIN, MAX,
and AVG. We can also use other aggregate functions such as PERCENTILE_DISC, PERCENTILE_CONT, stddev_
samp, and others.
One that we haven’t seen yet is MODE which allows us to find the most repeated value in a set of data.
To do this we can find the most common tree in a specific neighborhood:
In this case it is the Honey Locust tree. We can validate this by running this query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 289
546 honeylocust
333 Callery pear
252 Sophora
203 ginkgo
139 London planetree
Using the same data as above we can illustrate how to find the shortest path between two features.
There is a function to accomplish this in PostGIS so it is as easy as using this function, ST_ShortestLine.
We will need two geometries to pass as arguments and one line will be returned. In this case we can
illustrate this by showing the paths from the nearest 100 fire hydrants to the building footprint of the
Empire State Building.
As you can see the lines will connect at any point around the building, so it will connect to a point that
provides the shortest possible path (Figure 4.28).
Points to path
To turn a series of points into a line, we can use the function ST_MakeLine. If you only have two geome-
tries you can pass those as two individual arguments, however if you have more than two, you will
have to create an array. A quick way to do this is to wrap a select query in an array, which will create
an array of the points.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
290 4.2. NEW ANALYSES
4 geom
5 from
6 nyc_311
7 where
8 geom is not null
9 limit
10 10
11 )
This also gives you the flexibility to order the points or, in our case, limit the results and filter out any
null geometries. Let’s try this with 10 rows (Figure 4.29, on the facing page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 291
Delete holes
If you need to delete holes from a polygon, there are a few steps to complete that process. First we need
a polygon with a hole, which we can accomplish with this query by creating a 100 meter buffer around
Penn Station and cutting that out of the zip code 10001 (Figure 4.31, on page 293):
The best way to walk through this is by going through each step to eventually come to the result. So
let’s start by running our first step, which is to find the exterior ring of our geometry (Figure 4.32, on
page 294):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
292 4.2. NEW ANALYSES
And next we can make it back into a polygon using ST_MakePolygon (Figure 4.33, on page 295):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 293
2 select
3 10001 as id,
4 st_difference(
5 st_transform(geom, 4326),
6 st_buffer(
7 buildpoint(-73.9936596, 40.7505483, 4326) :: geography,
8 100
9 ) :: geometry
10 ) as geom
11 from
12 nyc_zips
13 where
14 zipcode = '10001'
15 )
16 select
17 id,
18
19 -- Creates a new polygon just from the exterior ring
20 -- which removes all holes
21 st_makepolygon(
22 st_exteriorring(geom)
23 ) as geom
24 from
25 a
Now this works for normal polygons but many times you will be dealing with multipolygons. We
can test this out by first creating a multipolygon. We do this by creating a second query that does the
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
294 4.2. NEW ANALYSES
same task as our first query, but this time in zip code 10017 and with a buffer around Grand Central
Station. And in a third CTE, we will union the two geometries together using ST_Union (Figure 4.34,
on page 296).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 295
16
17 -- Creates a polygon with a hole in zip code 10017
18 b as (
19 select
20 st_difference(
21 st_transform(geom, 4326),
22 st_buffer(
23 buildpoint(-73.9773136, 40.7526559, 4326) :: geography,
24 100
25 ) :: geometry
26 ) as geom
27 from
28 nyc_zips
29 where
30 zipcode = '10017'
31 ),
32
33 -- Unions both polygons into a multi-polygon
34 c as (
35 select
36 1 as id,
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
296 4.2. NEW ANALYSES
37 st_union(a.geom, b.geom)
38 from
39 a,
40 b
41 )
42 select
43 *
44 from
45 c
We cannot use the ST_ExteriorRing function since it only works with a polygon and we now have a
multi-polygon. To create polygons we first have to dump our two polygons using ST_Dump (Figure 4.35,
on page 297):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 297
23 ) as geom
24 from
25 nyc_zips
26 where
27 zipcode = '10017'
28 ),
29 c as (
30 select
31 1 as id,
32 st_union(a.geom, b.geom) as geom
33 from
34 a,
35 b
36 )
37 select
38 id,
39 (st_dump(geom)).geom
40 from
41 c
This will return two rows with to individual polygons which we can then use to capture the exterior
rings:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
298 4.2. NEW ANALYSES
5 st_buffer(
6 buildpoint(-73.9936596, 40.7505483, 4326) :: geography,
7 100
8 ) :: geometry
9 ) as geom
10 from
11 nyc_zips
12 where
13 zipcode = '10001'
14 ),
15 b as (
16 select
17 st_difference(
18 st_transform(geom, 4326),
19 st_buffer(
20 buildpoint(-73.9773136, 40.7526559, 4326) :: geography,
21 100
22 ) :: geometry
23 ) as geom
24 from
25 nyc_zips
26 where
27 zipcode = '10017'
28 ),
29 c as (
30 select
31 1 as id,
32 st_union(a.geom, b.geom) as geom
33 from
34 a,
35 b
36 )
37 select
38 id,
39
40 -- Takes the exterior ring from the geometry dump
41 st_exteriorring(
42 (st_dump(geom)).geom
43 )
44 from
45 c
Now we have two final steps. First we want to make those geometries into polygons, and then collect
them using ST_Collect back into the original multi-polygon. The issue is that since we want to aggre-
gate our geometries back by using the id column, we need to use a specific variant of ST_Collect. You
can see this at the bottom of the documentation for ST_Collect (Figure 4.36):
So to do this we can use the same structure, but add in our exterior rings in the subquery following
the FROM and then make our polygons within the collection function. In short, this means that in the
subquery, we actually have multiple rows of geometries resulting from the ST_Dump result, which is
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 299
known as a set result which is why we have to add the .geom to extract the actual geometries. This
allows ST_Collect to find those groups of rows and group them based on the outer GROUP BY condition,
and for our purposes we can create the polygons from the exterior rings here. There is a great post on
this topic107 at this link in the footnotes.
And then we have our filled multi-polygon geometry (Figure 4.37, on the following page):
107 https://loc8.cc/sql/cuny-postgis-holes
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
300 4.2. NEW ANALYSES
Pole of inaccessibility
To calculate a pole of inaccessibility, or the most isolated point in a polygon, PostGIS has a function
named ST_MaximumInscribedCircle, which acts a bit differently than many of the other functions we
have used so far. The function signature is below:
Description
Finds the largest circle that is contained within a (multi)polygon, or which does not overlap any lines
and points. Returns a record with fields:
• center - center point of the circle
• nearest - a point on the geometry nearest to the center
• radius - radius of the circle
This actually returns a table structure, or a record, with three columns: center, nearest, and radius. You
also need to call the function following the FROM argument in the query. Let’s take a look at an example
below:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 301
12 )
13 )
If we want to see the actual pole of inaccessibility we can return one column:
And if we want to see the inscribed circle itself we can use the radius in combination with the center to
create a buffer around it too:
We can also compare the pole of inaccessibility with the centroid to see how those compare by creating
a union with the inscribed circle query as well as the centroid (Figure 4.38, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
302 4.3. LINES TO POLYGONS, AND POLYGONS TO LINES
20 zipcode = '10001'
Turning polygons into lines and vice versa is quite easy. To turn polygons into lines you can use ST_
Boundary to turn the outside of the polygon to linestrings:
Let’s say you want to create individual linestrings from the full line. We can do this by using the cross
lateral join and generating a series of numbers from 1 to the number of points in the geometry with ST_
NPoints, minus 1. We will need to use the result of the ST_Boundary function since the ST_PointN only
works with linestrings. Once we have that we can create a line for each number in the series using ST_
PointN and the next point in the series by adding 1 to the number in the series. We subtract 1 in the
cross lateral join because if we add 1 to the max value of the total number of points we will see an error
since there won’t be a point in the geometry greater than the max number of points.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 303
8 st_boundary(st_transform(z.geom, 4326)),
9 numbers.num + 1
10 )
11 ),
12 numbers.num
13 from
14 nyc_zips z
15 cross join lateral generate_series(1, st_npoints(z.geom) - 1) as numbers(num)
16 where
17 z.zipcode = '10001'
This will return a table of individual lines. To turn it back into a polygon we can use ST_Polygonize.
It accepts a set of linestrings or an array of linestrings. The set is simply the rows in the column that
contains the linestrings so we can accomplish this by wrapping the set of lines in ST_Polygonize:
To snap points to a grid we can use the function ST_SnapToGrid which will snap points to an arbitrary
grid defined in the units of the projection. In this case we can project our original points from the NYC
311 data. We first transform this to a projection that supports meters, in this case 3857, and then create
a grid of 500 meters and 1000 meters. Then to show the points project it back to 4326. We will use a
random sample of 100,000 points (Figure 4.39, on the following page):
To tessellate a polygon into triangles we can use the ST_DelaunayTriangles function which we have
seen before which we can see in the below query (Figure 4.40, on page 305):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
304 4.5. TESSELLATE TRIANGLES
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 305
Of course this will create a geometry collection, and to turn this into rows we can use ST_Dump. We can
also get the area of the individual geometries and return the top 10 largest triangles:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
306 4.6. TAPERED BUFFERS
2 select
3 (
4 -- Create a dump to return the individual triangles
5 st_dump(
6
7 -- Create the triangles like above
8 st_delaunaytriangles(st_transform(geom, 4326))
9 )
10 ).geom
11 from
12 nyc_zips
13 where
14 zipcode = '10009'
15 )
16
17 -- Select and order by areas
18 select
19 a.geom,
20 st_area(geom) as area
21 from
22 a
23 order by
24 st_area(geom) desc
25 limit
26 10
To create a variable width buffer around a linestring will require several steps. First we have to create
a single linestring geometry to use with our query. To do this we can grab the bike path along Hudson
Street in lower Manhattan and turn it from a MultiLineString into a LineString (Figure 4.41, on the
facing page).
Next, we will add a second CTE that will dump the points from the linestrings as well as the length of
the geometry, and the original geometry.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 307
You will recall that the geometry dump will return a set (also known as a set returning function or SRF)
which means that it has two values in the set: the index of the point and the geometry.
Next, we will perform an operation over the rows. There are a few things we will break down here so
let’s take a look at the full query first with the new CTE starting on line 19:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
308 4.6. TAPERED BUFFERS
6 nyc_bike_routes
7 where
8 street = 'HUDSON ST'
9 ),
10 first as (
11 select
12 id,
13 st_dumppoints(geom) as dump,
14 st_length(geom) as len,
15 geom
16 from
17 lines
18 ),
19
20 -- For each path, select the id, path, and a buffer
21 -- around the path point. Using ST_LineLocatePoint
22 -- we use the line geometry and the point to find
23 -- the distance along the line, then make it smaller
24 -- using the log of the length
25 second as (
26 select
27 id,
28 (dump).path [1],
29 st_buffer(
30 (dump).geom,
31 ST_LineLocatePoint(geom, (dump).geom) * log(len)
32 ) as geom
33 from
34 first
35 )
36 select
37 *
38 from
39 second
(dump).path[1]
If you recall with geometry dump functions, you need to wrap that value in parentheses and can access
the geom with the .geom operator. The other value which is an integer contained between square
brackets is the index of the point on the linestring. This value is accessed with the .path operator since
it is an array data type. If we want to access the first value we need to use the array access method, or
a number in square brackets, in this case [1]. That array can contain more than one value, in the case of
multi-geometries you can see this in the examples in the PostGIS documentation108 .
Next we have this, formatted for readability:
st_buffer(
(dump).geom,
st_linelocatepoint(
geom,
(dump).geom
) * len / 10
) as geom
108 https://postgis.net/docs/ST_DumpPoints.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 309
So we start with our buffer, which will take two arguments, first the point to build the buffer from,
and then a numeric value which represents the distance in the unit of measurement of the projection to
draw the buffer radius. The point geometry is pulled from the geometry dump using:
(dump).geom
To get the value for the radius, since we are creating a tapered buffer we will first want to get the scaled
value. Here are the steps taking place in the code we are about to review:
1. We will use the ST_LineLocatePoint function which takes two arguments are returns a fractional
value representing the fraction of the position of the point on the line (i.e. a 0.5 value represents a
point on the line at the midway point). The two arguments are:
2. Then we multiply the fraction by the length value which is the same since it is one geometry. For
example, if the length is 100 meters and the fraction returned is 0.5, we will get a value of 50.
3. Finally, we divide the length multiplier by 10 to decrease the buffer size otherwise it will be too
large given the 4326 projection
Finally, we union the buffer geometries together, which alone won’t produce the result we want, which
is why we will wrap it in the ST_ConvexHull function. Let’s take a look.
third as (
select
id,
st_convexhull(
st_union(geom, lead(geom) over(partition by id order by id, path)
)
) as geom from second)
First, we select the ID then we start to construct our geometry. Keep in mind this function is set up to
work with multiple IDs, which we will get to in a minute. First we call ST_ConvexHull, then ST_Union
as we stated above. Next is where we address the ordering of the geometries and the grouping by id.
We use the LEAD function from PostgreSQL here which is a window function that allows us to access
the geometry from the current row and other rows as well given a specific order and offset, in this case
we use the OVER component of the window function to partition, or group, the functions by ID, and the
order it first by ID and then the geometry order of the path from our geometry dump.
So this is our final query (Figure 4.42, on page 311):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
310 4.6. TAPERED BUFFERS
5 from
6 nyc_bike_routes
7 where
8 street = 'HUDSON ST'
9 ),
10 first as (
11 select
12 id,
13 st_dumppoints(geom) as dump,
14 st_length(geom) as len,
15 geom
16 from
17 lines
18 ),
19 second as (
20 select
21 id,
22 (dump).path [1],
23 st_buffer(
24 (dump).geom,
25 ST_LineLocatePoint(geom, (dump).geom) * len / 10
26 ) as geom
27 from
28 first
29 ),
30
31 -- Create a convex hull around the buffers by union-ing
32 -- all the buffers together. These are ordered using the
33 -- LEAD window function and partition
34 third as (
35 select
36 id,
37 st_convexhull(
38 st_union(
39 geom,
40 lead(geom) over(
41 partition by id
42 order by
43 id,
44 path
45 )
46 )
47 ) as geom
48 from
49 second
50 )
51 select
52 id,
53 st_union(geom) as geom
54 from
55 third
56 group by
57 id
In a similar case to above we can also use any numeric value to create a variable width buffer. We can
use the same final query as above but swap out the length of a line for a numeric value. In this case I
used a random number using the RANDOM() function in PostgreSQL which will return a random number
between 1 and 0, and then multiplied that number by 100 which will return a buffer distance between
0 and 100. Of course, you need to transform the geometry to a coordinate system that accepts meters
and then back to 4326 to visualize it (Figure 4.43, on page 313).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 311
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
312 4.6. TAPERED BUFFERS
Your result will be different with the above query since the RANDOM() function will, true to its name,
return random values.
Symmetrical difference
PostGIS has a built-in function for symmetrical difference aptly named ST_SymDifference. It takes two
geometries and will return the portions of two geometries that do not intersect. We can test this with
zip codes and neighborhoods (Figure 4.44, on page 314):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL ANALYSIS 313
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
314 4.6. TAPERED BUFFERS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
5. Advanced Spatial Analytics
In the next chapters, we will start to take our spatial SQL skills and add complexity to the analyses we
are performing. We will look at more advanced analysis patterns in this chapter, and topics such as
raster data, spatial data science with Python (yes Python in a database), suitability analysis, H3 spatial
indexes, and using pgRouting to perform analysis with routing. Let’s begin with some more advanced
analysis topics!
A very common operation in spatial analytics is to perform spatial interpolation between two polygon
layers that do not share the same boundaries. A common example of this would be analyzing pop-
ulation data from one set of boundaries, such as Census Block Groups, and interpreting how many
people live in another set of boundaries that do not match to those boundaries, such as zip codes. This
is known as enrichment or area weighed interpolation.
The easiest way to think about this is if you have a polygon you want to enrich with an dataset that
contains polygons that overlap the target polygon. For simplicity, we can think of each underlying
polygon as having 100 people living in each. For each polygon that is entirely contained by the target
polygon, we can count all 100 people as living in that polygon. However, for the polygons that overlap,
we need to adjust that based on the overlap.
If we had underlying data such as the distribution of the population, land cover, buildings, or roads,
we could use a method known as dasymetric interpolation109 , which is described in the documentation
for the tobler Python library, which is a part of the Python Spatial Analysis Library or PySAL.
For argument’s sake, let’s assume that we have several polygons that are overlapping the target poly-
gon by exactly 50%, with the remaining 50% outside it. In this case we would count 50% of the popu-
lation, or 50 people as living in the target polygon. So to describe this analysis in pseudocode:
• Select target polygons to be enriched
• Find the polygons that contain the data to enrich with that intersect the target polygons
• Calculate the percentage overlap of the data containing polygons to the target polygon
• Divide the data by the overlap percentage
Our first task is to find the area of the overlap between the target polygon (our NYC neighborhoods
data) and the polygons we are using to enrich it, in this case the US Census Block Groups that we are
going to import shortly, and the total area the block groups. So first let’s import our new block group
data.
This data has the Census Block Groups from all of New York State and three data points from the 2021
American Community Survey, which is a dataset that samples the US population to make estimations
about population over a 5-year rolling time window, meaning this data was collected between 2017
and 2021. The data was downloaded from the US Census website110 and comes in a Geopackage. The
problem however is that the data contains 8,319 columns. While PostgreSQL supports 1600 columns,
in most cases this is highly inefficient and it makes more sense to extract the data you need rather than
109 https://github.com/pysal/tobler#dasymetric
315
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
316 5.1. SPATIAL DATA ENRICHMENT OR AREA WEIGHTED INTERPOLATION
pull it all into one table. If you were to try (like I did before reviewing the metadata) you will cause
some errors in the logs that will consume the memory of your database.
For this we can actually use our ogr2ogr import to filter the data we want to import. Using the flag for
SQL we can write a query that selects and renames the columns we want to use, in this case they are
columns representing income, median age, and total population. You can find the full metadata at this
link111 , and it is also saved in the book files in case the link breaks. Below are the columns of the data
we want to import:
• B01001e1 - Total population (SEX BY AGE: Total: Total population -- (Estimate))
• B19001e1 - Median income (HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-
ADJUSTED DOLLARS): Total: Households -- (Estimate))
• B01002e1 - Median age (MEDIAN AGE BY SEX: Total: Total population -- (Estimate))
To do this we just need to modify our ogr2ogr command line to add the SQL commands in Spatialite
and SQLite.
First we need to find the column names within the layers of the geodatabase using the ogrinfo com-
mand:
What this will show us is a list of column names within each layer. We will be using three ogr2ogr
statements to import our data into our database, which may seem odd because we are running a SQL
statement in the command to filter the data, why couldn’t we just perform a join between the tables?
Well, I thought the same thing, but the results were not great. I tried a number of options and settings
to see if this would work but to no avail. I couldn’t find great documentation on this specific issue
but overall I believe that instead of spending time to do this, using an ELT (or extract load transform)
workflow would be more valuable instead of spending time to make the ETL (extract transform load)
process work.
In an ELT scenario the goal is to simply extract the data, load it into the target (in this case our Post-
GIS database), and then transform it in the target location. While we are not performing a pure ELT
workflow since we are selecting a few columns of data to cut down on what we load into our database,
the concept is roughly the same since we still have to join our data in the database. So with that let’s
finish up the data imports. As you will see each layer inside the GeoDataBase acts as its own layer in
the query:
1 ogr2ogr \
2 -f PostgreSQL PG:"host=localhost user=docker password=docker
3 dbname=gis port=25432" ACS_2021_5YR_BG_36_NEW_YORK.gdb \
4 -dialect sqlite -sql "select GEOID as geoid, b01001e1 as population, B01002e1 as age from X01_AGE_AND_SEX" \
5 -nln nys_2021_census_block_groups_pop -lco GEOMETRY_NAME=geom
1 ogr2ogr \
2 -f PostgreSQL PG:"host=localhost user=docker password=docker \
3 dbname=gis port=25432" ACS_2021_5YR_BG_36_NEW_YORK.gdb \
4 -dialect sqlite -sql "select GEOID as geoid, B19001e1 as income from X19_INCOME" \
5 -nln nys_2021_census_block_groups_income -lco GEOMETRY_NAME=geom
1 ogr2ogr \
2 -f PostgreSQL PG:"host=localhost user=docker password=docker \
3 dbname=gis port=25432" ACS_2021_5YR_BG_36_NEW_YORK.gdb \
110 https://loc8.cc/sql/census-tiger-time-series
111 https://loc8.cc/sql/census-tiger-metadata
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 317
4 -dialect sqlite -sql "select SHAPE as geom, GEOID as geoid from ACS_2021_5YR_BG_36_NEW_YORK" \
5 -nln nys_2021_census_block_groups_geom -lco GEOMETRY_NAME=geom
Now that we have our table in our database we can simply join them together:
Here we select the block groups that fall within the New York City counties which can be found by
using the first five characters of the geoid column. The first two characters correspond to the state (36
for New York) and the next 3 characters correspond to the county. In the case of New York City those
are Bronx, Kings (Brooklyn), New York (Manhattan) Queens, and Richmond (Staten Island) (Figure 1.3,
on page 372).
---
Once we have the Block Group data, we need to find the overlapping area and divide that by the total
area. The equation will look like this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
318 5.1. SPATIAL DATA ENRICHMENT OR AREA WEIGHTED INTERPOLATION
population *
(st_area(
st_intersection(
block_groups.geom, nyc_hoods.geom)
) / st_area(block_groups.geom)
)
The best way to accomplish this is, once again, with a cross lateral join, but before that you may need
to update your neighborhoods table. I encountered an issue with an invalid geometry which you can
fix with this query:
This will show us the enriched population, number of polygons intersecting each neighborhood, and
average overlap of each block group intersecting the neighborhood.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 319
Imagine a scenario where the City of New York wants us to find the nearest 5 trees to all subway
entrances in the city. The catch is that for each subway station they want each tree listed as well as the
station, meaning that the station data will be repeated five times and then the five nearest trees will be
listed, for example:
entrance tree
14th St Tree 1
14th St Tree 2
14th St Tree 3
14th St Tree 4
14th St Tree 5
So how can we define this? Let’s use pseudo-code to define what we want to do.
Given two tables:
• Join the first table (subway entrances)
• To five nearest locations from the other table
If this sounds familiar, you are right. We did this in our bulk nearest neighbor join before but only with
one row. And luckily we only need to make one small change. We just need to change our limit to the
number of rows we want to return. First, let’s import the subway entrances data into our new database:
And then as we said, all we have to do is just rewrite our bulk nearest neighbor query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
320 5.2. NEAREST NEIGHBOR IN REPORT FORMAT
As an extra challenge, can you add a ranking column to the data ? ROW_NUMBER() OVER() will do
the trick here:
However, in the rare event that there are two trees at the same exact distance, we can use rank and
partition by the station ID and order it by the distance in our outer query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 321
15 order by
16 distance desc
17 )
18 from
19 nyc_subway_enterances sw
20 cross join lateral (
21 select
22 tree_id,
23 spc_common,
24 st_distance(sw.geom :: geography, geom :: geography) as distance
25 from
26 nyc_2015_tree_census
27 order by
28 sw.geom <-> geom
29 limit
30 5
31 ) near
32 limit
33 50
And that will cover any odd edge cases that may occur:
Before we proceed I want to point out that you can use the CROSS JOIN LATERAL using pretty much any
spatial function you want to help scale it to go row by row. Just some ideas:
• Objects within a distance
• Objects by category
• Distance to cluster
• Average distance to neighbors
• Polygons that share a border
• Raster data within a distance
• Overlap percentage
And pretty much anything that you can think up. The goal of this book is to help lay out as many
potential use cases as possible, but here is a perfect example of how you can take it forward to perform
any number of combinations that you can think up!
A common analysis that is used in many fields such as planning, real estate, construction, and telecom-
munications is understanding direct line of sight, or if you can directly view a location from another
location and what obstacles might be in the way. There are actually a few methods of accomplishing
this task:
• If we have elevation or obstruction data in a vector file we can simply draw a line between our two
points and see if any of the values intersecting that line are greater than any one of the end points.
However, this looses accuracy since the line between the points may rise or fall in elevation across
the duration of the line if the heights between the two points are different, which in most cases
they are likely to be. Using geometry (the mathematical field in this case not the GEOMETRY that we
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
322 5.3. FLAT LINE OF SIGHT
have been discussing so far), you could calculate the two legs of the triangle that forms between
the two points to find the angle of the base and the hypotenuse, then use those measurements to
calculate the height along the path using the distance of the measurement you want to take from
the start or end point.
• Another approach is using 3D geometries. Earlier in the book we covered that you can use a
third dimension on your geometries to store a Z-value, or height, in meters. PostGIS and any
other database or data warehouse that supports 3D geometries will then allow you to do a 3D
intersection of a 3D line between your two points.
• Finally, in a combination approach you can use an elevation file to add to the height of your
building 3D polygons to get an accurate height, assuming that your building polygon heights are
measured from the base of the structure.
Before we get started we are going to add some new data from the Denver Regional Council of Govern-
ments Regional Data Catalog, in this case their 2018 roofprints data which contains building roofprints
but also their height and the elevation the building sits at too.
So our first step when creating a flat line of sight analysis is to find the line between the two points.
In this case we will select two random points from the dataset by ordering our data using the RANDOM
function:
Next we will aggregate our data into arrays, so we can save keystrokes and use one column name and
reference using the array position for the two buildings:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 323
Now in this step we will simply create the line we will use to calculate our line of sight, so we can see it
on the map. To do this we will use the ST_MakeLine function to build a line between our two geometries.
We can access the two points using subqueries and the array position, [1] for the first geometry and
[2] for the second (Figure 5.2, on the next page).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
324 5.3. FLAT LINE OF SIGHT
From here, we are going to run a simply query that will find all the buildings that intersect the line we
just created that have a height greater than either of the two buildings creating the line. We can use the
array from the previous query to access the heights of the two buildings. Below is the complete query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 325
17 array_agg(gid) as id
18 from
19 a
20 ),
21 line as (
22
23 -- Here we use a simple select statement rather tha sub-queries
24 -- like the previous step to grab the height column too
25 select
26 st_makeline(geom [1], geom [2]) as geom,
27 height
28 from
29 bldgs
30 )
31 select
32
33 -- This will return all the buildings higher than the line
34 b.gid,
35 b.bldg_heigh + b.ground_ele as height,
36 st_transform(b.geom, 4326),
37 line.height
38 from
39 denver_buildings b
40 join line on st_intersects(line.geom, st_transform(b.geom, 4326))
41 where
42
43 -- This finds any building taller than either of our two buildings
44 b.bldg_heigh + b.ground_ele < line.height [1]
45 or b.bldg_heigh + b.ground_ele < line.height [2]
Below are the results with one image of the complete results and one zoomed into one area (Figures
5.3, on the following page and 5.4, on page 327):
Now to create a much more accurate line of sight analysis we need to turn our buildings into 3D
geometries. While this is also not completely accurate since these are only building roofprints, some
datasets will include building parts which will form a much more accurate 3-dimensional picture of a
building. For our purposes we are simply going to calculate the height of the building and the elevation
where the building is located. First we need to add a new geometry to our table:
Once we have done that we can update our table and use the 2-dimensional geometry and a function
called ST_Force3D to force a third dimension onto the geometry, in this case the building height plus
the ground elevation.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
326 5.4. 3D LINE OF SIGHT
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 327
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
328 5.4. 3D LINE OF SIGHT
8
9 -- This is our Z or height value, the building height
10 -- plus the ground elevation
11 bldg_heigh + ground_ele
12 )
Now we have a 3D geometry that we can use. The process from this point out is quite similar to the
previous query with a few differences (Results in figure 5.5).
• First we select two buildings, which is the same as before, however we will search for one building
then search for a second within 2 kilometers of the first building to reduce a potentially long-
running query.
• Next we union those two tables together, so the first three steps equate to the first step from the
previous query
• After that we select the centroids of the two source buildings. In the following step we create a
single geometry consisting of the two points, scale the geometries back to 2-dimensional using
ST_Force2D, and find all the buildings within 3 kilometers of those two points
• Finally, we find all the buildings that intersect a line we create in the final query using ST_MakeLine
and then find the crossing buildings using ST_3DIntersects, which is the same as ST_Intersects
but in a 3-dimensional space (Figure 5.6, on page 331).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 329
3 select
4 geom_z,
5 gid
6 from
7 denver_buildings
8 limit
9 1 offset 100
10 ),
11
12 -- Find a building with its ID and GeometryZ
13 -- within 2 kilometers
14 b as (
15 select
16 geom_z,
17 gid
18 from
19 denver_buildings
20 where
21 st_dwithin(
22 st_transform(geom, 3857),
23 (
24 select
25 st_transform(geom, 3857)
26 from
27 a
28 ),
29 2000
30 )
31 limit
32 1
33 ),
34
35 -- Use UNION to create a single table with matching columns
36 c as (
37 select
38 *
39 from
40 a
41 union
42 select
43 *
44 from
45 b
46 ),
47
48 -- Store the geometries and and IDs in arrays
49 bldgs as (
50 select
51 array_agg(st_centroid(geom_z)) as geom,
52 array_agg(gid) as id
53 from
54 c
55 ),
56
57 -- This query finds all the buildings within 3 kilometers of each building
58 denver as (
59 select
60 st_transform(geom, 3857) as geom,
61 geom_z,
62 gid
63 from
64 denver_buildings
65 where
66 st_dwithin(
67
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
330 5.5. CALCULATE POLYGONS THAT SHARE A BORDER
In all the spatial relationships we have explored so far, there is one that we haven’t used yet because
it doesn’t exist out of the box with PostGIS. This one finds polygons that share portions of borders
with other polygons, defined by simply sharing more than one point, a percentage of the border, or a
specific length. This can be useful when you want to find more direct relationships between polygons,
or features that may have a more direct relationship than simply sharing a point. Let’s review each of
the scenarios for a single polygon. If you want to expand this to every row you can still do this with a
join or cross join, and aggregate the IDs into an array.
Before we start, we will create a table that contains the census block groups just for New York City that
also excludes block groups with no population:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 331
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
332 5.5. CALCULATE POLYGONS THAT SHARE A BORDER
Next, let’s find all polygons that intersect with our source block group (Figure 5.7, on the facing page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 333
As you can see there are three polygons that only share one point. Let’s get rid of those first by using
ST_Intersection to find the shared geometry and then use ST_NPoints to find the number of points,
then filter out those that have 1 point (Figure 5.8, on page 335):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
334 5.5. CALCULATE POLYGONS THAT SHARE A BORDER
9 select
10 bgs.*,
11 st_npoints(
12 st_intersection(
13 bgs.geom,
14 (
15 select
16 geom
17 from
18 a
19 )
20 )
21 ) as intersected_points,
22 st_length(
23 st_intersection(
24 bgs.geom,
25 (
26 select
27 geom
28 from
29 a
30 )
31 ) :: geography
32 ) as length,
33 geom
34 from
35 nyc_2021_census_block_groups bgs
36 where
37
38 -- Only select polygons that have a border overlap
39 -- of 2 points or more
40 st_npoints(
41 st_intersection(
42 bgs.geom,
43 (
44 select
45 geom
46 from
47 a
48 )
49 )
50 ) > 1
51 and bgs.geoid != (
52 select
53 geoid
54 from
55 a
56 )
Now let’s find the polygons that share a border of at least 100 meters (Figure 5.9, on page 337):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 335
Figure 5.8: All polygons touching Block Group 360470201002 with 2 or more points in common
1 with a as (
2 select
3 *
4 from
5 nyc_2021_census_block_groups
6 where
7 geoid = '360470201002'
8 )
9 select
10 bgs.*,
11 st_npoints(
12 st_intersection(
13 bgs.geom,
14 (
15 select
16 geom
17 from
18 a
19 )
20 )
21 ) as intersected_points,
22 st_length(
23 st_intersection(
24 bgs.geom,
25 (
26 select
27 geom
28 from
29 a
30 )
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
336 5.5. CALCULATE POLYGONS THAT SHARE A BORDER
31 ) :: geography
32 ) as length,
33 geom
34 from
35 nyc_2021_census_block_groups bgs
36 where
37
38 -- Only select polygons that have a border overlap
39 -- of 100 meters or more
40 st_length(
41 st_intersection(
42 bgs.geom,
43 (
44 select
45 geom
46 from
47 a
48 )
49 ) :: geography
50 ) > 100
51 and bgs.geoid != (
52 select
53 geoid
54 from
55 a
56 )
Finally, what about polygons that share a percentage of the total perimeter of the source polygon. We
can use ST_Perimeter to do so after casting our source polygon to a GEOGRAPHY data type, then divide
the length of the intersected polygons by the perimeter value (Figure 5.10, on page 339):
Listing 5.21: Polygons that share more than 25 percent of the perimeter of the source polygon
1 with a as (
2 select
3 *
4 from
5 nyc_2021_census_block_groups
6 where
7 geoid = '360470201002'
8 )
9 select
10 bgs.geoid,
11 geom,
12 st_npoints(
13 st_intersection(
14 bgs.geom,
15 (
16 select
17 geom
18 from
19 a
20 )
21 )
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 337
Figure 5.9: All polygons touching Block Group 360470201002 that share a border of 100 meters
22 ) as intersected_points,
23 st_length(
24 st_intersection(
25 bgs.geom,
26 (
27 select
28 geom
29 from
30 a
31 )
32 ) :: geography
33 ) as length,
34 (
35 -- Finding the length of the shared border
36 st_length(
37 st_intersection(
38 bgs.geom,
39 (
40 select
41 geom
42 from
43 a
44 )
45 ) :: geography
46
47 -- Dividing that by the perimeter of the source polygon
48 ) / st_perimeter(
49 (
50 select
51 geom :: geography
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
338 5.6. FINDING THE MOST ISOLATED FEATURE
52 from
53 a
54 )
55 )
56 ) as percent_of_source
57 from
58 nyc_2021_census_block_groups bgs
59 where
60 -- Finding touching polygons that share more than 25% of
61 -- the source polygon's border
62 (
63 st_length(
64 st_intersection(
65 bgs.geom,
66 (
67 select
68 geom
69 from
70 a
71 )
72 ) :: geography
73 ) / st_perimeter(
74 (
75 select
76 geom :: geography
77 from
78 a
79 )
80 )
81 ) >.25
82 and bgs.geoid != (
83 select
84 geoid
85 from
86 a
87 )
And then there was one. With that let’s move on to our next exercise.
As you have seen so far there are a lot of ways we can perform spatial analysis with spatial SQL, but not
all analyses are built the same, especially when we increase the amount of data needed in our analysis.
One of the most important skills you can learn as your data scale increases is to find different ways to
minimize the problem at hand to increase the analysis speed. One way we can look at this is by trying
to find the most isolated building in New York City.
The brute force method for this would be to measure the distance from each building to the nearest
building. But with just over 1.2 million buildings in our dataset this would be quite inefficient. Below
is an example of the code we could run, but I do not recommend running it since when I ran it on my
computer, I ended up stopping the query after 10 minutes.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 339
Figure 5.10: All polygons touching Block Group 360470201002 that share at least 25% of a border
3 ogc_fid,
4 st_transform(geom, 4326) as geom,
5 closest.distance
6 from
7 building_footprints b
8 cross join lateral (
9 select
10 st_distance(geom, b.geom) as distance
11 from
12 building_footprints
13 order by
14 geom <-> b.geom
15 limit
16 1
17 ) closest
18 order by
19 closest.distance desc
This is where the creativity comes in, and by creativity I mean finding methods to limit the amount of
data you actually need to analyze. Some other potential examples of this are:
• Imagine you have millions of points that you want to create drive time isochrones around. Instead
of running for each point, approximate the drive times by using a grid and creating the isochrones
from each.
• In our raster use case you can use H3 cells to aggregate and approximate your raster data for
faster analysis (an analysis we will perform later in the book).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
340 5.6. FINDING THE MOST ISOLATED FEATURE
For our use case, we need to limit the number of buildings we are evaluating in our distance measure-
ments, and we can do this using H3 cells. However, this is not an original idea. Simon Wrigley put
together a great tutorial using a similar process in PostGIS to find the most remote building in Great
Britain112 .
Since we have already installed the two H3 libraries we need in our Docker container, we can simply
install then in our database using two queries:
create extension h3
create extension h3_postgis
Our first step is to add a column in our building footprints dataset to store our H3 index:
Next we update the table to add an H3 index from the centroid of the building footprint polygon at
H3 level 10. The H3 indexes run from level 1 to 15, 1 being the largest (at roughly 600 to 700 billion
square meters per cell) and 15 being the smallest (at just under 1 square meter). You can see this in this
web app113 that allows you to view H3 cells at various levels. To create the cells we will first turn the
buildings into centroids, then use the function h3_lat_lng_to_cell to create the cell ID:
This only takes about 29 seconds to complete (in case you were keeping track) on my computer. From
here the next logical step is to look at all the H3 cells that have the lowest number of buildings in them.
In our case there are several H3 cells with only one building in them. If you replicate this analysis and
you use a higher resolution cell you may need to select the cells under a certain threshold or go to a
lower threshold until you get cells that only have one building.
The next logical step would be to group all the cells by the count of buildings within them and then
visualize the H3 cells, but instead of adding another step we can actually select the building geometries
that fall in a cell with a count of 1. This will be a refresher from one of our earlier chapters but what
clause would allow us to filter by an aggregate?
___
This is one we haven’t used much, but the correct answer is HAVING which acts just like a WHERE clause
but with aggregated data. What we can do is select everything from our main buildings table, and
using a subquery we can query all the H3 IDs that have a count of 1 using HAVING to filter the data
(Figure 5.11, on page 342).
112 https://loc8.cc/sql/simon-wrigley-geo-big-data
113 https://wolf-h3-viewer.glitch.me/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 341
At this point we know the buildings that are potential candidates for the most remote and using this
method we have narrowed our candidates down to 2,187 from over 1.2 million. With this significantly
smaller amount we can find the distance for each of these buildings to its nearest building by combining
the subquery with HAVING from our previous step and parts of the brute force query from the beginning
of the section:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
342 5.6. FINDING THE MOST ISOLATED FEATURE
31 nyc_building_footprints
32 group by
33 h3
34 having
35 count(*) = 1
36 )
37 order by
38 closest.distance desc
bin distance
2127308 1005.9038869674532
3397806 745.4296782710237
2128487 671.5739992682301
. . . continued on next page
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 343
bin distance
2124071 490.094313204366
4540195 455.3490996545422
With that we can see that there is one building that is about 1,000 meters from the nearest building
which appears to be the most remote building in New York City (Figure 5.12).
As you can see the most remote building in New York City by about 300 meters is this building near
the 9th hole of the Van Cortlandt Park Golf Course in the northern part of the Bronx.
While seeing points on a map can give you some sense of how dense a layer of data might be, quan-
tifying how densely compact those features are is another task all together. One popular approach is
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
344 5.7. KERNEL DENSITY ESTIMATION (KDE)
using kernel density estimation (KDE) which creates an estimate, or smoothed, view of the distribution
of the data given the weighted distances of the data. As I am not a statistician I will refer you to this
link114 which provides a clear definition and visualization of how a KDE works with non-spatial data
as well as this quote from the University Consortium of Geospatial Information Science.
"The kernel density estimate at a location will be the sum of the fractions of all observations at that location. In a
GIS environment, kernel density estimation usually results in a density surface where each cell is rendered based
on the kernel density estimated at the cell center."
Fortunately for us, a former colleague of mine, Abel Vázquez Montoro, implemented this as a function
in PostGIS which we can use since it is publicly available as a Gist115 on his GitHub. With that let’s
take a look at the function and break apart what is happening.
First the function is declared with the arguments:
• ids - an array of integers
• geoms - an array of geometries
We can see that the function returns a table with the following structure:
• id - integer or bigint
• geom - geometry
• kdensity - integer
In this next section we are declaring some variables using DECLARE which only needs to be used
once. Then we start the function after the BEGIN keyword. The first three steps assign values to the
variables. The := is the same as an equals sign but just the PL/SQL compliant version116 . The variable
assignments include (in order)
• The centroid of all the geometries (mc)
• The length of the ids array (c)
• The square root of the factorial of the natural logarithm of 2
114 https://mathisonian.github.io/kde/
115 https://gist.github.com/AbelVM/dc86f01fbda7ba24b5091a7f9b48d2ee
116 https://www.postgresql.org/docs/current/plpgsql-statements.html#PLPGSQL-STATEMENTS-ASSIGNMENT
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 345
Now we construct the query which will be run by the function. In our first CTE the query contains the
point ID, geometry, and the distance to the centroid of all geometries. In the FROM statement, the query
un-nests the array which contains the IDs and geometries and gives it a shorthand alias of gid for the
ID, and g for the geom.
Our next CTE finds the median value using the PERCENTILE_CONT function which is a window function
to find the value at median (or 0.5):
The next CTE performs the KDE equation which finds the square root of:
• The sum of the X coordinate minus the X coordinate of the centroid to the 2nd power
• Divided by the array length plus the sum of the Y coordinate minus the Y coordinate of the
centroid to the 2nd power
• All divided by the array length again
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
346 5.7. KERNEL DENSITY ESTIMATION (KDE)
The next CTE creates a standard search radius based on the documentation here117 . This takes a con-
stant of 0.9 multiplied by the smallest value from the previous CTE, which the LEAST performs which
finds the smallest value from a set of data, as well as the constant "k" multiplied by the median, which
is then multiplied by length of the array to the power of -0.2. That is a lot, but effectively it is just an
expression of the equation listed in the documentation link in the footnotes.
Next we assemble the final query. Here we are performing a cross join between the data in the CTE
dist and the search radius CTE, sr. To find the number of features in the cluster for each point, we
perform a pure LATERAL join that finds the count of points within the length of the search radius from
the center of mass to the specific location under evaluation.
117 https://loc8.cc/sql/arcgis-kernel-density
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 347
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
348 5.8. ISOVIST
Now we can run our query and create a new table for all the trees in the East Village:
And here is our final map (Figure 5.13, on the facing page):
5.8 Isovist
We have already covered line of sight analyses and later we will introduce isochrones or drive times. An
isovist is a unique analysis that creates the visible area from a specific point is another possible analysis
using spatial SQL. This specific analysis and function was once again developed by Abel Vázquez
Montoro, and you can see the original code here118 .
For this example we will find the isovist for the center of Times Square, one of the most well known
areas in New York City and also the one that likely has the highest density of billboards in the city, if
not the world. First we will show the analysis results step by step and then show the entire function.
To run the example functions in the next steps you can add each section by replacing the outermost
query from the previous code snippet (such as select * from step_1) with the code from the next
code snippet.
In the first three steps we will select our buildings from our NYC Buildings dataset. The function itself
takes in an array of geometries, so our first step will be to unnest that array. From there we find all the
buildings within 630 meters, which is roughly the distance that I was able to see from the selected point
to another point on Google Street View in Times Square. The last step is to create a buffer and add that
as a geometry to the data (Figure 5.14, on page 351):
118 https://abelvm.github.io/sql/isovists/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 349
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
350 5.8. ISOVIST
30 where
31 st_dwithin(
32 st_setsrid(st_makepoint(-73.985136, 40.758786), 4326) :: geography,
33 geom :: geography,
34 630
35 )
36 ),
37 buildings as(
38
39 -- Union these two tables together
40 select
41 geom
42 from
43 buildings_crop
44 union
45 all
46 select
47 st_buffer(
48 st_setsrid(st_makepoint(-73.985136, 40.758786), 4326) :: geography,
49 630
50 ) :: geometry as geom
51 )
52 select
53 *
54 from
55 buildings
The next step creates a set of rays around the center point. This uses a set of functions to create lines
using ST_MakeLine between the center point and points generated from a function called ST_Project119
which creates a set of points around a point (in this case our center point), at a certain distance, and
based on a specific radius. What we end up with is a shape that looks like a ray burst around our center
point (Figure 5.15, on page 352):
Note that instead of adding the same code we will just show the next parts that are added in. Make
sure to combine the CTEs using commas.
119 https://postgis.net/docs/en/ST_Project.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 351
24 4326
25 ) as geom
26
27 -- Builds the series of integers used in ST_Project
28 from
29 generate_series(0, 100) as t(n)
30 )
31 select
32 *
33 from
34 rays
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
352 5.8. ISOVIST
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 353
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
354 5.8. ISOVIST
64 from
65 intersection_closest
Finally, we have a set of queries that build a polygon around those points:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 355
1 -- Creates a line from the geometries and then a polygon from the line
2 -- of the closest intersecting points
3 isovist_0 as(
4 select
5 st_makepolygon(st_makeline(geom)) as geom
6 from
7 intersection_closest
8 ),
9
10 -- Selects the polygons that actually intersect the building
11 isovist_buildings as(
12 select
13 st_collectionextract(st_union(b.geom), 3) as geom
14 from
15 isovist_0 i,
16 buildings_crop b
17 where
18 st_intersects(b.geom, i.geom)
19 )
20
21 -- This clips the building footprints from the isovist
22 -- created above to give the most accurate view.
23 -- We use the COALESCE function to return the first non
24 -- NULL result so we can ignore any non-intersecting buildings
25 select
26 coalesce(st_difference(i.geom, b.geom), i.geom)
27 from
28 isovist_0 i,
29 isovist_buildings b;
And this returns our final isovist (Figure 5.17, on the following page):
The complete function is as follows that also has some additional arguments that have default values
which means that you don’t have to include them unless you want to modify the default values:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
356 5.8. ISOVIST
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 357
52 rays as(
53 select
54 t.n as id,
55 st_setsrid(
56 st_makeline(
57 center,
58 st_project(
59 center :: geography,
60 radius + 1,
61 radians(angle_0 + t.n :: numeric * arc)
62 ) :: geometry
63 ),
64 4326
65 ) as geom
66 from
67 generate_series(0, rays) as t(n)
68 ),
69 intersections as(
70 select
71 r.id,
72 (
73 st_dump(st_intersection(st_boundary(b.geom), r.geom))
74 ).geom as point
75 from
76 rays r
77 left join buildings b on st_intersects(b.geom, r.geom)
78 ),
79 intersections_distances as(
80 select
81 id,
82 point as geom,
83 row_number() over(
84 partition by id
85 order by
86 center <-> point
87 ) as ranking
88 from
89 intersections
90 ),
91 intersection_closest as(
92 select
93 -1 as id,
94 case
95 when fov = 360 then null :: geometry
96 else center
97 end as geom
98 union
99 all (
100 select
101 id,
102 geom
103 from
104 intersections_distances
105 where
106 ranking = 1
107 order by
108 id
109 )
110 union
111 all
112 select
113 999999 as id,
114 case
115 when fov = 360 then null :: geometry
116 else center
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
358 5.9. EXPERT VOICES: DANNY SHEEHAN
And to use the function with some different values for our radius and number of rays (Figure 5.18, on
the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. ADVANCED SPATIAL ANALYTICS 359
I first learned spatial SQL in the CartoDB Editor web mapping tool.
Can you share an interesting way or use case that you are using spatial SQL for today?
For a lot of use cases I use Amazon Athena and store data as Parquet-this is an easy and low cost way
to run spatial queries on cloud native file formats. I’m eager to see developments and support for
GeoParquet along with other formats like netCDF, etc.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Part 3
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
1. Suitability Analysis
One area that I have done a good amount of work in during the course of my career is suitability anal-
ysis, or analyzing which areas might be best to locate a specific feature using many spatial attributes.
For example this could be a new retail store that wants to be in an area that reaches their target mar-
ket. Using demographic data we can use many variables to find suitable locations for this new store
location. This is a simple example but we will see more as we dive into this chapter.
A common suitability analysis is analyzing the potential fit of areas for expansion based on suitability
and other factors. This could be for any type of site such as a retail location, government offices or
service centers, distribution centers, or any type of site that requires a certain set of conditions to thrive.
In this case we will be using pharmacies in New York City which can be located close together in some
areas and farther apart in others, but overall there are many pharmacies in the city competing for space
and business.
The data we will be using was extracted from OpenStreetMap using the Overpass API via Overpass
Turbo120 , a user interface wrapped around the Overpass API. To import this data we can use this
command:
For this analysis we are going to analyze the potential within neighborhoods in New York City for new
Duane Reade locations, which is a specific franchise of pharmacy, for another theoretical competing
pharmacy that wants to co-locate with Duane Reade locations to directly compete with their business.
Once again this will have many CTEs to highlight each step of the analysis. First we will select all the
pharmacies that match the name Duane Reade:
120 https://overpass-turbo.eu/
363
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
364 1.1. MARKET EXPANSION POTENTIAL
Next we are going to add the name of each neighborhood that each store resides in to the stores data
using ST_Intersects:
In the third step, we will create a buffer of 800 meters around each store which will require us to
transform the geometry to EPSG 3857 to make sure our geometry uses meters rather than decimal
degrees:
In the fourth step, we will transform our geometry back to EPSG 4326 and union those geometries,
grouped by neighborhood which will result in one or more buffers as single geometries for each neigh-
borhood (Figure 1.1, on the facing page):
In our fifth step, we will enrich each of the group buffers with total population while also calculating
the area of the combined buffers. This will allow us to calculate the population density for each of the
grouped buffers. In this case we are simply analyzing the population density for these areas, you can
continue to add variables into the analysis as you see fit including points of interest, transportation
stops/stations, other types of stores, and more.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 365
Figure 1.1: Buffers around store locations grouped and unioned by neighborhood
10 )
11 ) as pop,
12 st_area(st_transform(d.geom, 3857)) as area
13 from
14 d
15 join nys_2021_census_block_groups bgs on st_intersects(bgs.geom, d.geom)
16 group by
17 d.geom,
18 d.neighborhood
19 )
In our final step we will simply calculate the population density for each of these areas:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
366 1.1. MARKET EXPANSION POTENTIAL
2 with a as (
3 select
4 id,
5 amenity,
6 brand,
7 name,
8 geom
9 from
10 nyc_pharmacies
11 where
12 name ilike '%duane%'
13 ),
14
15 -- Spatially join the Duane Reade stores to neighborhoods and adds neighborhood names
16 b as (
17 select
18 a.*,
19 b.neighborhood
20 from
21 a
22 join nyc_neighborhoods b on st_intersects(a.geom, b.geom)
23 ),
24
25 -- Creates a buffer for each store
26 c as (
27 select
28 id,
29 st_buffer(st_transform(geom, 3857), 800) as buffer,
30 neighborhood
31 from
32 b
33 ),
34
35 -- Union the buffers together by neighborhood
36 d as (
37 select
38 st_transform(st_union(buffer), 4326) as geom,
39 neighborhood
40 from
41 c
42 group by
43 neighborhood
44 ),
45
46 -- Calculates the proportional population for each group of buffers
47 -- and also the area of the grouped buffers
48 e as (
49 select
50 d.*,
51 sum(
52 bgs.population * (
53 st_area(st_intersection(d.geom, bgs.geom)) / st_area(bgs.geom)
54 )
55 ) as pop,
56 st_area(st_transform(d.geom, 3857)) as area
57 from
58 d
59 join nys_2021_census_block_groups bgs on st_intersects(bgs.geom, d.geom)
60 group by
61 d.geom,
62 d.neighborhood
63 )
64
65 -- Calculates the population density
66 select
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 367
67 neighborhood,
68 pop / area as potential
69 from
70 e
71 order by
72 pop / area desc
And we can see the top five neighborhoods for the competitor to focus their search:
neighborhood potential
Kips Bay 0.02320599976630974
Inwood 0.023181512204697885
Fordham 0.022357258309427155
Gramercy 0.021964586699969726
East Village 0.021819711504376737
While this is a simple example you can take this base and add more variables, expand areas, and per-
form multiple different ways to analyze market potential. Another great example of this is an analysis
looking at a new distribution center location of a company that focuses on small towns. When the
company located their site for a new location using a radius of 500 miles, it is possible to count the
number of towns that have a store already and towns that do not, representing the potential footprint
expansion for that company.
Calculating twin areas, or performing a similarity search, is another analysis we can perform to find
similarity between one target area and multiple areas that contain the same data points. We can do this
with our neighborhoods and trees to find the neighborhood that has the most similar tree makeup as
Harlem in Manhattan. To do this we first need to create a new table with columns that contain various
tree types as well as their percentage of the total trees in the neighborhood:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
368 1.2. SIMILARITY SEARCH OR TWIN AREAS
24 (
25 count(t.*) filter (
26 where
27 t.spc_common ilike '%maple%'
28 ) :: numeric / a.total_trees
29 ) as maple,
30 (
31 count(t.*) filter (
32 where
33 t.spc_common ilike '%linden%'
34 ) :: numeric / a.total_trees
35 ) as linden,
36 (
37 count(t.*) filter (
38 where
39 t.spc_common ilike '%honeylocust%'
40 ) :: numeric / a.total_trees
41 ) as honeylocust,
42 (
43 count(t.*) filter (
44 where
45 t.spc_common ilike '%oak%'
46 ) :: numeric / a.total_trees
47 ) as oak
48
49 -- Joins the above data with data from the original CTE
50 from
51 nyc_2015_tree_census t
52 join nyc_neighborhoods n on st_intersects(t.geom, n.geom)
53 join a using (neighborhood)
54 group by
55 n.neighborhood,
56 a.total_trees
First we calculate the count of the total trees in our subquery, so we don’t have to do that each time
going forward. Then we use FILTER to do a conditional count and then cast it to the numeric data type.
From here we can turn this into a table to perform our queries on the table, rather than a query each
time. Then we perform a JOIN with our normal ST_Intersects pattern. We can make this into a table,
so we can use it later on as needed.
Now while there are far more structured ways to find similarity from one are to others such as using
Principal Component analysis outlined in this blog post121 , for our purposes we will use a simple, yet
effective, technique to create a composite score that measures how different the tree count by species is
from Harlem. You could change this to density of trees by area or any other measure that you see fit.
But for the first step let’s calculate the difference from the original dataset.
121 https://carto.com/blog/spatial-data-science-site-planning
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 369
Here we end up with data that will show us the difference in each tree category from Harlem which
will be either a negative or positive number, so the closer to 0 the more similar it is to that specific tree
species in Harlem. Now of course the numbers can vary quite a bit between the neighborhoods so in
this case we will normalize our data from 0 to 1. The way to do this is to take the value in the row
and subtract the minimum from that column of data, and then divide it by the difference between the
maximum and minimum values in the column. In equation form:
zi = (xi - min(x)) / (max(x) - min(x))
As you will see below this will become quite a long query, so it will be annotated with comments that
correspond to the detailed notes in the list below:
• A: This is the first step of finding the difference between the tree species count in Harlem com-
pared to ever other neighborhood
• B: Here we will find the minimum and maximum values for every column and store those values
in an array, so we can save some keystrokes in this query and the next
• C: Here we are doing three things
– First we find the absolute value of the difference, which will make all the results
positive numbers. This means that the closer to 0, the more similar to the tree
makeup in Harlem
– Then we will calculate the 0 to 1 scaled index. This will seem backwards most
would interpret 1 as a better score, but in our case 0 is in fact a score closer to the
tree composition in Harlem
– Which means we will subtract the 0 to 1 value from 1 to reverse the index so now
the closer to 1, the closer to the tree counts in Harlem
• D: Here we add the values for each tree species for each row and then divide it by 5 so we can
return the index from 1 to 0
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
370 1.2. SIMILARITY SEARCH OR TWIN AREAS
And with our new table we can check the results and style them in QGIS to see which neighborhoods
are the most similar after we create a new table joining the neighborhoods to the geometries (Figure
1.2, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 371
An analysis that has proved highly effective and useful for combining datasets and creating analysis
outputs that can be easily understood by many audiences is using composite scoring. Once again, there
are more advanced methods to performing this analysis such as using residuals from machine learning
analysis. This analysis only uses the data in a normalized form along with other data. You can also
weight variables by multiplying them by a fraction, for example if you want to weight a variable half
as much as the other variables you can multiply the final output by 0.5.
While you can use geometries to do this analysis I recommend using a spatial index since it creates
areas of equal area and spotting trends that might be hidden inside polygons that vary by size and
extent. For this example we will be using four datasets to find a suitable area in New York City using
three demographic variables and tree counts in each H3 cell grid using the H3 functions in PostGIS. We
will try and find areas with the highest number of trees, areas with median annual household incomes
closest to $70,000, areas with a median age closest to 35, and areas with the highest population.
Again we will use our US Census Block Group data for New York City to create our H3 layer. We need
to perform two additional steps before writing our query. First is that we need to assign an H3 index
to our trees, which we can do by using the h3_lat_lng_to_cell function from the H3 library. First we
need to create a new column on our dataset:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
372 1.3. SUITABILITY OR COMPOSITE SCORE
4 column h3 text
And then update that column with our new H3 index, at level 10:
Next we need to add H3 cells to fill in our polygons of the census block groups and for this we can use
a function called h3_polygon_to_cells. This function will not produce any overlapping H3 cells since
it uses the centroids of the cells to determine if it falls in or out of the target polygon. As described in
the source H3 documentation122 :
Containment is determined by the cells’ centroids. A partitioning using the GeoJSON-like data structure, where
polygons cover an area without overlap, will result in a partitioning in the H3 grid, where cells cover the same
area without overlap.
To do this we will create a new table with the resulting H3 cells by selecting the block groups that fall
within the New York City counties (Figure 1.3).
Which we can now see our data from our new table showing the first 5 rows:
122 https://h3geo.org/docs/api/regions#polygontocells
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 373
geoid h3_polygon_to_cells
360050001000 8a2a100f406ffff
360050001000 8a2a100f46a7fff
360050001000 8a2a100f4047fff
360050001000 8a2a100f5db7fff
360050001000 8a2a100f478ffff
As you can see there are multiple indexes for each block group ID, and we can also see how many H3
cells exist for each block group:
geoid count
360819901000 8663
360859901000 5419
360810716001 1270
360470702030 1200
360479901000 1177
So now we can proceed to creating our suitability scores. While many of the steps in this analysis could
be combined for potential performance gains, this will show each step in more detail contained within
its own CTE. We will go step by step and then show the complete query at the end. For our first step
we simply need to find the number of trees within each H3 cell:
Next we will join our table with our H3 values with the total population. This will be inaccurate since
we are joining the total population for each block group to each occurrence of an H3 cell for that block
group. We will divide the total population into each cell in the next step.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
374 1.3. SUITABILITY OR COMPOSITE SCORE
Next we will divide the population evenly among each H3 cell within each block group using the count
from our first CTE table with the alias a:
In our fourth step we will build our calculations to find areas that are closest to our target values of
$70,000 of median household income and median age of 35. To do this we will use the same process as
the Harlem tree similarity analysis:
• Subtract the value for each row from the target values
• Turn the negative values into positive values resulting in a value of 0 being the best match
Our fifth step will simply join the H3 index of the previous CTE to the count of trees in each H3 cell
from our trees dataset:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 375
1 e as (
2 select
3 d.h3,
4 count(t.ogc_fid) as trees
5 from
6 d
7 join nyc_2015_tree_census t on d.h3 :: text = t.h3
8 group by
9 d.h3
10 )
The sixth step will look similar to the code we wrote in our similarity search query to calculate the
minimum and maximum values across each column, so we can use those to normalize our data in the
next step. Once again we will use arrays to cut down on keystrokes:
The seventh step is to simply join the results we need from the data stored in the CTEs with the aliases
d and e:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
376 1.3. SUITABILITY OR COMPOSITE SCORE
And our eighth and final CTE is to calculate the normalized values for each feature from 1 to 0. For
the population and age values we will reverse these by subtracting the normalized data from 1, since
in the base data 0 represents the best fit just as we did in the previous analysis.
And the last step is to query all our results and create one final index column by adding each individual
index together. We can divide this by 4 if we want otherwise we can leave it which means a perfect fit
will have a value of 4.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 377
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
378 1.3. SUITABILITY OR COMPOSITE SCORE
53 from
54 d
55 join nyc_2015_tree_census t on d.h3 :: text = t.h3
56 group by
57 d.h3
58 ),
59
60 -- Add the min and max values for each data point to an array
61 f as (
62 select
63 array [min(trees), max(trees)] as trees_s,
64 array [min(pop), max(pop)] as pop_s,
65 array [min(income), max(income)] as income_s,
66 array [min(age), max(age)] as age_s
67 from
68 e
69 join d on d.h3 = e.h3
70 ),
71
72 -- Join the two previous CTEs together
73 g as (
74 select
75 e.trees,
76 d.age,
77 d.income,
78 d.pop,
79 d.h3
80 from
81 d
82 join e on d.h3 = e.h3
83 ),
84
85 -- Calculate the 0 to 1 index
86 h as (
87 select
88 g.h3,
89 (
90 (g.trees :: numeric - f.trees_s [1]) /(f.trees_s [2] - f.trees_s [1])
91 ) as trees_i,
92 1 - ((g.age - f.age_s [1]) /(f.age_s [2] - f.age_s [1])) as age_i,
93 1 - (
94 (g.income - f.income_s [1]) /(f.income_s [2] - f.income_s [1])
95 ) as income_i,
96 ((g.pop - f.pop_s [1]) /(f.pop_s [2] - f.pop_s [1])) as pop_i
97 from
98 g,
99 f
100 )
101
102 -- Add up to find the final index value
103 select
104 *,
105 trees_i + age_i + income_i + pop_i
106 from
107 h
If you want to visualize this data in QGIS or another tool you will need to use the function h3_cell_
to_boundary to turn the H3 cell into a geometry. However, if we use KeplerGL, an open source tool
originally developed by Uber (which was also where H3 originated) can read the H3 index as a string
without a geometry.
To do this we first need to download our results as a CSV file, which we can do with a click inside of
pgAdmin. Once the query has completed, you should see a download button (Figure 1.4, on the next
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 379
page).
You will then be prompted to give the file a name (Figure 1.5).
Once you have completed these steps you can navigate to the Kepler website123 and click the "GET
STARTED" button on the homepage (Figure 1.6, on the next page).
The first thing you should see after clicking this button is a prompt to upload data, in this case the
CSV which we just created. Go ahead and either navigate to the file on your computer by clicking the
"browse your files" link or drag it onto the upload area (Figure 1.7, on the following page):
The map will open, and it should automatically recognize that you have an H3 index in your data and
style it for you too (Figure 1.8, on page 381):
Personally, I think it is awesome that we just created a map visualization without a geometry. It is one
of the things I greatly appreciate about spatial indexes is it is visually easy to approach and also creates
compelling visualizations. We can also add a filter to answer our question of what are the most suitable
areas for our study. To do so first navigate to the filter icon at the top of the left-hand menu (Figure 1.9,
on page 381):
Once you have clicked that button, then click "+ Add Filter" (Figure 1.10, on page 382):
In the next menu click on the dropdown menu and then select the value for "final_index" (Figure 1.11,
on page 382):
Now you can select the filter to see the areas with the best results. First the most suitable areas (Figure
1.12, on page 383):
123 https://kepler.gl/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
380 1.3. SUITABILITY OR COMPOSITE SCORE
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 381
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
382 1.3. SUITABILITY OR COMPOSITE SCORE
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 383
It looks like there are some suitable areas in Brooklyn along to well known tree-line parkways (the
pattern of straight lines): Eastern Parkway in the north and Ocean Parkway in the south. There are also
several pockets in Queens and upper Manhattan. We can also see the least suitable areas (Figure 1.13,
on page 384):
Several areas stand out here in Manhattan in SoHo and very clearly in the Upper East Side, as well as
neighborhoods on the far eastern edge of Queens bordering Nassau County: Douglaston, Little Neck,
and Glen Oaks.
In our final suitability analysis we will analyze a common scenario where two entities, be it retail stores,
bank branches, grocery stores, or any type of physical location that has a spatial market overlap, want
to merge in the most effective way to combine their footprint. The goal is to do this in a way that finds
the locations that make the most sense to stop operating or convert to a new model without reducing
trade area coverage.
There are several ways to address this, including weighting the locations by their trade area coverage,
proximity to competitor locations, or proximity to access points like public transportation, highways,
etc. All of those features can be easily added to the trade areas once they are created using some of
the techniques we have already discussed so far in this book, so this query will simply show how to
accomplish the spatial aspects of the problem. In this example we will use two pharmacy chains, Duane
Reade and CVS, and find the most optimal solution to consolidate their physical footprint. Our first
step is to calculate the population within 800 meters of each store location. We will use the overlap
method to only count the population that falls inside the buffer and turn this into a new table:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
384 1.4. MERGERS AND ACQUISITIONS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 385
Now we will start to find the stores we want to exclude from the final count. For that, let’s see how
many stores there are to begin (Figure 1.14):
At the beginning we have 674 total stores between the two brands in the greater New York City region,
although we will limit the results to those just in New York City. Our first step will be to exclude any
Duane Reade store that is within 200 meters of a CVS store. Our query begins with two CTEs to get
our store data for each brand while also cresting a new column transforming the geometry to 3857. In
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
386 1.4. MERGERS AND ACQUISITIONS
the final CTE we use that new column to find the stores that fall within 200 meters, then see how many
stores fall into that category, in this case 72 locations (keep in mind this is the number for the entire
New York City metro area and in the final analysis we will only focus on the stores inside the New
York City boundaries):
We can also find the number of locations that have a 75% overlap of their buffer areas. To calculate the
overlapping area of the Duane Reade buffers with the CVS buffers, divide by the area of the Duane
Reade buffer in the WHERE clause and filter our the results where the area is greater than 0.75, which
results in 31 total locations:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 387
6 where
7 name ilike '%duane%'
8 )
9 select
10 dr.*,
11
12 -- Find the area of overlap between the two buffer groups
13 st_area(
14 st_intersection(pharmacy_stats.buffer, dr.buffer)
15 ) / st_area(dr.buffer)
16 from
17 dr
18 join pharmacy_stats on st_intersects(dr.buffer, pharmacy_stats.buffer)
19 where
20
21 -- Find the number of pharmacies that have an overlap greater than 75%
22 st_area(
23 st_intersection(pharmacy_stats.buffer, dr.buffer)
24 ) / st_area(dr.buffer) >.75
25 and pharmacy_stats.name ilike '%cvs%'
Since we did these analyses independently we have a few more steps to finish our analysis. First we
want to see what our "before" scenario looks like which is the total combined trade area of the two
brand’s stores and the total population covered by that combined trade area. This means we want to
union the 800 meter buffers of both locations so there is no double counting. In the CTE, we will create
the unioned buffers followed by the area of all the buffers. Then we will intersect the unioned buffers
with the block groups in the final query, which we need to do otherwise the intersection would go over
each buffer which would account for counting population multiple times (Figure 1.15, on the following
page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
388 1.4. MERGERS AND ACQUISITIONS
area pop
224669913.8932964 2527400.3703658446
223115112.9693205 2512136.967491225
Here you can see that the combined coverage area covers about 224.7 square kilometers and about
2,527,400 people. To find out how our before and after scenarios look we first need to find the IDs of
the stores we need to remove. The steps of the query are outlined below:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 389
• Then we perform our proximity analysis, with our first CTE named dr that finds all the Duane
Reade locations while also creating a geometry with EPSG 3857
• We do the same for a CTE named cvss
• Then we have a CTE named removed_nearest that removes all the Duane Reade locations within
200 meters of a CVS
• We then UNION the two tables together to create a table with one column that contains the IDs of
the Duane Reade locations that are to be removed
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
390 1.4. MERGERS AND ACQUISITIONS
55 remove_nearest as (
56 select
57 dr.id
58 from
59 dr,
60 cvs
61 where
62 st_dwithin(dr.geom_3857, cvs.geom_3857, 200)
63 )
64 select
65 id
66 from
67 remove_nearest
68 union
69 select
70 id
71 from
72 overlap
From this we get 31 total Duane Reade locations that are to be removed. With that we can use the same
query as before to calculate the combined area and population, with one extra CTE at the beginning
that removes the stores that match the IDs that we plan to remove from consideration and change the
nyc_pharmacies table name for the new CTE alias p otherwise the query remains unchanged (Figure
1.16, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 1. SUITABILITY ANALYSIS 391
38 )
39 ) as pop
40 from
41 a
42 join nyc_2021_census_block_groups bgs on st_intersects(a.buffer, st_transform(bgs.geom, 4326))
43 group by
44 1,
45 2
area pop
223115112.9693205 2512136.967491225
As you can see our new area is 223.1 and population covered is 2,512,136, a difference of 1.6 square
kilometers and 15,264. Overall this allows the new operations to continue to function efficiently!
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
2. Working With Raster Data in PostGIS
Up until this point we have only worked with vector data and for most of the history of spatial SQL, the
databases that utilized it only supported vector data. Of course, raster data plays an important role in
spatial analytics. At some point, from what I can gather (it was not exactly clear from paging through
16 pages of PostGIS release notes), PostGIS added support for raster data somewhere between PostGIS
1.5 and 2.1. As of PostGIS 3.0 all the raster functionality has been rolled into its own extension.
We will use the raster extension to perform a variety of different analyses, but first we need to import
some data into our database. To do so we will use a tool called raster2pgsql which will generate an
SQL file to ingest our raster data into PostGIS. To do this, we need to create a second Docker image
of PostGIS to access this functionality. We cannot use our existing image since it won’t work, and
generally running those commands in the same location as our data is not a best practice.
Luckily for us, the main PostGIS Docker instance has a version of PostGIS that we can use just for this
purpose. To get started we can pull the image using the docker pull command in our terminal. Open a
new terminal and run this command:
docker pull postgis/postgis:15-master
Alternatively you can also do this in the Docker desktop app. Go to the top search bar and search
for postgis. Look for the container named "postgis/postgis". If you don’t see it, you can search for
postgis/postgis directly (Figure 2.1, on the following page).
Once you have found it you can hover over that item and open the menu labeled "Tag". You can select
any image you want, but to match the instructions and version we are using in the book search for the
version with the tag 15-master and click on it and then click "Pull".
You can validate that the image was pulled via either process by clicking on the "Images" tab on the
left-hand side of Docker desktop, and you should see the image listed there (Figure 2.2, on the next
page):
To run the image we can also use the terminal or Docker desktop. To use the terminal you can enter
this command. Note there is one change we need to make which is highlighted in the code:
The -v flag creates a volume within the Docker container that connects to a folder on your computer.
In this case we will create a folder named "raster" which will store all the data that we want to import
into our main PostGIS database. You will need to replace this section with the folder location on your
computer where that is stored. On my computer my data is stored at this location:
/Users/mattforrest/Documents/spatial-sql-book/raster
393
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
394 2.1. RASTER INGEST
Once you have done that you can go ahead and run the command in the terminal (Figure 2.3, on
page 395):
Note that you may or may not see a second warning about an image platform issue (linux/arm64/v8r).
This is because my computer has an Apple Silicon CPU not an Intel CPU, but this will work all the
same. To validate that the container is running after using either method, click on "Containers" and
you should see something like this (Figure 2.4, on the next page):
Now that we have our container running we are going to enable the postgis_raster extension in our
main database. To do so we can run this command inside pgAdmin:
Once complete we are going to open a new terminal session to run the command to import raster data
into our main database. Once again you can open the terminal session by command line or use the
terminal tied to the container within Docker desktop. In this case we will use our local terminal.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 395
Figure 2.4: You will see the new instance in your Docker Desktop
The first command will connect our terminal to our Docker container. This will work if you also used
the name "mini-postgis" otherwise you will need to change it to the name you used (Figure 2.5, on
page 396):
Next we can confirm that we have access to the raster2pgsql library by simply running that command
in the terminal (Figure 2.6, on page 397):
Make sure to double check that you placed the file denver-elevation.tif into the /raster folder.
And finally we will run our command to ingest our data. It is actually two commands strung together
that will complete the process in one step rather than two:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
396 2.1. RASTER INGEST
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 397
Go ahead and run the command in your terminal. It will stay open during the import process. Once
the command ends you should see an output like this in your terminal (Figure 2.7, on page 398):
This process may take a long time as the raster is large, but once it is complete you can confirm that the
raster has imported by selecting it from pgAdmin:
You will see that there are three columns: rid, rast, and filename. Each row represents a raster tile of
128 by 128 pixels. We can also confirm that this imported correctly by opening the raster with QGIS
(Figure 2.8, on page 399):
And that is it we have imported our raster data into PostGIS. There are many other options and ways
to pull your data in including from storage sources like Amazon Web Services S3, so there are many
options for you to use. But for now we can take a look at some ways to use the raster functions with
PostGIS.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
398 2.1. RASTER INGEST
Raster contours
Seeing as we imported just imported elevation data into our database, a logical next step would be
to create contour lines from our elevation raster. PostGIS has several functions124 that allow you to
perform analysis on rasters such as DEM analysis, map algebra, raster outputs, and more.
The specific function we will be using is ST_Contour which has the following signature:
1. First is that this function will return a set of records which means we will need to select the records
out of the return value similar to when we have used ST_Dump.
2. Next is that since we imported our raster file as tiles, we will need to join those into one massive
raster otherwise it would only run on one raster tile.
124 https://postgis.net/docs/en/RT_reference.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 399
3. The geometry that we are used to using is now replaced with the column rast and we need to
designate a band (as there is only one in our raster it is 1).
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
400 2.2. INTERPOLATION
If you create a table you can check it out in QGIS too and style it by the value (Figures 2.10, on page 401
and 2.11, on page 402):
2.2 Interpolation
Spatial interpolation, or the process of interpolating data over space where observations are unknown
from known observation points, is a common analysis supported by tools like QGIS and workflows in
Python. There are two primary methods for performing this analysis: inverse distance weighted (IDW)
and krigging. Krigging would require some extra work to implement, although it is possible. PostGIS
has already implemented a function for IDW which returns a raster with the interpolation. To do this
we will import weather station data from the New York State Mesonet at the University of Albany
which collects weather observations at stations across the state.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 401
Figure 2.10: Contour lines overlayed with the original raster data
Next there are three queries we need to run to generate our raster which we will walk through in detail.
These examples are from a blog post by Paul Ramsey from Crunchy Data125 which walks through the
process with a similar dataset. Let’s take a look at our first query:
125 https://www.crunchydata.com/blog/waiting-for-postgis-3.2-st_interpolateraster
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
402 2.2. INTERPOLATION
15 case
16 when "apparent_temperature_max [degc]" = '' then null
17 else "apparent_temperature_max [degc]" :: numeric
18 end
19 )
20 ) as geom,
21
22 -- Expands the grid to add room aroung the edges
23 st_expand(st_collect(geom), 0.5) as ext
24 from
25 nys_mesonet_202307
26 where
27 time_end = '2023-07-23 12:00:00 EDT'
28 ),
First we are creating an input table from our Mesonet data. Below are the parameters we are passing
into the query:
• 0.01::float8 AS pixelsize
– This defines the pixel size of the output raster in the unit of measurement of the
projection (4326 in this case)
• ’invdist:power:5.5:smoothing:2.0’ AS algorithm
– These are the options of the algorithm used to perform the interpolation which are
pulled from the GDAL in the gdal_grid tool126
126 https://gdal.org/programs/gdal_grid.html#interpolation-algorithms
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 403
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
404 2.2. INTERPOLATION
First we select 1 as our ID as there is only one raster being created. Then we pass in the geom column
and algorithm column from the inputs. Then we have a long series of functions:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 405
39
40 -- Sets 1 as the raster ID since we only have one raster
41 1 AS rid,
42 ST_InterpolateRaster(
43
44 -- The geometry collection that will be used to interpolate to the raster
45 geom,
46
47 -- The algorithm we defined
48 algorithm,
49 ST_SetSRID(
50
51 -- This creates the band
52 ST_AddBand(
53
54 -- Creates the empty raster with our arguments from the previous CTEs
55 ST_MakeEmptyRaster(width, height, upperleftx, upperlefty, pixelsize),
56
57 -- Sets the default values as a 32 bit float
58 '32BF'
59 ),
60
61 -- The SRID the raster will use
62 ST_SRID(geom)
63 )
64 ) AS rast
65 FROM
66 sizes,
67 inputs;
With that we have created a new raster that we can visualize in QGIS. Simply find the new table that
you have created in the DB Manager section of QGIS and double click it to add it to the map (Figure
2.12):
Below are two examples of the outputs, one using an integer and one using a float in that order (Figures
2.13, on the following page and 2.14, on page 407).
Of course there are a lot more variables that go into temperature interpolation, so this is a very over-
simplified example, but it shows us two things. First is the ability to perform fast analysis on raster
data with spatial SQL. Second is the ability to structure input parameters using only SQL to make your
analysis highly repeatable.
2.3 Raster to H3
Another way that I have seen spatial indexes being used in the field is to aggregate raster data to an
H3 grid. While you can certainly perform map algebra on the raw raster data, H3 data has a few
advantages in a database setting. You can:
• Map H3 cells to parents or children, or higher or lower levels of cells
• Find distances between cells
• Find neighbors of cells
• Calculate rings of cells around a cell
Since the H3 cells are strings they are also highly performant for joins to other data and for visualization
in tools that support visualizing based on the strings. As for joining the H3 cells to raster data, the query
is quite simple, one of the things I love about using H3 and raster data together. Below is our complete
query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
406 2.3. RASTER TO H3
As you can see there are only a few lines here. In our SELECT statement, we get the H3 cell from a POINT
geometry at level 12, and the average elevation in the cell from all the centroids that call in the H3 cell.
In our FROM statement we select from a subquery which contains this query:
Here we are selecting the values from a function called ST_PixelAsCentroids which is a set return-
ing function. We use a lateral join (note not a cross lateral join) to join the values across the denver_
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 407
elevation table. It returns four values: the x and y values for the raster, the band value, and the ge-
ometry which represents the raster centroid. With that we can run our query and create our table (this
took about 3 ½ minutes on my computer). After that you can export your table to a CSV and import it
into Kepler GL to see the visualization (Figure 2.15, on the following page):
While the resolution is not as granular as it would be with the pure raster it is still quite small and
appropriate for approximate analysis with other vector or H3 datasets (Figure 2.16, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
408 2.3. RASTER TO H3
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 2. WORKING WITH RASTER DATA IN POSTGIS 409
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
3. Routing and Networks With pgRouting
There are a few challenges in geospatial that have created complete industries around them. Broadly
labeled as location data services, these are tools that can perform geocoding, routing, and other complex
routing problems. These services are generally exposed via API, and this works great if you are in need
of a few routes or trade areas, but what if you want to create custom routes or run large batches of
routes? These same location data services are not meant for this scale of batch analysis, rather for one
time calls.
If you ever had this issue then I have good news for you. PostGIS has a great extension called pgRout-
ing that allows you to import your own network data, which in most cases can be any dataset with
connected lines that can be traversed in some way such as hiking/biking paths, roads and highways,
boat routes, public transportation, and more. It also has tools that we will be using to import data from
OpenStreetMap which gives you access to all the network data in OSM. We will use several tools and
queries inside pgRouting to create simple routes, create routes for car and bike travel, build a custom
cost solution for bike travel to prioritize safe travel for cyclists, build isochrones or trade areas, and
solve a simple traveling salesman problem. Our first step however is to import the data we are going
to use for our routes.
One of the first tasks we need to perform is to create a separate database to load our data into. Up to
this point we have used the default database which in our case is "gis". You can do this in the same
database if you choose, but I prefer to keep it separate just to keep things nice and tidy. To add the new
database, in pgAdmin right-click on the connection, in this case the connection called "Spatial SQL" if
you used the steps outlined in the book. Once the menu opens, select "Create" then click on "Database"
(Figure 3.1, on the following page).
This will open a new window where you can give your database a new name, in this case we will call
it "routing" (Figure 3.2, on page 411).
You can also see the code that you would need to do the comparable operation in SQL (Figure 3.3, on
page 412):
Once you add the database name, you can go ahead and click save, and your new database should be
created. Go ahead and click on the new database and open a new connection window to that database
where we can run the CREATE EXTENSION command to add pgRouting:
CREATE EXTENSION pgrouting;
If this for some reason doesn’t work (you may get an error that you need to install PostGIS as well),
simply add the word CASCADE at the end and run it again. Next we are going to use our extra PostGIS
instance we used in the raster section to run additional commands. If you do not have it running make
sure to restart it either in the Docker Desktop application or using this command:
411
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
412 3.1. PREPARE DATA TO USE IN PGROUTING
Once again we will start a new terminal session in our local terminal using this command:
Our first step is to install a library called osm2pgrouting which will allow us to import OSM data that
we have downloaded into our routing database. The process will generate tables that will be needed
by pgRouting to perform routing analysis. Go ahead and run these command in the Docker terminal
session:
You will also need to enter Y for yes when prompted. We also need to install a tool called osmctools
that helps us work with and manipulate OSM data. We can install that with this command and also
entering Y when prompted.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 413
Now that we have our tools installed, we need to get some data. Fortunately we can extract batches of
OSM data using a tool named BBBike. You can open up this URL to open the interface where we can
extract our data - https://extract.bbbike.org/.
Once you have opened the website you can see there are several options, including one to find areas
to search for. While the data is also included in the book files, if you want to download the data for
yourself with a more recent version of the data, you can select the option at the top of the page to show
the lat/long options (Figure 3.4, on the next page):
Next enter the following values:
• Left lower corner
– Lng: -74.459
– Lat: 40.488
• Right top corner
– Lng: -73.385
– Lat: 41.055
Once all values are entered you should see a bounding box around the New York City area that looks
like this (Figure 3.5, on page 413):
Next select the option for "OSM XML gzip’d" from the "Format" dropdown. Finally, enter your email
address and a name for your dataset and hit "extract". Once you have downloaded your data you will
need to extract it from the gzip file, which will open the enclosed .osm file. Our next step is to use
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
414 3.1. PREPARE DATA TO USE IN PGROUTING
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 415
osmfilter to filter out the data that we need. Below is the command we will use to do so:
There are three positional commands. The first is the path to our data within our Docker container.
Next is a filter starting with the -keep argument to keep only the data with the tag "highway". The last
command starting with -o is the path and name for the filtered OSM data. For the tag structure in the
-keep argument you can use a variety of combinations of OSM tags. Below are some examples from
the osmfilter documentation on the OSM Wiki127 :
./osmfilter norway.osm --keep="highway=primary =secondary waterway=river" >streets.osm
Once the process has completed your terminal will look something like this (Figure 3.6, on the next
page):
Next we will use osm2pgrouting into our new database. Below is the full command:
127 https://wiki.openstreetmap.org/wiki/Osmfilter
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
416 3.1. PREPARE DATA TO USE IN PGROUTING
4 -p 25432 \
5 -U docker \
6 -W docker \
7 --clean
The arguments should look similar to some of our other commands that we use to import data via the
command line:
The –clean argument will drop any previously existing tables if they have already been created. This
process will create several tables which we will look at after the import is completed. Go ahead and run
the above command (note that this is a long-running process and could take upwards of 10 minutes),
and once completed your terminal should look something like this (Figure 3.7, on the facing page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 417
Once the process is complete you will have all the data loaded into several tables that you will use
with pgRouting. Below is the table structure that you should see when you open the list of tables in
pgAdmin (Figure 3.8, on the next page):
There should be a total of 5 tables. The spatial_ref_sys is the table that contains the spatial reference
information or projection data that comes along with PostGIS and the pointsofinterest will be empty
since we only chose to import the road data. The other three tables are:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
418 3.2. CREATE A SIMPLE ROUTE IN PGROUTING
• ‘configuration‘ This contains information about the path or "ways" such as the tag (i.e. highway,
bike, etc.), max speed, and more
• ways_vertices_pgr: This contains the points where ways intersect
• ‘ways‘ These are the actual path segments that also contain information about the distance, speed,
and more. They also contain columns that represent the cost and cost in seconds which will be
used to decide how to choose which way to include or exclude in a route
Now that we have our data imported to use in pgRouting we can create our first route. For this first
query we will use the pure distance cost method which will find the most efficient route by distance
only. This route will look very different that what you might expect coming from something like Google
Maps or any other routing service. Following that query we will modify the configuration table to
create a weighted cost that will emphasize or reward the algorithm for choosing routes that are more
efficient for cars, focusing on using primary roads over side streets. First let’s take a look at how we
can construct a query to calculate a route.
The function we will be using is the pgr_dijkstra128 function which will calculate the shortest path
using Dijkstra’s algorithm129 . This is only one of many algorithms that you can use, but in general it
is reliable and usually the one you will see in examples, so we will start here. This algorithm can also
calculate one-to-one routes (which is what we are doing in this section), one-to-many, many-to-one,
and many-to-many. The function signature looks like this:
pgr_dijkstra(Edges SQL, start_vid, end_vid [, directed])
Example:
This may be a bit confusing for now but in short, you provide the function with:
• Edges SQL - The SQL (where the query is written as a string) that contains the data about the
edges, or in our case data from the ways table
• start_vid - The starting way ID(s)
• end_vid - The starting way ID(s)
• [, directed] - If the path is directed, meaning that it is specifically going from the start point to the
end point, which will take things like one way streets into consideration
Our first task is to find the way ID for our start and end points, which will be from the southernmost
point in New York City (Ward’s Point in Staten Island) and the northernmost point (at the College
of Mount Saint Vincent in the Bronx). To find the closest way, or line, to each point, we can use our
shorthand expression for ST_Distance, <->, and query for both locations in two separate subqueries:
128 https://docs.pgrouting.org/3.1/en/pgr_dijkstra.html
129 https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 419
1 with start AS (
2 SELECT
3 source
4 FROM
5 ways
6 ORDER BY
7 the_geom <-> ST_SetSRID(
8 ST_GeomFromText('POINT (-74.244391 40.498995)'),
9 4326
10 )
11 LIMIT
12 1
13 )
We select the source column from the ways table which contains the unique IDs for each road segment.
Next, we order the results by the distance from the starting point and limit the results to 1. We can
repeat the same process with our destination in a second subquery:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
420 3.2. CREATE A SIMPLE ROUTE IN PGROUTING
The first thing you will notice is that we use the pgr_dijkstra function . This is followed by a query
contained in single quotes as the query in pgRouting is passed as a string, which is our first argument
in the function. The second and third select the source IDs from our start and destination table. The
fourth argument tells the function that we want the route to be directed, meaning that it will utilize one
way streets as if we are going from the southernmost point in Staten Island to the northernmost point
in the Bronx.
On the outer parts of the query, you can see at the beginning that we are performing a union with
ST_Union on the geometry column, the_geom, from the ways table since the query returns individual
rows,each with a line segment ID. To get the geometry, we join the ways table using the edge column
from the pgr_dijkstra query and the gid column from the ways table. With that you can go ahead and
run the complete query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 421
Congrats! You have created your first route in spatial SQL. The only problem is that no one could
actually follow that route.
As stated above, the algorithm is only focused on finding the shortest route using all the ways within
the dataset. This includes everything including highways, streets, side streets, alleys, bike paths, walk-
ing paths, and anything in between. For example, this part goes from the street on to a sidewalk (Figure
3.10, on the next page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
422 3.2. CREATE A SIMPLE ROUTE IN PGROUTING
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 423
Here we go through a walking path in a park (Figure 3.12, on the next page):
And finally we take the bike path over the George Washington Bridge and then get back on the West
Side Highway (Figure 3.13, on page 423):
So what is the solution? This query only took the cost parameter into consideration which as of now is
just the distances. The ways table also includes a column called cost_s which is the time it will take to
traverse the route in seconds. Our import process produces this column for us using the distance and
the speed along that route. However, to demonstrate how changes to the cost will impact the route, we
will actually create a modified configuration table to exaggerate the costs for roads cars can travel on.
While this sounds like it could potentially be complicated, it is actually only a few steps. We will need
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
424 3.2. CREATE A SIMPLE ROUTE IN PGROUTING
to add a new column to our configuration table and then weight the different costs to incentivize the
algorithm to use specific road types. Finally, we will pass those new values into the same query we just
wrote and see the results.
Our first step is to take a look at the tags currently in the configuration table:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 425
Figure 3.13: Bike path on the George Washington Bridge, back to a highway exit, then on to the West Side Highway
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
426 3.2. CREATE A SIMPLE ROUTE IN PGROUTING
As you can see there are road types that we would generally prioritize if you were driving a car and
others that you would not want to use if you are driving. From here we can pick some of the tags we
want to prioritize, but before that we can create a copy of the configuration table so we can keep the
tables separate:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 427
Next we want to update the penalty column which will be used to multiply the cost column when we
run the routing query. You can take a look at all the different tags used in OSM for highways here130 .
The closer to 0 means that those roads will be more highly incentivized. First, we want to deincentivize
roads we do not want cars driving on:
This will make sure that the routing will not route on any paths that are steps, footways, or used by
pedestrians and will make it very costly to go on unclassified roads.
Next we will incentivize different highway types that we want cars to travel on. These are extreme
weightings for demonstration purposes, so keep that in mind:
130 https://wiki.openstreetmap.org/wiki/Key:highway
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
428 3.2. CREATE A SIMPLE ROUTE IN PGROUTING
11 penalty = 0.3
12 where
13 tag_value in (
14 'primary',
15 'primary_link',
16 'trunk',
17 'trunk_link',
18 'motorway',
19 'motorway_junction',
20 'motorway_link',
21 'secondary'
22 );
Next we will re-run our query with a few modifications in the query we send to pgRouting:
Here you can see that we are using the cost_s column which uses cost in seconds rather than distance.
Then we also multiply it by our penalty meaning that if a road has a travel time of 60 seconds and the
penalty is 0.5, then the weighted cost will be 30 seconds. The last step is we swap in our new car_config
table for the original configuration table:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 429
28 from
29 pgr_dijkstra(
30 'select
31 gid as id,
32 source,
33 target,
34 cost_s * penalty as cost,
35 reverse_cost_s * penalty as reverse_cost,
36 st_length(st_transform(the_geom, 3857)) as length
37 from
38 ways
39 join car_config using (tag_id)',
40 (
41 select
42 source
43 from
44 start
45 ),
46 (
47 select
48 source
49 from
50 destination
51 ),
52 true
53 ) as di
54 join ways as pt on di.edge = pt.gid;
And here are our results (Figures 3.15, on page 429 and 3.15, on page 429):
And here you can see this looks much closer to a route we would see when driving. You can do this
weighting with any type of path to prioritize routing for walkers, cyclists, wheelchairs, or even any
other metric you so choose.
An origin destination matrix is effectively a cross join between all the locations in one dataset to all the
locations in another. This is a common input to problems such as the Vehicle Routing Problem (VRP)
that finds the best solution for deliveries leaving a single distribution center with multiple vehicles.
There are other versions of the VRP too, such as the VRP with capacity constraints, VRP with pickup
and delivery constraints, VRP with time window constraints, and VRP with start and end locations.
One of the most complicated VRP problems I have seen comes from Instacart. In this post on their
public Medium page, they describe the the SCVRPTWMT - or stochastic (uncertainty of driver shifts,
weather, etc.) capacitated (amount of space or capacity of the driver) vehicle routing problem time
window (when an order must be fulfilled) multiple trips (multiple orders from one driver)131 .
While there are many methods to perform the analyses we described above (pgRouting even has ex-
perimental functions for the VRP) the one thing that all these tools depend on is data from an origin-
destination matrix. This is another great example of using the right tool for the job. Certainly you can
perform the analysis using straight line distances, but that does not account for the road network, and
unless you are operating a fleet of drones, you will likely be using the road network to make deliveries.
Many tools in these languages have ways to calculate an origin destination matrix, yet when the scale
increases this becomes challenging. APIs and routing tools like Google Maps are meant for one off API
calls not large scale batch operations. Other tools exist such as the open source routing tools such as the
Open Source Routing Machine or OSRM132 which is a similar routing tool on top of Open Street Map.
131 https://tech.instacart.com/space-time-and-groceries-a315925acf3a
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
430 3.3. BUILDING AN ORIGIN-DESTINATION MATRIX
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 431
However, this requires that you stand up an additional backend service. pgRouting provides the same
level of service and, if your data is already in a database, then this makes it a logical choice to use.
As we saw in the last section you can calculate cost in terms of distance or time, and for this example
we will be using time. We will also be creating a new database and configuration to use for bicycles as
many delivery services in New York City are delivered by cyclists. Finally, we will also re-import our
pharmacy data into our new database.
As a first step we will need to create a new database. Once again right-click on the database connection,
hover over, then click "Database". Then add bike as the new database name and click "Save" (Figures
3.16, on the next page 3.17, on page 431):
Next we want to reprocess our OSM data extract to include all bike specific routes. First, create another
terminal session into our mini PostGIS Docker container:
docker container exec -it mini-postgis bash
132 https://github.com/Project-OSRM/osrm-backend
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
432 3.3. BUILDING AN ORIGIN-DESTINATION MATRIX
Next we need to reimport two datasets into our new database: the pharmacy locations and the NYC
Building Footprints:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 433
8 Building_Footprints.geojson \
9 -nln nyc_building_footprints -lco GEOMETRY_NAME=geom
Next, we need to modify our configuration table to favor bicycle routes. With this query below, we
first favor routes that are meant for bicycles, then streets that are generally residential, set a negative
penalty for highways, and slightly higher penalties for primary streets. First you need to add the
penalty column to the configuration table:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
434 3.3. BUILDING AN ORIGIN-DESTINATION MATRIX
Now that we have our data ready we can start to construct our query. First we need to select the
locations we will use. To get started, we will get random 5 pharmacy locations and 20 random building
locations in a constrained area in Brooklyn (3 kilometers around Grand Army Plaza) which will end up
with 100 total routes. To do so we can create two CTEs:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 435
1 with pharm as (
2
3 -- Find 5 random pharmacies within 3 kilometers of Grand Army Plaza
4 select
5 name,
6 st_centroid(geom) as geom,
7 id
8 from
9 nyc_pharmacies
10 where
11 st_dwithin(
12 st_centroid(geom) :: geography,
13 st_setsrid(st_makepoint(-73.9700743, 40.6738928), 4326) :: geography,
14 3000
15 )
16 order by
17 random()
18 limit
19 5
20 ),
21
22 -- Select 20 random buildings within within 3 kilometers of Grand Army Plaza
23 bldgs as (
24 select
25 st_centroid(geom) as bldg_geom,
26 bin
27 from
28 nyc_building_footprints
29 where
30 st_dwithin(
31 st_centroid(geom) :: geography,
32 st_setsrid(st_makepoint(-73.9700743, 40.6738928), 4326) :: geography,
33 3000
34 )
35 order by
36 random()
37 limit
38 20
39 )
Since the pgr_dijkstra function accepts "many-to-many" routes, we can actually just use the query we
used previously with a few modifications. First we need to have two arrays of data with our start way
IDs and end way IDs. Then we will also need to aggregate and group our results by the start and
end IDs, which are named columns within the pgr_dijkstra query called start_vid and end_vid. The
complete query looks like this:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
436 3.3. BUILDING AN ORIGIN-DESTINATION MATRIX
17 3
18 ), bldgs as (
19 select
20 st_centroid(geom) as bldg_geom,
21 bin
22 from
23 nyc_building_footprints
24 where
25 st_dwithin(
26 st_centroid(geom) :: geography,
27 st_setsrid(st_makepoint(-73.9700743, 40.6738928), 4326) :: geography,
28 3000
29 )
30 order by
31 random()
32 limit
33 10
34 ),
35
36 c as (
37
38 -- First we select all the columns from the pharm, bldgs, and wid CTEs and sub-queries
39 select
40 pharm.*,
41 bldgs.*,
42 wid.*
43 from
44
45 -- We perform a cross join to find all possible matches between
46 -- the 5 pharmacies and the 20 buildings
47 pharm,
48 bldgs
49 cross join lateral (
50
51 -- For each row find the start and end way IDs
52 with start as (
53 select
54 source
55 from
56 ways
57 order by
58 the_geom <-> pharm.geom
59 limit
60 1
61 ), destination as (
62 select
63 source
64 from
65 ways
66 order by
67 ways.the_geom <-> st_centroid(bldgs.bldg_geom)
68 limit
69 1
70 )
71 select
72 start.source as start_id,
73 destination.source as end_id
74 from
75 start,
76 destination
77 ) wid
78 )
79 select
80 -- For each combination we get the sum of the cost in distance, seconds, and route length
81 -- and we repeat this for every row, or possible combination
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 437
82 sum(di.cost) as cost,
83 sum(length) as length,
84 sum(pt.cost_s) as seconds,
85 st_union(st_transform(the_geom, 4326)) as route
86 from
87 pgr_dijkstra(
88 'select
89 gid as id,
90 source,
91 target,
92 cost_s * penalty as cost,
93 reverse_cost_s * penalty as reverse_cost,
94 st_length(st_transform(the_geom, 3857)) as length
95 from ways
96 join configuration using (tag_id)',
97 array(
98 select
99 distinct start_id
100 from
101 c
102 ),
103 array(
104 select
105 distinct end_id
106 from
107 c
108 ),
109 true
110 ) as di
111 join ways as pt on di.edge = pt.gid
112 group by
113 start_vid,
114 end_vid
Our final results look like this (Figure 3.18, on the next page):
As you can see the paths make use of bike routes in Prospect Park (the large park in the middle of the
image) and other streets that are designated bike routes. It is optional to include the routes themselves
as you generally only need the cost (be it distance or seconds), but for our purposes it makes sense to
include them to see what the results look like. The nice part is that this scales quite well. The query
above took about 8 seconds on my computer which is 100 total routes. When I run 1,000 routes took
about 11 seconds, 10,000 took about 17 seconds, and 100,000 took about 48 seconds.
As you scale up there are some different methods you can use to constrain the query. First you can
create batches of queries so you can process them individually, such as one query per depot. Second,
for locations that need to be delivered to that are close together in compact groups, such as several
houses in a row, you can use a clustering method like ST_DBSCAN to group locations together since they
would likely be served by the same route. Finally, you can also limit the locations to each depot within
a certain distance. For example, if our pharmacy in Brooklyn only delivers within 2 kilometers or won’t
deliver into Manhattan, you can add that spatial constraint.
One of the most common problems in logistics or network analysis is solving the traveling salesman133
problem. This means finding the most efficient route between three or more locations on a network
that need to be visited. It is classified as an NP-Hard problem in network science and is used in a wide
range of fields and pgRouting has tools to address this specific problem.
133 https://en.wikipedia.org/wiki/Travelling_salesman_problem
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
438 3.4. TRAVELING SALESMAN PROBLEM
The function structure is a bit different from the other functions we have explored so far, but it uses a
string query input as well:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 439
16 (
17 select
18 array_agg(id)
19 from
20 edge_table_vertices_pgr
21 where
22 id < 14
23 ),
24 directed := false
25 ) $$,
26 randomize := false
27 );
This time it accepts one of its own functions, pgr_dijkstraCostMatrix, within the double dollar sign
string notation since the pgr_dijkstraCostMatrix also accepts a string. So in reality we need to take a
look at the function signature of pgr_dijkstraCostMatrix to see how we need to structure our data:
What this will do is calculate all the possible distances and costs between all of these points, which is
a bit different from our previous cost matrix analysis where we compared all the points in one table to
another. First we can create a static table with a sample of ten delivery locations per Rite-Aid pharmacy
in Brooklyn, with a total of 10 pharmacies we will ananlyze. I created the table using this query if you
want to create it on your own:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
440 3.4. TRAVELING SALESMAN PROBLEM
21 a
22 cross join lateral (
23 select
24 bin as building_id,
25 geom
26 from
27 nyc_building_footprints
28 where
29 st_dwithin(
30 st_centroid(geom) :: geography,
31 st_centroid(a.geom) :: geography,
32 800
33 )
34 order by
35 random()
36 limit
37 10
38 ) b
However, if you want to use the exact same data as you will see in the example, you can import and
transform the CSV from the book files using the commands below:
Once we have that table set up we want to create a table that we can use with a specific structure which
will have three columns:
1. Rite-Aid store ID
2. Way ID for each Rite-Aid store
3. Array of way IDs for the buildings
In our first two steps we need to find the way IDs for the stores and the buildings. Note that we find
the distinct IDs for the Rite-Aid locations since each one is repeated ten times:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 441
19 ) s
20 ), b as (
21 select
22 b.pharm_id as id,
23 s.source
24 from
25
26 -- For each building find the closest way ID
27 rite_aid_odm b
28 cross join lateral(
29 SELECT
30 source
31 FROM
32 ways
33 ORDER BY
34 the_geom <-> b.geom
35 limit
36 1
37 ) s
38 ),
Next we group the destinations into an array and append the array we are creating with the way ID of
the store using the array_prepend function. Our final query will look like this with the added CREATE
TABLE call to store the table to make our results easier:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
442 3.4. TRAVELING SALESMAN PROBLEM
38
39 -- Constructs an array with the way ID of the pharmacy as the first item
40 array_prepend(a.source, array_agg(distinct b.source)) as destinations
41 from
42 a
43 join b using (id)
44
45 -- Will return one row per pharmacy ID
46 group by
47 a.id,
48 a.source
49 )
50 select
51 *
52 from
53 c
As another step, we will also create the cost matrix using pgr_dijkstraCostMatrix ahead of time using
this query:
As you can see we select all the values from the table we just created, and for each row in the table the
query will return the costs between every point in the array using our weighted costs for bike routes.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 443
Since we don’t want to do this for every single point, we use a CROSS JOIN LATERAL to create just the
cost matrix for each Rite-Aid location and the specific buildings it needs to deliver to.
Now we are able to construct our second query. This one has a few key components to it, so I will break
it down step by step:
In this first section we will be selecting columns that ultimately come from the table we just created
with the alias s being used and from a CROSS JOIN LATERAL which we will create in the next step. We
also add in a window function to find the next node (way) that the route will use, which we will use
to build the actual route geometries. The LEAD function lets us look into the next row, and our partition
limits it to the specific source ID, or Rite-Aid location. In the next part of the query we structure our
CROSS JOIN for the pgr_TSP function:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
444 3.4. TRAVELING SALESMAN PROBLEM
This one is a bit funky, but let me explain. As you can see we are creating a CROSS JOIN LATERAL as
we discussed. Within that we create a subquery to actually call our pgr_TSP function, which of course
accepts the data from the cost matrix we created. So instead of calling that function again we can simply
query our table. But since this is a CROSS JOIN LATERAL we will want to ensure we do this for the source
ID that matches each other source ID, or Rite-Aid location, for the outer table. Using a shorthand
for concatenation we can actually join the column into the string by inserting the pharmacy way ID,
or source after the "=" then adding the column using || on both sides which acts as a shorthand for
CONCAT, and the finish the query with "$$" to close it out.
You can see what this looks like using this query:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 445
Finally, since we want the cyclist to start and end at the store we pass the s.source column twice to have
the route start and end at the same location. This is the data we will end up with:
In our last query we will create the routes. This query involves three tables: solved_tsp, ways, and
the return values from pgr_dijkstra. First in our CTE we select several columns from our solved_tsp
table: id, source, seq, node, next_node. Then we select all the return values from the pgr_dijkstra
function. This will result in one optimized route between each point in the route meaning that if
we have to visit 10 points, we will have ten rows of data, each containing a route. In the final step,
we aggregate the data from the CTE to create a combined geometry and find the sum of the time in
seconds the route will take (of course not accounting for stops, traffic lights, etc.), and the length of the
route. FYI this is a long-running query, so I would recommend making a table from it.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
446 3.5. CREATING TRAVEL TIME POLYGONS OR ISOCHONES
40 sum(cost) as cost,
41 sum(length) as length
42 from
43 a
44 group by
45 source
And here are our results (Figures 3.19 and 3.20, on the next page):
In our last routing-specific exercise we will use pgRouting to create isochonre polygons. These area
polygons that show what areas can be reached within a certain distance or time from a specific point.
For example, you can create an isochrone around your home to see the areas you can reach within a
ten-minute drive. With pgRouting you can create various network types, so you could apply this to
cyclist, pedestrians, and more. Once again, isochrones are a common service provided by location data
service providers via APIs, which are great for one time calls, but batch processes can take some time
or not work at all. Even though the service speeds have greatly improved in recent years, having the
flexibility to do this within the same location as your data can be highly beneficial.
Isochrones can be used for a wide range of analysis as they often provide a more accurate picture of
a population that can reach a certain location. This can be used to analyze retail trade areas, identify
how many people can reach a health care facility, accessibility to a government office, identify canni-
balization and overlap between locations, and much more. The functions we will be using are pgr_
drivingDistance, which, despite its name, uses any cost metric to find all the nodes that are less than
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 447
Figure 3.20: Output of the traveling salesman problem for bikes zoomed out
or equal to the cost we input into the function. From there we will construct a polygon around those
nodes using ST_ConcaveHull.
Our first task will be to identify the way ID closest to our starting point, in this case the entry to Katz’s
Deli, a famous New York City restaurant. Similar to before we want to make sure to exclude sidewalks
and match the location to the closest driveable road:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
448 3.5. CREATING TRAVEL TIME POLYGONS OR ISOCHONES
Next we call the pgr_drivingDistance function to find all the nodes that are equal to or less than the
cost of our third argument, in this case 600 seconds or 10 minutes. In this case we are using the column
cost_s rather than using distance since we are creating isochrones based on time. We also only select
the road tags that are driveable:
Using the query below we can see the results up to this point (Figure 3.21, on the facing page):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 449
27 from ways
28 join car_config
29 using (tag_id)
30 where tag_id in (110, 100, 124, 115, 112, 125, 109, 101,
31 103, 102, 106, 107, 108, 104, 105)',
32 (
33 select
34 source
35 from
36 start
37 ),
38 600
39 )
40 )
41
42 -- Union the geometries into a single geometry
43 select
44 st_union(ways.the_geom)
45 from
46 ways
47 where
48 gid in (select edge from b)
In our final step we create a concave hull around the combined geometries to create our drive time area
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
450 3.5. CREATING TRAVEL TIME POLYGONS OR ISOCHONES
This method is great for intersecting with other data as it ultimately has fewer vertices but if you want a
more "accurate" looking drive time you can substitute buffers for the concave hull. After we create the
buffers, in this example using a 20 meter buffer, we then find the exterior ring using ST_ExteriorRing
of the polygon to remove any holes, then turn it back to a polygon using ST_MakePolygon (Figure 3.23,
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 451
on page 451):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
452 3.5. CREATING TRAVEL TIME POLYGONS OR ISOCHONES
22 source,
23 target,
24 cost_s * 2.5 as cost,
25 reverse_cost_s * 2.5 as reverse_cost
26 from ways
27 join car_config
28 using (tag_id)
29 where tag_id in (110, 100, 124, 115, 112,
30 125, 109, 101, 103, 102, 106, 107, 108, 104, 105)',
31 (
32 select
33 source
34 from
35 start
36 ),
37 600
38 )
39 )
40 select
41
42 -- Turn it all into a polygon
43 st_makepolygon(
44
45 -- Find the exterior ring which will eliminate islands
46 st_exteriorring(
47
48 -- Create a 20 meter buffer
49 st_buffer(
50 st_transform(
51 st_union(the_geom), 4326) :: geography,
52 20
53 ) :: geometry
54 )
55 ) as geom
56 from
57 ways
58 where
59 gid in (
60 select
61 distinct edge
62 from
63 b
64 )
We can also do this for cycling using our bike database by changing out our configuration table name
and then selecting only roads that we don’t penalize (Figure 3.24, on page 452):
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 453
16 ), b as (
17 select
18 *
19 from
20 pgr_drivingdistance(
21 'select
22 gid as id,
23 source,
24 target,
25 cost_s * 2.5 as cost,
26 reverse_cost_s * 2.5 as reverse_cost
27 from ways
28 join configuration
29 using (tag_id)
30 where penalty <= 1',
31 (
32 select
33 source
34 from
35 start
36 ),
37 600
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
454 3.6. EXPERT VOICES: AARON FRAINT
38 )
39 )
40 select
41 st_makepolygon(
42 st_exteriorring(
43 st_buffer(
44 st_transform(st_union(the_geom), 4326) :: geography,
45 20
46 ) :: geometry
47 )
48 ) as geom
49 from
50 ways
51 where
52 gid in (
53 select
54 distinct edge
55 from
56 b
57 )
Name: Aaron Fraint Title: Solutions Engineering Team Lead, North America @ CARTO
The first time I encountered spatial SQL was on the job while working for Azavea. I was embedded
within the Stormwater Billing unit of the Philadelphia Water Department, and I quickly learned how
important SQL and Python were. The job required the ability to query a vast billing database and
intersect it with a land coverage dataset to determine the stormwater charge for every account. The
more impervious land cover on a property, the larger the bill. I also had to answer the phone and speak
to people about potential issues with their bills, sometimes writing SQL queries in real time! Using
photographic and other evidence, we were able to update our source data and fix any errors that may
have resulted in customers being charged too much. When this happened, we’d also use spatial SQL
to calculate the amount of the refund owed to the customer.
I enjoy using spatial SQL as it allows me to express simple or complex data transformations in a
straightforward manner. Instead of running numerous geoprocessing tools in a manual sequence, I
can run an entire analysis end to end as a single query, organizing my logical steps into common table
expressions. I find this to be an excellent environment that encourages experimentation and iteration.
Can you share an interesting way or use case that you are using spatial SQL for today?
While working as the Associate Manager of the Delaware Valley Regional Planning Commission’s
(DVRPC) Office of Mobility Analysis & Design, I leveraged spatial SQL within an API to solve end-to-
end routing requests with the least-stressful path for bicyclists.
To do this, I performed an ELT process with the agency’s Level of Traffic Stress dataset, where I ingested
the raw data into Postgres, and then reformatted it to work with pgRouting, the routing engine for the
project. I then used the FastAPI library in Python to write an API where requests with a start and
endpoint are then passed into a query that leverages pgRouting functions against a network with a
custom impedance variable that considers not just the length of the trip, but the stress levels as well.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 3. ROUTING AND NETWORKS WITH PGROUTING 455
This API is currently being used in an app called Ruti, which is a trip planning tool for finding the least
stressful bike routes developed by Corey Acri at AG Strategic in partnership with DVRPC. When you
visit https://ruti.bike/go and plan a trip between two places, a spatial SQL query using pgRouting
will return the least stressful path.
134 https://github.com/dvrpc/low-stress-bike-routing
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
4. Spatial Autocorrelation and Optimization
With Python and PySAL
Spatial data science is an area that has grown tremendously within the last several years. This took
place alongside broader growth in roles and jobs in data science and has also expanded usage of open
source GIS tools, specifically within the Python ecosystem. There are many types of analysis you can
perform in this space, but these slides135 from Luc Anselin, Ph. D. from the Center for Spatial Data
Science at the University of Chicago (Figure 4.1):
Figure 4.1: Spatial analytics questions from the YouTube Video "Spatial Data Science Overview" by Dr. Luc Anselin
In this section we will work to answer some of these questions such as clustering using spatial au-
tocorrelation, optimization by combining areas into regions or territories, and locating new facilities
optimally using location allocation. The interesting part is that we will actually be using Python to ac-
complish this within our PostGIS database. I will explain why and with what tools in the next section
so let’s jump in!
135 https://www.youtube.com/watch?v=lawWM6jQYEE&t=986s
457
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
458 4.1. SPATIAL AUTOCORRELATION
The tutorials in this section will feature tools and analyses from the Python Spatial Analysis Library or
PySAL. In my opinion it is the best library for performing analysis that you might categorize as spatial
data science, but can also be categorized as spatial econometrics or spatial statistics. The challenge for
us is that this library, as its name implies, is written in Python.
Before we get into the examples I want to call out a few important points. First is that this is a book
about spatial SQL not Python. So why are we talking about it here? As you have likely experienced,
it is near impossible to grow your geospatial toolkit without encountering Python. Python is a critical
language for modern GIS and one that I enjoy using for different problems and analyses as discussed
earlier in the book. Specifically as it relates to PySAL, what I value most is that I can use tools developed
by the best leaders, thinkers, and teachers in this field that have focused much of their time, research,
and energy into building tools like PySAL. This in turn makes complex analysis accessible to a user
like myself who does not have a background in these types of problems, but wants to solve them
nonetheless.
The same goes for PostGIS and any other tool under the spatial SQL umbrella. I am not a database
or C/C++/C# developer, and it would be a difficult task for me to implement a new function or even
some basic fixes into something like PostGIS. But as a user I can benefit from all this amazing work to
use PostGIS as a user to work with data in a flexible and scalable environment.
Now to answer the question about why we are using both spatial SQL and Python in one function.
Both spatial SQL and Python have their strengths and weaknesses. SQL is fantastic at working with
and manipulating data as we have seen throughout the course of this book. Python is great at creating
complex functions and scripting analytical processes. While you can perform data manipulation with
Python and libraries like Pandas and GeoPandas, at a certain point the speed at which you can manip-
ulate data starts to decrease. Specific to PySAL, some of their internal functions such as creating spatial
weights can also be slow and even fail at times with large data. As for SQL, while you can program to
do analysis like those in PySAL, it isn’t necessarily a straight forward process and Python is generally
efficient to perform statistical analysis on data within a native Python data structure. But above all, the
analysis we need is already built and maintained in PySAL, and if we don’t use this library we would
not be able to take advantage of all the work that already exists.
Our use cases that we will explore meet all of these conditions and by creating a solution that merges
the best of Python and SQL, we will create the best possible outcome to address scale and efficiency.
The solution we will explore will use PL/Python, or the Python Procedural Language which is a part
of PostgreSQL. It allows us to create our own functions that use Python or Python libraries that we
have installed on the server (in this case Docker container) where we are running PostgreSQL.
I do want to also state that this is not the only solution for performing this analysis. Certainly we can use
the same query process to pull data into Python code using sqlalchemy or other Python libraries with
similar functionality (most notably Pandas136 and GeoPandas137 to connect to your PostGIS instance.
The main problem we will be addressing is using SQL to quickly analyze our data using spatial rela-
tionships and to structure the data into the format the functions in PySAL need to use. How and where
you choose to execute this is up to you, but we will focus on the methods within PostgreSQL here.
A final note is that many, but not all, databases and data warehouses will include functionality to sup-
port developing functions with other languages. Snowflake for example supports Python as a UDF like
we will do in our example, or using its Snowpark product which allows you to scale the computational
136 https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html
137 https://loc8.cc/sql/geopandas-geodataframe
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL459
support for that Python code. Amazon Web Services Redshift also allows you to create Python UDFs
like we will see shortly, and BigQuery allows you to use JavaScript to create UDFs. The differentiat-
ing factor here is that these tools allow you to query data and run Python all in the same compute
infrastructure, reducing data transfer and complexity. So with that we can get started.
Before we can create our function we need to run a few steps to set up our instance to work with the
Python libraries we need. One of the first ones is the PL/Python extension which can be activated by
running this command:
This will allow us to query our database from inside the the SQL function using Python. In the next
steps we are going to open a terminal connection into our Docker container to install the Python li-
braries we need. To do so, you will need to open a new terminal window to run these commands.
Once this is done can start to enter the commands. To do so we can run this command:
Once the command is run you should see this screen (Figure 4.2):
Now we have opened a new terminal session in our Docker container via the terminal on your com-
puter, so all commands from this point forward will be run on the Docker container. Our PostGIS
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
460 4.1. SPATIAL AUTOCORRELATION
container is based on Debian which means that we will be using the package manager apt, or Ubuntu’s
Advanced Packaging Tool (APT). Our first command will be to update APT (Figure 4.3):
apt update
Next we need to install Python 3 and pip3, the Python package manger PyPI. Below is the output for
installing Python (Figure 4.4, on the next page):
When you install pip3 you will also need to enter "Y" for yes when prompted to complete the installa-
tion. First run this command and then enter “Y” when you are prompted (Figure 4.5, on page 460):
apt install python3-pip
We can check that the installations ran correctly by running the following commands (Figure 4.6, on
page 461):
python3 --version
pip3 --version
Next we are going to install GDAL which requires two commands, the second of which will require
the "Y" for yes prompt:
apt-get install gdal-bin
apt-get install libgdal-dev
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL461
Now finally we can run our command to install our Python libraries using pip3 (Figure 4.7, on
page 462):
pip3 install esda matplotlib libpysal spopt geopandas scikit-learn --break-system-packages
Once you see this message you have successfully installed all the required packages. Before we start
to build our function we should first take a look at the new tools we will use. For the most part the
Python UDFs follow the same structure as the other UDFs we have created so far over the course of the
book. The main difference is that the majority of the function is actually written in Python as the core
language.
The second is that, to access data in our database, we use the procedural PL/Python language with a
built-in library called plpy138 . This allows us to write queries to our database and return the data in a
dictionary format in Python. The queries are passed to the database as strings which means that they
can be formatted the same way you would a normal Python string.
Let’s take a look at a sample function to understand how this works. First is a simple function that will
return the greater of two integers from the PostgreSQL documentation:
138 https://www.postgresql.org/docs/current/plpython-database.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
462 4.1. SPATIAL AUTOCORRELATION
3 AS $$
4 if a > b:
5 return a
6 return b
7 $$
8 LANGUAGE plpython3u;
This function runs a simple if statement in Python to return the larger of two integers which you can
test with this query:
To see how the database calls with plpy will work we can run another test function:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL463
This function will return five rows of our data as text from within the function which will allow us
to see the data structure that is returned from a query call to "plpy". To do this we can create a table
without a geometry since that will show as a very long string:
1 create table nyc_neighborhoods_no_geom as
2 select
3 ogc_fid,
4 neighborhood,
5 boroughcode,
6 borough
7 from
8 nyc_neighborhoods
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
464 4.1. SPATIAL AUTOCORRELATION
13 {
14 'ogc_fid': 3,
15 'neighborhood': 'Arden Heights',
16 'boroughcode': '5',
17 'borough': 'Staten Island'
18 },
19 {
20 'ogc_fid': 4,
21 'neighborhood': 'Arlington',
22 'boroughcode': '5',
23 'borough': 'Staten Island'
24 },
25 {
26 'ogc_fid': 5,
27 'neighborhood': 'Arrochar',
28 'boroughcode': '5',
29 'borough': 'Staten Island'
30 }]>
As you can see we have an object with our target data in a dictionary structure in Python. We can
extract the data itself by accessing each row by accessing the data as an array such as below:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL465
5 return data[0]
6 $$
7 LANGUAGE plpython3u;
To access a specific column in that row we can access it using the column name and dictionary notation:
Which will return ’Allerton’. So with that, we are ready to start developing our new function.
For our first step, we can build out the structure of our function along with commented steps of the
code we will fill in with equivalent Python code:
As we go along we will simply return text for our development function so we can see the results of
each step, and then we can copy the successful code and create a function that returns a table. In our
first step, we need to import our libraries, create our dictionaries for our neighboring polygons data,
and our spatial weights data. Below is the full code which we will break down step by step:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
466 4.1. SPATIAL AUTOCORRELATION
25 select
26 array_agg(z.{ id_col } :: text) as neighbors
27 from
28 { tablename } z
29 where
30 st_intersects(z.{ geom }, b.{ geom })
31 and z.{ id_col } != b.{ id_col }
32 ) a
33 where
34 a.neighbors is not null
35 '''
36 )
37
38 weights = plpy.execute(
39 f '''
40 select
41 json_object_agg(b.{ id_col }, a.weights) as weights
42 from
43 { tablename } b
44 cross join lateral (
45 select
46 array_agg(z.{ id_col }) as neighbors,
47 array_fill(
48 (
49 case
50 when count(z.{ id_col }) = 0 then 0
51 else 1 / count(z.{ id_col }) :: numeric
52 end
53 ),
54 array [count(z.{id_col})::int]
55 ) as weights
56 from
57 { tablename } z
58 where
59 st_intersects(z.{ geom }, b.{ geom })
60 and z.{ id_col } != b.{ id_col }
61 ) a
62 where
63 a.neighbors is not null
64 '''
65 )
66 return neighbors
67 $$
68 LANGUAGE 'plpython3u';
In the first part of the query you can see that we imported all of our various libraries we need using the
Python import process:
Next we calculate our neighbors and weights data that will be used to calculate our spatial weights in
the next step using a call to our database:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL467
3 neighbors = plpy.execute(f'''
4 select
5
6 -- This will create a JSON object with the ID column and an
7 -- array of the neighbor IDs
8
9 json_object_agg(b.{ id_col }, a.neighbors) as neighbors
10 from
11 { tablename } b
12
13 -- To accomplish this we use a cross join lateral
14 -- to find all the neighboring IDs and aggreagate
15 -- them into an array to match the PySAL formatting
16
17 cross join lateral (
18 select
19 array_agg(z.{ id_col } :: text) as neighbors
20 from
21 { tablename } z
22 where
23
24 -- A neighbor is considered anything that intersects or
25 -- touches the target ID and that isn't the target polyogon
26
27 st_intersects(z.{ geom }, b.{ geom })
28 and z.{ id_col } != b.{ id_col }
29 ) a
30 where
31
32 -- This removes any islands or non-conencted polygons
33
34 a.neighbors is not null
35 ''')
36
37 weights = plpy.execute(f'''
38 select
39 json_object_agg(b.{ id_col }, a.weights) as weights
40 from
41 { tablename } b
42 cross join lateral (
43 select
44 array_agg(z.{ id_col }) as neighbors,
45 array_fill(
46 (
47 case
48 when count(z.{ id_col }) = 0 then 0
49 else 1 / count(z.{ id_col }) :: numeric
50 end
51 ),
52 array [count(z.{id_col})::int]
53 ) as weights
54 from
55 { tablename } z
56 where
57 st_intersects(z.{ geom }, b.{ geom })
58 and z.{ id_col } != b.{ id_col }
59 ) a
60 where
61 a.neighbors is not null
62 ''')
As you can see we store our results in a variable for each. In each query you will notice that we have
some values within braces, such as {id_col}. We are using f-string formatting in Python which means
that those values will be replaced by the values from our function. So if our ID columns is added as
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
468 4.1. SPATIAL AUTOCORRELATION
{ "360050001001" : ["360050001000","360810331000","360810929000","360810107010"]}
The key of the object is the census block ID of the target block group, in this case 360050001001, and the
values in the array are the block groups that touch it with at least one point.
The weights are similar except that each value is will be replaced with the value of one over the number
of items in the array. Meaning in our example since there are four neighbors, each ID will be replaced
by ¼ or 0.25. These are the data structures that are required to calculate our spatial weights as defined
in the PySAL documentation139 .
Let’s take a look at our query and review it step by step, with the values filled in:
1. Our target output is a JSON object with a key and an array value which will be read as a dictionary
within Python, so we create a JSON object using json_object_agg
2. Next we use a cross join lateral to find all the polygons that touch the source polygon
3. We take all the IDs of the polygons that touch the source polygon and put them into an array
4. We use ST_Intersects to find the matches that touch and then exclude the source polygon in the
second part of the where clause
5. PySAL will give an error for empty relationships so we exclude those
139 https://pysal.org/libpysal/generated/libpysal.weights.W.html#libpysal.weights.W
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL469
We do the same thing for the weights, but we add this instead of the neighbors array.
Here we use the ST_Intersects function that takes two arguments. The first is the number to fill the
array with, which in the case of our previous example is 0.25 since we have 4 entries. We then have the
second argument which created the empty array to fill with the length of the total number of entries.
Moving on to the next steps, we will create our spatial weights using the W function from the main
PySAL library, libpysal140 . We will also create a matching data structure for our variable data, in this
case our income column. If you want to learn more about spatial weights I would recommend this
chapter from the Geospatial Data Science Book141 .
Our spatial weights are created with these two lines of code:
We can access the data we want from our neighbors and weights query results using the array notation
we saw earlier and wrapping them in the JSON loads function which will turn our string data into
proper Python dictionary data.
Next we will grab our income data using a similar query structure to our first two queries. As you can
see in the last line of the query we are only returning our values from the query since we only need
those values in an ordered array in Python. This means that we have to turn them into an array and
account for null values by turning them to 0. To accomplish this we will loop over the items returned
from our query and put them into an empty array using a for loop with a conditional if/else inside it:
140 https://pysal.org/libpysal/generated/libpysal.weights.W.html#libpysal.weights.W
141 https://geographicdata.science/book/notebooks/04_spatial_weights.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
470 4.1. SPATIAL AUTOCORRELATION
We will put this all into a Pandas DataFrame to use in our next step which is to calculate the Moran’s I
values with our list data that is turned into a numpy array:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL471
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
472 4.1. SPATIAL AUTOCORRELATION
99 var_list.append(float(0.0))
100 else:
101 var_list.append(float(i ['data_col']))
102
103 li = esda.moran.Moran_Local(np.array(var_list), w)
104
105 return var_list
106 $$
107 LANGUAGE 'plpython3u';
In our last set of steps we will query our data to find the original data which includes the geometry, cre-
ate two lists of data which we can use to structure the final data output, and join our Pandas DataFrame
and structure the data.
The first step is to create a query to retrieve the original data column (in our case income), id column,
and geometry:
Next we will loop over that data to create two new lists. One will be turned into a DataFrame to use to
join to the other DataFrame we created in the earlier steps. The other will be a list to lookup the index
position of a specific Census Block Group ID to extract values. The index position is a number that
represents the position on a list starting at 0, so if I wanted to get the first position in a list called my_
list I would use this code to access it:
my_list[0]
We will need the index to extract values from the results of our Moran’s I results. We do that by finding
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL473
the specific data point we want (you can see all the return results in the documentation142 ) along with
the index position. So with that you can see how we create those lists using the following code:
Next we format our data using a similar process, and it is here where we extract the data from the
Moran’s I results, including the Moran’s I values, p-value, and quadrant.
142 https://pysal.org/esda/generated/esda.Moran_Local.html#esda.Moran_Local
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
474 4.1. SPATIAL AUTOCORRELATION
4
5 # Merge or join it with the original data
6
7 final = df.merge(original_data_df, how='inner', on=f'{id_col}')
8
9 # Drop duplicate columns
10
11 final_df = final.drop([f'{col}_x', f'{geom}_x'], axis=1)
12
13 # Rename the outputs
14
15 final_df = final_df.rename(columns={f'{col}_y': 'col', f'{geom}_y': f'{geom}', f'{id_col}': 'id' })
16
17 # Return the final formatted DataFrame as a dictionary using the .to_dict() function from Pandas
18
19 return final_df.to_dict(orient='records')
And that is it! You can see the complete code here:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL475
45 f'''
46 select
47 json_object_agg(b.{id_col}, a.weights) as weights
48 from
49 {tablename} b
50 cross join lateral (
51 select
52 array_agg(z.{id_col}) as neighbors,
53 array_fill(
54 (
55 case
56 when count(z.{id_col}) = 0 then 0
57 else 1 / count(z.{id_col}) :: numeric
58 end
59 ),
60 array [count(z.{id_col})::int]
61 ) as weights
62 from
63 {tablename} z
64 where
65 st_intersects(z.{geom}, b.{geom})
66 and z.{id_col} != b.{id_col}
67 ) a
68 where
69 a.neighbors is not null
70 '''
71 )
72
73 w = W(json.loads(neighbors[0]['neighbors']), json.loads(weights[0]['weights']), silence_warnings = True)
74 w.transform = 'r'
75
76 var_data = plpy.execute(
77 f'''
78 with a as (
79 select
80 distinct b.{col},
81 b.{id_col}
82 from
83 {tablename} b
84 cross join lateral (
85 select
86 array_agg(z.{id_col}) as neighbors
87 from
88 {tablename} z
89 where
90 st_intersects(z.{geom}, b.{geom})
91 and z.{id_col} != b.{id_col}
92 ) a
93 where
94 a.neighbors is not null
95 )
96 select
97 {col} as data_col
98 from
99 a
100 ''')
101
102 var_list = []
103 for i in var_data:
104 if i['data_col'] == None:
105 var_list.append(float(0.0))
106 else:
107 var_list.append(float(i['data_col']))
108
109 li = esda.moran.Moran_Local(np.array(var_list), w)
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
476 4.1. SPATIAL AUTOCORRELATION
110
111 original = plpy.execute(f'''
112 with a as (
113 select
114 distinct b.{col},
115 b.{id_col},
116 b.{geom}
117 from
118 {tablename} b
119 cross join lateral (
120 select
121 array_agg(z.{id_col}) as neighbors
122 from
123 {tablename} z
124 where
125 st_intersects(z.{geom}, b.{geom})
126 and z.{id_col} != b.{id_col}
127 ) a
128 where
129 a.neighbors is not null
130 )
131 select
132 {id_col},
133 {col},
134 {geom}
135 from
136 a'''
137 )
138
139 original_data = []
140 lookup_table = []
141
142 for i in original:
143 original_data.append(i)
144 lookup_table.append(i[f'{id_col}'])
145
146 df = pd.DataFrame.from_dict(original_data)
147
148 formatted_data = []
149
150 for i in original_data:
151 dict = i
152 res = lookup_table.index(i[f'{id_col}'])
153 dict['local_morani_values'] = li.Is[res]
154 dict['p_values'] = li.p_sim[res]
155 dict['quadrant'] = li.q[res]
156 formatted_data.append(dict)
157
158 original_data_df = pd.DataFrame.from_dict(formatted_data)
159 final = df.merge(original_data_df, how='inner', on=f'{id_col}')
160 final_df = final.drop([f'{col}_x', f'{geom}_x'], axis=1)
161 final_df = final_df.rename(columns = {f'{col}_y': 'col', f'{geom}_y': f'{geom}', f'{id_col}': 'id'})
162
163 return final_df.to_dict(orient = 'records')
164 $$
165 LANGUAGE 'plpython3u';
Next, run the complete function code to create our function. First we will create a table of Census Block
Groups that have a population greater than 0:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL477
3 *
4 from
5 nys_2021_census_block_groups
6 where
7 left(geoid, 5) in ('36061', '36005', '36047', '36081', '36085')
8 and population > 0
Then we will call the function and create a new table so we can see our results in QGIS:
After loading the data in QGIS we will want to style the map categorically using the quadrant feature.
While you can certainly use the statistics in the map to filter to find the most statistically significant
areas, the quadrants will tell you how the data is clustered on the map. The column will have 5 different
numbers ranging from 1 to 5. 5 means that there is no data, or it is not significant. The other four
numbers represent different labeled clusters that tell you if the cluster falls in a high or low correlation
(in this case of income) and a second label telling you if that block group is within a high or low cluster.
The labels map to this schema:
• High near High or HH - 1
• Low near High or LH - 2
• Low near Low or LL - 3
• High near Low - 4
You can read more about local autocorrelation here in the Geospatial Data Science Book143 . Once you
have opened your data in QGIS, you can double-click on the layer in the bottom left-hand layer list
to open the dialog to change the styling. First, click on the left-hand menu item called "Symbology",
then click the top menu dropdown and select “Categorized” (Figures 4.8, on the next page and 4.9, on
page 477):
Once the menu opens click the dropdown next to the word "Value" then select quadrant from the list.
After that click "Classify" near the bottom followed by "OK" which will return you to the map (Figures
4.10, on page 477 and 4.11, on page 478).
Once you do this you should see your final map, and we can see that there are many block groups in
the HH and LL categories and how they are spatially distributed (Figure 4.12, on page 479) :
And for this section that is it! We implemented a complete function for performing spatial autocorrela-
tion using PySAL all within PostGIS.
143 https://geographicdata.science/book/notebooks/07_local_autocorrelation.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
478 4.1. SPATIAL AUTOCORRELATION
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL479
Location allocation is the process of finding the most suitable location (facility) for a set of points (de-
mand points) to serve various clients, with specific constraints in place. In more real world terms, an
example of this is finding the best location for a new fire station. This would take into consideration the
current fire stations and constraints, such as the service area cannot expand beyond 7 minutes driving
time, or maximum service time. The same optimizations can be made for distance, coverage of demand
points, facility capacity, accounting for backup coverage, distance between facilities, and more.
You need to have sites in mind before you run the analysis, but when combined with a suitability
analysis, this can be a final step in the process for validating a location choice. This is another example
of using the best tool for the job, meaning that again we will be using Python for one part of the analysis
and SQL for the other.
The network analysis component of the process relies on a network, typically a road network optimized
for a specific type of travel such as car, bike, etc. As we know this is a perfect fit for using pgRouting
to calculate our routes efficiently and then structure the results that are required from Python. For the
Python side, PySAL has an incredibly robust package named spopt that performs spatial optimization
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
480 4.2. LOCATION ALLOCATION
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL481
for location allocation problems and regionalization (more commonly known as building territories)
which we will use in the next section. As of the writing of this book spopt supports the following
analyses:
• Location Set Covering Problem (LSCP)
– . . . the LSCP model was proposed whereby the minimum number of facilities de-
termined and located so that every demand area is covered within a predefined
maximal service distance or time (Church and Murray, 2018)144
• Capacitated Location Set Covering Problem–System Optimal (CLSCP-SO)
– "Locate just enough facilities and associated capacity such that all demand is served
within the capacity limits of each facility, given the coverage capabilities of each
facility." Church L. & Murray, A. (2018)
• The Backup Coverage Location Problem (LSCP-B)
– "Find the minimum number of facilities and their locations such that each demand
is covered, while maximizing the number of backup coverage instances among
demand areas."145
• Maximal Coverage Location Problem
144 https://pysal.org/spopt/notebooks/lscp.html
145 https://pysal.org/spopt/notebooks/lscpb.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
482 4.2. LOCATION ALLOCATION
• P-Median Problem
– "Here the objective is to locate a fixed number of facilities such that the resulting
sum of travel distances is minimized."148
• P-Dispersion (max-min-min) Problem
– "...Kuby (1987) described the following problem: Locate p facilities so that the min-
imum distance between any pair of facilities is maximized."149
For this example we will use the P-Median problem. This is one that is fairly easy to approach concep-
tually because it looks to have N facilities that serve the greatest area where the total distance traveled
between facilities is the lowest. In short, if you provide the model with 10 facilities, and you need 6,
it will pick the best 6 to maximize coverage. Since we already installed spopt in our original Python
installs we can go ahead and build out our function, but first let’s import the NYC Fire Stations data to
our routing database, so we can use it with pgRouting:
146 https://pysal.org/spopt/notebooks/mclp.html
147 https://pysal.org/spopt/notebooks/p-center.html
148 https://pysal.org/spopt/notebooks/p-median.html
149 https://pysal.org/spopt/notebooks/p-dispersion.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL483
8 nyc_hoods.geojson \
9 -nln nyc_neighborhoods -lco GEOMETRY_NAME=geom -nlt PROMOTE_TO_MULTI
For this analysis we are going to find the best location for a new fire station in the Willamsburg, Green-
point, and East Williamsburg neighborhoods by combining the existing fire stations with three new
locations. To do this we want to create a single table of inputs for all the locations. For our new loca-
tions we will need to create a new table:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
484 4.2. LOCATION ALLOCATION
29 new_stations
Before we calculate our origin destination matrix, we also want to create a table of only the buildings
we will evaluate to cut down on how much we need to calculate (Figure 4.13, on page 483):
Now that we have the stations we need to create an origin destination matrix to use in the func-
tion. Since we only need the costs we can actually use a different pgRouting function named pgr_
bdDijkstraCost which calculates only the cost between two points:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL485
27 'track',
28 'grade1',
29 'grade2',
30 'grade3',
31 'grade4',
32 'grade5',
33 'unclassified',
34 'footway',
35 'pedestrian',
36 'steps'
37 )
38 order by
39 the_geom <-> all_stations.geom
40 limit
41 1
42 ) z
43 ),
44
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
486 4.2. LOCATION ALLOCATION
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL487
110 target,
111 cost_s as cost,
112 reverse_cost_s as reverse_cost
113 from
114 ways
115 join car_config using (tag_id)
116 where
117 st_intersects(
118 st_transform(the_geom, 4326),
119 (
120 select
121 st_buffer(st_union(geom) :: geography, 100) :: geometry
122 from
123 nyc_neighborhoods
124 where
125 neighborhood in (
126 'williamsburg',
127 'east williamsburg',
128 'greenpoint'
129 )
130 )
131 ) $$,
132
133 -- We pass the arrays from above as the two arguments
134 (
135 select
136 *
137 from
138 starts
139 ),
140 (
141 select
142 *
143 from
144 destinations
145 ),
146 true
147 );
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
488 4.2. LOCATION ALLOCATION
Next we will import our libraries that we will be using and also call our origin destination matrix table
using the plpy built in library:
Here we iterate over the return values by querying our origin destination matrix table, and using a
loop in Python, we add those values to an empty list named odm_new:
This next step sets up our solver from the pulp library, which contains many optimization solver tools
for linear programming problems. The library is installed with spopt as a dependency, so there is no
need to install it separately.
From here we will do some data prep to set up our data in the required formats to run the optimization.
In this case we will need to create a new Pandas DataFrame from our list which we created in the
previous step. The required data structure for the optimization is a true matrix, meaning that instead
of the data structure in our DataFrame, which has a start and end ID for each row and the associated
cost, we need to create a matrix that has all the start IDs as columns and end IDs as rows. Pandas has a
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL489
built-in function that operates on a DataFrame named pivot_table that allows us to do this. We have four
positional arguments:
• index: The value we will use as the rows, in this case the end ID
• columns: Values we will use as the columns, in this case the start ID
• values: These are the values we will use which is our aggregate cost value or agg_cost
• fill_values: The functions we are going to use need data with no empty values, and in this case
we will fill those with a very large value so that the cost will be too high to use that value
We will run this query twice. First in the data variable we will add the .values operator on the end
to get the values as a pure Python list, which is required for the P-Median functions. The second we
will leave as a DataFrame to use to map the individual buildings to the road segments they have been
assigned to.
In this step we will assign weights for each of the ending street segment IDs. In this case we will keep
them all equal at 1, but you could weight these by the number of buildings on the road segment or
something similar depending on your needs (more on this in the example notebook150 ). This will create
a list with each item being "1" that is the same length as the total number of ending road segments.
Now we are ready to run our P-Median calculation. The first step is to use the function we imported
from the spopt library, and we will use the from_cost_matrix option with the following positional
arguments:
• numpy.array(data) - Our origin destination matrix as a numpy array
• numpy.array(weights) - The weights list as a numpy array
• p_facilities=int(optimal_facilities) - The ideal number of facilities using the argument from our
original SQL function arguments. We make sure that it is an integer by wrapping it inside the int
function in Python
• name="p-median-network-distance" - The name of the analysis we want to run as specified in
the example notebook151
If you so choose you can add an argument for predefined facilities which is also a numpy array of the
facility ids (in our case the road start IDs) that must be in the solution. This would account for cases
where you want to evaluate adding in new locations without removing any of the existing locations.
This would follow the p_facilities argument.
The last step is to run the solver using the pulp solver we established earlier. We take the variable
pmedian_from_cm that stored our data in the previous step and run the .solve() function on that vari-
able, passing the solver variable as the only argument. This will reassign the results to the pmedian_
from_cm variable.
150 https://pysal.org/spopt/notebooks/p-median.html#Simulate-points-in-a-network
151 https://pysal.org/spopt/notebooks/p-median.html#Calculating-the-(network-distance)-cost-matrix
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
490 4.2. LOCATION ALLOCATION
1 pmedian_from_cm = PMedian.from_cost_matrix(
2 numpy.array(data),
3 numpy.array(weights),
4 p_facilities=int(optimal_facilities),
5 name="p-median-network-distance" )
6
7 pmedian_from_cm = pmedian_from_cm.solve(solver)
Now that we have our results we need to create a data structure that contains the assigned facility, or
fire station, for each end road ID. To do this we can access a data structure from our pmedian_from_cm
value using the pmedian_from_cm.fac2cli operator which will return a data structure that contains a
dictionary with the ID of the starting point (in this case the starting road segment ID or start_vid) and
a list of the clients it serves (there will be one end_vid for multiple buildings). Our goal is to create a
data structure that we can turn into a DataFrame to later join to our original data. Before we do this, we
first need to create a table that contains the end segment ID for each of our stations being evaluated.
First we will create our query that will find the nearest road segment to each station then turn that into
a DataFrame we can use later on.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL491
44 for i in station_ids:
45 stations.append(i)
46
47 stations_df = pd.DataFrame.from_dict(stations)
cleaned_points = []
Next we will create two for loops. The first will go through each item in the pmedian_from_cm.fac2cli
object and if it has more than 0 values (meaning that facility is included in the solution) then it goes to
the next step. In the next step it goes through each client, or end road segment ID (end_vid), and adds
it to the dictionary. This has two values: The first the facility name from our table of locations under
evaluation which is represented by this line of code:
Listing 4.42: Getting station names and region IDs from the list
1 # First we filter the DataFrame to find
2 # the end way ID we are evaluating
3 z = stations_df[stations_df['end_vid']
4
5 ## Then we look in the columns in the vals pivot table
6 == vals.columns[
7
8 # Here we use the index position from the pmedian_from_cm.fac2cli
9 # object to find the value by row index, or i
10 pmedian_from_cm.fac2cli.index(i)]
11
12 # Finally from that list we access the name attribute and get
13 # the value from index position 0 which represents our facility
14 # assignment for each building
15 ]['name'].values[0]
Here we are filtering our stations_df DataFrame to find the end_vid that is at the index position of the
pmedian_from_cm.fac2cli values. More plainly, we have a DataFrame that has the name of each station
and the end_vid for that station. We know that in the pmedian_from_cm.fac2cli values have an index
position instead of the end_vid. That index position, for example 0, represents the first column of the
vals DataFrame or pivot table. To get that actual value, we use this code:
vals.columns[pmedian_from_cm.fac2cli.index(i)]
vals.columns returns a list of the end road segment IDs. To get the right one we get the index value
of 0 from* pmedian_from_cm.fac2cli.index(i). .index()* allows us to find the index by the row value, in
this case represented by "i".
Next we will get the actual value for the end ID, the only issue being that the data structure contains
the index location of that value within the original dataset. This means that if the end road segment ID
"12345" was assigned, we would only get the index which could be "0" if this were the first value in the
dataset. To find the actual ID value, we can use the index location of that value represented by "j" in
our for loop, and find the exact value from the DataFrame we created called vals. We can retrieve the
index values from that DataFrame (remember that this is a pivot table, so the index values are actually
our end IDs), then find the exact value by accessing that specific value using "j". Then we append this
to our empty list, and we are ready to move on.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
492 4.2. LOCATION ALLOCATION
1 for i in pmedian_from_cm.fac2cli:
2 if len(i) > 0:
3 z = stations_df[stations_df['end_vid'] == vals.columns[pmedian_from_cm.fac2cli.index(i)]]['name'].values[0]
4 for j in i:
5
6 # Here we can find the end ID by using the index value in the array to
7 # find the end ID, here represented by "j"
8 struct = {'facility': z, 'end_vid': list(vals.index)[j]}
9 cleaned_points.append(struct)
Finally, we create a new DataFrame from this dictionary we created for use later on.
df_startids = pd.DataFrame.from_dict(cleaned_points)
Our next few steps will include recreating our original origin-destination matrix table and turning it
into a DataFrame that contains each original building ID and the end road segment ID it is closest to.
This will allow us to assign the facility or fire station for each individual building. It is important to
note that this code is built for this specific example, meaning that it has the road tag IDs that we want
to exclude hard coded into the query. If you want to make this function fully repeatable you will have
to either limit your pgRouting ways data to only those values or provide another argument in your
function to pass those to the query.
With that said, the below query selects the source as the end road segment ID, the ID of the building,
and the centroid of the building to use as the geometry. It then performs a cross lateral join to find the
nearest road ID from the data in pgRouting. Once that is complete we use another for loop to format
the data into a dictionary which can then be turned into a Pandas DataFrame.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL493
32 'pedestrian',
33 'steps'
34 )
35 ORDER BY
36 ways.the_geom <-> st_transform(st_centroid(building_footprints.geom), 4326)
37 limit
38 1
39 ) z
40 ''')
41
42 orig_formatted = []
43 for i in orig_data:
44 orig_formatted.append(i)
45
46 orig_df = pd.DataFrame.from_dict(orig_formatted)
• Merge our original data DataFrame with the results from the P-Median analysis that have been
turned into a DataFrame
• We replace any NULL values in numpy with the None value in Python
• Return the new DataFrame as the return value after turning it into a dictionary.
With that we can create and run our complete function to create our new function in our database:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
494 4.2. LOCATION ALLOCATION
31
32 clients = data.shape[0]
33 weights = [1] * clients
34
35 pmedian_from_cm = PMedian.from_cost_matrix(
36 numpy.array(data),
37 numpy.array(weights),
38 p_facilities=int(optimal_facilities),
39 name="p-median-network-distance" )
40
41 pmedian_from_cm = pmedian_from_cm.solve(solver)
42
43 station_ids = plpy.execute(f'''
44 select
45 z.source as end_vid,
46 {facilities_table}.name
47 from
48 {facilities_table}
49 cross join lateral (
50 SELECT
51 source
52 FROM
53 ways
54 join configuration c using (tag_id)
55 where
56 c.tag_value not in (
57 'track',
58 'bridleway',
59 'bus_guideway',
60 'byway',
61 'cycleway',
62 'path',
63 'track',
64 'grade1',
65 'grade2',
66 'grade3',
67 'grade4',
68 'grade5',
69 'unclassified',
70 'footway',
71 'pedestrian',
72 'steps'
73 )
74 ORDER BY
75 ways.the_geom <-> st_transform(st_centroid({facilities_table}.geom), 4326)
76 limit
77 1
78 ) z
79 ''')
80
81 stations = []
82
83 for i in station_ids:
84 stations.append(i)
85
86 stations_df = pd.DataFrame.from_dict(stations)
87
88 cleaned_points = []
89
90 for i in pmedian_from_cm.fac2cli:
91 if len(i) > 0:
92 z = stations_df[stations_df['end_vid'] == vals.columns[pmedian_from_cm.fac2cli.index(i)]]['name'].values[0]
93 for j in i:
94 struct = {'facility': z, 'end_vid': list(vals.index)[j]}
95 cleaned_points.append(struct)
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL495
96
97 df_startids = pd.DataFrame.from_dict(cleaned_points)
98
99 orig_data = plpy.execute(f'''
100 select
101 z.source as end_vid,
102 {clients_table}.ogc_fid,
103 st_transform(st_centroid({clients_table}.geom), 4326) as geom
104 from
105 {clients_table}
106 cross join lateral (
107 SELECT
108 source
109 FROM
110 ways
111 join car_config c using (tag_id)
112 where
113 c.tag_value not in (
114 'track',
115 'bridleway',
116 'bus_guideway',
117 'byway',
118 'cycleway',
119 'path',
120 'track',
121 'grade1',
122 'grade2',
123 'grade3',
124 'grade4',
125 'grade5',
126 'unclassified',
127 'footway',
128 'pedestrian',
129 'steps'
130 )
131 ORDER BY
132 ways.the_geom <-> st_transform(st_centroid({clients_table}.geom), 4326)
133 limit
134 1
135 ) z
136 ''')
137
138 orig_formatted = []
139 for i in orig_data:
140 orig_formatted.append(i)
141
142 orig_df = pd.DataFrame.from_dict(orig_formatted)
143
144 final_df = orig_df.merge(df_startids, how='left', on='end_vid')
145 final_df = final_df.replace(numpy.nan, None)
146
147 return final_df.to_dict(orient='records')
148 $$
149 LANGUAGE 'plpython3u';
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
496 4.3. BUILD BALANCED TERRITORIES
We can then load this into QGIS to see what the results look like (Figure 4.14):
As you can see there are a few odd results, likely due to the specific road types we selected but overall
we see that most all the buildings are covered and all using existing locations!
In our final optimization exercise, we will be creating balanced territories from polygon data that con-
tains some sort of numeric data with the goal of creating a target number of territories with equally
balanced values between each territory. In a real world example, this is a similar process to creating bal-
anced sales territories based on customer locations or balanced political districts based on constituents.
This problem is also known as regionalization and, once again, the spopt library from PySAL has tools
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL497
for us to do this.
There are several methods that are available to use to create balanced territories:
• Max-P Regionalization: "The max-p problem involves the clustering of a set of geographic areas
into the maximum number of homogeneous regions such that the value of a spatially extensive
regional attribute is above a predefined threshold value"152
• Automatic Zoning Procedure (AZP) algorithm: "AZP can work with different types of objective
functions, which are very sensitive to aggregating data from a large number of zones into a pre-
designated smaller number of regions."153
152 https://pysal.org/spopt/notebooks/maxp.html
153 https://pysal.org/spopt/notebooks/azp.html
154 https://pysal.org/spopt/notebooks/skater.html
155 https://pysal.org/spopt/notebooks/ward.html
156 https://pysal.org/spopt/notebooks/reg-k-means.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
498 4.3. BUILD BALANCED TERRITORIES
This table will have an ID, the data column from the original table, an integer that will identify the
region the polygon is a part of, and the geometry.
Next we perform our imports of our Python libraries:
Here you will notice that we will be using GeoPandas to create a GeoDataFrame, Shapely to manage
Well Known Text data, and SciKitLearn to use the pairwise function to calculate distances. Our next
step will be to calculate data structures for the neighbors and weights. The only difference here is
that we want to ensure that we have contiguous territories, meaning that we want to ensure that two
polygons that only share one point are not considered neighbors and not considered eligible to be a
part of the same territory. You can modify this if you want to include those but in most cases this will
lead to odd connections in your territories.
The good news is that we can reuse the code that we wrote to find polygons that share more than one
point in common in this situation:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL499
As you can see we are creating a JSON object containing the ID column and neighboring polygon IDs.
We retrieve those from a CROSS JOIN LATERAL and within that query we use the WHERE clause to exclude
those polygons that only have 1 point in common. We repeat the same process for the weights as well:
Once again, just like our spatial autocorrelation analysis, we will compute the spatial weights object
using the libpysal library. However, we will use an additional argument named id_order which takes a
list of IDs that shows the order of the IDs that are provided from the neighbors and weights objects. To
do this we can create a new list and then add the IDs for each entry to that list:
Next we will set up our GeoDataFrame to pass into the SKATER function. First we will run a query
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
500 4.3. BUILD BALANCED TERRITORIES
that matches the same conditions as our previous query to ensure our data matches given our WHERE
clauses. While this is likely unnecessary it will cover any edge cases that may cause errors. We only
need the geometry column as WKT, the ID column, and the numeric value.
Once that is complete we will set up an empty list to add our values to that will allow us to create a
GeoDataFrame from the new dictionary:
And we will create a true geometry from our WKT geometry using the .apply() method in Pandas to
run the wkt.loads function on each entry.
gdf[’geometry’] = gdf[’geometry’].apply(wkt.loads)
And finally set the geometry on the GeoDataFrame and assign it the EPSG of 4326:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL501
Next we need to create a new variable that contains information about our minimum spanning tree.
This is effectively a dictionary that contains several arguments, which are described below using the
definitions from the PySAL SKATER notebook157 .
• dissimilarity
– A callable distance metric, with the default as sklearn.metrics.pairwise.manhattan_
distances158 .
• affinity
– A callable affinity metric between 0 and 1, which is inverted to provide a dissim-
ilarity metric. No metric is provided as a default (None). If affinity is desired,
dissimilarity must explicitly be set to None.
• reduction
– The reduction applied over all clusters to provide the map score, with the default
as numpy.sum()159 .
• center
– The method for computing the center of each region in attribute space with the
default as numpy.mean()160 .
• verbose
– A flag for how much output to provide to the user in terms of print statements and
progress bars. Set to 1 for minimal output and 2 for full output. The default is
False, which provides no output
Next we will create a model using the SKATER function from the spopt package. This function has the
following options for arguments:
• gdf: The GeoDataFrame containing our polygon data which was created earlier
• w: The spatial weights object created earlier
• [’col’]: This is the numeric value that we will be balancing our regions on which we have renamed
to col in our SQL query to build out the GeoDataFram
• n_clusters: The number of clusters or regions we want to create which is defined from the input
argument of the function
• floor: The minimum number of polygons any region can have which is defined from the input
argument of the function
• trace: "trace is a bool denoting whether to store intermediate labelings as the tree gets pruned."161
157 https://pysal.org/spopt/notebooks/skater.html
158 https://loc8.cc/sql/scitkit-learn
159 https://numpy.org/doc/stable/reference/generated/numpy.sum.html
160 https://numpy.org/doc/stable/reference/generated/numpy.mean.html
161 https://pysal.org/spopt/notebooks/skater.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
502 4.3. BUILD BALANCED TERRITORIES
• islands: "The islands keyword argument describes what is to be done with islands. It can be set to
either ’ignore’, which will treat each island as its own region when solving for n_clusters regions,
or ’increase’, which will consider each island as its own region and add to n_clusters regions."162
• spanning_forest_kwds: The object containing the spanning tree which we just created
And to complete the process we can solve our model using this line of code.
model.solve()
Once the model has run it is actually a simple process. We can add a new column to our GeoDataFrame
that contains the labels of the territories using this code:
gdf[’regions’] = model.labels_
The model.labels_ object contains the labels for our territories and by assigning it to gdf[’regions’] we can
add this as a new column. Finally, we transform our data to a dictionary so that it can be rendered as a
table in SQL:
return gdf.to_dict(orient=’records’)
Below is the complete function which you can add to your database:
162 https://pysal.org/spopt/notebooks/skater.html
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL503
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
504 4.3. BUILD BALANCED TERRITORIES
82 from
83 { tablename } b
84 cross join lateral (
85 select
86 array_agg(z.{ id_col }) as neighbors,
87 array_fill(
88 (
89 case
90 when count(z.{ id_col }) = 0 then 0
91 else 1 / count(z.{ id_col }) :: numeric
92 end
93 ),
94 array [count(z.{id_col})::int]
95 ) as weights
96 from
97 { tablename } z
98 where
99 st_intersects(z.{ geometry }, b.{ geometry })
100 and st_npoints(st_intersection(b.{ geometry }, z.{ geometry })) > 1
101 and z.{ id_col } != b.{ id_col }
102 ) a
103 where
104 a.neighbors is not null
105 ''')
106
107 gdf_data = []
108
109 for i in to_gdf:
110 gdf_data.append(i)
111
112 gdf = gpd.GeoDataFrame(gdf_data)
113
114 spanning_forest_kwds = dict(
115 dissimilarity=skm.manhattan_distances,
116 affinity=None,
117 reduction=numpy.sum,
118 center=numpy.mean,
119 verbose=False
120 )
121
122 model = spopt.region.Skater(
123 gdf,
124 w,
125 ['col'],
126 n_clusters=n_clusters,
127 floor=floor,
128 trace=False,
129 islands='increase',
130 spanning_forest_kwds=spanning_forest_kwds
131 )
132
133 model.solve()
134
135 gdf['regions'] = model.labels_
136
137 return gdf.to_dict(orient = 'records')
138 $$
139 LANGUAGE 'plpython3u';
To test our new function we will create 9 territories with equally balanced population in Brooklyn. To
create the table of block groups in Brooklyn we can first run this query which finds the county code
from the geoid column by removing the first two digits from the Census Block Group ID which match
to the state ID of New York State, then discarding the next seven digits to get the three we want to use
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 4. SPATIAL AUTOCORRELATION AND OPTIMIZATION WITH PYTHON AND PYSAL505
From here we can run our function and create a new table. The arguments are as follows:
• ’bklyn_bgs’: The table we just created
• ’geom’: Our geometry column
• ’population’: The column we want to balance on
• ’geoid’: The unique identifier in our dataset
• 9: The total number of regions we want to create
• 150: The minimum number of geometries in any given region, or the floor
As you will see since we are returning the geometry as Well Known Text we will also create a new
geometry column by using ST_GeomFromText.
And once that is complete we can see our newly created regions in QGIS (Figure 4.15, on the following
page):
It is important to test some different options for the number of regions and floor values to see the results
that come back. You can see some of the options in the example notebooks on the PySAL website. In
addition, you can also include multiple variables, although it will require that you modify the code of
the function.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
506 4.3. BUILD BALANCED TERRITORIES
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
5. Conclusion
Before going further, I first want to thank you, the reader. Without you this book would not be possible
and my only hope that you found this useful for your work, career, research, or however you are using
spatial SQL. I believe that SQL is a critical skill in modern GIS and within data science and analytics in
general and can help you not only grow your skills and find new opportunities in your career.
SQL continues to be the third or fourth most popular language in Stack Overflow’s annual Developer
Survey. While this doesn’t pertain directly to spatial SQL, the fact that it remains near the top annually
shows the staying power of SQL (Figure 5.1).
Figure 5.1: Top languages from Stack Overflow’s 2023 Developers Survey
I also found that SQL is a critical skill for careers in geospatial and GIS. Using data collected from
Google Jobs focused on "GIS" and "Geospatial" as keywords and data from the wider data space for
roles such as data scientist, data analyst, data engineer, etc. that contain the keyword "geospatial" or
"GIS" in the description, it was clear that SQL is a top skill for these positions (note that these are only
postings in the United States).
SQL is the second top skill across the entire dataset, but for salary positions it is actually the top skill as
seen below (Figure 5.2, on the following page):
Additionally, it is the second most important skill in both the GIS specific roles and the roles in the
wider "data" role category (Figures 5.3, on the next page and 5.4, on page 507):
Spatial SQL has the ability not only to help you accelerate your analysis, and potentially your career
507
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
508
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. CONCLUSION 509
Figure 5.4: Top skills for data analyst positions with a geospatial focus
prospects in the current job market. Keep in mind that this is for the current job market as of 2023, and
as spatial SQL continues to grow in adoption and usage, this may very well continue to change and
grow. The question becomes where is spatial SQL going to go in the future? While any prediction like
this can be difficult, I would like to offer my opinions based on the current trends taking place.
We only explored using spatial SQL in one other tool outside a database in this book, which in this
case is GDAL. However, SQL is showing up in more locations as a general purpose language for data
outside a traditional database. Two great examples of this are dbt or Data Build Tool and DuckDB with
WebAssembly or WASM.
dbt is a toll that allows you to create data transformation pipelines using SQL. It follows an ELT or
extract load transform, transformation process which means that raw data is loaded into the database
or data warehouse of choice, and is then transformed once it has landed in the destination. This differs
from ETL or extract transform load, where data is transformed before it is loaded into the destination.
The definition from the dbt website adds more detail:
dbt™ is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code
following software engineering best practices like modularity, portability, CI/CD, and documentation. Now
anyone on the data team can safely contribute to production-grade data pipelines.163
This allows you to use the same SQL you do in your database, but with dbt you can create complex
pipelines to transform raw data into reports, aggregated views, clean data, turn latitude and longitude
points to geometries, and much more. This gives you far more control and predictability around your
data and allows you to orchestrate common and regular tasks in a much cleaner way (Figure 5.5, on
the next page).
163 https://www.getdbt.com/product/what-is-dbt
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
510 5.1. SQL BEYOND THE DATABASE
The second tool is DuckDB, which we mentioned earlier in this book, but in this case it is DuckDB with
WebAssembly. WebAssembly or WASM is more or less a virtual environment inside your browser.
Here is the definition from the WebAssembly website:
WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is
designed as a portable compilation target for programming languages, enabling deployment on the web for client
and server applications.164
For our purposes, this means that you can run DuckDB inside a JavaScript application, meaning that
your backend or database is bundled and runs right inside your client side application. Every time
someone opens your application they are running the database which means that there is no database
you have to maintain, build, connect to, or read from.
The best example of this is an application developed by Youssef Harby that allows you to query data
from Overture Maps that is stored in GeoParquet files and show it on the map.165 The app also converts
the data to GeoJSON and allows you to export the results. The impressive part is that there is no
database, server, or any backend infrastructure. Just files that are hosted on the web and queried via
DuckDB in the browser. This removes a major obstacle for many developers that allows them much
164 https://webassembly.org/
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. CONCLUSION 511
Another major hurdle has been the maintenance, set up, and development of spatial databases has been
historically difficult and required skills in programming beyond just spatial SQL. While some steps in
this book required knowledge about Docker, basic shell commands, and other languages, this process
is overall far easier due to the help and hard work of developers that create and maintain Docker
containers that abstract away much of this difficult work and makes it easy to run a container on any
computer or cloud service that can run a container.
Even more so DuckDB makes it incredibly simple to start a SQL environment and read data directly
from files without importing them into a database like we did throughout this book. I believe that
DuckDB will be a core technology for geospatial in the coming years and will make spatial SQL far
more accessible to new and current users. Dr. Qiusheng Wu, Associate Professor in the Department
of Geography & Sustainability at the University of Tennessee, Knoxville has already started teaching
DuckDB for geospatial in his courses which are also published on YouTube. One of his first videos he
states the following:
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
512 5.3. SPATIAL SQL AS A CENTRAL HUB
Another trend is that more core geospatial services are starting to integrate with spatial SQL or will
actually run inside the infrastructure dedicated to spatial SQL. There are two practical examples of
this. The first is map tiling, or the creation of web map tiles for use in spatial applications and geo-
portals. PostGIS already supports the creation of tiles from tables or queries with functions such as ST_
AsMVT, and Crunchy Data, a company that supports and creates cloud solutions and tools for PostGIS
and PostgreSQL, also maintains a library called pg_tileserv which provides an API to request and
retrieve tiles from your database. This means that not only can your data and analysis live in PostGIS,
but your visualization engine can as well.
The other is the CARTO Analytics Toolbox which provides tools and functions that run within the data
warehouse or database infrastructure for PostGIS and data warehouses such as Snowflake, BigQuery,
Redshift, and Databricks. These functions act as stored procedures within the data warehouse that al-
lows the user to leverage the same computing infrastructure as the data warehouse to perform different
spatial functionality. This includes using spatial indexes like H3, creating tiles, connecting to services
like geocoding and routing, and more complex spatial analysis similar to those that we developed, and
more (Figure 5.7).
These trends indicate that more spatial functionality can actually move into a database or a data ware-
house. This is advantageous because the more work you can move closer to your data, the faster the
analysis will be.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
CHAPTER 5. CONCLUSION 513
The final area of growth is around data. While I believe the term big data will start to fade as tools are
increasingly supporting larger data as evidenced by this post from Jordan Tigani, Founder and CEO
of MotherDuck, this has still been a challenge in geospatial due to the complex nature of the geometry
column.167
I do believe that this is changing. Within PostGIS there are methods to architect your data to perform
and scale to be quite performant with millions of records, see this post from Paul Ramsey, Executive
Geospatial Engineer at Crunchy Data and core member of the PostGIS development team.168 And tools
like DuckDB allow you to take this further into hundreds of millions, potentially billions of records.
Data warehouses like those mentioned above allow you to do even more by allocating processing
power at scale to the queries and can take this even further, and spatial SQL is at the core of each
of these solutions.
The final question to explore is what this will look like in the field and in the day-to-day work for
a GIS or geospatial professional and within the academic programs training the next generation of
professionals. First let’s explore what this might look like in the field.
In my opinion, there are a few areas that will see much change in the years to come from a technology
shift:
• We stop thinking about layers and start thinking about tables
• Adding the layer of Geospatial Data Engineer and Analytics Engineer
• Spatial SQL as an analytical engine within a larger GIS
The concept of layers is so embedded within the core understanding of GIS, all the way back to the
beginning when GIS meant literally overlaying layers of printed data. Thinking in this way is limiting.
It limits the understanding of how data can be transformed, joined, and manipulated to show more
in depth analysis. Layers also implies that the analysis should be visual, in that the reader should
interpret the map layer to make the analysis, and the data is only there to be put on the map.
When we start to think of data as a more fluid concept that can open up the thinking around spatial
analysis, but one has to experience this first to see how data can be use more effectively together rather
than in separate layers.
I also anticipate that new terms will start to be a part of geospatial teams, namely Geospatial Data En-
gineer and Geospatial Analytics Engineer. Some of these are already starting to show up in job listings
and I anticipate that will grow over time. These roles are focused on making data usable and useful
within the organization, and as the name implies, with a specialty on spatial data. The Data Engineer
focuses working with raw data and bringing it into a usable environment. Think of reading recurring
location pings from a ship or ingesting a new batch of census data. The Analytics Engineer takes that
data to create data products such as reports, usable tables, or data ready for upstream analysis. For
example the Analytics Engineer might take the pings and create a report of distance traveled in the last
12 hours that runs every hour. With more data, the teams around this data will naturally need to grow.
Finally, I believe that spatial SQL will become a more integral analytics engine for geospatial and GIS
teams. I don’t think spatial SQL will replace desktop GIS, data science notebooks, web maps, or data
167 https://motherduck.com/blog/big-data-is-dead/
168 https://www.crunchydata.com/blog/performance-and-spatial-joins
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
514 5.6. SPATIAL SQL IN EDUCATION
visualization. I do however believe that much of the analysis and data prep that takes place in these
tools today will move into the database and spatial SQL world. This means that users of the tools
aforementioned can still continue to use them, but they will use spatial SQL to retrieve prepared data
or create queries on data in the database or data warehouse to bring into those tools. Then they can
use those tools to create more analysis or an end product from that data. This means that spatial SQL
becomes the core of data capture, manipulation, transformation, and analysis, which in turns allows
those same users to use much larger scales of data with superior performance.
In short, I would expect that professional teams will move more of the data intensive work into tool
supported by spatial SQL, new roles will begin to emerge to support that work, and users of data will
be able to interact with those new tables or query data as need be.
If we are to reach that point, we also need to ensure that the next generation of GIS and geospatial
professionals have the opportunity to learn these skills in addition to those already in the field. In
short, spatial SQL is generally taught as a database language today, one that allows users to manage,
store, and update data in a database system. It is that exact concept that is the first that needs to change.
As you have seen spatial SQL can support a range of analytical use cases that allow you to scale your
analysis and ultimately move faster, so providing more courses and resources that focus on this is the
first step.
Second is that the courses that do teach this trend to be at a higher course level, either in a postgraduate
program or post-graduate/professional certificate program. Sometimes this is appropriate of course,
but teaching spatial SQL and either creating courses or bringing spatial SQL elements into courses at
an undergraduate level will be important in helping it reach a larger audience.
The last is creating more resources to help the wider geospatial community use and understand the
power of spatial SQL. That could mean anything from talks, projects, blog posts, seminars, webinars,
documentation, etc. Once you can see how something is done in a real world use case, it generally
makes it easier for others to see how it can apply to them. Right now we are only scratching the surface,
but great events like PostGIS Day, the Spatial Data Science Conference, the FOSS4G event series, and
others are a great starting point for this.
I think that spatial SQL has a bright future ahead, and I am honored that this book could be a part of
that journey for you. My hope is that it helps you apply these ideas and concepts to your geospatial
work and career, and that you carry it forward as well. If that is helping other colleagues, sharing your
work, speaking at events, or any other way you see fit, I hope that you continue to be a part of that
bright future.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
Books from Locate Press
Be sure to visit http://locatepress.com for information on new and upcoming titles.
Introduction to QGIS
G ET STARTED WITH QGIS WITH THIS INTRODUCTION COVERING EVERYTHING
NEEDED TO GET YOU GOING USING FREE AND OPEN SOURCE GIS SOFTWARE .
This QGIS tutorial, based on the 3.16 LTR version, introduces you to major con-
cepts and techniques to get you started with viewing data, analysis, and creating
maps and reports.
Building on the first edition, the authors take you step-by-step through the pro-
cess of using the latest map design tools and techniques in QGIS 3. With numerous
new map designs and completely overhauled workflows, this second edition brings
you up to speed with current cartographic technology and trends.
With this book you’ll learn about the QGIS interface, creating, analyzing, and
editing vector data, working with raster (image) data, using plugins and the processing toolbox, and more.
Resources for further help and study and all the data you’ll need to follow along with each chapter are included.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
516 Locate Press Books
Leaflet Cookbook
C OOK UP DYNAMIC WEB MAPS USING THE RECIPES IN THE L EAFLET C OOKBOOK .
Leaflet Cookbook will guide you in getting started with Leaflet, the leading open-
source JavaScript library for creating interactive maps. You’ll move swiftly along
from the basics to creating interesting and dynamic web maps.
Even if you aren’t an HTML/CSS wizard, this book will get you up to speed in
creating dynamic and sophisticated web maps. With sample code and complete
examples, you’ll find it easy to create your own maps in no time.
A download package containing all the code and data used in the book is avail-
able so you can follow along as well as use the code as a starting point for your own
web maps.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
pgRouting: A Practical Guide
W HAT IS PG ROUTING ?
It’s a PostgreSQL extension for developing network routing applications and do-
ing graph analysis.
Interested in pgRouting? If so, chances are you already use PostGIS, the spatial
extender for the PostgreSQL database management system.
So when you’ve got PostGIS, why do you need pgRouting? PostGIS is a great tool
for molding geometries and doing proximity analysis, however it falls short when
your proximity analysis involves constrained paths such as driving along a road or
biking along defined paths.
This book will both get you started with pgRouting and guide you into routing, data fixing and costs, as well as
using with QGIS and web applications.
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*
*GENERATED FOR John Abban ON 2024-02-22 - THIS BOOK IS COPYRIGHTED - DO NOT DISTRIBUTE*