Machine Learning On Geographical Data Using Python
Machine Learning On Geographical Data Using Python
on Geographical
Data Using Python
Introduction into Geodata with
Applications and Use Cases
—
Joos Korstanje
Machine Learning
on Geographical
Data Using Python
Introduction into Geodata
with Applications and Use Cases
Joos Korstanje
Machine Learning on Geographical Data Using Python: Introduction into Geodata
with Applications and Use Cases
Joos Korstanje
VIELS MAISONS, France
Introduction�������������������������������������������������������������������������������������������������������������xv
iii
Table of Contents
iv
Table of Contents
v
Table of Contents
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 307
ix
About the Author
Joos Korstanje is a data scientist, with over five years
of industry experience in developing machine learning
tools. He has a double MSc in Applied Data Science and
in Environmental Science and has extensive experience
working with geodata use cases. He has worked at a
number of large companies in the Netherlands and France,
developing machine learning for a variety of tools. His
experience in writing and teaching has motivated him
to write this book on machine learning for geodata with
Python.
xi
About the Technical Reviewer
Xiaochi Liu is a PhD researcher and data scientist at
Macquarie University, specializing in machine learning,
explainable artificial intelligence, spatial analysis, and their
novel application in environmental and public health. He is
a programming enthusiast using Python and R to conduct
end-to-end data analysis. His current research applies
cutting-edge AI technologies to untangle the causal nexus
between trace metal contamination and human health
to develop evidence-based intervention strategies for
mitigating environmental exposure.
xiii
Introduction
Spatial data has long been an ignored data type in general data science and statistics
courses. Yet at the same time, there is a field of spatial analysis which is strongly
developed. Due to differences in tools and approaches, the two fields have long
developed in separate environments.
With the popularity of data in many business environments, the importance of
treating spatial data is also increasing. The goal of the current book is to bridge the gap
between data science and spatial analysis by covering tools of both worlds and showing
how to use tools from both to answer use cases.
The book starts with a general introduction to geographical data, including data
storage formats, data types, common tools and libraries in Python, and the like. Strong
attention is paid to the specificities of spatial data, including coordinate systems
and more.
The second part of the book covers a number of methods of the field of spatial
analysis. All of this is done in Python. Even though Python is not the most common
tool in spatial analysis, the ecosystem has taken large steps in user-friendliness and has
great interoperability with machine learning libraries. Python with its rich ecosystem of
libraries will be an important tool for spatial analysis in the near future.
The third part of the book covers multiple machine learning use cases on spatial
data. In this part of the book, you see that tools from spatial analysis are combined with
tools from machine learning and data science to realize more advanced use cases than
would be possible in many spatial analysis tools. Specific considerations are needed for
applying machine learning to spatial data, due to the specific nature of coordinates and
other specific data formats of spatial data.
Source Code
All source code used in the book can be downloaded from g ithub.com/apress/
machine-learning-geographic-data-python.
xv
PART I
General Introduction
CHAPTER 1
Introduction to Geodata
Mapmaking and analysis of the geographical environment around us have been present
in nature and human society for a long time. Human maps are well known to all of us:
they are a great way to share information about our environment with others.
Yet communicating geographical instructions is not invented only by the human
species. Bees, for example, are well known to communicate on food sources with their
fellow hive mates. Bees do not make maps, but, just like us, they use a clearly defined
communication system.
As geodata is the topic of this book, I find it interesting to share this out-of-the-box
geodata system used by honeybees. Geodata in the bee world has two components:
distance and direction.
Honeybee distance metrics
–– The round dance: A food source is present less than 50 meters from
the hive.
–– The sickle dance: Food sources are present between 50 and 150
meters from the hive.
–– The waggle (a.k.a. wag-tail) dance: Food sources are over 150 meters
from the hive. In addition, the duration of the waggle dance is an
indicator of how far over 150 meters the source is located.
–– As the sun changes location throughout the day, bees will update
each other by adapting their communication dances accordingly.
3
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_1
Chapter 1 Introduction to Geodata
Geodata Definitions
To get started, I want to cover the basics of coordinate systems in the simplest
mathematic situation: the Euclidean space. Although the world does not respect
the hypothesis made by Euclidean geometry, it is a great entry into the deeper
understanding of coordinate systems.
4
Chapter 1 Introduction to Geodata
Cartesian Coordinates
To locate points in the Euclidean space, we can use the Cartesian coordinate system.
This coordinate system specifies each point uniquely by a pair of numerical coordinates.
For example, look at the coordinate system in Figure 1-2, in which two points are located:
a square and a triangle.
The square is located at x = 2 and y = 1 (horizontal axis). The triangle is located
at x = -2 and y = -1.
5
Chapter 1 Introduction to Geodata
The point where the x and y axes meet is called the Origin, and distances are
measured from there. Cartesian coordinates are among the most well-known coordinate
system and work easily and intuitively in the Euclidean space.
In this schematic drawing, the star is designated as the pole, and the thick black line
to the right is chosen as the polar axis. This system is quite different from the Cartesian
system but still allows us to identify the exact same points: just in a different way.
The points are identified by two components: an angle with respect to the polar axis
and a distance. The square that used to be referred to as Cartesian coordinate (2,1) can
be referred to by an angle from the polar axis and a distance.
6
Chapter 1 Introduction to Geodata
At this point, you can measure the distance and the angle and obtain the coordinate
in the polar system. Judged by the eye alone, we could say that the angle is probably
more or less 30° and the distance is slightly above 2. We would need to have more precise
measurement tools and a more precise drawing for more precision.
There are trigonometric computations that we can use to convert between polar and
Cartesian coordinates. The first set of formulas allows you to go from polar to Cartesian:
The letter r signifies the distance and the letter φ is the angle. You can go the other
way as well, using the following formulas:
As a last part to cover about degrees, I want to mention the equivalence between
measuring angles in degrees and in radians. The radian system may seem scary if you
are not used to it, but just remember that for every possible angle that you can measure
(from 0 to 360) there is a corresponding notation in the radian system. Figure 1-5
shows this.
7
Chapter 1 Introduction to Geodata
8
Chapter 1 Introduction to Geodata
ArcGIS
ArcGIS, made by ESRI, is arguably the most famous software package for working with
Geographic Information Systems. It has a very large number of functionalities that can
be accessed through a user-friendly click-button system, but visual programming of
geodata processing pipelines is also allowed. Python integration is even possible for
those who have specific tasks for which there are no preexisting tools in ArcGIS. Among
its tools are also AI and data science options.
ArcGIS is a great software for working with geodata. Yet there is one big
disadvantage, and that is that it is a paid, proprietary software. It is therefore accessible
only to companies or individuals that have no difficulty paying the considerably high
price. Even though it may be worth its price, you’ll need to be able to pay or convince
your company to pay for such software. Unfortunately, this is often not the case.
This approach can be a good fit for your need if you are not afraid to commit to a
system like QGIS and fill the gaps that you may eventually encounter.
Python/R Programming
Finally, you can use Python or R programming for working with geodata as well.
Programming, especially in Python or R, is a very common skill among data
professionals nowadays.
As programming skills were less well spread a few years back, the boom in data
science, machine learning, and artificial intelligence has made languages like Python
become very commonly spread throughout the workforce.
Now that many are able to code or have access to courses to learn how to code, the
need for full software becomes less. The availability of a number of well-functioning
geodata packages is enough for many to get started.
Python or R programming is a great tool for treating geodata with common or more
modern methods. By using these programming languages, you can easily apply tools
from other libraries to your geodata, without having to convert this to QGIS modules, for
example.
The only problem that is not very well solved by programming languages is long-
term geodata storage. For this, you will need a database. Cloud-based databases are
nowadays relatively easy to arrange and manage, and this problem is therefore relatively
easily solved.
10
Chapter 1 Introduction to Geodata
Shapefile
The shapefile is a very commonly used file format for geodata because it is the standard
format for ArcGIS. The shapefile is not very friendly for being used outside of ArcGIS, but
due to the popularity of ArcGIS, you will likely encounter shapefiles at some point.
The shapefile is not really a single file. It is actually a collection of files that are
stored together in one and the same directory, all having the same name. You have the
following files that make up a shapefile:
As an example, let’s look at an open data dataset containing the municipalities of the
Paris region that is provided by the French government. This dataset is freely available
at https://geo.data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8fae
db7b5675d6
On this website, you can download the data in SHP/L93 format, and this will allow
you to download a directory with a zip file. Figure 1-6 shows what this contains.
As you can see, there are the .shp file (the main file), the .shx file (the index file), the
.dbf file containing the attributes, and finally the optional .prj file.
11
Chapter 1 Introduction to Geodata
For this exercise, if you want to follow along, you can use your local environment or a
Google Colab Notebook at https://colab.research.google.com/.
You have to make sure that in your environment, you install geopandas:
Then, make sure that in your environment you have a directory called Communes_
MGP.shp in which you have the four files:
–– Communes_MGP.shp
–– Communes_MGP.dbf
–– Communes_MGP.prj
–– Communes_MGP.shx
In a local environment, you need to put the “sample_data” file in the same directory
as the notebook, but when you are working on Colab, you will need to upload the whole
folder to your working environment, by clicking the folder icon and then dragging and
dropping the whole folder onto there. You can then execute the Python code in Code
Block 1-1 to have a peek inside the data.
12
Chapter 1 Introduction to Geodata
To make something more visual, you can use the code in Code Block 1-2.
shapefile.plot()
13
Chapter 1 Introduction to Geodata
You will obtain the map corresponding to this dataset as in Figure 1-8.
Figure 1-8. The map resulting from Code Block 1-2. Image by author
Data source: Ministry of DINSIC. Original data downloaded from https://geo.
data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6,
updated on 1 July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/
uploads/2018/11/open-licence.pdf)
14
Chapter 1 Introduction to Geodata
If you try opening the file with a text editor, you’ll find that it is an XML file (very
summarized, XML is a data storage pattern that can be recognized by many < and > signs).
Compared to the shapefile, you can see that KML is much easier to understand and
to parse. A part of the file contents is shown in Figure 1-9.
To get a KML file into Python, we can again use geopandas. This time, however, it is
a bit less straightforward. You’ll also need the Fiona package to obtain a KML driver. The
total code is shown in Code Block 1-3.
import fiona
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
You’ll then see the exact same geodataframe as before, which is shown in
Figure 1-10.
As before, you can plot this geodataframe to obtain a basic map containing the
municipalities of the area of Paris and around. This is done in Code Block 1-4.
16
Chapter 1 Introduction to Geodata
kmlfile.plot()
Figure 1-11. The plot resulting from Code Block 1-4. Screenshot by author
Data source: Ministry of DINSIC. Original data downloaded from https://geo.
data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6,
updated on 1 July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/
uploads/2018/11/open-licence.pdf)
An interesting point here is that the coordinates do not correspond with the map that
was generated from the shapefile. If you’ve read the first part of this chapter, you may
have a hinge on how this is caused by coordinate systems. We’ll get into this in much
more detail in Chapter 2.
GeoJSON
The json format is a data format that is well known and loved by developers. Json is
much used in communication between different information systems, for example, in
website and Internet communication.
The json format is loved because it is very easy to parse, and this makes it a perfect
storage for open source and other developer-oriented tools.
17
Chapter 1 Introduction to Geodata
Json is a key-value dataset, which is much like the dictionary in Python. The whole is
surrounded by accolades. As an example, I could write myself as a json object as in this
example:
{ 'first_name': 'joos',
'last_name': 'korstanje',
'job': 'data scientist' }
As you can see, this is a very flexible format, and it is very easy to adapt to all kinds of
circumstances. You might easily add GPS coordinates like this:
{ 'first_name': 'joos',
'last_name': 'korstanje',
'job': 'data scientist',
'latitude': '48.8566° N',
'longitude': '2.3522° E' }
18
Chapter 1 Introduction to Geodata
You can get a GeoJSON file easily into the geopandas library using the code in Code
Block 1-5.
As expected, the data looks exactly like before (Figure 1-13). This is because it is
transformed into a geodataframe, and therefore the original representation as json is not
maintained anymore.
19
Chapter 1 Introduction to Geodata
You can make the plot of this geodataframe to obtain a map, using the code in Code
Block 1-6.
geojsonfile.plot()
20
Chapter 1 Introduction to Geodata
Figure 1-14. The plot resulting from Code Block 1-6. Image by author
Data source: Ministry of DINSIC. Original data downloaded from https://geo.
data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6,
updated on 1 July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/
uploads/2018/11/open-licence.pdf)
TIFF/JPEG/PNG
Image file types can also be used to store geodata. After all, many maps are 2D images
that lend themselves well to be stored as an image. Some of the standard formats to store
images are TIFF, JPEG, and PNG.
–– The PNG format is another well-known image file format. You can
make this file into a GeoJPEG as well when using it together with a
PWG (world file).
21
Chapter 1 Introduction to Geodata
Image file types are generally used to store raster data. For now, consider that raster
data is image-like (one value per pixel), whereas vector data contains objects like lines,
points, and polygons. We’ll get to the differences between raster and vector data in a
next chapter.
On the following website, you can download a GeoTIFF file that contains an
interpolated terrain model of Kerbernez in France:
https://geo.data.gouv.fr/en/datasets/b0a420b9e003d45aaf0670446f0d600df
14430cb
You can use the code in Code Block 1-7 to read and show the raster file in Python.
Note Depending on your OS, you may obtain a .tiff file format rather than a
.tif when downloading the data. In this case, you can simply change the path to
become .tiff, and the result should be the same. In both cases, you will obtain the
image shown in Figure 1-15.
22
Chapter 1 Introduction to Geodata
Figure 1-15. The plot resulting from Code Block 1-7. Image by author
Data source: Ministry of DINSIC. Original data downloaded from https://geo.
data.gouv.fr/en/datasets/b0a420b9e003d45aaf0670446f0d600df14430cb,
updated on “unknown.” Open Licence 2.0 (www.etalab.gouv.fr/wp-content/
uploads/2018/11/open-licence.pdf)
It is interesting to look at the coordinates and observe that this file’s coordinate
values are relatively close to the first file.
CSV/TXT/Excel
The same file as used in the first three examples is also available in CSV. When
downloading it and opening it with a text viewer, you will observe something like
Figure 1-16.
23
Chapter 1 Introduction to Geodata
The important thing to take away from this part of the chapter is that geodata is “just
data,” but with geographic references. These can be stored in different formats or in
different coordinate systems to make things complicated, but in the end you must simply
make sure that you have some sort of understanding of what you have in your data.
You can use many different tools for working with geodata. The goal of those tools
is generally to make your life easier. As a last step for this introduction, let’s have a short
introduction to the different Python tools that you may encounter on your geodata
journey.
24
Chapter 1 Introduction to Geodata
Key Takeaways
1. Cartesian coordinates and polar coordinates are two alternative
coordinate systems that can indicate points in a two-dimensional
Euclidean space.
26
CHAPTER 2
Coordinate Systems
and Projections
In the previous chapter, you have seen an introduction to coordinate systems. You saw
an example of how you can use Cartesian coordinates as well as polar coordinates to
identify points on a flat, two-dimensional Euclidean space. It was already mentioned at
that point that the real-world scenario is much more complex.
When you are making maps, you are showing things (objects, images, etc.) that are
located on earth. Earth does not respect the rules that were shown in the Euclidean
example because Earth is an ellipsoid: a ball form that is not perfectly round. This makes
map and coordinate system calculations much more complex than what high-school
mathematics teaches us about coordinates.
To make the problem clearer, let’s look at an example of airplane navigation.
Airplane flights are a great example to illustrate the problem, as they generally cover
long distances. Taking into account the curvature of the earth really doesn’t matter much
when measuring the size of your terrace, but it does make a big impact when moving
across continents.
Imagine you are flying from Paris to New York using this basic sketch of the world’s
geography. You are probably well aware of such an organization of the world’s map on a
two-dimensional image.
A logical first impression would be that to go from Madrid to New York in the
quickest way, we should follow a line parallel from the latitude lines. Yet (maybe
surprisingly at first) this is not the shortest path. An airplane would better curve via
the north!
The reason for this is that the more you move to the north, the shorter the latitude
lines actually are. Latitude lines go around the earth, so at the North Pole you have a
length of zero, and at the equator, the middle line is the longest possible. The closer to
the poles, the shorter the distance to go around the earth.
27
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_2
Chapter 2 Coordinate Systems and Projections
As this example takes place in the northern hemisphere, the closest pole is the North
Pole. By curving north on the northern hemisphere (toward the pole), an airplane can
get to its destination with fewer kilometers. Figure 2-1 illustrates this.
Let’s now consider an example where you are holding a round soccer ball. When
going from one point to another on a ball, you will intuitively be able to say which path is
the fastest. If you are looking straight at the ball, when following your finger going from
one point to another, you will see your hand making a shape like in Figure 2-2.
28
Chapter 2 Coordinate Systems and Projections
Figure 2-2. The shortest path on a ball is not a straight line in two-dimensional
view. Image by author
When making maps, we cannot plot in three dimensions, and we, therefore,
need to find some way or another to put a three-dimensional path onto a two-
dimensional image.
Many map makers have proposed all sorts of ways to solve this unsolvable problem,
and the goal of this chapter is to help you understand how to deal effectively with those
3D to 2D mapping distortions that will be continuously looking to complexify your work
on geodata.
Coordinate Systems
While the former discussion was merely intuitive, it is now time to slowly get to more
official definitions of the concepts that you have seen. As we are ignoring the height of
a point (e.g., with respect to sea level) for the moment, we can identify three types of
coordinate systems:
29
Chapter 2 Coordinate Systems and Projections
30
Chapter 2 Coordinate Systems and Projections
There are more specific definitions that define the WGS 84, yet at this point, the
information becomes very technical. To quote from the Wikipedia page of the WGS 84:
The WGS 84 datum surface is an oblate spheroid with equatorial radius a
= 6378137 m at the equator and flattening f = 1/298.257223563. The refined
value of the WGS 84 gravitational constant (mass of Earth's atmosphere
included) is GM = 3986004.418×108 m3/s2. The angular velocity of the
Earth is defined to be ω = 72.92115×10−6 rad/s.
You are absolutely not required to memorize any of those details. I do hope that it
gives you an insight into how detailed a definition of Geographic Coordinate Systems
has to be. This explains how it is possible that other people and organizations have
identified alternate definitions. This is why there are many coordinate systems out there
and also one of the reasons why working with geospatial data can be hard to grasp in the
beginning.
31
Chapter 2 Coordinate Systems and Projections
X and Y Coordinates
When working with Projected Coordinate Systems, we do not talk about latitude and
longitude anymore. As latitude and longitude are relevant only for measurements
on the globe (ellipsoid), but on a flat surface, we can drop this complexity. Once the
three-dimensional lat/long coordinates have been converted to the coordinates of their
projection, we simply talk about x and y coordinates.
X is generally the distance to the east starting from the origin and y the distance to
the north starting from the origin. The location of the origin depends on the projection
that you are using. The measurement unit also changes from one Projected Coordinate
System to another.
32
Chapter 2 Coordinate Systems and Projections
The Albers equal area conic projection takes a very different approach, as it is conic.
Making conic maps is often done to make some zones better represented. The Albers
equal area conic projection, also called the Albert projection, projects the world to a two-
dimensional map while respecting areas as shown in Figure 2-4.
33
Chapter 2 Coordinate Systems and Projections
Figure 2-4. The world seen in an Albers equal area conic projection
Source: https://commons.wikimedia.org/wiki/File:World_borders_
albers.png. Public Domain
Conformal Projections
If shapes are important for your use case, you may want to use a conformal projection.
Conformal projections are designed to preserve shapes. At the cost of distorting the areas
on your map, this category of projections guarantees that all of the angles are preserved,
and this makes sure that you see the “real” shapes on the map.
Mercator
The Mercator map is very well known, and it is the standard map projection for many
projects. Its advantage is that it has north on top and south on the bottom while
preserving local directions and shapes.
Unfortunately, locations far away from the equator are strongly inflated, for example,
Greenland and Antarctica, while zones on the equator look too small in comparison
(e.g., Africa).
34
Chapter 2 Coordinate Systems and Projections
The Lambert conformal conic projection is another conformal projection, meaning that
it also respects local shapes. This projection is less widespread because it is a conic map,
and conic maps have never become as popular as rectangles. However, it does just as
well on plotting the earth while preserving shapes, and it has fewer problems with size
distortion. It looks as shown in Figure 2-6.
35
Chapter 2 Coordinate Systems and Projections
Equidistant Projections
As the name indicates, you should use equidistant projections if you want a map
that respects distances. In the two previously discussed projection types, there is no
guarantee that distance between two points is respected. As you can imagine, this will be
a problem for many use cases. Equidistant projections are there to save you if distances
are key to your solution.
36
Chapter 2 Coordinate Systems and Projections
The equidistant conic projection is another conic projection, but this time it preserves
distance. It is also known as the simple conic projection, and it looks as shown in
Figure 2-8.
37
Chapter 2 Coordinate Systems and Projections
One example of an azimuthal projection is the Lambert equal area azimuthal. As the
name indicates, it is not just azimuthal but also equal area. The world according to this
projection looks as shown in Figure 2-9.
38
Chapter 2 Coordinate Systems and Projections
Figure 2-9. The world seen in a Lambert equal area azimuthal projection
Source: https://commons.wikimedia.org/wiki/File:Lambert_azimuthal_
equal-area_projection_of_world_with_grid.png. Public Domain
39
Chapter 2 Coordinate Systems and Projections
are guaranteed to be at the same distance as the scale of the map. As an example, you
can see in Figure 2-10 the two-point equidistant projection of Eurasia with two points
from which all distances are respected. It is also azimuthal.
40
Chapter 2 Coordinate Systems and Projections
41
Chapter 2 Coordinate Systems and Projections
One key takeaway here is that metadata is a crucial part of geodata. Sending datasets
with coordinates while failing to mention details on the coordinate system used is very
problematic. At the same time, if you are on the receiving end, stay critical of the data
you receive, and pay close attention to whether or not you are on the right coordinate
system. Mistakes are easily made and can be very impactful.
42
Chapter 2 Coordinate Systems and Projections
w ww.google.com/maps/d/edit?mid=1phChS9aNUukXKk2MwOQyXvksRk-
HTOdZ&usp=sharing
Then you can use the code in Code Block 2-2 to import your map and show the data
that is contained within it.
import fiona
import geopandas as gpd
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
kmlfile = gpd.read_file(“the/path/to/the/exported/file.kml")
print(kmlfile)
43
Chapter 2 Coordinate Systems and Projections
You’ll find that there is just one line in this dataframe and that it contains a polygon
called France. Figure 2-11 shows this.
We can inspect that polygon in more detail by extracting it from the dataframe using
the code in Code Block 2-3.
print(kmlfile.loc[0,'geometry'])
You will see that the data of this polygon is a sequence of coordinates indicating the
contours. This looks like Code Block 2-4.
Code Block 2-4. The resulting geometry, output of Code Block 2-3
44
Chapter 2 Coordinate Systems and Projections
Figure 2-12. The map resulting from Code Block 2-5. Image by author
You will obtain a map of the polygon. You should recognize the exact shape of the
polygon, as it was defined in your map or in the example map, depending on which one
you used.
kmlfile.crs
45
Chapter 2 Coordinate Systems and Projections
You’ll see the result like in Figure 2-13 being shown in your notebook.
It may be interesting to see what happens when we plot the map into a very different
coordinate system. Let’s try to convert this map into a different coordinate system using
the geopandas library. Let’s change from the geographic WGS 84 into the projected
Europe Lambert conformal conic map projection, which is also known as ESRI:102014.
The code in Code Block 2-7 makes the transformation from the source coordinate
system to the target coordinate system.
proj_kml = kmlfile.to_crs('ESRI:102014')
proj_kml
Figure 2-14. The resulting dataframe from Code Block 2-7. Image by author
proj_kml.plot()
plt.title('ESRI:102014 map')
46
Chapter 2 Coordinate Systems and Projections
Figure 2-15. The plot resulting from Code Block 2-8. Image by author
The coordinate systems have very different x and y values. To see differences in
shape and size, you will have to look very closely. You can observe a slight difference in
the way the angle on the left is made. The pointy bit on the left is pointing more toward
the bottom in the left map, whereas it is pointing a bit more to the top in the right map.
This is shown in Figure 2-16.
47
Chapter 2 Coordinate Systems and Projections
Figure 2-16. Showing the two coordinate systems side by side. Image by author
Although differences here are small, they can have a serious effect on your
application. It is important to understand here that none of the maps are “wrong.” They
just use a different mathematical formula for projecting a 3D curved piece of land onto a
2D image.
Key Takeaways
1. Coordinate systems are mathematical descriptions of the earth
that allow us to communicate about locations precisely
2. Many coordinate systems exist, and each has its own advantages
and imperfections. One must choose a coordinate system
depending on their use case.
48
Chapter 2 Coordinate Systems and Projections
49
CHAPTER 3
51
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_3
Chapter 3 Geodata Data Types
52
Chapter 3 Geodata Data Types
53
Chapter 3 Geodata Data Types
For raster data, the storage is generally image-like. As explained before, each pixel
has a value. It is therefore common to store the data as a two-dimensional table in which
each row represents a row of pixels, and each column represents a column of pixels.
The values in your data table represent the values of one and only one variable. Working
with raster data can be a bit harder to get into, as this image-like data format is not very
accommodating to adding additional data. Figure 3-4 shows an example of this.
54
Chapter 3 Geodata Data Types
We will now get to an in-depth description of each of the data types that you are
likely to encounter, and you will see how to work with them in Python.
Points
The simplest data type is probably the point. You have seen some examples of point data
throughout the earlier chapters, and you have seen before that the point is one of the
subtypes of vector data.
Points are part of vector data, as each point is an object on the map that has its own
coordinates and that can have any number of attributes necessary. Point datasets are
great for identifying locations of specific landmarks or other types of locations. Points
cannot store anything like the shape of the size of landmarks, so it is important that you
use points only if you do not need such information.
Definition of a Point
In mathematics, a point is generally said to be an exact location that has no length,
width, or thickness. This is an interesting and important concept to understand about
point data, as in geodata, the same is true.
A point consists only of one exact location, indicated by one coordinate pair (be it x
and y, or latitude and longitude). Coordinates are numerical values, meaning that they
can take an infinite number of decimals. The number 2.0, for example, is different than
2.1. Yet 2.01 is also different, 2.001 is a different location again, and 2.0001 is another,
different, location.
Even if two points are very close to each other, it would theoretically not be correct
that they are touching each other: as long as they are not in the same location, there will
always be a small distance in between the points.
Another consideration is that if you have a point object, you cannot tell anything
about its size. Although you could make points larger and smaller on the map, your point
still stays at size 0. It is really just a location.
55
Chapter 3 Geodata Data Types
56
Chapter 3 Geodata Data Types
Of course, this is an extract, and the real list of variables about the squirrels is much
longer. What is interesting to see is how the KML data format has stored point data just
by having coordinates with it. Python (or any other geodata tool) will recognize the
format and will be able to automatically import this the right way.
To import the data into Python, we can use the same code that was used in the
previous chapter. It uses Fiona and geopandas to import the KML file into a geopandas
dataframe. The code is shown in Code Block 3-1.
57
Chapter 3 Geodata Data Types
import fiona
import geopandas as gpd
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
kmlfile = gpd.read_file("2018 Central Park Squirrel Census - Squirrel
Data.kml")
print(kmlfile)
You will see the dataframe, containing geometry, being printed as shown in
Figure 3-6.
You can clearly see that each line is noted as follows: POINT (coordinate coordinate).
The coordinate system should be located in the geodataframe’s attributes, and you can
look at it using the code in Code Block 3-2.
kmlfile.crs
58
Chapter 3 Geodata Data Types
You’ll see the info about the coordinate system being printed, as shown in Figure 3-7.
Figure 3-7. The output from Code Block 3-2. Image by author
Data source: NYC OpenData. 2018 Central Park Squirrel Census
You can plot the map to see the squirrel sightings on the map using the code in Code
Block 3-3. It is not very pretty for now, but additional visualization techniques will be
discussed in Chapter 4. For now, let’s focus on the data formats using Code Block 3-3.
59
Chapter 3 Geodata Data Types
kmlfile.columns
60
Chapter 3 Geodata Data Types
You’ll see that only the data shown in Figure 3-9 has been successfully imported.
Figure 3-9. The output from Code Block 3-4. Image by author
Now, this would be a great setback with any noncode geodata program, but as we
are using Python, we have the full autonomy of finding a way to repair this problem. I am
not saying that it is great that we have to parse the XML ourselves, but at least we are not
blocked at this point.
XML parsing can be done using the xml library. XML is a tree-based data format, and
using the xml element tree, you can loop through the different levels of the tree and go
down in distance. Code Block 3-5 shows how to do this.
import xml.etree.ElementTree as ET
tree = ET.parse("2018 Central Park Squirrel Census – Squirrel Data.kml")
root = tree.getroot()
for x in elementdata:
df_row.append(x[0].text)
df.append(df_row)
61
Chapter 3 Geodata Data Types
We can now (finally) apply our filter on the column shift, using the code in Code
Block 3-6.
To make the plots, we have to go back to a geodataframe again. This can be done by
combining the variables x and y into a point geometry as shown in Code Block 3-7.
62
Chapter 3 Geodata Data Types
AM_geodata.plot()
plt.title('AM squirrels')
PM_geodata.plot()
plt.title('PM squirrels')
The result is shown in Figure 3-11. You now have the maps necessary to investigate
differences in AM and PM squirrels. Again, visual parameters can be improved here,
but that will be covered in Chapter 4. For now, we focus on the data types and their
possibilities.
Figure 3-11. The maps resulting from Code Block 3-8 Image by author
Data source: NYC OpenData. 2018 Central Park Squirrel Census
63
Chapter 3 Geodata Data Types
Lines
Line data is the second category of vector data in the world of geospatial data. They are
the logical next step after points. Let’s get into the definitions straight away.
Definition of a Line
Lines are also well-known mathematical objects. In mathematics, we generally consider
straight lines that go from one point to a second point. Lines have no width, but they do
have a length.
In geodata, line datasets contain not just one line, but many lines. Line segments are
straight, and therefore they only need a from point and a to point. This means that a line
needs two sets of coordinates (one of the first point and one of the second point).
Lines consist of multiple line segments, and they can therefore take different forms,
consisting of straight line segments and multiple points. Lines in geodata can therefore
represent the shape of features in addition to length.
import pandas as pd
flights_data = pd.read_csv('flights.csv')
flights_data
64
Chapter 3 Geodata Data Types
geolookup = pd.read_csv('airports.csv')
geolookup
65
Chapter 3 Geodata Data Types
As you can see inside the data, the airports.csv is a file with geolocation information,
as it contains the latitude and longitude of all the referenced airports. The flights.
csv contains a large number of airplane routes in the USA, identified by origin and
destination airport. Our goal is to convert the routes into georeferenced line data: a line
with a from and to coordinate for each airplane route.
Let’s start by converting the latitude and longitude variables into a point, so that the
geometry can be recognized in further operations. The following code loops through the
rows of the dataframe to generate a new variable. The whole operation is done twice, as
to generate a “to/destination” lookup dataframe and a “from/source” lookup dataframe.
This is shown in Code Block 3-11.
As the data types are not aligned, the easiest hack here is to convert all the numbers
to strings. There are some missing codes and this would be better to solve by inspecting
the data quality issues, but for this introductory example, the string conversion does the
job for us. You can also see that some columns are dropped here. This is done in Code
Block 3-12.
66
Chapter 3 Geodata Data Types
flights_data['ORIGIN_AIRPORT'] = flights_data['ORIGIN_AIRPORT'].map(str)
flights_data['DESTINATION_AIRPORT'] = flights_data['DESTINATION_AIRPORT'].
map(str)
flights_data = flights_data[['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]
We now get to the step to merge the dataframes of the flights together with the from
and to geographical lookups that we just created. The code in Code Block 3-13 merges
two times (once with the from coordinates and once with the to coordinates).
After running this code, you will end up with a dataframe that still contains one row
per route, but it has now got two georeference columns: the from coordinate and the to
coordinate. This result is shown in Figure 3-14.
67
Chapter 3 Geodata Data Types
Figure 3-14. The dataframe resulting from Code Block 3-13. Image by author
Data source: www.kaggle.com/usdot/flight-delays, Public Domain
The final step of the conversion process is to make lines out of this to and from
points. This can be done using the LineString function as shown in Code Block 3-14.
lines = []
for i,row in flights_data.iterrows():
try:
point_from = row['geometry_from']
point_to = row['geometry_to']
lines.append(LineString([point_from, point_to]))
68
Chapter 3 Geodata Data Types
except:
#some data lines are faulty so we ignore them
pass
You will end up with a new geometry variable that contains only
LINESTRINGS. Inside each LINESTRING, you see the four values for the two coordinates
(x and y from, and x and y to). This is shown in Figure 3-15.
Now that you have created your own line dataset, let’s make a quick visualization as a
final step. As before, you can simply use the plot functionality to generate a basic plot of
your lines. This is shown in Code Block 3-15.
69
Chapter 3 Geodata Data Types
You should now obtain the map of the USA given in Figure 3-16. You clearly see
all the airplane trajectories expressed as straight lines. Clearly, not all of it is correct
as flights do not take a straight line (as seen in a previous chapter). However, it gives a
good overview of how to work with line data, and it is interesting to see that we can even
recognize the USA map by just using flight lines (with some imagination).
Figure 3-16. Plot resulting from Code Block 3-15. Image by author
Data source: www.kaggle.com/usdot/flight-delays, Public Domain
Polygons
Polygons are the next step in complexity after points and lines. They are the third and
last category of vector geodata.
70
Chapter 3 Geodata Data Types
Definition of a Polygon
In mathematics, polygons are defined as two-dimensional shapes, made up of lines
that connect to make a closed shape. Examples are triangles, rectangles, pentagons,
etc. A circle is not officially a polygon as it is not made up of straight lines, but you could
imagine a lot of very small straight lines being able to approximate a circle relatively well.
In geodata, the definition of the polygon is not much different. It is simply a list
of points that together make up a closed shape. Polygons are generally a much more
realistic representation of the real world. Landmarks are often identified by points, but as
you get to a very close-up map, you would need to represent the landmark as a polygon
(the contour) to be useful. Roads could be well represented by lines (remember that
lines have no width) but would have to be replaced by polygons once the map is at a
small enough scale to see houses, roads, etc.
Polygons are the data type that has the most information as they are able to store
location (just like points and lines), length (just like lines), and also area and perimeter.
You’ll see the content of the polygon dataset in Figure 3-17. It contains some
polygons and some multipolygons (polygons that consist of multiple polygons, e.g., the
USA has Alaska that is not connected to their other land, so they need multiple polygons
to describe their territory).
71
Chapter 3 Geodata Data Types
You can easily create a map, as we did before, using the plot function. This is
demonstrated in Code Block 3-17, and this time, it will automatically plot the polygons.
geojsonfile.plot()
Figure 3-18. The plot of polygons as created in Code Block 3-17. Image by author
Source: geopandas, BSD 3 Clause Licence
72
Chapter 3 Geodata Data Types
In Figure 3-19, you’ll see the first ten rows of this data, which are the world’s smallest
countries in terms of surface area.
Figure 3-19. The first ten rows of the data. Image by author
Source: geopandas, BSD 3 Clause Licence
We can also compute the length of the borders by calculating the length of the
polygon borders. The length attribute allows us to do so. You can use the code in Code
Block 3-19 to identify the ten countries with the longest contours.
Code Block 3-19. Identify the ten countries with longest contours
You’ll see the result in Figure 3-20, with Antarctica being the winner. Attention
though, as this may be distorted by coordinate system choice. You may remember that
some commonly used coordinate systems have strong distortions toward the poles and
make more central locations smaller. This could influence the types of computations that
are being done here. If a very precise result is needed, you’d need to tackle this question,
but for a general idea of the countries with the longest borders, the current approach
will do.
Figure 3-20. Dataset resulting from Code Block 3-19. Image by author
Rasters/Grids
Raster data, also called grid data, is the counterpart of vector data. If you’re used to
working with digital images in Python, you might find raster data quite similar. If you’re
used to working with dataframes, it may be a bit more abstract, and take a moment to get
used to it.
74
Chapter 3 Geodata Data Types
https://geo.data.gouv.fr/en/datasets/b0a420b9e003d45aaf0670446f0d600df14430cb
You can use the code in Code Block 3-20 to read and show the raster file in Python.
import rasterio
griddata = r'ore-kbz-mnt-litto3d-5m.tif'
img = rasterio.open(griddata)
matrix = img.read()
matrix
As you can see in Figure 3-21, this data looks nothing like a geodataframe
whatsoever. Rather, it is just a matrix full of the values of the one (and only one) variable
that is contained in this data.
You can plot this data using the default color scale, and you will see what this
numerical representation actually contains. As humans, we are particularly bad at
reading and interpreting something from a large matrix like the one earlier, but when we
see it color-coded into a map, we can get a much better feeling of what we are looking at.
The code in Code Block 3-21 does exactly that.
75
Chapter 3 Geodata Data Types
Raster data is a bit more limited than vector data in terms of adding data to it. Adding
more variables would be quite complex, except for making the array into a 3D, where
the third dimension contains additional data. However, for plotting, this would not be of
any help, as the plot color would still be one color per pixel, and you could never show
multiple variables for each pixel with this approach.
Raster data is still a very important data type that you will often need and often use.
Any value that needs to be measured over a large area will be more suitable to raster.
Examples like height maps, pollution maps, density maps, and much more are all only
solvable with rasters. Raster use cases are generally a bit more mathematically complex,
as they often use a lot of matrix computations. You’ll see examples of these mathematical
operations throughout the later chapters of the book.
76
Chapter 3 Geodata Data Types
Key Takeaways
1. There are two main categories of geodata: vector and raster. They
have fundamentally different ways of storing data.
2. Vector data stores objects and stores the geospatial references for
those objects.
3. Raster data cuts an area into equal-sized squares and stores a data
value for each of those squares.
4. There are three main types of vector data: point, line, and polygon.
5. Points are zero-dimensional, and they have no size. They are only
indicated by a single x,y coordinate. Points are great for indicating
the location of objects.
77
CHAPTER 4
Creating Maps
Mapmaking is one of the earliest and most obvious use cases of the field of geodata.
Maps are a special form of data visualization: they have a lot of standards and are
therefore easily recognizable and interpretable for almost anyone.
Just like other data visualization methods, maps are a powerful tool to share
a message about a dataset. Visualization tools are often wrongly interpreted as an
objective depiction of the truth, whereas in reality, map makers and visualization
builders have a huge power of putting things on the map or leaving things out.
An example is color scale picking on maps. People are so familiar with some
visualization techniques that when they see them, they automatically believe them.
Imagine a map showing pollution levels in a specific region. If you would want
people to believe that pollution is not a big problem in the area, you could build and
share a map that shows areas with low pollution as dark green and very strongly polluted
areas as light green. Add to that a small, unreadable, legend, and people will easily
interpret that there is no big pollution problem.
If you want to argue the other side, you could publish an alternative map that shows
the exact same values, but you depict strong pollution as dark red and slight pollution
as light red. When people see this map, they will directly be tempted to conclude that
pollution is a huge problem in your area and that it needs immediate action.
It is important to understand that there is no truth in choosing visualization. There
are however a number of levers in mapmaking that you should master well in order to
create maps for your specific purpose. Whether your purpose is making objective maps,
beautiful maps, or communicating a message, there are a number of tools and best
practices that you will discover in this chapter. Those are important to remember when
making maps and will come in handy when interpreting maps as well.
79
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_4
Chapter 4 Creating Maps
Once you execute this code, you’ll see the first five lines of the geodataframe
containing the world’s countries, as displayed in Figure 4-1.
80
Chapter 4 Creating Maps
For this example, we’ll make a map that is color-coded: colors will be based on the
area of the countries. To get there, we need to add a column to the geodataframe that
contains the countries’ areas. This can be obtained using Code Block 4-2.
If you now look at the dataframe again, you’ll see that an additional column is indeed
present, as shown in Figure 4-2. It contains the area of each country and will help us in
the mapmaking process.
81
Chapter 4 Creating Maps
If you do this, you’ll obtain a plot that just contains the polygons, just like in
Figure 4-3. There is no additional color-coding going on.
As the goal of our exercise is to color-code countries based on their total area, we’ll
need to start improving on this map with additional plotting parameters.
Adding color-coding to a plot is fairly simple using geopandas and matplotlib. The
plot method can take an argument column, and when specifying a column name there,
the map will automatically be color-coded based on this column.
In our example, we want to color-code with the newly generated variable called
area, so we’ll need to specify column=’area’ in the plot arguments. This is done in Code
Block 4-4.
world.plot(column='area', cmap='Greys')
82
Chapter 4 Creating Maps
You will see the black and white coded map as shown in Figure 4-4.
Figure 4-4. The grayscale map resulting from Code Block 4-4. Image by author
Data source: geopandas, BSD 3 Clause Licence
Plot Title
Let’s continue working on this map a bit more. One important thing to add to any
visualization, including maps, is a title. A title will allow readers to easily understand
what the goal of your map is.
When making maps with geopandas and matplotlib, you can use the matplotlib
command plt.title to easily add a title on top of your map. The example in Code Block 4-5
shows you how it’s done.
world.plot(column='area', cmap='Greys')
plt.title('Area per country')
You will obtain the map in Figure 4-5. It is still the same map as before, but now has a
title on top of it.
83
Chapter 4 Creating Maps
Plot Legend
Another essential part of maps (and other visualizations) is to add a legend whenever
you use color or shape encodings. In our map, we are using color-coding to show the
area of the countries in a quick visual manner, but we have not yet added a legend. It can
therefore be confusing for readers of the map to understand which values are high areas
and which indicate low areas.
In the code in Code Block 4-6, the plot method takes two additional arguments.
Legend is set to True to generate a legend. The legend_kwds takes a dictionary with
some additional parameters for the legend. The label will be the label of the legend, and
the orientation is set to horizontal to make the legend appear on the bottom rather than
on the side. A title is added at the end of the code, just like you saw in the previous part.
84
Chapter 4 Creating Maps
This is the final version of this map for the current example. The map does a fairly
good job at representing a numerical value for different countries. This type of use
case is easily solvable with geopandas and matplotlib. Although it may not be the most
aesthetically pleasing map, it is perfect for analytical purposes and the like.
85
Chapter 4 Creating Maps
cities = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))
cities.head()
When executing this code, you’ll see the first five lines of the dataframe, just like
shown in Figure 4-7. The column geometry shows the points, which are two coordinates
just like you have seen in earlier chapters.
You can easily plot this dataset with the plot command, as we have done many times
before. This is shown in Code Block 4-8.
cities.plot()
86
Chapter 4 Creating Maps
You will obtain a map with only points on it, as shown in Figure 4-8.
This plot is really not very readable. We need to add a background into this for more
context. We can use the world’s countries for this, using only the borders of the countries
and leaving the content white.
The code in Code Block 4-9 does exactly that. It starts with creating the fig and ax
and then sets the aspect to “equal” to make sure that the overlay will not be causing any
mismatching. The world (country polygons) is then plotted using the color white to
make it seem see-through, followed by the cities with a marker=‘x’ for squares and the
color=‘black’ for black color.
fig, ax = plt.subplots()
ax.set_aspect('equal')
world.plot(ax=ax, color='white', edgecolor='grey')
cities.plot(ax=ax, marker='x', color='black', markersize=15)
plt.title('Cities plotted on a country border base map')
plt.show()
87
Chapter 4 Creating Maps
88
Chapter 4 Creating Maps
plt.rcParams["figure.figsize"] = [16,9]
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree())
ax.set_extent([-10, 40, 30, 70], crs=ccrs.PlateCarree())
89
Chapter 4 Creating Maps
# background image
ax.stock_img()
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(states_provinces, edgecolor='gray')
# Add a copyright
text = AnchoredText('\u00A9 Natural Earth; license: public domain',loc=4,
prop={'size': 12}, frameon=True)
ax.add_artist(text)
plt.show()
The map resulting from this introductory Cartopy example is shown in Figure 4-10.
90
Chapter 4 Creating Maps
91
Chapter 4 Creating Maps
When looking at the result in terms of aesthetics, I would argue that there is no clear
winner. Similar results can probably be obtained with both methods. It is hard work to
get an aesthetically pleasing map with both of these, but for obtaining informative maps,
they work great.
–– Choropleth maps
–– Bubble maps
–– Heat maps
–– Scatter plots
92
Chapter 4 Creating Maps
To get a good grasp of the Plotly syntax, let’s do a walk-through of a short Plotly
example, based on their famous graphs in the graph gallery. In this example, you’ll see
how Plotly can easily make an aesthetically pleasing map, just by adding a few additional
functionalities.
You can use the code in Code Block 4-11 to obtain a Plotly express example dataset
that contains some data about cities.
import plotly.express as px
data = px.data.gapminder().query("year==2002")
data.head()
93
Chapter 4 Creating Maps
The content of the first five lines of the dataframe tells us what type of variables we
have. For example, you have life expectancy, population, and gdp per country per year.
The filter on 2002 that was applied in the preceding query makes that we have only one
data point per city; otherwise, plotting would be more difficult.
Let’s create a new variable called gdp to make the plot with. This variable can be
computed using Code Block 4-12.
Let’s now make a bubble map in which the icon for each country is larger or smaller
based on the newly created variable gdp using Code Block 4-13.
Even with this fairly simple code, you’ll obtain a quite interestingly looking graph, as
shown in Figure 4-12.
94
Chapter 4 Creating Maps
Each of the countries has a bubble that makes reference to their gdp, the continents
each have a different color, and you have a background that is the well-known natural
earth. You can hover over the data points to see more info about each of them.
95
Chapter 4 Creating Maps
Folium allows you to create interactive maps, and the results will almost give you the
feeling that you are working in Google Maps or comparable software. All this is obtained
using a few lines of code, and all the complex work for generating those interactive maps
is hidden behind Folium and Leaflet.js.
Folium has extensive documentation with loads of examples and quick-start tutorials
(https://python-visualization.github.io/folium/quickstart.html#Getting-
Started). To give you a feel for the type of results that can be obtained with Folium, we’ll
do a short walk-through of some great examples. Let’s start by simply creating a map and
then slowly adding specific parameters to create even better results.
Using the syntax in Code Block 4-14, you will automatically create a map of the
Paris region. This is a one-line code, which just contains the coordinates of Paris, and it
directly shows the power of the Folium library.
import folium
m = folium.Map(location=[48.8545, 2.2464])
m
You will obtain an interactive map in your notebook. It looks as shown in Figure 4-13.
96
Chapter 4 Creating Maps
Now, the interesting thing here is that this map does not contain any of your data. It
seems like it could be a map filled with complex points, polygons, labels, and more, and
deep down somewhere in the software it is. The strong point of Folium as a visualization
layer is that you do not at all need to worry about this. All your “background data” will
stay cleanly hidden from the user. You can imagine that this would be very complex to
create using the actual polygons, lines, and points about the Paris region.
Let’s go a step further and add some data to this basemap. We’ll add two markers
(point data in Folium terminology): one for the Eiffel Tower and one for the Arc de
Triomphe.
97
Chapter 4 Creating Maps
The code in Code Block 4-15 shows a number of additions to the previous code. First,
it adds a zoom_start. This basically tells you how much zoom you want to show when
initializing the map. If you have played around with the first example, you’ll see that you
can zoom out so far as to see the whole world on your map and that you can zoom in
to see a very detailed map as well. It really is very complete. However, for a specific use
case, you would probably want to focus on a specific region or zone, and setting a zoom_
start will help your users identify what they need to look at.
Second, there are two markers added to the map. They are very intuitively added to
the map using the .add_to method. Once added to the map, you simply show the map
like before, and they will appear. You can specify a popup so that you see additional
information when hovering over your markers. Using HTML markup, you can create
whole paragraphs of information here, in case you’d want to.
As the markers are point geometry data, they just need x and y coordinates to be
located on the map. Of course, these coordinates have to be in the correct coordinate
system, but that is nothing different from anything you’ve seen before.
import folium
m = folium.Map(location=[48.8545, 2.2464], zoom_start=11)
folium.Marker(
[48.8584, 2.2945], popup="Eiffel Tower").add_to(m)
folium.Marker(
[48.8738, 2.2950], popup="Arc de Triomphe").add_to(m)
If you are working in a notebook, you will then be able to see the interactive map
appear as shown in Figure 4-14. It has the two markers for showing the Eiffel Tower and
the Arc de Triomphe, just like we started out to do.
For more details on plotting maps with Folium, I strongly recommend you to read
the documentation. There is much more documentation out there, as well as sample
maps and examples with different data types.
98
Chapter 4 Creating Maps
99
Chapter 4 Creating Maps
Key Takeaways
1. There are many mapping libraries in Python, each with its specific
advantages and disadvantages.
100
Chapter 4 Creating Maps
101
PART II
GIS Operations
CHAPTER 5
• Buffering
• Erase
105
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_5
Chapter 5 Clipping and Intersecting
The first of the standard operations, clipping and intersecting, is covered in this
chapter. Let’s start by giving a general definition of the clipping operation and do an
example in Python. We will then do the same for the intersecting operation.
What Is Clipping?
Clipping, in geoprocessing, takes one layer, an input layer, and uses a specified boundary
layer to cut out a part of the input layer. The part that is cut out is retained for future use,
and the rest is generally discarded.
The clipping operation is like a cookie cutter, in which your cookie dough is the input
layer in which a cookie-shaped part is being cut out.
106
Chapter 5 Clipping and Intersecting
107
Chapter 5 Clipping and Intersecting
When clipping a line dataset, things become more complicated, as lines may start
inside the clip boundaries and end outside of them. In this case, the part of the line that
is inside the boundary has to be kept, but the part that is outside of the boundary has
to be removed. The result is that some rows of data will be entirely removed (lines that
are completely out of scope) and some of them will be altered (lines that are partly out
of scope). This will be clearer with a more detailed schematic drawing that is shown in
Figure 5-3. In this schematic drawing, the data is line data; imagine, for example, a road
network. The clip is a polygon.
108
Chapter 5 Clipping and Intersecting
In the coming part, you will see a more practical application of this theory by
applying the clipping operation in Python.
Clipping in Python
In this example, you will see how to apply a clipping operation in Python. The dataset is
a dataset that I have generated specifically for this exercise. It contains two features:
• A line that covers a part of the Seine River (a famous river in Paris,
France, which also covers a large part of the country of France)
The goal of the exercise is to clip the Seine River to the Paris center region. This is a
very realistic use of the clipping operation. After all, rivers are often multicountry objects
and are often displayed in maps. When working on a more local map, you will likely
encounter the case where you will have to clip rivers (or other lines like highways, train
lines, etc.) to a more local extent.
109
Chapter 5 Clipping and Intersecting
Let’s start with importing the dataset and opening it. You can find the data in
the GitHub repository. For the execution of this code, I’d recommend using a Kaggle
notebook or a local environment, as Colab has an issue with the clipping function at the
time of writing.
You can import the data using geopandas, as you have learned in previous chapters.
The code for doing this is shown in Code Block 5-1.
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
data = gpd.read_file('ParisSeineData.kml')
print(data)
We can quickly use the geopandas built-in plot function to get a plot of this data. Of
course, you have already seen more advanced mapping options in the previous chapters,
but the goal here is just to get a quick feel of the data we have. This is done in Code
Block 5-2.
data.plot()
When using this plot method, you will observe the map in Figure 5-5, which clearly
contains the two features: the Seine River as a line and the Paris center as a polygon.
110
Chapter 5 Clipping and Intersecting
Figure 5-5. The plot resulting from Code Block 5-2. Image by author
Now, as stated in the introduction of this example, the goal is to have only the Seine
River line object, but to clip it to the size of the Paris river. The first step is to split our data
object into two separate objects. This way, we will have one geodataframe with the Seine
River and a second geodataframe with the Paris polygon. This will be easier to work with.
You can extract the Seine River using the code in Code Block 5-3.
seine = data.iloc[0:1,:]
seine.plot()
You can verify in the resulting plot (Figure 5-6) that this has been successful.
111
Chapter 5 Clipping and Intersecting
Figure 5-6. The plot resulting from Code Block 5-3. Image by author
Now, we do the same for the Paris polygon using the code in Code Block 5-4.
paris = data.iloc[1:2,:]
paris.plot()
You will obtain a plot with the Paris polygon to verify that everything went well. This
is shown in Figure 5-7.
112
Chapter 5 Clipping and Intersecting
Figure 5-7. The plot resulting from Code Block 5-4. Image by author
Now comes the more interesting part: using the Paris polygon as a clip to the Seine
River. The code to do this using geopandas is shown in Code Block 5-5.
paris_seine = seine.clip(paris)
paris_seine
You will obtain a new version of the Seine dataset, as shown in Figure 5-8.
You can use the code in Code Block 5-6 to plot this version to see that it contains only
those parts of the Seine River that are inside the Paris center region.
paris_seine.plot()
113
Chapter 5 Clipping and Intersecting
Figure 5-9. The Seine River clipped to the Paris polygon. Image by author
This result shows that the goal of the exercise is met. We have successfully imported
the Seine River and Paris polygon, and we have reduced the size of the Seine River line
data to fit inside Paris.
You can imagine that this can be applied for highways, train lines, other rivers, and
other line data that you’d want to use in a map for Paris, but that is available only for a
much larger extent. The clipping operation is fairly simple but very useful for this, and it
allows you to remove useless data from your working environment.
What Is Intersecting?
The second operation that we will be looking at is the intersection. For those of you who
are aware of set theory, this part will be relatively straightforward. For those who are not,
let’s do an introduction of set theory first.
Sets, in mathematics, are collections of unique objects. A number of standard
operations are defined for sets, and this is generally helpful in very different problems,
one of which is geodata problems.
As an example, we could imagine two sets, A and B:
–– Set A contains three cities: New York, Las Vegas, and Mexico City.
–– Set B contains three cities as well: Amsterdam, New York, and Paris.
There are a number of standard operations that are generally applied to sets:
114
Chapter 5 Clipping and Intersecting
–– Difference: Elements that are in one but not in the other (not symmetrical)
With the example sets given earlier, we would observe the following:
–– The union of A and B: New York, Las Vegas, Mexico City, Amsterdam, Paris
115
Chapter 5 Clipping and Intersecting
In the first part of this chapter, you have seen that filtering is an important basic
operation in geodata. Set theory is useful for geodata, as it allows you to have a common
language for all these filter operations.
The reason that we are presenting the intersection in the same chapter as the clip is
that they are relatively similar and are often confused. This will allow us to see what the
exact similarities and differences are.
116
Chapter 5 Clipping and Intersecting
117
Chapter 5 Clipping and Intersecting
This basically just filters out some points, and the resulting shapes are still points.
Let’s now see what happens when applying this to two line datasets.
Line datasets will work differently. When two lines have a part at the exact same
location, the resulting intersection of two lines could be a line. In general, it is more likely
that two lines intersect at a crossing or that they are touching at some point. In this case,
the intersection of two lines is a point. The result is therefore generally a different shape
than the input. This is shown in the schematic drawing in Figure 5-12.
118
Chapter 5 Clipping and Intersecting
The lines intersect at three points, and the resulting dataset just shows these three
points. Let’s now see what happens when intersecting polygons.
Conceptually, as polygons have a surface, we consider that the intersection of two
polygons is the surface that they have in common. The result would therefore be the
surface that they share, which is a surface and therefore needs to be polygon as well.
The schematic drawing in Figure 5-13 shows how this works.
119
Chapter 5 Clipping and Intersecting
The result is basically just one or multiple smaller polygons. In the following section,
you will see how to apply this in Python.
Intersecting in Python
Let’s now start working on the example that was described earlier in this chapter. We
take a dataset with the Boulevard Périphérique and the Seine River, and we use the
intersection of those two to identify the locations where the Seine River crosses the
Boulevard Périphérique.
You can use the code in Code Block 5-7 to import the data and print the dataset.
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
data = gpd.read_file('ParisSeineData_example2_v2.kml')
data.head()
120
Chapter 5 Clipping and Intersecting
There are two polygons, one called Seine and one called Boulevard Périphérique.
Let’s use Code Block 5-8 to create a plot to see what this data practically looks like. We
can use the cmap to specify a colormap and obtain different colors. You can check
out the matplotlib documentation for an overview of colormaps; there are many to
choose from.
data.plot(cmap='tab10')
Figure 5-15. The plot resulting from Code Block 5-8. Image by author
Compared to the previous example, the data has been converted to polygons here.
You will see in a later chapter how to do this automatically using buffering, but for now it
has been done for you, and the polygon data is directly available in the dataset.
121
Chapter 5 Clipping and Intersecting
We can clearly see two intersections, so we can expect two bridges (or tunnels) to be
identified. Let’s now use the intersection function to find these automatically for us.
The code in Code Block 5-9 shows how to use the overlay function in geopandas to
create an intersection.
The result is a dataset with only the intersection of the two polygons, as shown in
Figure 5-16.
Figure 5-16. The plot resulting from Code Block 5-9. Image by author
The resulting object is a multipolygon, as it contains two polygons: one for each
bridge (or tunnel). You can see this more easily when creating the plot of this dataset
using Code Block 5-10.
intersection.plot()
The result may look a bit weird without context, but it basically just shows the two
bridges/tunnels of the Parisian Boulevard Périphérique. This is shown in Figure 5-17.
122
Chapter 5 Clipping and Intersecting
Figure 5-17. The two crossings of the Seine and the Boulevard Périphérique.
Image by author
The goal of the exercise is now achieved. We have successfully created an automated
method for extracting locations where roads cross rivers. If we would now want to do this
for the whole city, we could simply find datasets with all Paris’s roads and rivers and use
the same method to find all the bridges in Paris.
Of course, this was just one example, and you can generalize this to many situations
where creating intersections is useful, for example, to create new features in your dataset
if you want to do machine learning or for adding new features on maps.
It will be useful to keep in mind that there are other options in the overlay function,
of which we will see some in coming chapters. These are all related to other operations
in set theory, which is a very practical way to think about these basic geodata operations.
123
Chapter 5 Clipping and Intersecting
In the intersection example that will be done after this, you’ll see that the output
contains data from both input datasets and that it is therefore different from the clipping
operation.
Key Takeaways
1. There are numerous basic geodata operations that are standardly
implemented in most geodata tools. They may seem simple at
first sight, but applying them to geodata can come with some
difficulties.
2. The clipping operation takes an input dataset and reduces its size
to an extent given by a boundary dataset. This can be done for all
geodata data types.
4. Using clipping for lines or polygons will delete those lines and
polygons that are out of scope entirely, but will create a new
reduced form for those points that are partly inside and partly
outside of the boundaries.
124
Chapter 5 Clipping and Intersecting
9. You have seen how to use geopandas as an easy tool for both
clipping and intersecting operations.
125
CHAPTER 6
Buffers
In the previous chapter, we have started looking at a number of common geospatial
operations: data operations that are not possible, or at least not common, on regular
data, but that are very common on geospatial data.
The standard operations that will be covered are
• Buffering
• Erase
In the previous chapter, you have already seen two major operations. You have first
seen how to clip data to a specific extent, mainly for the use of dropping data based
on a spatial range. You have also seen how to use intersecting to create data based
on applying set theory on geospatial datasets. It was mentioned that other set theory
operations can be found in that scope as well.
In this chapter, we will look at a very different geospatial operation. You will discover
the geospatial operation of buffering or creating buffers. They are among the standard
operations of geospatial operations, and it is useful to master this tool.
Just like intersecting, the buffer is a tool that can be used either as a stand-alone or
as a tool for further analysis. It was not mentioned, but in the example of intersections, a
buffer operation was used to create polygon data of the bridges and rivers.
This clearly shows how those spatial operations should all be seen as tools in a
toolkit, and when you want to achieve a specific goal, you need to select the different
tools that you need to get there. This often means using different combinations of tools
in an intelligent manner. The more tools you know, the more you will be able to achieve.
127
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_6
Chapter 6 Buffers
We will start this chapter with an introduction into the theory behind the buffer, and
then do a number of examples in Python, in which you will see how to create buffers
using different geodata types.
–– When you create a buffer around a point, you will end up with a new
buffer polygon that contains the surrounding area around that point.
–– By adding a buffer to a line, you have a new feature that contains the
area around that line.
–– Buffers around polygons will also contain the area just outside the
polygon.
Buffers are newly created objects. After computing the buffer polygon, you still have
your original data. You will simply have created a new feature which is the buffer.
128
Chapter 6 Buffers
Figure 6-1. Showing the road first as a polygon and then as a line. Image by author
Representing the road as a line will allow you to do many things, like compute the
length of a road, find crossings with other roads and make a road network, etc. However,
in real life, a road has a width as well. For things like urban planning around the road,
building a bridge, etc., you will always need to know the width of the road at each
location.
As you have seen when covering data types in Chapter 3, lines have a length but not
width. It would not be possible to represent the width of a line. You could however create
a buffer around the line and give the buffer a specified width. This would result in a
polygon that encompasses the road, and you would then be able to generate a polygon-
like data.
129
Chapter 6 Buffers
In this schematic drawing, you see the left image containing points, which are
depicted here as stars. In the right image, you see how the buffers are circular polygons
that are formed exactly around the point.
Although it may seem difficult to find use cases for this, there are cases where this may
be useful. Imagine that your point data are sources of sound pollution and that it is known
that the sound can be heard a given number of meters from the source. Creating buffers
around the point would help to determine regions in which the sound problems occur.
Another, very different use case could be where you collect data points that are not
very reliable. Imagine, for example, that they are gps locations given by a mobile phone.
If you know how much uncertainty there is in your data points, you could create buffers
around your data points that state that all locations that are inside the buffer may have
been visited by the specific mobile phone user. This can be useful for marketing or ad
recommendations and the like.
130
Chapter 6 Buffers
We could imagine building different buffers around the railroad (one for very
strongly impacted locations and one for moderately impacted houses). The schematic
drawing in Figure 6-3 shows how building different buffers around the railroad could
help you in solving this problem.
You see that the left image contains one line (the planned railroad) and a number
of houses (depicted as stars). On the top right, you see a narrow buffer around the line,
which shows the heavy impact. You could filter out the points that are inside this heavy
impact buffer to identify them in more detail. The bottom-left graph contains houses
with a moderate impact. You could think of using set operations from the previous
chapter to select all moderate impact houses that are not inside the heavy impact buffer
(e.g., using a difference operation on the buffer, but other approaches are possible
as well).
131
Chapter 6 Buffers
In the left part of the schematic drawing, you see the lake polygon, an oval. On the
right, you see that a gray buffer has been created around the lake – maybe not the best
way to estimate the exact location of your path, but definitely an easy way to create the
new feature quickly in your dataset.
Now that you have seen how buffers work in theory, it is time to move on to some
practice. In the following section, we will start applying these operations in Python.
• You will see how to use buffers around points to investigate whether
each of the houses is on walking distance to a subway station (a real
added value).
132
Chapter 6 Buffers
• You will then see how to use buffers around a subway line to make
sure that the house is not disturbed by the noises of the subway line
(which can be a real problem).
• You will then see how to use buffers around parks to see whether
each of the houses is at least in walking distance of a park.
Let’s start with the first criterion by creating buffers around some subway stations.
You will see that the data contains eight subway stations. They do not have names as
that does not really have added value for this example. They are all point data, having a
latitude and longitude. They also have a z-score (height), but they are not used and they
are therefore all at zero.
Let’s make a quick and easy visualization to get a better feeling for the data that we
are working with. You can use the code in Code Block 6-2 to do so.
data.plot()
This plot will show the plot of the data. This is shown in Figure 6-6.
Figure 6-6. The plot resulting from Code Block 6-2. Image by author
We can clearly imagine the points being stations of a subway line. This plot is not very
visual. If you want to work on visuals, feel free to add some code from Chapter 4 to create
background visuals. You can also use the contextily library, which is a Python package that
can very easily create background maps. The code in Code Block 6-3 shows how it is done.
It uses the example data from Chapter 5 to create a background map with a larger extent.
134
Chapter 6 Buffers
Figure 6-7. The map with a background. Image by author using contextily source
data and image as referenced in the image
As you can see, the points are displayed on the map, on the subway line that goes
east-west. When we add houses to this data, we could compute distances from each
house to each subway station. However, we could not use these points in a set operation
or overlay. The overlay method would be much easier to compute than the distance
operation, which shows why it is useful to master the buffer operation.
We can use it to combine with other features as specified in the definition of the
example. Let’s now add a buffer on those points to start creating a house selection
polygon.
Creating the buffer is quite easy. It is enough to use “.buffer” and specify the width, as
is done in Code Block 6-4.
data.buffer(0.01)
This buffer operation will generate a dataset in polygons that now contains all the
buffer polygons, as shown in Figure 6-8.
135
Chapter 6 Buffers
import contextily as cx
The buffers are shown as black circles on the background map. You can check out
the result in Figure 6-9.
136
Chapter 6 Buffers
Figure 6-9. The plot resulting from Code Block 6-5. Image by author using
contextily source data and image as referenced in the image
With this result, we have successfully created a spatial layer to help us in filtering
houses to select. Let’s now move on to implementing the following two criteria using
buffers as well.
When calling this in a notebook, you will see how a line is automatically printed, as
shown in Figure 6-10.
137
Chapter 6 Buffers
This visualization isn’t particularly useful, so we’d better try to add this to our
existing plot. The code in Code Block 6-7 does exactly that, by storing the LineString as a
geopandas dataframe and then plotting it.
import pandas as pd
df = pd.DataFrame(
{
'Name': ['metro'],
'geometry': [LineString(data.loc[[7,6,5,4,0,1,2,3], 'geometry'].reset_
index(drop=True))]
}
)
gdf = gpd.GeoDataFrame(df)
gdf
You will see that the resulting geodataframe has exactly one line, which is the line
representing our subway, as shown in Figure 6-11.
To plot the line, let’s add this data into the plot with the background map directly,
using the code in Code Block 6-8.
138
Chapter 6 Buffers
import contextily as cx
You now obtain a map that has the subway station buffers and the subway rails as a
line. The result is shown in Figure 6-12.
Figure 6-12. The plot with the two data types. Image by author using contextily
source data and image as referenced in the image
The next step is to compute a buffer around this line to indicate an area that
you want to deselect for your search for a house, to respect the criteria given in the
introduction. This can be done using the same operation as used before, but now we will
choose a slightly smaller buffer size to avoid deselecting too much areas. This is done in
Code Block 6-9.
139
Chapter 6 Buffers
gdf.buffer(0.001)
By creating the buffer in this way, you end up with a geodataframe that contains a
polygon of the buffer, rather than with the initial line data. This is shown in Figure 6-13.
Now, you can simply add this polygon in the plot, and you’ll obtain a polygon that
shows areas that you should try to find a house in and some subareas that should be
avoided. Setting the transparency using the alpha parameter can help a lot to make more
readable maps. This is done in Code Block 6-10.
import contextily as cx
140
Chapter 6 Buffers
Figure 6-14. The plot resulting from Code Block 6-10. Image by author using
contextily source data and image as referenced in the image
This shows the map of Paris in which the best circles for use are marked in green, but
in which the red polygon should be avoided as it is too close to the subway line. In the
following section, we will add a third criterion on the map: proximity to a park. This will
be done by creating buffers on polygons.
141
Chapter 6 Buffers
In Figure 6-15, you will see that there are 18 parks in this dataset, all identified as
polygons.
Figure 6-15. The data from Code Block 6-11. Image by author
142
Chapter 6 Buffers
You can visualize this data directly inside our map, by adding it as done in Code
Block 6-12.
import contextily as cx
The parks are shown in the map as black contour lines. No buffers have yet been
created. This intermediate result looks as shown in Figure 6-16.
Figure 6-16. The map with the parks added to it. Image by author using contextily
source data and image as references in the image
143
Chapter 6 Buffers
Of course, it is unlikely that you will find a house inside a park, so we need to make
our search area such that it takes into account a border around those parks. This, again,
can be done by adding a buffer to our polygon data. The buffer operation works just like
it did before, by calling buffer with a distance. This is done in Code Block 6-13.
parks.buffer(0.01)
After the buffer, you have polygon data, just like you had before. Yet the size of the
polygon is now larger as it also has the buffers around the original polygons. Let’s now
add this into our plot, to see how this affects the places in which we want to find a house.
This is done in Code Block 6-14.
144
Chapter 6 Buffers
import contextily as cx
The colors and “zorder” (order of overlay) have been adjusted a bit to make the map
more readable. After all, it starts to contain a large number of features. You will see the
result shown in Figure 6-18.
Figure 6-18. The plot resulting from Code Block 6-14. Image by author using
contextily source data and image as referenced in the image
145
Chapter 6 Buffers
This map is a first result that you could use. Of course, you could go even further
and combine this with the theory from Chapter 5, in which you have learned how to use
operations from set theory to combine different shapes. Let’s see how to do this, with a
final goal to obtain a dataframe that only contains the areas in which we do want to find
a house, based on all three criteria from the introduction.
station_buffer = data.buffer(0.01)
rails_buffer = gdf.buffer(0.001)
park_buffer = parks.buffer(0.01)
A = gpd.GeoDataFrame({'geometry': station_buffer})
B = gpd.GeoDataFrame({'geometry': park_buffer})
C = gpd.GeoDataFrame({'geometry': rails_buffer})
146
Chapter 6 Buffers
You will obtain a dataset that looks like the data shown in Figure 6-19.
Figure 6-19. The data resulting from Code Block 6-16. Image by author
With this intersection of stations and parks, we now need to remove all locations
that are too close to a subway line, as this is expected to be too noisy, as specified in the
introduction.
To do this, we can also use an overlay, but this time we do not need the intersection
from set theory because an intersection would leave us with all places that have station,
park, and railway proximity. However, what we want is station and park proximity, but
not railway proximity. For this, we need to use the difference operation from set theory.
The code in Code Block 6-17 shows how this can be done.
147
Chapter 6 Buffers
The data still looks like a dataframe from before. The only difference that occurs is that the
data becomes much more complex with every step, as the shapes of our acceptable locations
become less and less regular. Let’s do a map of our final object using Code Block 6-18.
import contextily as cx
Figure 6-20. The final map of the exercise. Image by author using contextily source
data and image as referenced in the image
As you can see in Figure 6-20, the green areas are now a filter that we could use to
select houses based on coordinates. This answers the question posed in the exercise
and results in an interesting map as well. If you want to go further with this exercise,
you could create a small dataset containing point data for houses. Then, for looking up
whether a house (point data coordinate) is inside a polygon, you can use the operation
that is called “contains” or “within.” Documentation can be found here:
–– https://geopandas.org/en/stable/docs/reference/api/
geopandas.GeoSeries.within.html
–– h ttps://geopandas.org/en/stable/docs/reference/api/
geopandas.GeoSeries.contains.html
148
Chapter 6 Buffers
This operation is left as an exercise, as it goes beyond the demonstration of the buffer
operation, which is the focus of this chapter.
Key Takeaways
1. There are numerous basic geodata operations that are standardly
implemented in most geodata tools. They may seem simple at
first sight, but applying them to geodata can come with some
difficulties.
149
CHAPTER 7
• Buffering
• Erase
In the previous chapter, you have already seen clipping, intersecting, and buffering.
You have even seen a use case that combines those methods in a logical manner.
In this chapter, we will look at merging and dissolving, which are again very
different operations than those previously presented. They are also among the standard
geospatial operations, and it is useful to master these tools.
As explained in previous chapters, those spatial operations should all be seen as
tools in a toolkit, and when you want to achieve a specific goal, you need to select the
different tools that you need to get there. This often means using different combinations
of tools in an intelligent manner. The more tools you know, the more you will be able to
achieve.
We will start this chapter with the merge operation, covering both theory and
implementation in Python, and we will then move on to the dissolve operation. At the
end, you will see a comparison of both.
151
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_7
Chapter 7 Merge and Dissolve
What Is a Merge?
Merging geodata, just like with regular data, consists of taking multiple input datasets
and making them into a single new output feature. In the previous chapter, you already
saw a possible use case for a merge. If you remember, multiple “suitability” polygons
were created based on multiple criteria. At the end, all of these polygons were combined
into a single spatial layer. Although another solution was used in that example, a merge
could have been used to get all those layers together in one and the same layer.
152
Chapter 7 Merge and Dissolve
153
Chapter 7 Merge and Dissolve
Attribute joins are probably the type of join that you are already aware of. This is an
SQL-like join in which your geodata has some sort of key that is used to add information
from a second table that has this key as well. An example is shown in the schematic
drawing in Figure 7-2.
As you can see, this is a simple SQL-like join that uses a common identifier between
the two datasets to add the columns of the attribute table into the columns of the
geodata dataset.
An alternative is the spatial join, which is a bit more complex. The spatial join also
combines columns of two datasets, but rather than using a common identifier, it uses the
geographic coordinates of the two datasets. The schematic drawing in Figure 7-3 shows
how this can be imagined.
154
Chapter 7 Merge and Dissolve
In this example, the spatial join is relatively easy, as the objects are exactly the same
in both input datasets. In reality, you may well see slight differences in the features, but
you may also have different features that you want to join. You can specify all types of
spatial join parameters to make the right combination:
–– Joining all objects that are near each other (specify a distance)
This gives you a lot of tools to work with for combining datasets together, both
row-wise (merge) and column-wise (join). Let’s now see some examples of the merge,
attribute join, and spatial join in Python.
Merging in Python
In the coming examples, we will be looking at some easy-to-understand data. There
are multiple small datasets, and throughout the exercise, we will do all the three types
of merges.
155
Chapter 7 Merge and Dissolve
–– Add a new variable to the combined city file using an attribute lookup
–– Find the country of each of the cities using a spatial lookup with
the polygon file
Let’s now start by combining the three city files into a single layer with all the cities
combined.
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
us_cities = gpd.read_file('/kaggle/input/chapter7/USCities.kml')
us_cities
156
Chapter 7 Merge and Dissolve
canada_cities = gpd.read_file('/kaggle/input/chapter7/CanadaCities.kml')
canada_cities
mexico_cities = gpd.read_file('/kaggle/input/chapter7/MexicoCities.kml')
mexico_cities
157
Chapter 7 Merge and Dissolve
We can create a map of all three of those datasets using the syntax that you have seen
earlier in this book. This is done in Code Block 7-4.
import contextily as cx
# us cities
ax = us_cities.plot(markersize=128, figsize=(15,15))
# canada cities
canada_cities.plot(ax=ax, markersize=128)
# mexico cities
mexico_cities.plot(ax = ax, markersize=128)
# contextily basemap
cx.add_basemap(ax, crs=us_cities.crs)
158
Chapter 7 Merge and Dissolve
Figure 7-7. The map created using Code Block 7-4. Image by author using
contextily source data and image as referenced in the image
Now, this is not too bad already, but we actually want to have all this data in just
one layer, so that it is easier to work with. To do so, we are going to do a row-wise merge
operation. This can be done in Python using the pandas concat method. It is shown in
Code Block 7-5.
import pandas as pd
cities = pd.concat([us_cities, canada_cities, mexico_cities])
cities
You will obtain a dataset, in which all the points are now combined. Cities now
contain the rows of all the cities of the three input geodataframes, as can be seen in
Figure 7-8.
159
Chapter 7 Merge and Dissolve
If we now plot this data, we just have to plot one layer, rather than having to plot
three times. This is done in Code Block 7-6. You can see that it has all been successfully
merged into a single layer.
ax = cities.plot(markersize=128,figsize=(15,15))
cx.add_basemap(ax, crs=us_cities.crs)
160
Chapter 7 Merge and Dissolve
Figure 7-9. The map resulting from Code Block 7-6. Image by author using
contextily source data and image as referenced in the image
You can also see that all points now have the same color, because they are now all on
one single dataset. This fairly simple operation of row-wise merging will prove to be very
useful in your daily GIS operations.
Now that we have combined all data into one layer, let’s add some features using an
attribute join.
lookup = pd.DataFrame({
'city': [
'Las Vegas',
'New York',
161
Chapter 7 Merge and Dissolve
'Washington',
'Toronto',
'Quebec',
'Montreal',
'Vancouver',
'Guadalajara',
'Mexico City'
],
'population': [
1234,
2345,
3456,
4567,
4321,
5432,
6543,
1357,
2468
]
})
lookup
162
Chapter 7 Merge and Dissolve
Now, to add this data into the geodataframe, we want to do an SQL-like join, in
which the column “population” from the lookup table is added onto the geodataframe
based on the column “Name” in the geodataframe and the column “city” in the lookup
table. This can be accomplished using Code Block 7-8.
163
Chapter 7 Merge and Dissolve
You can see in the dataframe that the population column has been added, as is
shown in Figure 7-11.
Figure 7-11. The data resulting from Code Block 7-8. Image by author
You can now access this data easily, for example, if you want to do filters,
computations, etc. Another example is to use this attribute data to adjust the size of each
point on the map, depending on the (simulated) population size (of course, this is toy
data so the result is not correct, but feel free to improve on this if you want to). The code
is shown in Code Block 7-9.
ax = cities_new.plot(markersize=cities_new['population'] // 10,
figsize=(15,15))
cx.add_basemap(ax, crs=us_cities.crs)
164
Chapter 7 Merge and Dissolve
The result in Figure 7-12 shows the cities’ sizes being adapted to the value in the
column population, which was added to the dataset through an attribute join.
Figure 7-12. The map resulting from Code Block 7-9. Image by author using
contextily source data and image as referenced in the image
countries = gpd.read_file('NorthMiddleAmerciaCountries.kml')
countries
165
Chapter 7 Merge and Dissolve
If we plot the data against the background map, you can see that the polygons are
quick approximations of the countries’ borders, just for the purpose of this exercise. This
is done in Code Block 7-11.
166
Chapter 7 Merge and Dissolve
Figure 7-14. The plot resulting from Code Block 7-11. Image by author using
contextily source data and image as referenced in the image
You can see some distortion on this map. If you have followed along with the theory
on coordinate systems in Chapter 2, you should be able to understand where that is
coming from and have the tools to rework this map’s coordinate system if you’d want to.
For the current exercise, those distortions are not a problem. Now, let’s add our cities
onto this map, using Code Block 7-12.
167
Chapter 7 Merge and Dissolve
Figure 7-15. The combined map. Image by author using contextily source data
and image as referenced in the image
This brings us to the topic of the spatial join. In this map, you see that there are two
datasets:
–– The cities only contain information about the name of the city and the
population.
–– The countries are just polygons.
168
Chapter 7 Merge and Dissolve
You can see in Code Block 7-13 how a spatial join is done between the cities and
countries datasets, based on a “within” spatial join: the city needs to be inside the
polygon to receive its attributes.
Code Block 7-13. Spatial join between the cities and the countries
Figure 7-16. The data resulting from Code Block 7-13. Image by author
You see that the name of the country has been added to the dataset of the cities. We
can now use this attribute for whatever we want to in the cities dataset. As an example,
we could give the points a color based on their country, using Code Block 7-14.
ax = cities_3.plot(markersize=cities_3['population'] // 10,
c=cities_3['color'], figsize=(15,15))
cx.add_basemap(ax, crs=cities_3.crs)
169
Chapter 7 Merge and Dissolve
Figure 7-17. The map resulting from Code Block 7-14. Image by author using
contextily source data and image as referenced in the image
With this final result, you have now seen multiple ways to combine datasets into a
single dataset:
–– The spatial join, which is a join that bases itself on spatial attributes
rather than on any common identifier
In the last part of this chapter, you’ll discover the dissolve operation, which is often
useful in case of joining many datasets.
170
Chapter 7 Merge and Dissolve
The polygons A and B both have the value 1, so grouping by value would combine
those two polygons into one polygon. This operation can be useful when your data is too
granular, which may be because you have done a lot of geospatial operations or may be
because you have merged a large number of data files.
171
Chapter 7 Merge and Dissolve
Once you execute this, you’ll see that a new column has been added to the dataset,
as shown in Figure 7-19.
Figure 7-19. The data resulting from Code Block 7-15. Image by author
Now the goal is to create two polygons: one for North America and one for Middle
America. We are going to use the dissolve method for this, as shown in Code Block 7-16.
areas = countries.dissolve(by='Area')[['geometry']]
areas
The newly created dataset is the grouped (dissolved) result of the previous dataset, as
shown in Figure 7-20.
172
Chapter 7 Merge and Dissolve
We can now plot this data to see what it looks like, using Code Block 7-17.
The map in Figure 7-21 shows the result of the dissolve operation.
Figure 7-21. The result of the dissolve operation. Image by author using contextily
source data and image as referenced in the image
173
Chapter 7 Merge and Dissolve
This combined result has been grouped by the feature area, and it is a generalized
version of the input data. The dissolve operation is therefore much like a groupby
operation, which is a very useful tool to master when working with geodata.
Key Takeaways
1. There are numerous basic geodata operations that are standardly
implemented in most geodata tools. They may seem simple at
first sight, but applying them to geodata can come with some
difficulties.
174
CHAPTER 8
Erase
In this chapter, you will learn about the erase operation. The previous three chapters
have presented a number of standard GIS operations. Clipping and intersecting were
covered in Chapter 5, buffering in Chapter 6, and merge and dissolve were covered in
Chapter 7.
This chapter will be the last of those four chapters covering common tools for
geospatial analysis. Even though there are much more tools available in the standard GIS
toolbox, the goal here is to give you a good mastering of the basics and allowing you to
be autonomous in learning the other GIS operations in Python.
The chapter will start with a theoretical introduction of the erase operation and then
follow through with a number of example implementations in Python for applying the
erase on different geodata types.
175
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_8
Chapter 8 Erase
In this schematic drawing, you can see that there are three polygons on the left
(numbered 1, 2, and 3). The delete operation has deleted polygon 2, which makes that
there are only two polygons remaining in the output on the right. Polygon 2 was deleted,
or with a synonym, erased. The table containing the data would be affected as shown in
Figure 8-2.
This operation is relatively easy to understand and would not need to be covered in
much depth. However, this operation is not what is generally meant when talking about
an erase operation in GIS.
The erase function in GIS is actually much closer to the other spatial operations that
we have covered before. The erase operation takes two inputs: an input layer, which is
the one that we are applying an erase operation on, and the erase features.
The input layer can be any type of vector layer: polygon, line, points, or even mixed.
The erase feature generally has to be a polygon, although some implementations may
allow you to use other types as well.
What the operation does is erasing all the data in the input layer that are inside the
eraser polygon. This will delete a part of your input data, generally because you don’t
need it.
176
Chapter 8 Erase
Let’s look at some schematic overviews of erasing on the different input data types in
the next sections.
177
Chapter 8 Erase
You can see how the data table would change before and after the operation in the
schematic drawing in Figure 8-4.
178
Chapter 8 Erase
Figure 8-4. The table view behind the spatial erase. Image by author
You can see that the features 2 and 5 have simply been removed by the erase
operation. This could have been done also using a drop of the features with IDs 2 and 5.
Although using a spatial eraser rather than an eraser by ID for deleting a number of
points gives the same functional result, it can be very useful and even necessary to use a
spatial erase here.
When you have an erase feature, you would not yet have the exact IDs of the points
that you want to drop. In this way, the only way to get the list of IDs automatically is to do
a spatial join, or an overlay, which is what happens in the spatial erase.
When using more complex features like lines and polygons, the importance of the
spatial erase is even larger, as you will see now.
179
Chapter 8 Erase
What happens here is quite different from what happened in the point example.
Rather than deleting or keeping entire features, the spatial erase has now made an
alteration to the features. Before and after, the data still consists of two lines, yet they are
not exactly the same lines anymore. Only a part of each individual feature was erased,
thereby not changing the number of features but only the geometry. In the data table,
this would look something like shown in Figure 8-6.
180
Chapter 8 Erase
In the next section, you’ll see how this works for polygons.
181
Chapter 8 Erase
Figure 8-7. The spatial erase operation applied to polygons. Image by author
In the drawing, you see that there are three polygons in the input layer on the top
left. The erase feature is a rectangular polygon. Using a spatial erase, the output contains
altered versions of polygons 2 and 3, since the parts of them that overlaid the erase
feature have been cut off.
The impact of this operation in terms of data table would also be similar to the one
on the line data. The tables corresponding to this example can be seen in Figure 8-8.
182
Chapter 8 Erase
Figure 8-8. The table view of spatially erasing on polygons. Image by author
You should now have a relatively good intuition about the spatial erase operation.
To perfect your understanding, the next section will make an in-depth comparison
between the spatial eraser and some comparable operations, before moving on to the
implementation in Python.
183
Chapter 8 Erase
be an alteration rather than a deletion. Indeed, the ID of the input feature is not deleted
from the dataset, yet its geometry will be altered: reduced by the suppression of the part
of the feature that overlapped with the erase feature.
Spatial erase should be used when you want to delete features or part of features
based on geometry. Deleting features otherwise is useful when you want to delete based
on ID or based on any other column feature of your data.
184
Chapter 8 Erase
Another operation from set theory is the difference operation. You can use the same
spatial overlay functionality as for the intersection to compute this difference operation.
Rather than retaining parts that intersect in two input layers, the difference operation
will keep all features and parts of features from input layer A that are not present in input
layer B. Therefore, it will delete from layer A all that is in feature B, similar to an erase.
Depending on the software that you are using, you may or may not encounter
implementations of the erase operation. For example, in some versions of the paid
ArcGIS software, there is a function to erase. In geopandas, however, there is not, so we
might as well use the overlay with the difference operation to obtain the same effect/
function/result as the erase operation. This is what we will be doing in the following
section.
Erasing in Python
In this exercise, you will be working with a small sample map that was created
specifically for this exercise. The data should not be used for any other purpose than the
exercise as it isn’t very precise, but that is not a problem for now, as the goal here is to
master the geospatial analysis tools.
During the coming exercises, you will be working with a mixed dataset of Iberia,
which is the peninsula containing Spain and Portugal. The goal of the exercise is to
create a map of Spain out of this data, although there is no polygon that indicates the
exact region of Spain: this must be created by removing Portugal from Iberia.
I recommend running this code in Kaggle notebooks or in a local environment, as
there are some problems in Google Colab for creating overlays. To get started with the
exercise, you can import the data using the code in Code Block 8-1.
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
all_data = gpd.read_file('chapter_08_data.kml')
all_data
185
Chapter 8 Erase
Within this dataframe, you can see that there is a mix of data types or geometries.
The first two rows contain polygons of Iberia (which is the contour of Spain plus
Portugal). Then you have a number of roads, which are lines, followed by a number of
cities, which are points.
Let’s create a quick map to see what we are working with exactly using Code
Block 8-2. You can use the code hereafter to do so. If you are not yet familiar with
these methods for plotting, I recommend going back into earlier chapters to get more
familiar with this. From here on, we will go a bit faster over the basics of data imports,
file formats, data types, and mapping as they have all been extensively covered in
earlier parts of this book.
186
Chapter 8 Erase
import contextily as cx
As a result of this code, you will see the map shown in Figure 8-10.
Figure 8-10. The map resulting from Code Block 8-2. Image by author
This map is not very clear, as it still contains multiple types of data that are also
overlapping. As the goal here is to filter out some data, let’s do that first, before working
on improved visualizations.
187
Chapter 8 Erase
The resulting dataframe contains only the polygon corresponding to Iberia. This
dataframe is shown in Figure 8-11.
Figure 8-11. The data resulting from Code Block 8-3. Image by author
Now, let’s do the same for Portugal, using the code in Code Block 8-4.
The resulting dataframe contains only the polygon corresponding to Portugal. This
dataframe looks as shown in Figure 8-12.
Figure 8-12. The data resulting from Code Block 8-4. Image by author
188
Chapter 8 Erase
Now, we have two separate objects: one geodataframe containing just the polygon
for Iberia and a second geodataframe containing only one polygon for Portugal. What we
want to obtain is the difference between the two, as Spain is the part of Iberia that is not
Portugal. The code in Code Block 8-5 does exactly this.
Figure 8-13. The map resulting from Code Block 8-5. Image by author
Although the shape is correct, when we look at the spain dataframe object, we can
see that the name is still set to the input layer, which is Iberia. You can print the data
easily with Code Block 8-6.
spain
189
Chapter 8 Erase
We can reset the name to Spain using the code in Code Block 8-7.
spain.Name = 'Spain'
spain
The resulting dataframe now has the correct value in the column Name, as can be
seen in Figure 8-15.
Figure 8-15. The Spain data with the correct name. Image by author
Let’s plot all that we have done until here using a background map, so that we can
keep on adding to this map of Spain in the following exercises. The code to create this
plot is shown in Code Block 8-8.
190
Chapter 8 Erase
Figure 8-16. The plot resulting from Code Block 8-8. Image by author
If you are familiar with the shape of Spain, you will see that it corresponds quite well
on this map. We have successfully created a polygon for the country of Spain, just using
a spatial operation with two other polygons. You can imagine that such work can occur
regularly when working with spatial data, whether it is for spatial analysis, mapping and
visualizations, or even for feature engineering in machine learning.
In the following section, you will continue this exercise by also removing the
Portuguese cities from our data, so that we only retain relevant cities for our Spanish
dataset.
In previous chapters, you have seen some techniques that could be useful for this.
One could imagine, for example, doing a join with an external table. This external table
could be a map from city to country, so that after joining the geodataframe to this lookup
table, you could simply do a filter based on country.
In the current exercise, we are taking a different approach, namely, using a spatial
overlay with a difference operation, which is the same as an erase. This way, we will
erase from the cities all those that have a spatial overlay with the polygon of Portugal.
The first step of this exercise is to create a separate geodataframe that contains only
the cities. It is always easier to work with datasets that have one and only one data type.
Let’s use the code in Code Block 8-9 to filter out all point data, which are the cities in
our case.
You will obtain the dataset as shown in Figure 8-17, which contains only cities.
Figure 8-17. The dataset resulting from Code Block 8-9. Image by author
192
Chapter 8 Erase
Now that we have a dataset with only cities, we still need to filter out the cities of
Spain and remove the cities of Portugal. As you can see, there is no other column that we
could use to apply this filter, and it would be quite cumbersome to make a manual list
of all the cities that are Spanish vs. Portuguese. Even if it would be doable for the current
exercise, it would be much more work if we had a larger dataset, so it is not a good
practice.
The following code shows how to remove all the cities that have an overlay with
the Portugal polygon. Setting the how parameter to “difference” makes that they are
removed rather than retained. As a reminder, you have seen other parameters like
intersection and union being used in previous chapters. If you don’t remember what the
other versions do, it would be good to have a quick look back at this point using Code
Block 8-10.
Figure 8-18. The dataset resulting from Code Block 8-10. Image by author
When comparing this with the previous dataset, you can see that indeed a number of
cities have been removed. The Spanish cities that are kept are Bilbao, Barcelona, Madrid,
Seville, Malaga, and Santiago de Compostela. The cities that are Portuguese have been
removed: Porto, Lisbon, and Faro. This was the goal of the exercise, so we can consider it
successful.
193
Chapter 8 Erase
As a last step, it would be good to add this all to the map that we started to make in
the previous section. Let’s add the Spanish cities onto the map of the Spanish polygon
using the code in Code Block 8-11.
This code will result in the map shown in Figure 8-19, which contains the polygon of
the country Spain, the cities of Spain, and a contextily basemap for nicer visualization.
Figure 8-19. The map resulting from Code Block 8-11. Image by author
We have now done two parts of the exercise. We have seen how to cut the polygon,
and we have filtered out the cities of Spain. The only thing that remains to be done is to
resize the roads and make sure to filter out only those parts of the roads that are inside of
the Spain polygon. This will be the goal of the next section.
194
Chapter 8 Erase
You now obtain a dataset that contains only the roads, just like shown in Figure 8-20.
Figure 8-20. The dataset resulting from Code Block 8-12. Image by author
The problem is not really clear from the data, so let’s make a plot to see what is wrong
about those LineStrings using the code in Code Block 8-13.
195
Chapter 8 Erase
This code will generate the map in Figure 8-21, which shows that there are a lot of
parts of road that are still inside Portugal, which we do not want for our map of Spain.
Figure 8-21. The map resulting from Code Block 8-13. Image by author
Indeed, you can see here that there is one road (from Porto to Lisbon) that needs
to be removed entirely. There are also three roads that start in Madrid and end up in
Portugal, so we need to cut off the Portuguese part of those roads.
This is all easy to execute using a difference operation within an overlay again, as is
done in the code in Code Block 8-14.
196
Chapter 8 Erase
Figure 8-22. The data resulting from Code Block 8-14. Image by author
You can clearly see that some roads are entirely removed from the dataset, because
they were entirely inside of Portugal. Roads that were partly in Portugal and partly in
Spain were merely altered, whereas roads that were entirely in Spain are kept entirely.
Let’s now add the roads to the overall map with the country polygon and the cities, to
finish our final map of only Spanish features. This is done in Code Block 8-15.
197
Chapter 8 Erase
Figure 8-23. The map resulting from Code Block 8-15. Image by author
This map shows all of the features reduced to Spain, whereas we started from a
dataset in which we did not even have a Spain polygon. These examples show the type of
work that is very common in spatial analysis or feature engineering for spatial machine
learning. After all, data is not always clean and perfect and often needs some work to
be usable.
In this chapter, and the previous chapters, you should have found the basics for
working with geospatial data and have enough background to find out how to do some
of the other geospatial operations using documentations and other sources. In the next
chapters, the focus will shift to more mathematics and statistics, as we will be moving
into the chapters on machine learning.
198
Chapter 8 Erase
Key Takeaways
1. The erase operation has multiple interpretations. In spatial
analysis, its definition is erasing features or parts of features based
on a spatial overlay with a specified erase feature.
3. You can use the difference overlay to erase data from vector
datasets (points, lines, or polygons).
199
PART III
Interpolation
After having covered the fundamentals of spatial data in the first four chapters of this
book, and a number of basic GIS operations in the past four chapters, it is now time
to move on to the last four chapters in which you will see a number of statistics and
machine learning techniques being applied to spatial data.
This chapter will cover interpolation, which is a good entry into machine learning.
The chapter will start by covering definitions and intuitive explanations of interpolation
and then move on to some example use cases in Python.
What Is Interpolation?
Interpolation is a task that is relatively intuitive for most people. From a high-level
perspective, interpolation means to fill in missing values in a sequence of numbers. For
example, let’s take the list of numbers:
1, 2, 3, 4, ???, 6, 7, 8, 9, 10
Many would easily be able to find that the number 5 should be at the place where the
??? is written. Let’s try to understand why this is so easy. If we want to represent this list
graphically, we could plot the value against the position (index) in the list, as shown in
Figure 9-1.
203
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_9
Chapter 9 Interpolation
When seeing this, we would very easily be inclined to think that this data follows a
straight line, as can be seen in Figure 9-2.
204
Chapter 9 Interpolation
As we have no idea where these numbers came from, it is hard to say whether this is
true or not, but it seems logical to assume that they came from a straight line. Now, let’s
try another example. To give you a more complex example, try it with the following:
1, ???, 4, ???, 16
If you are able to find it, your most likely guess would be the doubling function,
which could be presented graphically as shown in Figure 9-3.
When doing interpolation, we try to find the best estimate for a value in between
other values based on a mathematical formula that seems to fit our data. Although
interpolation is not necessarily a method in the family of machine learning methods, it
is a great way to start discovering the field of machine learning. After all, interpolation
is the goal of best guessing some formula to represent data, which is fundamentally
what machine learning is about as well. But more on that in the next chapter. Let’s first
deep dive into a bit of the technical details of how interpolation works and how it can be
applied on spatial data.
205
Chapter 9 Interpolation
Linear Interpolation
The most straightforward method for interpolation is linear interpolation. Linear
interpolation comes down to drawing a straight line from each point to the next and
estimating the in-between values to be on that line. The graph in Figure 9-4 shows an
example.
Although it seems not such a bad idea, it is not really precise either. The advantage
of linear interpolation is that generally it is not very wrong: you do not risk estimating
values that are way out of bounds, so it is a good first method to try.
The mathematical function for linear interpolation is the following:
If you input the value for x at which you want to compute a new y, and the values of
x and y of the point before (x0, y0) and after (x1, y1) your new point, you obtain the new y
value of your point.
206
Chapter 9 Interpolation
P
olynomial Interpolation
Polynomial interpolation is a bit better for estimating such functions, as polynomial
functions can actually be curved. As long as you can find an appropriate polynomial
function, you can generally find a relatively good approximation. This could be
something like Figure 9-5.
Many, many other forms of polynomials exist. If you are not aware of polynomials, it
would be worth it checking out some online resources on the topic.
207
Chapter 9 Interpolation
Figure 9-6. Adding nearest neighbor interpolation to the graph. Image by author
208
Chapter 9 Interpolation
This nearest neighbor interpolation will assign the value that is the same value as
the closest point. The line shape is therefore a piecewise function: when arriving closer
(on the x axis) to the next point, the interpolated value (y axis) makes a jump to the next
value on the y axis. As you can see, this really isn’t the best idea for the curve at hand, but
in other situations, it can be a good and easy-to-use interpolation method.
Depending on where we live, we want to have the best appropriate value for
ourselves. In the north of the country, it is 10 degrees Celsius; in the south, it is 0 degree
Celsius. Let’s use a linear interpolation for this, with the result shown in Figure 9-8.
209
Chapter 9 Interpolation
Figure 9-8. The result of the exercise using linear interpolation. Image by author
This linear approach does not look too bad, and it is easy to compute by hand for this
data. Let’s also see what would have happened with a nearest neighbor interpolation,
which is also easy to do by hand. It is shown in Figure 9-9.
Figure 9-9. The result of the exercise using nearest neighbor interpolation. Image
by author
The middle part has been left out, as defining ties is not that simple, yet you can get
the idea of what would have happened with a nearest neighbor interpolation in this
example.
For the moment, we will not go deeper into the mathematical definitions, but if
you want to go deeper, you will find many resources online. For example, you could get
started here: https://towardsdatascience.com/polynomial-interpolation-3463ea4b
63dd. For now, we will focus on applications to geodata in Python.
210
Chapter 9 Interpolation
data = { 'point1': {
'lat': 0,
'long': 0,
'temp': 0 },
'point2': {
'lat': 10,
'long': 10,
'temp': 20 },
'point3' : {
'lat': 0,
'long': 10,
'temp': 10 },
'point4': {
'lat': 10,
'long': 0,
'temp': 30 }
}
Now, the first thing that we can do is making a dataframe from this dictionary and
getting this data into a geodataframe. For this, the easiest is to make a regular pandas
dataframe first, using the code in Code Block 9-2.
211
Chapter 9 Interpolation
import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index')
df
As a next step, let’s convert this dataframe into a geopandas geodataframe, while
specifying the geometry to be point data, with latitude and longitude. This is done in
Code Block 9-3.
The plot that results from this code is shown in Figure 9-11.
212
Chapter 9 Interpolation
Figure 9-11. The plot resulting from Code Block 9-3. Image by author
As you can see in this plot, there are four points with a different size. From a high-
level perspective, it seems quite doable to find intermediate values to fill in between the
points. What is needed to do so, however, is to find a mathematical formula in Python
that represents this interpolation and then use it to predict the interpolated values.
213
Chapter 9 Interpolation
Now that we have this function, we can call it on new points. However, we first need
to define which points we are going to use for the interpolation. As we have four points in
a square organization, let’s interpolate at the point exactly in the middle and the points
that are in the middle along the sides. We can create this new df using the code in Code
Block 9-5.
The data here only has latitude and longitude, but it does not yet have the estimated
temperature. After all, the goal is to use our interpolation function to obtain these
estimated temperatures.
In Code Block 9-6, you can see how to loop through the new points and call the
interpolation function to estimate the temperature on this location. Keep in mind that
this interpolation function is the mathematical definition of a linear interpolation, based
on the input data that we have given.
214
Chapter 9 Interpolation
interpolated_temps = []
for i,row in new_df.iterrows():
interpolated_temps.append(my_interpolation_function(row['lat'],
row['long'])[0])
new_df['temp'] = interpolated_temps
new_df
You can see the numerical estimations of these results in Figure 9-13.
Figure 9-13. The estimations resulting from Code Block 9-6. Image by author
The linear interpolation is the most straightforward, and it is clear from those
predictions that they look solid. It would be hard to say whether they are good or not, as
we do not have any ground truth value in interpolation use cases, yet we can say that it is
nothing too weird at least.
Now that we have estimated them, we should try to do some sort of analysis.
Combining them into a dataframe with everything so that we can rebuild the plot is done
in Code Block 9-7.
215
Chapter 9 Interpolation
Even though we do not have an exact metric to say whether this interpolation is
good or bad, we can at least say that the interpolation seems more or less logical to
the eye, which is comforting. This first try appears rather successful. Let’s try out some
more advanced methods in the next section, to see how results may differ with different
methods.
Kriging
In the first part of this chapter, you have discovered some basic, fundamental approaches
to interpolation. The thing about interpolation is that you can make it as simple, or as
complex, as you want.
Although the fundamental approaches discussed earlier are often satisfactory in
practical results and use cases, there are some much more advanced techniques that we
need to cover as well.
In this second part of the chapter, we will look at Kriging for an interpolation
method. Kriging is a much more advanced mathematical definition for interpolation.
Although it would surpass the level of this book to go into too much mathematical
detail here, for those readers that are at ease with more mathematical details, feel free to
216
Chapter 9 Interpolation
Figure 9-15. The interpolated values with Linear Ordinary Kriging. Image
by author
217
Chapter 9 Interpolation
Interestingly, some of these estimated values are not the same at all. Let’s plot them
to see whether there is anything weird or different going on in the plot, using the code in
Code Block 9-9.
Figure 9-16. The resulting plot using Linear Ordinary Kriging. Image by author
There is nothing too wrongly estimated if we judge by the plot, so there is no reason
to discount these results. As we have no metric for good or wrong interpolation, this
must be seen as just an alternative estimation. Let’s try to see what happens when using
other settings to Kriging in the next section.
218
Chapter 9 Interpolation
Figure 9-17. The result with Gaussian Ordinary Kriging. Image by author
Interestingly, the estimates for point5 and point9 change quite drastically again! Let’s
make a plot again to see if anything weird is occurring during this interpolation. This is
done in Code Block 9-11.
219
Chapter 9 Interpolation
Figure 9-18. The result from Gaussian Ordinary Kriging. Image by author
Again, when looking at this plot, it cannot be said that this interpolation is wrong in
any way. It is different from the others, but just as valid.
220
Chapter 9 Interpolation
Figure 9-19. The result from Exponential Ordinary Kriging. Image by author
Interestingly, again point5 and point9 are the ones that change a lot, while the others
stay the same. For coherence, let’s make the plot of this interpolation as well, using Code
Block 9-13.
221
Chapter 9 Interpolation
Figure 9-20. The plot from Exponential Ordinary Kriging. Image by author
Again, nothing obvious wrong with this plot, yet its results are again different than
before. It would only make sense to wonder which of them is right. Let’s conclude on this
in the next section.
222
Chapter 9 Interpolation
Even for such a simple interpolation example, we see spectacularly large differences
in the estimations of points 5 (middle bottom in the graph) and 9 (right middle in
the graph).
Now the big question here is of course whether we can say that any of those are
better than the others. Unfortunately, when applying mathematical models to data
where there is no ground truth, you just don’t know. You can build models that are useful
to your use case, you can use human and business logic to assess different estimates, and
you can use rules of thumb like Occam’s razor (keep the simplest possible model) for
your decision to retain one model over the other.
Alternatively, you can also turn to supervised machine learning for this.
Classification and regression will be covered in the coming two chapters, and they are
also methods for estimating data points that we don’t know, yet they are focused much
more on performance metrics to evaluate the fit of our data to reality, which is often
missing in interpolation use cases.
In conclusion, although there is not necessarily only one good answer, it is always
useful to have a basic working knowledge of interpolation. Especially in spatial use
cases, it is often necessary to convert data measured at specific points (like temperature
stations and much more) into a more continuous view over a larger two-dimensional
surface (like countries, regions, and the like). You have seen in this chapter that relatively
simple interpolations are already quite efficient in some use cases and that there is a vast
complexity to be discovered for those who wanted to go in more depth.
223
Chapter 9 Interpolation
Key Takeaways
1. Interpolation is the task of estimating unknown values in between
a number of known values, which comes down to estimating
values on unmeasured locations.
3. There are many mathematical “base” formulas that you can apply
to your points, and depending on the formula you chose, you may
end up with quite different results.
224
CHAPTER 10
Classification
With the current chapter, you are now arriving at one of the main parts of the book
about machine learning, namely, classification. Classification is, next to regression and
clustering, one of the three main tasks in machine learning, and they will all be covered
in this book.
Machine learning is a very large topic, and it would be impossible to cover all
of machine learning in just these three chapters. The choice has been made to do a
focus on applying machine learning models to spatial data. The focus is therefore on
presenting interesting and realizing use cases for machine learning on spatial data while
showing how spatial data can be used as an added value with respect to regular data.
There will not be very advanced mathematical, statistical, nor algorithmic
discussions in the chapters. There are many standard resources out there for those
readers who want to gain a deep and thorough mathematical understanding of machine
learning in general.
The chapter will start with a general introduction of what classification is, what we
can use it for, and some models and tools that you’ll need for doing classification, and
then we’ll dive into a deep spatial classification use case for the remainder of the chapter.
Let’s now start with some definitions and introductions first.
225
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_10
Chapter 10 Classification
226
Chapter 10 Classification
It was communicated to the participants that they receive a gift coupon for a 50%
discount on a new restaurant. This was done not just to incentivize participants to take
part in the study, but it was done also with an ulterior motive. Actually, there is a tracking
in place to measure whether the user has used the coupon, so that we can study the link
between movement patterns and the potential interest in this new restaurant.
Our classification case will be executed as a marketing study for the new restaurant.
The goal of the study is to use a tracked GPS path to predict whether a person is
interested in this restaurant. The model, if successful, will be used to implement push
notification ads to incentivize clients to find out about this new restaurant.
Sending an ad to a user costs money, so there is a real interest in finding out to which
client we want to send this ad or not. We need to build a model that is as good as possible
in predicting interest in using the coupon based on only the sequence of GPS points.
When looking at this data, you will see something like Figure 10-1.
Figure 10-1. The data resulting from Code Block 10-1. Image by author
The dataset is a bit more complex than what we have worked with in previous
chapters, so let’s make sure to have a good understanding of what we are working with.
228
Chapter 10 Classification
The first row of the geodataframe contains an object called the mall. This polygon is
the one that covers the entire area of the mall, which is the extent of our study. It is here
just for informative purposes, and we won’t need it during the exercise.
The following features from rows 1 to 7 present areas of the mall. They are also
polygons. Each area can either be one shop, a group of shops, a whole wing, or whatnot,
but they generally regroup a certain type of store. We will be able to use this information
for our model.
The remaining data are 20 itineraries. Each itinerary is represented as a LineString,
that is, a line, which is just a sequence of points that has been followed by each of the
20 participants in the study. The name of each of the LineStrings is either Bought Yes,
meaning that they have used the coupon after the study (indicating the product interests
them), or Bought No, indicating that the coupon was not used and therefore that the
client is probably not interested in the product.
Let’s now move on to make a combined plot of all this data to get an even better feel
of what we are working with. This can be done using Code Block 10-2.
all_data.plot(figsize=(15,15),alpha=0.1)
When executing this code, you will end up with the map in Figure 10-2. Of course,
it is not the most visual map, but the goal is here to put everything together in a quick
image to see what is going on in the data.
229
Chapter 10 Classification
In this map, the most outer light-gray contours are the contours of the large polygon
that sets the total mall area. Within this, you see a number of smaller polygons, which
indicate the areas of interest for our study, which all have a specific group of store
types inside them. Finally, you also see the lines criss-crossing, which represents the 20
participants of the study making their movements throughout the mall during their visit.
What we want to do now is to use the information of the store segment polygons to
annotate the trips of each participant. It would be great to end up with a percentage of
time that each participant has spent in each type of store, so that we can build a model
that learns a relationship between the types of stores that were visited in the mall and the
potential interests in the new restaurant.
As a first step toward this model, let’s separate the data to obtain datasets with only
one data type. For this, we will need to separate the information polygons from the
participant itineraries. Using all that you have seen earlier in the book, that should not
be too hard. The code in Code Block 10-3 shows how to get the info polygons in a new
dataset.
230
Chapter 10 Classification
Code Block 10-3. Select the info polygons into a separate dataset
info_polygons = all_data.loc[1:7,:]
info_polygons
Figure 10-3. The data resulting from Code Block 10-3. Image by author
Let’s extract the itineraries as well, using the code in Code Block 10-4.
itineraries = all_data.loc[8:,:]
itineraries
The itineraries dataset is a bit longer, but it looks as shown in Figure 10-4 (truncated
version).
231
Chapter 10 Classification
Figure 10-4. Truncated version of the data from Code Block 10-4
232
Chapter 10 Classification
locations. For this, we are going to need some sort of spatial overlay operation, as you
have seen in the earlier parts of this book.
To do the overlay part, it is easier to do this point by point, as we have lines that are
passing through many of the interest areas. If we were to do the operation using lines, we
would end up needing to cut the lines according to the boundaries of the polygons and
use line length for estimating time spent in the store section.
If we cut the lines into points, we can do a spatial operation to find the presence of
each point and then simply count, for each participant, the number of points in each
store section. If the points are collected on the basis of equal frequency, the number of
points is an exact representation of time.
The reason that we can do without the line is that we do not care about direction or
order here. After all, if we wanted to study the order of visits to each store section, we
would need to keep information about the sequence. The line is able to do this, whereas
the point data type is not.
In this section, you can see what a great advantage it is to master Python for working
on geodata use cases. Indeed, it is very advantageous of Python that we have liberty to
convert geometry objects to strings and do loops through them which would potentially
be way more complex in more click button, precoded, GIS tools.
The disadvantage may be that it is a little hard to get your head around sometimes,
but the code in Code Block 10-5 walks you through an approach to get the data from
a wide data format (one line per client) to a long data format (one row per data point/
coordinate).
Code Block 10-5. Get the data from a wide data format to a long data format
import pandas as pd
from shapely.geometry.point import Point
results = []
233
Chapter 10 Classification
results_df = pd.DataFrame(results)
results_df.columns = ['client_id', 'target', 'point']
results_df
The result of this code is the data in a long format: one row per point instead of one
row per participant. A part of the data is shown in Figure 10-5.
234
Chapter 10 Classification
Figure 10-5. A part of the data resulting from Code Block 10-5. Image by author
This result here is a pandas dataframe. For doing spatial operations, as you know by
now, it is best to convert this into a geodataframe. This can be done using the code in
Code Block 10-6.
235
Chapter 10 Classification
Your object gdf is now georeferenced. We can move on to joining this point dataset
with the store information dataset, using a spatial join. This spatial join is executed using
the code in Code Block 10-7.
Figure 10-6. The data resulting from Code Block 10-7. Image by author
You can see that for most points the operation has been successful. For a number
of points, however, it seems that NA, or missing values, has been introduced. This
is explained by the presence of points that are not overlapping with any of the store
information polygons and therefore having no lookup information. It would be good
to do something about this. Before deciding what to do with the NAs, let’s use the code
in Code Block 10-8 to count the number of each client for which there is no reference
information.
236
Chapter 10 Classification
# inspect NA
joined_data['na'] = joined_data.Name.isna()
joined_data.groupby('client_id').na.sum()
The number of nonreferenced points differs for each participant, but it never goes
above three. We can therefore conclude that there is really no problem in just discarding
the data that has no reference. After all, there is very little of it, and there is no added
information in this data.
The code in Code Block 10-9 shows how to remove the rows of data that have
missing values.
237
Chapter 10 Classification
# drop na
joined_data = joined_data.dropna()
joined_data
The resulting dataframe is a bit shorter: 333 rows instead of 359, as you can see in
Figure 10-8.
We now have a dataframe with explanatory information that will help us to predict
coupon usage.
238
Chapter 10 Classification
In this section, we will take the spatially joined dataframe and return to a wide
format, in which we again obtain a dataframe with one row per participant. Instead
of having a coordinate LineString, we now want to have the participant’s presence in
each of the categories. We will obtain this simply by counting, for each participant, the
number of points in each of the store categories. This is obtained using a groupby, which
is shown in Code Block 10-10.
location_behavior = joined_data.pivot_table(index='client_id',
columns='Name', values='target',aggfunc='count').fillna(0)
location_behavior
239
Chapter 10 Classification
This grouped data already seems very usable information for understanding
something about each of the clients. For example, we can see that the participant with
client_id 21 has spent a huge amount of time in the High Fashion section. Another
example is client_id 9, who seemed to have only gone to the supermarket.
240
Chapter 10 Classification
Although this dataset is very informative, there is still one problem before moving
on to the classification model. When looking at the category electronics, we can see as
an example client_id 17 being the largest value inside electronics. If we look further into
client_id 17, however, we see that electronics is not actually the largest category for this
participant.
There is a bias in the data that is due to the fact that not all participants have the
same number of points in their LineString. To solve this, we need to standardize for the
number of points, which can be done using the code in Code Block 10-11.
# standardize
location_behavior = location_behavior.div( location_behavior.sum(axis=1),
axis=0 )
location_behavior
241
Chapter 10 Classification
Figure 10-10. The data resulting from Code Block 10-11. Image by author
Modeling
Let’s now keep the data this way for the model – for inputting the data into the model.
Let’s move away from the dataframe format and use the code in Code Block 10-12 to
convert the data into numpy arrays.
242
Chapter 10 Classification
X = location_behavior.values
X
Figure 10-11. The array resulting from Code Block 10-12. Image by author
You can do the same to obtain an array for the target, also called y. This is done in
Code Block 10-13.
y = itineraries.Name.values
y
243
Chapter 10 Classification
After this step, you end up with four datasets: X_train and y_train are the parts of
X and y that we will use for training, and X_test and y_test will be used for evaluation.
244
Chapter 10 Classification
We now have all the elements to start building a model. The first model that we
are going to build here is the logistic regression. As we do not have tons of data, we can
exclude the use of complex models like random forests, xgboost, and the like, although
they could definitely replace the logistic regression if we had more data in this use case.
Thanks to the easy-to-use modeling interface of scikit-learn, it is really easy to replace
one model by another, as you’ll see throughout the remainder of the example.
The code in Code Block 10-15 first initiates a logistic regression and then fits the
model using the training data.
# logistic regression
from sklearn.linear_model import LogisticRegression
my_lr = LogisticRegression()
my_lr.fit(X_train, y_train)
The object my_lr is now a fitted logistic regression which basically means that its
coefficients have been estimated based on the training data and they have been stored
inside the object. We can now use the my_lr object to make predictions on any external
data that contains the same data as the one that was present in X_train.
Luckily, we have kept apart X_test so that we can easily do a model evaluation. The
first step in this is to make the predictions using the code in Code Block 10-16.
preds = my_lr.predict(X_test)
preds
The array contains the predictions for each of the rows in X_test, as shown in
Figure 10-13.
245
Chapter 10 Classification
We do have the actual truth for these participants as well. After all, they are not
really new participants, but rather a subset of participants of which we know whether
they used the coupon that we chose to keep apart for evaluation. We can compare the
predictions to the actual ground truth, using the code in Code Block 10-17.
Code Block 10-17. Convert the errors and ground truth to a dataframe
The test set is rather small in this case, and we can manually conclude that the
model is actually predicting quite well. In use cases with more data, it would be
better to summarize this performance using other methods. One great way to analyze
classification models is the confusion matrix. It shows in one graph all the data that is
correctly predicted, but also which are wrongly predicted and in that case which errors
were made how many times. The code in Code Block 10-18 shows how to create such a
confusion matrix for this use case.
246
Chapter 10 Classification
Figure 10-15. The plot resulting from Code Block 10-18. Image by author
In this graph, you see that most predictions were correct and only one mistake was
made. This mistake was a participant that did not buy, whereas the model predicted that
he was a buyer with the coupon.
Model Benchmarking
The model made one mistake, so we can conclude that it is quite a good model.
However, for completeness, it would be good to try out another model. Feel free to test
out any classification model from scikit-learn, but due to the relatively small amount of
data, let’s try out a decision tree model here. The code in Code Block 10-19 goes through
the exact same steps as before but simply with a different model.
247
Chapter 10 Classification
Figure 10-16. The resulting dataframe from Code Block 10-19. Image by author
Again, thanks to the small dataset size, it is easy to interpret and conclude that the
model is worse, as it has made two mistakes, whereas the logistic regression only made
one mistake. For coherence, let’s complete this with a confusion matrix analysis as well.
This is done in Code Block 10-20.
248
Chapter 10 Classification
The result is shown in Figure 10-17 and indeed shows two errors.
There are two errors, both cases of participants that did not buy in reality. It seems
that people that did not buy are a little bit harder to detect than the opposite, even
though more evidence would be needed to further investigate this.
Key Takeaways
1. Classification is an area in supervised machine learning that
deals with models that learn how to use independent variables to
predict a categorical target variable.
249
Chapter 10 Classification
250
CHAPTER 11
Regression
In the previous two chapters, you have learned about the fundamentals of machine
learning use cases using spatial data. You have first seen several methods of
interpolation. Interpolation was presented as an introduction to machine learning, in
which a theory-based interpolation function is defined to fill in unknown values of the
target variable.
The next step moved from this unsupervised approach to a supervised
approach, in which we build models to predict values of which we have ground truth
values. By applying a train-test-split, this ground truth is then used to compute a
performance metric.
The previous chapter showed how to use supervised models for classification.
In classification models, unlike with interpolation, the target variable is a categorical
variable. The shown example used a binary target variable, which classified people into
two categories: buyers and nonbuyers.
In this chapter, you will see how to build supervised models for target variables
that are numeric. This is called regression. Although regression, just like interpolation,
is used to estimate a numeric target, the methods are actually generally closer to the
supervised classification methods.
In regression, the use of metrics and building models with the best performance on
this metric will be essential as it was in classification. The models are adapted for taking
into account a numeric target variable, and the metrics need to be chosen differently to
take into account the fact that targets are numeric.
The chapter will start with a general introduction of what regression models are and
what we can use them for. The rest of the chapter will present an in-depth analysis of a
regression model with spatial data, during which numerous theoretical concepts will be
presented.
251
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_11
Chapter 11 Regression
Introduction to Regression
Although the goal of this book is not to present a deep mathematical content on machine
learning, let’s start by exploring the general idea behind regression models anyway.
Keep in mind that there are many resources that will be able to fill in this theory and that
the goal of the current book is to present how regression models can be combined with
spatial data analysis and modeling.
Let’s start this section by considering one of the simplest cases of regression
modeling: the simple linear regression. In simple linear regression, we have one numeric
target variable (y variable) and one numeric predictor variable (X variable).
In this example, let’s consider a dataset in which we want to predict a person’s
weekly weight loss based on the number of hours that a person has worked out in that
same week. We expect to see a positive relationship between the two. Figure 11-1 shows
the weekly weight loss plotted against the weekly workout hours.
This graph shows a clear positive relationship between workout and weight loss.
We could find the mathematical definition of the straight line going through those
points and then use this mathematical formula as a model to estimate weekly weight
loss as a function of the number of hours worked out. This can be shown graphically in
Figure 11-2.
252
Chapter 11 Regression
Figure 11-2. The simple linear regression added to the graph. Image by author
y=a*x+b
Weight_Loss = a * Workout + b
Mathematical procedures to determine the best-fitting values for a and b exist and
can be used to estimate this model. The exact mathematics behind this will be left for
further reading as to not go out of scope for the current book. However, it is important to
understand the general idea behind estimating such a model.
It is also important to consider which next steps are possible, so let’s spend some
time to consider those. Firstly, the current model is using only a single explanatory
variable (workout), which is not really representative of how one would go about
losing weight.
In reality, one could consider that the food quantity is also a very important factor
in losing weight. This would need an extension of the mathematical formula to become
something like the following:
In this case, the mathematics behind the linear regression would need to find best
values for three coefficients: a, b, and c. This can go on by adding more variables and
more coefficients.
253
Chapter 11 Regression
Until here, we have firstly discussed the simple linear regression, followed by linear
regression more generally. Although linear regression is often great for fitting regression
models, there are many other mathematical and algorithmic functions that can be used.
Examples of other models are Decision Trees, Random Forest, and Boosting models.
Deep down, each of them has their own definition of a basic model, in which the
data can be used to fit the model (the alternative of estimating the coefficients in the
linear model). Many reference works exist for those readers wanting to gain in-depth
mathematical insights into the exact workings of those models.
For now, let’s move on to a more applied vision by working through a regression use
case using spatial data.
The geodataset can be imported using geopandas and Fiona, just as you have seen in
earlier chapters of this book. This is done in Code Block 11-1.
254
Chapter 11 Regression
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
geodata = gpd.read_file('chapter 11 data.kml')
geodata.head()
Figure 11-3. The data resulting from Code Block 11-1. Image by author
The other variables are in the Excel file, which you can import using the code in
Code Block 11-2.
import pandas as pd
apartment_data = pd.read_excel('house_data.xlsx')
apartment_data.head()
255
Chapter 11 Regression
As you can see from this image, the data contains the following variables:
The Apt ID is not in the same format as the identifier in the geodata. It is necessary to
convert the values in order to make them correspond. This will allow us to join the two
datasets together in a later step. This is done using the code in Code Block 11-3.
After this operation, the dataset now looks as shown in Figure 11-5.
256
Chapter 11 Regression
Figure 11-5. The data resulting from Code Block 11-3. Image by author
Now that the two datasets have an identifier that corresponds, it is time to start the
merge operation. This merge will bring all columns into the same dataset, which will be
easier for working with the data. This merge is done using the code in Code Block 11-4.
We now have all the columns inside the same dataframe. This concludes the data
preparation phase. As a last step, let’s do a visualization of the apartment locations
within Amsterdam, to get a better feeling for the data. This is done using the code in
Code Block 11-5.
257
Chapter 11 Regression
import contextily as cx
Figure 11-7. The map resulting from Code Block 11-5. Image by author using
contextily source data and image as referenced in the image
You can see that the apartments used in this study are pretty well spread throughout
the center of Amsterdam. In the next section, we will do more in-depth exploration of
the dataset.
258
Chapter 11 Regression
This histogram shows us that the prices are all between 90 and 170, with the majority
being at 130. The data does not seem to follow a perfectly normal distribution, although
we do see more data points being closer to the center than further away.
If we would need to give a very quick-and-dirty estimation of the most appropriate
estimation for the price of our Airbnb, we could simply use the average price of Airbnbs
in the center of Amsterdam. The code in Code Block 11-7 computes this mean.
The result is 133.75, which tells us that setting this price would probably be a more or
less usable estimate if we had nothing more precise. Of course, as prices range from 90 to
170, we could be either:
259
Chapter 11 Regression
Although the trend is less clear than the one observed in the theoretical example
in the beginning of this chapter, we can clearly see that higher values on the x axis
(MaxGuests) generally have higher values on the y axis (Price). Figure 11-9 shows this.
Figure 11-9. The scatter plot of Price against MaxGuests. Image by author
The quality of a linear relationship can also be measured using a more quantitative
approach. The Pearson correlation coefficient is a sort of score between –1 and 1 that
gives this indication. A value of 0 means no correlation, a value close to –1 means a
negative correlation between the two, and a value close to 1 means a positive correlation
between the variables.
The correlation coefficient can be computed using the code in Code Block 11-9.
260
Chapter 11 Regression
import numpy as np
np.corrcoef(merged_data['MaxGuests'], merged_data['Price'])
This will give you the correlation matrix as shown in Figure 11-10.
The resulting correlation coefficient between MaxGuests and Price is 0.453. This
is a fairly strong positive correlation, indicating that the number of guests has a strong
positive impact on the price that we can ask for an Airbnb. In short, Airbnbs for more
people should ask a higher price, whereas Airbnbs for small number or guests should
price lower.
As a next step, let’s see whether we can also use the variable IncludesBreakfast for
setting the price of our Airbnb. As the breakfast variable is categorical (yes or no), it is
better to use a different technique for investigating this relationship. The code in Code
Block 11-10 creates a boxplot to answer this question.
261
Chapter 11 Regression
This boxplot shows us that Airbnbs that propose a breakfast are generally able to
ask a higher price than Airbnbs that do not propose one. Depending on whether you
propose a breakfast, you should price your apartment accordingly.
X = merged_data[['IncludesBreakfast', 'MaxGuests']]
y = merged_data['Price']
We will use a linear model for this phase of modeling. The scikit-learn
implementation of the linear model can be estimated using the code in Code
Block 11-12.
262
Chapter 11 Regression
# first version lets just do a quick and dirty non geo model
from sklearn.linear_model import LinearRegression
lin_reg_1 = LinearRegression()
lin_reg_1.fit(X, y)
Now that the model has been fitted, we have the mathematical definition (with the
estimated coefficients) inside our linear regression object.
print('When no breakfast and 0 Max Guests then price is estimated at: ',
lin_reg_1.intercept_)
263
Chapter 11 Regression
The R2 score can be computed on all data, but this is not the preferred way for
estimating model performance. Machine learning models tend to learn very well on the
data that was seen during training (fitting), without necessarily generalizing very well.
A solution for this is to split the data in a training set (observations that are used for the
training/estimation) and a test set that is used for model evaluation.
The code in Code Block 11-14 splits that initial data into a training and a test set.
Let’s now fit the model again, but this time only on the training data. This is done in
Code Block 11-15.
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_train, y_train)
To estimate the performance, we use the estimated model (in this case, the
coefficients and the linear model formula) to predict estimate prices on the test data.
This is done in Code Block 11-16.
pred_reg_2 = lin_reg_2.predict(X_test)
We can use these predicted values together with the real, known prices of the test set
to compute the R2 scores. This is done in Code Block 11-17.
264
Chapter 11 Regression
The resulting R2 score is 0.1007. Although not a great result, the score shows that the
model has some predictive value and would be a better segmentation than using the
mean for pricing.
Figure 11-12. The dataset resulting from Code Block 11-18. Image by author
Let’s see how latitude and longitude are related to the price by making scatter plots
of price vs. latitude and price vs. longitude. The first scatter plot is created in Code
Block 11-19.
plt.scatter(merged_data['lat'], merged_data['Price'])
265
Chapter 11 Regression
Figure 11-13. The scatter plot resulting from Code Block 11-19. Image by author
There does not seem to be too much of a trend in this scatter plot. It seems that
prices are ranging between 90 and 170, and that is not different for any other latitude.
Let’s use the code in Code Block 11-20 to check whether this is true for longitude as well.
plt.scatter(merged_data['long'], merged_data['Price'])
Figure 11-14. The scatter plot of price vs. longitude. Image by author
266
Chapter 11 Regression
These two scatter plots do not really capture location. After all, we could easily
imagine that relationships with latitude and longitude are not necessarily linear. It
would be weird to expect that the more you go to the east, the lower your price, is a rule
that always holds. It is more likely that there are specific high-value and low-value areas
within the overall area. In the following code, we create a visualization that plots prices
based on latitude and longitude at the same time.
The first step in creating this visualization is to convert price into a variable that can
be used to set point size for the apartments. This requires our data to be scaled into a
range that is more appropriate for plotting, for example, setting the cheapest apartments
to a point size of 16 and the most expensive ones to a point size of 512. This is done in
Code Block 11-21.
We can now use this size setting when creating the scatter plot. This is done in Code
Block 11-22.
267
Chapter 11 Regression
This graph shows that there are no linear relationships, but that we could expect
some areas to be learned that have generally high or generally low prices. This would
mean that we may need to change for a nonlinear model to fit this reality better.
268
Chapter 11 Regression
The R2 score that is obtained by this model is 0.498. This is actually quite an
improvement compared to the first iteration (which was at R2 of 0.10). This gives
confidence in moving on to the tests for nonlinear model in the next section.
The score that this model obtains is –0.04. Unexpectedly, we have a much worse
result than in the previous step. Be careful here, as the DecisionTree results will be
different for each execution due to randomness in the model building phase. You will
probably have a different result than the one presented here, but if you try out different
runs, you will see that the average performance is worse than the previous iteration.
The DecisionTreeRegressor, just like many other models, can be tuned using a large
number of hyperparameters. In this iteration, no hyperparameters were specified, which
means that only default values were used.
As we have a strong intuition that nonlinear models should be able to obtain better
results than a linear model, let’s play around with hyperparameters in the next iteration.
269
Chapter 11 Regression
Figure 11-16. The result of the model tuning loop. Image by author
270
Chapter 11 Regression
In this output, you can see that the max_depth of 3 has resulted in an R2 score of
0.54, much better than the result of –0.04. Tuning on max_depth has clearly had an
important impact on the model’s performance. Many other trials and iterations would
be possible, but that is left as an exercise. For now, the DecisionTreeRegressor with max_
depth = 3 is retained as the final regression model.
plt.figure(figsize=(15,15))
tree.plot_tree(dt_reg_5, feature_names=X2_train.columns)
plt.show()
271
Chapter 11 Regression
Figure 11-17. The tree plot resulting from Code Block 11-26. Image by author
We can clearly see which nodes have learned which trends. We can see that latitude
and longitude are used multiple times by the model, which allows the model to split out
specific areas on the map that are to be prices worse or better.
As this is the final model for the current use case, and we know that the R2 score tells
us that the model is a much better estimation than using just an average price, we can
be confident that pricing our Airbnb using the decision tree model will result in a more
appropriate price for our apartment.
The goal of the use case has therefore been reached: we have created a regression
model to use both spatial data and apartment data to make the best possible price
estimation for an Airbnb in Amsterdam.
272
Chapter 11 Regression
Key Takeaways
1. Regression is an area in supervised machine learning that deals
with models that learn how to use independent variables to
predict a numeric target variable.
273
CHAPTER 12
Clustering
In this fourth and last chapter on machine learning, we will cover clustering. To get this
technique in perspective, let’s do a small recap of what we have gone through in terms of
machine learning until now.
The machine learning topics started after the introduction of interpolation. In
interpolation, we tried to estimate a target variable for locations at which the value of this
target variable is unknown. Interpolation uses a mathematical formula to decide on the
best possible theoretical way to interpolate these values.
After interpolation, we covered classification and regression, which are the two main
categories in supervised modeling. In supervised modeling, we build a model that uses X
variables to predict a target (y) variable. The great thing about supervised models is that
we have a large number of performance metrics available that can help us in tuning and
improving the model.
1. Feature reduction
2. Clustering
In feature reduction, the goal is to take a dataset with a large number of variables
and then redefine these variables in a more efficient variable definition. Especially when
many of the variables are strongly correlated, you can reduce the number of variables in
the dataset in such a way that the new variables are not correlated.
275
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_12
Chapter 12 Clustering
Feature reduction will be a great first step for machine learning data preprocessing
and can also be used for data analysis. Examples of methods are PCA, Factor Analysis,
and more. Feature reduction is not much different on geospatial data than on regular
data, which is why we will not dedicate more space for this technique.
A second family of models within unsupervised models is clustering. Clustering
is very different from feature reduction, except from the fact that the notion of target
variable is absent in both types of models. Clustering on spatial data is quite different
from clustering on regular data, which is why this chapter will present clustering on
geodata in depth.
Introduction to Clustering
In clustering, the goal is to identify clusters, or groups, of observations based on some
measure of similarity or distance. As mentioned before, there is no target variable here:
we simply use all of the available variables about each observation to create groups of
similar observations.
Let’s consider a simple and often used example. In the graph in Figure 12-1, you’ll
see a number of people (each person is an observation) of which we have collected
the spending on two product groups at a supermarket: snacks and fast food is the first
category and healthy products is the second.
276
Chapter 12 Clustering
As this data has only two variables, it is relatively easy to identify three groups of clients in
this database. A subjective proposal for boundaries is presented in the graph in Figure 12-2.
In this graph, you see that the clients have been divided in three groups:
–– Hierarchical clustering
277
Chapter 12 Clustering
–– DBSCAN
–– OPTICS
–– Gaussian mixture
Many of those models, however, are unfortunately not usable for spatial data. The
problem with most models is that they compute Euclidean distances between two data
points or other distance and similarity metrics.
In spatial data, as covered extensively in this book, we work with latitude and
longitude coordinates, and there are specific things to take into account when
computing distances from one coordinate to the other. Although we could use basic
clustering models as a proxy, this would be wrong, and we’d need to hope that
the impact is not too much. It will be better to choose clustering methods that are
specifically designed for spatial data and that can take into account correct measures of
distance.
The main thing that you need to consider very strongly in clustering on spatial data
is the distance metric that you are going to use. There is no one-size-fits-all method here,
but we’ll discover such approaches and considerations throughout the spatial clustering
use case that is presented in this chapter.
278
Chapter 12 Clustering
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
geodata = gpd.read_file('chapter_12_data.kml')
geodata.head()
The data are stored as one LineString for each person. There are no additional
variables available. Let’s now make a simple plot to have a better idea of the type
of trajectories that we are working with. This can be done using the code in Code
Block 12-2.
279
Chapter 12 Clustering
geodata.plot()
To add a bit of context to these trajectories, we can add a background map to this
graph using the code in Code Block 12-3.
import contextily as cx
ax = geodata.plot(figsize=(15,15), markersize=64)
cx.add_basemap(ax, crs = geodata.crs)
280
Chapter 12 Clustering
Figure 12-5. The map resulting from Code Block 12-3. Image by author using
contextily source data and image as referenced in the image
The three trajectories are based in the city of Brussels. For each of the three
trajectories, you can visually identify a similar pattern: there are clustered parts where
there are multiple points in the same neighborhood, indicating points of interest.
Then there are also some parts where there is a real line-like pattern which indicates
movements from one point of interest to another.
Figure 12-6. The result from Code Block 12-4. Image by author
Let’s plot the trajectory of this person in order to have a more detailed vision of the
behavior of this person. This map can be made using the code in Code Block 12-5.
ax = one_person.plot(figsize=(15,15), markersize=64)
cx.add_basemap(ax, crs = one_person.crs)
Figure 12-7. The map resulting from Code Block 12-5. Image by author using
contextily source data and image as referenced in the image
You can see from this visualization that the person has been at two locations for a
longer period: one location on the top left of the map and a second point of interest on
the bottom right. We want to reach a clustering model that is indeed capable of capturing
these two locations.
282
Chapter 12 Clustering
To start building a clustering model for Person 1, we need to convert the LineString
into points. After all, we are going to cluster individual points to identify clusters of
points. This is done using the code in Code Block 12-6.
import pandas as pd
one_person_points_df = pd.DataFrame(
[x.strip('(').strip(')').strip('0').strip(' ').split(' ')
for x in str(one_person.loc[0, 'geometry'])[13:].split(',')],
columns=['long','lat']
)
one_person_points_df = one_person_points_df.astype(float)
one_person_points_df.head()
The data format that results from this code is shown in Figure 12-8.
Figure 12-8. The new data format of latitude and longitude as separate columns.
Image by author
283
Chapter 12 Clustering
Now that we have the right data format, it is time to apply a clustering method. As
our data is in latitude and longitude, the distance between two points should be defined
using haversine distance. We choose to use the OPTICS clustering method, as it applies
well to spatial data. Its behavior is the following:
–– OPTICS is able to discard points: when points are far away from all
identified clusters, they can be coded as –1, meaning an outlier data
point. This will be important in the case of spatial clustering, as there
will be many data points that are on a transportation part of the
trajectory that will be quite far away from the cluster centers. This
option is not available in all clustering methods, but it is there in
OPTICS and some other methods like DBSCAN.
Let’s start with an OPTICS clustering that uses the default settings. This is done in the
code in Code Block 12-7.
clustering = OPTICS(metric='haversine')
one_person_points_df.loc[:,'cluster'] = clustering.fit_predict(np.
radians(one_person_points_df[['lat', 'long']]))
The previous code has created a column called cluster in the dataset, which now
contains the cluster that the model has found for each row, each data point. The code in
Code Block 12-8 shows how to have an idea of how the clusters are distributed.
284
Chapter 12 Clustering
one_person_points_df['cluster'].value_counts()
Now, as said before, the cluster –1 identified outliers. Let’s delete them from the data
with the code in Code Block 12-9.
We can now compute the central points of each cluster by computing the median
point with a groupby operation. This is done in Code Block 12-10.
medians_of_POI = one_person_points_df.groupby(['cluster'])[['lat',
'long']].median().reset_index(drop=False)
medians_of_POI
285
Chapter 12 Clustering
Let’s plot those central coordinates on a map using the code in Code Block 12-11.
The basic plot with the three central points is shown in Figure 12-11.
Figure 12-11. The plot with the three central points of Person 1. Image by author
Let’s use the code in Code Block 12-12 to add more context to this map.
ax = one_person.plot(figsize=(15,15))
medians_of_POI_gdf.plot(ax=ax,markersize=128)
cx.add_basemap(ax, crs = one_person.crs)
286
Chapter 12 Clustering
Figure 12-12. Plotting the central points to a background map. Image by author
using contextily source data and image as references in the image
This map shows that the clustering was not totally successful. The cluster centroid
on the top left did correctly identify a point of interest, and the one bottom right as
well. However, there is one additional centroid in the middle that should not have been
identified as a point of interest. In the next section, we will tune the model to improve
this result.
287
Chapter 12 Clustering
In the following code, a different setting has been set in the OPTICS model:
–– Max_eps is set to 2.
–– Min_cluster_size is set to 8.
–– Xi is set to 0.05.
These hyperparameter values have been obtained by trying out different settings and
then looking whether the identified centroids coincided with the points of interest on the
map. The code is shown in Code Block 12-13.
clustering = OPTICS(
min_samples = 10,
max_eps=2.,
min_cluster_size=8,
xi = 0.05,
metric='haversine')
288
Chapter 12 Clustering
one_person_points_df.loc[:,'cluster'] = clustering.fit_predict(
np.radians(one_person_points_df[['lat', 'long']]))
one_person_points_df =
one_person_points_df[one_person_points_df['cluster'] != -1]
medians_of_POI = one_person_points_df.groupby(['cluster'])[['lat',
'long']].median().reset_index(drop=False)
print(medians_of_POI)
medians_of_POI_gdf = gpd.GeoDataFrame(medians_of_POI,
geometry=
[Point(x) for x in
zip(
list(medians_of_POI['long']),
list(medians_of_POI['lat'])
)
])
ax = one_person.plot(figsize=(15,15))
medians_of_POI_gdf.plot(ax=ax,markersize=128)
cx.add_basemap(ax, crs = one_person.crs)
289
Chapter 12 Clustering
Figure 12-13. The map resulting from Code Block 12-13. Image by author using
contextily data and image as referenced in the map
As you can see, the model has correctly identified the two points (top left and bottom
right) and no other points. The model is therefore successful at least for this person. In
the next section, we will apply this to the other data as well and see whether the new
cluster settings give correct results for them as well.
290
Chapter 12 Clustering
clustering = OPTICS(
min_samples = 10,
max_eps=2.,
min_cluster_size=8,
xi = 0.05,
metric='haversine')
one_person_points_df.loc[:,'cluster'] = clustering.fit_predict(
np.radians(one_person_points_df[['lat', 'long']]))
one_person_points_df =
one_person_points_df[one_person_points_df['cluster'] != -1]
medians_of_POI =
one_person_points_df.groupby(['cluster'])[['lat', 'long']].median().
reset_index(drop=False)
print(medians_of_POI)
medians_of_POI_gdf = gpd.GeoDataFrame(medians_of_POI,
geometry=
[Point(x) for x in
zip(
list(medians_of_POI['long']),
list(medians_of_POI['lat'])
291
Chapter 12 Clustering
)
])
ax = gpd.GeoDataFrame([row],
geometry=[row['geometry']]).plot(figsize=(15,15))
medians_of_POI_gdf.plot(ax=ax,markersize=128)
plt.show()
The resulting output and graphs will be shown hereafter in Figures 12-14, 12-15,
and 12-16.
This first map shows the result that we have already used before. Indeed, for Person 1,
the OPTICS model has correctly identified the two points of interest. Figure 12-15 shows
the results for Person 2.
292
Chapter 12 Clustering
Figure 12-15. The three central points of Person 2 against their trajectory. Image
by author
293
Chapter 12 Clustering
For Person 2, we can see that there are three points of interest, and the OPTICS
model has correctly identified those three centroids. The model is therefore considered
successful on this person. Let’s now check the output for the third person in
Figure 12-16.
This result for Person 3 is also successful. There were two points of interest in the
trajectory of Person 3, and the OPTICS model has correctly identified those two.
Key Takeaways
1. Unsupervised machine learning is a counterpart to supervised
machine learning. In supervised machine learning, there is a
ground truth with a target variable. In unsupervised machine
learning, there is no target variable.
2. Feature reduction is a family of methods in unsupervised machine
learning, in which the goal is to redefine variables. It is not very
different to apply feature reduction in spatial use cases.
294
Chapter 12 Clustering
295
CHAPTER 13
Conclusion
Throughout the 12 chapters of this book, you have been thoroughly introduced to three
main themes. The book started with an introduction to spatial data in general, spatial
data tools, and specific knowledge needed to work efficiently with spatial data.
After that, a number of common tools from Geographic Information Systems (GIS)
and the general domain of spatial analysis were presented. The final chapters of this
book were dedicated to machine learning on spatial data. The focus there was on those
decisions and considerations in machine learning that are different when working on
machine learning with spatial data.
In this chapter, we will do a recap of the main learnings of each of the chapters. At
the end, we will come back to some next steps for continuing learning in the domain of
machine learning, data science, and spatial data.
–– Cartesian coordinates
297
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8_13
Chapter 13 Conclusion
The chapter then moved on to introduce a number of standard tools for GIS analysis
including ArcGIS, QGIS, and other open source software. Python was used as a tool in
this book, and there are numerous convincing reasons to use it. If you want to become a
GIS analysis expert, it may be useful to learn other tools as well, but we will come back to
options for future learning paths later in this chapter.
Chapter 1 moved on to introduce multiple data storage types for spatial data:
–– Shapefiles
–– KML files
–– GeoJSON
When working with Python, we generally have much more freedom of data types, as
we are able to program any processing operation that we could need. This is not always
the case in click-button environments. Anyway, it is important to be able to interoperate
with any data storage format that you may encounter.
The chapter concluded by presenting a number of important Python libraries for
working with spatial data in Python, some of which were used extensively throughout
the other chapters of the book.
298
Chapter 13 Conclusion
There are many coordinate systems out there, and finding the one that corresponds
best to your need can be a challenge. In case of doubt, it may be best to stick to the more
common projections rather than the more advanced, as your end users may be surprised
if the map doesn’t correspond to something they are familiar with.
–– Point data has only one coordinate. A point has no size and no shape,
just a location.
299
Chapter 13 Conclusion
Choosing the data type for your project will really depend on your use case. Although
conversions between data types may sometimes be not super easy to define, necessary
cases for conversion happen, and you can generally use some of the GIS spatial tools
for this.
300
Chapter 13 Conclusion
301
Chapter 13 Conclusion
–– Linear interpolation
–– Polynomial interpolation
302
Chapter 13 Conclusion
303
Chapter 13 Conclusion
304
Chapter 13 Conclusion
Other Specialties
There are also other fields of study that are related to the topics that have been touched
on in this book. As an example, we have talked extensively about different data storage
types, but we have not had the room for talking about things like specific GIS databases
and long-term storage. If you are interested in data engineering, or databases, there is
more to learn on specific data storage for spatial data, together with everything that goes
with it (data architectures, security, accessibility, etc.).
Many other domains also have GIS-intensive workloads. Consider the fields of
meteorology, hydrology, and some domains of ecology and earth sciences in which
many professionals are GIS experts just because of the heavy impact of spatial data in
those fields.
When mastering spatial data operations, you will be surprised of how many fields
can actually benefit from spatial operations. Some domains already know it and are very
GIS heavy in their daily work, and in other domains, everything is yet to be invented.
305
Chapter 13 Conclusion
Key Takeaways
1. Throughout this book, you have seen three main topics:
a. Specialize in GIS by going into more detail of different GIS tools and
mapmaking.
c. Going into advanced earth observation use cases and combining this with
the study of the field of computer vision.
306
Index
A C
Albers equal area conic projection, Cartesian coordinate system, 5, 6
33, 34, 299 Cartopy, 88–92
Azimuthal equidistant Classification, 225
projection, 36–37 data modeling
Azimuthal/true direction array resulting, 243, 244
projection, 38–41 dataframe format, 242
error analysis, 247
logistic regression, 245
B plot resulting, 247
Babinet projection, 33 predictions, 245
Buffering operations resulting comparison, 246
data type, 128 stratification, 244
definition, 128 GIS spatial operations, 303
difference creation, 147 machine learning, 225
GIS spatial operations, 301 model benchmarking, 248–250
intersection operation, 129 reorganization/
line data, 130, 131 standardization, 238–242
point data, 129, 130 spatial communication
polygon, 132 advantage/disadvantage, 233
Python data resulting, 235, 236
data resulting, 144 feature engineering, 227
house searching geodataframe, 235
criteria, 132 importing data, 227–232
LineString object, 137–141 map resulting, 230
point data, 133–137 operation, 232, 233
polygons, 136, 141–146 resulting dataframe, 238
visualization, 143 source code, 237
schematic diagram, 128, 129 truncated version, 232
set operations, 146–149 use case, 226
standard operations, 127 wide/long data format, 233
307
© Joos Korstanje 2022
J. Korstanje, Machine Learning on Geographical Data Using Python,
https://doi.org/10.1007/978-1-4842-8287-8
INDEX
308
INDEX
309
INDEX
Interpolation
benchmark, 222 K
classification/regression, 223 Kriging solutions
curved line, 205 exponential setting, 220–222
definition, 203 fundamental approaches, 216
GIS spatial operations, 302 Gaussian, 219, 220
Kriging (see Kriging) linear, 217, 218
linear, 206 plot, 222
list graphical process, 203
nearest neighbor, 208, 209
one-dimensional/spatial interpolation, L
209, 210 Lambert conformal conic
polynomial functions, 207, 208 projection, 35–36
Python Lambert equal area azimuthal, 38–39
dataframe, 212 Linear interpolation, 206, 215
data points, 211 Local Coordinate Systems, 41, 49
geodataframe, 212
numerical estimations, 215
plot resulting, 213 M
2D linear function, 213–216 Mapmaking, 79
straight line, 204 Cartopy, 88–92
310
INDEX
311
INDEX
312