1 What Is Geospatial Data? - Geospatial Data Science With Julia
1 What Is Geospatial Data? - Geospatial Data Science With Julia
In this chapter, we define geospatial data and introduce a universal representation for it which
is ideal for geostatistical analysis. Unlike other representations in the literature (e.g., “raster”,
“vector” data), the proposed representation is suitable for encoding geospatial data over 3D
unstructured meshes, 2D images embedded in 3D space, and other types of complex geospatial
domains.
1.1 Definition
Definition
(Discrete) geospatial data is the combination of a table of attributes (or features) with a discretization
of a geospatial domain. Each row (or measurement) in the table corresponds to an element (or
geometry) in the discretization of the geospatial domain.
1.1.1 Table
In data science the most natural data structure for working with data is the table. Generally
speaking, a table is any object that can be structured into rows containing measurements and
columns representing variables. For example, Table 1.1 has 5 measurements of 4 variables:
1.1.2 Domain
The second definition that we need is that of a geospatial domain. In geosciences, questions are
often formulated within a physical region of interest. This physical region can cover a small area
of the surface of the Earth, the entire Earth surface, or any region of finite measure that can be
discretized into smaller geometries (a.k.a. elements):
:
(a) Coast line in Islay Province, Peru. View on Google Maps
(b) Synthetic carbonate reservoir model by Correia et al. (2015). See UNISIM-II for more details
Figure 1.1 illustrates two very different examples of geospatial domains. The domain in
Figure 1.1 (a) is widely studied in GIS books. It is a 2D domain that contemplates a small area
:
near Islay Province, Peru. The domain in Figure 1.1 (b) on the other hand is not considered in
traditional GIS literature. It is a 3D domain that has been discretized into hexahedron
geometries.
1.1.3 Remarks
Images like the one depicted in Figure 1.1 (a) are often implemented in terms of the array
data structure. GIS books call it “raster data”, but we will avoid this term in our framework in
order to obtain a more general set of tools.
According to our definition of geospatial data, “raster data” is simply a table with colors as
variables (e.g., RGB values) combined with a grid of quadrangle geometries. We illustrate
this concept in Figure 1.2 by zooming in a satellite image of the Lena delta:
There are no constraints on the geometries used in the discretization of the geospatial
domain. In Figure 1.3, Brazil is discretized into complex polygonal geometries that
represent country states:
Figure 1.3: Brazil’s states represented with complex polygonal geometries. View on Google Maps.
GIS books call it “vector data” because the geometries are stored as vectors of coordinates
in memory. We will also avoid this term in our framework given that it only highlights an
implementation detail.
Before we start discussing machine representation with actual Julia code, let’s make a final
(pedantic) distinction between the words geospatial and spatial. These words mean different
things in different communities:
Given that geospatial data science deals with both concepts, we must use these words carefully.
Note
In Geostatistical Learning, models can exploit both spaces to improve prediction performance, but that
is out of the scope for this book.
1.2 Representation
Based on the definition of geospatial data given in the previous section, we are now ready to
proceed and discuss an efficient machine representation for it with actual Julia code.
1.2.1 Table
The Julia language comes with two built-in table representations:
coltable = (
NAME=["John", "Mary", "Paul", "Anne", "Kate"],
AGE=[34, 12, 23, 39, 28],
HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72],
GENDER=["male", "female", "male", "female", "female"]
)
(NAME = ["John", "Mary", "Paul", "Anne", "Kate"], AGE = [34, 12, 23, 39, 28],
HEIGHT = [1.78, 1.56, 1.7, 1.8, 1.72], GENDER = ["male", "female", "male",
"female", "female"])
Given that data science is often performed with entire columns, this column-major
representation of a table is very convenient. The second representation focuses on the rows of
the table:
rowtable = [
(NAME="John", AGE=34, HEIGHT=1.78, GENDER="male"),
(NAME="Mary", AGE=12, HEIGHT=1.56, GENDER="female"),
(NAME="Paul", AGE=23, HEIGHT=1.70, GENDER="male"),
(NAME="Anne", AGE=39, HEIGHT=1.80, GENDER="female"),
(NAME="Kate", AGE=28, HEIGHT=1.72, GENDER="female")
]
:
5-element Vector{@NamedTuple{NAME::String, AGE::Int64, HEIGHT::Float64,
GENDER::String}}:
(NAME = "John", AGE = 34, HEIGHT = 1.78, GENDER = "male")
(NAME = "Mary", AGE = 12, HEIGHT = 1.56, GENDER = "female")
(NAME = "Paul", AGE = 23, HEIGHT = 1.7, GENDER = "male")
(NAME = "Anne", AGE = 39, HEIGHT = 1.8, GENDER = "female")
(NAME = "Kate", AGE = 28, HEIGHT = 1.72, GENDER = "female")
The row-major representation can be useful to process data that is potentially larger than the
available computer memory, or infinite streams of data.
Although these two representations come built-in with Julia, they lack basic functionality for
data science. The most widely used table representation for data science in Julia is available in
DataFrames.jl by Bouchet-Valat and Kamiński (2023).
using DataFrames
df = DataFrame(
NAME=["John", "Mary", "Paul", "Anne", "Kate"],
AGE=[34, 12, 23, 39, 28],
HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72],
GENDER=["male", "female", "male", "female", "female"]
)
5×4 DataFrame
This representation provides additional syntax for accessing rows and columns of the table:
df[1,:]
DataFrameRow (4 columns)
5-element Vector{String}:
"John"
"Mary"
"Paul"
"Anne"
"Kate"
df[1:3,["NAME","AGE"]]
3×2 DataFrame
String Int64
1 John 34
2 Mary 12
3 Paul 23
df.HEIGHT
5-element Vector{Float64}:
1.78
1.56
1.7
1.8
1.72
df."HEIGHT"
5-element Vector{Float64}:
1.78
1.56
1.7
1.8
1.72
Note
Unlike other languages, Julia makes a distinction between the the symbol :HEIGHT and the string
"HEIGHT" . The DataFrame representation supports both types for column names, but that is not
:
always the case with other table representations.
Other popular table representations in Julia are associated with specific file formats:
1.2.2 Domain
All available domain representations come from the Meshes.jl module.
using GeoStats
p = Point(1, 2)
s = Segment((0, 2), (1, 3))
t = Triangle((0, 0), (1, 0), (1, 1))
b = Ball((2, 2), 1)
geoms = [p, s, t, b]
Because these geometries are unaware of each other, we place them into a GeometrySet ,
informally known in computational geometry as the “soup of geometries” data structure:
gset = GeometrySet(geoms)
4 GeometrySet
├─ Point(x: 1.0 m, y: 2.0 m)
├─ Segment((x: 0.0 m, y: 2.0 m), (x: 1.0 m, y: 3.0 m))
├─ Triangle((x: 0.0 m, y: 0.0 m), (x: 1.0 m, y: 0.0 m), (x: 1.0 m, y: 1.0 m))
└─ Ball(center: (x: 2.0 m, y: 2.0 m), radius: 1.0 m)
No advanced knowledge is required to start working with these geometries. For example, we
can compute the length of the Segment , the area of the Triangle and the area of the Ball
with:
:
length(s), area(t), area(b)
More generally, we can compute the measure of the geometries in the domain:
4-element Vector{Quantity{Float64}}:
0.0 m
1.4142135623730951 m
0.5 m^2
3.141592653589793 m^2
In the example above, we iterated over the domain to apply the function of interest, but we
could have used Julia’s dot syntax for broadcasting the function over the geometries:
measure.(gset)
4-element Vector{Quantity{Float64}}:
0.0 m
1.4142135623730951 m
0.5 m^2
3.141592653589793 m^2
The list of supported geometries is very comprehensive. It encompasses all geometries from the
simple features standard and more. We will see more examples in the following chapters.
One of the main limitations of GIS software today is the lack of explicit representation of
topology. A GeometrySet does not provide efficient topological relations (Floriani and Hui
2007), yet advanced geospatial data science requires the definition of geospatial domains where
geometries are aware of their neighbors. Let’s illustrate this concept with the CartesianGrid
domain:
10×10 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m)
├─ maximum: Point(x: 10.0 m, y: 10.0 m)
└─ spacing: (1.0 m, 1.0 m)
grid[1]
:
Quadrangle
├─ Point(x: 0.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 1.0 m)
└─ Point(x: 0.0 m, y: 1.0 m)
And even though we can manipulate this domain as if it was a “soup of geometries”, the major
advantage in this abstraction is the underlying topology :
topo = topology(grid)
This data structure can be used by advanced users who wish to design algorithms with
neighborhood information. We will cover this topic in a separate chapter. For now, keep in mind
that working with the entire domain as opposed to with a vector or “soup of geometries” has
major benefits.
Note
The CartesianGrid domain is lazy, meaning it only stores the start and end points of the grid together
with the spacing between the elements. Therefore, we can easily create large 3D grids of Hexahedron
geometries without consuming all available memory:
10000×10000×10000 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m, z: 0.0 m)
├─ maximum: Point(x: 10000.0 m, y: 10000.0 m, z: 10000.0 m)
└─ spacing: (1.0 m, 1.0 m, 1.0 m)
grid[1]
Hexahedron
├─ Point(x: 0.0 m, y: 0.0 m, z: 0.0 m)
├─ Point(x: 1.0 m, y: 0.0 m, z: 0.0 m)
├─ Point(x: 1.0 m, y: 1.0 m, z: 0.0 m)
├─ Point(x: 0.0 m, y: 1.0 m, z: 0.0 m)
├─ Point(x: 0.0 m, y: 0.0 m, z: 1.0 m)
├─ Point(x: 1.0 m, y: 0.0 m, z: 1.0 m)
├─ Point(x: 1.0 m, y: 1.0 m, z: 1.0 m)
└─ Point(x: 0.0 m, y: 1.0 m, z: 1.0 m)
4 SimpleMesh
6 vertices
├─ Point(x: 0.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 0.0 m)
├─ Point(x: 0.0 m, y: 1.0 m)
├─ Point(x: 1.0 m, y: 1.0 m)
├─ Point(x: 0.25 m, y: 0.5 m)
└─ Point(x: 0.75 m, y: 0.5 m)
4 elements
├─ Quadrangle(1, 2, 6, 5)
├─ Triangle(2, 4, 6)
├─ Quadrangle(4, 3, 5, 6)
└─ Triangle(3, 1, 5)
The connect function takes a tuple of indices and a geometry type, and produces a connectivity
object. The geometry type can be omitted, in which case it is assumed to be a Ngon , i.e., a
polygon with N sides:
c = connect((1, 2, 3))
Triangle(1, 2, 3)
This connectivity object can be materialized into an actual geometry with a vector of points:
Triangle
├─ Point(x: 0.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 0.0 m)
└─ Point(x: 1.0 m, y: 1.0 m)
The SimpleMesh uses the materialize function above to construct geometries on the fly,
similar to what we have seen with the CartesianGrid :
:
mesh[1]
Quadrangle
├─ Point(x: 0.0 m, y: 0.0 m)
├─ Point(x: 1.0 m, y: 0.0 m)
├─ Point(x: 0.75 m, y: 0.5 m)
└─ Point(x: 0.25 m, y: 0.5 m)
Don’t worry if you feel overwhelmed by these concepts. We are only sharing them here to give
you an idea of how complex 3D domains are represented in the framework. You can do
geospatial data science without ever having to operate with these concepts explicitly.
The last missing piece of the puzzle is the combination of tables with domains into geospatial
data, which we discuss next.
1.2.3 Data
Wouldn’t it be nice if we had a representation of geospatial data that behaved like a table as
discussed in the Tables section, but preserved topological information as discussed in the
Domains section? In the GeoStats.jl framework, this is precisely what we get with the georef
function:
using GeoStats
df = DataFrame(
NAME=["John", "Mary", "Paul", "Anne"],
AGE=[34.0, 12.0, 23.0, 39.0]u"yr",
HEIGHT=[1.78, 1.56, 1.70, 1.80]u"m",
GENDER=["male", "female", "male", "female"]
)
grid = CartesianGrid(2, 2)
John 34.0 yr 1.78 m male Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
Mary 12.0 yr 1.56 m female Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
Paul 23.0 yr 1.7 m male Quadrangle((x: 0.0 m, y: 1.0 m), ..., (x: 0.0 m, y: 2.0 m))
Anne 39.0 yr 1.8 m female Quadrangle((x: 1.0 m, y: 1.0 m), ..., (x: 1.0 m, y: 2.0 m))
using Unitful: m, ft
1.0m + 2.0ft
The function combines any table with any domain into a geospatial data representation that
adheres to the Tables.jl interface. We call this representation a GeoTable to distinguish it from a
standard table. Besides the original columns, the GeoTable has a special geometry column with
the underlying domain:
names(geotable)
5-element Vector{String}:
"NAME"
"AGE"
"HEIGHT"
"GENDER"
"geometry"
:
Unlike a standard table, the GeoTable creates geometries on the fly depending on the data
access pattern. For example, we can request the first measurement of the GeoTable and it will
automatically construct the corresponding Quadrangle :
geotable[1,:]
(NAME = "John", AGE = 34.0 yr, HEIGHT = 1.78 m, GENDER = "male", geometry =
Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m)))
geotable[1:3,["NAME","AGE"]]
John 34.0 yr Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
Mary 12.0 yr Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
Paul 23.0 yr Quadrangle((x: 0.0 m, y: 1.0 m), ..., (x: 0.0 m, y: 2.0 m))
Finally, if we request the entire geometry column, we get back the original domain:
geotable[:,"geometry"]
2×2 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m)
├─ maximum: Point(x: 2.0 m, y: 2.0 m)
└─ spacing: (1.0 m, 1.0 m)
Besides the data access patterns of the DataFrame , the GeoTable also provides an advanced
method for retrieving all rows that intersect with a given geometry:
John 34.0 yr 1.78 m male Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
Mary 12.0 yr 1.56 m female Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
This method is very useful to narrow the region of interest and quickly discard all measurements
that are outside of it. For instance, it is common to discard all “pixels” outside of a polygon
before exporting the geotable to a file on disk.
Notice that the GeoTable representation is general enough to accommodate both “raster data”
and “vector data” in traditional GIS. We can create very large rasters because the
CartesianGrid is lazy:
georef(
(
R=rand(1000000),
G=rand(1000000),
B=rand(1000000)
),
CartesianGrid(1000, 1000)
)
R G B geometry
0.95311 0.540418 0.766193 Quadrangle((x: 0.0 m, y: 0.0 m), ..., (x: 0.0 m, y: 1.0 m))
0.207471 0.626717 0.537383 Quadrangle((x: 1.0 m, y: 0.0 m), ..., (x: 1.0 m, y: 1.0 m))
0.80495 0.372314 0.667564 Quadrangle((x: 2.0 m, y: 0.0 m), ..., (x: 2.0 m, y: 1.0 m))
0.211939 0.279282 0.278076 Quadrangle((x: 3.0 m, y: 0.0 m), ..., (x: 3.0 m, y: 1.0 m))
0.786862 0.348092 0.970068 Quadrangle((x: 4.0 m, y: 0.0 m), ..., (x: 4.0 m, y: 1.0 m))
0.88911 0.118394 0.805253 Quadrangle((x: 5.0 m, y: 0.0 m), ..., (x: 5.0 m, y: 1.0 m))
0.521314 0.475458 0.463105 Quadrangle((x: 6.0 m, y: 0.0 m), ..., (x: 6.0 m, y: 1.0 m))
0.595226 0.108076 0.944947 Quadrangle((x: 7.0 m, y: 0.0 m), ..., (x: 7.0 m, y: 1.0 m))
0.639352 0.895026 0.585009 Quadrangle((x: 8.0 m, y: 0.0 m), ..., (x: 8.0 m, y: 1.0 m))
0.930454 0.786526 0.440901 Quadrangle((x: 9.0 m, y: 0.0 m), ..., (x: 9.0 m, y: 1.0 m))
⋮ ⋮ ⋮ ⋮
And can load vector geometries from files that store simple features using the GeoIO.jl module:
:
using GeoIO
GeoIO.load("data/countries.geojson")
⋮ ⋮ ⋮
Note
The “data” folder is stored on GitHub. Check the Preface for download instructions.
We will see more examples of “vector data” in the chapter Interfacing with GIS, and will explain
why file formats like Shapefile.jl and GeoJSON.jl are not enough for advanced geospatial data
science.
using GeoStats
values(geotable)
4×4 DataFrame
domain(geotable)
2×2 CartesianGrid
├─ minimum: Point(x: 0.0 m, y: 0.0 m)
├─ maximum: Point(x: 2.0 m, y: 2.0 m)
└─ spacing: (1.0 m, 1.0 m)
1.3 Remarks