PGIS Unit 4
PGIS Unit 4
Chapter 06
Spatial data analysis
There are many ways to classify the analytical functions of a GIS. It makes the following
distinctions:
• Retrieval functions allow the selective search of data. We might thus retrieve all
agricultural fields where potato is grown.
2. Overlay functions. These belong to the most frequently used functions in a GIS
application. They allow the combination of two (or more) spatial data layers
comparing them position by position and treating areas of overlap—and of non-
overlap—in distinct ways. Many GISs support over-lays through an algebraic language,
expressing an overlay function as a formula in which the data layers are the arguments.
In this way, we can find
• The potato fields on clay soils (select the ‘potato’ cover in the crop data layer and the
‘clay’ cover in the soil data layer and perform an intersection of the two areas found),
• The fields where potato or maize is the crop (select both areas of potato and ‘maize’
cover in the crop data layer and take their union),
• The potato fields not on clay soils (perform a difference operator of areas with
‘potato’ cover with the areas having clay soil),
• The fields that do not have potato as crop (take the complement of the potato areas).
• Search functions allow the retrieval of features that fall within a given search window.
This window may be a rectangle, circle, or polygon.
• Buffer zone generation (or buffering) is one of the best known neighbourhood
functions. It determines a spatial envelope (buffer) around (a) given feature(s). The
created buffer may have a fixed width, or a variable width that depends on
characteristics of the area.
• Interpolation functions predict unknown values using the known values at nearby
locations. This typically occurs for continuous fields, like elevation, when the data
stored does not provide the direct answer for the location(s) of interest.
• Visibility functions also fit in this list as they are used to compute the points
visible from a given location (view shed modelling or view shed mapping) using
a digital terrain model.
Generalization, also called map dissolve, is the process of making a classification less
detailed by combining classes. Generalization is often used to reduce the level of
classification detail to make an underlying pattern more apparent.
6.2.1 Measurement
Geometric measurement on spatial features includes counting, distance and area size
computations. In general, measurements on vector data are more advanced, thus, also
more complex, than those on raster data.
The primitives of vector data sets are point, (poly)line and polygon. Related geometric
measurements are location, length, distance and area size. Some of these are
geometric properties of a feature in isolation (location, length, area size); others
(distance) require two features to be identified. The location property of a vector
feature is always stored by the GIS: a single coordinate pair for a point, or a list of pairs
for a polyline or polygon boundary. Occasionally, there is a need to obtain the location
of the centroid of a polygon; some GISs store these also, others compute them ‘on-
the-fly’.
Length is a geometric property associated with polylines, by themselves, or in their
function as polygon boundary. It can obviously be computed by the GIS—as the sum
of lengths of the constituent line segments—but it quite often is also stored with the
polyline.
Area size is associated with polygon features. Again, it can be computed, but usually
is stored with the polygon as an extra attribute value. This speeds up the computation
of other functions that require area size values.
Another geometric measurement used by the GIS is the minimal bounding box
computation. It applies to polylines and polygons, and determines the minimal
rectangle—with sides parallel to the axes of the spatial reference system—that covers
the feature. This is illustrated in Figure
A common use of area size measurements is when one wants to sum up the area sizes
of all polygons belonging to some class. This class could be crop type: What is the size
of the area covered by potatoes? If our crop classification is in a stored data layer, the
computation would include (a) selecting the potato areas, and (b) summing up their
(stored) area sizes. Clearly, little geometric computation is required in the case of
stored features.
Spatial resolution refers to the dimension of the cell size representing the area covered
on the ground. Therefore, if the area covered by a cell is 5 x 5 meters, the resolution is
5 meters. The higher the resolution of a raster, the smaller the cell size and, thus, the
greater the detail.
Measurements on raster data layers are simpler because of the regularity of the cells.
The area size of a cell is constant, and is determined by the cell resolution. Horizontal
and vertical resolution may differ, but typically do not.
Location of an individual cell derives from the raster’s anchor point, the cell resolution,
and the position of the cell in the raster. Again, there are two conventions: the cell’s
location can be its lower left corner, or the cell’s midpoint. These conventions are set
by the software in use, and in case of low resolution data they become more important
to be aware of.
The area size of a selected part of the raster (a group of cells) is calculated as the
number of cells multiplied by the cell area size.
The distance between two raster cells is the standard distance function applied to the
locations of their respective mid-points, obviously considering the cell resolution.
Where a raster is used to represent line features as strings of cells through the raster,
the length of a line feature is computed as the sum of distances between consecutive
cells
When exploring a spatial data set, the first thing one usually wants is to select certain
features, to (temporarily) restrict the exploration.
When multiple criteria have to be used for selection, we need to carefully express all
of these in a single composite condition. The tools for this come from a field of
mathematical logic, known as propositional calculus. Above, we have seen simple,
atomic conditions such as Area<400000, and LandUse=80. Atomic conditions use a
predicate symbol, such as<(less than) or=(equals). Other possibilities are<=(less than
or equal), >(greater than), >=(greater than or equal) and <> (does not equal). Any of
these symbols is combined with an expression on the left and one on the right.
Selecting features that are inside selection objects: This type of query uses the
containment relationship between spatial objects. Obviously, polygons can contain
polygons, lines or points, and lines can contain lines or points, but no other
containment relationships are possible.
Figure illustrates a containment query. Here, we are interested in finding the location
of medical clinics in the area of Ilala District. We first selected all areas of Ilala District,
using the technique of selection by attribute condition District=“Ilala”. Then, these
selected areas were used as selection objects to determine which medical clinics (as
point objects) were within them.
Adjacency is the meet relation-ship. It expresses that features share boundaries, and
therefore it applies only to line and polygon features. We want to select all parcels
adjacent to an industrial area. The first step is to select that area (in dark green) and
then apply the adjacency function to select all land use areas (in red) that are adjacent
to it.
Selecting features based on their distance: One may also want to use the distance
function of the GIS as a tool in selecting features. Such selections can be searches with
in a given distance from the selection objects, at a given distance, or even beyond a
given distance. There is a whole range of applications to this type of selection, e.g.:
• Which clinics are within 2 kilo meters of a selected school? (Information needed for
the school emergency plan.)
•Which roads are within 200 meters of a medical clinic? (These roads must have a high
road maintenance priority.)
Afterthought on selecting features: Any set of selected features can be used as the
input for a subsequent selection procedure. This means, for instance, that we can
select all medical clinics first, then identify the roads within 200 meters, then select
from them only the major roads, then select the nearest clinics to these remaining
roads, as the ones that should receive our financial support. In this way, we are
combining various techniques of selection.
6.2.3 Classification
The input data set may have itself been the result of a classification, and in such a case
we call it are classification. For example, we may have a soil map that shows different
soil type units and we would like to show the suitability of units for a specific crop. In
this case, it is better to assign to the soil units an attribute of suitability for the crop. A
second type of output is obtained when adjacent features with the same category are
merged into one bigger feature. Such post-processing functions are called spatial
merging, aggregation or dissolving.
User-controlled classification:
In user-controlled classification, a user selects the attribute(s) that will be used as the
classification parameter(s) and defines the classification method. The latter involves
declaring the number of classes as well as the correspondence between the old
attribute values and the new classes. This is usually done via a classification table.
Automatic classification:
1. Equal interval technique: The minimum and maximum values vmin and vmax of the
classification parameter are determined and the (constant) interval size for each
category is calculated as (vmax−vmin)/n, where n is the number of classes chosen by the
user. This classification is useful in revealing the distribution patterns as it determines
the number of features in each category.
Standard overlay operators take two input data layers, and assume they are
georeferenced in the same system, and overlap in study area. If either of these
requirements is not met, the use of an overlay operator is senseless. The principle of
spatial overlay is to compare the characteristics of the same location in both data
layers, and to produce a result for each location in the output data layer. The specific
result to produce is determined by the user. It might involve a calculation, or some
other logical function to be applied to every area or location. In raster data, these
comparisons are carried out between pairs of cells, one from each input raster. In
vector data, the same principle of comparing locations applies, but the underlying
computations rely on determining the spatial intersections of features from each input
layer.
In the vector domain, overlay is computationally more demanding than in the raster
domain. Here we will only discuss overlays from polygon data layers, but we note that
most of the ideas also apply to overlay operations with point or line data layers.
The standard overlay operator for two layers of polygons is the polygon intersection
operator. It is fundamental, as many other overlay operators proposed in the literature
or implemented in systems can be defined in terms of it. The principles are illustrated
in above Figure. The result of this operator is the collection of all possible polygon
intersections; the attribute table result is a join.
A second overlay operator is polygon overwrite. The result of this binary operator is
defined is a polygon layer with the polygons of the first layer, except where polygons
existed in the second layer, as these take priority. The principle is illustrated in Figure.
Most GISs do not force the user to apply overlay operators to the full polygon data
set. One is allowed to first select relevant polygons in the data layer, and then use the
selected set of polygons as an operator argument. The fundamental operator of all
these is polygon intersection. The others can be defined in terms of it, usually in
combination with polygon selection and/or classification. For instance, the polygon
overwrite of A by B can be defined as polygon intersection between A and B, followed
by a classification that prioritizes polygons in B, followed by a merge.
6.3.2 Raster overlay operators: GISs that support raster processing usually have a
language to express operations on rasters. These languages are generally referred to
as map algebra, or sometimes raster calculus. They allow a GIS to compute new rasters
from existing ones, using a range of functions and operators. Unfortunately, not
all implementations of map algebra offer the same functionality. When producing a
new raster we must provide a name for it, and define how it is computed. This is done
in an assignment statement of the following format:
Output_raster_name:=Map_algebra_expression.
The expression on the right is evaluated by the GIS, and the raster in which it results is
then stored under the name on the left. The expression may contain references to
existing rasters, operators and functions; the format is made clear below. The raster
names and constants that are used in the expression are called its operands.
Arithmetic operators:
Various arithmetic operators are supported. The standard ones are multiplication (×),
division (/), subtraction (−) and addition (+). Obviously, these arithmetic operators
should only be used on appropriate data values, and for instance, not on
classification values. Other arithmetic operators may include modulo division (MOD)
and integer division (DIV). Modulo division returns the remainder of division: for
instance,10 MOD 3will return 1 as 10−3×3 = 1. Similarly, 10 DIV 3 will return 3. More
operators are goniometric: sine (sin), cosine (cos), tangent (tan), and their inverse
functions asin, acos, and atan, which return radian angles as real values.
Map algebra also allows the comparison of rasters cell by cell. To this end, we may
use the standard comparison operators (<,<=,=,>=,>and<>) that we introduced
before. A simple raster comparison assignment is: C:=A <> B.
It will store truth values—either true or false—in the output raster C. A cell value in C
will be true if the cell’s value in A differs from that cell’s value in B. It will be false if they
are the same. Logical connectives are also supported in most implementations of map
algebra.
Conditional expressions:
The above comparison and logical operators produce rasters with the truth values true
and false. In practice, we often need a conditional expression with them that allows us
to test whether a condition is fulfilled. The general format is:
Conditional expressions are powerful tools in cases where multiple criteria must be
taken into account. A small size example may illustrate this. Consider a suitability
study in which a land use classification and a geological classification must be used.
The respective rasters are illustrated in Figure. Domain expertise dictates that some
combinations of land use and geology resultin suitable areas, whereas other
combinations do not. In our example, forests on alluvial terrain and grassland on shale
are considered suitable combinations, while the others are not. We could produce the
output raster of Figure with a map algebra expression such as:
There is another guiding principle in spatial analysis that can be equally useful. The
principle here is to find out the characteristics of the vicinity, here called
neighbourhood, of a location. After all, many suitability questions, for instance,
depend not only on what is at the location, but also on what is near the location. Thus,
the GIS must allow us ‘to look around locally’. To perform neighbourhood analysis, we
must:
1. State which target locations are of interest to us, and define their spatial extent,
Then, in the third step we indicate what it is we want to discover about the
phenomena that exist or occur in the neighbourhood. This might simply be its spatial
extent, but it might also be statistical information like:
To select target locations, one can use the selection techniques. To obtain
characteristics from an eventually identified neighbourhood, the same techniques
apply. So what remains to be discussed here is the proper determination of a
neighbourhood.
The principle of buffer zone generation is simple: we select one or more target
locations, and then determine the area around them, within a certain distance. In
Figure (a), a number of main and minor roads were selected as targets, and a 75 m
(resp., 25 m) buffer was computed from them. In some case studies, zonated buffers
must be determined, for instance in assessments of traffic noise effects. Most GISs
support this type of zonated buffer computation. An illustration is provided in Figure
(b). In vector-based buffer generation, the buffers themselves become polygon
features, usually in a separate data layer, that can be used in further spatial analysis.
The determination of neighbourhood of one or more target locations may depend not
only on distance—but also on direction and differences in the terrain in different
directions. This typically is the case when the target location contains a ‘source
material’ that spreads over time, referred to as diffusion. This ‘source material’ may be
air, water or soil pollution, Diffusion and spread commuters exiting a train station,
people from an opened-up refugee camp, a water spring uphill, or the radio waves
emitted from a radio relay station. In all these cases, one will not expect the spread to
occur evenly in all directions.
Diffusion computation involves one or more target locations, which are better called
source locations in this context. They are the locations of the source of whatever
spreads. The computation also involves a local resistance raster, which for each cell
provides a value that indicates how difficult it is for the ‘source -material’ to pass by
that cell.
Since ‘source material’ has the habit of taking the easiest route to spread, we must
determine at what minimal cost (i.e. at what minimal resistance) it may have
arrived in a cell. Therefore, we are interested in the minimal cost path. To determine
the minimal total resistance along a path from the source location csrc to an arbitrary
cell cx, the GIS determines all possible paths from csrc to cx, and then determines which
one has the lowest total resistance.
Continuous fields have a number of characteristics not shared by discrete fields. Since
the field changes continuously, we can talk about slope angle, slope aspect and
concavity/ convexity of the slope. These notions are not applicable to discrete fields.
Applications:
•Slope aspect calculation: The calculation of the aspect (or orientation) of the slope
in degrees (between 0 and 360 degrees), for any or all locations.
•Hillshading is used to portray relief difference and terrain morphology in hilly and
mountainous areas.
•Dynamic modeling: Apart from the applications mentioned above, DEMs are
increasingly used in GIS-based dynamic modelling, such as the computation of surface
run-off and erosion.
•Visibility analysis: A view shed is the area that can be ‘seen’—i.e. is in the direct line-
of-sight—from a specified target location. Visibility analysis determines the area visible
from a scenic lookout.
Computation of slope angle and slope aspect: A different choice of weight factors
may provide other information. Special filters exist to perform computations on the
slope of the terrain. Slope angle, which is also known as slope gradient, is the angle α,
illustrated in Figure, between a path p in the horizontal plane and the sloping terrain.
The path p must be chosen such that the angle α is maximal. A slope angle can be
expressed as elevation gain in a percentage or as a geometric angle, in degrees or
radians.
The path p must be chosen to provide the highest slope angle value, and thus it can
lie in any direction. The compass direction, converted to an angle with the North, of
this maximal down-slope path p is what we call the slope aspect.
Various classical spatial analysis functions on networks are supported by GIS software
packages. The most important ones are:
Optimal path finding: Optimal path finding techniques are used when a least-cost
path between two nodes in a network must be found. The two nodes are called origin
and destination, respectively. The aim is to find a sequence of connected lines to
traverse from the origin to the destination at the lowest possible cost. The cost function
can be simple: for instance, it can be defined as the total length of all lines on the path.
The cost function can also be more elaborate and take into account not only length of
the lines, but also their capacity, maximum transmission (travel) rate and other line
characteristics, for instance to obtain a reasonable approximation of travel time. There
can even be cases in which the nodes visited add to the cost of the path as well. These
may be called turning costs, which are defined in a separate turning cost table for each
node, indicating the cost of turning at the node when entering from one line and
continuing on another. This is illustrated in Figure.
Problems related to optimal path finding are ordered optimal path finding and
unordered optimal path finding. Both have an extra requirement that a number of
additional nodes needs to be visited along the path. In ordered optimal path finding,
the sequence in which these extra nodes are visited matters; in unordered optimal
path finding it does not. An illustration of both types is provided in Figure. Here, a path
is found from node A to node D, visiting nodes B and C.
Network partitioning:
In network partitioning, the purpose is to assign lines and/or nodes of the net-work,
in a mutually exclusive way, to a number of target locations. Typically, the target
locations play the role of service center for the network. This may be any type of
service: medical treatment, education, water supply. This type of network partitioning
is known as a network allocation problem.
Network allocation:
• The capacity with which a centre can produce the resources (whether they are
medical operations, school pupil positions, kilowatts, or bottles of milk), and
• The consumption of the resources, which may vary amongst lines or line segments.
Trace analysis: Trace analysis is performed when we want to understand which part
of a network is ‘conditionally connected’ to a chosen node on the network, known as
the trace origin. For a node or line to be conditionally connected, it means that a
path exists from the node/line to the trace origin, and that the connecting path
fulfills the conditions set. What these conditions are depends Tracing requires
connectivity on the application, and they may involve direction of the path, capacity,
length, or resource consumption along it. The condition typically is a logical
expression, as we have seen before, for instance:
• The path must be directed from the node/line to the trace origin,
• Its capacity (defined as the minimum capacity of the lines that constitute the path)
must be above a given threshold, and
Here we define application models to include any kind of GIS based model (including
so-called analytical and process models) for a specific real-world application. Such a
model, in one way or other, describes as faithfully as possible how the relevant
geographic phenomena behave, and it does so in terms of the parameters. The nature
of application models varies enormously. GIS applications for famine relief programs,
for instance, are very different from earthquake risk assessment applications, though
both can make use of GIS to derive a solution. Many kinds of application models exist,
and they can be classified in many different ways. Here we identify five characteristics
of GIS-based application models:
4. Its dimensionality- i.e. whether the model includes spatial, temporal or spatial and
temporal dimensions, and
5. Its implementation logic- i.e. the extent to which the model uses existing knowledge
about the implementation context.
Rule-based models attempt to model processes by using local (spatial) rules. Cellular
Automata (CA) are examples of models in this category. These are often used to
understand systems which are generally not well understood, but for which their local
processes are well known.
Agent-based models(ABM) attempt to model movement and development of
multiple interacting agents (which might represent individuals), often using sets of
decision-rules about what the agent can and cannot do.
Scale refers to whether the components of the model are individual or aggregate in
nature. Essentially this refers to the ‘level’ at which the model operates. Individual-
based models are based on individual entities, such as the agent-based models.
Implementation logic refers to how the model uses existing theory or knowledge to
create new knowledge. Deductive approaches use knowledge of the overall situation
in order to predict outcome conditions. This includes models that have some kind of
formalized set of criteria, often with known weightings for the inputs, and existing
algorithms are used to derive outcomes. Inductive approaches, on the other hand, are
less straightforward, in that they try to generalize in order to derive more general
models
Consider another example. A land use planning agency is faced with the problem of
identifying areas of agricultural land that are highly susceptible to erosion. Such areas
occur on steep slopes in areas of high rainfall. The spatial data used in a GIS to obtain
this information might include:
•A land use map produced five years previously from 1 : 25,000 scale aerial
photographs,
Various perspectives, motives and approaches to dealing with uncertainty have given
rise to a wide range of conceptual models and indices for the description and
measurement of error in spatial data. All these approaches have their origins in
academic research and have strong theoretical bases in mathematics and statistics.
Here we identify two main approaches for assessing the nature and amount of error
propagation:
1. Testing the accuracy of each state by measurement against the real world, and
Questions
1. What are the Classification, retrieval, and measurement functions of GIS?
2. Write a note on Overlay functions.
3. Explain the Neighbourhood and connectivity functions in GIS.
4. What do you mean by Measurement? How it will be done for Vector and Raster
data?
5. What do you mean by Spatial selection using topological relationships?
6. Explain Classification. What is User Controlled Classification?
7. What is Automatic Classification? What are its different techniques?
8. Differentiate between Vector and Raster Overlays.
9. Write a note on Neighbourhood function.
10. Explain Computation of diffusion and Flow computation.
11. Explain Raster based surface analysis with suitable example.
12. How will you compute slope angle and slope aspect?
13. Explain Network analysis.
14. What is GIS and application models? Explain with example.
15. How errors get propagated in Spatial Data Analysis?
9. “Select all the land use areas of which the size is less than 400,000” is an
example of ______________________.
A. Spatial selection by attribute conditions
B. Combining attribute conditions
C. Spatial selection using topological relationships
D. None of these
ANSWER: A