Learner Guide_Management of Data in Geospatial Database
Learner Guide_Management of Data in Geospatial Database
Learner Guide
NQF Level 5
Credits 8
Management of Data in Geospatial Databases
Learner Information
Organization:
Unit/Dept:
Facilitator Name:
Date Started:
Date of Completion:
Copyright
All rights reserved. The copyright of this learner guide and any annexures thereto, is
protected and expressly reserved. No part of this document may be copied,
reproduced, stored in a system, photocopying, recording or otherwise without the prior
permission.
Icons
This icon means that other books are available for further information on a
Books particular topic/subject.
This icon helps you to be prepared for the learning to follow or assist you to
demonstrate understanding of module content. Shows transference of
Activities
knowledge and skill.
Workplace Activities
An important aspect of learning is through workplace experience. Activities
with this icon can only be completed once a learner is in the workplace
This icon indicates practical tips you can adopt in the future.
Tips
Notes This icon represents important notes you must remember as part of the
learning process.
Many people recognize that each person prefers different learning styles and techniques.
Many people recognize that each person prefers different learning styles and techniques.
Learning styles group common ways that people learn. Everyone has a mix of learning styles.
Some people may find that they have a dominant style of learning, with far less use of the
other styles. Others may find that they use distinctive styles in different circumstances. There is
no right mix. Nor are your styles fixed. You can develop ability in less dominant styles, as well
as further develop styles that you already use well.
By recognizing and understanding your own learning styles, you can use techniques better
suited to you. This improves the speed and quality of your learning.
1. Visual (spatial): You prefer using pictures, images, and spatial understanding.
2. Aural (auditory-musical): You prefer using sound and music.
3. Verbal (linguistic): You prefer using words, both in speech and writing.
4. Physical (kinesthetic): You prefer using your body, hands and sense of touch.
5. Logical (mathematical): You prefer using logic, reasoning and systems.
6. Social (interpersonal): You prefer to learn in groups or with other people.
7. Solitary (intrapersonal): You prefer to work alone and use self-study.
Purpose The purpose of this Learner Guide is to provide learners with the
necessary knowledge related to Data Management in Geospatial
Databases. Data Science encompasses principles from various
fields such as statistics, machine learning, databases, data
warehousing, distributed systems, and data visualization. Data
scientists must understand the fundamental principles of data
processing and management. How to use various tools for data
cleaning, handling, and storage. Learners will be able to
understand and demonstrate practical knowledge of systems and
techniques used for data ingestion, processing, and storage after
completing this module. The module covers the fundamentals of
data handling in order to analyze and visualize data for location
analytics.
Table of Contents
Course Introduction .............................................................................................................. 11
Course Goals ......................................................................................................................... 11
Chapter 1: Introduction to Data Management................................................................. 12
Topics to cover ................................................................................................................... 12
Learning Objectives........................................................................................................... 12
Assessment ......................................................................................................................... 12
1.1. Understanding the history of data management .................................................... 13
1.2. Areas of Data Management ...................................................................................... 15
1.2.2. Data cleaning ....................................................................................................... 15
1.2.3. Data storage ......................................................................................................... 16
1.2.4. Data integration ................................................................................................... 16
1.2.5. Data security ......................................................................................................... 16
1.2.6. Data governance ................................................................................................. 18
1.2.7. Data analysis ......................................................................................................... 18
F.A. Class Exercise 1.1: Data Management Fundamentals ........................................... 21
FA Class Exercise 1.1. Data Management Fundamentals (10 min) ........................... 21
FA. Activity 1.1: Integrating Data in .................................................................................. 22
ArcGIS Pro ........................................................................................................................... 22
FA. Activity 1.1: Integrating Data in ArcGIS Pro (35 min) .............................. 22
Chapter 2: Introduction to Spatial Data Science .............................................................. 35
Topics to cover ................................................................................................................... 35
Learning Objectives........................................................................................................... 35
Assessment ......................................................................................................................... 35
2.1. What is Data Science ................................................................................................. 36
2.2. What is Spatial Data Science..................................................................................... 37
2.3. Industry usage of Spatial Data Science ................................................................... 37
2.3.1. Spatial Planning .................................................................................................... 37
2.3.2. Agriculture ............................................................................................................. 38
2.3.3. Disaster Management .......................................................................................... 38
2.3.4. Healthcare............................................................................................................. 38
2.3.5. Telecommunication.............................................................................................. 38
2.3.5. Transportation ....................................................................................................... 38
2.3.7. Natural resource management .......................................................................... 38
2.4. Spatial Data Science Workflow ................................................................................. 38
2.4.1. Data Cleaning ...................................................................................................... 40
FA. Activity 2.1: Editing Data in ArcGIS Pro ...................................................................... 42
FA Activity 2.1: Editing Data in ArcGIS Pro (20 min) .................................................. 42
FA. Activity 2.2: Introduction to ArcGIS Notebooks ........................................................ 51
FA Activity 2.2: Introduction to ArcGIS Notebooks (35 min)
51
2.3.2. Data Exploring and Discovery ............................................................................. 59
2.3.2.1. Spatial Visualization ........................................................................................... 60
FA. Activity 2.3: Data Exploration and Visualization ....................................................... 60
FA Activity 2.3: Data Exploration and Visualization .................................................. (45min)
60
2.3.3. Data Processing .................................................................................................... 79
2.3.4. Modelling ............................................................................................................... 80
F.A. Class Exercise 1.1: Data Management Fundamentals ........................................... 82
FA Class Exercise: 2.4. Cartographic model.................................................................... 82
2.5. Extract, Transform, Load (ETL) .................................................................................... 84
2.5.1. The ETL Workflow ................................................................................................... 84
Chapter 3: Understanding Databases ................................................................................ 87
Topics to cover ................................................................................................................... 87
Learning Objectives........................................................................................................... 87
Assessments........................................................................................................................ 87
3.1. What is a DBMS?.......................................................................................................... 88
i. Minimize redundancy .............................................................................................. 88
ii. Maintain consistency ............................................................................................... 88
iii. Guarantee security ............................................................................................... 89
iv. Provide frequent back-ups and Recovery ......................................................... 89
v. Manage concurrency ............................................................................................. 89
Course Introduction
Data Management includes principles that cut across different fields such as statistics,
databases, data warehousing, distributed systems, and data visualization. This course
will help you develop the skills to organize, manage, analyze and enhance data. How
to use various tools for data cleaning, handling, and storage. You will be able to
understand and demonstrate practical knowledge of systems and techniques used for
data ingestion, processing, and storage. The module covers the fundamentals of data
handling in order to analyze and visualize data for location analytics. This course will
build on basic understanding obtained in the Introduction to Data Acquisition for Spatial
Intelligence. The skills you will acquire in this course include;
Data Management
Introduction to Data Science
Spatial Data Science
Data Manipulation Methods
Structured vs Unstructured Data
Types of databases
Relational Data Model
Data Model (Schema)
Understanding Schema Database
Course Goals
At the end of this course, you will be able to:
The student will learn the basic concepts of Data Management. Understanding the
history of data management and workflows.
Topics to cover
Data management
Evolution of Data management
Data management Areas or subcomponents
Learning Objectives
After completing this section, you will be able to;
Assessment
FA. Class Exercise: 1.1. Data Management Fundamentals
FA. Activity 1.1: Integrating Data in ArcGIS Pro
The volume of spatial data has increased rapidly over the years due to the evolution of
sensing devices and telecommunication technology. Additionally, the evolution of
geographic information provided an opportunity for common users to routinely utilise
location-based datasets. Nowadays, a significant portion of data has a spatial aspect.
Thus, the management of the location and geometric characteristics of entities has
become a crucial element for every organization.
In the 1950s and 1960s, electronic computers began to replace manual systems for data
processing. This led to the development of databases and file systems, which allowed
for the efficient storage and retrieval of large amounts of data. The first commercially
available database management system (DBMS) was the Integrated Data Store (IDS),
released by General Electric in 1965.
In the 1970s, the relational database model was developed, which allowed for more
flexible and powerful querying of data. The Structured Query Language (SQL) was also
developed during this time and became the standard language for accessing and
managing data in relational databases.
In the 1980s and 1990s, the use of personal computers and the growth of the internet
led to a massive increase in the amount of data being generated and stored. This led
to the development of new data management technologies, such as data
warehousing, data mining, and online analytical processing (OLAP).
In the 2000s, the rise of big data and cloud computing led to the development of new
data management technologies, such as NoSQL databases, Hadoop, and
MapReduce. These technologies allowed for the efficient processing and analysis of
massive amounts of data.
In 2010, data management was focused on the increasing amount of data being
generated and the need to store and manage it effectively. The use of big data
technologies, such as Hadoop, emerged as a popular solution for managing large data
sets. Data warehouses and data lakes were also commonly used to store and manage
structured and unstructured data. In addition, the rise of cloud computing made it
easier for organizations to store and manage their data in the cloud.
In addition, the use of artificial intelligence (AI) and machine learning (ML) in data
management became more prevalent in 2020. AI and ML algorithms were used to
analyze and make sense of large data sets, allowing organizations to gain valuable
insights and improve decision-making. The use of cloud computing also continued to
grow, with many organizations using a hybrid cloud approach to store and manage
their data across multiple clouds and on-premises systems.
Cloud
Files RDBMS Computing
Data Acquisition involves obtaining data from various sources, this topic was introduced
in chapter 5 of the “Introduction to Data Acquisition for Spatial Intelligence” course.
There are various steps and processes involved in identifying the goals and requirements
for the spatial data to be acquired (i.e. What type of data is required and what the end
goal of the acquired data is) was outlined in chapter 5. Review the various data
acquisition methods introduced in the previous module.
Data cleaning will be extensively examined in the upcoming chapter, which will delve
into numerous data cleaning tools and techniques. These tools and methods will be
explored to transform the data into a format suitable for analysis, whether it pertains to
spatial or non-spatial data.
i. Data tampering, where attackers alter or delete spatial data to cause harm or
mislead users or decision-makers.
ii. Data inference, where adversaries use spatial data to infer sensitive information
about individuals or groups.
iii. Data linkage, where intruders combine spatial data with other sources to create
a more detailed profile of the data subjects.
ii. Map Overlay: Combining multiple spatial datasets to create new datasets that
represent intersections, unions, or differences between the original datasets.
iii. Buffer Analysis: Creating proximity zones or buffers around geographic features
to assess their spatial relationships with other features.
iii. Hot Spot Analysis: Identifying statistically significant clusters or spatial patterns of
high or low values in the data.
iii. Point Pattern Analysis: Studying the distribution and clustering of point data, often
used in fields like epidemiology and criminology.
iv. Inverse Distance Weighting (IDW): Interpolating values by assigning more weight
to nearby points and less weight to distant points.
v. Kriging: A geostatistical technique for estimating values at unobserved locations
based on values at nearby observed locations.
vi. Spatial Regression Models: Applying econometric techniques to analyze
spatially dependent economic and social data, such as housing prices or crime
rates.
ii. Service Area Analysis: Identifying areas that can be reached within a specified
travel time or distance from a given location.
iii. Network Flow Analysis: Analysing the flow of resources or goods through a
network.
ii. Change Detection: Identifying changes in land cover or features over time by
comparing satellite or aerial imagery.
iii. NDVI Analysis: Using the Normalized Difference Vegetation Index to assess
vegetation health and density.
iii. Viewshed Analysis: Determining areas visible from specific vantage points in a 3D
environment.
Data Acquisition
Data Positioning
Data Analysis
Data Integration
b) Which of the following is the correct order of the evolution of data
management?
Stones -> Files -> DBMS -> RDBMS -> Cloud Computing -> No SQL
Files -> Files -> RDBMS -> DBMS -> No SQL -> Cloud Computing
No SQL -> Files -> DBMS -> RDBMS -> Cloud Computing -> Stones
Stones -> Cloud Computing -> DBMS-> Files -> DBMS -> -> No SQL
c) You are a data scientist assigned to clean data collected in the field, which of
the following methods is best suited to clean the dataset displayed below?
e) Data governance ensures that individuals know what the organisations’ data is
about, where it comes from, and what the context of a dataset’s purpose.
True
False
Open ArcGIS Pro, then click on “Open Another Project” on the top right corner.
On the Catalog pane navigate to the project Databases and expand by clicking
the triangle on the left. This is the Default Geodatabase for the project. Once fully
expanded it is visible that the “Erven” dataset is stored here. Right click and import
this later to the current map.
Connect to the “FA Activity 2.2” folder and import the “Jabulani_Image” tile to the
current map. Using your knowledge, provide the image layer with the coordinate
system “GCS_Hartebeesthoek_1994”.
To access the other datasets, connect to the “FA Activity 2.2” folder and import
the “Roads” data to the current map. Using your knowledge, provide the roads
layer with the coordinate system “GCS_Hartebeesthoek_1994”.
To Import the excel spreadsheet. The file needs to be converted to a table. On the
ribbon, click on the Analysis Tab then select Tools.
The Geoprocessing pane will pop up. Search for the “Excel To table” tool.
Navigate the “XY Spreadsheet” located in the “FA Activity 2.2” folder. Click “Run”
at the bottom of the pane. The result is a Standalone Table that can be accessed
On the Contents pane right click the Table and select “Display XY Data” as
displayed below:
On the “Display XY Data” window under the Output Feature Class Navigate to the
“FA Activity 2.2” folder and name the feature class as “XY_Points”.
For the Coordinate System click on the earth icon on the “Display XY Data”
window. Add a new coordinate system and save as displayed below.
On the Catalog select the Portal tab and select the “My Organisation” icon.
Search “Gauteng” and double click “GAUTENG_ECON_DEV_SMME” and add the
layer “GAUTENG_ECON_DEV_SMME_MASTER_SCHEMA”
On the ribbon select the Analysis tab and click on Clip as displayed
On the Catalog Pane connect to the “FA Activity 2.2” folder and click the triangle
next to the “Jabulani Region B_FINAL.dwg” file to expand the CAD drawing. Add
the “Polyline” to the Map.
On the ribbon, click on the Analysis Tab then select Tools. Search for the “Copy
Features” tool. This tool will convert the CAD Feature Class to a shapefile.
Under Input features, navigate to “FA Activity 2.2” folder and double click the CAD
On the Geoprocessing pane search for the tool “Define Projection”. Select the
“Jabulani_CAD_Polyline” as your Input Dataset.
Assign the layer with a Coordinate System using the steps shown below:
The Define Project window should look like the image below. Click “Run”.
Repeat the last 3 steps for all the CAD Feature Datasets.
You have now prepared all the datasets and will now save them to a new
geodatabase.
On the Catalog pane Connect to the “FA Activity 2.2” folder Right click the folder,
select “New” then select File “Geodatabase”. Rename the new geodatabase to
“Jabulani_Ex1”
To Add data to the new geodatabase, navigate to the Contents pane and select
the layer that you would like to Export to the geodatabase. Right click this layer
and select “Data” then click “Export Features”. The “Export Features” window will
pop-up.
Under the Parameters Tab select the Output Location as the Geodatabase that
you have just created. Enter the Output Name as “Erven”.
Under the Environments tab specify the Output Coordinate System as
Show a screen short of the layers within the geodatabase as well as the layers on the
map.
The student will be introduced to the basic concepts of Data Science and Spatial Data
Science.
Topics to cover
Data Science and Spatial Data Science
Spatial Data Science workflows
Learning Objectives
After completing this section, you will be able to;
Assessment
FA Activity 2.1: Editing Data in ArcGIS Pro
FA. Activity 2.1: Introduction to ArcGIS Notebooks
FA. Activity 2.2: Data Exploration and Visualization
Data Science is an interdisciplinary field that involves extracting valuable insights and
knowledge from data. It combines techniques from statistics, computer science,
mathematics, and domain expertise to make data-driven decisions and predictions.
Data science is all about how to extract data, utilize it to acquire knowledge.
Mathematics
Data
Science
Domain Computer
Knowldge Scuence
2.3.2. Agriculture
Spatial Data Science is currently used for precision farming, soil analysis, and crop yield
prediction for efficient agricultural practices. It also helps farmers create more efficient
harvesting practices. Food production has soared, and environmental standards have
improved with the help of geospatial analysis.
2.3.4. Healthcare
Tracking disease spread, identifying outbreak hotspots, and optimizing healthcare
facility locations
2.3.5. Telecommunication
Telecommunications companies use spatial data science to build and improve
networks and track consumer requests and maintenance schedules. 5G mobile
internet connectivity is being expanded using geospatial data analysis.
2.3.5. Transportation
GIS can help with several transportation problems, such as identifying dangerous
intersections, improving road optimization, and choosing the optimal location for a
new road or rail network.
Wide Data Sources: Spatial data science often incorporates a wide range of
data sources, including remote sensing (satellite imagery, LiDAR), social media
data, sensor data, crowd-sourced data, and more.
Big Data: Spatial data science deals with big data, which may require
advanced technologies like distributed computing and machine learning for
processing and analysis.
Real-time Data: It often involves real-time or near-real-time data acquisition,
especially in applications like traffic monitoring, weather forecasting, and
disaster management.
Unstructured Data: Spatial data science may deal with unstructured data, such
as natural language text from social media or web sources, which requires text
mining and NLP techniques.
Machine Learning: Machine learning algorithms are commonly used for pattern
recognition, classification, and prediction in spatial data science.
Data Fusion: Combining data from multiple sources and formats is a common
practice to derive meaningful insights.
In this chapter, we will primarily concentrate on the four middle phases of workflow.
2.4.1.4. Deduplication
Identifying and removing duplicate records or entries in the dataset.
Data cleaning doesn't always follow a linear path, and in some cases, it may be a
combination of various methods described above to effectively cleanse the data.
You have been tasked with preparing a dataset for analysis by another department,
and one of the common tasks involved is the creation and refinement of building
polygons. This process involves digitizing, which essentially means using clicks on the
map to generate features in the form of points, lines, or polygons. When digitizing, it's
typical to overlay the features on top of an image or a basemap layer, as digitization
involves approximations of locations.
You will notice that some buildings are missing, some need to be modified
some need to be moved.
You will add a new building using “map templates” in ArcGIS Pro
The following pop up box will appear, click on the “buildings feature
template” and select the rectangle to capture the guard house.
Use the “Rectangle” capture tool to trace over the guard house
After completing the trace, the guard house will be selected and a
pop up message will appear to show that the capture is complete.
Use the polygon tool to trace over the house, click on “Finish” when
the trace is complete
After completing the trace, the house will be selected and a pop up
message will appear to show that the capture is complete
To move a feature, “select and drag” it with the pointer. On the Edit
tool par select the “move” tool.
The following ”pop up” window will appear, You will need to ensure that
the feature you wish to move is the only one selected.
Right click on the Erven layer and click on “Unselect” to deselect the
erven layer.
Now that the building layer is the only feature selected, click on the
“move” tool as displayed below and start moving the building.
Drag and position the building to the correct position until it is properly
located in the correct place.
Once the building has been moved, a pop up message will appear to
confirm.
Deselect the selected building, the move has been completed and a
pop up message will appear.
The last task is to delete a feature that is incorrectly captured. The building
selected below has been incorrectly captured.
Note: Ensure that the building to be deleted is the only building selected.
Upon closer inspection, you've identified that you can establish a common link using the
Description column in the Valuation Roll dataset. Your supervisor has instructed you to
create SG codes, which consist of 21 unique codes, from the information in the
Description column. To accomplish this task, you will be using ArcGIS Notebooks as part
of your spatial data analysis and manipulation process.
Open the “MyProject.aprx” on the map. The following window will appear.
NB: This is the total number of Erven in the area, the unit wishes to select properties in
the attached spreadsheet to be evaluated for the acquisition process).
Locate the save button on the top left corner of the screen and then click on
“Save”.
In this step, you will import the necessary Python modules to execute the cells in
the notebook. A Python module is a file that contains Python definitions and
statements. A module can define functions, classes, and variables, and it can
include runnable code. You will use the import statement to import the modules.
Click on the “Run” button. The code will have a [*] when its running and a
number [1] when it has executed as displayed below.
Run the second code snippet “Load data from Excel into Data Frame”, the code
commands the pandas “read” function to load the “Valuation Roll.xlsx” dataset
into the ArcGIS Notebook data frame.
The following table will appear after you run the “df” command.
a) How many properties or records are located within the Valuation Roll
dataset. ____907_____________________________________________
b) How columns are located within the Valuation Roll dataset.
____19_______________________________________________________
Study the “Description” attribute you will notice that each row contains sufficient
information to build the SG 21 code.
The following code snippets extracts the Erf and Portion values into a single
column separated by a comma.
The following code snippets Splits the Erf and Portion values into two columns
(Column1 Portion and Column 2 Erf).
NB notice that “Column1” contains a mixture of “Erf and Portion” values, we will
use more functions to clean up the errors.
Select the properties that do not contain portion values by using the statements
below.
Now that the “Erf” and “Portion” columns are “clean”. Run the Code to fill zeros
before the “Erf” and “Ptn” Columns. E.g Erf 228 to 00000228.
Run all the code snippets under the “Create SG Code” subheading
Now that all the “SG Codes” are created, Run the code snippets to join the
“Valuation Roll” to the “Erf_Boundaries” Layer.
The following layer will be added to the map and overlaid to the
“Erf_Boundaries”.
Right click on the newly added layer and confirm the number of rows.
a) Are “Valuation Roll” attributes appended to the “Erf_Boundaries” layer?
_____________________________________________________________________
Create a “Map Layout” with all elements of a map and send the map to your
Facilitator
2.3.2. Data Exploring and Discovery
Data exploration is a pivotal stage in spatial data science that plays a fundamental role
in comprehending the underlying geographic phenomena. It paves the way for
subsequent processes such as spatial modeling, decision-making, and policy
development. This crucial process empowers analysts to unearth valuable insights and
formulate well-informed spatial recommendations. Data exploration in spatial data
science encompasses a comprehensive approach involving the examination,
visualization, and analysis of geographic or spatial datasets. This approach is designed
to gain insights and understanding spatial patterns, relationships, and characteristics
within the data. The following are key aspects of data exploration in spatial data
science.
At this stage of the workflow, one can select one or multiple methods to visualise and
explore data.
Capture the name of the project as “Data Exploration and Cleaning” on the
dialog box.
Click on the on “Create a new folder for this project” check box
In the Windows group, click “Catalog Pane”. The following window will be shown
on the right side of the main window.
Expand the mini triangle in front of the “FA. Activity 1.1” folder until you see the
silver container titled “MyProject.gdb” and layers as indicated in the image
below.
Right click on the “Erf_Boundaries” layer and click “add to current map”.
Click the “Map” tab. In the Navigate group, make sure that “Explore” is
selected.
Click the “Data” tab. In the Table group, the click on “Attribute Table”.
Scroll towards the end while carefully studying the column header as well as the
data in the rows.
Notice that the last four fields are useful to create the report that is required,
however the data require cleaning to be able to produce meaningful results
suitable for decision making.
Step 1: The number of Registered Parcels
Place the curser on the “Status” Field and right click then select “Statistics”.
Click on the “Axes” tab and increase the “Label character limit” to “15”.
Click on the “General” Tab.
On the “Chart Title” type “REGISTRATION STATUS” and “X axis” title type
“Registration Status”.
Step 2: The total extent of land for the registered parcels and the unregistered parcels
Click on the "Calculate” button, this will display statistical and data quality
metrics of each field in the data as columns in a table.
Data Type
Feature Count
Number of Unique
Values
Frequently Appearing
Value
b) Study the Extent Units chart and note your observations below
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
Use the “Select by Attributes” tool to elect “sqm” and “SQM” valued and use
the “Field Calculator” to calculate “Sqm” in the extent unit’s field.
Select the “Ha” units and convert the area to “Sqm” using the “Field
Calculator”.I.e. Extent field * 10000. Calculate the “Ha” extent units to “Sqm”
Select the Null field to calculate the area. Use the calculate geometry tool.
Right click on the “Extent” Row and “Extent Unit” and select “Remove field”.
Step 3: The total amount of the registered parcels and prepare a map for the prices.
Select the “Amount” and “Extent Units “columns and drag the fields to the
statistics panel
Click on the "Calculate” button, this will display statistical and data quality
metrics of each field in the data as columns in a table.
Right click on the “Chart Preview” cell, hover on the "Create Chart” button, then
Select “Bar Chart”.
The “Bar Chart” is displayed below, notice that the field contains a mixture of
numeric values and text.
The data needs to be populated into a new numeric field that will store
numerical values only. Click on Data, Field then Create a “New” field with the
following specifications: Field Name “PROPERTYAMOUNT” and Field Alias
“PROPERTY AMOUNT” and Field Type “Long” then click “Save”.
Populate the “AMOUNT” field contents into the newly created field.
Ensure that all the data has been populated by calculating statistics for the new
“PROPERTY AMOUNT” field the “AMOUNT” field contents into the newly created
field.
Right click on the “Chart Preview” cell, hover on the "Create Chart” button, then
Select “Bar Chart”.
Compare the original “AMOUNT” and the newly created “PROPERTY AMOUNT”
charts to ensure that all the values have been populated.
You will notice that “R1.00” was not populated, select the row that contains
“R1.00” and populate “1” to the “PROPERTY AMOUNT”.
Select the rows that contains “Null” values and populate “0” to the “PROPERTY
AMOUNT”.
Symbolise the map using the “PROPERTY AMOUNT”, Select the Method to
“Quantile” and select “5 Classes”.
c) Study the Property Amounts map and write your observations below.
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
Traditional spatial data analysis and spatial data science differ in various ways. The table
below summarises some key differences between the two.
Data Volume Works with smaller datasets, Handles large and diverse
and Variety often limited to geographic datasets, including unstructured
data types like points, lines, and data like satellite imagery and
polygons. social media data, and
incorporates various data types
and sources.
Big Data and May struggle with the scalability Utilizes big data technologies and
Scalability of large datasets and big data distributed computing to process
analytics. and analyze massive spatial
datasets efficiently.
2.3.4. Modelling
Spatial data science modelling involves various techniques for optimization, simulation,
and prediction. These techniques help analyse and understand spatial phenomena,
make informed decisions, and forecast future events or trends in various fields including
urban planning, transportation, public health, environmental management, and more.
The subsections below provide an overview of these concepts within spatial data
science:
c) An analyst wants to identify the most suitable location for a wind farm. One of the
criteria is slope. The analyst will use a digital elevation model to create a dataset that
represents slope. This example demonstrates which step in suitability modeling?
Transforming criteria values
Deriving the criteria
Weighting and combining criteria
2.4.1.2. Simulation
Spatial simulation, often referred to as spatial modeling and simulation, is a
computational technique used to model and analyze complex spatial phenomena
and processes in various fields, including geography, urban planning, ecology,
epidemiology, and transportation. Spatial simulation involves creating computer-based
models that simulate the behavior, interactions, and dynamics of entities or phenomena
within a geographic or spatial context.
2.4.1.3. Prediction
Spatial data science often involves the prediction of spatial phenomena or events
using various statistical and machine learning techniques. Predictive modeling in
spatial data science aims to forecast future values, patterns, or occurrences at
specific locations within a geographic or spatial context
Extract, transform and load (ETL) is the process of extracting data from one system,
transforming it into a format that is consumable by another system, and loading it into
the final system where it will be used for business analysis. In this course we will refer to
GIS software as the final software where data will be loaded for project for analysis,
visualization, and other geospatial tasks.
In this subtopic, we'll explore a practical example in urban planning that showcases
the ETL (Extract, Transform, Load) spatial data workflow. Specifically, focusing on a
scenario where an urban planning department must evaluate traffic congestion in a
rapidly expanding city to pinpoint zones in need of infrastructure enhancements.
The urban planning department, armed with the results of the ETL (Extract, Transform,
Load) process, can now conduct spatial analysis to pinpoint congested areas,
strategize infrastructure improvements, and base their decisions on data-driven insights
to effectively address traffic-related challenges.
It's important to note that ETL processes are versatile and can be applied across a
broad spectrum of scenarios, spanning urban planning, environmental management,
and beyond. These processes seamlessly facilitate the integration and analysis of
diverse spatial data sources, ultimately empowering data-driven decision-making
across a wide range of spatial applications.
In ArcGIS, ETL tasks can be executed through various methods, including Model Builder
for visual workflows, Python scripting for customization and automation, and the Data
Interoperability extension for handling a diverse array of data formats. Model Builder is
a graphical interface that enables the creation of workflows by visually connecting
geoprocessing tools. It's valuable for automating intricate ETL (Extract, Transform, Load)
processes.
Model Builder is a visual interface within ArcGIS that allows you to create, edit,
and manage geoprocessing workflows. It is particularly useful for building
complex ETL processes by connecting geoprocessing tools in a visual manner.
Python Scripting is supported in ArcGIS Pro, allowing users to develop custom
scripts for performing data ETL tasks using ArcPy, the Python site package
designed for ArcGIS.
ArcGIS Pro also incorporates the Data Interoperability extension, offering
advanced ETL capabilities for handling a diverse array of data formats and
sources.
In this section you will be introduced to the relational database and other types of
databases.
Topics to cover
Introduction to Rational database (SQL)
Irrational database (Non-SQL) flat files Data management
Structured and unstructured databases.
Relational Data Model
Learning Objectives
After completing this section, you will be able to;
Assessments
FA. Activity 2.1 Use of Spatial Data Models in ArcGIS Pro
In the previous course, the user was introduced to both the raster and vector spatial
data structures and data acquisition methods. After identifying and acquiring data from
multiple sources in various formats, GIS users need to identify how to best store and
manage this data. In this section, closer attention will be paid to how this data is
managed using the highly recommended method to accomplish this task i.e. using
geographic databases. In the past, data was organized using files, nowadays data is
managed in a Database Management System (DBMS). Database Management
Systems make data quickly available to a multitude of users whiles still maintaining its
integrity. The DBMS protects data against deletion and corruption by facilitating the
addition, removal and updating of data by multiple users.
Database management refers to the management of tabular data in row and column
format and is frequently used for personal, business, government, and scientific
endeavors. Geospatial database management systems, alternatively, include the
functionality of a DBMS but also contain spatial characteristics about each data point
such as identity, location, shape, and orientation. Integrating this geographic
information with the tabular attribute data of a classical DBMS provide users with
powerful tools to visualize and answer the spatially explicit questions that arise in an
increasingly technological society. The DBMS has the following advantages over
traditional file data management:
i. Minimize redundancy
Data Redundancy implies repetition of data, data redundancy includes the same data
being present in multiple formats or tables. In turn, this means data analysis becomes
irrelevant and biased, data cannot be used to make data-driven decisions
v. Manage concurrency
DBMS provides concurrency control to allow multiple users to access and modify data
simultaneously without interfering with each other's work. It ensures that data remains
consistent even when multiple users are modifying it concurrently.
Geographic Information Systems integrate data from various resources into a single
homogeneous system, which require powerful and flexible data models to serve
multiple tasks.
In this module, our emphasis will be on the final two types of databases.
In the relational model, each table is linked to each other table via predetermined keys.
The primary key represents the attribute (column) whose value uniquely identifies a
particular record (row) in the relation (table). The primary key may not contain missing
values as multiple missing values would represent nonunique entities that violate the
basic rule of the primary key. The primary key corresponds to an identical attribute in a
secondary table (and possibly third, fourth, fifth, etc.) called a foreign key. This results in
all the information in the first table being directly related to the information in the second
table via the primary and foreign keys, hence the term “relational” DBMS. With these
links in place, tables within the database can be kept very simple, resulting in minimal
computation time and file complexity. This process can be repeated over many tables
as long as each contains a foreign key that corresponds to another table’s primary key.
There is great potential for redundancy in this model as each table must contain an
attribute that corresponds to an attribute in every other related table. Therefore,
redundancy must actively be monitored and managed in a RDBMS. To accomplish this,
a set of rules called normal forms have been developed. There are three basic normal
forms.
The first normal form refers to five conditions that must be met. They are as follows:
The second normal form states that any column that is not a primary key must be
dependent on the primary key. This reduces redundancy by eliminating the potential
for multiple primary keys throughout multiple tables. This step often involves the
creation of new tables to maintain normalization.
The third normal form states that all nonprimary keys must depend on the primary key,
while the primary key remains independent of all nonprimary keys.
The relational data model has several advantages that make it a popular choice for
organizing and managing data in a database. The relational data model provides a
powerful and flexible way to organize and manage data in a database, making it an
ideal choice for a wide range of applications. It has been widely adopted in the industry
and is supported by many popular database management systems, making it a
valuable skill for data professionals to have.
The biggest significance of relational databases lies in their ability to provide structured,
efficient, and reliable storage and management of data. Here are some of the key
significances of relational databases:
for meaningful analysis. In this chapter, we will delve into the preprocessing phase,
particularly focusing on the intricate art of transforming unstructured data into a
structured format, setting the stage for robust data analysis.
A graph database is a type of NoSQL database designed to represent and store data
in a graph structure. In a graph database, data is organized into nodes, relationships,
and properties, allowing for the efficient representation of complex relationships
between data elements. This makes graph databases particularly well-suited for
applications that involve highly interconnected data. The graph database winds up
looking like a cloud of concepts and connections, a template that's nearly impossible
in relational databases.
Non-relational databases and flat files have some similarities in that they do not rely on
a predefined schema, but they differ in terms of their structure and functionality. Non-
relational databases can be highly scalable, efficient, and flexible, allowing for the
storage and processing of large volumes of unstructured or semi-structured data. Flat
files, on the other hand, are often used for smaller data sets or simple data storage
needs.
In summary, non-relational databases and flat files are two different types of data
storage and management systems, each with its own strengths and weaknesses. Non-
relational databases are more suitable for large-scale distributed systems and big data
processing, while flat files are often used for smaller, simpler data storage needs.
i. Data storage and organization: Flat-file databases can be used to store and
organize data in a simple, structured format. Data can be stored in a flat-file
database in a tabular format with a specific delimiter, such as a comma or tab,
allowing for easy access and retrieval.
ii. Data processing and analysis: Flat-file databases can be used to process and
analyze data using a variety of tools and techniques. For example, data stored in
a flat-file database can be imported into a spreadsheet or statistical analysis
software for further analysis.
iii. Data sharing and collaboration: Flat-file databases can be easily shared and
collaborated on with others. Data can be shared as a file, emailed, or stored in a
shared folder or cloud storage service.
iv. Data migration and integration: Flat-file databases can be used to migrate and
integrate data between different systems. For example, data stored in a flat-file
database can be imported into a relational database or other data storage
system.
ESRI has developed a Geodatabase to leverage the two data models. The
geodatabase is implemented using the same multitier application architecture found
in other advanced DBMS applications. The geodatabase objects persist as rows in DBMS
tables that have identity, and the behaviour is supplied through the geodatabase
application logic.
Manipulating attribute data is much more complex than manipulating spatial data. It is
much more difficult to identify errors in attribute data when the values are syntactically
good, but incorrect, not as much tools exist for attribute data. The most common ways
utilized to manipulate attribute data include:
• Insert – refers to capturing attribute information for each geometry being added
in a feature class or adding a new row of information on a standalone table
• Update - Updating attribute data implies entering new information though the
geometry of the feature remains unmodified. The updating function is of great
importance during any project.
• Delete – refers to removing features from the GIS as they are removed from the
real world
iv. Domains: Domains are sets of valid values that can be assigned to specific fields
within feature classes or tables. They help enforce data integrity by restricting the
values that can be entered into attribute fields. For instance, a domain might
define a list of valid land use codes for a land parcel feature class.
v. Subtypes: Subtypes allow you to categorize features within a feature class into
subgroups based on common characteristics. For example, within a water utility
network feature class, subtypes might be used to distinguish between different
types of water pipes (e.g., main lines, service lines) with their own specific
attributes.
vi. Topology Rules: Topology rules define spatial relationships and constraints
between features in different feature classes. These rules help maintain the
integrity of spatial data by ensuring that features conform to specific geometric
criteria (e.g., no overlap between polygons).
vii. Geometric Networks: Geometric networks are specialized graph structures used
to model and analyze the connectivity and flow of resources in spatial data. They
are often used in utility and transportation network datasets.
A schema defines the physical structure of the geodatabase as well as the rules,
relationships, and properties of each dataset in the geodatabase. Thus, one of the most
important aspects of a geodatabase is its structure and the way the geodatabase is
organized into database tables, column types, indexes, and other database objects.
1. Tables: A table is a basic unit of storage in a database that contains data in rows
and columns.
2. Relationships: A relationship defines the connection between two or more tables in
a database. The relationship is based on a common key field that is used to link the
tables.
3. Constraints: Constraints are rules that are used to ensure the integrity of the data
stored in a database. Constraints can be used to enforce data type, uniqueness,
and referential integrity rules.
4. Views: Views are virtual tables that are derived from one or more tables in a
database. Views are used to simplify the complexity of data and to provide a
customized view of data to different users.
5. Indexes: An index is a data structure that is used to improve the performance of
database queries. Indexes are created on one or more columns in a table, and
they allow the database to quickly locate specific data in the table.
6. Stored Procedures: A stored procedure is a group of SQL statements that are stored
in the database and can be executed repeatedly by a user or an application.
Stored procedures can be used to improve performance, simplify complex
operations, and enforce business rules.
2. Data integration: Schemas can be used to integrate data from different sources
into a single database system. By defining common fields and relationships,
schemas enable the data to be combined and analyzed, which can lead to new
insights and discoveries.
3. Performance optimization: Schemas can be used to optimize the performance of a
database system. By creating indexes, views, and stored procedures, schemas can
improve the speed of data retrieval and processing, which is critical for large
databases or high-demand applications.
4. Security: Schemas can be used to define access control rules, which can prevent
unauthorized access to sensitive data. By defining user roles and permissions,
schemas ensure that data is protected and only accessible to authorized users.
Overall, schemas are an essential tool for solving data problems because they provide
a structured way of organizing, integrating, optimizing, and securing data. By defining
the logical structure of a database system, schemas ensure that data is consistent,
accurate, and easy to manage, which is critical for any data-driven organization.
Schemas are also used in GIS to solve data problems. Here are some examples:
1. Data modeling: GIS data modeling involves defining the logical structure of
geospatial data. By defining tables, relationships, and constraints, GIS schemas
ensure that spatial data is organized in a structured and logical manner, which
improves data quality and reduces the risk of errors.
2. Spatial analysis: GIS schemas can be used to define common fields and
relationships between geospatial data layers, enabling spatial analysis and
visualization. By creating indexes, views, and stored procedures, GIS schemas can
also improve the speed of spatial queries and data processing, which is critical for
large geospatial databases or high-demand GIS applications.
3. Data integration: GIS schemas can be used to integrate geospatial data from
different sources, enabling the data to be combined and analyzed. By defining
common fields and relationships, GIS schemas enable spatial data to be integrated
with non-spatial data, which can lead to new insights and discoveries.
4. Data sharing: GIS schemas can be used to define access control rules, which can
prevent unauthorized access to sensitive geospatial data. By defining user roles
and permissions, GIS schemas ensure that geospatial data is protected and only
accessible to authorized users.
Overall, GIS schemas are an essential tool for solving data problems because they
provide a structured way of organizing, integrating, optimizing, and securing geospatial
data. By defining the logical structure of geospatial databases, GIS schemas ensure that
spatial data is consistent, accurate, and easy to manage, which is critical for any GIS-
driven organization.
i. Identify data requirements: This involves identifying the types of data that need
to be stored and managed, as well as the relationships between different data
elements. For example, in a customer relationship management system, data
requirements may include customer information, order details, and payment
history.
ii. Choose a Geodatabase Type: Decide on the type of geodatabase you want to
create. Common options include file geodatabases, personal geodatabases,
and enterprise geodatabases.
iii. Define entities: Once data requirements have been identified, the next step is to
define entities or tables that will be used to store the data. Each entity
represents a distinct type of data, such as customers, orders, or products.
iv. Define attributes: Each entity must be further defined by specifying the attributes
or columns that will be used to store data. For example, the customer entity may
have attributes such as customer ID, name, address, and phone number.
v. Define relationships: Relationships are defined between entities to show how
they are related to each other. For example, a relationship may exist between
the customer entity and the order entity, indicating that each customer may
have multiple orders.
vi. Organize Data into Feature Datasets: Feature datasets are containers for
organizing related feature classes. Group feature classes that share common
characteristics or themes into feature datasets.
vii. Set Up Domains: Define attribute domains to enforce data integrity by restricting
values allowed for specific attributes.
viii. Design Topology: If your data requires maintaining topological relationships
(e.g., connectivity, adjacency), set up and configure topology rules.
ix. Plan Spatial Indexes: Determine the fields and geometry types for spatial indexes
to optimize spatial queries.
x. Implement Geodatabase Rules: Establish data constraints, rules, and behaviors
for maintaining data quality and consistency
xi. Implement the schema: The final step is to implement the schema by creating
the necessary database tables and defining data types, constraints, and other
properties.
Overall, creating a data schema is a complex process that requires careful planning,
analysis, and design. By following these steps, organizations can create a data
schema that is well-structured, efficient, and optimized for their specific data
management needs.
These are some of the most common types of schemas used in data management.
Choosing the right type of schema for a particular data set or database is important
to ensure data integrity, security, and optimal performance.
These are some of the most common types of schemas used in GIS. Choosing the right
type of schema for a GIS database is important to ensure that the data is organized in
a consistent and efficient manner and can be easily shared and used by others.
Create a schema for a dataset of your chouse in ArcGIS Pro that will be stored in a File
Geodatabase.
References
1. https://www.e-education.psu.edu/geog160/c1_p6.html
2. https://saylordotorg.github.io/text_essentials-of-geographic-information-
systems/s09-04-data-quality.html
3. https://www.gislounge.com/spatial-data-quality-an-introduction/
4. https://gis.si/en/data-acquisition-and-processing/
5. https://www.spatiality.co.ke/choosing-the-right-esri-field-operations-app/
6. https://www.dataversity.net/brief-history-data-
management/#:~:text=Data%20Management%2C%20as%20a%20concept,over
%20the%20last%20six%20decades.
7. https://saylordotorg.github.io/text_essentials-of-geographic-information-
systems/s09-geospatial-data-management.html
8. https://www.cert.ucr.edu/sites/default/files/2019-05/22_magdy_amr.pdf
9. https://www.esri.com/arcgis-blog/products/ops-dashboard/decision-
support/visualizing-data-effectively-on-dashboards/
10. https://learn.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-types-
for-reports-and-q-and-a
11. https://www.safegraph.com/guides/visualizing-geospatial-data
12. https://www.simplilearn.com/what-is-data-management-article
13. https://www.dataversity.net/brief-history-data-management/#
14. https://resources.esri.ca/news-and-updates/7-practical-steps-for-improving-
your-organizations-spatial-data-governance
15. https://www.directionsmag.com/article/12362
16. https://gisintransportation.com/about/spatial-data-governance/