0% found this document useful (0 votes)
9 views

Learner Guide_Management of Data in Geospatial Database

The document is a learner guide for managing data in geospatial databases, part of a National Certificate program in Spatial Intelligence Data Science. It covers various aspects of data management, including data processing, storage, and the significance of different database types. The guide aims to equip learners with practical knowledge and skills necessary for effective data handling and analysis in spatial contexts.

Uploaded by

letsoelarefuwe27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Learner Guide_Management of Data in Geospatial Database

The document is a learner guide for managing data in geospatial databases, part of a National Certificate program in Spatial Intelligence Data Science. It covers various aspects of data management, including data processing, storage, and the significance of different database types. The guide aims to equip learners with practical knowledge and skills necessary for effective data handling and analysis in spatial contexts.

Uploaded by

letsoelarefuwe27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

-

Management of Data in Geospatial


Databases

Learner Guide
NQF Level 5

Credits 8
Management of Data in Geospatial Databases


Learner Information

Details Please Complete this Section

Name & Surname:

Organization:

Unit/Dept:

Facilitator Name:

Date Started:

Date of Completion:

Copyright
All rights reserved. The copyright of this learner guide and any annexures thereto, is
protected and expressly reserved. No part of this document may be copied,
reproduced, stored in a system, photocopying, recording or otherwise without the prior
permission.

MDGD NTG Solutions Academy 2


Management of Data in Geospatial Databases


Icons

This icon means that other books are available for further information on a
Books particular topic/subject.

References This icon refers to any examples, handouts, checklists, etc.…

This icon represents important information related to a specific topic or section


of the guide.
Important

This icon helps you to be prepared for the learning to follow or assist you to
demonstrate understanding of module content. Shows transference of
Activities
knowledge and skill.

This icon represents any exercise to be completed on a specific topic during


Exercises class, at home by yourself or in a group.

An important aspect of the assessment process is proof of competence. This


Tasks/Projects can be achieved by observation, or a portfolio of evidence should be
submitted in this regard.

Workplace Activities
An important aspect of learning is through workplace experience. Activities
with this icon can only be completed once a learner is in the workplace

This icon indicates practical tips you can adopt in the future.
Tips

Notes This icon represents important notes you must remember as part of the
learning process.

This icon indicates that there is an important video to watch


Video

MDGD NTG Solutions Academy 3


Management of Data in Geospatial Databases


Different Learning styles

Many people recognize that each person prefers different learning styles and techniques.

Many people recognize that each person prefers different learning styles and techniques.
Learning styles group common ways that people learn. Everyone has a mix of learning styles.
Some people may find that they have a dominant style of learning, with far less use of the
other styles. Others may find that they use distinctive styles in different circumstances. There is
no right mix. Nor are your styles fixed. You can develop ability in less dominant styles, as well
as further develop styles that you already use well.

By recognizing and understanding your own learning styles, you can use techniques better
suited to you. This improves the speed and quality of your learning.

The Seven Learning Styles

1. Visual (spatial): You prefer using pictures, images, and spatial understanding.
2. Aural (auditory-musical): You prefer using sound and music.
3. Verbal (linguistic): You prefer using words, both in speech and writing.
4. Physical (kinesthetic): You prefer using your body, hands and sense of touch.
5. Logical (mathematical): You prefer using logic, reasoning and systems.
6. Social (interpersonal): You prefer to learn in groups or with other people.
7. Solitary (intrapersonal): You prefer to work alone and use self-study.

MDGD NTG Solutions Academy 4


Management of Data in Geospatial Databases


Learner guide Introduction

About the Learner This Learner Guide provides a comprehensive overview of


Guide… data management which involves data processing and storage
and forms part of a series of Learner Guides that have been
developed for National Certificate: Spatial Intelligence Data
Scientist – SP-210604 LEVEL 5 – 56 CREDITS. The series of Learner
Guides are conceptualized in modular’s format and developed
for National Certificate: Spatial Intelligence Data Scientist. They
are designed to improve the skills and knowledge of learners, and
thus enabling them to effectively and efficiently complete specific
tasks. Learners are required to attend training workshops as a
group or as specified by their organization. These workshops are
presented in modules and conducted by a qualified facilitator.

Purpose The purpose of this Learner Guide is to provide learners with the
necessary knowledge related to Data Management in Geospatial
Databases. Data Science encompasses principles from various
fields such as statistics, machine learning, databases, data
warehousing, distributed systems, and data visualization. Data
scientists must understand the fundamental principles of data
processing and management. How to use various tools for data
cleaning, handling, and storage. Learners will be able to
understand and demonstrate practical knowledge of systems and
techniques used for data ingestion, processing, and storage after
completing this module. The module covers the fundamentals of
data handling in order to analyze and visualize data for location
analytics.

Outcomes After completing this module, the learner will be able to


demonstrate the theoretical and knowledge of

 Introduction to Data Management for Data Science


 Data Manipulation Methods
 Structured vs Unstructured Data
 Types of databases
 Relational Data Model
 Data Model (Schema)
 Understanding Schema Database
Assessment Criteria The only way to establish whether a learner is competent and has
accomplished the specific outcomes is through an assessment

MDGD NTG Solutions Academy 5


Management of Data in Geospatial Databases


process. Assessment involves collecting and interpreting evidence


about the learner’s ability to perform a task. This guide may include
assessments in the form of activities, assignments, tasks or projects,
as well as workplace practical tasks. Learners are required to
perform tasks on the job to collect enough and appropriate
evidence for their portfolio of evidence.
 The fundamental concepts, principles, and workflows of data
science are explained and understood.
 Knowledge of data cleaning, transformation, sorting and
aggregation is demonstrated.
 Discuss the significance of data management and common
data handling workflows in data science.
 The Rational database use and its significance can be
explained.
 Contrast rational and irrational databases.
 Explain how flat-file databases can be used to solve data-
related problems
 Examine the differences between structured and unstructured
databases.
 Explain the significance of structured databases.
 Discuss the value of the relational data model.
 Describe how a data schema is created.
 Discuss the various types of schemas.
To qualify To qualify and receive credits towards the learning program, a
registered assessor will conduct an evaluation and assessment of
the learner’s portfolio of evidence and competency
Range of Learning This describes the situation and circumstance in which
competence must be demonstrated and the parameters in
which learners operate.
Responsibility The responsibility of learning rest with the learner, so:
 Be proactive and ask questions,
 Seek assistance and help from your facilitators, if required.

MDGD NTG Solutions Academy 6


Management of Data in Geospatial Databases


Table of Contents
Course Introduction .............................................................................................................. 11
Course Goals ......................................................................................................................... 11
Chapter 1: Introduction to Data Management................................................................. 12
Topics to cover ................................................................................................................... 12
Learning Objectives........................................................................................................... 12
Assessment ......................................................................................................................... 12
1.1. Understanding the history of data management .................................................... 13
1.2. Areas of Data Management ...................................................................................... 15
1.2.2. Data cleaning ....................................................................................................... 15
1.2.3. Data storage ......................................................................................................... 16
1.2.4. Data integration ................................................................................................... 16
1.2.5. Data security ......................................................................................................... 16
1.2.6. Data governance ................................................................................................. 18
1.2.7. Data analysis ......................................................................................................... 18
F.A. Class Exercise 1.1: Data Management Fundamentals ........................................... 21
FA Class Exercise 1.1. Data Management Fundamentals (10 min) ........................... 21
FA. Activity 1.1: Integrating Data in .................................................................................. 22
ArcGIS Pro ........................................................................................................................... 22
FA. Activity 1.1: Integrating Data in ArcGIS Pro (35 min) .............................. 22
Chapter 2: Introduction to Spatial Data Science .............................................................. 35
Topics to cover ................................................................................................................... 35
Learning Objectives........................................................................................................... 35
Assessment ......................................................................................................................... 35
2.1. What is Data Science ................................................................................................. 36
2.2. What is Spatial Data Science..................................................................................... 37
2.3. Industry usage of Spatial Data Science ................................................................... 37
2.3.1. Spatial Planning .................................................................................................... 37
2.3.2. Agriculture ............................................................................................................. 38
2.3.3. Disaster Management .......................................................................................... 38
2.3.4. Healthcare............................................................................................................. 38

MDGD NTG Solutions Academy 7


Management of Data in Geospatial Databases


2.3.5. Telecommunication.............................................................................................. 38
2.3.5. Transportation ....................................................................................................... 38
2.3.7. Natural resource management .......................................................................... 38
2.4. Spatial Data Science Workflow ................................................................................. 38
2.4.1. Data Cleaning ...................................................................................................... 40
FA. Activity 2.1: Editing Data in ArcGIS Pro ...................................................................... 42
FA Activity 2.1: Editing Data in ArcGIS Pro (20 min) .................................................. 42
FA. Activity 2.2: Introduction to ArcGIS Notebooks ........................................................ 51
FA Activity 2.2: Introduction to ArcGIS Notebooks (35 min)
51
2.3.2. Data Exploring and Discovery ............................................................................. 59
2.3.2.1. Spatial Visualization ........................................................................................... 60
FA. Activity 2.3: Data Exploration and Visualization ....................................................... 60
FA Activity 2.3: Data Exploration and Visualization .................................................. (45min)
60
2.3.3. Data Processing .................................................................................................... 79
2.3.4. Modelling ............................................................................................................... 80
F.A. Class Exercise 1.1: Data Management Fundamentals ........................................... 82
FA Class Exercise: 2.4. Cartographic model.................................................................... 82
2.5. Extract, Transform, Load (ETL) .................................................................................... 84
2.5.1. The ETL Workflow ................................................................................................... 84
Chapter 3: Understanding Databases ................................................................................ 87
Topics to cover ................................................................................................................... 87
Learning Objectives........................................................................................................... 87
Assessments........................................................................................................................ 87
3.1. What is a DBMS?.......................................................................................................... 88
i. Minimize redundancy .............................................................................................. 88
ii. Maintain consistency ............................................................................................... 88
iii. Guarantee security ............................................................................................... 89
iv. Provide frequent back-ups and Recovery ......................................................... 89
v. Manage concurrency ............................................................................................. 89

MDGD NTG Solutions Academy 8


Management of Data in Geospatial Databases


3.2. Review of Data Models .............................................................................................. 89


3.3. Types of Databases .................................................................................................... 90
3.3.1. Centralized database .......................................................................................... 90
3.3.2. Cloud database ................................................................................................... 90
3.3.3. Commercial database ........................................................................................ 90
3.3.4. Distributed database............................................................................................ 90
3.3.5. End-user database ............................................................................................... 90
3.3.6. Graph database ................................................................................................... 91
2.3.7. NoSQL database .................................................................................................. 91
3.3.8. Object-oriented database .................................................................................. 91
3.3.9. Open-source database ....................................................................................... 91
3.3.10. Operational database ....................................................................................... 91
3.3.11. Personal database ............................................................................................. 91
3.3.12. Relational database ........................................................................................... 92
3.3.13. Irrational database/ non-relational database (Non-SQL) flat files ................ 92
3.4. Relational Database Management Systems ........................................................... 92
3.4.1. Significance of relational databases .................................................................. 95
FA. Activity 3.1: Relational Diagram ................................................................................. 96
FA Activity 3.1: Relational Diagram .................................................................................. 96
3.5. Non-relational Databases .......................................................................................... 97
3.5.1. Characteristics of unstructured data .................................................................. 98
3.5.2. Types of No SQL Databases.............................................................................. 99
3.5.2. Column Family Databases ..................................................................................100
3.5.3. Document Databases......................................................................................100
3.5.4. Graph Databases .............................................................................................101
3.6. Flat files vs Irrigational Data bases ............................................................................102
3.5.1. Uses of flat-file databases ...................................................................................102
3.7. The Geographical Database ...................................................................................103
3.8. Data Manipulation .....................................................................................................104
3.9. Understanding the Geodatabase schema ............................................................105
3.9.1. Application of schemas to solve data problems .............................................107

MDGD NTG Solutions Academy 9


Management of Data in Geospatial Databases


3.9.2. Creating data schema .......................................................................................109


3.9.3. Various types of schemas ...................................................................................110
FA. Activity 3.1: Schema creation in ArcGIS Pro ............................................................111
FA Activity 3.2. Schema creation in ArcGIS Pro .............................................................111
References ............................................................................................................................112

MDGD NTG Solutions Academy 10


Management of Data in Geospatial Databases


Course Introduction
Data Management includes principles that cut across different fields such as statistics,
databases, data warehousing, distributed systems, and data visualization. This course
will help you develop the skills to organize, manage, analyze and enhance data. How
to use various tools for data cleaning, handling, and storage. You will be able to
understand and demonstrate practical knowledge of systems and techniques used for
data ingestion, processing, and storage. The module covers the fundamentals of data
handling in order to analyze and visualize data for location analytics. This course will
build on basic understanding obtained in the Introduction to Data Acquisition for Spatial
Intelligence. The skills you will acquire in this course include;

 Data Management
 Introduction to Data Science
 Spatial Data Science
 Data Manipulation Methods
 Structured vs Unstructured Data
 Types of databases
 Relational Data Model
 Data Model (Schema)
 Understanding Schema Database

Course Goals
At the end of this course, you will be able to:

 Utilize Data Science to clean data


 Apply spatial data science techniques to analyze data
 Manage data by using different data manipulation methods
 Differentiate between structured and unstructured Data
 Understand and use the Relational Data Model to store and manage data
 Understand the Database Schema

MDGD NTG Solutions Academy 11


Management of Data in Geospatial Databases


Chapter 1: Introduction to Data Management

The student will learn the basic concepts of Data Management. Understanding the
history of data management and workflows.

Topics to cover
 Data management
 Evolution of Data management
 Data management Areas or subcomponents

Learning Objectives
After completing this section, you will be able to;

 Explain what data management is and its significance.


 Understand the history and evolution of data management
 List and discuss areas of data management

Assessment
 FA. Class Exercise: 1.1. Data Management Fundamentals
 FA. Activity 1.1: Integrating Data in ArcGIS Pro

MDGD NTG Solutions Academy 12


Management of Data in Geospatial Databases


The volume of spatial data has increased rapidly over the years due to the evolution of
sensing devices and telecommunication technology. Additionally, the evolution of
geographic information provided an opportunity for common users to routinely utilise
location-based datasets. Nowadays, a significant portion of data has a spatial aspect.
Thus, the management of the location and geometric characteristics of entities has
become a crucial element for every organization.

Data management is the practice of collecting, organizing, protecting, and storing an


organization’s data so it can be analyzed for business decisions. The goal of data
management is to ensure that the data is accurate, reliable, and accessible for analysis.
Data management involves several key components, including data acquisition, data
cleaning, data storage, data integration, data security, data governance and data
analysis.

1.1. Understanding the history of data management


The history of data management can be traced back to the earliest forms of human
communication, where information was recorded on stone tablets and other durable
materials. However, the modern era of data management can be traced back to the
invention of computers and the development of electronic data processing.

In the 1950s and 1960s, electronic computers began to replace manual systems for data
processing. This led to the development of databases and file systems, which allowed
for the efficient storage and retrieval of large amounts of data. The first commercially
available database management system (DBMS) was the Integrated Data Store (IDS),
released by General Electric in 1965.

In the 1970s, the relational database model was developed, which allowed for more
flexible and powerful querying of data. The Structured Query Language (SQL) was also
developed during this time and became the standard language for accessing and
managing data in relational databases.

In the 1980s and 1990s, the use of personal computers and the growth of the internet
led to a massive increase in the amount of data being generated and stored. This led
to the development of new data management technologies, such as data
warehousing, data mining, and online analytical processing (OLAP).

In the 2000s, the rise of big data and cloud computing led to the development of new
data management technologies, such as NoSQL databases, Hadoop, and

MDGD NTG Solutions Academy 13


Management of Data in Geospatial Databases


MapReduce. These technologies allowed for the efficient processing and analysis of
massive amounts of data.

In 2010, data management was focused on the increasing amount of data being
generated and the need to store and manage it effectively. The use of big data
technologies, such as Hadoop, emerged as a popular solution for managing large data
sets. Data warehouses and data lakes were also commonly used to store and manage
structured and unstructured data. In addition, the rise of cloud computing made it
easier for organizations to store and manage their data in the cloud.

By 2020, data management had evolved to become more focused on data


governance, data privacy, and data security. With the increasing importance of data-
driven decision-making, organizations began to place a greater emphasis on ensuring
the quality and accuracy of their data. The General Data Protection Regulation (GDPR)
also came into effect in 2018, which set strict standards for data privacy and protection.
As a result, organizations began to implement more robust data governance and
security measures to protect their data from breaches and other security threats.

In addition, the use of artificial intelligence (AI) and machine learning (ML) in data
management became more prevalent in 2020. AI and ML algorithms were used to
analyze and make sense of large data sets, allowing organizations to gain valuable
insights and improve decision-making. The use of cloud computing also continued to
grow, with many organizations using a hybrid cloud approach to store and manage
their data across multiple clouds and on-premises systems.

Today, data management is a critical aspect of many industries, including healthcare,


finance, and e-commerce. The increasing use of artificial intelligence and machine
learning is driving further innovation in data management, as businesses seek to extract
insights and knowledge from their data to gain a competitive advantage.

MDGD NTG Solutions Academy 14


Management of Data in Geospatial Databases


1950 1965 1980–1990 2010


Before 1950 : DBMS Warehousing No SQL
Stones

Cloud
Files RDBMS Computing

1950–1960 1970 2000

1.2. Areas of Data Management


Data is the most important part of any GIS project. Proficiency in handling and
manipulating spatial data is of utmost importance. Data management principles
encompass the set of guidelines and best practices employed to guarantee the
effective and efficient management of data throughout its entire lifecycle. The sections
below highlight some key data management principles:

1.2.1. Data Acquisition

Data Acquisition involves obtaining data from various sources, this topic was introduced
in chapter 5 of the “Introduction to Data Acquisition for Spatial Intelligence” course.
There are various steps and processes involved in identifying the goals and requirements
for the spatial data to be acquired (i.e. What type of data is required and what the end
goal of the acquired data is) was outlined in chapter 5. Review the various data
acquisition methods introduced in the previous module.

1.2.2. Data cleaning


Data is collected in various forms and sources, and often comes unorganized or
incomplete. Even though data may be from an authoritative source, there is a chance
that it needs to be prepared to be used in a project. Data cleaning is a process of
implementing quality control procedures to ensure that the acquired data is accurate,
complete and doesn’t not contain errors or any discrepancies. The process depending
on the dataset might involve detecting missing values, correcting spelling errors,
populating empty fields, identifying duplicate records and standardizing data values.

MDGD NTG Solutions Academy 15


Management of Data in Geospatial Databases


Data cleaning will be extensively examined in the upcoming chapter, which will delve
into numerous data cleaning tools and techniques. These tools and methods will be
explored to transform the data into a format suitable for analysis, whether it pertains to
spatial or non-spatial data.

1.2.3. Data storage


Data storage involves choosing the suitable spatial data storage method and ensuring
that data is organized efficiently (i.e., logical data structure, file naming conventions,
and metadata). Proper storage and organization ensure data accessibility, reliability,
and usability. In Chapter 3 of “Introduction to Data Acquisition for Spatial Intelligence”
course we studied spatial data storage frameworks such as file systems and
geodatabases to store geospatial data. This subtopic will be explored in detail in this
course in Chapter 3.

1.2.4. Data integration


Data integration is a complex but valuable task. It is the process of combining data from
different sources into a single, unified view. Integration begins with the ingestion process,
and includes steps such as cleansing, ETL mapping, and transformation. This subtopic
will be explored in Chapter 2 of this course.

1.2.5. Data security


Spatial data can reveal sensitive information about people, places, and activities.
Organisations implement various measures to protect spatial data from unauthorized
access, misuse, or disclosure. Data Security involves safeguarding data to protect it from
corruption, theft, or unauthorized access by taking key steps and best practices for
spatial data security and privacy. Safeguarding spatial data differs per organisation, in
this module we will explore the following steps and security measures to combat spatial
data theft;

1.2.5.1. Understanding the risks


Spatial data can be compromised by malicious actors, insiders, or third parties.
Common risks include data breaches, where hackers exploit vulnerabilities in the spatial
database system or the network to access and exfiltrate the data. Hackers usually
perform the following actions to the hacked data;

i. Data tampering, where attackers alter or delete spatial data to cause harm or
mislead users or decision-makers.
ii. Data inference, where adversaries use spatial data to infer sensitive information
about individuals or groups.

MDGD NTG Solutions Academy 16


Management of Data in Geospatial Databases


iii. Data linkage, where intruders combine spatial data with other sources to create
a more detailed profile of the data subjects.

1.2.5.2. Implementing access control


The second step to secure spatial data is to implement access control mechanisms that
limit who can access and use the data and for what purposes. Access control can be
enforced at different levels, such as the database level, application level and data
level. E.g., Spatial database system can provide authentication and authorization
features to verify users' identity and credentials, as well as log and monitor their activities.

1.2.5.3. Adopting privacy-preserving techniques


The third step to secure spatial data is to adopt privacy-preserving techniques that can
protect the data from unwanted or unnecessary disclosure or analysis. Privacy-
preserving techniques can be applied at different stages of the data lifecycle, such as
data collection, storage, and processing. For instance, with data collection, a geo-
spatial data collector can use differential privacy to add random noise to the data while
preventing individual identification. When storing the data, a organisations can use
secure cloud storage to encrypt the data and distribute it across multiple servers and
locations. Lastly, with data processing, a geo-spatial data processor can employ
homomorphic encryption to perform computations on encrypted data without
decrypting it. All of these techniques help ensure the privacy and rights of the data
subjects while preserving the accuracy and quality of the data.

1.2.5.4. Educating and train the users


The fourth step to secure spatial data is to educate and train the users who are involved
in the creation, management, or consumption of the data. As users may lack the
awareness, knowledge, or skills to handle spatial data properly and responsibly. Finally,
provide users with tools and resources to detect and respond to any incidents or issues
related to geo-spatial data security and privacy, as well as report and resolve any
problems or concerns that they encounter or observe.

1.2.5.5. Reviewing and update regularly


The fifth step to secure spatial data is to review and update the security and privacy
measures and practices regularly. Spatial data security and privacy must be monitored
and improved constantly, so it is essential that organisations assess the effectiveness and
efficiency of the security and privacy measures, identify any gaps or weaknesses, adapt
the measures according to changes in the environment, and audit for compliance and
performance. Regular audits and inspections should also be conducted to ensure that
the security and privacy measures meet user and stakeholder expectations.

MDGD NTG Solutions Academy 17


Management of Data in Geospatial Databases


1.2.6. Data governance


Spatial data governance provides a structured framework for managing geospatial
data across an organization or multiple organizations. These policies include ongoing
processes and activities involved in keeping geospatial data accurate, up-to-date, and
reliable. Data may be acquired in the beginning of the year; data governance ensures
that spatial data remains relevant and useful over time. E.g. Regularly update spatial
datasets to reflect changes in the real world, such as new roads, buildings, or land-use
patterns. The steps to data governance typically involve the following:

 The development of a spatial data inventory


 Identify spatial data custodians
 Establish a spatial data custodians stakeholder network
 Develop geospatial data policies, standards and guidelines
1.2.7. Data analysis
Spatial data analysis is a versatile and powerful tool for gaining insights into the spatial
aspects of data, which leads to informed decision-making, better resource allocation,
and improved planning and management across various domains.

Spatial analysis is a process in which problems are modelled geographically, deriving


results by computer processing, and then explore and examine results. Geographic
information systems use spatial analysis in order to understand geographic questions.
Spatial analysis has proven to be highly effective for evaluating the geographic
suitability of certain locations for specific purposes, estimating and predicting outcomes,
interpreting and understanding change, detecting important patterns hidden in
information, and much more.

In the previous course "Introduction to Data Acquisition for Spatial Intelligence," we


delved into a specific segment of spatial analysis. In this module, our objective is to
comprehensively examine every facet of data analysis. The choice of data analysis
method depends on the specific goals of the analysis and the nature of the spatial data
being examined.

MDGD NTG Solutions Academy 18


Management of Data in Geospatial Databases


1.2.7.1. Descriptive Spatial Analysis

i. Spatial Query: Selecting and retrieving geographic features or attributes that


meet specific criteria within a spatial dataset.

ii. Map Overlay: Combining multiple spatial datasets to create new datasets that
represent intersections, unions, or differences between the original datasets.

iii. Buffer Analysis: Creating proximity zones or buffers around geographic features
to assess their spatial relationships with other features.

Review the methods of map overlaying in both raster and vector


formats covered in the previous course.

1.2.7.2. Exploratory Spatial Data Analysis (ESDA):


i. Spatial Visualization: Creating maps and visualizations to explore the distribution
and patterns of spatial data.

ii. Spatial Autocorrelation: Analyzing the degree of spatial similarity or dissimilarity


between neighboring locations.

iii. Hot Spot Analysis: Identifying statistically significant clusters or spatial patterns of
high or low values in the data.

1.2.7.3. Spatial Statistics:


i. Geostatistics: Applying statistical techniques to analyze spatial data, including
variograms for modeling spatial variability and kriging for spatial interpolation.

ii. Spatial Regression: Modeling relationships between dependent and


independent variables while considering spatial dependencies.

iii. Point Pattern Analysis: Studying the distribution and clustering of point data, often
used in fields like epidemiology and criminology.

MDGD NTG Solutions Academy 19


Management of Data in Geospatial Databases


iv. Inverse Distance Weighting (IDW): Interpolating values by assigning more weight
to nearby points and less weight to distant points.
v. Kriging: A geostatistical technique for estimating values at unobserved locations
based on values at nearby observed locations.
vi. Spatial Regression Models: Applying econometric techniques to analyze
spatially dependent economic and social data, such as housing prices or crime
rates.

1.2.7.4. Network Analysis:


i. Route Analysis: Finding the shortest path, optimal route, or travel time between
locations on a network (e.g., road network or utility network).

ii. Service Area Analysis: Identifying areas that can be reached within a specified
travel time or distance from a given location.

iii. Network Flow Analysis: Analysing the flow of resources or goods through a
network.

1.2.7.5. Remote Sensing Analysis:


i. Image Classification: Categorizing pixels in remote sensing imagery into land
cover or land use classes.

ii. Change Detection: Identifying changes in land cover or features over time by
comparing satellite or aerial imagery.

iii. NDVI Analysis: Using the Normalized Difference Vegetation Index to assess
vegetation health and density.

1.2.7.6. 3D Spatial Analysis:


i. 3D Visualization: Creating and analysing 3D representations of geographic
features and terrain.

ii. 3D Modelling: Generating 3D models of buildings, landscapes, or infrastructure.

iii. Viewshed Analysis: Determining areas visible from specific vantage points in a 3D
environment.

MDGD NTG Solutions Academy 20


Management of Data in Geospatial Databases


1.2.7.7. Spatial Decision Support Systems (SDSS):


i. Location-Allocation: Identifying optimal locations for facilities (e.g., hospitals,
schools) to serve a population efficiently.

ii. Multi-Criteria Evaluation: Combining and analysing multiple spatial criteria to


make decisions about suitable locations or routes.

F.A. Class Exercise 1.1: Data Management


Fundamentals

FA Class Exercise 1.1. Data Management Fundamentals (10 min)


a) Which of the following is not a key component of Data Management?

 Data Acquisition
 Data Positioning
 Data Analysis
 Data Integration
b) Which of the following is the correct order of the evolution of data
management?
 Stones -> Files -> DBMS -> RDBMS -> Cloud Computing -> No SQL
 Files -> Files -> RDBMS -> DBMS -> No SQL -> Cloud Computing
 No SQL -> Files -> DBMS -> RDBMS -> Cloud Computing -> Stones
 Stones -> Cloud Computing -> DBMS-> Files -> DBMS -> -> No SQL
c) You are a data scientist assigned to clean data collected in the field, which of
the following methods is best suited to clean the dataset displayed below?

 Detect missing values


 Correcting spelling errors

MDGD NTG Solutions Academy 21


Management of Data in Geospatial Databases


 Populating empty fields


 All of the above
d) Which category of analysis is typically used to calculate the most efficient or
cost-effective route for constructing a road between two towns?
 Spatial
 Integer
 Network
 Attribute

e) Data governance ensures that individuals know what the organisations’ data is
about, where it comes from, and what the context of a dataset’s purpose.
 True
 False

FA. Activity 1.1: Integrating Data in


ArcGIS Pro

FA. Activity 1.1: Integrating Data in ArcGIS Pro (35 min)


You have been provided with various types of data. Create a single geodatabase that
will contain all the data provided. Ensure that the data has a single coordinate system.
The data that you have been provided with is as follows:

1. Cadastre data – Shapefile; .shp


2. Jabulani Area Image - Tagged Image File; TIF
3. Roads - Shapefile; .shp
4. XY Spreadsheet data – Excel spreadsheet; .xlsx
5. “GAUTENG_ECON_DEV_SMME” - Web hosted layer
6. CAD data – Drawing; .dwg

MDGD NTG Solutions Academy 22


Management of Data in Geospatial Databases


 Open ArcGIS Pro, then click on “Open Another Project” on the top right corner.

 The open project window will pop up. Navigate to C:\Users\Your


name\Documents\GIS Training\IDA \FA Activity 2.2\Management of Data in Geospatial
Databases_Ex1” and select the ArcGIS Pro Project titled “Management of Data in

Geospatial Databases_Ex1” then click “Ok” on the window.

MDGD NTG Solutions Academy 23


Management of Data in Geospatial Databases


 The following window will pop open:

 On the Catalog pane navigate to the project Databases and expand by clicking
the triangle on the left. This is the Default Geodatabase for the project. Once fully
expanded it is visible that the “Erven” dataset is stored here. Right click and import
this later to the current map.

MDGD NTG Solutions Academy 24


Management of Data in Geospatial Databases


 Connect to the “FA Activity 2.2” folder and import the “Jabulani_Image” tile to the
current map. Using your knowledge, provide the image layer with the coordinate
system “GCS_Hartebeesthoek_1994”.

 To access the other datasets, connect to the “FA Activity 2.2” folder and import
the “Roads” data to the current map. Using your knowledge, provide the roads
layer with the coordinate system “GCS_Hartebeesthoek_1994”.

 To Import the excel spreadsheet. The file needs to be converted to a table. On the
ribbon, click on the Analysis Tab then select Tools.

MDGD NTG Solutions Academy 25


Management of Data in Geospatial Databases


 The Geoprocessing pane will pop up. Search for the “Excel To table” tool.
Navigate the “XY Spreadsheet” located in the “FA Activity 2.2” folder. Click “Run”
at the bottom of the pane. The result is a Standalone Table that can be accessed

under the Contents pane.

MDGD NTG Solutions Academy 26


Management of Data in Geospatial Databases


 On the Contents pane right click the Table and select “Display XY Data” as
displayed below:

 On the “Display XY Data” window under the Output Feature Class Navigate to the
“FA Activity 2.2” folder and name the feature class as “XY_Points”.

 For the Coordinate System click on the earth icon on the “Display XY Data”
window. Add a new coordinate system and save as displayed below.

MDGD NTG Solutions Academy 27


Management of Data in Geospatial Databases


 At bottom of the “Display XY Data” window click “OK”.

 On the Catalog select the Portal tab and select the “My Organisation” icon.
Search “Gauteng” and double click “GAUTENG_ECON_DEV_SMME” and add the
layer “GAUTENG_ECON_DEV_SMME_MASTER_SCHEMA”

MDGD NTG Solutions Academy 28


Management of Data in Geospatial Databases


 On the ribbon select the Analysis tab and click on Clip as displayed

 Select “GAUTENG_ECON_DEV_SMME_MASTER_SCHEMA” as your Input Feature.


Select the “Polygons” shapefile that is located in the “FA Activity 2.2” folder as the
Clip Features. Choose the “FA Activity 2.2” folder as your location for the Output
Feature and Name the Layer “ECON_DEV_Jabulani”. Click “Run”.
 On the Contents pane, Right click the
“GAUTENG_ECON_DEV_SMME_MASTER_SCHEMA” layer and remove from map.

 On the Catalog Pane connect to the “FA Activity 2.2” folder and click the triangle
next to the “Jabulani Region B_FINAL.dwg” file to expand the CAD drawing. Add
the “Polyline” to the Map.

MDGD NTG Solutions Academy 29


Management of Data in Geospatial Databases


 On the ribbon, click on the Analysis Tab then select Tools. Search for the “Copy
Features” tool. This tool will convert the CAD Feature Class to a shapefile.
 Under Input features, navigate to “FA Activity 2.2” folder and double click the CAD

Feature Dataset “Jabulani Region B_FINAL.dwg” select “Polyline”. Name your


Output dataset “Jabulani_CAD_Polyline”. Click “Run”.

 On the Geoprocessing pane search for the tool “Define Projection”. Select the
“Jabulani_CAD_Polyline” as your Input Dataset.

MDGD NTG Solutions Academy 30


Management of Data in Geospatial Databases


 Assign the layer with a Coordinate System using the steps shown below:

 The Define Project window should look like the image below. Click “Run”.

 Repeat the last 3 steps for all the CAD Feature Datasets.
 You have now prepared all the datasets and will now save them to a new
geodatabase.
 On the Catalog pane Connect to the “FA Activity 2.2” folder Right click the folder,
select “New” then select File “Geodatabase”. Rename the new geodatabase to
“Jabulani_Ex1”

MDGD NTG Solutions Academy 31


Management of Data in Geospatial Databases


 To Add data to the new geodatabase, navigate to the Contents pane and select
the layer that you would like to Export to the geodatabase. Right click this layer
and select “Data” then click “Export Features”. The “Export Features” window will
pop-up.

MDGD NTG Solutions Academy 32


Management of Data in Geospatial Databases


 Under the Parameters Tab select the Output Location as the Geodatabase that
you have just created. Enter the Output Name as “Erven”.
 Under the Environments tab specify the Output Coordinate System as

“GCS_Hartebeesthoek_1994”. Click “Ok” at the bottom of the window.

 Repeat the last 2 steps for all 6 of the datasets provided.

MDGD NTG Solutions Academy 33


Management of Data in Geospatial Databases


Show a screen short of the layers within the geodatabase as well as the layers on the
map.

MDGD NTG Solutions Academy 34


Management of Data in Geospatial Databases


Chapter 2: Introduction to Spatial Data Science

The student will be introduced to the basic concepts of Data Science and Spatial Data
Science.

Topics to cover
 Data Science and Spatial Data Science
 Spatial Data Science workflows

Learning Objectives
After completing this section, you will be able to;

 Understand and explain the fundamental concepts and principles of data


science and spatial data science
 Understand the common data handling workflows in spatial data science

Assessment
 FA Activity 2.1: Editing Data in ArcGIS Pro
 FA. Activity 2.1: Introduction to ArcGIS Notebooks
 FA. Activity 2.2: Data Exploration and Visualization

MDGD NTG Solutions Academy 35


Management of Data in Geospatial Databases


2.1. What is Data Science


The advancement of technology has produced large amounts of data. Devices in the
palms of our hands generate rich data assets ranging from detailed human movement
data to data on purchasing behaviour. At the same time, remote sensing devices
generate detailed images of the Earth at a 0.5-meter resolution. This excess amount of
data makes it impossible for a human to parse, clean and analyse it in a reasonable
time.

Data Science is an interdisciplinary field that involves extracting valuable insights and
knowledge from data. It combines techniques from statistics, computer science,
mathematics, and domain expertise to make data-driven decisions and predictions.
Data science is all about how to extract data, utilize it to acquire knowledge.

Mathematics

Data
Science

Domain Computer
Knowldge Scuence

In order to identify pattens and make predictions, applications of calculus, linear


algebra, and statistics are required. However, one doesn’t have be an expert
mathematician or statistician to understand how to identify and test the most suitable
analytic method for the problem or case study at hand at hand. At the initial stages of
the data science process, one should have a foundation in programming and
proficiency in languages like Python, R, or Structured Query Language (SQL) and other
scripting languages. However, possessing the technical abilities to retrieve and analyse
data and construct models is of limited value without a background in the specific
industry or domain to which the data science problem is related. Each of these
components alone can be tricky to master. It's crucial to acknowledge that the majority
of data scientists may not be specialists in all of these domains but typically possess a
fundamental understanding of them.

MDGD NTG Solutions Academy 36


Management of Data in Geospatial Databases


2.2. What is Spatial Data Science


Spatial data science is a subclass of data science that concentrates on geospatial
data, its unique properties, and specialized techniques and computation methods
necessary for deriving insights from this data. Instead of treating spatial data as another
feature in a tabular dataset, spatial data science goes deeper into understanding why
things are happening in a particular place and how they are related, or unrelated, to
the things going on around it. Spatial data science focuses on identifying spatial
relationships based on location, distance, and intersections between objects. It allows
researchers and analysts to uncover hidden spatial patterns, make informed decisions,
and develop models to predict future spatial trends.

2.3. Industry usage of Spatial Data Science


The following are just a few of the many widespread spatial data science example in
today's world:

2.3.1. Spatial Planning


Spatial Data Science, when appropriately applied, can help analyze urban growth
and expansion patterns, identifying untapped areas suitable for future development
while considering various factors essential for successful construction.

MDGD NTG Solutions Academy 37


Management of Data in Geospatial Databases


2.3.2. Agriculture
Spatial Data Science is currently used for precision farming, soil analysis, and crop yield
prediction for efficient agricultural practices. It also helps farmers create more efficient
harvesting practices. Food production has soared, and environmental standards have
improved with the help of geospatial analysis.

2.3.3. Disaster Management


Predicting and mitigating the impact of natural disasters like floods, earthquakes, and
wildfires.

2.3.4. Healthcare
Tracking disease spread, identifying outbreak hotspots, and optimizing healthcare
facility locations

2.3.5. Telecommunication
Telecommunications companies use spatial data science to build and improve
networks and track consumer requests and maintenance schedules. 5G mobile
internet connectivity is being expanded using geospatial data analysis.

2.3.5. Transportation
GIS can help with several transportation problems, such as identifying dangerous
intersections, improving road optimization, and choosing the optimal location for a
new road or rail network.

2.3.7. Natural resource management


Managing forests, water bodies, and wildlife habitats for sustainable resource
utilization.

2.4. Spatial Data Science Workflow


The spatial data science workflow is flexible and can be adapted to address a wide
range of spatial problems and applications, from urban planning and natural resource
management to location-based services and disaster response. The spatial data
science workflow involves applying data science techniques and methodologies to
geographic data. This workflow is conceptualized below. The spatial data science
workflow doesn't follow a linear path and may instead resembles the process displayed
below.

MDGD NTG Solutions Academy 38


Management of Data in Geospatial Databases


In Chapter 5 of the "Introduction to Data Acquisition for Geospatial Intelligence," our


emphasis was on conventional data acquisition methods. However, spatial data
science transcends the boundaries of traditional GIS by embracing a more extensive
array of data sources, encompassing big data and unstructured data, and harnessing
advanced technologies like machine learning and real-time data processing. This
expansion enhances the data collection process in spatial data science.

 Wide Data Sources: Spatial data science often incorporates a wide range of
data sources, including remote sensing (satellite imagery, LiDAR), social media
data, sensor data, crowd-sourced data, and more.
 Big Data: Spatial data science deals with big data, which may require
advanced technologies like distributed computing and machine learning for
processing and analysis.
 Real-time Data: It often involves real-time or near-real-time data acquisition,
especially in applications like traffic monitoring, weather forecasting, and
disaster management.
 Unstructured Data: Spatial data science may deal with unstructured data, such
as natural language text from social media or web sources, which requires text
mining and NLP techniques.
 Machine Learning: Machine learning algorithms are commonly used for pattern
recognition, classification, and prediction in spatial data science.
 Data Fusion: Combining data from multiple sources and formats is a common
practice to derive meaningful insights.

In this chapter, we will primarily concentrate on the four middle phases of workflow.

MDGD NTG Solutions Academy 39


Management of Data in Geospatial Databases


2.4.1. Data Cleaning


The Data Cleaning stage plays a pivotal role in the spatial data science process,
primarily focusing on verifying the accuracy and completeness of datasets to answer
spatial questions. Data cleaning, often referred to as data cleansing or data scrubbing,
encompasses the identification and correction of errors, inconsistencies, and
inaccuracies within a dataset. Its primary aim is to ensure that the dataset is not only
accurate and reliable but also well-suited for analysis. Data cleaning is an iterative
procedure that may entail multiple rounds of scrutiny and adjustments. Its ultimate
objective is to shape the data in a manner that faithfully represents real-world
phenomena, thus guaranteeing that subsequent data analysis and modelling efforts
yield dependable and meaningful results.

The following are some key aspects of data cleaning:

2.4.1.1. Handling Missing Data:


This process includes identifying and dealing with missing values in a dataset. Users may
utilise multiple options which include removing rows with missing data, imputing missing
values with statistical methods, or using domain knowledge to fill in missing information.

2.4.1.2. Dealing with Outliers


This procedure involves detecting and handling outliers, which are data points that
significantly deviate from the rest of the dataset. Users may choose to remove,
transform, or replace outliers depending on the context. Outliers may be either as
attributes within the data or as spatial anomalies.

2.4.1.3. Data Type Conversion:


This method ensures that data types are suitable for analysis, and if needed, perform
data type conversions. E.g. Converting categorical variables into numerical formats
when it's necessary for the analysis.

2.4.1.4. Deduplication
Identifying and removing duplicate records or entries in the dataset.

2.4.1.5. Standardization and Normalization


This process entails standardizing or normalizing data to bring variables to a consistent
scale or distribution. This standardization enables easier comparison and analysis of the
data.

2.4.1.6. Handling Inconsistent Data


This approach deals with discrepancies in data entries, such as variations in spelling for
the same category or inconsistencies in date formats.

MDGD NTG Solutions Academy 40


Management of Data in Geospatial Databases


2.4.1.7. Addressing Data Integrity Issues


Verify data integrity by checking for referential integrity constraints, ensuring that
relationships between different features or datasets are maintained.

2.4.1.8. Feature Engineering


This technique emphasizes the creation of new features or variables derived from
existing ones, with the aim of enhancing data quality and making it more suitable for
modelling purposes.

2.4.1.9. Data Transformation


This process involves converting data from one format or structure into another to
make it more suitable for analysis or to meet specific requirements.

2.4.1.10. Data Validation:


Cross-check data against external sources or domain knowledge to validate its
accuracy. This is particularly important when dealing with critical or sensitive data.

2.4.1.11. Coordinate Validation


This method checks for valid geographic coordinates within the dataset to ensure they
fall within expected ranges (e.g., latitude between -90 and 90 degrees, longitude
between -180 and 180 degrees). Identifying and correcting any outliers or invalid
coordinates.

2.4.1.12. Topology and Geometry Checks


This involves confirming that spatial entities, such as polygons and lines, conform to
topological principles, free from self-intersections or gaps. It also entails verifying the
accuracy of spatial connections, such as ensuring polygons don't overlap
inappropriately or that line features are correctly interconnected.

2.4.1.13. Geocoding and Address Standardization


Geocoding and address standardization refer to the processes of converting addresses
into geographic coordinates (latitude and longitude) and ensuring that addresses are
formatted consistently according to established standards.

2.4.1.14. Projection Consistency


Ensuring that all spatial data layers in the analysis have consistent coordinate reference
systems (CRS) to avoid errors in spatial calculations. E.g. Re-projecting datasets as
needed to align them with a common CRS.

2.4.1.15. Data Integration and Joining


This method involves merging and integrating spatial datasets from different sources,
ensuring that the joining criteria and spatial relationships are correctly defined.

MDGD NTG Solutions Academy 41


Management of Data in Geospatial Databases


2.4.1.16. Addressing Data Completeness


Checking for missing spatial data, including missing geometries or attributes, and
address these gaps through imputation or by obtaining additional data sources.

2.4.1.18. Validation against Reference Data


Validate spatial data against authoritative or reference datasets to ensure alignment
with ground truth and improve data accuracy.

Data cleaning doesn't always follow a linear path, and in some cases, it may be a
combination of various methods described above to effectively cleanse the data.

FA. Activity 2.1: Editing Data in ArcGIS Pro

FA Activity 2.1: Editing Data in ArcGIS Pro (20 min)

You have been tasked with preparing a dataset for analysis by another department,
and one of the common tasks involved is the creation and refinement of building
polygons. This process involves digitizing, which essentially means using clicks on the
map to generate features in the form of points, lines, or polygons. When digitizing, it's
typical to overlay the features on top of an image or a basemap layer, as digitization
involves approximations of locations.

 Open “Exercise 1 arpx” project in ArcGIS Pro

 You will notice that some buildings are missing, some need to be modified
some need to be moved.

MDGD NTG Solutions Academy 42


Management of Data in Geospatial Databases


 You will add a new building using “map templates” in ArcGIS Pro

 The following pop up box will appear, click on the “buildings feature
template” and select the rectangle to capture the guard house.

 Use the “Rectangle” capture tool to trace over the guard house

MDGD NTG Solutions Academy 43


Management of Data in Geospatial Databases


 After completing the trace, the guard house will be selected and a
pop up message will appear to show that the capture is complete.

 To capture another feature, Select the “Polygon” capture tool

 Use the polygon tool to trace over the house, click on “Finish” when
the trace is complete

MDGD NTG Solutions Academy 44


Management of Data in Geospatial Databases


 After completing the trace, the house will be selected and a pop up
message will appear to show that the capture is complete

 To move a feature, “select and drag” it with the pointer. On the Edit
tool par select the “move” tool.

MDGD NTG Solutions Academy 45


Management of Data in Geospatial Databases


 The following ”pop up” window will appear, You will need to ensure that
the feature you wish to move is the only one selected.

MDGD NTG Solutions Academy 46


Management of Data in Geospatial Databases


 Right click on the Erven layer and click on “Unselect” to deselect the
erven layer.

 Now that the building layer is the only feature selected, click on the
“move” tool as displayed below and start moving the building.

 Drag and position the building to the correct position until it is properly
located in the correct place.

MDGD NTG Solutions Academy 47


Management of Data in Geospatial Databases


 Once the building has been moved, a pop up message will appear to
confirm.

MDGD NTG Solutions Academy 48


Management of Data in Geospatial Databases


 Deselect the selected building, the move has been completed and a
pop up message will appear.

 The last task is to delete a feature that is incorrectly captured. The building
selected below has been incorrectly captured.

MDGD NTG Solutions Academy 49


Management of Data in Geospatial Databases


 Use the “selection” tool adjacent to the “delete” button as displayed


below, select the building to be deleted and press the delete button.

 Note: Ensure that the building to be deleted is the only building selected.

MDGD NTG Solutions Academy 50


Management of Data in Geospatial Databases


FA. Activity 2.2: Introduction to ArcGIS


Notebooks

FA Activity 2.2: Introduction to ArcGIS Notebooks (35 min)


You have assumed the role of a GIS intern at the eThekwini Metropolitan Municipality.
Your supervisor has assigned you a task involving 907 properties of interest, which are
potential vacant properties for Human Settlements development. To facilitate this task,
your supervisor has provided you with a Valuation Roll for the area, but there's a
challenge: there is no direct link between the Valuation Roll and Property Boundaries.

Upon closer inspection, you've identified that you can establish a common link using the
Description column in the Valuation Roll dataset. Your supervisor has instructed you to
create SG codes, which consist of 21 unique codes, from the information in the
Description column. To accomplish this task, you will be using ArcGIS Notebooks as part
of your spatial data analysis and manipulation process.

Property Township Erf Portion


Code N0FT0045 00004482 00000
No. of digits 8 digits 8 digits 5 digits
E.g., Cato Manor 4482 0

 Navigate to “C:\Users\Your name\Documents\GIS Training\MDGD” from your


virtual desktop machine and “Select” the “FA Activity 1.2” folder then click “OK

 Open the “MyProject.aprx” on the map. The following window will appear.

MDGD NTG Solutions Academy 51


Management of Data in Geospatial Databases


NB: This is the total number of Erven in the area, the unit wishes to select properties in
the attached spreadsheet to be evaluated for the acquisition process).

 Expand on “Notebooks”, right click on the “New Notebook.ipynb” and click on


“Open Notebook”.

 The following page will appear.

MDGD NTG Solutions Academy 52


Management of Data in Geospatial Databases


 Locate the save button on the top left corner of the screen and then click on
“Save”.

 In this step, you will import the necessary Python modules to execute the cells in
the notebook. A Python module is a file that contains Python definitions and
statements. A module can define functions, classes, and variables, and it can
include runnable code. You will use the import statement to import the modules.
 Click on the “Run” button. The code will have a [*] when its running and a
number [1] when it has executed as displayed below.

MDGD NTG Solutions Academy 53


Management of Data in Geospatial Databases


 Run the second code snippet “Load data from Excel into Data Frame”, the code
commands the pandas “read” function to load the “Valuation Roll.xlsx” dataset
into the ArcGIS Notebook data frame.

 The following table will appear after you run the “df” command.

MDGD NTG Solutions Academy 54


Management of Data in Geospatial Databases


a) How many properties or records are located within the Valuation Roll
dataset. ____907_____________________________________________
b) How columns are located within the Valuation Roll dataset.
____19_______________________________________________________
 Study the “Description” attribute you will notice that each row contains sufficient
information to build the SG 21 code.
 The following code snippets extracts the Erf and Portion values into a single
column separated by a comma.

MDGD NTG Solutions Academy 55


Management of Data in Geospatial Databases


 The following code snippets Splits the Erf and Portion values into two columns
(Column1 Portion and Column 2 Erf).

MDGD NTG Solutions Academy 56


Management of Data in Geospatial Databases


 NB notice that “Column1” contains a mixture of “Erf and Portion” values, we will
use more functions to clean up the errors.

 Select the properties that do not contain portion values by using the statements
below.

c) How many properties do not contain portion values?


____________________________________________________
 Use the following code to replace blank Erf numbers “Nan” stored in Column2
with the correct Erf numbers stored in Column1. Further replace the Erf numbers
stored in column 1 with a zero “0” value for all properties with no portion values.
 Run all the code snippets under the “Clean Erf and Portion Columns”
subheading

MDGD NTG Solutions Academy 57


Management of Data in Geospatial Databases


 Now that the “Erf” and “Portion” columns are “clean”. Run the Code to fill zeros
before the “Erf” and “Ptn” Columns. E.g Erf 228 to 00000228.

 Run all the code snippets under the “Create SG Code” subheading
 Now that all the “SG Codes” are created, Run the code snippets to join the
“Valuation Roll” to the “Erf_Boundaries” Layer.

 The following layer will be added to the map and overlaid to the
“Erf_Boundaries”.

MDGD NTG Solutions Academy 58


Management of Data in Geospatial Databases


 Right click on the newly added layer and confirm the number of rows.
a) Are “Valuation Roll” attributes appended to the “Erf_Boundaries” layer?
_____________________________________________________________________
 Create a “Map Layout” with all elements of a map and send the map to your
Facilitator
2.3.2. Data Exploring and Discovery
Data exploration is a pivotal stage in spatial data science that plays a fundamental role
in comprehending the underlying geographic phenomena. It paves the way for
subsequent processes such as spatial modeling, decision-making, and policy
development. This crucial process empowers analysts to unearth valuable insights and
formulate well-informed spatial recommendations. Data exploration in spatial data
science encompasses a comprehensive approach involving the examination,
visualization, and analysis of geographic or spatial datasets. This approach is designed
to gain insights and understanding spatial patterns, relationships, and characteristics
within the data. The following are key aspects of data exploration in spatial data
science.

MDGD NTG Solutions Academy 59


Management of Data in Geospatial Databases


2.3.2.1. Spatial Visualization


Generating maps to visually represent spatial data, employing various map types like
choropleth maps, heatmaps, or scatterplots, to effectively illustrate spatial distributions
and patterns. This subtopic will be addressed in the upcoming modules.

2.3.2.2. Spatial Relationships


Explore spatial relationships between different geographic features. Identify adjacency,
proximity, and connectivity to understand how spatial elements interact.

2.3.2.3. Spatial Autocorrelation


Use statistical techniques like Moran's I to assess spatial autocorrelation, which helps
identify if there is clustering or dispersion of values across space. It refers to the degree
to which the values of a variable at one location are correlated with the values of the
same variable at nearby locations. In simpler terms, it assesses whether similar values
tend to cluster together in space.

2.3.2.4. Clustering Analysis


Clustering is the task of grouping a set of objects in such a way that observations in the
same group are more similar to each other than to those in other groups.

2.3.2.5. Hotspot Analysis


Hotspot analysis identifies spatial hotspots or coldspots by analyzing areas with
statistically significant high or low values.

At this stage of the workflow, one can select one or multiple methods to visualise and
explore data.

FA. Activity 2.3: Data Exploration and


Visualization

FA Activity 2.3: Data Exploration and Visualization (45 min)


You have been appointed at Joe Gqabi District Municipality as a GIS intern. Your
supervisor assigned you a task to prepare a Property Registration report for the Municipal
Manager to be discussed in the upcoming Executive Council Meeting. Your supervisor
asked you to report on the following.

I. The number of Registered Parcels.


II. The total extent of land for the registered parcels and the unregistered parcels.
III. The total amount of the registered parcels and prepare a map for the prices.

MDGD NTG Solutions Academy 60


Management of Data in Geospatial Databases


 Start ArcGIS Pro


 In ArcGIS Pro, under “New Project”, click “Map”.

 Capture the name of the project as “Data Exploration and Cleaning” on the
dialog box.

 Click on the on “Create a new folder for this project” check box

 Click “OK’ to “Create a New Project”

 The following window will appear.

MDGD NTG Solutions Academy 61


Management of Data in Geospatial Databases


 Click the “View” tab on the ribbon.

 In the Windows group, click “Catalog Pane”. The following window will be shown
on the right side of the main window.

MDGD NTG Solutions Academy 62


Management of Data in Geospatial Databases


 Right click on the “Folders” tab to connect to the downloaded data

 Navigate to “C:\Users\Your name\Documents\GIS Training\MDGD” from your


virtual desktop machine and “Select” the “FA Activity 1.1” folder then click “OK”

 Expand the mini triangle in front of the “FA. Activity 1.1” folder until you see the
silver container titled “MyProject.gdb” and layers as indicated in the image
below.

MDGD NTG Solutions Academy 63


Management of Data in Geospatial Databases


 Right click on the “Erf_Boundaries” layer and click “add to current map”.

 Click the “Map” tab. In the Navigate group, make sure that “Explore” is
selected.

 Click on an Erf/stand/parcel on the map. In the Pop-up pane, scroll down, if


necessary, to see all the attributes.

MDGD NTG Solutions Academy 64


Management of Data in Geospatial Databases


 Click the “Data” tab. In the Table group, the click on “Attribute Table”.

 Scroll towards the end while carefully studying the column header as well as the
data in the rows.
 Notice that the last four fields are useful to create the report that is required,
however the data require cleaning to be able to produce meaningful results
suitable for decision making.
Step 1: The number of Registered Parcels
 Place the curser on the “Status” Field and right click then select “Statistics”.

 The following “Chart” will pop up next to the attribute table.

MDGD NTG Solutions Academy 65


Management of Data in Geospatial Databases


 Click on the “Chart Properties”.


 Click on the “Data” tab and click on the “Label bars” check box.

 Click on the “Axes” tab and increase the “Label character limit” to “15”.
 Click on the “General” Tab.

MDGD NTG Solutions Academy 66


Management of Data in Geospatial Databases


 On the “Chart Title” type “REGISTRATION STATUS” and “X axis” title type
“Registration Status”.

 Click on the “Export” button and select “Export as Graphics”, type


“Registration Status Chart” on the blank space provided, accept the default
path for storing the image and change the format to JPEG then click “Save”.
The exported Chart can be used to answer question 1 on the questions above.

Step 2: The total extent of land for the registered parcels and the unregistered parcels

 Notice that there is an “Extent” and an “Extent Unit” column.


 Start by understanding the different units in the “Extent Unit” column.
 Click on the “Data Engineering” tab under “view”, the fields panel displays field
aliases of all the visible fields of the selected layer, select the “Extent” and “Extent
Units “columns and drag the fields to the statistics panel.

MDGD NTG Solutions Academy 67


Management of Data in Geospatial Databases


 Click on the "Calculate” button, this will display statistical and data quality
metrics of each field in the data as columns in a table.

a) Study the fields’ statistics and populate the following table

Properties Extent Extent Units

Data Type

Feature Count

Number Of Null Values

Number of Unique
Values

MDGD NTG Solutions Academy 68


Management of Data in Geospatial Databases


Frequently Appearing
Value

Least Common Value

b) Study the Extent Units chart and note your observations below

___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
 Use the “Select by Attributes” tool to elect “sqm” and “SQM” valued and use
the “Field Calculator” to calculate “Sqm” in the extent unit’s field.

MDGD NTG Solutions Academy 69


Management of Data in Geospatial Databases


 Select the “Ha” units and convert the area to “Sqm” using the “Field
Calculator”.I.e. Extent field * 10000. Calculate the “Ha” extent units to “Sqm”

MDGD NTG Solutions Academy 70


Management of Data in Geospatial Databases


 Select the Null field to calculate the area. Use the calculate geometry tool.

MDGD NTG Solutions Academy 71


Management of Data in Geospatial Databases


 Select the “Registered” field, click on selected then click Statistics.

MDGD NTG Solutions Academy 72


Management of Data in Geospatial Databases


MDGD NTG Solutions Academy 73


Management of Data in Geospatial Databases


 Right click on the “Extent” Row and “Extent Unit” and select “Remove field”.

Step 3: The total amount of the registered parcels and prepare a map for the prices.

 Select the “Amount” and “Extent Units “columns and drag the fields to the
statistics panel
 Click on the "Calculate” button, this will display statistical and data quality
metrics of each field in the data as columns in a table.

MDGD NTG Solutions Academy 74


Management of Data in Geospatial Databases


 Right click on the “Chart Preview” cell, hover on the "Create Chart” button, then
Select “Bar Chart”.

 The “Bar Chart” is displayed below, notice that the field contains a mixture of
numeric values and text.

 The data needs to be populated into a new numeric field that will store
numerical values only. Click on Data, Field then Create a “New” field with the
following specifications: Field Name “PROPERTYAMOUNT” and Field Alias
“PROPERTY AMOUNT” and Field Type “Long” then click “Save”.

MDGD NTG Solutions Academy 75


Management of Data in Geospatial Databases


 Populate the “AMOUNT” field contents into the newly created field.

 Ensure that all the data has been populated by calculating statistics for the new
“PROPERTY AMOUNT” field the “AMOUNT” field contents into the newly created
field.
 Right click on the “Chart Preview” cell, hover on the "Create Chart” button, then
Select “Bar Chart”.
 Compare the original “AMOUNT” and the newly created “PROPERTY AMOUNT”
charts to ensure that all the values have been populated.

MDGD NTG Solutions Academy 76


Management of Data in Geospatial Databases


 You will notice that “R1.00” was not populated, select the row that contains
“R1.00” and populate “1” to the “PROPERTY AMOUNT”.

MDGD NTG Solutions Academy 77


Management of Data in Geospatial Databases


 Select the rows that contains “Null” values and populate “0” to the “PROPERTY
AMOUNT”.

 Symbolise the map using the “PROPERTY AMOUNT”, Select the Method to
“Quantile” and select “5 Classes”.

c) Study the Property Amounts map and write your observations below.
___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

MDGD NTG Solutions Academy 78


Management of Data in Geospatial Databases


2.3.3. Data Processing


In the previous chapter, we delved into spatial data processing and analysis. In this
chapter, our focus will shift towards exploring data processing within the realm of spatial
data science. Spatial data science is an advanced version of traditional spatial data
processing. It expands its capabilities to handle complex spatial problems by using a mix
of different fields of knowledge which includes advanced analysis, big data tools, and
machine learning. Spatial data science deals with modern challenges and needs
deeper insights and predictive abilities in various areas.

Traditional spatial data analysis and spatial data science differ in various ways. The table
below summarises some key differences between the two.

Component Traditional GIS Spatial Analysis Spatial Data Science


Methods

Scope and Typically focuses on basic Encompasses a broader range of


Complexity geographic operations like advanced analytical techniques,
querying, mapping, and basic machine learning, big data, and
statistical analysis. It often geospatial modelling. It tackles
involves simpler spatial tasks. complex spatial problems.

Data Volume Works with smaller datasets, Handles large and diverse
and Variety often limited to geographic datasets, including unstructured
data types like points, lines, and data like satellite imagery and
polygons. social media data, and
incorporates various data types
and sources.

Interdisciplinary Primarily conducted within the Combines expertise in GIS,


Nature domain of GIS (Geographic statistics, data science, and
Information Systems) and may computer science. It often
not require extensive requires interdisciplinary skills.
knowledge of statistics or data
science.

Complex Primarily uses simple descriptive Employs advanced statistical and


Modeling statistics and basic spatial machine learning models to
operations. make predictions, identify
patterns, and solve complex
spatial problems.

MDGD NTG Solutions Academy 79


Management of Data in Geospatial Databases


Big Data and May struggle with the scalability Utilizes big data technologies and
Scalability of large datasets and big data distributed computing to process
analytics. and analyze massive spatial
datasets efficiently.

Spatial Utilizes basic spatial algorithms Implements advanced spatial


Algorithms for operations like buffering, algorithms for tasks such as
distance calculations, and geostatistics, spatial regression,
overlays. and network analysis.

Spatial Focuses on basic cartographic Uses advanced geospatial


Visualization representation and map visualization techniques and
production. interactive dashboards for data
exploration and communication.

Predictive Less emphasis on predictive Incorporates predictive modeling


Modeling modeling and forecasting. to forecast spatial trends and
make data-driven decisions.

Machine Rarely incorporates machine Integrates machine learning for


Learning learning techniques. tasks such as image classification,
land cover mapping, and spatial
pattern recognition.

Data Primarily focuses on data Requires extensive data


Engineering preparation for mapping and engineering to process, clean,
basic analysis. and prepare diverse and large-
scale geospatial datasets.

2.3.4. Modelling
Spatial data science modelling involves various techniques for optimization, simulation,
and prediction. These techniques help analyse and understand spatial phenomena,
make informed decisions, and forecast future events or trends in various fields including
urban planning, transportation, public health, environmental management, and more.
The subsections below provide an overview of these concepts within spatial data
science:

2.3.4.1. Suitability modelling


Spatial optimization is a specialized branch of optimization that focuses on finding the
best solutions to problems where spatial or geographic considerations play a significant
role. It involves determining the optimal arrangement or allocation of resources,

MDGD NTG Solutions Academy 80


Management of Data in Geospatial Databases


facilities, or entities within a spatial context to achieve specific objectives while


considering spatial constraints and factors.

 Location-Allocation: Determining the optimal locations for facilities (e.g.,


warehouses, clinics) to minimize transportation costs or maximize coverage.
 Routing and Network Optimization: Finding the most efficient routes or paths
through a network, considering factors like travel time, distance, or cost.
 Land Use Planning: Optimizing land use allocation to maximize resource
utilization or minimize environmental impact.
Suitability Modelling Workflow
In this course we discuss four main steps for creating a suitability model.

 Step 1: Define suitability goal


The first step in suitability modelling is to clearly state what the goal is. E.g. What is the
suitable location for a new school, park, landfill, or a wildlife habitat restoration
project.

 Step 2: Determine and prepare the criteria data


The next task is to figure out what factors matter for this task. E.g. Criteria include how
close the location is to roads, how steep the land is, whether the soil is good, if there's
enough water, and what the rules say about how the land can be used.

MDGD NTG Solutions Academy 81


Management of Data in Geospatial Databases


 Step 3: Criteria Transformation


The following step is to transform data that represent the criteria identified in the
previous step. The values within each criterion must be transformed to a common
suitability scale. E.g Each dataset is assigned a value between 1 and 10. Closer
locations are assigned a higher suitability value of 9 or 10, and distant locations will
receive a lower suitability value of 1 or 2.

 Step 4: Criteria weighting


Before adding the criteria together, it may be that one criterion is more important
than the others. The weighting in this step tells us how important each criterion is
compared to the others. E.g. Each dataset is assigned a value between 1 and 10.
The land use is more important which weighs more and assigned a value of 7 than
the distance to schools which weighs less and assigned a value of 3

 Step 5: Locate the suitable sites


The final step combines the normalized criteria layers using weighted overlay or
overlay analysis techniques. This results in a suitability map, where each location is
assigned a suitability score or value based on the weighted criteria.

F.A. Class Exercise 1.1: Data Management


Fundamentals

FA Class Exercise: 2.4. Cartographic model


a) Which workflow lists the correctly ordered steps of a suitability analysis?
 Identify the criteria, weight and combine criteria, transform criteria values,
derive the criteria, locate the site
 Identify the criteria, derive the criteria, transform criteria values, weight and
combine criteria, locate the site
 Identify the criteria, derive the criteria, weight and combine criteria,
transform criteria values, locate the site
b) Which definition describes the process of transforming criteria values?
 Converting criteria values to a common scale
 Deriving the criteria values from source datasets
 Assigning weights to the criteria values

MDGD NTG Solutions Academy 82


Management of Data in Geospatial Databases


c) An analyst wants to identify the most suitable location for a wind farm. One of the
criteria is slope. The analyst will use a digital elevation model to create a dataset that
represents slope. This example demonstrates which step in suitability modeling?
 Transforming criteria values
 Deriving the criteria
 Weighting and combining criteria

2.4.1.2. Simulation
Spatial simulation, often referred to as spatial modeling and simulation, is a
computational technique used to model and analyze complex spatial phenomena
and processes in various fields, including geography, urban planning, ecology,
epidemiology, and transportation. Spatial simulation involves creating computer-based
models that simulate the behavior, interactions, and dynamics of entities or phenomena
within a geographic or spatial context.

 Urban Growth Modeling: Simulating the expansion of urban areas to forecast


land use changes.
 Epidemiological Modeling: Simulating the spread of diseases within geographic
regions to inform public health interventions.
 Environmental Modeling: Simulating natural processes like climate, water flow, or
ecological dynamics for environmental impact assessment.

2.4.1.3. Prediction
Spatial data science often involves the prediction of spatial phenomena or events
using various statistical and machine learning techniques. Predictive modeling in
spatial data science aims to forecast future values, patterns, or occurrences at
specific locations within a geographic or spatial context

 Spatial Interpolation: Predicting values at unobserved locations within a spatial


dataset. Common methods include kriging, IDW (Inverse Distance Weighting),
and spline interpolation.
 Time Series Forecasting: Predicting future values of a spatial variable over time,
such as weather data, air quality, or traffic flow.
 Machine Learning Regression: Using machine learning algorithms to build
predictive models for spatial data. Algorithms like decision trees, random forests,
support vector machines, and neural networks are applied.
Spatial prediction has numerous applications, including:

MDGD NTG Solutions Academy 83


Management of Data in Geospatial Databases


 Predicting environmental variables like temperature, precipitation, and air


quality.
 Forecasting disease spread in epidemiology.
 Predicting real estate property values.
 Estimating crop yields in precision agriculture.
 Predicting traffic congestion and transportation patterns.
 Forecasting land use changes in urban planning.

2.5. Extract, Transform, Load (ETL)

Extract, transform and load (ETL) is the process of extracting data from one system,
transforming it into a format that is consumable by another system, and loading it into
the final system where it will be used for business analysis. In this course we will refer to
GIS software as the final software where data will be loaded for project for analysis,
visualization, and other geospatial tasks.

2.5.1. The ETL Workflow

In this subtopic, we'll explore a practical example in urban planning that showcases
the ETL (Extract, Transform, Load) spatial data workflow. Specifically, focusing on a
scenario where an urban planning department must evaluate traffic congestion in a
rapidly expanding city to pinpoint zones in need of infrastructure enhancements.

Process Description Case study reference

Extract Data Sources: This step The department identifies


incudes locating and various data sources,
gaining access to data including traffic sensor
sources needed for the data, road network data,
project. These sources land use data, and
may encompass population data.
databases, spreadsheets,
web services, shapefiles,
geoJSON files, and various
other formats.

Extract spatial data from The Town planners extract


these sources using traffic sensor data from a

MDGD NTG Solutions Academy 84


Management of Data in Geospatial Databases


appropriate tools or real-time web service,


methods. This may involve purchase road network
querying databases, data from data vendor
downloading files, or (Tom Tom) and retrieve
accessing web APIs. land use and population
data from a municipal
database.

Transform Clean and preprocess the The department cleans


extracted data to ensure the traffic sensor data to
accuracy and remove duplicate records
consistency. This step and outliers caused by
includes handling missing sensor malfunctions.
values, resolving data
format issues, and
addressing outliers.

Convert data types and The team integrate the


units as necessary to road network data with
ensure uniformity and the land use data using a
compatibility. spatial join to associate
land use attributes with
road segments.

Perform spatial Spatial Transformation: All


transformations, such as datasets are transformed
reprojection, to ensure into a consistent
that all data layers are in a coordinate system.
consistent coordinate
system.

Aggregate, filter, or join Traffic sensor data is


datasets to create new aggregated by time
derived datasets, if intervals (e.g., hourly) and
needed. spatial zones (e.g.,
neighborhoods).

Load Organize the transformed The transformed and


data into a suitable format integrated data is
for storage and analysis. organized into a
This may involve creating geodatabase to maintain
geodatabases, file spatial relationships. The
geodatabases, or spatial data is loaded into a

MDGD NTG Solutions Academy 85


Management of Data in Geospatial Databases


database tables. Load the spatial database


cleaned and transformed management system
spatial data into a spatial (e.g., PostgreSQL with
database management PostGIS) for efficient
system (DBMS) or GIS querying. Appropriate
software. Index and spatial and attribute
optimize the data for indexing is applied to
efficient querying and optimize data access.
analysis.

The urban planning department, armed with the results of the ETL (Extract, Transform,
Load) process, can now conduct spatial analysis to pinpoint congested areas,
strategize infrastructure improvements, and base their decisions on data-driven insights
to effectively address traffic-related challenges.

It's important to note that ETL processes are versatile and can be applied across a
broad spectrum of scenarios, spanning urban planning, environmental management,
and beyond. These processes seamlessly facilitate the integration and analysis of
diverse spatial data sources, ultimately empowering data-driven decision-making
across a wide range of spatial applications.

In ArcGIS, ETL tasks can be executed through various methods, including Model Builder
for visual workflows, Python scripting for customization and automation, and the Data
Interoperability extension for handling a diverse array of data formats. Model Builder is
a graphical interface that enables the creation of workflows by visually connecting
geoprocessing tools. It's valuable for automating intricate ETL (Extract, Transform, Load)
processes.

 Model Builder is a visual interface within ArcGIS that allows you to create, edit,
and manage geoprocessing workflows. It is particularly useful for building
complex ETL processes by connecting geoprocessing tools in a visual manner.
 Python Scripting is supported in ArcGIS Pro, allowing users to develop custom
scripts for performing data ETL tasks using ArcPy, the Python site package
designed for ArcGIS.
 ArcGIS Pro also incorporates the Data Interoperability extension, offering
advanced ETL capabilities for handling a diverse array of data formats and
sources.

MDGD NTG Solutions Academy 86


Management of Data in Geospatial Databases


Chapter 3: Understanding Databases

In this section you will be introduced to the relational database and other types of
databases.

Topics to cover
 Introduction to Rational database (SQL)
 Irrational database (Non-SQL) flat files Data management
 Structured and unstructured databases.
 Relational Data Model

Learning Objectives
After completing this section, you will be able to;

 The Rational database use and its significance can be explained.


 Contrast rational and irrational databases.
 Explain how flat-file databases can be used to solve data-related problems
 Discuss the value of the relational data model. Give an illustration of a relational
dataset.
 Examine the differences between structured and unstructured databases.
 Explain the significance of structured databases.

Assessments
 FA. Activity 2.1 Use of Spatial Data Models in ArcGIS Pro

MDGD NTG Solutions Academy 87


Management of Data in Geospatial Databases


In the previous course, the user was introduced to both the raster and vector spatial
data structures and data acquisition methods. After identifying and acquiring data from
multiple sources in various formats, GIS users need to identify how to best store and
manage this data. In this section, closer attention will be paid to how this data is
managed using the highly recommended method to accomplish this task i.e. using
geographic databases. In the past, data was organized using files, nowadays data is
managed in a Database Management System (DBMS). Database Management
Systems make data quickly available to a multitude of users whiles still maintaining its
integrity. The DBMS protects data against deletion and corruption by facilitating the
addition, removal and updating of data by multiple users.

3.1. What is a DBMS?


A database is a structured collection of data files. A database management system
(DBMS) is a software package that allows for the creation, storage, maintenance,
manipulation, and retrieval of large datasets that are distributed over one or more files.
A DBMS and its associated functions are usually accessed through commercial software
packages such as Microsoft Access, Oracle, FileMaker Pro, or Avanquest MyDataBase.

Database management refers to the management of tabular data in row and column
format and is frequently used for personal, business, government, and scientific
endeavors. Geospatial database management systems, alternatively, include the
functionality of a DBMS but also contain spatial characteristics about each data point
such as identity, location, shape, and orientation. Integrating this geographic
information with the tabular attribute data of a classical DBMS provide users with
powerful tools to visualize and answer the spatially explicit questions that arise in an
increasingly technological society. The DBMS has the following advantages over
traditional file data management:

i. Minimize redundancy
Data Redundancy implies repetition of data, data redundancy includes the same data
being present in multiple formats or tables. In turn, this means data analysis becomes
irrelevant and biased, data cannot be used to make data-driven decisions

ii. Maintain consistency


Data inconsistency occurs when different versions of matching data exist in different
places in an organization. For example, one group has a client’s correct email, another
the correct phone number. By using a proper database management system and data
quality tools, you can be sure that an accurate view of data is shared throughout your
organization

MDGD NTG Solutions Academy 88


Management of Data in Geospatial Databases


iii. Guarantee security


Database management systems help users share data quickly, effectively, and securely
across an organization. By providing quick solutions to database queries, a data
management system enables faster access to more accurate data.

iv. Provide frequent back-ups and Recovery


DBMS provides backup and recovery mechanisms to protect against data loss due to
system failures, natural disasters, or other unforeseen events.

v. Manage concurrency
DBMS provides concurrency control to allow multiple users to access and modify data
simultaneously without interfering with each other's work. It ensures that data remains
consistent even when multiple users are modifying it concurrently.

Geographic Information Systems integrate data from various resources into a single
homogeneous system, which require powerful and flexible data models to serve
multiple tasks.

3.2. Review of Data Models


Several types of database models exist and were introduced in Chapter 3 of
Introduction to Data Acquisition for Spatial Intelligence. This section will review the
database models previously highlighted. A flat database is essentially a spreadsheet
whereby all data are stored in a single, large table. A hierarchical database is also a
fairly simple model that organizes data into a “one-to-many” association across levels
Common examples of this model include phylogenetic trees for classification of plants
and animals and familial genealogical trees showing parent-child relationships. Network
databases are similar to hierarchical databases, however, because they also support
“many-to-many” relationships. This expanded capability allows greater search flexibility
within the dataset and reduces potential redundancy of information. Alternatively, both
the hierarchical and network models can become incredibly complex depending on
the size of the databases and the number of interactions between the data points.
Modern geographic information system (GIS) software typically employs a fourth model
referred to as a relational database.

MDGD NTG Solutions Academy 89


Management of Data in Geospatial Databases


3.3. Types of Databases


There are several types of databases, each designed to serve specific needs and
applications. Selecting the right database type is crucial for efficient data management
and application performance. Here are some common types of databases:

3.3.1. Centralized database


A centralized database is one that operates entirely within a single location. Centralized
databases are typically used by bigger organizations, such as a business or university.
The database itself is located on a central computer or database system. Users can
access the database through a computer network, but it is the central computer that
runs and maintains the database.

3.3.2. Cloud database


A cloud database is one that runs over the Internet. The data is stored on a local hard
drive or server, but the information is available online. This makes it easy to access your
files from anywhere, as long as you have an Internet connection. To use a cloud
database, users can either build one themselves or pay for a service to store their data
for them. Encryption is an essential part of any cloud database, as all information needs
to be protected as it is transmitted online. This type of databases will be reviewed in
detail in the upcoming module “Spatial Intelligence Cloud Computing”.

3.3.3. Commercial database


A commercial database is any that is designed by a commercial business. Businesses
develop feature-rich databases, which they then sell to their customers. Commercial
databases can vary in terms of composition or the technology they use. The defining
trait of commercial databases is having users pay to use them, unlike open source
databases.

3.3.4. Distributed database


A distributed database is one that is spread out over multiple devices. Rather than
having all information stored on a single device, like other databases on this list,
distributed databases will operate across multiple machines, such as different
computers within the same location or across a network. The benefits of a distributed
database include increased speed, better reliability and ease of expansion.

3.3.5. End-user database


End-user is a term used in product development that refers to the person who uses the
product. An end-user database is, therefore, a database that is primarily used by a
single person. A good example of this type of database is a spreadsheet stored on your
local computer.

MDGD NTG Solutions Academy 90


Management of Data in Geospatial Databases


3.3.6. Graph database


Graph databases are databases that focus equally on the data and the connections
between them. In this database, data is not constricted to predefined models. Most
other databases can find connections between data when you run a search. With a
graph database, these connections are stored inside the database right alongside the
original data. This makes for a more efficient and faster database when your primary
goal is to manage the connections between your data.

2.3.7. NoSQL database


A NoSQL database has a hierarchy similar to a file folder system and the data within it is
unstructured, or non-relational. This lack of structure allows them to process larger
amounts of data at speed and makes it easier to expand in the future. Cloud computing
regularly makes use of NoSQL databases.

3.3.8. Object-oriented database


Object-oriented databases are ones in which data is represented as objects and
classes. An object is an item, such as a name or phone number, while a class is a group
of objects. Object-oriented databases are a type of relational database. Consider using
an object-oriented database when you have a large amount of complex data that you
want to process quickly.

3.3.9. Open-source database


An open-source database is designed for the public to use for free. Unlike commercial
databases, users can download or sign up for open source databases without paying
a fee. The term "open source" refers to a program in which users can see how it was
written and constructed and are free to make their own changes to the program.
Open-source databases are typically much cheaper than commercial databases, but
they can also lack some of the more advanced features found in commercial
databases.

3.3.10. Operational database


The purpose of an operational database is to allow users to modify data in real time.
Operational databases are critical in business analytics and data warehousing. They
can be set up either as relational databases or NoSQL, depending on needs.
Conventional databases rely on batch processing, where commands are carried out in
groups. Operational databases, on the other hand, allow you to add, edit and remove
data at any moment.

3.3.11. Personal database


A personal database is one that is designed for a single person. It is typically stored on
a personal computer and has a very simple design, consisting of only a few tables.
Personal databases are not typically suitable for complex operations, large amounts of
data or business operations.

MDGD NTG Solutions Academy 91


Management of Data in Geospatial Databases


3.3.12. Relational database


Relational databases are the other major type of database, opposite of NoSQL. With a
relational database, information is stored structured about other data. A good
representation of a relational database would be the connection between a person
shopping online and their shopping cart. Relational databases are often preferred when
you are concerned about the integrity of your data, or when you're not particularly
focused on scalability.

3.3.13. Irrational database/ non-relational database (Non-SQL) flat files


Non-relational databases, also known as NoSQL databases, are a type of database that
do not use the traditional tabular structure of relational databases. Non-relational
databases use a variety of data models, such as document-oriented, key-value, graph,
or column-family, to store and organize data. Non-relational databases are often used
for large-scale distributed systems, web applications, and big data processing.

In this module, our emphasis will be on the final two types of databases.

3.4. Relational Database Management Systems


A relational database management system (RDBMS) is a collection of tables that are
connected in such a way that data can be accessed without reorganization of the
tables. The tables are created such that each column represents a particular attribute
(e.g., soil type, PIN number, last name, area) and each row contains a unique value of
data for that column attribute (e.g., Delhi Sands Soils, 5555, Smith, 412.3 sqm)

In the relational model, each table is linked to each other table via predetermined keys.
The primary key represents the attribute (column) whose value uniquely identifies a
particular record (row) in the relation (table). The primary key may not contain missing
values as multiple missing values would represent nonunique entities that violate the
basic rule of the primary key. The primary key corresponds to an identical attribute in a
secondary table (and possibly third, fourth, fifth, etc.) called a foreign key. This results in
all the information in the first table being directly related to the information in the second
table via the primary and foreign keys, hence the term “relational” DBMS. With these
links in place, tables within the database can be kept very simple, resulting in minimal
computation time and file complexity. This process can be repeated over many tables
as long as each contains a foreign key that corresponds to another table’s primary key.

There is great potential for redundancy in this model as each table must contain an
attribute that corresponds to an attribute in every other related table. Therefore,
redundancy must actively be monitored and managed in a RDBMS. To accomplish this,
a set of rules called normal forms have been developed. There are three basic normal
forms.

The first normal form refers to five conditions that must be met. They are as follows:

MDGD NTG Solutions Academy 92


Management of Data in Geospatial Databases


1. There is no sequence to the ordering of the rows.


2. There is no sequence to the ordering of the columns.
3. Each row is unique.
4. Every cell contains one and only one value.
5. All values in a column pertain to the same subject.

The second normal form states that any column that is not a primary key must be
dependent on the primary key. This reduces redundancy by eliminating the potential
for multiple primary keys throughout multiple tables. This step often involves the
creation of new tables to maintain normalization.

MDGD NTG Solutions Academy 93


Management of Data in Geospatial Databases


The third normal form states that all nonprimary keys must depend on the primary key,
while the primary key remains independent of all nonprimary keys.

The relational data model has several advantages that make it a popular choice for
organizing and managing data in a database. The relational data model provides a

MDGD NTG Solutions Academy 94


Management of Data in Geospatial Databases


powerful and flexible way to organize and manage data in a database, making it an
ideal choice for a wide range of applications. It has been widely adopted in the industry
and is supported by many popular database management systems, making it a
valuable skill for data professionals to have.

3.4.1. Significance of relational databases


The most significant thing about relational databases is that they help organize and
store information in an easy-to-understand way, like a spreadsheet with rows and
columns. This makes it simple to find and work with data. They also ensure data is
accurate, secure, and consistent, which is crucial in many applications. In short,
relational databases are the backbone of many systems, ensuring that data is well-
organized and trustworthy, which is essential in our digital world.

The biggest significance of relational databases lies in their ability to provide structured,
efficient, and reliable storage and management of data. Here are some of the key
significances of relational databases:

i. Data Structure and Organization: Relational databases excel in structuring and


organizing data. They store data in tables with rows and columns, providing a
clear and organized framework for managing information. This tabular structure
simplifies data storage and retrieval.
ii. Data Integrity: Relational databases enforce data integrity through various
constraints like primary keys, foreign keys, unique constraints, and check
constraints. This ensures that data remains accurate and consistent, preventing
the entry of invalid or inconsistent information.
iii. Data Relationships: Relational databases allow the establishment of relationships
between tables, facilitating the modeling of complex data structures. This makes
it possible to represent real-world relationships in data, which is crucial for many
applications.
iv. Efficient Querying: Relational databases offer powerful query capabilities. Users
can retrieve data using SQL (Structured Query Language) and perform a wide
range of operations like filtering, sorting, aggregation, and joining, making it easy
to obtain the desired information from large datasets.
v. Scalability: Relational databases can scale vertically by adding more resources
to a single server or horizontally by sharding data across multiple servers. This
scalability ensures that the database can handle growing data volumes and user
loads.
vi. Data Security: Relational databases support access control and authentication
mechanisms, allowing administrators to define who can access and modify data.
This is essential for maintaining data privacy and security.

MDGD NTG Solutions Academy 95


Management of Data in Geospatial Databases


vii. ACID Properties: Relational databases adhere to the ACID (Atomicity,


Consistency, Isolation, Durability) properties, ensuring that database transactions
are processed reliably. This is vital in applications where data consistency and
reliability are paramount, such as financial systems.
viii. Data Independence: Relational databases provide a level of data
independence, separating the logical structure of the database from its physical
implementation. This allows changes to the physical storage without affecting
applications that interact with the data.
ix. Data Modeling: Relational databases encourage a well-defined data modeling
process, which helps in understanding the data requirements of an organization
and designing an efficient database schema.
x. Standardization: Relational databases are widely used and benefit from well-
established standards. The SQL language, for example, is a standardized way to
interact with relational databases, making it easier for developers and
organizations to work with these systems.
xi. Data Analytics: Relational databases are used as the foundation for data
analytics and reporting. Data can be stored, processed, and queried efficiently
to derive insights and support decision-making.
xii. Historical Data: Relational databases are effective for maintaining historical data
records. Changes to data over time can be tracked using techniques like
effective dating and auditing.
The significance of relational databases lies in their versatility and reliability, making them
a fundamental tool for managing and organizing data across various industries, from
finance and healthcare to e-commerce and beyond. Their structured approach to
data management is critical for ensuring data accuracy, consistency, and accessibility,
making them a cornerstone of modern information systems.

FA. Activity 3.1: Relational Diagram

FA Activity 3.1: Relational Diagram


Use ArcGIS Pro to visualize the data located in “C:\Users\Your name\Documents\GIS
Training\IDA\FA Activity 2.1”. Use Microsoft Visio to draw a relational illustration of
each dataset in the geodatabase how the datasets relate to each other.

Visualizing Geospatial Data with ArcGIS Pro:

MDGD NTG Solutions Academy 96


Management of Data in Geospatial Databases


 Open ArcGIS Pro:


 Launch ArcGIS Pro on your computer.
 Create a New Project:
 Start a new project or open an existing one if you have it.
 Add Data:
 In the "Catalog" pane, navigate to the location of your geodatabase, which is
"C:\Users\Your name\Documents\GIS Training\IMDGD\FA Activity 3.1" in this
case.
 Right-click on the geodatabase file (usually with a .gdb extension) and select
"Add to Current Map."
 Your datasets within the geodatabase should now be visible in the "Contents"
pane.
 Visualize Data:
 Double-click on the datasets you want to visualize to add them to your map.

3.5. Non-relational Databases


In today's era of big data, unstructured data is incredibly prevalent and abundant. This
ubiquity is primarily due to the fact that unstructured data can encompass a wide array
of data types, including media files, images, audio recordings, sensor data, textual
information, and much more. What defines unstructured data is the absence of a
predefined, structured database format. While unstructured data possesses its own
internal structure, it isn't constrained by predefined data models. It can originate from
human-generated sources or be machine-generated, whether in a textual or non-
textual format.

MDGD NTG Solutions Academy 97


Management of Data in Geospatial Databases


3.5.1. Characteristics of unstructured data


• More challenging to process and analyse compared to structured data due to its lack
of predefined structure.
• Requires specialized tools and techniques, such as Natural Language Processing (NLP)
and Machine Learning, to extract insights and patterns.
• Often contains valuable information, but accessing and utilizing it effectively can be
complex.

In the realm of data science, practitioners frequently encounter a diverse landscape of


data, encompassing both structured and unstructured forms. Proficiency in handling
and effectively integrating both data types is paramount for unearthing comprehensive
insights and practical solutions to real-world challenges. Structured data, with its well-
defined organization, tends to be more approachable and amenable to analysis when
compared to unstructured data. The majority of statistical and machine learning models
have been traditionally constructed with structured data in mind and may struggle with
the unstructured variety due to its inherent ambiguity. Remarkably, estimates indicate
that unstructured data constitutes a substantial portion, ranging from 80% to 90%, of the
world's data repositories. This unstructured data assumes various manifestations,
spanning from social media posts, emails, literature, to server logs. While data scientists
may naturally gravitate toward the orderliness of structured data, they must equip
themselves to navigate the colossal realm of unstructured data. To harness its potential,
a crucial preliminary step is preprocessing, a technique that readies unstructured data

MDGD NTG Solutions Academy 98


Management of Data in Geospatial Databases


for meaningful analysis. In this chapter, we will delve into the preprocessing phase,
particularly focusing on the intricate art of transforming unstructured data into a
structured format, setting the stage for robust data analysis.

3.5.2. Types of No SQL Databases


NoSQL brings higher performance, availability, and scalability. It can handle such
unstructured, and often chaotic, data much faster than traditional models. here are
four broad categories of NoSQL databases: key-value, column family, document, and
graph.

3.5.2.1. Key-Value Databases


A key-value database serves as the foundational pillar of NoSQL databases, functioning
by associating a unique key with a specific value, devoid of complex relationships or
predefined structures. In this schema, the key corresponds to a customer ID, serving as
a gateway to access all information related to that customer, ranging from their address
to the products they've ordered. Remarkably, a key-value database offers seamless
integration, as clients can effortlessly retrieve, update, or delete key-value pairs without
the database needing insight into the contents of the data stored within each key.
Unlike traditional relational models, where adding a data element, such as a preferred
name to the Order bucket, is discouraged, NoSQL databases embrace such flexibility.
Furthermore, the retrieval speed of key-value database management systems remains
consistently rapid even as the database scales up in size.

MDGD NTG Solutions Academy 99


Management of Data in Geospatial Databases


3.5.2. Column Family Databases


In contrast to the rigid key-to-row relationship seen in relational databases, column-
family databases organize data in clusters of rows, with the ability to associate any
number of columns to a particular row key. This approach is particularly useful for
accessing related data that is frequently queried together, such as capturing a user's
profile when they log into a web application. This design effectively addresses the issue
of column bloating often encountered in relational databases. In column-family
databases, data is organized in clusters and nodes, offering a more flexible and
dynamic structure as opposed to the strict and predefined structures of relational
databases.

3.5.3. Document Databases


A document database, also known as a document store or document-oriented
database, is a type of NoSQL database that is designed for storing, retrieving, and
managing semi-structured or unstructured data, typically in the form of documents.
These documents are self-contained data units that can store a wide variety of
information, such as text, JSON, XML, and binary data. E.g MongoDB,

MDGD NTG Solutions Academy 100


Management of Data in Geospatial Databases


3.5.4. Graph Databases

A graph database is a type of NoSQL database designed to represent and store data
in a graph structure. In a graph database, data is organized into nodes, relationships,
and properties, allowing for the efficient representation of complex relationships
between data elements. This makes graph databases particularly well-suited for
applications that involve highly interconnected data. The graph database winds up
looking like a cloud of concepts and connections, a template that's nearly impossible
in relational databases.

MDGD NTG Solutions Academy 101


Management of Data in Geospatial Databases


3.6. Flat files vs Irrigational Data bases


Flat files, on the other hand, are a type of data storage format that store data in a
simple, unstructured manner, usually in a plain text file with a specific delimiter (e.g.,
comma-separated values or tab-separated values). Flat files do not have a defined
structure or schema, and data is often stored in a single table or file.

Non-relational databases and flat files have some similarities in that they do not rely on
a predefined schema, but they differ in terms of their structure and functionality. Non-
relational databases can be highly scalable, efficient, and flexible, allowing for the
storage and processing of large volumes of unstructured or semi-structured data. Flat
files, on the other hand, are often used for smaller data sets or simple data storage
needs.

In summary, non-relational databases and flat files are two different types of data
storage and management systems, each with its own strengths and weaknesses. Non-
relational databases are more suitable for large-scale distributed systems and big data
processing, while flat files are often used for smaller, simpler data storage needs.

3.5.1. Uses of flat-file databases


Flat-file databases can be used to solve data-related problems in a variety of ways,
including:

i. Data storage and organization: Flat-file databases can be used to store and
organize data in a simple, structured format. Data can be stored in a flat-file
database in a tabular format with a specific delimiter, such as a comma or tab,
allowing for easy access and retrieval.

ii. Data processing and analysis: Flat-file databases can be used to process and
analyze data using a variety of tools and techniques. For example, data stored in
a flat-file database can be imported into a spreadsheet or statistical analysis
software for further analysis.

iii. Data sharing and collaboration: Flat-file databases can be easily shared and
collaborated on with others. Data can be shared as a file, emailed, or stored in a
shared folder or cloud storage service.

iv. Data migration and integration: Flat-file databases can be used to migrate and
integrate data between different systems. For example, data stored in a flat-file
database can be imported into a relational database or other data storage
system.

MDGD NTG Solutions Academy 102


Management of Data in Geospatial Databases


3.7. The Geographical Database

A Geographical Database is built from the Object-Relational Model which is a


combination of two data models. The responsibility for management of geographic
datasets is shared between the GIS software and generic DBMS. Certain aspects of
geographic dataset management, such as disk-based storage, definition of attribute
types, associative query processing, and multiuser transaction processing, are
delegated to the DBMS. The GIS software retains responsibility for defining the specific
DBMS schema used to represent various geographic datasets and for domain-specific
logic, which maintains the integrity and utility of the underlying records.

ESRI has developed a Geodatabase to leverage the two data models. The
geodatabase is implemented using the same multitier application architecture found
in other advanced DBMS applications. The geodatabase objects persist as rows in DBMS
tables that have identity, and the behaviour is supplied through the geodatabase
application logic.

The primary purpose of a geodatabase is organizing related feature classes into a


common feature dataset for building a topology, a network dataset, or a terrain
dataset. Feature datasets can also be used to integrate related feature classes
spatially or thematically.

Using a Geodatabase to store data has the following advantages:

Advantage Characteristic Description

MDGD NTG Solutions Academy 103


Management of Data in Geospatial Databases


Centralized • Feature classes All data is stored in the same


repository • Table database, as opposed to many
separate files.

Scalable data • File As your GIS needs change, you can


model • geodatabase migrate data from one
• Enterprise geodatabase to an upgraded
geodatabase format that allows for more users
and editors.
Shareable • Feature datasets These industry-standard models from
data models • Geodatabase Esri can be used for your
schema organization. You can also share
template your custom models within your
organization through exporting the
schema.
Increased data • Subtypes You can create spatial and attribute
integrity • Domains behaviors to facilitate editing,
• Topology eliminate data entry errors, and
maintain spatial and attribute
relationships between your data.

Support for • Mosaic dataset Mosaic datasets in the geodatabase


imagery allow you to manage multiple rasters
as one, create custom derived
products, and increase
performance. The mosaic dataset
can also be published as an image
service.

Geodatabases can be modified in many ways to accommodate any data storage


and functionality required. A well-designed geodatabase contains a manageable set
of feature classes along with the means to finely control feature behavior.

3.8. Data Manipulation


Data manipulation language (DML) is not its own language but is made up of the tools
used to modify data already included in a database, but not the structure itself.

MDGD NTG Solutions Academy 104


Management of Data in Geospatial Databases


Manipulating attribute data is much more complex than manipulating spatial data. It is
much more difficult to identify errors in attribute data when the values are syntactically
good, but incorrect, not as much tools exist for attribute data. The most common ways
utilized to manipulate attribute data include:

• Insert – refers to capturing attribute information for each geometry being added
in a feature class or adding a new row of information on a standalone table

• Update - Updating attribute data implies entering new information though the
geometry of the feature remains unmodified. The updating function is of great
importance during any project.

• Delete – refers to removing features from the GIS as they are removed from the
real world

3.9. Understanding the Geodatabase schema


The database schema is a blueprint that outlines the structure and details of the
database, guiding the database engine on how data is structured and how different
elements are interconnected. Likewise, A geodatabase schema is a structured
framework for organizing and managing geographic or spatial data within a
geodatabase, which is a specialized type of database designed for storing, managing,
and querying spatial data. Geodatabase schemas are essential for maintaining the
integrity and consistency of spatial data and ensuring that it can be effectively used in
Geographic Information System (GIS) applications. Here's an overview of key
components of a geodatabase schema:

i. Feature Classes: Feature classes are fundamental components of a geodatabase


schema. They represent different types of geographic features, such as points,
lines, and polygons. Each feature class has a defined geometry type (e.g., point,
line, polygon) and can contain attributes (fields) that describe the characteristics
of the features.
ii. Tables: In addition to feature classes, geodatabase schemas can include tables
that store non-spatial data related to geographic features. These tables are
typically used for storing attribute data that complements the spatial information
in feature classes.
iii. Relationships: Geodatabase schemas can define relationships between feature
classes and tables. Relationships represent associations between different
geographic features and non-spatial attributes. For example, a relationship might
link a feature class representing parcels of land with a table containing ownership
information.

MDGD NTG Solutions Academy 105


Management of Data in Geospatial Databases


iv. Domains: Domains are sets of valid values that can be assigned to specific fields
within feature classes or tables. They help enforce data integrity by restricting the
values that can be entered into attribute fields. For instance, a domain might
define a list of valid land use codes for a land parcel feature class.

v. Subtypes: Subtypes allow you to categorize features within a feature class into
subgroups based on common characteristics. For example, within a water utility
network feature class, subtypes might be used to distinguish between different
types of water pipes (e.g., main lines, service lines) with their own specific
attributes.

vi. Topology Rules: Topology rules define spatial relationships and constraints
between features in different feature classes. These rules help maintain the
integrity of spatial data by ensuring that features conform to specific geometric
criteria (e.g., no overlap between polygons).

vii. Geometric Networks: Geometric networks are specialized graph structures used
to model and analyze the connectivity and flow of resources in spatial data. They
are often used in utility and transportation network datasets.

viii. Raster Datasets: In addition to vector-based feature classes and tables,


geodatabase schemas can include raster datasets for storing and analyzing
continuous spatial data, such as satellite imagery, elevation models, and land
cover data.

ix. Versioning: Some geodatabase schemas include versioning capabilities, which


allow multiple users to work on different versions of the data simultaneously while
maintaining data consistency and tracking changes.

x. Security and Access Control: Geodatabase schemas often include security


features and access controls to manage user permissions and protect sensitive
spatial data.

MDGD NTG Solutions Academy 106


Management of Data in Geospatial Databases


A schema defines the physical structure of the geodatabase as well as the rules,
relationships, and properties of each dataset in the geodatabase. Thus, one of the most
important aspects of a geodatabase is its structure and the way the geodatabase is
organized into database tables, column types, indexes, and other database objects.

Here are some basic concepts related to a schema:

1. Tables: A table is a basic unit of storage in a database that contains data in rows
and columns.
2. Relationships: A relationship defines the connection between two or more tables in
a database. The relationship is based on a common key field that is used to link the
tables.
3. Constraints: Constraints are rules that are used to ensure the integrity of the data
stored in a database. Constraints can be used to enforce data type, uniqueness,
and referential integrity rules.
4. Views: Views are virtual tables that are derived from one or more tables in a
database. Views are used to simplify the complexity of data and to provide a
customized view of data to different users.
5. Indexes: An index is a data structure that is used to improve the performance of
database queries. Indexes are created on one or more columns in a table, and
they allow the database to quickly locate specific data in the table.
6. Stored Procedures: A stored procedure is a group of SQL statements that are stored
in the database and can be executed repeatedly by a user or an application.
Stored procedures can be used to improve performance, simplify complex
operations, and enforce business rules.

Overall, a schema is a way of organizing data in a database, and it includes tables,


relationships, constraints, views, indexes, and stored procedures. It is a fundamental
concept in database management and plays a critical role in defining and managing
data.

3.9.1. Application of schemas to solve data problems


Schemas are essential for solving data problems in several ways. Here are some
examples:

1. Data organization: Schemas help to organize data in a structured and logical


manner. By defining tables, relationships, and constraints, schemas ensure that
data is consistent, accurate, and easy to manage. This improves data quality and
reduces the risk of errors.

MDGD NTG Solutions Academy 107


Management of Data in Geospatial Databases


2. Data integration: Schemas can be used to integrate data from different sources
into a single database system. By defining common fields and relationships,
schemas enable the data to be combined and analyzed, which can lead to new
insights and discoveries.
3. Performance optimization: Schemas can be used to optimize the performance of a
database system. By creating indexes, views, and stored procedures, schemas can
improve the speed of data retrieval and processing, which is critical for large
databases or high-demand applications.
4. Security: Schemas can be used to define access control rules, which can prevent
unauthorized access to sensitive data. By defining user roles and permissions,
schemas ensure that data is protected and only accessible to authorized users.

Overall, schemas are an essential tool for solving data problems because they provide
a structured way of organizing, integrating, optimizing, and securing data. By defining
the logical structure of a database system, schemas ensure that data is consistent,
accurate, and easy to manage, which is critical for any data-driven organization.

Schemas are also used in GIS to solve data problems. Here are some examples:

1. Data modeling: GIS data modeling involves defining the logical structure of
geospatial data. By defining tables, relationships, and constraints, GIS schemas
ensure that spatial data is organized in a structured and logical manner, which
improves data quality and reduces the risk of errors.
2. Spatial analysis: GIS schemas can be used to define common fields and
relationships between geospatial data layers, enabling spatial analysis and
visualization. By creating indexes, views, and stored procedures, GIS schemas can
also improve the speed of spatial queries and data processing, which is critical for
large geospatial databases or high-demand GIS applications.
3. Data integration: GIS schemas can be used to integrate geospatial data from
different sources, enabling the data to be combined and analyzed. By defining
common fields and relationships, GIS schemas enable spatial data to be integrated
with non-spatial data, which can lead to new insights and discoveries.
4. Data sharing: GIS schemas can be used to define access control rules, which can
prevent unauthorized access to sensitive geospatial data. By defining user roles
and permissions, GIS schemas ensure that geospatial data is protected and only
accessible to authorized users.

Overall, GIS schemas are an essential tool for solving data problems because they
provide a structured way of organizing, integrating, optimizing, and securing geospatial
data. By defining the logical structure of geospatial databases, GIS schemas ensure that

MDGD NTG Solutions Academy 108


Management of Data in Geospatial Databases


spatial data is consistent, accurate, and easy to manage, which is critical for any GIS-
driven organization.

3.9.2. Creating data schema


Creating a data schema typically involves the following steps:

i. Identify data requirements: This involves identifying the types of data that need
to be stored and managed, as well as the relationships between different data
elements. For example, in a customer relationship management system, data
requirements may include customer information, order details, and payment
history.
ii. Choose a Geodatabase Type: Decide on the type of geodatabase you want to
create. Common options include file geodatabases, personal geodatabases,
and enterprise geodatabases.
iii. Define entities: Once data requirements have been identified, the next step is to
define entities or tables that will be used to store the data. Each entity
represents a distinct type of data, such as customers, orders, or products.
iv. Define attributes: Each entity must be further defined by specifying the attributes
or columns that will be used to store data. For example, the customer entity may
have attributes such as customer ID, name, address, and phone number.
v. Define relationships: Relationships are defined between entities to show how
they are related to each other. For example, a relationship may exist between
the customer entity and the order entity, indicating that each customer may
have multiple orders.
vi. Organize Data into Feature Datasets: Feature datasets are containers for
organizing related feature classes. Group feature classes that share common
characteristics or themes into feature datasets.
vii. Set Up Domains: Define attribute domains to enforce data integrity by restricting
values allowed for specific attributes.
viii. Design Topology: If your data requires maintaining topological relationships
(e.g., connectivity, adjacency), set up and configure topology rules.
ix. Plan Spatial Indexes: Determine the fields and geometry types for spatial indexes
to optimize spatial queries.
x. Implement Geodatabase Rules: Establish data constraints, rules, and behaviors
for maintaining data quality and consistency
xi. Implement the schema: The final step is to implement the schema by creating
the necessary database tables and defining data types, constraints, and other
properties.

MDGD NTG Solutions Academy 109


Management of Data in Geospatial Databases


xii. Documentation: Create comprehensive documentation for your geodatabase


schema, including data dictionaries, entity-relationship diagrams, and
metadata.

Overall, creating a data schema is a complex process that requires careful planning,
analysis, and design. By following these steps, organizations can create a data
schema that is well-structured, efficient, and optimized for their specific data
management needs.

3.9.3. Various types of schemas


In data management, a schema is a blueprint that defines the structure of a database
or data set. There are different types of schemas used for different types of databases,
including:

1. Physical schema: A physical schema describes how data is stored on a physical


storage device such as a hard drive. It specifies details such as file organization,
indexing, and partitioning.
2. Logical schema: A logical schema describes the structure of data in a database
independent of any physical storage considerations. It defines the relationships
between data elements, entities, and attributes.
3. Conceptual schema: A conceptual schema is a high-level view of the data in a
database. It provides a general description of the data and how it is organized. It is
often used as a starting point for developing a logical schema.
4. Database schema: A database schema is a description of the structure of a
database. It includes information about tables, columns, relationships, constraints,
and other database objects.
5. XML schema: An XML schema defines the structure of an XML document. It
describes the elements, attributes, and data types used in the document.
6. JSON schema: A JSON schema defines the structure of a JSON document. It
describes the properties, data types, and relationships used in the document.
7. Avro schema: An Avro schema is a schema for data serialization used in Apache
Avro. It defines the structure of data in a JSON-like format.

These are some of the most common types of schemas used in data management.
Choosing the right type of schema for a particular data set or database is important
to ensure data integrity, security, and optimal performance.

In Geographic Information Systems (GIS), a schema is a way of organizing and


defining the structure of spatial and attribute data in a GIS database. Here are some
common types of schemas used in GIS:

MDGD NTG Solutions Academy 110


Management of Data in Geospatial Databases


1. Geodatabase schema: A geodatabase schema is a set of rules that defines the


structure of a geodatabase, including the feature classes, tables, and relationships
between them. It is used to ensure that the data is organized in a consistent and
efficient manner.
2. Shapefile schema: A shapefile schema defines the structure of a shapefile, which is
a popular file format used for storing spatial data in a GIS. It includes information
about the features, attributes, and spatial reference of the data.
3. Metadata schema: A metadata schema is a set of rules that defines how
metadata is organized and stored for GIS data. It includes information about the
data source, quality, accuracy, and other important information that helps users
understand and use the data effectively.
4. KML schema: A KML schema defines the structure of a KML file, which is used for
displaying geographic data in Google Earth and other mapping applications. It
includes information about placemarks, polygons, lines, and other types of
geographic features.
5. GeoJSON schema: A GeoJSON schema defines the structure of a GeoJSON file,
which is a popular format for encoding geographic data in JSON format. It includes
information about the features, attributes, and spatial reference of the data.
6. GML schema: A GML schema defines the structure of a GML file, which is a
standard format for encoding geographic data in XML format. It includes
information about features, attributes, and spatial reference of the data.

These are some of the most common types of schemas used in GIS. Choosing the right
type of schema for a GIS database is important to ensure that the data is organized in
a consistent and efficient manner and can be easily shared and used by others.

FA. Activity 3.1: Schema creation in ArcGIS


Pro

FA Activity 3.2. Schema creation in ArcGIS Pro

Create a schema for a dataset of your chouse in ArcGIS Pro that will be stored in a File
Geodatabase.

MDGD NTG Solutions Academy 111


Management of Data in Geospatial Databases


References

1. https://www.e-education.psu.edu/geog160/c1_p6.html
2. https://saylordotorg.github.io/text_essentials-of-geographic-information-
systems/s09-04-data-quality.html
3. https://www.gislounge.com/spatial-data-quality-an-introduction/
4. https://gis.si/en/data-acquisition-and-processing/
5. https://www.spatiality.co.ke/choosing-the-right-esri-field-operations-app/
6. https://www.dataversity.net/brief-history-data-
management/#:~:text=Data%20Management%2C%20as%20a%20concept,over
%20the%20last%20six%20decades.
7. https://saylordotorg.github.io/text_essentials-of-geographic-information-
systems/s09-geospatial-data-management.html
8. https://www.cert.ucr.edu/sites/default/files/2019-05/22_magdy_amr.pdf
9. https://www.esri.com/arcgis-blog/products/ops-dashboard/decision-
support/visualizing-data-effectively-on-dashboards/
10. https://learn.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-types-
for-reports-and-q-and-a
11. https://www.safegraph.com/guides/visualizing-geospatial-data
12. https://www.simplilearn.com/what-is-data-management-article
13. https://www.dataversity.net/brief-history-data-management/#
14. https://resources.esri.ca/news-and-updates/7-practical-steps-for-improving-
your-organizations-spatial-data-governance

15. https://www.directionsmag.com/article/12362

16. https://gisintransportation.com/about/spatial-data-governance/

MDGD NTG Solutions Academy 112

You might also like