100% found this document useful (1 vote)
517 views

Database Design and Modeling With PostgreSQL

This document provides an overview of database design and modeling with PostgreSQL. It discusses relational databases and their components like tables, columns, and rows. It also covers normalization, which is the process of structuring data to avoid duplication and anomalies to ensure data integrity and efficiency. The document outlines steps for conceptual, logical, and physical data modeling including identifying entities, attributes, relationships and designing database schemas. It aims to help readers understand how to properly structure relational databases and plan database designs.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
517 views

Database Design and Modeling With PostgreSQL

This document provides an overview of database design and modeling with PostgreSQL. It discusses relational databases and their components like tables, columns, and rows. It also covers normalization, which is the process of structuring data to avoid duplication and anomalies to ensure data integrity and efficiency. The document outlines steps for conceptual, logical, and physical data modeling including identifying entities, attributes, relationships and designing database schemas. It aims to help readers understand how to properly structure relational databases and plan database designs.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 450

Database

Design &
Modeling with
PostgreSQL

Dinesh Asanka
Twitter: @dineshasanka74
[email protected]
Table of Contents
Chapter 1: Overview of PostgreSQL Relational Databases 1
Understanding relational databases 2
File-based databases 2
Hierarchical database 3
Document database 4
Relational database 6
Introduction to PostgreSQL 7
Installation and configuration 7
Limitations in PostgreSQL 9
Understanding tables, columns and rows 10
Introduction to constraints 16
NOT NULL 17
PRIMARY KEY 18
UNIQUE 19
CHECK 20
FOREIGN KEY 22
Exclusion Constraints 24
Deferrable Constraints 25
Fieldnotes 25
Summary 26
Questions 27
Further Reading 28
Chapter 2: Building Blocks Simplified 29
Identifying Entities and Entity-Types 30
Entity sets 30
Strong and Weak Entities 36
Introduction to Attributes 37
Attribute Types 38
Representation of Attributes and Entities 40
Identifying Entities and Attributes 41
Identifying main Entity Types 43
Identifying Attributes of Entities 44
Generalization and differentiation of Entities 49
Naming Entities and Attributes 49
Assembling the building blocks 50
Building block rules 50
Each table should represent one and only one entity-type 51
All columns must be atomic 51
Columns cannot be multi-valued 51
Table of Contents

Summary 52
Exercise 52
Questions 53
Chapter 3: Planning a Database Design 56
Understanding Database Design Principles 57
Usability 58
Extensibility 58
Integrity 59
Performance 60
Availability 61
Read-only and deferred operations 61
Partial, transient, or impending failures 62
Partial end-to-end failure 62
Security 62
Introduction to the database schema 63
Online Transaction Processing 64
Online Analysis Processing 64
OLTP versus OLAP 64
Choosing Database design approaches 65
Bottom-up design approaches 65
Top-down design approaches 66
Centralized design approach 66
De-centralized design 67
Importance of data modeling 67
Quality 67
Cost 68
Time to market 68
Scope 68
Performance 69
Documentation 69
Risk 69
Role of databases 69
Storing Data 70
Access Data 72
Secure Data 75
Common database design challenges 75
Data security 76
Performance 76
Data accuracy 77
High availability 77
Summary 78
Questions 79
Chapter 4: Representation of Data Models 82
Introduction to Database Design 83

[ ii ]
Table of Contents

Critical Success Factors in Database Design 83


Creating Conceptual Database Design 85
Local Conceptual Design 85
Identifying Entity Types 86
From Nouns or Noun-Phrases 86
Objects 86
Challenges in Identifying Entity Types 86
Documentation 87
Identification of Relationship Types 88
Binary Relationships 88
Complex Relationships 89
Recursive Relationships 90
Identify and Associate Attributes with Entity or Relationship Types 90
Troubles when Identifying Attributes 92
Documentation 93
Determining Attribute Domains 93
Determine Candidate and Primary Keys 93
Validate the Model for Redundancy 94
Validate Local Conceptual Model Against User Transactions 95
Review the Local Conceptual Model with the Users 95
Define Logical Data Model 95
Derive Relations for Local Logical Data Model 96
Validate Relations using Normalization 96
Validate Relations Against User Transactions 96
Define Integrity Constraints 97
Review Local Logical Data Models with Business Users 98
Merge Local Logical Data Models into a Global Logical Model 99
Validate Global Logical Data Models 99
Review the Global Logical Data Model with Business Users 100
Define the Semantic Data Model 100
E-R Modeling 101
Problems with E-R Model 105
Fan Traps 105
Chasm Traps 107
Defining Physical Data Model 108
Translate Global Logical Data Model 109
Design Base Relations 111
Design Representation of Derived Data 113
Design Enterprise Constraints 113
Design Physical Representation 113
Define User Views 114
Define Security Mechanism 114
Summary 115
Exercise 116
Questions 116
Further Reading 118

[ iii ]
Table of Contents

Chapter 5: Applying Normalization 119


History and Purpose of Normalization 120
Data Redundancy 120
Operation Anomalies 121
Insertion Anomalies 121
Modification Anomalies 122
Deletion Anomalies 122
Determining Functional Dependencies 123
Steps for Normalization 125
First Normal Form 125
Second Normal Form 127
Third Normal Form 129
Transitive Dependency 130
Applying 3NF 130
Boyce-Codd Normal Form 132
Identifying Candidate Key 133
Identifying Functional Dependencies 133
Fourth Normal Form 135
Multi-Valued Dependency 135
Applying 4NF 136
Exceptions for 4NF 137
Fifth Normal Form 138
JOIN Dependency 138
Applying 5NF 139
Exceptions for 5NF 140
Domain-Key Normal Form 140
De-Normalization of Data Structures 141
Normalization Cheat Sheet 142
Summary 143
Questions 144
Exercise 145
Further Reading 146
Chapter 6: Table Structures 147
Choosing Database Technology 148
Selecting Data Types 148
Character Data Types 149
Collation Support 151
Numerical Data Types 151
Serial Data Types 152
Monetary Data Types 153
Date/Time Data Types 154
Boolean Data Types 155
Designing Master Tables 156
Additional Serial Column as Primary Key 156
Introducing Create and Update Columns 157

[ iv ]
Table of Contents

Overwriting Master Tables 158


Designing Historical Master Tables 159
Row Based Historical Master Tables 160
Column Based Historical Master Tables 163
Hybrid Historical Master Tables 164
Designing Transaction Tables 166
High Data Volume 166
Partitioning High Data Volume Transaction Tables 166
High Transaction Velocity 167
Data Integrity 167
Sample Transaction Tables 167
Calculated Columns versus Processed Columns 169
Updatable Transaction Tables 169
Designing Reporting Tables 170
Designing Audit Tables 170
Maintaining Integrity between different tables 173
Foreign Key Constraints 173
Summary 175
Questions 176
Exercise 179
Further Reading 179
Chapter 7: Working with Design Patterns 180
Understanding Design Patterns 181
Identifying Issues in Software Patterns 181
Selection of Wrong Design Pattern 181
Implementing a Design Pattern without any Modifications 181
Outdated Design Patterns 182
Common Database Design Patterns 182
Data Mapper 182
Insert Mapper 184
Update Mapper 188
Delete Mapper 189
Unit of Work 190
Lazy Loading 191
Avoiding Many-to-Many Relationships 192
OLAP versus OLTP 196
Common Issues in Design Patterns 200
Database Technology 201
Data Volume 201
Data Growth 201
Avoiding Database Design Anti-Patterns 201
Implementing Business Logic in Database Systems 201
Replace NULL values with Empty Values 202
Triggers 203
Eager Loading 203

[v]
Table of Contents

Recursive Views 203


Summary 204
Exercise 204
Questions 205
Further Reading 206
Chapter 8: Working with Indexes 208
Non-Functional requirements of Databases 210
Performance 210
Security 211
Scalability 212
Availability 212
Recoverability 214
Interoperability 214
Indexes In Detail 215
B-Tree 217
Hash 219
GiST 219
Generalized Inverted Index 219
SP-GiST 220
Block Range Index (BRIn) 220
Implementing Indexes in PostgreSQL 220
Clustered Index 221
Clustered Index Best Practices 221
Clustered Index Scenarios 222
With and Without Index 223
Different WHERE Clause 226
Interchanging Where Clause 227
Interchanging SELECT Clause 228
Order Clause 229
AND OR in Where Clause 230
Additional Columns in SELECT Where Clause 231
Non-Clustered Index 232
Complex Queries 235
Multiple Joins 236
GROUP BY Condition 238
INCLUDE Indexes 238
What is Fill Factor 241
Disadvantages of Indexes 242
Performance of Insert Queries 243
Storage 243
Maintaining Indexes 243
Different types of Indexes 245
Filtered Index 245
Column Store Index 245
Field Notes 245

[ vi ]
Table of Contents

Summary 247
Questions 248
Exercise 250
Further Reading 250
Chapter 9: Designing a Database with Transactions 251
Understanding the Transaction 252
Definition of Transaction 252
ACID Theory 256
Atomicity 256
Consistency 257
Isolation 257
Durability 258
CAP Theory 258
Consistency 259
Availability 259
Partition Tolerance 260
Transaction Controls 260
Transactions in PostgreSQL 262
Isolation Levels 265
Read Committed 265
Read Uncommitted 266
Repeatable Read 266
Serializable 267
Designing a database using transactions 268
Running Totals 269
Summary Table 270
Indexes 271
Field Notes 272
Summary 272
Questions 273
Exercise 274
Further Reading 274
Chapter 10: Maintaining a Database 276
Role of a designer in Database Maintenance 277
Implementing Views for better maintenance 277
Sample Views in PostgreSQL 278
Creating Views in PostgreSQL 281
Performance aspects in Views 283
Using Triggers for design changes 285
Introduction to Trigger 285
Triggers as an Auditing Option 286
Triggers to Support Table Partition 289
Modification of Tables 291

[ vii ]
Table of Contents

Adding a Column 292


Adding a Table 292
Maintaining Indexes 294
Duplicate Indexes 295
Unused Indexes 295
Handling of other maintenance tasks 297
Backup 297
Reindex 297
Summary 297
Questions 298
Further Reading 300
Chapter 11: Designing Scalable Databases 301
Introduction to Database Scalability 302
Data 302
Requests 303
Selection of Scalability Methods 304
Introduction to Vertical Scalability 305
Table Partitioning 307
Trigger Method 307
Partition By Method 308
Vertical Partition 311
Advantages in Vertical Scalability 312
Disadvantages in Vertical Scalability 313
Understanding Horizontal Scalability 313
Advantages in Horizontal Scalability 315
Disadvantages in Horizontal Scalability 315
In-Memory Databases for High Scale 316
Design Patterns in Database Scalability 318
Load Balancing 318
Master-Master or Multi-Master 318
Connection Pooling 320
Modern Scalability in Cloud Infrastructure 320
Field Notes 321
Summary 322
Questions 323
Further Reading 324
Chapter 12: Securing a Database 325
Why security is important in database design 326
How Data Breaches have Occurred 327
Implementing Authentication 329
Password Policies 330
Creating a User in PostgreSQL 331
Implementing Authorization 335

[ viii ]
Table of Contents

Roles 335
Conflicting of Privileges 336
Row Level Security 337
Providing Authorization in PostgreSQL 338
GRANT Command in PostgreSQL 340
Using Views as a Security Option 341
Avoiding SQL Injection 342
What SQL Injection can do 343
Preventing SQL Injection Attacks 343
Encryption for Better Security 344
Data Auditing 346
Best Practices in Data Auditing 346
Best Practices for Security 347
Field Notes 348
Summary 349
Questions 349
Further Reading 350
Chapter 13: Distributed Databases 352
Why Distributed Databases? 353
Properties of Distributed Databases 354
Advantages in Distributed Databases 355
Disadvantages in Distributed Databases 357
Designing Distributed Databases 359
Full Replication Distribution 359
Partial Replication Distribution 362
Full Replication 363
Horizontal Fragmentation 363
Vertical Fragmentation 364
Hybrid Fragmentation 365
Implementing Sharding 367
Transparency in Distributed Database Systems 370
Distribution Transparency 370
Transaction Transparency 371
Performance Transparency 372
Twelve Rules for Distributed Databases 372
Challenges in Distributed System 373
Field Notes 374
Summary 376
Questions 377
Further Reading 378
Chapter 14: Case Study - LMS using PostgreSQL 379
Business Case 380
Educational Institute 380

[ ix ]
Table of Contents

Business Actors in LMS 381


Course in LMS 382
Student Discussions 383
Auditing Requirements 383
Planning a Database Design 383
Building the Conceptual Model 384
Identification of Entity Types 384
Identification of Relationship Types 385
Identify and Associate attributes with entity or relationship types 390
Documentation 392
Applying Normalization 392
Applying First Normal Form 393
Applying Second Normal Form 394
Breaking Multi-Value Attributes 396
Table Structures 397
Distributed Databases 397
Proposed Tables & Relations 401
Including Composite Data Types 401
Associated Tables for Lecturer Table 407
Database Design Enrollment Process 411
Important points in the Database Design 417
Indexing for Better performance 418
Other Considerations 422
Security 422
Auditing 422
High Availability 422
Scalability 423
Data Archival 423
Summary 423
Exercise 424
Further Reading 424
Chapter 15: Overcoming Database Design Mistakes 425
Why Mistakes are Important? 426
Poor or No Planning 426
Ignoring normalization 427
Neglecting the Performance Aspect 427
Poor naming standards 428
Less Priority for Security 430
Usage of Improper Data Types 430
Lack of documentation 431
Not using SQL facilities to protect data integrity 433
Not using Procedures to access data 433
Trying to build Generic Objects 434

[x]
Table of Contents

Lack of Testing 434


Other Common Mistakes 435
Summary 435
Questions 436
Index 438

[ xi ]
1
Overview of PostgreSQL
Relational Databases
Relational databases are the most common database technologies among the other available
database technologies. Among many database technologies to implement relational
databases, PostgreSQL databases are one of the popular choices. Before using any
technology, it is important to understand the capabilities and scale of which it has used in
the industry as well as the best practices. Since this book is based on PostgreSQL, we need
to understand the capabilities of the tool. Apart from the tool, there are few important
aspects of database design techniques such as Primary Keys, Unique Keys, Check
Constraints and Foreign Key constraints which need to be discussed.

This chapter discusses the overview of PostgreSQL databases with respect to relational
databases. Further, this chapter discusses the basic elements of tables such as columns and
rows.

The following topics will be covered in this chapter:

Understanding relational databases


Introduction to PostgreSQL
Understanding tables, columns, and rows
Introduction to constraints
Fieldnotes

The reader requires a basic understanding of IT systems and how databases are fitting to
the IT systems. Also, the reader should have an understanding of the different processes of
development methodologies.
Overview of PostgreSQL Relational Databases Chapter 1

Understanding relational databases


A relational database is a collective set of multiple tables. These tables are related to other
tables with relations. The database is storing a collection of data. Typically, in databases, it
is essential to store, as well as read them when necessary.

Storing is an essential element in any system. However, there are multiple options for
system designers to choose from, depending on the system requirement. Historically, there
few popular options such as file-based, hierarchical, document, and more importantly
relational databases to select, which we will learn about in this section.

Let us look at a contact list of a mobile phone which is a good example of a database. In the
contact list, you want a contact to be stored and retrieved when necessary. When accessing
the phone book, you don't have to traverse through the entire contact list. Instead, by
typing the first couple of letters of the contact, you will be able to retrieve the required data.

Relational algebra and relational calculus are some advanced topics that
are related to relational queries. Since relational queries are out of scope
for this book, relational algebra and relational calculus topics will not be
discussed in this book.

Tough we will be focusing more on relational databases in this book, it is better to


understand the different types of databases that are around. Since relational databases may
not be suited for all the business scenarios, it better to understand other types of databases
and their usages.

File-based databases
You don't always need fancy technologies to save your data. Hence, you can use simple
text-based files to save your data. This is equal to the manual system where all data are in
separate files. Though this type of database is ideal for a small set of data, when the data
volume is increased this may not be scalable. Furthermore, with the growing number of
objects, file-based databases will not become a feasible option to store configurations,
therefore, this type of databases are not commonly used.

Common Business-Oriented Language (COBOL ) databases are an example of a file-based


database and following shows the sample COBOL data file definition:
FD CUSTOMER-FILE.
01 CUSTOMER-RECORD.
03 CUSTOMER-KEY.
05 CUSTOMER-CODE PIC X(08).

[2]
Overview of PostgreSQL Relational Databases Chapter 1

03 CUSTOMER-NAME PIC X(36).


03 CUSTOMER-ADDRESS
05 CUSTOMER-ADDRESS-I PIC X(25).
05 CUSTOMER-ADDRESS-II PIC X(25).
05 CUSTOMER-ADDRESS-III PIC X(25).
03 CUSTOMER-OPBAL PIC S9(14)V99.
03 CUSTOMER-STATUS PIC X(1).

Customer File-Definition in COBOL is displayed in the preceding code. CUSTOMER-


CODE has a length of 8 which can be filled with any alpha-numeric characters. Customer
Address has three parts, where each part has a length of 25 characters. In this file
mechanism, the user has the option of calling the CUSTOMER-ADDRESS directly. When
the CUSTOMER-ADDRESS is called, all the sub-elements will be returned automatically.
CUSTOMER-OPBAL is a numeric file which has 14 length and two decimals. Only numeric
fields can be used to perform mathematical calculations in the COBOL file system.

Popular CSV, TSV, XML file are some other examples for file-based databases. These types
of databases provide limited indexing capabilities. In addition these types of files, different
vendors have developed different file-based formats, hence there a lot of incompatible file
formats. Another disadvantage of this database type is that it will run into a lot of data
integrity issues since relationships cannot be created from the file. Instead, relations have to
be implemented through the applications.

There are few challenges in implementing different security levels for file-based databases
as well. Due to those issues, the file-based database has become a legacy technology and is
not used much in the industry today. In spite of the limitations associated with the file
database, file databases are used internally by some of the computer applications to store
data related to configurations, and so on.

Hierarchical database
The hierarchical database model is a model where data is organized in a tree structure. In
the hierarchical databases, the parent-child relationship is maintained.

The following screenshot is an example that shows how a hierarchical database is


restructured:

[3]
Overview of PostgreSQL Relational Databases Chapter 1

In the above data model, it can be observed that Invoice is a child of the parent Customer.
in this, both IN001 and IN002 invoices are children of customer C001.

To store organizational structures and folder structures hierarchical databases are used as
those have a hierarchy built into this. In engineering, applications such as Electricity Flow
and Water Pipe flows can also be considered as a hierarchical structure, hence those are
some other examples for hierarchical databases. In the case of electrical power flow, power
flows from one point to the other points, and there can be multiple roots. In order to
configure power and water flows in hierarchical databases they need to support multiple
roots.

The major disadvantage of the hierarchical data model is that it has limited flexibility. For
example, in the hierarchical model, it is difficult to change the structure after a particular
structure is defined. For example, the mentioned example in this section, if one payment is
used to settle multiple invoices, the above model will not be suitable. In the above model,
pre-defined queries are mostly supported. However, in the case of a need for ad-hoc
queries, and when structural changes are required, this model may not be much suited.

Document database
Though file-based databases and hierarchical databases not much favoured in the industry
today, document databases are becoming popular in the industry due to the limitations in
relational databases. Document databases are part of a family of the NoSQL database.

In the following table, we can see the categories of NoSQL databases:

[4]
Overview of PostgreSQL Relational Databases Chapter 1

Database Type Databases


Key-value Amazon DynanoDB, Redis, Oracle 11g Berkeley DB
Graph Neo4j, InfiniteGraph, sones
Column HBase, Cassandra, Riak
Document MongoDB, CouchDB
The preceding screenshot shows that the popular databases for each category of NoSQL
databases. Out of document, graph, column and key-value pair databases, document
databases are the most common databases in the industry. CouchDB and MongoDB are the
most common databases in document databases.

Relational databases are much suited for transactional or structured data. However, in the
industry, there is more unstructured data than structured data. It is believed that in 2015,
out of 88% of data in the industry is in the form of unstructured data. This data has
accounted for 300 Exabytes in volume. Further, unstructured data is following an
exponential growth which is far ahead of structured data. Therefore, document databases
are used in the industry today to cater to the growing need of the non-structural data.

The document-oriented database is the notion of a document. Documents in a document


database are similar to the general programming concept of an object. Rather than
converting document to the traditional relational format, it can much easier to store the
document as it is, which will make easier for developers to save this data without distorting
or destroying the data. They are not required to adhere to a standard schema, nor they will
have all the same sections, slots, parts or keys.

In the following code block, we can see the sample of a document database which follows
the JSON format:
{
"CustomerCode": "CUS001",
"Name" : "Greg Wilsons",
"Address": "15 Scarborough Street",
"Interest": "riding"
}

In the following screenshot, we can see the implementation of a document database in


MongoDB:

[5]
Overview of PostgreSQL Relational Databases Chapter 1

In the above figure, it can be seen that multiple attributes stored in the same document. For
the product, both types (accessory and case) are stored with the same document. In the case
of a relational database, you might need to have a different table. Since it is stored with the
document, there is no relationships and performance will be increased.

Relational database
In 1970, computer scientist E. F. Codd proposed the relational model for data from a
seminar paper A Relational Model of Data for Large Shared Data Banks (https:/​/​www.
seas.​upenn.​edu/​~zives/​03f/​cis550/​codd.​pdf) which has become the root for relational
databases. In the proposed relational model, data is modelled and logically structured into
tables which consists of columns are rows. Tables have multiple constraints such as
primary keys and check constraints. These tables are related to other tables with foreign
keys. Apart from those basic features, relational database management systems support
features such as scalability, transactions, security and so on.

Since there are many types of databases to choose with for a given solution, the relational
database has a lot of advantages over the other database solutions such as File-based,
Document and Hierarchical databases. The main ability in relational databases is the option
of creating relationships between tables. Every table is consists of columns are rows. When
choosing a database technology, it is essential to understand what each of the different
database technologies provides and how suitable them for your problem.

There are few tools to support relational databases. Oracle, SQL Server, DB2 are the most
common propriety relational database tools whereas MySQL and PostgreSQL are the open-
source tools.

Since we will be explaining the design concepts using PostgreSQL, let us look at the basic

[6]
Overview of PostgreSQL Relational Databases Chapter 1

features of PostgreSQL.

Introduction to PostgreSQL
PostgreSQL also is known as Progres, is an open-source relational database management
system (RDBMS). In 1982, Michael Stonebraker at the University of California was the
leader of the Ingres project which led to PostgreSQL. Michael left the project and return
back to the project in 1985 and started a new project called Post-Ingres. In 1996, the project
was renamed to PostgreSQL in order to reflect the support of the SQL queries in the
tool. The first PostgreSQL release formed version 6.0 in 1997.

After a bit of history on PostgreSQL, it is important to know which organizations are using
this tool in order to understand the capabilities of this tool. Here are a few organizations
that use PostgreSQL, namely, Apple, BioPharm, Etsy, IMDB, Macworld, Debian, Fujitsu,
Red Hat, Sun Microsystem, Cisco, and Skype. This list indicates that PostgreSQL has rich
capabilities to support scalable, large, and versatile data.

PostgreSQL is a free and open-sourced cross-platform tool. This means that PostgreSQL can
be installed in Linux, Microsoft Windows, Solaris, FreeBSD, OpenBSD, and Mac OS X.
PostgreSQL follows SQL:2011 or ISO/IEC 9075:2011 standard. PostgreSQL follows ACID
transaction properties like most popular relational databases. ACID stands for Atomicity,
Consistency, Isolation, Durability. Apart from typical tables, PostgreSQL supports Views,
Triggers, Functions, and Triggers as well. When designing a database, a designer can look
at these different available options. Indexes are available in PostgreSQL to improve the
performance of data retrieval. PostgreSQL uses Multiversion Concurrency Control
(MVCC) as the concurrency method. Concurrent control is important when multiple users
are accessing the same objects and resources. Data Types are an important feature in any
database by using different data types, users can use the required data type for their use
case. PostgreSQL supports data types as defined by SQL:2008.

Installation and configuration


For this book, we will be using the PostgreSQL version 11.2 of Microsoft Windows.
Installation for different versions of PostgreSQL and for different operating systems can be
found at https:/​/​www.​postgresql.​org/​download/​. The Microsoft Windows 2019, 2016,
2012 R2 versions are compatible with PostgreSQL 11.2. To find out what different windows
versions supports different PostgreSQL version, you can visit the following link, https:/​/
www.​postgresql.​org/​download/​windows/​. On installing PostgreSQL, it prompts you to
install the PgAdmin3 as the client tool. However, you can download the latest PgAdmin

[7]
Overview of PostgreSQL Relational Databases Chapter 1

which will be used as the main client tool to connect to PostgreSQL from the following
link, https:/​/​www.​pgadmin.​org/​download/​.

When installing PostgreSQL, superuser (user name: postgres) password needs to be


provided as shown in the following screenshot:

Always, remember to set a complex password as the user Postgres user has administrative
privileges who can do anything on the database instance. Apart from the superuser
password configuration, there is an additional configuration for the PostgreSQL data
directory and the port. As always, the data directory should be in a different drive and it is
always recommended to change the Port number. The default port number for PostgreSQL
is 5432, and it is recommended to change to a five-digit number so that it is difficult to
make a guess. After installing PostgreSQL, there will be a default database named Postgres
with no tables in that database.

After installing PostgreSQL, next is to install the client tool which can be used to connect to
PostgreSQL. As said earlier, pgAdmin4 is installed.

When working with databases, it better to have a database that has some meaningful data.
Like AdventureWorks for SQL Server and sakila for MySQL, PostgreSQL also has a
sample database called sakila. This sample database can be downloaded from http:/​/
www.​postgresqltutorial.​com/​postgresql-​sample-​database/​ . This sample database is
about DVD rental database which consists of 15 tables. Entity Relation (ER ) Model and the
table descriptions can be found in this link. Sample database can be restored from the
restore by using the option in the pgAdmin4 or from the command prompt. All steps to
restore the database can be found at http:/​/​www.​postgresqltutorial.​com/​load-
postgresql-​sample-​database/.

[8]
Overview of PostgreSQL Relational Databases Chapter 1

In the following example, the sample database is restored as SampleDB. Once the
SampleDB is restored, the database can be viewed in the pgAdmin4 client as follows:

Now basic configurations are done for the PostgreSQL with the sample database and you
are ready to go ahead.

Let us look at the limitations in PostgreSQL in the following section.

Limitations in PostgreSQL
Before starting designing a database in PostgreSQL, it is important to understand the
limitations of the tool so that when designing the database necessary workaround can be
taken in the early stages of the design.

The following table shows the different limitations in the PostgreSQL database:
Limitation Limit Description
There is no limitation to the database size and according to the internet, there is a
Database Size No Limit
PostgreSQL database that has got size more than 6 TB.
By default, PostgreSQL stores data in 8 KB chunks. A table can have 32 bit signed
integer number of chunks which is two billion. However, chunk size can be modified
Table Size 16 TB
to 32 KB which means Table Size can be 64 TB. However, it is not practical to see
tables with more than 4 TB in size. Therefore, 16 TB size is more than enough.

[9]
Overview of PostgreSQL Relational Databases Chapter 1

No of Rows in a Table No Limit You can have any number of rows per table in PostgreSQL.
Though there is no limit for the maximum number of indexes per table, it is essential
Number of Indexes No Limit to have a balance number indexes as more indexes will decrease the performance of
the insert operations.
Though the limit for the Field size is 1 GB, practically server memory will be the
Field Size 1 GB
limit.
Depending on the data types Fields of may vary. However, this is a very large
Number of Fields 250 - 1600 number and it is difficult to imagine a scenario you need this number of Fields for a
table.
Row Size 1.6 TB Again a very large number.
By looking at the limitations above, it can be concluded that PostgreSQL has the capability
to hold a large volume of robust data.

Let us look at basic objects in the database in the following section.

Understanding tables, columns and rows


Tables, columns, and rows are the basic units of a relational database. In simple terms, a
database can be considered a collection of multiple tables. A table is a relation with columns
and rows. In the following screenshot we can see the Entity-Relation (ER) Diagram of the
DVD Rental Sample database which consists of fifteen tables:

[ 10 ]
Overview of PostgreSQL Relational Databases Chapter 1

Each table is represented in a two-dimensional object. For example, the Film table has
information about films and the customer table has information about customers.

In the following screenshot, we can see that the sample data set for the film table. In
this film table, film title, description, release year are stored:

[ 11 ]
Overview of PostgreSQL Relational Databases Chapter 1

Similarly, the following screenshot shows the sample data set for the customer table. In this
table, basic customer details such as first name, last name and email are stored:

Both the customer and film tables are master records which means these are the base records
for day to day transactions. Similarly, staff and country tables are falling into the master
tables.

When a customer rentals a film, the relevant rental table is updated with the necessary
customer and inventory data. The sample data set for the rental tables is shown in the
following figure.

[ 12 ]
Overview of PostgreSQL Relational Databases Chapter 1

The rental table will get data for each data. For each transaction or DVD rental, this table is
updated. Since transactions table receives data for whenever a transaction occurs, hence
these tables tend to grow rapidly. In the rental table, it has a relation to the master tables
such as staff, inventory, and customer and some transactional relevant dates.

Each table has multiple attributes which are named columns and in the following
screenshot we can see the attributes/columns:

Each attribute has it is a data type that will be discussed in detail in Chapter 6, Table
Structures. For the discussion on columns, let us examine the data types for each column
from a script.
CREATE TABLE public.customer
(

[ 13 ]
Overview of PostgreSQL Relational Databases Chapter 1

customer_id integer NOT NULL DEFAULT


nextval('customer_customer_id_seq'::regclass),
store_id smallint NOT NULL,
first_name character varying(45) COLLATE pg_catalog."default" NOT NULL,
last_name character varying(45) COLLATE pg_catalog."default" NOT NULL,
email character varying(50) COLLATE pg_catalog."default",
address_id smallint NOT NULL,
activebool boolean NOT NULL DEFAULT true,
create_date date NOT NULL DEFAULT ('now'::text)::date,
last_update timestamp without time zone DEFAULT now(),
active integer,
CONSTRAINT customer_pkey PRIMARY KEY (customer_id),
CONSTRAINT customer_address_id_fkey FOREIGN KEY (address_id)
REFERENCES public.address (address_id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE RESTRICT
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;

Let us only look at the columns data type for the moment as other details will be discussed
later in this chapter and in other chapters. first_name column which is created to store the
First Name of the customer has a length of 45 characters. This means that
the first_name column can have a maximum length of 45 characters. create_date has date
data type. By including the date data type, you are introducing all the features of date. As
you are aware date type has a lot of features. For example, there are months which ends
with date 30 and some months ends with 31st. To complicate this in leap years, February
ends at 29th while for nonleap years it will be 28th. To complicate this scenario leap year
definition is also not very simple. Imagine, if you have to configure or to implement all this.
To avoid this complication, you need to simply select correct data types. Apart from this,
important date calculation such as adding an interval to an existing date. Find the intervals
between two dates etc. This can be done only if you have the date data type. If not you will
run into major performance and implementation issues.

The activebool column is to specify whether the customer is an active customer or not. In
this, there can be either two values, true or false. Therefore, the boolean data type is
selected. An important configuration here is the setting of default values. This means that
in case no value is specified, by default true value is stored for that customer. If you want
to specify the value false into a record, then you need to explicitly specify the value false.

Also, there are NOT NULL constraints for some columns means that it is a compulsory
column. In this table, address_id is NOT NULL means that address_id should be compulsory
and cannot be left with NULL.

[ 14 ]
Overview of PostgreSQL Relational Databases Chapter 1

A row or a record is an instance in relation or in a table. This means that a table is a


collection of one or many rows. In the following screenshot we can see the set of rows in a
city table in the same example database:

In this city_id 3 is the instance of the Abu Dhabi city. This instance will be linked to another
table record. In the Abu Dhabi record, country_id is 101. The following screenshot shows
the country table which shows the relevant record:

[ 15 ]
Overview of PostgreSQL Relational Databases Chapter 1

Relevant record for the Abu Dhabi city is the country_id with 101 which is the United Arab
Emirates. This means that data in a record can be split into multiple tables. We will look at
this design aspect in Chapter 6: Table Structures.

Depending on the constraints defined in the attributes which will be discussed in the next
section, the row will have values according to those rules defined in the constraints.

Introduction to constraints
Table Constraints are implemented to enforce limits to the data that can be inserted or
updated or deleted from a table. By introducing table constraints, data integrity can be
achieved.

Following are the common constraint used in tables:

NOT NULL
PRIMARY KEY
UNIQUE
CHECK
FOREIGN KEY
EXCLUDE

In file-based databases, constraints cannot be implemented at the database object level.


Hence constraints are implemented at the application level in the file-based database
system. This will make a fat client, and which leads to higher implementation complexities

[ 16 ]
Overview of PostgreSQL Relational Databases Chapter 1

and maintenance difficulties. In addtion, when the system is integrated with other sources
and systems, it is much better if the table constraints are implemented at the database level.
If not, data integration issues will occur when data is written to file-based systems via
third-party applications and systems.

NOT NULL
NULL is a special indicator in the database world to indicate that value does not exist. It is
important to note that the NULL is not equivalent to empty. Further, NULL is not equal to
NULL as NULL means a state of an attribute and not a value.

When defining a column in a table, you can have the option of setting the NOT NULL
setting in PostgreSQL as shown in the following screenshot:

In the above example, CustomerName is a NOT NULL column while the Remarks column
is a Nullable column. This means that for every row, CustomerName should have value
however, the remarks column is optional.

If you try to insert a record without value to the CustomerName, that insertion will fail
with the following error:
ERROR: null value in column "CustomerName" violates not-null
constraintDETAIL: Failing row contains (2, null, A).
SQL state: 23502

Therefore, NOT NULL constraint ensures that compulsory columns have to be updated
with a value when inserting a record.

[ 17 ]
Overview of PostgreSQL Relational Databases Chapter 1

The NOT NULL constraint is applicable not only at the time of table
creation or adding a new column but also at the time of column
modification. If a column needs to be modified to a NOT NULL column,
before modifying, you need to make sure that this column has a value for
all records. If not, you will not be allowed to modify to NOT NULL
constraint.

PRIMARY KEY
Selecting a proper primary key is the most critical step in a table design. The primary key is
a column or combination of multiple columns that can be used to uniquely identify the
row. Though the Primary Key is an optional constraint in a table, the table needs a
primary key to ensure the row-level accessibility to the table. You can only have one
Primary Key for a table.

The Primary key can be defined and configured from the PostgreSQL as follows:

In the above example, this means that the CustomerID is the Primary key which is the
column you can use to distinguish the records.

Primary Key has two properties:

Values Column or Columns which are defined as Primary Key should be unique.
Primary Key column or columns cannot be nullable.

When defining a Primary Key, you do not have set this as a NOT NULL column. As soon
as when the Primary is defined, automatically those columns are NOT NULL column.

If a user tries to insert duplicate value to the primary key, it will fail with the following
value:
ERROR: duplicate key value violates unique constraint
"SampleTableConstraints_pkey"
DETAIL: Key ("CustomerID")=(2) already exists.
SQL state: 23505

[ 18 ]
Overview of PostgreSQL Relational Databases Chapter 1

There can be multiple candidates for a Primary Key. For example, in the above example,
either CustomerID or CustomerName can be considered as a Primary key as both can be
used to identify the customer. However, when selecting a Primary Key, for the performance
reason it is recommended to select a shorter column. Typically, we do not choose lengthy
string columns or date columns for the primary key.

UNIQUE
The UNIQUE constraint ensures that all values in a column have different values. A
PRIMARY KEY constraint automatically has a UNIQUE key constraint. Unlike the Primary
Key, you can have any number of Unique key constraints and UNIQUE Key constraint is
nullable.

Let us look at the same example of the customer table in the following screenshot. In the
customer table, the CustomerID is the PRIMARY KEY. However, we know that Customer
Name is also unique. In organizations, most of the times, there are duplicate values which
will cause a lot of maintenance issues. If Customer Name is not set to unique, there can be
multiple rows for the same customer. Ultimately you will end up with dividing those
transactions for multiple customer records but in reality, it is the same customer.

To avoid this issue, we can set the Customer name is a unique key which can be defined in
the PostgreSQL as shown in the following screenshot:

[ 19 ]
Overview of PostgreSQL Relational Databases Chapter 1

In this Unique constraint, CustomerName is selected as the unique column. The deferrable
concept will be explained late in this section.

When a duplicate entry is inserted to a table, the insertion will fail with the following error
message:
ERROR: duplicate key value violates unique constraint "UNQ_Customer_Name"
DETAIL: Key ("CustomerName")=(John Muller) already exists.
SQL state: 23505

Similar to Primary Key constraints, when enabling Unique constraints to the column where
the table already has some data, it should not have any duplicate data. If there are
duplicates, unique constraint creation will fail. There is a performance benefit in UNIQUE
constraints in addition as a data integrity feature.

[ 20 ]
Overview of PostgreSQL Relational Databases Chapter 1

CHECK
CHECK constraints are defined to ensure the validity of data in a database table and to
provide data integrity in the database. In any business, there is multiple business logic that
is customized to the business. For example, organization employees, and Age should be in
the range between 18 and 55. In addition to the above rule, the sales commission should
not exceed some percentage of the salary. These rules can be implemented in the
application but which will lead to complicated client and that will lead to data integrity
issues. Therefore, it will be much better if this can be implemented in the database by
means of CHECK constraints.

In PostgreSQL, check constraints can be implemented for a table as shown in the following
screenshot. In this example, the Age column should be in the range of 18 to 55:

There can be check constraints rules which can be a combination of multiple columns. For
example, the Commission should be less than 25% of salary as shown in the following
screenshot:

[ 21 ]
Overview of PostgreSQL Relational Databases Chapter 1

CHECK constraints are validated when the data is inserted or updated to the table. In case
those data violates the check constraints, it will generate the following error:
ERROR: new row for relation "SampleTable" violates check constraint
"CHK_Salary_Commion" DETAIL: Failing row contains (3, John Muller, 19,
$2,500.00, $5,000.00)SQL state: 23514

Similar to other constraints, if CHECK constraint is defined to the table where data is
available, it has to follow the rule.

FOREIGN KEY
In a database, there are multiple tables and there are relationships. These relationships
should be maintained in order to improve data quality. FOREIGN KEY constraints are
used to maintain this data quality. If FOREIGN KEYs are not present, DELETE and
UPDATE commands will be terminated the relationships between the database
tables. Also, The FOREIGN KEY constraint will prevent users to enter invalid data which
will violate the rules of the table relationship. For example, countryid in the city table is
linked with the countryid in the country table and that relationship is built by using
PostgreSQL as shown in the following screenshot:

[ 22 ]
Overview of PostgreSQL Relational Databases Chapter 1

When this FOREIGN KEY constraint violated with the inserts or updates, data will not be
inserted and an error will be generated. Let us look at it in the following code block:
ERROR: insert or update on table "city" violates foreign key constraint
"fk_country"
DETAIL:Key (country_id)=(1222) is not present in table "country".
SQL state: 23503

As said before, the FOREIGN KEYs are introduced in order to maintain the relationship
between tables. However, we know that this data is not static as they will be updates and
deletes. So when data is updated and deleted, actions need to be taken in order to maintain
the data integrity.

In PostgreSQL, there are five actions which can be defined what should happen when
deleting and updating as shown in the following screenshot:

[ 23 ]
Overview of PostgreSQL Relational Databases Chapter 1

In the following table we can see the action and their details:

Action What will happen


Generate an error indicating that the deletion(s) or update(s) will result in a violation of
NO ACTION
the foreign key constraint. This is the default action.
Generate an error indicating that the deletion(s) or update(s) will result in a violation of
RESTRICT the foreign key constraint. This is the same as NO ACTION but the only exception is
that this setting not deferrable.
When the CASCADE option is set, it will delete the referenced rows in the
CASCADE
child table when the referenced row is deleted in the parent table.
SET NULL Referenced column value will be set to NULL.
SET DEFAULT Referenced column value will be set to the default value.
When setting CASCADE option, it should be done great care as this will end up with many
deletes since this might end up with recursive deletes. That also will have a negative
performance impact.

Exclusion Constraints
Exclusion constraints are a somewhat uncommon constraint in the databases. All
constraints which we have discussed up to now are applied to a single row. For example,
when we say check constraint in the Age column, it is to restrict data being inserted into the

[ 24 ]
Overview of PostgreSQL Relational Databases Chapter 1

row with the correct range. Let us assume there is a need to have the following scenario. A
student can only register for only four courses in one semester. In another example, in a
banking scenario, you do not want your clients to withdraw money from two different
ATMs in the given time period. Both these scenarios are dealing with multiple rows and to
implement this exclusion constraint is used.

Deferrable Constraints
Deferrable and deferred, are important settings in constraints. By default, when a row is
inserted, constraints will be effective immediately. If the data is satisfied by the rule data
will be inserted or updated if not data will be rejected with an error. However, sometimes
you need to defer from this for a batch of transactions. However, not all constraints
are Deferrable. Only UNIQUE, PRIMARY KEY, FOREIGN KEY, and EXCLUDE constraints
are affected. Deferrable, NOT NULL and CHECK constraints are always not Deferrable. On
creating any UNIQUE, PRIMARY KEY, FOREIGN KEY, and EXCLUDE constraint, you can
set it to not deferrable which means that data will be rejected if the relevant rules are
violated. Though it is a deferrable constraint, data can be only inserted when the constraint
is Deferred.

Let us look at a real world scenario regarding database constraints in the following section.

Fieldnotes
An organization has three branches around the globe. Branches are raising purchase orders
and at the head office, these purchase orders a re-approved. Due to the nature of the
business, there are more than a thousand product items. After some time they have found
that, since these are isolated branches, when ordering if they found that product item is not
available, they will create a new product item. After a while, they have found that due to
the ignorance of the users, there are a lot of duplicates records for the same item. To worsen
the things, purchase orders are raised for the same item but in the system, there are many
records. This has resulted in the error in most of the reports and system has become
unusable.

The following steps are taken to correct the data duplication issue in the system:

1. De-duplicate data: Though it is much easier to de-duplicate product item master


table, a problem arises when deleting them as those records are referred by
transactions tables such as purchase orders and invoices, etc. Since there are
FOREIGN KEY constraints to the product master table, all the relevant foreign
key constraints were deferred. After deferring those foreign key constraints,

[ 25 ]
Overview of PostgreSQL Relational Databases Chapter 1

duplicates were removed in the product item master. Next was to modify the
transaction tables to facilitate new product item master.
2. Enforcing UNIQUE Constraint: A Simple UNIQUE constraint was implemented
in the Product Item master defining that the product item description is unique.
3. FULL-TEXT Search Facility: Enforcing a UNIQUE constraint will solve half the
problem. Most of the users misspelled the item descriptions and the same item
name may exist with - instead of a space. Item-A and Item A might be the same
item, but the UNIQUE constraint will not catch that. Therefore, users are
provided with the FULL-TEXT Search facility (Out of context for this book) and
before creating a product master they should do a search and check whether the
product item record already exists.
4. Periodic Audit: Whatever you do there can be errors in the data. Therefore
periodic auditing and correction should be done.

After the above implementations, data duplication issue is fixed to some extent. As the
UNIQUE constraint is not an ultimate solution for data duplication, a solution is needed to
provide with FULL-TEXT search and Data Auditing.

Summary
In this chapter, we started with understanding relational databases, its disadvantages, and
we also saw the types of databases under it. Then we were introduced to PostgreSQL, we
learned how to install and configure it, and also learned about its limitations. Next, we
briefly learned about tables, rows, and columns.

PostgreSQL, which is the open-source database utility, has many advantages over its
counterparts. Mainly, PostgreSQL limitations are very high which means that most robust
implementation can be done using PostgreSQL. Importantly, PostgreSQL is compatible
with most common operating systems. Different data types can be used to design table so
that proper design can be done. To facilitate data integrity as well as performance in some
instances, there are many constraints such as NOT NULL, PRIMARY
KEY, UNIQUE, CHECK, FOREIGN KEY, EXCLUDE can be implemented in the
PostgreSQL.

After setting up the environment and having gained an understanding of the basic rules
and databases and limitations of PostgreSQL, next is to understand the basic representation
of database entities.

In the next chapter will cover the basic building blocks of the database and how these
building blocks can be represented.

[ 26 ]
Overview of PostgreSQL Relational Databases Chapter 1

Questions
What is the default port for PostgreSQL? This question verifies whether you have
a basic understanding of PostgreSQL databases.

The default port number is 5432.

What is the importance of specifying the default configuration for columns?

The default is specified to provide a default value for a column in a table.


The default value will be added to all new records in the table if no other
value is specified explicitly. For the obvious values like customer_active , true
can be set as the default. In case you have a large number of columns with
boolean columns, it is always much easier to specify default value rather
than explicitly specifying them. Another benefit will be at the time of adding
a new column with value. If no default is specified, you need to add the
column and update it with the required value. However, if you have a
default configuration, that value will be updated after the column is added.

When data is inserted to a table where a column is defined as a Primary Key,


what will happen when there is a duplicate record is found in the middle of the
data source? How to overcome this issue in PostgreSQL?

When bulk data set is being inserted to a table where the primary key is
defined and if there is a duplicate in the middle of that data set, the insertion
will fail immediately and remaining records are not inserted. To avoid this,
you can use the EXCEPTION WHEN unique_violation THEN keyword.
This keyword will still return the error but importantly, balance records are
also inserted.

What is the difference between unique and primary key constraints?

The difference between a UNIQUE constraint and a Primary Key is that per
table you can have only one Primary Key however you can define several
UNIQUE constraints. Also, Primary Key constraints are not nullable while
UNIQUE constraints may be nullable.

What is the difference between CHECK and EXCLUDE constraints?

CHECK constraint evaluates a rule based on a single row of the table


whereas an EXCLUDE constraint evaluates a comparison of two rows in the
table.

[ 27 ]
Overview of PostgreSQL Relational Databases Chapter 1

How can we automatically delete records in the referenced table, when the
primary data is deleted?

When tables are created, FOREIGN KEY constraints can be enabled. By


default, deletion is not possible when there are referenced data. However, if
we set it to CASCADE delete all the referenced data records will be deleted.

How to solve a data duplication issue in a real-world scenario?

This will be a very common database interview question as data duplication


is a huge challenge in the industry. Please refer to the field notes section for
the complete scenario and the answer. In summary, you need to implement
UNIQUE constraints and the FULL-TEXT search option.

Further Reading
You can check the following links to learn more about the topics we have just covered:

SQL:2011: https:/​/​en.​wikipedia.​org/​wiki/​SQL:2011

SQL:2008: https:/​/​en.​wikipedia.​org/​wiki/​SQL:2008

Full List Companies of PostgreSQL https:/​/​stackshare.​io/​postgresql/​in-​stacks

Constraints: https:/​/​www.​postgresql.​org/​docs/​9.​4/​ddl-​constraints.​html

Document Databases: https:/​/​www.​mongodb.​com/​document-​databases

[ 28 ]
2
Building Blocks Simplified
This chapter will help you gather the basic knowledge of database entities and attributes.
This chapter covers all the building blocks of a database which are the basis to design a
database. We will also be covering how these building blocks are identified, how they can
be represented in a database, what are the best practices for naming the concepts and what
are the rules that could help us work more efficiently. You will need an understanding of
the information system requirements and the role of the database in an information system.

In this chapter we will cover the following topics:

Identifying Entities and Entity-Types


Introduction to Attributes
Identifying Entities and Attributes
Naming Your Entities
Assembling the Building Blocks
Building Block Rules
Building Blocks Simplified Chapter 2

Identifying Entities and Entity-Types


An Entity is an object in the real world that is separately and uniquely identified from all
the other objects. For example, Each employee in an organization is an entity. Each
customer who is doing business with the organization is an Entity. An entity can be a
physical object like a person, customer, student, lecturer, and so on, or it can be a
conceptual object such as a project, groups, company, organization, Universities and so on.

An Entity has a set of attributes or values which will describe the Entity. Some of those
attributes can be used to uniquely identify the Entity. For example, the person will have a
name, mobile number, address(es), gender as its attributes, but person ID attributes will be
used to uniquely identify the person.

The Entity Type will contain similar entities. In other words, the Entity is an object of an
Entity Type. As an example, EmployeeID E0001 is an Entity of Employee Entity Type.

The following screenshot contains a data set of Employees:

In the above screenshot, instances are called Entity. E0004 is an Entity. Entity Type is the
entire collection of all the instances. In the above example, Employee Entity Type contains
five Entities.

Let us look at Entity sets which are an important factor when building a data model.

Entity sets
Sub-Collection of Entities or an Entity Set is a set of entities of the same type and share the
same properties. Entity Sets can be defined for logical reasons where the user needs a

[ 30 ]
Building Blocks Simplified Chapter 2

subset of the entity. Entity Sets can be disjoint and joint Entity sets.

Let us look at Entity Sets in the following screenshot:

In the above Entity Sets, employees of HR and employees of Accounts are disjoint Entity
Sets where there is no intersection between those two entity sets. In this type of Entity sets,
you do not have to worry about the intersection portion as there is no intersection. Another
property of this Entity sets is that there are non-entities outside given Entity Types.

Following Entity Sets are more complicated than the above example, which is shown in the
following screenshot:

In the Entity sets seen in the preceding screenshot, the E0002 instance is common for both

[ 31 ]
Building Blocks Simplified Chapter 2

Entity Sets. E0005 instance is not part of any of those entity sets. In this example, Entity Sets
are from the same entities. In some cases, there can be common instances for different
Entity Types. For example, there can be common instances such as Mango in both Fruit and
Vegetable Entity Types.

If you are retrieving instances from the intersect Entity Sets, UNION should be used to
prevent the displaying of intersecting instances multiple times. UNION will provide
instances of multiple entities or entity sets without any duplicates. In case there are
duplicates, UNION will provide distinct values to the output.

This means that UNION should be used for Entities and Entity sets where there are
intersect instances as shown in the below screenshot.

[ 32 ]
Building Blocks Simplified Chapter 2

As you can see only the distinct values listed in the UNION output.

However, since the UNION has to check for the common instances, it will cause a
performance issue. To avoid this performance, there are instances you can use UNION
ALL. If you are sure that those entities are disjoint entities, then you can use UNION
ALL so that the performance of retrieval is improved. If you are sure that Entities or Entity
Sets are not overlapping instead of using UNION, UNION ALL should be used. UNION
ALL does not validate for intersect values, instead, it will be a simple union the multiple
entities or entity sets without distinction the values. If you are using UNION ALL for

[ 33 ]
Building Blocks Simplified Chapter 2

overlapping Entities or Entity sets, overlapping instances will be duplicated. However,


UNION ALL has the performance improvement over the UNION as it does not need to
check for duplication of values.

This scenario can be explained with the following screenshot:

[ 34 ]
Building Blocks Simplified Chapter 2

As indicated in the above screenshot, labels such as India, UK, and Canada are duplicated
since we have used UNION ALL.

The UNION operator is used to combine the result-set of two or more SELECT statements.

[ 35 ]
Building Blocks Simplified Chapter 2

When performing a UNION, Each SELECT statement within UNION must have the same
number of columns and those columns should have the same data types. Also, the columns
in each SELECT statement must also be in the same order.

Another type of Entity Set is sub Entity Set of another Entity Set as shown in the following
screenshot:

As shown in the screenshot, all instances belong to the Entity Set Employees of Age Under
35. Employees of Age Under 30 is a sub Entity Set of Employees of Age Under 35. As
given in the above example, E001 is an employee whose age is under 26. That means his age
is under 35 as well thus he is fitting to the Entity Set of Employees of Age Under 35 as well
as the Entity Set of Employees of Age Under 30.

Let us differentiate strong and weak entities in the next section.

Strong and Weak Entities


The Strong Entity is the Entity that can exist without depending on other Entity or
Entities. In contrast, a weak entity is the one that depends on another entity and it cannot
exists without another entity. For example, in a banking system, a loan does not exist on its
own, it has to be linked to a customer. This means that the Loan entity is a weak entity.

[ 36 ]
Building Blocks Simplified Chapter 2

As a strong entity is denoted by a solid rectangle, a weak entity is denoted by the double
rectangle. Another major difference in the weak entity is that it does not have the primary
key, it has a partial key that uniquely discriminates the weak entities. The primary key of a
weak entity is a composite key formed from the primary key of the strong entity and partial
key of the weak entity.

The following screenshot shows the relationship between the Customer and Loan entities:

To obtain a loan, the customer should exist. From the screenshot above, it is clear that Loan
Entity cannot exits without a customer entity.

After the Entity Types are identified, the next important step is identifying the attributes for
each entity.

Introduction to Attributes
As indicated before in Identifying Entities and Entity-Types, an Entity consists of a set of
attributes. Attributes are descriptive properties of each member in the Entity Type. In other
words, entities are represented by attributes. For example, the Employee Entity Type will
have attributes such as Employee ID, Employee Name, Address, Qualification, and so on.

Each attribute has a range of permitted values. The Employee ID should be number which

[ 37 ]
Building Blocks Simplified Chapter 2

is from 1-100000, whereas the Employee Name should be a character column where
numbers are not permitted. Sometimes, depending on the business you might decide your
own range. For example, for Employee Type entity type, you can define Permanent, Casual
and Temporary values as ranges. Also, you might need to implement domain for the Age
saying that you only employ people whose ages are between 18-55. This range is called the
Domain of the Attributes. As discussed in Chapter 1, Overview of PostgreSQL Relational
Databases, CHECK constraints are used to limit the users from entering values that are out
of the domain.

Attributes can be classified into different types for analysis purpose, as shown in the next
section.

Attribute Types
Attributes can be categorized into main five types. Let us view them in the following
screenshot:

Thee are five types of Attributes such as Simple, composite, Single, Multi-Values and
Derived. We will now understand the types of attributes in brief:

[ 38 ]
Building Blocks Simplified Chapter 2

Simple: Simple attributes are attributes that cannot be divided further; they are also called
atomic attributes. The Mobile Number attribute in the Student Entity Type can be
considered as a simple attribute. As you can imagine it is not practical to divide the Mobile
Number into further attributes. Most of the attributes fall into Simple attribute Type.

Composite: Composite attributes are attributes that can be divided into sub-attributes. If
you look at the Full Name attribute of an employee, it comprises of Title, First name,
Middle Name, and Last Name attributes. The Address attribute of an employee comprises
of Address I, Address II, City, Postal Code, and Country attributes.

In PostgreSQL, CONCAT() function is used to composite two or more attributes. If there is


a NULL attribute it will ignore the NULL attribute and combine other attributes to form a
result.

Single-Valued: Single-Valued attributes have a single value at an instance. The Employee


ID attribute of employee entities is considered as Single-Valued attributes. A birthday is a
Single-Valued number as there can be a single value for Birthday for a given instance.

Multi-Valued: Multi-Valued attributes have multiple values per instance. Some employees
will have multiple mobile phone numbers. Therefore, the Mobile numbers attribute will be
Multi-Valued attributes. Similarly, one employee may have more than one children which
are again a Multi-Valued attribute. Typically, these attributes will be moved to another
table from the main entity and relationship will be created to it from the main entity. Refer
to Rule 3, under the section Building blocks rules for more.

Derived: Some attributes are derived from other attributes. Age of Employees, Number of
Years of Employee Experience are examples of the Derived attributes. Age of Employees is
derived from the difference of the Employees Birth date attribute value, and the current
date. Employees Experience is derived from the difference between the joined date
attribute, and the current date. Age and Experience are very simple and standard derived
attributes, there are customized derived attributes which will be different from one
organization to the other. For example, the calculation of a grade for a subject will depend
upon the institute, and the subject sometimes might be depended on the semester too.

In the following table you can see how Age and Experience are derived for sample
Employee entities, that are from the simple attributes of the entities:

Employee
... Birth Date Joined Date Age Experience
ID
12345 ... 1974-Nov-28 2010-Jan-16 45 9
12356 ... 1976-Aug-05 2012-Feb-15 43 7
13467 ... 1978-Jan-01 2016-Dec-01 41 2
13478 ... 1972-Sep-13 2017-Aug-01 47 2

[ 39 ]
Building Blocks Simplified Chapter 2

Some database designers prefer not to use derived attributes; instead, they prefer to
calculate at the run-time. For example, if the Amount is equal to Unit Price * Quantity,
most of the designers prefer to calculate the Amount at the run time to save some disk
space.

In the derived attribute option, storage space is the concern. The advantage of this option is
that less computation power is needed. However, in the modern era of computing, storage
is not a major concern due to the rapid advance in storage technology. Therefore, derived
attributes are much better when it comes to database designing as it improves the reading
performance.

Derived attributes are updated at the time of data insert and the data update to the related
attributes. The important aspect of the derived attributes that those should not be directly
changed by the users as there are derived from the other attributes. If the derived attributes
can be modified, then there will be a data cleansing issue.

Let us see how we can represent attributes and entities in the next section.

Representation of Attributes and Entities


As attributes describe the properties of entities, it is essential to represent them in a
diagram. Also, different types of attributes that were discussed before should be
represented, so that it can be easily identified which type of attributes is attached to the
Entity.

In the following screenshot, all types of Entity Types are displayed:

[ 40 ]
Building Blocks Simplified Chapter 2

The employee is the Entity Type which consists of instances of Employees. A Entity Type
should be represented in a rigid, rectangle shape. Employee ID is the Primary key
(Discussed in Chapter 1, Overview of PostgreSQL Relational Databases), which is used to
identify the entity; Employee ID is underlined in the image. Since EmployeeID is also a
simple attribute, it is represented in a rigid, oval shape. Employee Name is a composite
attribute where the source attributes are Title, First Name, Middle Name, and Last Name.
The Employee Name attribute is linked to Employee Entity and source attributes are linked
to the composite attribute, Employee Name. Mobile Number is a multi-valued attribute
and it is shown in a double circle, rigid solid circle. Age attribute is derived from Date of
Birth, therefore, Age attribute is shown in a dotted a circle connected to the source
attribute—Date of Birth.

Identifying Entities and Attributes is an important aspect in data modeling which is


discussed in the next section.

Identifying Entities and Attributes


Identifying Entities and Attributes is a key step in database designing. Identifying Entities

[ 41 ]
Building Blocks Simplified Chapter 2

and Attributes will be an iterative process, thus it might not be easy to capture all the
Entities and their attributes at once.

Let us look at the basic steps of identifying entities and attributes:

Identifying Main Entities


Identifying Attributes of Identified Entities
Generalization and Differentiation of Entities

The following screenshot is a sample of hospital pharmacy invoice. When a patient is


prescribed drugs by the doctor, the patient has to get those drugs from a pharmacy. It will
contain one or more drugs as shown in the following sample invoice. Let us identify the
Entities and relevant attributes.

Please note that the following screenshot is a sample and not an original invoice:

[ 42 ]
Building Blocks Simplified Chapter 2

In the following sections, we will identify Entity Types and relevant attributes to satisfy the
above invoice.

Identifying main Entity Types


The first stage is to identify the main Entity types. To identify the main entities, important
objects should be selected. Since identifying the Entity Types are a challenging task, as a
database designer you can get the help of the end-users. Since end-users have been working
with data for long years, their experience will be valuable to identify the main entity
types. In this sense, the preceding pharmacy invoice is issued by the hospital cashier and is
issued to the patient. For a transaction in this context, there needs to be a cashier and
patient. The cashier and the patient perform the transaction through the invoice.

[ 43 ]
Building Blocks Simplified Chapter 2

So at first sight, three entities are identified, namely:

Invoice
Patient
Cashier

The next important aspect of this invoice is, what is the relationship between the patient
and the cashier. Basically, the patient purchases drugs from the cashier and an invoice is
issued. This means that drugs or item is also an Entity Type:

Drugs or Item

By further analyzing this invoice, you can see that there is an Order Number. This means
that there can be Orders coming through different departments or from different wards of
the Hospital. This means that Order also can be considered as an Entity Type:

Order

A doctor is also an important Entity as it is important to understand which doctors are


prescribing which drugs:

Doctor

Company or the hospital is also a conceptual entity in which all these entities belong to:

Hospital

Now we have identified the main entities. Then the list of entities will become the initial
version of your data model.

Therefore after the first analysis, the following table shows the Main Entity Types that are
discovered in the analysis:

Hospital Invoice Order


Patient Cashier Drugs
Doctor
Now, let us discuss the techniques of identifying the Entity Attributes.

Identifying Attributes of Entities


After identifying Entities, the next step is to identify attributes. Though the invoice might
be enough to identify entities and entity types, identifying attributes may need more
thinking. When identifying attributes, it is essential to look at not only the values in the

[ 44 ]
Building Blocks Simplified Chapter 2

invoice but also the possible values which may be needed in the future. By including those
attributes at this stage itself, you are avoiding unnecessary design changes and
maintenance issues which may arise later. Let us identify the attributes for the Invoice
Entity with the standard symbols.

Since the main Entity is Invoice, let us identify attributes of the Invoice Entity as shown in
the following screenshot:

In the Entity - Attribute screenshot above, all the attributes are linked to the Invoice Entity.
However, during the exploration of Entities, we identified that Patient, Drug, Cashier,
Order are not attributes of Invoice, but they are a separate entity. The next stage is to
separate those attributes and move them to different entities.

The following tables list the Entities, their attributes, and the attribute type.

The following table has attributes of Invoice entity:

Attribute Attribute Type


Invoice Number Primary Key
Invoice Date Simple

[ 45 ]
Building Blocks Simplified Chapter 2

Amount Simple
Quantity Simple
Number of Items Simple
Payment Type Simple
The following table has attributes of the Hospital entity:
R
A
e
tt
m
ri
Attribute Type a
b
r
u
k
te
s
H
o
s
p
it
Primary Key
al
N
a
m
e

[ 46 ]
Building Blocks Simplified Chapter 2

C
o
m
p
r
i
s
e
s
o
f

(
H
o
u
s
e
N
u
m
b
e
r
,
S
t
H
r
o
e
s
e
p
t
it
,
al
D
P
i
o
Composite s
st
t
al
r
A
i
d
c
d
t
r
,
e
P
ss
r
o
v
i
n
c
e
,
C
o
u
n
t
r
y
,
P
o
s
t
a
l
C
o
d
e
)

[ 47 ]
Building Blocks Simplified Chapter 2

The following table shows the attributes of the entity of the Patient:

Attribute Attribute Type Remarks


Patient Code Primary Key
Comprosis of (Title, First Name, Middle Name, Last
Patient Name Composite
Name) attributes
Comprosis of ( House Number, Street, District, Province, Country,
Patient Postal Address Composite
Postal Code) attributes
Mobile Number Multi-Valued
Date of Birth Simple
Age Derived Derived from the Date of Birth attribute
Height Simple
Weight Simple
Body Mass Index Derived Derived from the Height and Weight attributes
Next, let us see what are the attributes of the important entity of Doctor:

Attribute Attribute Type Remarks


Doctor Code Primary Key
Doctor Name Composite Composite of (Title, First Name, Middle Name, Last Name) attributes
Specialty Multi-Valued
Qualification Multi-Valued
Contact Hours Multi-Valued
Charges Multi-Valued
The drug is the next entity which is shown in the following table:

Attribute Attribute Type


Drug Code Primary Key
Drug Name Simple
Drug Rate Simple
Supplier Multi-Valued
Drug Specification Multi-Valued
Let us see the attributes of Order Entity:

Attribute Attribute Type


Order Number Primary Key
Order Date Simple
Drugs Multi-Valued
Cashier Name Composite
Patient Name Composite
Doctor Name Composite
Let us look at the Cashier entity:

Attribute Attribute Type Remarks

[ 48 ]
Building Blocks Simplified Chapter 2

Cashier Code Primary Key


Composite of (Title, First Name, Middle Name, Last Name)
Cashier Name Composite
attributes
Composite of (House Number, Street, District, Province, Country,
Cashier Postal Address Composite
Postal Code) attributes.
Mobile Number Multi-Valued
Designation Simple
As a designer, you should think a little ahead, rather than simply identifying what
requirement is there. For example, for the patient entity type, it is better to include more
health-specific attributes such as Body Mass Index (BMI). This type of proactive measure
will give you a competitive advantage over its competitors.

After distributing entities, the next step is to create a relationship between these entities. For
example, there can be only one patient per invoice, but there can be one or more drugs in an
invoice. We will look at these relationships in the Entity-Relationship (ER) diagram in
Chapter 3, Planning a Database Design.

In the next section, the generalization of identified entities is discussed in detail.

Generalization and differentiation of Entities


Generalization and differentiation involve removing or introducing entities to the model. In
the generalization and differentiation phase, you will see the renaming entities and
attributes, and assigning and reassigning different attributes to different entities.

With generalization, two entities which are representing different types of the same entities,
can be combined into one entity. On the other hand, at this phase, one entity can be divided
into two separate ones, if it is identified that the entity is representing two different
alternatives, and should be divided into two entities. For example, you might decide to
break Payment Type of the Invoice Entity into a new Entity, rather than having it as an
attribute.

After identifying Entities and their attributes, the next important step is to name them,
according to a methodical way so that they can be identified later.

Naming Entities and Attributes


When naming entities and attributes, there are general rules you need to follow. Though
the following rules are general practices, you can have your own conventions to suit your
environment:

[ 49 ]
Building Blocks Simplified Chapter 2

Entities/Attributes should be given precise and concrete names. Entity Name or


Attribute name should describe the role of those entities/attributes in the system.
For example, use the word Customer instead of Person. The term person appears
to be a very common entity; which can be Employee, Supplier or Customer.
Though there can be common attributes between these entities, there are specific
attributes for each entity. The Salary attribute will only be relevant to the
Employee, while the Credit limit will be relevant to the Customers and Suppliers.
Also, in a hospital, the customer is named as Patient. Therefore, it is better to use
terms for the entities/attributes which are relevant to the use case.
Use singular instead of the plural form, such as Patient, not Patients. However, if
you are not in favor of singular naming, you can follow plural naming. But it is
essential to maintain consistency, which means that you should not use both
singular and plural naming.
use Camelcase for naming entities/attributes. For example, the User Accounts
entity should be named as UserAccount.
Prohibit using short names and acronyms. For example, instead of using Tel it is
always better to use Telephone and instead of CRM, use
CustomerRelationManagement. There can be limitations in the database
technologies that you use when it comes to physical implementation. However,
you should not limit your entities and attribute naming with short names and
acronyms when it comes to conceptual implementation.

The naming convention is to maintain better readability so that it will make your database
easy to manage.

After identifying Entity Types and Attributes, next is to assemble them by identifying
relationships of them.

Assembling the building blocks


After identifying Entities, Entity Types, and Attributes, the next step is to assemble them in
the physical databases. Though we will be discussing this in detail at a later stage, it is
always better to set some understanding about the physical implementation. In a database
table, the individual rows are the unique entities and the columns are referred to attributes.

Let us discuss what are the basic rules to define building blocks in a database.

[ 50 ]
Building Blocks Simplified Chapter 2

Building block rules


It is important to understand the rules of building blocks. These rules will make it easier to
implement building blocks. Let us look at the rule in the following sections.

Each table should represent one and only one entity-


type
Rule 1: If two entity-types share the same attributes, and an entity is a member of both
entities, then the two entity-types can be represented in one table.

The Employee and Manager entity-types can both be represented in the same table as
employee and manager entities have the same attributes and manager entities are also
employee entities. In the business context, a manager is an employee. In the database
tables, self-joins are used to join these entities.

All columns must be atomic


Rule 2: When implementing columns, it is always better to implement data at the atomic
level. Atomic means that columns cannot be divided again or to another level. When we
define attributes, there are composite attributes such Full Name and Address. Full Name is
compisite of Title, First Name, Middle Name and Last Name. At the database level, we
implement atomic level such as Title, First Name as so on as atomin attributes. This is done
mainly it is easy to maintain such as to Update these attributes.

Composite attributes are used for display purposes. Therefore, at the table level, sub-
attributes are used and those are composite to one attribute at the view level. For example,
at the table level, atomic columns are, Title, FirstName, MiddleName, and LastName and in
the view level, there will be a concatenate derived column called Full Name, including all
the four columns.

Columns cannot be multi-valued


Rule 3: Some entity types will have multi-valued attributes. The example that we discussed
in the section Identifying Attributes of Entities,we have identified the Cashier Entity Type.
Since a acahier can own multiple, mobile numbers, for the Cashier Entity, Mobile Number
will be a multi-valued attibute. Though comma-separated values or some other separated
values can be implemented as single value attributes, there will be a lot of issues when
extracting information from those columns. However, to properly represent a multi-valued
attribute, another table will be created that should be related to the original table using a

[ 51 ]
Building Blocks Simplified Chapter 2

one-to-many relationship. Relationships are discussed in detail in the Chapter 1, Overview of


PostgreSQL Relational Databases.

Summary
In this chapter, we have looked at Entity, Entity types, and the process of identifying them.
We looked at different types of Entities and their relationship with other entires. Entities
have their attributes and we looked at how to identify those attributes. We discussed the
naming conventions of Entities and Attributes. After identifying those entities and
attributes, the next important task is to assemble them to the building blocks, which we
discussed with three rules.

After identifying the Entities and their Attributes, next is to plan the design of the
databases. In the next chapter, planing of designing the databases is discussed along with
challenges in planning.

Exercise
The following screenshot is a bill from a mobile company. Identify the Entities and
Attributes which can be used to implement a database design. You will learn more about
Relations in the upcoming chapters.

Let us look at the following screenshot as a reference to the exercise:

[ 52 ]
Building Blocks Simplified Chapter 2

Let us look at the commons questions that you come across mostly in interviews.

Questions
What is the difference between UNION and UNION ALL?

UNION will provide instances of multiple entities or entity sets without any

[ 53 ]
Building Blocks Simplified Chapter 2

duplicates. In case there are duplicates, UNION will provide distinct values
to the output. This means that UNION should be used for Entities and Entity
sets where there are intersect instances. However, since the UNION has to
check for the common instances, it will cause a performance issue. If you are
sure that Entities or Entity Sets are not overlapping instead of using UNION,
UNION ALL should be used. UNION ALL does not validate for intersect
values, instead, it will be a simple union the multiple entities or entity sets
without distinction the values. If you are using UNION ALL for overlapping
Entities or Entity sets, overlapping instances will be duplicated. However,
UNION ALL has the performance improvement over the UNION as it does
not need to check for duplication of values.

Why views are used to defined composite attributes of entities instead of defined
them at the table?

Composite attributes are used for display purposes, but they are much easier
to update when they are maintained at the sub-attributes level. Therefore, at
the table level, sub-attributes are used and those are composite to one
attribute at the view level. For example, at the table level, atomic columns
are, Title, FirstName, MiddleName, and LastName and in the view level,
there will be a concatenate derived column called Full Name, including all
the four columns.

When forming composite attributes, what is the action you should take to avoid
if one of the source attributes are NULL?

We can simply add the source column as shown in the following code
snippet:
SELECT Title + ' ' + First_Name + ' ' + Middle_Name +' ' + Last_Name);

However, in case if one column is NULL, it will result in a NULL output


which is not the exact requirement of the users. However, in PostgreSQL,
you can use the CONCAT command like in many other database
management systems. This is an example of how to use CONCAT.

SELECT CONCAT(Title,' ',First_Name,' ', Middle_Name,'


',Last_Name);

Why derived attributes are used instead of calculating at the run time?

In the derived attribute option, storage space is the concern. The advantage
of this option is that less computation power is needed. However, in the
modern era of computing, storage is not a major concern due to the rapid

[ 54 ]
Building Blocks Simplified Chapter 2

advantage in storage technology. Therefore, derived attributes are much


better when it comes to database designing.

[ 55 ]
3
Planning a Database Design
Database design is not something that you can start straight away. Like any design work
such as building a house or manufacturing a car, you need thorough planning for a
database as well. In the case of building a house, you need to plan for the land where you
are building your house. Then you need to plan the house depending on your requirements
and your budget. It is the same for the database as well. You need to understand your
environment and your users' requirements. Also, when building a house you plan for the
future as well. When it comes to database planning, you need to plan for the future as well.

It is important that you need to understand what you are designing for, who are your
intended users and what are your limitations. Also, one database is different from the
other database as every database is unique from others. Therefore, proper planning is
required for database design.

The following topics will be covered in this chapter:

Understanding Database design principles


Database schema
Database design approaches
Importance of data modelling
Role of databases
Common database design challenges
Planning a Database Design Chapter 3

Understanding Database Design Principles


The database is a key component in the information system. Therefore, when designing a
database, key aspects need to be addressed. These key aspects need to be addressed from
the very beginning of the project, rather than trying to cover them after the project starts.
When requirements are gathered from the users, it is essential to capture the relevant
information from the users.

The following screenshot shows, different aspects of the database design:

Depending on the system, the weightage of each parameter may be different. If you intend
to design a database system for a core-banking system, almost every parameter is very
much should be considered. However, if you are designing a database for your inter-
company DVD club, then the performance, availability may not be considered factors as
such. Similarly, for a data warehousing system, you might not be looking at availability
with high concerns but integrity will be an essential parameter. This means that depending
on the database that you design, you need to evaluate the importance of the parameters
carefully. Let us look at these parameters and how they should be addressed at the design

[ 57 ]
Planning a Database Design Chapter 3

planning stage of a database.

Usability
Concepts of relational databases are introduced in order to improve the usability of data.
We covered relational databases in Chapter 1: Overview of PostgreSQL Relational
Databases. That doesn't imply that just because you have that relational database, you have
the usability. However, when you are designing the database, you need to emphasize that
your database objects are storing meaningful data that can be used by the end-users or
applications. To guarantee that the database objects are storing meaningful data, it is
essential that all required data is gathered during the process of requirement gathering.

Another aspect of the data usability is that it converts the stored data into meaning full data
without major complex queries. In relational databases, data is stored in different objects.
When a user request for a data set, it might have to combine one or many tables. In the case
of multiple tables, it has to be simple to combine them. If the usability is maintained at the
database design, combining them will not become a hard task.

In PostgreSQL, Views and Stored procedures are used for improving the usability of the
database. Though you can define views and stored procedures later, it is essential to
identify the requirement at the planning stage so that you don't run into last-minute issues.

Extensibility
Business processes are not static but dynamic. When designing a database, it should be
considered that the designed database should withstand future changes. It is highly
unlikely that you don't have to do any modifications to your database. However, it is not
viable if you have to change your database on a daily basis or very frequently. If the
original database is much complex and not properly organized, then you need to change
the database design frequently. At the planning stage, you need to predict future changes
in the databases.

There are a few options that should be considered to enable your database to ensure
extensibility. The generalization of entities is one way to attain extensibility in databases.
For example, if you have employees at different departments, rather than creating tables
department wise (Employee_Accounts, Employee_HR, Employee_IT), it is advisable to
create an Employee table with a department attribute into the employee table. If not
whenever a department is introduced, you would need to create a table. Instead, it is only a
matter of adding a record to the department table. On the other hand if there different types
of employees such as permanent, temporary, executive, and so on, it is better to define

[ 58 ]
Planning a Database Design Chapter 3

them in separate tables simply because they have different parameters from each type.

Normalization is another very popular technique to achieve extensibility where a detailed


discussion is done in chapter 5:Applying Normalization. A simple design is the easiest way to
achieve Extensibility as a simple solution is easy to understand and easy to adjust when
necessary.

Integrity
It doesn't matter how large your database is, if your data is garbage, it is unusable. Also, no
one will trust your data and your database will become obsolete. It is essential to identify
what are the rules you need to implement to the database and how those rules should be
identified at the requirement stage itself.

There are different types of entities we need to identify during the requirement analysis
phase as shown in the following screenshot:

[ 59 ]
Planning a Database Design Chapter 3

Following are the basic implementation of different integration types and it is essential to
identify them at the planning stage:

Entity Integrity: This discusses the structure of the tables. Typically, Primary
Keys and Unique keys are used to achieve Entity Integrity.
Domain Integrity: This explains the attributes of the entities or columns in the
table. Nullability, Data Type, Data formats and data length are the mechanisms
which are used to achieve the Domain Integrity. When Name of a Person is
defined, it should be a string data type with typically 50 in length. For a phone
number depending on the environment, you can define the format for the phone
number such as ###-######-##.
Referential Integrity: (or Foreign Key Constraint) This was discussed in Chapter
1, Overview of PostgreSQL Relational Databases, where entity dependency is
maintained. For example, the Employee entity has a dependency on the
Department Entity. This means that there cannot be an employee without a
department, or with an invalid department. Also, you are unable to delete
departments where there are relations in the employee table.
Transactional Integrity: Transaction means one set of operations that are
considered as a single unit. It might be inserting/Updating/Deleting multiple
tables or records. However, from the transaction point of view, it should be one
unit. The Atomicity, Consistency, Isolation, and Durability (ACID) theory is
introduced to maintain transaction integrity. A detailed discussion will be done
on Transaction Integrity in Chapter 9, Designing a Database with Transactions.
Custom Integrity: Every organization has its own integrity rules. Salary should
not be higher than 50,000 or age should not be lesser than 18 and so on, are some
of those custom rules. This type of integrity is implemented using CHECK
constraints which were discussed in Chapter 1, Overview of PostgreSQL Relational
Databases.

Performance
Database tends to hold a large volume of data over time. Sometimes this may be in the
order of multiple terabytes. When dealing with this much data, it is essential to retrieve this
data in an efficient way. When designing a database, it should be done in order to improve
data retrieval and data insertion.

Apart from table designing, there are indexes that can be introduced in order to improve
data selection from the databases.

Following are the indexes which can be included in tables:

[ 60 ]
Planning a Database Design Chapter 3

Clustered Indexes
Non-Clustered Indexes
Bitmap Indexes
Include Indexes
Filtered Indexes
Column Store Indexes

These indexes are used in different options with different combinations. We will be
looking at the details of indexes in an upcoming Chapter 8: Working with Indexes.

Availability
Availability means that keeping the data available for the end-users for maximum possible
time. Many of us would think, availability is achieved via hardware and only hardware.

By using hardware, when a system is not available, a different set of hardware at different
geolocation can be used for users to operate. However, there can be a case where a costly
hardware solution will not be feasible to use due to budget constraints. Hence, rather than
using costly hardware solution, you can design your database to achieve some level of high
availability.

This might not be a 100% availability option but it will provide you breathing space so that
clients can perform some of their tasks to some extent until you bring the system back to
normal.

Depending on your needs and by considering the cost factor, there are different types of
High Availability options for you.

Read-only and deferred operations


During an event of a disaster, the read-only version of the database is available. This means
that users can query the previous data but cannot enter new values to the system. Typically
this is achieved via a database functionality such as data replication, database mirroring,
and so on. Replication cannot be done unless there is a primary key in all tables. Therefore,
during the design of databases, it is essential to have a primary key for each table.

Also, in some of the database systems, large columns that are more than 1GB cannot be
replicated. This means that you need to understand what are you going to replicate before
you start the database design.

[ 61 ]
Planning a Database Design Chapter 3

Partial, transient, or impending failures


During the time of a disaster, you can design your databases and application such a way
that important functionalities are available, but not all. For example, for a supermarket,
essential functionality will be issuing invoices from the point-of-sale (POS) machine.
Getting the cashier day balance, updating the prices of products may not be required
during a disaster. At least, if it is not working today, you can wait for another couple of
days. Therefore, you can design your database such a way that important functionalities are
in one database and that database is enabled for high availability. In this type of scenario,
Master Data Management (MDM) is used heavily and will be discussed in Chapter 7:
Working with Design Patterns. By following this approach, you will not be able to implement
some foreign key constraints. For example, if items are in a separate system, it is not
possible to implement a foreign key constraint between the point of sale system to the
product table as those are isolated tables.

Partial end-to-end failure


You might have multiple clients and some of your clients are very important to you. In
many businesses, few clients give you the major share of revenue to the organization. In
such situations, you would prefer to provide a high availability option for selected clients
rather than providing it to all mainly due to the cost. In this type of scenario, during an
event of a disaster, some clients will get all the system functionalities while some clients
will not get any functionality. To satisfy this type of scenario, techniques like database
sharding can be used. This means you can be defined as a shard which has one set of
clients and another shard with another set of client. Detail discussion on sharding will be
done in Chapter 12: Distributed Databases.

There are two challenges with the sharding design. Since you do not have all the client's
data at one database, when you are making combine reports, it will be challenging to
generate a report from a database where you have all the data. Sometimes, there are cases
where clients need to move between shards. In that situation, there should be an automated
mechanism transfer data between shards.

When designing databases for high availability, it is understood that the approach would
be something difficult to change later. Because of this, it is essential to identify what
availability the customer needs. If you are to change the approach in the later stage, it will
require unnecessary cost and effort.

[ 62 ]
Planning a Database Design Chapter 3

Security
Security is a key concept in any information system. In a database system, there are three
ways to achieve security, as follows:

Authentication: Authentication is identifying who the user is. There are multiple
ways to identify the user and the popular technique is user name and password.

Authorization: When a user is identified via authentication, the next step is to


look at what he can do. In a database system, there can be cases a user can only
read and another user can only write to some set of data. This different level of
object access is provided from the mechanism of Authorization.

Encryption: Data is an asset to any organization. If your data is comprised, it will


lead to financial and reputation loss to the organization. Encryption is a concept
that is not limited to databases. By means of encrypting, you are securing your
data for unauthorized access.

When the requirement gathering is done, it is essential to capture the requirement from the
perspective of these parameters. Sometimes, your client might not have enough
information and knowledge to provide this. However, it always better to get this
information during the planning phases, as changing them at the later stage will cost huge
for both you and your client. More importantly, it will hugely impact the project line as
well. For example, changing the database to adapt to different high availability mode
would be very difficult as the database has different design methods for different high
availability methods.

After Understanding Database design principles, next, we will look at what types of
database schemas you need to work with since.

Introduction to the database schema


Though you have the basic design concepts, you are not yet ready to start designing. Before
you start designing you need to understand what type of database you are designing.
Databases have their own mandate to support the different needs of the users. Depending
on their business and technical needs, the database designer has to choose the correct
database schema. In the database world, there are two types of systems, Online
Transaction Processing (OLTP) and Online Analytical Process (OLAP).

Let us see the difference between the two of them.

[ 63 ]
Planning a Database Design Chapter 3

Online Transaction Processing


OLTP is a method or a process which defines to manage transactions oriented application
in information systems. ATM in banking, point of sale invoice, purchasing from the
Internet, and money transfer from one account to another, are some of the examples that
can be provided for the OLTP. OLTP queries are simpler and run with a small period of
time but there will be a large number of executions.

Online Analysis Processing


OLAP is an analysis data. OLAP is used to analyze data to make strategic decisions to gain
a competitive advantage over the competitors. OLAP has historical data that was input by
the OLTP systems. OLAP has summarized and multi-dimensional data. Data warehouse,
reporting systems, analytical systems are examples of OLAP systems. Unlike OLTP
systems, OLAP has complex queries that will execute for a long time even hours but they
run only in very fewer frequencies.

OLTP versus OLAP


As a database designer, it is important to understand why there is a difference between
OLTP and OLAP. Depending on the type of system that you are designing, whether it is
OLAP or OLTP, you need to take design decisions.

The following table shows the major differences between OLTP and OLAP:

OLTP OLAP
Stores current Data Stores historical Data
Stores Details Data Mainly Summarized Data
Dynamic data, lot of updates and deletes Mostly Static data
Transaction-Driven Analytical Driven
Application Oriented Subjected Oriented
Supports Day-to-Day operations Supports strategical decisions
Defined queries Ad-hoc and unstructured queries queries
Highly Normalized data structures Denomizied data structures
More relationships with tables Fewer relationships with tables
High number of users Fever users
If you are designing a database for an OLTP, it is essential to identify the above properties.
For example, for OLTP typically schemas should be highly normalized schemas. Further,
OLTP should support high number of users.

[ 64 ]
Planning a Database Design Chapter 3

This means that you need to understand whether you are designing a database for OLTP
purposes or OLAP purposes. Sometimes, there can be a mix of OLTP and OLAP in the
database. For example, there can be some limited analytical reports in an OLTP system. In
that type of scenario, to decide whether the system is OLTP or OLAP, the system mandate
should be identified. If it is mainly towards OLTP, then the design should be done by
considering the OLTP and vice-versa.

For any design, there are multiple approaches. For database design also there can be
multiple approaches.

Choosing Database design approaches


There are two possible approaches to database design, they are referred to as bottom-up
and top-down. There are also centralized and de-centralized design types to decide on
database design. We will look at these approaches in the following sections.

Bottom-up design approaches


The bottom-up approach begins at the fundamental level of attributes or entity properties.
In simple terms, it starts with specifics to generals. As shown in the following screenshot,
in this method, first, the Attribute is identified. Those attributes are then grouped to
formed Entities and finally Model is the combination of all the entities:

In the bottom-up method, database designers will inspect all the system user interfaces
such as reports, entry screens, and forms. The designers then will work backward through

[ 65 ]
Planning a Database Design Chapter 3

the system to determine what attributes should be stored in the database. The bottom-up
design method is most suitable for less complex and small database systems. This approach
becomes extremely difficult when you have a large number of attributes and entities.

Top-down design approaches


The Top-down approach begins at the model level. In simple terms, it starts with generals
to specifics. As shown in the following screenshot, in this method, first, Entity types in the
Model are identified. Then the relevant Attribute of the entity types are identified:

In the method of top-down, the designer starts with a basic idea of what is required for the
system and the requirements will be gathered from the end-users with their collaboration
of what data they need to store in the database. A detailed understanding of the system is
required when the top-down method is used. For a large and complex database system, the
top-Down method is more suitable as it can be used to identify the attributes at the early
stage.

Centralized design approach


In the centralized approach, all the users' requirements are gathered and merged into a
single set of requirements. In this approach, the global data model is composed and views
are created for each user. A centralized design is typically suited for a smaller database and
can be successfully done by a single database administrator. The centralized design is not
limited to small organizations, whereas large organizations can operate within the
centralized small database environment. The downside of this approach is that databases
will not be easy to maintain where they are growing.

[ 66 ]
Planning a Database Design Chapter 3

De-centralized design
In the De-centralized design approach, different users' perspectives are maintained
separately. Different user requirements are gathered separately into a local data model. It
will then be merged into a global data model. This approach is much suited for complex
system and for a system which operates in different geographical location. The de-
centralized database design should be used when there are a large number of database
objects and database which has complex requirements. When there is a lot of disagreements
between users for the requirement, the decentralized design approach is much preferred.

Though there is no hard rule which approach is to use, it always depends on the use case.
Every use case has its own unique properties and different environments. Also, it is
possible that, rather than using the native approach, a combination of approaches or a
hybrid approach will be used.

Now that we have looked into the various approaches, let us look into data modeling as it
is an important factor in database design in various aspects such as Quality, Cost, and Time
to Market, and so on.

Importance of data modeling


The database model illustrates how the database objects were put together, which tables
relate to which, the columns within, and so on. Data modeling is an important part of
database design. In most cases, management tries to ignore the database modeling part
citing the timeline impact as the reason. However, by doing data modeling at the start of
the project, you are solving and anticipating problems that you need to confront in the
future. When a database model is available, it does not matter that an application
programmer’s view of the data is different from that of the manager and/or the end-user. In
other words, when a good database model is not available, it is more likely that you will
come across many issues not only within the development of the system but also during the
maintenance of the system. In this section, we will be looking at various aspects of the data
modeling as they are the important parameters to look at.

Quality
The data models enable to define the problem and provides you with multiple avenues for
the next stage. On average, about seventy percent of software development efforts fail, and
the major reason for failure is premature coding without building a data model. When you
have the model in your hand, you are clear with what to do and the next stage becomes

[ 67 ]
Planning a Database Design Chapter 3

much easier which will help to build quality software.

Cost
The data model promotes clarity and provides the groundwork for generating much of the
needed database and also programming code. Typically, data modeling cost estimates to
ten to fifteen percent of the entire project cost. However, it has the potential to reduce
sixty-seventy percent of the programming cost. Data modeling catches errors at an early
stage. which will be easy to fix. It is always better to fix them at the early stage than the
later stage, more importantly early than when the software is at the customer's hand.

The data model is reusable during future developments and maintenance. This will help to
reduce the project cost during the future developments and maintenance phase of the
projects.

Time to market
Having a sound data model at the start of the project can avoid unnecessary troubles which
will face during and after the project. When unseen issues occur during the development
phase, it is more likely that the project will run into chaos. This, in turn, will definitely
impact project timelines.

Also, any system will not operate in isolation. For example, one database system might be
linked with other systems to extract data in case of data warehousing. At the time, if there
the data model is present, it is much easy to integrate with other systems and make it much
easy to deploy the solutions to the production. This means that having a proper data
model, not only reduces the time to market of the current database but also the other
systems which are dependent on this database system.

Scope
There is always a gap between the customer and the technical team. This is the gap between
what you want and what you will get. The data model document will help the mechanism
to bridge this gap. The data model provides the scope for the data which will help them to
start the dialogue between the customer and developers. Business staff can visualize what
the developers are building and compare it with their understanding from the data
modeling.

A data model also promotes agreement on vocabulary and technical jargon. The data model

[ 68 ]
Planning a Database Design Chapter 3

highlights the chosen terms so that they can be driven forward. If there are any changes
required at this stage, both parties can agree on that at this level.

Performance
A proper data model makes much easier for database tuning. As a thump of rule, a
properly developed database typically performs better. To achieve optimal system
performance database performance plays a key role. With the presence of a data model, it is
much earlier to translate the model into a database design. This means that by using a
proper model, database performance will be improved.

On the other hand, data modeling provides a means to understand a database. This means
that a database developer is able to tune the database for fast performance by
understanding the data model.

Documentation
No matter how small or large your system, documentation is very important. Though it is a
small system, you never know how large this system becomes. Therefore, even though you
are starting at a tiny size, it is important to document. Also, there is technical jargon that
will need to communicate with the customers. Both parties can agree on naming
conventions with respect to database objects. For this, the database model will be used as
communication tools as the data model standardizes the names and properties of data
elements and their entities.

Risk
Risk analysis is a key task in any project. It is far better to identify the risk at the early stage
to mitigate future risks. The size of the data model will be a good measure of the project
risk. Since the data model provides a basic idea about the programming concepts and
project effort, it can be used as a risk measuring tool.

It is essential to understand the role of the database in a complete user system which is
discussed in the following section.

Role of databases
The database has a mandate of maintaining data in storage, and provide them to the end-

[ 69 ]
Planning a Database Design Chapter 3

users, and end-user application interface when the need arises. Depending on the need
there can be one or multiple databases for the system which can span over multiple
database servers to suit the business needs.

The following screenshot shows how databases are listed in the PostgreSQL server when it
is accessed from pgAdmin:

As shown in the above screenshot, there are HR, Invoice, SampleDB and postgres
databases in the system and you can have many databases in PostgreSQL. There are
instances where server contains over 10,000 databases per PostgreSQL instance. Also, there
is no limitation to the database size and there is a PostgreSQL database that has got size
more than 6 TB.

Storing Data
The main role of the database is to store data. Data has its own relationships between
database objects. In the database, data is stored by means of tables. The list of tables in the
sample database is shown in the following screenshot:

[ 70 ]
Planning a Database Design Chapter 3

There is no limitation to the number of tables for a database and a table can have any
number of records in a table. A table can contain 16 TB volume of data.

The following table contained a sample data set for the actor table in the sample database
shows in the row-column format:

[ 71 ]
Planning a Database Design Chapter 3

Every organization has a large volume of data and databases are mandated to store data.
However, by using tables, columns, and rows, a large volume of data can be methodically
stored in a database. INSERT, UPDATE and DELETE are the common SQL commands to store
data in relational databases.

Database plays a key role when multiple users are accessing the same table. The database
maintains ACID properties so that transactions are handled. Without a database,
transactions will not be able to handle them with user-friendly.

Access Data
Accessing data is an important phenomenon in the database. The database is useless if it
cannot access the large volume of data that is stored. Also, when accessing data, it should
be available to the user-readable manner as well not time-consuming. As we discussed
before, data is stored in multiple tables. For the end-users, these tables should be joined by
their relationships to present data meaningfully. Though there are a large number of
columns and rows, users may not need all of them. So, there should be a way to retrieve
with filtering. Users may need aggregation of data such as SUM, AVERAGE, MINIMUM,
MAXIMUM, and many more among various other functionalities.

An important factor in the relational database is faster access to data. It doesn't matter how
large data volume is, users should able to access desired data in a timely manner. Providing
the users' data with efficiency is another key role of a database.

In PostgreSQL databases, there are three ways of accessing the data functions, stored
procedures, and views.

The following screenshot shows the list of views in the sample database:

[ 72 ]
Planning a Database Design Chapter 3

In a database view, there can be multiple tables where they are joined to form meaningful
data.

The following code block shows the definition of actor_info view which you can see
there are multiple tables joined with few functions:
SELECT a.actor_id,
a.first_name,
a.last_name,
group_concat(DISTINCT (c.name::text || ': '::text) || (( SELECT
group_concat(f.title::text) AS group_concat
FROM film f
JOIN film_category fc_1 ON f.film_id = fc_1.film_id
JOIN film_actor fa_1 ON f.film_id = fa_1.film_id
WHERE fc_1.category_id = c.category_id AND fa_1.actor_id = a.actor_id
GROUP BY fa_1.actor_id))) AS film_info
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
LEFT JOIN film_category fc ON fa.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id
GROUP BY a.actor_id, a.first_name, a.last_name;

Apart from views and functions, stored procedures are also can be used as mechanisms for
data access options in the PostgreSQL database. Both functions and procedures will accept
parameters. In the following screenshot, we can see the list of functions in the sample
database of PostgreSQL database:

[ 73 ]
Planning a Database Design Chapter 3

In the following screenshot, we can see the definition of the film_in_stock function
which has three parameters.

[ 74 ]
Planning a Database Design Chapter 3

Similarly, Stored procedures are used in PostgreSQL to facilitate complex calculations.

Secure Data
A database will be accessed by multiple users who have different roles in the organization.
Therefore, the database has the role of maintaining the data securely. Security can be
achieved via authentication, authorization, encryption in the database technology.

Every database poses different challenges when it comes to database design. However, it is
better to consider the common challenges that database designers that will come across in
the following section.

Common database design challenges


It is needless to stress that designing a database is a challenging task. Most of these
challenges are characteristic of the environment and its users. Thus it is very difficult to
generalize those challenges. However, there are common database design challenges that
you would come across during the database design. As a database designer, it is essential to
understand what are the common challenges that database designers face during the
database design. By identifying them at the early stage, it will provide the database
designer with adequate time to prepare for it. In the following sections, we will see the
common challenges in database design.

[ 75 ]
Planning a Database Design Chapter 3

Data security
In a couple of years, more than 100,000 systems were compromised simply because their
database had been completely exposed to the public internet. In many instances, the
database is the main or the first culprit for the compromise. The challenge in the database
design with respect to the security is that the database security can be implemented with
various options.

Data security comes with three aspects of the databases. There are:

Authentication
Authorization
Encryption

Authentication is relatively less challenging with respect to Authorization. Authorization


means providing users who can access what objects (Tables, Views, Functions and so on).
Database objects mean that it can be a schema, table, views, functions, and so on. Even this
can go into the details such as columns or rows. Rows mean that in a table, one person can
access a set of records but he does not have access to another set of records in the same
table. It can also go into the more complexity where a user can only read of some table but
he can't update or insert.

Also, at the time of database design, the designer has to plan for security in the database.
Also, there can be needs to plan for user groups where multiple users can be in groups and
these groups will be provided with authorization options. By having the groups, it will ease
the configurations at the later stage. However, it is important to note that by introducing
the user groups, conflicts can occur. For example, a single user can be in multiple groups
where are those two groups will have conflicts of permission. All of these complexities will
raise challenges to the database designers.

Apart from authorization, Encryption is also challenging. Though you can enable
encryption for the entire database, it will cause performance and maintenance issues.
Therefore, At the time of designing, what level of encryption to which extent should be
done needs to be clearly identified. It will be a challenging task to identify what should be
encrypted at the time of design planning.

To the relief of database designers, most of the Relational Database Management System
(RDBMS) supports different types of security implementation. Since the database designer
has to decide on what database technology will be used, it is essential to understand what
is the security level that the user required and what level of capabilities this tool has.

[ 76 ]
Planning a Database Design Chapter 3

Performance
Most of the database designers, do not consider the performance aspect of the database.
Apart from not considering, the database designers have the challenge of visualizing or
predicting the performance of the system in the future. Mainly there is a miscommunication
or no communication between the development team and database team. Hence database
team does not know what are the frequently running querying from the application end.
This makes challenging tasks for the database designers to apply correct indexes to the
database. Database designers cannot apply indexes for all the columns as it will reduce the
performance of having a lot of indexes.

At least indexes can be applied at the later stage in the production. However, there are table
design changes such as denormalizing in order to improve the select performance. This
needs to be done at the design change but to do this, the database designer needs to
understand the performance needs which is a challenging task.

Another challenge in the performance is that designers are unable to predict the data
volume in the future. With the increase of data volume, database performance will be
impacted negatively. Since database designers have a challenge in predicting the data
growth, planning the performance for the database has become challenging.

Data accuracy
The database will not be useful unless your data is clean. It is the database designers'
challenge to visualize or predict what users will enter. Also, the database designer has to
understand what level of transaction needs to be in order to maintain the ACID properties
of a transaction. Challenge would be to implement transactions in distributed transactions.

Also, the database has other constraints such as Foreign, Check, and Unique constraints. As
a database designer, it is important to identify what level of constraint should be
implemented. In this user, the designer has to identify what is the constraint at the design
time. It will be challenging to enable these constraints when invalid data is saved on the
date. Therefore, the database designer has the challenge of identifying the constraints at the
very early stage so that the purity of data is achieved which will lead to the users' trust in
the data in the database.

High availability
As stressed in several places, high availability is something that is always neglect by the
database designers. The main reason for this negligence is that many are in the view that

[ 77 ]
Planning a Database Design Chapter 3

high availability can be achieved one database is deployed to the production and it is the
duty of the database administrators but not a designer's responsibility. Though there won't
be any difficulties for a database designer, when there is a plan to implement full
availability to the database via infrastructure, there will be a great challenge when it comes
to implementing partial availability.

Challenge with designing availability for database comes with defining the boundaries of
hg availability. The designer has the option of defining high availability for function wise
or other business objects wise such as Customer, project, etc. The database designer has the
challenge of identifying the partition object at the start of the database design phase. This is
because it is very difficult to change the partition object at the later stage.

Another challenge with high availability is the existence of multiple databases. When
multiple databases exist, there would be another challenge of integrating data between
databases and maintaining data quality. The database designer has to implement a
mechanism for data integration as a default database integration mechanism such as
foreign constraints will not be possible to implement.

Another challenge is moving data between these physical partitions. For example, let us say
you have decided to design a database where customers are the objective of high
availability. This means that your high importance customers data will be one database and
other customers' data will be in one or many databases. With time, customers might decide
to move between these partitions. As a database designer, you need to define a mechanism
to move data between these partitions without impacting the current business or at least
with minimum impact.

Summary
In this chapter, the planning of a database was discussed with different perspectives. It was
understood that the Database design principles should be aligned with High Availability,
Integrity, Security, Extensibility, and Performance.

We have identified that there are two types of popular database schemas which are OLAP
and OLTP. OLTP is mainly used for the transactional system while OLAP is used for
analytical and reporting systems. With this analysis, we know what type of database that
will be designing. During the planning stage of a database system, it needs to identify
whether it is OLAP and OLTP so that you can make decisions on database designing. There
are database design approaches which are bottom-up or top-down approaches. Further, we
looked at couple of approaches to design databases that are centralized and de-centralized
approaches as well. The database has a role in every system. Mainly databases are used for
storing and retrieving data in a secure way. In the case of PostgreSQL databases, tables,

[ 78 ]
Planning a Database Design Chapter 3

views, functions, and stored procedures are used for these actions. There are major
database design challenges such as security, performance, high availability and data
accuracy.

As a database designer, you need to plan for these challenges. We have identified common
challenges that database designers will encounter. It is essential to plan the database design
rather than start the design straightway to avoid unnecessary chaos in the stages of design,
development and also when the system is at the hand of the customer. Data Security, Data
Accuracy, High Availability are the main challenges we identified.

In the next chapter, we will discuss different data models to provide an understanding of
these models and how they can be built.

Questions
Why is it important to understand the high availability option for the database, at
the database planning stage?

Believing that high availability can be achieved only via hardware


infrastructure is a myth. This has led to database designers to ignore the
importance of considering the high availability at the time of database
designing. High availability can be introduced to the database in different
ways, such as partial availability. Partial availability means that you provide
some of your operations to be available horizontally or vertically. This means
that you provide the end-users either some functionalities will work or some
of your customers will have the luxury of using the system without feeling
any difference whereas another customer will not have the system. If you are
designing a database that needs to cater to some functionalities, you need to
identify what are the key features which you need to provide for the end-
users. This means you will be designing a separate database with respect to
functionality. When you are designing in that manner, it is important to
provide a mechanism for cross functionalities. Also, we need to consider the
distributive transactions as a single transaction will be in multiple databases.
Also, default inbuilt integrity will not work as similar data will be in
multiple databases. Therefore, different mechanisms need to be introduced.
For example, if Orders and Customers are in separate databases, there
should not be customers in the Order database where they do not exists in
the Customer database. This rule has to be implemented via an application
end. Also, scheduled reports should be executed and sent to relevant
authorities when there are mismatches. When you want to implement high

[ 79 ]
Planning a Database Design Chapter 3

availability from the Customer point of view, it is essential to differentiate


what are the customer activities and master data activities. When customers
are separated, there should be a mechanism to identify where the customer is
in. When a query is hit at the system, it needs to understand which customer
execute this and there should be a mechanism to understand in which this
customer belongs to. Apart from this concern, the same concerns of
transactions, integrity will exist even for this high availability. Since it is
difficult to modify this design it is always better to adopt these high
availability options at the planning time of the database rather than the
deployment time of the database.

Why is it important to decide whether the database that you are designing is for
OLTP or OLAP?

The mandate for OLTP and OLAP are different. OLTP designs are mainly for
systems that provide day-to-day transactions. In these types of systems, read
and writes almost equal and the transaction times are very shot. Also, in
OLTP most of the transactions are pre-defined. For example, in a customer
order creation, you will enter a date, customer code, and product items.
Finally the discount and taxes. this shows that in OLTP transactions are
predominately pre-defined. However, when it comes to OLAP, one user can
start with product-wise analysis while another user will start his analysis
with monthly wise with the customer-wise. You cannot set rules for analysis.
Due to this OLAP analysis is more in ad-hoc nature. Since OLAP is pre-
dominantly catering for analysis of data, in the OLAP system more, if not
95% of transactions are retrieval. For a database which requires more reads,
it is better if you have less number of table. If you have a large number of
tables, to retrieve information, you need to join multiple tables. when joining
multiple tables, it requires a high cost of resources. However, when there is a
fewer number of tables, it doesn't consume many resources. This means that
though the normalized structures are suited for OLTP systems, for OLAP de-
normalized structures are preferred.

How do you plan for performance improvement during the database designing
stage?

Performance is something that is difficult to plan during the design stage as


performance issues are identified mostly even after the deployment of the
database to the production environment. As the database designer, you need
to understand the growth of the databases so that correct indexes can be
planned. Also, rather than applying indexes for all possible column
combinations, it is always better to implement those indexes in the scientific

[ 80 ]
Planning a Database Design Chapter 3

method. To achieve this, as a database designer, closely work with the


application development team to identify the important and frequently
running queries. By identifying them, as a database designer, you can
implement proper indexes to the database. By identifying those indexes, you
are making the life of a DBA easy.

[ 81 ]
4
Representation of Data Models
After the planning is completed for the database in the previous chapter, now we need to
look at how the database models are represented. Most of the database does not start with
designing in a database technology itself. If we start a database design directly with the
database technologies, you will run into a lot of rework issues. Therefore, a Database
should be designed with a proper process by following a scientific method.

The database design will have multiple processes and stages. Initially, we will be looking at
the building of the conceptual model. We will look at how the conceptual model is verified.
Next, we will examine how the semantic data model is designed. Next, we will look at
briefly on the Physical data model as there is a separate chapter for the defining of the
Physical data model.

Apart from the building of these models, we will look at how to verify these models.
Further, we will look at how necessary documentations are done at each model.

In this chapter we will cover the following topics:

Introduction to Database Design


Creating Conceptual Database Design
Define the Logical Database Design
Define the Semantic Data Model
Defining the Physical Data Model
Representation of Data Models Chapter 4

Introduction to Database Design


Even less complex database design should not start directly in the technology itself. Before
starting to design the databases in the selected technology, it is important to understand
what are the exact problems and whether you have captured them correctly. This does not
have a dependency on the database technology that you will be choosing at the later stage.
Therefore, database design is the methodical approach that needs to be done with specific
procedures and techniques.

Database Design methodology needs more experience resources. At the


database design, it is important to ask the right questions from the
customers. Customers may have very high-level requirements. As a senior
experienced database designer, it is important to ask quality questions
that will lead to sustainable design. Therefore, it is important to employ
an experienced database designer at this stage.

The database design methodology contains multiple phases in which each phase contains
several steps and milestones to achieve. Because of this phase defined methodology project
managers can plan the activities, as well as milestones, can be tracked effectively. With this
approach, the database can be designed with a standardized and well-defined manner.

The three main phases of database design are as follows::

Conceptual Database Design


Logical Database Design
Physical Database Design

Let us see what are the critical factors in database design in the following section.

Critical Success Factors in Database Design


To verify the success of the database design, it is always better to follow the guidelines so
that verification can be done throughout the database design phases.

There are several critical success factors in database design which are listed below.

1. Continuous interaction with all stakeholders: You need to understand that the
ultimate beneficiaries of the database are users of the system. Therefore, it is
always important to keep interaction with them. Also, there are cases some
stakeholders such as database administrators, network administrators, ignored
as many are in the view that those stakeholders do not have any impact on the
database design. As we discussed in the previous chapters, network and

[ 83 ]
Representation of Data Models Chapter 4

infrastructure engineers need to be involved in high availability aspects of


databases. As a thump of a rule, it is always better to change at the earliest as
possible rather than changing them later, therefore, it is better to have a
discussion with the users from day zero.
2. Follow a Structured Methodology: The database design should be done
methodical way from the start to the end and even after the deployment of the
database to the production. During the maintenance cycle also, it is important to
follow the structured methodology for the modifications as well.
3. Data-Driven Approach: The data-driven approach is a process that is compelled
by data, rather than by intuition or by personal experience. Naturally, database
designers tend to design databases with their experience. Though the experience
is an important factor in database design, designers might be carried away with
only the experience and tend to forget the data-driven approach.
4. Incorporate integrity considerations into the data models: Systems are not
operating in isolation. Today or on some other day, multiple systems will
integrate with the system that you are designing today. Therefore, as a database
designer, you need to focus that database will be integrated in the future and
your design should be compatible with it.
5. Combine conceptualization, normalization, and transaction validation
techniques: Normalization is an important concept in databases and during the
design of the databases, it is important to follow normalization. However, in
reporting and data warehousing projects, de-normalization is also followed.
6. Use diagrams to represent the data models: Diagrams are the best presentation of
data models. Therefore, always use diagrams to represent data models.
7. Use a Database Design Language
8. Build a data dictionary to supplement the data model diagrams: A data
dictionary is a collection of descriptions of the data objects or items in a data
model. The data dictionary is built for the benefit of programmers and others
who need to refer to them.
9. Repeat Steps: Humans are always hesitant to repeat even they find they are on
the wrong path. You need to understand that the database is the heart of the
system. Therefore, whenever there is even a minor error, do not hesitate to
correct it. Also, sometime you might have to re-do the entire model again. As a
rule, it is always better to correct them at the early stage than facing a lot of issues
when the project is at the customer's hand.

Let us consider the main three types of database design, Conceptual database design,
logical database design, and physical database design.

Let us look into how conceptual database design is done in the next section.

[ 84 ]
Representation of Data Models Chapter 4

Creating Conceptual Database Design


The database design method starts with conceptual database modeling. The conceptual
database model is independent of implementations such as database technology,
application programs, operating system.

Let us look at the steps in Conceptual Database Design below:

Identification of Entity Types


Identification of Relationship Types
Identify and Associate attributes with entity or relationship types.
Determine Attribute domains
Determine candidate and primary keys
2. Validate the model for redundancy.
3. Validate local conceptual model against user transaction
4. Review the local conceptual model with the user(s)

In the Conceptual Design phase, we will start with Local Conceptual Design in the next
section.

Local Conceptual Design


In this phase, the conceptual data model is built for each view of the organization. During
the analysis, multiple numbers of user views need to be considered which is called a local
conceptual data model. During the conceptual model building stage, there can be an
overlap of requirements. If so, that local conceptual designs can be merged to firm another
local conceptual design.

Let us assume that you are assigned to build a database for a supermarket billing. There
will be three user views for this scenario.

Let see the differences between the different sub-user views as shown in the following
table:

View Sub-user Views


Consists of Point of sale (POS) operators, Supervisors, Line Managers, Branch
Staff
Managers.
Customer Retail Customers, Wholesale Customers, Priority Customers,
Individual Suppliers, Internal Suppliers, Organization, Value-added
Supplier
suppliers

[ 85 ]
Representation of Data Models Chapter 4

During this chapter, we will build the conceptual model for the Customer.

Let us see how are we going to identifying Entity Types.

Identifying Entity Types


As discussed in Chapter 2, Building Blocks Simplified, identifying the Entity types is an
important step in the database design process. There are two methods of identifying
identify Entity Types:

1. From Nouns and Noun-Phrases


2. Objects

Let us see how entity types are identified from nouns or noun-phrases in the following
section.

From Nouns or Noun-Phrases


From the requirement analysis document, you should be able to understand noun and
noun phrases. For example, when it says customer purchases products, customer returns
items, and customer earns points, customer redeems points, etc, it is very easy to note that
customer, item, and points are entity types.

The next way of identifying the entities is by checking the existing objects which are
discussed in the following section.

Objects
An alternative way of identifying Entity Types is to look for the objects of existence in the
scenario. If you look at the previous example, we know for a fact that in a supermarket,
there are objects such as Staff, Customer, and so on. Database designers experience plays a
key role when identifying Entity Types with this method,

Challenges in Identifying Entity Types


Though, it is easy to mention that the first step in conceptual design is identifying the entity
types, it is no mean an easy task.

Let us look at some of the challenges below

[ 86 ]
Representation of Data Models Chapter 4

The requirement document may not be very clear. As you know, the requirement
document will be written free-form text, therefore, it can be vague and will not be that easy
to find the relevant entity types. Requirement document contains views from various users
and it will be challenging to identify the entity types from the requirement documents.

It will be more challenging as there are synonyms and homonyms in the requirement
document. Synonyms are words that have the same meanings. Employees and Staff are
often used but it needs to be identified that it has the same meaning. Therefore, Employee
and Staff shouldn't be two different Entity types but a single Entity Type. Client and
Customer is another example of a Synonym whereas Item and Product are other examples
for Synonym.

Homonym means that it is the same word but it has a different meaning. In this case, it has
to be two different Entity Types but with different names.

The User Requirement document will always have acronyms that are often used by users.
GRNs, POs, are generally used by end-users in order to specify Good Received notes and
Purchase Orders respectively.

When identifying the Entity Types from the Object method, it mostly depends on personal
judgment and personal experience.

Documentation
Documentation is an important factor in any phase of the database design process. when
Entity types are identified it is important to document them.

The following table is a sample document for identified Entity Types:

Entity TypeDescriptionSimilarNames

The staff of the organization and there are Managers,


Supervisors, Cashiers.
Employee There are permanent, trainee, and temporary staff in the Staff
organization.
Each employee works at each outlet and at the head office.
Client
Customer Retail and wholesale customers who bought and return items. Consumer
Buyer
Products that are offered at the supermarket. Some products are
Product supplied by the supermarket registered suppliers and retail Item
suppliers.

[ 87 ]
Representation of Data Models Chapter 4

Suppliers are one who supplies products and services to the Contractor
Supplier supermarkets. There are retail and wholesale suppliers registered Dealer
with the supermarket. Provider
When a customer buys products, points are earned which can be
Point redeemed. Also, special points are received for selected products
during the promotions.
Customer who gets the product and services from the supermarket Orders
Sales
through Orders and Invoices. Invoices
Purchase Orders (PO)
Getting products and services from the suppliers to offer to
Purchases Good Received Notes
customers.
(GRN)

From the table above we can see that all the entity types are enlisted along with the similar
terms and details of the entity types. This should be a continuous document where any
stakeholder can get information at any time.

Let us see what is the process of identifying the Relationship types in the following section.

Identification of Relationship Types


Entity Types do not stay in isolation. They will have a relationship with other entity types.
As per the previous example, the customer will buy products and services from through
invoices from the staff who are employed at the supermarket. In this, Customer, Products,
Services, Invoices and Staff Entity Types have relationships and it is important to identify
the relations types.

Like we used nouns and noun phrases to identify the event types, verbs can be used to
identify relationship types. However, in most cases, the requirement document does not
explicitly mention the verb and it is the designer's job to identify the verb which is explicitly
mentioned in derived the relationships from them.

Let us look at different relationship types in the following sections.

Binary Relationships
Most of the time, event types have a binary relationship which means that relation involved
between two entity types or degree of the relation type is two. Let us look at the
requirement of Customer Buys Product.

This can be represented from the following Entity-Relationship (ER) models:

[ 88 ]
Representation of Data Models Chapter 4

There can be scenarios where multiple relationships exist between two entity types. For
example, Customer will earn points when he/she buys a product or a service and also, he
has the option of redeeming earned points with some rules.

This scenario can be represented as below:

In the above model, Customer and Product Entity Types have a relationship named Buys.

Complex Relationships
Though there are simple binary relationships, there can be cases where there is a
relationship that has a degree of the relation type of three or more.

When a Customer customer buys a product, it will be bought through an invoice. This can
be represented in the following figure:

[ 89 ]
Representation of Data Models Chapter 4

As shown in the above screenshot, Customer, Product and Invoice entities are related by
means of the relationship called Buys.

Recursive Relationships
Recursive relationships are the relations where the same entity type participate more than
one time. For example, In the same staff entity, there are managers and subordinates who
are reporting to the managers.

However, both of those employees are in the same entity types as shown below:

In the next section, we will identify the associate and relevant attributes with Entities or
relationship types.

[ 90 ]
Representation of Data Models Chapter 4

Identify and Associate Attributes with Entity or


Relationship Types
The next step is to identify the attributes of Entity and Relationship types. Like the way, we
identified the Entity Types, noun or noun phrases can be used to identify the relevant
attributes of Entity Types and Relation Types. However, you might not be able to get all the
attributes from the requirement document explicitly. Therefore, it is the database designer's
duty to verify the requirement and gather the necessary information to capture the
attributes. In some cases, designer experience plays a vital role to capture the attributes of
Entity Types and Relation Types.

There are different types of attributes which were discussed in Chapter 2.


Simple/Composite Attributes, Single/Multi-Valued attributes, and Derived attributes are
the different types of attributes. It is essential to understand that attributes should be
defined for entity types as well as for relationships. For example, we defined attributes such
as Name, Date of Birth, Address for customer entity types, at the time of customer
registration, the registration date, employees who performed the registration are the
attributes of relationship, not the entity type.

Let us see what are the possible attributes for the Employee entity type below:

EmployeeNo
Name (composite: Title, First Name, Middle Name, Last Name)
Address (composite: Address I, Address II, City, PostCode)
Designation
Previous Designations (multi-valued)
Manager (composite: Title, First Name, Middle Name, Last Name)
Date of Birth
Age (derived: date of Birth)
Gender

Above are the basic attributes for the Staff Entity Type.

Now let us look at what are the possible attributes for relevant relationship attributes with
respect to the Staff Entity Type.

For example, let us say a staff member will join the Super Market. The staff member is
promoted to a designation. When you examine this to relationships, there are relevant
attributes for the relationships. for the join relationship, there will be Join Date and for the
promotion relationship, there will be Promotion Date. For the previous promotions, there
will be a promotion date as well.

[ 91 ]
Representation of Data Models Chapter 4

When identifying attributes it is necessary to identify a few other details of those attributes
as well:

Attribute Name and Description: Attribute name is identified and it is essential


to note the description of the attribute. The description will be helpful when this
is referred later to dissolve any confusion.
Different Names or Synonyms: Different users may refer to attributes by a
different name. Gender and Sex are used as the Gender attributes for the Staff.
This will help to confront confusion which will arise later.
Data type: Since the database technology is not yet selected until this point, data
types should be described. For example, customer birth of date, it is a date but
when it comes to databases, it can be any data type of date, date2, or datetime,
etc. However, we will identify this as a date. For a customer name, it will be
identified as a string. Address it will be a string. it is important that there should
not be different names. For example, for string attributes, it should be either
string or text. Make sure that you don't mention them in different names.
Date Length: At the requirement gathering stage, it can easily identify the length
for each attribute.
Attribute Type: Whether it is Simple, Composite, Derived, multi-valued
attribute.
Default Values: Some attributes have default values. For example, When an
employee joined an organization, his / her status will be active. This will be a
default value.

Identifying attributes will face few obstacles that will be discussed in the next section.

Troubles when Identifying Attributes


You might think that identifying Attributes is a trivial task. Though it appears like an easy
task, identifying the attributes may not be that easy. and it will face many obstacles such as
missing attributes and duplicate attributes Those issues will be discussed in this section.

Missing Attributes: As you are aware, there are a large number of attributes associated with
entities and relationships. It is not a crime to miss attribute but as a database designer, it is
your duty to identify essential attributes. As a database designer, it is essential to provide
an allowance to add the attribute later. Since attributes can be identified at the different
stages, you need to make sure that all the relevant documents by going backward.

Duplicate Attributes: Sometimes, there can be situations where the same attribute are
associated with multiple objects. For example, employee joined date is associated with the
employee attribute and also it is an attribute with the relationship.

[ 92 ]
Representation of Data Models Chapter 4

Documentation
It not necessary to emphasize the fact that importance of documenting. Therefore, it is
essential to document the attributes information as well.

Let us see a simple documentation of the Employee entity in the following table.

Entity Type / Attribute Data Is


Description Length Multi-Valued Composite
Relationship Name Type Required
The attribute which
Employee EmployeeNo will identify the String 8 Yes No No
employee uniquely.
Title of the Employee.
Title Mr, Mrs, Ms are the String 4 Yes No No
possible values
First Name of the
First Name String 30 Yes No No
Employee
Middle name for the
Middle Name String 30 No No No
Employee
SurName or the Last
Last Name Name fo the String 30 Yes No No
Employee
Composite Attribute
of Title, First Name,
Name Yes
Middle Name, and
Last Name
Mobile Mobile Number(s) of
String 10 No Yes
Number the Employee
Determining attribute domains is an important task during the presentation of models that
are discussed in the following section.

Determining Attribute Domains


This step is to identify the range of values for each attribute. For example, the date of birth
and date of joined has a range. Gender will have a value of Male or Female. Marital Status
will have Single, Married, or Other. By specifying these domains, it is easy to implement
check constraints during the physical database design space. Also, it is essential to
document these ranges.

Let us see how we can determine candidate and primary keys, in the next section.

[ 93 ]
Representation of Data Models Chapter 4

Determine Candidate and Primary Keys


A Candidate Key can be an attribute or a combination of multiple attributes that can qualify
as a unique key in the database. There can be multiple Candidate Keys in one table. Each
Candidate Key can qualify as a Primary Key. A Primary Key is a column or a combination
of columns that uniquely identify a record. However, you can have only one Primary Key
for an Entity Type. Once the Primary key is selected from the available candidate attributes
and the remaining candidate attributes are called alternative attributes.

When selecting a primary key, the following consideration should be taken care of:

1. Unique: Needless to say that primary has to be unique as the primary key is the
column which will be used to identify the record.
2. Definite: The primary key should not have NULL or Empty values, and there has
to be a value for the primary key.
3. Stable: The Primary key should not change with time. When you are selecting a
Primary Key, you need to select an attribute(s) that are not changing over time.
4. Minimal: The Primary Key should have fewer attributes and less in length. For
example, Customer Name, Customer Addresses are not recommended to be
Primary Key.
5. Factless: The Primary Key should be hidden value. However, there can be cases
where primary keys may be an internal value, which does not have a business
meaning.
6. Accessible: The Primary key should be accessible at the time of record creation, it
should not be filled later.

After the model is built, the next is to validate the model. Model validation will be
discussed in the following section.

Validate the Model for Redundancy


Model Redundancy is the process of capturing redundant objects in the model. The main
way to detect the model redundancy is by verifying one-to-one relationships. If there are
entity types that have a relationship with one-to-one means that those entity types can be
consolidated to a single Entity Type. For example, if there is a one-to-one relationship
between Employee and Employee Profile Entity Types.

Another type of redundancy is redundant relationships. The redundant relationship means


that the same information can be obtained from a different relationship. However, this does
not necessarily mean that one of the relationships is redundant since they may represent

[ 94 ]
Representation of Data Models Chapter 4

different associations between the entities.

One of the validation techniques is to validate the local conceptual model against user
transactions. Let us discuss this technique in the following section.

Validate Local Conceptual Model Against User


Transactions
After defining the conceptual model, another way of validating the model is validating
against user transactions. This is a very common method used in the industry. Since the
conceptual model is built using the user requirements, this technique conceptual model can
be verified whether it is aligned to the user requirements.

Let us see how to review the local conceptual model with the users in the following section.

Review the Local Conceptual Model with the


Users
The local conceptual model is built by identifying user requirements. Therefore, it is
essential to review the local conceptual model with the Users. After the review done by the
users, it is essential to update the conceptual model and verity with the above steps again
so that you can make sure that the conceptual model is valid for the next step.

The next step is to define the Logical Data Model.

Define Logical Data Model


Af the conceptual model is finalized, next is to build the Logical Data Model. When
defining the Logical Data model, the following steps are carried out:

1. Derive relations for local logical data model


2. Validate relations using normalization
3. Validate relations against user transactions
4. Define Integrity constraints
5. Review local logical data models with users
6. Merge local logical data models into a global logical model
7. Validate global logical data models.

[ 95 ]
Representation of Data Models Chapter 4

8. Review the global logical data model with users

Let us have a detailed analysis of these steps.

Derive Relations for Local Logical Data Model


In this step, relations for the entities, and attributes are identified. The relationship between
entities is defined by the Primary Key to Foreign Key relationships. To achieve this we need
to identify the parent and child entities. For example, Department Relation will be the
Parent entity whereas the Employee entity will be a child entity.

These relations can be one-to-one relationships or one-to-many relationships. We will


discuss why it is important to remove many-to-many relationships in Chapter 7, Database
Design Pattern.

Validate Relations using Normalization


The purpose of Normalization is to achieve an effective and efficient database model by
separating identified Entities. Though there are theories that Normalization does not
provide maximum performance in scenarios such as Reporting, it is advised to follow the
process of Normalization. Normalization will be a solution for the Modification
Anomalies.

It is recommended to have at least up to the Third Normal Form (3NF).


During the First Normal form, repeating groups will be removed. Partial
dependencies on the Primary key is removed at the Second Normal Form
(2NF). Transitive dependencies on the Primary Key are moved at the
Third Normal Form. Though there are other levels of Normalization
forms such as Boyce-Codd Normal Form (BCNF), Fourth Normal Form
(4NF) and Fifth Normal Form (5NF), it is recommended to provide at least
3NF to the transactional databases.
Normalization will be discussed in Chapter 5, Applying Niormlization in detail.

Validate Relations Against User Transactions


This is a similar task we did during the design of the Conceptual model as well. This is a
mandatory task so that the defined relations can be verified. Especially, Primary Key /
Foreign key relations have to be verified.

[ 96 ]
Representation of Data Models Chapter 4

In case you are unable to perform user transactions against the model, then you need to
rework on the model. Most likely, you have made a mistake or there might have been a
user requirement error. This has to be corrected before moving to the next stage. After
finding and correcting the error, previous steps have to be revisited again.

Define Integrity Constraints


Integrity constraints are implemented in order to make the data consistent. For example,
there should not be an employee whose department is invalid.

Further, there can be some attributes that should exist. For example, every employee should
have a department whereas empty or null values should be permitted.

Also, it is important to specify, what are the actions when there is a data delete or update.
Delete or update will bring the data into an inconsistent state.

RESTRICT - This will not allow referenced data to be deleted.


CASCADE - When the parent record is deleted, all the child rows will be deleted.
This is a very dangerous setting. Should be used very carefully.
SET NULL - When the parent record is deleted, the referenced column of the
child records will be set to NULL.
SET DEFAULT - When the parent record is deleted, the referenced column of the
child records will be set to a default value.

The following are the available options for actions in the PostgreSQL.

[ 97 ]
Representation of Data Models Chapter 4

In PostgreSQL, there is an additional option called NO ACTION. This is


the default option in PostgreSQL. NO ACTION means that if any
referencing child rows still exist when the constraint is checked, an error is
raised. The key difference between NO ACTION and RESTRICT is
that NO ACTION allows the check to be deferred until later in the
transaction, whereas RESTRICT is not.

Another scenario would be implementing domain constraints. Previously discussed


constraints are more technical. This constraint is a business domain-specific constraint.
With respect to the Employee domain, Gender should be either Male or Female. Marital
status should be either of, Single, Married, Divorced. You should have any other values than
the defined values.

Review Local Logical Data Models with Business


Users
After you are completed with the local logical data models. the next step is to verify it with
the relevant business users. It is important to validate the Logical with all the business user

[ 98 ]
Representation of Data Models Chapter 4

scenarios. This step will be an important task, as this is helping to eliminate any gaps
between the business users and technical designers.

In the case of any issues, you have to go back to the initial step and follow
the steps again. However, it is always better to correct those errors rather
than finding them when the system is deployed to production. When
users are using the system, it will be very difficult to do the error
corrections.

There can be cases where you have only one local model. If there is only a local logical
model present, Then you are done with the logical model and not necessary to proceed to
the other steps. In most of the scenarios, you will have only one local logical model.

Merge Local Logical Data Models into a Global


Logical Model
If you have multiple local logical views, you need to merge them into a single global logical
model. In the case of large enterprise systems, more chances that there are multiple local
logical models.

In this step, it is essential to review the entity names and their relationships. Then these
entities should be merged. In the conceptual modeling, level you won't be having many
attributes. However, at the Logical data modeling, you have more attributes. Therefore
merging is challenging in the Logical design level.

There can be four types of merging.

1. Merging Entities / Relation with the same names and the same primary keys.
2. Merging Entities / Relation with the same names and the different primary keys.
3. Merging Entities / Relation with the different names and the same primary keys.
4. Merging Entities / Relation with the different names and the different primary
keys.

After the merging of entities, then the foreign keys and other constraints should be merged.

Validate Global Logical Data Models


This is the same as validating the local logical data models. However, this step will become
time-consuming thus an important step. In this step, all the individual user transactions are
validated against the Global logical data model.

[ 99 ]
Representation of Data Models Chapter 4

If this step failed even after the success at the validating at the local logical
data models, it means that there should be some shortcomings at the
merging step. However, during the validation database designers has to
review the merging of logical models.

Now you are ready to go to the business users with the logical model.

Review the Global Logical Data Model with


Business Users
The final step is to validate the global logical data model with business users. If you have
done great work at the previous stage, it is more likely that you are taking the accurate
design to the business users. It is more likely that this step will not be a time taken step.

Though these are mentions as steps, in practical, database designers will follow incremental
design at a different scale.

Now we will discuss the defining of the Semantic data model in the next stage.

Define the Semantic Data Model


The semantic data model is a new trend in database modeling. Semantic Data model is a
conceptual data model with semantic or meaning built into this. This means the model
adds the meaning to its instances. The semantic model will help parties to interpret the
objects stored in the database.

in 1981, Hammer and McLeod derived presented the Semantic Data Model while this was
extended by Shipman by means of Functional Data Model. Later in 1983, the Semantic
Association Model was introduced by Su. All these attempts are to build semantic into the
data model.

If we know that Nile River flows through, countries such as Ethiopia, Sudan, and Egypt.
Since we know that these countries are in Africa, from Semantic we can conclude that the
Nile River flows through Africa. In similar terms, we can build semantics from the
databases as well.

Let us examine how we can implement a semantic data model in a database from the
following screenshot:

[ 100 ]
Representation of Data Models Chapter 4

Let us see how to define the E-R data model in the following section.

E-R Modeling
We have discussed in detail about identifying Entities and their attributes in Chapter
2, Building Blocks Simplified. E-R model is mainly used as a tool to communicate between the
technical and non-technical users. E-R model is identified as a Top-Down approach to
database design. The E-R modeling starts with identifying Entities, their attributes, and
their relationships.

The following screenshot indicates standard symbols for the E-R modeling.

[ 101 ]
Representation of Data Models Chapter 4

In Chapter 2, we have identified different types of Entities for a given entity Invoice as
shown in the following screenshot.

[ 102 ]
Representation of Data Models Chapter 4

In the invoice entity type, there are different types of attributes such as Invoice ID (Primary
Keys), Patient Name (Composite Attribute), Drug (Multi-Valued Attribute), Invoice Date
(Normal Attribute), Patient Age (Derived Attribute).

In a business scenario, there are multiple entities and those were identified at the
conceptual modeling. Use case scenarios, as shown in the below screenshot can be used to
identify the Entities in the business case.

[ 103 ]
Representation of Data Models Chapter 4

In the above use case diagram, there are simple three actors such as Patient, Cashier, and
Doctor. In the shown use case diagram, we have identified simple use cases. By analyzing
these use case diagrams, we can identify the Entities In this business scenario, Cashier,
Patient, Prescription, Doctor and Invoice entities can be identified.

These Entities do not live in isolation. These one more entities are related to each other with
different relationships. Entities and their relationship are shown in the E-R diagram. A
sample E-R diagram for the above scenario is shown in the following screenshot.

[ 104 ]
Representation of Data Models Chapter 4

It is essential to identify the relationship as that will be a key factor when defining required
and multiple attributes. Further, the above E-R diagram indicates the relations of entities
between other entities. For example. The Cashier entity is related to Prescription, Invoice, and
Patient. In addition, Cashier has a relationship to the Doctor entity via Prescription Entity.

It is important to read this E-R diagram as this will become the communication language
between the technical and non-technical users. Relationships can be read as shown in the
following screenshot.

This means that by reading the E-R diagram, business users, technical users will be able to
identify the pictorial view of the business case.

Though the E-R model is widely used in the industry, let us look at what are the problems
with the E-R Model.

Problems with E-R Model


Problems in the E-R model are referred to as relation typically occurs due to the incorrect
interpretation of certain relationships. There are two types of connections traps which are
named Fan Traps and Chasm Traps.

[ 105 ]
Representation of Data Models Chapter 4

Fan Traps
Let us discuss this connection trap by means of an example.

The above E-R model explains that each employee belongs to a company and each
company has one or many Employees. On the other hand, each company has one or more
divisions and each division belongs to one company. With the above E-R diagram, it is not
possible to find the division of an employee. If you want to know, What is the Division that
employee Simon belongs to, you will not be able to answer? Unable to answer this question
is a result of the fan trap associated with the E-R diagram.

We can resolve the fan trap issue by changing the relationship. If you closely analyze the
above relations, you will understand that it is a hierarchical structure as shown in the below
screenshot.

[ 106 ]
Representation of Data Models Chapter 4

Now, you can modify the E-R diagram to suit the above hierarchical structure.

From the above E-R diagram, now it is possible to find what is the Divison of each
employee belongs to. Since each division has one company, it is possible to find the
company of the employee as well.

When identifying the business requirements, it is essential to identify the


hierarchical structures. Those hierarchical structures should be model to
maintain its hierarchical nature. By doing so, the Fan trap can be avoided.

Next is to identify the Chasm Traps which is discussed below.

Chasm Traps
Chasm Trap occurs when it is not possible to find the pathway for an entity for some
occurrence of some entities. Let us illustrate this by means of an E-R diagram.

In the above E-R diagram, Division and Employee entity relations can be understood easily
as it is trivial. Let us look at Project to Employee relationship. For each project, there can be
no or one employee and for each employee, there can be zero or more Projects. This means
there can be projects where no employees are assigned. Because of this optional behavior of
the Employee, for the Projects where no employees are assigned, it is not possible to find
out the Division that each project belongs to. This behavior is called the Chasm trap. To
avoid the Chasm trap, it is essential to include a direct relationship between the Division
and Project without going through the Employee as shown in the below screenshot.

[ 107 ]
Representation of Data Models Chapter 4

The Chasm Trap shows how important to find the correct relationship
type, whether it zero or more or one or zero. This issue will not occur if
the relationship is one-to-many instead of zero to many.

When an additional relationship is introduced between Project and Division, there should
not be a conflict when the project is allocated to an Employee. For example, let us assume if
a Project A is assigned to Division A and later it should be attached to an employee who is
member of the Division A. If an Employee who is member of different division is assigned,
it important to revert the previously assigned Division to a correct value.

After the conceptual and logical models are completed, next is to implement your design
into a physical data model which is discussed in the next section.

Defining Physical Data Model


The physical data model is the model that you will be implementing the database. Until
this stage, you do not have to worry about the database technology. Depending on the
logical design, you can choose the database technology. However, in some cases, you may
not have a choice. In some organizations, due to various reasons, you had to do your design
for given database technology instead of choosing at your like.

At this stage, you need to have experience resources on the selected database technology. In
addition, the resource person needs to know the features of the database technology that
you want to use. Further, it is recommended to have an understanding of the technology
road map for database technology.

This is what you see in the databases as tables. As there are multiple steps in Conceptual
and Logical Data Modeling, there are multiple steps to define the physical data model as

[ 108 ]
Representation of Data Models Chapter 4

well.

1. Translate Global Logical Data model to selected database technology model.


2. Design Physical representation
3. Define User Views
4. Define Security mechanisms

Let us look at each step in detail.

Translate Global Logical Data Model


In this step, there are three main activities such as design the base relations, design the
representation of derived data and define the enterprise constraints.

Let us design the physical data model for the following E-R model.

The following is the identified attributes.

[ 109 ]
Representation of Data Models Chapter 4

Now we need to define tables for identified entities. The following are the defined tables
for the above E-R diagrams.

Cashier {CashierID, FirstName, MiddleName, LastName}

Doctor {DoctorID, FirstName, MiddleName, LastName, Specialty, ContactNumber,


MainHospital}

Patient {PatientID, FirstName, MiddleName, LastName, Gender, ContactNumber, Age,


DateofBirth}

PrescriptionHeader {PrescriptionID, Date, DoctorID, PatientID}

PrescriptionDrugs {PrescriptionID, LineNo, Drug}

InvoiceHeader {InvoiceID, PrescriptionID, Date, DoctorID, PatientID, CashierID}

InvoiceDrugs {PrescriptionID, LineNo, Drug}

[ 110 ]
Representation of Data Models Chapter 4

Since Drug is a multi-value attribute, it is necessary to include drugs into


the seperate table as shown. In the above table stuctures, PrescriptionDrugs
and InvoiceDrugs are defined.

Now let us defined Base Relations in the next section.

Design Base Relations


In this step, relations, their attributes, primary keys, foreign keys are defined.

This is the Cashier table defined in PostgreSQL.

In the above relation or table, CashierID is the Primary Key. FirstName and LastName are
required whereas MiddleName column is optional. This is set by the Not NULL option as
shown in the above screenshot.

Following is the SQL Script to create the Cashier table.


CREATE TABLE public."Cashier"
(
"CashierID" serial NOT NULL ,
"FirstName" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"MiddleName" character varying(50) COLLATE pg_catalog."default",

[ 111 ]
Representation of Data Models Chapter 4

"LastName" character varying(50) COLLATE pg_catalog."default" NOT NULL,


CONSTRAINT "Cashier_pkey" PRIMARY KEY ("CashierID")
)

Now let us design the Doctor table which is more same as the Cashier table the only
exception is the addition of the Specialty, ContactNumber, MainHospital attributes.

The Doctor table can be created from the following script in PostgreSQL.
CREATE TABLE public."Doctor"
(
"DoctorID" serial NOT NULL ,
"FirstName" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"MiddleName" character varying(50) COLLATE pg_catalog."default",
"LastName" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"Specialty" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"ContactNumber" character varying(15) COLLATE pg_catalog."default" NOT
NULL,
"MainHospital" character varying(50) COLLATE pg_catalog."default",

[ 112 ]
Representation of Data Models Chapter 4

CONSTRAINT "Doctor_pkey" PRIMARY KEY ("DoctorID")


)

Similarly, Patient, PrescriptionHeader, PrescriptionDrugs, InvoiceHeader,


and InvoiceDrugs tables can be defined.

Design Representation of Derived Data


From this design, it is important to design the derived data. This means business users need
to derived information from the above design. The following are the sample derived data
that will be requested by the business users using the database.

Number of Patients for each doctor.


Total number of Sales
Cashier wise sales
Most popular drugs

As a database designer, you need to make a choice whether you are providing additional
tables for the above queries or calculate them when needed.

If you want to provide them with additional tables, there can be a data gap or storage cost.
In the second option, the processing cost is the question you need to answer.

Design Enterprise Constraints


After the physical table structures are completed, you need to implement enterprise
constraints. Though you can define these constraints from the application, it is better to
implement these in the database layer as well. The advantage of implementing constraints
in the database is that in the case of bulk data load these constraints will be helpful.

The Gender of the Patient can be considered for Enterprise constraints. You can define, Male
and Female as the constraints for this column.

Design Physical Representation


The database can be distributed into multiple databases depending on the user
requirements and other needs. For example, in a supermarket, you will have your Human
Resource related data in one database and Sales data in one database. Further, Sales data
might be distributed among geographical locations. This distribution of databases has to be
done according to a thought process, not as an ad-hoc manner.

[ 113 ]
Representation of Data Models Chapter 4

In addition to the distributed databases, tables can be partitioned too. This has to be
designed n the physical data model. This will be discussed in Chapter 12: Distributed
Databases.

Indexes are key concepts in a database that allows users to access data efficiently. Though
indexes will improve the user access queries such as SELECT queries, having a large
number of indexes will slow down INSERT queries. This means that we need to find the
optimal number of indexes for a system we are countering INSERT and SELECT. Detail
discussion on indexes will be taken place in Chapter 8: Working with Indexes.

Database growth is an important factor to be considered by database designers or database


administrators. There are different ways to calculate the database growth. One of them is
listed in the Further Reading section in this chapter.

Define User Views


If you recall, during the conceptual as well as at the logical design stage, we had local
models. Basically, we captured the requirement individual user perspective. Later we
merged them to a single global model for easy implementation purposes. However, it
should be noted that users should have their local perspective in the database. To achieve
the local perspective in the databases, we can design views.

In the mentioned example on the patient, Invoice is relevant for PAtient, Cashier as well for
the Doctor. Therefore, it is essential to create multiple views on the same invoice to cater to
different requirements of different users or actors. These views can be extended to the
security model as well.

Define Security Mechanism


Security is an important phenomenon to consider during the physical database design.
There multiple types of security segregations at the physical data model. Depending on the
place of implementation, there are two types of security:

1. System security
2. Data Security

System security covers the access and usage of the database at the system level. This covers
user name, password etc. Data security covers the database objects such as tables, views,
and procedures etc.

Depending on the mode of implementation of security there can be three types of security

[ 114 ]
Representation of Data Models Chapter 4

implementation.

1. Authentication
2. Authorization
3. Encryption

Authentication is identifying the user and Authorization is providing necessary access to


the authenticated user. Encryption means modifying the data so that everyone will not be
able to identify. We will have a detailed discussion on implementing a security model
in Chapter 11, Securing a Database.

Summary
In this chapter, we looked at how data is modeled. We looked at how the database is
physically designed from stages such as conceptual, logical data models.

At the conceptual model, main entity types are identified along with their relationships.
Further, it extremely important to validate the conceptual model against the user
requirement and data so that problems can be solved at the earlier stage.

At the Logical level, we defined the E-R diagram which was considered as the
communication tool between the technical and non-technical people. We identified two
issues that can occur at the E-R diagram which are Fan trap and Chasm Trap. Further, we
discussed it is important to identify the relation type between the entities. Like in the
Conceptual model, we discussed that it is necessary to build a logical model for local users
and then generate a global model. However, at multiple places, it is necessary to validate
this model so that no issues will occur at the production state. Normalization techniques
should be followed at this stage but will be discussed in the next chapter.

Finally, we discussed the implementation of the physical model. We identified that


theoretically, we need to know the database technology that we are using only at this stage
not prior. In case we have multi-valued attributes that have to be stored using a different
table. We discussed the need for the data distribution depending on the user needs in the
physical data model stage. In addition, necessary indexes are to be applied in the physical
data model. In this physical data model, indexes should be applied with a clear thought
process. Further, we identified that the security model is applied at this stage with different
techniques such as Authentication, Authorization, and Encryption. As the requirements
were gathered by listing to the individual users. To align with the requirement gathering
we discussed that the local design model should be developed but later it should be
combined to a global model at the conceptual and logical model. However, at the end
database should cater to individual user requirements. We discussed that to facilitate this,

[ 115 ]
Representation of Data Models Chapter 4

we need to define user views at the physical model.

The semantic model is the model that introduces meaning to the database model so that
end-users or application developers can utilize the model much easier. The physical model
is the model is which will be finally implemented in which you have the normalized data
models, data types and so on.

Applying Normalization is an important concept in database design. We typically perform


Normalization at all levels, Conceptual, Logical and Physical data models. In the next
chapter, we will discuss the significance of performing normalization with suitable real-
world examples.

Exercise
Since you have done the exercise given in Chapter 2, Building Blocks Simplified. In that
exercise, it was asked to identify Entity Types and their attributes. In this, you can extend
that exercise to develop a:

Conceptual Data Model


Logical Data Model
E-R Model
Physical Data Model
How can you extend this model to a Semantic Data Model

Questions
How you make sure that all the attributes are identified at the conceptual
database design stage?

There are a lot of attributes associated with entities and relationships. As a


database designer, it is necessary to identify essential attributes. As a
database designer, it is essential to provide an allowance to add the attribute
later. Since attributes can be identified at the different stages, you need to
make sure that all the relevant documents by going backward.

How do you select a Primary Key for an Entity Type?

First, you need to choose the candidate columns out of which, one should be
selected as a Primary key. To choose the Primary key, you need to choose the

[ 116 ]
Representation of Data Models Chapter 4

Unique candidate columns. The Primary key should be Definite, means that
it cannot be either NULL or empty. The primary key should be very small in
size. You cannot have lengthy attributes as the Primary Key. The selected
Primary key should not change over time and it should be accessible to the
users.

How do you avoid redundancy in data modeling? What are the differences
between the conceptual and semantic data model?

The semantic data model is adding meaning to the conceptual data model.

Before which modeling you need to choose the database technology for the
database?

You need to choose the database technology before the defining of the
physical model. At the conceptual model or at the Semantic model building,
you do not need to know what the technology is. The conceptual model and
semantic model are independent of the database technology whereas the
physical model is directly dependent on the database technology.

What is the purpose of integrating constraints and what are the main types of
constraints?

Constraints are used to maintain consistency in the database so that the data
is clean. Clean data will help to make correct timely decisions. Therefore, to
make correct decisions, it is essential to implement constraints. The main
types of constraints are UNIQUE, FOREIGN KEY, PRIMARY KEY, CHECK,
Required.

Why it is not recommended to set the Parent-Child relationship to CASCADE?

When the CASCADE option is set, child records will be automatically


deleted as parents records are deleted. If there are large numbers of relevant
records, deleting of simple parent record will cause large delays in the
database. Apart from the performance degradation, there can be cases if
there is a case of the invalid record is deleted. Since all the child records are
getting deleted, recovering would be a tedious task. Considering both of
these scenarios, it is recommended to not have the CASCADE option.

What is the importance of drawing the E-R diagram for the business case?

E-R diagram mainly used as a communication tool to discuss between the


technical and non-technical people. Further, the E-R diagram will give a clear
pictorial view of the database model that will be built. For Database

[ 117 ]
Representation of Data Models Chapter 4

designers, the E-R model will be helpful to define the physical data model.

What are the issues that can occur when you try to fix the Chasm Trap and how
to fix them?

Chasm Trap is fixed by introducing a direct relationship between Entities


avoiding the optional entities. However, since there can be multiple
relationships between those entities, there can be data conflicts. When the
option entity is updated it is essential to update the additional direct
relationship and vice versa.

At which stage you need to select the database technology?

Until the conceptual and logical model design stage, you are more worried
about the data model rather than the technology. Though in some practical
situations, during the conceptual and logical model design you might have
the database technology in mind. However, you need to decide on the
database technology at the physical data model designing phase only.

Why is it recommended to define user views at the physical data model?

Requirements were gathered from individual users. Later at the conceptual


design model as well as at the logical design model, we built local models
and later by amalgamating all these views, we created a global model.
However, users still have local requirements. Therefore, to facilitate these
local requirements, different user views should be created at the physical
data model.

Further Reading
Constraints: https://www.postgresql.org/docs/9.5/ddl-constraints.html
Database Growth Rate: http:/​/​www.​silota.​com/​docs/​recipes/​sql-​mom-
growth-​rate.​html

[ 118 ]
5
Applying Normalization
Until now, we have discussed different stages of modeling in order to design databases. In
this modeling, we need to identify a relevant set of relations in the databases. The
technique is to identify these relations, which is called Normalization. In this chapter, we
will discuss the different levels of normalization in detail, with suitable examples. This
chapter will also help you understand at what stage you stop normalization, and at what
instances you do not have to perform normalization.

The following topics will be covered in this chapter:

History of Normalization
Purpose of Normalization
Determine Functional Dependencies
Steps for Normalization
First Normal Form
Second Normal Form
Third Normalization Form
Boyce-Codd Normal Form
Fourth Normal Form
Fifth Normal Form
Domain-Key Normal Form
De-Normalization Data Structures
Normalization Cheat Sheet
Applying Normalization Chapter 5

History and Purpose of Normalization


Normalization is a standard process of identifying relations based on primary keys or
candidate keys on those relations. Normalization was first introduced by E.F. Codd in 1972.
Codd identified three forms of normalization called First, Second and Third normalization
forms. In 1974, R. Boyce and E.F. Codd extended the third normalization form which is
referred to as Boyce-Codd Normal Form (BCNF). Later fourth (4NF) and fifth (5NF)
normalization forms were introduced. However, these two types of normalization forms
are rarely used due to the fact that most of the requirements can be achieved with the first
three normalization forms. We will be discussing these normalization forms in detail in this
chapter.

The main purpose of the database normalization is to have a proper relation between the
entity types. Database normalization will design the attributes into natural groups. By
performing database normalization, data redundancy, and operation anomalies are the
main issues which will be fixed.

Data redundancy is one of the main things that will be addressed by the Normalization
process.

Data Redundancy
The major outcome of the data normalization is reducing the data redundancy. Let us look
at data redundancy through an example.

Let us see EmployeeDepartment relationship in the following table:


Employee ID Employee Name Age Designation Department ID Department Name Department Location
1 Simmon Rusule 32 Senior Manager 1 Administration Building 1 / 2 nd Floor
2 Anthony Howard 34 HR Manager 2 Human Resources Main Building / 1 st Floor
3 Eli Thomas 23 Assistant Manager 1 Administration Building 1 / 2 nd Floor
4 Patrick Jones 45 Assistant HR Manager 2 Human Resources Main Building / 1 st Floor
5 John Young 56 Supervisor 1 Administration Building 1 / 2 nd Floor
If you closely analyzed the above table you would see that department data is repeated for
employees. For example, Employees who have employee IDs of 1, 3, and 5 are attached to
the Administration department. The department name and its location are the attributes
that are relevant to the department. However, those are also duplicated among the
employees. This means that this department data is redundant.

By performing database normalization, we can come up with the following structures to


avoid data redundancy:
Employee {EmployeeID, Name, Age, Designation}
Department {DepartmentID, Department Name, Location}

[ 120 ]
Applying Normalization Chapter 5

EmployeeDepartment {EmployeeID, DepartmentID}

By having relations, as shown above, redundancy is avoided. Also, by removing the


redundancy you are saving space too. For example, now you do not need to replicate the
department name and its location for every employee. Just imagine if you have a large
number of records for employees and a large number of attributes for the department.

Operation Anomalies is another aspect that will be addressed by the process of


Normalization.

Operation Anomalies
In databases, there are three main operations which we call Data Manipulation Language
(DML). These operations are Insert, Update and Delete. By introducing database
normalization, you can avoid Insertion, Updation, and Deletion anomalies.

Let us look at these three anomalies with examples.

Insertion Anomalies
Insertion Anomalies can occur in multiple ways. Let us look at the previous example of
EmployeeDepatment that we discussed.

In the above example, an insert can occur in multiple ways. There can be a new employee
as well as there can be a new department. They are discussed below:

Insert to Employee: When you want to insert a new employee, in the


EmployeeDepartment relation, you need to insert the details of the new
employee along with all the details of the relevant department details of the
employee. For example, if there is a new employee who is attached to the HR
department, along with the new employee data, you need to insert the HR
department data. This means that you need to ensure that all the correct
department data should be inserted, if not there will be a chance for data
anomalies. In the case of Database Normalization, these types of data anomalies
will not occur as you are not required to get all the attributes of the relevant
department. Instead, you only need to insert a record to the Employee relation
and establish the relation to the department by including a new employee ID and
the relevant department ID in the EmployeeDepartment Relation.
Insert a New Department: When a new department is inserted, there can be
cases where there are no employees attached to it. For example, we want to insert
a new department called the IT department. If you do not have any employees,

[ 121 ]
Applying Normalization Chapter 5

how you can maintain these types of records in the EmployeeDepartment


relation? The only option is, you insert the department data to the
EmployeeDepartment relation while keeping the employee attributes as empty.
When the first employee of that department is entered, it is important to make
sure that the empty records are updated. This makes this model more complex
and there can be a lot of anomalies. In the proposed, normalization model, two
relations Employee, Department are introduced. Those two relations are related
by a DepartmentID. When a new department is needed to be stored, it is a matter
of inserting a record in the Department relation. Since Employee is a different
relation, no records need to be updated in the Employee relation. When the first
employee of the new department needs to be inserted, you need to update the
Employee relation with the relevant DepartmentID. This will reduce the
complexity of having both employee department in one relation.

Modification Anomalies
Modification Anomalies will occur when there are data updates. Similar to the Insert
Anomalies, there are two scenarios where Modification Anomalies can occur, such as:

Employee Modification: If the department of an employee needs to be modified,


all the department relevant attributes have to be modified. For example, let us
assume that the Employee ID 2 is moved from the HR department to the Admin
department, not only the department ID but also all the other parameters such as
Department name and location has to be changed which can cause data
anomalies. In the case of the database normalization model, it is only a matter of
updating the relevant department id of the EmployeeDepartment relation.
Department Modification: As we observed in the EmployeeDepartment relation,
department attributes are repeated among their employees. If a department
attribute is changed, for example, if the location of the Admin department is
changed, you need to change all records of employees who are attached to this
department. In the database normalization model, it is a matter of updating the
relevant record in the Department relation as Department is a separate relation.
Since EmployeeDepartment relation is referred by only the department ID there
is no need to modify any records in the EmployeeDepartment relation.

Deletion Anomalies
Deletion Anolamies also can occur in three ways. They are as discussed below:

Employee Delete: If you are deleting an Employee, it would not be a major

[ 122 ]
Applying Normalization Chapter 5

worry. There can be instances where one employee exists for a department. If
you are deleting the last employee of a department, you will be losing the
department details since Department data stored with the Employee relation and
there is no separate Department relation. if so every delete, you need to verify
whether the deleting employee is the last employee of the department. if it is the
last employee of the department, department details should be kept by making
the employee data to empty. Also, there can be aggregate data for department
detail.
To overcome this, one solution should be store the number of employees in the
department with the employee relation. As shown in the below relation, an
attribute called the number of employees in the department data is included.
Department Number Of
Employee ID Employee Name Age Designation Department ID Department Name
Location Employees
Building 1 / 2 nd
1 Simmon Rusule 32 Senior Manager 1 Administration 3
Floor
Anthony Main Building / 1 st
2 34 HR Manager 2 Human Resources 2
Howard Floor
Building 1 / 2 nd
3 Eli Thomas 23 Assistant Manager 1 Administration 3
Floor
Assistant HR Main Building / 1 st
4 Patrick Jones 45 2 Human Resources 2
Manager Floor
Building 1 / 2 nd
5 John Young 56 Supervisor 1 Administration 3
Floor
If you are deleting an employee record, you need to update the number of employees
attribute. If you are deleting Employee ID 2, the Number of employees for the HR
department will become 1 which has to be updated in the entire CustomerSupplier relation.
This is another anomaly that can occur during the deletion. In the case of the normalization
model, this will be one attribute in the Department relation as shown below:
Department {DepartmentID, Department Name, Location, NumberofEmployees}

Department Deletion: If you are deleting a department, you need to empty all
the relevant employee's details before deleting the department. In the case of the
normalization data model, you only have to empty the departmentID attribute in
the CustomerEmployee relation.

Determine functional dependencies between the attributes is an important task for database
normalization. Let us discuss why it is important to determining functional dependencies
and how can it be done.

Determining Functional Dependencies


Determining functional dependencies between the relation attributes is a key process to
determine the database normalization.

[ 123 ]
Applying Normalization Chapter 5

If x and y are attributes of relation r and each value of x is associated with


exactly one value of x, attribute y is functional dependent on attribute x.
This is denoted x - > y.

Let us assume Employee ID and Designation attributes and the functional relation as
shown below:

Every employee ID has a designation and one designation can be related to multiple
Employee IDs. When identifying the functional dependencies, it is important to identify all
possible values.

The relation between the Employee ID to Designation is one-to-one as one employee will
have only one designation at a time as shown below:

However, the designation to Employee ID relationship is many-to-one as one designation


can be assigned to many employees as shown below:

By identifying the functional dependencies, integrity constraints in a relationship can be

[ 124 ]
Applying Normalization Chapter 5

identified. Primary Key is an important constraint in a relation.

The normalization process is achieved in multiple steps. We will look at the steps of the
Normalization process in the following section.

Steps for Normalization


Normalization is a technique for analyzing the relations depended on their primary key
and functional dependencies as specified by Codd. Database normalization is not a one-
step process instead it will be multiple steps as shown in the below screenshot:

When you are into the process of database normalization, the more you go into the process,
relations will become stronger than before. It is important to note that the first normal form
(1NF) is a must and all the other normalization forms are optional. However, to avoid
operational anomalies which were discussed before in this chapter, it is essential to at least
proceed till the third normal form (3NF).

Let us look at different normalization forms in detail in the following sections.

First Normal Form


For any data set, it is essential to perform at lease 1NF as 1NF is a compulsory
normalization form on any data set while other normalization processes are optional. In
1NF, repeating attributes are identified and moved to a different relation. To relate the
moved relation, a relationship is placed between the existing relation and the new relation.

Let us look at an example of students and courses that they have enrolled in. This is

[ 125 ]
Applying Normalization Chapter 5

StudentCourse relation:
Student Student Course Course Course Start Course End
Gender Age Course Name
Number Name Number Credit Date Date
1 Andy Oliver Male 18 C001 Introduction to Databases 2 2020/01/01 2020/05/31
Harry
2 Male 19 C001 Introduction to Databases 2 2020/01/01 2020/05/31
Steward
Advanced Database
3 Alex Robert Male 18 C004 3 2020/06/01 2020/09/30
Management
4 Rose Elliot Female 21 C001 Introduction to Databases 2 2020/01/01 2020/05/31
Advanced Database
5 Michel Jacob Male 20 C004 3 2020/06/01 2020/09/30
Management
Advanced Database
6 Emily Jones Female 21 C004 3 2020/06/01 2020/09/30
Management

In the above relation, it was assumed that one student is enrolled in one
course. This is done in order to simplify the demonstration of the First
level of normalization form.

Following are the attributes of the above mentioned unnormalized StudentCourse relation:
StudentCourse {Student Number, Student Name, Gender, Age, Course Number,
Course
Name, Course Credits, Course Start Date, Course End Date }

The above relation will be created in PostgreSQL as follows:

The above design is done by using pgModeler. You can download the
demo version from https:/​/​pgmodeler.​io/​. This is an opensource tool
that is compatible with Windows, macOS, and Linux. More details of the
tool can be viewed at https:/​/​pgmodeler.​io/​support/​docs

[ 126 ]
Applying Normalization Chapter 5

It is very clear that the Student No is the Primary key in this relation.
If you closely analyzed, you will be able to identify the repeated attributes:
Repeated Attributes { Course Number, Course Name, Course Credits, Course
Start Date, Course End Date }

This means repeated attributes can be separated as follows:


StudentCourse{Student No, Student Name, Gender, Age, Course No}
Course{Course No, Course Name, Course Credits, Course Start Date, Course
End Date }

Let us see this 1NF database design in the PostgreSQL databases:

In the above model, Student relation is created with the Primary Key of StudentNo alone
with the other Student Data. For the course information, StudentCourse relation is created
with the Course Number as the Primary Key. These two relations are related with the
Course Number attribute.

The next level of Normalization is the Second Normal Form of 2NF.

Second Normal Form


Second Normal Form (2NF) is applied to the relations which have primary keys that are
composed of two or more attributes.

A 1NF relation with a single attribute primary key is considered to be in


2NF. This means that there is no need to do any other operations for the
1NF relation which already has a single attribute primary key.

Let us look at the same example which was discussed in the 1NF section, without the

[ 127 ]
Applying Normalization Chapter 5

assumption where a student can be enrolled in only one course. This means that in the
following example, a student can be enrolled in multiple (one or more) courses:
Student Student Course Course Course Start Course End
Gender Age Course Name
Number Name Number Credit Date Date
Introduction to
1 Andy Oliver Male 18 C001 2 2020/01/01 2020/05/31
Databases
Harry Introduction to
2 Male 19 C001 2 2020/01/01 2020/05/31
Steward Databases
Harry Advanced Database
2 Male 19 C004 3 2020/05/01 2020/07/01
Steward Management
Advanced Database
3 Alex Robert Male 18 C004 3 2020/06/01 2020/09/30
Management
Introduction to
4 Rose Elliot Female 21 C001 2 2020/01/01 2020/05/31
Databases
Advanced Database
4 Rose Elliot Female 21 C004 3 2020/06/15 2020/09/15
Management
Advanced Database
5 Michel Jacob Male 20 C004 3 2020/06/01 2020/09/30
Management
Advanced Database
6 Emily Jones Female 21 C004 3 2020/06/01 2020/09/30
Management
From the table above, we see that the relation Student Number and relation Course
Number are the composite attributes of the Primary Key:
{Student Number, Course Number} ->{Course Start Date, Course End Date}
(Primary Key)
{Student Number} -> {Student Name, Gender, Age} (Partial Dependancy)
{Course Number} -> {Course Name, Course Credit} (Partial Dependancy)

These will be converted into three relations.


Student {Student Number, Student Name, Gender, Age}
Course {Course Number, Course Name, Course Credit}
Enrollment {Student Number, Course Number, Course Start Date, Course End
Date}

Let us see these three relations in an ER diagram.

[ 128 ]
Applying Normalization Chapter 5

The above screenshot shows the relationships between Student, Course, and Enrolment
relations along with the possible data types.

Let us look at the next level of Normalization, which is the third normal form (3NF).

Third Normal Form


In the Third Normal Form (3NF), all attributes should be determined by only the key. Let
us expand the above example with more data and more attributes. Let us introduce,
lecturer for the courses who will be delivering the course. For this Lecturer number and the
name will be introduced as shown below:
Student Student Course Course Lecturer Lecturer Course Course
Gender Age Course Name
Number Name Number Credit Number Name Start Date End Date
Andy Introduction to Rose
1 Male 18 C001 2 L001 2020/01/01 2020/05/31
Oliver Databases Taylor
Harry Introduction to Rose
2 Male 19 C001 2 L001 2020/01/01 2020/05/31
Steward Databases Taylor
Advanced
Harry Ryan
2 Male 19 C004 Database 3 L002 2020/05/01 2020/07/01
Steward Thoms
Management

[ 129 ]
Applying Normalization Chapter 5

Advanced
Ryan
3 Alex Robert Male 18 C004 Database 3 L002 2020/06/01 2020/09/30
Thoms
Management
Introduction to Rose
4 Rose Elliot Female 21 C001 2 L002 2020/01/01 2020/05/31
Databases Taylor
Advanced
Ryan
4 Rose Elliot Female 21 C004 Database 3 L002 2020/06/15 2020/09/15
Thoms
Management
Advanced
Michel Ryan
5 Male 20 C004 Database 3 L002 2020/06/01 2020/09/30
Jacob Thoms
Management
Advanced
Ryan
6 Emily Jones Female 21 C004 Database 3 L002 2020/06/01 2020/09/30
Thoms
Management
Advanced
James Ryan
7 Male 24 C004 Database 3 L002 2020/07/01 2020/12/31
Dixon Thoms
Management
James Database Ryan
7 Male 24 C005 4 L002 2020/09/30 2021/01/30
Dixon Administration Thoms
Transitive dependency is an important concept that will be discussed under the third
normal form.

Transitive Dependency
Let us consider three attributes X, Y and Z in relation R. These three attributes are related in
such a way that X-> Y and Y-> Z. This means that Z is transitively dependent on X through
Y.

Let us consider the above example, with Transitive Dependency:


Student Number -> Course Number
Course Number -> Lecturer Number

This means that transitive dependency exists, Student Number -> Lecturer Number
through the attribute Course Number.

Let us see how the third level of normalization forms are applied.

Applying 3NF
With the additional attributes, Course Credit and so on, course relation will be impacted as
follows:

{Course Number} -> {Course Name, Course Credit, Course Lecturer Number, Course

[ 130 ]
Applying Normalization Chapter 5

Lecturer} (Partial Dependency)

When normalization is carried out until 2NF, the following relations will be met. The
highlighted attributes are the newly introduced attributes:
Student {Student Number, Student Name, Gender, Age}
Course {Course Number, Course Name, Course Credit, Lecturer Number,
Lecturer Name}
Enrollment {Student Number, Course Number, Course Start Date, Course End
Date}

Let us examine the data for the Course relation:

Course Course Lecturer


Course Name Lecturer Name
Number Credit Number
C001 Introduction to Databases 2 L001 Rose Taylor
C004 Advanced Database Management 3 L002 Ryan Thoms
C005 Database Administration 4 L002 Ryan Thoms
When the third normal form (3NF) is introduced, the course lecturer is moved to another
relation and the following is the finalized relations:
Student {Student Number, Student Name, Gender, Age}
Course {Course Number, Course Name, Course Credit, Lecturer Number}
Enrollment {Student Number, Course Number, Course Start Date, Course End
Date}
Lecturer {Lecturer Number, Lecturer Name}

The following table consists of data for the Course and Lecturer relations respectively:

Course Lecturer
Course Number Course Name
Credit Number
C001 Introduction to Databases 2 L001
C004 Advanced Database Management 3 L002
C005 Database Administration 4 L002
The following is the sample data set for Lecturer relation.

Lecturer Number Lecturer Name


L001 Rose Taylor
L002 Ryan Thoms
L002 Ryan Thoms
The above relations are implemented in PostgreSQL. We can see them in the following
screenshot:

[ 131 ]
Applying Normalization Chapter 5

First Normal Form to Third Normal Form is at least a minimum for database design. The
next level of normalization forms is optional. Let's discuss the Boyce-Codd Form (BCNF).

Boyce-Codd Normal Form


Though you can stop at the 3NF, there are situations where you can extend the data
modeling to the next level of normalization BCNF. In the BCNF, further functional
dependencies are identified. Like in the other normalization forms, to perform BCNF, the
database has to be at least the third normal form. This means that the BCNF data model is
in 3 NF but the 3NF data model does not have BCNF.

In some literature, the Boyce-Codd Normal Form is referenced as 3.5NF,


as BCNF is a normalization between 3NF and 4NF.

For a database with 3NF, if you have a super key in the data, then the BCNF can be applied.
Super key means, if there is a relation such as X-> Y, X is a non-prime attribute and Y is a
prime attribute and X is a super key.

Let us look at this from an example and extend the previous student, lecturer example. AS
we know students need to meet lectures to discuss their study matters. Since lecturers are

[ 132 ]
Applying Normalization Chapter 5

dealing with many students, you need to book a schedule an appointment. The
appointments details can be defined as follows. Let us name this relation as
StudentMeetingSchdule:

Student Number Meeting Date Meeting Time Lecturer Number Room Number
1 2010-Jan-02 11:00 L001 R001
2 2010-Jan-02 13:00 L001 R001
3 2010-Jan-02 13:00 L002 R002
2 2010-Jan-05 11:00 L001 R002
4 2010-Jan-05 11:00 L001 R001
4 2010-Jan-05 13:00 L002 R002
Following is the StudentMeetingSchedule relation along with its attributes shown below.
StudentMeetingSchedule {Student Number, Meeting Date, Meeting Time,
Lecturer Number, Room Number}

Let us identify the candidate keys for the above example.

Identifying Candidate Key


There can be three candidate keys, such as {Student Number, Meeting Date, Meeting
Time}, {Lecturer Number, Meeting Date, Meeting Time} and {Room Number, Meeting Date,
Meeting Time}

A Student can have one meeting at a given date and time, therefore, {Student Number,
Meeting Date, Meeting Time} is a candidate key. A Lecturer can have one meeting at a
given date and time, therefore, {Lecturer Number, Meeting Date, Meeting Time} is a
candidate key. On the other hand, a room can have one meeting a given date and time,
therefore {Room Number, Meeting Date, Meeting Time} is a candidate key.

Out of these three candidates, {Student Number, Meeting Date, Meeting Time} is chosen as
the primary key for the StudentMeetingSchdule.

The StudentMeetingSchdule has the following relation,


StudentMeetingSchdule {Student Number, Meeting Date, Meeting Time, Lecturer
Number, Room Number}

Let us identify the functional dependencies in the StudentMeetingSchdule relation.

[ 133 ]
Applying Normalization Chapter 5

Identifying Functional Dependencies


We discussed Functional dependency in the section Determining Functional Dependencies.
The following are the identified functional dependencies for the
StudentMeetingSchdule relation:
FD1 -> {Student Number, Meeting Date, Meeting Time} -> {Lecturer Number,
Room Number} ( Primary Key)
FD2 -> {Lecturer Number, Meeting Date, Meeting Time} -> {Student Number}
(Candidate Key)
FD3 -> {Room No, Meeting Date, Meeting Time} -> {Student Number,Lecturer
Number} (Candidate Key)
Super Key {Lecturer Number, Meeting Date -> {Room Number}

FD1, FD2, and FD3 are satisfying BCNF but not the Super key relationship. To achieve
BCNF, LectureRoom relation was introduced and the following are the modified relations.
StudentMeetingSchdule {Student Number, Meeting Date, Meeting Time, Lecturer
Number}
LecturerRoom {Lecturer Number, Meeting Date, Meeting Time, Room Number}

Following is the modified data set and the following is the StudentMeetingSchdule relation:

Student Number Meeting Date Meeting Time Lecturer Number


1 2010-Jan-02 11:00 L001
2 2010-Jan-02 13:00 L001
3 2010-Jan-02 13:00 L002
2 2010-Jan-05 11:00 L001
4 2010-Jan-05 11:00 L001
4 2010-Jan-05 13:00 L002
Following is the LecturerRoom relation:

Lecturer Number Meeting Date Meeting Time Room Number


L001 2010-Jan-02 11:00 R001
L001 2010-Jan-02 13:00 R001
L002 2010-Jan-02 13:00 R002
L001 2010-Jan-05 11:00 R002
L001 2010-Jan-05 11:00 R001
L002 2010-Jan-05 13:00 R002
Let us how these models are implemented in PostgreSQL that is shown in the following
screenshot:

[ 134 ]
Applying Normalization Chapter 5

If you look at the above tables, it is essential to note that these two tables can be linked to
one table. So the decision to stop the database modeling at 3NF or progress to the next level
of normalization, BCNF, is dependent on the significance of the super key relationship.

Next, we will discuss the Fourth Normal Form (4NF)

Fourth Normal Form


In 2NF, we removed partial dependencies and in 3NF we removed transitive dependencies.
After BCNF, any anomalies to functional dependencies are removed which were not
achieved from the 3NF. In the Fourth Normal Form (4NF) another dependency called
multi-valued dependency (MVD) will be removed. The 4NF was introduced in 1977 by
Ronald Fagin.

Let us see what is multi-valued dependency.

Multi-Valued Dependency
To have a multi-valued dependency (MVD) in a relation, that relation should have at least
three attributes, which means that MVD is not possible for a relation with two attributes.
There are two more conditions to satisfy MVD. Let us assume that there are three attributes
X, Y, and Z where X is the primary key of a relation R. For Single X value there will be

[ 135 ]
Applying Normalization Chapter 5

multiple values for Y, and X and Y attributes are independent.

The statement above is denoted as A->> B and A->> C.

Let us examine this with an example of StudentHobbySport relation:


StudentHobbySport {Student Number, Hobby, Sport}

Following is the sample data set for StudentHobbySport relation.

Student Number Hobby Sport


1 Watching TV Cricket
1 Riding Cycle Football
2 Watching TV Carrom
2 Movies Cricket
2 Playing Cricket
3 Growing Flowers Netball
3 Watching TV Chess

In the above table, for one record the Sport attribute is empty. It is because
Student Number 2 has three hobbies and 2 sports. In the above design, if
the number of attributes is not the same, there can be empty records.

In this relation following multi-valued dependencies are identified.


Student Number ->> Hobby
Student Number ->> Sport

Let us apply 4NF to the above data sets.

Applying 4NF
To apply 4NF to a relation, the relation should be satisfied with BCNF. Apart from that
requirement, the relation should have multi-valued dependencies in relation. If both of
these requirements are met, then the 4NF can be applied.

Let us apply the 4NF to the above relation StudentHobbySport by introducing two relations
as follows:

StudentHobby {Student Number, Hobby}

StudentSport {Student Number, Sport}

[ 136 ]
Applying Normalization Chapter 5

The following table shows the dataset for StudentHobby:

Student Number Hobby

1 Watching TV

1 Riding Cycle

2 Watching TV

2 Movies

2 Playing Cricket

3 Growing Flowers

3 Watching TV
The following table shows the dataset for StudentSport:

Student Number Sport


1 Cricket
1 Football
2 Carrom
2 Cricket
3 Netball
3 Chess
Though it is recommended to apply 4NF, there are exceptions to 4NF where you can ignore
to implement it, which we will see in the following section.

Exceptions for 4NF


As you know, in organizations, there are tiny relations that will have few records. for
example, payment type relation will have Cash, Credit. Invoice Type relation has In-Bound
and Out-Bound records. Commission Indicator relation has Commisionalble and Non-
commissionable. Employment Type relation has Permanent, Contract. These relations do
not have a relationship between them These types of tiny relations are unmanageable as
there can be a large number of tiny relations.

To avoid a large number of tiny relations, common relations can be built with the
combinations of all the values. The following table shows the relation after combining all
the tiny relations:

[ 137 ]
Applying Normalization Chapter 5

ID Payment Type Invoice Type Commission Indicator Employment Type


1 Cash In-Bound Commission Permanent
2 Cash In-Bound Commission Contract
3 Cash In-Bound Non-Commission Permanent
4 Cash In-Bound Non-Commission Contract
5 Cash Out-Bound Commission Permanent
6 Cash Out-Bound Commission Contract
7 Cash Out-Bound Non-Commission Permanent
8 Cash Out-Bound Non-Commission Contract
9 Credit In-Bound Commission Permanent
10 Credit In-Bound Commission Contract
11 Credit In-Bound Non-Commission Permanent
12 Credit In-Bound Non-Commission Contract
13 Credit Out-Bound Commission Permanent
14 Credit Out-Bound Commission Contract
15 Credit Out-Bound Non-Commission Permanent
16 Credit Out-Bound Non-Commission Contract

In the data warehouse, Dimensions are built by combining multiple


dimensions. These are called JUNK dimensions.

The next level of normalization form is the Fifth Normal Form (5NF).

Fifth Normal Form


After applying 1NF, 2NF, 3NF, BCNF, and 4NF, next is to apply Fifth Normal Form (5NF)
to the data. 5NF or Project-Join Normal Form (PJNF) was introduced by Ronald Fagin in
1979.

To apply 5NF, data should have achieved 4NF and JOIN dependency should exist. Next, let
us see what is JOIN dependency.

JOIN Dependency
If a relation can be regenerated by joining multiple relations and each of these relations has
a subset of the attributes of the relation, then the relation is in Join Dependency. It is a

[ 138 ]
Applying Normalization Chapter 5

generalization of Multivalued Dependency which was described in 4NF.

If relation R has attributes of X, Y, Z, then this R relation can be divided into relations, R1(X,
Y), R2 (X, Z) and R3(Y, Z).

Let us see how 5NF can be implemented with an example.

Applying 5NF
Let us look at the following scenario.
If Student (Andy) enrolled in the Subject ( Java)
Subject (Java) is conducted by Lecturer (Smith)
Student (Andy) learning from Lecturer (Smith)
Then Student (Andy) Enrolled in Subject (Java) conducted by Lecturer
(Smith).

Let us look at StudentCourseLecturer relation as shown below:

Student Course Lecturer


Andy Java Smith
Andy C# Joel
Simon Java Joel
StudentCourseLecturer will be divided into StudentCourse, CourseLecturer,
and StudentLecturer.

The StudentCourse relation is as follows:

Student Course
Andy Java
Andy C#
Simon Java
The CourseLecturer relation is as follows:

Course Lecturer
Java Smith
C# Joel
Java Joel
The StudentLecturer relation is as follows.

Student Lecturer
Andy Smith

[ 139 ]
Applying Normalization Chapter 5

Andy Joel
Simon Joel
In the 5NF, we have eliminated JOIN dependency by dividing relation to multiple
relations.

Now let us discuss the exceptions to 5NF below.

Exceptions for 5NF


Similar to the discussion we had in the section, Exceptions for 4NF, in the data warehouses
or in the analytical system, we tend to prefer a single relation in order to reduce tables joins
and improve performance. Therefore, in analytical systems or systems which prefer high
data reads, we tend to ignore 5NF.

Domain-Key Normal Form


The basic concept to introduce the Domain-Key Normal Form (DKNF) is to specify the
normal form that takes into consideration all the possible dependencies and constraints.
This normalization form is also referred to as the Sixth Normal form however, mostly this is
referred to as Domain-Key Normal form.

Let us look at this by means of an example. Let us see StudentCourseGrading relation


below:

Student ID Course ID Marks Grade


1 C0001 65 B
1 C0002 72 A
2 C0001 35 S
3 C0002 50 C
In this Domain, there are defined ranges for each grade. Additional relation called Grade
ranges can be included and Grade attribute is removed from StudentCourseGrading.

Following is the StudentCourseGrading relation.

Student ID Course ID Marks


1 C0001 65
1 C0002 72
2 C0001 35
3 C0002 50
The RankRange relation is introduced as follows:

[ 140 ]
Applying Normalization Chapter 5

Grade Starting Marks Ending Marks


A 71 100
B 60 70
C 50 59
S 31 49
F 0 30
Also, there can be multiple ranges of relations. If the above relation is a major range relation
there can be a minor range relation as shown below:

Grade Starting Marks Ending Marks


A+ 90 100
A- 71 89
B+ 65 70
B- 60 64
C+ 56 59
C- 50 55
S 31 49
F 0 30

When this type of Range relations is introduced, equal joins are not
possible, instead, inequal joins should be used. Though the unequal joins
will have a negative impact on the performance of queries, normally,
range relations are extremely small relations. This means that
performance is not a major consideration for small range relations. In the
data warehousing and data analytics system, a similar type of Range
Dimensions is used for a detailed analysis of data.

Though there are a lot of cases for normalization of data, there are cases where De-
normalization structures are still preferred. We will learn about it in the next section.

De-Normalization of Data Structures


As discussed in the section, the Purpose of Normalization, Normalisation is done to remove
data redundancy and avoid operation anomalies of Insert, Update and Delete. However,
there are systems which have less Data Manipulation Language (DML). For example, in
OLAP systems or Analytics systems, the predominant operation is read data or data
retrieval. These systems have fewer writes and a lot of reads.

[ 141 ]
Applying Normalization Chapter 5

The data warehouse system can be considered as an analytical system.


Most of the data warehouse system will have daily or twice a day ETL
(Extract-Transform-Load). This means that DML operations on data
warehouse systems are very less. However, there will be mostly ad-hoc as
well as pre-defined queries are executed against data warehouse systems.
This means that data warehouse systems have fewer writes and a large
number of reads. On top of these reads, the data warehouse will deliver a
large number of records in volume.

The data normalization is great, for OLTP systems, where transactions are equally well-
spreaded read and write operations. However, for data analytics systems, there are fewer
writes which means that there is no question of operation anomalies.

However, what are the issues of data normalizations as far as analytics systems are
considered. If data is normalized, to retrieve data, end-users have to join multiple tables or
relations.

There are two issues with this in analytical systems, namely:

Practical Usage: Most of the analytical systems are self-service which will be
used by business users, not technical users. Therefore, it is essential to keep the
model simple. If the data model is normalized, then the business user has to join
tables. This may not be a happy option for the business user. He would love to
see de-normalized data in a single entity so that he can pick and choose the
necessary attributes when necessary for his analysis.
Performance: If there are many tables, which is unavoidable in data
normalization, you need to join multiple tables, when the need arises. However,
joining tables will impact performance negatively when it comes to data
retrieval. This means that data normalization will have a negative performance
impact on a large number of data retrieving systems.

The above two reasons emphasize that data normalization is not something that you can
apply blindly. When it comes to data retrieval systems, it is better to keep the data in a
denormalized form. It also needed to specify that there are expectations to 4NF and 5NF
which were discussed Exceptions for 4NF and Exceptions for 5NF.

Since there are a lot of conceptual points in Normalization, it is better to maintain a


Cheatsheet for data normalization.

Normalization Cheat Sheet


[ 142 ]
Applying Normalization Chapter 5

A lot of concepts are discussed in this chapter. Therefore, it is essential to have a cheat sheet
so that it is easier to refer.

Following is the cheat sheet for all the dependency types:

Dependency Type Relationship


Functional Dependency Prime Attribute -> Non-Prime Attribute
Partial Dependency Part of Prime Attribute -> Non-Prime Attribute
Transitive Dependency Non-Prime Attribute -> Non-Prime Attribute
Primary Attribute ->>Non-Prime Attribute1
Multi-Valued Dependency
Primary Attribute ->>Non-Prime Attribute2,
Attribute1 -> Attribute2,
JOIN Dependency Attribute2 -> Attribute3
Attribute3 -> Attribute1
This cheat sheet can be used as the guide for the normalization process.

Summary
In this chapter, we looked at an important concept in database modeling. We looked at the
history of normalization first. Then we defined Normalization as the methodical and
formal process that will be used to identify relations in the data model based on their keys
and different dependencies. Normalization solves two issues in data models that are
repeated data and modification anomalies.

Then we discussed several levels or normalization that can be applied for the data model.
At the 1st level of normalization, each row will be identified by means of a single attribute,
we called it a Primary Key. The next level of normalization form is 2nd normalization form.
In the 2nd normalization form, repeated values are removed. In the third normalization
form, Transitive Dependencies are removed. As a database designer, we determined that at
least the third level normalization form should be achieved. In the BCNF or 3.5
normalization form fully function dependencies are removed. After the third level of
normalization, we discussed the fourth normal normalization form. In the next level of
normalization form, the fourth normalization for multi-valued dependencies is removed. In
the fifth normalization form, JOIN dependencies are removed. Including Domain-specific
range relations is what is to achieve in the sixth normalization form.

We understood that though Normalization is an important process in data modeling, there


are instances where you purposefully keep them in the De-normalized format. Mainly in
the Data warehousing and analytics system, de-normalized structures are maintained in
order to improve the read performance, considering the fact that data warehouses and data

[ 143 ]
Applying Normalization Chapter 5

analytics systems have more reads and very few writes.

In the next chapter, we will discuss how to implement the designed data models in the
physical table structures.

Questions
Describe the requirements of the database normalization.

The main purpose of the database normalization is to have a proper relation


between the entity types. Database normalization will design the attributes
into natural groups. By performing database normalization, data
redundancy, and Operation Anomalies are the main issues which will be
fixed.

What are the anomalies of which will be resolved by the database normalization?

The Normalization process will solve three operation anomalies such as


Insert Anomalies, Modification Anomalies, and Deletion Anomalies. For
more examples refer to Operation Anomalies section.

What is the minimal normalization form that a relation must satisfy?

The third normalization form is the minimal normalization form that should
be applied to a relation.

What are the advantages and disadvantages of Normalization?

Normalization will help to identify the relation. Also, it will help to resolve
anomalies which we discussed before. The major disadvantage of the
Normalization process is that it will create a few numbers of relations which
might be difficult to maintain. With a large number of relations are present,
you have to join those relations when you need to get any information. This
will lead to performance issues.

What are the differences between BCNF and 3NF?

In the third normalization form, transitive dependencies are identified


whereas in the Boyce–Codd normal form super key is identified.

What are the situations in which Normalization is not required in databases?


Why it is not required?

[ 144 ]
Applying Normalization Chapter 5

You can avoid the normalization process for the relations where you don't
have less or no data writes. Also, the relations in which you don't have many
rows are another place where you don't need a normalization process.

Why de-normalized structures are more preferred in data warehousing?

Data warehousing is an analytical tool for organizations. Analytical means


that it has more reads and very few write. When there are a large number of
in Normalization table structures, you need to join a large number of tables
when it comes to reading data. A large number of joins means CPU
consumption is high which will lead to reduce the performance of data
warehouses. Therefore, in data warehousing, de-normalized data models are
preferred than normalized data models.

What are the situations in which you can avoid 4NF?

When there are large number of tiny relations, you can avoid fourth
normalization form. If not you will end up with unmanageable relations
which will lead to security and maintenance issues.

How JUNK dimensions are implemented in the Data warehouse?

The JUNK dimensions are created by cross joining tiny dimensions. This will
create combinations of instances in all small dimensions. However, there can
be instances where some combinations may not occur. You can delete these
combinations to maintain the data quality in the databases. However,
keeping those combinations will not add any performance issues.

What are the scenarios you can ignore the implementation of 5NF?

We can ignore the fifth normalization form, in analytical systems, data


warehousing systems and in reporting systems.

Exercise
For the data model, defined in Chapter 4, Representation Models, apply normalization to a
possible normalization level. State what are the challenges that you are encountered during
the normalization process.

[ 145 ]
Applying Normalization Chapter 5

You may have to insert sample data into the model so that you can
visualize what is the customer requirement.

Further Reading
Microsoft, https://support.microsoft.com/en-gb/help/283878/description-
of-the-database-normalization-basics
Introduction to normalization, https:/​/​web.​archive.​org/​web/​20071104115206/
http:/​/​www.​utexas.​edu/​its-​archive/​windows/​database/​datamodeling/​rm/
rm7.​html

[ 146 ]
6
Table Structures
In chapters 2-5, we have discussed how to identify the relevant entities and their attributes
in the data models. Also, we have had a detailed discussion of different types of database
models that will be used to design databases. We have also extensively discussed different
normalization forms with relevant examples, hence, the database entities. their
attributes and their relationships were identified scientifically.

In this chapter, it is time to define these entities physically in the identified database
technology. Until up to now, we have discussed conceptual aspects of database design.
Until this point, we did not care about the database technology that will be used for the
database design. Now we will move on to the physical design of the databases. Therefore,
we need to choose database technology and this book covers the PostgreSQL database, we
will be looking at how this physical database modeling can be done in PostgreSQL
database technologies. Apart from the database technology, we need to understand the
available data types so that we know the strengths and limitations. Also, there are different
types of relations or tables in which table design is different. For example, there are Master
tables, Transaction tables, Reporting Tables, and Audit Tables. In this important chapter,
we will discuss what are the options in the Master tables. Transaction tables are the
important tables in the database system which can be considered as core tables in a
database system. We will discuss the important aspect of Reporting and Auditing tables in
this chapter as well.

In this chapter we will cover the following topics:

Choosing a Database Technology


Selecting Data Types
Designing Master Tables
Designing Transaction Tables
Designing Reporting Tables
Designing Audit Tables
Maintaining Integrity between different tables
Table Structures Chapter 6

Choosing Database Technology


As we discussed in Chapter 1, Overview of PostgreSQL Relational Databases, the database
designer has a lot of options for database technology. There are different types of databases
such as the Hierarchical database, File-based database, document databases, graph
database, and relational databases, and so on, to choose to suit the client's requirements. If
the database designer decided to choose relational databases as the technology, he has
another important choice to make. As you are aware, there are a lot of relational database
technologies from different vendors, the database designer has to choose from those
relational database technologies.

There are the most common and popular database technologies that are widely used by
many database designers. They are,

Oracle
Microsoft SQL Server
PostgreSQL
MariaDB
Teradata
IBM DB2
MySQL
Sybase
MS Access

When choosing a relational database technology, some times the database designer does
not have a choice, as an organization may have decided that to use already licensed, known
technology. However, even that database designer has to verify the features of the database
technology is sufficient to support the client requirements.

If the database designer has a choice to select the relational database technologies, then the
cost of the database, support of the database vendor and available resources such as
humans and drivers will play key parameters.

Even after the selection of database technology, then there are different versions of the
relational databases. So let us assume that the database as chosen PostgreSQL 12 as the
relational database technology.

It is important to understand what are the available data types in PostgreSQL so that the
database designer knows what are his options. Let us see what are the limitations are
properties of PostgreSQL data types.

[ 148 ]
Table Structures Chapter 6

Selecting Data Types


The data type is the basic constraint of an attribute or a table column. It defines what types
of values a column can hold. For example, the student name column should be string data
type and the age column is an integer data type and the date of a birth column is the date
data type. This means the age column will not accommodate string values, whereas the
date of the birth column needs valid data. date of birth, will not have 2020-11-31, as there is
no 31st in the month of November. However, if some enter 2025-11-30 as the date of birth,
that will be accepted as it is a valid date. Those types of constraints cannot be achieved by
data types.

In PostgreSQL, there are many data types, but we will be looking at only the important data
types in PostgreSQL, so that database design can be done. There are different data types
categories in PostgreSQL, and they are Character Types, Numeric Types, Monetary Types,
Date/Time Types, and Boolean Types. Apart from those data types, there are data types
such as Binary Types, JSON, Geometric Types, and XML Types which are not much used.

Let us see what are the character data types available in PostgreSQL.

Character Data Types


The most common data types are character data types. In PostgreSQL, there are three
character data types that are listed in the following table:

Data Types Description


character varying(n) variable-length with limit
character(n) fixed-length, blank padded
text variable unlimited length
character varying(n) and character(n) data types can store string up to length n where n is a
positive integer. In case, the user trying to insert a value that has length more the column
length, the following error will occur:
ERROR: value too long for type character varying(2) SQL state: 22001

character varying(n) data type will take only the length of the values where
are character(n) will have a fixed length by adding blank to the balance length.

In addition these two data types, PostgreSQL provides another character data type called
text type. Text data type stores strings of any length. The text data type is a non-SQL
standard type.

Apart from these three string data types, there are other two fix character data types. Those

[ 149 ]
Table Structures Chapter 6

are char and name, that have storage 1, 64 bytes respectively. The name data type should
not be used by the users as it is intended for internal use.

Following is the example of creating character data types:

Similarly, these data types can be used in a SQL script as shown below:
CREATE TABLE public."Customer"
(
"Title" character(4) ,
"FirstName" character varying(25) ,
"MiddleName" character varying(25) ,
"LastName" character varying(25) C,
"Status" "char",
"Profile" text
)

Though in some other database system there is a performance difference


between character(n) and character varying(n), in PostgreSQL, there is no
performance difference. The only difference would be that character(n)
will consume additional unnecessary storage and very few CPU cycles to
check the length when storing into tables. Therefore, In most situations,
text or character varying should be used.

It is important to note that the maximum length of the character data type can have is
10,485,760. ERROR: length for type varchar cannot exceed 10485760 will occur if you try to

[ 150 ]
Table Structures Chapter 6

create a column more than 10,485,760. However, it is unlikely that you need a character
column more than that length.

Let us see what is Collation Support in the character columns.

Collation Support
In the character data types, collation is an important concept that will allow users to use
multiple langue. As you know, different languages as their own language and sorting
properties. if you are designing a database for multi-lingual requirements, collation has to
be selected accordingly. By default, all the character data types will take the collation of the
database.

Let us as what are the Numerical Data types available with PostgreSQL and their limitation
and properties.

Numerical Data Types


In most database technologies, there are few different types of numeric data types and
PostgreSQL is no different. The following are the different numeric data types PostgreSQL:

Numeric Data Type Storage Size Range


smallint 2 bytes -32768 to +32767
-2,147,483,648 to +2,147,483,647
integer 4 bytes
( Approx. -2 Billion to + 2 Billion)
-9,223,372,036,854,775,808 to +9,223,372,036,854,775,807
bigint 8 bytes
(Approx -9 Quintillion to 9 Quintillion )
up to 131,072 digits before the decimal point
numeric variable
up to 16,383 digits after the decimal point
real 4 bytes 6 decimal digits
double precision 8 bytes 15 decimal digits precision
smallserial 2 bytes 1 to +32767
serial 4 bytes 1 to 2,147,483,647
bigserial 8 bytes 1 to 9,223,372,036,854,775,807
Typically, whenever a whole number is needed, users tend to select integer data type
without looking into the requirements. Let us say you want to identify the data type for
department ID where you won't get 50,000 departments. If you select, integer data type,
you are using 4 bytes, whereas you could use smallint data type. The smallint data type will
consume only 2 bytes whereas integer data type will consume 4 bytes and bigint data type
will consume 8 bytes. This will impact not only the department table but also the other

[ 151 ]
Table Structures Chapter 6

tables which are referring to the department table as well. For example, if the employee
table is referred to the department table, that will also be impacted. Also, when there are
indexes are creating, it will require more storage and it will be discussed in detail
in Chapter 8, Working with Indexes.

Also, this unnecessary usage of the data type will need additional storage and time for
database backup to execute and also additional time to restore the database if needed.

When you need to search for data, you may have to search for additional unnecessary data
pages if the wrong data type is selected. This means that it is essential to select appropriate
to select a correct data type so that it will have a storage and performance impact.

When integer data types are divided by an integer data type, the result is an integer. If you
want those values in numeric, make sure you cast either denominator or numerator as
shown in the below screenshot:

The above screenshot shows how different the outputs are when the data types are
different. This means you need to carefully select data types.

Let us see how Serial data types can be used in PostgreSQL.

Serial Data Types


Serial Data types, smallserial, serial, and bigserial are not real data types but used to create
unique auto-increment numbers. The following query will create an auto-increment column
called columnname in SampleTable table:

[ 152 ]
Table Structures Chapter 6

CREATE TABLE SampleTable (


columnname SERIAL
);

When a serial column is created, a SEQUENCE object will be created as shown in the
following screenshot.:

This sequence will be dropped once the serial column is dropped. You
also can drop the sequence, then the column will not be dropped but the
values will not be incremented.

Monetary data types are other important data types in PostgreSQL which will be discussed
in the following section.

Monetary Data Types


The money data type is the only available monetary data type in PostgreSQL. This is 8 bytes

[ 153 ]
Table Structures Chapter 6

data type that ranges from -92,233,720,368,547,758.08 to +92,233,720,368,547,758.07. The


money data type can have only two decimals. Therefore, in case you need three decimal
points for interest calculation, it is essential to use a numeric data type instead of the money
data type.

When the money data type is divided by different data types results can be different as
shown in the below screenshot:

The above screenshot shows different outputs when different data types are used. If you
want to obtain money data type as the output, correct inputs have to be used.

Date/Time data types are another important data type in PostgreSQL.

Date/Time Data Types


One of the most vulnerable data types is date-time data types in any database technology
hence there is no difference in the PostgreSQL well. There are several date-time data types
in PostgreSQL. Like for numeric data type, it is essential to pick the most relevant date-time
data type out of the existing date-time data types.

The following table shows the different date-time data types and their properties and their
limitations.

Data Type Storage Range Accuracy


timestamp without time zone 8 bytes 4713 BC - 294276 AD 1 microsecond
timestamp with time zone 8 bytes 4713 BC - 294276 AD 1 microsecond
date 4 bytes 4713 BC - 5874897 AD date

[ 154 ]
Table Structures Chapter 6

time without time zone 8 bytes 00:00:00 - 23:59:59 1 microsecond


time with time zone 12 bytes 00:00:00 - 23:59:59 with time zone 1 microsecond
interval 16 bytes -178,000,000 years to 178,000,000 years 1 microsecond
time, timestamp, and interval date-time data types will accept an optional precision value p
which specifies the number of fractional digits retained in the seconds' field of the value. p
value ranges from 0 to 6.

The interval data type will have the following parameters to chose to indicate the interval
type:

YEAR MONTH DAY HOUR


MINUTE SECOND YEAR TO MONTH DAY TO HOUR
DAY TO MINUTE DAY TO SECOND HOUR TO MINUTE HOUR TO SECOND
MINUTE TO SECOND
Let us discuss the properties of boolean data types in PostgreSQL.

Boolean Data Types


the boolean data type which has a storage of 1 byte is the only boolean data type. true, yes,
on and 1 values are considered as TRUE values whereas false, no, off, and 0 values are
considered as FALSE.

As indicated before, we have discussed the important data types in PostgreSQL. However,
PostgreSQL has much richer data types to suit your various needs. You can go through the
PostgreSQL documentation which is listed in the Further Readings section.

When it comes to design physical database structures, there are mainly six data layers as
shown in the following screenshot:

[ 155 ]
Table Structures Chapter 6

Though there are five data layers, there are four types of table types. There are Master
Tables, Transactional Tables, Reporting Data ad Transaction Audit Data that we will see in
the following section.

Designing Master Tables


Master tables are the tables that have referenced data. For example, if you look at a
University system, Entities such as Students, Courses, Lecturer are master data. Though it
might sense that creating master tables are much easier.

In master tables, there are two common design decisions.

Let us see the different serial data types that are available in PostgreSQL.

Additional Serial Column as Primary Key


Typically, there will be a business key, and most database designers unintentionally pick
this as the primary key. For example, in the Course Entity, course code (ENG1234, IT5467,
MAT1976) will be the primary key in most cases. This means in the related tables, course

[ 156 ]
Table Structures Chapter 6

code will be used. For example, the enrollment entity will have the course code. There can
be two issues with this design.

1. When the business key is used as the primary key, transaction tables are
referenced the business key as said in the above. Since these codes are governed
by the business, they have the right to change the business key. If the business
keys are changed, changing the master tables will not be a great issue as master
tables typically have less number of records. However, you need to change all
the records in the transaction table, which may have a large number of records in
the order of millions. Modifying the transaction table records may result in a
table lock and that will prompt the table to be unavailable until the update is
over.
2. Typically a business key alpha-numeric values as said earlier in this section. For
most of the queries, you need to join the transaction and the master tables. When
joining alpha-numeric columns, there will be a performance impact.

By considering these two issues, typically we can insert a new serial column, that will
be used as the primary key. The existing business key will be a UNIQUE indexed key.

Let us see why it is important to introduce create and update columns in master tables.

Introducing Create and Update Columns


It is important to identify when these master records are entered and updated, mainly for
integration purposes for the other system. Typically, other systems need to know when the
records are inserted and updated so that they can integrate into their systems.

The following screenshot shows the sample for the course table, which includes the
primary key, and created and updated columns:

[ 157 ]
Table Structures Chapter 6

"The following code block shows the script for the table above:
CREATE TABLE public."Course"
(
"CourseID" integer NOT NULL DEFAULT
nextval('"Course_CourseID_seq"'::regclass),
"CourseCode" character varying(10) COLLATE pg_catalog."default" NOT NULL,
"Description" character varying(50) COLLATE pg_catalog."default",
"CreatedDate" timestamp without time zone,
"ModifiedDate" timestamp without time zone,
CONSTRAINT "PK_Course" PRIMARY KEY ("CourseID")
)

The following screenshot shows the sample data set for the Course table:

However, there are different types of master tables depending on how the history of master
tables are maintained. Depending on these types of master tables, table design will differ.

Let us see how overwriting can be done in master tables.

[ 158 ]
Table Structures Chapter 6

Overwriting Master Tables


In Online Transactional Processing (OLTP) systems, typically it is not essential to keep the
historical data. This means that most of the time, in the OLTP system when you need to
modify the column, it will be overwritten. In this method, it is important to note that there
is no way to retrieve historical data. This is a very simple method as you can see from the
following flow chart:

Although updating the existing record is the most common master table design, there can
be a case where you need to keep the historical aspect of the master tables.

In the world of data warehousing or OLAP system, keeping the historical


values are referred to as Type 2, Type 3 or Type 6 Slowly Changing
Dimensions (SCD). As this is not covered in this book, we will not discuss
them in detail.

Let us examine the different design approaches for the Master tables.

Designing Historical Master Tables


There are three types of design which are available for historical master tables, they
are Row Based Historical Master Tables, Column Based Historical Master Tables,

[ 159 ]
Table Structures Chapter 6

and Hybrid Historical Master Tables. Let us learn more options for designing historical
master tables in the following sections.

Row Based Historical Master Tables


In the Row Based Historical Master Tables, when the data is updated, an additional row is
updated instead of overwriting the modified values. This means that there are different
versions of records maintained in the Master table.

To facilitate this, we need a ValidStartDate and ValidEndDate and iscurrent record column as
shown in the following screenshot:

The above-mentioned course table can be created in a script as shown below:


CREATE TABLE public."Course"
(
"CourseID" integer NOT NULL DEFAULT
nextval('"Course_CourseID_seq"'::regclass),
"CourseCode" character varying(10) COLLATE pg_catalog."default" NOT NULL,
"Description" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"CourseOwner" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"ValidStartDate" date NOT NULL,
"ValidEndDate" date,
"IsCurrent" bit(1) NOT NULL,
"CreatedDate" timestamp without time zone,
"ModifedDate" timestamp without time zone,
CONSTRAINT "PK_Course" PRIMARY KEY ("CourseID")
)

[ 160 ]
Table Structures Chapter 6

Please note that there are two modifications done to the previous course
table apart from the addition of ValidStartDate, ValidEndDate, and
IsCurrent columns. When we create the Course table before, the Data type
CourseId was set to serial. However, you will see that column data type
as integer now. As we said in the Serial Data Types section, the serial data
type is not a true data type and it is variant of the integer data type. you
will observe that there is another column included in this table
named, CourseOwner and this column will be used to demonstrate, how to
keep the historical data in a master table.

Let us assume that the course called JAVA is inserted into the table using the following
script:
INSERT INTO PUBLIC."Course" (
"CourseCode"
,"Description"
,"CourseOwner"
,"ValidStartDate"
,"IsCurrent"
,"CreatedDate"
,"ModifedDate"
)
VALUES (
'JAVA'
,'Introduction to JAVA'
,'Phil Jason'
,'2012-01-01'
,B'1'
,NOW()
,NOW()
);

After the insertion of the record, the Course table will look as shown in the following
screenshot:

Now let us assume that on 2017-05-01, Course ownership was changed to 'David Baker', a
new record will be inserted and the existing record will be updated using the following
script:
UPDATE PUBLIC."Course"
SET "ValidEndDate" = '2017-04-30' ,"IsCurrent" = B'0' ,"ModifedDate" =
NOW() WHERE "CourseCode" = 'JAVA' AND "IsCurrent" = B '1'

[ 161 ]
Table Structures Chapter 6

INSERT INTO PUBLIC."Course"


( "CourseCode" ,"Description" ,"CourseOwner" ,"ValidStartDate" ,"IsCurrent"
,"CreatedDate" ,"ModifedDate" )
VALUES
( 'JAVA' ,'Introduction to JAVA' ,'David Baker' ,'2017-05-01' ,B'1' ,NOW()
,NOW() );

After this script, the Course table will be updated as follows:

If you want to implement, Row-based historical master tables, Business key, (Course Code
in this example) cannot be used as the primary key as Business Key can be duplicated.
Therefore, if you wish to implement Row-based historical mater tables, you need to include
an additional serial column.

The following screenshot shows the flow chart for the Row-based historical master table
implementation:

However, if there are columns that are changing frequently, this type of design will not be
effective, as the Master table tends to grow rapidly. Also, not all columns need to be
treated as historical data. For example, in the above example, if Course Name is modified, it
will be just overwritten than adding a new row.

[ 162 ]
Table Structures Chapter 6

Column Based Historical Master Tables


Another way of keeping the history in the master tables is by using an additional column.
In the previous example, when there is modification new row will be added. In the Column
Based Historical master table, the historical value is updated into a column rather than a
row.

Let us look at the same example of the Course table with the CourseOwner columns. As
shown in the following screenshot, the previous owner will be stored in a column called
PreviousCourseOwner in the same record:

PreviousCourseOwner column of the Course table should be a NULLABLE


column. This is because there won't be any value
for PreviousCourseOwner for the first record.

Following is the script that was used to create the above table:
CREATE TABLE public."Course"
(
"CourseID" serial NOT NULL ,
"CourseCode" character varying(10) COLLATE pg_catalog."default" NOT NULL,
"Description" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"CourseOwner" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"PreviousCourseOwner" character varying(50) COLLATE pg_catalog."default",
"CreatedDate" timestamp without time zone,
"ModifedDate" timestamp without time zone,
CONSTRAINT "PK_Course" PRIMARY KEY ("CourseID")

[ 163 ]
Table Structures Chapter 6

When the first record is inserted, PreviousCourseOwner is null as shown in the below
screenshot:

Now let us assume that Course ownership was changed to 'David Baker' from 'Phil Jason'.
'Phil Jason' will be updated to the PreviousCourseOwner column and New value, David
Baker will be updated to the CourseOwner column in the Course table. This is how this was
done by using the following script:
UPDATE public."Course"
SET "PreviousCourseOwner" ="CourseOwner"
,"CourseOwner" = 'David Baker'
,"ModifedDate" = NOW()
WHERE "CourseCode" = 'JAVA'

This record update can be seen in the following screenshot:

Unlike Row-Based Mater Table, there won't be a large number of rows in the Column-
based Master Table. However, it is important to note that this can keep only one previous
value. Also, this type of master table is not scalable, if you want to keep history for a large
number of columns.

Hybrid Historical Master Tables


Having discussed Row-Based Mater Tables and Column-Based Mater Tables separately
there are certain cases where you need to incorporate both types together. The following
screenshot shows the Course table with Hybrid Historical Master Tables:

[ 164 ]
Table Structures Chapter 6

The above table can be created from the following script:


CREATE TABLE public."Course"
(
"CourseID" serial NOT NULL ,
"CourseCode" character varying(10) COLLATE pg_catalog."default" NOT NULL,
"Description" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"CourseOwner" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"PreviousCourseOwner" character varying(50) COLLATE pg_catalog."default",
"ValidStartDate" date NOT NULL,
"ValidEndDate" date,
"IsCurrent" bit(1) NOT NULL,
"CreatedDate" timestamp without time zone,
"ModifedDate" timestamp without time zone,
CONSTRAINT "PK_Course" PRIMARY KEY ("CourseID")
)

The following screenshot shows the sample data set for the above implementation:

Some times, designers prefer to have the Current value in all the records as well that is
shown in the below sample data set in the following screenshot:

[ 165 ]
Table Structures Chapter 6

Though the design of the Master tables seems to be a very trivial problem to solve.
However, to achieve the best, there are multiple options that designers have. The designer's
task is to carefully select the correct approach that will cater to the client's requirements.

In the next section, we will understand the design of the transaction tables. which is a core
part of database design.

Designing Transaction Tables


Transaction Tables hold the import data in the system. Transaction tables are the main
intention of the database systems. If you are to design a database for a University, the main
idea is to store exam results. If you want to design an ordering system, orders will be the
main entities to store. There are a few challenges when it comes to the design of the
transaction tables.

Data volume, transaction velocity, and data integrity are the major challenges that database
designers have to come across during the designing of transaction tables.

Let us look at the challenges possessed by the higher volumes in transaction tables.

High Data Volume


In transaction tables, we are looking at records in the order of millions or billions. This
means those transaction tables have to cater for large volumes. To cater for large volumes,
transactions should be very short. To reduce the row sizes, it is essential to perform
Normalization at the third level of normalization. Refer to Chapter 5, Applying Normalization
techniques for details on data normalization.

Partitioning High Data Volume Transaction Tables


The partition is a key aspect of the transaction tables in order to cater to high data volume.
b=By using the partition, a large volume table is divided into smaller manageable multiple
partitions. Partitioning can be used to improve scalability, reduce transaction contention,
and optimize performance in the user queries. The Partition process can also provide a

[ 166 ]
Table Structures Chapter 6

mechanism for dividing data by usage pattern. For example, you can archive older data in
cheaper data storage as partitions can be mapped to different physical locations.

However, the partitioning strategy of a table must be selected by giving great consideration
so that the benefits can be maximized while minimizing adverse effects in transactions. The
partition process and partition strategies will be looked at in detail in Chapter 12,
Distributed Databases.

Another challenge that database designers come across is High velocity in transaction
tables that will be discussed in the next section.

High Transaction Velocity


Velocity means the rate of data records that will be inserted into the transaction tables.
Typically, transaction tables have high velocity. If you consider orders in a supermarket, we
are looking at more than a hundred thousand records per day. This means that tables
should be designed to incorporate high transactions that will be discussed in detail
in Chapter 09, Designing a Database with Transactions.

To facilitate, high transaction velocity in transaction tables, normalization can be used so


that data duplication is removed and the performance of data writing can be improved.

Data Integrity
When it comes to transactions, it has to update multiple tables and has to refer to multiple
master tables. If you consider the order table, to create an order you need to refer to the
customer table, product table, cashier table, and so on. When the order is raised, it has to
make sure the correct reference is done. This integrity can be achieved by referential data
integrity that is discussed later in this chapter.

When an order is converted to an invoice, the transaction has to update the Order
transaction table, customer balance, inventory tables, and so on. For one transaction to
complete, the transaction has to make sure that all relevant tables are updated. In case if
one table fails, the entire transaction should be rollbacked to achieve Aotmocity property in
transactions. This integrity will be discussed in Chapter 09, Designing a Database with
Transactions.

[ 167 ]
Table Structures Chapter 6

Sample Transaction Tables


Let us examine the design of transaction tables, by means of a very common ordering
system design as shown in the below screenshot:

In the above system design, the Order Entity is divided into three physical tables in order to
cater to the challenges that we discussed at the challenges in transaction table design. For a
given order all three tables have to be updated to complete an order. All tables
have CreatedDate and ModifiedDate columns in order to provide integration to third-party
systems if needed.

It is always better to include CreatedDate and ModifiedDate columns in


every table, as we never know when third-party systems need these
columns. if you don't have these columns and need to include them later
will take some effort to change the database as well as the application
code.

Typically, there are a lot of relationships with master tables in the transaction tables. These

[ 168 ]
Table Structures Chapter 6

relationships should be FOREIGN KEY constraints so that integrity can be maintained.


Also, these columns can be included for indexes so that the performance of table joins was
improved.

Calculated Columns versus Processed Columns


the OrderDetail table in the Sample Transaction Table section, you will observe that there is
an Amount column which is equivalent to Quantity * UnitPrice.

Following is the simple formula for the calculation of Amount value using Quantity and
UnitPrice values.

Now the question is whether Amount should be stored as a calculated column or can it be
processed when it is required. There are positive and negative aspects of both cases. If you
consider the storage and the writing time, you should choose, Amount as a processed
column. However, if you are looking at improving the reading performance than the
writing performance, it is better to store as a Calculated Column.

Let us see how updatable transaction tables are designed.

Updatable Transaction Tables


Most of the transaction tables are write-only or write-heavy tables. most likely, these
records will be updated only a few occasions such as cancellations and goods returns, and
so on. However, there are instances where you need to update transaction tables.

Let us assume a student submits an assignment, then the lecturer will mark the assignment
as received or rejected. Then the lecturer will mark this for grading and finally, grades will
be released to the students.

Following table shows, how a single record has changed over time when it is moved to
different stages.

Assignment Submission Is Accept Marked Marked


Steps StudentID Date Grading
ID Date Accept LecturerID Datel By
1 S001 A0001 2018-05-01
2 S001 A0001 2018-05-01 Yes 2018-05-27 L001
3 S001 A0001 2018-05-01 Yes 2018-05-27 L001 2018-06-30 L002 A+
As shown in the above table, it is the same record that will update over time. Since the

[ 169 ]
Table Structures Chapter 6

record has to update later, it is essential to find the record in a timely manner. Therefore,
indexes make an important contribution to find the relevant record.

In the next section, we will discuss Reporting tables which are mainly to support reports in
the database system

Designing Reporting Tables


In reporting tables, read performances should be high. In reports, there are many
aggregations and summaries. These tables are customized to support specific reporting
needs, so when designing these tables, the requirement of the report should be very clear
along with the data volume.

As we discussed in Chapter 5, Applying Normalization there are cases where De-


Normalization structures are needed, instead of Normalization structures. Report tables are
one of those tables where De-Normalization structures are preferred over normalization
table structures, in order to provide higher data read performance. If there are normalized
database structures, during the data reading, it will need to join multiple tables which will
cause CPU cycles. To avoid this unnecessary CPU cycles, reporting tables will be designed
as de-normalized structures.

When there is no requirement to have real-time data for the reporting


tables, these tables will be updated by a different process than the
transaction itself. If the reports are needed in real-time, the user has no
option but to update them with the transaction itself.

Auditing is an important concept in database design and its nature is different from the
other tables. Therefore, special attention is needed to design audit tables, which will be
discussed in the next section.

Designing Audit Tables


There are several purposes for auditing. Mainly auditing is to support various standards
such as Data Protecting Law, Sarbanes-Oxley Law, and so on. Detecting the misusages and
prevent them happen again is another usage of the auditing tables. Auditing can be used as
a piece of evidence as well. Technically, auditing can be used as a recovery option, as
auditing will let you know what had happened.

For auditing it is essential to store the following data:

[ 170 ]
Table Structures Chapter 6

WHO Author of Change It can be the Operating system user, or database use or application user.
Data which was changed.
Operation: INSERT / DELETE / UPDATE / TRUNCATE /
WHAT What was change
Permissions Granted
Executed command.
When the change was The timestamp of the data. if the users are accessing from multiple time
WHEN
done zones, it is essential to store the time zone information as well.
From where the change
WHERE IP address / Client machine name etc.
was done
The following are the typical attributes that can be captured for the auditing. Please note
that due to practical and technical issues, you may not be able to find all the details:

The above Audit table can be created from the below script.
CREATE TABLE public."GeneralAudit"
(
"AuditID" bigserial NOT NULL ,
"StartTimeStamp" timestamp with time zone NOT NULL,
"EndTimeStamp" timestamp with time zone,
"OperatingUserName" character varying(50) COLLATE pg_catalog."default",
"SystemUserName" character varying(50) COLLATE pg_catalog."default",
"ApplicationName" character varying(50) COLLATE pg_catalog."default",
"Command" character varying(2000) COLLATE pg_catalog."default",
"HostName" character varying(15) COLLATE pg_catalog."default",
"DatabaseName" character varying(15) COLLATE pg_catalog."default",

[ 171 ]
Table Structures Chapter 6

"SessionID" character varying(10) COLLATE pg_catalog."default",


"Duration" smallint,
"IsSuccess" bit(1),
CONSTRAINT "GeneralAudit_pkey" PRIMARY KEY ("AuditID")
)

Audit tables are write-heavy table and very rarely there are used for retrieval. Audit data is
required to read-only there is an absolute need. Therefore, typically indexes are not created
in Audit tables. Refer to Chapter 8, Working with Indexes.

When there is a need to retrieve data from audit tables, you need indexes.
For example, if you want to analyze audit data for a given application,
you need to apply an index to the ApplicationName column. You can copy
the audit database to a different server and apply the necessary indexes to
facilitate the audit queries.

Also, there won't be any foreign key constraints or any other check constraints that are
implemented to the audit table in order to improve the data writing performance in the
audit tables. Only a Primary is added to the audit table.

There are several ways to capture audit data as listed below, but are not in the scope for this
book:

Triggers ( DDL / DML )


Change Data Capture (CDC)
Change Tracking
Audit Replication
Database Logs (Redo Logs)
Custom Implementation

In an enterprise system, it is better to have a separate Audit database so


that it can be used to store audit records of all the databases. By doing
this, there are two major advantages. Since normal users should access the
auditing due to security. If the auditing data in the same transaction
database, there are chances that normal users will get access to the
auditing tables. By separating the auditing tables to a separate database,
the chances of unnecessary users getting permission to the auditing tables
are minimum. Also, by keeping the auditing database separate, other
maintenance tasks such as indexing, archiving, database backup can be
managed separately.

Archiving is another important aspect of the audit tables. Some laws govern the period of
audit data retention. Typically, the table partition concept is used for audit tables so that

[ 172 ]
Table Structures Chapter 6

data archival can be done much easier. The partition process will be looked at detail in
Chapter 12, Distributed Databases.

After designing different tables, it is important to maintaining integrity between these


tables that will be looked at in the following section.

Maintaining Integrity between different


tables
Maintaining integrity between multiple tables is an important factor in database design.
Integrity can be achieved in a couple of ways. Constraints and Transactions are the most
common ways of achieving data integrity. There are other custom techniques that can be
employed from the application end which will not be discussed in this book.

Foreign keys are one of the main ways to maintain data integrity between multiple tables.

Foreign Key Constraints


A foreign key constraint specifies that the defined column can only contain values that are
in the referenced primary key of another table. The referential integrity ensures that the
data between those two tables.

If we look at an example of course enrollment, there are lectures and courses referred to in
the course enrollment tables. Let us see the basic columns of these three entities to describe
the foreign key constraints using the following screenshot:

[ 173 ]
Table Structures Chapter 6

Please note that in the above E-R diagram, only the essential columns are taken into
consideration.

Foreign constraints are created using PostgreSQL as shown in the following screenshot:

[ 174 ]
Table Structures Chapter 6

As shown in the above screen, CourseEnrollment tables CourseID is referenced to the


CourseID in the Course table.

A detailed discussion was done on Constraints (Check / Unique / Foreign Key) in Chapter 1,
Overview of PostgreSQL Relational Databases. Also, Transactions are discussed in detail
in Chapter 09, Designing a Database with Transactions.

Summary
After discussing the identifying entities and modeling by E-R modeling and different levels
of normalization forms in the previous chapters, this chapter dedicated the discussion to
the physical implementation of these entities in database tables. In this chapter, it was
identified that there are mainly four types of table structures in physical database
implementation. Master Tables, Transaction Tables, Reporting Tables, and Audit Tables are
those four types of tables. It was identified that there are several ways to design master
tables mainly depending on how history data is maintained. Most of the time, history is not
maintained in the master tables. However, when it is required to maintain the historical,
three types of designs were discussed depending on how the historical data is stored. Those
types are Row-Based, Column-Based and Hybrid Master table implementation.

In the transaction tables, it is essential to overcome multiple challenges such as high data
volume and high data velocity. Maintaining integrity is an important phenomenon in

[ 175 ]
Table Structures Chapter 6

transaction otherwise transactions will be invalid. Reporting tables are used mainly to
support high reads therefore, it was suggested to have de-normalized structures for
reporting tables. Auditing tables are used for many reasons mainly to support various
domain laws. In the Audit tables, it is discussed that it is essential to not have indexes to
support write-heavy audit tables. When database designers want to design a database there
are common problems that can be solved by database design patterns.

In the next chapter, we will discuss the design patterns of databases that will be used as a
template for the common problems in database designing.

Questions
What are the parameters you will consider when choosing a database technology
to match the client's requirements?

An important parameter to consider is the available feature of the selected or


to be selected database technologies. You need to choose a database
technology that will cater to the user requirements. There are several types of
database technologies such as relational, document, etc. Apart from the
database type, you need to choose a database technology that supports
specified features. For example, if you are looking at replicating data for
different purposes, you need to choose a database technology that supports
the required type of replication. Apart from the functionalities, the next
important thing to look at is the cost. Depending on the budget that you have
specified for the project, you need to choose the database technology as there
are opensource database technologies such as PostgreSQL. The next
important parameter to look at is the different resources available.

Why character varying(n) data type should be used over the character(n) data type?

When storing data in a character varying(n) data type, it consumes only the
storage of the data that you are inserting to the column. It does not consume
the entire allocated storage. For example, let us assume that there is a column
of character varying(50) data type. If you are storing "MESSAGE" value in the
above column, it will consume 7 bytes which is the length of the MESSAGE
text. On the other hand, if you are using the character(50) data type, it does
not matter the size of the value, but it will consume the entire size of 50
bytes. This means it is always recommended to utilized character varying(n)
data type for columns such as Name, Addresses etc.

Why collation is an important factor when it comes to database design?

[ 176 ]
Table Structures Chapter 6

In the character data types, collation is an important concept that will allow
users to use multiple langue. As you know, different languages as their own
language and sorting properties. if you are designing a database for multi-
lingual requirements, collation has to be selected accordingly. By default, all
the character data types will take the collation of the database.

Why it is essential to choose the correct data type among smallint, integer, and
bigint?

The smallint data type requires 2 bytes and the integer data type requires 4
bytes whereas the bigint data type requires 8 bytes. If you are not utilizing
correct numeric data types, you will be unnecessary spending data storage.
Additional storage means row sizes are larges which will reduce the
transaction read as well as write performance significantly. Apart from the
query performance, the database will be unnecessary large than it should be.
This means additional storage is required for the database and its backup.
Also, database backup and database restoration times are higher than what it
should be.

What is the strategy that you will use if you are dealing with users from multiple
geographical locations?

If you have a system that is catering to users in multiple geographical


locations, it is essential to keep time zone information. For example, if the
system is used by a user in Paris and another user in New York, relevant
data should be stored with the time zone information. PostgreSQL has date
data types that can be used to store time zone information. By using data
types such as timestamp with time zone and time with time zone in PostgreSQL,
you can deal with users from multiple geographical locations.

What is the importance of including an additional serial column for the master
table?

If no serial data type column is not used, the business key has to be used as
the primary key. By introducing an additional serial column, the primary key
can be chosen as the new serial column. This serial key will remove the
dependency of the primary key with the business. Additionally, by
introducing serial data type as the primary key to the table, tables joins are
performed over the numeric columns rather than alpha-numeric columns
that will result high performance during the table joins.

An organization has identified some of its female employees have changed their
names after marriage. What would be the best type of Historical Master table you

[ 177 ]
Table Structures Chapter 6

would choose to support this requirement and explain?

Column Based Historical Master Tables should be used to store the


employee's previous name. Since these types of changes are very rare,
maintaining the previous information in a different column would be
adequate. This means there will be two columns in the Employee table called
Name and PreviousName. In this type of master table, where the name of the
employee is changed, the current name will be updated to the Name column
and the previous name will be updated to the PreviousName columns.

Why at least the third normalization level is required for transaction tables?

Since transaction tables are dealing with high volume and high-velocity
transactions, it is essential to keep the transaction table at a minimum row
length. When the third normalization level is implemented to the transaction
tables, the row size of the transaction tables is reduced.

Why de-normalized database structures are most suitable for reporting tables?

Reporting tables have high read operations with high volume. If reporting
tables are fully normalized, data will be scattered and you need to perform a
large number of table joins. If there are large table joins for a large volume of
data, it will require high CPU. When tables are de-normalized, you don't
need many table joins, hence it will improve the read performance of the
reporting tables.

Why indexes should not be incorporated with Auditing tables?

Auditing tables have write-heavy operations. Reading on auditing tables are


done only when required, for example, in instances where you need to find
information about any specific incident. As you know, these types of events
are very rare. If there are many indexes in the Auditing tables, all indexes
have to be updated during the Auditing table inserts. This means that
writing performance will be reduced if there are more indexes in the
Auditing tables. This will lead to a reduction in the performance of the
transaction as well. Therefore, it is recommended to not have any indexes in
the Auditing tables. However, when the need arises to read data from the
Auditing tables, you can apply the required indexes temporarily and get the
work done.

Why it is recommended to have a separate database for Audit?

Auditing data should be stored securely. This means that Auditing data

[ 178 ]
Table Structures Chapter 6

should not be available to normal users. This can be achieved by the


standard authorization techniques even if they are stored in the same
transaction database. However, it will be a higher risk that can go wrong. By
providing a different database for Auditing data, those data can be easily
secured. Apart from the security aspect, backup and archive can be done
much easier if the Auditing data is maintained in a different database.

Exercise
Let us implement table structures for the designed database in the previous chapters.

1. Identify different types of tables


2. Identify what types of master tables are required.
3. What is the strategy for Audit databases?
4. What are the levels of auditing you can achieve?

Further Reading
Full list of data types in PostgreSQL: https:/​/​www.​postgresql.​org/​docs/​12/​datatype.
html

Collation: https:/​/​www.​postgresql.​org/​docs/​9.​1/​collation.​html

SOX Compliance: https:/​/​digitalguardian.​com/​blog/​what-​sox-​compliance

[ 179 ]
7
Working with Design Patterns
During chapters 1-4, we discussed the modeling aspect of the database. In those chapters,
we discussed how to capture the users' requirements with respect to the database. During
these chapters, we identified different data models such as conceptual models, semantic
modeling, and E-R modeling. Then we discussed how to optimize identified entities and
their attributes, by using different levels of normalization models. We also discussed where
we need to ignore the database normalization.

From the previous chapter, Table Structures the last chapter, we discussed how physical
table structures are implemented for various types of databases such as Master tables,
Transaction tables, Reporting tables, and Auditing tables.

In this chapter, we will discuss design patterns in this chapter. The software design pattern
is a general, reusable solution to a commonly occurring problem. However, it is important
to know that pattern should be applied depending on the environment. When it comes to
database design, there are common problems that can be addressed by database design
patterns.

Also, there are anti-patterns that means that what are should not do during the design of
databases

The following topics will be covered in this chapter:

Understanding Design Patterns


What are Common Database Patterns
Avoiding Many-to-Many Relationships
OLAP versus OLTP
Common Issues in Design Patterns
Avoiding Database Design Anti-Patterns
Working with Design Patterns Chapter 7

Patterns are most commonly used in application development. Let us see why software
patterns are used and what are the benefits of those.

Understanding Design Patterns


Design patterns are basic guidelines for solving repetitive problems. The Design patterns
can be considered as a template for the common, known database design issues. it is
important to note that design patterns are not completed designs that you can be used as it
is. By using the existing design patterns you CANNOT convert directly into the
implementation.

Before we discussed the database patterns it is important to understand the basic criticisms
over the design patterns, let us see what are the issues in Software Patterns in the next
section.

Identifying Issues in Software Patterns


Software Pattern should not be considered as a solution for all the design problems.

Let us see what is the issue of choosing the wrong design pattern in the following section.

Selection of Wrong Design Pattern


There are different design patterns available for the designers to choose from for their
problems. So the designer has to choose the correct design pattern for their existing
problem. For example, if you choose the On-Line Analysis Processing (OLAP) design
pattern for the On-Line Transaction Processing (OLTP), obviously you will run into issues.
You will be facing issues not because the design pattern is wrong, but the fact that you have
chosen a different design pattern for a different purpose.

Let us discuss the issue of implementing the Design pattern without any modifications.

Implementing a Design Pattern without any


Modifications
Design patterns are basic guidelines for a common problem, but not specific for a common
problem. Design patterns can be used as a starting point for these common problems.
However, most of the time problems are different from the environment to the

[ 181 ]
Working with Design Patterns Chapter 7

environment, and domain to domain. This means that you need to adopt the given design
patterns to match your environment.

Let us look at another issue of design patterns that is outdated design patterns.

Outdated Design Patterns


Some design patterns were introduced many decades ago. However, now your
requirements have changed along with the environmental parameters and tools. This
change means that legacy design patterns may not be suited for today's world. For
example, a decade or two decades ago, we were looking at high-cost memory, high-cost
CPU, and slow storage. However, those have changed now. Modern design patterns need
to match the changes, rather than simply utilizing them. For example, some years ago, we
were very particular about the length of the attributes. This is due to the fact that storage
was a major concern factor back then. Therefore, we were avoiding the storing of calculated
attributes, simply to save precious storage a few decades ago. However, since the cost of
storage has come down in the favor of the database designers, now we are looking at
storing those calculated attributes in the tables.

We had a detailed explanation with a real-world example in Chapter 6,


Tables Structures. In that chapter under the Sample Transaction table section,
we discussed this topic with more specific examples.

Let us see what are the common database design patterns in the next section.

Common Database Design Patterns


Though the design patterns are mostly discussed with respect to the application
development, in this section, we will look at the very common database design patterns
that are used by the database designers.

First, let us look at the Data Mapper database design pattern.

Data Mapper
As we know, the prime target of the database is to store data. Typically, this data comes to
the database through an application but direct user interventions to the database. There are
main three user actions in the database they are, INSERT, UPDATE, and DELETE. Data
Mapper design pattern will map these actions between the application and database. By

[ 182 ]
Working with Design Patterns Chapter 7

doing so, end-users don't need to know the schema of the relevant table.

Let us revisit the Course table which is shown in the screenshot below:

If you examine this table, apart from CourseID, CrearedDate, and ModifiedDate, all the
other columns, CourseCode, Description, CourseOwner, CurrentCourserOwner,
PreviousCourserOwner, ValidStartDate, ValidEndDate, IsCurrent needs to be filled from
a user or from an application.

The following screenshot shows the block diagram for the data mapper design pattern:

[ 183 ]
Working with Design Patterns Chapter 7

In the above diagram, three layers are defined. For the application end, the user will call the
Course Mapper objects.

Let us see how this mapper layer can be implemented in PostgreSQL.

In PostgreSQL, a set of SQL scripts can be executed via Procedures.

Database Procedures also referred to as Stored Procedures or Procs, are


sub-routines that can contain a set of SQL statements that perform one or
more tasks. Procedures can be used for data validation, access control, or
to reduce network traffic between clients and the database servers.

Let us see what is Insert Mapper in the following section.

Insert Mapper
Insert stored procedure can be created and let us see how that can be created from
PostgreSQL.

The following screenshot defines the signature of the procedure:

[ 184 ]
Working with Design Patterns Chapter 7

procedure name, variables, and variable types define the signature of the procedure. There
can be multiple procedures with one name. However, there should be a different number of
arguments or argument types. In the argument list, in_ suffix is used to indicate that it an
input argument. in_is_current has a default value which means that even if the value is not
passed for that argument, default value, 1 will be set.

The following screenshot shows the code for the procedure:

[ 185 ]
Working with Design Patterns Chapter 7

The procedure above will update all the passed values and current_date function will be
used to update the CreatedDate and ModifiedDate

Let us look at the script of the above procedure in the following code block:
-- PROCEDURE: public.insert_course(character varying, character varying,
character varying,character varying, character varying, date, date, bit)

-- DROP PROCEDURE public.insert_course(character varying, character

[ 186 ]
Working with Design Patterns Chapter 7

varying, character varying, character varying, character varying, date,


date, bit);

CREATE OR REPLACE PROCEDURE public.insert_course(


in_course_code character varying,
in_description character varying,
in_course_owner character varying,
in_current_course_owner character varying,
in_previous_course_owner character varying,
in_valid_start_date date,
in_valid_end_date date,
in_is_current bit DEFAULT '1'::"bit")
LANGUAGE 'sql'

AS $BODY$
INSERT INTO public."Course"
("CourseCode",
"Description",
"CourseOwner",
"CurrentCourseOwner",
"PreviousCourseOwner",
"ValidStartDate",
"ValidEndDate",
"IsCurrent"
,"CreatedDate",
"ModifedDate")
VALUES
(in_course_code,
in_description,
in_course_owner,
in_current_course_owner,
in_previous_course_owner,
in_valid_start_date,
in_valid_end_date,
in_is_current,
current_date,
current_date)
$BODY$;

In the application end, users don't have to worry about the table structures and their
names:
CALL public.insert_course(
'C001',
'Business Analytics',
'Shane Kimpsons',
'Shane Kimpsons',
'Paul Wilson',

[ 187 ]
Working with Design Patterns Chapter 7

'2020-01-01',
'9999-12-31',
B'1'
)

Further, users do not have to worry about the internal columns such as CreatedDate and
ModifiedDate columns. They simply have to call the following script to insert data into
the course table.

Update Mapper
Let us see how users can use Update mapper to update the course table. Update mapper
will have the same parameters as insert mapper.

The following script shows the SQL code for the Update Mapper:
-- PROCEDURE: public.update_course(character varying, character varying,
character varying, character varying, character varying, date, date, bit)

-- DROP PROCEDURE public.update_course(character varying, character


varying, character varying, character varying, character varying, date,
date, bit);

CREATE OR REPLACE PROCEDURE public.update_course(


in_course_code character varying,
in_description character varying,
in_course_owner character varying,
in_current_course_owner character varying,
in_previous_course_owner character varying,
in_valid_start_date date,
in_valid_end_date date,
in_is_current bit DEFAULT '1'::"bit")
LANGUAGE 'sql'
AS $BODY$
UPDATE public."Course"
SET "Description" = in_description,
"CourseOwner"=in_course_owner,
"CurrentCourseOwner"=in_current_course_owner,
"PreviousCourseOwner"=in_previous_course_owner,
"ValidStartDate"=in_valid_start_date,
"ValidEndDate"=in_valid_end_date,
"IsCurrent"=in_is_current,
"ModifiedDate"= current_date
WHERE "CourseCode"=in_course_code
$BODY$;

[ 188 ]
Working with Design Patterns Chapter 7

The above update mapper script will update course details by the Course Code. There can
be different variants of update mapper depending on the need. For example, there can be
an update mapper not only by Course Code but also with the IsCurrent attribute.

The following script will be used to execute the update mapper:


CALL public.update_course(
'C001',
'Business Analytics',
'Shane Kimpsons',
'Shane Kimpsons',
'Paul Wilson',
'2020-01-01',
'9999-12-31',
B'0'
)

It is important to note that, to improve the update mapper's performance, necessary


indexes should be added. Indexing will be discussed in Chapter 8, Working with Indexes.

Delete Mapper
Let us see how users can use Delete mapper to delete the course table. Implementation of
Delete mapper is much simpler than Insert or Update mapper as Delete mapper requires
fewer arguments.

The following script shows the Delete mapper for the course table:
-- PROCEDURE: public.delete_course(character varying)
-- DROP PROCEDURE public.delete_course(character varying);

CREATE OR REPLACE PROCEDURE public.delete_course(


in_course_code character varying)
LANGUAGE 'sql'
AS $BODY$
DELETE FROM public."Course"
WHERE "CourseCode"=in_course_code
$BODY$;

The delete mapper shown above, will delete the course by the Course Code. There can be
different variants of delete mapper as we saw for Update mappers depending on the need.
For example, there can be a Delete mapper not only by Course code but also with the
IsCurrent attribute.

Let us look at the Unit of Work design pattern.

[ 189 ]
Working with Design Patterns Chapter 7

Unit of Work
The Unit of Work design pattern is to ensure that only modified data will be written to the
database instead of writing the entire data again. Let us explain this by an example. In
Order entity, typically there are two tables. We can view them in the following screenshot:

In addition to the standard design, we have included two additional constraints to this
design:

1. A Unique constraint is included for Order Detail table for columns (Order
Number, ProductID) as there can be only one Product for a given order.
2. The amount column is generated column by Unit Price * Quantity. Due to this
implementation, you do not have to enter any values to the amount column but it
will be calculated automatically.

Following is the script for the OrderHeader table:


-- Table: public."OrderHeader"
-- DROP TABLE public."OrderHeader";

CREATE TABLE public."OrderHeader"


(
"OrderNumber" integer NOT NULL,
"OrderDate" date,
"CustomerID" smallint,
CONSTRAINT "OrderHeader_pk" PRIMARY KEY ("OrderNumber")
)

Following is the script for the OrderDetail table:


-- Table: public."OrderDetail"
-- DROP TABLE public."OrderDetail";

[ 190 ]
Working with Design Patterns Chapter 7

CREATE TABLE public."OrderDetail"


(
"OrderNumber" integer NOT NULL,
"OrderLine" integer NOT NULL,
"ProductID" smallint,
"Quantity" numeric(9,2),
"UnitPrice" numeric(9,2),
"Amount" numeric(9,2) GENERATED ALWAYS AS (("Quantity" * "UnitPrice"))
STORED,
CONSTRAINT "OrderDetail_pk" PRIMARY KEY ("OrderNumber", "OrderLine"),
CONSTRAINT "UNQ_OrderNumber_Product" UNIQUE ("OrderNumber", "ProductID"),
CONSTRAINT "FK_OrderID" FOREIGN KEY ("OrderNumber")
REFERENCES public."OrderHeader" ("OrderNumber") MATCH FULL
ON UPDATE NO ACTION
ON DELETE NO ACTION
)

The important configuration in this table is the additional unique constraint and computed
column for the Amount column.

In this model, there are multiple records per one order. When the user adds new order lines
to the Order, in some designs, the entire Order will be updated, whereas only one order
line is modified. As you can realize, this is a very ineffective technique. In some cases, the
entire order is deleted and re-entered.

In this data pattern, only modified records are updated as shown in the following script:
INSERT INTO public."OrderHeader"
( "OrderNumber", "OrderDate", "CustomerID")
VALUES (1, '2020-01-01',1)

INSERT INTO public."OrderDetail"


( "OrderNumber", "OrderLine", "ProductID", "Quantity", "UnitPrice")
VALUES (1,2,1,2,5)
ON CONFLICT
("OrderNumber","ProductID")
DO UPDATE SET "Quantity" = 2 , "UnitPrice" = 5

There are workarounds to the above techniques by using UPSERT or MERGE statement.

Let us look at the Lazy loading design pattern which describes in the below section.

Lazy Loading
The lazy load design pattern is an important database design pattern, especially for large
volume databases. The application layer may request for a large volume of data but

[ 191 ]
Working with Design Patterns Chapter 7

consumes only less. Some times, users may need only fewer records than what is generated.
The lazy load pattern will enable users to retrieve only the needed data.

Let us look at the following script:


SELECT *
FROM public."Course"
LIMIT 3
OFFSET 5

The above script will retrieve only three records (LIMIT 3) and starting from record number
5 (OFFSET 5). If the user needs the second data set, different OFFSET values can be sent to
retrieve the data.

When database designers design databases, initially they will be looking


at very little volume of database. This leads to the ignorance for
performance looking into the future aspects for data. When data grows
over time, applications will be impossible to use due to the performance.
Therefore, the database designers have to design the database by keeping
the allowance for future data growth.

If you did not include Lazy load, the application may receive the entire table. Initially, this
table might have fewer records. However, over time, this table will become a large table
and the application will not be able to cater to this volume. Therefore, it is essential to
follow the lazy load design pattern even at the start of the database design phase.

In the next section, let us look at the database pattern to avoid Many-to-many
relationships.

Avoiding Many-to-Many Relationships


It is always a common design pattern to avoid the May-To-Many relationship in database
design. Let us see this via an example.

Let us assume that in a large scale enterprise, to facilitate their clients' multiple staff
members are assigned. This means one client will have multiple staff members and on the
other hand, one staff member will be assigned to multiple clients. So when an invoice is
raised against a customer, there will be multiple staff members assigned to that invoice as
shown in the following entity diagram:

[ 192 ]
Working with Design Patterns Chapter 7

As shown in the above screenshot, one invoice will have one customer. However, since
there can be multiple staff members for a given invoice. Customer and Invoice modeling is
trivial as shown in the screenshot below.

Let us see the design patterns for OLAP versus OLTP in the following screenshot:

In the above table diagram, only key attribute columns are listed.

Now let us see how we can model, Invoice and Staff relations. Since there are multiple staff
members per invoice, you cannot include a single column in the invoice table.

In database modeling, many-to-many relationships are not possible. To


avoid many-to-many relationships, an intermediate or bride table is
introduced. This means the many-to-many relationship will be replaced
by two one-to-many relations that were introduced.

To facilitate the above requirement, the typical design pattern is to introduce an


intermediate table that contains the two columns which originally had the many-to-many
relationship. The following screenshot shows how these relationships are maintained:

[ 193 ]
Working with Design Patterns Chapter 7

In the above table design, lnk_InvoiceStaff table is introduced to facilitate the user
requirement. The introduced table contains InvoiceID and StaffID columns which are the
many-to-many relationship columns in the requirement. Invoice and lnk_InvoiceStaff tables
has a one-to-many relationship via InvoiceID column while Staff and lnk_InvoiceStaff tables
has a one-to-many relationship via StaffID column.

The following screenshot shows the entire model including the Customer table:

[ 194 ]
Working with Design Patterns Chapter 7

In the above model, all the relationships are one-to-many relationship and all the many-to-
many relationships are avoided.

The following code block shows the script for the four tables above, and they can be
executed in PostgreSQL:
;-- object: public."Customer" | type: TABLE --
CREATE TABLE public."Customer" (
"CustomerID" smallint NOT NULL,
"CustomerCode" varchar(8),
"CustomerName" varchar(50),
CONSTRAINT "UNQ_CustomerCode" UNIQUE ("CustomerCode"),
CONSTRAINT "Customer_pk" PRIMARY KEY ("CustomerID")
)
;-- object: public."Invoice" | type: TABLE --

CREATE TABLE public."Invoice" (


"InvoiceID" smallint NOT NULL,
"InvoiceDate" date,
"CustomerID" smallint,
"InvoiceAmount" numeric(15,2),

[ 195 ]
Working with Design Patterns Chapter 7

CONSTRAINT "Invoice_pk" PRIMARY KEY ("InvoiceID")


);
-- object: public."Staff" | type: TABLE --

CREATE TABLE public."Staff" (


"StaffID" smallint NOT NULL,
"StaffCode" varchar(12),
"StaffName" varchar(50),
CONSTRAINT "UNQ_StaffCode" UNIQUE ("StaffCode"),
CONSTRAINT "Staff_pk" PRIMARY KEY ("StaffID")
);
-- object: public."lnk_InvoiceStaff" | type: TABLE --

CREATE TABLE public."lnk_InvoiceStaff" (


"InvoiceID" smallint NOT NULL,
"StaffID" smallint NOT NULL,
CONSTRAINT "lnk_InvoiceStaff_pk" PRIMARY KEY ("InvoiceID","StaffID")
);
-- object: "FK_CustomerID" | type: CONSTRAINT --

ALTER TABLE public."Invoice" ADD CONSTRAINT "FK_CustomerID" FOREIGN KEY


("CustomerID")
REFERENCES public."Customer" ("CustomerID") MATCH FULL
ON DELETE NO ACTION ON UPDATE NO ACTION;

-- object: "FK_Invoice_LNK" | type: CONSTRAINT --

ALTER TABLE public."lnk_InvoiceStaff" ADD CONSTRAINT "FK_Invoice_LNK"


FOREIGN KEY ("InvoiceID")
REFERENCES public."Invoice" ("InvoiceID") MATCH FULL
ON DELETE NO ACTION ON UPDATE NO ACTION;

-- object: "FK_Staff_LNK" | type: CONSTRAINT --


ALTER TABLE public."lnk_InvoiceStaff" ADD CONSTRAINT "FK_Staff_LNK" FOREIGN
KEY ("StaffID")
REFERENCES public."Staff" ("StaffID") MATCH FULL
ON DELETE NO ACTION ON UPDATE NO ACTION;

The above script will create tables, as well as necessary Primary Key constraints, Unique
Key constraints, and Foreign Key Constraints.

Another important design decision that the database designer has to make is, whether the
database is following the OLTP or OLAP model, which we will discuss in the next section.

OLAP versus OLTP


[ 196 ]
Working with Design Patterns Chapter 7

As we discussed in Chapter 1, Overview of PostgreSQL Relational Databases, the database


design should consider the business requirements before deciding the database design. As
indicated before, OLTP is mainly used in the transactional systems while OLAP is mainly
used in reporting systems. The Relational Databases are mainly used for OLTP while
multi-dimensional analytics databases are used in OLAP. However, most of the databases
are falling into OLTP design models.

Let us see the key differences in these two types of processes:

OLTP OLAP
Data Source of data Extract from OLTP data and aggregated for queries
Transaction Short transactions Long Transaction
Operations Equal Reads / Writes Mostly Reads and very few writes
Duration Short Duration Long Duration
Queries Simple queries Complex queries
Normalization Normalized table structures De-Normalized table structures
Integrity is a concern and there are
Integrity Not a concern as most of the transactions are reads.
many writes,
In OLTP systems, major design pattern is, normalization of data model.

Let us see a simple table model for the Item master table. In the Item master table, there are
attributes such as Item Code, Item Name, Sub Category, and Item Category. In the
following screenshot we can view the OLTP design:

[ 197 ]
Working with Design Patterns Chapter 7

The sample script in the code block below can be executed in PostgreSQL to create those
tables:
-- object: public."Item" | type: TABLE --
CREATE TABLE public."Item" (
"ItemID" integer NOT NULL,
"ItemCode" varchar(8),
"ItemDescription" varchar(40),
"ItemSubCategoryID" smallint,

[ 198 ]
Working with Design Patterns Chapter 7

CONSTRAINT "UNQ_ItemCode" UNIQUE ("ItemCode"),


CONSTRAINT "Item_pk" PRIMARY KEY ("ItemID")
);

-- object: public."ItemSubCategory" | type: TABLE --


CREATE TABLE public."ItemSubCategory" (
"ItemSubCategoryID" smallint NOT NULL,
"ItemSubCategory" varchar(50),
"ItemCategoryID" smallint,
CONSTRAINT "ItemSubCategoy_pk" PRIMARY KEY ("ItemSubCategoryID")
);

-- object: public."ItemCategory" | type: TABLE --


CREATE TABLE public."ItemCategory" (
"ItemCategoryID" smallint NOT NULL,
"ItemCategory" varchar(50),
CONSTRAINT "ItemCategory_pk" PRIMARY KEY ("ItemCategoryID")
);

-- object: "FK_ItemSubCategory" | type: CONSTRAINT --


ALTER TABLE public."Item" ADD CONSTRAINT "FK_ItemSubCategory" FOREIGN KEY
("ItemSubCategoryID")
REFERENCES public."ItemSubCategory" ("ItemSubCategoryID") MATCH FULL
ON DELETE NO ACTION ON UPDATE NO ACTION;

-- object: "FK_ItemCategory" | type: CONSTRAINT --


ALTER TABLE public."ItemSubCategory" ADD CONSTRAINT "FK_ItemCategory"
FOREIGN KEY ("ItemCategoryID")
REFERENCES public."ItemCategory" ("ItemCategoryID") MATCH FULL
ON DELETE NO ACTION ON UPDATE NO ACTION;

Now let us see how Item tables are designed in the OLAP structures.

In the OLAP system, typically there are two types of tables, namely,
Dimension and Fact tables.

De-normalization structures are the major design pattern in the OLAP models. Therefore,
the above three tables listed in the previous code block which describe the Item master file,
will be converted to a single table as shown in the following screenshot:

[ 199 ]
Working with Design Patterns Chapter 7

Let us see how this table is created in PostgreSQL by using in the following script:
-- object: public."Item" | type: TABLE --
CREATE TABLE public."Item" (
"ItemID" serial NOT NULL,
"ItemCode" varchar(8),
"ItemDescription" varchar(50),
"ItemSubCategoryID" smallint,
"ItemSubCategory" varchar(50),
"ItemCategoryID" smallint,
"ItemCategory" varchar(50),
CONSTRAINT "Item_pk" PRIMARY KEY ("ItemID")
);

All the three tables designed during the OLTP modeling is modified to a single table at the
OLAP model. In the OLAP modeling, there are less join when there is a need to retrieve
data. Fewer tables join indicates less processing for reads. Since there are fewer reads in
OLAP models, having fewer tables will improve the performance for reporting and
analytics etc.

Let us see what are the common issues in database design patterns in the next section.

Common Issues in Design Patterns


Though design patterns are handy templates for database designers, there can be situations
where designers tend to accept patterns every time. Database designers can easily drown
into the wrong conclusion that database patterns will work for every situation. It essential
to understand that the design patterns are guidelines for specific problems and may not
work for all the problems and all the situations.

[ 200 ]
Working with Design Patterns Chapter 7

The following are the considerations that you need to look at when you are implementing
database design patterns.

Database Technology
Depending on the database technology that you have selected or you are forced to choose,
some of the database design patterns may not be possible. For example, in PostgreSQL,
simple SELECT statements are not possible in the procedures. Therefore, you need to
choose functions or cursors to implement SELECT statements inside the stored procedure.

Data Volume
Depending on the volume of the database that you are designing, you may have to modify
your database design pattern. For example, if you are dealing with a small volume of a
database, there is no need for Lazy Loading database design pattern.

Data Growth
If the database that you are designing, expecting a relatively slow growth over time, you
can decide against implementing database design patterns.

After discussing database design patterns, let us discuss the Database anti-patterns in the
next section.

Avoiding Database Design Anti-Patterns


Database design patterns help us to use as a template for common user problems that will
encounter during the database design and their usages. The database Anti-patterns are that
you should not use or you should avoid. Though, sometimes it is difficult to avoid, at least
the database designers try to minimize the usage of anti-patterns.

There are standard anti-patterns which we have discussed in multiple chapters before such
as No Primary Keys, No Foreign Keys, and so on. In the following sections, we will be
discussing the different sets of database anti-patterns.

[ 201 ]
Working with Design Patterns Chapter 7

Implementing Business Logic in Database


Systems
Database designers have the option of writing business logic using stored procedures,
triggers in the database. However, it is recommended to avoid implementing Business
Logic at the database level due to security issues. Business Logic can be accessed by the
system administrators if it is implemented at the database level. To avoid this unnecessary
security threat, it is better to keep the business logic at the application layer rather than at
the database layer.

Replace NULL values with Empty Values


Many database designers advocate to replace NULL values with empty values or some
other values. This is mainly done do avoid unnecessary application development troubles.
NULL a special value and it should not be replaced with any other values. At the design
time, it is important to define the NULLABLE columns so that the application developers
can generate their codes appropriately.

If you examine the following course table definition, you will see multiple nullable
columns:

As shown in the above screenshot, CurrentCourseOwner, PreviousCourseOwner,


ValidEndDate, CreatedDate , and ModifiedDate are nullable columns. This means for all

[ 202 ]
Working with Design Patterns Chapter 7

the columns, you need to enter some value which is not null.

However, NULLABLE columns need to handled specially in scripting.

If you want to list out all the rows in the course table where PreviousCourseOwner is null,
the following script has to be executed:
SELECT *
FROM public."Course"
WHERE PreviousCourseOwner IS NULL;

Similarly, you can get the rows that have value to PreviousCourseOwner column by
stating the IS NOT NULL in the where clause.

Triggers
Many database technologies have the option of running code at the database server level, as
databases have objects such as stored procedures and triggers. Of course, it's more efficient
to perform this processing close to the data. If not, users have to transmit the data to a client
application end and process at the client end.

However, if more processing is carried out at the database end, there can be a situation
where the database will not be able to handle incoming client requests. Therefore, in the
case of high-velocity database systems, it is better to avoid triggers and move that code to
the client-side.

Eager Loading
Eager loading is exactly opposite to the Lazy Loading. Eager loading is much simpler to
implement. For small data such as departments and countries, you can implement Eager
loading as it won't consume much of the processing to load data. In addition, you need to
make sure that data volume will not grow rapidly. Today you might have a small data set,
but in the future, it will become large. This means though the Eager Loading may work
today but not in the future. As a database designer, he should have the visionary into the
future.

Recursive Views
Views are used to create easy access to the end-users. Typically a view will contain a set of
tables. There are instances where views are also used inside views. As a database designer,

[ 203 ]
Working with Design Patterns Chapter 7

it is better to avoid multiple layers of views inside views. Multiple recursive views may
cause a lot of maintenance issues to the users, though views are introduced to ease the
management of data.

Summary
In this chapter, we discussed database design patterns. we said that Database design
patterns can be used as a template for existing common database problems. It was
mentioned in this chapter that database patterns cannot be used blindly, as there can be
several modifications has to be done to the defined patterns due to various environmental
conditions.

In the database design pattern, we identified, Data Mapper, Unit of Work and Lazy
Loading as the main three design patterns. Data Mapper will bridge between the database
and the application. Unit of Work and Lazy loading design patterns are mainly used as a
tool to counter potential performance issues which will occur in the future.

In the database modeling, it is essential to define the database model whether it is OLAP or
OLTP. OLAP is more towards analytical and reporting systems whereas OLTP is more
suited for transaction systems. in the OLAP, we understood that de-normalization
structures are more preferred so that reporting can be done much faster.

It is also discussed that it is essential to eliminate many-to-many relationships. We


observed that many-to-many relations can be replaced by introducing an intermediate table
with the two many-to-many attributes. With this technique, the many-to-many relationship
is replaced with two one-to-many relationships.

Apart from database design patterns, there are anti database design patterns. Anti database
patterns are what we should avoid. We mainly identified three anti database patterns in
this chapter. Avoiding implementing Business Login in the Application Layer, Triggers and
Recursive views were identified as the major anti database design patterns.

Exercise
1. What are the possible database design patterns that can be used for the database
model that you have built until now?
2. List what are the anti-patterns that can be discussed for the above design?
3. Explain what are the limitations in the design patterns with respect to your
design?

[ 204 ]
Working with Design Patterns Chapter 7

Questions
Why database designers have to careful when selecting Database Design
patterns?

It is important to choose the correct design pattern. Also, some design


patterns may be outdated due to the fact that they were introduced around a
decade or two ago. Since we have fast memory, superior CPU and less costly
storage, some design patterns may not be valid. Refer to the
section Identifying Issues in Software Patterns.

Why you should be careful when using database design patterns as it is?

Database Design patterns will act as a template for the existing common
problems. This means that are to provide basic guidelines. However, you
might have a different environment and domain. Therefore, you need to be
careful and need to carefully analyze your problem. Refer to the
section Identifying Issues in Software Patterns.

What are the issues of the implementation of the data mapper data layer?

Though there are a lot of advantages in the data mapper database design
pattern, it is important to note that there will be a large number of
procedures that will be difficult to manage. Also, managing security on those
procedures has to be done methodically.

Why Lazy Loading database design pattern is important even if the database is
not large at the time of designing?

Databases tend to grow rapidly over time, mainly due to the business
expansions. If Lazy loading is not considered at the time of design, there can
be performance issues when the data load is high. Lazy Loading will retrieve
only the small part of the data. On the other hand, Eager Loading will extract
all the data. When the Eager Loading is done, applications have to wait until
all the data is received, which will cause performance issues for the end0-
users.

How many-to-many relationships can be avoided in the database design?

To avoid the many-to-many relationship, an intermediate table should be


included with at least two columns which have the many-to-many
relationship. Then, the intermediate table will have two one-to-many
relationships with the original tables, which had many-to-many

[ 205 ]
Working with Design Patterns Chapter 7

relationships. Refer to the section Avoiding Many-to-Many Relationships.

Why de-normalization design patterns are used in OLAP modeling?

OLAP models are mostly used for analytics or reporting. This means OLAP
models have more reads and less writes. When there are more reads, it is
better to have fewer tables. When there are fewer tables, fewer processing
resources are needed to retrieve data. In the case of OLTP table structures,
there is a large number of tables, and when the need arises to read data, a
large number of table joins are needed. This will lead to higher processing
consumption. This means for the OLAP modeling de-normalization is used.
Refer OLAP versus OLTP.

What are the consideration factors when implementing Database Design


Patterns?

Database Technology, Data Growth, and Data Volume are the important
factors that should be considered when considering to implement database
design patterns.

Why Triggers are not recommended as a database design?

When there is a lot of processing are taking place at the database end, the
database will consume a lot of processing resources. This may block the
users from accessing the database. Therefore, triggers are not recommended
in the database design. If triggers are unavoidable, at least make sure that the
trigger code is not complex so that it will complete in less duration.

Further Reading
Merge: https:/​/​www.​postgresql.​org/​message-​id/​attachment/​23520/​sql-
merge.​html
Upsert: https:/​/​www.​postgresql.​org/​docs/​9.​5/​sql-​insert.​html
Computed Columns: https:/​/​www.​postgresql.​org/​docs/​12/​ddl-​generated-
columns.​html
OFFSET & LIMIT: https:/​/​www.​postgresql.​org/​docs/​9.​3/​queries-​limit.​html
Anti-Patterns: https:/​/​docs.​microsoft.​com/​en-​us/​azure/​architecture/
antipatterns/​busy-​database/​
Triggers: https:/​/​www.​postgresql.​org/​docs/​9.​1/​sql-​createtrigger.​html
Null: https:/​/​www.​tutorialspoint.​com/​postgresql/​postgresql_​null_​values.

[ 206 ]
Working with Design Patterns Chapter 7

htm9

[ 207 ]
8
Working with Indexes
Until now we have discussed basic design strategies for database design. We have
identified different steps to designing databases such as conceptual database modeling,
Logical database design, and physical database modeling. We have emphasized the need
for E-R model in order to have better communication between the technical and non-
technical users. During the logical database modeling, we understood the importance of
data normalization at different levels. By applying database Normalization, we were able to
capture a better data model. Then we discussed the database design patterns and anti-
design patterns for different common problems. From this chapter onwards, we will be
looking at how a database can be practically used. Though the well-designed database can
provide you functional requirements, non-functional requirements are a very important
discussion in database design.

Out of the available non-functional requirements in a database, performance is an


important element. Performance defines how usable your database is. There is no question
that proper database modeling can achieve a higher database performance. However.
design alone will not provide the database to perform better. Indexing is an important
concept in the database. Indexes make database access more efficient. However, applying
indexes should be done strategically.

Further, there are a lot of misconceptions about the indexes mainly due to the fact that
most of the engineers as well as the database designers. Those misconceptions will be
addressed and this chapter discusses what should be the path you should follow when the
user is not sure on the indexes.

In this chapter, we will look at the different types of indexes that PostgreSQL has and when
we should be using them. We will discuss a few field notes with regard to the Indexes.
Since indexes are used heavily in the industry, it essential to understand the practical usage
of these indexes.

The following topics will be covered in this chapter:

Non-Functional requirements of Databases


Indexes in Detail
Working with Indexes Chapter 8

Implementing Indexes in PostgreSQL


Complex Queries
What is Fill Factor
Disadvantage of Indexes
Maintaining Indexes
Different Types of Indexes
Field Notes

[ 209 ]
Working with Indexes Chapter 8

Let us discuss what are the non-functional requirements (NFR) for a database in the
following section.

Non-Functional requirements of Databases


Non-Functional requirements for any system will define the quality of the system. With
respect to databases, Non-Functional Requirements define the usage quality of the
database. Functional requirements mainly deal with the functional aspects of an application
that is connected to the database.

For example, tables, views, and procedures will define how the database functions to the
end-users. When the user executes a stored procedure with specific inputs, he will get an
output. When he executes OrderList with a date parameter, he will receive all the Orders
that were raised on that day. This will be a functional requirement for the application.

If you look into the above scenario, if it is taking more than one hour to retrieve the above-
said data, then users will not be using the above procedure. This example indicates the
importance of Non-Functional Requirements. Apart from performance, there are other
elements in Non-Functional Requirements. Scalability, Capacity, Availability, Reliability,
Recoverability, Maintainability, Serviceability, Security, Regulatory, Manageability, Data
Integrity, Usability, Interoperability are the main elements for Database Non-Functional
Requirements.

Apart from the business viability, there are a lot of compliance's which need to be satisfied.
For example, Sarbanes Oxley, UK Data Protection law are the main compliance's that have
to be implemented. Apart from these laws, there are domain-specific compliance's that
have to be implemented. Those are also considered under Non-Functional Requirements.

During the design phase, major consideration would be at the functional design, not the
non-functional requirements. However, after the implementation system or the database
will not be used if Non-Functional Requirements are not covered. Therefore, it is essential
to implement the Non-Functional Requirements.

Most of the time Non-Functional Requirements are implemented at the application level.
However, since databases are core components in the system, it is essential to implement
Non-Functional Requirements in the database level as well.

Let us look at the above mentioned Non-Functional Requirements in the following sections.

[ 210 ]
Working with Indexes Chapter 8

Performance
The database is used to store data that are created by users from applications. Therefore,
the database tends to store a large volume of data. When there is a large volume of data,
data access tends to become low performing. Typically, just after the implementation of the
system, database access will not be a problem as the data load is very low. However, when
data grows over time, data access will become slower. Therefore, it is essential to look at
performance at the design stage rather than fixing them at the production stage.

Though database design and selection of proper data types will improve the database
performance, special attention is needed for the performance. Indexes are the main
implementation to improve performances.

There are different types of indexes to cater to different needs in database


technologies such as Clustered Index, Non-Clustered Index, Column Store
Indexes. There are different ways to implement these indexes in different
database technologies.

In this chapter, we will discuss different scenarios of indexes and different types of indexes
that can be implemented in PostgreSQL.

Security
Since data is considered to be one of the valuable assets in any organization, it is needless to
stress that you need to protect your data. When data is accessed by multiple users in the
organization, it is important to distinguish these users. In addition, different users have
different levels of access to database objects and different levels of access. For example, the
user Joe can access table customer but data cannot be modified. One the other hand, the
user Alice can access the same table but he can write to the table as well.

Similarly, two different users have access to a Project table but one user can only view a few
columns only whereas another user will be able to retrieve all the columns. Further,
different users can only see different rows only. With these so many combinations, Security
in a database has become a complex but important process.

Database security is discussed in Authentication, Authorization, and


Encryption. It is important to note that during the implementation, there
is a mix of Authentication, Authorization, and Encryption to secure your
data from unauthorized access.

Security will be discussed in Chapter 9, Securing the Database in detail.

[ 211 ]
Working with Indexes Chapter 8

Scalability
The key aspect of the database is the velocity and volume of the stored data. This means
that the database size will be increased in the future. Sometimes, the growth can be in the
exponential order. Therefore, it is essential that the database should support scalability so
those end users will not be impacted by the data volume.

Indexes are one method that can be used to improve the scalability of database systems.

Availability
As stressed in many places, the database is a core component of the system. Therefore, it is
important to provide continuous services from the database. When there are hardware
failures such as Network, Storage, etc, still you need to provide continuous database access
to the applications and users.

Though it is ideal to achieve full availability, there are situations where it


is impossible to achieve the ideal full availability. This is mainly due to the
practical, technical and cost reasons. Therefore, there are situations where
organizations would prefer to deffer from full availability. Those different
types of Availability configurations are Read-only and deferred
operations, Partial transient or impending failures, Partial end-to-end
failure. This is discussed in Chapter 3, Planning a Database Design.

Most of the time, infrastructure technologies are used as the main implementation for
databases. However, there are situations where some level of availability from databases.
The different database has different technologies.

The following diagram is the most common design for Availability from the database end:

[ 212 ]
Working with Indexes Chapter 8

The method of synchronization will be differently implemented in different database


technologies.

Let us look at what are the Synchronous and Asynchronous replication.

Synchronous Replication

In synchronous replication, when transactions are updated in the primary


database, the transaction is completed only when the data replicated to the
secondary database. Though this is considered to be the most data safety
mode, transaction duration will be increased, thus reducing the performance
of the transaction.

In addition to the reduction of performance, there are high chances of


transaction failures too. Due to the network failures among other reasons, if
the transaction failed at the secondary database, the entire transaction will be

[ 213 ]
Working with Indexes Chapter 8

rollbacked thus this is not very popular among the database administrators
and users.

However, if you are looking at automatic failover for database, you need to
configure Synchronous Replication.

Asynchronous Replication

In the asynchronous replication, data is replicated after the transaction


completed in the primary database as a different process. This can be a
means of log files or database tables and this can be at a different schedule.

Though this technique will not allow automatic failover and potential data
loss, this is a more popular technique among database administrators mainly
due to the fact that it is less complex. Further, this technique is mostly
independent of the network failures whereby it can restarts where it
suspended.

Recoverability
As discussed multiple times, the database is a key component in the system. It holds data
which is a key asset of the organization. With respect to databases, there are two important
parameters for recoverability, they are the Recovery Point Objective (RPO) and Recovery
Time Objective (RTO).

Let us look at what is RPO and RTO in detail.

Recovery Point Objective (RPO)

Recovery Point Objective means the state of the system that was recovered. In
simple terms, this can be termed as data loss. It is obvious that business needs
minimum RPO for the successful operation of the business.

Recovery Time Objective (RTO)

Recovery Time Objective (RTO) is the time taken to recover the system. In simple
terms, this is referred to as downtime. Like RPO, it is essential to minimize the
RTO in database systems. In any database system, it is very important to take
measures to reduce RPO and RTO for the successful operation of the business.

Let us discuss the Interoperability aspect of databases.

[ 214 ]
Working with Indexes Chapter 8

Interoperability
As we discussed in Chapter 1, every business operates in heterogeneous environments with
respect to data, tools, and processes. Therefore, it is important to have the ability to
comment and communicate between different systems, devices, applications or products.
Since the database is the core component of the system, it is important to have the ability to
connect with other systems and data.

Since we have dedicated this chapter to Indexes, let us discuss Indexes in detail.

Indexes In Detail
Let us look at the example of a book. If you are asked to search for the keyword Database
Designers in a given book, what would be your strategy? Obviously, you will first look at
the Table of contents and if the relevant keyword exists you would directly go to that page.
Following is the screenshot for a table of contents of a book.

From the above screenshot, it can be observed that page 22 has the relevant content.

Some times, there are scenarios where you do not see the keyword in the table of content. In
that situation, you would refer to the index in the backside of the book as shown in the
below screenshot:

[ 215 ]
Working with Indexes Chapter 8

As shown in the above screenshot, users can get all the relevant page numbers for the given
keyword. For example, triggers keyword exists in page numbers 245 and 263 to 267 page
numbers. Imagine what will happen if you do not have these indexes. You would be
flipping over the entire book page by page. That will not be a great and pleasant experience
for a book reader. As you saw in a couple of examples here, Indexes are very helpful for
easy and faster access to data.

Indexes are like ordered data or a subset of your data. For example in a transaction table,
you can order them for OrderID. If you need to search for a given order number, you do
not have to search for the entire table, instead, you can directly access the order number.
Further, if you want to search by Customer ID in the order table, you can create a subset of
data with ordered CustomerID and relevant OrderID. If you want to search by
CustomerID, then the customer subset is searched and relevant IDs can be fetched.

[ 216 ]
Working with Indexes Chapter 8

There are different types of indexes in many database technologies, and PostgreSQL has the
following indexes for different uses:

B-Tree
Hash
GiST
Generalized Inverted Index (GIN)
SP-GiST
Block Range Index (BRIn)

Let us look at these indexes in detail.

B-Tree
The B-Tree stands for Balance tree, not Binary Tree. All the leaves nodes are at equal
distance from the root. A parent node can have multiple children minimizing tree depth.
This has reduced the traverse path for data retrieval.

B-trees were first proposed by Rudolf Bayer and Edward M.McCreight


when they were working for Boeing Research Labs. Initially, B Tree was
introduced for the purpose of efficiently managing index pages for large
random access files. However, Bayes did not disclose what B stands for.
Some suggested that B may stand for Bayes or Boeing.

PostgreSQL i-has adopted B-Tree defined by Philip L. Lehman of Carnegie-Mellon


University and. S. Bing Yao of Purdue University.

The following screenshot has shown the diagram for BTree structure:

[ 217 ]
Working with Indexes Chapter 8

Source: Efficient Locking for Concurrent Operations on B-Trees


PHILIP L. LEHMAN Carnegie-Mellon University
S. BING YAO Purdue University

https://www.csd.uoc.gr/~hy460/pdf/p650-lehman.pdf

Let us see how it will search for Key 56:

[ 218 ]
Working with Indexes Chapter 8

From the above screenshot, it is obvious that to search a specific value, using BTree has
improved than reading the data without any order.

Hash
Hash indexes are helpful when you have values that are more than 8 Kb. Further, the only
operator that can be used with a Hash index is equal (=) operator only.

GiST
GiST index used for overlapping data such as geometries. GiST index allows the
development of custom data types with the appropriate access method. In this index, range
values are used to index. This index has key functions such as UNION and DISTANCE.

UNION: used when inserting, if the range value is changed.


DISTANCE: Distance function is used for ORDER BY and nearest neighbor.

GiST index is more helpful to find the nearest neighbor and shortest path and so on.

[ 219 ]
Working with Indexes Chapter 8

Generalized Inverted Index


GIN or Generalized Inverted Index used index arrays, JSON, and tsvector. This index is
useful for full-text search. GIN index is also a B Tree index. In the GIN index array is split
and each value will be considered as an entry.

SP-GiST
List GiST index, SP-GiST allows the development of custom data types. SP-GiST is for non-
balanced data structures. This type of index can be used to store points.

Block Range Index (BRIn)


BRIn index is not a BTree index and in fact, it is not even a tree structure. BRIn Indexes are
typically very small in size.

Next, let us look at how indexes can be implemented in PostgreSQL.

Implementing Indexes in PostgreSQL


In PostgreSQL index can be created by using any access method that we discussed Indexes
In Detail section.

The following screenshot shows how to create an index using an access method.

[ 220 ]
Working with Indexes Chapter 8

The Btree access method is the most common and default access method to create an Index.

When indexes are created in PostgreSQL, it can be seen from the pgAdmin as shown in the
below screenshot:

BTree index is the most popular index in PostgreSQL. Mainly, there are two types of
indexes for tables, Clustered and Non-Clustered Indexes.

Clustered Index
The clustered index is the index that sorts your data in the order of the clustered index.
When a Clustered index order is defined, data will be physically ordered in the order of the
index.

Let us see what are the best practices for the Clustered indexes.

Clustered Index Best Practices


Since clustered indexes are an important concept in indexes, the clustered index should be
selected strategically.

Ideally, it is recommended to have a Clustered index for every table. This can be
avoided for small and static tables.
The clustered index should be small in size. Columns such as Names, dates
should be avoided for a clustered index.
Columns that are changing over time should not be selected as clustered indexes.
Integer, Autoincrement columns should be the better choices for the Clustered

[ 221 ]
Working with Indexes Chapter 8

index.

Clustered Index Scenarios


Since there are a lot of misconceptions and myths about clustered indexes, let us look at a
few scenarios with a sample table.

Let us look at this OrderDetail table as shown in the following screenshot.

Following is the script to create the OrderDetail table:


CREATE TABLE public."OrderDetail"
(
"OrderNumber" integer NOT NULL,
"OrderLine" integer NOT NULL,
"ProductID" smallint,
"Quantity" numeric(9,2),
"UnitPrice" numeric(9,2),
"Amount" numeric(9,2)
)

[ 222 ]
Working with Indexes Chapter 8

To demonstrate multiple scenarios, a sufficient number of records (100,000+) were inserted


and the following is the screenshot for a few records:

Whenever there is doubt in indexes, the go-to place is to examine the query plan.

Let us look at different scenarios to verify the usage of Indexes.

With and Without Index


First, we will look at the impact on a query with and without an index.

Let us execute the following query in OrderTable without any index:


SELECT "OrderNumber", "OrderLine", "ProductID", "Quantity", "UnitPrice",
"Amount"
FROM public."OrderDetail"
WHERE "OrderNumber" = '48087'

This query executed and after less than one second, 31 rows were retrieved.

Though we always tend to look at query execution duration as the basic


parameter to asses the query performance, it can be misleading many
times. In a production situation, when multiple queries are executed, one
query might block another query. Though this will reduce the query
performance, the real cause for the issue is the other query. Further, there

[ 223 ]
Working with Indexes Chapter 8

can be situations where database backups, index rebuild will consume the
resource. This will leave the query performance to be slower.

Following is the query plan for the above query when no indexes are present:

The above screenshot shows that to retrieve 31 records, 121286 records were filtered.

Let us create a Clustered Index for the OrderDetail with OrderNumber, OrderLine:

[ 224 ]
Working with Indexes Chapter 8

Let us see the query plan for the same query after the index was added:

[ 225 ]
Working with Indexes Chapter 8

Now the query plan has changed and only relevant rows are retrieved.

Different WHERE Clause


Since we can choose the data with different where clause, let us analyze the behavior of the
table with a clustered index for the following two queries.
--Query 1
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950;

--Query 2
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519;

The only difference in the above two queries is where clause has different columns but both
are part of the Clustered Index.

Let us see the query plan for both queries as shown in the below screenshot:

[ 226 ]
Working with Indexes Chapter 8

From the EXPLAIN plan, it can be seen that both queries are behaving similarly.

Interchanging Where Clause


Let us analyse another scenario of the Clustered index using the following two queries.

In this scenario, the WHERE clause is interchanged:


--Query 1
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
AND "OrderLine" = 68519;

--Query 2
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
AND "OrderNumber" = 58950;

Following is the query plan for both queries:

[ 227 ]
Working with Indexes Chapter 8

This shows that interchanging WHERE clause columns with AND clause won't have any
impact on the query performance.

Interchanging SELECT Clause


Let us see what is the performance impact with interchanging the SELECT clause
columns. Let us look at a scenario where the only difference is in the order of the select
query,
--Query 1
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
AND "OrderLine" = 68519;

--Query 2
SELECT "OrderLine", "OrderNumber"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
AND "OrderNumber" = 58950;

Following is the query plan for both queries:

[ 228 ]
Working with Indexes Chapter 8

As we observed in the previous two scenarios, we are seeing the same query plan for both
queries, which means that there is no performance impact when the columns in the
SELECT clause is changed.

Order Clause
Typically, we use the ORDER BY clause when we need to explicitly order the result set.

First, let us look at the following query:


SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
ORDER BY "OrderLine";

The following is the query plan for the above query:

[ 229 ]
Working with Indexes Chapter 8

Since the clustered key column is included for the ORDER BY clause, there is no impact
with the ORDER BY clause with respect to the performance and the results. If you can
remember above is the same query plan that was observed without the ORDER BY clause.

Now let us see a query with a different column for ORDER BY clause:
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
ORDER BY "Amount";

Let us see the query plan for the above query:

Now you will see a different query plan altogether. Since sorting is done from
the Amount column and that column is not part of the clustered column, sorting should be
done explicitly.

This means sorting is dependent on the Clustered Index column.

AND OR in Where Clause


Most of the users think that there is no difference in using AND or OR conditions with

[ 230 ]
Working with Indexes Chapter 8

respect to the performance.

Let us look at this scenario where the only difference is AND and OR clauses in both
queries.
--Query 1
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
AND "OrderLine" = 68519;

--Query 2
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
OR "OrderNumber" = 58950;

The following is the query plan for the second query as we have seen the query plan for the
first query:

From the above query plan, it is clearly visible that the query with OR condition is more
complex and time-consuming than the AND Condition.

Additional Columns in SELECT Where Clause


Let us analyze another scenario where the only difference is the addition of another column
to the where clause.

The new column is not part of the clustered index:


--Query 1

[ 231 ]
Working with Indexes Chapter 8

SELECT "OrderNumber", "OrderLine"


FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
AND "OrderLine" = 68519;

--Query 2
SELECT "OrderNumber", "OrderLine","ProductID"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
AND "OrderNumber" = 58950;

Let us see the difference in the query plan in the following screenshot:

In the first query, the required data is available Index structure itself and in the second
query, leaf node data are needed to retrieve data. Therefore, a different operation is needed
to get the data.

Next, let us look at the details of the Non-Clustered Index.

Non-Clustered Index
The non-Clustered index can be considered as equivalent to the Index that you see on the
backside of a book. If you want to search for a keyword, rather than doing a search in the
book content, first you search the index. After you find the keyword, along with the
keyword you find the page number.

Then you refer to the relevant page number to get the relevant page number as shown in

[ 232 ]
Working with Indexes Chapter 8

the below screenshot:

In a non-clustered index in a database, a subset of data is created. For example, we can


create an index ProductID in the OrderDetail table:

[ 233 ]
Working with Indexes Chapter 8

This index can be created from the following script:


CREATE INDEX "IX_OrderDetail_ProductID"
ON public."OrderDetail" USING btree
("ProductID" ASC NULLS LAST)

Let us analyze the Non-clustered index with different scenarios. Let us analyze the
following query:
SELECT "OrderNumber",
"OrderLine"
FROM public."OrderDetail"
WHERE "ProductID" = 797;

First, run this query without an index and the following is the query plan.

[ 234 ]
Working with Indexes Chapter 8

This shows that to search for records that satisfy the criteria, it has to do a table scan.

Let us create an index on the ProductID column and let us execute the same query and
verify the query plan:

In this query, since there is an index on ProductID, first it will search for the value.

As we discussed before, you can have only one Clustered Index per table.
However, you can create many non-clustered indexes per table. Since
Indexes are reducing the Writes into the table, when creating indexes for
the transaction-oriented tables, it is essential to identify the optimum
number of indexes. In the case of the Analysis system, you can create any
number of non-clustered indexes.

Let us look at more complex queries with Clustered and Non-Clustered index together.

Complex Queries
In the real world, we will have to incorporate many tables with different types of joins such
as INNER JOIN, LEFT JOIN. In these types of queries will have combinations of Clustered
and Non-Clustered indexes with more complex queries. Those queries need special

[ 235 ]
Working with Indexes Chapter 8

attention when it comes to EXPLAIN plans. Let us look at the behavior of queries when
there are tables with multiple joins and with Order By clause in the coming section.

Multiple Joins
In most of the cases, multiple tables are joined together to obtain the required results.

Let us join the OrderDetail table with the Product table. In the Product table, ProductID is
the Clustered index whereas OrderDetail's ProductID is the non-clustered key.

Let us examine the following query:


SELECT "OrderNumber",
"OrderLine" , "Name", "Amount"
FROM public."OrderDetail"
INNER JOIN public."Product"
ON public."OrderDetail"."ProductID" = public."Product"."ProductID"
WHERE public."Product"."Color" = 'Blue'
ORDER BY "Amount";

The following screenshot is the query plan for the above query:

Since Color is not a non-clustered index, that has to be filtered by doing the table scan.

If this is a frequently running query, it is better to add a non-clustered index to the Color
column from the below query:
CREATE INDEX "IX_Product_Color"
ON public."Product" USING btree
("Color" COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;

Now let us execute the same query after the index creation on the Color column.

[ 236 ]
Working with Indexes Chapter 8

The following is the query plan:

Since the Non-Clustered index is created for the Color column, instead of a table scan on
the Product table, an index scan is done that will improve the performances of the query.

When Non-clustered indexes are defined, it is better to look at the search


conditions and the joining conditions.

When more than two tables are included for a query, two tables are joined together and
other tables are added one by one.

Let us examine the following query:


SELECT "OrderNumber",
"OrderLine" , "Name", "Amount"
FROM public."OrderDetail"
INNER JOIN public."Product"
ON public."OrderDetail"."ProductID" = public."Product"."ProductID"
INNER JOIN public."ItemCategory"
ON public."ItemCategory"."ItemCategoryID" = public."Product"."CategoryID"
WHERE public."Product"."Color" = 'Blue'
ORDER BY "Amount";

The following is the query plan for the above query.

[ 237 ]
Working with Indexes Chapter 8

As you can see from the above query plan, Product and ItemCategory tables are joined first
and then the OrderDetail table is joined.

GROUP BY Condition
For aggregate functions, indexes key play a key role.

Let us look at this by using a sample query:


SELECT "Name", SUM("Amount")
FROM public."OrderDetail"
INNER JOIN public."Product"
ON public."OrderDetail"."ProductID" = public."Product"."ProductID"
GROUP BY public."Product"."Name"

Since there is no index on the Name column, a table scan has to be done for the Product
Table as shown in the below screenshot:

Let us do the same query by adding an index to the Name column as shown in the
following script:
CREATE INDEX "IX_Product_Name"
ON public."Product" USING btree
("Name" COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;

After the above index is added, you will observe that there is no change to the query plan
since we are extracting the entire data set.

Let us look at the importance of INCLUDE Indexes in the following section.

[ 238 ]
Working with Indexes Chapter 8

INCLUDE Indexes
Let us assume that you want to select columns that are not part of the non-clustered index.
However, you have an index for the where clause columns. This means that to locate the
records, the index can be used but to retrieve data you need to refer back to the data pages.
This will have a performance impact.

During the legacy days, there was a technique called COVERING Indexes.
Covering indexes were used to include all the columns which are covered
from the relevant query. However, this resulted in large size of the index.
The large size of the index resulted in more index fragmentation and there
can be a negative performance impact. Therefore, the COVERING index
option was not very popular among database administrators.

Include index is introduced in most of the databases in order to replace the covering
indexes. In the include indexes, include columns are kept at the leaf node. in the covering
indexes, all the columns are kept at all nodes including the leaf and branches.

The following screenshot shows how to create an INCLUDE index in the PostgreSQL:

[ 239 ]
Working with Indexes Chapter 8

In the above index, the Color column is the index while Name and FinishedGoodsFlag
columns are configured as include columns.

Following is the script to create the above include indexes:


CREATE INDEX "IX_Product_Color"
ON public."Product" USING btree
("Color" COLLATE pg_catalog."default" ASC NULLS LAST)
INCLUDE("Name", "FinishedGoodsFlag")

It is important to note that, Include index, include columns will not be


helpful for searching but only it will be helpful for the select columns.

Let us look at two different query plans with and without include indexes for the following
same query:

[ 240 ]
Working with Indexes Chapter 8

SELECT "Name", "Color","FinishedGoodsFlag"


FROM public."Product"
WHERE "Color" = 'Blue'

The following screenshot shows the query plan for the above query without include index
and with include index respectively:

The above query plans, show that that with include index query plan is improved better
than an index without include index.

The Fill factor is an important configuration in indexes which will be discussed in the
following section.

What is Fill Factor


Physically, data is stored in pages in a database. When the data page is filled, what will
happen when the data is updated. When the data page does not have sufficient space, the
page has to be split and pointers have to be rearranged. This means that when pages are
filled with data and when there is a tendency to frequent updates, there will be a
performance impact. To avoid this, we can keep additional space for future updates. How
space you wish to keep is defined by the FILL FACTOR.

In the following example, FILL FACTOR 80 means that every page will be filled only 80%
and the balance 20% will be kept for updates:

[ 241 ]
Working with Indexes Chapter 8

Similarly, the fill factor can be adjusted from a script as shown in the below script:
CREATE INDEX "IX_Product_Category"
ON public."Product" USING btree
("CategoryID" ASC NULLS LAST)
WITH (FILLFACTOR=80)

However, you cannot keep a very low value for FILL FACTOR as it will leave unnecessary
free spaces in the data pages. This will result in a large number of unnecessary data pages.
When there are many pages, to retrieve data, it has to traverse through many pages.

The default value of the FILL FACTOR is 100. This means by default all
data pages will be fully filled. The best-recommended value for the FILL
FACTOR is 80. Further, there are instances where FILL FACTOR is set to
90.

It is important to note that FILL FACTOR value will not be maintained continuously. It is
only maintained during the index creation and reindexing only.

Let us discuss the disadvantages of indexes in the following section.

Disadvantages of Indexes
We have been discussing the advantages and the usages of indexes. During previous
discussions, we identified the advantages of indexes with scenarios. This might need to

[ 242 ]
Working with Indexes Chapter 8

think that you can add any number of indexes at your will. However, there are
disadvantages of indexes as well.

Performance of Insert Queries


As we have discussed in the Performance Section and few other places, the index will
improve read performance. However, when data is inserted, the index has to be updated. If
you look at the BTree structure we discussed in the B-Tree section when there are data
insertions, the structure has to be changed. That is the reason it recommended to have a
sequence, small in size columns for a Clustered index.

Further, if there are many non-clustered indexes, those indexes have to be updated.
Therefore, when adding non-clustered indexes for the transaction-oriented systems, it is
essential to have an optimum number of indexes. In the case of analytical systems, though
it is recommended to have many indexes to improve read queries, during the Extract-
Transform-Load (ETL) it is essential to disable the indexes and re-enable them after the ETL
is completed.

Storage
A non-clustered index is not part of the existing tables, instead, the non-clustered index will
store separately with pointers to the table.

Remind the index of the backside of the book. When you have additional
indexes, you need to have

Due to this database storage will be increased. When database size is increased, database
backup time, restoring time will be increased. However, in modern days, though storage is
not a huge concern, the increase in other maintenance tasks should be considered when
designing indexes for the database.

Let us look at the options for index maintenance in the following section.

Maintaining Indexes
When indexes are available, over time with data deletes and inserts indexes will be
fragmented. Fragmented indexes will have a negative impact on query performance.

[ 243 ]
Working with Indexes Chapter 8

Therefore, indexes have to be reminded for a given frequency. During the maintenance of
indexes, CPU and Memory will be consumed and user queries will be impacted. Therefore,
it is essential to choose a window where fewer user queries are impacted.

Reindexing is also needed when the indexes are corrupted due to various hardware and
software reasons. In PostgreSQL, reindexing can be done at three levels, DATABASE,
TABLE, and INDEX levels.

Following is the option in pgAdmin to REINDEX index:

Similarly, this can be done using a script as shown in the below script:
--REINDEXING INDEX
REINDEX INDEX "IX_Product_Category";

--REINDEXING A TABLE
REINDEX TABLE public."Product"

--REINDEXING A DATABASE
REINDEX DATABASE "SampleDatabase"

When REINDEXING is applied at the database level, it will consume large resources and
duration. Therefore, it is recommended to reindex the index level rather than the database
level.

Apart from the said indexes, there are few other index options available in the other
database technologies which will be discussed in the following section.

[ 244 ]
Working with Indexes Chapter 8

Different types of Indexes


Though PostgreSQL is not supporting there few common indexes which are supported by
different database technologies. Some of those indexes are the Filtered Index and Column
Store Index.

Filtered Index
From the Filtered Index, you have the option of creating an index for the selected data set.
For example, for the invoice table, from the filtered index you can create an index for the
selected region. The advantage of this index is that index storage will be reduced. Hence
maintenance of Filtered indexes are much simpler.

Column Store Index


All the indexes we have discussed are row-based indexes. If you need to read all the
records in a table, row-based records are not efficient. Therefore, column store indexes are
introduced. For example, for Online Analytical Process (OLAP ) cubes, it has to read all the
records in a fact table. Therefore, if you implement a column store index in the fact table,
OLAP cubes processing will be much faster.

Only one column store index can be created per table and depending on the database
technology that you are using, there are Clustered and Non-Clustered column store
indexes. Further, in some database technologies, tables cannot be updated or inserted when
a column store index is implemented. In that type of database technology, column store
indexes are not suitable for transaction-oriented databases. However, in the case of
analytical systems such as reporting, data warehouses, OLAP cubes, by including a column
store indexes performance can be improved.

In the case of un-updatable columns store, the index has to be disabled before inserting
data to the columns store and enabled again after the data load is completed. This will
result in additional time during the data inserting to the column store index tables.

Field Notes
Indexes are vastly used in the industry as it helps to improve data retrieval performance as
we discussed in multiple instances.

Let us look at a few real-world scenarios of index implementations:

[ 245 ]
Working with Indexes Chapter 8

In a factory, there is a production belt in which finished goods are flowing


through. There are operators close to the belt and their task is to pick any
flowing product scan the barcode to a barcode reader. They have experienced
that at the initial stage, scanning has taken less than one second but when it
comes to the end of the season, scanning takes more than five seconds. The five
seconds duration is unbearable as this would delay the process significantly.
Having noticed that the performance degradation is due to the data load, they
had taken a step to archive data. Even though there is a performance
improvement with the data archive, this is not sustainable. They had to look for
expert advice. The expert, by listening to the problem itself, understood that the
problem is a lack of indexes. Reason for the during to be on the higher side when
the data load is high, due to the fact that it had to do a table scan. With the less
data table scan can be done quickly but with the data growth, the number of data
pages is high and the table scan will take a long time. After applying an index,
the data read was done very quickly, it does not matter what the data load is.
In a factory, there is a daily process that they called Day End. The Day End is a
heavy process that takes typically 5-6 hrs. This process involves many
transactions hence end users do not exactly know what is the poorly processing
query. Experts were brought in to provide a solution and they were able to find
the slow performing queries. After applying necessary indexes, Day End time
was reduced to less than 1 hrs.
In an Analytical system, there were more than five OLAP cubes and cube
processing takes hrs. However, there were no column store indexes in the
selected database technologies. As this has become a major issue for the
management, it was decided to move the database to another database
technology which has the support for the column store indexes. After moving the
database to new database technology and column store indexes were
implemented. After the column store implementation, OLAP cube processing
was much faster than before. However, selected new database technology does
not support the updatable column store index. Therefore, before the update to
the fact table, the column store index was disabled and re-enabled after the data
insertion is completed. Though this has led to many modifications in the system
and the additional overhead of maintenance, this has improved the cube process
than before.
In a Transactional system, end-users were complaining about slow query
performance. When this was analysed it will found that the database is much
larger than expected. After a detailed analysis, it was revealed that FILL
FACTOR was set to 20 instead of 80. 20 of FILL FACTOR means that 80% of free
data pages are available in the data pages leaving large data pages. When a
query is executed, it has to read many pages. Then FILL FACTOR was reset to
80% and REINDEXing was done. This reduced the database size and query

[ 246 ]
Working with Indexes Chapter 8

performance was improved significantly.


In a legacy system, a database was converted to a new version from a very old
version. However, after the conversion of the database, it was used as it is and
the new features of databases were not used. After some time, users were
complaining about query performance but they had experienced that after
reminding performance is just good. Due to the user load, they are unable to
perform reindex from time to time. After doing a thorough analysis of the
database it has revealed that in the previous version, there were a lot of covering
indexes and those indexes were fragmenting more frequently. After that finding,
it was decided to convert those covering indexes to INCLUDE indexes. With the
INCLUDE indexes, index fragmentation was reduced drastically and query
performance also improved.

Summary
Until this chapter, we have been discussing the functional aspect of the database design. In
this chapter, we understood that the non-functional requirement plays a key part in
database design. we discuss performance, security, high availability, scalability as a non-
functional option in databases. We identified indexes as a key factor to improve
performance in databases to retrieve data much faster.

Next, we identified, different types of Indexes in PostgreSQL such as B-Tree, Hash, GiST,
Generalized Inverted Index (GIN), SP-GiST and Block Range Index (BRIn). Out of these
indexes, default and the most common index type is B-Tree. We looked at different types of
index implementation Clustered and Non-clustered indexes with different scenarios. From
the different scenarios, we identified that the index plays a huge role when retrieving data
from the databases. It is also recommended that the query performance should be verified
from the EXPLAIN plan.

The FILL FACTOR is another important concept we discussed and we identified that
default value for FILL FACTOR is 100 and the best-recommended value is 80. Although we
use indexes to improve query performances, there are disadvantages of indexes as well.
decrease of insert queries and performance and additional storage requirements are the
identified disadvantages in the index. Further, we identified that COVERING indexes that
were used in the legacy databases can be replaced by INCLUDE indexes much effectively.

Over time, indexes will be fragmented due to data insert and delete. To avoid
fragmentation to improve the performance REINDEX should be done. Reindex is available
at Database, table and the index level and we said that a better option for the reindexing is
performing it at the index level to avoid resource contention. Apart from the PostgreSQL

[ 247 ]
Working with Indexes Chapter 8

index implementations, we identified that Filtered Index, Column Store Indexes are also
used in the industry with other database technologies.

Since there a lot of myths in the industry with respect to indexes, we looked challenging
thus interesting case studies of indexes as well.

The next chapter, Designing a Database with Transactions discusses how you can ensure data
integrity and handle database errors with the help of transactions. The readers will also
learn about how they can design a database using Transactions.

Questions
What is the importance of implementing Non-Functional Requirements from the
database perspective?

Non-Functional Requirements will decide the quality of the system. Since the
database is core components of the system, by maintaining the Non-
Functional Requirements, the quality of the system can be improved.
Performance, Scalability, Capacity, Availability, Reliability, Recoverability,
Maintainability, Serviceability, Security, Regulatory, Manageability, Data
Integrity, Usability, Interoperability are the main elements for Database Non-
Functional Requirements. Apart from these quality measures, enterprise
systems need to adhere to the different compliances such as Sarbanes Oxley,
Data Protection Law. Due to these factors, it is important to achieve Non-
Functional Requirements in a Database.

Why Indexes are essential phenomena in a database?

The database is a core component of a system. The database has a mandate of


storing large volumes of data. However, database sole operation is not
limited to storing the data only. It also needs to provide efficient and
effective data retrieval methods to multiple user requirements. When you
need to access data from large volumes, indexes will play a pivotal role to
provide efficient data access.

How many clustered can be created per table and Why?

Only one clustered indexes can be created per table. Since table data is
ordered in the order of the Clustered Index, multiple Clustered indexes
cannot be created.

[ 248 ]
Working with Indexes Chapter 8

During the interviews, different ways of questions are asked with respect
to the Clustered index. What is the behavior of the table when there are
two Clustered indexes? What will happen when the second Clustered
Index is created? All these questions are checking whether you know that
only one Clustered Index can be created per table.

What are the best recommendations for the Clustered Indexes?

Clustered should be very small in size. Most likely, integer columns are
suitable columns. Further, it should contain sequential values. This means
that sequence columns should be better suited. In addition to the above
requirements, the selected clustered column should be a static column that is
not changed over time. Especially, columns such as Names, Dates not
suitable columns for the Clustered Indexes.

What is the difference between AND and OR in the WHERE clause?

AND clause is performing better than the OR clause when the clustered
columns are used for the WHERE clause. for the OR clause, the index has to
be scan twice before the OR operation is done. For the AND operation, only
one scan is needed.

What is the Indexing strategy for the data warehouse?

The data warehouse is mostly used as an analytical system. This means most
of the operations are read operation whereas during the ETL bulk of data
will be inserted. As we have discussed, indexes will improve read
performance while there will be a negative impact on the write
performances. The data warehouse consists of fact and dimension tables.
Fact tables are consist of surrogate keys and measure columns. This means
the fact table does not need Clustered indexes and in the Dimension tables,
surrogate keys can be chosen as the clustered indexes. In the fact table, all the
surrogate keys should be configured to non-clustered indexes as they will be
joined with the dimension tables. This will lead to a large number of indexes
in the fact table as typically, the fact table will have a large number of
surrogate keys. In order to improve the data load performance, during the
ETLs indexes can be dropped and after the ETL is completed index can be re-
created.

Why Include indexes are better than Covering Indexes?

The Covering Index will include all the columns in a query. This includes all
columns in WHERE and SELECT clauses. Due to this, index size will become

[ 249 ]
Working with Indexes Chapter 8

larger and in the B-Tree all the columns are included in the branch and leaf
nodes. Since the index is large, there is a tendency to index to fragment.
However, in the query SELECT columns are not needed for searching. In the
INCLUDE indexes, SELECT columns are included in the include columns. In
INCLUDE index, include columns are stored only at the leaf node. Due to
this implementation, Index structure is much narrow and the tendency to
index fragmentation is high. This means INCLUDE indexes are much better
than the COVERING indexes.

What is the importance of FILL FACTOR and what are the best-recommended
values for the FILL FACTOR?

FILL FACTOR will decide how much percentage of the data page is filled. If
FILL FACTOR is set to 80 which is the best-recommended setting, 20 percent
of data pages are empty. Those free spaces will be utilized for data updates.
Due to this, page splits will not occur and that will improve the query
performance.

What are the instances that you can implement Column Store Indexes?

The column store indexes mainly used in data warehouses to process OLAP
cubes. However, depending on the database technologies, tables cannot be
updated once column store indexes are implemented.

Exercise
For the database design, you did for the mobile bill, identify the possible indexes.

Identify the query performance with and without indexes.


Examine the EXPLAIN plans for each query.

Further Reading
Sarbanes Oxley: https:/​/​www.​sarbanes-​oxley-​101.​com/​sarbanes-​oxley-
audits.​htm
Data Protection Law: https:/​/​www.​gov.​uk/​data-​protection
B Tree: https:/​/​www.​csd.​uoc.​gr/​~hy460/​pdf/​p650-​lehman.​pdf
Create Index: https:/​/​www.​postgresql.​org/​docs/​9.​1/​sql-​createindex.​html
Reindex: https:/​/​www.​postgresql.​org/​docs/​9.​4/​sql-​reindex.​html

[ 250 ]
9
Designing a Database with
Transactions
We have discussed database design concepts mainly with respect to the functional
requirements of users during Chapters 1-7. We discussed in detail how to design a database
with different models such as Conceptual and Physical Models in Chapter 4 Representation
Models. During that discussion, we identified different normalization forms in order to
achieve different advantages. Then we identified different design patterns which will be
helpful for database designers to tackle common problems. To achieve high performance in
a database, which is very much necessary, we discussed different aspects of Indexes and
how to use them as a tool to improve performance using different use cases or scenarios
in Chapter 8 Working with Indexes.

When databases are used, they will use multiple queries. For example, in order to raise an
invoice, you will be looking at different tables. This means that one user query has to be
treated as a single business query though these queries are different technical queries. In
case, failure of one of the queries should result in the non-existence of other queries in order
to achieve Integrity in databases. This one business is considered as a transaction and in
this chapter, we will be discussing how to achieve integrity by means of transactions with
examples in PostgreSQL.

The following topics will be covered in this chapter:

Understanding the Transactions


ACID Theory
CAP Theory
Transaction Controls
Isolation Levels
Designing Databases with Transaction
Field Notes
Designing a Database with Transactions Chapter 9

Understanding the Transaction


We have discussed database design in the previous chapters in order to satisfy different
user functions. During this design, we identified that to satisfy user requirements, multiple
tables have to be introduced. This means for one process multiple tables have to be
updated. From the business point of view, this process should be considered as one even
though, technically there are multiple technical processes.

Let us define what is the transaction so that we are clear what we are talking in this
chapter.

Definition of Transaction
A Transaction is a single business and program unit in which execution will change the
content of the database. A single business unit may contain multiple queries. However, in a
Transaction, all those multiple queries should be treated as one unit so that the database
should be in a constant state.

If you look at the definition, the transaction should modify the database content. Therefore,
a transaction should contain any Data Manipulation Language (DML) such as INSERT,
UPDATE or DELETE.

Let us look at a transaction by means of an example of an ATM transaction. Let us assume


John has 500 USD in his account whereas Jane as 350 USD in her account. John wants to
transfer 50 USD to Jane.

The following screenshot shows the state of the database before and after the transaction:

[ 252 ]
Designing a Database with Transactions Chapter 9

In the above transaction, before the transaction, John and Jane had 850 USD together. Since
it is an internal transfer, after the transaction total should be the same. After the transaction,
John and Jane's account total should be 850 USD.

Just imagine, if there is a failure after John's account is deducted but before it is credited to
the Jane account as shown in the following screenshot:

[ 253 ]
Designing a Database with Transactions Chapter 9

This shows that before the failure total was 850 USD but after the failure total is 800 USD.
This means that 50 USD is lost. Then, we introduce transactions to avoid these type of
losses. Since a transaction is considered as a one-unit, either all changes of the entire unit
should be impacted or neither.

Let us consider another example that has multiple table updates. Let us assume you are
raising an invoice. The invoice has two steps as shown in the following screenshot:

First, let us look at this scenario with no transactions are implemented:

[ 254 ]
Designing a Database with Transactions Chapter 9

In this scenario, Invoice records are stored but customer balance and inventory are not
updated. This will leave the database in an inconsistent state.

Let us look at the same scenario with the Transaction:

[ 255 ]
Designing a Database with Transactions Chapter 9

Since the entire transaction is considered as one unit, due to the failure of Update Customer
Balance, there are no invoice records. In other words, for the database perspective, this
transaction has not happened. This means that the database is in a consistent state.

To maintain the consistency of the Database, consistency of transactions should be


maintained. To maintain transaction consistency, there are two types of theories of
Transaction ACID and CAP theories which we will learn in the next section.

ACID Theory
To make transaction consistency, one theory would be to maintain ACID properties. ACID
stands for Atomicity, Consistency, Isolation, and Durability.

Let us discuss what are these ACID properties in detail.

Atomicity
We have already discussed in the Definition of Transaction section, the atomic nature of a
transaction. In that section, from the Bank ATM example as well as in the Invoice example,

[ 256 ]
Designing a Database with Transactions Chapter 9

we said that the entire transaction should be considered as one unit even though they have
multiple technical queries. In simple terms to satisfy the Atomicity properties, the entire
transaction should be effected or nothing should be effected. There should not be any
partial transaction that was caused by the database to be in an inconsistent state.

In database technologies, the Transaction Management component will take care of


Atomicity.

Consistency
We said during the introduction of the database transaction, the main idea of it is to keep
the database in a consistent state. Consistency property of the transaction means that if the
database is at a consistent state, after the traction it should be in the consistent state as well.
In the Bank ATM example, we say that there were two states, before the transfer and after
the transfer. During both stages, we identified that the total of both account balances is the
same.

There is no special module in the database to maintain the consistency of


transactions. If you can maintain, other properties of transactions,
Atomicity, Isolation, and Durability then the Consistency can be
achieved.

Isolation
Isolation in database transactions means that every transaction is logically isolated from
one another. In other terms, one transaction should not be impacted by another
transaction.

It is important to note that Isolation does not mean only one transaction
can run at a time. If that is the isolation, the purpose of the database is lost
and the database will not be a popular choice among businesses. Just
imagine, if you a designing a system for a supermarket chain and if you
say that only one transaction can happen at a given time, during a
shopping season, it will be a chaos.

The Concurrency Control Unit of the database is the component that manages Isolation in
databases. Locking, Blocking, and Deadlocking are the techniques used to achieve isolation
in the Concurrency Control Unit.

There are different levels of isolation levels in every database system. We will discuss the
different isolation levels on PostgreSQL in the Isolation Level section.

[ 257 ]
Designing a Database with Transactions Chapter 9

Durability
Durability means that once the data changes are done to the databases, it should be
persistent irrespective of hardware failures, software failures, and server restarts, and so on.
Let us say you created an Invoice. After the invoice is created there is a server restart. Even
after the server restart, that invoice should be persisted until you changed. The Database
Recovery Management unit is the component that is available to achieve Durability in a
database transaction.

The ACID theory is the most popular transaction mechanism which is used mostly by
Relational Database Management Systems (RDBMS).

Next, we will discuss CAP Theory which is mostly used in NoSQL databases.

CAP Theory
The CAP theory, which is mostly used in NoSQL databases and in distributed systems,
stands for Consistency, Availability, and Partition Tolerance.

In distributed systems and in NoSQL systems, there are multiple nodes in the database. The
combination of these three concepts is shown in the following screenshot:

[ 258 ]
Designing a Database with Transactions Chapter 9

Unlike ACID properties, in a system, there should be only two of those properties. This
means available options are CA, AP, and CP as you can see in the intersections of the
screenshots above.

Let us look at these concepts in detail.

Consistency
Every node of the distributed system will provide the most recent state or it will show the
previous state. Any node does not show the partial state. This means that every node will
have the same state. Since there can be latency for data reputation, this is called Eventual
Consistency as well.

[ 259 ]
Designing a Database with Transactions Chapter 9

Availability
Availability means that every node will have the read and write access. In a distributed
system, since there can be multiple nodes, it may not be possible to achieve Availability due
to the network latency and other hardware latencies.

Partition Tolerance
Partition Tolerance is the ability of the node to communicate between them. Partition
Tolerance is not unavoidable in most of the Distributed systems. This means that you have
to compromise either Consistency or Availability. In most of the cases, consistency is
compromised with eventual consistency.

Since every database and system cannot satisfy all three concepts of CAP theory, let us look
at what are the databases which follow different combinations of CAP theory:

Combination of CAP Theory Database System


SQL Server
Oracle
CA
MySQL
PostgreSQL
MongoDB
CP Hbase
Redis
CouchDB
AP
Cassandra
Next, let us look at what are the Transaction Controls in databases and in PostgreSQL.

Transaction Controls
Like other database technologies, PostgreSQL also has commands to control transactions
from SQL.

In a database, there are different types of languages as shown in the below table.

Language Description Example


INSERT
DML Data Manipulation Language UPDATE
DELETE

[ 260 ]
Designing a Database with Transactions Chapter 9

CREATE TABLE
DDL Data Definition Language
ALTER TABLE
ROLES
DCL Data Control Language
USERS
BEGIN
TCL Transaction Control Language COMMIT
ROLLBACK
DQL Data Query Language SELECT
Out of these languages, TCL will be used to control language.

In Transaction, there are different states of the transaction such as Active, Partial
Committed, Failed, Commit and Aborted as shown in the following table:

State Description
This is the initial state of every transaction. In the Active state, the transaction is said
Active
to be in the execution.
Partially When a transaction is in its final operation, it is said to be in a partially committed
Committed state.
If a transaction executes all its operations successfully, it is said to be committed.
Committed With the transaction is Committed, modifications are permanently established on the
database system.
When database recovery systems checks are failed, the transaction is said to be in a
Failed failed state. A failed transaction can no longer proceed further. You need to restart the
transaction if you wish to proceed.
When a transaction has reached the failed state, then the recovery manager module
will rollback all its modifications on the database to bring the state of the database
Aborted
back to the state where it was before the execution of the transaction. Transactions in
this state are called aborted.
The Partial Committed state is a special state. The transaction goes into this state after the
final statement is executed. During this state, if there is a violation of integrity constraint,
the transaction is moved to a failed state and then the rollback will be done.

These states have a relationship as shown in the below screenshot:

[ 261 ]
Designing a Database with Transactions Chapter 9

Every transaction should have two outcomes. Either it is successfully completed or failed.
The transaction is committed when it is successful and the database is converted to a new,
consistent state. On the other hand, the transaction should be Aborted after it is failed. This
is done by the rollback.

The Committed transaction should not be aborted. If the committed


transaction is a mistake, that has to be reversed by another transaction.
This revered transaction is called a compensating transaction.

The database recovery module can select either re-start the transaction or kill the
transaction after a transaction aborts.

Transactions in PostgreSQL
Let us see how we can use transactions in PostgreSQL. Let us simulate the ATM transaction
as we discussed in the Definition of Transaction section.

Let us create the table as shown in the following screenshot:

[ 262 ]
Designing a Database with Transactions Chapter 9

The following is the initial data set as shown in the below screenshot:

Let us see how a fund transfer can be done without transaction using the following script:
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" - 50
WHERE "ID" = 1;

UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" + 'AA'
WHERE "ID" = 2;

When the above statements are executed in batches, the first transaction will be successful
while the second will fail. However, since no transactions are implemented, the first
statement will be updated but not the second as shown in the following screenshot:

[ 263 ]
Designing a Database with Transactions Chapter 9

Though, the total should be 850 USD for both accounts, after the above transaction it is 800
USD.

Let us do the same process with a transaction as shown in the following script:
BEGIN TRANSACTION;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" - 50
WHERE "ID" = 1;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" + 'AA'
WHERE "ID" = 2;
ROLLBACK TRANSACTION;

After the failure of the second script, the entire batch will be rollbacked and you will see
that the database will be at the previous state as shown in the following screenshot:

Since both, the account balances are equal to 850 USD, the database in a consistent state.

Let us see the same script with a successful transaction as shown in the following script:
BEGIN TRANSACTION;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" - 50
WHERE "ID" = 1;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" + 50
WHERE "ID" = 2;
ROLLBACK TRANSACTION;

Now, you will see that the database has changed to a new state as shown in the following
screenshot:

[ 264 ]
Designing a Database with Transactions Chapter 9

Since both, the account balances are equal to 850 USD, the database in a consistent state.

Different databases have different isolation levels to support various user requirements. In
the next section, we will look at what are the Isolation levels that are supported by
PostgreSQL.

Isolation Levels
As we discussed in the ACID Theory, Isolation is to treat every transaction isolated from
another transaction.

During the concurrent user environments, if there is no isolation in place, the following
table shows us the problems that can occur.

Phenomena Description
Dirty Read A transaction reads data written by a different active but not committed transaction.
A transaction reads data again it has previously read in the same transaction but a
Nonrepeatable
different value will be retrieved due to the fact that it was modified by a separate
Read
committed transaction.
This problem will occur when one transaction read set of data but that set of data is
Phantom Read
different now due to another committed transaction.
Serialization One transaction is inserting multiple records, however, only a few will be seen by the
Anomaly other transaction.
Let us look at the isolation levels in PostgreSQL that are Read Committed, Read
Uncommitted, Repeatable Read, and Serializable in the next sections.

Read Committed
This is the default isolation level in PostgreSQL like in the many database systems such as
SQL Server. In the Read Committed isolation level, only committed transactions are read
from which Dirty reads are avoided. In this isolation level, two successive SELECT will
return can see different data.

Ket us see in this in a screenshot as shown below.

[ 265 ]
Designing a Database with Transactions Chapter 9

As shown in the above screenshot, at the start of Transaction 1, data was read as A. During
this transaction, another transaction has updated the value to B. This means that the next
read in the Transaction 1 will return the data is B.

Read Uncommitted
This is the weakest isolation level out of the existing isolation levels. In the Read
Uncommitted isolation level, dirty reads can occur. That means that non-committed
changes from other transactions can be seen in a transaction. In PostgreSQL, Rean
Uncommitted isolation levels are treated the same as the Read Committed isolation level.

Repeatable Read
In the Repeatable Read isolation level, all statements of the current transaction can only see
rows committed before the first query of the transaction. Further, it can see any day
modifications that was executed in this transaction.

This can be explained from the following screenshot.

[ 266 ]
Designing a Database with Transactions Chapter 9

As you can see in the above screenshot, when transaction 1 starts, the data value is A.
Before the transaction 1 ends, Transaction 2 has modified the data value to B. However, this
is not visible to the transaction 1 even after it does a data read.

Serializable
The Serializable isolation level ensures that concurrent transactions run as if you would run
sequentially one by one in order. As you can imagine, though the isolation can be achieved,
this would become a highly ineffective database isolation level as concurrent transactions
cannot be done. This is considered to be the most strong isolation level among the available
isolation levels.

The following table shows the summary of each phenomenon for each isolation level:

Non-Repeatable Phantom Serialization


Isolation Level Dirty Read
Read Read Anomaly

Read Committed Not Possible Possible Possible Possible

Read Uncommitted Possible* Possible Possible Possible

[ 267 ]
Designing a Database with Transactions Chapter 9

Repeatable Read Not Possible Possible* Not Possible Possible

Serializable Not Possible Not Possible Not Possible Not Possible

* Though these are theoretically possible, in PostgreSQL these phenomena are not
possible.

In most of the database systems, as shown in the above table, Read


Committed and Read Uncommitted are two different isolation levels.
Different between these two isolation levels are Dirty Read is possible in
Read Committed isolation level while not possible in the Read Committed
isolation level. However, in the PostgreSQL, Dirty Read is not possible in
both Read Committed and Rean Uncommitted isolation level. Due to this
reason, in PostgreSQL, both Read Committed and Read Uncommitted
isolation levels are the same.
The following screenshot shows the isolation levels which are on the weak-strong scale:

This is how you can set the different isolation levels in PostgreSQL:
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;

After discussing different types of isolation level, let us discuss couple of database design
techniques to improve transaction concepts in the database.

Designing a database using transactions


When designing a database for a transaction, it is essential to make sure the transaction
times should be shorter. There are two approaches can be done to achieve this.

[ 268 ]
Designing a Database with Transactions Chapter 9

Let us look at the following sample data:

When there are instances to calculated account balance on a given date, you need to
aggregate all the previous columns as shown in the following script:
SELECT SUM("Amount")
FROM public."BankTransaction"
WHERE "AccountName" ='100-1235-5789'
AND "TransactionDate" <= '2010-08-04';

The above script will give you an account balance of the AccountName of 100-1235-5780 as
at 2010-08-04.

This will lead to a greater duration of the transaction. There are two types of solutions for
the above problems, one of them is including a column and the second option is to include
another table.

[ 269 ]
Designing a Database with Transactions Chapter 9

Running Totals
When there is a need to retrieve a value as at a given date, we can use the Running Total
technique.

You can add another column with running total and you will see the data as shown in the
below screenshot:

If you want a balance as per any day, now it is you need only one row to get the records
leaving the shorter transaction.

The following is the equivalent query to get the previous result:


SELECT "RunningTotal"
FROM public."BankTransaction"
WHERE "AccountName" ='100-1235-5789'
AND "TransactionDate" = '2010-08-04';

The above query will use only records whereas the previous query needs many records.

[ 270 ]
Designing a Database with Transactions Chapter 9

Summary Table
Instead of an additional column, we can have another table. This will be a better solution
when you need month-end balances.

The following data sample is for the summary table:

If you want to retrieve balance as at the end of 2010-06, it is simple running a query on this
summary table, instead of running an aggregate query on the Transaction table as shown in
the below script:
SELECT "Amount"
FROM public."MonthlyBalance"
WHERE "AccountName" = '100-1235-5789'
AND "Month" = '2010-06'

With the query, only one record is read and that will reduce the transaction duration.

Indexes
As we discussed in Chapter 8, Working with Indexes, Index will use to read only the needed
records instead of the entire table. This will reduce the transaction duration.

Next, let us look at a few real-world scenarios regarding transactions.

[ 271 ]
Designing a Database with Transactions Chapter 9

Field Notes
An organization with more than 4,000 employees salary is processed at the end
of the month. This organization has implemented transactions for all the
processes. Since the Salary process takes a large duration, more than 2-3 hrs,
during the process of transaction, the database tends to grow large as the log file
will grow. Due to the transaction, there were many maintenance issues. During
the salary process, no other transactions are entered. To manage this, it was
decided to remove transactions. Before the start of the salary process, database
backup was taken and in case of a failure, backed up database can be used to
restore the database. This was possible only because there are no other
transactions that are possible during the processing of salary.
There was a requirement to identify the record modified data for some
transaction table. It was decided by the database designers to add an UPDATE
trigger to the table so that when a record is updated, MODIFIED DATE column
is updated by means of a trigger. However, after this was done, users were
experienced a longer transaction time. This was due to that trigger execution has
added its execution to the total transaction time as the trigger is a synchronous
process. Then, it was decided to modify the code and remove the trigger. This
modification has reduced the transaction execution time.

Summary
In this chapter, we looked at the Transaction aspects of database design. We considered the
transaction as a single business logical unit that should be treated as one unit. The
transaction ensures that the database will be in a consistent state. We identified two main
methods of transaction theory which are ACID and CAP theory. ACID stands for
Atomicity, Consistency, Isolation, and Durability whereas CAP theory stands for
Consistency, Availability and Partition Tolerance.

In PostgreSQL, there are main three Transaction Control Language commands that are
BEGIN TRANSACTION, ROLLBACK TRANSACTION and COMMIT TRANSACTION.
Every transaction either failed or success and in case of error transaction, we need to run a
compensation transaction to reverse the errored transaction. To support different allocation
levels, PostgreSQL has three isolation levels that are Serialize, Read Committed, and
Repeatable Reads. Unlike other database technologies, in PostgreSQL both Read
Committed and Read Uncommitted will be treated as same as in both isolation levels, Dirty
Reads are not possible.

During the database design, we identified that few measures can be taken in order to make

[ 272 ]
Designing a Database with Transactions Chapter 9

a better transaction. We introduced two methods, one of them is adding a column with
running totals. Another method is adding a different summary table. In both situations, the
aggregated query is replaced with a simple query so that the transaction duration is
reduced.

In the next chapter, we will discuss what is the involvement of a database designer during
the maintenance period of a database.

Questions
Why Transaction is an important concept in the database?

During the design stage, we will identify that to satisfy different business
user requirements, multiple tables have to be introduced. Further, when we
do follow the Normalization process, it will result in separating duplicated
data into multiple tables. This means that a single unit of a business process
may need to update multiple tables. During the modification of tables, if one
update fails, the database will be in an inconsistent state. In order to
maintain the consistency state in the database, Transactions are used.

Explain what are the database modules that are implemented to achieve ACID
properties of database transactions.

ACID stands Atomicity, Consistency, Isolation, and Durability in a


transaction property. The following table shows the different database
modules that implement different ACID property.

Property Component
Atomicity Transaction Management component
No specific component. If Atomicity, Isolation, and Durability are maintained
Consistency
Consistent is maintained automatically.
Isolation Concurrency Control Unit
Durability Database Recovery Management
Questions of ACID properties are very common during interviews and
papers. It is important to explain the concepts and usage of ACID properties
fluently.

Why we have to compromise either Consistency or Availability in Distributed


System.

Distributed Systems follows CAP theory that will cover Consistency,

[ 273 ]
Designing a Database with Transactions Chapter 9

Availability, and Partition Tolerance. Out of these three concepts, only two
concepts can be implemented in a system. In distributed systems, Partition
Tolerance is unavoidable. This means we have to compromise either
Consistency or Availability in Distributed System

What is the Default Isolation level in PostgreSQL?

The default isolation level in PostgreSQL is Read Committed. Though this is


a very simple question, most of the users do not have a proper idea about
isolation. During the interviews, this question is mostly asked to verify
whether the users have a basic idea about isolation in PostgreSQL.

What is the difference between Read Committed and Read Uncommitted


Isolation levels in PostgreSQL?

Though in most of the database technologies there is a difference between


Read Committed and Read Uncommitted isolation levels, there is no
difference in PostgreSQL. This is mainly due to the fact that unlike in other
database systems, Dirty reads are not possible for both isolation levels in
PostgreSQL. In most of the other database systems, dirty read is possible in
the Read Uncommitted isolation levels. If the Read Uncommitted isolation
model is requested, PostgreSQL will use Read Committed instead.

What are the specific design decisions that can be done in order to reduce the
transaction time?

It is important to note that there are vital decision needs to be taken during
the database design in order to reduce the transaction time. The main
objective in this regard is to reduce number of rows that reads or writes
during the transaction. To achieve this, a summary column or a summary
table can be added so that aggregate queries can be eliminated.

Exercise
Identify how transactions can be implemented for the Mobile bill example.

What are the possible best practices that can be implemented?


What is the best Isolation level for the above scenario?

[ 274 ]
Designing a Database with Transactions Chapter 9

Further Reading
ACID Theory: https:/​/​en.​wikipedia.​org/​wiki/​ACID

CAP Theory: https://en.wikipedia.org/wiki/CAP_theorem

Transaction: https:/​/​www.​postgresql.​org/​docs/​8.​3/​tutorial-​transactions.​html

Error Handling: https:/​/​www.​postgresql.​org/​docs/​9.​1/​ecpg-​errors.​html

Isolation Levels: https:/​/​www.​postgresql.​org/​docs/​9.​5/​transaction-​iso.​html

Setting Isolation Levels: https:/​/​www.​postgresql.​org/​docs/​9.​3/​sql-​set-​transaction.


html

[ 275 ]
10
Maintaining a Database
During the last nine chapters, we concentrated our discussions mainly on database design
aspects. In those design conversations, we discussed the design aspects of databases, from
planning to conceptual design, and physical implementation. After database design, we
discussed non-functional requirements of databases and identified that the indexes are the
key components to improve database performance. During the database transactions, we
identified that keeping the database consistent is an important factor. After the above
aspects are met, your database is ready for developers to start application developments.
However, as a database designer, your responsibilities will not be completed. Typically,
database designers are much needed during the phase of application development.
However, their knowledge is needed even after the database is released to production,
which is neglected in most of the cases.

This chapter discusses how to maintain and properly design a database with growing data.
Further, this chapter emphasizes that maintaining a database is a key role of a database
designer, though it is typically neglected. In addition to these aspects, we will be looking at
working with triggers in order to support different customer needs.

In this chapter we will cover the following topics:

Role of a designer in Database Maintenance


Implementing Views for better maintenance
Using Triggers for design changes
Modification of Tables
Maintaining Indexes
Handling of other Maintenance Tasks
Maintaining a Database Chapter 10

Role of a designer in Database Maintenance


Many people including database designers and application developers are in the
understanding of, that database designers do not have a role to play in database
maintenance. However, database designers have a dual role in database maintenance. As a
database designer, you need to design a database in such a way that it needs minimum
maintenance during the database in live. Database designer's second role is that the
adaptation to the customers' requirements that will come after the database in live.

When you are doing the design, it is very hard to predict the database growth and the
velocity of the data. If you are designing a database without considering the future aspect
of the database, you will run into performance and scalable issues during the production.
This means that the end-users as well as the application developers have limited
options with the database. If the database is designed without considering the future
aspects, it will lead to complex and time-consuming maintenance tasks.

When the database is released to production, there are new requirements and database
structural changes will come into the picture. As the database is in the production, you
need to carry out the required database changes. However, when the changes are required,
it is essential to perform these changes without impacting current business operations. If
the business impact cannot be avoided, as a database designer, you need to choose an
option that will be less impacting to the current business operations.

Let us assume you have a table with a large number of records. For
example, let us say the table size is 100 GB and it has more than ten
million rows. Typically, though it is not true always, a large table means
that it is frequently used by the application. When there is a need to
modify the table, such as changing the column data type and updating
column after adding a column, this modification may lead to a table lock.
This will prevent the table to access by the application.

In this type of scenario, the database designers knowledge is important for successful
implementation of the database, so that there won't be any impact on business.

In the next section we will see how views can be used, so that database maintenance is easy.

Implementing Views for better maintenance


A view is a database object that is of an embedded query. User can access a view as a
virtual table in PostgreSQL. The view is available in most of the other database
technologies.

[ 277 ]
Maintaining a Database Chapter 10

The following screenshot shows the semantic diagram for the database view:

As shown in the above screenshot, three tables are used, T1, T2, and T3. T1 and T2 tables
are joined with the C1 column, and T1, T3 tables are joined with the C2 column to create a
database view. The advantage of the view is that the person who is consuming does not
need to know the implementation of the view. For example, he does not need to know the
joining tables, joining columns and any other complex logic of the database queries.

Next, let us see how we can create views in PostgreSQL, and see the performance aspects in
views.

Now we will learn about views in the sample database DVDRental.

Sample Views in PostgreSQL


The DVDRental database is the sample database for PostgreSQL. The following screenshot
shows the views in the sample database:

[ 278 ]
Maintaining a Database Chapter 10

Let us see the code for a few of these views. Here is the code for the
view nicer_but_slower_film_list:
CREATE OR REPLACE VIEW public.nicer_but_slower_film_list
AS
SELECT film.film_id AS fid,
film.title,
film.description,
category.name AS category,
film.rental_rate AS price,
film.length,
film.rating,
group_concat(((upper("substring"(actor.first_name::text, 1, 1)) ||
lower("substring"(actor.first_name::text, 2))) ||
upper("substring"(actor.last_name::text, 1, 1))) ||
lower("substring"(actor.last_name::text, 2))) AS actors
FROM category
LEFT JOIN film_category ON category.category_id =
film_category.category_id
LEFT JOIN film ON film_category.film_id = film.film_id
JOIN film_actor ON film.film_id = film_actor.film_id
JOIN actor ON film_actor.actor_id = actor.actor_id
GROUP BY
film.film_id,
film.title,
film.description,
category.name,
film.rental_rate,
film.length,
film.rating;

When a user executes the above view, the following data will be returned. This view will
return the film title, description, category, price, length, ratings and actors:

[ 279 ]
Maintaining a Database Chapter 10

As you can use in the above code for the view, it is a very complex query. To retrieve the
above data set, five tables, category, film_category, film,film_actor, and actor are used.
These tables are joined with different types of joins, INNER JOIN and OUTER JOIN.

INNER JOIN and OUTER JOIN different types of joins used to relate
different tables. Since this book is dedicated to database design concepts,
we will not be discussing the details of JOINs. However, if you wish to
learn those aspects please follow https:/​/​www.​w3schools.​com/​sql/​sql_
join.​asp link.

Apart from those complexities, there are other complexities such as grouping data and
manipulating data. If you can revisit the code of the view, you will see that to get actors in a
comma-separated format grouping and manipulation has to be done. However, end-users
do not need to know this complex logic, but a simple select query will provide necessary
data for them.

Further, end-users can select necessary columns and include WHERE and ORDER BY
clauses as shown in the following script:
SELECT title,
description,
category,
price,
length,
rating
FROM public.nicer_but_slower_film_list
WHERE length > 2
ORDER BY title;

Let us look at another view actor_info in the sample database:

[ 280 ]
Maintaining a Database Chapter 10

By looking at the output this seems to be a very simple view. However, let us look at the
code for the same view:
CREATE OR REPLACE VIEW public.actor_info
AS
SELECT a.actor_id,
a.first_name,
a.last_name,
group_concat(DISTINCT (c.name::text || ': '::text) || (( SELECT
group_concat(f.title::text) AS group_concat
FROM film f
JOIN film_category fc_1 ON f.film_id = fc_1.film_id
JOIN film_actor fa_1 ON f.film_id = fa_1.film_id
WHERE fc_1.category_id = c.category_id AND fa_1.actor_id = a.actor_id
GROUP BY fa_1.actor_id))) AS film_info
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
LEFT JOIN film_category fc ON fa.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id
GROUP BY a.actor_id, a.first_name, a.last_name;

Now let us assume that you want to add a new column to the view that is available in a
different table not already available in the code of the view. In that scenario, you have to
change the view and for the application users, it is another column in the view so that no
major code modifications are needed.

Creating Views in PostgreSQL


Let us see how we can create database views from PostgreSQL.

As shown in the following screenshot, vw_OrderDetails is created:

[ 281 ]
Maintaining a Database Chapter 10

From the Code tab, the view definition is supplied as shown in the following screenshot:

[ 282 ]
Maintaining a Database Chapter 10

With those two configurations, the database view is created and users or application can
access the view.

Let us look at the performance aspects in the database views.

Performance aspects in Views


Most users are under the impression that views provide performance gain. In this section,
we will verify this with the help of an example.

Let us verify the EXPLAIN plan with the following query:


SELECT
title,
description,
category,
price,
length,
rating
FROM public.nicer_but_slower_film_list

You will see the following query plan:

Since the above query plan is complex, let us look at the same query plan in text format as
shown in the following screenshot:

[ 283 ]
Maintaining a Database Chapter 10

If you run the same query that is embedded in the database view, you will find that it is the
same query plan.

This is a myth with the many users that the database view will improve
the database performance. It is important to understand that a database
view is a logical layer where data is not saved. When the view is accessed,
the embedded script will be executed. This means, there is no difference
in executing view and the embedded script separately as far as
performance is concerned.

Though database views are not meant for performance improvement (except for
materialized views), database designers create database views in order to manage
application development much better.

In the view definition, it is technically possible to use views inside views. For example, you
can create another view using the nicer_but_slower_film_list view. However, this is

[ 284 ]
Maintaining a Database Chapter 10

not recommended. This is mainly due to the fact that views inside views might be difficult
to maintain at the later stage.

Next, let us see how Triggers can be utilized during the maintenance period.

Using Triggers for design changes


As we discussed in Chapter 9, Designing a Database with Transactions, we indicated that
database triggers will increase transaction time. However, there are instances where
database designers have to implement triggers during the maintenance period, as major
changes are not possible. Triggers can be used as an auditing option and as a data
separation option.

Let us look at how triggers can be used in different sensations in the following sections.

Introduction to Trigger
A database trigger is a set of code attached to a table or a view that code will execute
depending on the operation. There are three types of triggers in PostgreSQL, they are:

BEFORE
AFTER
INSTEAD OF

The trigger can be specified to execute before the operation is attempted on a row
(BEFORE/INSTEAD OF) or after the operation has completed (AFTER). If the table trigger
fires before or instead of the operation, the trigger will skip the operation for the current
row, or change the row being inserted. If the table trigger fires after the event, all changes
that were done are visible to the trigger.

Further, triggers may be defined to fire for TRUNCATE, though only FOR EACH
STATEMENT.

TRUNCATE is equal to DROP and RE-CREATE in the table. Though this


will delete the entire table, it is a Data Manipulation Language (DML)
where the TRUNCATE is a Data Definition Language (DDL). Due to this
difference, TRUNCATE is recommended when you want to delete the
entire table as there is a performance benefit.

The following table summarizes which types of triggers may be used on tables and views:

[ 285 ]
Maintaining a Database Chapter 10

Database
Trigger Row Level Statement Level
Operation
BEFORE INSERT / UPDATE / DELETE Tables Tables and Views
TRUNCATE - Tables
AFTER INSERT / UPDATE / DELETE Tables Tables and Views
TRUNCATE Tables
INSTEAD OF INSERT / UPDATE / DELETE Views
TRUNCATE
The following screenshot shows how the AFTER trigger works:

As shown in the above screenshot, when any data modifications are done by users or by
applications, the relevant table trigger is executed. For example, if an INSERT statement is
executed, the INSERT trigger is executed and so on.

Let us look at how triggers are useful option to capture the auditing data in the following
section.

[ 286 ]
Maintaining a Database Chapter 10

Triggers as an Auditing Option


As we discussed in Chapter 9: Designing a Database with Transactions, triggers tend to
increase the transaction time. Therefore, as database designers, we do not recommend to
use triggers. However, there are instances where you need to implement triggers at least on
a temporary basis.

Let us see triggers that can be implemented as an auditing mechanism. Let us assume that
we need to track the changes to the Product table. As far as the database is concerned,
modifications mean INSERT, UPDATE, and DELETE.

For auditing, we need to cover the basic three W's. Those W's are Who,
When and What. These three W's cover who did the change, When was
the change done, and What was the change. We have discussed auditing
table structures in Chapter 6, Table Structures.

Let us see how we can use triggers in PostgreSQL to implement data audit.

1. Let us create a table to store the Audit data that is shown in the following script:
CREATE TABLE public."ProductAudit"
("ID" Serial,
"ProductID" Integer,
"Operation" char(1),
"User" text,
"DateTime" timestamp
)

2. Next is to create the Trigger. For easy maintenance, we will create a single trigger
to capture INSERT, UPDATE and DELETE as shown in the following script:
CREATE TRIGGER log_Products_update
AFTER INSERT OR UPDATE OR DELETE ON Public."Product"
FOR EACH ROW
EXECUTE PROCEDURE process_product_audit();

3. in the preceding code block, we can see a function named process_product_audit


is called and the script for that function is listed below:
CREATE OR REPLACE FUNCTION public.process_product_audit()
RETURNS trigger
LANGUAGE 'plpgsql'
COST 100
VOLATILE NOT LEAKPROOF
AS $BODY$
BEGIN

[ 287 ]
Maintaining a Database Chapter 10

IF (TG_OP = 'DELETE') THEN

INSERT INTO public."ProductAudit"


("ProductID","User","Operation","DateTime")
SELECT OLD."ProductID",user, 'D', now();
RETURN OLD;

ELSIF (TG_OP = 'UPDATE') THEN

INSERT INTO public."ProductAudit"


("ProductID","User","Operation","DateTime")
SELECT OLD."ProductID",user, 'U', now();
RETURN NEW;

ELSIF (TG_OP = 'INSERT') THEN

INSERT INTO public."ProductAudit"


("ProductID","User","Operation","DateTime")
SELECT NEW."ProductID",user, 'I', now();

RETURN NEW;
END IF;
RETURN NULL;
END;

4. Now let us see after doing when modifications to the Product table:
--Data Insert
INSERT INTO public."Product"
("ProductID","Name","ProductNumber",
"Color","MakeFlag","FinishedGoodsFlag")
VALUES(1098,'Sample Prodyuct','CAX-4578','Red',1,1)

--Data Update
UPDATE public."Product" SET "MakeFlag" = 0
WHERE "ProductID" = 1098

--Data Delete
DELETE FROM public."Product"
WHERE "ProductID" =319

Let us see how data was captured to the ProductAudit table as shown in the following
screenshot:

[ 288 ]
Maintaining a Database Chapter 10

You can extend this trigger to capture all the medications as above trigger capture on the
ProductID.

The AFTER triggers can be used not only for auditing, but for the troubleshooting process
as well. As we discussed in Chapter 9, Designing a Database with Transactions, if triggers are
not necessary, you can disable the trigger from the user interface as well, as seen in
the following screenshot:

By disabling the trigger you can use it whenever you need rather than dropping and
recreating them again.

Let us look at how triggers can be used to support table partition.

Triggers to Support Table Partition


Database Triggers can be used as a mechanism to support data partition as well. Let us say
we already have a table, Order Details in the production environment. After sometimes,
due to the data volume, you will decide to partition this table to year wise. However, this
might require a huge code change that will not be accepted by many stakeholders. With
fewer code changes, you can implement triggers to achieve physical data partitioning.

In this method, we need to transfer the data to different physical tables as shown in the

[ 289 ]
Maintaining a Database Chapter 10

following screenshot:

As shown in the above screenshot, now our target is that when data is inserted to the Order
Detail table, depending on the year of the Order Date, data will be moved to either Order
Detail 2019 or Order Detail 2020.

2. Let us create two tables OrderHeader_2019, OrderHeader_2020 as shown in the


following code block:
CREATE TABLE public."OrderHeader_2019"
(
"OrderNumber" integer NOT NULL,
"OrderDate" date,
"CustomerID" smallint,
CONSTRAINT "OrderHeader_2019_pk" PRIMARY KEY ("OrderNumber")
)

CREATE TABLE public."OrderHeader_2020"


(
"OrderNumber" integer NOT NULL,
"OrderDate" date,
"CustomerID" smallint,
CONSTRAINT "OrderHeader_2020_pk" PRIMARY KEY ("OrderNumber")

[ 290 ]
Maintaining a Database Chapter 10

3. Next, is to create the trigger on OrderHeader as shown in the following code


block:
CREATE TRIGGER trg_order_partition
BEFORE INSERT ON public."OrderHeader"
FOR EACH ROW
EXECUTE PROCEDURE process_partiton_orders();

4. The heart of this operation is the function process_partiton_orders that is listed


in the below code block:
CREATE OR REPLACE FUNCTION public.process_partiton_orders()
RETURNS trigger
LANGUAGE 'plpgsql'
COST 100
VOLATILE NOT LEAKPROOF
AS $BODY$
BEGIN
IF(NEW."OrderDate" >= '2019-01-01' AND NEW."OrderDate" <= '2019-12-31')
THEN
INSERT INTO "OrderHeader_2019"
SELECT NEW."OrderNumber",NEW."OrderDate",NEW."CustomerID";
RETURN NULL;
ELSIF (NEW."OrderDate" >= '2020-01-01' AND NEW."OrderDate" <=
'2020-12-31') THEN
INSERT INTO "OrderHeader_2020"
SELECT NEW."OrderNumber",NEW."OrderDate",NEW."CustomerID";
RETURN NULL;
END IF;

It is important that triggers should not include a complex logic as it will increase the
transaction duration which will result in unnecessary timeouts in applications.

In the next section, we will look at the design changes that need to be adopted during the
addition of a new column to the existing column.

Modification of Tables
As we emphasized in Why Maintenance is a Designers Tasks, one of the most important tasks
that database designers have to undertake is, adding a column to the existing large, high-
velocity table. As a database designer, it is your duty to satisfy the user requirement while
not disturbing the current business operation.

[ 291 ]
Maintaining a Database Chapter 10

There are two approaches to this, they are, adding a column to the required table, and
adding a separate table and join to the original tables. Let us discuss those approaches, and
the issues of these two methods in the upcoming sections.

Adding a Column
The obvious and simpler method is to add a column to the existing table. Adding a column
to the small, not heavily used table will not be a huge problem. However, the problem will
occur when you try to add a column a table which is in high volume. If you adding a
column to the existing table alone will not be problematic. However, when you add a
column you may have to populate data to the column.

For example, let us say we want to add a column called Product Category to the Order
Detail table. After adding the column, column to be in cooperation, you need to add the
relevant Product Category to the Order Detail by doing a lookup the Product Sub
Category and Product Category Table.

If you are looking at a large table, the update should be done batch-wise in order to keep
the table accessible to the end-users.

Adding a Table
When there is recruitment to add multiple columns to a table, the better approach would be
to add a different table. Let us look at this with the help of an example.

1. Let us say, we have the OrderDetail table as shown below:

In the above table, Order Number and Order Line columns are combined to make the

[ 292 ]
Maintaining a Database Chapter 10

Primary Key. Now let us say, we want to add few columns such as Amount, Tax, Discount,
Final Amount to the Order Detail table. Obviously, you can add these columns to the Order
Detail table, provided that the table is small in size and not a table which frequently used
by the application users.

2. Next is to create a new table as follows:

This OrderDetailExtented table will have a one-to-one relation with the Order Detail table
s shown in the below table:

[ 293 ]
Maintaining a Database Chapter 10

During the Normalization Process, we discussed in Chapter 5: Applying


Normalization that if there are one-to-one relationships, those two tables
can be combined to a single table. This approach is done during the initial
design. However, we are proposing the separate table approach as
discussed above for the table that is already in the live environment.

3. Then we will create a View. Though it is a different table, a view can be created
as shown in the following screenshot:

By designing a database view, end-users do not have to worry during the selection of the
data by joining the base table. Instead, they have to access the view.

Since indexes are a key part of the database, maintaining them is also an important process
of the database. Let us discuss Index Maintenance in the next section.

Maintaining Indexes
[ 294 ]
Maintaining a Database Chapter 10

In Chapter 8, Working with Indexes we discussed the importance of the indexes for the
efficient usage of the database. In an ideal situation, it is better if you can identify indexes at
the design stage. However, practically, this is not possible. Most of the time indexes are
needed to be added by looking at the slowly running queries. When queries are added it is
important to ensure that indexes should not be duplicated.

Let us look at the importance of identifying Duplicate Indexes and UnUsed Indexes in the
following sections.

Duplicate Indexes
As we discussed in Chapter 9, Designing a Database with Transactions, in transaction-oriented
systems, we need to have the optimal number of indexes as having more indexes will
hamper the performance of data writes.

Let us say we have an Index on the Year, Month column. If you need to have another index
on the Year, Month, and Day, it is advised to drop the Year, Month index, and create an
index for Year, Month, Day.

Let us say we have an INCLUDE index as follows:


CREATE INDEX "IX_Category_INCL_Color"
ON public."Product" USING btree
("CategoryID" ASC NULLS LAST)
INCLUDE("FinishedGoodsFlag")

If you need to add another index for Category, FinishedGoodsFlag, then the above index
should be dropped and the following index should be added:
CREATE INDEX "IX_Category_FinishedGoodsFlag"
ON public."Product" USING btree
("CategoryID" ASC NULLS LAST,
"FinishedGoodsFlag" ASC NULLS LAST)

As discussed in Chapter 8, Working with Indexes, it is important to maintain indexes during


the production stage. Index Maintenance is not only about de-fragmentation of indexes or
reindexing but carefully analyze the existing indexes, before creating the indexes.

Unused Indexes
Though we created the indexes, these indexes may not be used during the production. It is
essential to identify those indexes periodically and remove them.

[ 295 ]
Maintaining a Database Chapter 10

The following script will identify the missing indexes in PostgreSQL:


SELECT
relid::regclass AS table,
indexrelid::regclass AS index,
pg_size_pretty(pg_relation_size(indexrelid::regclass)) AS index_size,
idx_tup_read,
idx_tup_fetch,
idx_scan
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan = 0
AND indisunique IS FALSE;

The following is the output for the above query when it was executed in the DVDRental
Database:

Typically, unused indexes are identified on a monthly basis. However, there can be reports
that are running on Quarterly or Annual basis. These reports may need indexes that are
listed by the unused index script. Therefore, do not simply drop the indexes even though
they appear in the unused index list.

Let us look at what are design decision that you have to make for database maintenance
tasks.

[ 296 ]
Maintaining a Database Chapter 10

Handling of other maintenance tasks


When a database is moved to the production environment, there are maintenance tasks that
have to be done in order to keep the database healthy. Though these maintenance tasks are
assigned to database administrators, database designers have their specific tasks.

Let us look at what are the tasks of database designers have during the standard
maintenance tasks such as Backup and Reindex operations.

Backup
Backup is the basic process that any database administrator will adopt as a basic Disaster
Recovery option. If there are bulk processes in the system such as, day-close or a month-
end process in a system, you need to ask the database administrators to reschedule the
backup time so that it can avoid the clash with the bulk process.

Typically, you can request the database administrators to perform the database backup
after the backup is completed, so that backup will have the processed data.

Reindex
As we discussed in Chapter 9, Designing a Database with Transactions, Reindex is an
important maintenance task that should be carried out by the database administrators.
However, database reindex will consume some resources such as CPU, Memory and so
onthat may reduce the database performance. Hence, as a database designer, you should
inform the database administrator on when is the best time to perform the database
reindex.

Further, there are tables in the system which are used around the clock. If you want to
reindex those tables, you need to choose a better time. There can be a situation where, there
can be multiple schedules of database reindex separating the heavy index processes to a
different time.

Summary
During this chapter, we discussed the involvement of a database designer during the
maintenance phase or after the database is released to the production environment. We
mainly identified that to support different database changes, database views can be used.
Though database views cannot be considered as performance improvement, it can be

[ 297 ]
Maintaining a Database Chapter 10

utilized to improve the maintenance tasks.

We also identified triggers to support different activities such as auditing data and physical
table partition. However, we specified that triggers should be used only when necessary, as
triggers will increase the transaction time that will reduce the application performance.

Another design task is, the addition of columns after the database is released to the
production. When adding a column it is important not to disturb the existing business
operations. Therefore, when there is a need to add a column to a large table, we
recommended that data should be updated batch-wise after the column is added. If there is
a requirement to add many columns, we recommended adding a different table with the
same primary key. For easy access, we said that it is advisable to create a database view by
joining the new and the previous table.

We further identified that you may not be able to capture all the indexes requirements
during the database design stage. Hence database designers need to identify the new
indexes after the database is released to the production. Another task of the database
designer is to identify the unused indexes strategically so that optimal indexes can be kept.
Further, we identified that the database designer has an important role to play alone with
database administrators to design an optimum schedule for the database maintenance
tasks.

In the next chapter, we will discuss what are the methods of Designing Scalable Database
and it's best practices.

Questions
Why Database Maintenance is an important task for Database Designer?

Database designers work starts at the early stage of the project, even at the
requirement gathering stage. They will identify most of the facts at the
beginning of the project. However, real usages will be found only when the
database is released to the production and when users are using it. During
this stage, they might have limitations of changes as the database is loaded
with data and it is being used. Therefore, special skills and attention is
needed by the database designers during the database maintenance period.

Why Database Views are recommended to access data through the application,
rather than directly accessing them?

If applications are accessing the database with direct tables, application users

[ 298 ]
Maintaining a Database Chapter 10

need to know the joining columns and conditions. In addition, if there are
database changes, application users have to change their code as well. To
avoid, these types of issues database views can be introduced. Since
application users are connecting through views, in case of modifications
application users have minimum changes.

Why database view does not improve query performance?

Database views have a logical layer. Whenever a database view is accessed,


embedded code in the view is executed. This means that, there is no
performance difference between executing the database view or embedded
code separately. This is a very common question in the database interview,
as there is a myth among the users that the database view will improve the
database performance.

How triggers can be utilized as an option to partition the table?

When a large table exists in production, there can be a situation where you
need to do the physical table partitioning depending on the year or any other
parameter. This can be a tedious task due the fact that applications and users
are already connecting to this table. However, with physically separating the
table, the table partition can be achieved from a database trigger. After
creating the separate physical tables, a trigger can be created on the base
table so that the data is inserted to the correct physical table.

What are the approaches that can be taken if multiple columns are needed to be
added to a large and heavily used table?

The better option is to create a separate table that has a one-to-one


relationship with the main table. Then database view can be created by
joining these two tables. When data needs to be retrieved, application users
can access the view.

Why it is necessary to identify Unused indexes?

Tough indexes are improving the select or data retrieval processes, too many
indexes will decrease the data write performance in a transactional oriented
database. Therefore, as a database designer, you need to identify the unused
indexes periodically.

What are the challenges of maintaining the Unused Indexes?

In an application, there can be monthly or quarterly or annual reports. There


can be indexes support those non-frequent reports. Depending on the time of

[ 299 ]
Maintaining a Database Chapter 10

the unused index identification, these indexes will be listed in unused


indexes. Therefore, before choosing to drop unused indexes, index usage
should be thoroughly analyzed.

What is the involvement of database designers for database maintenance tasks


such as, Database Backup and Reindex?

Database backup and Reindex are resource incentive tasks of a database.


During these operations, database operation will become slower. Due to this,
as a database designer, it is important to schedule high data incentive
processes such as, Month End Salary Processing, that will not to conflict with
the database maintenance tasks.

Further Reading
CREATE VIEW: https:/​/​www.​postgresql.​org/​docs/​9.​2/​sql-​createview.​html

Triggers: https:/​/​www.​postgresql.​org/​docs/​9.​1/​sql-​createtrigger.​html

Trigger Functions: https:/​/​www.​postgresql.​org/​docs/​9.​2/​plpgsql-​trigger.​html

Routine Database Maintenance Tasks: https:/​/​www.​postgresql.​org/​docs/​9.​0/


maintenance.​html

Finding Unused Index: https:/​/​subscription.​packtpub.​com/​book/​big_​data_​and_


business_​intelligence/​9781783555338/​6/​ch06lvl1sec72/​finding-​unused-​indexes

[ 300 ]
11
Designing Scalable Databases
After ten chapters, now we have obtained sound knowledge on the database design. In
database design discussions, we were mainly concerned about providing the functional
aspects of the customers' requirements. As a database designer, your task is not only to
deliver database that fulfilling the functional requirements of the customers but also to
provide non-functional requirements. Therefore, in Chapter 8: Working with Indexes, Non-
Functional requirements of Databases section, we looked at the non-requirement
functionalities of the database extensively in order to fulfill the efficient usage of the
database. In that discussion, we continued to discuss Performance aspects of the database
by means of index concepts. After discussing the Transaction Management in order to
maintain the data integrity, during the last chapter we discussed maintenance aspects of the
database. As a database designer, you need to design your database to support future loads
of the database. Apart from the user loads, the data volume of the database will be
increased typically in the exponential order. To facilitate this, you need to design a scalable
database.

This chapter discusses the approach towards a scalable database design. We also move on
to covering database reliability.

In this chapter we will cover the following topics:

Introduction to Database Scalability


Selection of Scalability Methods
Introduction to Vertical Scalability
Understanding Horizontal Scalability
In-Memory Tables for High Scale
Design Patterns in Database Scalability
Modern Scalability in Cloud Infrastructure
Designing Scalable Databases Chapter 11

Introduction to Database Scalability


Database Scalability is the ability of the database to cater to customer requests, even with
the increasing demands of the users. When a database designer is designing a database, he
may not have the idea of the future load. Further, future usage patterns will change and
requirements will change. For example, when designing a database, users might be located
only in one location. However, after another 2-3 years time, there will be users who are
physically located in a different geographical location. Since your database is designed only
for users who are at a single location, the database may not be able to cater to users in
different locations.

Database scalability has two dimensions, such as data and requests. we will learn about
them in the following sections.

Data
It is needless to say that data is one of the key assets in an organization. Typically, data is
growing in exponential order. Research has found that we all will generate half of the
existing data every year. Further, in another research, it was revealed that, in every ten
minutes, we all have generated more data than from prehistoric times until 2003. These two
facts tell us how data volume is increasing. It is not only the volume that database
designers have to worry about.

There are other factors as shown in the following screenshot:

[ 302 ]
Designing Scalable Databases Chapter 11

The Database insert and modification data rate will change over time due to the increase of
transactions and users, this is called Velocity. Data might be in different types such as
relational and non-relational data types. In databases, we need to deal with different types
of data such as relational, text and so on this is called Variety. Veracity means that the
accuracy of the data or how quality your data is. These aspects are any challenges to the
database designers. In the case of scalability, database designers have to plan for the future,
that will be an additional and huge challenge for the database designers.

Let us see why user requests are a key factor in database scalability along with data.

Requests
Users are the other key factors when deciding the scalability of the database. User requests
can be changed in multiple ways. Obviously, the number of users will increase over time.
During the design time, designers have to design the database in the view that the number
of users will increase. Not only the number of users, but the users' activity will also change.
For example, at the time of database design, you have only a small number of users in one
geographical location. Due to business expansions or acquisitions, users' locations may
have increased. This means that your database now has to cater for the users in different
locations that were designed for the users in one location.

[ 303 ]
Designing Scalable Databases Chapter 11

Apart from the number of users and their locations, the users' usage patterns will differ
over time. For example, at the start, the database may cater a report that is running on a
daily basis. Over time, the same report may need to process every hour. When designing a
report, it is essential to understand that report execution frequency may increase.

During the database design requirement gathering phase, it is essential to


understand the frequency of the business processes. During the design,
you need to design the database so that it can withstand for the increased
frequency as well.

Having understood the need for the database scalability, let us discuss the major types of
scalability methods that are used in the database design.

Selection of Scalability Methods


There are two popular database scalable types that are in practice by database designers in
the industry. These two scalable types are:

Horizontal Scaling
Vertical Scaling

The following screenshot shows how databases can be arranged in different types of
scaling:

[ 304 ]
Designing Scalable Databases Chapter 11

As you can see from the above screenshot, the Horizontal scaling is that the distribution of
same or partition of data while the Vertical scaling is increasing the data volume.

Let us look at Vertical Scalability in the following section.

Introduction to Vertical Scalability


This approach involves adding more virtual and physical resources to the server that is
hosting the database or to the database. As we are aware, CPU, Memory, and Storage are
the main components that can be vertically scaled. Technically, CPU, Memory, and Storage
can be increased. However, there is always an optimum value as there is a cost involved

[ 305 ]
Designing Scalable Databases Chapter 11

with every hardware components.

Normally, Performance Monitor also known as perfmon is used to evaluate these hardware
parameters in the Microsoft Platform.

The Microsoft Windows Performance Monitor is a tool that system


administrators can utilize to measure how programs executing on their
computers or servers. This tool is a time-based tool and it can be used in
real-time. You can refer to the details of perfmon at https:/​/​docs.
microsoft.​com/​en-​us/​windows-​server/​administration/​windows-
commands/​perfmon

In this tool, there are lot of counters to choose as shown in the following screenshot:

As you can see in the above screenshot, you can measure the CPU for the postgres process
only. By looking at this measure, you can decide whether you need to increase CPU.
Similarly, Memory and storage requirements can be evaluated from this tool.

In most of the incidents, database administrators are looking at CPU, Memory, and Storage
to improve Vertical Scalability. However, apart from the server level, there are database
level configurations that can be done to improve vertical scalability.

[ 306 ]
Designing Scalable Databases Chapter 11

As we discussed in Chapter 8, Working with Indexes, we used indexes to improve


performances. Indexes can be considered as a method to increase vertical scalability.
Further, proper schema design is done to achieve vertical scalability. During Chapter 4,
Representation Models, we understood that transactional-oriented database requires Online
Transaction Processing (OLTP) database design, while Analytical-oriented database
requires Online Analytical Processing (OLAP). We discussed Normalization methodologies
in Chapter 5, Applying Normalization, so that vertical scalability can be achieved.

When there are applications managed by external vendors, the most likely
options is vertical scaling. Since Vertical scaling is applied to the server or
to the database, the application will not be impacted.

One of the most common ways of achieving vertical scalability is the table partitioning that
will be discussed in the following section.

Table Partitioning
Table Partition is to split the data of a table into smaller and multiple physical tables.

There are benefits of partitioning tables, such as:

Query performance is improved dramatically, it will help to improve the vertical


scalability.
Bulk loads and batch deletes are much faster.
In-frequently used data can be moved to cheaper hardware to reduce cost.

There are two methods of table partitioning in PostgreSQL, Trigger Method and Partition
By method. We will learn about them in the following sections.

Trigger Method
In Chapter 10, Maintaining a Database, we discussed how partitioned can be incorporated
using a table trigger.

In this method, as shown in the screenshot, data is diverted to the separate physical tables:

[ 307 ]
Designing Scalable Databases Chapter 11

When data is accessed depending on the request, only a single table is accessed. If there is a
request to return data from multiple tables, either from a user view, data will be returned to
the end-user.

These partitioned tables are mandated to have only a set of range data. for
example, Order Detail 2018 table should have Order Detail from
2018-01-01 to 2018-12-31. To ensure the data integrity, it is recommended
to implement a CHECK CONSTRAINT to the date column. This will
ensure that there won't be any records outside the allocated date range
even if some requests direct tries to insert records to the Order Detail 2018
table.

However, it is important to note that, introducing the trigger will increase the transaction
duration that will increase. If the trigger is more complex, there can be more transaction
timeouts.

Partition By Method
Though we used triggers as a workaround to implement partition, there are inbuilt
mechanisms in most of the database technologies. In PostgreSQL too, there is a native
mechanism to implement table partitions. Let us look a the usage of Table Partition by

[ 308 ]
Designing Scalable Databases Chapter 11

means of an example. In the following steps, the OrderHeader table will be partitioning
based on the OrderDate. Let us say we will partition the OrderHeader table for the yearly
basis.

1. The following script will create the OrderHeader table, with defining the
PARTITION column as the Order Date column:
CREATE TABLE public."OrderHeader"(
OrderID int not null,
OrderDate date not null,
CustomerID int,
OrderAmount int
) PARTITION BY RANGE (OrderDate);

2. Next, we will define the partition ranges for the OrderHeader table as shown in
the following code script:
CREATE TABLE public."OrderHeader2018" PARTITION OF "OrderHeader"
FOR VALUES FROM ('2018-01-01') TO ('2019-01-01');

CREATE TABLE public."OrderHeader2019" PARTITION OF "OrderHeader"


FOR VALUES FROM ('2019-01-01') TO ('2020-01-01');

CREATE TABLE public."OrderHeader2020" PARTITION OF "OrderHeader"


FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');

After the above scripts are executed, three partitions, OrderHeader2018, OrderHeader2019,
and OrderHeader2020 to the OrderHeader table as shown in the screenshot:

[ 309 ]
Designing Scalable Databases Chapter 11

In these partitions, you can select the OrderHeader Table or the Partition tables
individually.

3. Indexes should be created on each partition for the performance benefits as


shown in the following code script:
CREATE INDEX ON public."OrderHeader2018" (OrderDate);
CREATE INDEX ON public."OrderHeader2019" (OrderDate);
CREATE INDEX ON public."OrderHeader2020" (OrderDate);

4. When records are inserted to the OrderHeader table, data will be stored in the
relevant portion. The following script shows the data distribution for each
partition:
SELECT
nmsp_parent.nspname AS parent_schema,
parent.relname AS parent,
nmsp_child.nspname AS child_schema,
child.relname AS child,
pg_stat_user_tables.n_live_tup AS RecordCount
FROM pg_inherits
JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
JOIN pg_class child ON pg_inherits.inhrelid = child.oid
JOIN pg_namespace nmsp_parent ON nmsp_parent.oid = parent.relnamespace
JOIN pg_namespace nmsp_child ON nmsp_child.oid = child.relnamespace

[ 310 ]
Designing Scalable Databases Chapter 11

JOIN pg_stat_user_tables ON pg_stat_user_tables.relname = child.relname


WHERE parent.relname='OrderHeader';

The following screenshot shows the data distribution:

Maintaining Partitions are also an important task. If you want to delete data for 2018,
without partition, the DELETE statement has to be used. If DELETE is used, every record is
deleted and that will take a long duration as well as data IOPS will be consumed heavily.
However, in the case of partitioning, it is only a meta-data change.

Let us see how data can be deleted much easily in the partition:

1. To remove the partition, you can simply drop the partition as shown in the
following script.
DROP TABLE public."OrderHeader2018";

2. However, during this data will be lost. You can remove the data from the
partitioned table by keeping the data using the following script:
ALTER TABLE public."OrderHeader" DETACH PARTITION "OrderHeader2019";

Partitions are often used in large transactional tables. Further, in the case of data
warehousing, Partitioning can be incorporated in large fact tables as well. It is essential to
determine the range for the partition table at the initial stage. Though this can be altered
later, by deciding this at the design stage will reduce unnecessary maintenance and
downtime tasks.

LIST is another type of partition available in the PostgreSQL other than the RANGE portion
that we discussed before. By using LIST partitions, you can partition tables by specific
Values. This is very handy if you want to partition your data by location wise.

Vertical Partition
In some tables, there are infrequently accessed columns. If these columns large is size, it is

[ 311 ]
Designing Scalable Databases Chapter 11

better to store them outside the main table. For example, if the Product table needs to
include attributes like Product Description, Product Images, and so on.

The following screenshot shows how tables are vertically partitioned:

Since the Product Description and the Product images are infrequently used attributes and
there are large in size, those attributes can be separated as shown. Typically, this type of
partitioned mostly done on the master tables such as Customer, Product, Supplier and so
on. In case of transaction tables, customer reviews can be vertically partitioned.

You can combine Vertical, Horizontal and, Hybrid Partitioning to suit


your environment. For example, the Product table can be Horizontally
partitioned to Frequent and Infrequent Selling Products. Then each
product list can be vertically partitioned as shown in the above example.

Having discussed what are the options for the vertical scalability, let us look at the
advantages and disadvantages in vertical scalability.

[ 312 ]
Designing Scalable Databases Chapter 11

Advantages in Vertical Scalability


From the application development perspective, vertical scalability is the much-preferred
option. When the vertical scalability is changed, there is no change required to the
application. For example, by adding an index to a table or adding more CPU to the server
does not require any modifications to the application.

In addition to the above two features, from the data center perspective, you don't need
physical servers. Since you don't need additional physical servers, additional cooling
systems are not needed in vertical scaling. Due do these factors, you don't need additional
hardware in vertical scaling.

Disadvantages in Vertical Scalability


Some database technologies such as Microsoft SQL Server licensing, the cost is tightly
attached to the number of CPUs. When you increase the number of CPUs, the licensing cost
will be increased. There is a limitation to the hardware upgrades. For example, you cannot
increase Memory to multiple tetra bytes as some configurations are not compatible.

After discussing the Vertical Scalability, we will discuss the Horizontal Scalability in the
next section,

Understanding Horizontal Scalability


Horizontal scalability is the most sophisticated and modern techniques that are being used
in the industry. When you are to leverage features of Horizontal Scalability, it is essential
that the proper designing should be done. Though for vertical scalability rigorous design
discussions are not needed.

There are of different configurations for Horizontal Scalability. We will be


discussing more on Horizontal Scalability options in Chapter 13,
Distributed Databases.

Horizontal Scaling is about adding more nodes and adding more databases. There can be
many ways of configuring Horizontal Scalability. One of them is shown in the following
screenshot:

[ 313 ]
Designing Scalable Databases Chapter 11

As seen from the screenshot, the primary database is replicated to multiple secondary
databases. There are multiple replication scenarios in different database technologies.
Either you can replicate the entire database instance, or only several tables. Further, you can
replicate selected columns or rows.

During the replication, you can configure it to be an Active-Passive replication. Active-


Passive replication means that the secondary database is only used as a read-only instance
and you are unable to perform data write into the secondary instance. The Application
layer will perform the Read-Write requests to the Primary Database. Read-only requests
will be diverted to the secondary databases.

Database Sharding is another horizontal technique that can be utilized in database design.

We will discuss the Strategies and Techniques in Database Sharding in


Chapter 13, Distributed Databases.

In the case of sharding, data is not replicated but data is physically partitioned. As shown
in the following screenshot, data is partitioned by the customers:

[ 314 ]
Designing Scalable Databases Chapter 11

As shown in the above screenshot, customers data grouped into one database. When a user
request comes to the Primary Database, depending on the customer, the request will be
diverted to the relevant database. Challenge in the Database Sharding is the replicating of
master data to shards. If you have a large number of shards, master data has to be
replicated. There are further complications, in case if there are customer movements
between shards. Though this is not a frequent task, as a database designer you need to plan
for such tasks.

Let us look at the Advantages and Disadvantages of Horizontal Scalability in the following
sections.

Advantages in Horizontal Scalability


Typically, horizontal scalability is much cheaper since individual nodes as smaller.
Upgrades are simple as it is adding another node. In theory, you can add infinite nodes to
achieve Horizontal Scalability. Further, since there are multiple nodes, high availability and
disaster recovery can be managed easily.

[ 315 ]
Designing Scalable Databases Chapter 11

Disadvantages in Horizontal Scalability


The obvious disadvantage of Horizontal Scalability is the complexity of the environment.
When you have multiple nodes, there need to be complex techniques to synchronize data
between these nodes. Due to this complexity, there can data latency between the nodes that
will lead to inconsistency issues. To avoid higher latency, you need to invest on the
network resources.

Apart from the configuration complexity, application development is complex as well.


Especially, transaction management implementation is complex as transaction has to be
distributed between multiple nodes of database instances. There are maintenance
difficulties such as bugs fixes, indexes, and patches that have to be applied on all nodes.

After looking at Vertical and Horizontal calling, let us look at how to improve the database
scale using In-Memory Databases.

In-Memory Databases for High Scale


In databases, we always retrieve data to the application. If we read data multiple times, that
will take unnecessary round trips to the database to retrieve the same data again and again.
As we are aware, reading from memory is faster than reading from the disk.

Most of the modern databases have the caching mechanism in-built to


improve the performance. Therefore, when choosing a database, you need
to find out whether the database technology that was selected is capable
of handling data cache efficiently.

Though the reading from the memory is faster, it should be noted that the Memory devices
are costly than disks. Therefore, as a database designer, it is a challenge to choose what
should be cached and how long it should be cached.

Let us look at a basic architectural diagram for the Memory Enabled data as shown in the
following screenshot:

[ 316 ]
Designing Scalable Databases Chapter 11

First, the requested data will be verified for its existence in the Cache. If it is available then
the data will be read from the Cache. If not, data will be updated to the Cache from the
Database.

As we discussed in Chapter 8: Working with Indexes, advanced query optimization


techniques can be used to improve query performances. If there are frequently accessed
data, to improve the performance we can move the data to Memory build tables. Based on
the performance requirements and data access pattern, we can determine the caching
strategy, such as what data (tables, Rows, Columns) to cache, retention of cache and
Clearing of Cache. There are different algorithms to update the cache when there is a data
update in the database. If not, the consistency will not be maintained.

Like database design patterns, there are design patterns for database Scalability that will be
discussed in the following section.

[ 317 ]
Designing Scalable Databases Chapter 11

Design Patterns in Database Scalability


Similar to software design patterns and database design patterns, there are database scaling
patterns for efficient database scaling. We are looking at Database Scalability patterns for
Relational Database Management System and we will not be focusing on the design
patterns for NoSQL databases.

Let us look at few commonly used design patterns

Load Balancing
One of the most common design patterns for Database scalability is Load Balancing. In the
Load Balancing Pattern, the pattern will decide which database instance it should access.
There are several techniques to select a database instance from the available database set.

Those techniques are listed in the following table:

Load Balancing
Description
Technique
Random Any database instance will be selected without any specific logic.
Round Robin Every database will get an opportunity to connect the application.
Least consumed, from CPU or number of connections perspective database will be
Least Busy
selected to choose the database instance.
Due to the practical situation, it might not be able to replicate each database objects.
Further, some instance might be read-only whereas only less number of Read-Write
Customized
database instances are available. Therefore, depending on the request type, you have
to divert the requests to the relevant database instances.
The most common technique out of these is the Round Robin. However, in the round-robin
technique, there can be situations where 100% scalability will not achievable.

Master-Master or Multi-Master
We use data duplication to achieve database scalability as discussed in the Horizontal
Scaling section. For data replication, the most common replication technique is known as
the Master-Slave or Publisher-Subscriber technique. In this technique, one node will be
responsible for data publication, where many number of nodes will subscribe to the
publisher. However, this is considered to be a single point of failure. The Single point of
failure means that in the configuration there can be a situation where the entire system will
fail due to failure at one place. This phenomenon can be shown in the following screenshot.

[ 318 ]
Designing Scalable Databases Chapter 11

In the above, even if Server B fails, Server A can communicate with Server C. However, if
the Server A fails entire system will fail.

To avoid, the single point of failure, Multi-Master replication or Peer-to-Peer (P2P)


replication can be used as shown in be below screenshot.

[ 319 ]
Designing Scalable Databases Chapter 11

In the above Multi-Master configuration, even if one server fails the other two servers will
be able to communicate.

Connection Pooling
Though we think that opening a database connection is a very simple operation, it is
actually an expensive operation. The Connection pooling is used to keep database
connections open so that those connections can be reused. This technique avoids reopening
the network connection, validating the server authentication, and database authorization
and so on.

Without a connection pool technique, a connection request will take 40-50 milliseconds,
whereas with connection pooling it takes only 1-5 milliseconds. Further, connection pooling
will reduce the chance of server crashing. In PostgreSQL, there is a limit for the number of
connections. However, by using the connection pooling, many connections can be made
without crashing the PostgreSQL server.

In the modern era, most of the organisations are exploring opportunities in the Cloud
infrastructure. Let us look at how Cloud can be utilized to improve database scalability.

[ 320 ]
Designing Scalable Databases Chapter 11

Modern Scalability in Cloud Infrastructure


Cloud has become a buzz word, and most of the organizations are looking at options of
leveraging the features of Cloud infrastructure to their applications. Most of the systems
have dynamic need of scalability. This means during some period of the system operation,
you need to scale up the configuration of the system. If you are looking at the classical on-
premise configuration and if you increase the capacity, it will be difficult to bring back to
the previous configuration once the heavy load is completed.

In the case of the Extract-Transform-Load (ETL), or in Monthly Processing


such as Month-End Salary Calculation, databases need more CPU and
Memory. However, these processes won't last long. If you scale up the
servers to high configuration, you will not be able to scale down. This
means that during the less load time Return on Investment is low. Since
you have options in Scale Up and Scale down option in Cloud, you can
provide necessary configuration during the required time.

Since most of the cloud vendors support different cloud infrastructure such as
Infrastructure as a Service (IaaS), Software as a Service (SaaS) and Platform as a Service
(PaaS). In the case of IaaS, since you are responsible for the hardware configuration, you
can increase or decrease the configuration when need to improve the scalability. In the PaaS
infrastructure, configurations are dictated by the vendor configurations. However, since
most of the vendors have different PaaS scalability levels, you have the option of selecting
them when the need arises.

Field Notes
A database is chosen to store sales transactions. In this database, purchase order,
orders, invoices, payments are stored along with the customer and product data.
Since this database is a highly transaction-oriented system, the database was
highly normalized. After a few years, management requested for various types
of reports. Many of those reports were analytical reports such as monthly
comparison reports, Debtors Ageing Report, and so on. As these reports need a
lot of table joins and a large volume, there were two issues. The first issue is that
the reports taking a long time to process and the other issue was, during the time
of reporting processing, transactions getting delayed. To avoid these both issues,
it was decided to create a separate report database with de-normalized data
structures. Daily basis, data was transferred to the Reporting database using an
ETL mechanism. Since these are analytical reports, the management was happy
to retrieve data with a maximum of one-day latency.

[ 321 ]
Designing Scalable Databases Chapter 11

An Organization is heavily dependent on dashboards for it's key decision


making. At the design stage, these dashboards were built to support Indian
region management officers. However, due to the business expansions, the
management has to travel to countries outside the Indian subcontinent, such as
Australia, Europe, and USA. During these visits, they have to access those
management dashboards in order to make key decisions. Due to the server
proximity, they were finding difficulties of accessing the dashboards. When this
was related to the technical teams, the issue was narrowed down to the latency of
database access. Since these dashboards were created a few years back, drastic
changes were not acceptable. Since dashboards required only a subset of the data
set, those data was replicated to servers in those geographical locations. When a
user access dashboards from outside the Indian subcontinent, dashboards are
accessing the data from the local location rather than reading from the primary
server. This leads to the increment of the positive user experience of the
management.

Summary
In this chapter, we identified that there are mainly two types of scalability, vertically and
Horizontal scalability. We found that vertical scalability is the easiest scalability option
from the application point of view. In vertical scalability required application changes are
the minimum. In vertical scalability, we introduced more CPU, Memory and Disk storage,
Indexes, Partition. We identified that there are different types of Partitions such as Vertical,
Horizontal and Hybrid partition. in the PostgreSQL, there are two types of partitions,
RANGE and LIST. Apart from the native partition options, we looked at how triggers can
be utilized to implement triggers.

The modern scalability technique is horizontal scalability where the database is replicated.
Replication and sharding are the most common techniques in horizontal scalability. In this
scalability method, there can be read/write nodes as well as read-only nodes. Depending on
the request types, connections should be diverted to the relevant database. In-Memory
database is another modern technique that we discussed under the scalability.

We further discussed a few database scalability design patterns such as Load Balancing,
Peer-to-Peer, and Connection pooling techniques. Since most of the organizations have set
their vision towards cloud infrastructure, we discussed how cloud can be utilized to
improve the database scalability. In this discussion, we saw that IaaS and PaaS cloud
infrastructure can be effectively utilized to improve the database scalability.

In the next chapter, we will dedicate our discussion another important non-functional
requirement called database security.

[ 322 ]
Designing Scalable Databases Chapter 11

Questions
Why Database Scalability is a challenging aspect of database design?
Database Scalability is the ability of the database to cater to customer requests
even with the increasing demands of the users. These increasing demands may
not be able to visualize at the design time. Further, user usage patterns will
change dramatically due to business needs. Since database design is done at the
early stage, it is a challenging task to design database for the future scalability.

What are the instances where Vertical Scaling is better than the Horizontal
Scaling?
There can be instances where database objects can not be separated due to the
application design. Further, due to the licensing concerns, it might be difficult to
separate the databases. This means that there are instances that Vertical
Scalability is the only option you have to work with.

How Partition can be utilized in PostgreSQL to improve database scalability?


There are two types of native partitioning options available in the PostgreSQL,
RANGE and LIST. With RANGE partitioning you can define ranges, such as date
range for a partition. In the RANGE partition, you can define partition by values,
such as Location. In both types of partitions, when the data volume increases,
different partitions will be created. For example, in the RANGE partition, an
additional partition will be created on yearly basis so that data accessed is much
easier. In the LIST partition, when an additional location is added, that data will
be in another partition. Similar to the RANGE partition, data access will be
improved in the LIST partition as well. Apart from the access point-of-view,
there are maintenance advantages in table partition as well. In addition, there are
vertical partition can be used to separate in-frequent access columns to a different
table.

What are the advantages of using Horizontal Scalability?


With Horizontal scalability, we are looking at duplicating the database with
different techniques. With Horizontal scalability, database performance can be
improved by considering the future load in the database. When the number of
users and the user requests are increased another database instances can be
added. Therefore, in the horizontal scalability more scalability can be achieved.

How Cloud can be useful to improve the database scalability?


In IaaS and PaaS cloud infrastructure there is an ability with most cloud vendors
to modify the configuration when the need arises. After the heavy load period is
over, you can switch back to the period configuration. With this ability, users

[ 323 ]
Designing Scalable Databases Chapter 11

have the option of improving the configuration whenever they need, with less
cost.

What is the advantage of using the connection pooling?


When a connection made to the database, it needs to open a network session,
validate the server authentication, and verify the database authorization. These
tasks typically take 40-50 milliseconds. If the connections are cached or pooled,
the cached connection can be reused and then the database connection
establishment takes on 1-5 milliseconds.

Further Reading
Replication: https:/​/​www.​howtoforge.​com/​tutorial/​postgresql-​replication-
on-​ubuntu-​15-​04/​
Table Partitioning: https:/​/​www.​postgresql.​org/​docs/​10/​ddl-​partitioning.
html

[ 324 ]
12
Securing a Database
In Chapter 8, Working with Indexes we discussed how important to implement the non-
functional requirements for a database. In that discussion, we discussed that identifying
non-functional requirements are challenging mainly due to the fact, your clients will not be
able to explain them to you. Therefore, it's your duty to explore to non-functional
requirements for the database. During that discussion, we identified a few factors for Non-
Functional Requirements such as Performance, Scalability, High Availability and Security.

Since security one of the most important aspects in the database, thus it is always neglected
during the database design phase. This mainly due to the fact that database designers are
more concerned about the schema design than the non-functional-requirements. Since
Database security is an integral part of a database, we have dedicated this chapter for
Security aspects in database design. This chapter would help you understand how a
database can be secured through a step-by-step approach.

In this chapter we will cover the following topics:

Why security is important in database design


How data breaches have occurred
Implementing authentication
Implementing authorization
Avoiding SQL Injection
Encryption for better security
Data auditing
Best practices for security
Field notes
Securing a Database Chapter 12

Why security is important in database


design
Though it is needless to stress the fact that security is paramount important aspects of any
system, most of the times, it is neglected in the database design stage. There are many
consequences that can occur mainly due to the fact that security is ignored during the
design stage.

1. Data loss
2. Loss of availability
3. Loss of privacy
4. Loss of reputation
5. Loss of Income
6. Penalities

If an unauthorized user enters into the system, he can erase part of the data. when it
coming to erasing data, it can be one or more data records, or one or more tables, or an
entire database. Does not matter how large the data erase is, it is a data loss.

If unnecessary users are login into the database, those users will consume a lot of resources
such as CPU and memory. Due to these unauthorized users, authorized users will not have
access to the databases and database will become unavailable to them.

Mostly in the sectors such as Health Care data privacy is a must as there are personal
records are stored in the databases. Apart from the Health sector, data such as birth date
(DOB), social security numbers (SSN) has to be protected from others. If this data is
compromised, your clients' privacy will not be met.

In today's business, brand reputation is an important aspect of any business. If your


business has a history of data loss and data breaches, it will be difficult to improve the
brand image and reputation.

When you lost the reputation, automatically you will tend to lose the revenue. Apart from
the loss of revenue, there can be situations where your organization will be penalized
legally due to not taking additional precautions with security implementation. This
implication means that you are bound for financial impact negatively in two ways.

Due to this factor, it is essential, pay attention to data security as a pro-active measure to
secure data in the database. However, it is important to plan for the tasks even if there is a
data breach. For example, you need to take actions to avoid data losses and plan for data
losses such as data disaster systems and so on. Similarly, as a database designer, you need

[ 326 ]
Securing a Database Chapter 12

to take actions to avoid unauthorized access. However, in the case of unauthorized user
accesses, as a database designer, you need to implement methods to detect and recover
them.

Let us look at the instances of data breaches historically and their properties.

How Data Breaches have Occurred


Following is the chart for the types of data exposed in data breaches in 2019.

Source: Help Net Security

Above chart shows that Date of birth, SSN, Address are the prime targets by the data
hackers. This indicates that you need to place extra security measures when storing such
type of data.

There is a common belief that most of the data breaches are external. Even though this
majority of the data breaches are external, there is an upward trend on internal breaches as
shown in the following screenshot.

[ 327 ]
Securing a Database Chapter 12

Source: Verizon Data Breach Investigation Report

This screenshot shows that in 2018 34% of the security breaches are internal which was 25%
in 2016. This upward trend insists the need of the database designers to implement security
fo the databases to protect not only their external parties but to protect from the internal
parties as well.

The following screenshot shows the per capita cost of the data breaches in industry-wise in
2018.

[ 328 ]
Securing a Database Chapter 12

Source: Data Breach Report, 2018

The Health sector has the highest cost in breaches and it is close to the double of the cost of
the Financial sector which is the second-highest sector.

The above different factors indicate that the severity in the security aspects in database
design.

In the database design, there main three aspects of database security.

1. Authentication
2. Authorization
3. Encryption

Let us discuss Authentication in the following section.

[ 329 ]
Securing a Database Chapter 12

Implementing Authentication
Authentication referees to identifying the user or verify the user who is claimed to be. In
typical applications, there are several ways of identifying the user such as password, bar
codes, swipe cards, finger scans, Radio-frequency identification (RFID) cards and so on. For
the database, the most common way of authentication is user name and passwords as other
authentication modes are used by applications.

As we know, we need to provide a user name and a password to login into an operating
system. Similarly, we need to connect to the database to access data. In some Database
Management Systems (DBMS), you can allow the operating system users to connect to the
database. This configuration will reduce keeping multiple users to manage. In different to
this, some DBMS needs to keep a separate list of users that are kept in the database system.

In Microsoft SQL Server, there are two types of authentications, that as


Windows Authentication and SQL Authentication. When configuring a
database server, you can define what type of Authentication is possible. In
Microsoft SQL Server, this can be either Windows Authentication only or
both authentication. Both authentication mode is called Mixed
Authentication.
Apart from user name and password, there are other additional configurations to manage users that are listed
below.

Limit of the number of connections - Some times, same user connection is used to
connect to the database from different applications. Allowing multiple
connections may lead to difficulties in user management.
Allowable Login Times - You can limit the user to login to the system to a specific
time. For example, you can define for a user that he can connect to the database
on weekdays between 9 AM to 5 PM.
Not a login account - You can create user accounts only to connect from
applications. It is advisable to set the user account as a not a login account so
that unnecessary logins can be avoided.
Superuser - A user who can do anything on the server. Since this user can do any
action on the server, it is important to create less number of super users.

Let us see how password policies can be set.

[ 330 ]
Securing a Database Chapter 12

Password Policies
Password is an important component in authentication as it is the only secret to the user to
connect to the databases. Due to this importance, several measures are taken to protect the
password being hacked.

The major way to protect the password is ensuring that the password is strong. Several
rules are implemented by various databases to keep the password strong. Typically, a
strong password needs to have all of the following rules.

Password should be in a sufficient length. Typically, a password should have a


length of eight or more characters.
Password should not contain complete or any part of the user name. For
example, if the user name is john, the password should not be such as john123
and so on.
When changing the password, it should not be a password that was used before.
Password should contain different combinations of characters. A strong
password should contain any three of followings,
lower case characters ( a, b, c )
upper case characters (A, B, C)
numbers (1, 2, 3)
special characters(!, @, #)

Apart from the password setting, there are other tasks that should be carried out by the
database designer or the database administrator.

Reset password on the first login. Since user names are created by the database
administrator, he knows the password that he set for the created user. Therefore,
the password must be changed the password by the user.
If the user account is created for a real user, not for an application to connect to
the database, it is essential to set the password expiration date.

Some of these options may not be available in many of the database technologies. However,
even if these options are not available, as a database designer, you need to implement
workarounds. For example, if you can set an expiry date for the password for the user
name, you need to implement different monitoring mechanism to notify the system
administrators regarding the expired password.

Let us look at how users can be created in the PostgreSQL in the following section.

[ 331 ]
Securing a Database Chapter 12

Creating a User in PostgreSQL


In PostgreSQL, user creation is a simple operation. Following is the logins and Group Roles
creation option in the pgAdmin.

The above screenshot shows that the Logins and Group Roles available in the server. from
the icon logins and group roles can be clearly distinguished.

In PostgreSQL, User authentication is referred to as Logins.

Let us see the steps in the creation of logins in PostgreSQL.

1. First, define the user name. A best practice is suggested at the Best Practices for
Security section. Optionally, you can define the comments for the created user as
shown in the below screenshot.

[ 332 ]
Securing a Database Chapter 12

2. Define the password, Account expiry date and the connection limit. How to set a
strong password was discussed before in the Password Policies section.

In the above screen, connection limit can be set as well. Unlimited connections are possible
when the Connection Limit is set to -1. This is the default settings. When the password is not
provided, the user will be considered as a role that will be discussed in the Implementing
Authorization Section.

3. Additional configuration can be done for the logins as shown by the following
screenshot.

[ 333 ]
Securing a Database Chapter 12

As shown in the above screen, you can configure whether this login can be used login to the
database instance and/or whether it is a superuser.

Superuser is the special property of any login. Whether a login is given


the superuser permission, that login can do anything on the database
server. Further, if any permissions are denied, that details of permissions
will not be affected because that login is a superuser. Due to this
behaviour of superusers, it is important to pay much attention during the
creation of a superuser login.
We will be discussing the other options for login creation, under the authorization.

4. The Following script will provide you script to create the login.
CREATE ROLE dba_admin WITH
LOGIN
NOSUPERUSER
NOCREATEDB
NOCREATEROLE
INHERIT
NOREPLICATION
CONNECTION LIMIT -1
VALID UNTIL '2020-09-30T20:26:54+05:30'
PASSWORD 'xxxxxx';
COMMENT ON ROLE dba_admin IS 'This is the account for the database

[ 334 ]
Securing a Database Chapter 12

administrator';

Please note that when scripting out the login account, the password is not scripted as a
security measure.

Next stage of securing a data in the database is Authorization that will be looked at in the
following section.

Implementing Authorization
As we discussed in the Implementing Authentication, Authentication is providing a user to
login into the database system by verifying his identity. However, Authorization is
providing the permissions to the authenticated users to different objects in the database.

Authorization is a security mechanism used to determine access levels of


users related to the database and its data. Authorization is normally
preceded by authentication for user identity verification.

Database authorization is the process of establishing the relationship between the database
objects and users by different authorization modes such as SELECT, INSERT, UPDATE and
so on. The above definition means there are three components in the database
authorization. They are,

1. User - User means that the authenticated user who is requesting access to the
database or database objects or data.
2. Objects - In the context of databases, there are a lot of objects, such as Tables,
Views, Procedures and so on. The objects can be narrowed down to Columns,
and rows.
3. Operation - Operation means the type of action that the authenticated user is
requesting. It can be SELECT, INSERT, UPDATE or DELETE.

GRANT is the common command in most of the database techniques to provide the
authorization to different users. We will discuss the GRANT options in PostgreSQL
in Providing Authorization in PostgreSQL section.

Having discussed the basics of database authorization, let us discuss a few important
aspects in database authorization. First of them is Roles that will be discussed in the next
section.

[ 335 ]
Securing a Database Chapter 12

Roles
The basic concept of database authorization is that the authenticated user should be given
the necessary to do what they need to do in the database and no more. Even for medium
scale organization, there can be more than twenty to thirty users. Most of these users may
have a set of permissions. For examples, developers may have read and write permission
where are administrators will have a different set of permission.

If you have to create user wise authorization, then it will become a tedious task and will
run into many maintenance issues.

Group Roles are assigned with different permissions. There can be members who are users.
Those users do not have privileges assigned explicitly. However, since the user is a member
of Group Role, he will be inherited the privileges of the group roles.

Conflicting of Privileges
A user can be a member of multiple group roles. Every group have different privileges.
When a user logs into the system, he will get the privileges of all the groups. However,
there can be situations where a conflict of permission. For example, if the user has select has
on the table whereas a group role where he is a member of, has explicitly denial of
permission then there is a conflict of privileges.

In case of conflict of privileges, there is a golden rule that is listed in the following
information box.

Denial of Access always outweighs grant of access except when the user
is the system admin or the superuser.

This golden rule can be applied to sort many conflicts. Further, when you are enabling a
user with superuser access, you can not control them by denial of permission.

in PostgreSQL, there is no explicit command to denial of permission. In


some other database technologies such as Microsoft SQL Server, there is
DENY command alone with GRANT and REVOKE commands. Since
there is no DENY permission in PostgreSQL, there won't be any
confusions of conflicting permissions.

Let us look few of of the combinations of privileges and the effective privileges in order to
understand the conflict of privileges.

[ 336 ]
Securing a Database Chapter 12

User Permission Group Permission Effective Permission


Grant access to Table A Denial access to Table A Denial access to Table A
Denial access to Table A Grant access to Table A Denial access to Table A
Denial of Read access to Can write but can't read data
Read Write access to Table A
Table A from Table A
Denial access to Table A Superuser Can do Anything
Superuser Denial access to Table A Can do Anything
The above table indicates how the above rules work with different combinations.

Let us look at how row-level security is an important design concept in database security.

Row Level Security


Up to now, we have discussed, how users can be given access to database objects such as
tables, columns and views. These objects are not data but meta-data.

However, when data is inserted to a table, data can be relevant to different entities. If there
is a requirement that one user should see only one set of data, then there will be complex
implementation.

For example, if you look at the following screenshot, in the same data set, there are different
logical partition depending on the locations.

If there is a requirement where one user should see only data at one location, you need to
physically divide the table into different locations. After dividing this into multiple tables,
each relevant user can be given permission to each relevant table. For the users who need to
access all the locations, one view can be created by combining all the tables. However, there

[ 337 ]
Securing a Database Chapter 12

are several potential practical problems with this implementation.

If there are a large number of partitions, you need to create many physical tables.
For example, if you are implementing physically separated tables for the sales
representatives, then you will end up with a large number of tables.
When there are new partitions are added, you need to add a new table, grant the
permissions to users and update the view and so on. This will lead to a lot of
maintenance overhead.

However, In most of the database technologies including PostgreSQL, there is a feature


called Row-Level Security (RLS).

After the table is created, the table has to be enabled for the ROW LEVEL SECURITY as
shown in the below code
ALTER TABLE public."BankTransaction"
ENABLE ROW LEVEL SECURITY;

Then policies are created for each user providing the necessary permissions. The details of
enabling RLS for PostgreSQL can be found in https:/​/​www.​postgresql.​org/​docs/​10/​ddl-
rowsecurity.​html

Since PostgreSQL is the database technique, that we have selected, let us see how
authorization is implemented in PostgreSQL.

Providing Authorization in PostgreSQL


In PostgreSQL, there are defined roles. As a database designer or as a database
administrator it is better to understand the default roles in PostgreSQL. Following list
shows the default roles in PostgreSQL.

Role
pg_execute_server_program
pg_monitor
pg_read_all_settings
pg_read_all_stats
pg_read_server_files
pg_signal_backend
pg_stat_scan_tables
pg_write_server_files
Refer to https:/​/​www.​postgresql.​org/​docs/​11/​default-​roles.​html to find out the
allowed access for each default role.

[ 338 ]
Securing a Database Chapter 12

When creating a login, there are two options to provide authorization, one is from the
Privileges tab and from the Membership tab.

This set of privileges will define whether the login can create databases create roles and so
on.

If the login is a superuser, all the permissions will be allocated to the superuser.

Next option is adding the login to different group roles from the membership tab as shown
in the below screenshot.

[ 339 ]
Securing a Database Chapter 12

A login can be a member of multiple default or user-defined roles.

Let us see how the user is granted permissions in PostgreSQL with GRANT command.

GRANT Command in PostgreSQL


The GRANT comment has three basics, User, Object and the Operation. As discussed, a
user should be a valid login. A database object can be a Database, Columns, Tables, Views,
Sequence or Procedure and so on. Possible operations, SELECT, INSERT, UPDATE,
DELETE, TRUNCATE, REFERENCE, TRIGGERS, CREATE, CONNECT and so on. Further
details of GRANT can be viewed at https:/​/​www.​postgresql.​org/​docs/​9.​0/​sql-​grant.
html.

Following some of the different ways of using the GRANT command.


--Granting SELECT permission to dev_testuser on Cashier Table
GRANT SELECT ON public."Cashier" TO dev_testuser;

--Granting SELECT, UPDATE, INSERT permission to dev_testuser on Cashier


Table
GRANT SELECT, UPDATE, INSERT ON public."Cashier" TO dev_testuser;

[ 340 ]
Securing a Database Chapter 12

--Granting SELECT on CashierID,"FirstName" Columns and Update permission on


FirstName columns on Cashier Table
GRANT SELECT ("CashierID","FirstName"), UPDATE ("FirstName") ON
public."Cashier" TO dev_testuser;

Last GRANT command means that you can provide users only with selected columns that
the lettering them selecting all the columns. For example, if there a salary column in the
Employee table, you can provide the human resource team only the selecting all the
columns but salary column. For the finance team, you can let them all the columns in the
columns but the rating column that should only view by the human resource team.

Using Views as a Security Option


In Chapter 10: Maintaining a Database under Working with Views, we discussed how views are
useful as a better maintenance option for the database. Apart from the maintainability and
re-usability, views are also used as a security option.

As shown in the following screenshot, end-user can be given access to a view. This view
will have one or more tables. To user to access the view, he does not need permission on the
table.

Let us verify this from the following code.

Let us grant permission to a user dba_admin to a view called film_list in the DVDRental
database.
GRANT SELECT on public.film_list to dba_admin

With the above grating of permission, user dba_admin can access the film_list view as
shown in the below screenshot.

[ 341 ]
Securing a Database Chapter 12

Then, let us access one of the tables that are called from the view.

The error shown in the above screenshot verifies that the user cannot directly access the
tables.

Let us look at what kind of actions can be implemented in order to avoid SQL Injection in
the following

Avoiding SQL Injection


SQL injection is one of the legacy methods used compromise database. SQL Injection uses
simple string building techniques.

Let us assume you have running the following query in the application.
SELECT * FROM Users
WHERE UserName ='John' AND

[ 342 ]
Securing a Database Chapter 12

Password = 'XXXX'

If you provide invalid user name password the above query will return no records.

However, the above query can be modified with string building techniques to the following
code.
SELECT * FROM Users
WHERE UserName ='' OR 1 =1 -- AND Password = 'XXXX'

The above will ignore the password part of the query and will return records. If the
application is written such a way, an unauthorized user can log in into the system.

Let us see what are possible attacks can be done from the SQL Injection

What SQL Injection can do


Though SQL Injection appears to be a very simple technique it can do lot of damage to your
data or database or to the database server. The following is the list of actions that can be
done from SQL Injection.

Login into the applications without proper authentication.


Obtaining Table structures of the database.
Manuplicate data after obtaining table structures.
Drop tables, views and procedures.
Retrieving data that are not available to the users from the application.
Execute other applications in the database instance and increase CPU and
memory consumptions in the Server.
Shut down of the Database instance

As you can see from the above list, SQL Injection can be used to retrieve data as well as to
shut down a database instance.

Let us see what are the measures that can be implemented to prevent SQL Injection attacks.

Preventing SQL Injection Attacks


Due to the high impact of SQL Injection, it is essential to implement precautions against
SQL Injection.

The most important way to prevent SQL Injection is, provide only necessary

[ 343 ]
Securing a Database Chapter 12

permission to the user. For example, even if the end-user able to execute the
DROP TABLE, DROP DATABASE statement, it will fail if they do not have
proper permissions. Further, if you are providing user access via views, his
INSERT, UPDATE, DELETE commands will fail.
Use Procedures instead of Inline SQL Queries.
SQL Injection technique is built on the string building techniques. The string
building techniques are possible only if you are using inline queries. If
applications are using procedures to retrieve data, SQL Injection can be
eliminated. As a database designer, it is essential to plan to implement procedure
even at the design stage of the database.

Remove unnecessary applications in the Database server.


SQL Injection attacks can execute the available applications in the server. By
executing a high number of instances of these applications, the database server
will consume high CPU and high memory. The high consumption of CPU and
memory will cause the database instance to slow down. Therefore, it is
recommended to remove unwanted applications in the server.

There are few other techniques that can be supported by the application developers such as
limiting the text box sizes, string replacing techniques, string validation and so on. In this
section, we looked at the preventing techniques that can be adopted by the database
designed and database administrators.

In the next section, we will look are how encryption can be used to protect data in your
database.

Encryption for Better Security


Encryption is a technique that has been used in many domains. Encryption is converting
data into a code to prevent unauthorized access.

Since Encryption is a vast topic, we will be discussing how to adopt


encryption to the database design. We will not be covering, different types
of encryption (Symmetric) or different algorithms (MD5).

Let us look at how to encrypt password in databases.

1. Let us create a table to store user details.


CREATE TABLE SystemUsers

[ 344 ]
Securing a Database Chapter 12

(UserID int,
UserName character varying (50),
Password character varying (500)
)

Please note that we have created only a limited number of columns only to demonstrate
password-encryption capabilities.

2. By using crypt() and gen_salt() PostgreSQL function as shown in the below


code.

2. INSERT INTO public.systemusers( userid, username, password)


SELECT 1 ,'John',('1qaz2wsx@', gen_salt('bf', 4))

3. Let us verify the encrypted password.

You can verify that the password is encrypted in the table.

We have used bf algorithm in the above example. Further, there are more algorithms
supported in PostgreSQL as shown in the below table.

Algorithm Description Max Password Length


bf Blowfish 72
md5 MD5 unlimited
xdes Extended DES 8
des DES 8
However, there are few drawbacks in the Encryption as listed below.

1. Storage - Encryption needs more storage than plain text. Though storage is not a
major concern these days, if you are storing database backups, you need to
account for additional storage requirement.
2. Performance - With every Encryption process, you need to encrypt data when
storing and De-crypt the data when reading. This two additional process means

[ 345 ]
Securing a Database Chapter 12

that your transaction duration will be increased.


3. Resource Consumption - As we discussed in the Performance, two additional
processes are needed for encryption. Depending on the encryption algorithms

We will be looking at how Data Auditing is implemented with respect to the Security
aspect in database design.

Data Auditing
Auditing is mainly used as a security option to facilitate many security protection laws.
Since there are many data and structural modifications in the database, it is essential to
keep a track of those modifications.

We discussed in detail on designing of auditing table structures in Chapter


6: Table Structures in the Designing Audit Tables section.

In PostgreSQL few options are available for auditing that are listed below.

By using native logging in PostgreSQL ( log_statement = all )


This is a native option available with PostgreSQL.

By using table triggers


We discussed with implementation how triggers can be used as an option to
implement data audit in Chapter 10: Maintaining a Database in the section Triggers
as an Auditing Option. Though this seems to be a much easier option, this method
will increase the transaction duration. Hence, from the performance perspective,
trigger option may not be ideal.

By using community provided PostgreSQL tools such as


audit-trigger 91plus (https:/​/​github.​com/​2ndQuadrant/​audit-
trigger)
pgaudit extension (https:/​/​github.​com/​pgaudit/​pgaudit)

Let us look at the best practices in Data Auditing.

[ 346 ]
Securing a Database Chapter 12

Best Practices in Data Auditing


Auditing is unavoidable in enterprise application development as it provides fast benefits.
However, for auditing, there are best practices we need to follow.

Separate Auditing data from application data.


It is better if we can keep a separate database for auditing. With this approach,
we can protect the audit data from application users and other administrators
much easier. If the auditing data is kept with the application data, due to
negligence, audit data can be taken over by a user who should not have access to
audit data. If you have multiple databases, which is very common for an
enterprise application, it is more recommended to have a separate database for
auditing. In enterprise applications, you can have a separate server for auditing.

Having a separate database may cause issues if you are implementing triggers to
generate data as there will be distributed transactions. The distributed
transaction may cause performance issues in transactions.

Should Audit only essential data


It is much simpler to enable auditing for the entire database or for all databases.
However, as a database designer, you need to find out what are data that should
be audited. During the database design, it is essential to enable auditing
configuration that can be enabled when the need arises.

A mechanism for Data Archival in Auditing DataAudit data will grow in an


exponential trend. There are legal reasons to keep audit data for a substantial
period of time. Techniques such as portions can be used as a data archival
mechanism for auditing data.

No Indexes for Auditing Data


As we discussed in Chapter 8: Working with Indexes, indexes will improve the data
reading performance while decreasing the data write performance. Reading
auditing data is a rare event as it will be a special occasion such as
troubleshooting or fact-finding. However, since there are high data writing in
auditing data, it is important to have less or no indexes on auditing. When the
need arises to read data from auditing data, you can implement the necessary
indexes.

Let us look at what are best practices that should be implemented by the database
designers.

[ 347 ]
Securing a Database Chapter 12

Best Practices for Security


Though, there are a lot of options available there best practices that can be implemented.
These best practice will help database designers to take pro-active decisions during the
database design stage.

When naming the users or logins or group roles, it is always recommended to


name the users with a proper naming convention. For example, the following list
can be used to name the users as a prefix to indicate their roles. Apart from the
prefix, you can define a mechanism to define the suffix part of the users. For
example, for John Murray who is a database administrator, you can define a user
name as dba_johnm

Type of Users Prefix


Application app_
Database Administrators dba_
Developers dev_
Quality Assurance qa_

As we discussed in the Using Views as a Security Option section, views should be


used to provide access to the end-user and application as it can achieve better
security.
Enable alert for failed login is an import configuration to implement. Typically,
before some user tries to compromise a database system, he will attempt a few
times. If there is an alert indicating that there are invalid login attempts, the
security team can take actions immediately.
When saving sensitive data such as SSN, Date of Birth, Credit card numbers, it
essential to understand that do you really need to save this data. If you have to
save this data special measures should be taken. for example, you can encrypt
these data or store this data in different tables and so on.
Drop unnecessary databases in the live server. Especially, it is very common to
see sample databases such as DVDRental are available in the live server. Since
database structures are known to everyone, there are possibilities of attacks using
these databases.

Field Notes
An organization has a large number of support staff members and they are
responsible for updating data files sent by their clients. Some times due to invalid
file uploads, data is overwritten rating concerns over the stability of the support.

[ 348 ]
Securing a Database Chapter 12

Application team was able to limit the number of issues, after doing a lot of
validations and restricts from the process. However, there were still some
mishandling of those support files. Then it was decided to implement a data
disaster recover system (DDRS). In the proposed DDRS system, data is
replicated to a read-only instance, however, in the read-only instance data is
updated only after 2 hours. This means the secondary instance is always two
hours behind the primary. In case of invalid data restoration, data administrators
can restore the data from the secondary instance. In addition to a DDRS feature,
this instance can be used as a reporting instance with limited features.
A Pharmaceutical organization has many medical representatives around the
country. Since medical representatives were operating locally, the organization
decided to centralized data to a single database for better analysis. Later, they
decided to extend this analysis facility to medical representatives. However,
since data is in one table and medicine is a fiercely competitive domain, one
representative should not be able to access any other representative's data.
Therefore, before extending the feature to medical representatives, they applied
Row Level Security to the table. With this Row Level Security, each medical
representative can only access to his data only. However, the Organization is able
to analysis the entire data set.

Summary
In this chapter, we discussed another important non-functional requirement that is
Security. We identified that due to security lapses, there can be Data loss, loss of availability
loss of privacy, loss of reputation, loss of Income. Further, we looked at historical how data
breaches have occurred, during different years and domains.

We identified the main three options in security that are Authentication, Authorization
and Encryption. Authentication is verifying the user identify who is claimed to be.
Authorization is providing access to different objects. We discussed that Authorization is
more complex since it has to deal with different database objects, such as tables, views,
columns, procedures and so on. Further, there are different types of operations such as
SELECT, INSERT, UPDATE, DELETE and so on. Apart from these complexities, there are
Roles convictions and row-level security to increase complexity.
SQL Injection is a string building technique that is used to attack database and we
discussed how to avoid SQL Injection from the database design perspective. Data Auditing
is another data security technique that we discussed. We discussed a few Field Notes with
regards to database security.

[ 349 ]
Securing a Database Chapter 12

Questions
Can we deny permission on a superuser?
This question is often asked to mislead the candidates during the interviews. It is
possible to deny permission on a superuser. However, those denials of
permissions will not be affected as he is the superuser.

If a user is granted permission to select data from a table and for a role in which
the user is a member of, has denial of permission on the same table to select data.
What is the effective permission when the user logs into the database?
This question can be asked by interchanging the group and the user to test your
knowledge on the security. Whatever the scenarios, whether it is the user or the
group, denial of permission always outweighs the grant of permission if the user
or the group is not a superuser. However, if this question is asked specifically on
the PostgreSQL, there is no denial of permission on PostgreSQL.

Why Authorization is more complex than Authentication?


Though Authentication is extremely important phenomena, Authorization is
more complex due to many configurations in it. In Authorization, mainly three
objects are involved, user, database object, and operation. When it comes to the
database object, there are different types of objects such as tables, views,
procedures, and so on. When it comes to operations, it can be SELECT, UPDATE,
INSERT, DELETE, ALTER, TRUNCATE and so on. When it comes to data, it can
be Row-Level security implementation needed. Apart from these objects, there
can be authorization conflicts between multiple roles and the user. Considering
all these facts Authorization is complex than authentication.

When providing permission to a user to access a view, do you need to provide


access to the tables that are called in the view.?
No. There is a myth that to users to access the view, they need permission on the
tables that are called by the view. However, one of the main ideas of introducing
views is to use as a security mechanism. By implementing security through
views, you don't have to expose tables to the end-users.

Further Reading
Insider Treats Statistics: https:/​/​www.​ekransystem.​com/​en/​blog/​insider-​threat-
statistics-​facts-​and-​figures

[ 350 ]
Securing a Database Chapter 12

Cost of Data Breaches: https:/​/​healthitsecurity.​com/​news/​healthcare-​data-​breach-


costs-​remain-​highest-​among-​industries

Database Roles and Privileges: https://www.postgresql.org/docs/8.4/user-manag.html

CREATE Role: https:/​/​www.​postgresql.​org/​docs/​11/​sql-​createrole.​html

GRANT Command: https:/​/​www.​postgresql.​org/​docs/​9.​0/​sql-​grant.​html

Row Level Security: https:/​/​www.​postgresql.​org/​docs/​10/​ddl-​rowsecurity.​html

Encryption: https:/​/​www.​postgresql.​org/​docs/​8.​1/​encryption-​options.​html

Encryption Function: https:/​/​www.​postgresql.​org/​docs/​9.​0/​pgcrypto.​html

Configuring Logging: https:/​/​www.​postgresql.​org/​docs/​9.​5/​runtime-​config-​logging.


html

Avoiding SQL Injection: https:/​/​rhianedavies.​wordpress.​com/​2012/​04/​13/​is-​your-


database-​secure-​by-​dinesh-​asanka-​sql-​server-​standard-​magazine/​

[ 351 ]
13
Distributed Databases
During the Chapter 11:Designing Scalable Databases, we discussed briefly on Horizontal
scalability that can be considered as a Distributed Database. Apart from that discussion,
most of our design discussions are confined to the centralized database management
system.

In modern days the business is moving away from the legacy centralized business
processes. Since technology should align with the business process, systems are designed
with distributed patterns. To facilities distributed systems, databases have to be
distributed.

In this chapter, we will discuss the need for distributed databases, and the design concepts
in the distributed databases. Later, we will discuss real-world implementations with
distributed databases.

In this chapter we will cover the following topics:

1. Why Distributed Databases


2. Designing Distributed Databases
3. Implementing Sharding
4. Transparency in Distributed Databases
5. Design Concepts in Distributed Databases
6. Challenges in Distributed Databases
7. Field Notes
Distributed Databases Chapter 13

Why Distributed Databases?


One or two decades ago, most organizations were operating in a centralized model. Due to
the limited transactions and limited expectations, applications and data were implemented
in a centralized model. Even though they may have branches around the world, they can
save this data in a centralized database located at one site.

Though they are centralized in data storage, from the business process point of view there
are many different logical units. Some of those logical units are branches, offices, factories,
departments, projects, business units, operation segments, sister companies, divisions and
so on. Due to the growth of the business and the completion between its competitors, it will
be better to separate business process depending on any logical unit such as departments,
projects, divisions and so on. To align with the business process units, it is important to
separate the business processes to match the business process distribution.

Since business independence is paramount important in business distribution, it is vital to


distribute the applications as well. Due to this distribution, business process units can take
analytical sessions independently. In order to take independent analytical decisions, it is
important to implement separate system units or systems should be separated. Since data
in a core in the system units, it is important to distribute data as well.

Apart from the logical business unit wise data distribution, data distribution can be done
with respect the functionality as well. We will discuss the different mechanisms for data
distribution at Architecture for Distributed Databases section.

Another technical progress that has helped to implement database distribution is the
increased efficiency in network systems. Since the strength of the network connectivity
plays a major role in distribution, the efficient network should be a must for the success of a
distributed system. Modern technical improvement in the network entering has paved the
way to the implementation of distributed databases.

Behavioural patterns of users have resulted in the implementation of distributed databases.


Modern days, most of the management users prefer to analysis and retrieve data on mobile
devices such as laptops, mobiles and other handheld devices. These dynamic user
behaviours can be facilitated only by keeping the data distributed.

Let us look at a basic diagram for the distributed database system in the following
screenshot.

[ 353 ]
Distributed Databases Chapter 13

In the above screenshot, there is a data distribution framework where the user connects to.
Depending on the data distribution method, the framework will decide how to connect to
the relevant database. Depending on the request, it may have to access one or more
databases. These databases can be located in multiple physical locations. For the user, it will
be one logical database though physically it is three databases. If the databases are
centralized all three databases will be implemented in a single database. This simple
distributed database architecture shows how a single database can be distributed to
multiple databases and how those databases will be accessed.

Having understood the basis of distributed databases, let us see what are properties of
distributed databases.

Properties of Distributed Databases


A distributed database is a single logical database but split into multiple physical
fragmentations using different technologies. The followings are important properties of
Distributed Databases.

Data is splitter into multiple fragments.

[ 354 ]
Distributed Databases Chapter 13

These fragments are chosen to align business processing units or business


functions. During the distributed database design, the database designer has to
choose the fragmentation policies. Since it is difficult to modify the fragmentation
later, mode thoughts and discussions have to be put into at the design stage.

Single logical database with the collection fragments.


For the end-users, it will be a single logical database. We will discuss this aspect
under Transparency in Distributed Databases section.

Fragments can be replicated by different database techniques.


This replication can be done using technologies like, database replication,
mirroring, log-based transferring, extract-transform-load and soon.

Data fragments are allocated to the sites.


There will be a many-to-many relationship between the data fragments and sites.
This means that one site can have multiple data fragments and one data fragment
can be shared between multiple sites. However, as a database designer, it should
be better to minimize the sharing the data fragments between the sites.
Minimizing sharing or data fragments will improve the performance of the entire
system.

Sites are connected by communication links.


As we discussed in the previous section, connectivity is vital to the success of the
distributed system. Though these sites are designed to be operated
independently, there needs to be a network connection between the sites. These
sites may not be needed to connect between sites all the times but at least sites
should have the capability to connect to a master site.

Data at each site is control by a Database Management System (DBMS).


All site do not need to have the same DBMS. By having the same DBMS at all site
will improve usability and maintainability. However, due to cost and other
operational issues, may lead to having different DBMS.

Let us look at the advantages and disadvantages of distributed databases.

Advantages in Distributed Databases


let us discuss what are additional advantages that can be provided with the Distributed
Database Architecture.

More Align with the Business Processing Units

[ 355 ]
Distributed Databases Chapter 13

Many organizations have internal or external, logical or physically business


processing units. When business units are operating, there are resources such as
projects and employees are attached to these business units. If an employee
wants to analyse, it is better if he can query a local distributed database, rather
than querying a centralized database. On the other hand, business processing
units can configure the local distributed database with their own parameters that
will not impact the other business units.

Local Independency
With the implementation of distributed databases, data can be stored locally.
With the local storing of data, administrators have more control over the data.
For Example, if the distributed databases are implemented in geological
separation, different data fragments are located at the local locations. Due to this
separation, local administrators have more control over their data. In the case of
maintenance, local administrators can choose a better maintenance window so
that major of users are not impacted. If the data is centralized, it will be difficult
to choose a single maintenance window to support all the users without
impacting their business functions.

Apart from maintenance independence, functional independency can be


achieved too. Configuration can be done with respect to the local functionalities
that will not be impacting other business units.
Improved High Availability
In the centralized database implementation, when there is a failure in the
database, all the users will not be above to perform their usual tasks. However,
when the database is distributed, it is unlikely that all database will not be
available at a given time. Therefore, during an outage of one database, still, the
other sites will not be impacted. This is called as Degraded High Availability.

In the Degraded High Availablity, you are providing users with


some level of availability for a given time period since full
availability cannot be provided immediately. In the Degrade
High Availability, read-only databases are provided to the
users to work until the full availability is restored. Further,
during the degraded high availability, there can be performance
issues such as poor responsiveness and high data latency as
weel functional limitations.

Further, since the data is distributed to small fragments, implementing a high


available solution is also simple compared to the centralized database
implementation.

[ 356 ]
Distributed Databases Chapter 13

High Performance
Since data is stored closer to the user locations, user query will hit the local
database reducing the network latency to the database. On the other hand, you
are dealing with smaller data set compared with the large centralized database.
With the smaller data set, you will get faster results. In addition, with the
distributed databases all user queries are distributed among the existing
databases in the distributed architecture. With this architecture, users contention
are less in the distributed database compared to the centralized database. Apart
from the contention, CPU, IO contentions are also less than the centralized
database.

Apart from these obvious benefits from distributed databases, designers and
database administrators can work on locally to improve performance. As we
discussed in Chapter 8: Working with Indexes in the Disadvantage of Indexes section,
indexes will decrease the write query performance. However, as a database
designer, now your scope is narrowed to a smaller decentralized database rather
than a large centralized database. This smaller database allows database
designers to enable index by considering only the local query patterns.
Improved Reliability
Reliability is an important phenomenon that has to be achieved from the
databases. In Distributed databases, data is replicated with different techniques
such as replication, mirroring and so on. With data replication, data is duplicated
to a different site. In case of a failure, at least on a temporary basis, the users in
the failed site can be directed to the working site. This means that with the
distributed databases, database reliability can be achieved.

Expansion Maintainability
Expansion is unavoidable in database systems. In a centralized system, the rate of
expansion will be high. When the expansion is high, database administrators
have work on CPU, Memory and IO. When the database is distributed,
expansion is much slower and administrators have enough time to work on
expansion strategies.

In the distributed database architecture, when the need arises to expand, new
sites can be created with less hassle. Since distributed architecture it self support
horizontal expansions, less effort is needed. However, in the case of centralized
database architecture, horizontal expansion is a costly operation as it needs more
time and effort. Vertical expansion is the most possible expansion in the
centralized database, where are both vertical and horizontal expansion are
possible in the distributed databases.

[ 357 ]
Distributed Databases Chapter 13

Disadvantages in Distributed Databases


Though there are definite advantages in distributed databases, there are disadvantages in
distributed database implementations as well.

Complexity
From the very simple example we discussed in this chapter, we can understand
the complexities in distributed database systems. In simple terms, instead of one
centralized database, now you have multiple databases that will increase the
network complexity in the system.

Though every site keeps its data in a local location, there needs to be data shared
between sites. To share the data, replication, mirroring and log-based sharing
should be introduced. This data share will further increase the complexity of the
system. These techniques not only increase the complexity of the system but also
it increases the complexities such as maintenance of the system.
Security
In the centralized database implementation, security needs to be applied on a
single database which will have fewer security concerns. Since there are multiple
databases in multiple sites in a distributed system, security will become a
challenge. Since most of the databases are connected through a network, network
security will be a concerning factor. As we discussed Chapter 12: Securing a
Database, at each database we need to maintain Authentication and
Authorization.

Further, when new users are added or removed from the system, you may have
to perform this in many databases. Not only the users, but Authorization to
different levels of objects also have to be maintained in all the databases.
Cost
In a distributed database system more nodes needed to be added. When more
nodes are added, network connectivity should be improved. These additional
hardware requirements mean that there is an additional cost involved with the
distributed databases. Apart from the hardware cost, additional software
applications are required to support replication or mirroring.

With more nodes are in place, more investments need to be done on the
monitoring of the distributed database system which again increases the cost.
When databases are distributed, more database administrators are required.
Further, you need to recruit high skill resource personal to manage the
distributed database system. All of these factors indicate that you need to spend
more funds on the distributed database than to a centralized database.
Maintenance

[ 358 ]
Distributed Databases Chapter 13

When there are multiple nodes and replication between nodes, the maintenance
effort of the system will be increased. When there are multiple nodes, applying
software patches has to be done for all the nodes. In the case of a centralised
database, applying patches would be only for one database nodes.

Lack of Resources
The Distributed database needs special software and skills. For example, a
distributed database needs more skilled database designers and to maintain the
distributed system, skilled database administrators are required.

Due to these advantages and disadvantages in distributed databases, when you are
choosing a distributed databases architecture, you need to find the optimum solution as a
database designer.

Designing Distributed Databases


In the distributed database design two important concepts need to be addressed. Those two
concepts are fragmentation and replication.

One of the very common methods of distributing database is replicating the entire database
that will be discussed in the next section.

Full Replication Distribution


Entire database replication is the easiest method of database distribution. The distribution
mechanism is simpler as there is no fragmentation. Since the entire database is replicated,
there is less hassle, for applications.

However, since the entire database is replicated to the secondary nodes, it will consume
more network and the network consumption is very high.

In full database replication, there can be two types of replication. One of them is Active-
Active replication as shown in the below screenshot.

[ 359 ]
Distributed Databases Chapter 13

In the Active-Active replication, both nodes can be used by applications for both writes and
reads. However, in this configuration, there can be conflicts as the same record can be
updated by two different users at the same time.

Replication conflicts can be resolved by conflict resolution techniques.


Some database tools have inbuilt conflict resolution techniques. Most of
the popular technique is assigning priority for the nodes. When the
conflict arises, the least priority node will become the looser and data
modification at the least priority nodes will be ignored. Similarly, there
are other techniques like Maximum value, Minimum Value and
Customized value as so on.

Since data modifications are taking place in multiple nodes, Active-Active replication has
limitations of expanding it to many nodes. Often, this method is used to distribute data to
users who do not have continuous connectivity.

The most popular replication type is the Active-PAssive replication as shown in the
following screenshot.

[ 360 ]
Distributed Databases Chapter 13

In the above active-passive replication type, data writing is possible only at a single node
whereas other nodes are used as read-only nodes. Since the implementation of this type is
much simpler, there are situations where you can configure multiple read-only nodes. The
Read-only node is mostly used as a reporting database to reduce the load of the main
system.

Mirroring and Log-Based replications are the most commonly used techniques for Full
Database Replication. In both of these methods, it is preferred to use asynchronous
replication. In asynchronous replication, after a transaction is completed in the primary
server, user acknowledgement is sent and does not wait for the secondary to complete the
transaction. The following screenshot shows the difference of transaction durations
between the asynchronous and synchronous replication.

[ 361 ]
Distributed Databases Chapter 13

As observed in the above screenshot, Synchronous replication will increase the transaction
duration. Hence, Synchronous replication will impact system performance negatively.
Impact of negative performance has lead many users to configure asynchronous replication
for the database distribution.

Data consistency is not guaranteed in the asynchronous replication. If the


transaction failed at the secondary database, still the primary transaction
will prevail. However, both nodes will not be in the consistent state in
transaction wise. In the Synchronous replication, the primary database
transaction will be ROLLBACKED if the secondary database is failed.

Though full replication distribution is commonly used due to the simple implementation.
However, due to the network utilization, database designers are looking at partial
replication distribution databases.

Let us look at partial replication distribution in detail.

[ 362 ]
Distributed Databases Chapter 13

Partial Replication Distribution


Depending on the usage of fragmentation and replication, there are four types of database
distribution. Full Replication, Horizontal Fragmentation, Vertical Fragmentation,
Hybrid Fragmentation.

Full Replication
In the Full Replication technique, the entire object will be replicated. In this method of
replication, tables and views selected to replicate between nodes. For example, in a table, all
the row and columns are replicated as shown in the following screenshot.

As shown in the above screenshot, the entire table is replicated. Thie method of distribution
is common in the distributed database systems. Though this will not consume network as
full database replication, there will be substantial network utilization.

Horizontal and Vertical fragmentations are implemented in order to precisely utilized the
network.

Horizontal Fragmentation
Horizontal fragmentation or row-level fragmentation is shown in the following screenshot.

As shown in the above screenshot, every node has it is relevant data. For example, if we are
looking at the customer table, we can distribute data by location wise. In the customer table,
there are three locations USA, UK and CANADA. We can distribute the customer table,
depending on the locations as shown in the following screenshot.

[ 363 ]
Distributed Databases Chapter 13

Then the USA employees data will be stored in USA site so that USA users can access their
users in the site closer to them. Similarly, other users can access their data close to their
sites. In Horizontal Fragmentation, it is important to fragment data into similar size
fragmentation, if not large fragmentation will not archive the desired performance
expectations.

Vertical Fragmentation
Another mode of fragmentation is Vertical Fragmentation as shown in the following
screenshot.

Same Customer table example is used to explain the Vertical Fragmentations in Distributed
Databases as shown in the below screenshot for the same dataset.

[ 364 ]
Distributed Databases Chapter 13

As shown in the above screenshot, the customer table is distributed with columns. In the
preceding example, one distributed database node has Employee Number, Employee
Name and Date of Birth while the other node has Employee Number, Location and
Salary.

In the vertical distributed databases, Primary Key should be available.

Hybrid Fragmentation
Hybrid fragmentation is combinations of Horizontal fragmentation and Vertical
fragmentation as shown in the below screenshot.

[ 365 ]
Distributed Databases Chapter 13

In the above example, every fragment is equally distributed. Let us use the same customer
example to get more understanding of the Hybrid Fragmentation.

In the above example, the customer table is vertically distributed by column {Employee
Number, Employee Name, Date of Birth} and {Employee Number, Location, Salary } and
horizontal on employee location. However, these types of distribution will lead to a large
number of fragments. Presence of a large number of fragmentations will leave heavy
maintenance problems. Further, some fragments will be very small and some of these
fragments are meaningless.

Without, theoretically fragmenting the database, as a database designer, you need to


fragment them with more meaningfully. Let us fragment the Customer table as shown in
the below screenshot.

In the above shown distributed database design, UK customers with all the attributes in
one fragment. US location employees were fragmented in {Employee Number, Employee
Name, Date of Birth} and {Employee Number, Location, Salary } whereas for the Canada

[ 366 ]
Distributed Databases Chapter 13

employee were fragmented {Employee Number, Employee Name, Location, Date of Birth}
and {Employee Number, Salary}.

Deciding of fragmentation has to be done with carefully examine the


requirements. Though it is not impossible to modify the fragment,
modifying will associate higher costs such as downtime, resources cost.

Fragmentation of data in the distributed databases is an important phenomenon. Due to the


importance, it is essential to plan the fragmentation at the design stage. Further, proper
testing should be done to verify the operations with distributed databases. Especially, in the
hybrid database distribution, more testing has to be done since there can be a lot of
different fragmentation.

Having understood the different ways to fragment the database, let us look at special
distributed database technique Sharding in the next section.

Implementing Sharding
Sharding is the process of breaking up high volume tables into smaller table fragments
named shards. These shards can be spread across multiple servers or sites. A shard is a
horizontal data partition.

Though MongoDB supports Sharding off the shelf, database technologies


such as Oracle, Microsoft SQL Server, PostgreSQL, MySQL do not support
sharding natively. Therefore, as a database designer, you need to design
Sharding from the basics.

in sharding, the entire database is divided into multiple databases depending on the
business key. For example, we can divide the database by customer wise. If we are dividing
them into multiple databases by customer, all the customer-related data should be shared.

[ 367 ]
Distributed Databases Chapter 13

In the above screenshot shows the reference for the shared database system. In the above
system, end-users are connecting to the shard catalogue. Depending on the query, the shard
catalogue will decide from which shard data should be read or write. However, master data
that are related to all shards will be replicated from the master database.

Let us look at an example to match the above reference diagram in the following
screenshot.

[ 368 ]
Distributed Databases Chapter 13

In the above Shard design, two shards are introduced. In the first shard, Customers who are
having Customer Keys {1,2,3} are allocated while for the Shard 2 Customers who are having
customer keys {1001, 1002} are allocated. Each shard will have a range so that new
customers can get the next number. In the shard 1, Order table will have the Orders for
Customers {1,2,3}. To maintain the unique constraints for the Order table, the first two
digits of the order number represents the shards.

Since Shards are stored in different locations and different server, it is


impossible to implement UNIQUE constraints across the entire
distributed database system.

Product data which is a master or reference table should be replicated to all shards as
Product table is required to all shards.

Depending on the situations, you may have to interchange the data


between shards. Therefore, as a database designer, data transfer between
shards needs to be planned beforehand.

When a user is requesting data from the shared database system, he needs to provide at
least a sharding key. If the sharding key is unknown, the secondary key which is the
Customer Key in this example should be provided in order to find the shard.

Let us discuss the concept of Transparency in Distributed Database.

[ 369 ]
Distributed Databases Chapter 13

Transparency in Distributed Database


Systems
Transparency in Distributed Database systems means that hiding the implementation of
distribution from the user. In other words, for the end-user, he should not feel any
differences between the distributed and the centralized database system.

it is extremely difficult to achieve 100% transparency in distributed


database systems. If you are trying to achieve the 100% transparency, the
distributed database system will become unmanageable.

There are three types of transparency in distributed databases, Distribution Transparency,


Transaction Transparency, and Performance Transparency that are discussed in the
following.

Distribution Transparency
As we discussed in the Designing Distributed Databases section, there are different types of
fragmentation. These fragments may reside in different databases. When a user executes a
query, he does not need to worry about where the fragmentation and their databases are.

Let us look at the more complex user query in the following database distribution.

Let us assume that the user needs to execute the following query in a centralized database.
SELECT EmployeeNumber, EmployeeName, Salary

[ 370 ]
Distributed Databases Chapter 13

FROM Employees
WHERE Salary > 3000

With the database distribution design, now the above query will not work. Instead of the
following query needs to be executed.
SELECT
EmployeeUS1.EmployeeNumber,
EmployeeUS1.EmployeeName,
EmployeeUS2.Salary
FROM EmployeeUS1 INNER JOIN
EmployeeUS2 ON EmployeeUS1.EmployeeNumber = EmployeeUS2.EmployeeNumber
WHERE EmployeeUS2.Salary > 3000

UNION

EmployeeUK.EmployeeNumber,
EmployeeUK.EmployeeName,
EmployeeUK.Salary
FROM EmployeeUK
WHERE EmployeeUK.Salary > 3000

UNION

SELECT
EmployeeCANADA1.EmployeeNumber,
EmployeeCANADA1.EmployeeName,
EmployeeCANADA2.Salary
FROM EmployeeCANADA1 INNER JOIN EmployeeCANADA2 ON
EmployeeCANADA1.EmployeeNumber = EmployeeCANADA1.EmployeeNumber
WHERE EmployeeCANADA2.Salary > 3000

You can see that distributed queries are more complex.

If these types of queries are executing, fragmentation should be


reconsidered. In the case of different physical sites, there will be
distributed queries and performance will be impacted. Further, frequent
distributed queries prone to errors.
In the distribution transparency, replication method should transparent to the user. you may be using different
techniques to distribute which should be transparent from the end-user.

Transaction Transparency
During Chapter 9: Designing a Database with Transactions, in ACID Theory section we
discussed how important maintaining four properties Atomicity, Consistency, Isolation and

[ 371 ]
Distributed Databases Chapter 13

Durability in a Transaction. When transaction executes on a distributed database, the


system should handle the distributed transaction hiding the complexities of the distributed
transaction.

In some operating systems such as Microsoft, a component called


Distributed Transaction Co-Ordinator (DTC) exists in order to coordinate
the different resources in a single transaction.

Performance Transparency
Performance Transparency means the end-user should not experience a difference with
respect to performance when he is using the distributed databases. In other words, his
experience with the distributed system should be as same as the centralized database
system with respect to the performance.

During the query processing of a distributed system, the process has to identify the
fragment and the location. In the distributed environment, there will be additional query
cost incur due to the communication between sites apart from the CPU and IP cost.

Since each site has different query processor, a distributed query will be processed in
parallel. Parallel processing will improve query processing if the data load is similar in the
fragments. Therefore, when designing fragments, the deciding factor should be data
distribution in fragments.

Twelve Rules for Distributed Databases


In 1987, C.J. Date has introduced twelve rules for distributed databases. These rules will
ensure that end-users will feel like a non-distributed database system.

1. Local Autonomy for Sites - Site should be able to operate locally. Local data
should've manged by locally.
2. No Reliance on a central site - There should not be a site that the entire system
cannot be operated with. In other words, there should not be a single point of
failure.
3. Continues Operation - There should not be a need for downtime when it comes
to adding or removing a site.
4. Location Independence - A user should be able to access any site provided he has
permission.
5. Fragmentation Independence - User should be able to access data without

[ 372 ]
Distributed Databases Chapter 13

knowing how there are fragmented provided he has the necessary permissions.
6. Replication Independence - User should be unaware of whether the data is
replicated or the mechanisms of replication.
7. Distributed query processing - The distributed system should be capable of
query processing when there is a need to access multiple sites and
fragmentation.
8. Distributed transaction processing - The distributed system should be capable of
transaction processing when there is a need to access multiple sites and
fragmentation.
9. Hardware independence - Since there are different sites in distributed database
systems, different sites should be able to run on different hardware.
10. Operating System Independence - Since there are different sites in distributed
database systems, different sites should be able to run on different operating
systems.
11. Network Independence - Since there can be multiple sites and those sites are
connected with different network topologies. The distributed database should be
work without considering the network topologies.
12. Database Independence - Since there are different sites in distributed database
systems, different sites should be able to run on different database technologies.

Let us discuss the challenge that will be encountered by database designers and database
administrators for distributed databases.

Challenges in Distributed System


Though there are a lot of benefits in distributed database systems, due to the complexities
in implementation, there are few challenges that are encountered.

One of the major challenges in the distributed database system is designing or


identifying the fragments. There can be a lot of methods such as Horizontal,
Vertical and Hydrid fragmentation. Since there are a lot of attributes that can be
chosen for fragmentation, the database has to decide what is the optimal method
of fragmentation.
Depending on the changes in the user requests in future, fragments and fragment
locations should be changed in order to provide better service. However, moving
data between fragments is not a simplistic task. Not only all the related data has
to be transferred but also there should be changes in the application as well.
Due to the complexity of the database distribution architecture, maintaining the
distribution system will need high effort. To maintain the distribution databases,

[ 373 ]
Distributed Databases Chapter 13

skilled administrators are required who have experience in databases, network


and operating system.
Implementing security is challenging in distributed databases. When a
centralized database system, is concerned, security is maintained at the one
database. When the distributed database system is implemented, there will be
multiple databases. When many databases are in place, we need to implement all
the authentication and authorization to all the databases. Apart from
implementing security for all the databases, network security has to be improved
as there are a lot of communications between databases in distributed databases.

Let us look at a few examples from the real-world scenarios with respect to the distributed
databases in the following section.

Field Notes
Following is the simplified flow diagram for a reporting system for an
organization.

[ 374 ]
Distributed Databases Chapter 13

Due to the business unit, transactions databases were distributed to multiple


databases Database1, Database and so on. For the reporting purposes, transaction
data was extract-transform-load (ETL) to two reporting databases. Though
transactions databases were distributed to match the business units, reporting
databases were distributed considering report functionalities. To improve
performance and usability, an Online Analysis Processing (OLAP) cube was
introduced. All the reporting requirements were facilitated by connecting reports
to the OLAP cube.

The organization has many sales representatives who are responsible for sales
around the country. Every morning, they receive the stock from the storehouse

[ 375 ]
Distributed Databases Chapter 13

and then they leave for the sales. Since there was no proper system in place, all
the data captured in manually. Later it was decided to implement a system to
computerized the entire process so that the work of the sales representatives and
administrators tasks will become easier. It was decided to provide sales
representatives. Every representative will have dedicated databases and early
morning database data will be replicated to the PDA. During their sales, PDA
data is updated. End of the day, they will update the main office database after
completing their sales. Since there is a need for data to be updated from both
server and client, Active-Active replication was used. Since there are fewer
chances for conflicts, the basic conflict resolution method of priority nodes
conflict resolution method was used.

Learning Management System (LMS) was designed to cater to a large number of


educational institutes. These institutes have two categories. Large scale institute
requires high availability and disaster recovery system. Small scales institute are
not high revenue-generating institutes and do not need rich features such as high
availability, disaster recovery and high response time. Since there two sets of
institutes need two different types of hardware, it was decided to shard the
database into two databases. Institutes that are requiring rich features sets were
placed in high-end hardware with additional servers for high availability and
different disaster recovery options. However, there were instances where
instances were requesting to move between shards. Scripts were prepaid so that
institute data can be moved between shards.

Summary
In this chapter, we discussed the different aspects of distributed database systems. We
identified that due to business growth and to provide independence to business, a
distributed database is a better solution than to the centralized database. Fragmentation
was identified as an important phenomenon in distributed databases where horizontal,
vertical and hybrid fragmentation can be done by carefully examining the requirement. We
discussed sharding as a special technique to distribute data. In the sharding, sharding
catalogue was implemented to find the correct route for the sharding database.

Transparency is an important concept in distributed databases where it hides complexities


of the distributed database implementation. We discussed three types of transparency in
distributed databases, Distribution Transparency, Transaction Transparency, and
Performance Transparency in this chapter. Further, we looked at a few cases studies with a
distributed database, that emphasised the benefits of distributed databases and how
complex the implementation of distributed databases are.

[ 376 ]
Distributed Databases Chapter 13

Having completed the theoretical aspects with practical examples in database design, now
it is time to use all the learning to a case study. The next chapter is dedicated to discussing a
case study of database design for a Learning Management System.

Questions
Why distributed databases are needed for modern business?
Modern businesses are operating in many business units such as branches,
projects, business segments, branches and so on. Due to the increase in
competition, businesses are moving towards operating these different units
independently. Apart from the business process independence, businesses are
looking at maintenance independence where one unit can schedule their own
maintenance without impacting other business processes. To separate the
business process, technical systems also distributed when needed. With the
application separation, databases need to be separated too. By distributing data,
separate units have the luxury of creating their own analytics. However, simple
separation of databases will run int lot of technical and practical issues.
Therefore, databases should be distributed taking business processes into
account.

What are the challenges in security in distributed databases?


In a centralized database system, security is maintained at the one database. As
we discussed, the security of the database is applied mainly by means of
Authentication and Authorization. When the distributed database system is
implemented, there will be multiple databases. When many databases are in
place, we need to implement all the authentication and authorization to all the
databases.

Apart from implementing security for all the databases, network security has to
be improved. Since there are a lot of communications between databases in
distributed databases, it is essential to improve network security. This means in a
distributed system there are more challenges in security than to a centralized
database.
What are the scenarios that you have to choose a database distributed system?
If an organization processes are distributed at multiple locations and these
processes can be operated in independently or semi-independently, you can
implement a distributed database system.

What are the challenges in Active-Active Replication in Distributed databases?


In the Active-Active replication, data is modified from the multiple nodes. If the

[ 377 ]
Distributed Databases Chapter 13

same record is modified at the same time multiple times, there will be a conflict
of data. To resolve these conflicts and chose a winner and victim(s), a conflict
resolution mechanism should be introduced.

How to maintain UNIQUE constraints in Shards?


In shards in distributed databases, data is separated into multiple databases.
Since the same Entity is available in multiple databases, it is impossible to
implement physical UNIQUE constraints to multiple tables which a sitting in
multiple servers. Therefore, to keep the uniqueness, shard key can be integrated
into the primary key of the transaction tables such as Order Number, Invoice
Number and so on. For the master data such as Customer, ranges can be defined.
Master tables that have to be shared will be replicated between shards.

Further Reading
High Availability: https:/​/​scalegrid.​io/​blog/​managing-​high-​availability-​in-
postgresql-​part-​1/​

https:/​/​blog.​timescale.​com/​blog/​scalable-​postgresql-​high-​availability-​read-
scalability-​streaming-​replication-​fb95023e2af/​

Sharding Strategies: https:/​/​blog.​yugabyte.​com/​four-​data-​sharding-​strategies-​we-


analyzed-​in-​building-​a-​distributed-​sql-​database/​

[ 378 ]
14
Case Study - LMS using
PostgreSQL
During all the chapter until up to now, we have discussed different aspects of the database
design. Though we emphasised more on the design aspect such as Conceptual Database
Design, E-R Modeling, Normalization of data models, and so on. Apart from the functional
requirements of a database, we discussed a lot of non-functional requirements of database
such indexes, scalability, transaction as well. During the last chapter, we discussed how
databases can be designed by using distributed principals.

Having discussed every aspect of databases, now it is time to put all the knowledge into a
case study. In this chapter, we will be using design concepts that we learned to design a
database for the Learning Management System (LMS).

To design a database for a LMS, we will be looking at the Business Case and we will build a
conceptual model for the LMS. We will define necessary table structures for the LMS using
PostgreSQL. We will discuss the necessary indexing and a few other advanced features for
the database design.

In this chapter we will cover the following topics:

1. Business Case
2. Planning a Database Design
3. Building the Conceptual Model
4. Applying Normalization
5. Table Structures
6. Indexing for Better performance
7. Other Considerations
Case Study - LMS using PostgreSQL Chapter 14

Business Case
Learning Management System (LMS) is to support e-learning aspects of different
educational institutes. Further, LMS handles the management and delivery of eLearning
courses.

Let us look at a hypothetical LMS as a case study to design databases.

Educational Institute
The main logical entity of the LMS is the educational institutes. These institutes are
operating in two different scales. Less number of educational institutes are determined to
support 24X7 supports for its end users. Further, these high scale institute needs additional
disaster recovery options. A large number of institutes requires same functions as high
scale institutes but not on the same scale. Since these are small to medium scale institutes,
users may not require 24X7 supports. Considering different support levels, there are three
categories of educational institutes that are High, Medium and Small.

Though there are different scales for educational institutes, depending on the licensing cost
they may request for the upgrade or downgrade between three categories.

Every educational institute multiple faculties and every faculty has multiple departments
as shown in the following example.

[ 380 ]
Case Study - LMS using PostgreSQL Chapter 14

In the above screenshot, It is shown how institute, faculty, departments and courses are
organized. Above screenshot shows how courses are defined in the Computer
Engineering Department.

In this simplified LMS, there are main three business actors that will be discussed in the
following section.

Business Actors in LMS


In the LMS there are three business actors, Lectures, Students and Administrators.

Lectures - Lectures has total ownership on the courses. Lecturers will define the
content of the courses. They will define the modules for the courses. Every
module has one or many module leaders. Module leader is another lecturer from
the institute. The Module leader will define the module curriculum, mode of
evaluations, references and module plan. A lecturer can initiate discussion so
that students can participate.
In addition to the module leader(s), there can be optional lectures such as visiting
lectures, industry experts, subject matter experts and so on. The module leader
will update the course contents, mark the attendance, setting the deadlines, mark

[ 381 ]
Case Study - LMS using PostgreSQL Chapter 14

the assignments, make the grading and so on.

Course Administrators - Course administrators have permission to create


courses, register students, enroll students to the courses.
Students - Students are participating for each module by submitting assignments,
initiating discussions with lectures and other co-students. In addition to their
engagement with courses, they can meet lecturer by making an appointment
with the lecturer.

In the LMS course is an important entity. Let us look at how the course is defined in the
LMS.

Course in LMS
Students are enrolled in a Course and each Course has a department. Every course has a
duration. Every course has multiple modules and modules can be a core module or an
optional module. The module should have a study plan. Study plan contains the schedule,
assignments as shown in the below example.

As shown in the above example, Advanced Databases is one of the modules in BSc(IT).
However, the Advanced database can be a module in another course. As shown in the
above example, a weekly study plan and one Assignment is listed. The study plan includes
Notes and Research Papers whereas Assignment will have Assignment Documents and

[ 382 ]
Case Study - LMS using PostgreSQL Chapter 14

Marking Schema. An Assignment has multiple deadlines, such as soft deadline, hard
deadline that are applicable for the students. There are two important dates for lectures,
that are notification date and marking completion dates.

Every course has a course coordinator and every module has a module leader. There can be
one or many lecturers assigned to a module. A Lecturer can be a content provider for
multiple modules.

Student Discussions
Students are encouraged to perform productive discussion through the LMS. Student or
Lecture can initiate discussions. Every discussion will have a title and description. User can
tag every discussion with different labels such as question, database, and so on so that later
stage it can be searched quickly. Either the module leader or course administrator or the
initiator can close the discussions.

Auditing Requirements
Auditing is an important aspect as there can be legal issues with respect to the assignments.
Therefore, all the activities related to the assignments should be audited. Auditing
mechanism should retain the time of the activities and the user of the activity.

Having understood the basic requirements for the LMS, let us plan for the database design
in the next section.

Planning a Database Design


First of all, we need to identify the type of schema that we should use to design the LMS. In
Chapter 3: Planning a Database Design we came across two different types of database
schemas. In that chapter, in the section Database design approaches we identified that
depending on their business and technical needs, the database designer has to choose the
correct database schema. In the database world, there are two types of systems, Online
Transaction Processing (OLTP) and Online Analytical Process (OLAP).

Since LMS is transaction-driven, application-oriented that supports the day to day business
of an educational institute, the LMS database should be designed according to an OLTP
method.

During Chapter 3: Planning a Database Design, we discussed different approaches for

[ 383 ]
Case Study - LMS using PostgreSQL Chapter 14

database design. They are referred to as mainly bottom-up and top-down. LMS database
design will take the top-down approach where the model is designed first, then entity and
then the attributes. In the top-down approach, the designer starts with a basic idea of what
is required for the system. Since top-approach is used for a large and complex database
system, we can utilize the top-down approach to design the LMS database.

Apart from the design, we need to plan the database as a distributed database as there is a
clear need to distribute data. We can distribute the data depending on the scale of the
education institutes.

Let us look at how the conceptual model is built for the LMS database in the following
section.

Building the Conceptual Model


As we discussed in Chapter 4: Representation of Data Models, in the section
Creating Conceptual Database Design, we identified the following core tasks at the
Conceptual Model.

1. Identification of Entity Types


2. Identification of Relationship Types
3. Identify and Associate attributes with entity or relationship types.
4. Determine Attribute domains
5. Determine candidate and primary keys

Identification of Entity Types


During the discussion of the brief requirement analysis of LMS, we identified Institute,
Department, Course, Module, Student, Lectures, Administrator, Appointments,
Assignments are the main Entity Types of the LMS system.

Initially, we need to draw an entity type table to identify the entity types and refer at the
future date. Following is the table for entity types.

Similar
Entity Type Description Names / Different
Roles
Education institute that is a logical unit of
University,
Institute that combines courses, lecturers, and
students. School, Academy

[ 384 ]
Case Study - LMS using PostgreSQL Chapter 14

Division, Teaching
Department Separation of teaching units.
Unit
Course Plan of study on a subject
Module Different subjects of a course. Subject
The person who is enrolled in a course and
Student
who engaged with the course curriculum.
Pupil
Teacher,
Professor, Course
The person who is responsible for the Coordinator,
Lecturer
delivery of courses. Module Leader,
Department
Head,
Administrator The person who manages the course Course Manager
Appointment When there is a meeting between the lecture Meetings, Interviews
As presented in the above table, Lecturer is an important entity type in the LMS system.
However, in the Lecturer entity type, there are specific entities such as Course Coordinator
Module Leader, Department head who are lecturers.

At this stage, core entity types of LMS database are identified. However,
this list can be modified at the later stage during normalization stage.

After identifying the core entity types, let us identify the relationship between the identified
entities.

Identification of Relationship Types


Our next task is to identify the relationship for the core relationship types by means of E-R
diagrams.

Let us first identify the hierarchical relationships between the above entity types. The
lecturer is the main actor in the LMS database as he plays a key and different roles in the
system.

[ 385 ]
Case Study - LMS using PostgreSQL Chapter 14

As we identified in the Business Case section, there is a natural hierarchy in the LMS. As we
discussed in Chapter 1, Heirachicahl databases can be used to store the above data.
However, since there are other relations that follow the relational model, we have chosen
the relational database to implement LMS, the hierarchical data model is presented as
above relations.

Out of the identified entity types, the lecture is the most complex entity type. Tough the
Lecturer main role is to teach at different modules, there are additional roles that are
carried out by the lecturer. Head of the department is a lecturer who manages the
department affairs. Course coordinator, an entity from Lecturer entity type, manages the
course. Similarly, the lecturer will become a module leader when it comes to managing the
module. All for these functionalities are presented in the following screenshot.

[ 386 ]
Case Study - LMS using PostgreSQL Chapter 14

Apart from the lecturer, there will be administrators who manage the course as shown in
the below diagram.

In the LMS system, apart from the lecturer, the student is another important entity type. A
student engages with course and module. A user will be enrolled in a course and thereby
the student has to follow the module. The student engaged with a module by participating
in the module. This relationship is shown in the below screenshot.

[ 387 ]
Case Study - LMS using PostgreSQL Chapter 14

The student participation in a Module will be in multiple ways such as attending the
module, submitting the assignments, presenting and writing the exams. These aspects can
be shown in the below relationship diagram.

[ 388 ]
Case Study - LMS using PostgreSQL Chapter 14

Since there are a lot of engagements with the module by students in different aspects, the
relationship between student and module is complex.

The relationship can be considered as an entity type in different database


design. For example, the assignment relationship can be identified as an
entity type in a different design. Since there can be different approaches to
the database design, let us consider these as relationships.

Apart from the standard relationships, there are recursive relationships for courses and
Modules. Most of the courses have prerequisite courses. Prerequisite means to enrol to
some course there are courses to be completed. Similarly, there are prerequisite modules for
some modules. The recursive relationship for the Course and Module is shown in the
below screenshot.

[ 389 ]
Case Study - LMS using PostgreSQL Chapter 14

Let us identify the attributes for the identified entity types.

Identify and Associate attributes with entity or


relationship types
After identifying the entity types in the LMS system, next is to identify the attributes for
each entity type. In the E-R Modeling section of Chapter 2: Representation of Data Models, we
looked at different notations to present the different attributes. By using those standard
notations, let us define the attributes of Lecturer entity type. Following is the screenshot for
the Lecturer entity type.

In the above E-R diagram, Lecturing Departments, Teaching Modules, Qualifications,


Mobile Numbers are multi-valued attributes. Age is the derived attribute from the Data of
Birth attribute. Lecturer full Name is a composite attribute of Title, First Name, Middle
Name and Last Name. Lecture Code is the primary key of the Lecturer Entity Type.

[ 390 ]
Case Study - LMS using PostgreSQL Chapter 14

Let us look at the E-R diagram for the Course entity type as Course is an important Entity
Type in the LMS.

As shown in the above screenshot, Course Code is identified as the Primary Key for the
Course Entity Type. Every course has multiple models and there can be prerequisite
courses for the particular course.

Since every course has multiple modules, let us look at the E-R diagram of the module as
shown in the below screenshot.

After looking at attributes of important Entity Types Lecturer, Course and Module, we
need to find out the attributes for entity types such as Student, Appointment and so on.

[ 391 ]
Case Study - LMS using PostgreSQL Chapter 14

Next, let us look at the documentation of entities.

Documentation
Let us look at the documentation for the Student entity type.

Entity Type / Attribute Data Is


Description Length Multi-Valued Composite
Relationship Name Type Required
The attribute which
Student StudentNo will identify the String 8 Yes No No
student uniquely.
Title of the student.
Title Mr, Mrs, Ms are the String 4 Yes No No
possible values
First Name of the
First Name String 30 Yes No No
student
Middle name for the
Middle Name String 30 No No No
student
SurName or the Last
Last Name String 30 Yes No No
Name of the student
Composite Attribute of
Title, First Name,
Full Name String Yes
Middle Name, and
Last Name
Mobile Mobile Number(s) of
String 10 No Yes
Number the student
Date of birth of the
Date of Birth Date No No No
student
After identifying Entity types, our next stage is applying normalization to different the
conceptual data model that will be covered in the following section.

Applying Normalization
In Chapter 5: Applying Normalization, we discussed that Normalization is an important
process in the database design. Since LMS is a transactional database (OLTP), we need to
follow the Normalization process. In the same chapter, we discussed that there are different
levels of normalization that is, First Normal Form (1NF), Second Normal Form (2NF),
Third Normalization Form (3NF), Boyce-Codd Normal Form (BCNF), Fourth Normal
Form (4NF), Fifth Normal Form (5NF), and Domain-Key Normal Form (DKNF). Out of
these normalization forms, we need to apply at least 3NF to the database design.

[ 392 ]
Case Study - LMS using PostgreSQL Chapter 14

Let us look at a few examples of applying normalization for the LMS database entities.

Applying First Normal Form


Let us look Department Entity with the following sample data set.
Department Department Head of the Faculty Faculty Institute Institute Institute
Dean
Code Name Department Code Name Code Name Address
Institute of
Electrical Engineering Prof. Jean Mumbai,
I01F01D001 Dr. Susan Rose F01 I01 Engineering &
Engineering Faculty Dias India
Science
Institute
Engineering Prof. Jean Mumbai,
I01F01D002 Civil Engineering Prof. George Khan F01 I01 of Engineering
Faculty Dias India
& Science
Institute
Chemical Dr (Mrs). Deepika Engineering Prof. Jean Mumbai,
I01F01D003 F01 I01 of Engineering
Engineering Shanmugam Faculty Dias India
& Science
Institute
Science Dr. John Mumbai,
I01F02D004 Computer Science Dr. Simmon Hall F02 I01 of Engineering
Faculty Arthur India
& Science
Institute
Science Dr. John Mumbai,
I01F02D005 Micro Biology Mr. Peter De Costa F02 I01 of Engineering
Faculty Arthur India
& Science
If you closely analyse the above data set for the Department Entity Type, there are
repeating attributes.
Repeated Attributes { Faculty Code, Faculty Name, Dean, Institute Code,
Institute Name, Institute Address }

One of the main objectives of the Normalization is to eliminate repeating attributes as


discussed in Chapter 5: Normalization, First Normal Form section.

This means repeated attributes can be separated into other entities as follows:
Faculty { Faculty Code, Faculty Name, Dean, Institute Code }
Institute {Institute Code, Institute Name, Institute Address, Head of the
Institue }
Derpatment {DepartmentCode, DepartmentName, Head of the Department, Faculty
Code }

If you further look at the above entities, you will see that in the Dean, Head of the Institute,
and Head of the Department are entities from lectures. Therefore, incorporate department
entity and include Lecture code of the Dean, Head of the Institute and Head of the
Department as they are instances in the Lecture Entity.
Faculty { Faculty Code, Faculty Name, Dean Code, Institute Code }

Institute { Institute Code, Institute Name, Institute Address, Head of the

[ 393 ]
Case Study - LMS using PostgreSQL Chapter 14

Institute Code }

Department { Department Code, Department Name, Head of the Department Code,


Faculty Code }

Lecturer { Lecturer Code, Full Name, Qualifications}

Let us look at the ER diagram for these entities as shown in the below screenshot.

With the above design, we have removed the repeating attributes. However, this has
intended the number of entity types.

Let us see how we can apply the second level of normalization.

Applying Second Normal Form


As we discussed in Chapter 5: Normalization, Second Normal Form section, Second Normal
Form (2NF) is applied to the relations which have primary keys that are composed of two
or more attributes.

Let us look at the Student Enrollment data set as shown below.

Student Course Course


Student Name Course Name
Number Code Enrolment Date
1 Andy Oliver C001 Introduction to Databases 2020/01/01
2 Harry Steward C001 Introduction to Databases 2020/01/01

[ 394 ]
Case Study - LMS using PostgreSQL Chapter 14

Advanced Database
2 Harry Steward C004 2020/05/01
Management
Advanced Database
3 Alex Robert C004 2020/06/01
Management
4 Rose Elliot C001 Introduction to Databases 2020/01/01
Advanced Database
4 Rose Elliot C004 2020/06/15
Management
Advanced Database
5 Michel Jacob C004 2020/06/01
Management
Advanced Database
6 Emily Jones C004 2020/06/01
Management
In the above relation, {Student Number, Course Code, Course Enrollment Date} are
considered as Primary Key since there can be students who are enrolled in the same course
multiple times.

Above relation is divided into three relations as shown below.


Course { Course Code, Course Name }

Student { Student Number, Full Name }

Enrollment { Enrollment ID, Course Code, Student Number, Enrollment Date }

This can be shown in the following E-R diagram.

Similarly Assignment Submission, Exams can be treated to the second level of


normalization form.

Following in the Assignment Submission E-R diagram.

[ 395 ]
Case Study - LMS using PostgreSQL Chapter 14

In the above design, Assignment Submission is separated to different entity types with the
submission date, marked date and so on.

Similar to the Assignment Submission, for exams, the same normalization


level should be followed.

Breaking Multi-Value Attributes


In some instances, there are multi-values attributes. For example, lecturers have multiple
qualifications. Though this can be saved in the same table as follows, it might result in
negative performance and usability impact.

Lecturer Code Lecturer Name Qualification


L001 Rose Kempton B.Sc (Eng), M.Sc, PhD
L002 Kieth Taylor M.Sc (AI), PG Dip (Management)
L003 Frank Nelson B.Sc (Eng), M.Sc (AI), MPhil
L004 Peter Manuel PhD, PG Dip

[ 396 ]
Case Study - LMS using PostgreSQL Chapter 14

This can be defined into multiple entities as shown in the below.


Lecturer { Lecturer Code, Lecturer Name }

Lecturer Qualifcation { Lecturer Code, Qualifaction }

Similar to Qualifications, lecturers memberships are also treated as same.


Lecturer { Lecturer Code, Lecturer Name }

Lecturer Membership { Lecturer Code, Membership }

Both relationships can be shown in the below ER diagram.

After we have identified entities, next is to implement these design in the physical database.
Let us see how we can implement identified tables in physical databases in the next
section.

Table Structures
Until now, we do not need to know what is the database technology that is going to be used
for this implementation. We have decided to use PostgreSQL for the implementation of
LMS. In this design, we will be looking at how to implement Distributed Databases, Tables
and Views.

Distributed Databases
In Chapter 13: Distributed Databases during the Sharding discussion, we identified that the
shards need to be designed. As identified during the requirement elicitation, sharding will
be done based on the institute. There will be three shards and one master shards.

In the database design, we have named the LMS_Master as the master shards and

[ 397 ]
Case Study - LMS using PostgreSQL Chapter 14

LMS_Shard_01, LMS_Shard_02, and LMS_Shard_03 as shown in the server explorer.

In the master shard (LMS_Master), there will be a master table to define what is the shard
that every institute belongs to. Following is the table structure for the shard master table.

It is important to implement constraints for the above table so that data integrity can be
maintained.

InstituteID attribute is the primary key in the InstituteMasterShard. Therefore,


InstituteID attribute is NOT NULL and it is UNIQUE. The InstituteID is the
auto-increment value.
InstituteCode attribute should be UNIQUE.

[ 398 ]
Case Study - LMS using PostgreSQL Chapter 14

ShardID attribute that will decide the shard that institute belongs to. Therefore,
it should be NIOT NULL and should be a value either 1 or 2 or 3. To maintain the
data integrity CHECK constraint is implemented. In case, shards are added to
expand the horizontal scalability, you need to modify the constraint.
IsActive attribute will allow to enable or disable the Institute. This attribute can
contain either 1 or 0. Similar to ShardID, to maintain the data integrity CHECK
constraint is implemented on the IsActive column.

Following screenshot shows how CHECK constraints are implemented in PostgreSQL.

Following script shows how to create InstituteMasterShard table.


CREATE TABLE public."InstituteMasterShard"
(
"InstituteID" bigserial NOT NULL,
"InstituteCode" character varying(10) COLLATE pg_catalog."default",
"InstituteName" character varying(50) COLLATE pg_catalog."default" NOT
NULL,
"ShardID" smallint NOT NULL,
"IsActive" bit(1),
CONSTRAINT "InstituteMasterShard_pkey" PRIMARY KEY ("InstituteID"),
CONSTRAINT "InstituteMasterShard_InstituteCode_key" UNIQUE
("InstituteCode"),
CONSTRAINT "CHECK_ShardID " CHECK ("ShardID" >= 0 AND "ShardID" <= 3),
CONSTRAINT "CHECK_IsActive" CHECK ("IsActive" = '0'::"bit" OR
"IsActive" = '1'::"bit")
)

Let us look at some sample data for the above table.

[ 399 ]
Case Study - LMS using PostgreSQL Chapter 14

In the master shard, you do not need to keep all the details of the institutes. Those details
can be stored at each shard level.

Maintaining the shard is an important functionality. As a database administrator, it is


important to monitor the distribution of institutes. This can be achieved from the following
script.
SELECT ActiveInstitues."ShardID",
ActiveInstitues."ActiveInstitues",
InctiveInstitues."InactiveInstitues"
FROM
(
SELECT "ShardID",
COUNT(*) AS "ActiveInstitues"
FROM public."InstituteMasterShard"
WHERE "IsActive" = B'1'
GROUP BY "ShardID"
) ActiveInstitues
FULL OUTER JOIN
(
SELECT "ShardID",
COUNT(*) AS "InactiveInstitues"
FROM public."InstituteMasterShard"
WHERE "IsActive" = B'0'
GROUP BY "ShardID"
) InctiveInstitues
ON ActiveInstitues."ShardID" = InctiveInstitues."ShardID"

[ 400 ]
Case Study - LMS using PostgreSQL Chapter 14

Following is the results of the above query that shows the distribution of the institutes.

After the master shard is completed, let us look at how other tables are designed.

Proposed Tables & Relations


In the table design, we have taken two important decision.

To easy management of shards, every table will have Institue ID which is the
shard attribute. This will help database administrators and customer support
staff to move data between shards. This is a requirement that was identified at
the requirement elicitation.
Primary key will be chosen as the auto-increment attribute and business key will
be used as an UNIQUE index. For example, in the Lecturer entity type, there will
be LecturerID and the business key, LecturerCode will an UNIQUE index.
Usage of Composite Types for common composite attributes such as Address,
Full Name. During the entity types identifications, we noticed that Address
attribute can be introduced to Institute, Lecturer, Administrator and Student.
Similarly, Full Name can be included for Lecturer, Student and Administrator.
Rather than using individual attributes such as AddressI, Address II, City, it is
better to use the combine attribute.

Let us see how we can include Composite data types.

Including Composite Data Types


Let us look at Address attribute in Institue, Lecturer, Student and Administrator entity
types.

[ 401 ]
Case Study - LMS using PostgreSQL Chapter 14

Let us see how can we create composite types in PostgreSQL. Following is the Address
Type.

[ 402 ]
Case Study - LMS using PostgreSQL Chapter 14

The same Address type can be created from the below script.
CREATE TYPE public."Address" AS
(
"AddressI" character varying(25),
"AddressII" character varying(25),
"City" character varying(15),
"Province" character varying(15),
"Country" character varying(15),
"Postcode" character varying(10)
);

Similarly, we can extend the composite type to the Full Name as well.

[ 403 ]
Case Study - LMS using PostgreSQL Chapter 14

Following is the implementation for FullName composite data type in PostgreSQL.

The above implementation can be generated from the following script.


CREATE TYPE public."FullName" AS
(
"Title" character varying(6),
"FirstName" character varying(25),
"MiddleName" character varying(25),
"LastName" character varying(25)
);

[ 404 ]
Case Study - LMS using PostgreSQL Chapter 14

Similarly, we can implement a composite data type for the Audit columns. For audit
purposes, we need IsActive, RecordCreatedDate, RecordModifiedDate
and InstitiuteID attributes in every column. Rather than inserting all four columns, can
create a AuditColumns composite key as shown below.
CREATE TYPE public."AuditColumns" AS
(
"IsActive" bit(1),
"RecordCreatedDate" date,
"RecordModifiedDate" date,
"InstitiuteID" bigint
);

Now let us see how we can implement this in a Lecturer table as Lecturer table needs
FullName. Address, and AudtiColumns composite data types. The following screenshot
shows, how to include composite data types into a table in PostgreSQL.

The following has the script to create a Lecturer table.


CREATE TABLE public."Lecturer"

[ 405 ]
Case Study - LMS using PostgreSQL Chapter 14

(
"LecturerID" serial NOT NULL,
"LecturerCode" character varying(8) NOT NULL,
"Name" "FullName",
"Address" "Address",
"AllocatedDepartmentID" integer,
"DateofBirth" date,
"AuditColumns" "AuditColumns",
CONSTRAINT "Lecturer_pkey" PRIMARY KEY ("LecturerID"),
CONSTRAINT "UNIQUE_LecturerCode" UNIQUE ("LecturerCode"),
CONSTRAINT "FK_AllocatedDepartmentID_Department_DepartmentID" FOREIGN KEY
("AllocatedDepartmentID")
REFERENCES public."Department" ("DepartmentID") MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)

In the Lecturer table, LecturerID is the primary key where are LecturerCode is considered
as a UNIQUE key. AllocatedDepartmentID is a foreign key constraint with reference to the
Department table.

Let us see we can insert a record to Lecturer table. To Populate Lecturer table, we need to
populate Institue, Faculty and Department table as shown in the following script.
INSERT INTO public."Institute"(
"InstituteID", "InstituteCode", "InstituteName", "Address",
"HeadofInstitueID", "AuditColumns")
VALUES (2, 'IN0002', 'University of Business Management',
ROW('','','Mumbai','','India','213456'), NULL,
ROW(B'1','2020-05-18',NULL,2))

INSERT INTO public."Faculty"(


"FacultyID", "FacultyCode", "FacultyName", "DeanID", "AuditColumns")
VALUES (1,'FAC0001', 'Faculty of Management',NULL,
ROW(B'1','2020-05-18',NULL,2));

INSERT INTO public."Department"(


"DepartmentID", "DepartmentCode", "DepartmentName", "HeadofDepartmentID",
"FacultyID", "AuditColumns")
VALUES (1, 'DEP001', 'Department of Statistics', NULL, 1,
ROW(B'1','2020-05-18',NULL,2) );

INSERT INTO public."Lecturer"(


"LecturerCode", "Name", "Address", "AllocatedDepartmentID", "DateofBirth",
"AuditColumns")
VALUES ('L0001',
ROW('Dr','Shaun',NULL,'De Mel'),
ROW('','','Mumbai','','India','213456') ,1,'1974-11-28',

[ 406 ]
Case Study - LMS using PostgreSQL Chapter 14

ROW(B'1','2020-05-18',NULL,2))

When inserting values to composite attributes, values should be used with the ROW syntax
as shown in the above script.

When accessing these composite attributes, there are specific queries needs to be executed
as shown below.
--SELECTING Data
SELECT "LecturerCode", "Name",
("Address")."City" --Selecting the City column from Address composite
attribute
FROM public."Lecturer"
WHERE
("AuditColumns")."IsActive" = B'1' --Choosing IsActive attribute from
AuditColumns

Following screenshot shows the result for the above query where Name attribute is a
composite attribute.

Associated Tables for Lecturer Table


AS we discussed in the Requirments, there are Institute, Faculty and Department which
falls into a Hierarchical relationship. In the above three logical business units, there are
heads who are instances from the Lecturer entity types. That relationship can be shown as
below screenshot.

[ 407 ]
Case Study - LMS using PostgreSQL Chapter 14

Apart from the above basic details for Lecturer, there are a few important properties of
lecturers. For example, details like, qualifications, contact telephone numbers, email
addresses, memberships have to be stored. These types of details will be requested in
future. Therefore, if you decided to implement each table, you will end up with a large
number of tables. This number of tables will lead to difficulty in maintaining data.

To avoid a large number of table, a single table can be introduced with the label. In future,
if there is a requirement for a new type, it is a matter of inserting records to this table
instead of creating a table.

The following screenshot shows the table structure for the extended properties for the
Lecture.

[ 408 ]
Case Study - LMS using PostgreSQL Chapter 14

As usual AuditColumns composite attribute is added and Index s created on he


ExtendedDetailType columns.

Let us see the relation of the LecturerExtendedDetail table with the Lecturer table in the
below screenshot.

[ 409 ]
Case Study - LMS using PostgreSQL Chapter 14

Let us how data is organized in the above table. The following screenshot shows different
details of a single lecturer.

[ 410 ]
Case Study - LMS using PostgreSQL Chapter 14

In case of another type, it is a matter of inserting records to the extended table. This
configuration will avoid having unnecessary and a large number of tables that is difficult to
manage.

Database Design Enrollment Process


Student enrolment to the different courses is an important feature in a LMS system.

For enrolment of courses, there are mainly three entities, Students, Courses and Lecturers.
Students are enrolled in courses where courses have many modules. Every module has
assignments and exams.

Student table is a standard table with serial integer column and usual Audit Columns as
shown in below screenshot.

[ 411 ]
Case Study - LMS using PostgreSQL Chapter 14

In the above Student table, Student is linked to the Department table through
DepartmentID.

Let us look at the Course Table as shown in the below screenshot.

[ 412 ]
Case Study - LMS using PostgreSQL Chapter 14

The course table is one of the core tables in the LMS database. There are prerequisite
courses for some courses. Since there can be multiple prerequisite courses, that
configuration cannot be stored in the Course table. Hence, you need to introduce an
additional table which has the following table schema.

[ 413 ]
Case Study - LMS using PostgreSQL Chapter 14

In the above table, the combination of CourseID and PrerquisteCourseID will make a
unique constraint. This will avoid configuring the same course twice as a prerequisite for a
given course. If there are no prerequisite courses, there won't be a record in the
PrerequisitCourse table.

Now let us look at all the relationships with the course table as shown in the below
screenshot.

[ 414 ]
Case Study - LMS using PostgreSQL Chapter 14

When a student is enrolled in a course, the student will be allocated to multiple modules.
Those modules will be either compulsory or optional. A module can be linked with
multiple courses but a module has an offered department and a module leader to it. There
can be multiple lecturers for the module as well as there are prerequisites modules like we
saw for the courses.

Following screenshot shows the table structures for the Module.

[ 415 ]
Case Study - LMS using PostgreSQL Chapter 14

To stop the overloading of tables in the above screenshot, Department, Lecturer and
Student tables are removed from the above diagram. However, those three tables are part
of this design.

After the design of modules, the next step of the design is the student submission. Student
submission can be assignments and for exams.

[ 416 ]
Case Study - LMS using PostgreSQL Chapter 14

Similar to the assignment submission, exams submission and exam marking also can be
designed.

Let us see the important factors in this design.

Important points in the Database Design


Every design has it is own features. In this design of LMS, we have taken special
approaches considering various database design considerations.

We distributed database with respect to the institute. Sharding technique was


used as the database distributed technique. It was decided to have three shards
depending on the scalability of the institute operation. Since there can be a
situation where we need to transfer data between the shards in order to upgrade
or the downgrade the institute, in every table Institute ID attribute is

[ 417 ]
Case Study - LMS using PostgreSQL Chapter 14

maintained.
If there are request to move one institute from one shard to another, table by
table data can be moved since we have included InsituteID in every table.

Every table will have Audit Columns that are IsActive, RecordCreationDate,
RecordModificationDate and InstituteID. Composite Data Type is used for this
Audit Columns as it is easier to add them to the tables.
Apart from Audit Columns, for the columns Name and Addresses, we have
utilized Composite data types as it is most commonly used attribute in the LMS
database. For example, Student, Lecturers, Administrator there are addresses and
Name.
For all the tables, we have utilized serial columns as the Primary Key. The serial
data type (or bigserial data type) will ensure that it will generate values
automatically in the increasing order. With this implementation, we have
ensured that there is less fragmentation. Since we are using integer data types to
join data between table query performance will be enhanced.
For the lecturer details such as Emails, Mobile Number, Qualifications and
Memberships, we have used only one table called LecturerExtendedDetail. This
will enable developers to include more types in future without modifying the
database schema.
For each business process, we have included tables. For example, we have
included tables for, enrollments, submissions, and exams. With this approach,
transaction per table is reduced thus entire system performance is improved. For,
transaction tables, we can include partition per year so that we can achieve
performance, scalability and so on.
For all tables, Foreign key constraints are established. By this implementation,
data integrity is achieved. Further, we have used the by default option NO
ACTION so that in case of conflicts, data will not be deleted or updated.

Let us discuss, how to improve performance in the designed database by using indexes.

Indexing for Better performance


In Chapter 8: Working with Indexes, we discussed how important to create indexes to improve
data retrieval performance. In that chapter, at the Disadvantage of Indexes section, we
indicated that we should select optimum indexes, as a large number of indexes will hinder
the write performance of a data. LMS is an OLTP database where there will be equal reads
and writes. Therefore, in the designed LMS database we need to choose an optimum
number of indexes.

[ 418 ]
Case Study - LMS using PostgreSQL Chapter 14

Since we have assigned Primary Keys, we have made the same primary key as the
Clustered index for al the table. Following screenshot shows the configuration of the
clustered index for AssignmentSubmissionID column for the AssignmentSubmission
table.

Following is the script for the clustered index created in the above screenshot.
CREATE UNIQUE INDEX "CIX_AssigmentSubmission"
ON public."AssignmentSubmission" USING btree
("AssignmentSubmissionID" ASC NULLS LAST)
TABLESPACE pg_default;

ALTER TABLE public."AssignmentSubmission"


CLUSTER ON "CIX_AssigmentSubmission";

Apart from Primary Keys, we have defined Foreign Key constraints for tables where
necessary. During the table joins, typically foreign keys will be used as the joining columns.
Therefore, to improve the performance, non-clustered indexes are created on those foreign
key constraints.

[ 419 ]
Case Study - LMS using PostgreSQL Chapter 14

Foreign Key constraints are to maintain data integrity between tables and
not as a performance option. With the Foreign key constraint, index won't
be created. Therefore, you have to create an index explicitly on foreign key
constraints columns.
For the AssignmentSubmission table, we can create two non-clustered indexes for the StudentID and
AssignmentID as shown in the below script.
CREATE INDEX "IX_SutdentID_AssginmentSubmission"
ON public."AssignmentSubmission" USING btree
("StudentID" ASC NULLS LAST)
TABLESPACE pg_default;

CREATE INDEX "IX_AssginmentID_AssignmentSubmission"


ON public."AssignmentSubmission" USING btree
("AssignmentID" ASC NULLS LAST)
TABLESPACE pg_default;

Apart from Primary Keys and Foreign Key Constraints, Some tables have Business Keys.
For example, Student table will have StudentCode, Lecturer will have LecturerCode,
Course table will have CourseCode and so on. Normally, from the application point of
view, these business keys will be used search. Therefore, to facilitate searches, indexes can
be implemented. Apart from indexes, you can create a UNIQUE index for these business
key columns.

Following screenshot shows the configuration of the unique index for CourseCode in the
Course table.

[ 420 ]
Case Study - LMS using PostgreSQL Chapter 14

Following is the script for the above unique index.


CREATE UNIQUE INDEX "UIX_CourseCode_Course"
ON public."Course" USING btree
("CourseCode" COLLATE pg_catalog."default" ASC NULLS LAST)
INCLUDE("CourseDescription")
WITH (FILLFACTOR=90)
TABLESPACE pg_default;

In the above index, two important configurations are implemented.

1. Fill Factor: Since the index is a string column, there can be fragmentation. To
avoid fragmentation, a fill factor of 90 is used.
2. Include Columns: To facilitate the improved search of
CourseDescription column, CourseDescription column is used as an include
column.

Having discussed the indexes in detail for the designed database, let us look at other

[ 421 ]
Case Study - LMS using PostgreSQL Chapter 14

consideration such as security, audit and so on.

Other Considerations
After database design is completed with table structures and indexes, as a database
designer, you need to attend to other factors of database design such as security, auditing
and so on.

Security
In this database design, we have distributed database systems. This means we have
multiple databases in the system. Since, these are physical databases, from the
administrative point of view, different administrative privileges can be provided for the
different administrators.

From the application point of view, there are sub-systems such as CourseEnrollment,
Student Registration, Assignment Submission and so on. For each subsystem, a database
user is created so that the particular application sub-system can use the relevant database
user. Those sub-system users will have only execution permissions to the procedures and
views and direct access of tables are prohibited.

Auditing
Though we have implemented, AuditColumns composite data type in every table as a
preliminary mechanism for audit, in case of a LMS, we need to adapt extensive auditing
mechanism.

In this regard, we can use a separate Auditing database so that all the auditing records can
be stored in the database. Those auditing tables, will not have indexes to so that transaction
duration will not be impacted.

High Availability
Though high availability is mainly falling into administrators hand, from the database
design perspective, we can take a few steps so that better availability options can be
specified at the database design stage. For example, we can use read-only copies of the
database so that, in case of an issue in the main database, users can connect to the read-only
databases. With this approach, users will have the option at least reading data from the

[ 422 ]
Case Study - LMS using PostgreSQL Chapter 14

read-only databases.

Scalability
In the LMS database, Course Enrollment, Exams and Submission are heavy transactions
tables. From the database distribution we have achieved some level of scalability, we can
improve the scalability further.

We can implement Partitioning for the above mentioned tables to enhance scalability. Since
we are not sure about the partition interval, we can implement monthly based portions.

Data Archival
In the LMS, data will tend to grow rapidly. However, since we are dealing with multiple
institutes, every institute will have different approaches to data archival. Since we have
implemented InstituteID as an AuditColumns, data archival will be much easier.

Summary
After discussing different aspects of databases in previous chapters, we dedicated our
discussion in the chapter to design a database. We have chosen the Learning Management
System (LMS) as a case study.

Having looked at the future scalability, we have chosen the Sharding, a distributed
database architecture for the LMS. We have chosen institutes as the sharding partitions.

We have utilized normalized processed for the tables, as LMS is a transactional database. In
the database design, we have included serial columns for all the tables and we made the
business key a UNIQUE index. For all the tables, we have included Audit Columns and we
used a composite data type which is the same data type we used for Names and Addresses.
We have used separate tables as much as possible in order to reduce transaction time. Thus
this will reduce the user concurrency as well. Apart from including Primary Keys for the
serial column, we have implemented Foreign key constraints as well for the reference
columns.

The clustered indexes were used for the Primary keys and non-clustered keys were
implemented for the foreign keys to enhance the query performances.

In the next chapter, we will be looking at the typical mistakes that are done by the database

[ 423 ]
Case Study - LMS using PostgreSQL Chapter 14

design.

Exercise
Expand the designed LMS database to include discussions as discovered in the
Requirement Elicitation.

Further Reading
What is LMS: https:/​/​www.​talentlms.​com/​what-​is-​an-​lms

Composite Data Types: https:/​/​www.​postgresql.​org/​docs/​9.​5/​rowtypes.​html

[ 424 ]
15
Overcoming Database Design
Mistakes
During all fourteen chapters, we discussed the complete aspects of database design. During
this discussion, we extensively discussed the database design strategies such as database
modelling, database normalization and so on. Further, we extended our discussions to non-
functional requirements for efficient and effective usage of databases. In this discussion, we
discussed Transaction Management, Database Maintainces, Scalability, last but least
Security. Tough we discussed different aspects of database design, it is important to stress
the common mistake done by the database designers. These can be considered as DON'Ts
in database design.

In this chapter, we will discuss the following common database design errors and tips and
hacks that could help us avoid these mistakes.

Why Mistakes are Important?


Poor or No Planning
Ignoring normalization
Neglecting the Performance Aspect
Poor naming standards
Less Priority for Security
Usage of Improper Data Types
Lack of documentation
Not using SQL facilities to protect data integrity
Not using Procedures to access data
Trying to build Generic Objects
Lack of Testing
Other Common Mistakes
Overcoming Database Design Mistakes Chapter 15

Why Mistakes are Important?


In every science, whether it is medical or engineering, best people are learning from their or
others mistake. There is no difference in database designing as well. By going through the
common mistakes and their consequences, we can design a database for better usage.

Another aspect of identifying the other mistakes it to, not to repeat the same mistake in
your project. Some mistakes are irreversible. It might be too late to recover of you have
done the mistake again. correcting some mistakes will cost in multiple ways such as time,
cost, resource, reputation and so on. Hence, it is better to know the common mistakes of
others so that you won't follow the same path.

In some stage of our carriers, we have done a few mistakes. Those mistakes will be
important only if we analyse them and learn from those mistakes. In this chapter, we are
looking at different mistakes that can occur during the database design.

Let us look at the mistake number 1, Poor or No planning in the database in the following
section.

Poor or No Planning
Following is a very common and popular proverb.

Failing to Plan is Planning to Fail - Alan Lakein (Well-known author)

In simple terms, if you are not planning, the chances of failing are high. Most of the time
planing for database design is neglected. This is mainly due to the wrong assumption that
database planning is simple and can be done during the development stage. one of the
main reason for this assumption is that to save time. It is important to understand that time
spent on the planning is not a wastage. If you are not spending adequate and quality time
during the planning stage of the database, you will be wasting that time during the later
stage. Importantly, you need to communicate the importance of planning in database
design to higher technical management.

By not planning, you might be saving a few hours or days. However, due to lack of
planning, you may have to spend more hours during the maintenance. Unnecessary
maintenance may lead to downtime and will have a business impact as well.

Apart from the maintenance, there is more chance that you may have to re-design your
database. Re-designing will not only consume more time, but your reputation will be
negatively impacting.

[ 426 ]
Overcoming Database Design Mistakes Chapter 15

Apart from the no planing, another common mistake is that poor planning or no quality
time spent on database planning. It is very much essential to spent quality time. during the
planning, you need to understand the environment and the possible strategies that can be
used.

In Chapter 3: Planning a Database Design, we dedicated our discussion on database planning.


In that chapter, we discussed the importance of planning, design approaches and so on. In
that discussion, we went onto examine the identifying nature of the database during the
planning stage.

Next, we will discuss another mistake, Ignoring Normalization.

Ignoring normalization
During the discussions in Chapter 5: Applying Normalization, we discussed the importance of
normalization. However, most of the time, during the database design discussions,
normalization is ignored. As we discussed in the Purpose of Normalization section, from
normalization we can optimize the database performance. When the normalization is
ignored, data will be repeated. Excessive repetition of data will cause database performance
issues.

Apart from normalization, over-normalization will result in too many tables. When there
are too many tables, you need too many joins to retrieve data. Too many joins will impact
database performance negatively.

Normalization is better suited for OnLine Transaction Process (OLTP) database as OLTP
databases have both read as well as writes. However, in the OnLine Analytical Process
(OLAP), you will have more data reads. Having normalization in OLAP database will
cause negative performance. Therefore, data models should be de-normalized in OLAP
databases.

If the Normalization is ignored at the design stage of a database you will run into
unnecessary maintenance tasks during the data in the live environment. If the
normalization is to apply for the database in the live environment, there will be downtime.
A downtime will impact business efficiency. Therefore, it is essential to apply the necessary
normalization or denormalization levels during the database design stage.

Let us look at another mistake made by the database design that is neglecting the
performance aspect of the database.

[ 427 ]
Overcoming Database Design Mistakes Chapter 15

Neglecting the Performance Aspect


If the database design is done by accidental database designers, mostly the performance
aspect is neglect as the functional aspects will take the priority. During the development of
the application, developers will be working with a small sample data set. With a simple
data set, developers will not encounter any major performance issues. However, with the
data load in the live environment, the application will become slower. Since fixing the
performance issues when the database in the live environment is challenging, it is essential
to design databases to sustain to the future data volume.

As a general practice with indexes, it is recommended to create Clustered indexes for the
Primary keys and Non-Clustered indexes for the Foreign Key constraints. For the business
keys, it is recommended to create non-clustered unique indexes.

Apart from the above standard indexes, during the development time, developers can
recommend index for the specific application queries. As we discussed in Chapter 8:
Working with Indexes, developers can view the query execution plan in order to find out the
possible queries.

Apart from indexes, there are other steps that can be taken in order to improve the
performance of database queries. Implementation of Partition is one of them as by
implementing partitions query performance are increased. However, as a database
designer, it is important to implement partition at the early stages of the database design.

Another trivial mistake done by the developers is using SELECT * in views and
procedures. Though at the time writing, you might need all the columns so as a developer,
you would think that SELECT * will suit your requirement. However, due to the
demanding requirements, you may have to include additional columns to the table. Since
you have used SELECT * in queries, these unnecessary columns will be retrieved. As a
general practice, as a developer and a database designer, you should avoid using SELECT *
from views and stored procedures.

Naming standards are important even in databases that will be discussed in the following
section.

Poor naming standards


Following a standard for the database objects, is an important aspect of database design.
However, in most of the cases, we ignore the following a standard for the naming of
database objects.

[ 428 ]
Overcoming Database Design Mistakes Chapter 15

Out of the available data objects, a table object is an important object. Let us see what are
the standard that we can afford for the table naming.

1. Do not use prefix or suffix for table names - Some database designers tend to
name table s with prefixes or suffixes. For example, they will use TblStudent or
StudentTables for the Student table name that should be prohibited.
2. Use of singular names for table names - As we know a table holds a collection of
entities. Therefore, there is no need for plural names for table names. For
example, instead of UserEnrollments it is advised to use UserEnrollment as a
table name.
3. Never use spaces for the table names - When naming the tables, do not leave
space for the table name such as User Enrollment, instead
use UserEnrollment for the table name.
4. Do not use special characters for the table names - Some designers tend to use {, (
, - or _ for the table names. In some development environment, these types of
table names will not be possible. As we do not know what types of development
environments will be used in future, it is better to avoid using special characters
for the table names.

Apart from tables, columns are also named with a standard. Similar to the table names, you
should not include spaces or special characters for the column names. Especially for
column name that has data types of boolean or bit data types, start the column name with
is. For example, during the design of Module table structure in Chapter 14, we use
IsCompulsory column name.

Since tables and columns are visible to the development teams and other third parties, there
will some attention when deciding names for tables, views and columns and so on.
However, there are internal objects such as Foreign key constraints, Indexes and Check
constraints that are not visible outside the database. though there are not visible to outside,
it is recommended to follow a standard during the naming of these objects.

In foreign key constraints, there are three objects involved. Those three objects are the
Column, referenced column and the primary table. When naming the foreign key
constraints FK_ColumnName_ReferencedTableName_PrimaryKey standard can be used.
In the case of StudentEnrollment foreign key with StudentID column in Student table,
FK_StudentID_Student_StudentID name for the foreign key constraint name can be used.

Similar to Foreign key constraints, Indexes are also internal objects in a database. Though
those indexes are not visible to outside the database, it is essential to follow and naming
convention for the indexes as well. typically, there are three types of indexes, Clustered,
Non-Clustered, Unique and so on. Typically, we used CLS_, NIX_, UNQ_ prefixes for the
above indexes respectively. When naming the indexes, we can use TableName and the

[ 429 ]
Overcoming Database Design Mistakes Chapter 15

column as part of the index. For example, we can use CLS_Student_StudentID for the
clustered index in the Student table.

Further, for check constraints, we need to implement naming convention as similar to the
indexes as well.

Let us look at how we should overcome security related mistakes in database design in the
following section.

Less Priority for Security


During the database design, most of the database designers and business will focus on the
functional requirements of the database. Due to this, there is a possibility that the important
non-functional requirements such as security will be neglected at the design stage. This will
result in a major impact during the live environment and even it is possible to stop the
system is being used by the business users.

During the database design, we do not plan the separation of database and application
server or the webserver. It is important to separate the Database server and the webserver
into different physical machines. This separation will provide more security for the
database as well as the application server. Apart from the security aspect, there are other
benefits such as better consumption of CPU and memory.

Never store passwords in a database in the native format. Always, encrypt the password
when storing and decrypt the password at the application end when required. Apart from
the password, Social Security Number, Credit cards should not be stored in the native
format and in some data laws, you won't allow to store them in the native format.
Therefore, during the database design, it is important to encrypt these types of data.

Most of the time, applications are designed to access with the admin user so the
authorization is never tested. We should never provide the admin permission outside the
database administrator. As we discussed in Chapter 12: Securing a Database, one of the ways
to avoid SQL Injection is by limiting permission levels of the application user.

Another one of the major ignorance by the database designers is not using the proper data
types that will be discussed in the next section.

Usage of Improper Data Types


Data types are important constraints in the database. By providing appropriate data types,

[ 430 ]
Overcoming Database Design Mistakes Chapter 15

data integrity can be achieved to a certain extent. In the section Selecting Data Types of
Chapter 6: Table Structures, we discussed in detail how important to choose correct data
types. However, there are common mistakes done by database designers with respect to
data types. Let us look at basic mistakes done by the database designers with regards to the
data types.

In the tables, id columns should be assigned with integer data types. Since id columns are
used for foreign key constraints, there will be a performance benefit when there are table
joins. Apart from the foreign key constraints, storage is also reduced.

Another common mistake done by the database designer is that the assigning integer data
types for all the necessary columns. Every database technique has different types of integer
data types. For example, in PostgreSQL, there are three integer data types. There are
different properties of these data types as shown in the following table.

Data Type Storage Range


smallint 2 bytes -32,768 to +32,767
integer 4 bytes -2,147,483,648 to +2,147,483,647
bigint 8 bytes -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Rather than using an integer data type for all the instances, as a database designer, you
should choose the correct integer data type. For example, columns like Age, Duration,
smallint data type can be used instead of an integer data type. As indicated in the preceding
table, using the smallint data type will reduce the storage. Indirectly, reduced storage will
improve the performance of user queries.

Composite data type was discussed in Chapter 15: Case Study - LMS using PostgreSQL which
is another data type. By implementing proper composite data types, database design
development will be much easier.

Another mistake that database designers do is not using bit data types. Instead of bit data
types, they prefer to use integer data types. For the boolean data, it is always advisable to
use the bit data type.

Let us discusses the lack of documentation in the process of database design in the next
section.

Lack of documentation
Documentation is always neglected in IT systems, especially in database system design. As
you remember in Chapter 4: Representation of Data Models, we discussed different models
such as Conceptual and E-R models and so on. During different levels of modelling, it is

[ 431 ]
Overcoming Database Design Mistakes Chapter 15

important to document necessary findings and design. By documenting database design,


you will be revisiting the finding and that will make you put more emphasis on the design.

Further, documentation will help others to understand the finding. At the initial stage,
documentation will become the means of communication between you and the business
team. Apart from the business users, newcomers to the database design teams and
development team can understand the database design.

Apart from the design documentation, another important documentation is commenting on


the database codes. In PostgreSQL, you can comment on Tables and columns as shown in
the below screenshot.

These comments should be more detailed so that another designer or a developer can
understand them. Apart from table objects, Views, Procedures and Function can be
commented.

Another ignorance by the database designers is not updating the documentation. As we


know database design is not a status process. With the dynamic requirements of the
business users, you need to modify the table structures. Typically, most of the database
designer will modify the database physical structures but do not update the relevant
documentation. With this, later documentation will become obsolete and no one will be
using them. Therefore, in the database documentation, it is important to document as well
as it is more important to update the documentation with the changes to the database
objects.

[ 432 ]
Overcoming Database Design Mistakes Chapter 15

Let us look at another mistake done by the database designers in the following section.

Not using SQL facilities to protect data


integrity
Most of the database designers are focusing on the database functionality of the database
rather than the non-functional requirements. Data Integrity is an important requirement
that is missed by many database designers. They ignore the data integrity options during
the database design stage as they argue that integrity can be maintained from the
application. For example, if the Student Name is not null, that can be implemented from the
application so that Student Name cannot be blank.

However, with the increase of requirement and the expansion of business in future, there
will be data coming to the database from different sources. Therefore, it is essential to
implement constraints from the database end as well. From the database aspect, we can
implement NOT NULL constraints, Foreign KEY constraints, CHECK constraints, and
UNIQUE constraints at the database level. With this implementation, even the different
data applications have to adhere to the above constraints and database data integrity will
be maintained throughout.

Not using Procedures to access data


When the database is designed by the application developers, there are more chances that
the usage of Procedures are avoided. However, there are a lot of advantages of using stored
procedures than using the inline code in the applications.

1. Code reusability - When the queries are executed through a procedure,


developers do not need to know the complex implementation of codes and
further, code can be re-used at many places. In addition, when there are
structural changes to the database that will result in the change in the business
logic. When the procedures are implemented, it is only a matter of changing
them at one or a few places. If there are inline SQL code, you need to change in
multiple places.
2. Security - As we discussed in Chapter 12: Securing a Database in the SQL Injection
section, one of the easiest ways to avoid SQL Injection is using procedures
instead of inline SQL code. Apart from avoiding SQL Injection, by using
Procedures you can avoid providing access to the table level. With this measure,
database authorization is enhanced.

[ 433 ]
Overcoming Database Design Mistakes Chapter 15

3. Performance - When a Procedure executes for the first time, it caches the
procedure code. so the subsequent executions will use the pre-cached code. As
pre-cached code is faster than the standard inline code, using procedure will
enhance the performance of the database.

Due to these three reasons, it is always recommended to use procedures during the
database access. However, due to the implementation difficulties, many database designers
will avoid using procedures.

Let us see another mistake that database designers do during the database design, that is
trying to build generic Objects.

Trying to build Generic Objects


As database designers, we tend to design database objects with generic objects so that they
can be standardized. However, in business, generic objects might fail. For example, in
Chapter 14: Case Study - LMS using PostgreSQL, we decided to have two separate tables
for Assignments and for exams. If you went with the generic Object approach as both are
similar types of exams, you can easily configure one table.

With the one table, in future, there will be ideal columns for one objects. Further, the data
volume on one table will be very huge. This will result in more table locking and that will
result in an unnecessary performance impact on the database.

Therefore, in database design, it is always better to separate business objects into separate
tables. Though there will be a large number of tables, and it will result in complexities in
maintenance. However, during the database design, paramount importance is to ensure the
high-performance for the database queries. Due to this importance, it is essential to avoid
building generic objects during the database design.

Another basic mistake that does by the database designer is the lack of testing that will be
discussed in the next section.

Lack of Testing
Typically, we do the testing for the application functional, performance perspective. It is
very rare that database designers carry out testing on databases. With database specific
testing, you can avoid a lot of mistakes that will occur in the later stage of the project.

Initially, during the conceptual database design stage, as a database designer, you need to

[ 434 ]
Overcoming Database Design Mistakes Chapter 15

go through the requirement elicitation documents and verify whether your model supports
the business requirements. In this exercise, it is better to sit with the business team so that
you can verify your queries and concerns directly with the business team.

When the table structures are created, it is important to insert some data to the designed
table structures. With this approach, we can verify whether the defined data length and
data types are valid. Further, you can verify the validity of unique, foreign key and check
constraints with sample data.

Apart from functional testing of the database, it is essential to verify the performance aspect
of databases. Performance can be impacted due to two reasons. They are the increase of
users and increase of data loads. Therefore, with the help of the quality assurance team
experts, we can test the database for the feasibility of the database for a large number of
users and a high volume of data.

Let us look at other common mistake done by the database designers in the next section.

Other Common Mistakes


In addition to common mistakes that we discussed in the previous sections, there are few
mistakes done by the database designers. Let us discuss how to overcome these mistakes in
briefly.

1. For the sensitive and mission-critical systems, it is essential to use disaster


recovery and security services like failover clustering, auto backups, replication
and mirroring and so on. This has to be designed at the early phase rather than
trying to adopt them at the later stage of the project.
By designing a disaster system at the earlier stage, you will have the option of
utilizing its features. For example, if we have a read-only replica, we can design
our reports to read data from the read-only replicas. This will enhance the
performance of reports as well as the other database queries.

2. Large size attributes such as Image, Profile descriptions should be avoided in the
frequently queried tables. With large-size attributes, table size will become
larger. In this large size tables, queries will need to read more data pages that
will result in slowness in query performances. Therefore, large size attributes
must be placed in separate tables and their pointer can be used in queried tables
when necessary.

[ 435 ]
Overcoming Database Design Mistakes Chapter 15

Summary
After discussing the detail of the database design, we dedicated our discussion in this
chapter to the common mistakes done by the database designers. We identified that poor
planning is the critical mistakes done by database designers. Further, ignoring
normalization and neglecting performance aspect is key mistakes done by the database
designer. These mistakes will result in a very difficult database to manage.

Most of the database designers do not maintain proper naming standards that will cause a
database to a mess. We further identified that most database designers provide less focus
on the security of the database. Another important mistake that we identified is the lack of
documentation and last of testing for databases.

Another basic mistake we identified that the database designers tend to use common data
types rather than utilizing the correct data types. Lack of usage of data integrity options
such as check constraints, foreign key constraints and unique constraints is another trivial
mistake that will result in data integrity od database in future.

Questions
What is the indexing strategy that you would apply during the database design?
Since it is difficult to find the user pattern at the start of the design stage of a
database, there are few strategies for indexing during the design stage. For the
Primary keys, we can include a Clustered index. For the foreign key constraint
columns, we can implement non-clustered indexes as the foreign key constraint
columns will be used to join tables. Apart from these two types of indexes, we
can implement unique non-clustered indexes for the business keys. Apart from
these indexes, as an application developer, you need to find the possible indexes
so that those indexes can be identified at the development stage.

Why it is important to use procedures instead of inline SQL code?


Procedure code has many advantages over using the inline SQL codes in the
application itself. One of the obvious reason to use the procedures is to support
reusability of the SQL Code. Apart from the reusability, another advantage of
using procedures is as a security measure.

To avoid SQL Injection is the introduction of procedures. Since SQL code cannot
be injected to the procedures, hackers will not be able to break into the database
with SQL Injection technique. Further, usage of procedures will provide
application users execution permissions on procedures and they do not need any

[ 436 ]
Overcoming Database Design Mistakes Chapter 15

permissions on underline tables.


The Procedures use pre-compiled and cached code. This will make the procedure
has an advantage over inline queries with respect to the performance.

[ 437 ]
Index

You might also like