Database Design and Modeling With PostgreSQL
Database Design and Modeling With PostgreSQL
Design &
Modeling with
PostgreSQL
Dinesh Asanka
Twitter: @dineshasanka74
[email protected]
Table of Contents
Chapter 1: Overview of PostgreSQL Relational Databases 1
Understanding relational databases 2
File-based databases 2
Hierarchical database 3
Document database 4
Relational database 6
Introduction to PostgreSQL 7
Installation and configuration 7
Limitations in PostgreSQL 9
Understanding tables, columns and rows 10
Introduction to constraints 16
NOT NULL 17
PRIMARY KEY 18
UNIQUE 19
CHECK 20
FOREIGN KEY 22
Exclusion Constraints 24
Deferrable Constraints 25
Fieldnotes 25
Summary 26
Questions 27
Further Reading 28
Chapter 2: Building Blocks Simplified 29
Identifying Entities and Entity-Types 30
Entity sets 30
Strong and Weak Entities 36
Introduction to Attributes 37
Attribute Types 38
Representation of Attributes and Entities 40
Identifying Entities and Attributes 41
Identifying main Entity Types 43
Identifying Attributes of Entities 44
Generalization and differentiation of Entities 49
Naming Entities and Attributes 49
Assembling the building blocks 50
Building block rules 50
Each table should represent one and only one entity-type 51
All columns must be atomic 51
Columns cannot be multi-valued 51
Table of Contents
Summary 52
Exercise 52
Questions 53
Chapter 3: Planning a Database Design 56
Understanding Database Design Principles 57
Usability 58
Extensibility 58
Integrity 59
Performance 60
Availability 61
Read-only and deferred operations 61
Partial, transient, or impending failures 62
Partial end-to-end failure 62
Security 62
Introduction to the database schema 63
Online Transaction Processing 64
Online Analysis Processing 64
OLTP versus OLAP 64
Choosing Database design approaches 65
Bottom-up design approaches 65
Top-down design approaches 66
Centralized design approach 66
De-centralized design 67
Importance of data modeling 67
Quality 67
Cost 68
Time to market 68
Scope 68
Performance 69
Documentation 69
Risk 69
Role of databases 69
Storing Data 70
Access Data 72
Secure Data 75
Common database design challenges 75
Data security 76
Performance 76
Data accuracy 77
High availability 77
Summary 78
Questions 79
Chapter 4: Representation of Data Models 82
Introduction to Database Design 83
[ ii ]
Table of Contents
[ iii ]
Table of Contents
[ iv ]
Table of Contents
[v]
Table of Contents
[ vi ]
Table of Contents
Summary 247
Questions 248
Exercise 250
Further Reading 250
Chapter 9: Designing a Database with Transactions 251
Understanding the Transaction 252
Definition of Transaction 252
ACID Theory 256
Atomicity 256
Consistency 257
Isolation 257
Durability 258
CAP Theory 258
Consistency 259
Availability 259
Partition Tolerance 260
Transaction Controls 260
Transactions in PostgreSQL 262
Isolation Levels 265
Read Committed 265
Read Uncommitted 266
Repeatable Read 266
Serializable 267
Designing a database using transactions 268
Running Totals 269
Summary Table 270
Indexes 271
Field Notes 272
Summary 272
Questions 273
Exercise 274
Further Reading 274
Chapter 10: Maintaining a Database 276
Role of a designer in Database Maintenance 277
Implementing Views for better maintenance 277
Sample Views in PostgreSQL 278
Creating Views in PostgreSQL 281
Performance aspects in Views 283
Using Triggers for design changes 285
Introduction to Trigger 285
Triggers as an Auditing Option 286
Triggers to Support Table Partition 289
Modification of Tables 291
[ vii ]
Table of Contents
[ viii ]
Table of Contents
Roles 335
Conflicting of Privileges 336
Row Level Security 337
Providing Authorization in PostgreSQL 338
GRANT Command in PostgreSQL 340
Using Views as a Security Option 341
Avoiding SQL Injection 342
What SQL Injection can do 343
Preventing SQL Injection Attacks 343
Encryption for Better Security 344
Data Auditing 346
Best Practices in Data Auditing 346
Best Practices for Security 347
Field Notes 348
Summary 349
Questions 349
Further Reading 350
Chapter 13: Distributed Databases 352
Why Distributed Databases? 353
Properties of Distributed Databases 354
Advantages in Distributed Databases 355
Disadvantages in Distributed Databases 357
Designing Distributed Databases 359
Full Replication Distribution 359
Partial Replication Distribution 362
Full Replication 363
Horizontal Fragmentation 363
Vertical Fragmentation 364
Hybrid Fragmentation 365
Implementing Sharding 367
Transparency in Distributed Database Systems 370
Distribution Transparency 370
Transaction Transparency 371
Performance Transparency 372
Twelve Rules for Distributed Databases 372
Challenges in Distributed System 373
Field Notes 374
Summary 376
Questions 377
Further Reading 378
Chapter 14: Case Study - LMS using PostgreSQL 379
Business Case 380
Educational Institute 380
[ ix ]
Table of Contents
[x]
Table of Contents
[ xi ]
1
Overview of PostgreSQL
Relational Databases
Relational databases are the most common database technologies among the other available
database technologies. Among many database technologies to implement relational
databases, PostgreSQL databases are one of the popular choices. Before using any
technology, it is important to understand the capabilities and scale of which it has used in
the industry as well as the best practices. Since this book is based on PostgreSQL, we need
to understand the capabilities of the tool. Apart from the tool, there are few important
aspects of database design techniques such as Primary Keys, Unique Keys, Check
Constraints and Foreign Key constraints which need to be discussed.
This chapter discusses the overview of PostgreSQL databases with respect to relational
databases. Further, this chapter discusses the basic elements of tables such as columns and
rows.
The reader requires a basic understanding of IT systems and how databases are fitting to
the IT systems. Also, the reader should have an understanding of the different processes of
development methodologies.
Overview of PostgreSQL Relational Databases Chapter 1
Storing is an essential element in any system. However, there are multiple options for
system designers to choose from, depending on the system requirement. Historically, there
few popular options such as file-based, hierarchical, document, and more importantly
relational databases to select, which we will learn about in this section.
Let us look at a contact list of a mobile phone which is a good example of a database. In the
contact list, you want a contact to be stored and retrieved when necessary. When accessing
the phone book, you don't have to traverse through the entire contact list. Instead, by
typing the first couple of letters of the contact, you will be able to retrieve the required data.
Relational algebra and relational calculus are some advanced topics that
are related to relational queries. Since relational queries are out of scope
for this book, relational algebra and relational calculus topics will not be
discussed in this book.
File-based databases
You don't always need fancy technologies to save your data. Hence, you can use simple
text-based files to save your data. This is equal to the manual system where all data are in
separate files. Though this type of database is ideal for a small set of data, when the data
volume is increased this may not be scalable. Furthermore, with the growing number of
objects, file-based databases will not become a feasible option to store configurations,
therefore, this type of databases are not commonly used.
[2]
Overview of PostgreSQL Relational Databases Chapter 1
Popular CSV, TSV, XML file are some other examples for file-based databases. These types
of databases provide limited indexing capabilities. In addition these types of files, different
vendors have developed different file-based formats, hence there a lot of incompatible file
formats. Another disadvantage of this database type is that it will run into a lot of data
integrity issues since relationships cannot be created from the file. Instead, relations have to
be implemented through the applications.
There are few challenges in implementing different security levels for file-based databases
as well. Due to those issues, the file-based database has become a legacy technology and is
not used much in the industry today. In spite of the limitations associated with the file
database, file databases are used internally by some of the computer applications to store
data related to configurations, and so on.
Hierarchical database
The hierarchical database model is a model where data is organized in a tree structure. In
the hierarchical databases, the parent-child relationship is maintained.
[3]
Overview of PostgreSQL Relational Databases Chapter 1
In the above data model, it can be observed that Invoice is a child of the parent Customer.
in this, both IN001 and IN002 invoices are children of customer C001.
To store organizational structures and folder structures hierarchical databases are used as
those have a hierarchy built into this. In engineering, applications such as Electricity Flow
and Water Pipe flows can also be considered as a hierarchical structure, hence those are
some other examples for hierarchical databases. In the case of electrical power flow, power
flows from one point to the other points, and there can be multiple roots. In order to
configure power and water flows in hierarchical databases they need to support multiple
roots.
The major disadvantage of the hierarchical data model is that it has limited flexibility. For
example, in the hierarchical model, it is difficult to change the structure after a particular
structure is defined. For example, the mentioned example in this section, if one payment is
used to settle multiple invoices, the above model will not be suitable. In the above model,
pre-defined queries are mostly supported. However, in the case of a need for ad-hoc
queries, and when structural changes are required, this model may not be much suited.
Document database
Though file-based databases and hierarchical databases not much favoured in the industry
today, document databases are becoming popular in the industry due to the limitations in
relational databases. Document databases are part of a family of the NoSQL database.
[4]
Overview of PostgreSQL Relational Databases Chapter 1
Relational databases are much suited for transactional or structured data. However, in the
industry, there is more unstructured data than structured data. It is believed that in 2015,
out of 88% of data in the industry is in the form of unstructured data. This data has
accounted for 300 Exabytes in volume. Further, unstructured data is following an
exponential growth which is far ahead of structured data. Therefore, document databases
are used in the industry today to cater to the growing need of the non-structural data.
In the following code block, we can see the sample of a document database which follows
the JSON format:
{
"CustomerCode": "CUS001",
"Name" : "Greg Wilsons",
"Address": "15 Scarborough Street",
"Interest": "riding"
}
[5]
Overview of PostgreSQL Relational Databases Chapter 1
In the above figure, it can be seen that multiple attributes stored in the same document. For
the product, both types (accessory and case) are stored with the same document. In the case
of a relational database, you might need to have a different table. Since it is stored with the
document, there is no relationships and performance will be increased.
Relational database
In 1970, computer scientist E. F. Codd proposed the relational model for data from a
seminar paper A Relational Model of Data for Large Shared Data Banks (https://www.
seas.upenn.edu/~zives/03f/cis550/codd.pdf) which has become the root for relational
databases. In the proposed relational model, data is modelled and logically structured into
tables which consists of columns are rows. Tables have multiple constraints such as
primary keys and check constraints. These tables are related to other tables with foreign
keys. Apart from those basic features, relational database management systems support
features such as scalability, transactions, security and so on.
Since there are many types of databases to choose with for a given solution, the relational
database has a lot of advantages over the other database solutions such as File-based,
Document and Hierarchical databases. The main ability in relational databases is the option
of creating relationships between tables. Every table is consists of columns are rows. When
choosing a database technology, it is essential to understand what each of the different
database technologies provides and how suitable them for your problem.
There are few tools to support relational databases. Oracle, SQL Server, DB2 are the most
common propriety relational database tools whereas MySQL and PostgreSQL are the open-
source tools.
Since we will be explaining the design concepts using PostgreSQL, let us look at the basic
[6]
Overview of PostgreSQL Relational Databases Chapter 1
features of PostgreSQL.
Introduction to PostgreSQL
PostgreSQL also is known as Progres, is an open-source relational database management
system (RDBMS). In 1982, Michael Stonebraker at the University of California was the
leader of the Ingres project which led to PostgreSQL. Michael left the project and return
back to the project in 1985 and started a new project called Post-Ingres. In 1996, the project
was renamed to PostgreSQL in order to reflect the support of the SQL queries in the
tool. The first PostgreSQL release formed version 6.0 in 1997.
After a bit of history on PostgreSQL, it is important to know which organizations are using
this tool in order to understand the capabilities of this tool. Here are a few organizations
that use PostgreSQL, namely, Apple, BioPharm, Etsy, IMDB, Macworld, Debian, Fujitsu,
Red Hat, Sun Microsystem, Cisco, and Skype. This list indicates that PostgreSQL has rich
capabilities to support scalable, large, and versatile data.
PostgreSQL is a free and open-sourced cross-platform tool. This means that PostgreSQL can
be installed in Linux, Microsoft Windows, Solaris, FreeBSD, OpenBSD, and Mac OS X.
PostgreSQL follows SQL:2011 or ISO/IEC 9075:2011 standard. PostgreSQL follows ACID
transaction properties like most popular relational databases. ACID stands for Atomicity,
Consistency, Isolation, Durability. Apart from typical tables, PostgreSQL supports Views,
Triggers, Functions, and Triggers as well. When designing a database, a designer can look
at these different available options. Indexes are available in PostgreSQL to improve the
performance of data retrieval. PostgreSQL uses Multiversion Concurrency Control
(MVCC) as the concurrency method. Concurrent control is important when multiple users
are accessing the same objects and resources. Data Types are an important feature in any
database by using different data types, users can use the required data type for their use
case. PostgreSQL supports data types as defined by SQL:2008.
[7]
Overview of PostgreSQL Relational Databases Chapter 1
which will be used as the main client tool to connect to PostgreSQL from the following
link, https://www.pgadmin.org/download/.
Always, remember to set a complex password as the user Postgres user has administrative
privileges who can do anything on the database instance. Apart from the superuser
password configuration, there is an additional configuration for the PostgreSQL data
directory and the port. As always, the data directory should be in a different drive and it is
always recommended to change the Port number. The default port number for PostgreSQL
is 5432, and it is recommended to change to a five-digit number so that it is difficult to
make a guess. After installing PostgreSQL, there will be a default database named Postgres
with no tables in that database.
After installing PostgreSQL, next is to install the client tool which can be used to connect to
PostgreSQL. As said earlier, pgAdmin4 is installed.
When working with databases, it better to have a database that has some meaningful data.
Like AdventureWorks for SQL Server and sakila for MySQL, PostgreSQL also has a
sample database called sakila. This sample database can be downloaded from http://
www.postgresqltutorial.com/postgresql-sample-database/ . This sample database is
about DVD rental database which consists of 15 tables. Entity Relation (ER ) Model and the
table descriptions can be found in this link. Sample database can be restored from the
restore by using the option in the pgAdmin4 or from the command prompt. All steps to
restore the database can be found at http://www.postgresqltutorial.com/load-
postgresql-sample-database/.
[8]
Overview of PostgreSQL Relational Databases Chapter 1
In the following example, the sample database is restored as SampleDB. Once the
SampleDB is restored, the database can be viewed in the pgAdmin4 client as follows:
Now basic configurations are done for the PostgreSQL with the sample database and you
are ready to go ahead.
Limitations in PostgreSQL
Before starting designing a database in PostgreSQL, it is important to understand the
limitations of the tool so that when designing the database necessary workaround can be
taken in the early stages of the design.
The following table shows the different limitations in the PostgreSQL database:
Limitation Limit Description
There is no limitation to the database size and according to the internet, there is a
Database Size No Limit
PostgreSQL database that has got size more than 6 TB.
By default, PostgreSQL stores data in 8 KB chunks. A table can have 32 bit signed
integer number of chunks which is two billion. However, chunk size can be modified
Table Size 16 TB
to 32 KB which means Table Size can be 64 TB. However, it is not practical to see
tables with more than 4 TB in size. Therefore, 16 TB size is more than enough.
[9]
Overview of PostgreSQL Relational Databases Chapter 1
No of Rows in a Table No Limit You can have any number of rows per table in PostgreSQL.
Though there is no limit for the maximum number of indexes per table, it is essential
Number of Indexes No Limit to have a balance number indexes as more indexes will decrease the performance of
the insert operations.
Though the limit for the Field size is 1 GB, practically server memory will be the
Field Size 1 GB
limit.
Depending on the data types Fields of may vary. However, this is a very large
Number of Fields 250 - 1600 number and it is difficult to imagine a scenario you need this number of Fields for a
table.
Row Size 1.6 TB Again a very large number.
By looking at the limitations above, it can be concluded that PostgreSQL has the capability
to hold a large volume of robust data.
[ 10 ]
Overview of PostgreSQL Relational Databases Chapter 1
Each table is represented in a two-dimensional object. For example, the Film table has
information about films and the customer table has information about customers.
In the following screenshot, we can see that the sample data set for the film table. In
this film table, film title, description, release year are stored:
[ 11 ]
Overview of PostgreSQL Relational Databases Chapter 1
Similarly, the following screenshot shows the sample data set for the customer table. In this
table, basic customer details such as first name, last name and email are stored:
Both the customer and film tables are master records which means these are the base records
for day to day transactions. Similarly, staff and country tables are falling into the master
tables.
When a customer rentals a film, the relevant rental table is updated with the necessary
customer and inventory data. The sample data set for the rental tables is shown in the
following figure.
[ 12 ]
Overview of PostgreSQL Relational Databases Chapter 1
The rental table will get data for each data. For each transaction or DVD rental, this table is
updated. Since transactions table receives data for whenever a transaction occurs, hence
these tables tend to grow rapidly. In the rental table, it has a relation to the master tables
such as staff, inventory, and customer and some transactional relevant dates.
Each table has multiple attributes which are named columns and in the following
screenshot we can see the attributes/columns:
Each attribute has it is a data type that will be discussed in detail in Chapter 6, Table
Structures. For the discussion on columns, let us examine the data types for each column
from a script.
CREATE TABLE public.customer
(
[ 13 ]
Overview of PostgreSQL Relational Databases Chapter 1
Let us only look at the columns data type for the moment as other details will be discussed
later in this chapter and in other chapters. first_name column which is created to store the
First Name of the customer has a length of 45 characters. This means that
the first_name column can have a maximum length of 45 characters. create_date has date
data type. By including the date data type, you are introducing all the features of date. As
you are aware date type has a lot of features. For example, there are months which ends
with date 30 and some months ends with 31st. To complicate this in leap years, February
ends at 29th while for nonleap years it will be 28th. To complicate this scenario leap year
definition is also not very simple. Imagine, if you have to configure or to implement all this.
To avoid this complication, you need to simply select correct data types. Apart from this,
important date calculation such as adding an interval to an existing date. Find the intervals
between two dates etc. This can be done only if you have the date data type. If not you will
run into major performance and implementation issues.
The activebool column is to specify whether the customer is an active customer or not. In
this, there can be either two values, true or false. Therefore, the boolean data type is
selected. An important configuration here is the setting of default values. This means that
in case no value is specified, by default true value is stored for that customer. If you want
to specify the value false into a record, then you need to explicitly specify the value false.
Also, there are NOT NULL constraints for some columns means that it is a compulsory
column. In this table, address_id is NOT NULL means that address_id should be compulsory
and cannot be left with NULL.
[ 14 ]
Overview of PostgreSQL Relational Databases Chapter 1
In this city_id 3 is the instance of the Abu Dhabi city. This instance will be linked to another
table record. In the Abu Dhabi record, country_id is 101. The following screenshot shows
the country table which shows the relevant record:
[ 15 ]
Overview of PostgreSQL Relational Databases Chapter 1
Relevant record for the Abu Dhabi city is the country_id with 101 which is the United Arab
Emirates. This means that data in a record can be split into multiple tables. We will look at
this design aspect in Chapter 6: Table Structures.
Depending on the constraints defined in the attributes which will be discussed in the next
section, the row will have values according to those rules defined in the constraints.
Introduction to constraints
Table Constraints are implemented to enforce limits to the data that can be inserted or
updated or deleted from a table. By introducing table constraints, data integrity can be
achieved.
NOT NULL
PRIMARY KEY
UNIQUE
CHECK
FOREIGN KEY
EXCLUDE
[ 16 ]
Overview of PostgreSQL Relational Databases Chapter 1
and maintenance difficulties. In addtion, when the system is integrated with other sources
and systems, it is much better if the table constraints are implemented at the database level.
If not, data integration issues will occur when data is written to file-based systems via
third-party applications and systems.
NOT NULL
NULL is a special indicator in the database world to indicate that value does not exist. It is
important to note that the NULL is not equivalent to empty. Further, NULL is not equal to
NULL as NULL means a state of an attribute and not a value.
When defining a column in a table, you can have the option of setting the NOT NULL
setting in PostgreSQL as shown in the following screenshot:
In the above example, CustomerName is a NOT NULL column while the Remarks column
is a Nullable column. This means that for every row, CustomerName should have value
however, the remarks column is optional.
If you try to insert a record without value to the CustomerName, that insertion will fail
with the following error:
ERROR: null value in column "CustomerName" violates not-null
constraintDETAIL: Failing row contains (2, null, A).
SQL state: 23502
Therefore, NOT NULL constraint ensures that compulsory columns have to be updated
with a value when inserting a record.
[ 17 ]
Overview of PostgreSQL Relational Databases Chapter 1
The NOT NULL constraint is applicable not only at the time of table
creation or adding a new column but also at the time of column
modification. If a column needs to be modified to a NOT NULL column,
before modifying, you need to make sure that this column has a value for
all records. If not, you will not be allowed to modify to NOT NULL
constraint.
PRIMARY KEY
Selecting a proper primary key is the most critical step in a table design. The primary key is
a column or combination of multiple columns that can be used to uniquely identify the
row. Though the Primary Key is an optional constraint in a table, the table needs a
primary key to ensure the row-level accessibility to the table. You can only have one
Primary Key for a table.
The Primary key can be defined and configured from the PostgreSQL as follows:
In the above example, this means that the CustomerID is the Primary key which is the
column you can use to distinguish the records.
Values Column or Columns which are defined as Primary Key should be unique.
Primary Key column or columns cannot be nullable.
When defining a Primary Key, you do not have set this as a NOT NULL column. As soon
as when the Primary is defined, automatically those columns are NOT NULL column.
If a user tries to insert duplicate value to the primary key, it will fail with the following
value:
ERROR: duplicate key value violates unique constraint
"SampleTableConstraints_pkey"
DETAIL: Key ("CustomerID")=(2) already exists.
SQL state: 23505
[ 18 ]
Overview of PostgreSQL Relational Databases Chapter 1
There can be multiple candidates for a Primary Key. For example, in the above example,
either CustomerID or CustomerName can be considered as a Primary key as both can be
used to identify the customer. However, when selecting a Primary Key, for the performance
reason it is recommended to select a shorter column. Typically, we do not choose lengthy
string columns or date columns for the primary key.
UNIQUE
The UNIQUE constraint ensures that all values in a column have different values. A
PRIMARY KEY constraint automatically has a UNIQUE key constraint. Unlike the Primary
Key, you can have any number of Unique key constraints and UNIQUE Key constraint is
nullable.
Let us look at the same example of the customer table in the following screenshot. In the
customer table, the CustomerID is the PRIMARY KEY. However, we know that Customer
Name is also unique. In organizations, most of the times, there are duplicate values which
will cause a lot of maintenance issues. If Customer Name is not set to unique, there can be
multiple rows for the same customer. Ultimately you will end up with dividing those
transactions for multiple customer records but in reality, it is the same customer.
To avoid this issue, we can set the Customer name is a unique key which can be defined in
the PostgreSQL as shown in the following screenshot:
[ 19 ]
Overview of PostgreSQL Relational Databases Chapter 1
In this Unique constraint, CustomerName is selected as the unique column. The deferrable
concept will be explained late in this section.
When a duplicate entry is inserted to a table, the insertion will fail with the following error
message:
ERROR: duplicate key value violates unique constraint "UNQ_Customer_Name"
DETAIL: Key ("CustomerName")=(John Muller) already exists.
SQL state: 23505
Similar to Primary Key constraints, when enabling Unique constraints to the column where
the table already has some data, it should not have any duplicate data. If there are
duplicates, unique constraint creation will fail. There is a performance benefit in UNIQUE
constraints in addition as a data integrity feature.
[ 20 ]
Overview of PostgreSQL Relational Databases Chapter 1
CHECK
CHECK constraints are defined to ensure the validity of data in a database table and to
provide data integrity in the database. In any business, there is multiple business logic that
is customized to the business. For example, organization employees, and Age should be in
the range between 18 and 55. In addition to the above rule, the sales commission should
not exceed some percentage of the salary. These rules can be implemented in the
application but which will lead to complicated client and that will lead to data integrity
issues. Therefore, it will be much better if this can be implemented in the database by
means of CHECK constraints.
In PostgreSQL, check constraints can be implemented for a table as shown in the following
screenshot. In this example, the Age column should be in the range of 18 to 55:
There can be check constraints rules which can be a combination of multiple columns. For
example, the Commission should be less than 25% of salary as shown in the following
screenshot:
[ 21 ]
Overview of PostgreSQL Relational Databases Chapter 1
CHECK constraints are validated when the data is inserted or updated to the table. In case
those data violates the check constraints, it will generate the following error:
ERROR: new row for relation "SampleTable" violates check constraint
"CHK_Salary_Commion" DETAIL: Failing row contains (3, John Muller, 19,
$2,500.00, $5,000.00)SQL state: 23514
Similar to other constraints, if CHECK constraint is defined to the table where data is
available, it has to follow the rule.
FOREIGN KEY
In a database, there are multiple tables and there are relationships. These relationships
should be maintained in order to improve data quality. FOREIGN KEY constraints are
used to maintain this data quality. If FOREIGN KEYs are not present, DELETE and
UPDATE commands will be terminated the relationships between the database
tables. Also, The FOREIGN KEY constraint will prevent users to enter invalid data which
will violate the rules of the table relationship. For example, countryid in the city table is
linked with the countryid in the country table and that relationship is built by using
PostgreSQL as shown in the following screenshot:
[ 22 ]
Overview of PostgreSQL Relational Databases Chapter 1
When this FOREIGN KEY constraint violated with the inserts or updates, data will not be
inserted and an error will be generated. Let us look at it in the following code block:
ERROR: insert or update on table "city" violates foreign key constraint
"fk_country"
DETAIL:Key (country_id)=(1222) is not present in table "country".
SQL state: 23503
As said before, the FOREIGN KEYs are introduced in order to maintain the relationship
between tables. However, we know that this data is not static as they will be updates and
deletes. So when data is updated and deleted, actions need to be taken in order to maintain
the data integrity.
In PostgreSQL, there are five actions which can be defined what should happen when
deleting and updating as shown in the following screenshot:
[ 23 ]
Overview of PostgreSQL Relational Databases Chapter 1
In the following table we can see the action and their details:
Exclusion Constraints
Exclusion constraints are a somewhat uncommon constraint in the databases. All
constraints which we have discussed up to now are applied to a single row. For example,
when we say check constraint in the Age column, it is to restrict data being inserted into the
[ 24 ]
Overview of PostgreSQL Relational Databases Chapter 1
row with the correct range. Let us assume there is a need to have the following scenario. A
student can only register for only four courses in one semester. In another example, in a
banking scenario, you do not want your clients to withdraw money from two different
ATMs in the given time period. Both these scenarios are dealing with multiple rows and to
implement this exclusion constraint is used.
Deferrable Constraints
Deferrable and deferred, are important settings in constraints. By default, when a row is
inserted, constraints will be effective immediately. If the data is satisfied by the rule data
will be inserted or updated if not data will be rejected with an error. However, sometimes
you need to defer from this for a batch of transactions. However, not all constraints
are Deferrable. Only UNIQUE, PRIMARY KEY, FOREIGN KEY, and EXCLUDE constraints
are affected. Deferrable, NOT NULL and CHECK constraints are always not Deferrable. On
creating any UNIQUE, PRIMARY KEY, FOREIGN KEY, and EXCLUDE constraint, you can
set it to not deferrable which means that data will be rejected if the relevant rules are
violated. Though it is a deferrable constraint, data can be only inserted when the constraint
is Deferred.
Let us look at a real world scenario regarding database constraints in the following section.
Fieldnotes
An organization has three branches around the globe. Branches are raising purchase orders
and at the head office, these purchase orders a re-approved. Due to the nature of the
business, there are more than a thousand product items. After some time they have found
that, since these are isolated branches, when ordering if they found that product item is not
available, they will create a new product item. After a while, they have found that due to
the ignorance of the users, there are a lot of duplicates records for the same item. To worsen
the things, purchase orders are raised for the same item but in the system, there are many
records. This has resulted in the error in most of the reports and system has become
unusable.
The following steps are taken to correct the data duplication issue in the system:
[ 25 ]
Overview of PostgreSQL Relational Databases Chapter 1
duplicates were removed in the product item master. Next was to modify the
transaction tables to facilitate new product item master.
2. Enforcing UNIQUE Constraint: A Simple UNIQUE constraint was implemented
in the Product Item master defining that the product item description is unique.
3. FULL-TEXT Search Facility: Enforcing a UNIQUE constraint will solve half the
problem. Most of the users misspelled the item descriptions and the same item
name may exist with - instead of a space. Item-A and Item A might be the same
item, but the UNIQUE constraint will not catch that. Therefore, users are
provided with the FULL-TEXT Search facility (Out of context for this book) and
before creating a product master they should do a search and check whether the
product item record already exists.
4. Periodic Audit: Whatever you do there can be errors in the data. Therefore
periodic auditing and correction should be done.
After the above implementations, data duplication issue is fixed to some extent. As the
UNIQUE constraint is not an ultimate solution for data duplication, a solution is needed to
provide with FULL-TEXT search and Data Auditing.
Summary
In this chapter, we started with understanding relational databases, its disadvantages, and
we also saw the types of databases under it. Then we were introduced to PostgreSQL, we
learned how to install and configure it, and also learned about its limitations. Next, we
briefly learned about tables, rows, and columns.
PostgreSQL, which is the open-source database utility, has many advantages over its
counterparts. Mainly, PostgreSQL limitations are very high which means that most robust
implementation can be done using PostgreSQL. Importantly, PostgreSQL is compatible
with most common operating systems. Different data types can be used to design table so
that proper design can be done. To facilitate data integrity as well as performance in some
instances, there are many constraints such as NOT NULL, PRIMARY
KEY, UNIQUE, CHECK, FOREIGN KEY, EXCLUDE can be implemented in the
PostgreSQL.
After setting up the environment and having gained an understanding of the basic rules
and databases and limitations of PostgreSQL, next is to understand the basic representation
of database entities.
In the next chapter will cover the basic building blocks of the database and how these
building blocks can be represented.
[ 26 ]
Overview of PostgreSQL Relational Databases Chapter 1
Questions
What is the default port for PostgreSQL? This question verifies whether you have
a basic understanding of PostgreSQL databases.
When bulk data set is being inserted to a table where the primary key is
defined and if there is a duplicate in the middle of that data set, the insertion
will fail immediately and remaining records are not inserted. To avoid this,
you can use the EXCEPTION WHEN unique_violation THEN keyword.
This keyword will still return the error but importantly, balance records are
also inserted.
The difference between a UNIQUE constraint and a Primary Key is that per
table you can have only one Primary Key however you can define several
UNIQUE constraints. Also, Primary Key constraints are not nullable while
UNIQUE constraints may be nullable.
[ 27 ]
Overview of PostgreSQL Relational Databases Chapter 1
How can we automatically delete records in the referenced table, when the
primary data is deleted?
Further Reading
You can check the following links to learn more about the topics we have just covered:
SQL:2011: https://en.wikipedia.org/wiki/SQL:2011
SQL:2008: https://en.wikipedia.org/wiki/SQL:2008
Constraints: https://www.postgresql.org/docs/9.4/ddl-constraints.html
[ 28 ]
2
Building Blocks Simplified
This chapter will help you gather the basic knowledge of database entities and attributes.
This chapter covers all the building blocks of a database which are the basis to design a
database. We will also be covering how these building blocks are identified, how they can
be represented in a database, what are the best practices for naming the concepts and what
are the rules that could help us work more efficiently. You will need an understanding of
the information system requirements and the role of the database in an information system.
An Entity has a set of attributes or values which will describe the Entity. Some of those
attributes can be used to uniquely identify the Entity. For example, the person will have a
name, mobile number, address(es), gender as its attributes, but person ID attributes will be
used to uniquely identify the person.
The Entity Type will contain similar entities. In other words, the Entity is an object of an
Entity Type. As an example, EmployeeID E0001 is an Entity of Employee Entity Type.
In the above screenshot, instances are called Entity. E0004 is an Entity. Entity Type is the
entire collection of all the instances. In the above example, Employee Entity Type contains
five Entities.
Let us look at Entity sets which are an important factor when building a data model.
Entity sets
Sub-Collection of Entities or an Entity Set is a set of entities of the same type and share the
same properties. Entity Sets can be defined for logical reasons where the user needs a
[ 30 ]
Building Blocks Simplified Chapter 2
subset of the entity. Entity Sets can be disjoint and joint Entity sets.
In the above Entity Sets, employees of HR and employees of Accounts are disjoint Entity
Sets where there is no intersection between those two entity sets. In this type of Entity sets,
you do not have to worry about the intersection portion as there is no intersection. Another
property of this Entity sets is that there are non-entities outside given Entity Types.
Following Entity Sets are more complicated than the above example, which is shown in the
following screenshot:
In the Entity sets seen in the preceding screenshot, the E0002 instance is common for both
[ 31 ]
Building Blocks Simplified Chapter 2
Entity Sets. E0005 instance is not part of any of those entity sets. In this example, Entity Sets
are from the same entities. In some cases, there can be common instances for different
Entity Types. For example, there can be common instances such as Mango in both Fruit and
Vegetable Entity Types.
If you are retrieving instances from the intersect Entity Sets, UNION should be used to
prevent the displaying of intersecting instances multiple times. UNION will provide
instances of multiple entities or entity sets without any duplicates. In case there are
duplicates, UNION will provide distinct values to the output.
This means that UNION should be used for Entities and Entity sets where there are
intersect instances as shown in the below screenshot.
[ 32 ]
Building Blocks Simplified Chapter 2
As you can see only the distinct values listed in the UNION output.
However, since the UNION has to check for the common instances, it will cause a
performance issue. To avoid this performance, there are instances you can use UNION
ALL. If you are sure that those entities are disjoint entities, then you can use UNION
ALL so that the performance of retrieval is improved. If you are sure that Entities or Entity
Sets are not overlapping instead of using UNION, UNION ALL should be used. UNION
ALL does not validate for intersect values, instead, it will be a simple union the multiple
entities or entity sets without distinction the values. If you are using UNION ALL for
[ 33 ]
Building Blocks Simplified Chapter 2
[ 34 ]
Building Blocks Simplified Chapter 2
As indicated in the above screenshot, labels such as India, UK, and Canada are duplicated
since we have used UNION ALL.
The UNION operator is used to combine the result-set of two or more SELECT statements.
[ 35 ]
Building Blocks Simplified Chapter 2
When performing a UNION, Each SELECT statement within UNION must have the same
number of columns and those columns should have the same data types. Also, the columns
in each SELECT statement must also be in the same order.
Another type of Entity Set is sub Entity Set of another Entity Set as shown in the following
screenshot:
As shown in the screenshot, all instances belong to the Entity Set Employees of Age Under
35. Employees of Age Under 30 is a sub Entity Set of Employees of Age Under 35. As
given in the above example, E001 is an employee whose age is under 26. That means his age
is under 35 as well thus he is fitting to the Entity Set of Employees of Age Under 35 as well
as the Entity Set of Employees of Age Under 30.
[ 36 ]
Building Blocks Simplified Chapter 2
As a strong entity is denoted by a solid rectangle, a weak entity is denoted by the double
rectangle. Another major difference in the weak entity is that it does not have the primary
key, it has a partial key that uniquely discriminates the weak entities. The primary key of a
weak entity is a composite key formed from the primary key of the strong entity and partial
key of the weak entity.
The following screenshot shows the relationship between the Customer and Loan entities:
To obtain a loan, the customer should exist. From the screenshot above, it is clear that Loan
Entity cannot exits without a customer entity.
After the Entity Types are identified, the next important step is identifying the attributes for
each entity.
Introduction to Attributes
As indicated before in Identifying Entities and Entity-Types, an Entity consists of a set of
attributes. Attributes are descriptive properties of each member in the Entity Type. In other
words, entities are represented by attributes. For example, the Employee Entity Type will
have attributes such as Employee ID, Employee Name, Address, Qualification, and so on.
Each attribute has a range of permitted values. The Employee ID should be number which
[ 37 ]
Building Blocks Simplified Chapter 2
is from 1-100000, whereas the Employee Name should be a character column where
numbers are not permitted. Sometimes, depending on the business you might decide your
own range. For example, for Employee Type entity type, you can define Permanent, Casual
and Temporary values as ranges. Also, you might need to implement domain for the Age
saying that you only employ people whose ages are between 18-55. This range is called the
Domain of the Attributes. As discussed in Chapter 1, Overview of PostgreSQL Relational
Databases, CHECK constraints are used to limit the users from entering values that are out
of the domain.
Attributes can be classified into different types for analysis purpose, as shown in the next
section.
Attribute Types
Attributes can be categorized into main five types. Let us view them in the following
screenshot:
Thee are five types of Attributes such as Simple, composite, Single, Multi-Values and
Derived. We will now understand the types of attributes in brief:
[ 38 ]
Building Blocks Simplified Chapter 2
Simple: Simple attributes are attributes that cannot be divided further; they are also called
atomic attributes. The Mobile Number attribute in the Student Entity Type can be
considered as a simple attribute. As you can imagine it is not practical to divide the Mobile
Number into further attributes. Most of the attributes fall into Simple attribute Type.
Composite: Composite attributes are attributes that can be divided into sub-attributes. If
you look at the Full Name attribute of an employee, it comprises of Title, First name,
Middle Name, and Last Name attributes. The Address attribute of an employee comprises
of Address I, Address II, City, Postal Code, and Country attributes.
Multi-Valued: Multi-Valued attributes have multiple values per instance. Some employees
will have multiple mobile phone numbers. Therefore, the Mobile numbers attribute will be
Multi-Valued attributes. Similarly, one employee may have more than one children which
are again a Multi-Valued attribute. Typically, these attributes will be moved to another
table from the main entity and relationship will be created to it from the main entity. Refer
to Rule 3, under the section Building blocks rules for more.
Derived: Some attributes are derived from other attributes. Age of Employees, Number of
Years of Employee Experience are examples of the Derived attributes. Age of Employees is
derived from the difference of the Employees Birth date attribute value, and the current
date. Employees Experience is derived from the difference between the joined date
attribute, and the current date. Age and Experience are very simple and standard derived
attributes, there are customized derived attributes which will be different from one
organization to the other. For example, the calculation of a grade for a subject will depend
upon the institute, and the subject sometimes might be depended on the semester too.
In the following table you can see how Age and Experience are derived for sample
Employee entities, that are from the simple attributes of the entities:
Employee
... Birth Date Joined Date Age Experience
ID
12345 ... 1974-Nov-28 2010-Jan-16 45 9
12356 ... 1976-Aug-05 2012-Feb-15 43 7
13467 ... 1978-Jan-01 2016-Dec-01 41 2
13478 ... 1972-Sep-13 2017-Aug-01 47 2
[ 39 ]
Building Blocks Simplified Chapter 2
Some database designers prefer not to use derived attributes; instead, they prefer to
calculate at the run-time. For example, if the Amount is equal to Unit Price * Quantity,
most of the designers prefer to calculate the Amount at the run time to save some disk
space.
In the derived attribute option, storage space is the concern. The advantage of this option is
that less computation power is needed. However, in the modern era of computing, storage
is not a major concern due to the rapid advance in storage technology. Therefore, derived
attributes are much better when it comes to database designing as it improves the reading
performance.
Derived attributes are updated at the time of data insert and the data update to the related
attributes. The important aspect of the derived attributes that those should not be directly
changed by the users as there are derived from the other attributes. If the derived attributes
can be modified, then there will be a data cleansing issue.
Let us see how we can represent attributes and entities in the next section.
[ 40 ]
Building Blocks Simplified Chapter 2
The employee is the Entity Type which consists of instances of Employees. A Entity Type
should be represented in a rigid, rectangle shape. Employee ID is the Primary key
(Discussed in Chapter 1, Overview of PostgreSQL Relational Databases), which is used to
identify the entity; Employee ID is underlined in the image. Since EmployeeID is also a
simple attribute, it is represented in a rigid, oval shape. Employee Name is a composite
attribute where the source attributes are Title, First Name, Middle Name, and Last Name.
The Employee Name attribute is linked to Employee Entity and source attributes are linked
to the composite attribute, Employee Name. Mobile Number is a multi-valued attribute
and it is shown in a double circle, rigid solid circle. Age attribute is derived from Date of
Birth, therefore, Age attribute is shown in a dotted a circle connected to the source
attribute—Date of Birth.
[ 41 ]
Building Blocks Simplified Chapter 2
and Attributes will be an iterative process, thus it might not be easy to capture all the
Entities and their attributes at once.
Please note that the following screenshot is a sample and not an original invoice:
[ 42 ]
Building Blocks Simplified Chapter 2
In the following sections, we will identify Entity Types and relevant attributes to satisfy the
above invoice.
[ 43 ]
Building Blocks Simplified Chapter 2
Invoice
Patient
Cashier
The next important aspect of this invoice is, what is the relationship between the patient
and the cashier. Basically, the patient purchases drugs from the cashier and an invoice is
issued. This means that drugs or item is also an Entity Type:
Drugs or Item
By further analyzing this invoice, you can see that there is an Order Number. This means
that there can be Orders coming through different departments or from different wards of
the Hospital. This means that Order also can be considered as an Entity Type:
Order
Doctor
Company or the hospital is also a conceptual entity in which all these entities belong to:
Hospital
Now we have identified the main entities. Then the list of entities will become the initial
version of your data model.
Therefore after the first analysis, the following table shows the Main Entity Types that are
discovered in the analysis:
[ 44 ]
Building Blocks Simplified Chapter 2
invoice but also the possible values which may be needed in the future. By including those
attributes at this stage itself, you are avoiding unnecessary design changes and
maintenance issues which may arise later. Let us identify the attributes for the Invoice
Entity with the standard symbols.
Since the main Entity is Invoice, let us identify attributes of the Invoice Entity as shown in
the following screenshot:
In the Entity - Attribute screenshot above, all the attributes are linked to the Invoice Entity.
However, during the exploration of Entities, we identified that Patient, Drug, Cashier,
Order are not attributes of Invoice, but they are a separate entity. The next stage is to
separate those attributes and move them to different entities.
The following tables list the Entities, their attributes, and the attribute type.
[ 45 ]
Building Blocks Simplified Chapter 2
Amount Simple
Quantity Simple
Number of Items Simple
Payment Type Simple
The following table has attributes of the Hospital entity:
R
A
e
tt
m
ri
Attribute Type a
b
r
u
k
te
s
H
o
s
p
it
Primary Key
al
N
a
m
e
[ 46 ]
Building Blocks Simplified Chapter 2
C
o
m
p
r
i
s
e
s
o
f
(
H
o
u
s
e
N
u
m
b
e
r
,
S
t
H
r
o
e
s
e
p
t
it
,
al
D
P
i
o
Composite s
st
t
al
r
A
i
d
c
d
t
r
,
e
P
ss
r
o
v
i
n
c
e
,
C
o
u
n
t
r
y
,
P
o
s
t
a
l
C
o
d
e
)
[ 47 ]
Building Blocks Simplified Chapter 2
The following table shows the attributes of the entity of the Patient:
[ 48 ]
Building Blocks Simplified Chapter 2
After distributing entities, the next step is to create a relationship between these entities. For
example, there can be only one patient per invoice, but there can be one or more drugs in an
invoice. We will look at these relationships in the Entity-Relationship (ER) diagram in
Chapter 3, Planning a Database Design.
With generalization, two entities which are representing different types of the same entities,
can be combined into one entity. On the other hand, at this phase, one entity can be divided
into two separate ones, if it is identified that the entity is representing two different
alternatives, and should be divided into two entities. For example, you might decide to
break Payment Type of the Invoice Entity into a new Entity, rather than having it as an
attribute.
After identifying Entities and their attributes, the next important step is to name them,
according to a methodical way so that they can be identified later.
[ 49 ]
Building Blocks Simplified Chapter 2
The naming convention is to maintain better readability so that it will make your database
easy to manage.
After identifying Entity Types and Attributes, next is to assemble them by identifying
relationships of them.
Let us discuss what are the basic rules to define building blocks in a database.
[ 50 ]
Building Blocks Simplified Chapter 2
The Employee and Manager entity-types can both be represented in the same table as
employee and manager entities have the same attributes and manager entities are also
employee entities. In the business context, a manager is an employee. In the database
tables, self-joins are used to join these entities.
Composite attributes are used for display purposes. Therefore, at the table level, sub-
attributes are used and those are composite to one attribute at the view level. For example,
at the table level, atomic columns are, Title, FirstName, MiddleName, and LastName and in
the view level, there will be a concatenate derived column called Full Name, including all
the four columns.
[ 51 ]
Building Blocks Simplified Chapter 2
Summary
In this chapter, we have looked at Entity, Entity types, and the process of identifying them.
We looked at different types of Entities and their relationship with other entires. Entities
have their attributes and we looked at how to identify those attributes. We discussed the
naming conventions of Entities and Attributes. After identifying those entities and
attributes, the next important task is to assemble them to the building blocks, which we
discussed with three rules.
After identifying the Entities and their Attributes, next is to plan the design of the
databases. In the next chapter, planing of designing the databases is discussed along with
challenges in planning.
Exercise
The following screenshot is a bill from a mobile company. Identify the Entities and
Attributes which can be used to implement a database design. You will learn more about
Relations in the upcoming chapters.
[ 52 ]
Building Blocks Simplified Chapter 2
Let us look at the commons questions that you come across mostly in interviews.
Questions
What is the difference between UNION and UNION ALL?
UNION will provide instances of multiple entities or entity sets without any
[ 53 ]
Building Blocks Simplified Chapter 2
duplicates. In case there are duplicates, UNION will provide distinct values
to the output. This means that UNION should be used for Entities and Entity
sets where there are intersect instances. However, since the UNION has to
check for the common instances, it will cause a performance issue. If you are
sure that Entities or Entity Sets are not overlapping instead of using UNION,
UNION ALL should be used. UNION ALL does not validate for intersect
values, instead, it will be a simple union the multiple entities or entity sets
without distinction the values. If you are using UNION ALL for overlapping
Entities or Entity sets, overlapping instances will be duplicated. However,
UNION ALL has the performance improvement over the UNION as it does
not need to check for duplication of values.
Why views are used to defined composite attributes of entities instead of defined
them at the table?
Composite attributes are used for display purposes, but they are much easier
to update when they are maintained at the sub-attributes level. Therefore, at
the table level, sub-attributes are used and those are composite to one
attribute at the view level. For example, at the table level, atomic columns
are, Title, FirstName, MiddleName, and LastName and in the view level,
there will be a concatenate derived column called Full Name, including all
the four columns.
When forming composite attributes, what is the action you should take to avoid
if one of the source attributes are NULL?
We can simply add the source column as shown in the following code
snippet:
SELECT Title + ' ' + First_Name + ' ' + Middle_Name +' ' + Last_Name);
Why derived attributes are used instead of calculating at the run time?
In the derived attribute option, storage space is the concern. The advantage
of this option is that less computation power is needed. However, in the
modern era of computing, storage is not a major concern due to the rapid
[ 54 ]
Building Blocks Simplified Chapter 2
[ 55 ]
3
Planning a Database Design
Database design is not something that you can start straight away. Like any design work
such as building a house or manufacturing a car, you need thorough planning for a
database as well. In the case of building a house, you need to plan for the land where you
are building your house. Then you need to plan the house depending on your requirements
and your budget. It is the same for the database as well. You need to understand your
environment and your users' requirements. Also, when building a house you plan for the
future as well. When it comes to database planning, you need to plan for the future as well.
It is important that you need to understand what you are designing for, who are your
intended users and what are your limitations. Also, one database is different from the
other database as every database is unique from others. Therefore, proper planning is
required for database design.
Depending on the system, the weightage of each parameter may be different. If you intend
to design a database system for a core-banking system, almost every parameter is very
much should be considered. However, if you are designing a database for your inter-
company DVD club, then the performance, availability may not be considered factors as
such. Similarly, for a data warehousing system, you might not be looking at availability
with high concerns but integrity will be an essential parameter. This means that depending
on the database that you design, you need to evaluate the importance of the parameters
carefully. Let us look at these parameters and how they should be addressed at the design
[ 57 ]
Planning a Database Design Chapter 3
Usability
Concepts of relational databases are introduced in order to improve the usability of data.
We covered relational databases in Chapter 1: Overview of PostgreSQL Relational
Databases. That doesn't imply that just because you have that relational database, you have
the usability. However, when you are designing the database, you need to emphasize that
your database objects are storing meaningful data that can be used by the end-users or
applications. To guarantee that the database objects are storing meaningful data, it is
essential that all required data is gathered during the process of requirement gathering.
Another aspect of the data usability is that it converts the stored data into meaning full data
without major complex queries. In relational databases, data is stored in different objects.
When a user request for a data set, it might have to combine one or many tables. In the case
of multiple tables, it has to be simple to combine them. If the usability is maintained at the
database design, combining them will not become a hard task.
In PostgreSQL, Views and Stored procedures are used for improving the usability of the
database. Though you can define views and stored procedures later, it is essential to
identify the requirement at the planning stage so that you don't run into last-minute issues.
Extensibility
Business processes are not static but dynamic. When designing a database, it should be
considered that the designed database should withstand future changes. It is highly
unlikely that you don't have to do any modifications to your database. However, it is not
viable if you have to change your database on a daily basis or very frequently. If the
original database is much complex and not properly organized, then you need to change
the database design frequently. At the planning stage, you need to predict future changes
in the databases.
There are a few options that should be considered to enable your database to ensure
extensibility. The generalization of entities is one way to attain extensibility in databases.
For example, if you have employees at different departments, rather than creating tables
department wise (Employee_Accounts, Employee_HR, Employee_IT), it is advisable to
create an Employee table with a department attribute into the employee table. If not
whenever a department is introduced, you would need to create a table. Instead, it is only a
matter of adding a record to the department table. On the other hand if there different types
of employees such as permanent, temporary, executive, and so on, it is better to define
[ 58 ]
Planning a Database Design Chapter 3
them in separate tables simply because they have different parameters from each type.
Integrity
It doesn't matter how large your database is, if your data is garbage, it is unusable. Also, no
one will trust your data and your database will become obsolete. It is essential to identify
what are the rules you need to implement to the database and how those rules should be
identified at the requirement stage itself.
There are different types of entities we need to identify during the requirement analysis
phase as shown in the following screenshot:
[ 59 ]
Planning a Database Design Chapter 3
Following are the basic implementation of different integration types and it is essential to
identify them at the planning stage:
Entity Integrity: This discusses the structure of the tables. Typically, Primary
Keys and Unique keys are used to achieve Entity Integrity.
Domain Integrity: This explains the attributes of the entities or columns in the
table. Nullability, Data Type, Data formats and data length are the mechanisms
which are used to achieve the Domain Integrity. When Name of a Person is
defined, it should be a string data type with typically 50 in length. For a phone
number depending on the environment, you can define the format for the phone
number such as ###-######-##.
Referential Integrity: (or Foreign Key Constraint) This was discussed in Chapter
1, Overview of PostgreSQL Relational Databases, where entity dependency is
maintained. For example, the Employee entity has a dependency on the
Department Entity. This means that there cannot be an employee without a
department, or with an invalid department. Also, you are unable to delete
departments where there are relations in the employee table.
Transactional Integrity: Transaction means one set of operations that are
considered as a single unit. It might be inserting/Updating/Deleting multiple
tables or records. However, from the transaction point of view, it should be one
unit. The Atomicity, Consistency, Isolation, and Durability (ACID) theory is
introduced to maintain transaction integrity. A detailed discussion will be done
on Transaction Integrity in Chapter 9, Designing a Database with Transactions.
Custom Integrity: Every organization has its own integrity rules. Salary should
not be higher than 50,000 or age should not be lesser than 18 and so on, are some
of those custom rules. This type of integrity is implemented using CHECK
constraints which were discussed in Chapter 1, Overview of PostgreSQL Relational
Databases.
Performance
Database tends to hold a large volume of data over time. Sometimes this may be in the
order of multiple terabytes. When dealing with this much data, it is essential to retrieve this
data in an efficient way. When designing a database, it should be done in order to improve
data retrieval and data insertion.
Apart from table designing, there are indexes that can be introduced in order to improve
data selection from the databases.
[ 60 ]
Planning a Database Design Chapter 3
Clustered Indexes
Non-Clustered Indexes
Bitmap Indexes
Include Indexes
Filtered Indexes
Column Store Indexes
These indexes are used in different options with different combinations. We will be
looking at the details of indexes in an upcoming Chapter 8: Working with Indexes.
Availability
Availability means that keeping the data available for the end-users for maximum possible
time. Many of us would think, availability is achieved via hardware and only hardware.
By using hardware, when a system is not available, a different set of hardware at different
geolocation can be used for users to operate. However, there can be a case where a costly
hardware solution will not be feasible to use due to budget constraints. Hence, rather than
using costly hardware solution, you can design your database to achieve some level of high
availability.
This might not be a 100% availability option but it will provide you breathing space so that
clients can perform some of their tasks to some extent until you bring the system back to
normal.
Depending on your needs and by considering the cost factor, there are different types of
High Availability options for you.
Also, in some of the database systems, large columns that are more than 1GB cannot be
replicated. This means that you need to understand what are you going to replicate before
you start the database design.
[ 61 ]
Planning a Database Design Chapter 3
There are two challenges with the sharding design. Since you do not have all the client's
data at one database, when you are making combine reports, it will be challenging to
generate a report from a database where you have all the data. Sometimes, there are cases
where clients need to move between shards. In that situation, there should be an automated
mechanism transfer data between shards.
When designing databases for high availability, it is understood that the approach would
be something difficult to change later. Because of this, it is essential to identify what
availability the customer needs. If you are to change the approach in the later stage, it will
require unnecessary cost and effort.
[ 62 ]
Planning a Database Design Chapter 3
Security
Security is a key concept in any information system. In a database system, there are three
ways to achieve security, as follows:
Authentication: Authentication is identifying who the user is. There are multiple
ways to identify the user and the popular technique is user name and password.
When the requirement gathering is done, it is essential to capture the requirement from the
perspective of these parameters. Sometimes, your client might not have enough
information and knowledge to provide this. However, it always better to get this
information during the planning phases, as changing them at the later stage will cost huge
for both you and your client. More importantly, it will hugely impact the project line as
well. For example, changing the database to adapt to different high availability mode
would be very difficult as the database has different design methods for different high
availability methods.
After Understanding Database design principles, next, we will look at what types of
database schemas you need to work with since.
[ 63 ]
Planning a Database Design Chapter 3
The following table shows the major differences between OLTP and OLAP:
OLTP OLAP
Stores current Data Stores historical Data
Stores Details Data Mainly Summarized Data
Dynamic data, lot of updates and deletes Mostly Static data
Transaction-Driven Analytical Driven
Application Oriented Subjected Oriented
Supports Day-to-Day operations Supports strategical decisions
Defined queries Ad-hoc and unstructured queries queries
Highly Normalized data structures Denomizied data structures
More relationships with tables Fewer relationships with tables
High number of users Fever users
If you are designing a database for an OLTP, it is essential to identify the above properties.
For example, for OLTP typically schemas should be highly normalized schemas. Further,
OLTP should support high number of users.
[ 64 ]
Planning a Database Design Chapter 3
This means that you need to understand whether you are designing a database for OLTP
purposes or OLAP purposes. Sometimes, there can be a mix of OLTP and OLAP in the
database. For example, there can be some limited analytical reports in an OLTP system. In
that type of scenario, to decide whether the system is OLTP or OLAP, the system mandate
should be identified. If it is mainly towards OLTP, then the design should be done by
considering the OLTP and vice-versa.
For any design, there are multiple approaches. For database design also there can be
multiple approaches.
In the bottom-up method, database designers will inspect all the system user interfaces
such as reports, entry screens, and forms. The designers then will work backward through
[ 65 ]
Planning a Database Design Chapter 3
the system to determine what attributes should be stored in the database. The bottom-up
design method is most suitable for less complex and small database systems. This approach
becomes extremely difficult when you have a large number of attributes and entities.
In the method of top-down, the designer starts with a basic idea of what is required for the
system and the requirements will be gathered from the end-users with their collaboration
of what data they need to store in the database. A detailed understanding of the system is
required when the top-down method is used. For a large and complex database system, the
top-Down method is more suitable as it can be used to identify the attributes at the early
stage.
[ 66 ]
Planning a Database Design Chapter 3
De-centralized design
In the De-centralized design approach, different users' perspectives are maintained
separately. Different user requirements are gathered separately into a local data model. It
will then be merged into a global data model. This approach is much suited for complex
system and for a system which operates in different geographical location. The de-
centralized database design should be used when there are a large number of database
objects and database which has complex requirements. When there is a lot of disagreements
between users for the requirement, the decentralized design approach is much preferred.
Though there is no hard rule which approach is to use, it always depends on the use case.
Every use case has its own unique properties and different environments. Also, it is
possible that, rather than using the native approach, a combination of approaches or a
hybrid approach will be used.
Now that we have looked into the various approaches, let us look into data modeling as it
is an important factor in database design in various aspects such as Quality, Cost, and Time
to Market, and so on.
Quality
The data models enable to define the problem and provides you with multiple avenues for
the next stage. On average, about seventy percent of software development efforts fail, and
the major reason for failure is premature coding without building a data model. When you
have the model in your hand, you are clear with what to do and the next stage becomes
[ 67 ]
Planning a Database Design Chapter 3
Cost
The data model promotes clarity and provides the groundwork for generating much of the
needed database and also programming code. Typically, data modeling cost estimates to
ten to fifteen percent of the entire project cost. However, it has the potential to reduce
sixty-seventy percent of the programming cost. Data modeling catches errors at an early
stage. which will be easy to fix. It is always better to fix them at the early stage than the
later stage, more importantly early than when the software is at the customer's hand.
The data model is reusable during future developments and maintenance. This will help to
reduce the project cost during the future developments and maintenance phase of the
projects.
Time to market
Having a sound data model at the start of the project can avoid unnecessary troubles which
will face during and after the project. When unseen issues occur during the development
phase, it is more likely that the project will run into chaos. This, in turn, will definitely
impact project timelines.
Also, any system will not operate in isolation. For example, one database system might be
linked with other systems to extract data in case of data warehousing. At the time, if there
the data model is present, it is much easy to integrate with other systems and make it much
easy to deploy the solutions to the production. This means that having a proper data
model, not only reduces the time to market of the current database but also the other
systems which are dependent on this database system.
Scope
There is always a gap between the customer and the technical team. This is the gap between
what you want and what you will get. The data model document will help the mechanism
to bridge this gap. The data model provides the scope for the data which will help them to
start the dialogue between the customer and developers. Business staff can visualize what
the developers are building and compare it with their understanding from the data
modeling.
A data model also promotes agreement on vocabulary and technical jargon. The data model
[ 68 ]
Planning a Database Design Chapter 3
highlights the chosen terms so that they can be driven forward. If there are any changes
required at this stage, both parties can agree on that at this level.
Performance
A proper data model makes much easier for database tuning. As a thump of rule, a
properly developed database typically performs better. To achieve optimal system
performance database performance plays a key role. With the presence of a data model, it is
much earlier to translate the model into a database design. This means that by using a
proper model, database performance will be improved.
On the other hand, data modeling provides a means to understand a database. This means
that a database developer is able to tune the database for fast performance by
understanding the data model.
Documentation
No matter how small or large your system, documentation is very important. Though it is a
small system, you never know how large this system becomes. Therefore, even though you
are starting at a tiny size, it is important to document. Also, there is technical jargon that
will need to communicate with the customers. Both parties can agree on naming
conventions with respect to database objects. For this, the database model will be used as
communication tools as the data model standardizes the names and properties of data
elements and their entities.
Risk
Risk analysis is a key task in any project. It is far better to identify the risk at the early stage
to mitigate future risks. The size of the data model will be a good measure of the project
risk. Since the data model provides a basic idea about the programming concepts and
project effort, it can be used as a risk measuring tool.
It is essential to understand the role of the database in a complete user system which is
discussed in the following section.
Role of databases
The database has a mandate of maintaining data in storage, and provide them to the end-
[ 69 ]
Planning a Database Design Chapter 3
users, and end-user application interface when the need arises. Depending on the need
there can be one or multiple databases for the system which can span over multiple
database servers to suit the business needs.
The following screenshot shows how databases are listed in the PostgreSQL server when it
is accessed from pgAdmin:
As shown in the above screenshot, there are HR, Invoice, SampleDB and postgres
databases in the system and you can have many databases in PostgreSQL. There are
instances where server contains over 10,000 databases per PostgreSQL instance. Also, there
is no limitation to the database size and there is a PostgreSQL database that has got size
more than 6 TB.
Storing Data
The main role of the database is to store data. Data has its own relationships between
database objects. In the database, data is stored by means of tables. The list of tables in the
sample database is shown in the following screenshot:
[ 70 ]
Planning a Database Design Chapter 3
There is no limitation to the number of tables for a database and a table can have any
number of records in a table. A table can contain 16 TB volume of data.
The following table contained a sample data set for the actor table in the sample database
shows in the row-column format:
[ 71 ]
Planning a Database Design Chapter 3
Every organization has a large volume of data and databases are mandated to store data.
However, by using tables, columns, and rows, a large volume of data can be methodically
stored in a database. INSERT, UPDATE and DELETE are the common SQL commands to store
data in relational databases.
Database plays a key role when multiple users are accessing the same table. The database
maintains ACID properties so that transactions are handled. Without a database,
transactions will not be able to handle them with user-friendly.
Access Data
Accessing data is an important phenomenon in the database. The database is useless if it
cannot access the large volume of data that is stored. Also, when accessing data, it should
be available to the user-readable manner as well not time-consuming. As we discussed
before, data is stored in multiple tables. For the end-users, these tables should be joined by
their relationships to present data meaningfully. Though there are a large number of
columns and rows, users may not need all of them. So, there should be a way to retrieve
with filtering. Users may need aggregation of data such as SUM, AVERAGE, MINIMUM,
MAXIMUM, and many more among various other functionalities.
An important factor in the relational database is faster access to data. It doesn't matter how
large data volume is, users should able to access desired data in a timely manner. Providing
the users' data with efficiency is another key role of a database.
In PostgreSQL databases, there are three ways of accessing the data functions, stored
procedures, and views.
The following screenshot shows the list of views in the sample database:
[ 72 ]
Planning a Database Design Chapter 3
In a database view, there can be multiple tables where they are joined to form meaningful
data.
The following code block shows the definition of actor_info view which you can see
there are multiple tables joined with few functions:
SELECT a.actor_id,
a.first_name,
a.last_name,
group_concat(DISTINCT (c.name::text || ': '::text) || (( SELECT
group_concat(f.title::text) AS group_concat
FROM film f
JOIN film_category fc_1 ON f.film_id = fc_1.film_id
JOIN film_actor fa_1 ON f.film_id = fa_1.film_id
WHERE fc_1.category_id = c.category_id AND fa_1.actor_id = a.actor_id
GROUP BY fa_1.actor_id))) AS film_info
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
LEFT JOIN film_category fc ON fa.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id
GROUP BY a.actor_id, a.first_name, a.last_name;
Apart from views and functions, stored procedures are also can be used as mechanisms for
data access options in the PostgreSQL database. Both functions and procedures will accept
parameters. In the following screenshot, we can see the list of functions in the sample
database of PostgreSQL database:
[ 73 ]
Planning a Database Design Chapter 3
In the following screenshot, we can see the definition of the film_in_stock function
which has three parameters.
[ 74 ]
Planning a Database Design Chapter 3
Secure Data
A database will be accessed by multiple users who have different roles in the organization.
Therefore, the database has the role of maintaining the data securely. Security can be
achieved via authentication, authorization, encryption in the database technology.
Every database poses different challenges when it comes to database design. However, it is
better to consider the common challenges that database designers that will come across in
the following section.
[ 75 ]
Planning a Database Design Chapter 3
Data security
In a couple of years, more than 100,000 systems were compromised simply because their
database had been completely exposed to the public internet. In many instances, the
database is the main or the first culprit for the compromise. The challenge in the database
design with respect to the security is that the database security can be implemented with
various options.
Data security comes with three aspects of the databases. There are:
Authentication
Authorization
Encryption
Also, at the time of database design, the designer has to plan for security in the database.
Also, there can be needs to plan for user groups where multiple users can be in groups and
these groups will be provided with authorization options. By having the groups, it will ease
the configurations at the later stage. However, it is important to note that by introducing
the user groups, conflicts can occur. For example, a single user can be in multiple groups
where are those two groups will have conflicts of permission. All of these complexities will
raise challenges to the database designers.
Apart from authorization, Encryption is also challenging. Though you can enable
encryption for the entire database, it will cause performance and maintenance issues.
Therefore, At the time of designing, what level of encryption to which extent should be
done needs to be clearly identified. It will be a challenging task to identify what should be
encrypted at the time of design planning.
To the relief of database designers, most of the Relational Database Management System
(RDBMS) supports different types of security implementation. Since the database designer
has to decide on what database technology will be used, it is essential to understand what
is the security level that the user required and what level of capabilities this tool has.
[ 76 ]
Planning a Database Design Chapter 3
Performance
Most of the database designers, do not consider the performance aspect of the database.
Apart from not considering, the database designers have the challenge of visualizing or
predicting the performance of the system in the future. Mainly there is a miscommunication
or no communication between the development team and database team. Hence database
team does not know what are the frequently running querying from the application end.
This makes challenging tasks for the database designers to apply correct indexes to the
database. Database designers cannot apply indexes for all the columns as it will reduce the
performance of having a lot of indexes.
At least indexes can be applied at the later stage in the production. However, there are table
design changes such as denormalizing in order to improve the select performance. This
needs to be done at the design change but to do this, the database designer needs to
understand the performance needs which is a challenging task.
Another challenge in the performance is that designers are unable to predict the data
volume in the future. With the increase of data volume, database performance will be
impacted negatively. Since database designers have a challenge in predicting the data
growth, planning the performance for the database has become challenging.
Data accuracy
The database will not be useful unless your data is clean. It is the database designers'
challenge to visualize or predict what users will enter. Also, the database designer has to
understand what level of transaction needs to be in order to maintain the ACID properties
of a transaction. Challenge would be to implement transactions in distributed transactions.
Also, the database has other constraints such as Foreign, Check, and Unique constraints. As
a database designer, it is important to identify what level of constraint should be
implemented. In this user, the designer has to identify what is the constraint at the design
time. It will be challenging to enable these constraints when invalid data is saved on the
date. Therefore, the database designer has the challenge of identifying the constraints at the
very early stage so that the purity of data is achieved which will lead to the users' trust in
the data in the database.
High availability
As stressed in several places, high availability is something that is always neglect by the
database designers. The main reason for this negligence is that many are in the view that
[ 77 ]
Planning a Database Design Chapter 3
high availability can be achieved one database is deployed to the production and it is the
duty of the database administrators but not a designer's responsibility. Though there won't
be any difficulties for a database designer, when there is a plan to implement full
availability to the database via infrastructure, there will be a great challenge when it comes
to implementing partial availability.
Challenge with designing availability for database comes with defining the boundaries of
hg availability. The designer has the option of defining high availability for function wise
or other business objects wise such as Customer, project, etc. The database designer has the
challenge of identifying the partition object at the start of the database design phase. This is
because it is very difficult to change the partition object at the later stage.
Another challenge with high availability is the existence of multiple databases. When
multiple databases exist, there would be another challenge of integrating data between
databases and maintaining data quality. The database designer has to implement a
mechanism for data integration as a default database integration mechanism such as
foreign constraints will not be possible to implement.
Another challenge is moving data between these physical partitions. For example, let us say
you have decided to design a database where customers are the objective of high
availability. This means that your high importance customers data will be one database and
other customers' data will be in one or many databases. With time, customers might decide
to move between these partitions. As a database designer, you need to define a mechanism
to move data between these partitions without impacting the current business or at least
with minimum impact.
Summary
In this chapter, the planning of a database was discussed with different perspectives. It was
understood that the Database design principles should be aligned with High Availability,
Integrity, Security, Extensibility, and Performance.
We have identified that there are two types of popular database schemas which are OLAP
and OLTP. OLTP is mainly used for the transactional system while OLAP is used for
analytical and reporting systems. With this analysis, we know what type of database that
will be designing. During the planning stage of a database system, it needs to identify
whether it is OLAP and OLTP so that you can make decisions on database designing. There
are database design approaches which are bottom-up or top-down approaches. Further, we
looked at couple of approaches to design databases that are centralized and de-centralized
approaches as well. The database has a role in every system. Mainly databases are used for
storing and retrieving data in a secure way. In the case of PostgreSQL databases, tables,
[ 78 ]
Planning a Database Design Chapter 3
views, functions, and stored procedures are used for these actions. There are major
database design challenges such as security, performance, high availability and data
accuracy.
As a database designer, you need to plan for these challenges. We have identified common
challenges that database designers will encounter. It is essential to plan the database design
rather than start the design straightway to avoid unnecessary chaos in the stages of design,
development and also when the system is at the hand of the customer. Data Security, Data
Accuracy, High Availability are the main challenges we identified.
In the next chapter, we will discuss different data models to provide an understanding of
these models and how they can be built.
Questions
Why is it important to understand the high availability option for the database, at
the database planning stage?
[ 79 ]
Planning a Database Design Chapter 3
Why is it important to decide whether the database that you are designing is for
OLTP or OLAP?
The mandate for OLTP and OLAP are different. OLTP designs are mainly for
systems that provide day-to-day transactions. In these types of systems, read
and writes almost equal and the transaction times are very shot. Also, in
OLTP most of the transactions are pre-defined. For example, in a customer
order creation, you will enter a date, customer code, and product items.
Finally the discount and taxes. this shows that in OLTP transactions are
predominately pre-defined. However, when it comes to OLAP, one user can
start with product-wise analysis while another user will start his analysis
with monthly wise with the customer-wise. You cannot set rules for analysis.
Due to this OLAP analysis is more in ad-hoc nature. Since OLAP is pre-
dominantly catering for analysis of data, in the OLAP system more, if not
95% of transactions are retrieval. For a database which requires more reads,
it is better if you have less number of table. If you have a large number of
tables, to retrieve information, you need to join multiple tables. when joining
multiple tables, it requires a high cost of resources. However, when there is a
fewer number of tables, it doesn't consume many resources. This means that
though the normalized structures are suited for OLTP systems, for OLAP de-
normalized structures are preferred.
How do you plan for performance improvement during the database designing
stage?
[ 80 ]
Planning a Database Design Chapter 3
[ 81 ]
4
Representation of Data Models
After the planning is completed for the database in the previous chapter, now we need to
look at how the database models are represented. Most of the database does not start with
designing in a database technology itself. If we start a database design directly with the
database technologies, you will run into a lot of rework issues. Therefore, a Database
should be designed with a proper process by following a scientific method.
The database design will have multiple processes and stages. Initially, we will be looking at
the building of the conceptual model. We will look at how the conceptual model is verified.
Next, we will examine how the semantic data model is designed. Next, we will look at
briefly on the Physical data model as there is a separate chapter for the defining of the
Physical data model.
Apart from the building of these models, we will look at how to verify these models.
Further, we will look at how necessary documentations are done at each model.
The database design methodology contains multiple phases in which each phase contains
several steps and milestones to achieve. Because of this phase defined methodology project
managers can plan the activities, as well as milestones, can be tracked effectively. With this
approach, the database can be designed with a standardized and well-defined manner.
Let us see what are the critical factors in database design in the following section.
There are several critical success factors in database design which are listed below.
1. Continuous interaction with all stakeholders: You need to understand that the
ultimate beneficiaries of the database are users of the system. Therefore, it is
always important to keep interaction with them. Also, there are cases some
stakeholders such as database administrators, network administrators, ignored
as many are in the view that those stakeholders do not have any impact on the
database design. As we discussed in the previous chapters, network and
[ 83 ]
Representation of Data Models Chapter 4
Let us consider the main three types of database design, Conceptual database design,
logical database design, and physical database design.
Let us look into how conceptual database design is done in the next section.
[ 84 ]
Representation of Data Models Chapter 4
In the Conceptual Design phase, we will start with Local Conceptual Design in the next
section.
Let us assume that you are assigned to build a database for a supermarket billing. There
will be three user views for this scenario.
Let see the differences between the different sub-user views as shown in the following
table:
[ 85 ]
Representation of Data Models Chapter 4
During this chapter, we will build the conceptual model for the Customer.
Let us see how entity types are identified from nouns or noun-phrases in the following
section.
The next way of identifying the entities is by checking the existing objects which are
discussed in the following section.
Objects
An alternative way of identifying Entity Types is to look for the objects of existence in the
scenario. If you look at the previous example, we know for a fact that in a supermarket,
there are objects such as Staff, Customer, and so on. Database designers experience plays a
key role when identifying Entity Types with this method,
[ 86 ]
Representation of Data Models Chapter 4
The requirement document may not be very clear. As you know, the requirement
document will be written free-form text, therefore, it can be vague and will not be that easy
to find the relevant entity types. Requirement document contains views from various users
and it will be challenging to identify the entity types from the requirement documents.
It will be more challenging as there are synonyms and homonyms in the requirement
document. Synonyms are words that have the same meanings. Employees and Staff are
often used but it needs to be identified that it has the same meaning. Therefore, Employee
and Staff shouldn't be two different Entity types but a single Entity Type. Client and
Customer is another example of a Synonym whereas Item and Product are other examples
for Synonym.
Homonym means that it is the same word but it has a different meaning. In this case, it has
to be two different Entity Types but with different names.
The User Requirement document will always have acronyms that are often used by users.
GRNs, POs, are generally used by end-users in order to specify Good Received notes and
Purchase Orders respectively.
When identifying the Entity Types from the Object method, it mostly depends on personal
judgment and personal experience.
Documentation
Documentation is an important factor in any phase of the database design process. when
Entity types are identified it is important to document them.
Entity TypeDescriptionSimilarNames
[ 87 ]
Representation of Data Models Chapter 4
Suppliers are one who supplies products and services to the Contractor
Supplier supermarkets. There are retail and wholesale suppliers registered Dealer
with the supermarket. Provider
When a customer buys products, points are earned which can be
Point redeemed. Also, special points are received for selected products
during the promotions.
Customer who gets the product and services from the supermarket Orders
Sales
through Orders and Invoices. Invoices
Purchase Orders (PO)
Getting products and services from the suppliers to offer to
Purchases Good Received Notes
customers.
(GRN)
From the table above we can see that all the entity types are enlisted along with the similar
terms and details of the entity types. This should be a continuous document where any
stakeholder can get information at any time.
Let us see what is the process of identifying the Relationship types in the following section.
Like we used nouns and noun phrases to identify the event types, verbs can be used to
identify relationship types. However, in most cases, the requirement document does not
explicitly mention the verb and it is the designer's job to identify the verb which is explicitly
mentioned in derived the relationships from them.
Binary Relationships
Most of the time, event types have a binary relationship which means that relation involved
between two entity types or degree of the relation type is two. Let us look at the
requirement of Customer Buys Product.
[ 88 ]
Representation of Data Models Chapter 4
There can be scenarios where multiple relationships exist between two entity types. For
example, Customer will earn points when he/she buys a product or a service and also, he
has the option of redeeming earned points with some rules.
In the above model, Customer and Product Entity Types have a relationship named Buys.
Complex Relationships
Though there are simple binary relationships, there can be cases where there is a
relationship that has a degree of the relation type of three or more.
When a Customer customer buys a product, it will be bought through an invoice. This can
be represented in the following figure:
[ 89 ]
Representation of Data Models Chapter 4
As shown in the above screenshot, Customer, Product and Invoice entities are related by
means of the relationship called Buys.
Recursive Relationships
Recursive relationships are the relations where the same entity type participate more than
one time. For example, In the same staff entity, there are managers and subordinates who
are reporting to the managers.
However, both of those employees are in the same entity types as shown below:
In the next section, we will identify the associate and relevant attributes with Entities or
relationship types.
[ 90 ]
Representation of Data Models Chapter 4
Let us see what are the possible attributes for the Employee entity type below:
EmployeeNo
Name (composite: Title, First Name, Middle Name, Last Name)
Address (composite: Address I, Address II, City, PostCode)
Designation
Previous Designations (multi-valued)
Manager (composite: Title, First Name, Middle Name, Last Name)
Date of Birth
Age (derived: date of Birth)
Gender
Above are the basic attributes for the Staff Entity Type.
Now let us look at what are the possible attributes for relevant relationship attributes with
respect to the Staff Entity Type.
For example, let us say a staff member will join the Super Market. The staff member is
promoted to a designation. When you examine this to relationships, there are relevant
attributes for the relationships. for the join relationship, there will be Join Date and for the
promotion relationship, there will be Promotion Date. For the previous promotions, there
will be a promotion date as well.
[ 91 ]
Representation of Data Models Chapter 4
When identifying attributes it is necessary to identify a few other details of those attributes
as well:
Identifying attributes will face few obstacles that will be discussed in the next section.
Missing Attributes: As you are aware, there are a large number of attributes associated with
entities and relationships. It is not a crime to miss attribute but as a database designer, it is
your duty to identify essential attributes. As a database designer, it is essential to provide
an allowance to add the attribute later. Since attributes can be identified at the different
stages, you need to make sure that all the relevant documents by going backward.
Duplicate Attributes: Sometimes, there can be situations where the same attribute are
associated with multiple objects. For example, employee joined date is associated with the
employee attribute and also it is an attribute with the relationship.
[ 92 ]
Representation of Data Models Chapter 4
Documentation
It not necessary to emphasize the fact that importance of documenting. Therefore, it is
essential to document the attributes information as well.
Let us see a simple documentation of the Employee entity in the following table.
Let us see how we can determine candidate and primary keys, in the next section.
[ 93 ]
Representation of Data Models Chapter 4
When selecting a primary key, the following consideration should be taken care of:
1. Unique: Needless to say that primary has to be unique as the primary key is the
column which will be used to identify the record.
2. Definite: The primary key should not have NULL or Empty values, and there has
to be a value for the primary key.
3. Stable: The Primary key should not change with time. When you are selecting a
Primary Key, you need to select an attribute(s) that are not changing over time.
4. Minimal: The Primary Key should have fewer attributes and less in length. For
example, Customer Name, Customer Addresses are not recommended to be
Primary Key.
5. Factless: The Primary Key should be hidden value. However, there can be cases
where primary keys may be an internal value, which does not have a business
meaning.
6. Accessible: The Primary key should be accessible at the time of record creation, it
should not be filled later.
After the model is built, the next is to validate the model. Model validation will be
discussed in the following section.
[ 94 ]
Representation of Data Models Chapter 4
One of the validation techniques is to validate the local conceptual model against user
transactions. Let us discuss this technique in the following section.
Let us see how to review the local conceptual model with the users in the following section.
[ 95 ]
Representation of Data Models Chapter 4
[ 96 ]
Representation of Data Models Chapter 4
In case you are unable to perform user transactions against the model, then you need to
rework on the model. Most likely, you have made a mistake or there might have been a
user requirement error. This has to be corrected before moving to the next stage. After
finding and correcting the error, previous steps have to be revisited again.
Further, there can be some attributes that should exist. For example, every employee should
have a department whereas empty or null values should be permitted.
Also, it is important to specify, what are the actions when there is a data delete or update.
Delete or update will bring the data into an inconsistent state.
The following are the available options for actions in the PostgreSQL.
[ 97 ]
Representation of Data Models Chapter 4
[ 98 ]
Representation of Data Models Chapter 4
scenarios. This step will be an important task, as this is helping to eliminate any gaps
between the business users and technical designers.
In the case of any issues, you have to go back to the initial step and follow
the steps again. However, it is always better to correct those errors rather
than finding them when the system is deployed to production. When
users are using the system, it will be very difficult to do the error
corrections.
There can be cases where you have only one local model. If there is only a local logical
model present, Then you are done with the logical model and not necessary to proceed to
the other steps. In most of the scenarios, you will have only one local logical model.
In this step, it is essential to review the entity names and their relationships. Then these
entities should be merged. In the conceptual modeling, level you won't be having many
attributes. However, at the Logical data modeling, you have more attributes. Therefore
merging is challenging in the Logical design level.
1. Merging Entities / Relation with the same names and the same primary keys.
2. Merging Entities / Relation with the same names and the different primary keys.
3. Merging Entities / Relation with the different names and the same primary keys.
4. Merging Entities / Relation with the different names and the different primary
keys.
After the merging of entities, then the foreign keys and other constraints should be merged.
[ 99 ]
Representation of Data Models Chapter 4
If this step failed even after the success at the validating at the local logical
data models, it means that there should be some shortcomings at the
merging step. However, during the validation database designers has to
review the merging of logical models.
Now you are ready to go to the business users with the logical model.
Though these are mentions as steps, in practical, database designers will follow incremental
design at a different scale.
Now we will discuss the defining of the Semantic data model in the next stage.
in 1981, Hammer and McLeod derived presented the Semantic Data Model while this was
extended by Shipman by means of Functional Data Model. Later in 1983, the Semantic
Association Model was introduced by Su. All these attempts are to build semantic into the
data model.
If we know that Nile River flows through, countries such as Ethiopia, Sudan, and Egypt.
Since we know that these countries are in Africa, from Semantic we can conclude that the
Nile River flows through Africa. In similar terms, we can build semantics from the
databases as well.
Let us examine how we can implement a semantic data model in a database from the
following screenshot:
[ 100 ]
Representation of Data Models Chapter 4
Let us see how to define the E-R data model in the following section.
E-R Modeling
We have discussed in detail about identifying Entities and their attributes in Chapter
2, Building Blocks Simplified. E-R model is mainly used as a tool to communicate between the
technical and non-technical users. E-R model is identified as a Top-Down approach to
database design. The E-R modeling starts with identifying Entities, their attributes, and
their relationships.
The following screenshot indicates standard symbols for the E-R modeling.
[ 101 ]
Representation of Data Models Chapter 4
In Chapter 2, we have identified different types of Entities for a given entity Invoice as
shown in the following screenshot.
[ 102 ]
Representation of Data Models Chapter 4
In the invoice entity type, there are different types of attributes such as Invoice ID (Primary
Keys), Patient Name (Composite Attribute), Drug (Multi-Valued Attribute), Invoice Date
(Normal Attribute), Patient Age (Derived Attribute).
In a business scenario, there are multiple entities and those were identified at the
conceptual modeling. Use case scenarios, as shown in the below screenshot can be used to
identify the Entities in the business case.
[ 103 ]
Representation of Data Models Chapter 4
In the above use case diagram, there are simple three actors such as Patient, Cashier, and
Doctor. In the shown use case diagram, we have identified simple use cases. By analyzing
these use case diagrams, we can identify the Entities In this business scenario, Cashier,
Patient, Prescription, Doctor and Invoice entities can be identified.
These Entities do not live in isolation. These one more entities are related to each other with
different relationships. Entities and their relationship are shown in the E-R diagram. A
sample E-R diagram for the above scenario is shown in the following screenshot.
[ 104 ]
Representation of Data Models Chapter 4
It is essential to identify the relationship as that will be a key factor when defining required
and multiple attributes. Further, the above E-R diagram indicates the relations of entities
between other entities. For example. The Cashier entity is related to Prescription, Invoice, and
Patient. In addition, Cashier has a relationship to the Doctor entity via Prescription Entity.
It is important to read this E-R diagram as this will become the communication language
between the technical and non-technical users. Relationships can be read as shown in the
following screenshot.
This means that by reading the E-R diagram, business users, technical users will be able to
identify the pictorial view of the business case.
Though the E-R model is widely used in the industry, let us look at what are the problems
with the E-R Model.
[ 105 ]
Representation of Data Models Chapter 4
Fan Traps
Let us discuss this connection trap by means of an example.
The above E-R model explains that each employee belongs to a company and each
company has one or many Employees. On the other hand, each company has one or more
divisions and each division belongs to one company. With the above E-R diagram, it is not
possible to find the division of an employee. If you want to know, What is the Division that
employee Simon belongs to, you will not be able to answer? Unable to answer this question
is a result of the fan trap associated with the E-R diagram.
We can resolve the fan trap issue by changing the relationship. If you closely analyze the
above relations, you will understand that it is a hierarchical structure as shown in the below
screenshot.
[ 106 ]
Representation of Data Models Chapter 4
Now, you can modify the E-R diagram to suit the above hierarchical structure.
From the above E-R diagram, now it is possible to find what is the Divison of each
employee belongs to. Since each division has one company, it is possible to find the
company of the employee as well.
Chasm Traps
Chasm Trap occurs when it is not possible to find the pathway for an entity for some
occurrence of some entities. Let us illustrate this by means of an E-R diagram.
In the above E-R diagram, Division and Employee entity relations can be understood easily
as it is trivial. Let us look at Project to Employee relationship. For each project, there can be
no or one employee and for each employee, there can be zero or more Projects. This means
there can be projects where no employees are assigned. Because of this optional behavior of
the Employee, for the Projects where no employees are assigned, it is not possible to find
out the Division that each project belongs to. This behavior is called the Chasm trap. To
avoid the Chasm trap, it is essential to include a direct relationship between the Division
and Project without going through the Employee as shown in the below screenshot.
[ 107 ]
Representation of Data Models Chapter 4
The Chasm Trap shows how important to find the correct relationship
type, whether it zero or more or one or zero. This issue will not occur if
the relationship is one-to-many instead of zero to many.
When an additional relationship is introduced between Project and Division, there should
not be a conflict when the project is allocated to an Employee. For example, let us assume if
a Project A is assigned to Division A and later it should be attached to an employee who is
member of the Division A. If an Employee who is member of different division is assigned,
it important to revert the previously assigned Division to a correct value.
After the conceptual and logical models are completed, next is to implement your design
into a physical data model which is discussed in the next section.
At this stage, you need to have experience resources on the selected database technology. In
addition, the resource person needs to know the features of the database technology that
you want to use. Further, it is recommended to have an understanding of the technology
road map for database technology.
This is what you see in the databases as tables. As there are multiple steps in Conceptual
and Logical Data Modeling, there are multiple steps to define the physical data model as
[ 108 ]
Representation of Data Models Chapter 4
well.
Let us design the physical data model for the following E-R model.
[ 109 ]
Representation of Data Models Chapter 4
Now we need to define tables for identified entities. The following are the defined tables
for the above E-R diagrams.
[ 110 ]
Representation of Data Models Chapter 4
In the above relation or table, CashierID is the Primary Key. FirstName and LastName are
required whereas MiddleName column is optional. This is set by the Not NULL option as
shown in the above screenshot.
[ 111 ]
Representation of Data Models Chapter 4
Now let us design the Doctor table which is more same as the Cashier table the only
exception is the addition of the Specialty, ContactNumber, MainHospital attributes.
The Doctor table can be created from the following script in PostgreSQL.
CREATE TABLE public."Doctor"
(
"DoctorID" serial NOT NULL ,
"FirstName" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"MiddleName" character varying(50) COLLATE pg_catalog."default",
"LastName" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"Specialty" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"ContactNumber" character varying(15) COLLATE pg_catalog."default" NOT
NULL,
"MainHospital" character varying(50) COLLATE pg_catalog."default",
[ 112 ]
Representation of Data Models Chapter 4
As a database designer, you need to make a choice whether you are providing additional
tables for the above queries or calculate them when needed.
If you want to provide them with additional tables, there can be a data gap or storage cost.
In the second option, the processing cost is the question you need to answer.
The Gender of the Patient can be considered for Enterprise constraints. You can define, Male
and Female as the constraints for this column.
[ 113 ]
Representation of Data Models Chapter 4
In addition to the distributed databases, tables can be partitioned too. This has to be
designed n the physical data model. This will be discussed in Chapter 12: Distributed
Databases.
Indexes are key concepts in a database that allows users to access data efficiently. Though
indexes will improve the user access queries such as SELECT queries, having a large
number of indexes will slow down INSERT queries. This means that we need to find the
optimal number of indexes for a system we are countering INSERT and SELECT. Detail
discussion on indexes will be taken place in Chapter 8: Working with Indexes.
In the mentioned example on the patient, Invoice is relevant for PAtient, Cashier as well for
the Doctor. Therefore, it is essential to create multiple views on the same invoice to cater to
different requirements of different users or actors. These views can be extended to the
security model as well.
1. System security
2. Data Security
System security covers the access and usage of the database at the system level. This covers
user name, password etc. Data security covers the database objects such as tables, views,
and procedures etc.
Depending on the mode of implementation of security there can be three types of security
[ 114 ]
Representation of Data Models Chapter 4
implementation.
1. Authentication
2. Authorization
3. Encryption
Summary
In this chapter, we looked at how data is modeled. We looked at how the database is
physically designed from stages such as conceptual, logical data models.
At the conceptual model, main entity types are identified along with their relationships.
Further, it extremely important to validate the conceptual model against the user
requirement and data so that problems can be solved at the earlier stage.
At the Logical level, we defined the E-R diagram which was considered as the
communication tool between the technical and non-technical people. We identified two
issues that can occur at the E-R diagram which are Fan trap and Chasm Trap. Further, we
discussed it is important to identify the relation type between the entities. Like in the
Conceptual model, we discussed that it is necessary to build a logical model for local users
and then generate a global model. However, at multiple places, it is necessary to validate
this model so that no issues will occur at the production state. Normalization techniques
should be followed at this stage but will be discussed in the next chapter.
[ 115 ]
Representation of Data Models Chapter 4
The semantic model is the model that introduces meaning to the database model so that
end-users or application developers can utilize the model much easier. The physical model
is the model is which will be finally implemented in which you have the normalized data
models, data types and so on.
Exercise
Since you have done the exercise given in Chapter 2, Building Blocks Simplified. In that
exercise, it was asked to identify Entity Types and their attributes. In this, you can extend
that exercise to develop a:
Questions
How you make sure that all the attributes are identified at the conceptual
database design stage?
First, you need to choose the candidate columns out of which, one should be
selected as a Primary key. To choose the Primary key, you need to choose the
[ 116 ]
Representation of Data Models Chapter 4
Unique candidate columns. The Primary key should be Definite, means that
it cannot be either NULL or empty. The primary key should be very small in
size. You cannot have lengthy attributes as the Primary Key. The selected
Primary key should not change over time and it should be accessible to the
users.
How do you avoid redundancy in data modeling? What are the differences
between the conceptual and semantic data model?
The semantic data model is adding meaning to the conceptual data model.
Before which modeling you need to choose the database technology for the
database?
You need to choose the database technology before the defining of the
physical model. At the conceptual model or at the Semantic model building,
you do not need to know what the technology is. The conceptual model and
semantic model are independent of the database technology whereas the
physical model is directly dependent on the database technology.
What is the purpose of integrating constraints and what are the main types of
constraints?
Constraints are used to maintain consistency in the database so that the data
is clean. Clean data will help to make correct timely decisions. Therefore, to
make correct decisions, it is essential to implement constraints. The main
types of constraints are UNIQUE, FOREIGN KEY, PRIMARY KEY, CHECK,
Required.
What is the importance of drawing the E-R diagram for the business case?
[ 117 ]
Representation of Data Models Chapter 4
designers, the E-R model will be helpful to define the physical data model.
What are the issues that can occur when you try to fix the Chasm Trap and how
to fix them?
Until the conceptual and logical model design stage, you are more worried
about the data model rather than the technology. Though in some practical
situations, during the conceptual and logical model design you might have
the database technology in mind. However, you need to decide on the
database technology at the physical data model designing phase only.
Further Reading
Constraints: https://www.postgresql.org/docs/9.5/ddl-constraints.html
Database Growth Rate: http://www.silota.com/docs/recipes/sql-mom-
growth-rate.html
[ 118 ]
5
Applying Normalization
Until now, we have discussed different stages of modeling in order to design databases. In
this modeling, we need to identify a relevant set of relations in the databases. The
technique is to identify these relations, which is called Normalization. In this chapter, we
will discuss the different levels of normalization in detail, with suitable examples. This
chapter will also help you understand at what stage you stop normalization, and at what
instances you do not have to perform normalization.
History of Normalization
Purpose of Normalization
Determine Functional Dependencies
Steps for Normalization
First Normal Form
Second Normal Form
Third Normalization Form
Boyce-Codd Normal Form
Fourth Normal Form
Fifth Normal Form
Domain-Key Normal Form
De-Normalization Data Structures
Normalization Cheat Sheet
Applying Normalization Chapter 5
The main purpose of the database normalization is to have a proper relation between the
entity types. Database normalization will design the attributes into natural groups. By
performing database normalization, data redundancy, and operation anomalies are the
main issues which will be fixed.
Data redundancy is one of the main things that will be addressed by the Normalization
process.
Data Redundancy
The major outcome of the data normalization is reducing the data redundancy. Let us look
at data redundancy through an example.
[ 120 ]
Applying Normalization Chapter 5
Operation Anomalies
In databases, there are three main operations which we call Data Manipulation Language
(DML). These operations are Insert, Update and Delete. By introducing database
normalization, you can avoid Insertion, Updation, and Deletion anomalies.
Insertion Anomalies
Insertion Anomalies can occur in multiple ways. Let us look at the previous example of
EmployeeDepatment that we discussed.
In the above example, an insert can occur in multiple ways. There can be a new employee
as well as there can be a new department. They are discussed below:
[ 121 ]
Applying Normalization Chapter 5
Modification Anomalies
Modification Anomalies will occur when there are data updates. Similar to the Insert
Anomalies, there are two scenarios where Modification Anomalies can occur, such as:
Deletion Anomalies
Deletion Anolamies also can occur in three ways. They are as discussed below:
[ 122 ]
Applying Normalization Chapter 5
worry. There can be instances where one employee exists for a department. If
you are deleting the last employee of a department, you will be losing the
department details since Department data stored with the Employee relation and
there is no separate Department relation. if so every delete, you need to verify
whether the deleting employee is the last employee of the department. if it is the
last employee of the department, department details should be kept by making
the employee data to empty. Also, there can be aggregate data for department
detail.
To overcome this, one solution should be store the number of employees in the
department with the employee relation. As shown in the below relation, an
attribute called the number of employees in the department data is included.
Department Number Of
Employee ID Employee Name Age Designation Department ID Department Name
Location Employees
Building 1 / 2 nd
1 Simmon Rusule 32 Senior Manager 1 Administration 3
Floor
Anthony Main Building / 1 st
2 34 HR Manager 2 Human Resources 2
Howard Floor
Building 1 / 2 nd
3 Eli Thomas 23 Assistant Manager 1 Administration 3
Floor
Assistant HR Main Building / 1 st
4 Patrick Jones 45 2 Human Resources 2
Manager Floor
Building 1 / 2 nd
5 John Young 56 Supervisor 1 Administration 3
Floor
If you are deleting an employee record, you need to update the number of employees
attribute. If you are deleting Employee ID 2, the Number of employees for the HR
department will become 1 which has to be updated in the entire CustomerSupplier relation.
This is another anomaly that can occur during the deletion. In the case of the normalization
model, this will be one attribute in the Department relation as shown below:
Department {DepartmentID, Department Name, Location, NumberofEmployees}
Department Deletion: If you are deleting a department, you need to empty all
the relevant employee's details before deleting the department. In the case of the
normalization data model, you only have to empty the departmentID attribute in
the CustomerEmployee relation.
Determine functional dependencies between the attributes is an important task for database
normalization. Let us discuss why it is important to determining functional dependencies
and how can it be done.
[ 123 ]
Applying Normalization Chapter 5
Let us assume Employee ID and Designation attributes and the functional relation as
shown below:
Every employee ID has a designation and one designation can be related to multiple
Employee IDs. When identifying the functional dependencies, it is important to identify all
possible values.
The relation between the Employee ID to Designation is one-to-one as one employee will
have only one designation at a time as shown below:
[ 124 ]
Applying Normalization Chapter 5
The normalization process is achieved in multiple steps. We will look at the steps of the
Normalization process in the following section.
When you are into the process of database normalization, the more you go into the process,
relations will become stronger than before. It is important to note that the first normal form
(1NF) is a must and all the other normalization forms are optional. However, to avoid
operational anomalies which were discussed before in this chapter, it is essential to at least
proceed till the third normal form (3NF).
Let us look at an example of students and courses that they have enrolled in. This is
[ 125 ]
Applying Normalization Chapter 5
StudentCourse relation:
Student Student Course Course Course Start Course End
Gender Age Course Name
Number Name Number Credit Date Date
1 Andy Oliver Male 18 C001 Introduction to Databases 2 2020/01/01 2020/05/31
Harry
2 Male 19 C001 Introduction to Databases 2 2020/01/01 2020/05/31
Steward
Advanced Database
3 Alex Robert Male 18 C004 3 2020/06/01 2020/09/30
Management
4 Rose Elliot Female 21 C001 Introduction to Databases 2 2020/01/01 2020/05/31
Advanced Database
5 Michel Jacob Male 20 C004 3 2020/06/01 2020/09/30
Management
Advanced Database
6 Emily Jones Female 21 C004 3 2020/06/01 2020/09/30
Management
In the above relation, it was assumed that one student is enrolled in one
course. This is done in order to simplify the demonstration of the First
level of normalization form.
Following are the attributes of the above mentioned unnormalized StudentCourse relation:
StudentCourse {Student Number, Student Name, Gender, Age, Course Number,
Course
Name, Course Credits, Course Start Date, Course End Date }
The above design is done by using pgModeler. You can download the
demo version from https://pgmodeler.io/. This is an opensource tool
that is compatible with Windows, macOS, and Linux. More details of the
tool can be viewed at https://pgmodeler.io/support/docs
[ 126 ]
Applying Normalization Chapter 5
It is very clear that the Student No is the Primary key in this relation.
If you closely analyzed, you will be able to identify the repeated attributes:
Repeated Attributes { Course Number, Course Name, Course Credits, Course
Start Date, Course End Date }
In the above model, Student relation is created with the Primary Key of StudentNo alone
with the other Student Data. For the course information, StudentCourse relation is created
with the Course Number as the Primary Key. These two relations are related with the
Course Number attribute.
Let us look at the same example which was discussed in the 1NF section, without the
[ 127 ]
Applying Normalization Chapter 5
assumption where a student can be enrolled in only one course. This means that in the
following example, a student can be enrolled in multiple (one or more) courses:
Student Student Course Course Course Start Course End
Gender Age Course Name
Number Name Number Credit Date Date
Introduction to
1 Andy Oliver Male 18 C001 2 2020/01/01 2020/05/31
Databases
Harry Introduction to
2 Male 19 C001 2 2020/01/01 2020/05/31
Steward Databases
Harry Advanced Database
2 Male 19 C004 3 2020/05/01 2020/07/01
Steward Management
Advanced Database
3 Alex Robert Male 18 C004 3 2020/06/01 2020/09/30
Management
Introduction to
4 Rose Elliot Female 21 C001 2 2020/01/01 2020/05/31
Databases
Advanced Database
4 Rose Elliot Female 21 C004 3 2020/06/15 2020/09/15
Management
Advanced Database
5 Michel Jacob Male 20 C004 3 2020/06/01 2020/09/30
Management
Advanced Database
6 Emily Jones Female 21 C004 3 2020/06/01 2020/09/30
Management
From the table above, we see that the relation Student Number and relation Course
Number are the composite attributes of the Primary Key:
{Student Number, Course Number} ->{Course Start Date, Course End Date}
(Primary Key)
{Student Number} -> {Student Name, Gender, Age} (Partial Dependancy)
{Course Number} -> {Course Name, Course Credit} (Partial Dependancy)
[ 128 ]
Applying Normalization Chapter 5
The above screenshot shows the relationships between Student, Course, and Enrolment
relations along with the possible data types.
Let us look at the next level of Normalization, which is the third normal form (3NF).
[ 129 ]
Applying Normalization Chapter 5
Advanced
Ryan
3 Alex Robert Male 18 C004 Database 3 L002 2020/06/01 2020/09/30
Thoms
Management
Introduction to Rose
4 Rose Elliot Female 21 C001 2 L002 2020/01/01 2020/05/31
Databases Taylor
Advanced
Ryan
4 Rose Elliot Female 21 C004 Database 3 L002 2020/06/15 2020/09/15
Thoms
Management
Advanced
Michel Ryan
5 Male 20 C004 Database 3 L002 2020/06/01 2020/09/30
Jacob Thoms
Management
Advanced
Ryan
6 Emily Jones Female 21 C004 Database 3 L002 2020/06/01 2020/09/30
Thoms
Management
Advanced
James Ryan
7 Male 24 C004 Database 3 L002 2020/07/01 2020/12/31
Dixon Thoms
Management
James Database Ryan
7 Male 24 C005 4 L002 2020/09/30 2021/01/30
Dixon Administration Thoms
Transitive dependency is an important concept that will be discussed under the third
normal form.
Transitive Dependency
Let us consider three attributes X, Y and Z in relation R. These three attributes are related in
such a way that X-> Y and Y-> Z. This means that Z is transitively dependent on X through
Y.
This means that transitive dependency exists, Student Number -> Lecturer Number
through the attribute Course Number.
Let us see how the third level of normalization forms are applied.
Applying 3NF
With the additional attributes, Course Credit and so on, course relation will be impacted as
follows:
{Course Number} -> {Course Name, Course Credit, Course Lecturer Number, Course
[ 130 ]
Applying Normalization Chapter 5
When normalization is carried out until 2NF, the following relations will be met. The
highlighted attributes are the newly introduced attributes:
Student {Student Number, Student Name, Gender, Age}
Course {Course Number, Course Name, Course Credit, Lecturer Number,
Lecturer Name}
Enrollment {Student Number, Course Number, Course Start Date, Course End
Date}
The following table consists of data for the Course and Lecturer relations respectively:
Course Lecturer
Course Number Course Name
Credit Number
C001 Introduction to Databases 2 L001
C004 Advanced Database Management 3 L002
C005 Database Administration 4 L002
The following is the sample data set for Lecturer relation.
[ 131 ]
Applying Normalization Chapter 5
First Normal Form to Third Normal Form is at least a minimum for database design. The
next level of normalization forms is optional. Let's discuss the Boyce-Codd Form (BCNF).
For a database with 3NF, if you have a super key in the data, then the BCNF can be applied.
Super key means, if there is a relation such as X-> Y, X is a non-prime attribute and Y is a
prime attribute and X is a super key.
Let us look at this from an example and extend the previous student, lecturer example. AS
we know students need to meet lectures to discuss their study matters. Since lecturers are
[ 132 ]
Applying Normalization Chapter 5
dealing with many students, you need to book a schedule an appointment. The
appointments details can be defined as follows. Let us name this relation as
StudentMeetingSchdule:
Student Number Meeting Date Meeting Time Lecturer Number Room Number
1 2010-Jan-02 11:00 L001 R001
2 2010-Jan-02 13:00 L001 R001
3 2010-Jan-02 13:00 L002 R002
2 2010-Jan-05 11:00 L001 R002
4 2010-Jan-05 11:00 L001 R001
4 2010-Jan-05 13:00 L002 R002
Following is the StudentMeetingSchedule relation along with its attributes shown below.
StudentMeetingSchedule {Student Number, Meeting Date, Meeting Time,
Lecturer Number, Room Number}
A Student can have one meeting at a given date and time, therefore, {Student Number,
Meeting Date, Meeting Time} is a candidate key. A Lecturer can have one meeting at a
given date and time, therefore, {Lecturer Number, Meeting Date, Meeting Time} is a
candidate key. On the other hand, a room can have one meeting a given date and time,
therefore {Room Number, Meeting Date, Meeting Time} is a candidate key.
Out of these three candidates, {Student Number, Meeting Date, Meeting Time} is chosen as
the primary key for the StudentMeetingSchdule.
[ 133 ]
Applying Normalization Chapter 5
FD1, FD2, and FD3 are satisfying BCNF but not the Super key relationship. To achieve
BCNF, LectureRoom relation was introduced and the following are the modified relations.
StudentMeetingSchdule {Student Number, Meeting Date, Meeting Time, Lecturer
Number}
LecturerRoom {Lecturer Number, Meeting Date, Meeting Time, Room Number}
Following is the modified data set and the following is the StudentMeetingSchdule relation:
[ 134 ]
Applying Normalization Chapter 5
If you look at the above tables, it is essential to note that these two tables can be linked to
one table. So the decision to stop the database modeling at 3NF or progress to the next level
of normalization, BCNF, is dependent on the significance of the super key relationship.
Multi-Valued Dependency
To have a multi-valued dependency (MVD) in a relation, that relation should have at least
three attributes, which means that MVD is not possible for a relation with two attributes.
There are two more conditions to satisfy MVD. Let us assume that there are three attributes
X, Y, and Z where X is the primary key of a relation R. For Single X value there will be
[ 135 ]
Applying Normalization Chapter 5
In the above table, for one record the Sport attribute is empty. It is because
Student Number 2 has three hobbies and 2 sports. In the above design, if
the number of attributes is not the same, there can be empty records.
Applying 4NF
To apply 4NF to a relation, the relation should be satisfied with BCNF. Apart from that
requirement, the relation should have multi-valued dependencies in relation. If both of
these requirements are met, then the 4NF can be applied.
Let us apply the 4NF to the above relation StudentHobbySport by introducing two relations
as follows:
[ 136 ]
Applying Normalization Chapter 5
1 Watching TV
1 Riding Cycle
2 Watching TV
2 Movies
2 Playing Cricket
3 Growing Flowers
3 Watching TV
The following table shows the dataset for StudentSport:
To avoid a large number of tiny relations, common relations can be built with the
combinations of all the values. The following table shows the relation after combining all
the tiny relations:
[ 137 ]
Applying Normalization Chapter 5
The next level of normalization form is the Fifth Normal Form (5NF).
To apply 5NF, data should have achieved 4NF and JOIN dependency should exist. Next, let
us see what is JOIN dependency.
JOIN Dependency
If a relation can be regenerated by joining multiple relations and each of these relations has
a subset of the attributes of the relation, then the relation is in Join Dependency. It is a
[ 138 ]
Applying Normalization Chapter 5
If relation R has attributes of X, Y, Z, then this R relation can be divided into relations, R1(X,
Y), R2 (X, Z) and R3(Y, Z).
Applying 5NF
Let us look at the following scenario.
If Student (Andy) enrolled in the Subject ( Java)
Subject (Java) is conducted by Lecturer (Smith)
Student (Andy) learning from Lecturer (Smith)
Then Student (Andy) Enrolled in Subject (Java) conducted by Lecturer
(Smith).
Student Course
Andy Java
Andy C#
Simon Java
The CourseLecturer relation is as follows:
Course Lecturer
Java Smith
C# Joel
Java Joel
The StudentLecturer relation is as follows.
Student Lecturer
Andy Smith
[ 139 ]
Applying Normalization Chapter 5
Andy Joel
Simon Joel
In the 5NF, we have eliminated JOIN dependency by dividing relation to multiple
relations.
[ 140 ]
Applying Normalization Chapter 5
When this type of Range relations is introduced, equal joins are not
possible, instead, inequal joins should be used. Though the unequal joins
will have a negative impact on the performance of queries, normally,
range relations are extremely small relations. This means that
performance is not a major consideration for small range relations. In the
data warehousing and data analytics system, a similar type of Range
Dimensions is used for a detailed analysis of data.
Though there are a lot of cases for normalization of data, there are cases where De-
normalization structures are still preferred. We will learn about it in the next section.
[ 141 ]
Applying Normalization Chapter 5
The data normalization is great, for OLTP systems, where transactions are equally well-
spreaded read and write operations. However, for data analytics systems, there are fewer
writes which means that there is no question of operation anomalies.
However, what are the issues of data normalizations as far as analytics systems are
considered. If data is normalized, to retrieve data, end-users have to join multiple tables or
relations.
Practical Usage: Most of the analytical systems are self-service which will be
used by business users, not technical users. Therefore, it is essential to keep the
model simple. If the data model is normalized, then the business user has to join
tables. This may not be a happy option for the business user. He would love to
see de-normalized data in a single entity so that he can pick and choose the
necessary attributes when necessary for his analysis.
Performance: If there are many tables, which is unavoidable in data
normalization, you need to join multiple tables, when the need arises. However,
joining tables will impact performance negatively when it comes to data
retrieval. This means that data normalization will have a negative performance
impact on a large number of data retrieving systems.
The above two reasons emphasize that data normalization is not something that you can
apply blindly. When it comes to data retrieval systems, it is better to keep the data in a
denormalized form. It also needed to specify that there are expectations to 4NF and 5NF
which were discussed Exceptions for 4NF and Exceptions for 5NF.
A lot of concepts are discussed in this chapter. Therefore, it is essential to have a cheat sheet
so that it is easier to refer.
Summary
In this chapter, we looked at an important concept in database modeling. We looked at the
history of normalization first. Then we defined Normalization as the methodical and
formal process that will be used to identify relations in the data model based on their keys
and different dependencies. Normalization solves two issues in data models that are
repeated data and modification anomalies.
Then we discussed several levels or normalization that can be applied for the data model.
At the 1st level of normalization, each row will be identified by means of a single attribute,
we called it a Primary Key. The next level of normalization form is 2nd normalization form.
In the 2nd normalization form, repeated values are removed. In the third normalization
form, Transitive Dependencies are removed. As a database designer, we determined that at
least the third level normalization form should be achieved. In the BCNF or 3.5
normalization form fully function dependencies are removed. After the third level of
normalization, we discussed the fourth normal normalization form. In the next level of
normalization form, the fourth normalization for multi-valued dependencies is removed. In
the fifth normalization form, JOIN dependencies are removed. Including Domain-specific
range relations is what is to achieve in the sixth normalization form.
[ 143 ]
Applying Normalization Chapter 5
In the next chapter, we will discuss how to implement the designed data models in the
physical table structures.
Questions
Describe the requirements of the database normalization.
What are the anomalies of which will be resolved by the database normalization?
The third normalization form is the minimal normalization form that should
be applied to a relation.
Normalization will help to identify the relation. Also, it will help to resolve
anomalies which we discussed before. The major disadvantage of the
Normalization process is that it will create a few numbers of relations which
might be difficult to maintain. With a large number of relations are present,
you have to join those relations when you need to get any information. This
will lead to performance issues.
[ 144 ]
Applying Normalization Chapter 5
You can avoid the normalization process for the relations where you don't
have less or no data writes. Also, the relations in which you don't have many
rows are another place where you don't need a normalization process.
When there are large number of tiny relations, you can avoid fourth
normalization form. If not you will end up with unmanageable relations
which will lead to security and maintenance issues.
The JUNK dimensions are created by cross joining tiny dimensions. This will
create combinations of instances in all small dimensions. However, there can
be instances where some combinations may not occur. You can delete these
combinations to maintain the data quality in the databases. However,
keeping those combinations will not add any performance issues.
What are the scenarios you can ignore the implementation of 5NF?
Exercise
For the data model, defined in Chapter 4, Representation Models, apply normalization to a
possible normalization level. State what are the challenges that you are encountered during
the normalization process.
[ 145 ]
Applying Normalization Chapter 5
You may have to insert sample data into the model so that you can
visualize what is the customer requirement.
Further Reading
Microsoft, https://support.microsoft.com/en-gb/help/283878/description-
of-the-database-normalization-basics
Introduction to normalization, https://web.archive.org/web/20071104115206/
http://www.utexas.edu/its-archive/windows/database/datamodeling/rm/
rm7.html
[ 146 ]
6
Table Structures
In chapters 2-5, we have discussed how to identify the relevant entities and their attributes
in the data models. Also, we have had a detailed discussion of different types of database
models that will be used to design databases. We have also extensively discussed different
normalization forms with relevant examples, hence, the database entities. their
attributes and their relationships were identified scientifically.
In this chapter, it is time to define these entities physically in the identified database
technology. Until up to now, we have discussed conceptual aspects of database design.
Until this point, we did not care about the database technology that will be used for the
database design. Now we will move on to the physical design of the databases. Therefore,
we need to choose database technology and this book covers the PostgreSQL database, we
will be looking at how this physical database modeling can be done in PostgreSQL
database technologies. Apart from the database technology, we need to understand the
available data types so that we know the strengths and limitations. Also, there are different
types of relations or tables in which table design is different. For example, there are Master
tables, Transaction tables, Reporting Tables, and Audit Tables. In this important chapter,
we will discuss what are the options in the Master tables. Transaction tables are the
important tables in the database system which can be considered as core tables in a
database system. We will discuss the important aspect of Reporting and Auditing tables in
this chapter as well.
There are the most common and popular database technologies that are widely used by
many database designers. They are,
Oracle
Microsoft SQL Server
PostgreSQL
MariaDB
Teradata
IBM DB2
MySQL
Sybase
MS Access
When choosing a relational database technology, some times the database designer does
not have a choice, as an organization may have decided that to use already licensed, known
technology. However, even that database designer has to verify the features of the database
technology is sufficient to support the client requirements.
If the database designer has a choice to select the relational database technologies, then the
cost of the database, support of the database vendor and available resources such as
humans and drivers will play key parameters.
Even after the selection of database technology, then there are different versions of the
relational databases. So let us assume that the database as chosen PostgreSQL 12 as the
relational database technology.
It is important to understand what are the available data types in PostgreSQL so that the
database designer knows what are his options. Let us see what are the limitations are
properties of PostgreSQL data types.
[ 148 ]
Table Structures Chapter 6
In PostgreSQL, there are many data types, but we will be looking at only the important data
types in PostgreSQL, so that database design can be done. There are different data types
categories in PostgreSQL, and they are Character Types, Numeric Types, Monetary Types,
Date/Time Types, and Boolean Types. Apart from those data types, there are data types
such as Binary Types, JSON, Geometric Types, and XML Types which are not much used.
Let us see what are the character data types available in PostgreSQL.
character varying(n) data type will take only the length of the values where
are character(n) will have a fixed length by adding blank to the balance length.
In addition these two data types, PostgreSQL provides another character data type called
text type. Text data type stores strings of any length. The text data type is a non-SQL
standard type.
Apart from these three string data types, there are other two fix character data types. Those
[ 149 ]
Table Structures Chapter 6
are char and name, that have storage 1, 64 bytes respectively. The name data type should
not be used by the users as it is intended for internal use.
Similarly, these data types can be used in a SQL script as shown below:
CREATE TABLE public."Customer"
(
"Title" character(4) ,
"FirstName" character varying(25) ,
"MiddleName" character varying(25) ,
"LastName" character varying(25) C,
"Status" "char",
"Profile" text
)
It is important to note that the maximum length of the character data type can have is
10,485,760. ERROR: length for type varchar cannot exceed 10485760 will occur if you try to
[ 150 ]
Table Structures Chapter 6
create a column more than 10,485,760. However, it is unlikely that you need a character
column more than that length.
Collation Support
In the character data types, collation is an important concept that will allow users to use
multiple langue. As you know, different languages as their own language and sorting
properties. if you are designing a database for multi-lingual requirements, collation has to
be selected accordingly. By default, all the character data types will take the collation of the
database.
Let us as what are the Numerical Data types available with PostgreSQL and their limitation
and properties.
[ 151 ]
Table Structures Chapter 6
tables which are referring to the department table as well. For example, if the employee
table is referred to the department table, that will also be impacted. Also, when there are
indexes are creating, it will require more storage and it will be discussed in detail
in Chapter 8, Working with Indexes.
Also, this unnecessary usage of the data type will need additional storage and time for
database backup to execute and also additional time to restore the database if needed.
When you need to search for data, you may have to search for additional unnecessary data
pages if the wrong data type is selected. This means that it is essential to select appropriate
to select a correct data type so that it will have a storage and performance impact.
When integer data types are divided by an integer data type, the result is an integer. If you
want those values in numeric, make sure you cast either denominator or numerator as
shown in the below screenshot:
The above screenshot shows how different the outputs are when the data types are
different. This means you need to carefully select data types.
[ 152 ]
Table Structures Chapter 6
When a serial column is created, a SEQUENCE object will be created as shown in the
following screenshot.:
This sequence will be dropped once the serial column is dropped. You
also can drop the sequence, then the column will not be dropped but the
values will not be incremented.
Monetary data types are other important data types in PostgreSQL which will be discussed
in the following section.
[ 153 ]
Table Structures Chapter 6
When the money data type is divided by different data types results can be different as
shown in the below screenshot:
The above screenshot shows different outputs when different data types are used. If you
want to obtain money data type as the output, correct inputs have to be used.
The following table shows the different date-time data types and their properties and their
limitations.
[ 154 ]
Table Structures Chapter 6
The interval data type will have the following parameters to chose to indicate the interval
type:
As indicated before, we have discussed the important data types in PostgreSQL. However,
PostgreSQL has much richer data types to suit your various needs. You can go through the
PostgreSQL documentation which is listed in the Further Readings section.
When it comes to design physical database structures, there are mainly six data layers as
shown in the following screenshot:
[ 155 ]
Table Structures Chapter 6
Though there are five data layers, there are four types of table types. There are Master
Tables, Transactional Tables, Reporting Data ad Transaction Audit Data that we will see in
the following section.
Let us see the different serial data types that are available in PostgreSQL.
[ 156 ]
Table Structures Chapter 6
code will be used. For example, the enrollment entity will have the course code. There can
be two issues with this design.
1. When the business key is used as the primary key, transaction tables are
referenced the business key as said in the above. Since these codes are governed
by the business, they have the right to change the business key. If the business
keys are changed, changing the master tables will not be a great issue as master
tables typically have less number of records. However, you need to change all
the records in the transaction table, which may have a large number of records in
the order of millions. Modifying the transaction table records may result in a
table lock and that will prompt the table to be unavailable until the update is
over.
2. Typically a business key alpha-numeric values as said earlier in this section. For
most of the queries, you need to join the transaction and the master tables. When
joining alpha-numeric columns, there will be a performance impact.
By considering these two issues, typically we can insert a new serial column, that will
be used as the primary key. The existing business key will be a UNIQUE indexed key.
Let us see why it is important to introduce create and update columns in master tables.
The following screenshot shows the sample for the course table, which includes the
primary key, and created and updated columns:
[ 157 ]
Table Structures Chapter 6
"The following code block shows the script for the table above:
CREATE TABLE public."Course"
(
"CourseID" integer NOT NULL DEFAULT
nextval('"Course_CourseID_seq"'::regclass),
"CourseCode" character varying(10) COLLATE pg_catalog."default" NOT NULL,
"Description" character varying(50) COLLATE pg_catalog."default",
"CreatedDate" timestamp without time zone,
"ModifiedDate" timestamp without time zone,
CONSTRAINT "PK_Course" PRIMARY KEY ("CourseID")
)
The following screenshot shows the sample data set for the Course table:
However, there are different types of master tables depending on how the history of master
tables are maintained. Depending on these types of master tables, table design will differ.
[ 158 ]
Table Structures Chapter 6
Although updating the existing record is the most common master table design, there can
be a case where you need to keep the historical aspect of the master tables.
Let us examine the different design approaches for the Master tables.
[ 159 ]
Table Structures Chapter 6
and Hybrid Historical Master Tables. Let us learn more options for designing historical
master tables in the following sections.
To facilitate this, we need a ValidStartDate and ValidEndDate and iscurrent record column as
shown in the following screenshot:
[ 160 ]
Table Structures Chapter 6
Please note that there are two modifications done to the previous course
table apart from the addition of ValidStartDate, ValidEndDate, and
IsCurrent columns. When we create the Course table before, the Data type
CourseId was set to serial. However, you will see that column data type
as integer now. As we said in the Serial Data Types section, the serial data
type is not a true data type and it is variant of the integer data type. you
will observe that there is another column included in this table
named, CourseOwner and this column will be used to demonstrate, how to
keep the historical data in a master table.
Let us assume that the course called JAVA is inserted into the table using the following
script:
INSERT INTO PUBLIC."Course" (
"CourseCode"
,"Description"
,"CourseOwner"
,"ValidStartDate"
,"IsCurrent"
,"CreatedDate"
,"ModifedDate"
)
VALUES (
'JAVA'
,'Introduction to JAVA'
,'Phil Jason'
,'2012-01-01'
,B'1'
,NOW()
,NOW()
);
After the insertion of the record, the Course table will look as shown in the following
screenshot:
Now let us assume that on 2017-05-01, Course ownership was changed to 'David Baker', a
new record will be inserted and the existing record will be updated using the following
script:
UPDATE PUBLIC."Course"
SET "ValidEndDate" = '2017-04-30' ,"IsCurrent" = B'0' ,"ModifedDate" =
NOW() WHERE "CourseCode" = 'JAVA' AND "IsCurrent" = B '1'
[ 161 ]
Table Structures Chapter 6
If you want to implement, Row-based historical master tables, Business key, (Course Code
in this example) cannot be used as the primary key as Business Key can be duplicated.
Therefore, if you wish to implement Row-based historical mater tables, you need to include
an additional serial column.
The following screenshot shows the flow chart for the Row-based historical master table
implementation:
However, if there are columns that are changing frequently, this type of design will not be
effective, as the Master table tends to grow rapidly. Also, not all columns need to be
treated as historical data. For example, in the above example, if Course Name is modified, it
will be just overwritten than adding a new row.
[ 162 ]
Table Structures Chapter 6
Let us look at the same example of the Course table with the CourseOwner columns. As
shown in the following screenshot, the previous owner will be stored in a column called
PreviousCourseOwner in the same record:
Following is the script that was used to create the above table:
CREATE TABLE public."Course"
(
"CourseID" serial NOT NULL ,
"CourseCode" character varying(10) COLLATE pg_catalog."default" NOT NULL,
"Description" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"CourseOwner" character varying(50) COLLATE pg_catalog."default" NOT NULL,
"PreviousCourseOwner" character varying(50) COLLATE pg_catalog."default",
"CreatedDate" timestamp without time zone,
"ModifedDate" timestamp without time zone,
CONSTRAINT "PK_Course" PRIMARY KEY ("CourseID")
[ 163 ]
Table Structures Chapter 6
When the first record is inserted, PreviousCourseOwner is null as shown in the below
screenshot:
Now let us assume that Course ownership was changed to 'David Baker' from 'Phil Jason'.
'Phil Jason' will be updated to the PreviousCourseOwner column and New value, David
Baker will be updated to the CourseOwner column in the Course table. This is how this was
done by using the following script:
UPDATE public."Course"
SET "PreviousCourseOwner" ="CourseOwner"
,"CourseOwner" = 'David Baker'
,"ModifedDate" = NOW()
WHERE "CourseCode" = 'JAVA'
Unlike Row-Based Mater Table, there won't be a large number of rows in the Column-
based Master Table. However, it is important to note that this can keep only one previous
value. Also, this type of master table is not scalable, if you want to keep history for a large
number of columns.
[ 164 ]
Table Structures Chapter 6
The following screenshot shows the sample data set for the above implementation:
Some times, designers prefer to have the Current value in all the records as well that is
shown in the below sample data set in the following screenshot:
[ 165 ]
Table Structures Chapter 6
Though the design of the Master tables seems to be a very trivial problem to solve.
However, to achieve the best, there are multiple options that designers have. The designer's
task is to carefully select the correct approach that will cater to the client's requirements.
In the next section, we will understand the design of the transaction tables. which is a core
part of database design.
Data volume, transaction velocity, and data integrity are the major challenges that database
designers have to come across during the designing of transaction tables.
Let us look at the challenges possessed by the higher volumes in transaction tables.
[ 166 ]
Table Structures Chapter 6
mechanism for dividing data by usage pattern. For example, you can archive older data in
cheaper data storage as partitions can be mapped to different physical locations.
However, the partitioning strategy of a table must be selected by giving great consideration
so that the benefits can be maximized while minimizing adverse effects in transactions. The
partition process and partition strategies will be looked at in detail in Chapter 12,
Distributed Databases.
Another challenge that database designers come across is High velocity in transaction
tables that will be discussed in the next section.
Data Integrity
When it comes to transactions, it has to update multiple tables and has to refer to multiple
master tables. If you consider the order table, to create an order you need to refer to the
customer table, product table, cashier table, and so on. When the order is raised, it has to
make sure the correct reference is done. This integrity can be achieved by referential data
integrity that is discussed later in this chapter.
When an order is converted to an invoice, the transaction has to update the Order
transaction table, customer balance, inventory tables, and so on. For one transaction to
complete, the transaction has to make sure that all relevant tables are updated. In case if
one table fails, the entire transaction should be rollbacked to achieve Aotmocity property in
transactions. This integrity will be discussed in Chapter 09, Designing a Database with
Transactions.
[ 167 ]
Table Structures Chapter 6
In the above system design, the Order Entity is divided into three physical tables in order to
cater to the challenges that we discussed at the challenges in transaction table design. For a
given order all three tables have to be updated to complete an order. All tables
have CreatedDate and ModifiedDate columns in order to provide integration to third-party
systems if needed.
Typically, there are a lot of relationships with master tables in the transaction tables. These
[ 168 ]
Table Structures Chapter 6
Following is the simple formula for the calculation of Amount value using Quantity and
UnitPrice values.
Now the question is whether Amount should be stored as a calculated column or can it be
processed when it is required. There are positive and negative aspects of both cases. If you
consider the storage and the writing time, you should choose, Amount as a processed
column. However, if you are looking at improving the reading performance than the
writing performance, it is better to store as a Calculated Column.
Let us assume a student submits an assignment, then the lecturer will mark the assignment
as received or rejected. Then the lecturer will mark this for grading and finally, grades will
be released to the students.
Following table shows, how a single record has changed over time when it is moved to
different stages.
[ 169 ]
Table Structures Chapter 6
record has to update later, it is essential to find the record in a timely manner. Therefore,
indexes make an important contribution to find the relevant record.
In the next section, we will discuss Reporting tables which are mainly to support reports in
the database system
Auditing is an important concept in database design and its nature is different from the
other tables. Therefore, special attention is needed to design audit tables, which will be
discussed in the next section.
[ 170 ]
Table Structures Chapter 6
WHO Author of Change It can be the Operating system user, or database use or application user.
Data which was changed.
Operation: INSERT / DELETE / UPDATE / TRUNCATE /
WHAT What was change
Permissions Granted
Executed command.
When the change was The timestamp of the data. if the users are accessing from multiple time
WHEN
done zones, it is essential to store the time zone information as well.
From where the change
WHERE IP address / Client machine name etc.
was done
The following are the typical attributes that can be captured for the auditing. Please note
that due to practical and technical issues, you may not be able to find all the details:
The above Audit table can be created from the below script.
CREATE TABLE public."GeneralAudit"
(
"AuditID" bigserial NOT NULL ,
"StartTimeStamp" timestamp with time zone NOT NULL,
"EndTimeStamp" timestamp with time zone,
"OperatingUserName" character varying(50) COLLATE pg_catalog."default",
"SystemUserName" character varying(50) COLLATE pg_catalog."default",
"ApplicationName" character varying(50) COLLATE pg_catalog."default",
"Command" character varying(2000) COLLATE pg_catalog."default",
"HostName" character varying(15) COLLATE pg_catalog."default",
"DatabaseName" character varying(15) COLLATE pg_catalog."default",
[ 171 ]
Table Structures Chapter 6
Audit tables are write-heavy table and very rarely there are used for retrieval. Audit data is
required to read-only there is an absolute need. Therefore, typically indexes are not created
in Audit tables. Refer to Chapter 8, Working with Indexes.
When there is a need to retrieve data from audit tables, you need indexes.
For example, if you want to analyze audit data for a given application,
you need to apply an index to the ApplicationName column. You can copy
the audit database to a different server and apply the necessary indexes to
facilitate the audit queries.
Also, there won't be any foreign key constraints or any other check constraints that are
implemented to the audit table in order to improve the data writing performance in the
audit tables. Only a Primary is added to the audit table.
There are several ways to capture audit data as listed below, but are not in the scope for this
book:
Archiving is another important aspect of the audit tables. Some laws govern the period of
audit data retention. Typically, the table partition concept is used for audit tables so that
[ 172 ]
Table Structures Chapter 6
data archival can be done much easier. The partition process will be looked at detail in
Chapter 12, Distributed Databases.
Foreign keys are one of the main ways to maintain data integrity between multiple tables.
If we look at an example of course enrollment, there are lectures and courses referred to in
the course enrollment tables. Let us see the basic columns of these three entities to describe
the foreign key constraints using the following screenshot:
[ 173 ]
Table Structures Chapter 6
Please note that in the above E-R diagram, only the essential columns are taken into
consideration.
Foreign constraints are created using PostgreSQL as shown in the following screenshot:
[ 174 ]
Table Structures Chapter 6
A detailed discussion was done on Constraints (Check / Unique / Foreign Key) in Chapter 1,
Overview of PostgreSQL Relational Databases. Also, Transactions are discussed in detail
in Chapter 09, Designing a Database with Transactions.
Summary
After discussing the identifying entities and modeling by E-R modeling and different levels
of normalization forms in the previous chapters, this chapter dedicated the discussion to
the physical implementation of these entities in database tables. In this chapter, it was
identified that there are mainly four types of table structures in physical database
implementation. Master Tables, Transaction Tables, Reporting Tables, and Audit Tables are
those four types of tables. It was identified that there are several ways to design master
tables mainly depending on how history data is maintained. Most of the time, history is not
maintained in the master tables. However, when it is required to maintain the historical,
three types of designs were discussed depending on how the historical data is stored. Those
types are Row-Based, Column-Based and Hybrid Master table implementation.
In the transaction tables, it is essential to overcome multiple challenges such as high data
volume and high data velocity. Maintaining integrity is an important phenomenon in
[ 175 ]
Table Structures Chapter 6
transaction otherwise transactions will be invalid. Reporting tables are used mainly to
support high reads therefore, it was suggested to have de-normalized structures for
reporting tables. Auditing tables are used for many reasons mainly to support various
domain laws. In the Audit tables, it is discussed that it is essential to not have indexes to
support write-heavy audit tables. When database designers want to design a database there
are common problems that can be solved by database design patterns.
In the next chapter, we will discuss the design patterns of databases that will be used as a
template for the common problems in database designing.
Questions
What are the parameters you will consider when choosing a database technology
to match the client's requirements?
Why character varying(n) data type should be used over the character(n) data type?
When storing data in a character varying(n) data type, it consumes only the
storage of the data that you are inserting to the column. It does not consume
the entire allocated storage. For example, let us assume that there is a column
of character varying(50) data type. If you are storing "MESSAGE" value in the
above column, it will consume 7 bytes which is the length of the MESSAGE
text. On the other hand, if you are using the character(50) data type, it does
not matter the size of the value, but it will consume the entire size of 50
bytes. This means it is always recommended to utilized character varying(n)
data type for columns such as Name, Addresses etc.
[ 176 ]
Table Structures Chapter 6
In the character data types, collation is an important concept that will allow
users to use multiple langue. As you know, different languages as their own
language and sorting properties. if you are designing a database for multi-
lingual requirements, collation has to be selected accordingly. By default, all
the character data types will take the collation of the database.
Why it is essential to choose the correct data type among smallint, integer, and
bigint?
The smallint data type requires 2 bytes and the integer data type requires 4
bytes whereas the bigint data type requires 8 bytes. If you are not utilizing
correct numeric data types, you will be unnecessary spending data storage.
Additional storage means row sizes are larges which will reduce the
transaction read as well as write performance significantly. Apart from the
query performance, the database will be unnecessary large than it should be.
This means additional storage is required for the database and its backup.
Also, database backup and database restoration times are higher than what it
should be.
What is the strategy that you will use if you are dealing with users from multiple
geographical locations?
What is the importance of including an additional serial column for the master
table?
If no serial data type column is not used, the business key has to be used as
the primary key. By introducing an additional serial column, the primary key
can be chosen as the new serial column. This serial key will remove the
dependency of the primary key with the business. Additionally, by
introducing serial data type as the primary key to the table, tables joins are
performed over the numeric columns rather than alpha-numeric columns
that will result high performance during the table joins.
An organization has identified some of its female employees have changed their
names after marriage. What would be the best type of Historical Master table you
[ 177 ]
Table Structures Chapter 6
Why at least the third normalization level is required for transaction tables?
Since transaction tables are dealing with high volume and high-velocity
transactions, it is essential to keep the transaction table at a minimum row
length. When the third normalization level is implemented to the transaction
tables, the row size of the transaction tables is reduced.
Why de-normalized database structures are most suitable for reporting tables?
Reporting tables have high read operations with high volume. If reporting
tables are fully normalized, data will be scattered and you need to perform a
large number of table joins. If there are large table joins for a large volume of
data, it will require high CPU. When tables are de-normalized, you don't
need many table joins, hence it will improve the read performance of the
reporting tables.
Auditing data should be stored securely. This means that Auditing data
[ 178 ]
Table Structures Chapter 6
Exercise
Let us implement table structures for the designed database in the previous chapters.
Further Reading
Full list of data types in PostgreSQL: https://www.postgresql.org/docs/12/datatype.
html
Collation: https://www.postgresql.org/docs/9.1/collation.html
[ 179 ]
7
Working with Design Patterns
During chapters 1-4, we discussed the modeling aspect of the database. In those chapters,
we discussed how to capture the users' requirements with respect to the database. During
these chapters, we identified different data models such as conceptual models, semantic
modeling, and E-R modeling. Then we discussed how to optimize identified entities and
their attributes, by using different levels of normalization models. We also discussed where
we need to ignore the database normalization.
From the previous chapter, Table Structures the last chapter, we discussed how physical
table structures are implemented for various types of databases such as Master tables,
Transaction tables, Reporting tables, and Auditing tables.
In this chapter, we will discuss design patterns in this chapter. The software design pattern
is a general, reusable solution to a commonly occurring problem. However, it is important
to know that pattern should be applied depending on the environment. When it comes to
database design, there are common problems that can be addressed by database design
patterns.
Also, there are anti-patterns that means that what are should not do during the design of
databases
Patterns are most commonly used in application development. Let us see why software
patterns are used and what are the benefits of those.
Before we discussed the database patterns it is important to understand the basic criticisms
over the design patterns, let us see what are the issues in Software Patterns in the next
section.
Let us see what is the issue of choosing the wrong design pattern in the following section.
Let us discuss the issue of implementing the Design pattern without any modifications.
[ 181 ]
Working with Design Patterns Chapter 7
environment, and domain to domain. This means that you need to adopt the given design
patterns to match your environment.
Let us look at another issue of design patterns that is outdated design patterns.
Let us see what are the common database design patterns in the next section.
Data Mapper
As we know, the prime target of the database is to store data. Typically, this data comes to
the database through an application but direct user interventions to the database. There are
main three user actions in the database they are, INSERT, UPDATE, and DELETE. Data
Mapper design pattern will map these actions between the application and database. By
[ 182 ]
Working with Design Patterns Chapter 7
doing so, end-users don't need to know the schema of the relevant table.
Let us revisit the Course table which is shown in the screenshot below:
If you examine this table, apart from CourseID, CrearedDate, and ModifiedDate, all the
other columns, CourseCode, Description, CourseOwner, CurrentCourserOwner,
PreviousCourserOwner, ValidStartDate, ValidEndDate, IsCurrent needs to be filled from
a user or from an application.
The following screenshot shows the block diagram for the data mapper design pattern:
[ 183 ]
Working with Design Patterns Chapter 7
In the above diagram, three layers are defined. For the application end, the user will call the
Course Mapper objects.
Insert Mapper
Insert stored procedure can be created and let us see how that can be created from
PostgreSQL.
[ 184 ]
Working with Design Patterns Chapter 7
procedure name, variables, and variable types define the signature of the procedure. There
can be multiple procedures with one name. However, there should be a different number of
arguments or argument types. In the argument list, in_ suffix is used to indicate that it an
input argument. in_is_current has a default value which means that even if the value is not
passed for that argument, default value, 1 will be set.
[ 185 ]
Working with Design Patterns Chapter 7
The procedure above will update all the passed values and current_date function will be
used to update the CreatedDate and ModifiedDate
Let us look at the script of the above procedure in the following code block:
-- PROCEDURE: public.insert_course(character varying, character varying,
character varying,character varying, character varying, date, date, bit)
[ 186 ]
Working with Design Patterns Chapter 7
AS $BODY$
INSERT INTO public."Course"
("CourseCode",
"Description",
"CourseOwner",
"CurrentCourseOwner",
"PreviousCourseOwner",
"ValidStartDate",
"ValidEndDate",
"IsCurrent"
,"CreatedDate",
"ModifedDate")
VALUES
(in_course_code,
in_description,
in_course_owner,
in_current_course_owner,
in_previous_course_owner,
in_valid_start_date,
in_valid_end_date,
in_is_current,
current_date,
current_date)
$BODY$;
In the application end, users don't have to worry about the table structures and their
names:
CALL public.insert_course(
'C001',
'Business Analytics',
'Shane Kimpsons',
'Shane Kimpsons',
'Paul Wilson',
[ 187 ]
Working with Design Patterns Chapter 7
'2020-01-01',
'9999-12-31',
B'1'
)
Further, users do not have to worry about the internal columns such as CreatedDate and
ModifiedDate columns. They simply have to call the following script to insert data into
the course table.
Update Mapper
Let us see how users can use Update mapper to update the course table. Update mapper
will have the same parameters as insert mapper.
The following script shows the SQL code for the Update Mapper:
-- PROCEDURE: public.update_course(character varying, character varying,
character varying, character varying, character varying, date, date, bit)
[ 188 ]
Working with Design Patterns Chapter 7
The above update mapper script will update course details by the Course Code. There can
be different variants of update mapper depending on the need. For example, there can be
an update mapper not only by Course Code but also with the IsCurrent attribute.
Delete Mapper
Let us see how users can use Delete mapper to delete the course table. Implementation of
Delete mapper is much simpler than Insert or Update mapper as Delete mapper requires
fewer arguments.
The following script shows the Delete mapper for the course table:
-- PROCEDURE: public.delete_course(character varying)
-- DROP PROCEDURE public.delete_course(character varying);
The delete mapper shown above, will delete the course by the Course Code. There can be
different variants of delete mapper as we saw for Update mappers depending on the need.
For example, there can be a Delete mapper not only by Course code but also with the
IsCurrent attribute.
[ 189 ]
Working with Design Patterns Chapter 7
Unit of Work
The Unit of Work design pattern is to ensure that only modified data will be written to the
database instead of writing the entire data again. Let us explain this by an example. In
Order entity, typically there are two tables. We can view them in the following screenshot:
In addition to the standard design, we have included two additional constraints to this
design:
1. A Unique constraint is included for Order Detail table for columns (Order
Number, ProductID) as there can be only one Product for a given order.
2. The amount column is generated column by Unit Price * Quantity. Due to this
implementation, you do not have to enter any values to the amount column but it
will be calculated automatically.
[ 190 ]
Working with Design Patterns Chapter 7
The important configuration in this table is the additional unique constraint and computed
column for the Amount column.
In this model, there are multiple records per one order. When the user adds new order lines
to the Order, in some designs, the entire Order will be updated, whereas only one order
line is modified. As you can realize, this is a very ineffective technique. In some cases, the
entire order is deleted and re-entered.
In this data pattern, only modified records are updated as shown in the following script:
INSERT INTO public."OrderHeader"
( "OrderNumber", "OrderDate", "CustomerID")
VALUES (1, '2020-01-01',1)
There are workarounds to the above techniques by using UPSERT or MERGE statement.
Let us look at the Lazy loading design pattern which describes in the below section.
Lazy Loading
The lazy load design pattern is an important database design pattern, especially for large
volume databases. The application layer may request for a large volume of data but
[ 191 ]
Working with Design Patterns Chapter 7
consumes only less. Some times, users may need only fewer records than what is generated.
The lazy load pattern will enable users to retrieve only the needed data.
The above script will retrieve only three records (LIMIT 3) and starting from record number
5 (OFFSET 5). If the user needs the second data set, different OFFSET values can be sent to
retrieve the data.
If you did not include Lazy load, the application may receive the entire table. Initially, this
table might have fewer records. However, over time, this table will become a large table
and the application will not be able to cater to this volume. Therefore, it is essential to
follow the lazy load design pattern even at the start of the database design phase.
In the next section, let us look at the database pattern to avoid Many-to-many
relationships.
Let us assume that in a large scale enterprise, to facilitate their clients' multiple staff
members are assigned. This means one client will have multiple staff members and on the
other hand, one staff member will be assigned to multiple clients. So when an invoice is
raised against a customer, there will be multiple staff members assigned to that invoice as
shown in the following entity diagram:
[ 192 ]
Working with Design Patterns Chapter 7
As shown in the above screenshot, one invoice will have one customer. However, since
there can be multiple staff members for a given invoice. Customer and Invoice modeling is
trivial as shown in the screenshot below.
Let us see the design patterns for OLAP versus OLTP in the following screenshot:
In the above table diagram, only key attribute columns are listed.
Now let us see how we can model, Invoice and Staff relations. Since there are multiple staff
members per invoice, you cannot include a single column in the invoice table.
[ 193 ]
Working with Design Patterns Chapter 7
In the above table design, lnk_InvoiceStaff table is introduced to facilitate the user
requirement. The introduced table contains InvoiceID and StaffID columns which are the
many-to-many relationship columns in the requirement. Invoice and lnk_InvoiceStaff tables
has a one-to-many relationship via InvoiceID column while Staff and lnk_InvoiceStaff tables
has a one-to-many relationship via StaffID column.
The following screenshot shows the entire model including the Customer table:
[ 194 ]
Working with Design Patterns Chapter 7
In the above model, all the relationships are one-to-many relationship and all the many-to-
many relationships are avoided.
The following code block shows the script for the four tables above, and they can be
executed in PostgreSQL:
;-- object: public."Customer" | type: TABLE --
CREATE TABLE public."Customer" (
"CustomerID" smallint NOT NULL,
"CustomerCode" varchar(8),
"CustomerName" varchar(50),
CONSTRAINT "UNQ_CustomerCode" UNIQUE ("CustomerCode"),
CONSTRAINT "Customer_pk" PRIMARY KEY ("CustomerID")
)
;-- object: public."Invoice" | type: TABLE --
[ 195 ]
Working with Design Patterns Chapter 7
The above script will create tables, as well as necessary Primary Key constraints, Unique
Key constraints, and Foreign Key Constraints.
Another important design decision that the database designer has to make is, whether the
database is following the OLTP or OLAP model, which we will discuss in the next section.
OLTP OLAP
Data Source of data Extract from OLTP data and aggregated for queries
Transaction Short transactions Long Transaction
Operations Equal Reads / Writes Mostly Reads and very few writes
Duration Short Duration Long Duration
Queries Simple queries Complex queries
Normalization Normalized table structures De-Normalized table structures
Integrity is a concern and there are
Integrity Not a concern as most of the transactions are reads.
many writes,
In OLTP systems, major design pattern is, normalization of data model.
Let us see a simple table model for the Item master table. In the Item master table, there are
attributes such as Item Code, Item Name, Sub Category, and Item Category. In the
following screenshot we can view the OLTP design:
[ 197 ]
Working with Design Patterns Chapter 7
The sample script in the code block below can be executed in PostgreSQL to create those
tables:
-- object: public."Item" | type: TABLE --
CREATE TABLE public."Item" (
"ItemID" integer NOT NULL,
"ItemCode" varchar(8),
"ItemDescription" varchar(40),
"ItemSubCategoryID" smallint,
[ 198 ]
Working with Design Patterns Chapter 7
Now let us see how Item tables are designed in the OLAP structures.
In the OLAP system, typically there are two types of tables, namely,
Dimension and Fact tables.
De-normalization structures are the major design pattern in the OLAP models. Therefore,
the above three tables listed in the previous code block which describe the Item master file,
will be converted to a single table as shown in the following screenshot:
[ 199 ]
Working with Design Patterns Chapter 7
Let us see how this table is created in PostgreSQL by using in the following script:
-- object: public."Item" | type: TABLE --
CREATE TABLE public."Item" (
"ItemID" serial NOT NULL,
"ItemCode" varchar(8),
"ItemDescription" varchar(50),
"ItemSubCategoryID" smallint,
"ItemSubCategory" varchar(50),
"ItemCategoryID" smallint,
"ItemCategory" varchar(50),
CONSTRAINT "Item_pk" PRIMARY KEY ("ItemID")
);
All the three tables designed during the OLTP modeling is modified to a single table at the
OLAP model. In the OLAP modeling, there are less join when there is a need to retrieve
data. Fewer tables join indicates less processing for reads. Since there are fewer reads in
OLAP models, having fewer tables will improve the performance for reporting and
analytics etc.
Let us see what are the common issues in database design patterns in the next section.
[ 200 ]
Working with Design Patterns Chapter 7
The following are the considerations that you need to look at when you are implementing
database design patterns.
Database Technology
Depending on the database technology that you have selected or you are forced to choose,
some of the database design patterns may not be possible. For example, in PostgreSQL,
simple SELECT statements are not possible in the procedures. Therefore, you need to
choose functions or cursors to implement SELECT statements inside the stored procedure.
Data Volume
Depending on the volume of the database that you are designing, you may have to modify
your database design pattern. For example, if you are dealing with a small volume of a
database, there is no need for Lazy Loading database design pattern.
Data Growth
If the database that you are designing, expecting a relatively slow growth over time, you
can decide against implementing database design patterns.
After discussing database design patterns, let us discuss the Database anti-patterns in the
next section.
There are standard anti-patterns which we have discussed in multiple chapters before such
as No Primary Keys, No Foreign Keys, and so on. In the following sections, we will be
discussing the different sets of database anti-patterns.
[ 201 ]
Working with Design Patterns Chapter 7
If you examine the following course table definition, you will see multiple nullable
columns:
[ 202 ]
Working with Design Patterns Chapter 7
the columns, you need to enter some value which is not null.
If you want to list out all the rows in the course table where PreviousCourseOwner is null,
the following script has to be executed:
SELECT *
FROM public."Course"
WHERE PreviousCourseOwner IS NULL;
Similarly, you can get the rows that have value to PreviousCourseOwner column by
stating the IS NOT NULL in the where clause.
Triggers
Many database technologies have the option of running code at the database server level, as
databases have objects such as stored procedures and triggers. Of course, it's more efficient
to perform this processing close to the data. If not, users have to transmit the data to a client
application end and process at the client end.
However, if more processing is carried out at the database end, there can be a situation
where the database will not be able to handle incoming client requests. Therefore, in the
case of high-velocity database systems, it is better to avoid triggers and move that code to
the client-side.
Eager Loading
Eager loading is exactly opposite to the Lazy Loading. Eager loading is much simpler to
implement. For small data such as departments and countries, you can implement Eager
loading as it won't consume much of the processing to load data. In addition, you need to
make sure that data volume will not grow rapidly. Today you might have a small data set,
but in the future, it will become large. This means though the Eager Loading may work
today but not in the future. As a database designer, he should have the visionary into the
future.
Recursive Views
Views are used to create easy access to the end-users. Typically a view will contain a set of
tables. There are instances where views are also used inside views. As a database designer,
[ 203 ]
Working with Design Patterns Chapter 7
it is better to avoid multiple layers of views inside views. Multiple recursive views may
cause a lot of maintenance issues to the users, though views are introduced to ease the
management of data.
Summary
In this chapter, we discussed database design patterns. we said that Database design
patterns can be used as a template for existing common database problems. It was
mentioned in this chapter that database patterns cannot be used blindly, as there can be
several modifications has to be done to the defined patterns due to various environmental
conditions.
In the database design pattern, we identified, Data Mapper, Unit of Work and Lazy
Loading as the main three design patterns. Data Mapper will bridge between the database
and the application. Unit of Work and Lazy loading design patterns are mainly used as a
tool to counter potential performance issues which will occur in the future.
In the database modeling, it is essential to define the database model whether it is OLAP or
OLTP. OLAP is more towards analytical and reporting systems whereas OLTP is more
suited for transaction systems. in the OLAP, we understood that de-normalization
structures are more preferred so that reporting can be done much faster.
Apart from database design patterns, there are anti database design patterns. Anti database
patterns are what we should avoid. We mainly identified three anti database patterns in
this chapter. Avoiding implementing Business Login in the Application Layer, Triggers and
Recursive views were identified as the major anti database design patterns.
Exercise
1. What are the possible database design patterns that can be used for the database
model that you have built until now?
2. List what are the anti-patterns that can be discussed for the above design?
3. Explain what are the limitations in the design patterns with respect to your
design?
[ 204 ]
Working with Design Patterns Chapter 7
Questions
Why database designers have to careful when selecting Database Design
patterns?
Why you should be careful when using database design patterns as it is?
Database Design patterns will act as a template for the existing common
problems. This means that are to provide basic guidelines. However, you
might have a different environment and domain. Therefore, you need to be
careful and need to carefully analyze your problem. Refer to the
section Identifying Issues in Software Patterns.
What are the issues of the implementation of the data mapper data layer?
Though there are a lot of advantages in the data mapper database design
pattern, it is important to note that there will be a large number of
procedures that will be difficult to manage. Also, managing security on those
procedures has to be done methodically.
Why Lazy Loading database design pattern is important even if the database is
not large at the time of designing?
Databases tend to grow rapidly over time, mainly due to the business
expansions. If Lazy loading is not considered at the time of design, there can
be performance issues when the data load is high. Lazy Loading will retrieve
only the small part of the data. On the other hand, Eager Loading will extract
all the data. When the Eager Loading is done, applications have to wait until
all the data is received, which will cause performance issues for the end0-
users.
[ 205 ]
Working with Design Patterns Chapter 7
OLAP models are mostly used for analytics or reporting. This means OLAP
models have more reads and less writes. When there are more reads, it is
better to have fewer tables. When there are fewer tables, fewer processing
resources are needed to retrieve data. In the case of OLTP table structures,
there is a large number of tables, and when the need arises to read data, a
large number of table joins are needed. This will lead to higher processing
consumption. This means for the OLAP modeling de-normalization is used.
Refer OLAP versus OLTP.
Database Technology, Data Growth, and Data Volume are the important
factors that should be considered when considering to implement database
design patterns.
When there is a lot of processing are taking place at the database end, the
database will consume a lot of processing resources. This may block the
users from accessing the database. Therefore, triggers are not recommended
in the database design. If triggers are unavoidable, at least make sure that the
trigger code is not complex so that it will complete in less duration.
Further Reading
Merge: https://www.postgresql.org/message-id/attachment/23520/sql-
merge.html
Upsert: https://www.postgresql.org/docs/9.5/sql-insert.html
Computed Columns: https://www.postgresql.org/docs/12/ddl-generated-
columns.html
OFFSET & LIMIT: https://www.postgresql.org/docs/9.3/queries-limit.html
Anti-Patterns: https://docs.microsoft.com/en-us/azure/architecture/
antipatterns/busy-database/
Triggers: https://www.postgresql.org/docs/9.1/sql-createtrigger.html
Null: https://www.tutorialspoint.com/postgresql/postgresql_null_values.
[ 206 ]
Working with Design Patterns Chapter 7
htm9
[ 207 ]
8
Working with Indexes
Until now we have discussed basic design strategies for database design. We have
identified different steps to designing databases such as conceptual database modeling,
Logical database design, and physical database modeling. We have emphasized the need
for E-R model in order to have better communication between the technical and non-
technical users. During the logical database modeling, we understood the importance of
data normalization at different levels. By applying database Normalization, we were able to
capture a better data model. Then we discussed the database design patterns and anti-
design patterns for different common problems. From this chapter onwards, we will be
looking at how a database can be practically used. Though the well-designed database can
provide you functional requirements, non-functional requirements are a very important
discussion in database design.
Further, there are a lot of misconceptions about the indexes mainly due to the fact that
most of the engineers as well as the database designers. Those misconceptions will be
addressed and this chapter discusses what should be the path you should follow when the
user is not sure on the indexes.
In this chapter, we will look at the different types of indexes that PostgreSQL has and when
we should be using them. We will discuss a few field notes with regard to the Indexes.
Since indexes are used heavily in the industry, it essential to understand the practical usage
of these indexes.
[ 209 ]
Working with Indexes Chapter 8
Let us discuss what are the non-functional requirements (NFR) for a database in the
following section.
For example, tables, views, and procedures will define how the database functions to the
end-users. When the user executes a stored procedure with specific inputs, he will get an
output. When he executes OrderList with a date parameter, he will receive all the Orders
that were raised on that day. This will be a functional requirement for the application.
If you look into the above scenario, if it is taking more than one hour to retrieve the above-
said data, then users will not be using the above procedure. This example indicates the
importance of Non-Functional Requirements. Apart from performance, there are other
elements in Non-Functional Requirements. Scalability, Capacity, Availability, Reliability,
Recoverability, Maintainability, Serviceability, Security, Regulatory, Manageability, Data
Integrity, Usability, Interoperability are the main elements for Database Non-Functional
Requirements.
Apart from the business viability, there are a lot of compliance's which need to be satisfied.
For example, Sarbanes Oxley, UK Data Protection law are the main compliance's that have
to be implemented. Apart from these laws, there are domain-specific compliance's that
have to be implemented. Those are also considered under Non-Functional Requirements.
During the design phase, major consideration would be at the functional design, not the
non-functional requirements. However, after the implementation system or the database
will not be used if Non-Functional Requirements are not covered. Therefore, it is essential
to implement the Non-Functional Requirements.
Most of the time Non-Functional Requirements are implemented at the application level.
However, since databases are core components in the system, it is essential to implement
Non-Functional Requirements in the database level as well.
Let us look at the above mentioned Non-Functional Requirements in the following sections.
[ 210 ]
Working with Indexes Chapter 8
Performance
The database is used to store data that are created by users from applications. Therefore,
the database tends to store a large volume of data. When there is a large volume of data,
data access tends to become low performing. Typically, just after the implementation of the
system, database access will not be a problem as the data load is very low. However, when
data grows over time, data access will become slower. Therefore, it is essential to look at
performance at the design stage rather than fixing them at the production stage.
Though database design and selection of proper data types will improve the database
performance, special attention is needed for the performance. Indexes are the main
implementation to improve performances.
In this chapter, we will discuss different scenarios of indexes and different types of indexes
that can be implemented in PostgreSQL.
Security
Since data is considered to be one of the valuable assets in any organization, it is needless to
stress that you need to protect your data. When data is accessed by multiple users in the
organization, it is important to distinguish these users. In addition, different users have
different levels of access to database objects and different levels of access. For example, the
user Joe can access table customer but data cannot be modified. One the other hand, the
user Alice can access the same table but he can write to the table as well.
Similarly, two different users have access to a Project table but one user can only view a few
columns only whereas another user will be able to retrieve all the columns. Further,
different users can only see different rows only. With these so many combinations, Security
in a database has become a complex but important process.
[ 211 ]
Working with Indexes Chapter 8
Scalability
The key aspect of the database is the velocity and volume of the stored data. This means
that the database size will be increased in the future. Sometimes, the growth can be in the
exponential order. Therefore, it is essential that the database should support scalability so
those end users will not be impacted by the data volume.
Indexes are one method that can be used to improve the scalability of database systems.
Availability
As stressed in many places, the database is a core component of the system. Therefore, it is
important to provide continuous services from the database. When there are hardware
failures such as Network, Storage, etc, still you need to provide continuous database access
to the applications and users.
Most of the time, infrastructure technologies are used as the main implementation for
databases. However, there are situations where some level of availability from databases.
The different database has different technologies.
The following diagram is the most common design for Availability from the database end:
[ 212 ]
Working with Indexes Chapter 8
Synchronous Replication
[ 213 ]
Working with Indexes Chapter 8
rollbacked thus this is not very popular among the database administrators
and users.
However, if you are looking at automatic failover for database, you need to
configure Synchronous Replication.
Asynchronous Replication
Though this technique will not allow automatic failover and potential data
loss, this is a more popular technique among database administrators mainly
due to the fact that it is less complex. Further, this technique is mostly
independent of the network failures whereby it can restarts where it
suspended.
Recoverability
As discussed multiple times, the database is a key component in the system. It holds data
which is a key asset of the organization. With respect to databases, there are two important
parameters for recoverability, they are the Recovery Point Objective (RPO) and Recovery
Time Objective (RTO).
Recovery Point Objective means the state of the system that was recovered. In
simple terms, this can be termed as data loss. It is obvious that business needs
minimum RPO for the successful operation of the business.
Recovery Time Objective (RTO) is the time taken to recover the system. In simple
terms, this is referred to as downtime. Like RPO, it is essential to minimize the
RTO in database systems. In any database system, it is very important to take
measures to reduce RPO and RTO for the successful operation of the business.
[ 214 ]
Working with Indexes Chapter 8
Interoperability
As we discussed in Chapter 1, every business operates in heterogeneous environments with
respect to data, tools, and processes. Therefore, it is important to have the ability to
comment and communicate between different systems, devices, applications or products.
Since the database is the core component of the system, it is important to have the ability to
connect with other systems and data.
Since we have dedicated this chapter to Indexes, let us discuss Indexes in detail.
Indexes In Detail
Let us look at the example of a book. If you are asked to search for the keyword Database
Designers in a given book, what would be your strategy? Obviously, you will first look at
the Table of contents and if the relevant keyword exists you would directly go to that page.
Following is the screenshot for a table of contents of a book.
From the above screenshot, it can be observed that page 22 has the relevant content.
Some times, there are scenarios where you do not see the keyword in the table of content. In
that situation, you would refer to the index in the backside of the book as shown in the
below screenshot:
[ 215 ]
Working with Indexes Chapter 8
As shown in the above screenshot, users can get all the relevant page numbers for the given
keyword. For example, triggers keyword exists in page numbers 245 and 263 to 267 page
numbers. Imagine what will happen if you do not have these indexes. You would be
flipping over the entire book page by page. That will not be a great and pleasant experience
for a book reader. As you saw in a couple of examples here, Indexes are very helpful for
easy and faster access to data.
Indexes are like ordered data or a subset of your data. For example in a transaction table,
you can order them for OrderID. If you need to search for a given order number, you do
not have to search for the entire table, instead, you can directly access the order number.
Further, if you want to search by Customer ID in the order table, you can create a subset of
data with ordered CustomerID and relevant OrderID. If you want to search by
CustomerID, then the customer subset is searched and relevant IDs can be fetched.
[ 216 ]
Working with Indexes Chapter 8
There are different types of indexes in many database technologies, and PostgreSQL has the
following indexes for different uses:
B-Tree
Hash
GiST
Generalized Inverted Index (GIN)
SP-GiST
Block Range Index (BRIn)
B-Tree
The B-Tree stands for Balance tree, not Binary Tree. All the leaves nodes are at equal
distance from the root. A parent node can have multiple children minimizing tree depth.
This has reduced the traverse path for data retrieval.
The following screenshot has shown the diagram for BTree structure:
[ 217 ]
Working with Indexes Chapter 8
https://www.csd.uoc.gr/~hy460/pdf/p650-lehman.pdf
[ 218 ]
Working with Indexes Chapter 8
From the above screenshot, it is obvious that to search a specific value, using BTree has
improved than reading the data without any order.
Hash
Hash indexes are helpful when you have values that are more than 8 Kb. Further, the only
operator that can be used with a Hash index is equal (=) operator only.
GiST
GiST index used for overlapping data such as geometries. GiST index allows the
development of custom data types with the appropriate access method. In this index, range
values are used to index. This index has key functions such as UNION and DISTANCE.
GiST index is more helpful to find the nearest neighbor and shortest path and so on.
[ 219 ]
Working with Indexes Chapter 8
SP-GiST
List GiST index, SP-GiST allows the development of custom data types. SP-GiST is for non-
balanced data structures. This type of index can be used to store points.
The following screenshot shows how to create an index using an access method.
[ 220 ]
Working with Indexes Chapter 8
The Btree access method is the most common and default access method to create an Index.
When indexes are created in PostgreSQL, it can be seen from the pgAdmin as shown in the
below screenshot:
BTree index is the most popular index in PostgreSQL. Mainly, there are two types of
indexes for tables, Clustered and Non-Clustered Indexes.
Clustered Index
The clustered index is the index that sorts your data in the order of the clustered index.
When a Clustered index order is defined, data will be physically ordered in the order of the
index.
Let us see what are the best practices for the Clustered indexes.
Ideally, it is recommended to have a Clustered index for every table. This can be
avoided for small and static tables.
The clustered index should be small in size. Columns such as Names, dates
should be avoided for a clustered index.
Columns that are changing over time should not be selected as clustered indexes.
Integer, Autoincrement columns should be the better choices for the Clustered
[ 221 ]
Working with Indexes Chapter 8
index.
[ 222 ]
Working with Indexes Chapter 8
Whenever there is doubt in indexes, the go-to place is to examine the query plan.
This query executed and after less than one second, 31 rows were retrieved.
[ 223 ]
Working with Indexes Chapter 8
can be situations where database backups, index rebuild will consume the
resource. This will leave the query performance to be slower.
Following is the query plan for the above query when no indexes are present:
The above screenshot shows that to retrieve 31 records, 121286 records were filtered.
Let us create a Clustered Index for the OrderDetail with OrderNumber, OrderLine:
[ 224 ]
Working with Indexes Chapter 8
Let us see the query plan for the same query after the index was added:
[ 225 ]
Working with Indexes Chapter 8
Now the query plan has changed and only relevant rows are retrieved.
--Query 2
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519;
The only difference in the above two queries is where clause has different columns but both
are part of the Clustered Index.
Let us see the query plan for both queries as shown in the below screenshot:
[ 226 ]
Working with Indexes Chapter 8
From the EXPLAIN plan, it can be seen that both queries are behaving similarly.
--Query 2
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
AND "OrderNumber" = 58950;
[ 227 ]
Working with Indexes Chapter 8
This shows that interchanging WHERE clause columns with AND clause won't have any
impact on the query performance.
--Query 2
SELECT "OrderLine", "OrderNumber"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
AND "OrderNumber" = 58950;
[ 228 ]
Working with Indexes Chapter 8
As we observed in the previous two scenarios, we are seeing the same query plan for both
queries, which means that there is no performance impact when the columns in the
SELECT clause is changed.
Order Clause
Typically, we use the ORDER BY clause when we need to explicitly order the result set.
[ 229 ]
Working with Indexes Chapter 8
Since the clustered key column is included for the ORDER BY clause, there is no impact
with the ORDER BY clause with respect to the performance and the results. If you can
remember above is the same query plan that was observed without the ORDER BY clause.
Now let us see a query with a different column for ORDER BY clause:
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
ORDER BY "Amount";
Now you will see a different query plan altogether. Since sorting is done from
the Amount column and that column is not part of the clustered column, sorting should be
done explicitly.
[ 230 ]
Working with Indexes Chapter 8
Let us look at this scenario where the only difference is AND and OR clauses in both
queries.
--Query 1
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderNumber" = 58950
AND "OrderLine" = 68519;
--Query 2
SELECT "OrderNumber", "OrderLine"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
OR "OrderNumber" = 58950;
The following is the query plan for the second query as we have seen the query plan for the
first query:
From the above query plan, it is clearly visible that the query with OR condition is more
complex and time-consuming than the AND Condition.
[ 231 ]
Working with Indexes Chapter 8
--Query 2
SELECT "OrderNumber", "OrderLine","ProductID"
FROM public."OrderDetail"
WHERE "OrderLine" = 68519
AND "OrderNumber" = 58950;
Let us see the difference in the query plan in the following screenshot:
In the first query, the required data is available Index structure itself and in the second
query, leaf node data are needed to retrieve data. Therefore, a different operation is needed
to get the data.
Non-Clustered Index
The non-Clustered index can be considered as equivalent to the Index that you see on the
backside of a book. If you want to search for a keyword, rather than doing a search in the
book content, first you search the index. After you find the keyword, along with the
keyword you find the page number.
Then you refer to the relevant page number to get the relevant page number as shown in
[ 232 ]
Working with Indexes Chapter 8
[ 233 ]
Working with Indexes Chapter 8
Let us analyze the Non-clustered index with different scenarios. Let us analyze the
following query:
SELECT "OrderNumber",
"OrderLine"
FROM public."OrderDetail"
WHERE "ProductID" = 797;
First, run this query without an index and the following is the query plan.
[ 234 ]
Working with Indexes Chapter 8
This shows that to search for records that satisfy the criteria, it has to do a table scan.
Let us create an index on the ProductID column and let us execute the same query and
verify the query plan:
In this query, since there is an index on ProductID, first it will search for the value.
As we discussed before, you can have only one Clustered Index per table.
However, you can create many non-clustered indexes per table. Since
Indexes are reducing the Writes into the table, when creating indexes for
the transaction-oriented tables, it is essential to identify the optimum
number of indexes. In the case of the Analysis system, you can create any
number of non-clustered indexes.
Let us look at more complex queries with Clustered and Non-Clustered index together.
Complex Queries
In the real world, we will have to incorporate many tables with different types of joins such
as INNER JOIN, LEFT JOIN. In these types of queries will have combinations of Clustered
and Non-Clustered indexes with more complex queries. Those queries need special
[ 235 ]
Working with Indexes Chapter 8
attention when it comes to EXPLAIN plans. Let us look at the behavior of queries when
there are tables with multiple joins and with Order By clause in the coming section.
Multiple Joins
In most of the cases, multiple tables are joined together to obtain the required results.
Let us join the OrderDetail table with the Product table. In the Product table, ProductID is
the Clustered index whereas OrderDetail's ProductID is the non-clustered key.
The following screenshot is the query plan for the above query:
Since Color is not a non-clustered index, that has to be filtered by doing the table scan.
If this is a frequently running query, it is better to add a non-clustered index to the Color
column from the below query:
CREATE INDEX "IX_Product_Color"
ON public."Product" USING btree
("Color" COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
Now let us execute the same query after the index creation on the Color column.
[ 236 ]
Working with Indexes Chapter 8
Since the Non-Clustered index is created for the Color column, instead of a table scan on
the Product table, an index scan is done that will improve the performances of the query.
When more than two tables are included for a query, two tables are joined together and
other tables are added one by one.
[ 237 ]
Working with Indexes Chapter 8
As you can see from the above query plan, Product and ItemCategory tables are joined first
and then the OrderDetail table is joined.
GROUP BY Condition
For aggregate functions, indexes key play a key role.
Since there is no index on the Name column, a table scan has to be done for the Product
Table as shown in the below screenshot:
Let us do the same query by adding an index to the Name column as shown in the
following script:
CREATE INDEX "IX_Product_Name"
ON public."Product" USING btree
("Name" COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
After the above index is added, you will observe that there is no change to the query plan
since we are extracting the entire data set.
[ 238 ]
Working with Indexes Chapter 8
INCLUDE Indexes
Let us assume that you want to select columns that are not part of the non-clustered index.
However, you have an index for the where clause columns. This means that to locate the
records, the index can be used but to retrieve data you need to refer back to the data pages.
This will have a performance impact.
During the legacy days, there was a technique called COVERING Indexes.
Covering indexes were used to include all the columns which are covered
from the relevant query. However, this resulted in large size of the index.
The large size of the index resulted in more index fragmentation and there
can be a negative performance impact. Therefore, the COVERING index
option was not very popular among database administrators.
Include index is introduced in most of the databases in order to replace the covering
indexes. In the include indexes, include columns are kept at the leaf node. in the covering
indexes, all the columns are kept at all nodes including the leaf and branches.
The following screenshot shows how to create an INCLUDE index in the PostgreSQL:
[ 239 ]
Working with Indexes Chapter 8
In the above index, the Color column is the index while Name and FinishedGoodsFlag
columns are configured as include columns.
Let us look at two different query plans with and without include indexes for the following
same query:
[ 240 ]
Working with Indexes Chapter 8
The following screenshot shows the query plan for the above query without include index
and with include index respectively:
The above query plans, show that that with include index query plan is improved better
than an index without include index.
The Fill factor is an important configuration in indexes which will be discussed in the
following section.
In the following example, FILL FACTOR 80 means that every page will be filled only 80%
and the balance 20% will be kept for updates:
[ 241 ]
Working with Indexes Chapter 8
Similarly, the fill factor can be adjusted from a script as shown in the below script:
CREATE INDEX "IX_Product_Category"
ON public."Product" USING btree
("CategoryID" ASC NULLS LAST)
WITH (FILLFACTOR=80)
However, you cannot keep a very low value for FILL FACTOR as it will leave unnecessary
free spaces in the data pages. This will result in a large number of unnecessary data pages.
When there are many pages, to retrieve data, it has to traverse through many pages.
The default value of the FILL FACTOR is 100. This means by default all
data pages will be fully filled. The best-recommended value for the FILL
FACTOR is 80. Further, there are instances where FILL FACTOR is set to
90.
It is important to note that FILL FACTOR value will not be maintained continuously. It is
only maintained during the index creation and reindexing only.
Disadvantages of Indexes
We have been discussing the advantages and the usages of indexes. During previous
discussions, we identified the advantages of indexes with scenarios. This might need to
[ 242 ]
Working with Indexes Chapter 8
think that you can add any number of indexes at your will. However, there are
disadvantages of indexes as well.
Further, if there are many non-clustered indexes, those indexes have to be updated.
Therefore, when adding non-clustered indexes for the transaction-oriented systems, it is
essential to have an optimum number of indexes. In the case of analytical systems, though
it is recommended to have many indexes to improve read queries, during the Extract-
Transform-Load (ETL) it is essential to disable the indexes and re-enable them after the ETL
is completed.
Storage
A non-clustered index is not part of the existing tables, instead, the non-clustered index will
store separately with pointers to the table.
Remind the index of the backside of the book. When you have additional
indexes, you need to have
Due to this database storage will be increased. When database size is increased, database
backup time, restoring time will be increased. However, in modern days, though storage is
not a huge concern, the increase in other maintenance tasks should be considered when
designing indexes for the database.
Let us look at the options for index maintenance in the following section.
Maintaining Indexes
When indexes are available, over time with data deletes and inserts indexes will be
fragmented. Fragmented indexes will have a negative impact on query performance.
[ 243 ]
Working with Indexes Chapter 8
Therefore, indexes have to be reminded for a given frequency. During the maintenance of
indexes, CPU and Memory will be consumed and user queries will be impacted. Therefore,
it is essential to choose a window where fewer user queries are impacted.
Reindexing is also needed when the indexes are corrupted due to various hardware and
software reasons. In PostgreSQL, reindexing can be done at three levels, DATABASE,
TABLE, and INDEX levels.
Similarly, this can be done using a script as shown in the below script:
--REINDEXING INDEX
REINDEX INDEX "IX_Product_Category";
--REINDEXING A TABLE
REINDEX TABLE public."Product"
--REINDEXING A DATABASE
REINDEX DATABASE "SampleDatabase"
When REINDEXING is applied at the database level, it will consume large resources and
duration. Therefore, it is recommended to reindex the index level rather than the database
level.
Apart from the said indexes, there are few other index options available in the other
database technologies which will be discussed in the following section.
[ 244 ]
Working with Indexes Chapter 8
Filtered Index
From the Filtered Index, you have the option of creating an index for the selected data set.
For example, for the invoice table, from the filtered index you can create an index for the
selected region. The advantage of this index is that index storage will be reduced. Hence
maintenance of Filtered indexes are much simpler.
Only one column store index can be created per table and depending on the database
technology that you are using, there are Clustered and Non-Clustered column store
indexes. Further, in some database technologies, tables cannot be updated or inserted when
a column store index is implemented. In that type of database technology, column store
indexes are not suitable for transaction-oriented databases. However, in the case of
analytical systems such as reporting, data warehouses, OLAP cubes, by including a column
store indexes performance can be improved.
In the case of un-updatable columns store, the index has to be disabled before inserting
data to the columns store and enabled again after the data load is completed. This will
result in additional time during the data inserting to the column store index tables.
Field Notes
Indexes are vastly used in the industry as it helps to improve data retrieval performance as
we discussed in multiple instances.
[ 245 ]
Working with Indexes Chapter 8
[ 246 ]
Working with Indexes Chapter 8
Summary
Until this chapter, we have been discussing the functional aspect of the database design. In
this chapter, we understood that the non-functional requirement plays a key part in
database design. we discuss performance, security, high availability, scalability as a non-
functional option in databases. We identified indexes as a key factor to improve
performance in databases to retrieve data much faster.
Next, we identified, different types of Indexes in PostgreSQL such as B-Tree, Hash, GiST,
Generalized Inverted Index (GIN), SP-GiST and Block Range Index (BRIn). Out of these
indexes, default and the most common index type is B-Tree. We looked at different types of
index implementation Clustered and Non-clustered indexes with different scenarios. From
the different scenarios, we identified that the index plays a huge role when retrieving data
from the databases. It is also recommended that the query performance should be verified
from the EXPLAIN plan.
The FILL FACTOR is another important concept we discussed and we identified that
default value for FILL FACTOR is 100 and the best-recommended value is 80. Although we
use indexes to improve query performances, there are disadvantages of indexes as well.
decrease of insert queries and performance and additional storage requirements are the
identified disadvantages in the index. Further, we identified that COVERING indexes that
were used in the legacy databases can be replaced by INCLUDE indexes much effectively.
Over time, indexes will be fragmented due to data insert and delete. To avoid
fragmentation to improve the performance REINDEX should be done. Reindex is available
at Database, table and the index level and we said that a better option for the reindexing is
performing it at the index level to avoid resource contention. Apart from the PostgreSQL
[ 247 ]
Working with Indexes Chapter 8
index implementations, we identified that Filtered Index, Column Store Indexes are also
used in the industry with other database technologies.
Since there a lot of myths in the industry with respect to indexes, we looked challenging
thus interesting case studies of indexes as well.
The next chapter, Designing a Database with Transactions discusses how you can ensure data
integrity and handle database errors with the help of transactions. The readers will also
learn about how they can design a database using Transactions.
Questions
What is the importance of implementing Non-Functional Requirements from the
database perspective?
Non-Functional Requirements will decide the quality of the system. Since the
database is core components of the system, by maintaining the Non-
Functional Requirements, the quality of the system can be improved.
Performance, Scalability, Capacity, Availability, Reliability, Recoverability,
Maintainability, Serviceability, Security, Regulatory, Manageability, Data
Integrity, Usability, Interoperability are the main elements for Database Non-
Functional Requirements. Apart from these quality measures, enterprise
systems need to adhere to the different compliances such as Sarbanes Oxley,
Data Protection Law. Due to these factors, it is important to achieve Non-
Functional Requirements in a Database.
Only one clustered indexes can be created per table. Since table data is
ordered in the order of the Clustered Index, multiple Clustered indexes
cannot be created.
[ 248 ]
Working with Indexes Chapter 8
During the interviews, different ways of questions are asked with respect
to the Clustered index. What is the behavior of the table when there are
two Clustered indexes? What will happen when the second Clustered
Index is created? All these questions are checking whether you know that
only one Clustered Index can be created per table.
Clustered should be very small in size. Most likely, integer columns are
suitable columns. Further, it should contain sequential values. This means
that sequence columns should be better suited. In addition to the above
requirements, the selected clustered column should be a static column that is
not changed over time. Especially, columns such as Names, Dates not
suitable columns for the Clustered Indexes.
AND clause is performing better than the OR clause when the clustered
columns are used for the WHERE clause. for the OR clause, the index has to
be scan twice before the OR operation is done. For the AND operation, only
one scan is needed.
The data warehouse is mostly used as an analytical system. This means most
of the operations are read operation whereas during the ETL bulk of data
will be inserted. As we have discussed, indexes will improve read
performance while there will be a negative impact on the write
performances. The data warehouse consists of fact and dimension tables.
Fact tables are consist of surrogate keys and measure columns. This means
the fact table does not need Clustered indexes and in the Dimension tables,
surrogate keys can be chosen as the clustered indexes. In the fact table, all the
surrogate keys should be configured to non-clustered indexes as they will be
joined with the dimension tables. This will lead to a large number of indexes
in the fact table as typically, the fact table will have a large number of
surrogate keys. In order to improve the data load performance, during the
ETLs indexes can be dropped and after the ETL is completed index can be re-
created.
The Covering Index will include all the columns in a query. This includes all
columns in WHERE and SELECT clauses. Due to this, index size will become
[ 249 ]
Working with Indexes Chapter 8
larger and in the B-Tree all the columns are included in the branch and leaf
nodes. Since the index is large, there is a tendency to index to fragment.
However, in the query SELECT columns are not needed for searching. In the
INCLUDE indexes, SELECT columns are included in the include columns. In
INCLUDE index, include columns are stored only at the leaf node. Due to
this implementation, Index structure is much narrow and the tendency to
index fragmentation is high. This means INCLUDE indexes are much better
than the COVERING indexes.
What is the importance of FILL FACTOR and what are the best-recommended
values for the FILL FACTOR?
FILL FACTOR will decide how much percentage of the data page is filled. If
FILL FACTOR is set to 80 which is the best-recommended setting, 20 percent
of data pages are empty. Those free spaces will be utilized for data updates.
Due to this, page splits will not occur and that will improve the query
performance.
What are the instances that you can implement Column Store Indexes?
The column store indexes mainly used in data warehouses to process OLAP
cubes. However, depending on the database technologies, tables cannot be
updated once column store indexes are implemented.
Exercise
For the database design, you did for the mobile bill, identify the possible indexes.
Further Reading
Sarbanes Oxley: https://www.sarbanes-oxley-101.com/sarbanes-oxley-
audits.htm
Data Protection Law: https://www.gov.uk/data-protection
B Tree: https://www.csd.uoc.gr/~hy460/pdf/p650-lehman.pdf
Create Index: https://www.postgresql.org/docs/9.1/sql-createindex.html
Reindex: https://www.postgresql.org/docs/9.4/sql-reindex.html
[ 250 ]
9
Designing a Database with
Transactions
We have discussed database design concepts mainly with respect to the functional
requirements of users during Chapters 1-7. We discussed in detail how to design a database
with different models such as Conceptual and Physical Models in Chapter 4 Representation
Models. During that discussion, we identified different normalization forms in order to
achieve different advantages. Then we identified different design patterns which will be
helpful for database designers to tackle common problems. To achieve high performance in
a database, which is very much necessary, we discussed different aspects of Indexes and
how to use them as a tool to improve performance using different use cases or scenarios
in Chapter 8 Working with Indexes.
When databases are used, they will use multiple queries. For example, in order to raise an
invoice, you will be looking at different tables. This means that one user query has to be
treated as a single business query though these queries are different technical queries. In
case, failure of one of the queries should result in the non-existence of other queries in order
to achieve Integrity in databases. This one business is considered as a transaction and in
this chapter, we will be discussing how to achieve integrity by means of transactions with
examples in PostgreSQL.
Let us define what is the transaction so that we are clear what we are talking in this
chapter.
Definition of Transaction
A Transaction is a single business and program unit in which execution will change the
content of the database. A single business unit may contain multiple queries. However, in a
Transaction, all those multiple queries should be treated as one unit so that the database
should be in a constant state.
If you look at the definition, the transaction should modify the database content. Therefore,
a transaction should contain any Data Manipulation Language (DML) such as INSERT,
UPDATE or DELETE.
The following screenshot shows the state of the database before and after the transaction:
[ 252 ]
Designing a Database with Transactions Chapter 9
In the above transaction, before the transaction, John and Jane had 850 USD together. Since
it is an internal transfer, after the transaction total should be the same. After the transaction,
John and Jane's account total should be 850 USD.
Just imagine, if there is a failure after John's account is deducted but before it is credited to
the Jane account as shown in the following screenshot:
[ 253 ]
Designing a Database with Transactions Chapter 9
This shows that before the failure total was 850 USD but after the failure total is 800 USD.
This means that 50 USD is lost. Then, we introduce transactions to avoid these type of
losses. Since a transaction is considered as a one-unit, either all changes of the entire unit
should be impacted or neither.
Let us consider another example that has multiple table updates. Let us assume you are
raising an invoice. The invoice has two steps as shown in the following screenshot:
[ 254 ]
Designing a Database with Transactions Chapter 9
In this scenario, Invoice records are stored but customer balance and inventory are not
updated. This will leave the database in an inconsistent state.
[ 255 ]
Designing a Database with Transactions Chapter 9
Since the entire transaction is considered as one unit, due to the failure of Update Customer
Balance, there are no invoice records. In other words, for the database perspective, this
transaction has not happened. This means that the database is in a consistent state.
ACID Theory
To make transaction consistency, one theory would be to maintain ACID properties. ACID
stands for Atomicity, Consistency, Isolation, and Durability.
Atomicity
We have already discussed in the Definition of Transaction section, the atomic nature of a
transaction. In that section, from the Bank ATM example as well as in the Invoice example,
[ 256 ]
Designing a Database with Transactions Chapter 9
we said that the entire transaction should be considered as one unit even though they have
multiple technical queries. In simple terms to satisfy the Atomicity properties, the entire
transaction should be effected or nothing should be effected. There should not be any
partial transaction that was caused by the database to be in an inconsistent state.
Consistency
We said during the introduction of the database transaction, the main idea of it is to keep
the database in a consistent state. Consistency property of the transaction means that if the
database is at a consistent state, after the traction it should be in the consistent state as well.
In the Bank ATM example, we say that there were two states, before the transfer and after
the transfer. During both stages, we identified that the total of both account balances is the
same.
Isolation
Isolation in database transactions means that every transaction is logically isolated from
one another. In other terms, one transaction should not be impacted by another
transaction.
It is important to note that Isolation does not mean only one transaction
can run at a time. If that is the isolation, the purpose of the database is lost
and the database will not be a popular choice among businesses. Just
imagine, if you a designing a system for a supermarket chain and if you
say that only one transaction can happen at a given time, during a
shopping season, it will be a chaos.
The Concurrency Control Unit of the database is the component that manages Isolation in
databases. Locking, Blocking, and Deadlocking are the techniques used to achieve isolation
in the Concurrency Control Unit.
There are different levels of isolation levels in every database system. We will discuss the
different isolation levels on PostgreSQL in the Isolation Level section.
[ 257 ]
Designing a Database with Transactions Chapter 9
Durability
Durability means that once the data changes are done to the databases, it should be
persistent irrespective of hardware failures, software failures, and server restarts, and so on.
Let us say you created an Invoice. After the invoice is created there is a server restart. Even
after the server restart, that invoice should be persisted until you changed. The Database
Recovery Management unit is the component that is available to achieve Durability in a
database transaction.
The ACID theory is the most popular transaction mechanism which is used mostly by
Relational Database Management Systems (RDBMS).
Next, we will discuss CAP Theory which is mostly used in NoSQL databases.
CAP Theory
The CAP theory, which is mostly used in NoSQL databases and in distributed systems,
stands for Consistency, Availability, and Partition Tolerance.
In distributed systems and in NoSQL systems, there are multiple nodes in the database. The
combination of these three concepts is shown in the following screenshot:
[ 258 ]
Designing a Database with Transactions Chapter 9
Unlike ACID properties, in a system, there should be only two of those properties. This
means available options are CA, AP, and CP as you can see in the intersections of the
screenshots above.
Consistency
Every node of the distributed system will provide the most recent state or it will show the
previous state. Any node does not show the partial state. This means that every node will
have the same state. Since there can be latency for data reputation, this is called Eventual
Consistency as well.
[ 259 ]
Designing a Database with Transactions Chapter 9
Availability
Availability means that every node will have the read and write access. In a distributed
system, since there can be multiple nodes, it may not be possible to achieve Availability due
to the network latency and other hardware latencies.
Partition Tolerance
Partition Tolerance is the ability of the node to communicate between them. Partition
Tolerance is not unavoidable in most of the Distributed systems. This means that you have
to compromise either Consistency or Availability. In most of the cases, consistency is
compromised with eventual consistency.
Since every database and system cannot satisfy all three concepts of CAP theory, let us look
at what are the databases which follow different combinations of CAP theory:
Transaction Controls
Like other database technologies, PostgreSQL also has commands to control transactions
from SQL.
In a database, there are different types of languages as shown in the below table.
[ 260 ]
Designing a Database with Transactions Chapter 9
CREATE TABLE
DDL Data Definition Language
ALTER TABLE
ROLES
DCL Data Control Language
USERS
BEGIN
TCL Transaction Control Language COMMIT
ROLLBACK
DQL Data Query Language SELECT
Out of these languages, TCL will be used to control language.
In Transaction, there are different states of the transaction such as Active, Partial
Committed, Failed, Commit and Aborted as shown in the following table:
State Description
This is the initial state of every transaction. In the Active state, the transaction is said
Active
to be in the execution.
Partially When a transaction is in its final operation, it is said to be in a partially committed
Committed state.
If a transaction executes all its operations successfully, it is said to be committed.
Committed With the transaction is Committed, modifications are permanently established on the
database system.
When database recovery systems checks are failed, the transaction is said to be in a
Failed failed state. A failed transaction can no longer proceed further. You need to restart the
transaction if you wish to proceed.
When a transaction has reached the failed state, then the recovery manager module
will rollback all its modifications on the database to bring the state of the database
Aborted
back to the state where it was before the execution of the transaction. Transactions in
this state are called aborted.
The Partial Committed state is a special state. The transaction goes into this state after the
final statement is executed. During this state, if there is a violation of integrity constraint,
the transaction is moved to a failed state and then the rollback will be done.
[ 261 ]
Designing a Database with Transactions Chapter 9
Every transaction should have two outcomes. Either it is successfully completed or failed.
The transaction is committed when it is successful and the database is converted to a new,
consistent state. On the other hand, the transaction should be Aborted after it is failed. This
is done by the rollback.
The database recovery module can select either re-start the transaction or kill the
transaction after a transaction aborts.
Transactions in PostgreSQL
Let us see how we can use transactions in PostgreSQL. Let us simulate the ATM transaction
as we discussed in the Definition of Transaction section.
[ 262 ]
Designing a Database with Transactions Chapter 9
The following is the initial data set as shown in the below screenshot:
Let us see how a fund transfer can be done without transaction using the following script:
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" - 50
WHERE "ID" = 1;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" + 'AA'
WHERE "ID" = 2;
When the above statements are executed in batches, the first transaction will be successful
while the second will fail. However, since no transactions are implemented, the first
statement will be updated but not the second as shown in the following screenshot:
[ 263 ]
Designing a Database with Transactions Chapter 9
Though, the total should be 850 USD for both accounts, after the above transaction it is 800
USD.
Let us do the same process with a transaction as shown in the following script:
BEGIN TRANSACTION;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" - 50
WHERE "ID" = 1;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" + 'AA'
WHERE "ID" = 2;
ROLLBACK TRANSACTION;
After the failure of the second script, the entire batch will be rollbacked and you will see
that the database will be at the previous state as shown in the following screenshot:
Since both, the account balances are equal to 850 USD, the database in a consistent state.
Let us see the same script with a successful transaction as shown in the following script:
BEGIN TRANSACTION;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" - 50
WHERE "ID" = 1;
UPDATE public."AccountBalance"
SET "AccountBalance"= "AccountBalance" + 50
WHERE "ID" = 2;
ROLLBACK TRANSACTION;
Now, you will see that the database has changed to a new state as shown in the following
screenshot:
[ 264 ]
Designing a Database with Transactions Chapter 9
Since both, the account balances are equal to 850 USD, the database in a consistent state.
Different databases have different isolation levels to support various user requirements. In
the next section, we will look at what are the Isolation levels that are supported by
PostgreSQL.
Isolation Levels
As we discussed in the ACID Theory, Isolation is to treat every transaction isolated from
another transaction.
During the concurrent user environments, if there is no isolation in place, the following
table shows us the problems that can occur.
Phenomena Description
Dirty Read A transaction reads data written by a different active but not committed transaction.
A transaction reads data again it has previously read in the same transaction but a
Nonrepeatable
different value will be retrieved due to the fact that it was modified by a separate
Read
committed transaction.
This problem will occur when one transaction read set of data but that set of data is
Phantom Read
different now due to another committed transaction.
Serialization One transaction is inserting multiple records, however, only a few will be seen by the
Anomaly other transaction.
Let us look at the isolation levels in PostgreSQL that are Read Committed, Read
Uncommitted, Repeatable Read, and Serializable in the next sections.
Read Committed
This is the default isolation level in PostgreSQL like in the many database systems such as
SQL Server. In the Read Committed isolation level, only committed transactions are read
from which Dirty reads are avoided. In this isolation level, two successive SELECT will
return can see different data.
[ 265 ]
Designing a Database with Transactions Chapter 9
As shown in the above screenshot, at the start of Transaction 1, data was read as A. During
this transaction, another transaction has updated the value to B. This means that the next
read in the Transaction 1 will return the data is B.
Read Uncommitted
This is the weakest isolation level out of the existing isolation levels. In the Read
Uncommitted isolation level, dirty reads can occur. That means that non-committed
changes from other transactions can be seen in a transaction. In PostgreSQL, Rean
Uncommitted isolation levels are treated the same as the Read Committed isolation level.
Repeatable Read
In the Repeatable Read isolation level, all statements of the current transaction can only see
rows committed before the first query of the transaction. Further, it can see any day
modifications that was executed in this transaction.
[ 266 ]
Designing a Database with Transactions Chapter 9
As you can see in the above screenshot, when transaction 1 starts, the data value is A.
Before the transaction 1 ends, Transaction 2 has modified the data value to B. However, this
is not visible to the transaction 1 even after it does a data read.
Serializable
The Serializable isolation level ensures that concurrent transactions run as if you would run
sequentially one by one in order. As you can imagine, though the isolation can be achieved,
this would become a highly ineffective database isolation level as concurrent transactions
cannot be done. This is considered to be the most strong isolation level among the available
isolation levels.
The following table shows the summary of each phenomenon for each isolation level:
[ 267 ]
Designing a Database with Transactions Chapter 9
* Though these are theoretically possible, in PostgreSQL these phenomena are not
possible.
This is how you can set the different isolation levels in PostgreSQL:
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
After discussing different types of isolation level, let us discuss couple of database design
techniques to improve transaction concepts in the database.
[ 268 ]
Designing a Database with Transactions Chapter 9
When there are instances to calculated account balance on a given date, you need to
aggregate all the previous columns as shown in the following script:
SELECT SUM("Amount")
FROM public."BankTransaction"
WHERE "AccountName" ='100-1235-5789'
AND "TransactionDate" <= '2010-08-04';
The above script will give you an account balance of the AccountName of 100-1235-5780 as
at 2010-08-04.
This will lead to a greater duration of the transaction. There are two types of solutions for
the above problems, one of them is including a column and the second option is to include
another table.
[ 269 ]
Designing a Database with Transactions Chapter 9
Running Totals
When there is a need to retrieve a value as at a given date, we can use the Running Total
technique.
You can add another column with running total and you will see the data as shown in the
below screenshot:
If you want a balance as per any day, now it is you need only one row to get the records
leaving the shorter transaction.
The above query will use only records whereas the previous query needs many records.
[ 270 ]
Designing a Database with Transactions Chapter 9
Summary Table
Instead of an additional column, we can have another table. This will be a better solution
when you need month-end balances.
If you want to retrieve balance as at the end of 2010-06, it is simple running a query on this
summary table, instead of running an aggregate query on the Transaction table as shown in
the below script:
SELECT "Amount"
FROM public."MonthlyBalance"
WHERE "AccountName" = '100-1235-5789'
AND "Month" = '2010-06'
With the query, only one record is read and that will reduce the transaction duration.
Indexes
As we discussed in Chapter 8, Working with Indexes, Index will use to read only the needed
records instead of the entire table. This will reduce the transaction duration.
[ 271 ]
Designing a Database with Transactions Chapter 9
Field Notes
An organization with more than 4,000 employees salary is processed at the end
of the month. This organization has implemented transactions for all the
processes. Since the Salary process takes a large duration, more than 2-3 hrs,
during the process of transaction, the database tends to grow large as the log file
will grow. Due to the transaction, there were many maintenance issues. During
the salary process, no other transactions are entered. To manage this, it was
decided to remove transactions. Before the start of the salary process, database
backup was taken and in case of a failure, backed up database can be used to
restore the database. This was possible only because there are no other
transactions that are possible during the processing of salary.
There was a requirement to identify the record modified data for some
transaction table. It was decided by the database designers to add an UPDATE
trigger to the table so that when a record is updated, MODIFIED DATE column
is updated by means of a trigger. However, after this was done, users were
experienced a longer transaction time. This was due to that trigger execution has
added its execution to the total transaction time as the trigger is a synchronous
process. Then, it was decided to modify the code and remove the trigger. This
modification has reduced the transaction execution time.
Summary
In this chapter, we looked at the Transaction aspects of database design. We considered the
transaction as a single business logical unit that should be treated as one unit. The
transaction ensures that the database will be in a consistent state. We identified two main
methods of transaction theory which are ACID and CAP theory. ACID stands for
Atomicity, Consistency, Isolation, and Durability whereas CAP theory stands for
Consistency, Availability and Partition Tolerance.
In PostgreSQL, there are main three Transaction Control Language commands that are
BEGIN TRANSACTION, ROLLBACK TRANSACTION and COMMIT TRANSACTION.
Every transaction either failed or success and in case of error transaction, we need to run a
compensation transaction to reverse the errored transaction. To support different allocation
levels, PostgreSQL has three isolation levels that are Serialize, Read Committed, and
Repeatable Reads. Unlike other database technologies, in PostgreSQL both Read
Committed and Read Uncommitted will be treated as same as in both isolation levels, Dirty
Reads are not possible.
During the database design, we identified that few measures can be taken in order to make
[ 272 ]
Designing a Database with Transactions Chapter 9
a better transaction. We introduced two methods, one of them is adding a column with
running totals. Another method is adding a different summary table. In both situations, the
aggregated query is replaced with a simple query so that the transaction duration is
reduced.
In the next chapter, we will discuss what is the involvement of a database designer during
the maintenance period of a database.
Questions
Why Transaction is an important concept in the database?
During the design stage, we will identify that to satisfy different business
user requirements, multiple tables have to be introduced. Further, when we
do follow the Normalization process, it will result in separating duplicated
data into multiple tables. This means that a single unit of a business process
may need to update multiple tables. During the modification of tables, if one
update fails, the database will be in an inconsistent state. In order to
maintain the consistency state in the database, Transactions are used.
Explain what are the database modules that are implemented to achieve ACID
properties of database transactions.
Property Component
Atomicity Transaction Management component
No specific component. If Atomicity, Isolation, and Durability are maintained
Consistency
Consistent is maintained automatically.
Isolation Concurrency Control Unit
Durability Database Recovery Management
Questions of ACID properties are very common during interviews and
papers. It is important to explain the concepts and usage of ACID properties
fluently.
[ 273 ]
Designing a Database with Transactions Chapter 9
Availability, and Partition Tolerance. Out of these three concepts, only two
concepts can be implemented in a system. In distributed systems, Partition
Tolerance is unavoidable. This means we have to compromise either
Consistency or Availability in Distributed System
What are the specific design decisions that can be done in order to reduce the
transaction time?
It is important to note that there are vital decision needs to be taken during
the database design in order to reduce the transaction time. The main
objective in this regard is to reduce number of rows that reads or writes
during the transaction. To achieve this, a summary column or a summary
table can be added so that aggregate queries can be eliminated.
Exercise
Identify how transactions can be implemented for the Mobile bill example.
[ 274 ]
Designing a Database with Transactions Chapter 9
Further Reading
ACID Theory: https://en.wikipedia.org/wiki/ACID
Transaction: https://www.postgresql.org/docs/8.3/tutorial-transactions.html
[ 275 ]
10
Maintaining a Database
During the last nine chapters, we concentrated our discussions mainly on database design
aspects. In those design conversations, we discussed the design aspects of databases, from
planning to conceptual design, and physical implementation. After database design, we
discussed non-functional requirements of databases and identified that the indexes are the
key components to improve database performance. During the database transactions, we
identified that keeping the database consistent is an important factor. After the above
aspects are met, your database is ready for developers to start application developments.
However, as a database designer, your responsibilities will not be completed. Typically,
database designers are much needed during the phase of application development.
However, their knowledge is needed even after the database is released to production,
which is neglected in most of the cases.
This chapter discusses how to maintain and properly design a database with growing data.
Further, this chapter emphasizes that maintaining a database is a key role of a database
designer, though it is typically neglected. In addition to these aspects, we will be looking at
working with triggers in order to support different customer needs.
When you are doing the design, it is very hard to predict the database growth and the
velocity of the data. If you are designing a database without considering the future aspect
of the database, you will run into performance and scalable issues during the production.
This means that the end-users as well as the application developers have limited
options with the database. If the database is designed without considering the future
aspects, it will lead to complex and time-consuming maintenance tasks.
When the database is released to production, there are new requirements and database
structural changes will come into the picture. As the database is in the production, you
need to carry out the required database changes. However, when the changes are required,
it is essential to perform these changes without impacting current business operations. If
the business impact cannot be avoided, as a database designer, you need to choose an
option that will be less impacting to the current business operations.
Let us assume you have a table with a large number of records. For
example, let us say the table size is 100 GB and it has more than ten
million rows. Typically, though it is not true always, a large table means
that it is frequently used by the application. When there is a need to
modify the table, such as changing the column data type and updating
column after adding a column, this modification may lead to a table lock.
This will prevent the table to access by the application.
In this type of scenario, the database designers knowledge is important for successful
implementation of the database, so that there won't be any impact on business.
In the next section we will see how views can be used, so that database maintenance is easy.
[ 277 ]
Maintaining a Database Chapter 10
The following screenshot shows the semantic diagram for the database view:
As shown in the above screenshot, three tables are used, T1, T2, and T3. T1 and T2 tables
are joined with the C1 column, and T1, T3 tables are joined with the C2 column to create a
database view. The advantage of the view is that the person who is consuming does not
need to know the implementation of the view. For example, he does not need to know the
joining tables, joining columns and any other complex logic of the database queries.
Next, let us see how we can create views in PostgreSQL, and see the performance aspects in
views.
[ 278 ]
Maintaining a Database Chapter 10
Let us see the code for a few of these views. Here is the code for the
view nicer_but_slower_film_list:
CREATE OR REPLACE VIEW public.nicer_but_slower_film_list
AS
SELECT film.film_id AS fid,
film.title,
film.description,
category.name AS category,
film.rental_rate AS price,
film.length,
film.rating,
group_concat(((upper("substring"(actor.first_name::text, 1, 1)) ||
lower("substring"(actor.first_name::text, 2))) ||
upper("substring"(actor.last_name::text, 1, 1))) ||
lower("substring"(actor.last_name::text, 2))) AS actors
FROM category
LEFT JOIN film_category ON category.category_id =
film_category.category_id
LEFT JOIN film ON film_category.film_id = film.film_id
JOIN film_actor ON film.film_id = film_actor.film_id
JOIN actor ON film_actor.actor_id = actor.actor_id
GROUP BY
film.film_id,
film.title,
film.description,
category.name,
film.rental_rate,
film.length,
film.rating;
When a user executes the above view, the following data will be returned. This view will
return the film title, description, category, price, length, ratings and actors:
[ 279 ]
Maintaining a Database Chapter 10
As you can use in the above code for the view, it is a very complex query. To retrieve the
above data set, five tables, category, film_category, film,film_actor, and actor are used.
These tables are joined with different types of joins, INNER JOIN and OUTER JOIN.
INNER JOIN and OUTER JOIN different types of joins used to relate
different tables. Since this book is dedicated to database design concepts,
we will not be discussing the details of JOINs. However, if you wish to
learn those aspects please follow https://www.w3schools.com/sql/sql_
join.asp link.
Apart from those complexities, there are other complexities such as grouping data and
manipulating data. If you can revisit the code of the view, you will see that to get actors in a
comma-separated format grouping and manipulation has to be done. However, end-users
do not need to know this complex logic, but a simple select query will provide necessary
data for them.
Further, end-users can select necessary columns and include WHERE and ORDER BY
clauses as shown in the following script:
SELECT title,
description,
category,
price,
length,
rating
FROM public.nicer_but_slower_film_list
WHERE length > 2
ORDER BY title;
[ 280 ]
Maintaining a Database Chapter 10
By looking at the output this seems to be a very simple view. However, let us look at the
code for the same view:
CREATE OR REPLACE VIEW public.actor_info
AS
SELECT a.actor_id,
a.first_name,
a.last_name,
group_concat(DISTINCT (c.name::text || ': '::text) || (( SELECT
group_concat(f.title::text) AS group_concat
FROM film f
JOIN film_category fc_1 ON f.film_id = fc_1.film_id
JOIN film_actor fa_1 ON f.film_id = fa_1.film_id
WHERE fc_1.category_id = c.category_id AND fa_1.actor_id = a.actor_id
GROUP BY fa_1.actor_id))) AS film_info
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
LEFT JOIN film_category fc ON fa.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id
GROUP BY a.actor_id, a.first_name, a.last_name;
Now let us assume that you want to add a new column to the view that is available in a
different table not already available in the code of the view. In that scenario, you have to
change the view and for the application users, it is another column in the view so that no
major code modifications are needed.
[ 281 ]
Maintaining a Database Chapter 10
From the Code tab, the view definition is supplied as shown in the following screenshot:
[ 282 ]
Maintaining a Database Chapter 10
With those two configurations, the database view is created and users or application can
access the view.
Since the above query plan is complex, let us look at the same query plan in text format as
shown in the following screenshot:
[ 283 ]
Maintaining a Database Chapter 10
If you run the same query that is embedded in the database view, you will find that it is the
same query plan.
This is a myth with the many users that the database view will improve
the database performance. It is important to understand that a database
view is a logical layer where data is not saved. When the view is accessed,
the embedded script will be executed. This means, there is no difference
in executing view and the embedded script separately as far as
performance is concerned.
Though database views are not meant for performance improvement (except for
materialized views), database designers create database views in order to manage
application development much better.
In the view definition, it is technically possible to use views inside views. For example, you
can create another view using the nicer_but_slower_film_list view. However, this is
[ 284 ]
Maintaining a Database Chapter 10
not recommended. This is mainly due to the fact that views inside views might be difficult
to maintain at the later stage.
Next, let us see how Triggers can be utilized during the maintenance period.
Let us look at how triggers can be used in different sensations in the following sections.
Introduction to Trigger
A database trigger is a set of code attached to a table or a view that code will execute
depending on the operation. There are three types of triggers in PostgreSQL, they are:
BEFORE
AFTER
INSTEAD OF
The trigger can be specified to execute before the operation is attempted on a row
(BEFORE/INSTEAD OF) or after the operation has completed (AFTER). If the table trigger
fires before or instead of the operation, the trigger will skip the operation for the current
row, or change the row being inserted. If the table trigger fires after the event, all changes
that were done are visible to the trigger.
Further, triggers may be defined to fire for TRUNCATE, though only FOR EACH
STATEMENT.
The following table summarizes which types of triggers may be used on tables and views:
[ 285 ]
Maintaining a Database Chapter 10
Database
Trigger Row Level Statement Level
Operation
BEFORE INSERT / UPDATE / DELETE Tables Tables and Views
TRUNCATE - Tables
AFTER INSERT / UPDATE / DELETE Tables Tables and Views
TRUNCATE Tables
INSTEAD OF INSERT / UPDATE / DELETE Views
TRUNCATE
The following screenshot shows how the AFTER trigger works:
As shown in the above screenshot, when any data modifications are done by users or by
applications, the relevant table trigger is executed. For example, if an INSERT statement is
executed, the INSERT trigger is executed and so on.
Let us look at how triggers are useful option to capture the auditing data in the following
section.
[ 286 ]
Maintaining a Database Chapter 10
Let us see triggers that can be implemented as an auditing mechanism. Let us assume that
we need to track the changes to the Product table. As far as the database is concerned,
modifications mean INSERT, UPDATE, and DELETE.
For auditing, we need to cover the basic three W's. Those W's are Who,
When and What. These three W's cover who did the change, When was
the change done, and What was the change. We have discussed auditing
table structures in Chapter 6, Table Structures.
Let us see how we can use triggers in PostgreSQL to implement data audit.
1. Let us create a table to store the Audit data that is shown in the following script:
CREATE TABLE public."ProductAudit"
("ID" Serial,
"ProductID" Integer,
"Operation" char(1),
"User" text,
"DateTime" timestamp
)
2. Next is to create the Trigger. For easy maintenance, we will create a single trigger
to capture INSERT, UPDATE and DELETE as shown in the following script:
CREATE TRIGGER log_Products_update
AFTER INSERT OR UPDATE OR DELETE ON Public."Product"
FOR EACH ROW
EXECUTE PROCEDURE process_product_audit();
[ 287 ]
Maintaining a Database Chapter 10
RETURN NEW;
END IF;
RETURN NULL;
END;
4. Now let us see after doing when modifications to the Product table:
--Data Insert
INSERT INTO public."Product"
("ProductID","Name","ProductNumber",
"Color","MakeFlag","FinishedGoodsFlag")
VALUES(1098,'Sample Prodyuct','CAX-4578','Red',1,1)
--Data Update
UPDATE public."Product" SET "MakeFlag" = 0
WHERE "ProductID" = 1098
--Data Delete
DELETE FROM public."Product"
WHERE "ProductID" =319
Let us see how data was captured to the ProductAudit table as shown in the following
screenshot:
[ 288 ]
Maintaining a Database Chapter 10
You can extend this trigger to capture all the medications as above trigger capture on the
ProductID.
The AFTER triggers can be used not only for auditing, but for the troubleshooting process
as well. As we discussed in Chapter 9, Designing a Database with Transactions, if triggers are
not necessary, you can disable the trigger from the user interface as well, as seen in
the following screenshot:
By disabling the trigger you can use it whenever you need rather than dropping and
recreating them again.
In this method, we need to transfer the data to different physical tables as shown in the
[ 289 ]
Maintaining a Database Chapter 10
following screenshot:
As shown in the above screenshot, now our target is that when data is inserted to the Order
Detail table, depending on the year of the Order Date, data will be moved to either Order
Detail 2019 or Order Detail 2020.
[ 290 ]
Maintaining a Database Chapter 10
It is important that triggers should not include a complex logic as it will increase the
transaction duration which will result in unnecessary timeouts in applications.
In the next section, we will look at the design changes that need to be adopted during the
addition of a new column to the existing column.
Modification of Tables
As we emphasized in Why Maintenance is a Designers Tasks, one of the most important tasks
that database designers have to undertake is, adding a column to the existing large, high-
velocity table. As a database designer, it is your duty to satisfy the user requirement while
not disturbing the current business operation.
[ 291 ]
Maintaining a Database Chapter 10
There are two approaches to this, they are, adding a column to the required table, and
adding a separate table and join to the original tables. Let us discuss those approaches, and
the issues of these two methods in the upcoming sections.
Adding a Column
The obvious and simpler method is to add a column to the existing table. Adding a column
to the small, not heavily used table will not be a huge problem. However, the problem will
occur when you try to add a column a table which is in high volume. If you adding a
column to the existing table alone will not be problematic. However, when you add a
column you may have to populate data to the column.
For example, let us say we want to add a column called Product Category to the Order
Detail table. After adding the column, column to be in cooperation, you need to add the
relevant Product Category to the Order Detail by doing a lookup the Product Sub
Category and Product Category Table.
If you are looking at a large table, the update should be done batch-wise in order to keep
the table accessible to the end-users.
Adding a Table
When there is recruitment to add multiple columns to a table, the better approach would be
to add a different table. Let us look at this with the help of an example.
In the above table, Order Number and Order Line columns are combined to make the
[ 292 ]
Maintaining a Database Chapter 10
Primary Key. Now let us say, we want to add few columns such as Amount, Tax, Discount,
Final Amount to the Order Detail table. Obviously, you can add these columns to the Order
Detail table, provided that the table is small in size and not a table which frequently used
by the application users.
This OrderDetailExtented table will have a one-to-one relation with the Order Detail table
s shown in the below table:
[ 293 ]
Maintaining a Database Chapter 10
3. Then we will create a View. Though it is a different table, a view can be created
as shown in the following screenshot:
By designing a database view, end-users do not have to worry during the selection of the
data by joining the base table. Instead, they have to access the view.
Since indexes are a key part of the database, maintaining them is also an important process
of the database. Let us discuss Index Maintenance in the next section.
Maintaining Indexes
[ 294 ]
Maintaining a Database Chapter 10
In Chapter 8, Working with Indexes we discussed the importance of the indexes for the
efficient usage of the database. In an ideal situation, it is better if you can identify indexes at
the design stage. However, practically, this is not possible. Most of the time indexes are
needed to be added by looking at the slowly running queries. When queries are added it is
important to ensure that indexes should not be duplicated.
Let us look at the importance of identifying Duplicate Indexes and UnUsed Indexes in the
following sections.
Duplicate Indexes
As we discussed in Chapter 9, Designing a Database with Transactions, in transaction-oriented
systems, we need to have the optimal number of indexes as having more indexes will
hamper the performance of data writes.
Let us say we have an Index on the Year, Month column. If you need to have another index
on the Year, Month, and Day, it is advised to drop the Year, Month index, and create an
index for Year, Month, Day.
If you need to add another index for Category, FinishedGoodsFlag, then the above index
should be dropped and the following index should be added:
CREATE INDEX "IX_Category_FinishedGoodsFlag"
ON public."Product" USING btree
("CategoryID" ASC NULLS LAST,
"FinishedGoodsFlag" ASC NULLS LAST)
Unused Indexes
Though we created the indexes, these indexes may not be used during the production. It is
essential to identify those indexes periodically and remove them.
[ 295 ]
Maintaining a Database Chapter 10
The following is the output for the above query when it was executed in the DVDRental
Database:
Typically, unused indexes are identified on a monthly basis. However, there can be reports
that are running on Quarterly or Annual basis. These reports may need indexes that are
listed by the unused index script. Therefore, do not simply drop the indexes even though
they appear in the unused index list.
Let us look at what are design decision that you have to make for database maintenance
tasks.
[ 296 ]
Maintaining a Database Chapter 10
Let us look at what are the tasks of database designers have during the standard
maintenance tasks such as Backup and Reindex operations.
Backup
Backup is the basic process that any database administrator will adopt as a basic Disaster
Recovery option. If there are bulk processes in the system such as, day-close or a month-
end process in a system, you need to ask the database administrators to reschedule the
backup time so that it can avoid the clash with the bulk process.
Typically, you can request the database administrators to perform the database backup
after the backup is completed, so that backup will have the processed data.
Reindex
As we discussed in Chapter 9, Designing a Database with Transactions, Reindex is an
important maintenance task that should be carried out by the database administrators.
However, database reindex will consume some resources such as CPU, Memory and so
onthat may reduce the database performance. Hence, as a database designer, you should
inform the database administrator on when is the best time to perform the database
reindex.
Further, there are tables in the system which are used around the clock. If you want to
reindex those tables, you need to choose a better time. There can be a situation where, there
can be multiple schedules of database reindex separating the heavy index processes to a
different time.
Summary
During this chapter, we discussed the involvement of a database designer during the
maintenance phase or after the database is released to the production environment. We
mainly identified that to support different database changes, database views can be used.
Though database views cannot be considered as performance improvement, it can be
[ 297 ]
Maintaining a Database Chapter 10
We also identified triggers to support different activities such as auditing data and physical
table partition. However, we specified that triggers should be used only when necessary, as
triggers will increase the transaction time that will reduce the application performance.
Another design task is, the addition of columns after the database is released to the
production. When adding a column it is important not to disturb the existing business
operations. Therefore, when there is a need to add a column to a large table, we
recommended that data should be updated batch-wise after the column is added. If there is
a requirement to add many columns, we recommended adding a different table with the
same primary key. For easy access, we said that it is advisable to create a database view by
joining the new and the previous table.
We further identified that you may not be able to capture all the indexes requirements
during the database design stage. Hence database designers need to identify the new
indexes after the database is released to the production. Another task of the database
designer is to identify the unused indexes strategically so that optimal indexes can be kept.
Further, we identified that the database designer has an important role to play alone with
database administrators to design an optimum schedule for the database maintenance
tasks.
In the next chapter, we will discuss what are the methods of Designing Scalable Database
and it's best practices.
Questions
Why Database Maintenance is an important task for Database Designer?
Database designers work starts at the early stage of the project, even at the
requirement gathering stage. They will identify most of the facts at the
beginning of the project. However, real usages will be found only when the
database is released to the production and when users are using it. During
this stage, they might have limitations of changes as the database is loaded
with data and it is being used. Therefore, special skills and attention is
needed by the database designers during the database maintenance period.
Why Database Views are recommended to access data through the application,
rather than directly accessing them?
If applications are accessing the database with direct tables, application users
[ 298 ]
Maintaining a Database Chapter 10
need to know the joining columns and conditions. In addition, if there are
database changes, application users have to change their code as well. To
avoid, these types of issues database views can be introduced. Since
application users are connecting through views, in case of modifications
application users have minimum changes.
When a large table exists in production, there can be a situation where you
need to do the physical table partitioning depending on the year or any other
parameter. This can be a tedious task due the fact that applications and users
are already connecting to this table. However, with physically separating the
table, the table partition can be achieved from a database trigger. After
creating the separate physical tables, a trigger can be created on the base
table so that the data is inserted to the correct physical table.
What are the approaches that can be taken if multiple columns are needed to be
added to a large and heavily used table?
Tough indexes are improving the select or data retrieval processes, too many
indexes will decrease the data write performance in a transactional oriented
database. Therefore, as a database designer, you need to identify the unused
indexes periodically.
[ 299 ]
Maintaining a Database Chapter 10
Further Reading
CREATE VIEW: https://www.postgresql.org/docs/9.2/sql-createview.html
Triggers: https://www.postgresql.org/docs/9.1/sql-createtrigger.html
[ 300 ]
11
Designing Scalable Databases
After ten chapters, now we have obtained sound knowledge on the database design. In
database design discussions, we were mainly concerned about providing the functional
aspects of the customers' requirements. As a database designer, your task is not only to
deliver database that fulfilling the functional requirements of the customers but also to
provide non-functional requirements. Therefore, in Chapter 8: Working with Indexes, Non-
Functional requirements of Databases section, we looked at the non-requirement
functionalities of the database extensively in order to fulfill the efficient usage of the
database. In that discussion, we continued to discuss Performance aspects of the database
by means of index concepts. After discussing the Transaction Management in order to
maintain the data integrity, during the last chapter we discussed maintenance aspects of the
database. As a database designer, you need to design your database to support future loads
of the database. Apart from the user loads, the data volume of the database will be
increased typically in the exponential order. To facilitate this, you need to design a scalable
database.
This chapter discusses the approach towards a scalable database design. We also move on
to covering database reliability.
Database scalability has two dimensions, such as data and requests. we will learn about
them in the following sections.
Data
It is needless to say that data is one of the key assets in an organization. Typically, data is
growing in exponential order. Research has found that we all will generate half of the
existing data every year. Further, in another research, it was revealed that, in every ten
minutes, we all have generated more data than from prehistoric times until 2003. These two
facts tell us how data volume is increasing. It is not only the volume that database
designers have to worry about.
[ 302 ]
Designing Scalable Databases Chapter 11
The Database insert and modification data rate will change over time due to the increase of
transactions and users, this is called Velocity. Data might be in different types such as
relational and non-relational data types. In databases, we need to deal with different types
of data such as relational, text and so on this is called Variety. Veracity means that the
accuracy of the data or how quality your data is. These aspects are any challenges to the
database designers. In the case of scalability, database designers have to plan for the future,
that will be an additional and huge challenge for the database designers.
Let us see why user requests are a key factor in database scalability along with data.
Requests
Users are the other key factors when deciding the scalability of the database. User requests
can be changed in multiple ways. Obviously, the number of users will increase over time.
During the design time, designers have to design the database in the view that the number
of users will increase. Not only the number of users, but the users' activity will also change.
For example, at the time of database design, you have only a small number of users in one
geographical location. Due to business expansions or acquisitions, users' locations may
have increased. This means that your database now has to cater for the users in different
locations that were designed for the users in one location.
[ 303 ]
Designing Scalable Databases Chapter 11
Apart from the number of users and their locations, the users' usage patterns will differ
over time. For example, at the start, the database may cater a report that is running on a
daily basis. Over time, the same report may need to process every hour. When designing a
report, it is essential to understand that report execution frequency may increase.
Having understood the need for the database scalability, let us discuss the major types of
scalability methods that are used in the database design.
Horizontal Scaling
Vertical Scaling
The following screenshot shows how databases can be arranged in different types of
scaling:
[ 304 ]
Designing Scalable Databases Chapter 11
As you can see from the above screenshot, the Horizontal scaling is that the distribution of
same or partition of data while the Vertical scaling is increasing the data volume.
[ 305 ]
Designing Scalable Databases Chapter 11
Normally, Performance Monitor also known as perfmon is used to evaluate these hardware
parameters in the Microsoft Platform.
In this tool, there are lot of counters to choose as shown in the following screenshot:
As you can see in the above screenshot, you can measure the CPU for the postgres process
only. By looking at this measure, you can decide whether you need to increase CPU.
Similarly, Memory and storage requirements can be evaluated from this tool.
In most of the incidents, database administrators are looking at CPU, Memory, and Storage
to improve Vertical Scalability. However, apart from the server level, there are database
level configurations that can be done to improve vertical scalability.
[ 306 ]
Designing Scalable Databases Chapter 11
When there are applications managed by external vendors, the most likely
options is vertical scaling. Since Vertical scaling is applied to the server or
to the database, the application will not be impacted.
One of the most common ways of achieving vertical scalability is the table partitioning that
will be discussed in the following section.
Table Partitioning
Table Partition is to split the data of a table into smaller and multiple physical tables.
There are two methods of table partitioning in PostgreSQL, Trigger Method and Partition
By method. We will learn about them in the following sections.
Trigger Method
In Chapter 10, Maintaining a Database, we discussed how partitioned can be incorporated
using a table trigger.
In this method, as shown in the screenshot, data is diverted to the separate physical tables:
[ 307 ]
Designing Scalable Databases Chapter 11
When data is accessed depending on the request, only a single table is accessed. If there is a
request to return data from multiple tables, either from a user view, data will be returned to
the end-user.
These partitioned tables are mandated to have only a set of range data. for
example, Order Detail 2018 table should have Order Detail from
2018-01-01 to 2018-12-31. To ensure the data integrity, it is recommended
to implement a CHECK CONSTRAINT to the date column. This will
ensure that there won't be any records outside the allocated date range
even if some requests direct tries to insert records to the Order Detail 2018
table.
However, it is important to note that, introducing the trigger will increase the transaction
duration that will increase. If the trigger is more complex, there can be more transaction
timeouts.
Partition By Method
Though we used triggers as a workaround to implement partition, there are inbuilt
mechanisms in most of the database technologies. In PostgreSQL too, there is a native
mechanism to implement table partitions. Let us look a the usage of Table Partition by
[ 308 ]
Designing Scalable Databases Chapter 11
means of an example. In the following steps, the OrderHeader table will be partitioning
based on the OrderDate. Let us say we will partition the OrderHeader table for the yearly
basis.
1. The following script will create the OrderHeader table, with defining the
PARTITION column as the Order Date column:
CREATE TABLE public."OrderHeader"(
OrderID int not null,
OrderDate date not null,
CustomerID int,
OrderAmount int
) PARTITION BY RANGE (OrderDate);
2. Next, we will define the partition ranges for the OrderHeader table as shown in
the following code script:
CREATE TABLE public."OrderHeader2018" PARTITION OF "OrderHeader"
FOR VALUES FROM ('2018-01-01') TO ('2019-01-01');
After the above scripts are executed, three partitions, OrderHeader2018, OrderHeader2019,
and OrderHeader2020 to the OrderHeader table as shown in the screenshot:
[ 309 ]
Designing Scalable Databases Chapter 11
In these partitions, you can select the OrderHeader Table or the Partition tables
individually.
4. When records are inserted to the OrderHeader table, data will be stored in the
relevant portion. The following script shows the data distribution for each
partition:
SELECT
nmsp_parent.nspname AS parent_schema,
parent.relname AS parent,
nmsp_child.nspname AS child_schema,
child.relname AS child,
pg_stat_user_tables.n_live_tup AS RecordCount
FROM pg_inherits
JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
JOIN pg_class child ON pg_inherits.inhrelid = child.oid
JOIN pg_namespace nmsp_parent ON nmsp_parent.oid = parent.relnamespace
JOIN pg_namespace nmsp_child ON nmsp_child.oid = child.relnamespace
[ 310 ]
Designing Scalable Databases Chapter 11
Maintaining Partitions are also an important task. If you want to delete data for 2018,
without partition, the DELETE statement has to be used. If DELETE is used, every record is
deleted and that will take a long duration as well as data IOPS will be consumed heavily.
However, in the case of partitioning, it is only a meta-data change.
Let us see how data can be deleted much easily in the partition:
1. To remove the partition, you can simply drop the partition as shown in the
following script.
DROP TABLE public."OrderHeader2018";
2. However, during this data will be lost. You can remove the data from the
partitioned table by keeping the data using the following script:
ALTER TABLE public."OrderHeader" DETACH PARTITION "OrderHeader2019";
Partitions are often used in large transactional tables. Further, in the case of data
warehousing, Partitioning can be incorporated in large fact tables as well. It is essential to
determine the range for the partition table at the initial stage. Though this can be altered
later, by deciding this at the design stage will reduce unnecessary maintenance and
downtime tasks.
LIST is another type of partition available in the PostgreSQL other than the RANGE portion
that we discussed before. By using LIST partitions, you can partition tables by specific
Values. This is very handy if you want to partition your data by location wise.
Vertical Partition
In some tables, there are infrequently accessed columns. If these columns large is size, it is
[ 311 ]
Designing Scalable Databases Chapter 11
better to store them outside the main table. For example, if the Product table needs to
include attributes like Product Description, Product Images, and so on.
Since the Product Description and the Product images are infrequently used attributes and
there are large in size, those attributes can be separated as shown. Typically, this type of
partitioned mostly done on the master tables such as Customer, Product, Supplier and so
on. In case of transaction tables, customer reviews can be vertically partitioned.
Having discussed what are the options for the vertical scalability, let us look at the
advantages and disadvantages in vertical scalability.
[ 312 ]
Designing Scalable Databases Chapter 11
In addition to the above two features, from the data center perspective, you don't need
physical servers. Since you don't need additional physical servers, additional cooling
systems are not needed in vertical scaling. Due do these factors, you don't need additional
hardware in vertical scaling.
After discussing the Vertical Scalability, we will discuss the Horizontal Scalability in the
next section,
Horizontal Scaling is about adding more nodes and adding more databases. There can be
many ways of configuring Horizontal Scalability. One of them is shown in the following
screenshot:
[ 313 ]
Designing Scalable Databases Chapter 11
As seen from the screenshot, the primary database is replicated to multiple secondary
databases. There are multiple replication scenarios in different database technologies.
Either you can replicate the entire database instance, or only several tables. Further, you can
replicate selected columns or rows.
Database Sharding is another horizontal technique that can be utilized in database design.
In the case of sharding, data is not replicated but data is physically partitioned. As shown
in the following screenshot, data is partitioned by the customers:
[ 314 ]
Designing Scalable Databases Chapter 11
As shown in the above screenshot, customers data grouped into one database. When a user
request comes to the Primary Database, depending on the customer, the request will be
diverted to the relevant database. Challenge in the Database Sharding is the replicating of
master data to shards. If you have a large number of shards, master data has to be
replicated. There are further complications, in case if there are customer movements
between shards. Though this is not a frequent task, as a database designer you need to plan
for such tasks.
Let us look at the Advantages and Disadvantages of Horizontal Scalability in the following
sections.
[ 315 ]
Designing Scalable Databases Chapter 11
After looking at Vertical and Horizontal calling, let us look at how to improve the database
scale using In-Memory Databases.
Though the reading from the memory is faster, it should be noted that the Memory devices
are costly than disks. Therefore, as a database designer, it is a challenge to choose what
should be cached and how long it should be cached.
Let us look at a basic architectural diagram for the Memory Enabled data as shown in the
following screenshot:
[ 316 ]
Designing Scalable Databases Chapter 11
First, the requested data will be verified for its existence in the Cache. If it is available then
the data will be read from the Cache. If not, data will be updated to the Cache from the
Database.
Like database design patterns, there are design patterns for database Scalability that will be
discussed in the following section.
[ 317 ]
Designing Scalable Databases Chapter 11
Load Balancing
One of the most common design patterns for Database scalability is Load Balancing. In the
Load Balancing Pattern, the pattern will decide which database instance it should access.
There are several techniques to select a database instance from the available database set.
Load Balancing
Description
Technique
Random Any database instance will be selected without any specific logic.
Round Robin Every database will get an opportunity to connect the application.
Least consumed, from CPU or number of connections perspective database will be
Least Busy
selected to choose the database instance.
Due to the practical situation, it might not be able to replicate each database objects.
Further, some instance might be read-only whereas only less number of Read-Write
Customized
database instances are available. Therefore, depending on the request type, you have
to divert the requests to the relevant database instances.
The most common technique out of these is the Round Robin. However, in the round-robin
technique, there can be situations where 100% scalability will not achievable.
Master-Master or Multi-Master
We use data duplication to achieve database scalability as discussed in the Horizontal
Scaling section. For data replication, the most common replication technique is known as
the Master-Slave or Publisher-Subscriber technique. In this technique, one node will be
responsible for data publication, where many number of nodes will subscribe to the
publisher. However, this is considered to be a single point of failure. The Single point of
failure means that in the configuration there can be a situation where the entire system will
fail due to failure at one place. This phenomenon can be shown in the following screenshot.
[ 318 ]
Designing Scalable Databases Chapter 11
In the above, even if Server B fails, Server A can communicate with Server C. However, if
the Server A fails entire system will fail.
[ 319 ]
Designing Scalable Databases Chapter 11
In the above Multi-Master configuration, even if one server fails the other two servers will
be able to communicate.
Connection Pooling
Though we think that opening a database connection is a very simple operation, it is
actually an expensive operation. The Connection pooling is used to keep database
connections open so that those connections can be reused. This technique avoids reopening
the network connection, validating the server authentication, and database authorization
and so on.
Without a connection pool technique, a connection request will take 40-50 milliseconds,
whereas with connection pooling it takes only 1-5 milliseconds. Further, connection pooling
will reduce the chance of server crashing. In PostgreSQL, there is a limit for the number of
connections. However, by using the connection pooling, many connections can be made
without crashing the PostgreSQL server.
In the modern era, most of the organisations are exploring opportunities in the Cloud
infrastructure. Let us look at how Cloud can be utilized to improve database scalability.
[ 320 ]
Designing Scalable Databases Chapter 11
Since most of the cloud vendors support different cloud infrastructure such as
Infrastructure as a Service (IaaS), Software as a Service (SaaS) and Platform as a Service
(PaaS). In the case of IaaS, since you are responsible for the hardware configuration, you
can increase or decrease the configuration when need to improve the scalability. In the PaaS
infrastructure, configurations are dictated by the vendor configurations. However, since
most of the vendors have different PaaS scalability levels, you have the option of selecting
them when the need arises.
Field Notes
A database is chosen to store sales transactions. In this database, purchase order,
orders, invoices, payments are stored along with the customer and product data.
Since this database is a highly transaction-oriented system, the database was
highly normalized. After a few years, management requested for various types
of reports. Many of those reports were analytical reports such as monthly
comparison reports, Debtors Ageing Report, and so on. As these reports need a
lot of table joins and a large volume, there were two issues. The first issue is that
the reports taking a long time to process and the other issue was, during the time
of reporting processing, transactions getting delayed. To avoid these both issues,
it was decided to create a separate report database with de-normalized data
structures. Daily basis, data was transferred to the Reporting database using an
ETL mechanism. Since these are analytical reports, the management was happy
to retrieve data with a maximum of one-day latency.
[ 321 ]
Designing Scalable Databases Chapter 11
Summary
In this chapter, we identified that there are mainly two types of scalability, vertically and
Horizontal scalability. We found that vertical scalability is the easiest scalability option
from the application point of view. In vertical scalability required application changes are
the minimum. In vertical scalability, we introduced more CPU, Memory and Disk storage,
Indexes, Partition. We identified that there are different types of Partitions such as Vertical,
Horizontal and Hybrid partition. in the PostgreSQL, there are two types of partitions,
RANGE and LIST. Apart from the native partition options, we looked at how triggers can
be utilized to implement triggers.
The modern scalability technique is horizontal scalability where the database is replicated.
Replication and sharding are the most common techniques in horizontal scalability. In this
scalability method, there can be read/write nodes as well as read-only nodes. Depending on
the request types, connections should be diverted to the relevant database. In-Memory
database is another modern technique that we discussed under the scalability.
We further discussed a few database scalability design patterns such as Load Balancing,
Peer-to-Peer, and Connection pooling techniques. Since most of the organizations have set
their vision towards cloud infrastructure, we discussed how cloud can be utilized to
improve the database scalability. In this discussion, we saw that IaaS and PaaS cloud
infrastructure can be effectively utilized to improve the database scalability.
In the next chapter, we will dedicate our discussion another important non-functional
requirement called database security.
[ 322 ]
Designing Scalable Databases Chapter 11
Questions
Why Database Scalability is a challenging aspect of database design?
Database Scalability is the ability of the database to cater to customer requests
even with the increasing demands of the users. These increasing demands may
not be able to visualize at the design time. Further, user usage patterns will
change dramatically due to business needs. Since database design is done at the
early stage, it is a challenging task to design database for the future scalability.
What are the instances where Vertical Scaling is better than the Horizontal
Scaling?
There can be instances where database objects can not be separated due to the
application design. Further, due to the licensing concerns, it might be difficult to
separate the databases. This means that there are instances that Vertical
Scalability is the only option you have to work with.
[ 323 ]
Designing Scalable Databases Chapter 11
have the option of improving the configuration whenever they need, with less
cost.
Further Reading
Replication: https://www.howtoforge.com/tutorial/postgresql-replication-
on-ubuntu-15-04/
Table Partitioning: https://www.postgresql.org/docs/10/ddl-partitioning.
html
[ 324 ]
12
Securing a Database
In Chapter 8, Working with Indexes we discussed how important to implement the non-
functional requirements for a database. In that discussion, we discussed that identifying
non-functional requirements are challenging mainly due to the fact, your clients will not be
able to explain them to you. Therefore, it's your duty to explore to non-functional
requirements for the database. During that discussion, we identified a few factors for Non-
Functional Requirements such as Performance, Scalability, High Availability and Security.
Since security one of the most important aspects in the database, thus it is always neglected
during the database design phase. This mainly due to the fact that database designers are
more concerned about the schema design than the non-functional-requirements. Since
Database security is an integral part of a database, we have dedicated this chapter for
Security aspects in database design. This chapter would help you understand how a
database can be secured through a step-by-step approach.
1. Data loss
2. Loss of availability
3. Loss of privacy
4. Loss of reputation
5. Loss of Income
6. Penalities
If an unauthorized user enters into the system, he can erase part of the data. when it
coming to erasing data, it can be one or more data records, or one or more tables, or an
entire database. Does not matter how large the data erase is, it is a data loss.
If unnecessary users are login into the database, those users will consume a lot of resources
such as CPU and memory. Due to these unauthorized users, authorized users will not have
access to the databases and database will become unavailable to them.
Mostly in the sectors such as Health Care data privacy is a must as there are personal
records are stored in the databases. Apart from the Health sector, data such as birth date
(DOB), social security numbers (SSN) has to be protected from others. If this data is
compromised, your clients' privacy will not be met.
When you lost the reputation, automatically you will tend to lose the revenue. Apart from
the loss of revenue, there can be situations where your organization will be penalized
legally due to not taking additional precautions with security implementation. This
implication means that you are bound for financial impact negatively in two ways.
Due to this factor, it is essential, pay attention to data security as a pro-active measure to
secure data in the database. However, it is important to plan for the tasks even if there is a
data breach. For example, you need to take actions to avoid data losses and plan for data
losses such as data disaster systems and so on. Similarly, as a database designer, you need
[ 326 ]
Securing a Database Chapter 12
to take actions to avoid unauthorized access. However, in the case of unauthorized user
accesses, as a database designer, you need to implement methods to detect and recover
them.
Let us look at the instances of data breaches historically and their properties.
Above chart shows that Date of birth, SSN, Address are the prime targets by the data
hackers. This indicates that you need to place extra security measures when storing such
type of data.
There is a common belief that most of the data breaches are external. Even though this
majority of the data breaches are external, there is an upward trend on internal breaches as
shown in the following screenshot.
[ 327 ]
Securing a Database Chapter 12
This screenshot shows that in 2018 34% of the security breaches are internal which was 25%
in 2016. This upward trend insists the need of the database designers to implement security
fo the databases to protect not only their external parties but to protect from the internal
parties as well.
The following screenshot shows the per capita cost of the data breaches in industry-wise in
2018.
[ 328 ]
Securing a Database Chapter 12
The Health sector has the highest cost in breaches and it is close to the double of the cost of
the Financial sector which is the second-highest sector.
The above different factors indicate that the severity in the security aspects in database
design.
1. Authentication
2. Authorization
3. Encryption
[ 329 ]
Securing a Database Chapter 12
Implementing Authentication
Authentication referees to identifying the user or verify the user who is claimed to be. In
typical applications, there are several ways of identifying the user such as password, bar
codes, swipe cards, finger scans, Radio-frequency identification (RFID) cards and so on. For
the database, the most common way of authentication is user name and passwords as other
authentication modes are used by applications.
As we know, we need to provide a user name and a password to login into an operating
system. Similarly, we need to connect to the database to access data. In some Database
Management Systems (DBMS), you can allow the operating system users to connect to the
database. This configuration will reduce keeping multiple users to manage. In different to
this, some DBMS needs to keep a separate list of users that are kept in the database system.
Limit of the number of connections - Some times, same user connection is used to
connect to the database from different applications. Allowing multiple
connections may lead to difficulties in user management.
Allowable Login Times - You can limit the user to login to the system to a specific
time. For example, you can define for a user that he can connect to the database
on weekdays between 9 AM to 5 PM.
Not a login account - You can create user accounts only to connect from
applications. It is advisable to set the user account as a not a login account so
that unnecessary logins can be avoided.
Superuser - A user who can do anything on the server. Since this user can do any
action on the server, it is important to create less number of super users.
[ 330 ]
Securing a Database Chapter 12
Password Policies
Password is an important component in authentication as it is the only secret to the user to
connect to the databases. Due to this importance, several measures are taken to protect the
password being hacked.
The major way to protect the password is ensuring that the password is strong. Several
rules are implemented by various databases to keep the password strong. Typically, a
strong password needs to have all of the following rules.
Apart from the password setting, there are other tasks that should be carried out by the
database designer or the database administrator.
Reset password on the first login. Since user names are created by the database
administrator, he knows the password that he set for the created user. Therefore,
the password must be changed the password by the user.
If the user account is created for a real user, not for an application to connect to
the database, it is essential to set the password expiration date.
Some of these options may not be available in many of the database technologies. However,
even if these options are not available, as a database designer, you need to implement
workarounds. For example, if you can set an expiry date for the password for the user
name, you need to implement different monitoring mechanism to notify the system
administrators regarding the expired password.
Let us look at how users can be created in the PostgreSQL in the following section.
[ 331 ]
Securing a Database Chapter 12
The above screenshot shows that the Logins and Group Roles available in the server. from
the icon logins and group roles can be clearly distinguished.
1. First, define the user name. A best practice is suggested at the Best Practices for
Security section. Optionally, you can define the comments for the created user as
shown in the below screenshot.
[ 332 ]
Securing a Database Chapter 12
2. Define the password, Account expiry date and the connection limit. How to set a
strong password was discussed before in the Password Policies section.
In the above screen, connection limit can be set as well. Unlimited connections are possible
when the Connection Limit is set to -1. This is the default settings. When the password is not
provided, the user will be considered as a role that will be discussed in the Implementing
Authorization Section.
3. Additional configuration can be done for the logins as shown by the following
screenshot.
[ 333 ]
Securing a Database Chapter 12
As shown in the above screen, you can configure whether this login can be used login to the
database instance and/or whether it is a superuser.
4. The Following script will provide you script to create the login.
CREATE ROLE dba_admin WITH
LOGIN
NOSUPERUSER
NOCREATEDB
NOCREATEROLE
INHERIT
NOREPLICATION
CONNECTION LIMIT -1
VALID UNTIL '2020-09-30T20:26:54+05:30'
PASSWORD 'xxxxxx';
COMMENT ON ROLE dba_admin IS 'This is the account for the database
[ 334 ]
Securing a Database Chapter 12
administrator';
Please note that when scripting out the login account, the password is not scripted as a
security measure.
Next stage of securing a data in the database is Authorization that will be looked at in the
following section.
Implementing Authorization
As we discussed in the Implementing Authentication, Authentication is providing a user to
login into the database system by verifying his identity. However, Authorization is
providing the permissions to the authenticated users to different objects in the database.
Database authorization is the process of establishing the relationship between the database
objects and users by different authorization modes such as SELECT, INSERT, UPDATE and
so on. The above definition means there are three components in the database
authorization. They are,
1. User - User means that the authenticated user who is requesting access to the
database or database objects or data.
2. Objects - In the context of databases, there are a lot of objects, such as Tables,
Views, Procedures and so on. The objects can be narrowed down to Columns,
and rows.
3. Operation - Operation means the type of action that the authenticated user is
requesting. It can be SELECT, INSERT, UPDATE or DELETE.
GRANT is the common command in most of the database techniques to provide the
authorization to different users. We will discuss the GRANT options in PostgreSQL
in Providing Authorization in PostgreSQL section.
Having discussed the basics of database authorization, let us discuss a few important
aspects in database authorization. First of them is Roles that will be discussed in the next
section.
[ 335 ]
Securing a Database Chapter 12
Roles
The basic concept of database authorization is that the authenticated user should be given
the necessary to do what they need to do in the database and no more. Even for medium
scale organization, there can be more than twenty to thirty users. Most of these users may
have a set of permissions. For examples, developers may have read and write permission
where are administrators will have a different set of permission.
If you have to create user wise authorization, then it will become a tedious task and will
run into many maintenance issues.
Group Roles are assigned with different permissions. There can be members who are users.
Those users do not have privileges assigned explicitly. However, since the user is a member
of Group Role, he will be inherited the privileges of the group roles.
Conflicting of Privileges
A user can be a member of multiple group roles. Every group have different privileges.
When a user logs into the system, he will get the privileges of all the groups. However,
there can be situations where a conflict of permission. For example, if the user has select has
on the table whereas a group role where he is a member of, has explicitly denial of
permission then there is a conflict of privileges.
In case of conflict of privileges, there is a golden rule that is listed in the following
information box.
Denial of Access always outweighs grant of access except when the user
is the system admin or the superuser.
This golden rule can be applied to sort many conflicts. Further, when you are enabling a
user with superuser access, you can not control them by denial of permission.
Let us look few of of the combinations of privileges and the effective privileges in order to
understand the conflict of privileges.
[ 336 ]
Securing a Database Chapter 12
Let us look at how row-level security is an important design concept in database security.
However, when data is inserted to a table, data can be relevant to different entities. If there
is a requirement that one user should see only one set of data, then there will be complex
implementation.
For example, if you look at the following screenshot, in the same data set, there are different
logical partition depending on the locations.
If there is a requirement where one user should see only data at one location, you need to
physically divide the table into different locations. After dividing this into multiple tables,
each relevant user can be given permission to each relevant table. For the users who need to
access all the locations, one view can be created by combining all the tables. However, there
[ 337 ]
Securing a Database Chapter 12
If there are a large number of partitions, you need to create many physical tables.
For example, if you are implementing physically separated tables for the sales
representatives, then you will end up with a large number of tables.
When there are new partitions are added, you need to add a new table, grant the
permissions to users and update the view and so on. This will lead to a lot of
maintenance overhead.
After the table is created, the table has to be enabled for the ROW LEVEL SECURITY as
shown in the below code
ALTER TABLE public."BankTransaction"
ENABLE ROW LEVEL SECURITY;
Then policies are created for each user providing the necessary permissions. The details of
enabling RLS for PostgreSQL can be found in https://www.postgresql.org/docs/10/ddl-
rowsecurity.html
Since PostgreSQL is the database technique, that we have selected, let us see how
authorization is implemented in PostgreSQL.
Role
pg_execute_server_program
pg_monitor
pg_read_all_settings
pg_read_all_stats
pg_read_server_files
pg_signal_backend
pg_stat_scan_tables
pg_write_server_files
Refer to https://www.postgresql.org/docs/11/default-roles.html to find out the
allowed access for each default role.
[ 338 ]
Securing a Database Chapter 12
When creating a login, there are two options to provide authorization, one is from the
Privileges tab and from the Membership tab.
This set of privileges will define whether the login can create databases create roles and so
on.
If the login is a superuser, all the permissions will be allocated to the superuser.
Next option is adding the login to different group roles from the membership tab as shown
in the below screenshot.
[ 339 ]
Securing a Database Chapter 12
Let us see how the user is granted permissions in PostgreSQL with GRANT command.
[ 340 ]
Securing a Database Chapter 12
Last GRANT command means that you can provide users only with selected columns that
the lettering them selecting all the columns. For example, if there a salary column in the
Employee table, you can provide the human resource team only the selecting all the
columns but salary column. For the finance team, you can let them all the columns in the
columns but the rating column that should only view by the human resource team.
As shown in the following screenshot, end-user can be given access to a view. This view
will have one or more tables. To user to access the view, he does not need permission on the
table.
Let us grant permission to a user dba_admin to a view called film_list in the DVDRental
database.
GRANT SELECT on public.film_list to dba_admin
With the above grating of permission, user dba_admin can access the film_list view as
shown in the below screenshot.
[ 341 ]
Securing a Database Chapter 12
Then, let us access one of the tables that are called from the view.
The error shown in the above screenshot verifies that the user cannot directly access the
tables.
Let us look at what kind of actions can be implemented in order to avoid SQL Injection in
the following
Let us assume you have running the following query in the application.
SELECT * FROM Users
WHERE UserName ='John' AND
[ 342 ]
Securing a Database Chapter 12
Password = 'XXXX'
If you provide invalid user name password the above query will return no records.
However, the above query can be modified with string building techniques to the following
code.
SELECT * FROM Users
WHERE UserName ='' OR 1 =1 -- AND Password = 'XXXX'
The above will ignore the password part of the query and will return records. If the
application is written such a way, an unauthorized user can log in into the system.
Let us see what are possible attacks can be done from the SQL Injection
As you can see from the above list, SQL Injection can be used to retrieve data as well as to
shut down a database instance.
Let us see what are the measures that can be implemented to prevent SQL Injection attacks.
The most important way to prevent SQL Injection is, provide only necessary
[ 343 ]
Securing a Database Chapter 12
permission to the user. For example, even if the end-user able to execute the
DROP TABLE, DROP DATABASE statement, it will fail if they do not have
proper permissions. Further, if you are providing user access via views, his
INSERT, UPDATE, DELETE commands will fail.
Use Procedures instead of Inline SQL Queries.
SQL Injection technique is built on the string building techniques. The string
building techniques are possible only if you are using inline queries. If
applications are using procedures to retrieve data, SQL Injection can be
eliminated. As a database designer, it is essential to plan to implement procedure
even at the design stage of the database.
There are few other techniques that can be supported by the application developers such as
limiting the text box sizes, string replacing techniques, string validation and so on. In this
section, we looked at the preventing techniques that can be adopted by the database
designed and database administrators.
In the next section, we will look are how encryption can be used to protect data in your
database.
[ 344 ]
Securing a Database Chapter 12
(UserID int,
UserName character varying (50),
Password character varying (500)
)
Please note that we have created only a limited number of columns only to demonstrate
password-encryption capabilities.
We have used bf algorithm in the above example. Further, there are more algorithms
supported in PostgreSQL as shown in the below table.
1. Storage - Encryption needs more storage than plain text. Though storage is not a
major concern these days, if you are storing database backups, you need to
account for additional storage requirement.
2. Performance - With every Encryption process, you need to encrypt data when
storing and De-crypt the data when reading. This two additional process means
[ 345 ]
Securing a Database Chapter 12
We will be looking at how Data Auditing is implemented with respect to the Security
aspect in database design.
Data Auditing
Auditing is mainly used as a security option to facilitate many security protection laws.
Since there are many data and structural modifications in the database, it is essential to
keep a track of those modifications.
In PostgreSQL few options are available for auditing that are listed below.
[ 346 ]
Securing a Database Chapter 12
Having a separate database may cause issues if you are implementing triggers to
generate data as there will be distributed transactions. The distributed
transaction may cause performance issues in transactions.
Let us look at what are best practices that should be implemented by the database
designers.
[ 347 ]
Securing a Database Chapter 12
Field Notes
An organization has a large number of support staff members and they are
responsible for updating data files sent by their clients. Some times due to invalid
file uploads, data is overwritten rating concerns over the stability of the support.
[ 348 ]
Securing a Database Chapter 12
Application team was able to limit the number of issues, after doing a lot of
validations and restricts from the process. However, there were still some
mishandling of those support files. Then it was decided to implement a data
disaster recover system (DDRS). In the proposed DDRS system, data is
replicated to a read-only instance, however, in the read-only instance data is
updated only after 2 hours. This means the secondary instance is always two
hours behind the primary. In case of invalid data restoration, data administrators
can restore the data from the secondary instance. In addition to a DDRS feature,
this instance can be used as a reporting instance with limited features.
A Pharmaceutical organization has many medical representatives around the
country. Since medical representatives were operating locally, the organization
decided to centralized data to a single database for better analysis. Later, they
decided to extend this analysis facility to medical representatives. However,
since data is in one table and medicine is a fiercely competitive domain, one
representative should not be able to access any other representative's data.
Therefore, before extending the feature to medical representatives, they applied
Row Level Security to the table. With this Row Level Security, each medical
representative can only access to his data only. However, the Organization is able
to analysis the entire data set.
Summary
In this chapter, we discussed another important non-functional requirement that is
Security. We identified that due to security lapses, there can be Data loss, loss of availability
loss of privacy, loss of reputation, loss of Income. Further, we looked at historical how data
breaches have occurred, during different years and domains.
We identified the main three options in security that are Authentication, Authorization
and Encryption. Authentication is verifying the user identify who is claimed to be.
Authorization is providing access to different objects. We discussed that Authorization is
more complex since it has to deal with different database objects, such as tables, views,
columns, procedures and so on. Further, there are different types of operations such as
SELECT, INSERT, UPDATE, DELETE and so on. Apart from these complexities, there are
Roles convictions and row-level security to increase complexity.
SQL Injection is a string building technique that is used to attack database and we
discussed how to avoid SQL Injection from the database design perspective. Data Auditing
is another data security technique that we discussed. We discussed a few Field Notes with
regards to database security.
[ 349 ]
Securing a Database Chapter 12
Questions
Can we deny permission on a superuser?
This question is often asked to mislead the candidates during the interviews. It is
possible to deny permission on a superuser. However, those denials of
permissions will not be affected as he is the superuser.
If a user is granted permission to select data from a table and for a role in which
the user is a member of, has denial of permission on the same table to select data.
What is the effective permission when the user logs into the database?
This question can be asked by interchanging the group and the user to test your
knowledge on the security. Whatever the scenarios, whether it is the user or the
group, denial of permission always outweighs the grant of permission if the user
or the group is not a superuser. However, if this question is asked specifically on
the PostgreSQL, there is no denial of permission on PostgreSQL.
Further Reading
Insider Treats Statistics: https://www.ekransystem.com/en/blog/insider-threat-
statistics-facts-and-figures
[ 350 ]
Securing a Database Chapter 12
Encryption: https://www.postgresql.org/docs/8.1/encryption-options.html
[ 351 ]
13
Distributed Databases
During the Chapter 11:Designing Scalable Databases, we discussed briefly on Horizontal
scalability that can be considered as a Distributed Database. Apart from that discussion,
most of our design discussions are confined to the centralized database management
system.
In modern days the business is moving away from the legacy centralized business
processes. Since technology should align with the business process, systems are designed
with distributed patterns. To facilities distributed systems, databases have to be
distributed.
In this chapter, we will discuss the need for distributed databases, and the design concepts
in the distributed databases. Later, we will discuss real-world implementations with
distributed databases.
Though they are centralized in data storage, from the business process point of view there
are many different logical units. Some of those logical units are branches, offices, factories,
departments, projects, business units, operation segments, sister companies, divisions and
so on. Due to the growth of the business and the completion between its competitors, it will
be better to separate business process depending on any logical unit such as departments,
projects, divisions and so on. To align with the business process units, it is important to
separate the business processes to match the business process distribution.
Apart from the logical business unit wise data distribution, data distribution can be done
with respect the functionality as well. We will discuss the different mechanisms for data
distribution at Architecture for Distributed Databases section.
Another technical progress that has helped to implement database distribution is the
increased efficiency in network systems. Since the strength of the network connectivity
plays a major role in distribution, the efficient network should be a must for the success of a
distributed system. Modern technical improvement in the network entering has paved the
way to the implementation of distributed databases.
Let us look at a basic diagram for the distributed database system in the following
screenshot.
[ 353 ]
Distributed Databases Chapter 13
In the above screenshot, there is a data distribution framework where the user connects to.
Depending on the data distribution method, the framework will decide how to connect to
the relevant database. Depending on the request, it may have to access one or more
databases. These databases can be located in multiple physical locations. For the user, it will
be one logical database though physically it is three databases. If the databases are
centralized all three databases will be implemented in a single database. This simple
distributed database architecture shows how a single database can be distributed to
multiple databases and how those databases will be accessed.
Having understood the basis of distributed databases, let us see what are properties of
distributed databases.
[ 354 ]
Distributed Databases Chapter 13
[ 355 ]
Distributed Databases Chapter 13
Local Independency
With the implementation of distributed databases, data can be stored locally.
With the local storing of data, administrators have more control over the data.
For Example, if the distributed databases are implemented in geological
separation, different data fragments are located at the local locations. Due to this
separation, local administrators have more control over their data. In the case of
maintenance, local administrators can choose a better maintenance window so
that major of users are not impacted. If the data is centralized, it will be difficult
to choose a single maintenance window to support all the users without
impacting their business functions.
[ 356 ]
Distributed Databases Chapter 13
High Performance
Since data is stored closer to the user locations, user query will hit the local
database reducing the network latency to the database. On the other hand, you
are dealing with smaller data set compared with the large centralized database.
With the smaller data set, you will get faster results. In addition, with the
distributed databases all user queries are distributed among the existing
databases in the distributed architecture. With this architecture, users contention
are less in the distributed database compared to the centralized database. Apart
from the contention, CPU, IO contentions are also less than the centralized
database.
Apart from these obvious benefits from distributed databases, designers and
database administrators can work on locally to improve performance. As we
discussed in Chapter 8: Working with Indexes in the Disadvantage of Indexes section,
indexes will decrease the write query performance. However, as a database
designer, now your scope is narrowed to a smaller decentralized database rather
than a large centralized database. This smaller database allows database
designers to enable index by considering only the local query patterns.
Improved Reliability
Reliability is an important phenomenon that has to be achieved from the
databases. In Distributed databases, data is replicated with different techniques
such as replication, mirroring and so on. With data replication, data is duplicated
to a different site. In case of a failure, at least on a temporary basis, the users in
the failed site can be directed to the working site. This means that with the
distributed databases, database reliability can be achieved.
Expansion Maintainability
Expansion is unavoidable in database systems. In a centralized system, the rate of
expansion will be high. When the expansion is high, database administrators
have work on CPU, Memory and IO. When the database is distributed,
expansion is much slower and administrators have enough time to work on
expansion strategies.
In the distributed database architecture, when the need arises to expand, new
sites can be created with less hassle. Since distributed architecture it self support
horizontal expansions, less effort is needed. However, in the case of centralized
database architecture, horizontal expansion is a costly operation as it needs more
time and effort. Vertical expansion is the most possible expansion in the
centralized database, where are both vertical and horizontal expansion are
possible in the distributed databases.
[ 357 ]
Distributed Databases Chapter 13
Complexity
From the very simple example we discussed in this chapter, we can understand
the complexities in distributed database systems. In simple terms, instead of one
centralized database, now you have multiple databases that will increase the
network complexity in the system.
Though every site keeps its data in a local location, there needs to be data shared
between sites. To share the data, replication, mirroring and log-based sharing
should be introduced. This data share will further increase the complexity of the
system. These techniques not only increase the complexity of the system but also
it increases the complexities such as maintenance of the system.
Security
In the centralized database implementation, security needs to be applied on a
single database which will have fewer security concerns. Since there are multiple
databases in multiple sites in a distributed system, security will become a
challenge. Since most of the databases are connected through a network, network
security will be a concerning factor. As we discussed Chapter 12: Securing a
Database, at each database we need to maintain Authentication and
Authorization.
Further, when new users are added or removed from the system, you may have
to perform this in many databases. Not only the users, but Authorization to
different levels of objects also have to be maintained in all the databases.
Cost
In a distributed database system more nodes needed to be added. When more
nodes are added, network connectivity should be improved. These additional
hardware requirements mean that there is an additional cost involved with the
distributed databases. Apart from the hardware cost, additional software
applications are required to support replication or mirroring.
With more nodes are in place, more investments need to be done on the
monitoring of the distributed database system which again increases the cost.
When databases are distributed, more database administrators are required.
Further, you need to recruit high skill resource personal to manage the
distributed database system. All of these factors indicate that you need to spend
more funds on the distributed database than to a centralized database.
Maintenance
[ 358 ]
Distributed Databases Chapter 13
When there are multiple nodes and replication between nodes, the maintenance
effort of the system will be increased. When there are multiple nodes, applying
software patches has to be done for all the nodes. In the case of a centralised
database, applying patches would be only for one database nodes.
Lack of Resources
The Distributed database needs special software and skills. For example, a
distributed database needs more skilled database designers and to maintain the
distributed system, skilled database administrators are required.
Due to these advantages and disadvantages in distributed databases, when you are
choosing a distributed databases architecture, you need to find the optimum solution as a
database designer.
One of the very common methods of distributing database is replicating the entire database
that will be discussed in the next section.
However, since the entire database is replicated to the secondary nodes, it will consume
more network and the network consumption is very high.
In full database replication, there can be two types of replication. One of them is Active-
Active replication as shown in the below screenshot.
[ 359 ]
Distributed Databases Chapter 13
In the Active-Active replication, both nodes can be used by applications for both writes and
reads. However, in this configuration, there can be conflicts as the same record can be
updated by two different users at the same time.
Since data modifications are taking place in multiple nodes, Active-Active replication has
limitations of expanding it to many nodes. Often, this method is used to distribute data to
users who do not have continuous connectivity.
The most popular replication type is the Active-PAssive replication as shown in the
following screenshot.
[ 360 ]
Distributed Databases Chapter 13
In the above active-passive replication type, data writing is possible only at a single node
whereas other nodes are used as read-only nodes. Since the implementation of this type is
much simpler, there are situations where you can configure multiple read-only nodes. The
Read-only node is mostly used as a reporting database to reduce the load of the main
system.
Mirroring and Log-Based replications are the most commonly used techniques for Full
Database Replication. In both of these methods, it is preferred to use asynchronous
replication. In asynchronous replication, after a transaction is completed in the primary
server, user acknowledgement is sent and does not wait for the secondary to complete the
transaction. The following screenshot shows the difference of transaction durations
between the asynchronous and synchronous replication.
[ 361 ]
Distributed Databases Chapter 13
As observed in the above screenshot, Synchronous replication will increase the transaction
duration. Hence, Synchronous replication will impact system performance negatively.
Impact of negative performance has lead many users to configure asynchronous replication
for the database distribution.
Though full replication distribution is commonly used due to the simple implementation.
However, due to the network utilization, database designers are looking at partial
replication distribution databases.
[ 362 ]
Distributed Databases Chapter 13
Full Replication
In the Full Replication technique, the entire object will be replicated. In this method of
replication, tables and views selected to replicate between nodes. For example, in a table, all
the row and columns are replicated as shown in the following screenshot.
As shown in the above screenshot, the entire table is replicated. Thie method of distribution
is common in the distributed database systems. Though this will not consume network as
full database replication, there will be substantial network utilization.
Horizontal and Vertical fragmentations are implemented in order to precisely utilized the
network.
Horizontal Fragmentation
Horizontal fragmentation or row-level fragmentation is shown in the following screenshot.
As shown in the above screenshot, every node has it is relevant data. For example, if we are
looking at the customer table, we can distribute data by location wise. In the customer table,
there are three locations USA, UK and CANADA. We can distribute the customer table,
depending on the locations as shown in the following screenshot.
[ 363 ]
Distributed Databases Chapter 13
Then the USA employees data will be stored in USA site so that USA users can access their
users in the site closer to them. Similarly, other users can access their data close to their
sites. In Horizontal Fragmentation, it is important to fragment data into similar size
fragmentation, if not large fragmentation will not archive the desired performance
expectations.
Vertical Fragmentation
Another mode of fragmentation is Vertical Fragmentation as shown in the following
screenshot.
Same Customer table example is used to explain the Vertical Fragmentations in Distributed
Databases as shown in the below screenshot for the same dataset.
[ 364 ]
Distributed Databases Chapter 13
As shown in the above screenshot, the customer table is distributed with columns. In the
preceding example, one distributed database node has Employee Number, Employee
Name and Date of Birth while the other node has Employee Number, Location and
Salary.
Hybrid Fragmentation
Hybrid fragmentation is combinations of Horizontal fragmentation and Vertical
fragmentation as shown in the below screenshot.
[ 365 ]
Distributed Databases Chapter 13
In the above example, every fragment is equally distributed. Let us use the same customer
example to get more understanding of the Hybrid Fragmentation.
In the above example, the customer table is vertically distributed by column {Employee
Number, Employee Name, Date of Birth} and {Employee Number, Location, Salary } and
horizontal on employee location. However, these types of distribution will lead to a large
number of fragments. Presence of a large number of fragmentations will leave heavy
maintenance problems. Further, some fragments will be very small and some of these
fragments are meaningless.
In the above shown distributed database design, UK customers with all the attributes in
one fragment. US location employees were fragmented in {Employee Number, Employee
Name, Date of Birth} and {Employee Number, Location, Salary } whereas for the Canada
[ 366 ]
Distributed Databases Chapter 13
employee were fragmented {Employee Number, Employee Name, Location, Date of Birth}
and {Employee Number, Salary}.
Having understood the different ways to fragment the database, let us look at special
distributed database technique Sharding in the next section.
Implementing Sharding
Sharding is the process of breaking up high volume tables into smaller table fragments
named shards. These shards can be spread across multiple servers or sites. A shard is a
horizontal data partition.
in sharding, the entire database is divided into multiple databases depending on the
business key. For example, we can divide the database by customer wise. If we are dividing
them into multiple databases by customer, all the customer-related data should be shared.
[ 367 ]
Distributed Databases Chapter 13
In the above screenshot shows the reference for the shared database system. In the above
system, end-users are connecting to the shard catalogue. Depending on the query, the shard
catalogue will decide from which shard data should be read or write. However, master data
that are related to all shards will be replicated from the master database.
Let us look at an example to match the above reference diagram in the following
screenshot.
[ 368 ]
Distributed Databases Chapter 13
In the above Shard design, two shards are introduced. In the first shard, Customers who are
having Customer Keys {1,2,3} are allocated while for the Shard 2 Customers who are having
customer keys {1001, 1002} are allocated. Each shard will have a range so that new
customers can get the next number. In the shard 1, Order table will have the Orders for
Customers {1,2,3}. To maintain the unique constraints for the Order table, the first two
digits of the order number represents the shards.
Product data which is a master or reference table should be replicated to all shards as
Product table is required to all shards.
When a user is requesting data from the shared database system, he needs to provide at
least a sharding key. If the sharding key is unknown, the secondary key which is the
Customer Key in this example should be provided in order to find the shard.
[ 369 ]
Distributed Databases Chapter 13
Distribution Transparency
As we discussed in the Designing Distributed Databases section, there are different types of
fragmentation. These fragments may reside in different databases. When a user executes a
query, he does not need to worry about where the fragmentation and their databases are.
Let us look at the more complex user query in the following database distribution.
Let us assume that the user needs to execute the following query in a centralized database.
SELECT EmployeeNumber, EmployeeName, Salary
[ 370 ]
Distributed Databases Chapter 13
FROM Employees
WHERE Salary > 3000
With the database distribution design, now the above query will not work. Instead of the
following query needs to be executed.
SELECT
EmployeeUS1.EmployeeNumber,
EmployeeUS1.EmployeeName,
EmployeeUS2.Salary
FROM EmployeeUS1 INNER JOIN
EmployeeUS2 ON EmployeeUS1.EmployeeNumber = EmployeeUS2.EmployeeNumber
WHERE EmployeeUS2.Salary > 3000
UNION
EmployeeUK.EmployeeNumber,
EmployeeUK.EmployeeName,
EmployeeUK.Salary
FROM EmployeeUK
WHERE EmployeeUK.Salary > 3000
UNION
SELECT
EmployeeCANADA1.EmployeeNumber,
EmployeeCANADA1.EmployeeName,
EmployeeCANADA2.Salary
FROM EmployeeCANADA1 INNER JOIN EmployeeCANADA2 ON
EmployeeCANADA1.EmployeeNumber = EmployeeCANADA1.EmployeeNumber
WHERE EmployeeCANADA2.Salary > 3000
Transaction Transparency
During Chapter 9: Designing a Database with Transactions, in ACID Theory section we
discussed how important maintaining four properties Atomicity, Consistency, Isolation and
[ 371 ]
Distributed Databases Chapter 13
Performance Transparency
Performance Transparency means the end-user should not experience a difference with
respect to performance when he is using the distributed databases. In other words, his
experience with the distributed system should be as same as the centralized database
system with respect to the performance.
During the query processing of a distributed system, the process has to identify the
fragment and the location. In the distributed environment, there will be additional query
cost incur due to the communication between sites apart from the CPU and IP cost.
Since each site has different query processor, a distributed query will be processed in
parallel. Parallel processing will improve query processing if the data load is similar in the
fragments. Therefore, when designing fragments, the deciding factor should be data
distribution in fragments.
1. Local Autonomy for Sites - Site should be able to operate locally. Local data
should've manged by locally.
2. No Reliance on a central site - There should not be a site that the entire system
cannot be operated with. In other words, there should not be a single point of
failure.
3. Continues Operation - There should not be a need for downtime when it comes
to adding or removing a site.
4. Location Independence - A user should be able to access any site provided he has
permission.
5. Fragmentation Independence - User should be able to access data without
[ 372 ]
Distributed Databases Chapter 13
knowing how there are fragmented provided he has the necessary permissions.
6. Replication Independence - User should be unaware of whether the data is
replicated or the mechanisms of replication.
7. Distributed query processing - The distributed system should be capable of
query processing when there is a need to access multiple sites and
fragmentation.
8. Distributed transaction processing - The distributed system should be capable of
transaction processing when there is a need to access multiple sites and
fragmentation.
9. Hardware independence - Since there are different sites in distributed database
systems, different sites should be able to run on different hardware.
10. Operating System Independence - Since there are different sites in distributed
database systems, different sites should be able to run on different operating
systems.
11. Network Independence - Since there can be multiple sites and those sites are
connected with different network topologies. The distributed database should be
work without considering the network topologies.
12. Database Independence - Since there are different sites in distributed database
systems, different sites should be able to run on different database technologies.
Let us discuss the challenge that will be encountered by database designers and database
administrators for distributed databases.
[ 373 ]
Distributed Databases Chapter 13
Let us look at a few examples from the real-world scenarios with respect to the distributed
databases in the following section.
Field Notes
Following is the simplified flow diagram for a reporting system for an
organization.
[ 374 ]
Distributed Databases Chapter 13
The organization has many sales representatives who are responsible for sales
around the country. Every morning, they receive the stock from the storehouse
[ 375 ]
Distributed Databases Chapter 13
and then they leave for the sales. Since there was no proper system in place, all
the data captured in manually. Later it was decided to implement a system to
computerized the entire process so that the work of the sales representatives and
administrators tasks will become easier. It was decided to provide sales
representatives. Every representative will have dedicated databases and early
morning database data will be replicated to the PDA. During their sales, PDA
data is updated. End of the day, they will update the main office database after
completing their sales. Since there is a need for data to be updated from both
server and client, Active-Active replication was used. Since there are fewer
chances for conflicts, the basic conflict resolution method of priority nodes
conflict resolution method was used.
Summary
In this chapter, we discussed the different aspects of distributed database systems. We
identified that due to business growth and to provide independence to business, a
distributed database is a better solution than to the centralized database. Fragmentation
was identified as an important phenomenon in distributed databases where horizontal,
vertical and hybrid fragmentation can be done by carefully examining the requirement. We
discussed sharding as a special technique to distribute data. In the sharding, sharding
catalogue was implemented to find the correct route for the sharding database.
[ 376 ]
Distributed Databases Chapter 13
Having completed the theoretical aspects with practical examples in database design, now
it is time to use all the learning to a case study. The next chapter is dedicated to discussing a
case study of database design for a Learning Management System.
Questions
Why distributed databases are needed for modern business?
Modern businesses are operating in many business units such as branches,
projects, business segments, branches and so on. Due to the increase in
competition, businesses are moving towards operating these different units
independently. Apart from the business process independence, businesses are
looking at maintenance independence where one unit can schedule their own
maintenance without impacting other business processes. To separate the
business process, technical systems also distributed when needed. With the
application separation, databases need to be separated too. By distributing data,
separate units have the luxury of creating their own analytics. However, simple
separation of databases will run int lot of technical and practical issues.
Therefore, databases should be distributed taking business processes into
account.
Apart from implementing security for all the databases, network security has to
be improved. Since there are a lot of communications between databases in
distributed databases, it is essential to improve network security. This means in a
distributed system there are more challenges in security than to a centralized
database.
What are the scenarios that you have to choose a database distributed system?
If an organization processes are distributed at multiple locations and these
processes can be operated in independently or semi-independently, you can
implement a distributed database system.
[ 377 ]
Distributed Databases Chapter 13
same record is modified at the same time multiple times, there will be a conflict
of data. To resolve these conflicts and chose a winner and victim(s), a conflict
resolution mechanism should be introduced.
Further Reading
High Availability: https://scalegrid.io/blog/managing-high-availability-in-
postgresql-part-1/
https://blog.timescale.com/blog/scalable-postgresql-high-availability-read-
scalability-streaming-replication-fb95023e2af/
[ 378 ]
14
Case Study - LMS using
PostgreSQL
During all the chapter until up to now, we have discussed different aspects of the database
design. Though we emphasised more on the design aspect such as Conceptual Database
Design, E-R Modeling, Normalization of data models, and so on. Apart from the functional
requirements of a database, we discussed a lot of non-functional requirements of database
such indexes, scalability, transaction as well. During the last chapter, we discussed how
databases can be designed by using distributed principals.
Having discussed every aspect of databases, now it is time to put all the knowledge into a
case study. In this chapter, we will be using design concepts that we learned to design a
database for the Learning Management System (LMS).
To design a database for a LMS, we will be looking at the Business Case and we will build a
conceptual model for the LMS. We will define necessary table structures for the LMS using
PostgreSQL. We will discuss the necessary indexing and a few other advanced features for
the database design.
1. Business Case
2. Planning a Database Design
3. Building the Conceptual Model
4. Applying Normalization
5. Table Structures
6. Indexing for Better performance
7. Other Considerations
Case Study - LMS using PostgreSQL Chapter 14
Business Case
Learning Management System (LMS) is to support e-learning aspects of different
educational institutes. Further, LMS handles the management and delivery of eLearning
courses.
Educational Institute
The main logical entity of the LMS is the educational institutes. These institutes are
operating in two different scales. Less number of educational institutes are determined to
support 24X7 supports for its end users. Further, these high scale institute needs additional
disaster recovery options. A large number of institutes requires same functions as high
scale institutes but not on the same scale. Since these are small to medium scale institutes,
users may not require 24X7 supports. Considering different support levels, there are three
categories of educational institutes that are High, Medium and Small.
Though there are different scales for educational institutes, depending on the licensing cost
they may request for the upgrade or downgrade between three categories.
Every educational institute multiple faculties and every faculty has multiple departments
as shown in the following example.
[ 380 ]
Case Study - LMS using PostgreSQL Chapter 14
In the above screenshot, It is shown how institute, faculty, departments and courses are
organized. Above screenshot shows how courses are defined in the Computer
Engineering Department.
In this simplified LMS, there are main three business actors that will be discussed in the
following section.
Lectures - Lectures has total ownership on the courses. Lecturers will define the
content of the courses. They will define the modules for the courses. Every
module has one or many module leaders. Module leader is another lecturer from
the institute. The Module leader will define the module curriculum, mode of
evaluations, references and module plan. A lecturer can initiate discussion so
that students can participate.
In addition to the module leader(s), there can be optional lectures such as visiting
lectures, industry experts, subject matter experts and so on. The module leader
will update the course contents, mark the attendance, setting the deadlines, mark
[ 381 ]
Case Study - LMS using PostgreSQL Chapter 14
In the LMS course is an important entity. Let us look at how the course is defined in the
LMS.
Course in LMS
Students are enrolled in a Course and each Course has a department. Every course has a
duration. Every course has multiple modules and modules can be a core module or an
optional module. The module should have a study plan. Study plan contains the schedule,
assignments as shown in the below example.
As shown in the above example, Advanced Databases is one of the modules in BSc(IT).
However, the Advanced database can be a module in another course. As shown in the
above example, a weekly study plan and one Assignment is listed. The study plan includes
Notes and Research Papers whereas Assignment will have Assignment Documents and
[ 382 ]
Case Study - LMS using PostgreSQL Chapter 14
Marking Schema. An Assignment has multiple deadlines, such as soft deadline, hard
deadline that are applicable for the students. There are two important dates for lectures,
that are notification date and marking completion dates.
Every course has a course coordinator and every module has a module leader. There can be
one or many lecturers assigned to a module. A Lecturer can be a content provider for
multiple modules.
Student Discussions
Students are encouraged to perform productive discussion through the LMS. Student or
Lecture can initiate discussions. Every discussion will have a title and description. User can
tag every discussion with different labels such as question, database, and so on so that later
stage it can be searched quickly. Either the module leader or course administrator or the
initiator can close the discussions.
Auditing Requirements
Auditing is an important aspect as there can be legal issues with respect to the assignments.
Therefore, all the activities related to the assignments should be audited. Auditing
mechanism should retain the time of the activities and the user of the activity.
Having understood the basic requirements for the LMS, let us plan for the database design
in the next section.
Since LMS is transaction-driven, application-oriented that supports the day to day business
of an educational institute, the LMS database should be designed according to an OLTP
method.
[ 383 ]
Case Study - LMS using PostgreSQL Chapter 14
database design. They are referred to as mainly bottom-up and top-down. LMS database
design will take the top-down approach where the model is designed first, then entity and
then the attributes. In the top-down approach, the designer starts with a basic idea of what
is required for the system. Since top-approach is used for a large and complex database
system, we can utilize the top-down approach to design the LMS database.
Apart from the design, we need to plan the database as a distributed database as there is a
clear need to distribute data. We can distribute the data depending on the scale of the
education institutes.
Let us look at how the conceptual model is built for the LMS database in the following
section.
Initially, we need to draw an entity type table to identify the entity types and refer at the
future date. Following is the table for entity types.
Similar
Entity Type Description Names / Different
Roles
Education institute that is a logical unit of
University,
Institute that combines courses, lecturers, and
students. School, Academy
[ 384 ]
Case Study - LMS using PostgreSQL Chapter 14
Division, Teaching
Department Separation of teaching units.
Unit
Course Plan of study on a subject
Module Different subjects of a course. Subject
The person who is enrolled in a course and
Student
who engaged with the course curriculum.
Pupil
Teacher,
Professor, Course
The person who is responsible for the Coordinator,
Lecturer
delivery of courses. Module Leader,
Department
Head,
Administrator The person who manages the course Course Manager
Appointment When there is a meeting between the lecture Meetings, Interviews
As presented in the above table, Lecturer is an important entity type in the LMS system.
However, in the Lecturer entity type, there are specific entities such as Course Coordinator
Module Leader, Department head who are lecturers.
At this stage, core entity types of LMS database are identified. However,
this list can be modified at the later stage during normalization stage.
After identifying the core entity types, let us identify the relationship between the identified
entities.
Let us first identify the hierarchical relationships between the above entity types. The
lecturer is the main actor in the LMS database as he plays a key and different roles in the
system.
[ 385 ]
Case Study - LMS using PostgreSQL Chapter 14
As we identified in the Business Case section, there is a natural hierarchy in the LMS. As we
discussed in Chapter 1, Heirachicahl databases can be used to store the above data.
However, since there are other relations that follow the relational model, we have chosen
the relational database to implement LMS, the hierarchical data model is presented as
above relations.
Out of the identified entity types, the lecture is the most complex entity type. Tough the
Lecturer main role is to teach at different modules, there are additional roles that are
carried out by the lecturer. Head of the department is a lecturer who manages the
department affairs. Course coordinator, an entity from Lecturer entity type, manages the
course. Similarly, the lecturer will become a module leader when it comes to managing the
module. All for these functionalities are presented in the following screenshot.
[ 386 ]
Case Study - LMS using PostgreSQL Chapter 14
Apart from the lecturer, there will be administrators who manage the course as shown in
the below diagram.
In the LMS system, apart from the lecturer, the student is another important entity type. A
student engages with course and module. A user will be enrolled in a course and thereby
the student has to follow the module. The student engaged with a module by participating
in the module. This relationship is shown in the below screenshot.
[ 387 ]
Case Study - LMS using PostgreSQL Chapter 14
The student participation in a Module will be in multiple ways such as attending the
module, submitting the assignments, presenting and writing the exams. These aspects can
be shown in the below relationship diagram.
[ 388 ]
Case Study - LMS using PostgreSQL Chapter 14
Since there are a lot of engagements with the module by students in different aspects, the
relationship between student and module is complex.
Apart from the standard relationships, there are recursive relationships for courses and
Modules. Most of the courses have prerequisite courses. Prerequisite means to enrol to
some course there are courses to be completed. Similarly, there are prerequisite modules for
some modules. The recursive relationship for the Course and Module is shown in the
below screenshot.
[ 389 ]
Case Study - LMS using PostgreSQL Chapter 14
[ 390 ]
Case Study - LMS using PostgreSQL Chapter 14
Let us look at the E-R diagram for the Course entity type as Course is an important Entity
Type in the LMS.
As shown in the above screenshot, Course Code is identified as the Primary Key for the
Course Entity Type. Every course has multiple models and there can be prerequisite
courses for the particular course.
Since every course has multiple modules, let us look at the E-R diagram of the module as
shown in the below screenshot.
After looking at attributes of important Entity Types Lecturer, Course and Module, we
need to find out the attributes for entity types such as Student, Appointment and so on.
[ 391 ]
Case Study - LMS using PostgreSQL Chapter 14
Documentation
Let us look at the documentation for the Student entity type.
Applying Normalization
In Chapter 5: Applying Normalization, we discussed that Normalization is an important
process in the database design. Since LMS is a transactional database (OLTP), we need to
follow the Normalization process. In the same chapter, we discussed that there are different
levels of normalization that is, First Normal Form (1NF), Second Normal Form (2NF),
Third Normalization Form (3NF), Boyce-Codd Normal Form (BCNF), Fourth Normal
Form (4NF), Fifth Normal Form (5NF), and Domain-Key Normal Form (DKNF). Out of
these normalization forms, we need to apply at least 3NF to the database design.
[ 392 ]
Case Study - LMS using PostgreSQL Chapter 14
Let us look at a few examples of applying normalization for the LMS database entities.
This means repeated attributes can be separated into other entities as follows:
Faculty { Faculty Code, Faculty Name, Dean, Institute Code }
Institute {Institute Code, Institute Name, Institute Address, Head of the
Institue }
Derpatment {DepartmentCode, DepartmentName, Head of the Department, Faculty
Code }
If you further look at the above entities, you will see that in the Dean, Head of the Institute,
and Head of the Department are entities from lectures. Therefore, incorporate department
entity and include Lecture code of the Dean, Head of the Institute and Head of the
Department as they are instances in the Lecture Entity.
Faculty { Faculty Code, Faculty Name, Dean Code, Institute Code }
[ 393 ]
Case Study - LMS using PostgreSQL Chapter 14
Institute Code }
Let us look at the ER diagram for these entities as shown in the below screenshot.
With the above design, we have removed the repeating attributes. However, this has
intended the number of entity types.
[ 394 ]
Case Study - LMS using PostgreSQL Chapter 14
Advanced Database
2 Harry Steward C004 2020/05/01
Management
Advanced Database
3 Alex Robert C004 2020/06/01
Management
4 Rose Elliot C001 Introduction to Databases 2020/01/01
Advanced Database
4 Rose Elliot C004 2020/06/15
Management
Advanced Database
5 Michel Jacob C004 2020/06/01
Management
Advanced Database
6 Emily Jones C004 2020/06/01
Management
In the above relation, {Student Number, Course Code, Course Enrollment Date} are
considered as Primary Key since there can be students who are enrolled in the same course
multiple times.
[ 395 ]
Case Study - LMS using PostgreSQL Chapter 14
In the above design, Assignment Submission is separated to different entity types with the
submission date, marked date and so on.
[ 396 ]
Case Study - LMS using PostgreSQL Chapter 14
After we have identified entities, next is to implement these design in the physical database.
Let us see how we can implement identified tables in physical databases in the next
section.
Table Structures
Until now, we do not need to know what is the database technology that is going to be used
for this implementation. We have decided to use PostgreSQL for the implementation of
LMS. In this design, we will be looking at how to implement Distributed Databases, Tables
and Views.
Distributed Databases
In Chapter 13: Distributed Databases during the Sharding discussion, we identified that the
shards need to be designed. As identified during the requirement elicitation, sharding will
be done based on the institute. There will be three shards and one master shards.
In the database design, we have named the LMS_Master as the master shards and
[ 397 ]
Case Study - LMS using PostgreSQL Chapter 14
In the master shard (LMS_Master), there will be a master table to define what is the shard
that every institute belongs to. Following is the table structure for the shard master table.
It is important to implement constraints for the above table so that data integrity can be
maintained.
[ 398 ]
Case Study - LMS using PostgreSQL Chapter 14
ShardID attribute that will decide the shard that institute belongs to. Therefore,
it should be NIOT NULL and should be a value either 1 or 2 or 3. To maintain the
data integrity CHECK constraint is implemented. In case, shards are added to
expand the horizontal scalability, you need to modify the constraint.
IsActive attribute will allow to enable or disable the Institute. This attribute can
contain either 1 or 0. Similar to ShardID, to maintain the data integrity CHECK
constraint is implemented on the IsActive column.
[ 399 ]
Case Study - LMS using PostgreSQL Chapter 14
In the master shard, you do not need to keep all the details of the institutes. Those details
can be stored at each shard level.
[ 400 ]
Case Study - LMS using PostgreSQL Chapter 14
Following is the results of the above query that shows the distribution of the institutes.
After the master shard is completed, let us look at how other tables are designed.
To easy management of shards, every table will have Institue ID which is the
shard attribute. This will help database administrators and customer support
staff to move data between shards. This is a requirement that was identified at
the requirement elicitation.
Primary key will be chosen as the auto-increment attribute and business key will
be used as an UNIQUE index. For example, in the Lecturer entity type, there will
be LecturerID and the business key, LecturerCode will an UNIQUE index.
Usage of Composite Types for common composite attributes such as Address,
Full Name. During the entity types identifications, we noticed that Address
attribute can be introduced to Institute, Lecturer, Administrator and Student.
Similarly, Full Name can be included for Lecturer, Student and Administrator.
Rather than using individual attributes such as AddressI, Address II, City, it is
better to use the combine attribute.
[ 401 ]
Case Study - LMS using PostgreSQL Chapter 14
Let us see how can we create composite types in PostgreSQL. Following is the Address
Type.
[ 402 ]
Case Study - LMS using PostgreSQL Chapter 14
The same Address type can be created from the below script.
CREATE TYPE public."Address" AS
(
"AddressI" character varying(25),
"AddressII" character varying(25),
"City" character varying(15),
"Province" character varying(15),
"Country" character varying(15),
"Postcode" character varying(10)
);
Similarly, we can extend the composite type to the Full Name as well.
[ 403 ]
Case Study - LMS using PostgreSQL Chapter 14
[ 404 ]
Case Study - LMS using PostgreSQL Chapter 14
Similarly, we can implement a composite data type for the Audit columns. For audit
purposes, we need IsActive, RecordCreatedDate, RecordModifiedDate
and InstitiuteID attributes in every column. Rather than inserting all four columns, can
create a AuditColumns composite key as shown below.
CREATE TYPE public."AuditColumns" AS
(
"IsActive" bit(1),
"RecordCreatedDate" date,
"RecordModifiedDate" date,
"InstitiuteID" bigint
);
Now let us see how we can implement this in a Lecturer table as Lecturer table needs
FullName. Address, and AudtiColumns composite data types. The following screenshot
shows, how to include composite data types into a table in PostgreSQL.
[ 405 ]
Case Study - LMS using PostgreSQL Chapter 14
(
"LecturerID" serial NOT NULL,
"LecturerCode" character varying(8) NOT NULL,
"Name" "FullName",
"Address" "Address",
"AllocatedDepartmentID" integer,
"DateofBirth" date,
"AuditColumns" "AuditColumns",
CONSTRAINT "Lecturer_pkey" PRIMARY KEY ("LecturerID"),
CONSTRAINT "UNIQUE_LecturerCode" UNIQUE ("LecturerCode"),
CONSTRAINT "FK_AllocatedDepartmentID_Department_DepartmentID" FOREIGN KEY
("AllocatedDepartmentID")
REFERENCES public."Department" ("DepartmentID") MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
In the Lecturer table, LecturerID is the primary key where are LecturerCode is considered
as a UNIQUE key. AllocatedDepartmentID is a foreign key constraint with reference to the
Department table.
Let us see we can insert a record to Lecturer table. To Populate Lecturer table, we need to
populate Institue, Faculty and Department table as shown in the following script.
INSERT INTO public."Institute"(
"InstituteID", "InstituteCode", "InstituteName", "Address",
"HeadofInstitueID", "AuditColumns")
VALUES (2, 'IN0002', 'University of Business Management',
ROW('','','Mumbai','','India','213456'), NULL,
ROW(B'1','2020-05-18',NULL,2))
[ 406 ]
Case Study - LMS using PostgreSQL Chapter 14
ROW(B'1','2020-05-18',NULL,2))
When inserting values to composite attributes, values should be used with the ROW syntax
as shown in the above script.
When accessing these composite attributes, there are specific queries needs to be executed
as shown below.
--SELECTING Data
SELECT "LecturerCode", "Name",
("Address")."City" --Selecting the City column from Address composite
attribute
FROM public."Lecturer"
WHERE
("AuditColumns")."IsActive" = B'1' --Choosing IsActive attribute from
AuditColumns
Following screenshot shows the result for the above query where Name attribute is a
composite attribute.
[ 407 ]
Case Study - LMS using PostgreSQL Chapter 14
Apart from the above basic details for Lecturer, there are a few important properties of
lecturers. For example, details like, qualifications, contact telephone numbers, email
addresses, memberships have to be stored. These types of details will be requested in
future. Therefore, if you decided to implement each table, you will end up with a large
number of tables. This number of tables will lead to difficulty in maintaining data.
To avoid a large number of table, a single table can be introduced with the label. In future,
if there is a requirement for a new type, it is a matter of inserting records to this table
instead of creating a table.
The following screenshot shows the table structure for the extended properties for the
Lecture.
[ 408 ]
Case Study - LMS using PostgreSQL Chapter 14
Let us see the relation of the LecturerExtendedDetail table with the Lecturer table in the
below screenshot.
[ 409 ]
Case Study - LMS using PostgreSQL Chapter 14
Let us how data is organized in the above table. The following screenshot shows different
details of a single lecturer.
[ 410 ]
Case Study - LMS using PostgreSQL Chapter 14
In case of another type, it is a matter of inserting records to the extended table. This
configuration will avoid having unnecessary and a large number of tables that is difficult to
manage.
For enrolment of courses, there are mainly three entities, Students, Courses and Lecturers.
Students are enrolled in courses where courses have many modules. Every module has
assignments and exams.
Student table is a standard table with serial integer column and usual Audit Columns as
shown in below screenshot.
[ 411 ]
Case Study - LMS using PostgreSQL Chapter 14
In the above Student table, Student is linked to the Department table through
DepartmentID.
[ 412 ]
Case Study - LMS using PostgreSQL Chapter 14
The course table is one of the core tables in the LMS database. There are prerequisite
courses for some courses. Since there can be multiple prerequisite courses, that
configuration cannot be stored in the Course table. Hence, you need to introduce an
additional table which has the following table schema.
[ 413 ]
Case Study - LMS using PostgreSQL Chapter 14
In the above table, the combination of CourseID and PrerquisteCourseID will make a
unique constraint. This will avoid configuring the same course twice as a prerequisite for a
given course. If there are no prerequisite courses, there won't be a record in the
PrerequisitCourse table.
Now let us look at all the relationships with the course table as shown in the below
screenshot.
[ 414 ]
Case Study - LMS using PostgreSQL Chapter 14
When a student is enrolled in a course, the student will be allocated to multiple modules.
Those modules will be either compulsory or optional. A module can be linked with
multiple courses but a module has an offered department and a module leader to it. There
can be multiple lecturers for the module as well as there are prerequisites modules like we
saw for the courses.
[ 415 ]
Case Study - LMS using PostgreSQL Chapter 14
To stop the overloading of tables in the above screenshot, Department, Lecturer and
Student tables are removed from the above diagram. However, those three tables are part
of this design.
After the design of modules, the next step of the design is the student submission. Student
submission can be assignments and for exams.
[ 416 ]
Case Study - LMS using PostgreSQL Chapter 14
Similar to the assignment submission, exams submission and exam marking also can be
designed.
[ 417 ]
Case Study - LMS using PostgreSQL Chapter 14
maintained.
If there are request to move one institute from one shard to another, table by
table data can be moved since we have included InsituteID in every table.
Every table will have Audit Columns that are IsActive, RecordCreationDate,
RecordModificationDate and InstituteID. Composite Data Type is used for this
Audit Columns as it is easier to add them to the tables.
Apart from Audit Columns, for the columns Name and Addresses, we have
utilized Composite data types as it is most commonly used attribute in the LMS
database. For example, Student, Lecturers, Administrator there are addresses and
Name.
For all the tables, we have utilized serial columns as the Primary Key. The serial
data type (or bigserial data type) will ensure that it will generate values
automatically in the increasing order. With this implementation, we have
ensured that there is less fragmentation. Since we are using integer data types to
join data between table query performance will be enhanced.
For the lecturer details such as Emails, Mobile Number, Qualifications and
Memberships, we have used only one table called LecturerExtendedDetail. This
will enable developers to include more types in future without modifying the
database schema.
For each business process, we have included tables. For example, we have
included tables for, enrollments, submissions, and exams. With this approach,
transaction per table is reduced thus entire system performance is improved. For,
transaction tables, we can include partition per year so that we can achieve
performance, scalability and so on.
For all tables, Foreign key constraints are established. By this implementation,
data integrity is achieved. Further, we have used the by default option NO
ACTION so that in case of conflicts, data will not be deleted or updated.
Let us discuss, how to improve performance in the designed database by using indexes.
[ 418 ]
Case Study - LMS using PostgreSQL Chapter 14
Since we have assigned Primary Keys, we have made the same primary key as the
Clustered index for al the table. Following screenshot shows the configuration of the
clustered index for AssignmentSubmissionID column for the AssignmentSubmission
table.
Following is the script for the clustered index created in the above screenshot.
CREATE UNIQUE INDEX "CIX_AssigmentSubmission"
ON public."AssignmentSubmission" USING btree
("AssignmentSubmissionID" ASC NULLS LAST)
TABLESPACE pg_default;
Apart from Primary Keys, we have defined Foreign Key constraints for tables where
necessary. During the table joins, typically foreign keys will be used as the joining columns.
Therefore, to improve the performance, non-clustered indexes are created on those foreign
key constraints.
[ 419 ]
Case Study - LMS using PostgreSQL Chapter 14
Foreign Key constraints are to maintain data integrity between tables and
not as a performance option. With the Foreign key constraint, index won't
be created. Therefore, you have to create an index explicitly on foreign key
constraints columns.
For the AssignmentSubmission table, we can create two non-clustered indexes for the StudentID and
AssignmentID as shown in the below script.
CREATE INDEX "IX_SutdentID_AssginmentSubmission"
ON public."AssignmentSubmission" USING btree
("StudentID" ASC NULLS LAST)
TABLESPACE pg_default;
Apart from Primary Keys and Foreign Key Constraints, Some tables have Business Keys.
For example, Student table will have StudentCode, Lecturer will have LecturerCode,
Course table will have CourseCode and so on. Normally, from the application point of
view, these business keys will be used search. Therefore, to facilitate searches, indexes can
be implemented. Apart from indexes, you can create a UNIQUE index for these business
key columns.
Following screenshot shows the configuration of the unique index for CourseCode in the
Course table.
[ 420 ]
Case Study - LMS using PostgreSQL Chapter 14
1. Fill Factor: Since the index is a string column, there can be fragmentation. To
avoid fragmentation, a fill factor of 90 is used.
2. Include Columns: To facilitate the improved search of
CourseDescription column, CourseDescription column is used as an include
column.
Having discussed the indexes in detail for the designed database, let us look at other
[ 421 ]
Case Study - LMS using PostgreSQL Chapter 14
Other Considerations
After database design is completed with table structures and indexes, as a database
designer, you need to attend to other factors of database design such as security, auditing
and so on.
Security
In this database design, we have distributed database systems. This means we have
multiple databases in the system. Since, these are physical databases, from the
administrative point of view, different administrative privileges can be provided for the
different administrators.
From the application point of view, there are sub-systems such as CourseEnrollment,
Student Registration, Assignment Submission and so on. For each subsystem, a database
user is created so that the particular application sub-system can use the relevant database
user. Those sub-system users will have only execution permissions to the procedures and
views and direct access of tables are prohibited.
Auditing
Though we have implemented, AuditColumns composite data type in every table as a
preliminary mechanism for audit, in case of a LMS, we need to adapt extensive auditing
mechanism.
In this regard, we can use a separate Auditing database so that all the auditing records can
be stored in the database. Those auditing tables, will not have indexes to so that transaction
duration will not be impacted.
High Availability
Though high availability is mainly falling into administrators hand, from the database
design perspective, we can take a few steps so that better availability options can be
specified at the database design stage. For example, we can use read-only copies of the
database so that, in case of an issue in the main database, users can connect to the read-only
databases. With this approach, users will have the option at least reading data from the
[ 422 ]
Case Study - LMS using PostgreSQL Chapter 14
read-only databases.
Scalability
In the LMS database, Course Enrollment, Exams and Submission are heavy transactions
tables. From the database distribution we have achieved some level of scalability, we can
improve the scalability further.
We can implement Partitioning for the above mentioned tables to enhance scalability. Since
we are not sure about the partition interval, we can implement monthly based portions.
Data Archival
In the LMS, data will tend to grow rapidly. However, since we are dealing with multiple
institutes, every institute will have different approaches to data archival. Since we have
implemented InstituteID as an AuditColumns, data archival will be much easier.
Summary
After discussing different aspects of databases in previous chapters, we dedicated our
discussion in the chapter to design a database. We have chosen the Learning Management
System (LMS) as a case study.
Having looked at the future scalability, we have chosen the Sharding, a distributed
database architecture for the LMS. We have chosen institutes as the sharding partitions.
We have utilized normalized processed for the tables, as LMS is a transactional database. In
the database design, we have included serial columns for all the tables and we made the
business key a UNIQUE index. For all the tables, we have included Audit Columns and we
used a composite data type which is the same data type we used for Names and Addresses.
We have used separate tables as much as possible in order to reduce transaction time. Thus
this will reduce the user concurrency as well. Apart from including Primary Keys for the
serial column, we have implemented Foreign key constraints as well for the reference
columns.
The clustered indexes were used for the Primary keys and non-clustered keys were
implemented for the foreign keys to enhance the query performances.
In the next chapter, we will be looking at the typical mistakes that are done by the database
[ 423 ]
Case Study - LMS using PostgreSQL Chapter 14
design.
Exercise
Expand the designed LMS database to include discussions as discovered in the
Requirement Elicitation.
Further Reading
What is LMS: https://www.talentlms.com/what-is-an-lms
[ 424 ]
15
Overcoming Database Design
Mistakes
During all fourteen chapters, we discussed the complete aspects of database design. During
this discussion, we extensively discussed the database design strategies such as database
modelling, database normalization and so on. Further, we extended our discussions to non-
functional requirements for efficient and effective usage of databases. In this discussion, we
discussed Transaction Management, Database Maintainces, Scalability, last but least
Security. Tough we discussed different aspects of database design, it is important to stress
the common mistake done by the database designers. These can be considered as DON'Ts
in database design.
In this chapter, we will discuss the following common database design errors and tips and
hacks that could help us avoid these mistakes.
Another aspect of identifying the other mistakes it to, not to repeat the same mistake in
your project. Some mistakes are irreversible. It might be too late to recover of you have
done the mistake again. correcting some mistakes will cost in multiple ways such as time,
cost, resource, reputation and so on. Hence, it is better to know the common mistakes of
others so that you won't follow the same path.
In some stage of our carriers, we have done a few mistakes. Those mistakes will be
important only if we analyse them and learn from those mistakes. In this chapter, we are
looking at different mistakes that can occur during the database design.
Let us look at the mistake number 1, Poor or No planning in the database in the following
section.
Poor or No Planning
Following is a very common and popular proverb.
In simple terms, if you are not planning, the chances of failing are high. Most of the time
planing for database design is neglected. This is mainly due to the wrong assumption that
database planning is simple and can be done during the development stage. one of the
main reason for this assumption is that to save time. It is important to understand that time
spent on the planning is not a wastage. If you are not spending adequate and quality time
during the planning stage of the database, you will be wasting that time during the later
stage. Importantly, you need to communicate the importance of planning in database
design to higher technical management.
By not planning, you might be saving a few hours or days. However, due to lack of
planning, you may have to spend more hours during the maintenance. Unnecessary
maintenance may lead to downtime and will have a business impact as well.
Apart from the maintenance, there is more chance that you may have to re-design your
database. Re-designing will not only consume more time, but your reputation will be
negatively impacting.
[ 426 ]
Overcoming Database Design Mistakes Chapter 15
Apart from the no planing, another common mistake is that poor planning or no quality
time spent on database planning. It is very much essential to spent quality time. during the
planning, you need to understand the environment and the possible strategies that can be
used.
Ignoring normalization
During the discussions in Chapter 5: Applying Normalization, we discussed the importance of
normalization. However, most of the time, during the database design discussions,
normalization is ignored. As we discussed in the Purpose of Normalization section, from
normalization we can optimize the database performance. When the normalization is
ignored, data will be repeated. Excessive repetition of data will cause database performance
issues.
Apart from normalization, over-normalization will result in too many tables. When there
are too many tables, you need too many joins to retrieve data. Too many joins will impact
database performance negatively.
Normalization is better suited for OnLine Transaction Process (OLTP) database as OLTP
databases have both read as well as writes. However, in the OnLine Analytical Process
(OLAP), you will have more data reads. Having normalization in OLAP database will
cause negative performance. Therefore, data models should be de-normalized in OLAP
databases.
If the Normalization is ignored at the design stage of a database you will run into
unnecessary maintenance tasks during the data in the live environment. If the
normalization is to apply for the database in the live environment, there will be downtime.
A downtime will impact business efficiency. Therefore, it is essential to apply the necessary
normalization or denormalization levels during the database design stage.
Let us look at another mistake made by the database design that is neglecting the
performance aspect of the database.
[ 427 ]
Overcoming Database Design Mistakes Chapter 15
As a general practice with indexes, it is recommended to create Clustered indexes for the
Primary keys and Non-Clustered indexes for the Foreign Key constraints. For the business
keys, it is recommended to create non-clustered unique indexes.
Apart from the above standard indexes, during the development time, developers can
recommend index for the specific application queries. As we discussed in Chapter 8:
Working with Indexes, developers can view the query execution plan in order to find out the
possible queries.
Apart from indexes, there are other steps that can be taken in order to improve the
performance of database queries. Implementation of Partition is one of them as by
implementing partitions query performance are increased. However, as a database
designer, it is important to implement partition at the early stages of the database design.
Another trivial mistake done by the developers is using SELECT * in views and
procedures. Though at the time writing, you might need all the columns so as a developer,
you would think that SELECT * will suit your requirement. However, due to the
demanding requirements, you may have to include additional columns to the table. Since
you have used SELECT * in queries, these unnecessary columns will be retrieved. As a
general practice, as a developer and a database designer, you should avoid using SELECT *
from views and stored procedures.
Naming standards are important even in databases that will be discussed in the following
section.
[ 428 ]
Overcoming Database Design Mistakes Chapter 15
Out of the available data objects, a table object is an important object. Let us see what are
the standard that we can afford for the table naming.
1. Do not use prefix or suffix for table names - Some database designers tend to
name table s with prefixes or suffixes. For example, they will use TblStudent or
StudentTables for the Student table name that should be prohibited.
2. Use of singular names for table names - As we know a table holds a collection of
entities. Therefore, there is no need for plural names for table names. For
example, instead of UserEnrollments it is advised to use UserEnrollment as a
table name.
3. Never use spaces for the table names - When naming the tables, do not leave
space for the table name such as User Enrollment, instead
use UserEnrollment for the table name.
4. Do not use special characters for the table names - Some designers tend to use {, (
, - or _ for the table names. In some development environment, these types of
table names will not be possible. As we do not know what types of development
environments will be used in future, it is better to avoid using special characters
for the table names.
Apart from tables, columns are also named with a standard. Similar to the table names, you
should not include spaces or special characters for the column names. Especially for
column name that has data types of boolean or bit data types, start the column name with
is. For example, during the design of Module table structure in Chapter 14, we use
IsCompulsory column name.
Since tables and columns are visible to the development teams and other third parties, there
will some attention when deciding names for tables, views and columns and so on.
However, there are internal objects such as Foreign key constraints, Indexes and Check
constraints that are not visible outside the database. though there are not visible to outside,
it is recommended to follow a standard during the naming of these objects.
In foreign key constraints, there are three objects involved. Those three objects are the
Column, referenced column and the primary table. When naming the foreign key
constraints FK_ColumnName_ReferencedTableName_PrimaryKey standard can be used.
In the case of StudentEnrollment foreign key with StudentID column in Student table,
FK_StudentID_Student_StudentID name for the foreign key constraint name can be used.
Similar to Foreign key constraints, Indexes are also internal objects in a database. Though
those indexes are not visible to outside the database, it is essential to follow and naming
convention for the indexes as well. typically, there are three types of indexes, Clustered,
Non-Clustered, Unique and so on. Typically, we used CLS_, NIX_, UNQ_ prefixes for the
above indexes respectively. When naming the indexes, we can use TableName and the
[ 429 ]
Overcoming Database Design Mistakes Chapter 15
column as part of the index. For example, we can use CLS_Student_StudentID for the
clustered index in the Student table.
Further, for check constraints, we need to implement naming convention as similar to the
indexes as well.
Let us look at how we should overcome security related mistakes in database design in the
following section.
During the database design, we do not plan the separation of database and application
server or the webserver. It is important to separate the Database server and the webserver
into different physical machines. This separation will provide more security for the
database as well as the application server. Apart from the security aspect, there are other
benefits such as better consumption of CPU and memory.
Never store passwords in a database in the native format. Always, encrypt the password
when storing and decrypt the password at the application end when required. Apart from
the password, Social Security Number, Credit cards should not be stored in the native
format and in some data laws, you won't allow to store them in the native format.
Therefore, during the database design, it is important to encrypt these types of data.
Most of the time, applications are designed to access with the admin user so the
authorization is never tested. We should never provide the admin permission outside the
database administrator. As we discussed in Chapter 12: Securing a Database, one of the ways
to avoid SQL Injection is by limiting permission levels of the application user.
Another one of the major ignorance by the database designers is not using the proper data
types that will be discussed in the next section.
[ 430 ]
Overcoming Database Design Mistakes Chapter 15
data integrity can be achieved to a certain extent. In the section Selecting Data Types of
Chapter 6: Table Structures, we discussed in detail how important to choose correct data
types. However, there are common mistakes done by database designers with respect to
data types. Let us look at basic mistakes done by the database designers with regards to the
data types.
In the tables, id columns should be assigned with integer data types. Since id columns are
used for foreign key constraints, there will be a performance benefit when there are table
joins. Apart from the foreign key constraints, storage is also reduced.
Another common mistake done by the database designer is that the assigning integer data
types for all the necessary columns. Every database technique has different types of integer
data types. For example, in PostgreSQL, there are three integer data types. There are
different properties of these data types as shown in the following table.
Rather than using an integer data type for all the instances, as a database designer, you
should choose the correct integer data type. For example, columns like Age, Duration,
smallint data type can be used instead of an integer data type. As indicated in the preceding
table, using the smallint data type will reduce the storage. Indirectly, reduced storage will
improve the performance of user queries.
Composite data type was discussed in Chapter 15: Case Study - LMS using PostgreSQL which
is another data type. By implementing proper composite data types, database design
development will be much easier.
Another mistake that database designers do is not using bit data types. Instead of bit data
types, they prefer to use integer data types. For the boolean data, it is always advisable to
use the bit data type.
Let us discusses the lack of documentation in the process of database design in the next
section.
Lack of documentation
Documentation is always neglected in IT systems, especially in database system design. As
you remember in Chapter 4: Representation of Data Models, we discussed different models
such as Conceptual and E-R models and so on. During different levels of modelling, it is
[ 431 ]
Overcoming Database Design Mistakes Chapter 15
Further, documentation will help others to understand the finding. At the initial stage,
documentation will become the means of communication between you and the business
team. Apart from the business users, newcomers to the database design teams and
development team can understand the database design.
These comments should be more detailed so that another designer or a developer can
understand them. Apart from table objects, Views, Procedures and Function can be
commented.
[ 432 ]
Overcoming Database Design Mistakes Chapter 15
Let us look at another mistake done by the database designers in the following section.
However, with the increase of requirement and the expansion of business in future, there
will be data coming to the database from different sources. Therefore, it is essential to
implement constraints from the database end as well. From the database aspect, we can
implement NOT NULL constraints, Foreign KEY constraints, CHECK constraints, and
UNIQUE constraints at the database level. With this implementation, even the different
data applications have to adhere to the above constraints and database data integrity will
be maintained throughout.
[ 433 ]
Overcoming Database Design Mistakes Chapter 15
3. Performance - When a Procedure executes for the first time, it caches the
procedure code. so the subsequent executions will use the pre-cached code. As
pre-cached code is faster than the standard inline code, using procedure will
enhance the performance of the database.
Due to these three reasons, it is always recommended to use procedures during the
database access. However, due to the implementation difficulties, many database designers
will avoid using procedures.
Let us see another mistake that database designers do during the database design, that is
trying to build generic Objects.
With the one table, in future, there will be ideal columns for one objects. Further, the data
volume on one table will be very huge. This will result in more table locking and that will
result in an unnecessary performance impact on the database.
Therefore, in database design, it is always better to separate business objects into separate
tables. Though there will be a large number of tables, and it will result in complexities in
maintenance. However, during the database design, paramount importance is to ensure the
high-performance for the database queries. Due to this importance, it is essential to avoid
building generic objects during the database design.
Another basic mistake that does by the database designer is the lack of testing that will be
discussed in the next section.
Lack of Testing
Typically, we do the testing for the application functional, performance perspective. It is
very rare that database designers carry out testing on databases. With database specific
testing, you can avoid a lot of mistakes that will occur in the later stage of the project.
Initially, during the conceptual database design stage, as a database designer, you need to
[ 434 ]
Overcoming Database Design Mistakes Chapter 15
go through the requirement elicitation documents and verify whether your model supports
the business requirements. In this exercise, it is better to sit with the business team so that
you can verify your queries and concerns directly with the business team.
When the table structures are created, it is important to insert some data to the designed
table structures. With this approach, we can verify whether the defined data length and
data types are valid. Further, you can verify the validity of unique, foreign key and check
constraints with sample data.
Apart from functional testing of the database, it is essential to verify the performance aspect
of databases. Performance can be impacted due to two reasons. They are the increase of
users and increase of data loads. Therefore, with the help of the quality assurance team
experts, we can test the database for the feasibility of the database for a large number of
users and a high volume of data.
Let us look at other common mistake done by the database designers in the next section.
2. Large size attributes such as Image, Profile descriptions should be avoided in the
frequently queried tables. With large-size attributes, table size will become
larger. In this large size tables, queries will need to read more data pages that
will result in slowness in query performances. Therefore, large size attributes
must be placed in separate tables and their pointer can be used in queried tables
when necessary.
[ 435 ]
Overcoming Database Design Mistakes Chapter 15
Summary
After discussing the detail of the database design, we dedicated our discussion in this
chapter to the common mistakes done by the database designers. We identified that poor
planning is the critical mistakes done by database designers. Further, ignoring
normalization and neglecting performance aspect is key mistakes done by the database
designer. These mistakes will result in a very difficult database to manage.
Most of the database designers do not maintain proper naming standards that will cause a
database to a mess. We further identified that most database designers provide less focus
on the security of the database. Another important mistake that we identified is the lack of
documentation and last of testing for databases.
Another basic mistake we identified that the database designers tend to use common data
types rather than utilizing the correct data types. Lack of usage of data integrity options
such as check constraints, foreign key constraints and unique constraints is another trivial
mistake that will result in data integrity od database in future.
Questions
What is the indexing strategy that you would apply during the database design?
Since it is difficult to find the user pattern at the start of the design stage of a
database, there are few strategies for indexing during the design stage. For the
Primary keys, we can include a Clustered index. For the foreign key constraint
columns, we can implement non-clustered indexes as the foreign key constraint
columns will be used to join tables. Apart from these two types of indexes, we
can implement unique non-clustered indexes for the business keys. Apart from
these indexes, as an application developer, you need to find the possible indexes
so that those indexes can be identified at the development stage.
To avoid SQL Injection is the introduction of procedures. Since SQL code cannot
be injected to the procedures, hackers will not be able to break into the database
with SQL Injection technique. Further, usage of procedures will provide
application users execution permissions on procedures and they do not need any
[ 436 ]
Overcoming Database Design Mistakes Chapter 15
[ 437 ]
Index