Database Principles and Technologies Based On Huawei GaussDB
Database Principles and Technologies Based On Huawei GaussDB
Database Principles
and Technologies – Based
on Huawei GaussDB
Huawei Technologies Co., Ltd.
Hangzhou, China
© Posts & Telecom Press 2023. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-
nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license and indicate if you modified the licensed material. You do not have permission
under this license to share adapted material derived from this book or parts of it.
The images or other third party material in this book are included in the book’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
This work is subject to copyright. All commercial rights are reserved by the author(s), whether the whole
or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Regarding these commercial rights a non-exclusive license has been
granted to the publisher.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
Nowadays, database technology has developed from the early stage of simply saving
and processing data files to a rich, comprehensive discipline with data modeling and
database management system as the core, as the foundation and core of modern
computer application system. Entering the Internet era, the traditional database
system began to show decadence in response to the storage needs of big data, and
enterprise customers urgently need a new generation of database products, that is,
products with dynamic expansion and contraction capacity, high throughput, low
cost, and other characteristics. As a result, cloud computing-based databases have
emerged and risen, showing the future-oriented trend of cloud-based, distributed,
and multi-mode processing.
Based on Huawei's GaussDB (for MySQL) cloud computing-based database, this
book focuses on various cloud computing-based features and application scenarios
of cloud computing-based databases. The division of the book’s eight chapters is as
follows:
Chapter 1 mainly introduces databases, including database technology overview,
database technology history, relational database architecture, and mainstream appli-
cation scenarios of relational databases.
Chapter 2 mainly teaches database basics, including the main responsibilities and
contents of database management, and introduces some common and important
basic concepts of databases.
Chapter 3 introduces SQL syntax, including GaussDB (for MySQL) data types,
system functions and operators, which aims to help beginners master get started with
SQL syntax.
Chapter 4 focuses on SQL syntax classification and further explains SQL state-
ments accordingly, covering data query, data update, data definition, and data
control.
Chapter 5 focuses on database security fundamentals, including basic security
management techniques for databases, such as access control, user management,
permission management, object permissions, and cloud auditing services, which will
be elaborated from basic concepts, usages, and application scenarios.
v
vi Preface
1 Introduction to Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of Database Technology . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Database Management System . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Database System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 History of Database Technology . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Emergence and Development of Database Technology . . . 5
1.2.2 Comparison of the Three Stages of Data Management . . . 6
1.2.3 Benefits of Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Development Characteristics of the Database . . . . . . . . . . 9
1.2.5 Hierarchical Model, Mesh Model and Relational Model . . 10
1.2.6 Structured Query Language . . . . . . . . . . . . . . . . . . . . . . 14
1.2.7 Characteristics of Relational Databases . . . . . . . . . . . . . . 14
1.2.8 Historical Review of Relational Database Products . . . . . . 15
1.2.9 Other Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.10 New Challenges for Data Management Technologies . . . . 18
1.2.11 NoSQL Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.12 NewSQL Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.13 Database Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Architecture of Relational Databases . . . . . . . . . . . . . . . . . . . . . . 24
1.3.1 Development of Database Architecture . . . . . . . . . . . . . . 24
1.3.2 Single-Host Architecture . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.3 Group Architecture: Master-Standby Architecture . . . . . . 26
1.3.4 Group Architecture: Master-Slave Architecture . . . . . . . . 27
1.3.5 Group Architecture: Multi-Master Architecture . . . . . . . . 28
1.3.6 Shared Disk Architecture . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3.7 Sharding Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.3.8 Shared-Nothing Architecture . . . . . . . . . . . . . . . . . . . . . 30
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
About the Author
xiii
Chapter 1
Introduction to Databases
1.1.1 Data
Data refers to the raw records that have not been processed. Generally speaking, data
is not clearly organized and classified, and thus cannot clearly express the meaning
of what things represent. Data can be a pile of magazines, a stack of newspapers,
minutes of a meeting, or a copy of medical records. Early computer systems were
primarily used for scientific calculations and dealt with numerical data, that is,
numbers in the generalized concept of data, such as integers like 1, 2, 3, 4, 5, but
also floating point numbers like 3.14, 100.34, and 25.336.
1.1.2 Database
Database is a large collection of organized and shareable data stored in the computer
for a long time, with the following three characteristics.
1.1 Overview of Database Technology 3
(1) Long-term storage: The database should provide a reliable mechanism to sup-
port long-term data storage, so that data recovery is feasible upon the system
failure to prevent data loss in the database.
(2) Organization: Data should be organized, described and stored in a certain data
model. Model-based storage endows data with less redundancy, higher indepen-
dence and easy scalability.
(3) Sharebility: The data in the database is shared and used by all types of users, not
exclusive to a single user.
The student information database shown in Fig. 1.3 should be accessible
non-exclusively to different users such as students, teachers and parents
simultaneously.
storage space, and how to realize the linkage between the data. The basic goal of
data organization and storage is to improve storage space utilization, facilitate
data access, and provide a variety of data access to improve access efficiency.
(3) Data manipulation. DBMS also provides data manipulation language (DML),
with which users can manipulate data to achieve such basic operations as query,
insert, deletion and modification on data.
(4) Transaction management and operation management of the database. The data-
base is unified managed and controlled by the DBMS during establishment,
operation and maintenance to ensure the correct operation of transactions, the
security and integrity of data, the concurrent use of data by multiple users and the
system recovery after a failure.
(5) Database establishment and maintenance. This function covers the database
initial data input and conversion, database dump and recovery, database reorga-
nization and performance monitoring, analysis function, etc. These functions are
usually implemented by certain programs or management tools.
The database system (DBS) is a system for storing, managing, processing and
maintaining data composed of database, DBMS and its application development
tools, applications and database administrators.
In Fig. 1.5, the parts other than the user and the operating system are the
components of the database system.
1.2 History of Database Technology 5
Though the operating system is not a component of the database system, the DBMS
needs to call the interface provided by the operating system in order to access the
database.
specified in the application, but also the physical structure, including storage
structures, access methods, etc. had to be designed. Thus, on one hand, pro-
grammers had a very heavy workload, while on the other hand,
non-programmers were incapable to use the computer system.
(2) File system stage (from the late 1950s to the mid-1960s). In this phase, data was
organized into separate data files, which is accessed based on file name and
saved and obtained based on record, with file opening, closing, and access
support provided by the file system.
(3) Database system stage (from the late 1960s to present). In the late 1960s,
database systems (proprietary software systems) emerged to allow large-scale
data management. In this stage, with the development of the times, hierarchical
databases, mesh databases, and classic relational databases have emerged suc-
cessively. In the last 20 years or so, emerging databases such as NoSQL and
NewSQL have also emerged.
data is stored repeatedly and data redundancy is high. Such separate management
approach is prone to data inconsistency.
The lack of file independence means that the file serves a specific application and
the logical structure of the file is designed for this application. If the logical structure
of the data changes, the definition of the file structure in the application must be
modified, because the data depends on the application. In addition, files do not reflect
the intrinsic linkage between things in the real world because they are independent of
each other. From file system to database system, data management technology has
made a leap.
8 1 Introduction to Databases
interfering with each other and affecting the results obtained from the access.
Data recovery refers to the function that the DBMS restores the database from an
error state to a known correct state when the database system has hardware
failure, software failure, operation error, etc.
Database has become one of the important foundation and core technology of
computer information system and intelligent application system, as shown in
Fig. 1.7.
The development of database systems presents the following three characteristics.
(1) The database development is concentrated on the data model development. The
data model is the core and foundation of the database system, so the develop-
ment of the database system and the development of the data model are insep-
arable. How to divide the data model is an important criterion for database
system division.
(2) Intersection and combination with other computer technologies. With the end-
less emergence of new computer technology, intersecting and combining with
other computer technologies becomes a significant feature of the development of
database system, such as the distributed database upon the combination with
Fig. 1.7 Applications and related technologies and models of database systems
10 1 Introduction to Databases
The hierarchical model, mesh model and relational model are the three classical data
models that have emerged throughout history.
1. Hierarchical model
The hierarchical model presents a tree-like data structure, as shown in Fig. 1.8.
There are two very typical features as follows.
(1) There is one and only one node without “two parents” nodes, which is called
the root node.
(2) Each of the nodes other than the root node has one and only one “two
parents” node, and this hierarchical model is often used in common organi-
zational structures.
2. Mesh model
The mesh model has a data structure similar to a network diagram, as shown in
Fig. 1.9. In the mesh model diagram, E represents an entity and R represents the
relation between entities. In the mesh model, more than one node is allowed to
have no “two parents” node, and a node can have more than one “two parents”
node. As shown in Fig. 1.9, E1 and E2 have no “two parents” node, while E3 and
E5 have two “two parents” nodes respectively. The mesh model is able to map a
lot of many-to-many relations in reality, such as students choosing courses and
teachers teaching them.
3. Relational model
A strict concept of relation is the basis of the relational model, and this relation
must be normalized and the component of the relation must be an indivisible data
item, as shown in Fig. 1.10.
In 1970, Dr. Edgar Frank Codd, a researcher at IBM, published a paper entitled “A
Relational Model of Data for Large Shared Data Banks” in the publication
Communication of the ACM. He introduced the concept of relational model and
Explanation laid the theoretical foundation for the relational model. Dr. Codd has published
several articles on paradigm theory and 12 criteria that define how to measure
relational systems. This laid the foundation of the relational model with mathe-
matical theory.
Built on the set algebra, the relational model consists of a set of relations, each
with a normalized two-dimensional table as its data structure. As the student
12 1 Introduction to Databases
Table 1.2 Comparison of hierarchical model, mesh model, and relational model
Features Hierarchical model Mesh model Relational model
Data Structure Formatted model with Formatted model Normalization-compli-
simple and clear tree-like ant model
structure
Data There is no “two par- Corresponding informa- Data operations are set
Manipulation ents” node, and no tion (e.g. pointer) is also operations, where the
“child” node can be be added or deleted in object and result of the
inserted; the “child” the “two parents” nodes operation are relations.
node will be deleted when adding and delet- Data operations must
when the “two parents” ing nodes satisfy the integrity
nodes are deleted constraints on relations.
Data Linkage The linkage between The linkage between The linkage between
data is reflected by the data is reflected by the data is reflected by the
access path access path relation
Advantages 1. Simple and clear data 1. Directer description of 1. Based on a rigorous
structure the real world, reflecting mathematical theory
2. High query efficiency many-to-many relations 2. Single concept, with
3. Good integrity support 2. Good performance relations to represent
and high saving and entities and linkages
access efficiency between entities
3. Access path transpar-
ent to the user, with a
high degree of indepen-
dence and confidential-
ity
4. Simplification of
development work for
programmers
Disadvantages 1. Many non-hierarchical 1. Complex structures 1. The hidden access
linkages that exist in the become extremely com- path leads to less effi-
real world are not suit- plex as applications cient queries than the
able for representation expand formatted model
by hierarchical models 2. The complexity of the 2. Optimization of the
2. Representation of object definition and user's query is required
many-to-many linkages manipulation language
generates a lot of redun- requires the embedding
dant data of high-level languages
3. Hierarchical com- (COBOL, C), thus mak-
mands tend to be proce- ing it difficult for users
dural due to the tight to master and use
structure 3. due to the existence of
multiple paths, the user
must understand the
details of the system
structure, thus increasing
the burden of writing
code
14 1 Introduction to Databases
time; there must be no consistency violation where Account A has reduced the
amount of money but Account B has not increased..
(3) Isolation. The execution of a transaction in the database cannot be interfered
with by other transactions, that is, the internal operation of a transaction and the
use of data are isolated from other transactions; multiple transactions subject to
concurrent execution cannot interfere with each other.
(4) Durability. Once a transaction is committed, the changes to the data in the
database are permanent. Post-commit operation or failure will not have any
effect on the result of the transaction.
The introduction of the relational model was an epochal and significant event in the
history of database development. The great success in the research of relational
theory and the development of relational DBMS has further promoted the develop-
ment of relational database. The last 40 years have been the most “glorious” years for
relational databases, during which many successful database products have been
born, having a great impact on the development of society and our life. Some of the
relational database products are shown in Fig. 1.12.
(1) Oracle is the database product of Oracle Corporation, which is one of the most
popular relational databases in the world. In 1977, Larry Ellison and his
colleague Bob Miner founded Software Development Labs (SDL), and they
also developed the first version of Oracle in assembly language based on a
paper published by Dr. Codd (released to the public in 1979).
(2) Teradata is a database product launched by Tenet of the US. The first database
computer DBC/1012, which was released in 1984, was the first database-
dedicated platform with massively parallel processing (MPP) architecture.
The Teradata database was primarily available in the early days as an all-in-
one machine, positioned as a large data warehouse system. Proprietary software
and hardware gave it excellent OLAP performance, but it was very expensive.
(3) DB2 is the database product of IBM. DB2 is the main relational database
product promoted by IBM, which only served IBM mainframe and small
machine at the beginning, and then started to support Windows, UNIX and
other platforms in 1995. The reason why it is named DB2 is because DB1 is a
hierarchical database.
(4) Ingres was originally a relational database research project initiated by the
University of California, Berkeley in 1974, and the code of Ingres was avail-
able for free, so much commercial database software was produced based on it,
including Sybase, Microsoft SQL Server, Informix, and the successor project
PostgreSQL. It can be said that Ingres is one of the most influential computer
research projects in history.
(5) Informix was the first commercial Ingres product to appear in 1982, but was
later acquired by IBM in 2000 due to management failures by its owner. The
source code of Informix was then licensed to GBASE from China, which
developed the Chinese-made Gbase 8t product on the basis of its source code.
(6) Sybase is a database product of Sybase Inc. The company was founded in 1984,
named after the combination of the words “System” and “Database”, and one
of its founders, Bob Epstein, was one of the main designers of Ingres. Sybase
first proposed and implemented the idea of the Client/Server database archi-
tecture. The company began working with Microsoft in 1987 to develop the
Sybase SQL Server product. After the termination of the partnership, Microsoft
continued to develop the MS SQL Server and Sybase continued to develop the
Sybase ASE. Its subsequent relational database, Sybase IQ, designed especially
1.2 History of Database Technology 17
With the expansion of the database industry and the diversification of data objects,
the traditional relational database model begins to reveal many weaknesses, such as
poor identification capability for complex objects, weak semantic expression capa-
bility, and poor processing capability for data types such as text, time, space, sound,
image and video. For example, multimedia data are basically stored as binary data
streams in relational databases, but for binary data streams, the generic database has
poor identification capability and poor semantic expression capability, which is not
conducive to retrieval and query.
In view of this, many new data models have been proposed to adapt to the new
application requirements, specifically the following.
18 1 Introduction to Databases
(1) Object oriented data model (OODM). This model, combining the semantic data
model and object-oriented programming methods, uses a series of object-
oriented methods and new concepts to form the basis of the model. However,
the OODM operation language is too complex, which increases the burden of
system upgrade for enterprises, and it is difficult for users to accept such a
complex way of use. So OODM is not as universally accepted as relational
database except for some specific application markets.
(2) XML data model. With the rapid development of the Internet, there are a large
number of semi-structured and unstructured data sources. Extensible markup
language (XML) has become a common data model for exchanging data on the
Internet and a hot spot for database research, and accordingly derived an XML
data model for semi-structured data. Pure XML database, based on XML node
tree model, supports XML data management, but the same requires to solve the
various problems faced by traditional relational database.
(3) RDF data model. The information in the Internet lacks a unified expression, so
the World Wide Web Consortium (W3C) proposes to describe and annotate
Internet resources with the resource description framework (RDF). The RDF is a
markup language for describing Internet resources, with triple containing
resources (subject), attributes (predicate), and attribute values (object) as the
infrastructure. Such a triple is also called a statement, where an attribute value
can be a resource (either a resource or a literal; if it is a literal, it can only be an
atomic value, such as a number, a date, etc.), and an attribute describes the
relationship between the resource and the attribute value. Statement can also be
represented as a graph: a directed edge points from the statement resource to the
attribute value, with the attribute on the edge; the attribute value of a statement
can be the resource of another statement.
Although new data models are emerging from time to time, all of them have failed to
replace the relational database model as the common basic model for database
products due to problems such as lack of convenience and generality.
New challenges for data management technologies are as follows.
(1) With the automation, diversification and intelligence of data acquisition means,
the volume of data is soaring, so the databases need to provide a high degree of
scalability and scalability.
(2) The ability to deal with diverse data types is needed. Data can be classified into
structured data, semi-structured and unstructured data, including texts, graphics,
images, audio, videos and other multimedia data, stream data, queue data, etc.
Diverse data types require database products to develop the ability of dealing
with multiple data types and the ability to heterogeneous processing.
(3) The development of sensing, network and communication technologies has put
forward higher requirements for data acquisition and processing in real-time.
1.2 History of Database Technology 19
(4) At the advent of the era of big data, data problems such as massive heterogene-
ity, complicated forms, high growth rate, and low value density have posed
comprehensive challenges to traditional relational databases. NoSQL technol-
ogy has flourished in response to the needs of big data development. Big data has
4V characteristics, as shown in Fig. 1.13.
The 4Vs are Volume (huge data volume), Variety (rich data types), Velocity (fast
generation speed), and Veracity (widely varying veracity). Volume: The volume of
data covered by Big Data processing is huge, having risen from the traditional
terabyte level to the petabyte level. Variety: Big data processing involves a wide
range of data types, where in addition to traditional structured data, Internet web
logs, videos, pictures, geolocation information, etc. can also be found; moreover,
semi-structured data and unstructured data also need to be processed. Velocity: The
high processing speed in the Internet of Things ( IoT) applications is particularly
significant, with the requirement for real-time processing. Veracity: Big data
processing pursues high quality data, i.e., mining valuable data from massive data
with a lot of noise, and due to low data value density, high-value information needs
to be mined among a large amount of low-value data.
To meet the challenges of the big data era, new models and technologies have sprung
up, typically the NoSQL database technology, which first emerged in 1998 as a
lightweight, open-source, non-relational database technology that does not provide
SQL functionality. By 2009, the concept began to return, but it was a completely
20 1 Introduction to Databases
different concept compared to the original one. The NoSQL technology, or Not Only
SQL, that is widely accepted today is no longer just SQL technology.
Many different types of NoSQL database products have been created over the
years, and although they have different characteristics, non-relational, distributed,
and not guaranteed to meet ACID characteristics are their unifying features.
NoSQL databases have the following three technical features.
(1) Partitioning of data (Partition). It can distribute data across multiple nodes in a
cluster, and then conduct parallel processing on a large number of nodes to
achieve high performance; it also facilitates the scaling of the cluster by scaling
horizontally.
(2) Reduction of ACID consistency constraint. Based on the BASE principle, it
accepts the eventual consistency constraint although it allows temporary
inconsistency.
(3) Backup for each data partition. The general principle of triple backup (three
copies of data are kept on the current node, another node in the same rack, and
another node in another rack against node failure and rack failure. The more
backups, the greater the data redundancy. Based on the comprehensive consid-
eration of security and redundancy, such triple backup of data is the most
reasonable setting) is followed to cope with node failures and improve system
availability.
The four common types of NoSQL database technologies are divided by storage
model, including key-value database, graph database, column family database, and
document database, as shown in Fig. 1.14.
Table 1.3 briefly introduces the main NoSQL databases. Key-value databases are
generally implemented based on hash tables by pointing key to value; storing keys in
memory enables extremely efficient key-based, or code-based, query and write
operations, and is suitable for caching user information, session information, con-
figuration files, shopping carts, and other application scenarios. Such products as
column grouping database, document database and graph database also feature their
own characteristics, but since this book is mainly concerned with relational data-
bases, they will not be covered here.
NoSQL was not created to replace a relational DBMS (RDBMS), and while it has
both significant advantages and disadvantages. It is designed to work with RDBMS
to build a complete data ecosystem.
1.2 History of Database Technology 21
Since the introduction of NoSQL, a highly scalable product, its ease of use has been
recognized. If applied to traditional databases, it can greatly enhance the scalability
of traditional databases. Therefore, a relational database that combines the scalability
of NoSQL with support for the relational model has been developed. This new-type
database is mainly oriented to the online transaction processing (OLTP) scenario that
shows high requirements for speed and concurrency. The database uses SQL as the
main language, so it is called NewSQL database.
“NewSQL” is only a description of this class, not an officially defined name.
NewSQL database is a relational database system that supports the relational model
(including ACID features) while achieving the scalability of NoSQL, mainly ori-
ented to the OLTP scenario, allowing SQL as the primary language.
The classification of NewSQL databases is as follows.
(1) Databases re-constructed with new architecture.
22
In the early days when the data size was not too large, the database system used a
very simple stand-alone service, i.e., database software was installed on a dedicated
server to provide external data access services. However, as business expands, the
data size in the database and the pressure on the business are upgraded. This requires
the database architecture to change accordingly. The architecture classification
shown in Fig. 1.16 is a way to distinguish the database architecture according to
the number of hosts.
An architecture with only one database host is a single-host architecture, while an
architecture with more than one database host is a multi-host architecture. The single
host in the single-host architecture deploys both database application and database
on the same host; while the stand-alone host deploys them separately, with the
database exclusively on a separate database server. The multi-host architecture
enhances the availability and service capability of the overall database services by
increasing the number of servers. This architecture can be classified into two models
based on whether data shards are generated. One type is the group architecture, in
which, depending on the role of each server, the servers are further divided into
master-standby, master-slave and multi-master architectures. Regardless of the
grouping method, the databases share the same structure and store exactly the
same data, essentially replicating data between multiple databases with synchroni-
zation techniques. Another model is the sharding architecture, which spreads the
data shards within different hosts through a certain mechanism.
In order to avoid the application services and database services from competing for
resources, the single-host architecture evolved from the earlier single-host model to
stand-alone host for database, which separates the application services and data
services. For the application services, the number of servers can be increased to
balance the load, thus enhancing the concurrency capability of the system. The
single-host deployment features such as flexibility and ease of deployment in
R&D, learning, and simulation environments, as shown in Fig. 1.17.
The LAMP (Linux, Apache, MySQL, and PHP) architecture of the early Internet
is a typical single-host architecture, with following obvious shortcomings.
(1) Poor scalability. The single-host architecture only supports vertical expansion,
improving performance by increasing the hardware configuration, but there is an
upper limit to the hardware resources that can be configured on a single host.
(2) Single point of failure. Expansion of the single-host architecture often requires
suspension, and the service will also suspense. In addition, hardware failure can
easily lead to the unavailability of the entire service, and can even cause
data loss.
(3) As business expands, the single-host architecture is bound to encounter perfor-
mance bottlenecks.
26 1 Introduction to Databases
The master-standby architecture in the group architecture is actually born from the
single-host architecture to solve the single point of failure, as shown in Fig. 1.18.
The database is deployed on two servers, where the server that undertakes the
data read/write service is called the host, and the other server, standby, is used as a
backup to copy the data from the host using the data synchronization mechanism.
Only one server provides data services at the same time.
This architecture has the advantage that the application does not require addi-
tional development to cope with database failures, plus it improves data fault
tolerance compared to a stand-alone architecture.
The disadvantage is the waste of resources, the backup and the host enjoy the
same configuration, but the backup resources are basically in idle state; in addition,
the performance pressure is still concentrated on a single server, which cannot
address the performance bottleneck. When a failure occurs, the switch between the
host and the standby requires some manual intervention or monitoring. So to say,
this model only addresses the data availability and cannot break through the perfor-
mance bottleneck; while the performance is still limited by the hardware configura-
tion of a single server, cannot be improved overall by increasing the number of
servers.
does not result in higher performance, and storage I/O can easily become a bottle-
neck affecting the overall system performance.
The massively parallel processing (MPP) architecture spreads tasks in parallel across
multiple servers and nodes. After the computation on each node is completed, the
results of each part are aggregated into the final result, as shown in Fig. 1.24.
The MPP architecture is characterized by the fact that the tasks are executed in
parallel, while the computation is distributed. Two minor variations exist here, one is
the non-shared host architecture and the other is the shared host architecture. In the
non-shared host architecture, all nodes are peer-to-peer, and data can be queried and
32 1 Introduction to Databases
loaded by any node, which generally does not have performance bottlenecks and
single-point risks, but the technical implementation is more complex.
The common MPP architecture products are as follows.
(1) Non-shared host architecture: Vertica and Teradata.
(2) Shared host architecture: Greenplum and Netezza.
Teradata and Netezza are hardware-software all-in-one machines, while GaussDB
(DWS), Greeplum, and Vertica are software versions of MPP architecture databases.
The shared architecture is the basis of the shared-nothing architecture, and shared-
nothing for clusters is only possible if data is sharded.
concurrency
Data No data consistency The data synchronization Same as the master- The data needs to be syn- Based on the sharding
consistency issues mechanism is adopted to standby mode, and with chronized in both direc- technology, data is
synchronize between the increase of the num- tions between multiple scattered on each node,
master and standby, but ber of slaves, the data hosts, so it is prone to data and data synchronization
there are data latency latency issues and data inconsistency. But the is not required between
issues and data loss risks loss risk will be more shared disk architecture nodes, so there is no data
prominent with shared storage does consistency issues
not have data consistency
issues
Scalability Only vertical scaling is Only vertical scaling is The slave can be scaled Good scalability, but Linear scaling is theoreti-
supported, and will supported, and will horizontally for better increasing the number of cally possible, so scal-
encounter hardware per- encounter hardware per- concurrent read capacity hosts will lead to a dra- ability is best
formance bottlenecks due formance bottlenecks due matic increase in data
to the single host to the single host synchronization
complexity
33
34 1 Introduction to Databases
The concept of online analytical processing (OLAP) was first proposed by Edgar
Frank Codd in 1993 relative to OLTP system, and refers to the query and analysis
1.4 Mainstream Applications of Relational Databases 35
the development of NewSQL database technology. Readers can find related mate-
rials to expand their knowledge on their own.
1.5 Summary
This chapter introduces the basic concepts of database and data management system,
reviews the development history of database for decades, details the development of
database from early mesh model and hierarchical model to relational model, and
introduces the emerging NoSQL and NewSQL concepts in recent years; provides a
comparative analysis and introduction to the main architectures of relational data-
base, and briefly explains the advantages and disadvantages of various architectures
in different scenarios; finally, introduces and contrasts the mainstream application
scenarios of OLTP and OLAP for relational data.
Through the study of this chapter, readers are able to describe the concepts related
to database technology, enumerate the main relational databases, distinguish differ-
ent relational data architectures, and describe and identify the main application
scenarios of relational databases.
1.6 Exercises
1. [Multiple Choice] The characteristics of the data stored in the database are ( ).
A. Permanently stored
B. Organized
C. Independent
D. Shareable
2. [Multiple Choice] The components of the concept of a database system are ( ).
A. Database management system
B. Database
C. Application development tool
D. Application
3. [True or False] Database applications can read database files directly, without
using the database management system. ( )
A. True
B. False
4. [Multiple Choice] What are the stages in the development of data management?
()
38 1 Introduction to Databases
A. Manual stage
B. Intelligent system
C. File system
D. Database system
5. [Single Choice] In which data model, more than one node is allowed to have no
“two parents” node, and a node can have more than one “two parents” node. ( )
A. Hierarchical model
B. Relational model
C. Object-oriented model
D. Mesh model
6. [Multiple Choice] Which of the following are NoSQL databases? ( )
A. Graph database
B. Document database
C. Key-value database
D. Column family database
7. [True or False] The emergence of NoSQL and NewSQL databases can
completely subvert and replace the original relational database systems. ( )
A. True
B. False
8. [True or False] The master-standby architecture can improve the overall read/
write concurrency by separating read and write. ( )
A. True
B. False
9. [Single Choice] Which database architecture has good linear scalability? ( )
A. Master-slave architecture
B. Shared-nothing architecture
C. Shared disk architecture
D. Master-standby architecture
10. [True or False] The characteristic of the sharding architecture is that the data is
scattered on each database node of the cluster through a certain algorithm, and
the advantage of server number in the cluster is taken for parallel computing. ( )
A. True
B. False
11. [Multiple Choice] Test metrics used to measure OLTP systems include ( ).
A. tpmC
B. Price/tmpC
C. qphH
D. qps
1.6 Exercises 39
12. [Multiple Choice] OLAP system is suitable for which of the following scenar-
ios? ( )
A. Reporting system
B. Online transaction system
C. Multi-dimensional analysis and data mining systems
D. Data warehouse
13. [True or False] OLAP system can analyze and process a large volume of data, so
it can also meet the processing performance requirements of OLTP for small
data volume. ( )
A. True
B. False
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 2
Basic Knowledge of Database
Various database products have different characteristics, but they share some com-
mon ground in the main database concepts, that is, they can achieve various database
objects and different levels of security protection measures, and they also emphasize
the performance management and daily operation and maintenance management of
the database.
This chapter is about the main responsibilities and contents of database manage-
ment, and introduces some common but important basic concepts of databases to lay
a good foundation for the next stage of learning. After completing this chapter,
readers will be able to describe the main work of database management, including
distinguishing different backup methods, listing measures for security management
and describing the work of performance management, as well as describing the
important basic concepts of database and the usage of each database object.
storage, processes, threads, and all other hardware and software objects that are
available and at the disposal of the database. Competition refers to the demand for
the use of the same resource by multiple workloads at the same time, and this conflict
arises because the number of resources is less than the demand of the workload.
Database environment management covers such database operation and mainte-
nance management as installation, configuration, upgrade, migration and other
management work to ensure the normal operation of the IT infrastructure, including
database systems.
Database object is a general term for the various concepts and structures used to store
and point to data in the database. Object management is about a management process
that uses objects to define languages, create tools, or modify or delete various
database objects. Basic database objects generally include tables, views, index
sequences, stored procedures and functions, as shown in Table 2.1.
The database product itself does not propose strict naming restrictions, but
arbitrary naming of objects will lead to an uncontrollable and unmaintainable
system, or even cause the maintenance difficulties to the entire system. The devel-
opment of naming convention is a basic requirement for database design, because a
good naming convention means a good start.
There are several suggestions for naming convention as follows.
(1) Unify the case of the names. The case of the names can be standardized on a
project basis, such as all capitalization, all lowercase, or initial capitalization.
(2) Use prefixes to identify the type of object, such as the table name prefix “t_”,
view prefix “v_”, function prefix “f_” and so on.
(3) Try to choose meaningful, easy to remember, descriptive, short and unique
English words for naming, not recommended to use Chinese Pinyin.
(4) Use the name dictionary to develop some common abbreviations on a project
basis, such as “amt” for “amount”.
Some commercial databases set length limits for table names and view names in
early versions, for example, they cannot exceed 30 characters. Too long names are
not easy to remember and communicate, nor easy for SQL code writing. Some
public database naming specifications can be used as a blueprint to develop some
industry- and project-oriented database naming conventions according to project
characteristics, as shown in Table 2.2.
There are many possible reasons for data loss, mainly storage media failure, user’s
operation error, server failure, virus invasion, natural disasters, etc. Backup database
is to additionally store the data in the database and the relevant information to ensure
the normal operation of the database system, so that it can be used to restore the
database upon the system failure.
The objects of database backup include but not limited to data itself and data-
related database objects, users, permissions, database environment (profiles, timing
tasks), etc. Data recovery is the activity of restoring a database system from a failed
or paralyzed state to one that is operational and capable of restoring data to an
acceptable state.
For enterprises and other organizations, database systems and other application
systems constitute a larger information system platform, so database backup and
recovery is not independent, but should be combined with other application systems
to consider the overall disaster recovery performance of the whole information
system platform. This is the so-called enterprise-level disaster recovery.
2.1 Overview of Database Management 45
Disaster backup refers to the process of backing up data, data processing systems,
network systems, infrastructure, specialized technical information and operational
management information for the purpose of recovery after a disaster occurs. Disaster
backup has two objectives, one is recovery time objective (RTO) and the other is
recovery point objective (RPO). RTO is the time limit within which recovery must
be completed after a disaster has stopped an information system or business function.
RPO is the requirement for the time point to which the system and data are recovered
to after a disaster. For example, if the RPO requirement is one day, then the system
and data must be recovered to the state 24 h before the failure caused by the disaster,
and the possibility of data loss within 24 h is allowed in this case. However, if the
data can be restored to the state only two days ago, that is, 48 h ago, the requirement
of RPO ¼ 1 day is not satisfied. The RTO emphasizes the availability of the service,
and the smaller the RTO, the less the loss of service. The RPO targets data loss, and
the smaller the RPO, the less the data loss. A typical disaster recovery goal of an
enterprise is RTO <30 min, with zero data loss (RPO ¼ 0).
China’s GB/T 20988-2007: Information security technology—Disaster recovery
specifications for information systems divides disaster recovery into six levels, as
shown in Fig. 2.1.
Level 1: Basic support. The data backup system is required to guarantee data backup
at least once a week, and the backup media can be stored off-site, with no specific
requirements for the backup data processing system and backup network system.
For example, it is required to store the data backup on a tape placed in another
location in the same city.
Level 2: Alternate site support. On the basis of meeting Level 1, it is required to
equip part of the data processing equipment required for disaster recovery, or to
46 2 Basic Knowledge of Database
deploy the required data processing equipment to the backup site within a
predetermined time after a disaster; it is also required to equip part of the
communication lines and corresponding network equipment, or to deploy the
required communication lines and network equipment to the backup site within a
predetermined time after a disaster.
Level 3: Electronic transmission and equipment support. It is required to conduct at
least one full data backup every day, and the backup media is stored off-site,
while using communication equipment to transfer critical data to the backup site
in batches at regular intervals several times a day; part of the data processing
equipment, communication lines and corresponding network equipment required
for disaster recovery should be equipped.
Level 4: Electronic transfer and complete device support. On the basis of Level 3, it
is required to configure all data processing equipment, communication lines and
corresponding network equipment required for disaster recovery, which must be
in ready or operational status.
Level 5: Real-time data transfer and complete device support. In addition to requir-
ing at least one full data backup per day and backup media stored off-site, it also
requires the use of remote data replication technology to replicate critical data to
the backup site in real time through the communication network.
Level 6: Zero data loss and remote cluster support. It is required to realize remote
real-time backup with zero data loss; the backup data processing system should
have the same processing capability as the production data processing system,
and the application software should be “clustered” and can be switched seam-
lessly in real time.
Table 2.3 exemplifies the disaster recovery levels defined by the Information security
technology—Disaster recovery specifications for information systems.
The higher the disaster recovery level, the better the protection of the information
system, but this also means a sharp increase in cost. Therefore, a reasonable disaster
recovery level needs to be determined for service systems based on the cost-risk
balance principle (i.e., balancing the cost of disaster recovery resources against the
potential loss due to risk). For example, the disaster recovery capability specified for
core financial service systems is Level 6, while non-core services are generally
specified as Level 4 or Level 5 depending on the scope of service and industry
standards; the disaster recovery level for SMS networks in the telecom industry is
2.1 Overview of Database Management 47
Level 3 or Level 4. Each industry should follow the specifications to assess the
importance of its own service systems to determine the disaster recovery level of
each system.
Different databases provide different backup tools and means, but all involve
various “backup strategies”. Backup strategies are divided into full backup, differ-
ential backup and incremental backup according to the scope of data collection; or
into hot backup, warm backup and cold backup according to whether the database is
deactivated; or into physical backup and logical backup according to the backup
content.
Full backup, also called complete backup, refers to the complete backup of all
data and corresponding structures at a specified point in time. Full backup is
characterized by the most complete data and is the basis for differential and incre-
mental backups, as well as the most secure backup type, whose backup and recovery
time increases significantly with the increase in data volume. While important, full
backup also comes at a cost in time and expenses, and is prone to a performance
impact on the entire system.
The amount of data to be backed up each time for full backup is quite large and
takes a long time, so it should not be operated frequently, even with the highest data
security. Differential backup is a backup of data that has changed since the last full
backup. Incremental backup is a backup of the data that has changed after the
previous backup, as shown in Fig. 2.2.
Given that incremental backups have the advantage of not backing up data
repeatedly, each incremental backup involves a small volume of data and requires
very little time, but the reliability of each backup must be guaranteed. For example,
when a system failure occurs in the early hours of Thursday morning and the system
needs to be restored, the full backup on Sunday, the incremental backup on Monday,
the incremental backup on Tuesday, and the incremental backup on Wednesday
must all be prepared and restored in chronological order. If Tuesday’s incremental
backup file is corrupted, then Wednesday’s incremental backup will also fail, so that
only the data state at 12:00 PM on Monday can be restored.
The differential backup shows the same advantage as incremental backup, the
volume of data per backup is small and the backup time is short, but the availability
of system data should be guaranteed. It only needs the data from the last full backup
and the most recent differential backup. For example, if a failure occurs early
48 2 Basic Knowledge of Database
Thursday morning and the system needs to be restored, simply prepare the full
backup on Sunday and the differential backup on Wednesday.
In terms of the volume of data to be backed up, the largest volume of data to be
backed up is the full backup, followed by the differential backup and finally the
incremental backup. Usually, full backup + differential backup is recommended
when the backup time window allows. If the incremental data volume for differential
backup is larger and the backup operation cannot be completed within the allowed
backup time window, then the full backup + incremental backup can be used.
Hot backup is performed when the database is running normally, where read/
write operations can be performed on the database during the backup period.
Warm backup means that only database read operations can be performed during
the backup period, and no write operation is allowed, where the database availability
is weaker than hot backup.
Cold backup means that read/write operations are not available during the backup
period, and the backup data is the most reliable.
In the case that the database application does not allow the service to stop, a hot
backup solution must be used, but the absolute accuracy of the data cannot be
guaranteed. In the case where the read/write service of the application can be stopped
and the accuracy of the backup data is required, the cold backup solution is preferred.
For example, the hot backup solution should be used as much as possible for routine
daily backups, while a cold backup solution is recommended in the case of system
migration, so as to ensure data accuracy.
A physical backup is a direct backup of the data files corresponding to the
database or even the entire disk. Logical backup refers to exporting data from the
database and archiving the exported data for backup. The difference between the two
is shown in Table 2.4.
Backup portability means that the backup results of the database can be restored
to different database versions and database platforms. In terms of recovery
2.1 Overview of Database Management 49
efficiency, the physical backup only needs to directly recover data files of data
blocks, which is highly efficient; the logical backup is equivalent to re-executing
SQL statements when recovering, so the system overhead is high and inefficient
when the data volume is large. Compared with the strong dependence of physical
backup on the physical format of logs, the logical backup is only based on logical
changes of data, which makes the application more flexible and enables cross-
version replication, replication to other heterogeneous databases, and customization
support when the table structure of source and target databases are inconsistent.
Logical backups only support backup to SQL script files. Logical backups take up
less space in comparison with physical backups, because the latter generate data
files. Physical backups also allow backing up only metadata, at which time the
backup result takes up the least amount of space.
In a broad sense, the database security framework is divided into three levels:
network security, operating system security, and DBMS security.
(1) Network security. The main technologies for maintaining network security are
encryption technology, digital signature technology, firewall technology, and
intrusion detection technology. The security at the network level focuses on the
encryption of transmission contents. Before transmission through the network,
the transmission content should be encrypted, and the receiver should decrypt
the data after receiving it to ensure the security of the data in the transmission
process.
(2) Operating system security. Encryption aiming at securing the operating system
refers to the encryption of data files stored in the operating system, the core of
which is to ensure the security of the server, mainly in terms of the server’s user
accounts, passwords, access rights, etc. Data security is mainly reflected in the
encryption technology, security of data storage, security of data transmission,
such as Kerberos, IPsec, SSL and VPN technologies.
(3) DBMS security. The encryption aimed at DBMS security refers to the encryp-
tion and decryption of data in the process of reading and writing data by means
of custom functions or built-in system functions, involving database encryption,
data access control, security auditing, and data backup.
To summarize, all the three levels of security involve encryption. The security at the
network level focuses on encryption of the transmission content, where the sender
encrypts the transmission content before the network transmission and the receiver
decrypts the information after receiving it, thus securing the transmission. The
encryption aiming at securing the operating system refers to the encryption of data
files stored in the operating system. The encryption aimed at DBMS security refers
to the encryption and decryption of data in the process of reading and writing data by
means of custom functions or built-in system functions.
Security control is to provide security against intentional and unintentional
damage at different levels of the database application system, for example:
(1) Encryption of access data ! intentional illegal activities.
(2) User authentication and restriction of operation rights ! intentional illegal
operations.
(3) Improvement of system reliability and data backup ! unintentional damage
behavior.
The security control model shown in Fig. 2.3 is only a schematic diagram, while all
database products nowadays have their own security control models. When a user
needs to access a database, he or she first has to enter the database system. The user
provides his identity to the application, and the application submits the user’s
identity to the DBMS for authentication, after that, only legitimate users can proceed
to the next step. When a legitimate user is performing a database operation, the
DBMS further verifies that the user has such operation rights. The user can only
operate if he or she has operation rights, otherwise the operation will be denied. The
operating system also has its own protection measures, such as setting access rights
to files and encrypting storage for files stored on disk, so that the data is unreadable
even if it is stolen. In addition, it is possible to save multiple copies of data files, thus
avoiding data loss when accidents occur.
The authentication of database users is the outermost security protection provided
by the DBMS to prevent unauthorized users from accessing.
2.1 Overview of Database Management 51
GaussDB (for MySQL) sets a password security policy for new database users created
on the client side.
The password length is at least eight characters.
The password should contain at least one uppercase letter, one lowercase letter,
one digit and one special character.
The password should be changed periodically.
appropriate rights to access the corresponding database objects if they have the
corresponding roles.
For example, if User A wants to query the data of Table T, then we can grant User
A the right to query Table T directly, or we can create Role R, then grant the right to
view Table T to Role R, and finally grant Role R to User A.
Audit can help database administrators to find the vulnerabilities in the existing
architecture and its usage. Audit of users and database administrators is to analyze
and report on various operations, such as creating, deleting, and modifying
instances, resetting passwords, backing up and restoring, creating, modifying, and
deleting parameter templates, and other operations.
The levels of database audit are as follows.
(1) Access and authentication audit: analysis of database user’s login (log in) and
logout (log out) information, such as login and logout time, connection method
and parameter information, login path, etc.
(2) User and database administrator audit: analysis and reporting on the activities
performed by users and database administrators.
(3) Security activity monitoring: recording of any unauthorized or suspicious activ-
ities in the database and generation of audit reports.
(4) Vulnerability and threat audit: identification of possible vulnerabilities in the
database and the “users” who intend to exploit them.
The encryption of database is divided into two layers—the encryption of kernel layer
and the encryption of outer layer. Kernel-layer encryption means that the data is
encrypted or decrypted before physical access, which is transparent to the database
users. If encrypted store is used, the encryption operation runs on the server side,
which will increase the load on the server to some extent. Outer-layer encryption
means developing special encryption and decryption tools, or defining encryption
and decryption methods, which can control the encryption object granularity, and
encrypt and decrypt at table or field level, and users only need to focus on sensitive
information range.
There are upper limits on the processing capacity of resources. For example, the disk
space is limited, and there are also upper limits on CPU frequency, memory size and
network bandwidth. Resources are divided into supply resources and concurrency
control resources. The supply resources, also called basic resources, are the
resources corresponding to computer hardware, including the resources managed
by the operating system, whose processing capacity is ordered as “CPU > memory
>> disk network”. Concurrency control resources include but are not limited to
locks, queues, caches, mutually exclusive signals, etc., which are also resources
managed by the database system. The basic principle of performance management is
to make full use of resources and not to waste them.
Unlike the even supply of resources, the use of resources is uneven. For example,
if a distributed system fails to choose a reasonable data slicing method, the nodes
with more data will be heavily loaded and their resources will be strained, but the
nodes with less data will be lightly loaded and their resources will be relatively
“idle”, as shown in Table 2.5.
1 ns ¼ 109 s
Resource bottlenecks can be exchanged. For example, a system with low I/O
performance and sufficient memory can be exchanged through high memory and
high CPU consumption. A system with limited network bandwidth can also improve
the efficiency of data transfer by compressing the transfer, i.e., using the CPU to
(3) System optimization for sudden slowdown during system operation (emergency
processing). In the emergency processing scenario, performance problems do
not happen for any reason. Sudden performance changes are often caused by
code changes, such as put-into-production of newly developed business, new
requirement changes, DDL changes, unexpected configuration changes, data-
base upgrades, etc. Generally, this kind of problem has a high degree of urgency,
which often requires the intervention of experienced personnel and quick
response.
(4) Performance optimization for the situation where the system suddenly becomes
slow and then returns to normal after a period of time. This is generally due to
bottlenecks that limit throughput during peak periods, and capacity expansion is
the simplest way to solve it. However, due to the extra investment and time
period involved, this method needs to be supported by sufficient resources. A
more natural solution is to reduce the number of operations per unit (concur-
rency control) or to reduce the resource consumption per unit of operation.
(5) System optimization based on the reduction of resource consumption. In this
scenario, the whole system generally does not suffer from obvious performance
problems, but rather emphasizes the effectiveness of resource usage, which is
relatively well-timed and less stressful. For example, to analyze and optimize the
top ten jobs that consume the most resources and have the longest response time
in system application.
(6) Preventive daily inspection. Inspection work is generally applicable to scenarios
where the whole system does not have obvious performance problems.
The data to be collected for performance management include CPU usage data,
space utilization, users and roles using the database system, response time of
heartbeat queries, performance data submitted to the database with SQL as the
basic unit, and job-related performance data submitted by database tools (such as
load, unload, backup, restore, etc.). As far as the timing of data collection is
concerned, some daily data collection can be arranged, or data collection can be
carried out during the time period when users use the system intensively in one day,
or during the time period when the system pressure is relatively high.
After data collection is completed, corresponding performance reports need to be
generated. For example, periodic performance reports or performance trend analysis
reports. There are many monitoring reports that can be extracted in the database
system, for example, regular performance reports (daily, weekly and monthly
reports) can be established by using performance-related data; the performance
trend analysis report can be established by using common indicators to obtain an
intuitive display of the current system performance; you can also generate reports of
specific trend types, such as reports based on abnormal events, reports of SQL or
jobs that consume a lot of resources, reports of resource consumption of specific
users and user groups, and reports of resource consumption of specific applications.
56 2 Basic Knowledge of Database
Built-in resource views or monitoring reports are some advanced features provided by
the database that are not available in some databases.
1. Database installation
The basic principles adopted by different database products are similar, but each
product has its own characteristics and precautions, which users need to under-
stand and learn before installation.
The first is the installation of the database, the process of which is shown in
Fig. 2.4.
The premise of database installation is some basic preparations, mainly as
follows.
2. Database uninstallation
Before the database is upgraded, it is necessary to uninstall and clean up the
old version of the database. The basic steps of traditional database uninstallation
are as follows.
(1) (Optional) Make a full backup of the database.
(2) Stop the database service.
(3) Uninstall the database.
The basic steps of cloud database uninstallation are as follows.
(1) (Optional) Make a full backup of the database.
(2) Delete the data instance from the cloud platform.
The uninstallation approaches are similar for single-host, master-standby,
and one-master-multi-standby architectures, and the uninstallation operations
to be performed on each node are the same. Uninstallation of distributed
clusters generally uses proprietary uninstallation tools. Some users need to
destroy the data on the store media after uninstalling the database in order to
prevent data leakage.
3. Database migration
Database migration needs to design different migration schemes according to
different migration scenarios, and the factors to be considered are as follows:
58 2 Basic Knowledge of Database
4. Database expansion
The capacity of any database system is determined after estimating the volume
of data in the future at a certain time point. When determining the capacity, not
only the volume of data store should be considered, but also the following
shortcomings should be avoided:
(1) Inadequacy of computing power (average daily busy level of CPU of the
whole system > 90%).
(2) Insufficient response and concurrency capability (qps and tps are significantly
reduced, failing to meet the SLA).
SLA is the abbreviation of Service Level Agreement. When signing a contract with a
customer, some performance commitments are generally made to the customer, for
example, the database system provided should be able to meet 10,000 queries/s, the
response time for a single query should not exceed 30ms, and to meet the database-
related service indicators. SLA may also include service commitments such as 7 24
response.
5. Routine maintenance
In order to carry out routine maintenance, a more rigorous work plan should be
formulated for each job, and implemented to check the risks and ensure the safe
and efficient operation of the database system.
Database troubleshooting mainly involves the following matters.
(1) Configure database monitoring indicators and alarm thresholds.
(2) Set the alarming process for the fault events at each level.
(3) Receive the alarm and locate the fault according to the logs.
(4) Record the original information in detail for the problems encountered.
(5) Strictly abide by the operating procedures and industry safety regulations.
(6) For major operations, the operation feasibility should be confirmed before
operation, and the operation personnel with authority should perform them
after the corresponding backup, emergency and safety measures are in place.
Database health inspection mainly involves the following matters.
(1) View health inspection tasks.
(2) Manage health inspection reports.
(3) Modify the health inspection configuration.
60 2 Basic Knowledge of Database
Database system is made for managing data, and database is actually a collection of
data, which is expressed as a collection of data files, data blocks, physical operating
system files or disk data blocks, such as data files, index files and structure files. But
not all database systems are file-based, there are also databases that write data
directly into memory.
Database instance refers to a series of processes in the operating system and the
memory blocks allocated for these processes, which are the channels to access the
database. Generally speaking, a database instance corresponds to a database, as
shown in Fig. 2.5.
A database is a collection of physically stored data, and a database instance is the
collection of software processes, threads and memory that access data. Oracle is
process-based, so its instance refers to a series of processes; and a MySQL instance
is a series of threads and the memory associated with the threads.
Multi-instance is to build and run multiple database instances on a physical
server, each using a different port to listen through a different socket, and each
having a separate parameter profile. Multi-instance operation can make full use of
hardware resources and maximize the service performance of the database.
Distributed database presents unified instances, and generally does not allow
users to directly connect to instances on data nodes. A distributed cluster is a set
of mutually independent servers that form a computer system through a high-speed
network. Each server may have a complete copy or a partial copy of the database,
and all servers are connected to each other through the network, together forming a
complete global large-scale database that is logically centralized and physically
distributed.
Multi-instance and distributed cluster are shown in Fig. 2.6.
into the connection pool for the next user to request. Connection creation and
disconnection are managed by the connection pool itself, and the initial number of
connections, the upper and lower limits of number of connections, the maximum
times of uses per connection, and the maximum idle time can be controlled by setting
the parameters of the connection pool. However, there are alternatives, that is,
monitoring the number and usage of database connections through its own manage-
ment mechanism. Connections also vary by database product. Oracle’s connection
overhead is large, while MySQL’s is relatively small. For highly concurrent service
scenarios, if there are many connections accumulated, the overall connection cost of
the whole database should also be considered by database administrators.
2.2 Key Concepts of Database 63
2.2.3 Schema
Schema is a collection of related database objects that allows multiple users to share
the same database without interfering with each other. The schema organizes
database objects into logical groups for easier management and form namespaces
to avoid object name conflicts. A schema contains tables, other database objects, data
types, functions, operators, etc.
“table_a” shown in Fig. 2.9 indicates tables with the same name. Since they
belong to different schemas, they are allowed to use the same name, but in fact they
may store different data and have different structures. When accessing one of the
tables with the same name, it is necessary to specify the schema name to explicitly
point to the target table.
2.2.4 Tablespace
Tablespace is composed of one or more data files, with which you can define where
database object files are stored. All objects in the database are logically stored in the
tablespace, and physically stored in the data files belonging to the tablespace.
The function of table space is to arrange the physical store location of data
according to the usage pattern of database objects, so as to improve the performance
of database. It places frequently used indexes on the disk with stable performance
and fast computing speed to facilitate data archiving, and place tables that are used
less frequently and require lower access performance on the disk with slower
computing speed.
You can also specify the physical disk space occupied by data through
tablespaces and set the upper limit of physical space usage to avoid running out of
disk space.
In view of the fact that tablespaces correspond to physical data files, tablespaces
can actually associate data with store, and then the tablespaces themselves specify
the store locations of database objects such as tables and indexes in the database.
After the database administrator creates a tablespace, he or she can refer to it when
creating database objects.
2.2 Key Concepts of Database 65
Figure 2.10 shows the tablespaces of GaussDB (for MySQL), which are created
as the system predefines six tablespaces, including SYSTEM tablespace, TEMP
tablespace, TEMP2 tablespace, TEMP2_UNDO tablespace, UNDO tablespace, and
USERS tablespace, as shown in Table 2.6.
TEMP tablespaces, as the intermediate result set of SQL statements, are be used
by common temporary tables of users. When executing DML (insert, update and
delete, etc.) operations, the old data generated before the execution of the operation
will be written to the UNDO tablespace, which is mainly used to implement
transaction rollback, database instance recovery, read consistency and flashback
queries.
2.2.5 Table
GaussDB (for MySQL) supports the creation of temporary tables, which are used
to hold the data needed for a session or a transaction. When a session exits or a user
commits and rolls back a transaction, the data in the temporary table is automatically
cleared, but the table structure remains.
The data in a temporary table is temporary and procedural, with no need to be
retained permanently like a normal data table.
Temporary tables cannot be displayed using the [SHOW TABLES] command.
To avoid deleting a permanent table with the same table name, you can use the
[DROP TEMPORARY TABLE staff_history_session;] command when performing
a delete of the table structure.
The data in the temporary table exists only for the life of the session and is
automatically cleared when the user exits the session and the session ends, as shown
below.
Temporary tables with the same name can be created for different sessions. The
name of the temporary table can be the same as the name of the permanent table.
Execute the following command to create a temporary table in GaussDB (for
MySQL):
According to the way data is stored, tables are divided into row store and column
store, as shown in Fig. 2.11. GaussDB (for MySQL) currently supports only row
store, while GaussDB (DWS) supports both row store and column store. The default
store mode is row store, which differs from column store only in store mode. From
2.2 Key Concepts of Database 67
the presentation form of tables, the tables in the two store modes still hold
two-dimensional data, which accord with the relational theory of relational database.
If the table in the form of row store (row store table) stores the same row of data in
different columns, records can be written once when performing INSERT and
UPDATE operations; But when you choose to query, even if you only query a
few columns, all the data will be read.
The table in the form of column store (column store table) first splits the rows
when writing data, at which time a row is split into multiple columns, and then the
data of the same column is stored in the adjacent physical area. Therefore, in the
column store mode, the times of write of a row record is obviously more than that in
the row store mode. This increase in the write times leads to higher overhead and
poorer performance of the column store table compared with the row store table
when performing INSERT and UPDATE operations. However, when querying,
column store tables just scan the columns involved and then read them, so the I/O
scanning and reading range are much smaller than row store tables. Column-store
query can eliminate irrelevant columns. If only a few columns need to be queried, it
can greatly reduce the amount of data to be queried, and then speed up the query. In
addition, for column store tables, each row hold the data of the same data type, and
the data of the same type can be compressed by a lightweight compression algorithm
to achieve a good compression ratio, so the space occupied by the column store table
is relatively small.
68 2 Basic Knowledge of Database
Row store tables, on the other hand, are difficult to compress because the field
types of the tables are not uniform and cannot be compressed dynamically unless
they are confirmed in advance.
Regarding the choice of store mode, row store is the default store mode. The
scenarios for which column store is suitable are mainly queries of statistical analysis
type (scenarios with a lot of GROUP and JOIN operations), OLAP, data mining and
other application query scenarios that make a lot of query requests. One of the main
advantages of column storage is that it can greatly reduce the I/O occupation of the
system in the reading process, especially when querying massive data, I/O has
always been one of the main bottlenecks of the system. Row store is suitable for
scenarios such as point queries (simple queries with fewer returned records and
based on indexes), lightweight transactions like OLTP, and scenarios that involves a
lot of write operations and more data additions, deletions and changes. Row store is
more suitable for OLTP, such as the traditional applications based on addition,
deletion, change and check operations. Column store is more suitable for OLAP,
and is well suited to play a role in the field of data warehousing, such as data
analysis, mass store and business intelligence, which mainly involves infrequently
updated data.
2.2.7 Partition
A partitioned table is obtained by dividing the data of a large table into many small
subsets of data. The main types of partitioned tables are as follows.
(1) Range-partitioned table: The data is mapped to each partition based on a range
determined by the partition key specified when the partition table is created. This
is the most commonly used partition method, and the date is often used as the
partitioning key, for example, the sales data is partitioned by month.
(2) List-partitioned table: A huge table is partitioned into small manageable blocks.
(3) Hash-partitioned tables: In many cases, users cannot predict the range of data
changes on a particular column, and therefore cannot create a fixed number of
range partitions or list partitions. In this case, hash-partitioned tables provide a
way to divide the data equally among a specified number of partitions, so that
the data written to the table is evenly distributed among the partitions; however,
the user cannot predict which partition the data will be written to. For example, if
the sales cities are spread all over the country, it is difficult to partition the table
in a list, and then the table can be hash-partitioned.
(4) Interval-partitioned table: It is a special kind of range-partitioned table. For
ordinary range partition, users will pre-create partitions, and if the inserted
data is not in the partition, the database will report an error. In this case, the
user can add the partition manually or use the interval partition. For example, the
user can use the range-partitioned table in the way of one partition per day, and
create a batch of partitions (e.g. 3 months) for subsequent use when the service is
2.2 Key Concepts of Database 69
deployed, but the partitions need to be created again after 3 months, otherwise
the subsequent service data entry will report an error. This approach of range
partition increases maintenance costs and requires the kernel to support auto-
matic partition creation. But with interval partition, the user does not need to care
about creating subsequent partitions, which reduces partition design and main-
tenance costs.
Example: The code for range-partitioning a date is as follows.
CREATE TABLE tp
(
id INT,
name VARCHAR(50),
purchased DATE
)
PARTITION BY RANGE( YEAR(purchased))
(
PARTITION p0 VALUES LESS THAN (2015),
PARTITION p1 VALUES LESS THAN (2016),
PARTITION p2 VALUES LESS THAN (2017),
PARTITION p3 VALUES LESS THAN (2018),
PARTITION p4 VALUES LESS THAN (2019),
PARTITION p5 VALUES LESS THAN (2020)
);
is within several partitions, so when the query statement scans data, it will only
search specific partitions instead of scanning the whole table through partition
pruning. In general, the I/O overhead of partition scanning is n/m compared to
scanning the entire table, where m is the total number of partitions and n is the
number of partitions that satisfy the WHERE condition.
Scenario 2 (Row 2 in Table 2.8): Inserting data into an empty partition is similar to
loading data into an empty table, and the efficiency of inserting data is higher with
this internal implementation.
Scenario 3 (Row 3 in Table 2.8): If data is to be deleted or truncated, the data in some
partitioned tables can be processed directly because the quick positioning and
deletion function of partition makes the processing much more efficient than the
scenario without partition.
2.2 Key Concepts of Database 71
The data tables of GaussDB (DWS) distributed database are scattered on all data
nodes (DNs), so you need to specify the distribution columns when you create the
tables, as shown in Table 2.9.
The sample code for the Hash distribution is as follows.
The data in the database is classified into basic data, compound data, serial number
data and geometric data. Basic data includes numerical value, character, binary data,
date and time, Boolean data, enumeration data, etc., as shown in Table 2.10.
2.2 Key Concepts of Database 73
2.2.10 View
Unlike the base table, a view is not physically present, but is a dummy table. If the
data in the base table changes, then the data queried from the view will also change.
In this sense a view is a window through which the data of interest to the user in the
database and its changes can be seen, and the view is run once each time it is
referenced.
“author_v1” shown in Fig. 2.14 is vertically split data, only two columns in the
base table are visible, and other columns are not visible through the view;
“author_v2” is horizontally split data, only all data in the table with age values
greater than 20 are visible, but all columns are visible. No matter how to split, the
data of “author_v1” and “author_v2” views are not really stored in the database.
When the user accesses the view through the SELECT statement, the user accesses
the data in the underlying base table through the view, so the view is called a
“dummy table”. To the user, accessing a view is exactly the same as accessing a
table.
The main functions of a view are as follows.
(1) Simplifies operations. When querying, we often have to use aggregate functions
and display information about other fields, and we may need to associate other
tables, thus there is a long statement to write. If this action happens frequently,
we can create views, just by executing the SELECT * FROM view statement.
(2) Improves security. Users can only query and modify the data they see, because
the view is virtual, not physically present, and it just stores a collection of data.
The view is a dynamic collection of data, and the data is updated as the base table
2.2 Key Concepts of Database 75
is updated. We can present the important field information in the base table to the
user through the view, but the user cannot change and delete the view at will to
ensure the security of the data.
(3) Achieves logical independence and shields the impact from the structure of real
tables. Views allow the application and database tables to be somewhat inde-
pendent of each other. Without a view, the application must be built on top of the
table; but with a view, the application can be built on top of the view. The
application is separated from the database table by the view.
The following sample code encapsulates more complex logic through views.
The user uses the same simplified SQL query statement as the normal table, with
the code shown below.
This form is called a simple view, which enables the modification of the table
through the view, for example, using the “UPDATE v_abc SET a¼‘101’ WHERE
b¼‘xxxx’;” statement.
However, if the view has aggregate functions, summary functions, or GROUP
BY grouping calculations, or if the view is a result view with multiple table
associations, they are complex views that cannot be used to make changes to the
base table data.
76 2 Basic Knowledge of Database
2.2.11 Index
An index provides pointers to data values stored in specified columns of a table, like
a table of contents of a book. It can speed up table queries, but also increase the
processing time of insertion, update, and deletion operations.
If you want to add an index to a table, then which fields the index is built on is a
question that must be considered before creating the index. It is also necessary to
analyze the service processing of the application, data usage, fields that are often
used as query conditions or required to be sorted, so as to determine whether to
establish an index.
When creating indexes, the following suggestions are used as a reference.
(1) Create indexes on columns that are frequently required to be searched and
queried, which can speed up the search and query.
(2) Create an index on a column that used as the primary key, which emphasizes the
uniqueness of the column and organizes the arrangement structure of the data in
the table.
(3) Create indexes on columns that often need to be searched based on ranges as the
ordering of indexes can ensure the continuity of the specified ranges.
(4) Create indexes on columns that need to be ordered frequently as the ordering of
indexes can reduce query time.
(5) Create indexes on the columns that often use the WHERE clause to speed up the
judgment of the condition.
(6) Create indexes for fields that often follow the keywords ORDER BY, GROUP
BY, and DISTINCT.
The created index may not be used, and when to use the index will be automatically
judged by the system after the index is successfully created. Indexes are used when
the system thinks it is faster to use them than to scan them sequentially. Successfully
created indexes must be synchronized with tables to ensure that new data can be
found accurately, which increases the load of data operation. We also need to remove
useless indexes periodically, and we can query the execution plan by EXPLAIN
statement to determine whether to use an index or not.
The indexing methods are shown in Table 2.11.
If a table declares a unique constraint or primary key, a unique index (possibly a
multi-field index) is automatically created on the fields that make up the unique
constraint or primary key to implement those constraints.
Create a normal index
2.2.12 Constraints
Data integrity refers to the correctness and consistency of data. Integrity constraints
can be defined at the time of defining a table. Integrity constraint itself is a rule that
does not occupy database space. Integrity constraints are stored in the data dictionary
together with the table structure definitions.
Figure 2.15 shows the common constraint types, as follows.
78 2 Basic Knowledge of Database
(1) Unique (UNIQUE) and primary key (PRIMARY KEY) constraints. When all
values in the field will not have duplicate records, you can add unique constraints
to the corresponding fields, such as ID card field and employee number field. If a
table does not have a unique constraint, then duplicate records can appear in the
table. If the fields can be guaranteed to satisfy the unique constraint and not-null
constraint, then the primary key constraint can be used, and usually a table can
only have one primary key constraint.
(2) References key constraint is used to establish a relationship between two tables,
which is necessary to specify which column of the primary table is referenced.
(3) Check constraint is a constraint on the range of legal values in a field. For
example, the balance in the savings account table is not allowed to be negative,
so a check constraint can be added to the balance field so that the balance field
takes a value 0.
(4) Not-null constraint. If the current field should not have null values or unknown
data in service sense, you can add not-null constraints to ensure that the inserted
data are all not-null data, such as the ID card field of personal information.
(5) Default constraint. When inserting data, if no specific value is given, then the
default constraint will be used to give a default initial value, for example, if the
default value of the initial member’s rank is 0, when a new member record is
added, the member’s rank will be 0.
If the field values can be filled in from the service level, then it is recommended that
the default constraint not be used to avoid unintended results when the data is loaded.
Add a not-null constraint to a field that clearly does not have a null value, and the
optimizer will automatically optimize it and explicitly name the constraint that is
allowed to be explicitly named. Explicit naming is supported for all types of
constraints except not-null and default constraints.
If a default constraint is used, it is actually assigned by default for some unex-
pected cases. Such default values may hide potential problems. So for OLAP
2.2 Key Concepts of Database 79
2.2.13 Transaction
(1) Explicit commit: Transactions have explicit start and end marks.
(2) Implicit commit: Each data operation statement automatically becomes a trans-
action. GaussDB (for MySQL) adopts implicit COMMIT by default, without
adding COMMIT statement, and each statement is regarded as an automatic
commit of transaction.
Implicit commit can be turned off with the SET autocommit ¼ 0 statement.
The code to set explicit commit is as follows.
Figure 2.17 shows the specific operations of transaction commit and rollback.
GaussDB (for MySQL) is an OLTP database that adopts an explicit transaction
processing model, but it does not provide a statement that explicitly defines the
transaction start, instead, it takes the first executable SQL as the transaction start.
You may face a data inconsistency in implicit commit—dirty read, which means
that one transaction reads data that has not been committed (uncommitted) from
another transaction. The uncommitted data is called “dirty” data because of the
possibility of rollback.
The transaction T1 shown in Fig. 2.18 transfers $200 from Account A to
Account B, where the initial balance of Account A is $1000 and the initial balance
of Account B is $500.
Fig. 2.18 Dirty read. (a) Transaction T1 changes the value of A from 1000 to 800 and changes the
value of B from 500 to 700, but has not yet committed the transaction. (b) At this time, Transaction
T2 starts to read the data, and gets A of value 800 modified by the transaction T1. (c) Transaction T1
is rolled back, but because it is not committed, A recovers to the initial value 1000, while the value
of B is 500; at this time, for Transaction T2, the value of A is still 800. This case is dirty read, that is,
Transaction T2 reads data that has not been committed by Transaction T1.
The ANSI SQL standard defines 4 transaction isolation levels to avoid 3 kinds of
data inconsistency. The transaction levels, from high to low, are shown below.
(1) Serializable. All transactions in the system are executed one by one in a serial
manner, so all data inconsistencies can be avoided. However, this serializable
execution method of controlling concurrent transactions in an exclusive manner
will lead to queuing of transactions that significantly reduces the concurrency of
the system, so should be used with great caution.
Here serialization means that all operations are serially queued, for example:
2.3 Summary
This chapter describes the core objectives of database management, and introduces
the scope of database management work, explaining the basic concepts of database
object management, backup recovery, and disaster recovery levels, as well as the
important concepts of database. Some concepts that tend to be confused are com-
pared and explained, and the important but rather obscure concepts are introduced
and analyzed based on scenarios.
2.4 Exercises 85
2.4 Exercises
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 3
Getting Started with SQL Syntax
Data type is a basic attribute of data. Data are generally divided into common data
and uncommon data. Common data types include numeric value, character, date and
time, and so on. Uncommon data types include Boolean data, spatial data, JSON
data, etc.
90 3 Getting Started with SQL Syntax
Different data types occupy different storage space, and can perform different oper-
ations. The data in the database is stored in the data tables. Each column in the data
table is defined with the data type. When storing data, the user must comply with the
attributes of these data types, otherwise errors may occur.
1. Numeric Value
The numeric value types available in GaussDB (for MySQL) database include
integer, floating-point number and fixed-point number, which support basic
32-bit integer and 64-bit integer.
(1) There are five types of integers, as shown in Table 3.1.
INTEGER (32-bit signed integer) occupies 4 bytes, with value range from
-231 to 231-1, which can be expressed by the keywords INT, INTEGER,
BINARY_INTERGER, INT SIGNED, INTEGER SINGNED, SHORT,
SMALLINT and TINYINT. BIGINT (64-bit signed integer) occupies
8 bytes, with value range from -263 to 263-1, which can be expressed by the
keywords BIGINT, BINARY_BIGINT and BIGINT SIGNED.
(2) Floating-point numbers are divided into two types as follows.
FLOAT: single-precision floating-point number occupying 4 bytes, with
8-bit precision.
DOUBLE: double-precision floating-point number occupying 8 bytes,
with 16-bit precision.
(3) The fixed-point numbers occupy 4–24 bytes, with actual length related to the
effective number it represents, and with value range from -1.0E128 to
1.0E128, which can be expressed by keywords DECIMAL and NUMERIC.
They are in the following syntax format, requiring s p.
If the values of “p” and “s” are not specified, “p” defaults to 10, meaning
that there is no restriction on the value after the decimal point. If the value of
“s” is not specified or s ¼ 0, the fixed-point number has no decimal part.
2. Character
The character types supported by GaussDB (for MySQL) are CHAR,
VARCHAR, BINARY, VARBINARY, TEXT, BLOB, ENUM, and SET.
Under the default encoding set “utf8mb4”, Chinese characters occupy 3 bytes,
numeric and English characters occupy 1 byte, and other characters occupy up to
4 bytes. The characters are divided into fixed-length strings and variable-length
strings.
CHAR(n) is used to store fixed-length bytes or strings, with the n indicating
the length of the string, and taking values from 0 to 255. If the length of the input
string is less than n, the right end will be made up with spaces. For example,
CHAR(4) will occupy 4 bytes no matter how many characters are input.
VARCHAR(n) is used to store variable-length bytes or strings, with the
n indicating the length of the string, and taking values from 0 to 65535. If the
length of the input string is less than n, there is no need to make up with spaces.
The number of bytes occupied by VARCHAR is the actual number of characters
input + 1 byte (n 255) or 2 bytes (n > 255), so VARCHAR(4) occupies 4 bytes
when 3 English characters are input.
In the string comparison between CHAR and VARCHAR, the case sensitivity and the
spaces at the end are ignored.
time zone, you can use DATETIME, DATE and TIMESTAMP types, which can
all indicate year, month, day, hour, minute and second information; however,
unlike DATE and DATETIME which support up to seconds, TIMESTAMP can
support up to microseconds.
YEAR can also be expressed as a two-digit string “YY”, ranging from 00 to
99, among which, values of 00–69 and 70–99 are converted to YEAR values of
2000–2069 and 1970–1999.
The value range of DATETIME/DATE is [0001-01-01 00:00:00, 9999-12-31
23:59:59], Expressed as “2019-08-22 17:29:13”.
TIMESTAMP[(n)] can specify the precision to be saved through the parameter
n, ranging from 0 to 6; or takes no parameter, in which case the default precision
of decimals after the second is 6. For example, 2019-08-22 17:29:13.263183 (n ¼
6), 2019-08-22 17:34:36.383 (n ¼ 3). The value range of TIMESTAMP is [0001-
01-01 00:00:00.000000, 9999-12-31 23:59:59.999999].
When storing timestamp data with time zone, TIMESTAMP(n) WITH TIME
ZONE and TIMESTAMP(n) WITH LOCAL TIME ZONE can be used. The
difference between the two is that TIMESTAMP(n) WITH TIME ZONE holds
the time and time zone information and therefore occupies 12 bytes, e.g., 2019-
08-22 18:41:30.135428 +08:00. TIMESTAMP(n) WITH LOCAL TIME ZONE
uses local data information, which only saves time information, not time zone
information. It will be converted to the timestamp of the current time zone of the
database when stored, and will be displayed with the information of the local time
zone when viewed, so it occupies 8 bytes. For example, when stored, it is
displayed as 2019-08-22 18: 41:30.135428; when viewed, it is displayed as
2019-08-22 18:41:30. 135428 +08:00.
Boolean data can be stored by the keywords BOOL and BOOLEAN, occupying
1 byte. For string input, the normal strings TRUE and FALSE are supported, as well
as the single characters T and F, and the string values 1 and 0. Boolean data can be
converted to and from INT and BIGINT data because Boolean data can be seen as
3.2 Data Types 93
the numbers 0 and 1, so it can be converted to the integers 0 and 1. Integer data can
also be converted to Boolean data. The conversion rule is that integer 0 corresponds
to the Boolean value FALSE, and other non-zero integers correspond to the Boolean
value TRUE. For the output of Boolean data, when Boolean data is displayed, or
when converting Boolean data to character data, Gaussian database uniformly out-
puts 1 as string TRUE and 0 as string FALSE. When the input value is null, the
output of the Boolean data is also null.
Spatial data types include GEOMETRY, POINT, LINESTRING,
POLYGON, etc.
JSON data (JSON: Javascript Object Notation) support native JSON data,
allowing for more efficient storage and management of JSON documents.
To store department information of a company, first create a table with fields for
department information. Suppose the department information to be stored includes
department number, department level, department name, establishment time, and
whether it is an excellent department, etc., we need to determine the data type of the
specific information first: if the department number is numeric data, it can be
expressed as NUMBER; if the department level is integer data, it can be expressed
as INT; if the department name is character data, it can be expressed as VARCHAR;
the establishment time can be expressed by date data; whether it is an excellent
department can be expressed by Boolean data. This is the CREATE TABLE
statements to create a department information table. The code is shown below.
After the table is created successfully, if you want to store the department
description information in the table, you can add more columns to the table. Suppose
the column name is “section_description”, if the department description information
is expressed by a string, the content may be bulky; and if it is defined as BLOB data,
it can be achieved by the following statement.
The numeric calculation functions are responsible to calculate numeric values, such
as absolute value calculation function ABS(x), sine function SIN(x), cosine function
COS(x), inverse sine function ASIN(i), and inverse cosine function ACOS(x).
The ABS(x) function is used to calculate the absolute value. The input parameter
can be a numeric value or a non-numeric value that can be implicitly converted to a
numeric value. The type of the return value is the same as that of the input parameter.
x must be an expression that can be converted to a numeric value type. ABS(x)
eventually returns the absolute value of x (including INT, BIGINT, REAL, NUM-
BER, and DECIMAL types).
The SIN(x) and COS(x) functions are used to calculate the sine and cosine values,
whose input parameter is an expression that can be converted to a numeric value, and
the return value is of type NUMBER.
3.3 System Functions 95
The ASIN(x) and ACOS(x) functions are used to calculate the arc sine and arc
cosine values, whose input parameter is an expression that can be converted to a
numeric value, with the range of [-1, 1], and the return value is of type NUMBER.
The code is shown below.
ROUND(X,D) can truncate the numeric value X before and after the decimal
point according to the value specified by D, and round it to return the truncated
value. The value of D is in the range [30, 30]. If D is ignored, all fractional parts are
intercepted and rounded. If D is negative, it means that the left digit from the decimal
point is filled with zeros and rounded, and the decimal part is removed. The code is
shown below.
The CEIL(X) function is used to calculate the smallest integer greater than or
equal to the specified expression n, whose input parameter is an expression that can
96 3 Getting Started with SQL Syntax
be converted to a numeric value, and the return value is an integer. For example,
CEIL(15.3) is calculated as 16. Numeric calculation functions are shown in
Table 3.3.
The SIGN(X) function is used to take the sign of the numeric value type, which
returns 1 if greater than 0, returns -1 if less than 0, and returns 0 if equal to 0. The
returned value is of the numeric value type. For example, for SIGN(2*3), 2 3 ¼
6, if greater than 0, the calculation result is 1.
The SQRT(X) function is used to calculate the square root of a non-negative real
number, whose input parameter is an expression that can be converted to a
non-negative values, and the return value is of type DECIMAL. For example,
SQRT(49) is calculated as 7.
The TRUNCATE (X,D) function is used to intercept the input numeric data in the
specified format, without rounding, where X indicates the data to be intercepted, and
D for the interception accuracy, and the return value is of type NUMBER. For
example, TRUNCATE(15.79,1) is 15.7 after intercepting a decimal to the right;
TRUNCATE(15.79,-1) is 10 after intercepting an integer to the left.
The FLOOR(X) function is used to find the nearest integer less than or equal to
the value of the expression, whose input parameter is an expression that can be
converted to a numeric value, and the return value is of type NUMBER. For
example, FLOOR(12.8) is calculated as 12.
The PI() function is used to return the value of π, with valid number default to
7 digits. For example, PI() returns 3.141593.
The MOD(X,Y) function is used for modulo operations, whose input parameter is
an expression that can be converted to a NUMBER data, and the return value is of
type NUMBER. For example, MOD (29,3) is calculated as 2.
Other numeric calculation functions include the exponentiation function POWER
(), etc.
3.3 System Functions 97
In the above example, the CONCAT() function splices the strings ‘11’, ‘NULL’
and ‘22’ to return the string 11NULL22, and the CONCAT_WS() function splices
‘11’, NULL and ‘22’ by the separator ‘-’, where NULL means null, to return 11–22.
The HEX (str) function returns a string of hexadecimal value, whose input
parameter is of numeric value type or character type, and the return value is of string
type. The HEX2BIN (str) and HEXTORAW (str) functions return strings
represented as hexadecimal strings. The difference between the two is that the
HEX2BIN() function returns the BINARY type, where the input hexadecimal string
must be prefixed with 0x, while the HEXTORAW() function returns the RAW type.
In the above example, the HEX(‘ABC’) function returns the hexadecimal string
414243 for ABC. The HEX2BIN(‘0X28’) function returns the string “(” represented
by the hexadecimal string 28. The HEXTORAW(‘ABC’) function returns the
hexadecimal string ABC of type RAW.
98 3 Getting Started with SQL Syntax
less than or equal to 0, then a null string is returned. The function of RIGHT(str,
length) is opposite to that of LEFT(), which returns the right few characters of the
specified string. For example, the result after executing RIGHT(‘abcdef’,3) is def. If
length is less than or equal to 0, then a null string is returned.
The LEFT () and RIGHT () functions are described as follows. str is the source
string from which the substring is to be extracted. length is a positive integer,
specifying the number of characters returned from the left or right. If length is 0 or
a negative number, then a null string is returned. If length is greater than the length of
the str string, the function returns the entire str string. The client currently supports a
maximum string of 32767 bytes, so the function returns a maximum value of 32767
bytes.
The LENGTH(str) function is used to get the length of the string function, for
example, the result of executing LENGTH(‘1234大’) is 7. The LENGTH () function
returns the number of characters in str, whose input parameter is an expression that
can be converted to a string, and the return value is of type INT.
The LOWER(str) function is used to convert a string to the corresponding
lowercase form. For example, the result of executing LOWER(‘ABCD’) is abcd,
without converting the numeric value type. Corresponding to the LOWER() func-
tion, the UPPER(str) function is used to convert a string to the corresponding
uppercase form. For example, the result of executing UPPER(‘abcd’) is ABCD,
without converting the numeric value type. The LOWER() and UPPER() functions
have input parameters that can be converted to string expressions and return values
that are of string type.
100 3 Getting Started with SQL Syntax
The SPACE (n) function is to generate n spaces, and the value range of n is
[0,4000]. For example, the result of CONCAT(‘123’, SPACE(4),‘abc’) is 123 abc.
The REVERSE(str) function returns the reverse order of the string, only supports
the string type. For example, the result of REVERSE(‘abcd’) is dcba.
SUBSTR(str,start,len) is a string interception function. For example, SUBSTR
(‘abcdefg’,3,4) indicates that intercept a string of length 4 from the third character,
delivering the result cdef. The SUBstr () function intercepts and returns a substring
with len characters from start in str, where the input parameter str must be an
expression that can be converted into a string, and the input parameters start and
len must be expressions that can be converted into INT type. The return value is of
string type.
mysql> SELECT
DATE_FORMAT(SYSDATE(),'%W'),DATE_FORMAT(SYSDATE(),'%w'),
DATE_ FORMAT(SYSDATE(),'%Y-%m-%d');
+---------------------------+---------------------------
+---------------------------------+
|DATE_FORMAT(SYSDATE(),'%W')|DATE_FORMAT(SYSDATE(),'%w')|
DATE_FORMAT(SYSDATE(),'%Y-%m-%d')|
+---------------------------+---------------------------
+----------------------------------+
| Tuesday |2 | 2020-05-19
|
+---------------------------+---------------------------
+---------------------------------+
1 row in set (0.00 sec)
The EXTRACT(field from datetime) function extracts the specified time field
“field” from the specified datetime, where the values of the field include year, month,
day, hour, minute, and second, and the return value is of the numeric value type. If
the field value is SECOND, the return value is of the floating-point number type,
where the integer part indicates second, and the decimal part indicates microsecond.
This function takes any numeric value or any non-numeric value that can be
implicitly converted to a numeric value as an parameter and returns the same data
type as the parameter.
3.3 System Functions 101
The above code extracts the month from “2019-08-23”, and returns the result 8;
and intercepts from the system date according to “YY”, and the result is 2019-01-01
00:00:00. Time and date functions are shown in Table 3.5.
When using CAST () function for data type conversion, the following conditions can
be met, otherwise an error will be reported.
(1) The two expressions can be converted implicitly.
(2) The data types must be explicitly converted.
The code is shown below.
The function CONVERT(value,type) converts value type into type type, and the
value range is all data types except LONGBLOB, BLOB, and IMAGE.
The code is shown below.
System information functions are used to query the system information of GaussDB
(for MySQL). The VERSION() function is used to return the database version
number; the CONNECTION_ID() function returns the server connection ID num-
ber; the DATABASE() function returns the name of the current database; the
SCHEMA() function returns the name of the current Schema; the USER(),
SYSTEM_USER(), SESSION _USER(), and CURRENT_USER() functions return
the name of the current user; the LAST_INSERT_ID() function returns the value of
auto_increment; the CHARSET(str) function returns the character set of the string str;
and the COLLATION(str) function returns the character alignment of the string str.
3.4 Operators
An operator can process one or more operands, which may be before, after, or
between two operands. It is an important element that makes up an expression,
specifying the operation to be performed on the operands. Operators are classified
into unary and binary operators depending on the number of operands required. The
priority of operators determines the order in which different operators are computed
in an expression. Operators of the same priority are computed in left-to-right order.
Common operators can be divided into logical operators, comparison operators,
arithmetic operators, test operators, wildcards and other operators according to usage
scenarios.
If you want to query employees who joined after 2000 or whose salary is >5000
from the staffs table, i.e., if one of the two conditions is required to satisfy, the two
conditions after WHERE should be joined by OR.
If you want to query from staffs table for employees who did not join after 2000
and whose salary is >5000, you can add NOT in front of the condition of joining
after 2000; at this time, the relationship between hiredate and salary is AND, so the
two conditions after WHERE are joined by AND.
All data types can be compared using the comparison operator and return a value
of Boolean type. The comparison operators are all binary operators, and the two
piece of data being compared must be of the same data type or of a type that can be
implicitly converted. GaussDB database provides six comparison operators, includ-
ing <, >, <¼, >¼, ¼, <> or !¼ (not equal to), which should be selected according
to the service scenario.
The comparison operator > is used to query the employees whose salary is
greater than 5000 from staffs table.
The comparison operator <> is used to query the employees whose salary is not
equal to 5000 from staffs table.
The operations are in the form of +, , *, /, etc., and the order of priority is four
arithmetic operations > left and right shift > bitwise AND > bitwise exclusive OR
> bitwise inclusive OR.
When one of the above bitwise operations is executed, if the input parameter has
decimal places, the input parameter will be rounded before the bitwise operation is
done. A code example is as follows.
EXISTS means that an eligible element exists, and NOT EXISTS means that no
eligible element exists. The sample code is as follows.
108 3 Getting Started with SQL Syntax
IS NULL means the field is equal to NULL; while IS NOT NULL means the field
is not equal to NULL. The sample code is as follows.
ANY means it is enough that one of the values in the subquery satisfies the
condition, which matches with each content in one of the following three forms.
(1) ¼ANY: The function is exactly the same as that of the IN operator.
3.4 Operators 109
SELECT * FROM emp WHERE sal IN ( SELECT sal FROM emp WHERE job =
‘MANAGER’);
(2) >ANY: Larger than the largest data in the record returned by the subquery.
(3) <ANY: Small than the smallest data in the record returned by the subquery.
LIKE means matching with the expression; NOT LIKE means no match with the
expression. Only character type is supported. The sample code is as follows.
REGEXP and REFEXP_LIKE indicate that the string matches the regular expres-
sion and the expression return value is of Boolean type. The syntax of
REGEXP_LIKE: REGEXP_LIKE(str,pattern[,match_param]). The input parameter
“str” is the string subject to regular processing, supporting the string type and
NUMBER type; the input parameter “pattern” is the regular expression to be
matched; the input parameter “match_param” indicates the pattern (‘i’ means case-
insensitive search; ‘c’ means case-sensitive search; ‘c’ is set by default). The sample
code is as follows.
Require the system to return 1 when there is a string equal to “zhangsan” in the
NAME field in the table, and the EXISTS operator can be used for conditional query.
To find out the information in the table with ID fields between 1 and 2, you can
use BETWEEN 1 AND 2 for conditional query.
To query the information of the rows in the table whose NAME field is NULL,
you can use the IS NULL operator for conditional query.
To query the information of the rows in the table whose ID field is 1, 3 and 5, you
can use the ANY operator for conditional query.
To find the information of rows with “an” string in the NAME field, use the LIKE
operator with the wildcard %.
Wildcard and other operators are shown in Tables 3.11 and 3.12. % indicates any
number of characters, including no character. _ indicates an exact unknown charac-
ter. These two characters are often used in LIKE and NOT LIKE statements to
achieve string matching.
Single quotes (') are used to indicate the string type. If a single quotation mark is
included in the string text, then two single quotation marks must be used The sample
code is as follows.
3.6 Exercises 111
Double quotes (") and back quotes (`) are used to indicate the name of an object
such as a table, field, index, etc. or an alias. They are case-sensitive and support
keywords as names or aliases. If the object name is included in double quotes or back
quotes, GaussDB database takes case-insensitive treatment and treats both upper and
lower cases as upper case.
3.5 Summary
This chapter is about the data types, system functions, operators and SQL statements
involved in Huawei GaussDB (for MySQL) to help readers get a preliminary
understanding of GaussDB (for MySQL) and lay a good foundation for the next
step of learning.
3.6 Exercises
A. True
B. False
3. [Single Choice] Run
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 4
SQL Syntax Categories
Data query is used to query the data within a database, specifically the operation of
retrieving data from one or more tables and views. Data query is one of the basic
applications of database. GaussDB (for MySQL) database provides rich query
methods, including simple query, conditional query, join query, subquery, set
operation, data grouping, sorting and restriction, etc. It is necessary to describe the
type of data query language and its usage based on the actual usage scenario.
The most common query in daily use is that implemented by the FROM clause,
whose syntax format is as follows.
The expressions that appear after the SELECT keyword and before the FROM
clause are called SELECT item, and the SELECT item is used to specify the columns
to be queried. If you want to query all columns, you can use the * sign after the
SELECT keyword, while if you only query specific columns, you can directly
specify the column name after the SELECT keyword, and note that the column
names should be separated by commas. The part after the keyword FROM specifies
which table(s) to query from, either a table or multiple tables, or a clause. Simple
query belongs to the case of FORM keyword specifying a table.
Example: Create a training table “training”, insert three rows of data into the table
and then view all the columns in the training table.
Create the training table.
The above code first creates a table by CREATE TABLE statement, and then
inserts data into the table by INSERT statement, where the table name “training” is
followed by the field information to be inserted; VALUES is followed by the
information of the specific inserted data, which item-by-item corresponds to the
field information behind the table name “training”. The staff_id field is defined as
NOT NULL, which means that the field data cannot be empty, and the field must
have data when inserting. If the value in VALUES contains all the columns in the
training table, the specific field specified after the training table can be omitted. After
that, you can insert three rows of data into the table by the same INSERT statement,
and then query the table by SELECT statement after the insertion is completed.
If you want to query all columns in the table, just add the * sign behind the
SELECT keyword. The sample code is as follows.
4.1 Data Query 117
-------------------------------------------------------------
--------------
10 SQL majorization 2017-06-25 12:00:00 90
10 information safety 2017-06-26 12:00:00 95
10 master all kinds of thinking methods 2017-07-25 12:00:00 97
The keyword FROM is followed by the table name “training”, so all the data
information in the table training can be queried.
Sometimes there may be duplicate records in the table, and when retrieving these
records, it is necessary to do so by retrieving only unique records, not duplicate ones,
which can be achieved by the keyword DISTINCT. The DISTINCT keyword means
to remove all duplicate rows from the result set of SELECT, so that each row in the
result set is unique, and the range of values is the names of the fields that already
exist or the expressions of the fields. The syntax format is as follows.
The keyword DISTINCT is added before the SELECT item, and if there is only
one column after the DISTINCT keyword, that column will be used to calculate the
duplicate value; if there are two or more columns, the combined result of those
columns will be used for duplicate checking.
Table 4.1 shows the employee information table of a department. Now let’s query
the employees' job and bonus information and remove the records with duplicate
jobs and bonuses. According to the contents of the query, we can see that the
SELECT item includes job and bonus; to remove the records with the same jobs
and bonuses, we need to use the keyword DISTINCT. Add the DISTINCT keyword
in front of job and bond in the SELECT item to achieve de-duplication and get the
corresponding results without duplicate values, with the specific code as follows.
-------------------------------------------------------------
---
developer 9000
tester 7000
developer 10000
3 rows fetched.
When selecting query columns, the column names can be represented in the follow-
ing forms.
(1) Manually enter the column names, separated by English commas (,). For
example, to query both the a and b columns of table t1 and the f1 and f2 columns
of table t2, use the SELECT a, b, f1, f2 FROM t1, t2 statement, where columns a
and b are the columns of table t1, while f1 and f2 are the columns of table t2, and
the results are displayed in the form of Cartesian product.
(2) Calculate the fields. For example, to query the sum of the two fields a and b in
table t1, perform numerical calculation on the columns a and b, with the
statement SELECT a + b FROM t1.
(3) Use table names to qualify the column names. If two or more tables happen to
have some common column names, it is recommended to use the table name to
qualify the column names. You can also get query results without qualifying the
column names, but the use of fully qualified table and column names not only
makes the SQL statement clearer and easier to understand, but also reduces the
processing workload inside the database, thus improving the return performance
of the query. For example, querying column a of table t1 and column f1 of table
t2 can be achieved by the SELECT t1.a,t2.f1 FROM t1,t2 statement.
Again, take the training table as an example. To view the number of the staff taking
the course and the training course name in the training table, you can specify to query
staff_id column and course_name column in the SELECT item, with the SELECT
staff_id, course_name FROM training statement. This allows you to query the staff
number and course name information directly from the training table. The sample
code is as follows.
4.1 Data Query 119
------------------------------------------------------------–
10 SQL majorization
10 information safety
10 master all kinds of thinking methods
Another example is about student scores. There are two score tables, Math and
English, both of which contain student numbers and corresponding scores, as shown
in Tables 4.2 and 4.3.
Now let’s find the math and English scores of the student with the student number
10. To make it easier to describe and use, alias the math score table to “a” and the
English score table to “b”. The score column in the math score table is aliased to
“MATH”, and that in the English score table is aliased to “ENGLISH”. The WHILE
conditional statement can be used to query the scores of student 10. The specific
approach is to restrict the student number sid in the math score table to be equal to
10, and restrict the student number sid in the English score table to be equal to
10 also, and the relationship between the two conditions is “AND”, connected by the
logical operator AND. In this way, we can find out the math score and English score
of the student whose student number is 10 at one time, with the specific code as
follows.
The above aliases are set using the clause AS some_name, which allows you to
assign another name to a table name or column name for display. Generally aliases
are created to make the column names more readable.
120 4 SQL Syntax Categories
The SQL aliases for columns and tables follow the corresponding column or table
names, respectively, and can be interspersed with the keyword AS. To replace the
staff_id field in the training table with “empno” to display the results, you can do so
by using the SELECT staff_id AS empno, course_name FROM training statement.
The keyword AS can be omitted. The alias can also be indicated by adding double
quotes (SELECT staff_id “empno”, course_name FROM training) so that the
staff_id field in the table is displayed as empno. In the previous example the math
table “math” uses the alias “a”, and the English table “english” uses the alias “b”.
The same is true for aliasing the MATH and ENGLISH columns in the math and
English score tables, respectively. The specific code is as follows.
------------------------------------------------------------–
10 SQL majorization
10 information safety
10 master all kinds of thinking methods
SELECT a.sid, a.score math, b.score english FROM math a, english b WHERE
a.sid = 10 AND b.sid = 10;
SID MATH ENGLISH
------------------------------------
10 95 82
The above example uses a conditional query for querying a student’s scores. A
conditional query is a query that sets conditions in the SELECT statement to get
more accurate results. The condition is specified by both the expression and the
operator, and the value returned by the conditional query is TRUE, FALSE or
UNKNOWN. The query conditions can be applied not only to the WHILE clause
but also to the HAVING clause, where the HAVING clause is used for further
conditional filtering of the grouped result set.
Its syntax formats include both CONDITION clause and PREDICATE clause.
The CONDITION clause is a conditional query statement, followed by the
PREDICATE clause as the query expectation condition, and can be used with
other conditions to perform AND, OR and other operations. The syntax format is
as follows.
{expression { = | <> | != | > | >= | < | <= } { ALL | ANY } expression | ( SELECT )
| string_expression [ NOT ] LIKE string_expression
| expression [ NOT ] BETWEEN expression AND expression
| expression IS [ NOT ] NULL
| expression [ NOT ] IN ( SELECT | expression [ , . . . n ] )
| [ NOT ] EXISTS ( SELECT )
}
The query condition is defined by the expression and the operator jointly. The
common ways to define conditions are as follows.
(1) Use the comparison operators >, <, >¼, <¼, ! ¼, <>, ¼, etc. to specify the
comparison query conditions. When comparing with data of numeric type,
single quotes can be used or not at will; but when comparing with data of
character and date type, the data must be include in single quotes.
(2) Use the test operator to specify the range query conditions. If you expect the
returned results to satisfy more than one condition, you can use the AND logical
operator to connect these conditions; if you expect the returned results to satisfy
one of several conditions, you can use the OR logical operator to connect these
conditions.
Example: To query for information about trainees taking the course SQL
majorization. Here you can use the compare operator to specify the query conditions.
This is done by specifying that course_name is equal to the course name string “SQL
majorization” after the WHILE keyword in the conditional query, as follows.
------------------------------------------------------------–
10 SQL majorization 2017-06-25 12:00:00 90
1 rows fetched.
The commonly used logical operators are AND, OR and NOT, which return
TRUE, FALSE and NULL, respectively, where NULL stands for unknown. Their
operation priority is: NOT > AND > OR.
The operation rules are shown in Table 4.4.
The test operators are also explained in the previous chapters. GaussDB (for
MySQL) supports the test operators shown in Table 4.5.
122 4 SQL Syntax Categories
Example: Query the information from the employee bonus table “bonuses_depa”
shown in Table 4.6. Table 4.5 contains four fields: staff_id, name, job, and bonus.
If you need to query from the table the employees as a developer and with bonus
greater than 8000, in view of the query information subject to conditions, you can
use the conditional query, that is, specify job equal to the string developer and bonus
greater than 8000, and use AND to connect the two, because they are of AND
relationship. The sample code is as follows.
SELECT * FROM bonuses_depa WHERE job = 'developer' AND bonus > 8000;
STAFF_ID NAME JOB BONUS
-----------------------------------------------------------
30 wangxin developer 9000
4.1 Data Query 123
If you need to query from the table the employees whose surname is wang and
bonus between 8500 and 9500, you should also use the conditional query; since
there are many employees with the surname wang, you should use the operator LIKE
and the wildcard % together if you want to query all employees with the surname
wang. As for the bonus value range, you can use the test operator BETWEEN...
AND...; since the two conditions are of AND relationship, they should be connected
with AND, with sample code as follows.
In practical applications, when querying the required data, it is often necessary to use
two or more tables or views. Such query of two or more data tables or views is called
a join query, which is usually built between “parent-child” tables that are related to
each other.
The syntax format is as follows.
The table_reference clause can be a table name, view name, query clause, etc.,
and the join keyword is JOIN. The OUTER represents the outer join, and INNER
represents the inner join. The outer join includes the left join, right join, and full join.
ON is followed by restrictions and other information.
When more than one table appears in the FROM clause of the query, the database
will perform the join operation.
(1) The SELECT column of the query can be any one column of these tables, as in
the above-mentioned example of score query. Similarly, the sample code for
querying a column value in Tables 1 and 2 is as follows.
(2) Most join queries contain at least one join condition, which can be either in the
FROM clause or in the WHERE clause. The sample code is as follows.
(3) The WHERE clause can be used to convert the join relationship of a table to an
outer join by specifying the + operator, but it is not recommended to use this
method because it is not standard SQL syntax.
The keyword for inner join is INNER JOIN, where INNER can be omitted.
The join execution order of an inner join necessarily follows the order of the
tables written in the statement.
Example: To query employee ID, highest degree and test scores. The query
operation is performed using the relevant column (staff_id) in both the training
and education tables.
We know that the education table contains the employee ID and highest
degree information, while the training table contains the employee ID and test
score information. To query the employee ID, highest degree and test scores at
one time, you need to use inner join query between the two tables, because the
employee ID in the two tables are corresponding. Firstly, the staff_id column of
the two tables is conditionally queried to get the corresponding information, and
then the same staff_id fields is used as the query condition for the join query, so
as to achieve the simultaneous query of employee ID, highest degree and test
scores. The sample code is as follows.
------------------------------------------------------------–
10 SQL majorization 2017-06-25 12:00:00 90
11 BIG DATA 2018-06-25 12:00:00 92
12 Performance Turning 2018-06-29 12:00:00 95
A join query queries multiple tables for related rows. The result set returned by an
inner join query contains only those rows that match the query and join conditions.
However, sometimes it is necessary to include data from unrelated rows, i.e., the
result set returns not only the rows that match the join condition, but also all the rows
in the left table or the right table, or both tables, thus an outer join is required.
The two data sources specified by an inner join are on equal footing, unlike an
outer join, which is based on one data source, and conditionally matches another data
source to the base data source.
An inner join returns all the data records in both tables that satisfy the join
condition. An outer join returns not only the rows that satisfy the join condition,
but also the rows that do not satisfy the join condition.
Outer joins are further divided into left outer join, right outer join and full
outer join.
Left outer join, also known as left join, refers to querying the left table as the base
table, associating the right table according to the specified join conditions, and
getting the data of the base table and the right table that matches the conditions;
for the records that exist in the base table but cannot be matched in the right table, the
corresponding field position of the right table is expressed as NULL, as shown in
Fig. 4.1.
In the query statement, the left table is the education table and the right table is the
training table, so the left join takes the education table as the base table and matches
the right table training by the employee ID. The query result contains two parts.
Suppose the employee IDs of the left table are 11, 12 and 13, and the right table has
the same employee IDs 11 and 12, according to the specified SELECT item, the
result will contain the information of the employee IDs 11 and 12 and the highest
degree information in the left table, and the information of the test scores
corresponding to these employee IDs in the right table. Since the employee ID
13 in the left table does not match any content in the right table, the result will
contain the information of the employee ID 13 and the highest degree information in
the left table, and the test scores corresponding to it in the right table is null, as shown
in Table 4.7.
The specific code is as follows.
126 4 SQL Syntax Categories
The right outer join, also known as the right join, corresponds to the left join. It
means that the right table is the base table and the data in the right table is queried on
the basis of the inner join (data not in the left table is filled with NULL), as shown in
Fig. 4.2.
The left table is the education table, and the right table is the training table. The
right join takes the training table as the base table, and matches the highest degree in
the left table by the employee ID. The query results will contain these two parts. If
the employee IDs in the right table are 10, 11 and 12, of which the same as those in
the left table are 11 and 12, the result will contain information about the employee
IDs 11 and 12 in the left table and information about their highest degrees, as well as
information about the test scores corresponding to these employee IDs in the right
table, according to the SELECT item specified. Since the employee ID 10 in the left
table does not match any content in the right table, the result will contain the
information of the employee ID 10 and the highest degree information in the left
table, and the test scores corresponding to it in the right table is null, as shown in
Table 4.8.
The sample code is as follows.
4.1 Data Query 127
Anti join is a special type of join without a specified keyword in SQL. It is the
opposite of a semi join and is implemented by adding a NOT IN or NOT EXISTS
subquery after WHERE, returning all rows in the main query that do not satisfy the
condition.
For example, if you query the highest degree information of employees who have
not attended training, first find information about the same employee ID in the
education table and the training table in the subquery after the keyword NOT IN;
then find the information with different employee IDs in the education table
according to the same employee ID information found; and finally return the
employee IDs and highest degree and so on. As you can see, it is the opposite of
the above-mentioned semi join. The sample code is as follows.
4.1.6 Subquery
An uncorrelated subquery means that the subquery is independent of the outer main
query. The execution of the subquery does not need to obtain the value of the main
query in advance, but only serves as a query condition of the main query. When the
query is executed, the subquery and the main query can be divided into two
independent steps, i.e., the subquery is executed first, and then the main query is
executed.
The syntax format of a subquery is the same as that of a normal query, and it can
appear in the FROM clause, the WHERE clause, and the WITH AS clause. A
subquery in the FROM clause is called an inline view, and a subquery in the
WHERE clause is called a nested subquery.
The WITH AS clause defines a SQL fragment that will be used by the entire SQL
statement, making the SQL statement more readable. The table that stores the SQL
fragment is different from the base table in that it is a dummy table. The database
does not store the definition and data corresponding to the view, and these data are
still stored in the original base table. If the data in the base table changes, the data
queried from the table where the SQL fragment is stored also changes. The syntax
format is as follows.
table_name is the user-defined name of the table where the SQL fragment is
stored, i.e., the dummy table's name.
select_ statement1 is the SELECT statement that queries the data from the base
table, and the data found is the data information of the dummy table.
select_ statement2 is the SELECT statement to query the data from the user-
defined table where the SQL fragment is stored, which is the SQL statement to find
the data from the dummy table.
Example: To find the employees in each department whose salary is above the
average salary of the department by a correlated subquery.
A staffs table contains information such as names, department IDs and salaries,
etc. Now to query the information of employees in each department who have
higher-than-average salary in the department, you can do it by subquery. For each
row of the staffs table, the main query uses a correlated subquery to calculate the
average salary of members of the same department, with the following code.
(continued)
130 4 SQL Syntax Categories
For each row of the staffs table, the main query uses a correlated subquery to
calculate the average salary of members of the same department. The correlation
subquery performs the following steps for each row of the staffs table.
(1) Determine the section_id of the row. The alias of the staffs table is s1 in the main
query and s2 in the subquery, and the subquery condition is the section_id in the
main query table has the same information as in the subquery table.
(2) Use the average calculation function AVERAGE() to calculate the department
average salary and section_id to evaluate the main query.
(3) Compare the salary field with the average salary in the main query and take the
results that are greater than the average salary (if the salary in this row is greater
than the average salary in the department, the row is returned).
Each row in the staffs table will be calculated once by the subquery.
The following is an example of the WITH AS subquery.
Example: To query information about employees who have attended BIG DATA
courses.
Example: To create a table with the same structure as the training table by the
subquery.
<> means not equal, so the condition 1<>1 is not valid and the subquery will
not return data.
Since the condition following WHILE is not valid, only the table structure is
created and no data is inserted into it.
Insert all the data of the training table into the training_new table by subquery.
Find out all the data in the training table by subquery, and then insert the data into
the new table training_new by INSERT statement, where the training_new table
already exists, with the table structure same as the training table.
In most databases only one SELECT query statement is used to return a result set. If
you want to query multiple SQL statements at once and merge the results of all
SELECT queries into a single result set, you need to use the merging result set
operator to merge multiple SELECT statements. This type of query is called a
merging or compound query, which can be implemented with the UNION operator.
The UNION operator combines the result sets of multiple query blocks into a
single result and outputs it. The following should be noted when using it.
(1) Each query block must have the same number of queried columns. For example,
if you query a table, the number of fields in both tables must be the same.
(2) The query columns corresponding to each query block must be of the same data
type or of the same data type group. For tables, the data types of the columns
queried by the two tables should be the same or of the same data type group
(interconvertible).
(3) The keyword ALL means keep all duplicate data, and no ALL means delete all
duplicate data.
Figure 4.3 shows the tables A and B, where table A has a column, including
1 and 2; table B is also has a column, with the same definition as table A's
column field, carrying the contents of 2 and 3. If execute A UNION B, the same
content “2” in A and B tables are combined and output, i.e. the result set is 1, 2,
3; if A UNION ALL B, the returned result set outputs both “2” in A and B,
i.e. the result set will be 1, 2, 2, 3.
There are employee information of two departments, as shown in Tables 4.9 and
4.10, let's query the information of employees who have received bonuses over
7000. We know the employee information tables of department 1 and department
2, who carry the same number of columns and definitions, so we can merge the result
sets to get the information of employees with bonuses over 7000 in the two
departments. First, use the SELECT condition to query the IDs, names and bonuses
of employees with bonuses over 7000 in department 1; after that, use the same
SELECT condition to query such information in department 2; then use UNION
ALL to combine the results of the two queries into a result set. This way you can
query information from two departments at once. The code is shown below.
SELECT staff_id, name, bonus FROM bonuses_depa1 WHERE bonus > 7000
UNION ALL SELECT
staff_id, name, bonus FROM bonuses_depa2 WHERE bonus > 7000 ;
STAFF_ID STAFF_NAME BONUS
-----------------------------------------------------------–
30 wangxin 9000
35 caoming 10000
25 liulili 8000
29 liuxue 9000
4.1 Data Query 133
What corresponds to the merging result set is difference result set, which can
perform subtraction on the query result set to calculate the result that exists in the
output of the left query statement but not in the output of the right query statement.
Getting different results in the result sets can be realized by the MINUS and
EXCEPT operators. Use A MINUS B C to get the results after removing all records
contained in result set A from result set B and result set C, i.e., records that exist in A
but not in B and C, with the syntax format as follows.
select_statement1 is the SELECT statement that produces the first result set,
similar to result set A.
select_statement2 is the SELECT statement that produces the second result set,
similar to result set B.
The result returned is the difference result set between result set A and result
set B, i.e., the data information that is in result set A but not in result set B.
The contents of result set A are 1, 2, and 3, and the contents of result set B are 2, 3,
and 4. Since the column definitions of result set A and result set B are the same,
conduct the difference result set calculation for A and B, i.e., A MINUS B yields a
difference result of 1, as shown in Fig. 4.4.
The code for querying data using MINUS is as follows.
which can be achieved by the keyword GROUP BY with the following syntax
format.
GROUP BY { column_name } [ , . . . ]
The HAVING clause can further filter the data in the result set of the grouping by
comparing some properties of the groups with a constant value, where only the
groups that meet the conditions of the HAVING clause are extracted. It is often used
in conjunction with the GROUP BY clause to select special groups, and the syntax
format is as follows.
HAVING CONDITION { , . . . }
The ORDER BY clause sorts the rows returned by the query statement according to
the specified columns. Without the ORDER BY clause, multiple executions of the
136 4 SQL Syntax Categories
same query will not necessarily retrieve rows in the same order. The syntax format of
ORDER BY is as follows.
------------------------------------------------------------–
31 xufeng document developer 6000
30 wangxin developer 9000
34 denggui quality control 5000
35 caoming tester 10000
If there are many rows of data in a table, but only a few of them need to be queried,
you can use the LIMIT clause to implement the data restriction function. The data
4.2 Data Update 137
restriction consists of two separate clauses, the LIMIT clause and the OFFSET
clause.
The LIMIT clause is used to limit the rows allowed to be returned by the query,
which can specify the offset and the number of rows or percentage of rows to be
returned. This clause can be used to implement top-N statements. To get consistent
results, specify the ORDER BY clause to determine the order. The OFFSET clause is
used to set the starting position of return. The syntax is shown below.
start indicates the number of rows to be skipped before the return line, and count
is the maximum number of rows to be returned. When both start and count are
specified, the start rows will be skipped before the count rows to be returned is
counted. To return 20 rows of the result set and skip the first 5 rows, you can do so
with the LIMIT 20 OFFSET 5 expression.
In Table 4.10, the query for employee information is limited to a total of 2 rows of
data after skipping the first 1 row of the query. Since only 2 rows of data are queried,
then the LIMIT clause can be used, where LIMIT 2 can be added after the query
statement to limit the query to only 2 rows; to skip the first 1 row, the OFFSET
clause can be used, where OFFSET 1 can be added after LIMIT, so that the
corresponding data information can be queried. Similarly, the order of LIMIT and
OFFSET clauses can be exchanged, or can be realized directly by the LIMIT clause,
that is, adding LIMIT 1 2 directly after the query statement. The specific code is as
follows.
There are three main ways to update data (data manipulation): data insertion, data
modification, and data deletion. These operations are all commonly used by database
developers.
138 4 SQL Syntax Categories
At the time of data query, the table must have data, otherwise the data will not be
queried. Therefore, data should be inserted into the table first.
The following items should be noted when inserting data.
(1) Only the user with INSERT permission can insert data into the table. SYS user is
the system administrator super user. Ordinary users are not allowed to create
SYS user objects.
(2) To use the RETURNING clause, the user must have SELECT permission for the
table.
(3) If the QUERY clause is used to insert data rows from the query, the user also
needs to have the permission to use the SELECT permission of the table in the
query.
(4) The commit of INSERT transaction is enabled by default.
The keyword of the data insertion statement is INSERT, and the syntax format
presents the following three forms.
(1) Value insertion. Construct a row and insert it into the table with the following
syntax.
IGNORE means that the INSERT statement ignores errors that occur during
execution, and does not support simultaneous use with ON DUPLICATE KEY
UPDATE. tbl_name is the name of the table to be inserted; partion_name is one
or more partitions or subpartitions (or both) of the table, with the list of names
separated by commas; col_name is the name of the table field to be inserted, and
expression is the value or expression of the inserted field. If the INSERT
statement specifies a field name that contains all the fields in the table, the
field name can be omitted.
(2) Query insertion. Use the result set returned by the SELECT clause to construct
one or more rows and insert them into the table, with the syntax shown below.
select_clause is the SELECT query result set, which will be used as the value
in the newly inserted table.
4.2 Data Update 139
(3) Insert a record, and if a primary key conflict error is reported, then perform the
UPDATE operation to update the specified field value, with the syntax below.
The second step is to perform the value insertion. Insert a row into the
training1 table with the INSERT statement as shown below.
In the third step, a query insertion is performed. Insert all the data of the
training table into the training1 table by subquery. This can be achieved by the
following statements, using INSERT and SELECT statements to query all the
data in the training table and insert it into training1, with the specific statement as
follows.
Step 4, if there is a primary key conflict error, execute the UPDATE opera-
tion. First, create the primary key in the training table (achieved by the ALTER
TABLE ADD PRIMARYKEY statement), then use the ON DUPLICATE KEY
UPDATE statement in the INSERT statement to achieve the record insertion
operation, and update the primary key name, exam date and other fields when a
primary key conflict occurs. The specific code is shown below.
140 4 SQL Syntax Categories
Data modification, as the name implies, is to modify the value of the relevant data in
the table, in which the following matters should be noted.
(1) The commit of the UPDATE transaction is enabled by default, not requiring the
COMMIT clause.
(2) The user who performs the operation needs to have the UPDATE permission of
the table.
The data modification keyword is UPDATE, and the syntax format is as
follows.
{ table_name
| join_table
}
4.2 Data Update 141
table_name is the table name to be updated, whose value range is the name of
the existing table; col_name is the field name to be modified, whose value range
is the name of the existing field; and expression is the value or expression
assigned to the field. condition is an expression returning values of Boolean
type, only rows returning TRUE under this expression will be updated.
The join_table clause is a set of tables used for linked queries, including inner
join, left join, and right join.
Then you can create the tables education and training by the CREATE
TABLE statement, with the code as follows.
Then insert data into the two tables by the INSERT statement, inserting
2 pieces of data into the education table and 4 pieces of data into the training
table with the following code.
Now you can update the contents of the table, updating the first_name field
that carries the same record on staff_id in the training table and staff_id in the
142 4 SQL Syntax Categories
education table. The table to be updated is the training table, so the keyword
UPDATE is followed by the table name “training”. This update involves two
tables, so it can be done with the JOIN clause. The update condition is that the
staff_id in the training table and the staff_id in the education table are the same,
so the JOIN condition indicates that the staff_id in training table is equal to the
staff_id in the education table. The specific record to be updated is the first_name
in the training table. Therefore, the keyword SET is followed by the setting
information of the first_name in the training table. To set the eligible first_name
to ALAN, the following statement can be executed.
Data deletion is to delete data rows from a table, where the following matters should
be paid attention to.
(1) The user using this statement must have the DELETE permission of the table.
(2) The commit of the DELETE transaction is enabled by default.
The keyword for data deletion is DELETE, as INSERT as a transaction operation.
The specific syntax format is as follows.
DELETE FROMtable_name
[ WHERE condition ]
[ ORDER BY { column_name [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] }
[ , ... ] ]
[ LIMIT [ start, ] count
| LIMIT count OFFSET start
| OFFSET start[ LIMIT count ] ]
table_name is the name of the table to which the data to be deleted belongs.
condition is the condition of the data to be deleted.
The ORDER BY clause specifies the fields of the result set to be sorted.
ASC or DESC specifies whether the ORDER BY clause is to be sorted in
ascending or descending order.
NULLS FIRST specifies the sorting position of NULL values in ORDER BY,
where FIRST means that rows containing NULL values will be at the top, and LAST
means that rows containing NULL values will be at the bottom. If this option is not
specified, ASC defaults to NULLS LAST and DESC defaults to NULLS FIRST.
4.2 Data Update 143
count specifies the number of rows of data to be returned, and start specifies the
number of rows to be skipped before the value is returned. When both are specified,
it means that the start rows will be skipped before the count rows is returned.
Deleting a row in a table that matches another table can be done in two ways.
The first is achieved by the DELETE FROM statement, where table_ref_list
refers to the table to which the data to be deleted belongs, and temporary tables
are not supported to appear in this temporary table, join_table is a collection of tables
associated with a group of tables, and is used in a similar way to how it is used in data
insertion.
The second is achieved by DELETE FROM and USING statements, with the
statement contents as the first method. The both methods can achieve the deletion of
data.
Example: To delete the training record with staff_id of 10 and with username
NFORMATION SAFETY from the training table.
First create the training table. This table may already exist, so follow the deletion
method introduced in the data insertion to delete the training table that may already
exist by the DROP TABLE IE EXISTS statement, with the code as follows.
Then you can create the training table by the CREATE TABLE statement. The
code is as follows.
Then insert the data into the table by the INSERT statement.
144 4 SQL Syntax Categories
Data definition is to define the objects in the database. Database objects are the
components of the database, mainly including tables, indexes, views, stored pro-
cedures, defaults, rules, triggers, functions, etc.
A table is a special data structure in the database for storing data objects and the
relationship between objects, consisting of rows and columns.
A index is a structure for sorting the values of one or more columns in a database
table, with which the quick access to specific information in a database table is
workable.
A view is a dummy table derived from one or several basic tables that can be used
to control user access to data.
A stored procedure is a collection of SQL statements designed to accomplish a
specific function. It is generally used for report statistics, data migration, etc.
Defaults are pre-determined values assigned to columns or column data items that
do not have specific values specified when creating columns or column data to a
table.
4.3 Data Definition 145
does not exist, a new table will be created. table_name is the name of the table, which
cannot be duplicated with the existing table name. relational_properties is the table
properties, including column name, type, row constraint and out-of-row constraint
information. DEFAULT is the default value of the column, AUTO_INCREMENT is
the specified self-increment, COMENT 'string' is the comment of the specified
column, inline_constraint is the column constraint, out_of_line_constraint is the
table constraint, and AS QUERY is the specified subquery to insert the rows returned
by the subquery into the table when creating the table.
The following items should be noted when creating a table.
(1) To create the current user's table, the user needs to be granted CREATE TABLE
system permissions.
(2) The table name and column name (data type and size) must be specified when
creating the table.
(3) Self-incrementing columns only support INT and BIGINT types, a table only
supports one self-incrementing column, and the self-incrementing column must
be a primary key or a unique index.
(4) When creating a foreign key, if no column is specified, the primary key of the
parent table is taken by default. If the parent table does not have a primary key,
an error is reported.
(5) The partition key must be an integer or an expression whose result is an integer.
In some scenarios, you can use columns directly for partitioning.
(6) Current supported partition types: RANGE, LIST, HASH, and KEY.
(7) Up to 1024 partition intervals are supported. If the total number of partitions
exceeds 1024, an error is reported.
Partitioning is to divide the data of a table into several smaller parts in some way, but
logically it is still a table. Gaussian database supports partitioning by range
(RANGE), by hash (HASH), by list (LIST), and by interval (KEY). Take the
range partitioning as an example, the syntax format is as follows.
boundary. MAXVALUE can be used when creating a range partition, usually for
setting the upper boundary of the last partition. TABLESPACE is the tablespace
keyword, followed by tablespace_name as the name of the tablespace where the
partition is located, and physical_properties_clause which specifies the properties of
the page break storage.
Example: To create the education table.
CREATE TABLE is followed by the table name, and the column name and
column definition are specified in parentheses after the table name, with the preced-
ing column name and the following column definition are separated by a space. The
different columns are separated by commas, where the employee ID is of the integer
type; the highest degree is of the fixed-length string type, with the length of 8 bytes,
NOT NULL means the value of the column cannot be empty; the school is a
variable-length string with the maximum length of 64 bytes; the graduation time is
of the time type; and the graduation description is a variable-length string with the
maximum length of 70 bytes.
Create a partition table “training”.
CREATE TABLE is followed by the table name, as well as the column name and
column definition. The keyword PARTITION BY is followed by RANGE(staff_id)
if you create a range partition table with the employee ID as the partition key. The
keyword PARTITION is followed by the specific partition name in parentheses.
Since it is a range partition, you need to specify the upper boundary keyword for
it. The value in parentheses after VALUES LESS THAN is the upper boundary
value, and the last value MAXVALUE indicates the upper boundary of the last range
partition.
148 4 SQL Syntax Categories
If, after the table is created, the table properties are found to be inappropriate and
need to be modified, the table properties can be modified by the ALTER TABLE
statement. The specific operations of modifying table properties include: adding,
deleting, modifying and renaming columns, adding, deleting, enabling and disabling
constraints, modifying the table name, and modifying the tablespace of the partition.
The syntax format is as follows.
When modifying table properties, the following points should not be overlooked.
(1) When adding column properties to a table, you need to ensure that there are no
rows in the table.
(2) When modifying the column properties of the table, make sure that the data
types in the table do not conflict, and if there is a conflict, the value of the column
needs to be set to NULL.
Commonly used operation examples are as follows.
Add a column full_masks to the training table.
Modify the data type of the course_name column in the training table.
Add a constraint.
Users can delete tables under their own name. If you need to delete a table under
another user name, you need to have the DROP TABLE permission. Ordinary users
cannot delete system user objects.
The syntax format of DROP is as follows.
IF EXISTS is used to detect the existence of the specified table and delete it if it
exists; if not, the deletion operation will not report an error.
4.3.5 Index
A index is a structure for sorting the values of one or more columns in a database
table, with which the quick access to specific information in a database table is
workable. Indexes can greatly improve the speed of SQL retrieval. Take the direc-
tory (index) of Chinese dictionary as an example, we can quickly find the Chinese
character we need through a directory sorted by pinyin, strokes, radicals, etc.
For example, to look up the information of the employee with ID 10000 from an
employee table with 200,000 pieces of data, if there is no index, you have to go
through the whole table until you find the row equal to 10,000. Once an index is built
on the ID, you can look it up in the index. Since the index is algorithmically
optimized, the lookup is much faster. Therefore, indexes allows fast access to data.
The SQL statements involved in the index are shown in Table 4.13.
Indexes can be classified into single-column indexes and multi-column indexes
by number of index columns, and into common indexes, unique indexes, functional
indexes, and partitioned indexes by index usage method.
150 4 SQL Syntax Categories
UNIQUE means to create a unique index, which will detect if there are duplicate
values in the table each time data is added, and report an error if the inserted or
updated values will result in duplicate records.
index_name is the name of the index to be created.
table_name is the name of the table where the index is to be created, which is
allowed to have a user modifier.
The sample code for creating an index online on the normal table “posts” is as
follows.
(1) Create the normal table “posts”.
Example code for creating a partitioned index on the partition table “education” is as
follows.
(1) Create the partition table “education”.
Create a list partition on the highest degree field, with the doctor partition
indicating the highest degree of doctor, master partition indicating master, and
bachelor partition indicating bachelor. Create indexes on the employee ID and
highest degree fields of the education table, with idx_education as the index
name. The indexes are built on three partitions, with the keyword PARTITION
followed by the names of the three partitions - doctor, master and bachelor.
An existing index definition can be changed by modifying the index properties, with
the following syntax format.
To create an index on the posts_id and post_name columns of the posts table, the
table name follows ON, the column names are in parentheses, and the default is
ascending, ASC can be omitted. Add the keyword ONLINE to create indexes online.
To create an index online, you can use the ALTER INDEX statement, the online
rebuild keyword is REBUILD ONLINE, and idx_posts is the index name.
Renaming an index can be done using the ALTER INDEX statement, renaming
idx_posts to idx_posts_temp. The specific code is as follows.
4.3 Data Definition 153
4.3.6 View
A view is a dummy table derived from one or several base tables to control user
access to data, where the SQL statements involved are shown in Table 4.14.
A view is different from a base table in that only the definition of the view is
stored in the database, not the data corresponding to the view, which is still stored in
the original base table. The data queried from the view will change as the data in the
base table changes. In this sense, a view is like a window through which you can see
the data of interest to the user in the database and its changes.
The keyword to create a view is CREATE VIEW, and the syntax format is as
follows.
The view can be created by the CREATE VIEW statement; if it exists, you
need to update it, so add the keyword OR REPLACE, followed by the view
name training_view, and AS is followed by the subquery. If you need to view all
the data in the staff_id and score fields of the training table, the subquery is
SELECT staff_id,score FROM training.
(2) Create the view training_view and specify the view column alias. As required,
the view should be updated if it exists, so the OR REPLACE keyword needs to
be added. The keyword is followed by the view name training_view; to specify
the view column alias, you can specify the column alias after the view name, and
the alias corresponds to the results found in the subquery, with the specific
statement as follows.
(3) View the data in the view. The method is the same as querying the data in the
table, replacing the table name after the query statement with the view name,
with the specific statement as follows.
(4) View the view structure. You can view the view structure by the DESCRIBE
statement, with the specific syntax as follows.
DESCRIBE training_view;
4.4 Data Control 155
The keyword to delete the view is DROP VIEW, and the syntax format is as
follows.
If the view exists, IF EXISTS performs the deletion operation, and returns
success if the view does not exist.
For example, DROP VIEW IF EXISTS training_view; means if the view
training_view exists, then delete the view, and if it does not exist, then return
success.
The commit transaction statement “perpetuates” all operations in the current trans-
action unit of work and ends the transaction.
The syntax format is as follows.
COMMIT;
SET autocommit=0;
COMMIT;
A rollback transaction is a rollback that undoes all operations in the current unit and
ends the transaction. The keyword to roll back a transaction is ROLLBACK, and the
syntax format is as follows.
ROLLBACK;
After the successful execution, the operation performed in the second step
will be undone, that is, the inserted data cannot be found from the table posts. In
the above example, if you do not add ROLLBACK, there will be a record in the
table; after adding ROLLBACK, the data in the table is null.
Transaction save point is a save point set in the transaction. Transaction save point
provides a flexible rollback method, where the transaction can be rolled back to a
save point during the execution, the operation before the save point is valid, and the
subsequent operations are rolled back. A transaction can set multiple save points.
The syntax format for setting the transaction save point is as follows.
SAVEPOINT savepoint_name
savepoint_name is the name of the save point. After rolling back to this save
point, the transaction state is the same as the transaction state at the time of setting the
158 4 SQL Syntax Categories
save point, and the transaction operations of the database after this save point will be
rolled back.
An example of setting the transaction save points is as follows.
SAVEPOINT S1;
SAVEPOINT S2;
4.5 Others
Function description: This statement has many forms to provide information about
the database, tables, columns, and server status, etc.
The syntax format is as follows.
SHOW DATABASES;
160 4 SQL Syntax Categories
CREATE TABLE bonus_ 2019(staff id INT NOT NULL, staff name CHAR
(50), job VARCHAR(30), bonus INT);
SHOW TABLES;
SHOW TABLES FROM database name;
# Tables in demo
bonus_ 2019
Function description: This statement enables the user to assign values to different
variables, servers or clients.
The syntax format is as follows.
TRADITIONAL(只影响当前会话)
SET SESSION sql_mode ='TRADITIONAL';
SET LOCAL sql_mode ='TRADITIONAL';
SET @@ SESSION.sql_mode ='TRADITIONAL';
SET @@ LOCAL.sql_mode ='TRADITIONAL';
SET @@ sql_mode ='TRADITIONAL';
SET sql_mode ='TRADITIONAL';
4.6 Summary
Upon the study this chapter, readers are expected to master the four languages of
SQL statements.
(1) Data query language (DQL): including simple query, conditional query, join
query, subquery, merging result set and other query methods.
(2) Data manipulation language (DML): including data insertion, data modification
and data deletion, etc.
(3) Data definition language (DDL): including the creation and deletion of tables,
indexes, sequences, etc.
(4) Data control language (DCL): including the commit and rollback of transactions.
This chapter introduces the syntax formats, notes, usage scenarios and typical
examples of each language.
The next step is to practice and think more in order to perceive how and why and
apply flexibly, thus improving the efficiency of database use and development.
4.7 Exercises
1. [Single Choice] The logical expression to find the record whose job is engineer
and salary is above 6000 is ( ).
A. position ¼ ‘engineer’ or salary > 6000
B. position ¼ engineer and salary > 6000
C. position ¼ engineer or salary > 6000
D. position ¼ ‘engineer’ and salary > 6000
2. [Single Choice] The expression “age BETWEEN 20 AND 30” in the WHERE
clause is equivalent to ( ).
A. age >¼ 20 AND age <¼ 30
B. age >¼ 20 OR age <¼30
4.7 Exercises 163
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 5
Database Security Fundamentals
Database security management aims to protect the data in the database system to
prevent data leakage, tampering, and destruction. Database system stores all kinds of
important and sensitive data, and as a multi-user system, it is critical to provide
appropriate permissions for different users.
This chapter introduces the basic security management techniques used in the
database, including access control, user management, permission management,
object permissions, and cloud audit services, which will be elaborated in detail
from three aspects: basic concepts, usage methods, and application scenarios.
In a broad sense, the database security framework can be divided into three levels, as
shown in Fig. 5.1.
See Sect. 2.1.4 of this book for a detailed description of the database security
framework.
GaussDB (for MySQL) has the following main security defenses against intentional
and unintentional compromises.
(1) The first line of defense is formed through access control and SSL connection to
prevent client counterfeiting, information leakage and interactive message
tampering.
(2) The second line of defense is formed by user rights management, which mainly
reinforces the database server to prevent risks such as permission changes.
(3) The third defense is formed by security audit management, so that all operations
on the database can be traced.
GaussDB (for MySQL) also supports anti-DOS attacks to prevent clients from
maliciously occupying server-side session resources. If a connection is not authen-
ticated within the set authentication time, the server will forcibly disconnect the
connection and release the session resources it occupies to avoid the connection
session resources exhaustion caused by malicious TCP connections. This setting can
effectively prevent DOS attacks.
This chapter will introduce the main strategies of database security management
from three aspects: access control, user rights management and cloud audit service.
Identity and Access Management (IAM) is a basic service for Huawei Cloud to
provide access management, which helps users securely control access rights to
Huawei Cloud services and resources.
IAM can be used without payment, and users only need to pay for the resources in
the account. After registering Huawei Cloud, the system will automatically create an
5.2 Access Control 169
account, which is the subject of resource attribution and billing. Users have full
control over the resources they own and can access all the cloud services of Huawei
Cloud. If a user has purchased multiple resources in Huawei Cloud, such as Elastic
Cloud Server (ECS), Cloud Hard Disk (Elastic Volume Service, EVS), Bare Metal
Server (BMS), etc. for his/her team or application needs, he/she can use the user
management function of IAM to create IAM users for employees or applications and
grant each IAM user the appropriate permissions according to the job requirements.
Newly created IAM users can log in to Huawei Cloud using their individual user
names and passwords. IAM users are useful to avoid sharing passwords for accounts
when multiple users collaborate to operate the same account. The use of IAM is
shown in Fig. 5.2.
other accounts or cloud services, setting account security policies, and ultimate
consistency.
(1) Fine-grained permission management.
Using IAM, different resources within the account can be assigned to the
created IAM users on demand to achieve fine-grained permission management,
as shown in Fig. 5.3.
For example, control user Charlie has the right to manage the VPCs in
Project B, while restricted user James only has the right to view the data of the
VPCs in Project B.
(2) Secure access.
You can use IAM to generate identity credentials for users or applications
without sharing the account password with other people, and the system will
allow users to securely access the resources in the account through the permis-
sion information carried in the identity credentials.
(3) Sensitive operations.
IAM provides sensitive operation protections including login protection and
operation protection. When logging in to the console or performing sensitive
operations, the system will require a second authentication such as a verification
code for email, cell phone or virtual MFA, so as to provide a higher level of
security protection for the account and resources.
(4) Bulk management of user permissions through user groups.
Instead of individual authorization for each user, just plan the user group and
grant the corresponding permission to the user group, then add the user to the
user group, thus the user inheriting the permission of the user group. If the user
permissions change, just delete the user in the user group or add the user into
other user groups to achieve quick user authorization.
(5) Isolation of resources within a region.
Through creating sub-projects in the region, the resources between projects
under the same region can be isolated from each other.
5.2 Access Control 171
IAM provides authentication and authorization functions for other Huawei Cloud
services. Users created in IAM can use other services in the system according to their
permissions after authorization. For services that do not support the use of IAM
authorization, the IAM user created in the account must log in with the account to
use the cloud services. The explanation of related terms in IAM authorization is
shown below.
(1) Service: Cloud services that use IAM authorization, whose service name can be
clicked to display the permissions supported by the service and the difference
between the different permissions.
(2) Region: The region selected for authorization by the cloud service when using
IAM authorization.
(3) Global region: The service is deployed without specifying a physical region, i.e.,
a global-level service, where the service is authorized in a global project and can
be accessed without switching regions.
(4) Other regions: The service is deployed with specifying a physical region, i.e., a
project-level service, where authorization is performed in regions other than the
global region and takes effect only in the authorized region, and the access to a
cloud services requiring switching to the corresponding region.
172 5 Database Security Fundamentals
(5) Console: Whether the cloud service supports permission management in the
IAM console.
(6) API: Whether the cloud service supports calling API for permission
management.
(7) Delegation: The user delegates operation permissions to the service, and allows
the service to use other cloud services as itself, performing daily tasks on behalf
of the user.
(8) Policy: Does the cloud service support permission management through poli-
cies; a policy is a language that describes a set of permission sets in JSON
format, which precisely allows or denies users to perform the specified opera-
tions on the resource type of the service.
The flow of IAM using GaussDB (for MySQL) is shown in Fig. 5.4.
(1) Create a user group and authorize it. Create a user group in IAM console and
grant GaussDB (for MySQL) read-only access "GaussDB ReadOnlyAccess".
(2) Create users and join user groups. Create users in the IAM console and add them
to the user group created in the previous step.
5.2 Access Control 173
(3) Users log in and verify permissions. Switch to the authorization area in the
newly created user login console and verify the permissions. Select GaussDB
(for MySQL) in the "Service List" to display the main interface of GaussDB (for
MySQL), click the "Purchase a database instance" button in the upper right
corner, and try to buy an instance of GaussDB (for MySQL). If the purchase
failed (assuming that the current permission only contains GaussDB
ReadOnlyAccess), it means that "GaussDB ReadOnlyAccess" is in effect.
Select any service other than cloud database GaussDB (for MySQL) in the
"Service List" (assuming the current policy only contains GaussDB
ReadOnlyAccess), and if it indicates insufficient permissions, it means
"GaussDB ReadOnlyAccess" is in effect.
The Secure Sockets Layer (SSL) protocol is a security protocol that provides security
and data integrity to network communications. It is important for the following
reasons.
(1) It is very dangerous to transmit sensitive data (bank data, transaction informa-
tion, password information, etc.) in clear text in the network, and the purpose of
SSL protocol is to provide communication security and data integrity guarantee.
174 5 Database Security Fundamentals
(2) In the 7-layer Open System Interconnection (OSI), the SSL protocol is located
between the transport layer and the application layer, providing support for
secure communication. Many application layer protocols have derived more
secure protocols by integrating SSL protocol, such as HTTPS.
(3) Google, Facebook, Taobao and other current mainstream websites and applica-
tions all support SSL communication encryption.
(4) GaussDB (for MySQL) supports SSL communication encryption between client
and server to ensure the security and integrity of data transmission.
The symmetric encryption algorithm of SSL is to use the same key for encryption
and decryption, which is characterized by open algorithm, fast encryption and
decryption, and high efficiency. Asymmetric encryption algorithm contains a pair
of keys: public key and private key. Encryption and decryption use different keys
and are characterized by high algorithm complexity, high security and poor perfor-
mance compared to symmetric encryption. SSL uses an asymmetric encryption
algorithm to negotiate the session key during the handshake phase. After the
encryption channel is established, the transmitted data is encrypted and decrypted
using a symmetric encryption algorithm.
Permissions are the ability to execute certain a specific SQL statement, and the
ability to access or maintain a particular object. As you can imagine, it is easy to
manage a village with only a few dozen households, but it would be relatively
difficult to manage a large city with several million people. Permission control on
users is especially important for database resource and security management.
GaussDB (for MySQL) supports the management of user permissions, which
allows you to configure the user's operational access to database objects and the use
of database functions.
The permissions granted to GaussDB (for MySQL) accounts determine the
operations that the accounts can perform. The different permissions of GaussDB
(for MySQL) differ in the contexts and operation levels to which they apply, as
shown below.
(1) Administrative permission: enables users to manage GaussDB (for MySQL)
server operations; the permission is global, as it is not specific to a particular
database.
(2) Database permission: applies to the database and all objects in it; the permission
can be granted for a specific database or globally in order to meet different needs.
(3) Object permission: can be granted to specific objects in the database, all objects
of a given type in the database (such as all tables in the database), or all objects
globally (such as tables, indexes, views, and stored routines).
5.3 User Permission Control 175
GaussDB (for MySQL) supports both static and dynamic permissions, with static
permissions built into the server. They can always be granted to user accounts and
cannot be unregistered. Dynamic permissions can be registered and deregistered at
runtime, but this affects their availability. Dynamic permissions that have not been
registered cannot be granted.
The GaussDB (for MySQL) server controls user access to the database through
permission tables, which are stored in the GaussDB (for MySQL) database and
initialized when the database is initialized. An example of permission table is shown
in Table 5.1.
5.3.2 Users
As a database administrator, you should create a database user for each user who
needs to connect to the database. The database user connects to the database by user
name and password. The user here becomes a database user who can manipulate
database objects and access database data after connecting to the database, such as
creating tables, accessing tables, executing SQL statements, etc.
By default, users of GaussDB (for MySQL) database can be divided into
3 categories.
System administrator: has the highest permissions of the database (e.g. SYS user,
SYSDBA user).
Security administrator: has the CREATE USER permission.
Ordinary user: by default, has PUBLIC object permission and only has the
permission of the object they created; if you need other permissions, you need to
be empowered by the system administrator through the GRANT statement.
SYSDBA is the user who can login to the database without password, with "zsql/
AS SYSDBA" to connect to the database.
Two points should to be noted here. First, when connecting to a database, the database
user must use a database that already exists, and cannot connect to a database that
does not exist. Second, a user can establish multiple connections to the database, that
is, multiple sessions can be established for operations.
176 5 Database Security Fundamentals
Users can be created by the CREATE USER statement. When using this state-
ment, the following three points should be noted.
(1) The user executing this statement needs to have CREATE USER system
permissions, otherwise no new user can be created.
(2) When creating a user, you need to specify the user name and password, the user
name and password required when the user connects to the database is specified
at this time.
(3) The root user is not allowed to be created, because it is a system-preset user.
The common syntax format for creating users is as follows.
user_name is the user name; password is the user password, which needs to be
enclosed by single quotes. After the user is successfully created, you can connect to
the database with the corresponding user name and password.
The following special characters are not allowed in the user name.
Semicolon (;), vertical line (|), backquote (`), dollar sign ($), bit operator (&),
greater than sign (>), less than sign (<), double quote (""), single quote (''),
exclamation mark (!) , spaces, and the copyright symbol (©). Double quotes or
backquotes are also not allowed. If the user name contains any special characters
other than those prohibited above, it must be enclosed in double quotation marks ("")
or backquotes ('').
When setting a password for a user name, the following requirements must be
met.
(1) The length of the password must be greater than or equal to eight characters.
(2) When creating a password, the password must be enclosed in single quotes.
Example: To create a user with the username "smith" and the password "data-
base_123", you can execute the following statement.
The user name consists of letters, and the password contains letters, special
symbols and numbers, which meet the requirements and can be created successfully.
The password in the example satisfies the password requirements.
You can modify users by ALTER USER, during which should pay attention to the
following matters.
5.3 User Permission Control 177
(1) The user executing this statement needs to have ALTER USER system permis-
sions, similar to CREATE USER permissions.
(2) If the specified user does not exist, an error message will be displayed. Only the
user that already exists can be modified.
User modification is mainly applied to the following scenarios.
(1) Modify the user password.
(2) Manually lock the user or unlock the user. For example, if a user has been locked
out after a certain number of failed login attempts, the user needs to be unlocked.
The syntax format for changing the user password is as follows.
user_name is the user name to be changed and new_password is the new user
password.
Example: To change user smith's password to "database_456". The administrator
can change it directly with the following statement.
When a user is no longer in use, it is necessary to delete the user, and all the objects
created by the user will be deleted accordingly. You can delete a user by the DROP
USER statement. Note that the user executing the statement needs to have the DROP
USER system permission, similar to the CREATE USER permission.
The syntax format for deleting a user is as follows.
Example: To delete user smith, you can use the following statement.
5.3.5 Roles
A role is a set of permissions, by which the database can divide permissions at the
organization level. The concept of roles was not introduced until MySQL 8. A
database may be accessed by multiple users, so for easy management, you can
first group permissions and assign them to roles, with each set of permissions
corresponding to one role. For users with different permission levels, you can
grant different roles to users, equivalent to granting the permissions that users
need in bulk, instead of granting them one by one.
For example, a company can have multiple financial roles with permissions such
as paying wages and allocating funds. A role does not belong to any user, that is, a
role is not private to a user, but can be owned by multiple users. For example, finance
is a role that is not private to a single employee, but can be shared by multiple
employees. Suppose the user smith creates the role staffs, then smith.staffs is private
to smith. Other users can access or operate on smith.staffs if they have the appro-
priate permissions, but smith.staffs belongs only to the smith user.
Roles can be created through the CREATE ROLE statement. It should be noted
that the user executing the statement needs to have the CREATE ROLE system
permission. The role neither belongs to any user nor can log in to the database and
execute SQL statement operations, and the role must be unique in the system.
GaussDB (for MySQL) contains the following four system-preconfigured roles
by default.
(1) Database administrator: has all system permissions, which cannot be deleted.
(2) RESOURCE, the role to create base object: has the permission to create stored
procedures, functions, triggers, table sequences.
(3) CONNECT, the role to connect: has the permission to connect to the database.
(4) STATISTICE, statistical role.
The syntax format for creating a role is as follows.
The role can be deleted with the DROP ROLE statement. When deleting a role,
the user executing the statement must have the DROP ANY ROLE system permis-
sion, or be the creator of the role, or have been granted the role and have the WITH
GRANT OPTION attribute. If the role to be deleted does not exist, an error message
is displayed. When a role is deleted, the permissions that the role has are recovered
from the user or other role to which the role was granted, and the user associated with
the role or the role loses the permissions contained in the role.
The syntax format for deleting a role is as follows.
5.3.6 Authorization
The previous sections all mention permissions, which need to be granted. Authori-
zation is the granting of permissions or roles to users or other roles, so that the
corresponding users or roles have the appropriate permissions. For example, a newly
created user has no permission and cannot perform any operations on the database or
180 5 Database Security Fundamentals
even connect to the database. If you grant the CREATE SESSION create connection
permission to the user, and the user has the right to connect to the database. If the
user needs to create a table, he/she needs to have the CREATE TABLE permission
to create a table. The table created by this user belongs to the object of this user, and
this user can add, delete, change, and check the data in the table. Authorization can
be achieved through the GRANT statement, which can grant one permission to a
user or role, or multiple permissions to a user or role at the same time. You can grant
Permission 1 to User 1, or grant permissions 1, 2, and 3 to Role 1, which then granted
by Role 1 to Role 2, and finally you can grant the permissions of Role 2 to the user,
as shown in Fig. 5.6.
The common syntax format for permission granting is as follows.
To grant a permission, the user executing the grant statement needs to have been
granted the permission and have the WITH GRANT OPTION attribute.
Example: To grant the CREATE USER permission to the user smith, and allow
smith to grant this permission to other users or roles.
The syntax format for granting roles is similar to the format for granting permis-
sions, as follows.
role_name is the role name and grantee is the user or role to be granted. WITH
GRANT OPTION is optional, if set, the granted user or role can re-grant the granted
role to other users or roles.
To grant the role, the user executing the granting role statement needs to meet one
of the following conditions.
(1) It has been granted the role and has the WITH GRANT OPTION attribute.
(2) It is the creator of the role.
Example: To grant the role of teacher to smith and allow smith to grant this role to
other users or roles.
Having the WITH GRANT OPTION attribute means that the authorized user can
re-grant the acquired permission or role to other users or roles.
When a user who has been granted a role no longer needs to have the permissions
contained in the role, the user's role permissions should be recovered. For example, if
Employee A is a finance employee that has the right view the company's funds, when
he/she is leaving, his/her finance role must be recovered. The system administrator
(SYS user, user in the database administrator role) has all system permissions,
including the GRANT ANY ROLE system permission, so the system administrator
can execute the role recover statement.
If the role is to be recovered, the user who performs the REVOKE operation
needs to meet one of the following conditions.
(1) It has been granted the role and has the WITH GRANT OPTION attribute.
(2) It is the creator of the role being recovered.
The common syntax format for recovering a role is as follows.
role_name is the name of the role, and revokee is the user or role whose
permissions are recovered. Up to 63 users or roles can be assigned at a time. Note
that you are not allowed to recover the permissions of the database administrator
role. The initial permissions of the database administrator role are determined when
the database is created, and you can subsequently grant permissions to the database
administrator role, but are not allowed to recover its permissions.
5.4 Cloud Audit Services 183
The use of permissions should follow the principle of minimization, and in order
to ensure the security of the database, permissions and roles need to be recovered in
time when they are not in use.
An example of the application of users, roles and permissions is as follows.
To create the user smith, with the password database_123.
The log audit module is the core component of information security audit function,
and is an important part of enterprises' and organizations' risk control on information
system security. In the context of gradual cloudization of information systems,
global information and data security management organizations at all levels, includ-
ing China's National Standardization Technical Committee, have issued several
standards on this, such as ISO IEC27000, GB/T 20945-2013, COSO, COBIT,
ITIL, NISTSP800, etc.
Cloud Trace Service (CTS) is a professional log auditing service contained in
Huawei's cloud security solution, providing the collection, storage, and query
functions for various cloud resource-related operation records, which can be used
to support common application scenarios such as security analysis, compliance
audit, resource tracking, and problem location, as shown in Fig. 5.7.
184 5 Database Security Fundamentals
With the cloud audit service, operation events related to GaussDB (for MySQL)
instances can be recorded for future queries, audits, and tracebacks. The key
operation events supported by the cloud audit service are shown in Table 5.2.
Track event viewing is the operation that the system starts to record cloud service
resources after the cloud audit service is started. The cloud audit service management
console keeps a record of the last seven days of operations. Log in to the manage-
ment console and select the "Manage & Deploy > Cloud Audit Service" option in
the "All Services" or "Service List" to enter the information page of cloud audit
5.4 Cloud Audit Services 185
Table 5.2 Key operation events supported by the cloud audit service
Operation Resource type Event
Create an instance Instance createInstance
Adds a read-only node Instance addNodes
Delete a read-only node Instance deleteNode
Restart an instance Instance restartinstance
Modify an instance port Instance changeInstancePort
Modify an instance security group Instance modifySecurityGroup
Upgrade a read-only instance to a primary instance Instance instanceFailOver
Bind or unbind a public IP Instance setOrResetPublicIP
Remove an instance Instance deleteInstance
Rename an instance Instance renameInstance
Modify the node priority Instance modifyPriority
Modify the specification Instance instanceAction
Reset the password Instance resetPassword
Back up and restore to a new instance Instance restoreInstance
Create a backup Backup createManualSnapshot
Delete a backup Backup deleteManualSnapshot
Create a parameter template parameterGroup createParameterGroup
Modify a parameter template parameterGroup updateParameterGroup
Delete a parameter template parameterGroup deleteParameterGroup
Copy a parameter template parameterGroup copyParameterGroup
Reset a parameter template parameterGroup resetParameterGroup
Compare parameter templates parameterGroup compareParameterGroup
Apply a parameter template parameterGroup applyParameterGroup
service; select the "Event List" option in the left navigation tree to enter the event list
information page. The event list supports filtering to query the corresponding
operation events. The current event list supports four dimensions of the combined
query, with the relevant content described below.
(1) Event source, resource type and filter type. You can select the corresponding
query conditions in the drop-down box.
Generally, select "CloudTable" as the event source; select "All Resource
Types" as the resource type, or specify a specific resource type; and select "All
Filter Types" as the filter type, or select one of "By Event Name", "By Resource
ID", "By Resource Name".
(2) Operation user. You can select a specific operation user in the drop-down box,
and this operation user is at user level, not at tenant level.
(3) Event Level. The options are "All Event Levels", "Normal", "Warning", "Inci-
dent". Only one of them can be selected.
(4) Start time and end time. The operation events can be queried by selecting the
time period.
186 5 Database Security Fundamentals
5.5 Summary
This chapter firstly introduces the basic concepts, usage and application scenarios of
users, roles and permissions, and the relationship between the three; then elaborates
on authorization and permission recovery, including the syntaxes and the conditions
that need to be satisfied by users who perform authorization or permission recovery
operations.
5.6 Exercises
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 6
Database Development Environment
6.1.2 JDBC
Java database connectivity (JDBC) is a Java API for executing SQL statements that
provides a unified access interface to a variety of relational databases. Applications
manipulate data through JDBC. The flow of JDBC connection to database is shown
in Fig. 6.2.
GaussDB (for MySQL) database provides support for JDBC4.0 features. To
compile the program code, you need to use JDK1.8.
The installation and configuration steps of JDBC are as follows.
(1) Configure the JDBC package.
Download the driver package from the relevant website, decompress it and
configure it in the project.
JDBC package name: com.huawei.gauss.jdbc.ZenithDriver.jar.
(2) Load the driver.
Before creating a database connection, you need to load the database driver
class by loading Class.forName("com.huawei.gauss.jdbc.ZenithDriver") implic-
itly in the code.
(3) Connect to the database.
Before remotely accessing the database, you need to set the IP address and
port number for LSNR_IP and LSNR_PORT monitoring in the configuration
file zengine.ini.
When creating a database connection using JDBC, the following function is
required.
Another way to load the database driver classes is to pass the parameters at
the start of the JVM (Java Virtual Machine), where jdbctes is the name of the test
case program.
java -Djdbc.drivers=com.huawei.gauss.jdbc.ZenithDriver
jdbctest;
This method is not commonly used, so you just need to know about it without going
into particular detail.
After the database driver class is loaded, you need to connect to the database.
Before remote access to the database, set the IP address and port number to be
192 6 Database Development Environment
monitored by the corresponding parameters in the configuration file, and then use the
JDBC to create a database connection. The database connection includes three
parameters: url, user, and password, as shown in Table 6.1.
In the url parameter, ip is the database server name, port is the database server
port, and the url connection attributes are split by the & symbol. Each property is a
key/value pair.
Table 6.2 shows the common interfaces of the JDBC.
The following introduces the development and debugging of JDBC application
with the Eclipse environment under Windows operating system as an example.
• Operating system environment: Win10-64bit.
• Compiling and debugging environment: Eclipse SDK version:3.6.1.
The steps of running JDBC application are shown below.
(1) Create a project in Eclipse.
New!Project!Java Project!Next!enter ProjectName (such as
test_jdbc)!Finish.
The code for compiling and running the JDBC application is as follows.
package com.huawei.gauss.jdbc.executeType;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import com.huawei.gauss.jdbc.inner.GaussConnectionImpl;
public class jdbc_test{
public void test() {
// 驱动类
String driver = "com.huawei.gauss.jdbc.ZenithDriver";
// 数据库连接描述
String sourceURL = "jdbc:zenith:@10.255.255.1:1888";
Connection conn = NULL;
try {
// 加载数据库驱动类
Class.forName(driver).newInstance();
} catch (Exception e) {
// 抛出异常
e.printStackTrace();
}
try {
// 数据库连接, test_1为用户名, Gauss_234为密码
conn = DriverManager.getConnection
(sourceURL, "test_1", "Gauss_234");
WHILE(TRUE) {
// 执行SQL语句, 若在数据库中能查看到此条数据, 说明成功
PreparedStatement ps = conn.prepareStatement
("INSERT INTO t1 values (1, 2)");
ps.execute();
// 若执行成功, 控制台会输出“Connection succeed!”
System.out.println("Connection succeed!");
}
} catch (Exception e) {
// 抛出异常
e.printStackTrace();
}
}
}
194 6 Database Development Environment
6.1.3 ODBC
[GaussDB]
Driver64=/usr/local/odbc/lib/libzeodbc.so
setup=/usr/local/lib/libzeodbc.so
[zenith]
Driver=DRIVER_N
Servername=192.168.0.1(数据库Server IP)
Port=1888 (数据库监听端口)
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
export ODBCSYSINI=/usr/local/etc
export ODBCINI=/usr/local/etc/odbc.ini
#if WIN32
#include <windows.h>
#endif
#include <stdlib.h>
#include <stdio.h>
#include "sql.h"
#include "sqlext.h"
int main()
{
SQLHANDLE h_env, h_conn, h_stmt;
SQLINTEGER ret;
SQLCHAR *dsn = (SQLCHAR *)"myzenith";/*数据源名称*/
SQLCHAR *username = (SQLCHAR *)"sys";/*用户名*/
SQLCHAR *password = (SQLCHAR *)"sys";/*密码*/
SQLSMALLINT dsn_len = (SQLSMALLINT)strlen((const CHAR *)dsn);
SQLSMALLINT username_len = (SQLSMALLINT)strlen((const CHAR *)
username);
SQLSMALLINT password_len = (SQLSMALLINT)strlen((const CHAR *)
password);
h_env = h_conn = h_stmt = NULL;
//申请句柄资源
ret = SQLAllocHandle(SQL_HANDLE_ENV, SQL_NULL_HANDLE, &h_env);
if ((ret != SQL_SUCCESS)&&(ret != SQL_SUCCESS_WITH_INFO)) {
return SQL_ERROR;
}
//设置环境句柄属性
if (SQL_SUCCESS != SQLSetEnvAttr(h_env,
SQL_ATTR_ODBC_VERSION, (void*)SQL_ OV_ODBC3, 0)) {
SQLFreeHandle(SQL_HANDLE_ENV, h_env);
return SQL_ERROR;
}
//分配链接句柄
if (SQL_SUCCESS != SQLAllocHandle(SQL_HANDLE_DBC, h_env, &
h_conn)) {
return SQL_ERROR; }
//设置链接句柄自动提交属性
if (SQL_SUCCESS != SQLSetConnectAttr(h_conn,
SQL_ATTR_AUTOCOMMIT, (void *)1, 0)) {
SQLFreeHandle(SQL_HANDLE_DBC, h_conn); // 用于释放ODBC的句柄
SQLFreeHandle(SQL_HANDLE_ENV, h_env);
return SQL_ERROR; }
//链接数据源
if (SQL_SUCCESS != SQLConnect(h_conn, dsn, dsn_len, username,
username_len, password,
password_len)){
SQLFreeHandle(SQL_HANDLE_DBC, h_conn);
SQLFreeHandle(SQL_HANDLE_ENV, h_env);
return SQL_ERROR; }
//申请执行句柄
if (SQL_SUCCESS != SQLAllocHandle(SQL_HANDLE_STMT, h_conn,
(continued)
200 6 Database Development Environment
&h_stmt)) {
SQLFreeHandle(SQL_HANDLE_DBC, h_conn);
SQLFreeHandle(SQL_HANDLE_ENV, h_env);
return SQL_ERROR; }
//创建表并插入一条记录
SQLCHAR* create_table_sql = (SQLCHAR*)"CREATE TABLE test(col INT)";
SQLExecDirect(h_stmt, create_table_sql, strlen
(create_table_sql));
// 直接执行SQL语句
SQLCHAR* insert_sql = (SQLCHAR*)"INSERT INTO test (col) values
(:col)";
SQLPrepare(h_stmt, insert_sql, strlen(insert_sql)); // 准备要
执行的SQL语句
int col = 1;
SQLBindParameter(h_stmt, 1, SQL_PARAM_INPUT, SQL_C_SSHORT,
SQL_INTEGER, sizeof(int), 0,
&col, 0, NULL); // 往准备好SQL的执行句柄上绑定参数
SQLExecute(h_stmt); // 执行SQL语句
printf("Connection succeed!\n");
//断开数据库链接
SQLDisconnect(h_conn);
//释放句柄资源
SQLFreeHandle(SQL_HANDLE_DBC, h_conn);
SQLFreeHandle(SQL_HANDLE_ENV, h_env);
return SQL_SUCCESS;
}
6.1.4 Others
The code to create a connection object using the GSC (C-API) is as follows.
6.1 GaussDB Database Driver 201
(2) Go driver. The Go driver is released as source code, and the upper-level
application brings the code into the application project and compiles it with
the application for use. From the file level, Go driver is divided into three parts:
Go API, C driver library and C header file. The Zenith Go driver is based on the
Zenith C driver, which is obtained through cgo technology packaging. The lib
subdirectory is the dynamic library for C driver, and the include subdirectory is
the C driver cgo involved in the header files. The Go driver relies on GCC 5.4
and above, and use the GO 1.12.1 or an higher version.
(3) Dynamic library of the Python driver: pyzenith.so. When using the Python
driver to connect to a database, get the Connection and establish the connection
by calling pyzenith.connect. GaussDB (for MySQL) uses Python, on the basis of
the Linux operating system. Python supports time objects, using the following
functions to get the time.
Date(year,month,day)—constructs an object containing the date.
Time(hour,minute,second)—constructs an object containing the time.
Timestamp(year,month,day,hour,minute,second,usec)—constructs an object
containing the timestamp.
DateFromTicks(ticks)—construct the date value given with ticks value.
TimeFromTicks(ticks)—constructs the time value given with ticks value.
TimestampFromTicks(ticks)—constructs the timestamp value given with ticks
value.
The sample code to execute the SQL statement and get all the tuples is as
follows.
202 6 Database Development Environment
import pyzenith
conn=pyzenith.connect
('192.168.0.1','gaussdba','database_123','1888')
c=conn.cursor()
c.execute("CREATE TABLE testexecute(a INT,b CHAR(10),c
DATE)")
c.execute("INSERT INTO testexecute values(1,'s','2012-12-
13')")
c.execute("SELECT * FROM testexecute")
row =c.fetchall()
c.close()
conn.close()
updates and the high cost of operation and maintenance, and the fact that only Java is
supported in most cases. The advantages of middleware sharding solutions (e.g.,
open source Mycat, Cobar, commercial software Ekoson, etc.) are zero changes to
the application, language-independence, full transparency to the application for
database scaling, and effective convergence of the number of connections through
connection sharing. The disadvantage is the possibility of additional latency (<4%).
The key features of DDM are read/write separation, data sharding, and smooth
database scaling. In the past, the read/write separation was controlled by the appli-
cation itself, including configuring all database information in the client and realiz-
ing the read/write separation; database adjustment requires synchronous
modification of the application, and database failure requires modification of the
application, at which time the operation and maintenance and development need to
synchronize the adjustment configuration. Nowadays, DDM achieves read/write
separation, including: plug-and-play—automatic read/write separation and support
for configuring performance weights for different nodes; application transparency—
the application still operates a single node and database adjustments are not
application-aware; and high availability—master-slave switchover or slave node
failure is transparent to the application, as shown in Fig. 6.6.
The sharding application logic implemented by the application itself is complex:
the application rewrites SQL statements, routes the SQL to different databases, and
aggregates the results; database failure and adjustment require synchronous adjust-
ment by the application, which makes operation and maintenance more difficult
dramatically; the application upgrade and update maintenance workload is large and
unacceptable for large systems. Today, data sharding is implemented by DDM with
zero application changes: large table sharding, which supports automatic sharding by
hash and other algorithms; automatic routing, which routes SQL to the real data
source according to the sharding rules; connection multiplexing, which is used to
substantially improve concurrent database access through connection pool
multiplexing of MySQL instances. A comparison of data sharding is shown in in
Fig. 6.7.
If the application itself implements the horizontal expansion of the database, the
expansion is prone to application downtime and service interruption, so tools for
data migration are required. Nowadays, horizontal expansion of database by DDM
can automatically balance data, achieve unlimited expansion (unlimited number of
supported shards, ease coping with massive data), full automation (one-click expan-
sion, automatic rollback of abnormalities), and small impact on services (second-
level interruption, no service awareness at other times). A comparison of database
horizontal scaling is shown in Fig. 6.8.
The applicable scenarios of DMM are as follows.
(1) High-frequency transactions on large applications: e-commerce, finance, O2O,
retail, and social applications. Characteristics: large user base, frequent market-
ing activities, and increasingly slow response of the core database. Countermea-
sures: The linear horizontal scaling function provided by DDM can easily cope
with the high concurrent real-time transaction scenarios.
206 6 Database Development Environment
(2) IoT massive sensors: industrial monitoring, smart city, and Internet of Vehicles.
Characteristics: many sensing devices, high sampling frequency, large data
scale, breakthrough of single database bottleneck. Countermeasures: The capac-
ity horizontal expansion function provided by DDM can help users to store
massive data at low cost.
(3) Massive video and picture data index: Internet, social applications, etc. Charac-
teristics: billions pieces of picture, document, video and other data, and
extremely high performance requirements for indexing these files and providing
real-time addition, deletion, change and query operations. Countermeasures:
The ultra-high performance and distributed expansion function provided by
DDM can effectively improve the search efficiency of the index.
(4) Traditional program, hardware and government agencies: large enterprises and
banks. Characteristics: Traditional solutions rely on commercial solutions with
high hardware cost such as minicomputers and high-end storage. Countermea-
sures: The linear horizontal scaling function provided by DDM can easily cope
with highly concurrent real-time transaction scenarios.
How to use DDM - buy a database middleware.
Step 1: Console > Database > Distributed Database Middleware (DDM) Instance
Management.
Step 2: Click the “Buy Database Middleware Instance” button, as shown in Fig. 6.9.
Step 3: Select “Pay As You Go” for the billing mode. Leave the default settings for
region, available partitions and instance specifications if there are no special
needs, as shown in Fig. 6.10.
Step 4: Enter the instance name, select the corresponding virtual private cloud,
subnet and security group (the virtual private cloud must be consistent with the
database instance), and then click the “Buy Now” button, as shown in Fig. 6.11.
Step 5: Confirm the specification, check the check box to agree to the service
agreement, and click the “Submit” button, as shown in Fig. 6.12.
Multi-instance and distributed cluster are shown in Fig. 6.13.
How to use DDM - data sharding.
Step 1: Console > Database > Distributed Database Middleware (DDM) Instance
Management.
Step 2: Select the instance that needs to be sharded, and click the “Create Logical
Library” text hyperlink, as shown in Fig. 6.14.
Step 3: Select the split mode, set the logical library name and transaction model, and
select the associated RDS instance, as shown in Fig. 6.15.
Step 4: Select the RDS instance where the logical library can be created (same as the
case of virtual private cloud), and click the “Create” button, as shown in Fig. 6.16.
The data is successfully sharded, as shown in Fig. 6.17.
208 6 Database Development Environment
6.2.2 DRS
Data replication service (DRS) is an easy-to-use, stable, and efficient cloud service
for online database migration and real-time database synchronization. DRS targets
cloud databases, reducing the complexity of data flow between databases and
effectively reducing the cost of data transfer.
It features the following capabilities. Online migration: It supports a variety of
service scenarios such as cross-cloud platform database migration, under-cloud
database migration to the cloud or cross-region database migration on the cloud
6.2 Database Tools 209
DRS classifies users who need to be migrated into three categories, i.e. users who
can be migrated completely, users who need to be downgraded and users who cannot
be migrated, as shown in Fig. 6.22.
In DRS parameter migration, most of the parameters that are not migrated do not
cause the migration to fail, but they often have a direct impact on the operation and
performance of the service. DRS supports parameter migration to make the service
212 6 Database Development Environment
and application run more smoothly and worry-free after database migration. Service
parameters include character set settings, maximum number of connections, sched-
uling related settings, lock wait time, Timestamp default behavior and connection
wait time. The performance parameters include *_buffer size and -_cache size, as
shown in Fig. 6.23.
6.2 Database Tools 213
through the table, and when the “number of objects” and “number of migrated
objects” are equal, the migration of the object is complete. You can view the
migration progress of each object through the “View Details” hyperlink, and when
the progress is 100%, the migration is complete, as shown in Fig. 6.25.
The macro comparison at object level is used to determine whether data objects
are missing; the data is proofread in detail by data-level comparison. The comparison
of rows and contents at different levels is shown in Fig. 6.26.
6.2 Database Tools 215
6.2.3 DAS
Data is the core asset of the enterprise. How to control the access rights of
sensitive data, realize the security of database changes, audit the operation retroac-
tively and reduce the labor cost of DBA is an important demand of enterprises when
the number of database instances reaches a certain scale.
The advantages of DAS enterprise version are as follows.
(1) Secure data access: Employees do not have access to database login name and
password, and need to apply for permission first for querying the library; it
supports multi-dimensional query control on total number of queries per day,
total data rows, maximum number of rows returned per query, etc.
(2) Sensitive data protection: Sensitive fields are automatically identified and
marked; sensitive data will be desensitized and displayed when employees
perform query and export operations.
(3) Change security: All operations on the library are recorded in audit logs, and the
database operation behavior is traceable.
(4) Operation audit: It features risk identification of SQL change, service audit
control; automatic detection of database water level when change is executed;
and data cleaning for large data tables.
(5) Improved efficiency and reduced cost: It features flexible security risk and
approval process customization; the empowerment of the roles of service head
and database administrator on the library delegates the low-risk library change
operations to the service supervisor, reducing the labor cost of the database
administrator in the enterprise.
How to use DAS—add a database connection. The steps are as follows.
Step 1: Console > Database > Data Admin Service (DAS).
Step 2: Click the “Add Database Login” button, as shown in Fig. 6.29.
Step 3: Select the database type as GaussDB (for MySQL).
Step 4: Select the database source (RDS instance), and select the instance under the
corresponding source, as shown in Fig. 6.30.
218 6 Database Development Environment
Step 5: Fill in the login user name and password under the selected instance, and it is
recommended to check the “Remember Password” checkbox, as shown in
Fig. 6.31.
Step 6: Click the “Add Now” button, as shown in Fig. 6.32.
Step 7: Select the database instance to log in, and click the “Login” hyperlink, as
shown in Fig. 6.33.
Step 8: Log in to the DAS Administration page successfully, as shown in Fig. 6.34.
How to use DAS—create an object.
In the Library Management page, we can create and manage database objects,
diagnose SQL, and collect metadata, following the steps as follows.
6.2 Database Tools 219
Step 1: Click “New Database” on the homepage, fill in the database name and click
"OK", as shown in Fig. 6.35.
Step 2: After successful login, you can enter the Library Management page, as
shown in Fig. 6.36.
Step 3: Click the “New Table” button, as shown in Fig. 6.37.
Step 4: Enter the New Table page, set the table's basic information, fields, indexes
and other information, as shown in Fig. 6.38.
Step 5: After setting up, click the “Create Now” button., as shown in Fig. 6.39.
Step 6: In addition to tables, we can also create new views, stored procedures, events
and other objects, as shown in Fig. 6.40.
How to use DAS—create an object.
Open the SQL Operations page, there will be automatic SQL input prompt for
assisting to finish the SQL statement.
220 6 Database Development Environment
Step 1: Click the “SQL Window” button at the top of the page, or the “SQL Query”
hyperlink at the bottom to open the SQL Operations page, as shown in Fig. 6.41.
On the SQL Operations page, we can perform SQL operations, such as query, etc.
Step 2: Write SQL statements. DAS provides SQL prompt function to facilitate
writing SQL statements, as shown in Fig. 6.42.
6.2 Database Tools 221
Step 3: After the execution of SQL statement, you can check the operation result and
execution record at the bottom, as shown in Fig. 6.43.
How to use DAS—import and export.
In the Import and Export page, we can import the existing SQL statements into
the database for execution, and export the database file or SQL result set for saving.
Step 1: Create a new import task. You can import an SQL file or CSV file.
Step 2: Select the file source, either imported locally or from the OBS.
Step 3: Select the database. The imported file will be executed within the
corresponding database, as shown in Fig. 6.44.
Step 4: Create a new export task and select the database file to be exported, or choose
to export the SQL result set, as shown in Fig. 6.45.
How to use DAS - compare the table structures.
In the Structure Scheme page, we can compare the structures of the tables within
the two databases and choose whether to synchronize after the comparison, as shown
in Fig. 6.46.
Step 1: Create a table structure comparison and synchronization tasks.
Step 2: Select benchmark database and target database.
Step 3: Select the synchronization type.
Step 4: Start the comparison task.
Step 5: Start the synchronization task.
Client tools are mainly for users to connect, operate and debug databases more
conveniently.
224 6 Database Development Environment
(1) gsql is an interactive database connection tool run by GaussDB (DWS) at the
command line.
(2) Data Studio is a graphical interface tool that allows users to connect to GaussDB
(for MySQL) and debug and execute SQL statements and stored procedures
through Data Studio.
6.3 Client Tools 225
6.3.1 zsql
groupadd dbgrp
useradd -g dbgrp -d /home/omm -m -s /bin/bash omm
passwd omm
The method of integrity check of the zsql client installation package is as follows.
(1) Execute the following command to output the check value of the installation
package.
226 6 Database Development Environment
sha256sum GAUSSDB100-V300R001C00-ZSQL-EULER20SP8-64bit.tar.gz
export PATH=/home/omm/app/bin:$PATH
export LD_LIBRARY_PATH=/home/omm/app/lib:/home/omm/app/add-
ons:$LD_LIBRARY_PATH
After completing the installation of zsql, you need to log in to the server where
GaussDB100 is located as the root user, i.e. the zsql client user. Take omm as an
example, put the client installation package under the directory “/home/omm”, and
modify the installation package user group.
cd /home/omm
chown omm:dbgrp GAUSSDB100-V300R001C00-ZSQL-EULER20SP8-64bit.
tar.gz
Next, make changes to the user group and execute the su command to switch to
the user under which the zsql client is running.
su - omm
cd /home/omm
tar -zxvf GAUSSDB100-V300R001C00-ZSQL-EULER20SP8-64bit.tar.gz
If the database user's password contains the special character $, you must escape it
with the escape character \ when connecting to the database via zsql, otherwise the
login will fail.
cd GAUSSDB100-V300R001C00-ZSQL-EULER20SP8-64bit
Here -U is the user running the zsql client, e.g. omm. -R is the directory where the
zsql client is installed.
After completing the installation of the zsql client, use zsql to connect.
Log in as the database administrator with the following code format.
(1) q: This parameter is used to cancel SSL login authentication view, which can
be used together with the -w parameter.
(2) s: This parameter is used to set the prompt-free mode to execute SQL
statement.
(3) w: This parameter indicates the waiting timeout time when the client connects
to the database, currently 10s by default; can be used with the q parameter. The
value meanings of waiting timeout are as follows.
• 1: means wait for the server response, no timeout.
• 0: means do not wait for the timeout, and return the result directly.
• n: means wait for n seconds.
After using the -w parameter, when zsql starts to connect to the database, the waiting
timeout is set to the specified value. After starting, the waiting response timeout for
the currently established connection, the waiting response timeout for the new
connection re-established and the query timeout are all specified values; the setting
expires after exiting the zsql process.
When logging in as a normal database user, the following three types of logins are
available.
(1) Interactive Login Method 1.
user is the database user name and user_password is the database user
password. ip:port is the IP address and port number of the host where the
database is located, which is 1888 by default.
Interactive Login Method 1 has no conn, where you need to connect and then enter
the password. Interactive Login Method 2 has conn, where you can enter the
password as connect. Non-Interactive Login Method has no conn, where you can
enter the password as connect in a different manner. The most commonly used is the
6.3 Client Tools 229
Non-Interactive Login Method, while for the interactive login methods, you just
need to know about them, because they all have the same result.
Example: User gaussdba logs in locally to the database.
[gaussdba@plat1~]$ zsql
SQL> CONN gaussdba/[email protected]:1611
connected.
//启动zsql进程时设置等待响应超时时间
[gaussdba@plat1~]$ zsql gaussdba/[email protected]:1611
-w 20
connected.
//创建新用户jim, 并赋予新用户CREATE SESSION权限
SQL> DROP USER IF EXISTS jim;
CREATE USER jim IDENTIFIED BY database_123;
GRANT CREATE SESSION TO jim;
//切换用户, 再次建立的新连接的等待响应超时时间也是20s
CONN jim/[email protected]:1611
connected.
EXIT
When starting the zsql process, set the response timeout to 20s. After starting,
the response timeout for the current connection is 20s. After exiting the zsql process,
the setting expires and the waiting response timeout for new connections remains at
the default value of 20s.
When connecting to zsql, you can set the parameters to meet your specific
functional requirements. If you set the -s parameter to execute SQL statements in
promptless mode, the results will be output to the specified file instead of being
displayed back on the current screen. This parameter should be placed at the end of
the command.
Example: User hr connects to the database in the silent mode, specifying the
output log name as silent.log.
Multiple normal SQL statements can be entered in the -c parameter, but the
statements need to be separated by a semicolon (;). When entering procedure
statements in the -c parameter, only a single entry is supported, and the procedure
needs to be ended with a slash /.
The sample code is as follows.
Objects with $ in their names need to add the escape character \. The maximum
length of a single executable SQL statement should be no longer than 1MB.
-f refers to the execution of SQL scripts, which cannot be used with the -c or -s
parameters. The setting of the -f parameter is the same as that of the -c and -s
parameters, which are placed at the end of the command.
Or as follows.
The -a parameter is used to output the executed SQL statement, which can be used
together with the -f parameter, and must be in front of the -f parameter. This means
output and execute the SQL statement in the SQL script. If the -a parameter is not set
then output the execution result of the statement in the SQL script directly, and no
SQL script will be output.
6.3 Client Tools 231
(continued)
232 6 Database Development Environment
SQL>
123
------------
123
1 rows fetched.
SQL>
Succeed.
The format of the statement to view the database object definition information is
as follows.
Or as follows.
The DESC column size shows the derived value (maximum derived value) at the
time of SQL parsing, and the execution returns column data values that do not
exceed that size.
expression is a query statement.
Query the definition information of the table privilege.
SPOOL file_path
Save the execution result and close the current output file stream.
SPOOL off
When the SPOOL file is specified, zsql results are output to a file. The contents of
the file are approximately the same as those displayed on the zsql command line, and
the output is closed only after SPOOL OFF is specified.
If the file specified by the SPOOL command does not exist, zsql will create a file.
If the specified file already exists, zsql appends the execution result to the original
result.
Exit zsql and enter cat spool.txt to view the contents of the spool.txt file, as
follows.
Note that spool.txt does not have the SELECT 'This SQL will not be output into . /
spool.txt' FROM SYS_DUMMY; statement, because the SPOOL OFF statement has
been executed before the execution of this statement.
Logical import IMP and logical export EXP.
(1) Logical import and logical export do not support the export of SYS user data.
(2) During the logical import and logical export of data, you need to have the
corresponding operation permission for the object to be exported.
(3) If execute FILETYPE¼BIN during logical import and logical export, three types
of files are exported: metadata files (user-specified files), data files (.D files), and
LOB files (.L files).
(4) If there is a file with the same name existing in the directory during logical
import and logical export, the file will be overwritten directly without any
prompt.
6.3 Client Tools 235
(5) When logically exporting data, a metadata file and a subdirectory named data
will be generated under the specified export file path. If no export file path is
specified, a metadata file and a subdirectory named data will be generated under
the current path by default. When executing FILETYPE¼BIN, the generated
subfiles (data file, LOB file) will be placed under the secondary directory data; if
the specified metadata file and the generated subfiles already exist, an error will
be reported.
Generate the analysis report WSR.
WSR (Workload Statistics Report) is used to generate the performance analysis
report. By default, only SYS users have permission to perform the related operations.
If an ordinary user needs to use it, he/she needs the SYS user permission—grant
statistics to user, which means that the statistics role is granted to the ordinary user.
After authorization, the ordinary user has the permissions to create snapshots, delete
snapshots, view snapshots, and generate WSR reports, but does not have the
permission to change WSR parameters. When an ordinary user performs an opera-
tion, he/she needs to carry the SYS name to execute the corresponding stored
procedure, such as CALL SYS.WSR$CREATE_SNAPSHOT.
Other functions include SHOW (query parameter information), SET (set param-
eters), DUMP (export data), LOAD (import data), COL (set column width), WHEN-
EVER (set whether to continue or exit the connection operation when the script runs
abnormally), etc.
6.3.2 gsql
To configure the database server using gsql, and log in to any node in the GaussDB
(DWS) cluster as the omm user, execute the related command source
${BIGDATA_HOME}/mppdb/.mppdbgs_profile to start the environment variables.
Execute the following command to add the IP address or host name (separated by
a comma) of the external service, where NodeName is the current node name and
10.11.12.13 is the IP address of the network card of the server where CN is located.
domain name. Database user is the user name of the cluster database. When you
connect to the cluster for the first time using the client, please specify the default
administrator user set when creating the cluster, such as “dbadmin”. The database
port is the “database port” set when creating the cluster, as shown in Fig. 6.48.
Usage: gsql can directly send query statements to the database for execution, and
return the execution results.
238 6 Database Development Environment
The gsql tool also provides some more useful meta-commands for quick interac-
tion with the database. For example, to quickly view the object definition, the code is
as follows.
postgres=# \d dual
View "pg_catalog.dual"
Column | Type | Modifiers
------–+------+---------–
dummy | text |
Column:字段名;
Type:字段类型;
Modifiers:约束信息;
For more commands, you can use \? to view the usage instructions.
Data Studio is a graphical user interface (GUI) tool that can be used to connect to
GaussDB databases, execute SQL statements, manage stored procedures, and man-
age database objects. Data Studio currently supports most of the basic features of
GaussDB, providing database developers with a user-friendly graphical interface
that simplifies database development and application development tasks, and can
significantly improve the efficiency of building programs.
Let's download, install and run Data Studio.
(1) Download and install Data Studio under Windows.
Download: Login to Huawei support website, go to “Technical Support >
Cloud Computing > FusionInsight >FusionInsight Tool” page, and select the
corresponding version of Data Studio to download.
Installation: After downloading, unzip the Data Studio installation package.
(2) Set the Data Studio profile (optional).
You can modify the configuration file “Data Studio.ini” to personalize the
Data Studio operating parameters. The modified parameters will take effect after
restarting the Data Studio.
The Data Studio user's manual teaches how to use each parameter.
6.3 Client Tools 239
Fig. 6.49 Connect to GaussDB (for MySQL) database using Data Studio
MySQL Workbench is a GUI tool for designing and creating new database icons,
building database documentation, and performing complex MySQL migrations. As
a next-generation tool for visual database design and management, it is available in
both open source and commercial versions. This software supports Windows and
Linux operating systems. MySQL Workbench provides database developers with a
user-friendly graphical interface that simplifies database development and applica-
tion development tasks, and can significantly improve the efficiency of building
programs.
It comes with the MySQL Workbench client for Windows-based platforms.
MySQL Workbench download: Log in to the MySQL official website and select
MySQL MySQL Workbench from the DOWNLOADS option at the bottom of the
page; unzip the downloaded client installation package (32-bit or 64-bit) to the path
6.3 Client Tools 241
you need to install (e.g. D:\MySQL Workbench); open the installation directory and
double-click MySQL Workbench.exe (or click the right mouse button and run as
administrator).
Connection to the database using MySQL Workbench: Enter the connection
information in MySQL Workbench.
The main interface of MySQL Workbench includes: navigation bar, SQL edition
window, query result window, and basic database information, as shown in
Fig. 6.51.
The basic functions of MySQL Workbench mainly comes in the following
aspects. Navigation bar: Shows the management functions of the database, such as
status check, connection management, user management, data import and export;
provides the entrance of various object management operations, such as starting and
stopping instances, querying logs, viewing operation files; shows the performance of
the database, where reports can be set or generated. SQL edition window: Edits,
formats and executes various SQL statements; in the process of editing SQL
statements, grammar assistant will automatically associate according to user input
and provide suggestions for completion. Query result window: Displays the results
returned by the query statements, where users can sort, dynamically filter, copy,
export, and edit the results. Database basic situation: Shows the existing database
and the basic situation of objects at all levels under the database. Database backup:
According to customer's demand, provides enterprise-level online backup and
backup recovery functions. Audit check: The search field provides narrow-displayed
operation events, including exhibition activity acquisition of inclusion type and
242 6 Database Development Environment
display events of query type, and all activities are displayed by default. Custom
filters are also available.
6.4 Summary
This chapter introduces GaussDB database related tools, including JDBC, ODBC,
gsql. The drivers in GaussDB database include JDBC, ODBC, etc. GaussDB
database provides some related connection tools, including zsql, gsql, Data Studio.
6.5 Exercises
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 7
Database Design Fundamentals
difficult thing. The goals that are too large or too high will result in unachievable
goals, and targets that are too small will be unacceptable to the customer. Therefore,
the goals should be planned reasonably in stages and levels so as to form sustainable
solutions for the construction process, ultimately meeting the needs of the users and
achieving the goals.
In October 1978, database experts from more than 30 countries have dedicated their
time to discussing database design methods in New Orleans, USA. They applied the
ideas and methods of software engineering to propose a database design specifica-
tion, which is the famous New Orleans design methodology, currently recognized as
a more complete and authoritative design method for database specification. The
New Orleans design methodology divides the database design into four phases, as
shown in Fig. 7.1.
These four phases are requirement analysis, conceptual design, logical design,
and physical design. The requirement analysis phase mainly analyzes user require-
ments and produces requirement statements; the conceptual design phase mainly
analyzes and defines information and produces conceptual models; the logical
design phase mainly designs based on entity connections and produces logical
models; and the physical design phase mainly designs physical structures based on
physical characteristics of database products and produces physical models.
In addition to the New Orleans design methodology, there are also database
design methods based on E-R diagrams, and design methods based on the 3NF.
They are all specific techniques and methods used in different phases of database
design, which will be described in detail in later chapters.
In real life, the whole building without a good foundation is crooked. Experience has
proven that poor requirement analysis can directly lead to incorrect design. If many
Fig. 7.1 The four phases in the New Orleans design methodology
248 7 Database Design Fundamentals
problems are not discovered until the system testing stage and then go back to correct
them, it will be costly, so the requirement analysis stage must be given high priority.
The requirement analysis phase mainly collects information and analyzes and
organizes it to provide sufficient information for the subsequent phases. This stage is
the most difficult and time-consuming stage, but is also the basis of the whole
database design. If the requirement analysis is not done well, the whole database
design may be reworked.
The following points should be done in the requirement analysis phase.
(1) Understand the operation of the existing system, such as the service carried by
the existing system, the service process and the deficiencies.
(2) Determine the functional requirements of the new system, that is, to understand
the end-user's ideas, functional requirements and the desired results.
(3) Collect the basic data and related service processes that can achieve the objec-
tives, so as to prepare for a better understanding of service processes and user
requirements.
The main task of the requirement analysis phase is first to investigate user service
behaviors and processes, then to conduct system research, collect and analyze
requirements, determine the scope of system development, and finally prepare a
requirement analysis report.
The phase of investigation of user service behaviors and processes requires
understanding of user expectations and goals for the new system and the main
problems of the existing system. In the stage of system research, collecting and
analyzing requirements, determine the scope of system development, with the main
tasks being divided into the following three parts.
(1) Information research. It is necessary to determine all the information to be used
in the designed database system and to clarify the sources, methods, data formats
and contents of the information. The main goal of the requirement analysis phase
is to clarify what data is to be stored in the designed database, what data needs to
be processed, and what data needs to be used for the next system.
(2) Processing requirements. Translate the user's service functional requirements
into a requirement statement that defines the functional points of the database
system to be designed. That is, convert the requirements described by users in
service language into design requirements that can be understood by computer
systems or developers; it is necessary to describe the operational functions of
data processing, the sequence of operations, the frequency and occasion of
execution of operations, and the connection between operations and data, as
well as to specify the response time and processing methods required by users.
These contents form the necessary part of the user requirement specification.
7.2 Requirements Analysis 249
(3) Understand and record user requirements in terms of security and integrity. In
the stage of writing requirement analysis report, it needs to go through the
process of system research, collection and processing, and generally the output
product in this stage is the requirement analysis report, including user require-
ment specification and data dictionary. The data dictionary here is a summary
document of the data items and data of the existing services, not the data
dictionary inside the database product.
The focus of requirement analysis is to sort out the "information flow" and "service
flow" of users. The "service flow" refers to the current status of the service, including
service policies, organization, service processes, etc. The "information flow" refers
to the data flow, including the source, flow and focus of data, the process and
frequency of data generation and modification, and the relation between data and
service processing. External requirements should be clarified during the requirement
analysis phase, including but not limited to data confidentiality requirements, query
response time requirements, output report requirements, etc.
According to the actual situation and the possible support from users, the
requirement investigation can be done by a combination of means, for example,
viewing the design documents and reports of existing systems, talking with service
personnel, and questionnaire surveys. If conditions permit, sample data from
existing service systems should also be collected as part of the design process to
verify some service rules and understand the quality of data.
During the requirement analysis process, do not make assumptions or guesses about
the user's ideas. Always check with the user for assumptions or unclear areas.
The data dictionary is the result obtained after the introduction of requirement
analysis, data collection and data analysis. Unlike the data dictionary in the database,
the data dictionary here mainly refers to the description of the data, not the data itself,
and includes the following contents.
(1) Data items: They mainly includes data item name, meaning, data type, length,
value range, unit and logical relation with other data items, which are the basis of
model optimization in logic design stage.
250 7 Database Design Fundamentals
(2) Data structure: Data structure reflects the combination relation between data
items, and a data structure can be composed of several data items and data
structures.
(3) Data flow: The data dictionary is required to represent the data flow, that is, the
transmission path of data in the system, including data source, flow direction,
average flow, peak flow, etc.
(4) Data storage: This includes data access frequency, retention time duration, and
data access methods.
(5) Processing process: This includes the function of the data processing process and
processing requirements. Function refers to what the processing process is used
to do, and the requirements include how many transactions are processed per
unit of time, how much data volume involved, time response requirements, etc.
There is no fixed document specification for the format of data dictionary, in
practice, it can refer to the above content items and can be reflected through different
descriptive documents or in the model file. So the data dictionary is a concept at the
abstract level, a collection of documents. And in the requirement analysis phase, the
most important output is the user requirement specification, where the data dictio-
nary often exists as an annex or appendix to provide a reference for the model
designers in their subsequent work.
The task of the conceptual design phase is to analyze the requirements proposed by
the users, synthesize, summarize and abstract the user requirements, and form a
conceptual-level abstract model independent of the concrete DBMS, i.e., the con-
ceptual data model (hereinafter referred to as the conceptual model). The conceptual
model is a high level abstract model, independent of any specific database product,
not be bound by any database product characteristics. At this stage, the conceptual
model is independent of the physical attributes of any particular database product.
The conceptual model has developed the following four main features.
(1) It can truly and fully reflect the real world, including the connection between
things and things, as a real model of the real world.
(2) It is easy to understand, enabling discussion with users who are not familiar with
the database.
(3) It is easy to change, when the application environment and application require-
ments change, the conceptual model can be modified and expanded.
(4) It is easy to convert to a relational data model.
The latter two are the basic conditions for the smooth progress of the next stage
of work.
7.3 Conceptual Design 251
In practice, the conceptual model can also be designed not to the attribute level in
detail, but to the entity level. If the conceptual model will increase the workload is all
the attributes are planned out in detail. The E-R diagram of the conceptual model
should delineate the linkages between entities clearly and express them clearly in the
practical application project. So it is sufficient that the general conceptual model
reaches the level that reflects the linkages between entities.
The linkages within and between entities are usually represented by diamond-
shaped boxes. In most cases, the data model is concerned with the linkages between
entities. The linkages between entities are usually divided into three categories.
(1) One-to-one linkage (1:1): Each instance in entity A has at most one instance
linked to it in entity B, and vice versa. For example, a class has a Class Advisor,
this linkage is recorded in the form of 1:1.
(2) One-to-many linkage (1:n): Each instance in entity A has n instances linked to it
in entity B, while each instance in entity B has at most 1 instance linked to it in
entity A, which is recorded as 1:n. For example, there are n students in a class.
(3) Many-to-many linkage (m:n): Each instance in entity A has n instances linked to
it in entity B, while each instance in entity B has m instance linked to it in entity
A, which is recorded as m:n. Take for example the linkage between students and
elective courses. A student can take more than one course, and a course can be
taken by more than one student.
Simply put, conceptual design is the conversion of realistic conceptual abstractions
and linkages into the form of an E-R diagram, as shown in Fig. 7.4.
Logical design is the process of converting a conceptual model into a concrete data
model. According to the basic E-R diagram established in the conceptual design
phase, the selected target data model (hierarchical, mesh, relational, or object-
oriented) is converted into the corresponding logical-layer target data model, and
what is obtained is the logical data model (hereinafter referred to as logical model).
For relational databases, this conversion has to conform to the principles of the
relational data model.
The most important work in the logical design phase is to determine the attributes
and primary keys of the logical model. The primary key identifies the unique primary
keyword in the table, also known as a code. A primary key can consist of a single
field or multiple fields. The more common way of logical design work is to use E-R
design tool and IDEF1X method for logical model building. Commonly used E-R
diagram representations include IDEF1X, Crow's Foot for IE models, Unified
Modeling Language (UML) class diagrams, etc.
The logical model of this book adopts the IDEF1X (Integration DEFinition for
Information Modeling) method. IDEF, which stands for Integration DEFinition
method, was established in the US Air Force ICAM (Integrated Computer Aided
Manufacturing) project, and three methods were initially developed - functional
modeling (IDEF0), information modeling (IDEF1), and dynamic modeling (IDEF2).
Later, as information systems were developed one after another, IDEF cluster
methods were introduced, such as data modeling method (IDEF1X), process
description acquisition method (IDEF3), object-oriented design method (IDEF4),
OO design method using C++ (IDEF4C++), entity description acquisition method
(IDEF5), design theory acquisition method (IDEF6), and Human-system interaction
design method (IDEF8), service constraint discovery method (IDEF9), network
design method (IDEF14), etc. IDEF1X is an extended version of IDEF1 in the
IDEF family of methods, which adds some rules to the E-R method to make the
semantics richer.
The IDEF1X method has several features when used for logic modeling.
(1) It supports the semantic structure necessary for the development of conceptual
and logical models, and has good scalability.
(2) It has concise and consistent structure in semantic concept representation.
(3) It is easy to understand, enabling service personnel, IT technicians, database
administrators and designers to communicate based on the same language.
254 7 Database Design Fundamentals
According to the characteristics of entities, they can be divided into two categories.
(1) Independent entity, which is usually represented by a rectangular box with right-
angle corners. An independent entity is an entity that exists independently that
does not depend on other entities.
(2) Dependent entity, which is usually represented by a rectangular box with round
corners. Dependent entities must depend on other entities, and the primary key in
a dependent entity must be part or all of the primary key of an independent
entity.
The primary key of the independent entity will appear in and become part of the
primary key of the dependent entity, as shown in Fig. 7.5, where the chapter entity
depends on the book entity. For example, many books have Chap. 2. If there is no
book as one of the ID primary keys to distinguish the Chap. 2 of different books,
only one record of Chap. 2 will appear in the chapter entity. But in fact, the title, page
number and word count of Chap. 2 of different books are different, so the chapter
entity depends on the book entity in order to function.
Attributes are the characteristics of the entity, containing the following types to be
noted.
(1) Primary key. The primary key is an attribute or group of attributes that identifies
the uniqueness of an entity instance. For example, the name of a student entity
cannot be used as a primary key because there may be cases of duplication of
name. The school number or ID number can be used as an attribute that uniquely
identifies the student, i.e., it can be used as a primary key.
(2) Optional key. It can identify other attributes or groups of attributes of the entity.
(3) Foreign key. Two entities are linked, and the foreign key of one entity is the
primary key of the other entity. You can also call the primary key entity the
parent entity and the entity with the foreign key the child entity.
(4) Non-key attribute. It is an attributes other than primary key and foreign key
attributes inside an entity.
(5) Derived attribute. It is a field that can be counted or derived from other fields.
The primary key of the book entity shown in Fig. 7.6 is the book ID, while other
attributes are non-key attributes. The primary key of the chapter is the book ID plus
the chapter number, while other attributes are non-key attributes. The book ID in the
chapter entity is a foreign key.
How to distinguish the relation between primary key, foreign key and index? A
primary key uniquely identifies an instance, have no duplicate values, which is a
non-null attribute, and should not be updated. Its role is to determine the uniqueness
of a record and ensure data integrity, so an entity can have only one primary key.
A foreign key is generally the primary key of another entity, which can be
duplicated or null for this entity, and its role is to establish data reference consistency
with the relation between two entities. So an entity can have more than one foreign
key. For example, attribute A is a foreign key in table X, and it is duplicable in table
256 7 Database Design Fundamentals
Table 7.1 Relation between primary key, foreign key and index
Non-unique
Primary key Foreign key Unique index index
Characteristic Uniquely identifies Primary key of An object built An object built
an instance, no dupli- another entity, on a table, no on a table, can
cate value, non-null, can be duplicate duplicate value, be null and can
and should not be and null can have a null have duplicate
updated value values
Role Determines the Establishes data Improves query Improves query
uniqueness of records reference consis- efficiency efficiency
and ensures data tency and rela-
integrity tion between two
entities
Quantity An entity can have An entity can A table can have A table can
only one primary key have multiple multiple unique have multiple
foreign keys indexes non-unique
indexes
Primary keys and foreign keys are logical concepts in the logical model, while indexes
are physical objects. Many databases can create primary keys when building a table,
at which time the attributes of the primary keys are unique non-null indexes.
After determining the entities and important attributes, you also need to under-
stand the relations between the entities. Relations are used to describe how entities
are related to each other. For example, if a book "includes" several chapters,
"includes" is the relation between these two entities. The relation is directional.
The book "includes" the chapter rather than the chapter "includes" the book, so the
relation between the chapter and the book is "belonging to".
Cardinality is a service rule that reflects the relation between two or more entities,
and the relation cardinality is used to express the concept of "linkage" in the E-R
method.
Figure 7.7 shows the illustration of cardinality in IDEF1X. Understanding the
meanings of the labels helps to quickly clarify the relation between entities as you
see the model structure. From left to right, the first symbol represents a one-to-many
relation, where the cardinality for many party is 0, 1, or n. The P symbol represents a
one-to-many relation, where the cardinality for many party is 1 or n. The difference
7.4 Logical Design 257
The significance of cardinality is that it reflects the relation, as shown in Fig. 7.8.
First of all, both the left and right sides are "including" relations, and the left side of
the relation is 1:1, which means that a chapter must belong to a book, that is, it
belongs to and only belongs to. For the example on the left, the values 0 to n are
possible expressions for the optional requirement that a book may contain one or
more chapters. And the cardinality equal to 0 expression means that a book is not
divided into chapters. In practice, when the cardinality is equal to 0, null values may
appear when the two tables are associated with each other. The example on the right
258 7 Database Design Fundamentals
takes the values 1 to n, which is a certain form of expression for the mandatory
requirement that the cardinality is not 0 means that a book must contain one or more
chapters.
Identifying relation occurs between independent and dependent entities, where
the instance unique identification of a child entity is associated with the parent entity
and the primary key attribute of the parent entity becomes one of the primary key
attributes of the child entity. The primary key book ID of the parent entity book
shown in Fig. 7.6 becomes the primary key attribute component of the chapter.
Non-identifying relation means that the child entity does not need the relation
with the parent entity to determine the uniqueness of the instance. At this point the
two entities are independent entities with no dependencies. In Fig. 7.6, if the chapter
entity does not depend on the book entity and becomes independent, then each
chapter number can only have one record, and the same chapters of different books
will cover each other, and there is a problem with this design. In this case, the
solution is to modify the non-identifying relation into an identifying relation. It can
be summarized as follows: according to whether the parent entity and child entity
have a foreign key relation, if there is a foreign key, it is a child entity; if there is a
primary key, it is the parent entity. The location of the foreign key determines
whether the parent entity and the child entity are of identifying or non-identifying
relation. If the foreign key appears in the primary key of the child entity, it is an
identifying relation; if the foreign key appears in the non-key attribute of the child
entity, it is a non-identifying relation.
Recursive relation means that the parent entity and the child entity are the same
entity, forming a recursive or nested relation, and the primary key of the entity also
becomes its own foreign key. A recursive relation occurs when the entities them-
selves form a hierarchical relation. In practical applications, such entities of recursive
relation are very common. For example, The organization structure includes superior
departments and subordinate departments. One department may have one or more
subordinate departments, the lowest department has no subordinate department, and
the top department has no superior department, as shown in Fig. 7.9.
7.4 Logical Design 259
Subtype relation is the relation between a subclass entity and the parent entity to
which it belongs. There are two types of subtype relation. One is complete subtype
relation, also called complete classification, where each instance of the parent entity
to which it belongs can be associated with an instance of the entity in the subtype
group, and all instances can be found in the classification case, with no exception.
The other is incomplete subtype, also called incomplete classification, where each
instance of the parent entity is not necessarily associated with an entity instance in
the subclass group, and only some instances can be classified in the subclass, and
some instances cannot be classified or do not need to care about the classification.
Remember that in practice you must not divide a pocket of other subclasses in order
to pursue complete classification, which will bring uncertainty to future service
development.
The logic model is summarized as follows.
(1) Entity is the metadata that describes the service.
(2) The primary key is an attribute or group of attributes that identifies the unique-
ness of an entity instance.
(3) Relations exist between entities only if there are foreign keys, and no relation can
be established without foreign keys.
(4) The cardinalities of the relations reflect the service rules between the relations.
The logic model is as follows.
• A customer can have only one type of savings account.
• A customer can have more than one type of savings account.
• An order can correspond to only one shipping order.
• A product includes multiple parts.
260 7 Database Design Fundamentals
7.4.4 NF Theory
According to the specific service requirements, database design needs to make clear
how to construct a database design pattern that meets the requirements, and how
many entities need to be generated, which attributes these entities are composed of,
and what is the relation between entities. To be precise, these are the questions that
need to be addressed in the logical design stage of relational database. The relational
model is based on strict mathematical theory, so designing the relational model
based on the normalization theory of relational database can construct a reasonable
relational model. In the database logic design phase, the process of placing attributes
in the correct entity is called normalization. Different NFs satisfy different levels of
requirements.
Between 1971 and 1972, Dr. E.F. Codd systematically proposed the concept of
1NF to 3NF, which fully discussed the model normalization issues. Later, others
deepened and proposed higher-level NF standards, but for relational databases, it is
sufficient to achieve the 3NF in practical applications.
The relational data model designed by following the normalization theory has the
following implications.
1. It can avoid the generation of redundant data.
2. The risk of data inconsistency can be reduced.
3. The model has good scalability.
4. It can be flexibly adjusted to reflect changing service rules.
In contrast to normalization in the process of logical model checking, normalization
means denormalization when the physical model is built, i.e., violating some nor-
malization rules to improve the performance when the database is applied by
enhancing the physical rule attributes.
When determining entity attributes, the question often faced is: which attributes
belong to the corresponding entities? This is the question to be addressed by the NF
theory. For example, there will be a lot of business dealings between banks and
individuals, and the same person may be engaged in business such as saving,
spending on credit cards, buying financial products for investment and financial
management, and buying cars and houses with loans. For banks, different services
are carried out by different departments and service systems. For example, if you
spend with credit cards, you have a credit card (credit card number) and a customer
number in the credit card system; if you handle financial management, you open a
financial account; if you make a deposit, you open a savings account. The individual
a bank faces is a person. When building a model, how do you group individuals into
a single customer entity? Do you create three entities or use one entity when
counting a customer's assets? For customers who do not have a loan relation with
the bank, if there is a loan relation in the future, what should the current model
consider in advance for this change? These are all questions that need to be
addressed in the logical design, and the theoretical basis for this is the NF model.
7.4 Logical Design 261
The one that satisfies the minimum requirements is called the first NF (1NF), the
one that further satisfies the requirements based on the 1NF is the second NF (2NF),
and so on. A low-level NF relation pattern can be transformed into a collection of
several higher-level NF relation patterns by schema decomposition. This process is
called normalization, as shown in Fig. 7.10.
Domain is the set of legal values of an attribute, which defines the valid range of
values of the attribute. The values inside the domain are legal data. The domain
reflects the relevant rules.
For example, the domain of the employee ID shown in Fig. 7.11 is an integer
greater than 0, so 0 and 10 are the data outside the domain. For example, if the cell
phone numbers are 11-bit length integers, 12345678910 is legal data; however, if we
consider the actual situation, it cannot be legal data because different operators have
different number segments.
If and only if each attribute contains only atomic values (which cannot be
sub-splittable), a relation (table or entity) conforms to the 1NF, and the value of
each attribute can only contain one value in the value range (not a subset).
262 7 Database Design Fundamentals
Table 7.4 Customer infor- Customer ID (PK) Name Age Phone number
mation table (3)
123 XXX 30 555-666-1234
123 XXX 30 333-888-5678
456 YYY 40 555-777-8080 ext. 43
456 YYY 40 155-0099-9900
789 ZZZ 50 777-808-9234
adapt to new situations, which will lead to instability of the model structure, that
is, business development brings instability impact to the model.
(3) Ambiguity arise when using data. Which number should be placed first? Which
number should be put in the second place? What are the rules? Which telephone
number shall prevail when obtaining contact information of customers? All of
the above questions can lead to semantic confusion and ambiguity in the use of
data for service.
To solve the above problems, the solution is to turn the duplicate group into a high
table and put the phone number in the same attribute. This is in line with the 1NF, as
shown in Table 7.4.
Atomicity means indivisibility .But to which degree should it be split? Many
people are prone to misunderstand the concept of atomicity in practical applications.
Generally speaking codes with coding rules are actually composite codes, which are
divisible in terms of rules. For example, ID numbers and cell phone numbers can
both be further split into data of smaller granularity, such as birth year and gender.
However, from the field perspective, the field of ID number is legal as long as it
conforms to the coding rules, i.e., it is atomic data and does not need further splitting.
The 2NF means that each table must have a primary key, with other data elements
corresponding to the primary key one by one. This relation is often referred to as
functional dependence, where all other data elements in the table depend on the
primary key, or the data element is uniquely identified by the primary key. The 2NF
emphasizes full functional dependence, which simply put, all non-primary key fields
are dependent on the primary key as a whole, not some of them.
There are two necessary conditions to satisfy the 2NF: firstly, the 1NF should be
satisfied; secondly, every non-primary attribute is fully functionally dependent on
any of the candidate keys. It can be simply understood that all non-primary key fields
depend on the whole primary key, not a part of it. What is shown in Table 7.5 does
not satisfy the 2NF because the order date depends only on the order number and has
nothing to do with the part number. So the table will have a lot of redundant data as
the order number is repeated.
A simple tip: If an entity has only one primary key field, then basically the entity is
satisfying the 2NF.
264 7 Database Design Fundamentals
Modify Table 7.5 to include the order date and the dependent order number as
primary keys to form another entity, then both entities now satisfy the 2NF. This is
the normalization, where a first-level NF can be converted into a collection of several
higher-level NF relational patterns through schema decomposition, as shown in
Tables 7.6 and 7.7.
The 3NF is that all non-primary key fields depend on the whole primary key, not
on other attributes of the non-primary key. There are two necessary conditions to
satisfy the 3NF: firstly, the 2NF should be satisfied; secondly, every non-primary
attribute is not transitively dependent on the primary key. That is to say, the whole
non-primary key field of the 3NF depends on the whole primary key instead of the
non-primary key attribute. The customer name shown in Table 7.8 depends on the
non-primary key attribute customer ID, so the 3NF is not satisfied.
The 3NF is mainly for the field redundancy constraint, which cannot have derived
fields in the table. If there are redundant fields in the table, when updating data, the
update efficiency will be reduced because of the existence of redundant data, which
will easily lead to inconsistent data. The solution is to split the table into two tables
and form a primary-foreign key relation, as shown in Tables 7.9 and 7.10.
7.4 Logical Design 265
Table 7.9 Order table Order number (PK) Order date Customer ID (FK)
1000 2010-08-01 1230008
2000 2010-11-15 1290004
3000 2010-09-30 1280003
In 1970, Dr. E.F. Codd, an IBM researcher, published a paper that introduced the
concept of relational model and laid the theoretical foundation of relational model.
After publishing this paper, he defined the concept of the 1NF, 2NF and 3NF in the
early 1970s. In practical applications, it is sufficient for a relational model to satisfy
the 3NF.
The KEY—1st Normal Form (1NF)
The WHOLE Key—2nd Normal Form (2NF)
AND NOTHING BUT the Key—3rd Normal Form (3NF)—E.F. Codd
Database design now satisfies at most the 3NF. It is generally believed that although
higher NFs have better constraint on data relations, they also make database I/O more
busy due to the increase of data relation tables, so in real projects, there are basically
no cases that meet the 3NF or higher.
When designing a logic model, some principle issues should be noted. The first is the
establishment of naming rules. Similar to other language development, it is advisable
to establish naming rules and follow them during logical modeling. The main
purpose of establishing naming rules is to unify ideas, facilitate communication
and achieve standardized development. For example, in the case of unified naming,
the amount is amount, which is abbreviated as amt, and its corresponding physical
type is DECIMAL(9,2). This field needs to be accurate to two decimal places when
calculating. However, if the naming is inconsistent, for example, some people define
the customer ID as cid, and some people define it as customer_id, it is easy to
266 7 Database Design Fundamentals
question whether the two attributes belong to the same object, which makes different
roles have different understandings of the same model.
The naming suggestions for entities and attributes are as follows.
(1) Entity name: capitalize the type domain + entity descriptor (full name, initial
capitalization).
(2) Attribute name: use full names with initial capitalization, and some conventional
abbreviations are provided after the spaces.
(3) Avoid mixing English and Chinese Pinyin.
(4) If it is abbreviated, it must be the abbreviation of English words, avoid using the
acronym abbreviation of pinyin.
Also pay attention to designing the logic model according to the design process,
determining the entities and attributes, for example, defining the entity's primary key
(PK), defining some non-primary key attributes (Non-Key Attribute), defining
non-unique attribute groups and adding the corresponding comment content.
Finally, it is necessary to determine the relation between entities, e.g., use foreign
keys to determine whether the relation between entities is identifiable and determine
whether the cardinality of the relation is of 1:1, 1:n or n:m. When adding non-key
attributes of entities, it is important to consider whether the added attributes conform
to the design of the 3NF according to the rules of the 3NF. If the added attributes
violate the 3NF, entity splitting is required to determine the relation between the new
entity and the original entity. The content of the annotation is generally a literal
description of the service meaning, code value, etc.
7.5 Physical Design 267
Physical design is the adjustment of the physical attributes of the model based on the
logical model in order to optimize the database performance and improve the
efficiency of service operation and application efficiency. The physical design
should be adjusted in conjunction with the physical attributes of the target database
product, with the ultimate goal of generating a deployable DDL for the target
database.
The main contents include but are not limited to the following.
(1) Non-regularized processing of entities.
(2) Physical naming of tables and fields.
(3) Determining the type of fields, including attributes such as length, precision, and
case sensitivity.
(4) Adding physical objects that do not exist in the logical model, such as indexes,
constraints, partitions, etc.
Table 7.11 shows the designations of the same concept at different stages. For
example, relations in relational theory are called entities in the logical model and
tables in the physical model. A tuple in relational theory is an instance in the logical
model and a row in the physical model. Attributes in relational theory are called
attributes in the logical model and fields of a table in the physical model.
In the comparison between the logical and physical models shown in Table 7.12,
what are included in the logical model are entities and attributes, which correspond
to tables and fields in the physical model. As for the key values, the physical model
generally does not use primary keys, but more often uses unique constraints and
not-null constraints to achieve this. Because the data quality requirement is too high
if primary key constraint is used, the constraint requirement is generally reduced in
the physical implementation, and the primary key is mainly reflected in the logical
concept. In terms of name definition, the logical model is named according to the
service rules and the naming convention of real-world objects, while the physical
model needs to consider the limitations of database products, such as no illegal
characters, no database keywords, and no over-length. In terms of regularization, the
logical model design should try to meet the 3NF and be regularized; the physical
model pursues high performance and may have to be denormalized, which is
non-regularized processing.
Table 7.13 Order table Order number (PK) Order date Customer ID (FK)
1000 2010-08-01 1230008
2000 2010-11-15 1290004
3000 2010-09-30 1280003
7.5 Physical Design 269
The complexity of SQL can be reduced by adding redundant columns and using
duplicate groups, as shown in Tables 7.16 and 7.17. This example is a conversion
from a high table above to a wide table below, a means often used in the front-end
report query process, which is more suitable for fixed class reports with style
requirements determined in advance.
Tables 7.18 and 7.19 show the reduction of function calculation by adding
derived columns, which is a very common application scenario. For example,
extracting customer age information from ID card numbers; classifying users into
VIP customers, platinum customers, ordinary customers, etc. based on their spend-
ing amounts; and flag suspicious transactions and suspicious accounts after judging
270 7 Database Design Fundamentals
them in the AML system. This method is generally used in customer relation
management projects. In Table 7.19, users are divided into different groups by
age, including elderly, middle-aged and young.
Denormalization is commonly handled by the following means.
(1) Adding duplicate groups.
(2) Performing pre-linkage.
(3) Adding derived fields.
(4) Creating summary tables or temporary tables.
(5) Horizontally or vertically splitting tables.
The negative impact of denormalization is relatively large for OLAP systems, but is
more common for OLTP systems, and is generally used to improve the system's high
concurrency performance for scenarios that require a large number of transactions.
The impact of denormalization needs to be considered more in OLAP systems for the
following reasons.
(1) Denormalization does not bring performance improvement to all processing
processes, and the negative impact needs to be balanced.
(2) Denormalization may sacrifice the flexibility of data models.
(3) Denormalization poses the risk of data inconsistency.
(3) Use triggers. The trigger has good real-time processing effect. After the appli-
cation updates the data of Table A, the database will automatically trigger the
update of Table B, but the cost of using the trigger is that it will cause pressure on
the database. The use of triggers in the actual application means a significant
negative impact on performance, so there are fewer and fewer scenes using it.
The table-level physicalization operations listed here are only part of the work, not
covering all table-level physicalization work.
There are several methods for table physicalization as follows.
(1) Perform the denormalization operation using the methods described earlier.
(2) Decide whether to perform partitioning. Partitioning large tables can reduce the
amount of I/O scanning workload and narrow the scope of queries. But the
granularity of partitioning is not the finer the better. For example, if you only
query monthly summary or conduct monthly query, you only need to partition
by month instead of by day.
(3) Decide whether to split the history table and the current table. History table is
some cold data with low frequency of use, for which you can use low-speed
storage; and current table is the hot data with high frequency of query, for which
you can use high-speed storage. History tables can also use compression to
reduce the storage space occupied.
For field-level physicalization efforts, first try to use data types with short fields.
Data types with shorter lengths not only reduce the size of data files and improve I/O
performance, but also reduce memory consumption during related calculations and
improve computational performance. For example, for integer data, try not to use
INT if you can use SMALLINT, and try not to use BIGINT if you can use INT. The
second is to use consistent data types, trying to use the same data type for table
linkage operations. Otherwise, the database must dynamically convert them into the
same data type for comparison, which will bring some performance overhead. The
last is the use of efficient data. Generally speaking, integer data operations (including
¼, >, <, , , 6¼ and other conventional comparison operations, as well as GROUP
BY) are more efficient than strings and floating-point numbers.
The premise of using efficient data is that the data type must meet the service
requirements of the value field. For example, the service context is the amount field
with decimals, then you cannot force the use of integers in pursuit of high efficiency.
A certain identification class field takes values of 0,1. If I want to set a data type for
this field, which one is appropriate?
Thinking
(6) Create indexes on the columns that often use the WHERE clause to speed up the
judgment of the condition.
The above scenario allows the use of indexes, but it is not necessary. Whether the
indexes can be used after being added is determined by the database system itself.
However, creating more indexes will have negative effects, such as the need for
more index space; when inserting the base table data, the efficiency of the insertion
operation will be reduced because the index data should be inserted at the same time.
Therefore, invalid indexes should be deleted in time to avoid wasting space.
Other physicalization means are judged to be used according to the situation, for
example, whether to further compress the data, whether to encrypt or desensitize the
data, etc.
During the physical design process, we typically use modeling software for both
logical and physical modeling. Automation software delivers many benefits, such as
forward generation of DDL, reverse analysis of database, and comprehensive satis-
faction of various requirements in modeling, so that efficient modeling can be
carried out.
Advantages of using modeling software for logical modeling and physical
modeling are as follows.
(1) Powerful and rich.
(2) Forward DDL generation and reverse analysis of database.
(3) Free switch of views between logical model and physical model.
(4) Comprehensive satisfaction of various requirements in modeling for efficient
modeling.
The following are some of the commonly used modeling software.
(1) ERwin's full name is ERwin Data Modeler, a data modeling tool from CA,
which supports all major database systems.
(2) PowerDesigner is SAP's enterprise modeling and design solution that uses a
model-driven approach to integrate service and IT, helps deploy effective
enterprise architecture, and provides powerful analysis and design techniques
for R&D lifecycle management. PowerDesigner uniquely integrates multiple
standard data modeling techniques (UML, service process modeling, and
market-leading data modeling) with leading development platforms such
as. NET, WorkSpace, PowerBuilder, Java, Eclipse, etc., providing business
analysis and standardized database design solutions for traditional software
development cycle management.
(3) ER/Studio is a set of model-driven data structure management and database
design products that help companies discover, reuse and document data assets. It
empowers data structures with the ability to fully analyze existing data sources
7.6 Database Design Case 275
The products that should be output during the physical model design phase include
the following:
(1) A physical data model, usually an engineering file for some automated modeling
software;
(2) Physical model naming convention, which is a standard convention that every-
one in the project should follow;
(3) Design specification of the physical data model;
(4) DDL table building statements for the target database.
The entities and attributes that can be proposed in the order shown in Fig. 7.13 are
order number, order date, customer ID, customer name, contact information, ID
276 7 Database Design Fundamentals
number, customer address, part number, part description, part unit price, part
quantity, part total price, and order total price.
If this information is generated directly into an entity where the design result is a
table that needs to cover all the information, then the part number, part description,
part unit price, part quantity, and part total price are called duplicate attribute groups
that have to appear repeatedly in the entity, as shown in Fig. 7.14. For example,
including Part Number 1, Part Description 1, Part Unit Price 1, Part Quantity 1, Part
Number 2, Part Description 2, Part Unit Price 2, etc. This situation does not satisfy
the 1NF.
For the duplicate group problem of part information, the relevant information of
the part is extracted to form a separate entity with several parts for each order, then
7.6 Database Design Case 277
the primary key of the new entity is the order number plus the part number, as shown
in Fig. 7.15.
After eliminating the duplicate groups, which NF does the model now conform to?
Thinking
There are still partial dependencies on the information of the parts in the current
model, so normalization should be continued to resolve the partial dependencies.
Extract the partial information that depends only on the part number to form a new
entity, the part entity, as shown in Fig. 7.16.
After eliminating the partial dependencies, which NF does the model now
conform to?
Thinking
278 7 Database Design Fundamentals
Table 7.21 Order table Order number (PK) Order date Customer ID (FK)
1000 2010-08-01 123008
2000 2010-11-15 129004
3000 2010-09-30 128003
The problem with the current model is that the customer information depends on
the customer ID, and the customer ID depends on the order number. Such depen-
dency has a transferability and is not directly dependent. So a conversion from the
2NF to the 3NF has to be implemented to eliminate this transmissive dependency.
Eliminating the transmissive dependency is to generate the customer information
as a separate entity, the customer entity, as shown in Fig. 7.17.
After eliminating the transmissive dependency, which NF does the model now
conform to?
Thinking
At this point, the logical model is essentially complete. However, note that the
order total price and the part total price are derived fields, which are not strictly
considered to meet the requirements of the 3NF, so they should be "erased".
After the regularization process is completed, the entity of the 3NF model is
obtained, and the primary key and foreign key are marked in the two-dimensional
table. Experience the 3NF model, as shown in Tables 7.21 and 7.22.
Since the part total price attribute has been removed from the order-part table, if
you want to get the part total price now, you need to do the operation based on the
7.6 Database Design Case 279
order-part table and multiply the part quantity by the sales price. The pseudo SQL
code is as follows.
If you now want to get the order total price, the pseudo SQL code is as follows.
After completing the logical model design, the physical model design begins. First,
name the table and field according to certain convention, avoid using database
keywords, and perform certain case-specific design; then determine the data type
at the field level, and if it involves characters, the length of the field definition, then
determine its upper limit according to the possible value fields of the actual data;
after that determine whether each field needs to add non-null constraints, unique
constraints, and other constraints, as shown in Tables 7.23, 7.24, 7.25, and 7.26.
The samples in the above tables are a kind of example, you can adjust them as the
actual situation required in practice.
280 7 Database Design Fundamentals
If the value type of the set price is DEIMAL(5,2), what is the range of the value field?
Thinking
7.6.4 Denormalization
The denormalization shown in Tables 7.27 and 7.28 solves some service problems
by adding some derived fields. For example, Total_Price states the order total price
of a particular order. Item_Total indicate the sales of a part in an order.
Whether to continue to derive fields or perform other pre-association operations
depends on the service problems to be solved, the computational complexity, and
whether denormalization can speed up these queries.
7.6 Database Design Case 281
Table 7.28 order detail table Field name Field type Constraint
Order_Num INTEGER NOT NULL
Item_Id INTEGER NOT NULL
Sale_Price DECIMAL(5,2) NOT NULL
Item_Quantity SMALLINT NOT NULL
Item_Total DECIMAL(9,2) NOT NULL
What is the average monthly sales for Q1? What are the top three parts by sales? You
can further refine the derived fields on your own based on some service issues.
Thinking
Taking Tables 7.23 and 7.25 as examples, the operation results of adding indexes are
shown in Tables 7.29 and 7.30. Here some partition indexes and query indexes are
added. There is no standard answer for adding indexes, and the same needs to be
judged according to the actual scenario and data volume. For OLTP, each table
needs to add a primary key, and if there is no natural primary key, then a field like
sequence can be used as a proxy primary key. For OLAP distributed database, each
table needs to further select distribution keys upon careful consideration.
282 7 Database Design Fundamentals
7.7 Summary
This chapter focuses on the New Orleans design methodology to database modeling,
and explains the four phases of requirement analysis, conceptual design, logical
design, and physical design, with the tasks of each design phase clearly explained.
The significance of requirement analysis stage is expounded. The E-R approach is
introduced in the conceptual design stage. For the logical design section, the
important basic concepts and the 3NF are expounded, and each NF is explained in
depth with examples. For the stage of physical design, the denormalization means
and the key points in the work are emphasized. The chapter concludes with a small
practical case to illustrate the main elements of logical and physical modeling.
7.8 Exercises
1. [Single Choice] The next phase after the logical design phase in the New
Orleans design methodology is ( ).
A. Requirement analysis
B. Physical design
C. Conceptual design
D. Logical design
2. [Multiple Choice] In what ways is the database operating environment efficient?
()
A. Data access efficiency
B. Time cycle of data storage
C. Storage space utilization
D. Efficiency of database system operation and management
3. [Multiple Choice] In the process of requirement investigation, which of the
following methods can be used? ( )
A. Questionnaire survey
B. Interviews with service personnel
C. Sample data collection, and data analysis
D. Review or the User Requirement Specification
4. [Multiple Choice] Which of the following options are included in the three
elements of the E-R diagram in model design? ( )
A. Entity
B. Relation
C. Cardinality
D. Attribute
7.8 Exercises 283
12. [True or False] Because partitioning can reduce the I/O scan overhead during
data query, the more partitions are created during the physicalization process,
the better. ( )
A. True
B. False
13. [True or False] The foreign key is the unique identifier that identifies each
instance in an entity. ( )
A. True
B. False
14. [True or False] Atomicity that satisfies the 1NF is the sub-splitting of each
attribute to the smallest granularity that is non-sub-splittable. ( )
A. True
B. False
15. [True or False] A relation between entities exists only if a foreign key exists, and
a relation between two entities cannot be established without a foreign key. ( )
A. True
B. False
16. [Multiple Choice] Which of the following options in the process of building a
logical model is within the scope of work for determining the attributes in an
entity? ( )
A. Define the primary key of the entity
B. Define some of the non-key attributes
C. Define non-unique attribute groups
D. Define constraints on attributes
17. [True or False] The data dictionary in the requirement analysis phase of the New
Orleans design methodology has the same meaning as the data dictionary in a
database product. ( )
A. True
B. False
7.8 Exercises 285
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 8
Introduction to Huawei Cloud Database
GaussDB
Everything from scratch, from weak to strong, means the accumulation of time and
the precipitation of experience. The decade whets one sword Huawei officially
released GaussDB database series products on May 15, 2019.
In order to pay tribute to German mathematician Gauss, Huawei named its self-
developed databases GaussDB. The Kunpeng ecology develops in three technology
and O&M efficiency, help users focus on core business innovation, and introduce
innovative technologies and new services faster. Rich ecological options, in addition
to the commitment to build Huawei ecology, are also compatible with widely used
open ecology, such as MySQL, etc., to facilitate users' application migration and
development, ensuring continuity of user investment and business.
improve the throughput of data centers, improve the scalability of network applica-
tions, and implement auto-tuning.
In fact, Huawei divides the database into three parts: SQL layer, abstraction layer
and storage layer. From the physical level, it can be divided into two layers: one is
the SQL layer, which adopts a one-master-multi-standby model; the other is the
storage abstraction layer, which maintains database services for different tenants,
including building pages, log processing, and other related functions, as shown in
Fig. 8.3.
For the SQL layer, the plan, query and management transactions can be isolated
by managing client connections and parsing SQL requests in the form of one read-
write and multiple read-only copies. Meanwhile, Huawei also launched HWSQL
and has made many performance improvements based on HWSQL, including query
result cache, query plan cache and online DD.
The whole design uniquely features the reduction of frequent page reading
operations from memory by SQL replication of multiple nodes. When an update
occurs on the master server, the Replicas SQL database also receives the transaction
and commits the update list.
There is also a storage abstract layer (SAL). SAL is a logical layer that isolates
SQL front ends, transactions, and queries within a storage unit. When manipulating
8.2 Relational Database Products and Related Tools 293
database pages, SAL support accessing multiple versions of the same page. Based on
spaceID and pageID, SAL can shard all data, with its storage and memory resources
growing proportionally.
In terms of performance, GaussDB (for MySQL) takes full advantage of some
features of Huawei. The system container uses Huawei's Hi1882 high-performance
chip, so it is better than the general container in terms of performance; the RDMA
application greatly reduces computational costs; the Co-Processor achieves data
processing with as few resources as possible, reducing the workload of the SQL
nodes, as shown in Fig. 8.4.
The architecture of GaussDB (for MySQL) is shown in Fig 8.5.
(1) Ultimate reliability: zero data loss, flash recovery from failure, and support for
cross-AZ high availability.
(2) Multi-dimensional expansion: compute nodes expansion in both directions.
Horizontal expansion: support for horizontal expansion in 1-write &15-read
mode. Vertical expansion: online elastic expansion, and on-demand billing.
294 8 Introduction to Huawei Cloud Database GaussDB
traditional RDS for MySQL, the log-as-data architecture no longer needs to refresh
pages, and all update operations only record logs, removing secondary writes, thus
reducing the consumption of precious network bandwidth.
The instance specifications for GaussDB (for MySQL) are shown in Table 8.3.
The financial industry is currently asset-light, and rapid expansion is the driver for
its use of cloud databases. However, the whole industry is experiencing the pain
point of unpredictable user traffic and generated data, and the user experience is
affected at the peak of business, and even the service must be stopped for expansion.
GaussDB (for MySQL) compute nodes support bi-directional expansion, based
on cloud virtualization, where the specification can be changed on a single node,
which supports 1 write and 15 read nodes, with an expansion ratio of 0.9. It also
supports storage pooling, with a maximum of 128TB storage space. The expansion
of compute nodes will not bring about an increase in storage costs.
In the enterprise-level market where SaaS applications enter, the business pain
points of large Internet companies and traditional large enterprises are huge business,
high throughput, and unsolved open source database problems, so it is necessary to
adopt complicated solutions such as sub-database and sub-table. Enterprise users
generally prefer to commercial databases (eg. SQL Server and Oracle), which cost
highly in license.
GaussDB (for MySQL) adopts storage pooling, uses MySQL native optimiza-
tion, and also has advantages in hardware, such as RDMA, V5CPU, and Optance,
and in terms of architecture, database logic is pushed down to release arithmetic
power and reduce network overhead.
(1) Rapid business development, with annual data growth of more than 30%;
(2) Real-time analysis capability required for the data analysis platform to achieve
intelligent user experience;
(3) Support for independent report development and visual analysis.
To this end, GaussDB database gives the following solutions:
(1) On-demand elastic expansion to support rapid business development;
(2) SQL on HDFS support for real-time analysis of instant exploration scenarios,
Kafka stream data entry at high speed, and real-time report generation;
(3) Key technologies such as multi-tenant load management and approximate cal-
culation enabling efficient report development and visual analysis.
These solutions generate the following user benefits:
(1) On-demand capacity expansion without business interruption;
(2) Real-time analysis results thanks to the new data analysis model, with marketing
accuracy rate increased by more than 50%;
(3) Response time of typical visual report query and analysis reduced from the past
minute level to within 5 s, and report development cycle reduced from the past
2 weeks to 0.5 h.
GaussDB database is suitable for small and medium-sized banks' Internet-based
transaction systems, such as mobile apps, websites, etc. It is compatible with the
industry's mainstream commercial database ecology, with high performance, secu-
rity and reliability, etc.
The advantages of GaussDB database are as follows.
(1) Security and reliability. It supports SSL encrypted connection and KMS data
encryption to ensure data security; supports database master-standby architec-
ture, and when the host machine fails, where when the master machine fails, the
standby machine is automatically upgraded to the master to ensure business
continuity.
(2) Ultra-high performance. With high performance and low latency transaction
processing capability, the performance of Sysbench data under typical configu-
ration is 30% to 50% higher than that of open source database.
GTM: Global Transaction Manager, which provides the information required for
global transaction control and uses multi-version concurrency control mechanism
(based on multiple versions and concurrency control protocol).
WLM: Workload Manager, which controls the allocation of system resources and
prevents excessive business load from hitting the system, leading to business
congestion and system crashes.
Coordinator Node: Acts as the business entry and result return of the whole system;
receives access requests from business applications; decomposes tasks and sched-
ules parallel execution of task shards.
Data Node: The logical entity that executes query task sharding.
GDS Loader: Parallel data loading, multiple configurable; supports text file format
with automatic error data recognition.
GaussDB (DWS) has the following main features and significant advantages over
traditional data warehouses, which can solve the problem of multi-industry ultra-
large data processing and common platform management.
(1) Easy use.
One-stop visualization and convenient management: Uses GaussDB (DWS)
management console to complete the O&M management work such as applica-
tion and data warehouse connection, data backup, data recovery, and data
warehouse resources and performance monitoring.
Seamless integration with big data: You can use standard SQL to query data
on HDFS and OBS without data relocation.
One-click heterogeneous database migration tool: Provides migration tools
that support the migration of SQL scripts from MySQL, Oracle and Teradata to
GaussDB (DWS).
(2) Easy scalability.
On-demand expansion: Non-shared open architecture, where nodes can be
added at any time according to business conditions to improve the data storage
capacity and query analysis performance of the system.
Linear performance improvement upon expansion: Capacity and perfor-
mance improving linearly with the cluster expands, with a linear ratio of 0.8.
Capacity expansion without business interruption: The expansion process
supports data addition, deletion, modification and check operations, as well as
DDL operations (DROP/ TRUNCATE/ALTER TABLE); table-level online
expansion technology, with no business interruption and no perception during
expansion.
(3) High performance.
Cloud-based distributed architecture: GaussDB (DWS) adopts fully parallel
MPP architecture, where business data is scattered across multiple nodes, and
data analysis tasks are pushed to the data site for execution nearby, so that
large-scale data processing can be done in parallel and fast response to data
processing can be realized.
High performance of query, and trillion data response within seconds:
GaussDB (DWS) background realizes parallel execution of instructions in
8.2 Relational Database Products and Related Tools 303
Unified analysis portal: GaussDB (DWS)'s SQL is used as the unified portal for
upper-layer applications, and application developers can access all data using
familiar SQL.
Real-time interactive analysis: For immediate analysis needs, analysts can get
information from the big data platform in real time.
Flexible adjustment: Adding nodes can expand the system's data storage capacity
and query and analysis performance, which can support petabyte-scale data storage
and calculation.
Data Studio application scenarios also include enhanced ETL and real-time BI
analysis, as shown in Fig. 8.11.
Data Migration: It supports multiple data sources, as well as efficient real-time
data import in batch.
High performance: It supports petabyte-scale data storage at low cost and trillions
of data correlation analysis with second-level response.
Real time: Real-time integration of business data streams helps users optimize
and adjust business decisions in a timely manner.
The application scenarios of Data Studio also include real-time data analysis, as
shown in Fig. 8.12.
Real-time streaming data entry: IoT, Internet and other data can be written to
GaussDB (DWS) in real time after being processed by streaming computing and AI
services.
8.3 NoSQL Databases 307
NoSQL, also called “Not Only SQL” and “non-relational”, refers to a non-relational
database that is different from the traditional relational databases.
There are many significant differences between NoSQL and relational databases.
For example, NoSQL does not guarantee the ACID feature of relational databases;
NoSQL does not use SQL as the query language; NoSQL data storage can be used
without a fixed table schema; NoSQL often avoids the use of SQL JOIN operations.
NoSQL features easy scalability, high performance, etc.
Huawei's self-developed distributed multi-mode NoSQL database service with
computing-storage separation architecture covers four mainstream NoSQL database
services: GaussDB (for Mongo), GaussDB (for Cassandra), GaussDB (for Redis),
and GaussDB (for Influx), as shown in Fig. 8.13.
308 8 Introduction to Huawei Cloud Database GaussDB
GaussDB NoSQL supports cross-3AZ clusters of high availability, and has the
advantages of minute-level computing capacity expansion, second-level storage
capacity expansion, strong data consistency, ultra-short latency, and high-speed
backup recovery compared with the community version, which is cost-effective
and suitable for IoT, meteorology, Internet, games and other fields.
The cloud database GaussDB (for Mongo) is a cloud-native NoSQL database
compatible with MongoDB ecology. It features enterprise-class performance, flex-
ibility, high reliability, visual management, etc.
GaussDB (for Mongo), which supports computing-storage separation, extreme
availability and massive storage, mainly demonstrates the following benefits.
(1) Separation of storage and computing: The storage layer adopts DFV high-
performance distributed storage, and the computing and storage resources are
expanded independently on demand.
(2) Extreme availability: It supports distributed deployment with 3–12 nodes, tol-
erates n-1 node failure, and has three copies of data storage to ensure data
security.
(3) Massive storage: It allows up to 96TB storage space.
(4) Autonomy and controllability: It supports Kunpeng architecture.
(5) Compatibility: It is compatible with MongoDB protocol for consistent develop-
ment experience.
The computing-storage separation architecture of GaussDB (for Mongo) allows
computing and storage to expand on-demand separately, effectively reducing
costs; based on shared storage, Rebalance does not migrate data; 3AZ disaster
recovery is supported.
GaussDB (for Mongo) offloads replica sets to distributed storage, reducing the
number of storage copies; all ShardServer can handle business; distributed storage is
based on sharded replication, which can better aggregate I/O performance and fault
reconstruction performance; RocksDB storage engine guarantees good write perfor-
mance; local SSD read Cache (cache) is used to optimize read performance;
8.4 Summary 309
8.4 Summary
This chapter introduces the database features, including Huawei relational databases
GaussDB (for MySQL), GaussDB (openGauss) and Huawei GaussDB (DWS), and
expounds the product features and business value of NoSQL databases, including
GaussDB (for Mongo) and GaussDB ( for Cassandra).
310 8 Introduction to Huawei Cloud Database GaussDB
Fig. 8.14 GaussDB (for Cassandra) use cases for industrial manufacturing and meteorology
industries
8.5 Exercises 311
8.5 Exercises
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-
nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if you modified the licensed material.
You do not have permission under this license to share adapted material derived from this chapter or
parts of it.
The images or other third party material in this chapter are included in the chapter's Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter's Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Index
N R
NF theory, 260–265 Relational model, 10–13, 15, 21, 37, 38, 260,
Numeric calculation functions, 94–96, 112 265
O S
Object permissions, v, 167, 174, 175, 182, 186 Secure sockets layer (SSL), 49, 168, 173–174,
Online analytical processing (OLAP), 16, 17, 186, 228, 299, 310
34–37, 39, 54, 58, 68, 78, 245, 270, 273, SQL syntax categories, 115–164
281, 287–289, 311 System permissions, 146, 154, 172, 176–179,
Online transaction processing (OLTP), 21, 181, 182, 186
34–39, 54, 58, 68, 79, 81, 245, 270, 273,
281, 287–289
Operators, v, 63, 87, 104–112, 119–124, 131, T
133, 176, 261, 297 Transaction control, 155, 302