Denormalization
Denormalization
Abstract- It is currently the norm that relational database delete anomalies that would have otherwise been present in
designs should be based on a normalized logical data model. a non-normalized database. Another goal of normalization is
The primary objective of this design technique is data integrity to minimize redesign of the database structure. Admittedly,
and database extendibility. The Third Normal Form is it is impossible to predict every need that your database
regarded by academicians and practitioners alike to be point at design will have to fulfill and every issue that is likely to
which the database design is most efficient. Unfortunately, even arise, but it is important to mitigate against potential
this lower level normalization form has a major drawback with problems as much as possible by a careful planning.
regards to query evaluation. Information retrievals from the Arguably, normalizing your data is essential to good
database can result in large number of joins which degrades performance, and ease of development, but the question
query performance. So you need to sometimes break always comes up: "How normalized is normalized enough?"
theoretical rules for real world performance gains. Most Many books on normalization, mention that 3NF is
existing Conceptual Level RDBMS data models provide a set of essential, and many times BCNF, but 4NF and 5NF are
constructs that only describes ―what data is used‖ and does really useful and well worth the time required to implement
not capture ―how the data is being used‖. The question of ―how
them [Davidson, 2007]. This optimization, however, results
in performance degradation in data retrievals from the
data is used‖ gets embedded in the implementation level
database as a large number of joins need to be done to solve
details. As a result, every application built on the existing
queries [Date, 1997] [Inmon, 1987] [Schkolnick and
database extracts the same or similar data in different ways. If
Sorenson ,1980].
the functional use of the data is also captured, common query
"Third normal form seems to be regarded by many as the
evaluation techniques can be formulated and optimized at the
points where your database will be most efficient ... If your
design phase, without affecting the normalized database database is overnormalized you run the risk of excessive
structure constructed at the Conceptual Design phase. This table joins. So you denormalize and break theoretical rules
paper looks at denormalization as an effort to improve the for real world performance gains." [Sql Forums, 2009].
performance in data retrievals made from the database There is thus a wide gap between the academicians and the
without compromising data integrity. A study on a hierarchical database application practitioners which needs to be
database table shows the performance gain - with respect to addressed. Normalization promotes an optimal design from
response time – using a denormalization technique. a logical perspective. Denormalization is a design level that
needs to be mitigated one step up from normalization. With
Keywords: denormalization, database deign, performance tuning, respective to performance of retrieval, denormalization is
materialized views, query evaluation not necessarily a bad decision if implemented following a
systematic approach to large scale databases where dozens
I. INTRODUCTION of relational tables are used.
Denormalization is an effort that seeks to optimize
M ost of the applications existing today have been built,
or are still being built using RDBMS or ORDBMS
technologies. The RDBMS is thus not dead, as stated by
performance while maintaining data integrity. A
denormalized database is thus not equivalent to a database
Arnon-Roten [Roten_Gal, 2009]. Van Couver, a software that has not been normalized. Instead, you only seek to
engineer with vast experience in databases at Sun denormalize a data model that has already been normalized.
MicroSystems, emphasizes the fact that RDBMSs are here This distinction is important to understand, because you go
to stay but do require improvements in scalability and from normalized to denormalized, not from nothing to
performance bottlenecks [Couver , 2009]. denormalized. The mistake that some software developers
Normalization is the process of putting one fact and nothing do is to directly build a denormalized database considering
more than one fact in exactly one appropriate place. Related only the performance aspect. This only optimizes one part of
facts about a single entity are stored together, and every the equation, which is database reads. Denormalization is a
attribute of each entity is non-transitively associated to the design level that is one step up from normalization and
Primary Key of that entity. This design technique results in should not be treated naively. Framing denormalization
enhanced data integrity and removes insert, update and against normalization purely in the context of performance
P a g e | 45 Global Journal of Computer Science and Technology
is unserious and can result in major application problems is the overheads required in view consistency maintenance.
[Thought Clusters, 2009]. We need to understand how and Denormalization is not necessarily a bad decision if
when to use denormalization implemented wisely [Mullins , 2009].
This paper is organized as follows: Section 1 introduces the
concept and current need for denormalization. Section 2 Some denormalization techniques have been researched and
provides us a background of the related work in this area implemented in many strategic applications to improve
from the academic and the practitioners‘ point of view. query response times. These strategies are followed in the
Section 3 makes a strong case for denormalization while creation of data warehouses and data marts [Shin and
Section 4 presents the framework for a systematic Sanders, 2006] [Barquin and Edelstein ] and are not directly
denormalization. Section 5 elucidates some denormalization applicable to an OLTP system. Restructuring a monolithic
techniques that can be followed during the database design Web application composed of Web pages that address
life cycle and shows the performance gain of this technique queries to a single database into a group of independent
over a Hierarchical Normalized Relation. Web services querying each other also requires
denormalization for improved performance [Wei Z et al,
2008].
II. BACKGROUND AND RELATED WORK
Relational Databases can be roughly categorized into Several researches have developed a list of normalization
Transaction Processing (OLTP) and Data Warehouse and denormalization types ,and have subsequently
(OLAP). As a general rule, OLTP databases use normalized mentioned that denormalization should be carefully
schema and ACID transactions to maintain database deployed according to how the data will be used [Hauns
integrity as the data needs to be continuously updated when ,1994] [Rodgers, 1989].The primary methods that have been
transactions occur. As a general rule, OLAP databases use identified are : combining tables, introducing redundant
unnormalized schema (the ―star schema‖ is the paradigmatic data, storing derivable data, allowing repeating groups,
OLAP schema) and are accessed without transactions partitioning tables, creating report tables, mirroring tables.
because each table row is written exactly one time and then These ―denormalization patterns‖ have been classified as
never deleted nor updated. Often, new data is added to Collapsing Relations, Partitioning Relations, Adding
OLAP databases in an overnight batch, with only queries Redundant Attributes and Adding Derived Attributes [
occurring during normal business hours [Lurie M.,IBM, Sanders and Shin ,2001]
2009] [Microsoft SQL Server guide] [Wiseth ,Oracle].
Software developers and practitioners mention that database III. A CASE FOR DENORMALIZATION
design principles besides normalization, include building of Four main arguments that have guided experienced
indices on the data and denormalization of some tables for practitioners in database design have been listed here [26]
performance. Performance tuning methods like indices and The Convenience Argument
clustering data of multiple tables exist, but these methods The presence of calculated values in tables‘ aids the
tend to optimize a subset of queries at the expense of the evaluation of adhoc queries and report generation.
others. Indices consume extra storage and are effective only Programmers do not need to know anything about the API
when they work on a single attribute or an entire key value to do the calculation.
.The evaluation plans sometimes skip the secondary indexes The Stability Argument
that are created by users if these indices are nonclustering As systems evolve, new functionality must be provided to
[Khaldtiance , 2008]. the users while retaining the original. History data may still
Materialized Views can also be used as a technique for need to be retained in the database.
improving performance [Vincent et al,97] but these The Simple Queries Argument
consume vast amount of storage and their maintenance Queries that involve join jungles are difficult to debug and
results in additional runtime overheads. Blind application of dangerous to change. Eliminating joins makes queries
Materialized Views can actually result in worse query simpler to write, debug and change
evaluation plans and should be used carefully [Chaudhuri et The Performance Argument
al, 1995]. View update techniques have been researched and Denormalized databases require fewer joins in comparison
a relatively new method of updating using additional views to normalized relations. Computing joins are expensive and
has been proposed [Ross et al, 1996]. time consuming. Fewer joins directly translates to improved
In the real world, denormalization is sometimes necessary. performance.
There have been two major trends in the approach to Denormalization of Databases, ie, a systematic creation of a
demoralization. The first approach uses a ―non normalized database structure whose goal is performance improvement,
ERD‖ where the entities in the ERD are collapsed to is thus needed for today‘s business processing requirements.
decrease the joins. In the second approach, denormalization This should be an intermediate step in the DataBase Design
is done at the physical level by consolidating relations, Life Cycle integrated between the Logical DataBase Design
adding synthetic attributes and creating materialized views Phase and the Physical DataBase Design Phase. Retrieval
to improve performance. The disadvantage of this approach performance needs dictate very quick retrieval capability for
Global Journal of Computer Science and Technology P a g e | 46
data stored in relational databases, especially since more violate data integrity. The IUDs to data are done on the Base
accesses to databases are being done through Internet. Users Tables and the denormalized structures are kept in synch by
are concerned with more prompt responses than an optimum triggers on the base tables.
design of databases. To create a Denormalization Schema Since the denormalized structures are used for information
the functional usage of the operational data must be retrieval , they need to consider the authorization access that
analyzed for optimal Information Retrieval. users have over the base tables.
The construction of the ―Denormalization View‖ is not an
Some of the benefits of denormalization can be listed: intermediate step between the Logical and the Physical
Design phases, but needs to be consolidated by considering
(a)Performance improvement by all 3 views of the SPARC ANSI architectural specifications.
Precomputing derived data
Minimizing joins Most existing Conceptual Level RDBMS data models
Reducing Foreign Keys provide a set of constructs that describes the structure of the
Reducing indices and saving storage database [Elmashree and Navathe]. This higher level of
Smaller search sets of data for partitioned tables conceptual modeling only informs the end user ―what data is
Caching the Denormalized structures at the Client used‖ and does not capture ―how the data is being used‖.
for ease of access thereby reducing query/data The question of ―how data is used‖ gets embedded in the
shipping cost. implementation level details. As a result, every application
built on the existing database extracts the same or similar
data in different ways. If the functional use of the data is
(b)Since the Denormalized structures are primarily also captured, common query evaluation techniques can be
designed keeping in mind the functional usage of the formulated and optimized at the design phase, without
application, users can directly access these structures rather affecting the normalized database structure constructed at
then the base tables for report generation. This also reduces the Conceptual Design phase. Business rules are descriptive
bottlenecks at the server. integrity constraints or functional (derivative or active) and
ensure a well functioning of the system. Common models
used during the modeling process of information systems do
A framework for denormalization needs to address the not allow the high level specification of business rules
following issues: except a subset of ICs taken into account by the data model
(i) Identify the stage in the DataBase Design Life Cycle [Amghar and Mezaine, 1997].
where Denormalization structures need to be created.
(ii) Identify situations and the corresponding candidate
base tables that cause performance degradation. The ANSI 3 level architecture stipulates 3 levels – The
(iii) Provide strategies for boosting query response times. External Level and the Conceptual Level, which captures
(iv) Provide a method for performing the cost-benefit data at rest, and the Physical Level which describes how the
analysis. data is stored and depends on the DBMS used. External
(v) Identify and strategize security and authorization Schemas or subschemas relate to the user views. The
constraints on the denormalized structures. Conceptual Schema describes all the types of data that
Although (iv) and (v) above are important issues in appear in the database and the relationships between data
denormalization, they will not be considered in this paper items. Integrity constraints are also specified in the
and will be researched on later. conceptual schema. The Internal Schema provides
definitions for stored records, methods of representation,
data fields, indexes, and hashing schemes. Although this
architecture provides the application development
IV. A DENORMALIZATION FRAMEWORK environment with logical and physical data independence, it
The framework presented in this paper differs from the does not provide an optimal query evaluation platform. The
papers surveyed above in the following respects: DBA has to balance conflicting user requirements before
It does not create denormalized tables with all contributing creating indices and consolidating the Physical schema.
attributes from the relevant entities, but instead creates a set
of Denormalized Structures over a set of Normalized tables. The reason denormalization is at all possible in relational
This is an important and pertinent criteria as these structures databases is because, courtesy of the relational model, which
can be built over existing applications with no ―side effects creates lossless decompositions of the original relation, no
of denormalization‖ over the existing data. Information is lost in the process. The Denormalized
The entire sets of attributes from the contributing entities are structure can be reengineered and populated from the
not stored in the Denormalized structure. This greatly existing Normalized database and vice-versa. In a
reduces the storage requirements and redundancies. distributed application development environment the
The Insert, Update and Delete operations (IUDs) are not Denormalization Views can be cached on the client resulting
done to the denormalized structures directly and thus do not in a major performance boost by saving run time shipping
P a g e | 47 Global Journal of Computer Science and Technology
costs. It would require only the Denormalization View the number of entities the queries involve,
Manager to be installed on the Client. the usage of the data (ie, the kind of attributes and
A High Level Architecture that this framework considers is their frequency of extraction within queries and
defined as follows: reports),
the volume of data being analyzed and extracted in
queries ( cardinality and degree of relations,
number and frequency of tuples, blocking factor of
tuples, clustering of data, estimated size of a
relation ),
the frequency of occurrence and the priority of the
query,
the time taken by the queries to execute(with and
without denormalization).
The inputs that are required for the construction of the A system cannot enforce truth, only consistency. Internal
Denormalized schema can be identified as: Predicates (IPs) are what the data means to the system and
the logical and external views schema design, External Predicates (EPs) are what data means to a user. The
the physical storage and access methods provided EPs result in criterion for acceptability of IUD operations on
by the DBMS, the data, which is an unachievable goal [Date, Kannan,
the authorization the users have on the Swamynathan], especially when Materialized Views are
manipulation and access of the data within the created. In the framework presented in this paper, IUDs on
database, the Denormalized Structures are never rejected as these are
the interaction (inter and intra) between the entities, automatically propagated to the base relations where the
Global Journal of Computer Science and Technology P a g e | 48
Domain and Table level ICs are enforced. Once the base The Denormalization Schema Design is an input to the
relations are updated, the Denormalized Schema Relation Query Optimizer for collapsing access paths, resulting in the
triggers are invoked atomically to synchronize the data, IRT which is then submitted to the Query Evaluation
ensuring simultaneous consistency of Base and Engine.
Denormalized tables. Further, the primary reason for the
Denormalization Structures is faster information retrieval Although the metadata tables are query able at the server,
and not data manipulation; hence no updates need be made the Denormalized Structure Manager can have its own
to the Denormalization Schema directly. metadata stored locally (at the node where the DSs are
stored).
Every Normalized Relation requires a Primary Key which DS_Metadata_Scheme(DS_Name,DS_Trigger_Name,DS
satisfies the Key Integrity Constraint. This PK maintains _Procedure_Name, DS_BT1_Name,
uniqueness of tuples in the database and is not necessarily Creator,DS_BT1_Trigger_Name,DS_BT2_Trigger_Nam
the search key value for users. For the RDIRS we define e,DS_BT1_Authorization,DS_BT2_Authorization)
101 105
Union
With an increased set of tuples, and a greater depth in the
Select ItemNo from item where ParentItemNo
hierarchy, the improvement will be substantial.
in
108 200 203
204 (Select ItemNo from item
where ParentItemNo=‘100‘)
Union
109
209 Select ItemNo from item where ParentItemNo
110 112
in
111 (Select ItemNo from item where
ParentItemNo in
Figure 3: Partial Hierarchical Item Data
(Select ItemNo from item
where ParentItemNo=‘100‘))
The Normalized Relation for the Hierarchical Item Table This retrieval query, besides being extremely inefficient, one
would be stored as needs to know the maximum depth of the hierarchy.
ItemNo ParentItemNo OtherItemDetails
100 … The Denormalized Schema for the Item Information in the
101 100 … RDIRS :
105 100 … DN_Item_Hierarchy (ParentItemNo, ChildItemNo,
108 101 … ItemName, ChildLevel, IsLeaf, Item_URowId)
200 101 … The ChildLevel ascertains the level in the hierarchy that the
203 101 … child node is at; IsLeaf specifies if that node has further
204 101 … child nodes and makes queries like ―Find all items that
109 108 … have no subparts‖ solvable efficiently.
110 108 …
111 108 … The (part) extension of the DN_Item_Hierarchy Schema
112 108 … ParentItemNo ChildItemNo ItemName ChildLevel IsLeaf
209 204 … ItemRowId
P a g e | 51 Global Journal of Computer Science and Technology
[4] Chirkova R., Chen Li, and J Li, ―Answering queries [7] Hauns M., ―To normalize or denormalize, that is the
using materialized views with minimum size‖ ,. Vldb question‖, Proceedings of 19th Int.Conf for Management
Journal 2006, 15 (3), pp. 191-210. and Performance Evaluation of Enterprise computing
[5] Date C.J, ―The Normal is so …interesting‖, DataBase Systems, San Diego,CA,1994,pp 416-423
programming and Design, Nov 1997,pp 23-25
[6] Halevy A. ―Answering queries using views: A survey.‖
In VLDB, 2001.
[8] Inmon W.H, ―Denormalization for Efficiency, [17] Date C.J. ,Kannan A., Swamynathan S.,‖An
―ComputerWorld‖, Vol 21 ,1987 pp 19-21 Introduction to Database Systems ―, ,8th Ed.,Pearson
[9] Ross K., Srivastava D. and Sudarshan S., ‖Materialized Education
View Maintenace and integrity constraint checking : trading [18] Elmashree R. and Navathe S.,―Fundamentals of
space for time‖, ACM Sigmod Conference 1996,pp 447 -458 Database Systems‖,3rd Ed, Addison Weisley.
[10] Rodgers U., ‖Denormalization: why, what and how?‖ [19] Davidson L., ―Ten common design mistakes ―,
Database Programming and Design,1989 (12) ,pp 46-53 software engineers blog, Feb 2007
[11] Sanders G. and Shin S.K, ―Denormalization Effects on [20] Downs K.,‖The argument for Denormalization‖,The
Performance of RDBMS‖, Proceedings of the 34 th Database Programmer,Oct 2008
International Conference on Systems Sciences, 2001 [21] Khaldtiance S., ―Evaluate Index Usage in Databases‖,
[12] Schkolnick M., Sorenson P. , ―Denormalization :A SQL Server Magazine, October 2008
performance Oriented database design technique‖ , [22] Lurie M.,IBM, ‖Winning Database Configurations
Proceedings of the AICA 1980 Congress ,Italy. [23] Mullins C, ―Denormalization Guidelines ―, Platinum
[13] Shin S.K and Sanders G.L., ― Denormalization Tecnology Inc.,Data administration Newsletter, Accessed
strategies for data retrieval from data warehouses June 2009.
―,Decision support Systems, VolVol. 42, No. 1, pp. 267-282, [24] Microsoft - SQL Server 7.0 Resource Guide ‖Chapter
2006 12 - Data Warehousing Framework‖
[14] Vincent M., Mohania M. and Kambayashi Y., ―A Self- [25] Roten-Gal-Oz A. Cirrus minor in ―Making IT work‖
Maintainable View maintenance technique for data Musings of an Holistic Architect, Accessed June 2009
warehouses‖ ,8th Int. Conf on Management of Data, [26] Van Couver D. on his blog ―Van Couvering is not a
Chennai,India verb‖, Accessed June 2009
[15] Wei Z., Dejun J., Pierre G.,Chi C.H, Steen [27] Wiseth K, Editor-in-Chief of Oracle Technology News,
M.,‖Service-Oriented Data Denormalization for Scalable in ‖Find Meaning‖,Accessed June 2009
Web Applications‖ , Proceedings of the 17 th International [28] Thought Clusters on software, development and
WWW Conference 2008, Beijing, China programming, website -– March 2009
[16] Barquin R., Edelstein H., ―Planning and Designing the [29] website – http://www.sqlteam.com/Forums/, Accessed
Data Warehouse‖, Prentice Hall July 2009