Automatic Database Normalization and Primary Key Generation
Automatic Database Normalization and Primary Key Generation
000011
Authorized licensed use limited to: Indian Institute of Technology - Jodhpur. Downloaded on August 14,2022 at 13:46:38 UTC from IEEE Xplore. Restrictions apply.
2. REPRESENTING DEPENDENCIES α ∈ {Determinant key set}
We will use three structures, Dependency Graph (DG), iii. Establish DM elements as follows:
Dependency Matrix (DM), and Directed Graph Matrix
(DG), to represent and manipulate dependencies amongst if α → β DM [α ][ β ] = 2 ,
attributes of a relation. if α → γ DM [α ][γ ] = 1 ,
Otherwise DM [α ][γ ] = 0 ,
2.1. Dependency Graph Diagram
The DM for Example 1 is shown in Figure 2.
With functional dependency we can monitor all relations
between different attributes of a table. We can A B C D E F G
graphically show these dependencies by using a set of A 2 1 1 1 0 0 0
simple symbols. In these graphs, arrow is the most C 0 0 2 1 1 0 0
important symbol used. Besides, in our way of D 0 0 0 2 0 0 1
representing the relationship graph, a (dotted) horizontal EF 0 0 0 1 2 2 1
line separates simple keys (i.e., attributes) from composite Figure 2: Initial dependency matrix
keys (i.e., keys composed of more than one attribute). A
dependency graph is generated using the following rules. 2.3. Directed Graph Matrix
1. Each attribute of the table is encircled and all The Directed Graph (DG) matrix for determinant keys is
attributes of the table is drawn at the lowest level used to represent all possible direct dependencies
(i.e., bottom) of the graph. between determinant keys. The DG is an n×n matrix
2. A horizontal line is drawn on top of all attributes. where n is the number of determinant keys. The process
3. Each composite key (if any) is encircled and all of determining the elements of this matrix follows.
composite keys are drowning on top of the The elements of the DG matrix are initially set to zeros.
horizontal line. Starting from the first row of the dependency matrix DM,
4. All functional dependency arrows are drawn. this matrix is investigated in a row major approach.
5. All reflexivity rule dependencies are drawn using Suppose we are investigating the row corresponding to
dotted arrows (for example AB-->A, AB-->B). determinant key x. If all simple keys that x is composed
Consider the functional dependency set of Example 1 for of depend on a determinant key other than x then x also
a relation r. depends on that determinant key (Armstrong’s
augmentation rule). The dependency of a simple key to a
Example 1: FDs = {A Æ BCD, C Æ DE, EF Æ DG,
determinant key is represented by a non-zero in the DM
D Æ G}
matrix.
Figure 1 is the graphical representation of the
For example, suppose that FDs = {ABÆE, BCÆA,
dependencies.
DEÆA}.The corresponding dependency matrix and the
EF initial directed graph matrix are shown in Figure 3.
A B C D E AB BC DE
AB 2 2 0 0 1 AB 0 0 0
BC 1 2 2 0 0 BC 0 0 0
A B C D E F G DE 1 0 0 2 2 DE 0 0 0
(a): Dependency Matrix (b): Directed Graph Matrix
Figure 3: Initializing DM and DG Matrices
Figure 1: Graphical representation of dependencies
If we are able to obtain all dependencies between In part (a) of Figure 4, we start with the first row of the
determinant keys we can produce all dependencies DM matrix. The determinant key of this row is AB. A
between all attributes of a relation. These dependencies and B are subsets of AB which appear in columns one
are represented by using a Dependency Matrix (DM). and two of the matrix. In Row one, columns one and two
Using path finding algorithms and Armstrong’s are both nonzero. Therefore AB depends on AB.
transitivity rule new dependencies are discovered from Considering the second row, columns one and two are
the existing dependency set. This is the basis of the both nonzero, too. Hence, AB depends on BC. However,
normalization algorithm which we will be discussing in for the third row, it is not the case that both A and B
the following. depend on DE. Therefore, a -1 value is put in the
intersection of row DE and column AB in the DG matrix
of part (b) of Figure 4.
2.2. Dependency Matrix
From a dependency graph, the corresponding A B C D E AB BC DE
Dependency Matrix (DM) is generated as follows: AB 2 2 0 0 1 AB 1 -1 -1
i. Define matrix DM [n] [m], where BC 1 2 2 0 0 BC 1 1 -1
n = number of Determinant Keys. DE 1 0 0 2 2 DE -1 -1 1
m = number of Simple Keys. (a) (b)
Figure 4: Dependency Matrix and Directed Graph
ii. Suppose that β ⊆ α , γ ⊄ α and
Matrix
β ,γ ∈ {Simple key set} The algorithm for producing the DG graph follows.
000012
Authorized licensed use limited to: Indian Institute of Technology - Jodhpur. Downloaded on August 14,2022 at 13:46:38 UTC from IEEE Xplore. Restrictions apply.
Directed- Graph-Matrix() By applying the path finding algorithm, the updated
{ matrix is shown in part (a) of Figure 9. As it can be seen
for (i=0; i<n; i++) from part (a) of Figure 8, the direct dependency of C to B
for (k= each attribute that composed determinant key i) has faded away. To tackle this deficiency, the following
for ( j=0; j<n ; j++) { Circular-Dependency algorithm is designed. This
if ( DM[j][k]!=0 && DG[j][i]!=-1) algorithm internally uses the FindOne recursive
DG[j][i]=1; algorithm. The latter will find the direct dependency, if
else DG[j][i]=-1; } any, and replace the transitive one. This is reflected in
} part (b) Figure 9.
After generating the DG matrix we turn our attention A B C A B C
towards finding all possible paths between all pairs. This A 2 1 B A 2 1 B
matrix will show all transitive dependencies between B 1 2 A B 1 2 1
determinant keys. There is many such path finding (a) (b)
algorithms like Prim, Kruskal, and Warshal algorithms. If ÆC is returned
Figure 9: The original BÆ
there is a path from node x to node y it means y
transitively depends on x. As an example, Figure 5 shows In Figure 10, DM2 represents the initial dependency
the complete determinant key transitive dependencies matrix.
corresponding to the DM graph of Figure 3.
Circular-Dependency ()
AB BC DE {
AB 1 -1 -1
for ( i=0; i<n; i++)
BC 1 1 -1
DE -1 -1 1 for(j=0; j<m; j++)
Figure 5: Determinant key transitive dependencies if(DM[i][j]!= {0,1,2})
if(FindOne (i, j, j, n)&& DM2[i][j]==1)
From Figure 5 we can deduct that AB depends on BC. DM[i][j]=1;
On the other hand, E depends on AB. Therefore, E }
depends on BC. That is, BCÆAB, ABÆ E => BCÆE. int FindOne (int i, element j, int k, int n)
These dependencies are recognized through dependency {
closure procedure which is presented in Figure 6. if(DM[j][k]==1 && n>=1) return 0;
elseif (n<1) return 1;
Dependency-closure ()
else return FindOne (i, DM[i][k], k, n-1);
{
}
for (i=0; i<n ; i++)
Figure 10: Replacing transitive dependency with
for( j=0; j<n ; j++)
original direct dependency
if( i!=j && Path[i][j]!=-1) {
for (k=0; k<m ; k++) Example 2: Consider the following case taken from
if( DM[j][k]!=0 && DM[j][k]!=2) [8]: Relation GH {A, B, C, D, E, F, G, H, I, J, K, L} with
DM[i][k]=j; } dependencies: FDs = {AÆBC, EÆAD, GÆAEJK,
} GHÆFI, KÆAL, and JÆK}. The corresponding
Figure 6: Recognition of dependency closure dependency graph is given in Figure 11.
DM of Figure 3 is updated as follows to reflect all GH
dependencies including those that are obtained by
Dependency-closure procedure.
A B C D E
AB 2 2 0 0 0 A B C D E F G H I J K L
BC 1 2 2 0 AB
DE 1 0 0 2 2
Figure 7: E depends on BC via AB Figure 11: Dependency graph for Example 2
In Figure 7, E depends on BC via AB. It is possible that Figure 12 shows the original DM for Figure 11.
E might depend on BC through some other determinant
A B C D E F G H I J K L
key, too. In which case is will not matter which A 2 1 1 0 0 0 0 0 0 0 0 0
determinant key is used in Figure 7 to represent this E 1 0 0 1 2 0 0 0 0 0 0 0
dependency. One issue to be careful of is that by G 1 0 0 0 1 0 2 0 0 1 1 0
updating the DM matrix to reflect transitive dependencies GH 0 0 0 0 0 1 2 2 1 0 0 0
some direct dependencies may fade away. K 1 0 0 0 0 0 0 0 0 0 2 1
J 0 0 0 0 0 0 0 0 0 2 1 0
Consider FDs = {AÆB, BÆA and BÆC}. The DM and
DG matrices are shown in Figure 8. Figure 12 : Initial Dependency Matrix for Figure 11
000013
Authorized licensed use limited to: Indian Institute of Technology - Jodhpur. Downloaded on August 14,2022 at 13:46:38 UTC from IEEE Xplore. Restrictions apply.
A E G GH K J determinant key of the row being scanned are equal to 2
A 1 -1 -1 -1 -1 -1 and the values of the corresponding columns of the
E 1 1 -1 -1 -1 -1
G 1 1 1 -1 1 1
candidate key are equal to 2, then a partial dependency is
GH -1 -1 1 1 -1 -1 found.
K 1 -1 -1 -1 1 -1 In Figure 15, the dependency of G to GH is partial.
J -1 -1 -1 -1 1 1 Therefore, we have to create a new table. From the DM
Figure 13 : The DG matrix for Example 2 matrix, we notice that E and J are directly dependent to
The path matrix is shown in Figure14. G. The new table will be composed of G, E, J, and all
A E G GH K J
simple keys which are transitively dependent on G. The
A 1 -1 -1 -1 -1 -1 transitive dependencies are obtained from the
E 1 1 -1 -1 -1 -1 determinant key transitive dependencies matrix. G is the
G 1 1 1 -1 1 1 primary key of this table. There is no other partial
GH 1 1 1 1 1 1 dependency. In Figure 16, the DM matrix is partitioned
K 1 -1 -1 -1 1 -1
into two new DMs corresponding to new tables.
J 1 -1 -1 -1 1 1
Figure 14: Determinant key transitive dependencies A B C D E G J K L
New dependencies are applied to the DM and Figure 15 A 2 1 1 0 0 0 0 0 0
E 1 A A 1 2 0 0 0 0
is the semi-final result. G K E E E 1 2 1 J K
A B C D E F G H I J K L K 1 A A 0 0 0 0 2 1
A 2 1 1 0 0 0 0 0 0 0 0 0 J K K K 0 0 0 2 1 0
E 1 A A 1 2 0 0 0 0 0 0 0 (a): G_Relation :{ G, E, J, K, A, B, C, D, L}
G K E E E 1 0 2 0 0 1 J K
GH K G G G G 1 2 2 1 G J K F G H I
K 1 A A 0 0 0 0 0 0 0 2 1 GH 1 2 2 1
J K K K 0 0 0 0 0 0 2 1 0 (b): GH_Relation :{ GH, F, I}
Figure 15 : Dependency closure matrix
Figure 16 : Database normalized up to 2NF
It is now the time to replace direct dependencies which
might have disappeared by applying transitive 3.2 Third Normal Form (3NF)
dependencies. However, the FindOne algorithm does not In order to transform the relations into 3NF, each DM is
discover any fade away dependency. Therefore, Figure scanned row by row starting from the first row. If a
15 shows the optimal dependency set. Entries with value determinant key is encountered whose dependency is
1 are identify components of this set. neither partial (from Figure 16) nor it is wholly
dependent on part of the primary key [9] a separate table
We are now in a position to obtain candidate keys. A has to be formed. Of course, if a table is previously
candidate key is a set of attributes to which all other formed a duplicate is not generated. This new table will
attributes depend on. From the final DM we notice that include the determinant key and all other attributes which
GH has this property. are transitively depend on this key. As it can be seen,
there is no transitive dependency in part (b) of Figure 16.
There are other sets of attributes which can be considered However, dependencies of A, E, K, and J in part (a) are
as candidate keys. For example, the set of {G, F, H, I} of transitive form. Each of these dependencies led to
could be considered as a candidate key. However, the set production of a new table.
with the least number of attributes amongst the
determinant keys will be considered the primary key in F G H I J K A K L
the following discussions. GH 1 2 2 1 J 2 1 K 1 2 1
(a) (b) (c)
3. THE PROPOSED NORMALIZATION PROCESS
It is assumed that the reader is familiar with the A D E A B C
E G J
definitions of different normal forms. On the other hand, G 1 2 1 E 1 1 2 A 2 1 1
tables of a relational database are assumed to be in 1NF (d) (e) (f)
form to begin with. Our proposed 2NF and 3NF Figure 17: Database Normalized up to 3NF
normalization process makes use of both dependency and
determinant key transitive dependencies. 3.3 The BCNF Normal Form
For a relation with only one candidate key, 3NF and
3.1 Second Normal Form (2NF) BCNF are equivalent. To transform the relations to
To proceed with the 2NF, it is assumed that the table is BCNF requires the creation of new relation for each
already in 1NF form. The resulting 1NF relation is: transitivity dependency. The resulting BCNF relations
GH_Relation :{ GH, A, B, C, D, E, F, I, J, K, L} are:
GH_Relation :{ GH, F, I},J_Relation :{ J, K}
The goal is to discover all partial dependencies. To K_Relation :{ K, A, L},G_Relation :{ G, E, J}
produce the 2NF form, we should find all partial E_Relation :{ E, A, D} and A_Relation :{ A, B, C}.
dependencies. To do this, the DM is scanned row by row To develop the process of generating BCNF form,
(ignoring the primary key row), starting from the first consider the case where there is more than one candidate
row. If all values of the simple keys that make up the key for the table being normalized.
000014
Authorized licensed use limited to: Indian Institute of Technology - Jodhpur. Downloaded on August 14,2022 at 13:46:38 UTC from IEEE Xplore. Restrictions apply.
3.4 A Complete Normalization Example A B C D E F G H
The following is a complete example with multiple AB 2 2 1 A 1 1 F BF
candidate keys. A 2 0 0 1 0 0 0 0
Example 3: Consider the following case taken from [9]: F 0 0 0 0 0 2 1 0
Relation AB:{A, B, C, D, E, F, G, H} with dependencies: BF 0 2 0 0 0 2 0 1
BCH 1 2 2 AB AB AB AB 2
FDs = {ABÆCEFGH, AÆD, FÆG, BFÆH,
BCF 1 2 2 AB AB 2 AB AB
BCHÆADEFG and BCFÆADE}.
Figure 23: After Circular Dependency
AB BF BCH BCF 3.4.1 Candidate Keys of Example 3
A candidate key is a set of attributes to which all other
attributes completely depend on. AB is a candidate key
because all simple keys either directly depend on AB or
via a determinant key which is not qualified to be a
candidate key. For other potential candidate keys BCH
and BCF there are some dependencies of simple keys
which are via a potential candidate key. For example, C
A B C D E F G H depends on BCH via AB which is a potential candidate
key. These kinds of dependencies have to be reexamined
. to make sure whether the dependency persists if the
Figure 18: Dependency graph for Example 3 dependency through the potential candidate key is
ignored. This can be done by applying the Dependency-
Figure 19 shows the original DM for Figure 18. closure routine on the initial DM of the relation.
A B C D E F G H However, a modification has to be considered for the
AB 2 2 1 0 1 1 1 1 Dependency-closure routine which is to be used here. We
A 2 0 0 1 0 0 0 0 would like to ignore dependencies of a potential
F 0 0 0 0 0 2 1 0 candidate key to another potential candidate key. To do
BF 0 2 0 0 0 2 0 1
so, the statement
BCH 1 2 2 1 1 1 1 2
BCF 1 2 2 1 1 2 0 0 if (i! = j & & Path[i ][ j ]! = −1)
Figure 19: Initial Dependency Matrix for Figure 18 is replaced by
if (i!= j & & Path[i ][ j ]!= −1 & & j ∉ {Potential
Figure 20 is the corresponding DG matrix.
Candidate key set})
AB A F BF BCH BCF
AB 1 1 1 1 1 1
The result is depicted in Figure 24. This figure shows the
A -1 1 -1 -1 -1 -1 set of optimal dependencies and real candidate keys.
F -1 -1 1 -1 -1 -1 A B C D E F G H
BF -1 -1 1 1 -1 -1 AB 2 2 1 A 1 1 F BF
BCH 1 1 1 1 1 1 A 2 0 0 1 0 0 0 0
BCF 1 1 1 1 -1 1 F 0 0 0 0 0 2 1 0
Figure 20: he DG matrix for Example 3 BF 0 2 0 0 0 2 F 1
The path matrix is shown in Figure 21. BCH 1 2 2 A 1 1 F 2
BCF 1 2 2 A 1 2 F BF
AB A F BF BCH BCF Figure 24: The set of optimal dependencies
AB 1 1 1 1 1 1
A -1 1 -1 -1 -1 -1 In the following we will act on 2NF and 3NF. From now
F -1 -1 1 -1 -1 -1 on we assume AB is the primary key.
BF -1 -1 1 1 -1 -1
BCH 1 1 1 1 1 1 3.4.2 2NF of Example 3
BCF 1 1 1 1 1 1 The normalization of 1NF relations to 2NF involves the
Figure 21: Determinant key transitive dependencies removal of partial dependencies on the primary key. If a
New dependencies are applied to the DM and Figure 22 partial dependency exists, we remove the functionally
dependent attributes from the relation by placing them in
is the semi-final result.
a new relation with a copy of their determinant. On
A B C D E F G H identifying the functional dependencies, we continue the
AB 2 2 1 A BCH BCH F BF process of normalization the relation. We begin by
A 2 0 0 1 0 0 0 0
F 0 0 0 0 0 2 1 0
testing whether the relation is in 2NF by identifying the
BF 0 2 0 0 0 2 0 1 presence of any partial dependencies on the primary key.
BCH 1 2 2 AB AB AB AB 2 We see that the attribute D is partially dependent on part
BCF 1 2 2 AB AB 2 AB AB of the primary key, namely A. On the other hand, the
Figure 22: Dependency closure matrix attributes C, E, and F are fully dependent on the whole
primary key. We note that H is not wholly dependent on
It is now the time to restore direct dependencies which
part of the primary key AB and therefore does not violate
might have been replaced by transitive dependencies. The
2NF. Hence, we need to create a new relation called
FindOne algorithm discovers all fade away dependencies.
A_relation. As a result, DM is also partitioned as follows.
One such dependency is shown in Figure 23 in the
intersection of AB and Fand AB and E.
000015
Authorized licensed use limited to: Indian Institute of Technology - Jodhpur. Downloaded on August 14,2022 at 13:46:38 UTC from IEEE Xplore. Restrictions apply.
A B C E F G H the functional dependency BCHÆAEF. On the other
AB 2 2 1 1 1 F BF hand, we recognize that if the functional dependency
F 0 0 0 0 2 1 0
BF 0 2 0 0 2 F 1
BFÆH is not removed, the AB_relation will have data
BCH 1 2 2 1 1 F 2 redundancy [9]. In practice, some designers stop at 3NF
BCF 1 2 2 1 2 F BF and do not proceed to BCNF. In which case, there may
(a): AB_Relation :{ AB, C, E, F, G, H} exist some redundancies in the database designed.
A D
A B C E F
A 2 1
AB 2 2 1 1 1 F G
(b): A_Relation :{ A, D} BCF 1 2 2 1 2 F 2 1
(a): AB_Relation :{ AB, C, E, F} (d): F_Relation :{ F, G}
Figure 25: Database normalized up to 2NF
B F H A D
3.4.4 3NF of Example 3 BF 2 2 1 A 2 1
The normalization of a 2NF table to 3NF involves the (b): BF_Relation :{ BF, H} (c): A_Relation :{ A, D}
removal of transitivity dependencies. If a transitivity
dependency exists, we remove the transitivity Figure 27: Database normalized up to BCNF
dependency attributes from the relation by placing them
in a new relation along with a copy of their determinant. 4. CONCLUSION
First, we examine the functional dependencies within the A new complete automated relational database
A and AB relations, which are as figure 25. normalization method is presented. The process is based
The A_relaton does not have transitive dependencies on on the generation of dependency matrix, directed graph
the primary key. However, although all the non-primary- matrix, and determinant key transitive dependency
key attributes within the AB_Relation are functionally matrix. The details of the methods for 2NF, 3NF, and
dependent on primary key, G is also dependent on F. This BCNF are discussed. Two examples, one without
is an example of a transitive dependency, which occurs multiple candidate keys and one with multiple candidate
when a non-primary-key attribute is dependent on keys are considered and the defined algorithms are
another non-primary-key attribute. Although BF Æ H, applied to produce the desired final tables. A nine thing
BF is not a non-primary key (as B is part of the primary about the developed algorithms is the automatic
key). Therefore, we do not remove the dependency at this distinguishing of one primary key for every final table
stage. In other words, this dependency is not wholly that is generated. We believe that the algorithms are very
transitivity dependent on non-primary-key attribute and efficient. However, we will compare our algorithms with
therefore does not violate 3NF. To transform the other similar algorithms, in the future.
AB_Relation into third normal form, we must first
remove transitive dependency. It is done by creating two 5. REFERENCES
new relations called F_Relation and AB_Relation [9]. [1] M Arenas, L Libkin, An Information-Theoretic Approach
The resulting 3NF relations have in figure 26. to Normal Forms for Relational and XML Data, Journal of the
ACM (JACM), Vol. 52(2), pp. 246-283, 2005.
A B C E F H [2] Kolahi, S., Dependency-Preserving Normalization of
AB 2 2 1 1 1 BF Relational and XML Data, Journal of Computer System
BF 0 2 0 0 2 1 Science, Vol. 73(4): pp. 636-647, 2007.
BCH 1 2 2 1 1 2 [3] Mora, A., M. Enciso, P. Cordero, IP de Guzman, An
BCF 1 2 2 1 2 BF Efficient Preprocessing Transformation for Functional
(a): AB_Relation :{ AB, C, E, F, H} Dependencies Sets Based on the Substitution Paradigm,
CAEPIA2003, pp.136-146, 2003.
A D F G [4] Du H., and L. Wery, A Normalization Tool for Relational
A 2 1 F 2 1 Database Designers, Journal of Network and Computer
(b): A_Relation :{ A, D} (c): F_Relation :{ F, G} Applications, Volume 22, No. 4, pp. 215-232, October 1999.
[5] Yazici, A., and Z. Karakaya, Normalizing Relational
Figure 26: Third normal form of Example 3 Database Schemas Using Mathematica, LNCS, Springer-
Verlag, Vol.3992, pp. 375-382, 2006.
3.4.4 Boyce-Codd Normal Form (BCNF) [6] Kung, H. and T. Case, Traditional and Alternative Database
We now examine A, AB and F relations to determine Normalization Techniques: Their Impacts on IS/IT Students’
whether they are in BCNF. A relation is in BCNF if Perceptions and Performance, International Journal of
every determinant of a relation is a candidate key. Information Technology Education, Vol.1, No.1 pp. 53-76,
Therefore, to test for BCNF, we simply identify all the 2004.
determinants and make sure they are candidate keys. We [7] Akehurst, D.H., B. Bordbar, P.J. Rodgers, and N.T.G.
can see that from DMs in Figure 26 DM (b) and DM(c) Dalgliesh, Automatic Normalization via Metamodelling, ASE
their relations are already in BCNF. To transform 2002 Workshop on Declarative Meta Programming to Support
Software Development, 2002.
AB_Relation into BCNF, we must remove the
[8] Date, C.J., An Introduction to Database Systems, Addison-
dependency that violates BCNF by creating one new Wesley, Seventh Edition 2000.
relation for BFÆH. The resulting BCNF relations have in [9] Connoly, Thomas, Carolyn Begg: Database Systems. A
figure 27. Practical Approach to Design, Implementation, and
In this example, the decomposition of the original Management , Pearson Education, Third edition, 2005.
AB_Relation to BCNF relations has resulted in ‘loss’ of
000016
Authorized licensed use limited to: Indian Institute of Technology - Jodhpur. Downloaded on August 14,2022 at 13:46:38 UTC from IEEE Xplore. Restrictions apply.