We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35
Chapter 5
Functional Dependency and Normalization
Functional Dependency and Normalization Discuss some of the theory that has been developed with the goal of evaluating relational schemas for design quality-that is, to measure formally why one set of groupings of attributes into relation schemas is better than another. There are two levels at which we can discuss the "goodness" of relation schemas. 1. logical (or conceptual) level • how users interpret the relation schemas and the meaning of their attributes • Having good relation schemas at this level enables users to understand clearly the meaning of the data in the relations, and hence to formulate their queries correctly 2. The second is the implementation (or storage) level- • how the tuples in a base relation are stored and updated •database design may be performed using two approaches: A)bottom-up A bottom-up design methodology (also called design by synthesis) considers the basic relationships among individual attributes as the starting point and uses those to construct relation schemas B) Top-down •starts with a number of groupings of attributes into relations that exist together naturally, for example, on an invoice, a form, or a report. •The relations are then analyzed individually and collectively, leading to further decomposition until all desirable properties are met INFORMAL DESIGN GUIDELINES FOR RELATION SCHEMAS
Four informal measures of quality for relation schema design
Semantics of the attributes Reducing the redundant values in tuples Reducing the null values in tuples Disallowing the possibility of generating spurious tuples Whenever we group attributes to form a relation schema, we assume that attributes belonging to one relation have certain real-world meaning and a proper interpretation associated with them This meaning, or semantics, specifies how to interpret the attribute values stored in a tuple of the relation-in other words, how the attribute values in a tuple relate to one another. If the conceptual design is done carefully, followed by a systematic mapping into relations, most of the semantics will have been accounted for and the resulting design should have a clear meaning. Example The meaning of the EMPLOYEE relation schema is quite simple: Each tuple represents an employee, with values for the employee's name (ENAME. social security number (SSN), birth date (BDATE), and address (ADDRESS), and the number of the department that the employee works for (DNUMBER). The DNUMBER attribute is a foreign key that represents an implicit relationship between EMPLOYEE and DEPARTMENT. The semantics of the DEPARTMENT and PROJECT schemas are also straightforward: Each DEPARTMENT tuple represents a department entity, and each PROJECT tuple represents a project entity. The attribute DMGRSSN of DEPARTMENT relates a department to the employee who is its manager, while DNUM of PROJECT relates a project to its controlling department; both are foreign key attributes. The ease with which the meaning of a relation's atributes can be explained is an informal measure of how well the relation is designed GUIDELINE 1 • Design a relation schema so that it is easy to explain its meaning. Do not combine attributes from multiple entity types and relationship types into a single relation. • Figure 10.3a represents a single employee but includes additional information-namely, the name (DNAME) of the department for which the employee works and the social security number (DMGRSSN) of the department manager. • For the EMP_PROJ relation of Figure 10.3b, each tuple relates an employee to a project but also includes the employee name (ENAME), project name (PNAME), and project location (PLOCATION). • Although there is nothing wrong logically with these two relations, they are considered poor designs because they violate Guideline 1 by mixing attributes from distinct real-world entities; • EMP_DEPT mixes attributes of employees and departments, and EMP_PROJ mixes attributes of employees and projects Redundant Information in Tuples and Update Anomalies One goal of schema design is to minimize the storage space used by the base relations Grouping attributes into relation schemas has a significant effect on storage space. For example, compare the space used by the two base relations EMPLOYEE and DEPARTMENT with that for an EMP_DEPT base relation . In EMP_DEPT, the attribute values pertaining to a particular department (DNUMBER,DNAME, DMGRSSN) are repeated for every employee who works for that department. In contrast, in the normal design , each department's information appears only once in the DEPARTMENT relation . Only the department number (DNUMBER) is repeated in the EMPLOYEE relation for each employee who works in that department. Similar comments apply to the EMP_PRO] relation , which augments the WORKS_ON relation with additional attributes from EMPLOYEE and PROJECT. Another serious problem with using the relations in Figure 10.4 is as base relations is the problem of update anomalies. These can be classified into insertion anomalies, deletion anomalies, and modification anomalies. Insertion Anomalies. Insertion anomalies can be differentiated into two types, illustrated by the following examples based on the EMP_DEPT relation To insert a new employee tuple into EMP_DEPT, we must include either the attribute values for the department that the employee works for, or nulls (if the employee does not work for a department as yet). For example, to insert a new tuple for an employee who works in department number 5, we must enter the attribute values of department 5 correctly so that they are consistent with values for department 5 in other tuples in EMP_DEPT. In the Normal design we do not have to worry about this consistency problem because we enter only the department number in the employee tuple; all other attribute values of department 5 are recorded only once in the database, as a single tuple in the DEPARTMENT relation. • It is difficult to insert a new department that has no employees as yet in the EMP_DEPT relation. • The only way to do this is to place null values in the attributes for employee. • This causes a problem because SSN is the primary key of EMP_DEPT, and each tuple is supposed to represent an employee entity-not a department entity. • This problem does not occur in the design of Figure 10.2, because a department is entered in the DEPARTMENT relation whether or not any employees work for it, and whenever an employee is assigned to that department, a corresponding tuple is inserted in EMPLOYEE Deletion Anomalies: The problem of deletion anomalies is related to the second insertion anomaly situation discussed earlier. If we delete from EMP_DEPT an employee tuple that happens to represent the last employee working for a particular department, the information concerning that department is lost from the database. This problem does not occur in the database of Figure 10.2 because DEPARTMENT tuples are stored separately. Modification Anomalies: In EMP_DEPT, if we change the value of one of the attributes of a particular department-say, the manager of department 5-we must update the tuples of all employees who work in that department; otherwise, the database will become inconsistent. If we fail to update some tuples, the same department will be shown to have two different values for manager in different employee tuples, which would be wrong.' Based on the preceding three anomalies, we can state the guideline GUIDELINE 2 Design the base relation schemas so that no insertion, deletion, or modification anomalies are present in the relations. If any anomalies are present, note them clearly and make sure that the programs that update the database will operate correctly. Null Values in Tuples In some schema designs we may group many attributes together into a "fat" relation. If many of the attributes do not apply to all tuples in the relation, we end up with many nulls in those tuples. This can waste space at the storage level and may also lead to problems with understanding the meaning of the attributes and with specifying JOIN operations at the logical level. Another problem with nulls is how to account for them when aggregate operations such as COUNT or SUM are applied. Nulls can have multiple interpretations, such as the following: • The attribute does not apply to this tuple. • The attribute value for this tuple is unknown. • The value is known but absent; that is, it has not been recorded yet. Having the same representation for all nulls compromises the different meanings they may have. Therefore, we may state another guideline. GUIDELINE 3 As far as possible, avoid placing attributes in a base relation whose values may frequently be null. If nulls are unavoidable, make sure that they apply in exceptional cases only and do not apply to a majority of tuples in the relation. FUNCTIONAL DEPENDENCIES
A functional dependency is a constraint between two sets of
attributes from the database. Suppose that our relational database schema has n attributes A1,A2, ••• , An Defnition: A functional dependency, denoted by XY, between two sets of attributes X and Y that are subsets of R specifies a constraint on the possible tuples that can form a relation state r of R. This means that the values of the Y component of a tuple in r depend on, or are determined by, the values of the X component; alternatively, the values of the X component of a tuple uniquely (or functionally) determine the values of the Y component. •We also say that there is a functional dependency from X to Y, or that Y is functionally dependent on X. •The abbreviation for functional dependency is FD or f.d. The set of attributes X is called the left-hand side of the FD, and Y is called the right-hand side. •Thus, X functionally determines Y in a relation schema R if, and only if, whenever two tuples of r(R) agree on their X-value, they must necessarily agree on their Y-value. Note the following: •If a constraint on R states that there cannot be more than one tuple with a given X value in any relation instance r(R)-that is, X is a candidate key of R-this implies that X ->Y for any subset of attributes Y of R (because the key constraint implies that no two tuples in any legal state r(R) will have the same value of X). • If X-> Y in R, this does not say whether or not Y ->X in R. A functional dependency is a property of the semantics or meaning of the attributes Consider the relation schema EMP_PROJ. From the semantics of the attributes, we know that the following functional dependencies should hold a. SSN ENAME b. PNUMBER{PNAME, PLOCATION} C. {SSN, PNUMBER}HOURS These functional dependencies specify that (a) the value of an employee's social security number (SSN) uniquely determines the employee name (ENAME), (b) the value of a project's number (PNUMBER) uniquely determines the project name (PNAME) and location (PLOCATION), and (c) a combination of SSN and PNUMBER values uniquely determines the number of hours the employee currently works on Example: TEACHER COURSE, we cannot confirm this unless we know that it is true for all possible legal states of TEACH It is, however, sufficient to demonstrate a single counterexample to disprove a functional dependency. For example, because 'Smith' teaches both 'Data Structures' and 'Data Management', we can conclude that TEACHER does not functionally determine COURSE. Figure 10.3 introduces a diagrammatic notation for displaying FDs: Each FD is displayed as a horizontal line. The left-hand-side attributes of the FD are connected by vertical lines to the line representing the FD, while the right-hand-side attributes are connected by arrows pointing toward the attributes, as shown in Figures 10.3a and 10.3b. Inference Rules for Functional Dependencies We denote by F the set of functional dependencies that are specified on relation schema R. Typically, the schema designer specifies the functional dependencies that are semantically obvious; But still numerous other functional dependencies hold in all legal relation instances that satisfy the dependencies in F. Those other dependencies can be inferred or deduced from the FDs in F. For example, if each department has one manager, so that DEPT_NO uniquely determines MANAGER_SSN i.e (DEPT_NO MGR_SSN ), and a Manager has a unique phone number called MGR_PHONE i.e (MGR_SSN MGR_PHONE), then these two dependencies together imply that DEPT_NO MGR_PHONE. This is an inferred FD and need not be explicitly stated in addition to the two given FDS. Definition: Formally, the set of all dependencies that include F as well as all dependencies that can be inferred from F is called the closure of F; it is denoted by P+. •We use the notation F=l XY to denote that the functional dependency X Y is inferred from the set of functional dependencies F. •For example, suppose that we specify the following set F of obvious functional dependencies on the relation schema of Figure 10.3a: F= {SSN {ENAME, BDATE, ADDRESS, DNUMBER}, DNUMBER {DNAME, DMGRSSN}} Some of the additional functional dependencies that we can infer from F are the following: SSN - {DNAME, DMGRSSN} SSN SSN DNUMBER DNAME The following six rules IR1 through IR6 are well known inference rules for functional dependencies: IR1 (reflexive rule''}: If X ↃY, then X Y. IR2 (augmentation rule"): {X Y}=l XZ YZ. IR3 (transitive rule): {X Y, Y Z} =l X Z. IR4 (decomposition, or projective, rule): {X YZ}l= X Y. IR5 (union, or additive, rule): {X Y, X Z}=l X YZ.
READING ASSIGNMENT PROOF ALL OF THEM
Normalization of Relations • The normalization process, was first proposed by Codd (l972a) • Normalization process, which proceeds in a top-down fashion by evaluating each relation against the criteria for normal forms and decomposing relations as necessary, is considered as relational design by analysis. • Normalization of data can be looked upon as a process of analyzing the given relation schemas based on their FDs and primary keys to achieve the desirable properties of: 1. minimizing redundancy 2. minimizing the insertion, deletion, and update anomalies – Unsatisfactory relation schemas that do not meet certain conditions-the normal form tests-are decomposed into smaller relation schemas that meet the tests and hence possess the desirable properties Definitions of Keys and Attributes Participating in Keys • Definition: A superkey of a relation schema R = {AI, A2, ... , An} is a set of attributes S C R with the property that no two tuples tl and t2 in any legal relation state r of R will have tl[S] = t2[S]. • A key K is a superkey with the additional property that removal of any attribute from K will cause K not to be a superkey any more. • The difference between a key and a superkey is that a key has to be minimal; that is, if we have a key K = {A1, A2, ... , Ak} of R, then K - {Ai} is not a key of R for any Ai, 1<=i<=k • Ex: {SSN} is a key for EMPLOYEE, whereas {SSN, ENAME} {SSN, ENAME, BDATE} and any set of attributes that includes SSN are all superkeys. • If a relation schema has more than one key, each is called a candidate key. • One of the candidate keys is arbitrarily designated to be the primary key, and the others are called secondary keys • Definition. An attribute of relation schema R is called a prime attribute of R if it is a member of some candidate key of R. • An attribute is called nonprime if it is not a prime attribute-that is, if it is not a member of any candidate key. • Ex both SSN and PNUMBER are prime attributes of WORKS_ON, whereas other attributes of WORKS_ON are nonprime First Normal Form • First normal form (INF) is now considered to be part of the formal definition of a relation in the basic (flat) relational model • Historically, it was defined to disallow multivalued attributes, composite attributes, and their combinations. • It states that the domain of anattribute must include only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute. • Hence, INF disallows having a set of values, a tuple of values, or a combination of both as an attribute value for a single tuple • In other words, INF disallows "relations within relations" or "relations as attribute values within tuples." The only attribute values permitted by lNF are single atomic (or indivisible) values • Consider the DEPARTMENT relation schema shown in Figure 10.1, whose primary key is DNUMBER, and suppose that we extend it by including the DLOCATIONS attribute as shown in Figure 10.8a. We assume that each department can have a number of locations • As can be seen, this is not in 1NFbecause DLOCATIONS is not an atomic attribute, as illustrated by the first tuple in Figure 1O.8b. There are two ways we can look at the DLOCATIONS attribute – The domain of DLOCATIONS contains atomic values, but some tuples can have a set of these values. – In this case, DLOCATIONS is not functionally dependent on the primary key DNUMBER – The domain of DLOCATIONS contains sets of values and hence is nonatomic. In this case, DNUMBER DLOCATIONS, because each set is considered a single member of the attribute domain. • In either case, the DEPARTMENT relation of Figure 10.8 is not in 1NF Techniques to achieve first normal form for such a relation: 1. Remove the attribute DLOCATIONS that violates 1NF and place it in a separate relation DEPT_LOCATIONS along with the primary key DNUMBER of DEPARTMENT. The primary key of this relation is the combination {DNUMBER, DLOCATION},as shown in Figure 10.2.A distinct tuple in DEPT_LOCATIONS exists for each location of a department. This decomposes the non-1NF relation into two 1NFrelations 2. Expand the key so that there will be a separate tuple in the original DEPARTMENT relation for each location of a DEPARTMENT, as shown in Figure 10.8c. In this case, the primary key becomes the combination {DNUMBER, DLOCATION}. This solution has the disadvantage of introducing redundancy in the relation 3. If a maximum number of values is known for the attribute-for example, if it is known that at most three locations can exist for a department-replace the DLOCATIONS attribute by three atomic attributes: DLOCATION1, DLOCATION2, and DLOCATION3. This solution has the disadvantage of introducing null values if most departments have fewer than three locations Second Normal Form • Second normal form (2NF) is based on the concept of full functional dependency. • A functional dependency X Y is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more; that is, for any attribute A єX, (X - {A}) does not functionally determine Y. • A functional dependency X Y is a partial dependency if some attribute A є X can be removed from X and the dependency still holds; that is, for some A E X, (X - {A}) Y. • For example: in figure 10.3b) {SSN, PNUMBER} HOURS is a full dependency (neither SSN HOURS nor PNUMBER HOURS holds). • However, the dependency {SSN, PNUMBER} ENAME is partial because SSN ENAME holds. • Definition. A relation schema R is in 2NF if every nonprime attribute A in R is fully functionally dependent on the primary key of R. • The test for 2NF involves testing for functional dependencies whose left-hand side attributes are part of the primary key. • If the primary key contains a single attribute, the test need not be applied at all • The EMP_PROJ relation in Figure 10.3b is in INF but is not in 2NF. The nonprime attribute ENAME violates 2NF because of FD2, as do the nonprime attributes PNAME and PLOCATION because of FD3. • The functional dependencies FD2 and FD3 make ENAME, PNAME, and PLOCATION partially dependent on the primary key {SSN, PNUMBER} of EMP_PROJ, thus violating the 2NFtest. • If a relation schema is not in 2NF, it can be "second normalized" or "2NFnormalized" into a number of 2NFrelations in which nonprime attributes are associated only with the part of the primary key on which they are fully functionally dependent. • The functional dependencies FDI, m2, and FD3 in Figure IO.3b hence lead to the decomposition of EMP_PROJ into the three relation schemas EPl, EP2, and EP3 shown Third Normal form • Reading assignment