0% found this document useful (0 votes)
8 views

Unit 4

This document covers SQL and data normalization, detailing various database languages including DDL, DML, DCL, and TCL, along with their commands and functions. It explains aggregate functions, grouping with GROUP BY and HAVING clauses, nested subqueries, and the concept of views in SQL. Additionally, the document discusses normalization, its importance in eliminating anomalies, and the process of achieving different normal forms to ensure data integrity and minimize redundancy.

Uploaded by

iqacgfgch2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unit 4

This document covers SQL and data normalization, detailing various database languages including DDL, DML, DCL, and TCL, along with their commands and functions. It explains aggregate functions, grouping with GROUP BY and HAVING clauses, nested subqueries, and the concept of views in SQL. Additionally, the document discusses normalization, its importance in eliminating anomalies, and the process of achieving different normal forms to ensure data integrity and minimize redundancy.

Uploaded by

iqacgfgch2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Page |1

Unit 4: SQL and Data Normalization


4.1 Database Languages
Data Definition Language (DDL), is used by the DBA and by database designers to define schemas. The DBMS will
have a DDL compiler whose function is to process DDL statements in order to identify descriptions of the schema
constructs and to store the schema description in the DBMS catalog.
A schema is created via the CREATE SCHEMA statement, which can include all the schema elements’ definitions.
Alternatively, the schema can be assigned a name and authorization identifier, and the elements can be defined
later. For example, the following statement creates a schema called COMPANY, owned by the user with authorization
identifier ‘Jsmith’. Note that each statement in SQL ends with a semicolon.

The CREATE TABLE command is used to specify a new relation by giving it a name and specifying its attributes and
initial constraints.

(OR)

DDL also allows us to specify different constraints(write about PRIMARYKEY and FOREIGNKEY constraints)

Data Manipulation Language (DML)


There are two main types of DMLs. A high-level or nonprocedural DML can be used on its own to specify complex
database operations concisely. Many DBMSs allow high-level DML statements either to be entered interactively from
a display monitor or terminal or to be embedded in a general-purpose programming language. In the latter case,
DML statements must be identified within the program so that they can be extracted by a precompiler and processed
by the DBMS. A low-level or procedural DML must be embedded in a general-purpose programming language. This
type of DML typically retrieves individual records or objects from the database and processes each separately.

(Explain SELECT,INSERT,DELETE AND UPDATE)

Data Control Language (DCL)


These are the SQL commands that perform operations like giving and withdrawing database access from the user
 The GRANT command gives access privileges or permissions like ALL,SELECT, EXECUTE to the database
objects like views, tables etc in SQL.
 The REVOKE command withdraws access privileges or permissions given with the GRANT command.

Transaction Control Language (TCL)


Transactions group a set of tasks into a single execution unit. Each transaction begins with
a specific task and ends when all the tasks in the group successfully complete. If any of the
tasks fail, the transaction fails. Therefore, a transaction has only two results: success or
failure. The commands used in TCL are:
 COMMIT: commits a transaction
 ROLLBACK: rollbacks a transaction in case of any error occurs
Page |2

 SAVEPOINT: sets a save point within a transaction.


4.2 Aggregate Functions in SQL
Aggregate functions are used to summarize information from multiple tuples into a single-tuple summary.
A number of built-in aggregate functions exist: COUNT, SUM, MAX, MIN, and AVG.
Example 1: Find the sum of the salaries of all employees, the maximum salary, the minimum salary, and the
average salary

Example 2: Retrieve the number of employees in the ‘Research’ department.

4.3 Grouping: The GROUP BY and HAVING Clauses


Grouping is used to create subgroups of tuples before summarization. For example, we may want to find the average
salary of employees in each department or the number of employees who work on each project. In these cases we
need to partition the relation into nonoverlapping subsets (or groups) of tuples. Each group (partition) will consist of
the tuples that have the same value of some attribute(s), called the grouping attribute(s). We can then apply the
function to each such group independently to produce summary information about each group.
Example 1: For each department, retrieve the department number, the number of employees in the department,
and their average salary

The results are:

f NULLs exist in the grouping attribute, then a separate group is created for all tuples with a NULL value in the
grouping attribute. For example, if the EMPLOYEE table had some tuples that had NULL for the grouping attribute
Dno, there would be a separate group for those tuples in the result above.
Example 2: For each project, retrieve the project number, the project name, and the number of employees who
work on that project.

The above SQL statement shows how we can use a join condition in conjunction with GROUP BY. In this case, the
grouping and functions are applied after the joining of the two relations. Sometimes we want to retrieve the values
of these functions only for groups that satisfy certain conditions. For example, suppose that we want to modify the
Page |3

above SQL statement so that only projects with more than two employees appear in the result. SQL provides a
HAVING clause, which can appear in conjunction with a GROUP BY clause, for this purpose. HAVING provides a
condition on the summary information regarding the group of tuples associated with each value of the grouping
attributes. Only the groups that satisfy the condition are retrieved in the result of the query. This is illustrated below:
Example 3: For each project on which more than two employees work, retrieve the project number, the project
name, and the number of employees who work on the project.

Example 4: For each project, retrieve the project number, the project name, and the number of employees from
department 5 who work on the project.
Page |4

4.4 Nested Sub Queries


Some queries require that existing values in the database be fetched and then used in a comparison condition. Such
queries can be conveniently formulated by using nested queries, which are complete select-from-where blocks
within the WHERE clause of another query. That other query is called the outer query.

Example 5: Make a list of all project numbers for projects that involve an employee whose last name is ‘Smith’, either
as a worker or as a manager of the department that controls the project.

The above SQL statement can be better expressed using nested subquery, as follows:

4.5 Views in SQL


A view in SQL terminology is a single table that is derived from other tables.6 These other tables can be base
tables or previously defined views. A view does not necessarily exist in physical form; it is considered to be a virtual
table, in contrast to base tables, whose tuples are always physically stored in the database. This limits the possible
update operations that can be applied to views, but it does not provide any limitations on querying a view.
A view is a way of specifying a table that we need to reference frequently, even though it may not exist physically.
For example, referring to the COMPANY database, we may frequently issue queries that retrieve the employee name
and the project names that the employee works on. Rather than having to specify the join of the three tables
EMPLOYEE, WORKS_ON, and PROJECT every time we issue this query, we can define a view that is specified as the
result of these joins. Then we can issue queries on the view, which are specified as single table retrievals rather than
as retrievals involving two joins on three tables. We call the EMPLOYEE, WORKS_ON, and PROJECT tables the defining
tables of the view.
Page |5

In V1, we did not specify any new attribute names for the view WORKS_ON1 (although we could have); in this
case, WORKS_ON1 inherits the names of the view attributes from the defining tables EMPLOYEE, PROJECT, and
WORKS_ON. View V2 explicitly specifies new attribute names for the view DEPT_INFO, using a one-to-one
correspondence between the attributes specified in the CREATE VIEW clause and those specified in the SELECT clause
of the query that defines the view.
We can now specify SQL queries on a view—or virtual table—in the same way we specify queries involving base
tables. For example, to retrieve the last name and first name of all employees who work on the ‘ProductX’ project,
we can utilize the WORKS_ON1 view and specify the query as:

If we do not need a view any more, we can use the DROP VIEW command to dispose of it. For example, to get rid
of the view V1, we can use the SQL statement:

4.6 Normalization
4.6.1 Anomalies in relational database design
Storing natural joins of base relations obviously leads to redundancy and an additional problem referred to as
update anomalies. These can be classified into insertion anomalies, deletion anomalies, and modification anomalies.
Insertion Anomalies. Insertion anomalies can be differentiated into two types, illustrated by the following
examples based on the EMP_DEPT relation shown in fig.

Figure 4.1: Two relation schemas suffering from update anomalies. (a) EMP_DEPT and (b) EMP_PROJ.
Page |6

■ To insert a new employee tuple into EMP_DEPT, we must include either the attribute
values for the department that the employee works for, or NULLs (if the employee does
not work for a department as yet). For example, to insert a new tuple for an employee
who works in department number 5, we must enter all the attribute values of department
5 correctly so that they are consistent with the corresponding values for department 5
in other tuples in EMP_DEPT.
■ It is difficult to insert a new department that has no employees as yet in the
EMP_DEPT relation. The only way to do this is to place NULL values in the attributes
for employee. This violates the entity integrity for EMP_DEPT because Ssn is its
primary key. Moreover, when the first employee is assigned to that department, we do
not need this tuple with NULL values any more.

Deletion Anomalies. The problem of deletion anomalies is related to the second insertion anomaly situation just
discussed. If we delete from EMP_DEPT an employee tuple that happens to represent the last employee working for
a particular department, the information concerning that department is lost from the database.

Modification Anomalies. In EMP_DEPT, if we change the value of one of the attributes of a particular department—
say, the manager of department 5—we must update the tuples of all employees who work in that department;
otherwise, the database will become inconsistent. If we fail to update some tuples, the same department will be
shown to have two different values for manager in different employee tuples, which would be wrong.

4.6.2 Decomposition
It is easy to see that these three anomalies are undesirable and cause difficulties to maintain consistency of data as
well as require unnecessary updates. The solution for these all problems is to decompose the tables into base
tables.

(a)
EMPLOYEE
Ename Ssn Bdate Address Dno

DEPARTMENT
Dnumber Dname Dmgr_ssn

(b)
WORKS_ON
Ssn Pnumber Hours

PROJECT
Pnumber Ename Pname Plocation

Figure 4.2: Relation schemas after resolving Update anomaly in fig.4.1

4.6.3 Functional dependencies


A functional dependency, denoted by X → Y, between two sets of attributes X and Y
that are subsets of R specifies a constraint on the possible tuples that can form a relation
Page |7

state r of R. The constraint is that, for any two tuples t 1 and t 2 in r that have t 1[X] = t
2[X], they must also have t 1[Y] = t 2[Y].

Note the following:

■ If X is a candidate key of R—this implies that X → Y for any subset of attributes Y of


R. If X is a candidate key of R, then X → R.
■ If X → Y in R, this does not say whether or not Y → X in R.
For example, Figure 4.3 shows a particular state of the TEACH relation schema. Although at first glance we may think
that Text → Course, we cannot confirm this unless we know that it is true for all possible legal states of TEACH. It is,
however, sufficient to demonstrate a single counterexample to disprove a functional dependency. For example,
because ‘Smith’ teaches both ‘Data Structures’ and ‘Data Management,’ we can conclude that Teacher does not
functionally determine Course.

Figure 4.3: A relation state of TEACH with a possible functional dependency TEXT → COURSE. However, TEACHER → COURSE is ruled out.

See the illustrative example relation in Figure 4.4. Here, the following FDs may hold because the four tuples in the
current extension have no violation of these constraints: B → C; C → B; {A, B} → C; {A, B} → D; and {C, D} → B.
However, the following do not hold because we already have violations of them in the given extension: A → B
(tuples 1 and 2 violate this constraint); B → A (tuples 2 and 3 violate this constraint); D → C (tuples 3 and 4 violate
it).

Figure 4.4: A relation R (A, B, C, D) with its extension.

4.7 Normal forms based on primary keys


4.7.1 Normalization of Relations
The normalization process, as first proposed by Codd (1972a), takes a relation schema through a series of tests to
certify whether it satisfies a certain normal form. The process, which proceeds in a top-down fashion by evaluating
each relation against the criteria for normal forms and decomposing relations as necessary, can thus be considered
as relational design by analysis. Initially, Codd proposed three normal forms, which he called first, second, and third
normal form. A stronger definition of 3NF—called Boyce-Codd normal form (BCNF)—was proposed later by Boyce
and Codd. All these normal forms are based on a single analytical tool: the functional dependencies among the
Page |8

attributes of a relation. Later, a fourth normal form (4NF) and a fifth normal form (5NF) were proposed, based on
the concepts of multivalued dependencies and join dependencies, respectively.

Normalization of data can be considered a process of analyzing the given relation schemas based on their FDs and
primary keys to achieve the desirable properties of (1) minimizing redundancy and (2) minimizing the insertion,
deletion, and update anomalies discussed in Section 4.6.1. It can be considered as a “filtering” or “purification”
process to make the design have successively better quality. Unsatisfactory relation schemas that do not meet certain
conditions—the normal form tests—are decomposed into smaller relation schemas that meet the tests and hence
possess the desirable properties.

Definition. The normal form of a relation refers to the highest normal form condition that it
meets, and hence indicates the degree to which it has been normalized.

Normal forms, when considered in isolation from other factors, do not guarantee a good database design. It is
generally not sufficient to check separately that each relation schema in the database is, say, in BCNF or 3NF. Rather,
the process of normalization through decomposition must also confirm the existence of additional properties that
the relational schemas, taken together, should possess. These would include two properties:

■ The nonadditive join or lossless join property, which guarantees that the spurious tuple
generation problem does not occur with respect to the relation schemas created after
decomposition.

■ The dependency preservation property, which ensures that each functional dependency is
represented in some individual relation resulting after decomposition.

The nonadditive join property is extremely critical and must be achieved at any cost, whereas the dependency
preservation property, although desirable, is sometimes sacrificed.

Prime attribute : An attribute of relation schema R is called a prime attribute of R if it is a


member of some candidate key of R. An attribute is called nonprime if it is not a prime
attribute—that is, if it is not a member of any candidate key

4.7.2 First Normal Form

It states that the domain of an attribute must include only atomic (simple, indivisible) values and
that the value of any attribute in a tuple must be a single value from the domain of that attribute.

Hence, 1NF disallows having a set of values, a tuple of values, or a combination of both as an attribute value for a
single tuple. In other words, 1NF disallows relations within relations or relations as attribute values within tuples.
The only attribute values permitted by 1NF are single atomic (or indivisible) values.

Consider the DEPARTMENT schema shown in following figure 4.5(a), Dnumber is the primary key of the relation.
Fig. (b) shows the state. We assume that each department can have a number of locations. There are two ways we
can look at the Dlocations attribute:

■ The domain of Dlocations contains atomic values, but some tuples can have a set of these values. In this case,
Dlocations is not functionally dependent on the primary key Dnumber.

■ The domain of Dlocations contains sets of values and hence is nonatomic. In this case, Dnumber → Dlocations
because each set is considered a single member of the attribute domain.
Page |9

Figure 4.5: Normalization into 1NF. (a) A relation schema that is not in 1NF. (b) Sample state of relation DEPARTMENT.

(c) 1NF version of the same relation with redundancy

In either case, the DEPARTMENT relation in Figure 4.5 is not in 1NF. There are three main techniques to achieve
first normal form for such a relation:

1. Remove the attribute Dlocations that violates 1NF and place it in a separate relation DEPT_LOCATIONS along
with the primary key Dnumber of DEPARTMENT. The primary key of this relation is the combination
{Dnumber, Dlocation}, as shown in Figure 4.6. A distinct tuple in DEPT_LOCATIONS exists for each location of
a department. This decomposes the non-1NF relation into two 1NF relations.

Figure 4.6: DEPARTMENT relation of figure 4.5 which is in 1NF

2. Expand the key so that there will be a separate tuple in the original DEPARTMENT relation for each location
of a DEPARTMENT, as shown in Figure 4.5(c). In this case, the primary key becomes the combination
{Dnumber, Dlocation}. This solution has the disadvantage of introducing redundancy in the relation.
P a g e | 10

3. If a maximum number of values is known for the attribute—for example, if it is known that at most three
locations can exist for a department—replace the Dlocations attribute by three atomic attributes:
Dlocation1, Dlocation2, and Dlocation3. This solution has the disadvantage of introducing NULL values if most
departments have fewer than three locations. It further introduces spurious semantics about the ordering
among the location values that is not originally intended. Querying on this attribute becomes more difficult;
for example, consider how you would write the query: List the departments that have ‘Bellaire’ as one of
their locations in this design.

Of the three solutions above, the first is generally considered best because it does not suffer from redundancy
and it is completely general, having no limit placed on a maximum number of values.

4.7.3 Second Normal Form


A functional dependency X → Y is a partial dependency if some attribute A ε X can be removed from X and the
dependency still holds; that is, for some A ε X, (X – {A}) → Y. In Figure 4.7, {Ssn, Pnumber} → Hours is a full
dependency (neither Ssn → Hours nor Pnumber → Hours holds). However, the dependency {Ssn, Pnumber} →
Ename is partial because Ssn → Ename holds.

Figure 4.7 A relation which is in 1NF but not in 2NF

Definition. A relation schema R is in 2NF if every nonprime attribute A in R is fully functionally


dependent on the primary key of R.

The EMP_PROJ relation in Figure 4.7 is in 1NF but is not in 2NF. The nonprime attribute Ename violates
2NF because of FD2, as do the nonprime attributes Pname and Plocation because of FD3. The functional
dependencies FD2 and FD3 make Ename, Pname, and Plocation partially dependent on the primary key
{Ssn, Pnumber} of EMP_PROJ, thus violating the 2NF test.

The functional dependencies FD1, FD2, and FD3 lead to the decomposition of EMP_PROJ into the three
relation schemas EP1, EP2, and EP3 shown in Figure 4.8, each of which is in 2NF.

Figure 4.8 Normalizing EMP_PROJ into 2NF relations

4.7.4 Third Normal Form


P a g e | 11

Figure 4.9 A relation schema which is not in 3NF

Definition. According to Codd’s original definition, a relation schema R is in 3NF if it satisfies 2NF and
no nonprime attribute of R is transitively dependent on the primary key.

A functional dependency X → Y in a relation schema R is a transitive dependency if there exists a set of attributes Z
in R that is neither a candidate key nor a subset of any key of R, and both X → Z and Z → Y hold.

The dependency Ssn → Dmgr_ssn is transitive through Dnumber in EMP_DEPT in Figure 4.9, because both the
dependencies Ssn → Dnumber and Dnumber → Dmgr_ssn hold and Dnumber is neither a key itself nor a subset of
the key of EMP_DEPT. Intuitively, we can see that the dependency of Dmgr_ssn on Dnumber is undesirable in
EMP_DEPT since Dnumber is not a key of EMP_DEPT.

The relation schema EMP_DEPT is in 2NF, since no partial dependencies on a key exist. However, EMP_DEPT is not
in 3NF because of the transitive dependency of Dmgr_ssn (and also Dname) on Ssn via Dnumber. We can normalize
EMP_DEPT by decomposing it into the two 3NF relation schemas ED1 and ED2 shown in Figure 4.10.

Figure 4.10 Normalizing EMP_DEPT into 3NF relations.

Figure 4.11 Summary of Normal Forms Based on Primary Keys and Corresponding Normalization
P a g e | 12

You might also like