CS6302 Notes PDF
CS6302 Notes PDF
CS6302 Notes PDF
com
UNIT I
www.rejinpaul.com
www.rejinpaul.com
include only those customers who have an account balance of $10,000 or more. A program to
generate such a list does not exist. Again, the officer has the preceding two options, neither of
which is satisfactory.
3.Data isolation
Because data are scattered in various files, and files may be in different formats, writing new
application programs to retrieve the appropriate data is difficult.
4.Integrity problems
The data values stored in the database must satisfy certain types of consistency constraints.
Example:
The balance of certain types of bank accounts may never fall below a prescribed amount .
Developers enforce these constraints in the system by addition appropriate code in the various
application programs
5.Atomicity problems
Atomic means the transaction must happen in its entirety or not at all. It is difficult to ensure
atomicity in a conventional file processing system.
Example:
Consider a program to transfer $50 from account A to account B. If a system failure occurs
during the execution of the program, it is possible that the $50 was removed from account A but
was not credited to account B, resulting in an inconsistent database state.
6.Concurrent access anomalies
For the sake of overall performance of the system and faster response, many systems allow
multiple users to update the data simultaneously. In such an environment, interaction of
concurrent updates is possible and may result in inconsistent data. To guard against this
possibility, the system must maintain some form of supervision. But supervision is difficult to
provide because data may be accessed by many different application programs that have not been
coordinated previously.
Example: When several reservation clerks try to assign a seat on an airline flight, the system
should ensure that each seat can be accessed by only one clerk at a time for assignment to a
passenger.
7. Security problems
Enforcing security constraints to the file processing system is difficult
www.rejinpaul.com
www.rejinpaul.com
VIEWS OF DATA
A major purpose of a database system is to provide users with an abstract view of the data i.e the
system hides certain details of how the data are stored and maintained.
Views have several other benefits.
Views provide a level of security. Views can be setup to exclude data that some users should
not see.
Views provide a mechanism to customize the appearance of the database.
A view can present a consistent, unchanging picture of the structure of the database, even if the
underlying database is changed.
The ANSI / SPARC architecture defines three levels of data abstraction.
External level / logical level
Conceptual level
Internal level / physical level
The objectives of the three level architecture are to separate each user's view of the database
from the way the database is physically represented.
External level
The users' view of the database External level describes that part of the database that is relevant
to each user.
The external level consists of a number of different external views of the database. Each user has
a view of the 'real world' represented in a form that is familiar for that user. The external view
includes only those entities, attributes, and relationships in the real world that the user is
interested in.
The use of external models has some very major advantages,
Makes application programming much easier.
Simplifies the database designer's task.
Helps in ensuring the database security.
Conceptual level
www.rejinpaul.com
www.rejinpaul.com
The community view of the database conceptual level describes what data is stored in the
database and the relationships among the data.
The middle level in the three level architecture is the conceptual level. This level contains the
logical structure of the entire database as seen by the DBA. It is a complete view of the data
requirements of the organization that is independent of any storage considerations. The
conceptual level represents:
All entities, their attributes and their relationships
The constraints on the data
Semantic information about the data
Security and integrity information.
The conceptual level supports each external view. However, this level must not contain any
storage dependent details. For instance, the description of an entity should contain only data
types of attributes and their length, but not any storage consideration such as the number of bytes
occupied.
Internal level
The physical representation of the database on the computer Internal level describes how the data
is stored in the database.
The internal level covers the physical implementation of the database to achieve optimal runtime
performance and storage space utilization. It covers the data structures and file organizations
used to store data on storage devices.The internal level is concerned with
Storage space allocation for data and indexes.
Record descriptions for storage
Record placement.
Data compression and data encryption techniques.
Below the internal level there is a physical level that may be managed by the operating system
under the direction of the DBMS
Physical level
The physical level below the DBMS consists of items only the operating system knows such as
exactly how the sequencing is implemented and whether the fields of internal records are stored
as contiguous bytes on the disk.
www.rejinpaul.com
www.rejinpaul.com
Similar to types and variables in programming languages which we already know, Schema is the
logical structure of the database E.g., the database consists of information about a set of
customers and accounts and the relationship between them) analogous to type information of a
variable in a program.
DATA MODELS
The data model is a collection of conceptual tools for describing data, data relationships, data
semantics, and consistency constraints. A data model provides a way to describe the design of a
data base at the physical, logical and view level.
The purpose of a data model is to represent data and to make the data understandable.
According to the types of concepts used to describe the database structure, there are three data
models:
1.An external data model, to represent each user's view of the organization.
2.A conceptual data model, to represent the logical view that is DBMS independent
3.An internal data model, to represent the conceptual schema in such a way that it can be
understood by the DBMS.
Categories of data model:
1.Record-based data models
2.Object-based data models
3.Physical-data models.
The first two are used to describe data at the conceptual and external levels, the latter is used to
describe data at the internal level.
1.Record -Based data models
In a record-based model, the database consists of a number of fixed format records possibly of
differing types. Each record type defines a fixed number of fields, each typically of a fixed
length.
There are three types of record-based logical data model.
Hierarchical data model.
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
Entity: An entity was defined as anything about which data are to be collected and stored. Each
row in the relational table is known as an entity instance or entity occurrence in the ER model.
Each entity is described by a set of attributes that describes particular characteristics of the entity.
Object oriented model:
In the object-oriented data model (OODM) both data and their relationships are contained in a
single structure known as an object.An object is described by its factual content. An object
includes information about relationships between the facts within the object, as well as
information about its relationships with other objects. Therefore, the facts within the object are
given greater meaning. The OODM is said to be a semantic data model because semantic
indicates meaning.The OO data model is based on the following components:
An object is an abstraction of a real-world entity.
Attributes describe the properties of an object.
DATABASE SYSTEM ARCHITECTURE
Transaction Management
Storage Management
A storage manager is a program module that provides the interface between the low-level data
stored in the database and the application programs and queries submitted to the system.
A storage manager is a program module that provides the interface between the low-level data
stored in the database and the application programs and queries submitted to the system.
www.rejinpaul.com
www.rejinpaul.com
Database Administrator
Coordinates all the activities of the database system; the database administrator has a good
understanding of the enterprises information resources and needs:
Schema definition
Database Users
Users are differentiated by the way they expect to interact with the system.
Specialized users write specialized database applications that do not fit into the traditional
data processing framework
Naive users invoke one of the permanent application programs that have been written
previously
File manager
manages allocation of disk space and data structures used to represent information on disk.
Database manager
www.rejinpaul.com
www.rejinpaul.com
The interface between low level data and application programs and queries.
Query processor
translates statements in a query language into low-level instructions the database manager
understands. (May also attempt to find an equivalent but more efficient form.)
DML precompiler
converts DML statements embedded in an application program to normal procedure calls in a
host language. The precompiler interacts with the query processor.
DDL compiler
converts DDL statements to a set of tables containing metadata stored in a data dictionary. In
addition, several data structures are required for physical system implementation:
Data files:store the database itself.
Data dictionary:stores information about the structure of the database. It is used heavily. Great
emphasis should be placed on developing a good design and efficient implementation of the
dictionary.
Indices:provide fast access to data items holding particular values.
The entity relationship (ER) data model was developed to facilitate database design by
allowing specification of an enterprise schema that represents the overall logical structure of a
database. The E-R data model is one of several semantic data models.
The semantic aspect of the model lies in its representation of the meaning of the data. The E-R
model is very useful in mapping the meanings and interactions of real-world enterprises onto a
conceptual schema.
The ERDs represent three main components entities, attributes and relationships.
Entity sets:
An entity is a thing or object in the real world that is distinguishable from all other objects.
Example:
Each person in an enterprise is entity.
www.rejinpaul.com
www.rejinpaul.com
An entity has a set of properties, and the values for some set of properties may uniquely identify
an entity.
Example:
A person may have a person-id would uniquely identify one particular property whose value
uniquely identifies that person.
An entity may be concrete, such as a person or a book, or it may be abstract, such as a loan, a
holiday, or a concept.An entity set is a set of entities of the same type that share the same
properties, or attributes.
Example:
Relationship sets:
A relationship is an association among several entities.
Example:
A relationship that associates customer smith with loan L-16, specifies that Smith is a customer
with loan number L-16.
A relationship set is a set of relationships of the same type.
The number of entity sets that participate in a relationship set is also the degree of the
relationship set.
A unary relationship exists when an association is maintained within a single entity.
Attributes:
For each attribute, there is a set of permitted values, called the domain, or value set, of that
attribute. Example:
The domain of attribute customer name might be the set of all text strings of a certain length.
An attribute of an entity set is a function that maps from the entity set into a domain.
An attribute can be characterized by the following attribute types:
Simple and composite attributes.
Single valued and multi valued attributes.
Derived attribute.
Simple attribute (atomic attributes)
An attribute composed of a single component with an independent existence is called simple
attribute.
Simple attributes cannot be further subdivided into smaller components.
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
Relationships are
1.visits
2.creates
www.rejinpaul.com
www.rejinpaul.com
Keys:
A super key ofan entity set is a set of one or more attributes whose values uniquely determine
each entity.
A candidate key of an entity set is a minimal super key.
www.rejinpaul.com
www.rejinpaul.com
Although several candidate keys may exist, one of the candidate keys is selected to be the
primary key.
The combination of primary keys of the participating entity sets forms a candidate key of a
relationship set.
- must consider the mapping cardinality and the semantics of the relationship set when selecting
the primary key.
(social-security, account-number) is the primary key of depositor
E-R Diagram Components
Rectangles represent entity sets.
Ellipses represent attributes.
Diamonds represent relationship sets.
Lines link attributes to entity sets and entity sets to relationship sets.
Double ellipses represent multivalued attributes.
Dashed ellipses denote derived attributes.
Primary key attributes are underlined.
Weak Entity Set
An entity set that does not have a primary key is referred to as a weak entity set. The existence of
a weak entity set depends on the existence of a strong entity set;it must relate to the strong set via
a one-to-many relationship set. The discriminator (or partial key) of a weak entity set is the set of
attributes that distinguishes among all the entities of a weak entity set. The primary key of a
weak entity set is formed by the primary key of the strong entity set on which the weak entity set
is existence dependent,plus the weak entity sets discriminator. A weak entity set is depicted by
double rectangles
Specialization
This is a Top-down design process designate subgroupings within an entity set that are
distinctive from other entitie in the set.
www.rejinpaul.com
www.rejinpaul.com
These subgroupings become lower-level entity sets that have attributes or participate in
relationships that do not apply to the higher-level entity set.
Depicted by a triangle component labeled ISA (i.e., savings-account is anaccount
Generalization:
A bottom-up design process combine a number of entity sets that share the same features into a
higher-level entity set.
Specialization and generalization are simple inversions of each other; they are represented in an
E-R diagram in the same way.
Attribute Inheritance a lower-level entity set inherits all the attributes and relationship
participation of the higher-level entity set to which it is linked.
Design Constraints on Generalization:
Constraint on which entities can be members of a given lower-level entity set.
condition-defined
user-defined
-Constraint on whether or not entities may belong to more than one lower-level entity set within
a single generalization.
disjoint
overlapping
-Completeness constraint specifies whether or not an entity in the higher-level entity set must
belong to at least one of the lower-level entity sets within a generalization.
total
- partial Joints in Aggregation
Treat relationship as an abstract entity.
Allows relationships between relationships.
Abstraction of relationship into new entity.
Without introducing redundancy, the following diagram represents that:
A customer takes out a loan
An employee may be a loan officer for a customer-loan pair
www.rejinpaul.com
www.rejinpaul.com
RELATIONAL DATABASES
A relational database is based on the relational model and uses a collection of tables to
represent both data and the relationships among those data. It also includes a DML and DDL.
The relational model is an example of a record-based model.
Record-based models are so named because the database is structured in fixed-format records of
several types.
A relational database consists of a collection of tables, each of which is assigned a unique name.
A row in a table represents a relationship among a set of values.
A table is an entity set, and a row is an entity. Example: a simple relational database.
Columns in relations (table) have associated data types.
The relational model includes an open-ended set of data types, i.e. users will be able to define
their own types as well as being able to use system-defined or built in types.
Every relation value has two pairs
1)A set of column-name: type-name pairs.
2)A set of rows
The optimizer is the system component that determines how to implement user requests. The
process of navigating around the stored data in order to satisfy the user's request is performed
automatically by the system, not manually by the user. For this reason, relational systems are
sometimes said to perform automatic navigation.Every DBMS must provide a catalog or
dictionary function
. The catalog is a place where all of the various schemas (external, conceptual, internal) and all
of the corresponding mappings (external/conceptual, conceptual/internal) are kept. In other
words, the catalog contains detailed information (sometimes called descriptor information or
metadata) regarding the various objects that are of interest to it.
A relational database is based on the relational model and uses a collection of tables to
represent both data and the relationships among those data. It also includes a DML and DDL.
The relational model is an example of a record-based model. Record-based models are so named
because the database is structured in fixed-format records of several types.
A relational database consists of a collection of tables, each of which is assigned a unique name.
A row in a table represents a relationship among a set of values.
A table is an entity set, and a row is an entity. Example: a simple relational database.
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
RELATIONAL ALGEBRA
A basic expression in the relational algebra consists of either one of the following:
oA relation in the database
oA constant relation
Let E1and E2be relational-algebra expressions; the following are all relational-algebra
expressions:
E1nE2
E1- E2
E1x E2
p(E1), Pis a predicate on attributes in E1
s(E1), Sis a list consisting of some of the attributes in E1
x (E1), x is the new name for the result of E1
The select, project and rename operations are called unary operations, because they operate on
one relation.
The union, Cartesian product, and set difference operations operate on pairs of relations and are
called binary operations
Selection (or Restriction) ()
The selection operation works on a single relation R and defines a relation that contains only
those tuples of R that satisfy the specified condition (predicate).
Syntax:
Predicate (R)
Example:
List all staff with a salary greater than 10000.
Sol:
salary > 10000 (Staff).
The input relation is staff and the predicate is salary>10000. The selection operation defines a
relation containing only those staff tuples with a salary greater than 10000.
Projection ():
The projection operation works on a single relation R and defines a relation that contains a
vertical subset of R, extracting the values of specified attributes and eliminating duplicates.
www.rejinpaul.com
www.rejinpaul.com
Syntax:
al,.......an(R)
Example:
Produce a list of salaries for all staff, showing only the staffNo, name and salary.
staffNo. Name, Salary (Staff).
Rename ():
Rename operation can rename either the relation name or the attribute names or both
Syntax:
s (BI.B2,.Bn) (R) Or s(R) Or p (B1.B2Bn) (R)
S is the new relation name and B1, B2,.....Bn are the new attribute names.
The first expression renames both the relation and its attributes, the second renames the relation
only, and the third renames the attributes only. If the attributes of R are (Al, A2,...An) in that
order, then each Aj is renamed as Bj.
Union
The union of two relations R and S defines a relation that contains all the tuples of R or S or both
R and S, duplicate tuples being eliminated. Union is possible only if the schemas of the two
relations match.
Syntax:
RUS
Example:
List all cities where there is either a branch office or a propertyforRent.
City (Branch) U civ(propertyforRent)
Set difference:
The set difference operation defines a relation consisting of the tuples that are in relation R, but
not in S. R and S must be union-compatible.
Syntax
R-S
Example:
List all cities where there is a branch office but no properties for rent.
Sol.:
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
FUNCTIONAL DEPENDENCY
X is the determinant set and Y is the dependent attribute. Thus, given a tuple and the values of
the attributes in X, one can determine the corresponding value of the Y attribute. The set of all
functional dependencies that are implied by a given set of functional dependencies X is called
closure of X.
A set of inference rules, called Armstrong's axioms, specifies how new -functional dependencies
can be inferred from given ones.
Let A, B, C and D be subsets of the attributes of the relation R. Armstrong's axioms are as
follows:
1)Reflexivity
www.rejinpaul.com
www.rejinpaul.com
KEYS
A key is a set of attributes that uniquely identifies an entire tuple, a functional dependency
allows us to express constraints that uniquely identify the values of certain attributes.
However, a candidate key is always a determinant, but a determinant doesnt need to be a key.
CLOSURE
Let a relation R have some functional dependencies F specified. The closure of F (usually
written as F+) is the set of all functional dependencies that may be logically derived from F.
Often F is the set of most obvious and important functional dependencies and F+, the closure, is
the set of all the functional dependencies including F and those that can be deduced from F. The
closure is important and may, for example, be needed in finding one or more candidate keys of
the relation.
AXIOMS
www.rejinpaul.com
www.rejinpaul.com
Before we can determine the closure of the relation, Student, we need a set of rules.
Developed by Armstrong in 1974, there are six rules (axioms) that all possible functional
dependencies may be derived from them.
Using the first rule alone, from our example we have 2^7 = 128 subsets. This will further lead to
many more functional dependencies. This defeats the purpose of normalizing relations. find
what attributes depend on a given set of attributes and therefore ought to be together.
Step 2 Let the next dependency be A -> B. If A is in X^c and B is not, X^c <- X^c + B.
www.rejinpaul.com
www.rejinpaul.com
Step 1 --- X^c <- X, that is, X^c <- (SNo, CNo)
Step 2 --- Consider SNo -> SName, since SNo is in X^c and SName is not, we have: X^c <-
(SNo, CNo) + SName
Step 3 --- Consider CNo -> CName, since CNo is in X^c and CName is not, we have: X^c <-
(SNo, CNo, SName) + CName
Step 4 --- Again, consider SNo -> SName but this does not change X^c.
Step 5 --- Again, consider CNo -> CName but this does not change X^c.
NORMALIZATION
Initially Codd (1972) presented three normal forms (1NF, 2NF and 3NF) all based on
functional dependencies among the attributes of a relation. Later Boyce and Codd proposed
another normal form called the Boyce-Codd normal form (BCNF). The fourth and fifth
normal forms are based on multi-value and join dependencies and were proposed later.
www.rejinpaul.com
www.rejinpaul.com
Suppose we had started with bor_loan. How would we know to split up (decompose) it into
borrower and loan?
Write a rule if there were a schema (loan_number, amount), then loan_number would be a
candidate key
loan_number ? amount
In bor_loan, because loan_number is not a candidate key, the amount of a loan may have to be
repeated. This indicates the need to decompose bor_loan.
The next slide shows how we lose information -- we cannot reconstruct the original employee
relation -- and so, this is a lossy decomposition
A relational schema R is in first normal form if the domains of all attributes of R are atomic
Non-atomic values complicate storage and encourage redundant (repeated) storage of data
Example: Set of accounts stored with each customer, and set of owners stored with each
account.
Atomicity is actually a property of how the elements of the domain are used.
www.rejinpaul.com
www.rejinpaul.com
Suppose that students are given roll numbers which are strings of the form CS0012 or EE1127
If the first two characters are extracted to find the department, the domain of roll numbers is not
atomic.
Doing so is a bad idea: leads to encoding of information in application program rather than in the
database.
First normal form is a relation in which the intersection of each row and column contains one
and only one value.To transform the un-normalized table (a table that contains one or more
repeating groups) to first normal form, identify and remove the repeating groups within the table,
(i.e multi valued attributes, composite attributes, and their combinations).
Example:Multi valued attribute -phone number
Composite attributes -address.
There are two common approaches to removing repeating groups from un-normalized tables:
1)Remove the repeating groups by entering appropriate data in the empty columns of rows
containing the repeating data. This approach is referred to as 'flattening' the table, with this
approach, redundancy is introduced into the resulting relation, which is subsequently removed
during the normalization process.
2)Removing the repeating group by placing the repeating data, along with a copy of the original
key attribute(s), in a separate relation. A primary key is identified for the new relation.
Example 1:
(Multi valued).
Consider the contacts table, which contains the contact tracking information
The above table contains a repeating group of the date and description of two conversations.The
only advantage of designing the table like this is that it avoids the need
GOALS OF 1NF
www.rejinpaul.com
www.rejinpaul.com
In the case that a relation R is not in good form, decompose it into a set of relations {R1, R2, ...,
Rn} such that each relation is in good form the decomposition is a lossless-join decomposition
functional dependencies
multivalued dependencies
www.rejinpaul.com
www.rejinpaul.com
Second normal form applies to relations with composite keys, ie. relations with a primary key
composed of two or more attributes. A relation with a single attribute primary key is
automatically in at least 2 NF.
A relation that is in first normal form and every non-primary-key attribute is fully functionally
dependent on the primary key is in Second Normal Form.
The Normalization of I NF relations to 2 NF involve the removal of partial dependencies. If a
partial dependency exists, remove the functionally dependent attributes from the relation by
placing them in a new relation along with a copy of their determinant.
The dependency SSN > DMGRSSN is transitive through DNumber in EMP_DEPT because
both the dependencies
SSNDNumber and DNumber >DMGRSSN hold and DNumber
is neither a key itself nor a subset of the key of EMP_DEPT.
A relation that is in first and second normal form, and in which no non-primary key attribute is
transitively dependent on the primary key is in Third Normal form.The normalization of 2NF
relations to 3NF involves the removal of transitive dependencies. If a transitive dependency
exists, remove the transitively dependent attribute(s) from the relation by placing the attributes(s)
www.rejinpaul.com
www.rejinpaul.com
in a new relation along with a copy of the determinant.The update (insertion, deletion and
modification) anomalies arise as a result of the transitive dependency.
Example:
To transform the EMPDept relation into third normal form, first remove the transitive
dependency by creating two new relations EDI and ED2
FD2
DNO DNAME DMGRSSN
Relations that have redundant data may have problems called update anomalies, which are
classified as insertion, deletion or modification anomalies. These anomalies occur because, when
the data in one table is deleted or updated or new data is inserted, the related data is also not
correspondingly updated or deleted. One of the aims of the normalization is to remove the update
anomalies.
Boyce-codd Normal Form (BCNF) is based on functional dependencies that take into account all
candidate keys in a relation.A candidate key is a unique identifier of each of the tuple.For a
relation with only one candidate key, third normal form and BCNF are equivalent.A relation is in
BCNF if any only if every determinant is a candidate key.To test whether a relation is in BCNF,
identify all the determinants and make sure that they are candidate keys. A determinant is an
attribute or.a group of attributes on which some other attribute is fully functionally
dependent.The difference between third normal form and BCNF is that for a functional
dependencyA ->B, the third normal form allows this dependency in a relation if 'B' is a primary-
key attribute and 'A' is not a candidate key, whereas BCNF insists that for this dependency to
remain in a relation, 'A' must be a candidate key.
Consider the client interview relation.
www.rejinpaul.com
www.rejinpaul.com
(clientNo, interviewDate),
(staffNo, interviewDate, interviewtime),
and (roomNo, interviewDate, interviewTime).
Select (clientNo, interviewDate) to act as the primary key for this relation.
The client interview relation has the following functional dependencies:
fdl: clientNo, interviewdate->interviewTime, staffNo, roomNo
fd2: staffNo, inerviewdate, interviewTime->clientNo (Candidatekey).
Fd3: RoomNo, interviewdate, interviewtime >staffNo, clientNo(candidate)
fd4: staffNo, interviewdate> roomNo
As functional dependencies fdl, fd2, and fd3 are all candidate keys for this relation, none of these
dependencies will cause problems for the relation.
This relation is not BCNF due to the presence of the (staffNo, interviewdate) determinant, which
is not a candidate key for the relation.BCNF requires that all determinants in a relation must be a
candidate key for the relation
www.rejinpaul.com
www.rejinpaul.com
In this, members of staff called Ann Beech and David Ford work at branch B003, and property
owners called Carl Farreland Tina Murphy are registered at branch B003.However, as there is no
direct relationship between members of staff and property owners.The MVD in this relation is
branchNo-Sname
branchNo->> OName
A multi-valued dependency A->B in relation R is trivial if (a) B is a subset of A or (B)
AUB = R.
A multi-valued dependency A->B is nontrivial if neither (a) nor (b) is satisfied
This is not in 4NF because of the presence of the nontrivial MVD.Decompose the relation into
the BranchStafTand Branchowner relations.
Both new relations are in 4NF because the Branchstaff relation contains the trivial MVD
branch ->>SName, and the branchowner relation contains the trivial MVD branchNo->>OName.
Branch staff
BRANCHNO SNAME
BRANCHNO ONAME
www.rejinpaul.com
www.rejinpaul.com
Whenever we decompose a relation into two relations the resulting relations have the loss-less
join property. This property refers to the fact that we can rejoin the resulting relations to produce
the original relation.
Example:
The decomposition of the Branch staffowner relation
FD2
PROPRTY NO SUPPLIER NO
FD3
The propertyitemsupplier relation with the form (A,B,C) satisfies the join dependency JD
(R1(A,B), R2(B,C). R3(A, C)).i.e. performing the join on all three will recreate the original
propertyitemsupplier relation.
www.rejinpaul.com
www.rejinpaul.com
3. Define DBMS.
A Database-management system consists of a collection of interrelated data and a set of
programs to access those data. The collection of data, usually referred to as the database,
contains information about one particular enterprise. The primary goal of a DBMS is to
provide an environment that is both convenient and efficient to use in retrieving and
storing database information.
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
25.Define BCNF.
A relation schema R is in BCNF with respect to a set F of FDs if for all FDs of the form A -> B,
where A is contained in R and B is contained in R, at least one of the following holds:
1. A -> B is a trivial FD
2. A is a superkey for schema R.
www.rejinpaul.com
www.rejinpaul.com
1. A ->> B is a trivial MD
2. A is a superkey for schema R.
16 MARKS QUESTIONS
1.Briefly explain about Database system architecture:
2.Explain about the Purpose of Database system.
3. Briefly explain about Views of data.
4. Explain E-R Model in detail with suitable example.
5. Explain about various data models.
6. Draw an E R Diagram for Banking, University, Company, Airlines, ATM, Hospital, Library,
Super market, Insurance Company.
7. Explain 1NF, 2Nf and BCNF with suitable example.
8. Consider the universal relation R={ A,B,C,D,E,F,G,H,I} and the set of functional
dependencies
F={(A,B)->{C],{A}>{D,E},{B}->{F},{F}->{G,H},{D}->[I,J}.what is the key for Decompose
R into 2NF,the 3NF relations.
9. What are the pitfalls in relational database design? With a suitable example, explain the role of
functional dependency in the process of normalization.
10. What is normalization? Explain all Normal forms.
11. Write about decomposition preservation algorithm for all FDs.
12.Explain functional dependency concepts
13.Explain 2NF and 3NF in detail
14.Define BCNF .How does it differ from 3NF.
15.Explain the codds rules for relational database design
www.rejinpaul.com
www.rejinpaul.com
UNIT -2
SQL FUNDAMENTALS
Structural query language (SQL) is the standard command set used to communicate with the
relational database management systems. All tasks related to relational data management-
creating tables, querying the database for information.
Advantages of SQL:
SQL is a high level language that provides a greater degree of abstraction than procedural
languages.
Increased acceptance and availability of SQL.
Applications written in SQL can be easily ported across systems.
SQL as a language is independent of the way it is implemented internally.
Simple and easy to leam.
Set-at-a-time feature of the SQL makes it increasingly powerful than the record -at-a-time
processing technique.
SQL can handle complex situations.
SQL data types:
SQL supports the following data types.
CHAR(n) -fixed length string of exactlyV characters.
VARCHAR(n) -varying length string whose maximum length is 'n' characters.
FLOAT -floating point number.
www.rejinpaul.com
www.rejinpaul.com
Drop table
An existing base table can be deleted at any time by using the drop table statement.
Syntax
www.rejinpaul.com
www.rejinpaul.com
DESC
Desc command used to view the structure of the table.
Syntax
Desc table-name;
Example:
SQL>Desc book;
Truncate table
If there is no further use of records stored in a table and the structure has to be retained then the
records alone can be deleted.
Syntax
Truncate table table-name;
Example:
SQL>Truncate table book;
Table truncated.
This command would delete all the records from the table, book.
INTEGRITY
Data integrity refers to the correctness and completeness of the data in a database, i.e. an
integrity constraint is a mechanism used to prevent invalid data entry into the table.
The various types of integrity constraints are
1)Domain integrity constraints
2)Entity integrity constraints
3)Referential integrity constraints
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
TRIGGER
A database trigger is procedural code that is automatically executed in response to certain events
on a particular table or view in a database
. The trigger is mostly used for maintaining the integrity of the information on the database. For
example, when a new record (representing a new worker) is added to the employees table, new
records should also be created in the tables of the taxes, vacations and salaries.
Triggers are for
Customization of database management;
centralization of some business or validation rules;
logging and audit.
Overcome the mutating-table error.
Maintain referential integrity between parent and child.
Generate calculated column values
Log events (connections, user actions, table updates, etc)
Gather statistics on table access
Modify table data when DML statements are issued against views
Enforce referential integrity when child and parent tables are on different nodes of a
distributed database
Publish information about database events, user events, and SQL statements to
subscribing applications
Enforce complex security authorizations: (i.e. prevent DML operations on a table after
regular business hours)
Prevent invalid transactions
www.rejinpaul.com
www.rejinpaul.com
Enforce complex business or referential integrity rules that you cannot define with
constraints
Control the behavior of DDL statements, as by altering, creating, or renaming objects
Audit information of system access and behavior by creating transparent logs
SECURITY
Authorization
Insert - allows insertion of new data, but not modification of existing data.
<user list> is:a user-id public, which allows all valid users the privilege granted
A role Granting a privilege on a view does not imply granting any privileges on the underlying
relations.
The grantor of the privilege must already hold the privilege on the specified item (or be the
database administrator).
Privileges in SQL
www.rejinpaul.com
www.rejinpaul.com
select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the branch relation:
all privileges: used as a short form for all the allowable privileges
Revoking Authorization in SQL
Example:
All privileges that depend on the privilege being revoked are also revoked.
<privilege-list> may be all to revoke all privileges the revokee may hold.
If the same privilege was granted twice to the same user by different grantees, the user may
retain the privilege after the revocation.
EMBEDDED SQL
The SQL standard defines embeddings of SQL in a variety of programming languages such as
C,Java, and Cobol.
A language to which SQL queries are embedded is referred to as a host language, and the SQL
structures permitted in the host language comprise embedded SQL.
The basic form of these languages follows that of the System R embedding of SQL into PL/I.
www.rejinpaul.com
www.rejinpaul.com
EXEC SQL statement is used to identify embedded SQL request to the preprocessor
Note: this varies by language (for example, the Java embedding uses
# SQL { . }; )
From within a host language, find the names and cities of customers with more than the variable
amount dollars in some account.
EXEC SQL
END_EXEC
The fetch statement causes the values of one tuple in the query result to be placed on host
language variables.
EXEC SQL fetch c into :cn, :cc END_EXEC Repeated calls to fetch get successive
tuples in the query result
A variable called SQLSTATE in the SQL communication area (SQLCA) gets set to 02000 to
indicate no more data is available
The close statement causes the database system to delete the temporary relation that holds the
result of the query.
www.rejinpaul.com
www.rejinpaul.com
DYNAMIC SQL
Example of the use of dynamic SQL from within a C program.char * sqlprog = update
account set balance = balance * 1.05 where account_number = ? EXEC
SQL prepare dynprog from :sqlprog;char account [10] = A-101;
The dynamic SQL program contains a ?, which is a place holder for a value that is provided
when the SQL program is executed.
ODBC (Open Database Connectivity) works with C, C++, C#, and Visual Basic
VIEWS
A relation that is not of the conceptual model but is made visible to a user as a virtual relation
is called a view.
A view is defined using the create view statement which has the form
where <query expression> is any legal SQL expression. The view name is represented by v
Once a view is defined, the view name can be used to refer to the virtual relation that the view
generates.
www.rejinpaul.com
www.rejinpaul.com
USES OF VIEWS
Consider a user who needs to know a customers name, loan number and branch name, but has
no need to see the loan amount.
Define a view
Grant the user permission to read cust_loan_data, but not borrower or loan
PROCESSING OF VIEWS
When a view is created the query expression is stored in the database along with the view name
A view relation v1 is said to depend directly on a view relation v2 if v2 is used in the expression
defining v1
A view relation v1 is said to depend on view relation v2 if either v1 depends directly to v2 or there
is a path of dependencies from v1 to v2
www.rejinpaul.com
www.rejinpaul.com
VIEW EXPANSION
Let view v1 be defined by an expression e1 that may itself contain uses of view relations.
repeat
As long as the view definitions are not recursive, this loop will terminate
DATABASE LANGUAGES
In many DBMSs where no strict separation of levels is maintained, one language, called the data
definition language (DDL), is used by the DBA and by database designer's to define both
schemas.
In DBMSs where a clear separation is maintained between the conceptual and internal levels, the
DDL is used to specify the conceptual schema only. Another language, the storage definition
language (SDL), is used to specify the internal schema.
The mappings between the two schemas may be specified in either one of these languages.
For a true three-schema architecture a third language, the view definition language (VDL), to
specify user views, and their mappings to the conceptual schema, but in most DBMSs the DDL
is used to define both conceptual and external schemas.Once the database schemas are complied
and the database is populated with data, users must have some means to manipulate the database.
The DBMS provides a set of operations or a language called the data manipulation
language(DML) for manipulations include retrieval, insertion, deletion, and modification of the
data.
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
QUERY PROCESSING
The aims of query processing are to transform a query written in a high-level language, typically
SQL, into a correct and efficient execution strategy expressed in a low-level language
(implementing the relational algebra), and to execute the strategy to retrieve the required data.
The steps involved in processing a query are
Parsing and transaction
Optimization
Evaluation
Before query processing can begin, the system must translate the query into a usable for (SQL).
Thus, the first action the system must take in query processing is to translate a given query into
its internal form. In generating the internal form of the query, the parser checks the syntax of the
www.rejinpaul.com
www.rejinpaul.com
user's query, verifies that the relation names appearing in the query are names of the relations in
the database, and so.
The system constructs a parsetree representation of the query, which it then translates into a
relational algebra expression. Example
Consider the query
Select balance from account where balance < 2500.
This query can be translated into either of the following relational algebra expressions.
balance < 2500 ( balance (account))
To specify fully how to evaluate a query, we need not only to provide the relational algebra
expression, but also to annotate it with instructions specifying how to evaluate each operation.
A relational - algebra operation annotated with instructions on how to evaluate it is called an
evaluation primitive.
A sequence of primitive operations that can be used to evaluate a query is a query execution plan
or query evaluation plan. It evaluation plan a particular index in specified for the selection
operation.
The query - execution engine takes a query - evaluation plan, executes that plan, and returns the
answers to the query.
It is the responsibility of the system to construct a query - evaluation plan that minimizes the cost
of query evaluation. This task is called query optimization.
In order to optimize a query, a query optimizer must know the cost of each operation. Although
the exact cost is hard to compute, since it depends on many parameters such as actual memory
available to the operation, it is possible to get a rough estimate of execution cost for each
operation.
SORTING ALGORITHM
www.rejinpaul.com
www.rejinpaul.com
i = 0; repeat
read M blocks of the relation or the rest of the relation; sort the in - memory part op the relation;
write the sorted data to run file R;; i = i-H;
until the end of the relation
2. In the second stage, the runs are merged. The merge stage operates as follows:
read one block of each of the N files R, into a buffer page in memory; repeat
choose the first type in sort order among all buffer pages;
write the tuple to the output, and delete it from the buffer page;
if the buffer page of any run Rj is empty and not end-of-file (Rj) then
read the next block of R, into the buffer page;
until all buffer pages are empty.
DATABASE TUNING
The times required for different phases of query and transaction processing.
These and other statistics create a profile of the contents and use of the database. Other
information obtained from monitoring the database system activities and processes includes.
Storage statistics
www.rejinpaul.com
www.rejinpaul.com
Index statistics.
How to allocate resources such as disks, RAM and processes for most efficient utilization.
Tuning indexes
The initial choice of indexes may have to be revised for the following reasons.
Certain queries may take too long to run for lack of an index.
Certain indexes may be causing excessive overhead because the index is no an attribute that
undergoes frequent changes.
If a given physical database design does not meet the expected objectives, we may revert to the
logical database design, makes adjustments to the logical schema, and remap it to a new set of
physical tables and indexes.
Tf the processing requirements are dynamically changing, the design needs, to respond by
making changes to the conceptual schema if necessary and to reflect those changes into the
logical schema and physical design. These changes may be of the following nature.
Existing tables may be joined because certain attributes from two or more tables are frequently
needed together.
For the given set of tables, there may be alternative design choices, all of which achieve 3NF or
BCNF. One may be replaced by the other.
www.rejinpaul.com
www.rejinpaul.com
The query plan shows that relevant indexes are not being used.
TWO MARKS WITH ANSWER
2. Define Merge-join?
The merge-join algorithm can be used to compute natural joins and equi-joins.
www.rejinpaul.com
www.rejinpaul.com
Query optimization refers to the process of finding the lowest cost method of evaluating a given
query.
6.Define Aggregate Functions.
Aggregate functions are functions that take a collection of values as input and return a single
value. SQL offers five built-in aggregate functions:
Average: avg
Minimum: min
Maximum: max
Total: sum
Count: count
www.rejinpaul.com
www.rejinpaul.com
12.Define Assertions.
An assertion is a predicate expressing a condition that we wish the database always satisfied.
E.g.) create assertion
13.Define Triggers.
A trigger is a statement that is executed automatically by the system as a side effect of a
modification to the database. To design a trigger mechanism, we must meet two requirements:
1. Specify the conditions under which the trigger is to be executed.
2. Specify the actions to be taken when the trigger executes.
14.Define Catalog
The catalog is the place where all of the various schemas (external, conceptual, internal) and all
of the corresponding mappings are kept. the catalog contains detailed information called
descriptor information or metadata regarding the various of the system.
15.Define Types
It is defined as a set of value
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
Update
16 MARKS
www.rejinpaul.com
www.rejinpaul.com
d. For each publisher, find the names of employees who have borrowed more than five books of
that publisher.
3. Explain Embedded and Dynamic SQL.
UNIT -3
Transaction States
Active
This is the initial state, the transaction stays in this state while it is executing.
Partially committed
A transaction is in this state when it has executed the final statement.
Failed
A transaction is in this state once the normal execution of the transaction cannot proceed.
Aborted
A transaction is said to be aborted when the transaction has rolled back and the database is being
restored to the consistent state prior to the start of the transaction.
Committed
A transaction is in the committed state once it has been successfully executed and the database is
transformed into a new consistent state.
A transaction starts in the active state, A transaction contains a group of statements that form a
logical unit of work. When the transaction has finished executing the last statement, it enters the
www.rejinpaul.com
www.rejinpaul.com
partially committed state. At this point the transaction has completed execution, but it is still
possible that it may have to be aborted. This is because the actual output may still be in the main
memory and a hardware failure can still prevent the successful completion. The database system
then writes enough information to the disk. When the last of this information is written, the
transaction enters the committed states.
A transaction enters the failed state once the system determines that the transaction can no longer
proceed with its normal execution. This could be due to hardware failures or logical errors. Such
a transaction should be rolled back. When the roll back is complete, the transaction enters the
aborted state when a transaction aborts, the system has two options as follows:
Restart the transaction
ACID properties
There are properties that all transactions should possess. The four basic or so-called ACID,
properties of a transaction are
Atomicity: The 'all or nothing' property. A transaction is an indivisible unit that is either
performed in its entirety or is not performed at all. It is the responsibility of the recovery
subsystem of the DBMS to ensure atomicity.
Consistency: A transaction must transform the database from one consistent state to another
consistent state. It is the responsibility of both the DBMS and the application developers to
ensure consistency. The DBMS can ensure consistency by enforcing all the constraints that have
been specified on the database schema, such as integrity and enterprise constraints. However in
itself this is insufficient to ensure consistency.
Example:
A transaction that is intended to transfer money from one bank account to another and the
programmer makes an error in the transaction logic and debits one account but credits the wrong
account, then the database is in an inconsistent state.
www.rejinpaul.com
www.rejinpaul.com
Isolation: Transactions execute independently of one another, i.e. the partial effects of
incomplete transactions should not be visible to other transactions. It is the responsibility of the
concurrency control subsystem to ensure isolation.
Durability: The effects of a successfully completed transaction are permanently recorded in
the database and must not be lost because of a subsequent failure. It is the responsibility of the
recovery subsystem to ensure durability.
UNDO reverses an operation, using the log entries written by the DO portion of the sequence.
REDO redoes an operation, using the log entries written by the DO portion of the sequence.
To ensure that the DO, UNDO, and REDO operations, can survive a system crash while they are
being executed, a write-ahead protocol is used. The write-ahead protocol forces the log entry to
be written to permanent storage before the actual operation takes place.
The two-phase commit protocol defines the operations between two types of nodes: The
coordinator and one or more subordinates, or cohorts. The participating nodes agree on a
coordinator. Generally, the coordinator role is assigned to the node that initiates the transaction.
www.rejinpaul.com
www.rejinpaul.com
2) The subordinates receive the message. Write the transaction log, using the write-ahead
protocol and send an acknowledgement (YES / PREPARED TO COMMIT or NO / NOT
PREPARED) message to the coordinator.
3) The coordinator makes sure that all nodes are ready to commit, or it aborts the action.
If all nodes are PREPARED TO COMMIT, the transaction goes to phase-2. If one or more nodes
reply NO or NOT PREPARED, the coordinator broadcasts an ABORT message to all
subordinates.
Phase2: The Final Commit
1) The coordinator broadcasts a COMMIT message to all subordinates and waits for the replies.
2) Each subordinate receives the COMMIT message, then updates the database using the DO
protocol.
If one or more subordinates did not COMMIT, the coordinator sends an ABORT message,
thereby forcing them to UNDO all changes.
The objective of the two-phase commit is to ensure that all nodes commit their part of the
transaction, otherwise, the transaction is aborted. If one of the nodes fails to commit, the
information necessary to recover the database is in the transaction log, and the database can be
recovered with the DO-UNDO-REDO protocol.
LOCKING
Locking is a procedure used to control concurrent access to data when one transaction is
accessing the database, a lock may deny access to other transactions to prevent incorrect results.
www.rejinpaul.com
www.rejinpaul.com
A transaction must obtain a read or write lock on a data item before it can perform a read or write
operation.
The read lock is also called a shared lock. The write lock is also known as an exclusive lock. The
lock depending on its types gives or denies access to other operations on the same data item.
The basic rules for locking are
If a transaction has a read lock on a data item, it can read the item but not update it.
If a transition has a read lock on a data item, other transactions can obtain a read lock on the
data item, but no write locks.
If a transaction has a write lock on a data item, it can both read and update the data item.
If a transaction has a write lock on a data item, then other transactions cannot obtain either a
read lock or a write lock on the data item.
If the data item for which the lock is requested is not already locked, the transaction is granted
the requested lock,
If the item is currently lock, the DBMS determines what kind of lock is the current one. The
DBMS also finds out what lock is requested.
If a read lock is requested on an item that is already under a read lock, then the requested will
be granted.
If a read lock or a write lock is requested on an item that is already under a write lock, then the
request is denied and the transaction must wait until the lock is released.
A transaction continues to hold the lock until it explicitly releases it either during execution or
when it terminates.
The effects of a write operation will be visible to other transactions only after the write lock is
released.
www.rejinpaul.com
www.rejinpaul.com
Live lock
Suppose a transaction T2 has a shared lock on a data item and another transaction T1 requests
and exclusive lock on the same data item. Ti will have to wait until T2 releases the lock. Mean
while, another transaction T3 request a shared lock on the data item. Since the lock request of T3
is compatible to the lock granted to T2, T3 will be granted the shared lock on the data item. At
this point even if T2 releases the lock, Ti will have to wait until T3 also releases the lock. The
transaction T| can wait for an exclusive lock endlessly if other transactions continue to request
and acquire shared locks on the data item. The transaction T1 is starved (or is in live lock), as it
is not making any progress.
Two phase locking protocol requires that each transaction issue lock and unlock requests in two
phases:
1. Growing phase
A transaction may obtain locks, but may not release any lock.
2. Shrinking phase
A transaction may release locks, but may not obtain any new locks.
Initially, a transaction is in the growing phase. The transaction acquires locks as needed. Once
the transaction releases a lock, it enters the shrinking phase, and it can issue not more lock
requests.
The point in the schedule where the transaction has obtained its final lock (the end of its growing
phase) is called the lock point of the transaction.
Another variant of two - phase locking is the rigorous two-phase locking protocol* which
requires that all locks be held until the transaction commit.
If lock conversion is allowed then upgrading of locks (from read locked to write - locked) must
be done during the expanding phase, and downgrading of locks (from write-locked to read -
locked) must be done in the shrinking phase.
Strict two-phase locking and rigorous two-phase locking (with lock conversions) are used
extensively in commercial database systems.
INTENT LOCKING
www.rejinpaul.com
www.rejinpaul.com
In the concurrency - control schemes, each individual data item is used as the until on which
synchronization is performed.
There are circumstances, however where it would be advantages to group several data items, and
to treat them as one individual synchronization unit.
Example:
If a transaction Tj needs to access the entire database, and a locking protocol is used, then Tj
must lock each item in the database clearly, executing these locks is time-consuming. It would be
better if Tj could issue a single lock request to lock the entire database. If transaction Tj needs to
access only a few data items, it should not be required to lock the entire database, since
otherwise concurrency is lost.
Granularity
Granularity is the size of data items chosen as the unit of protection by a concurrency control
protocol.
Hierarchy of granularity
The granularity of locks is represented in a hierarchical structure where each node represents
data items of different sizes.
The multiple granularity locking (MGL) protocol consists of the following rules.
1. It must observe the lock compatibility function.
2. It must lock the root of the tree first, and can lock it in any mode.
3. It can lock a node N in S or IS mode only if it currently has the parent of node N locked in
either IX or IS mode.
4. It can lock a node N in X, SIX, or IX mode only if it currently has the parent of node N locked
in either IX or SIX mode.
5. It can lock a node only if it has not previously unlocked any node.
6. it can unlock a node N only if it currently has none of the children of node N locked.
The multiple - granularity protocol requires that locks be acquired in top-down (root - to - leaf)
order, whereas locks must be released in bottom - up (leaf -to - root) order.
www.rejinpaul.com
www.rejinpaul.com
To ensure serializability with locking levels, a two-phase locking protocol is used as follows:
DEADLOCK
Deadlock occurs when each transaction T in a set of two or more transactions is waiting for some
item that is locked by some other transaction T in the set.
There is only one way to break deadlock: abort one or more of the transactions. This usually
involves undoing all the changes made by the aborted transactions (S).
There are three general techniques for handling deadlock:
Timeouts
Deadlock prevention
Deadlock detection
Recovery.
Timeouts
A transaction that requests a lock will wait for only a system defined period of time. If the lock
has not been granted within this period, the lock request times out. In this case, the DBMS
assumes the transaction may be deadlocked, even though it may not be, and it aborts and
automatically restarts the transaction.
Deadlock Prevention
Another possible approach to deadlock prevention is to order transactions using transaction
timestamps.
Wait - Die algorithm allows only an older transaction to wait for a younger one otherwise the
transaction is aborted and restarted with the same timestamp so that eventually it will become the
oldest active transaction and will not die.
www.rejinpaul.com
www.rejinpaul.com
Wound - wait, allows only a younger transaction can wait for an older one. if an older transaction
requests a lock held by a younger one the younger one is aborted.
Deadlock detection and Recovery
Deadlock detection is usually handled by the construction of a wait - for graph (WFG) that
shows the transaction dependencies, that is transaction Tj is dependent on Tj if transaction Tj
holds the lock on a data item that Tj is waiting for , Deadlock exists if and only if the WFG
contains a cycle.
When a detection algorithm determines that a deadlock exists, the system must recover from the
deadlock. The most common solution is to roll back one or more transactions to break the
deadlock.
Starvation occurs when the same transaction is always chosen as the victim, and the transaction
can never complete.
SERIALIZABILITY
Non serial schedule is a schedule where the operations from a set of concurrent transactions are
interleaved.
The objective of serializability is to find non serial schedules that allow transactions to execute
concurrently without interfering with one another, and there by produce a database state that
could be produced by a serial execution.
Conflict serializability:
In serializability, the ordering of read and write operations is important:
It two transactions only read a data item, they do not conflict and order is not important.
www.rejinpaul.com
www.rejinpaul.com
If two transactions either read or write completely separate data items, they do not conflict and
order is not important.
It one transaction writes a data item and another either reads or writes the same data item, the
order of execution is important.
The instructions I; and Ij conflict if they are operations by different transactions on the same data
item, and atleast one of these instructions is a write operation.
View serializability:
The schedules S and S' are said to be view equivalent if the following conditions met:
For each data item x{< if transaction Ti reads the initial value of x in schedule S, then
transaction Tj must, in schedule S\ also read the initial value of x.
For each data item x, if transaction Tj executes read (x) in schedule S, and if that value was
produced by a write (x) operation executed by transaction Tj, then the read (x) operation of
transaction Tj must, in schedule S, also read the value of x that was produced by the same write
(x) operation of transaction T;.
For each data item xt the transaction that performs the final write (x) operation in schedule S
must perform the final write (x) operation in schedule S'.
1.What is Recovery?
Recovery means to restore the database to a correct state after some failure has rendered the
current state incorrect or suspect
2.What is Transactions?
A transaction is a logical unit of work It begins with BEGIN TRANSACTION
It ends with COMMIT or ROLLBACK
www.rejinpaul.com
www.rejinpaul.com
4.What is Correctness?
The database must always be consistent, which is defined as not violating any known integrity
constraint The DBMS can enforce consistency, but not correctness
9.What is Concurrency?
Concurrency ensures that database transactions are performed concurrently without violating the
data integrity of the respective databases.
10.What is transaction?
A transaction is a unit of program execution that accesses and possibly updates various data
www.rejinpaul.com
www.rejinpaul.com
items. A transaction usually results from the execution of a user program written in a high-level
data-manipulation language or programming language, and is delimited by statements of the
form begin transaction and end transaction. The transaction consists of all operations executed
between the begin and end of the transaction.
13.What is Locking?
A transaction locks a portion of the database to prevent concurrency problems
Exclusive lock write lock, will lock out all other transactions
Shared lock read lock, will lock out writes, but allow other reads
14.What is Deadlock?
Strict two-phase locking may result in deadlock if two transactions each take a shared lock
before one of them tries to take an exclusive lock Or if the second one tries to take an exclusive
lock where the first already has a shared lock, and the first in turn is waiting for additional shared
locks
www.rejinpaul.com
www.rejinpaul.com
16.What is Serializability?
An interleaved execution is considered correct if and only if it is serializable
A set of transactions is serializable if and only if it is guaranteed to produce the same result as
when each transaction is completed prior to the following one being started
UNIT -4
RAID
www.rejinpaul.com
www.rejinpaul.com
RAID systems are used for their higher reliability and higher performance rate, rather than for
economic reasons. Another key justification for RAID use is easier management and operations.
Improvement of reliability via redundancy
If we store only one copy of the data, then each disk failure will result in loss of a significant
amount of data.
The solution to the problem of reliability is to introduce redundancy, ie some extra information
that is not needed normally, but that can be used in the event of failure of a disk to rebuild the
lost information. Thus, even if a disk fails, data are not lost, so the effective mean time to failure
is increased.
The simplest (expensive) approach to introducing redundancy is to duplicate every disk. This
technique is called mirroring.
Mean time to repair is the time it takes to replace a failed disk and to restore the data on it.
With disk mirroring, the rate at which read requests can be handled is doubled, since read
requests can be sent to either disk. The transfer rate of each read is the same as in a single-disk
system, but the number of reads per Unit time has doubled.
With multiple disks, the transfer rate can be improved as well by striping data across multiple
disks. In its simplest from, data striping consists of splitting the bits of each byte across multiple
disks. Such striping is called bit level striping. Block - level striping stripes blocks across
multiple disks. They are two main goals of parallelism in a disk system.
Load - balance multiple small accesses, so that the through put of such accesses increases.
Parallelize large accesses so that the response time of large accesses is reduced.
RAID levels
Mirroring provides high reliability, but it is expensive. Striping provides high data - transfer
rates, but does not improve reliability. Various alternative schemes aim to provide redundancy at
lower cost by combining disk striping with "parity" bits. The schemes are classified into RAID
levels.
RAID level 0
RAID level 0 uses data striping at the level of blocks has not redundant data (such as mirroring
or parity bits) and hence has the best write performance since updates do not have to be
duplicated. However, its read performance is not good.
RAID level 1
www.rejinpaul.com
www.rejinpaul.com
RAID level 1 refers to disk mirroring with block striping. Its read performance is good than
RAID level 0. Performance improvement is possible by scheduling a read request to the disk
with shortest expected seek and rotational delay.
RAID level 2
RAID level 2 uses memory-style redundancy by using hamming codes, which contain parity bits
for distinct overlapping subsets of components. If one of the bits in the byte gets damaged, the
parity of the byte changes and thus will not match the stored parity. Similarly, if the stored parity
bit gets damaged, it will not match the computed parity.
The disks labeled P store the error-correction bits. If one of the disks fails, the remaining bits of
the byte and the associated error-correction bits can be read from other disks, and can be used to
reconstruct the damaged data.
RAID level 3
Bit inter leaved parity organization, improves on level 2 by exploiting the fact that disk,
controllers, can detect whether a sector has been read correctly, so a single parity bit can be used
for error correction.
If one of the sectors gets damaged, the system knows exactly which sector it is, and, for each bit
in the sector, the system can figure out whether it is a 1 or a 0 by computing the parity of
thecorresponding bits from sectors in the other disks. If the parity of the remaining bits is equal
to the stored parity, the missing bit is 0. otherwise, it is 1.
RAID level 3 supports a lower number of I/O operations per second, since every disk has to
participate in every I/O request.
RAID level 4
RAID level 4, block inter leaved parity organization, uses block-level striping and keeps a parity
block on a separate disk for corresponding blocks from N other disks. If one of the disks fails,
the parity block can be used with the corresponding blocks from the other disks to restore the
blocks of the failed disk.
Multiple read accesses can proceed in parallel, leading to a higher overall I/O rate.
A single write requires four disk accesses: two to read the two old blocks, and two to write the
two blocks.
RAID level 5
www.rejinpaul.com
www.rejinpaul.com
RAID level 5, block-inter leaved distributed parity, improves on level 4 by partitioning data and
parity among all N + 1 disks. In level 5, all disks can participate in satisfying read requests, so
level5 increases the total number of requests that can be met in a given amount of time. For each
set of N logical blocks, one of the disks stores the parity, and the other N disks store the blocks.
RAID level 6
RAID level 6, the P + Q redundancy scheme, is much like RAID level 5, but stores extra
redundant information to guard against multiple disk failures, instead of using parity, level 6 uses
error-correcting codes. In this, 2 bits of redundant data are stored for every 4 bits of data and the
system can tolerate two disk failures. Choice of RAID level
The factors to be taken into account in choosing a RAID level are
Monetary cost of extra disk-storage requirements.
The order in which records are stored and accessed in the file is dependent on the file
organization.
The physical arrangement of data in a file into records and pages on secondary storage is called
file organization.
The main types of file organization are:
Hash files
Heap files
Records are placed on disk in no particular order.
www.rejinpaul.com
www.rejinpaul.com
Records are placed in the file in the same order as they are inserted. A new record s inserted in
the last page of the file. If there is insufficient space in the last page, a new page is added to the
file.
A linear search must be performed to access a record from the file until the required record is
found.
To delete a record, the required page first has to be retrieved, the record marked as deleted, and
the page written back to disk.
Heap files are one of the best organizations for bulk loading data into a table, as records are
inserted at the end of the sequence.
Sequential (ordered files)
Records are ordered by the value of specified fields.
A binary search must be performed to access a record as follows
Retrieve the mid-page of the file check whether the required record is between the first and last
records of this page. If so, the required record lies on this page and no more pages need to be
retrieved.
If the value of the key field in the first record on the page is greater than the required value,
occurs on an earlier page therefore repeat the above steps.
If value of the key field in the last record on the page is less than the required value, it occurs
on a latter page, and so repeat the above steps.
If the record is not found during the binary search, the overflow file has to be searched linearly.
www.rejinpaul.com
www.rejinpaul.com
Ordered files are rarely used for database storage unless a primary index is added to the file.
Hash files (Random or direct files)
Records are placed on disk according to a hash function. A hash function calculates the address
of the page in which the record is to be stored based on one or more fields in the record.
The base field is called the hash field, or if the field is also a key field of the file, it is called the
hash key.
. The hash function is chosen so that records are as evenly distributed as possible throughout the
file.
The division - remainder hashing. This technique uses the mod function which takes the field
value, divides it by some predetermined integer value, and uses the remainder of this division as
the disk address.
Each address generated by a hashing function corresponds to a page, or bucket, with slots for
multiple records. Within a bucket, records are placed in order of arrival. When the same address
is generated for two or more records, then it is called as a collision. The records are called
synonyms.
There are several techniques can be used to manage collisions.
Open addressing
Unchained overflow
Chained overflow
Multiple hashing
Open addressing
If a collision occurs, the system perform a linear search to find the first available slot to insert a
new record.
Unchained overflow
Instead of searching for a free slot, an overflow area is maintained for collisions that cannot be
placed at the hash address.
Chained overflow
An overflow area is maintained for collisions that cannot be placed at the hash address and each
bucket has an additional field, called a synonym pointer, that indicates whether a collision has
www.rejinpaul.com
www.rejinpaul.com
occurred, if so, points to the overflow page used, the pointer is zero no collision has occurred.
Multiple hashing
An alternative approach to collision management is to apply a second hashing function if the first
one results in a collision. The aim is to produce a new hash address that will avoid a collision.
The second hashing function is generally used to place records in an overflow area.
Indices whose search key specifies an order different from the sequential order of the file are
called non clustering indices or secondary indices.
All files are ordered sequentially on some search key, with a clustering index on the search key,
are called index - sequential files. There are several type of ordered indexes.
Primary index
Clustering index
Secondary index
Primary indexes
A primary index is an ordered file whose records are of fixed length with two fields. The first
field is the primary key of the data file, and the second filed is a pointer to a disk block (a block
address).
There is one index entry (or index record) in the index file for each block in the data file. Each
index record has the value of the primary key field for the first record in a block and a pointer to
that block as its two field values. i = <K (i),P(i)>ordered data file. The first record in each block
of the data file is called the anchor record of block or block anchor.
Indexes can also be characterized as dense or sparse.
A dense index has an index entry for every search key value in the data file.
A sparse (or non dense) index has index entries for only some of the search values.
A primary index is hence a non dense (sparse index), since it includes an entry for each disk
block of the data file and the keys of its anchor record rather than tor every search value.
To retrieve a record, given the value K of its primary key field, do a binary search on the index
file to find the appropriate index entry i, and then retrieve the data file block whose address is P
(i).Clustering indexes
www.rejinpaul.com
www.rejinpaul.com
If records of a file are physically ordered on a non key field is called the clustering field. Create a
different type of index, called a clustering index, to speed up retrieval of records that have the
same value for the clustering field.
A clustering index is also an ordered file with two fields. The first field is of the same type as the
clustering field of the data file, and the second field is a block pointer.
This differs from a primary index, which requires that the ordering field of the data file have a
distinct value for each record.
Record insertion and deletion still cause problems, because the data records are physically
ordered. To alleviate the problem of insertion, it is common to reserve
a whole block for each value of the clustering field. All records with that value are placed in the
block.
A secondary index is also an ordered file similar to a primary index. However, whereas the data
file associated with a primary index is sorted on the index key, the data file associated with a
secondary index may not be sorted on the indexing key. Further, the secondary index key need
not contain unique values.
There are several techniques for handling non-unique secondary indexes.
Produce a dense secondary index that maps on to all records in the data file, thereby allowing
duplicate key values to appear in the index.
Allow the secondary index to have an index entry for each distinct key value, but allow the
block pointers to be multi-valued, with an entry corresponding to each duplicate key value in the
data file.
Allow the secondary index to have an index entry for each distinct key value. However, the
block pointer would not pointer to the data file but to a bucket that contains pointers to the
corresponding records in the data file.
The secondary index may be on a field which is a candidate key and has a unique value in
every record, or a non key with duplicate values.
A secondary index structure on a key field that has a distinct value for every record. Such a
field is sometimes called a secondary key. In this case there is one index entry for each record in
www.rejinpaul.com
www.rejinpaul.com
the data file, which contains the value of the secondary key for the record and a pointer either to
the block in which the record is sorted to the record itself. Hence, such an index is dense.
The index is an ordered file with two fields. The first field is of the same data type as some non
ordering field of the data file that is an indexing field. The second field is either a block pointer
or a record pointer.
Multilevel indexes
When an index file becomes large and extends over many pages, the search time for the required
increases.
B + TREE
A binary tree has order 2 in which each node has no more than two children. The rules for a B+
tree are as follows.
If the root is not a leaf node, it must have at least two children.
For a tree of order n, each node except the root and leaf nodes must have between n/2 and n
pointers and children. IF n/2 is not an integer, the result is rounded up.
For a tree of order n, the number of key values in a leaf node must be between (n-l)/2 and (n-l)
pointers and and children. If (n-l)/2 is not an integer, the result is rounded up.
The number of key values contained in a non leaf node is 1 less than the number of pointers.
The tree must always be balanced ie every path from the root node to a leaf must have the same
length.
Leaf nodes are linked in order of key values.
HASHING
In static hashing the hash address space is fixed when the file is created. The term bucket denotes
a unit of storage that can store one or more records.
A hash function h is a function from k to B. Where K denotes the set of all search-key values,
and B denote the set of all bucket addresses.
www.rejinpaul.com
www.rejinpaul.com
Hash functions
The worst possible hash function maps all search-key values to the same bucket.
An ideal hash function distributes the stored keys uniformly across all the buckets, so that every
bucket has the same number of records.
Choose a hash function that assigns search-key values to buckets in such a way that the
distribution has these qualities.
The distribution is uniform
2. The chosen hash function may result in non uniform distribution of search keys.
Bucket overflow can be handled by using overflow buckets If a record must be inserted into a
bucket b, and b is already full, the system provides an overflow bucket for b and inserts the
record into the overflow bucket. If the overflow bucket is also full, the system provides another
overflow bucket, and so on. All the overflow buckets of a given bucket are chained together in a
linked list. Overflow handling using such a linked list is called overflow chaining.
Lookup algorithm
The system uses the hash function on the search key to identify a bucket b. The system must
examine all the records in bucket b to see whether they match the search key as before. If bucket
b has overflow buckets, the system must examine the records in all the overflow buckets also
closed hashing means the set of buckets is fixed and there is overflow chains.
www.rejinpaul.com
www.rejinpaul.com
Open hashing, the set of buckets is fixed, and there are no overflow chains. If a bucket is full, the
system inserts records in the next bucket in cyclic order that has space, is called linear probing.
Open hashing has been used to construct symbol tables for compilers and assemblers, but closed
hashing is preferable for database systems.
Hash indices
Hashing can be used not only for file organization, but also for index structure creation. A hash
index organizes the search keys, with their associated pointers, into a hash file structure.
A distributed database system consist of loosely coupled sites that share no physical component
Database systems that run on each site are independent of each other
Are aware of each other and agree to cooperate in processing user requests.
Each site surrenders part of its autonomy in terms of right to change schemas or software
Sites may not be aware of each other and may provide only limited facilities for cooperation in
transaction processing
www.rejinpaul.com
www.rejinpaul.com
Full replication of a relation is the case where the relation is stored at all sites.
Fully redundant databases are those in which every site contains a copy of the entire database.
Advantages of Replication
Availability: failure of site containing relation r does not result in unavailability of r is replicas
exist.
Reduced data transfer: relation r is available locally at each site containing a replica of r.
Disadvantages of Replication
Increased complexity of concurrency control: concurrent updates to distinct replicas may lead
to inconsistent data unless special concurrency control mechanisms are implemented.
One solution: choose one copy as primary copy and apply concurrency control operations on
primary copy
Vertical fragmentation : the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or superkey) to ensure lossless join
property.
A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate
key.
www.rejinpaul.com
www.rejinpaul.com
Data transparency : Degree to which system user may remain unaware of the details of how
and where the data items are stored in a distributed system
Fragmentation transparency
Replication transparency
Location transparency
Structure:
Advantages:
Disadvantages:
www.rejinpaul.com
www.rejinpaul.com
Alternative to centralized scheme: each site prefixes its own site identifier to any name that it
generates i.e., site 17.account.
Fulfills having a unique identifier, and avoids problems associated with central control.
Solution:
Create a set of aliases for data items; Store the mapping of aliases to the real names at each site.
The user can be unaware of the physical location of a data item, and is unaffected if the data
item is moved from one site to another.
Transaction may access data at several sites.Each site has a local transaction manager
responsible for:
Participating in coordinating the concurrent execution of the transactions executing at that site.
Coordinating the termination of each transaction that originates at the site, which may result in
the transaction being committed at all sites or aborted at all sites.
Many database applications require data from a variety of preexisting databases located in a
heterogeneous collection of hardware and software platforms
www.rejinpaul.com
www.rejinpaul.com
Creates an illusion of logical database integration without any physical database integration
ADVANTAGES
hardware
system software
Applications
Organizational/political difficulties
Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel
data can be partitioned and each processor can work independently on its own partition.
Queries are expressed in high level language (SQL, translated to relational algebra)
Different queries can be run in parallel with each other. Concurrency control takes care of conflicts.
www.rejinpaul.com
www.rejinpaul.com
Horizontal partitioning tuples of a relation are divided among many disks such that each tuple resides
on one disk.
Round-robin:
Hash partitioning:
Let i denote result of hash function h applied to the partitioning attribute value of a tuple. Send tuple to
disk i.
Range partitioning:
Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to disk I + 1. Tuples with v
< v0 go to disk 0 and tuples with v vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a
tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.
INTERQUERY PARALLELISM
Increases transaction throughput; used primarily to scale up a transaction processing system to support
a larger number of transactions per second.
Easiest form of parallelism to support, particularly in a shared-memory parallel database, because even
sequential database systems support concurrent processing.
www.rejinpaul.com
www.rejinpaul.com
Cache-coherency has to be maintained reads and writes of data in buffer must find latest version of
data.
INTRAQUERY PARALLELISM
Execution of a single query in parallel on multiple processors/disks; important for speeding up long-
running queries.
Intraoperation Parallelism parallelize the execution of each individual operation in the query.
Most Web documents are hypertext documents formatted via the HyperText Markup Language (HTML)
hypertext links to other documents, which can be associated with regions of the text.
forms, enabling users to enter data which can then be sent back to the Web server
Web browsers have become the de-facto standard user interface to databases
Avoid the need for downloading/installing specialized code, while providing a good graphical user
interface
Examples: banks, airline and rental car reservations, university course registration and grading, an so on.
www.rejinpaul.com
www.rejinpaul.com
1.Define Cache?
The cache is the fastest and most costly form of storage. Cache memory is small; its use is
managed by the operating system.
4.Define RAID.
It is collectively called redundant arrays of inexpensive disk, have been proposed to address the
performance and reliability issues. Raids are used for their higher reliability and higher data
transfer rate. RAID stands for independent, instead of inexpensive.
www.rejinpaul.com
www.rejinpaul.com
search-key value and pointer to the first data record with that search-key value.
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
www.rejinpaul.com
The data transfer rate is the rate at which data can be retrieved from or stored to the disk.
25.What are the techniques to be evaluated for both ordered indexing and hashing?
Access types
Access time
Insertion time
Deletion time
Space overhead
www.rejinpaul.com
www.rejinpaul.com
35.DefineInteroperation Parallelism
16 MARK QUESTIONS
1.How the records are represented and organized in files . Explain with suitable example
5,3,4,9,7,15,14,21,22,23
www.rejinpaul.com
www.rejinpaul.com
UNIT-5
Extend the relational data model by including object orientation and constructs to deal with
added data types.
Allow attributes of tuples to have complex types, including non-atomic values such as nested
relations.
Preserve relational foundations, in particular the declarative access to data, while extending
modeling power.
COMPLEX DATATYPES
Motivation:
Intuitive definition:
allow relations whenever we allow atomic (scalar) values relations within relations
www.rejinpaul.com
www.rejinpaul.com
lastname varchar(20))
final
not final
l Note: final and not final indicate whether subtypes can be created
name Name,
address Address,
dateOfBirth date)
METHODS
for CustomerType
begin
end
www.rejinpaul.com
www.rejinpaul.com
from customer
INHERITANCE
Subtypes can redefine methods by using overriding method in place of method in the method
declaration
Define a type Department with a field name and a field head which is a reference to the type
Person, with table people as scope:
We can omit the declaration scope people from the type declaration and instead make an addition
to the create table statement:
create table departments of Department
(head with options scope people)
PATH EXPRESSIONS
www.rejinpaul.com
www.rejinpaul.com
If department head were not a reference, a join of departments with people would be required to
get at the address
XML
Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML
Documents have tags giving extra information about sections of the document
Users can add new tags, and separately specify how the tag should be handled for display
The ability to specify new tags, and to create nested tag structures make XML a great way to
exchange data, not just documents.
Much of the use of XML has been in data exchange applications, not as a replacement for HTML
E.g.
<bank>
<account>
</account>
<depositor>
www.rejinpaul.com
www.rejinpaul.com
</depositor>
</bank>
Examples:
Scientific data
Chemistry: ChemML,
Each application area has its own set of standards for representing information
XML has become the basis for all new generation data interchange formats
Earlier generation formats were based on plain text with line headers indicating the meaning of
fields
Tied too closely to low level document structure (lines, spaces, etc)
Each XML based standard defines what are valid elements, using
XML Schema
www.rejinpaul.com
www.rejinpaul.com
A wide variety of tools is available for parsing, browsing and querying XML documents/data
Wide acceptance, not only in database systems, but also in browsers, tools, and applications
STRUCTURE OF XML
Element: section of data beginning with <tagname> and ending with matching </tagname>
Proper nesting
Improper nesting
Formally: every start tag must have a unique matching end tag, that is in the context of the same
parent element.
Example
<bank-1>
<customer>
<account>
www.rejinpaul.com
www.rejinpaul.com
</account>
<account>
</account>
</customer>
.
.
</bank-1>
NESTING
With multiple orders, customer name and address are stored redundantly
normalization replaces nested structures in each order by foreign key into table storing customer
name and address information
External application does not have direct access to data referenced by a foreign key
Example:
<account>
<account_number> A-102</account_number>
www.rejinpaul.com
www.rejinpaul.com
<branch_name> Perryridge</branch_name>
<balance>400 </balance>
</account>
</account>
Attributes are specified by name=value pairs inside the starting tag of an element
An element may have several attributes, but each attribute name can only occur once
In the context of documents, attributes are part of markup, while subelement contents are part of
the basic document contents
In the context of data representation, the difference is unclear and may be confusing
<account>
<account_number>A-101</account_number>
</account>
Suggestion: use attributes for identifiers of elements, and use subelements for contents
Same tag name may have different meaning in different organizations, causing confusion on
exchanged documents
www.rejinpaul.com
www.rejinpaul.com
Avoid using long unique names all over document by using XML Namespaces
<bank Xmlns:FB=http://www.FirstBank.com>
<FB:branch>
<FB:branchname>Downtown</FB:branchname>
</FB:branch>
</bank>
Elements without subelements or text content can be abbreviated by ending the start tag with a
/> and deleting the end tag
To store string data that may contain tags, without the tags being interpreted as subelements, use
CDATA as below
<![CDATA[<account> </account>]]>
CDATDatabase schemas constrain what information can be stored, and the data types of stored
values
Otherwise, a site cannot automatically interpret data received from another site
Widely used
XML Schema
www.rejinpaul.com
www.rejinpaul.com
What subelements can/must occur inside each element, and how many times.
DTD syntax
names of elements, or
Example
Notation:
| - alternatives
+ - 1 or more occurrences
* - 0 or more occurrences
www.rejinpaul.com
www.rejinpaul.com
<!DOCTYPE bank [
Name
Type of attribute
CDATA
Whether
mandatory (#REQUIRED)
or neither (#IMPLIED)
Examples
<!ATTLIST customer
www.rejinpaul.com
www.rejinpaul.com
customer_id ID # REQUIRED
Decision-support systems are used to make business decisions, often based on data collected by
on-line transaction-processing systems.
Data analysis tasks are simplified by specialized tools and SQL extensions
Example tasks
For each product category and each region, what were the total sales in the
last quarter and how do they compare with the same quarter last year
Data mining seeks to discover knowledge automatically in the form of statistical rules and
patterns from large databases.
A data warehouse archives information gathered from multiple sources, and stores it under a
unified schema, at a single site.
Important for large businesses that generate data from multiple divisions, possibly
at multiple sites
www.rejinpaul.com
www.rejinpaul.com
Data that can be modeled as dimension attributes and measure attributes are called
multidimensional data.
Measure attributes
Dimension attributes
e.g. the attributes item_name, color, and size of the sales relation
Values for one of the dimension attributes form the row headers
www.rejinpaul.com
www.rejinpaul.com
The SQL:1999 standard actually uses null values in place of all despite confusion
with regular null values
www.rejinpaul.com
www.rejinpaul.com
Sometimes called dicing, particularly when values for multiple dimensions are
fixed.
Drill down: The opposite operation - that of moving from coarser-granularity data to finer-
granularity data
H E.g. the dimension DateTime can be used to aggregate by hour of day, date, day
of week, month, quarter or year
www.rejinpaul.com
www.rejinpaul.com
The earliest OLAP systems used multidimensional arrays in memory to store data cubes,
and are referred to as multidimensional OLAP (MOLAP) systems.
OLAP implementations using only relational database features are called relational
OLAP (ROLAP) systems
www.rejinpaul.com
www.rejinpaul.com
Hybrid systems, which store some summaries in memory and store the base data and
other summaries in a relational database, are called hybrid OLAP (HOLAP) systems.
Early OLAP systems precomputed all possible aggregates in order to provide online response
2n combinations of group by
It suffices to precompute some aggregates, and compute others on demand from one of the
precomputed aggregates
Can compute aggregate on (item-name, color) from an aggregate on (item-name, color, size)
Can compute aggregates on (item-name, color, size), (item-name, color) and (item-name) using a
single sorting of the base data
Relational representation of cross-tab that we saw earlier, but with null in place of all, can be
computed by
Returns 1 if the value is a null value representing all, and returns 0 in all other cases.
Can use the function decode() in the select clause to replace such nulls by a value such as all
www.rejinpaul.com
www.rejinpaul.com
select student-id, rank( ) over (order by marks desc) as s-rank from student-marks
Ranking may leave gaps: e.g. if 2 students have the same top mark, both have rank 1, and the
next rank is 3
WINDOWING
E.g.: moving average: Given sales values for each date, calculate for each date the
average of the sales on that day, the previous day, and the next day
All rows with values between current row value 10 to current value
www.rejinpaul.com
www.rejinpaul.com
E.g. Given a relation transaction (account-number, date-time, value), where value is positive for
a deposit and negative for a withdrawal
Find total balance of each account after each transaction on the account
DATAWAREHOUSING
Data sources often store only current data, not historical data
Corporate decision making requires a unified view of all organizational data, including historical
data
A data warehouse is a repository (archive) of information gathered from multiple sources, stored
under a unified schema, at a single site
Shifts decision support query load away from transaction processing systems
www.rejinpaul.com
www.rejinpaul.com
DESIGN ISSUES
Keeping warehouse exactly synchronized with data sources (e.g. using two-phase
commit) is too expensive
Schema integration
Data cleansing
www.rejinpaul.com
www.rejinpaul.com
Queries on raw data can often be transformed by query optimizer to use aggregate
values
Dimension values are usually encoded using small integers and mapped to full values via
dimension tables
DATAMINING
www.rejinpaul.com
www.rejinpaul.com
Data mining is the process of semi-automatically analyzing large databases to find useful
patterns
Predict if a credit card applicant poses a good credit risk, based on some attributes
(income, job type, age, ..) and past history
Classification
Regression formulae
Descriptive Patterns
Associations
Find books that are often bought by similar customers. If a new such
customer buys one such book, suggest the others too.
Clusters
Classification rules for above example could use a variety of data, such as educational level,
salary, age, etc.
www.rejinpaul.com
www.rejinpaul.com
P.credit = excellent
Each internal node of the tree partitions the data into groups based on a
partitioning attribute, and a partitioning condition for the node
Leaf node:
www.rejinpaul.com
www.rejinpaul.com
all (or most) of the items at the node belong to the same class, or
The purity of a set S of training instances can be measured quantitatively in several ways.
Gini (S) = 1 -
It reaches its maximum (of 1 1 /k) if each class the same number of instances.
Procedure GrowTree (S )
Partition (S );
computation of p (d | cj )
precomputation of p (cj )
www.rejinpaul.com
www.rejinpaul.com
To simplify the task, nave Bayesian classifiers assume attributes have independent
distributions, and thereby estimate
Each of the p (di | cj ) can be estimated from a histogram on di values for each class
cj
REGRESSION
Given values for a set of variables, X1, X2, , Xn, we wish to predict the value of a
variable Y.
In general, the process of finding a curve that fits the data is also called curve
fitting.
Regression aims to find coefficients that give the best possible fit.
ASSOCIATION RULES
Retail shops are often interested in associations between different items that people buy.
A person who bought the book Database System Concepts is quite likely also to buy
the book Operating System Concepts.
www.rejinpaul.com
www.rejinpaul.com
E.g. when a customer buys a particular book, an online shop may suggest
associated books.
Association rules:
E.g. each transaction (sale) at a shop is an instance, and the set of all
transactions is the population
Support is a measure of what fraction of the population satisfies both the antecedent and the
consequent of the rule.
E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers.
The support for the rule is milk screwdrivers is low.
Confidence is a measure of how often the consequent is true when the antecedent is true.
E.g. the rule bread milk has a confidence of 80 percent if 80 percent of the
purchases that include bread also include milk.
We are generally only interested in association rules with reasonably high support (e.g.
support of 2% or greater)
Nave algorithm
2. For each set find its support (i.e. count how many transactions purchase all items
in the set).
www.rejinpaul.com
www.rejinpaul.com
Large itemsets: sets with a high count at the end of the pass
If memory not enough to hold all counts for all itemsets use multiple passes, considering only
some itemsets in each pass.
Optimization: Once an itemset is eliminated because its count (support) is too small none of its
supersets needs to be considered.
Pass 1: count support of all sets with just 1 item. Eliminate those items with low support
Pass i: candidates: every set of i items such that all its i-1 item subsets are large
E.g. if many people purchase bread, and many people purchase cereal, quite a few
would be expected to purchase both
www.rejinpaul.com
www.rejinpaul.com
CLUSTERING
Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in
the same cluster
Group points into k sets (for a given k) such that the average distance of points from the centroid
of their assigned group is minimized
Another metric: minimize average distance between every pair of points in a cluster
Data mining systems aim at clustering techniques that can handle very large data
sets
2. Define a type Department with a field name and a field head which is a reference to the
type Person, with table people as scope:
www.rejinpaul.com
www.rejinpaul.com
under Person
(degree varchar(20),
department varchar(20))
create type Teacher
under Person
(salary integer,
department varchar(20))
METHODS
5.Define Motivation:
6. Define Intuitive
allow relations whenever we allow atomic (scalar) values relations within relations
7. Define XML
Extensible Markup Language
Derived from SGML (Standard Generalized Markup Language), but simpler to use than
SGML
Documents have tags giving extra information about sections of the document
www.rejinpaul.com
www.rejinpaul.com
Extensible, unlike HTML Users can add new tags, and separately specify how the tag should be
handled for display
A wide variety of tools is available for parsing, browsing and querying XML documents/data
Wide acceptance, not only in database systems, but also in browsers, tools, and applications
12.What is a Element?
section of data beginning with <tagname> and ending with matching </tagname>
Decision-support systems are used to make business decisions, often based on data collected by
on-line transaction-processing systems.
Data analysis tasks are simplified by specialized tools and SQL extensions
Example tasks
www.rejinpaul.com
www.rejinpaul.com
For each product category and each region, what were the total sales in the
last quarter and how do they compare with the same quarter last year
archives information gathered from multiple sources, and stores it under a unified
schema, at a single site.
a. Important for large businesses that generate data from multiple divisions, possibly
at multiple sites
www.rejinpaul.com
www.rejinpaul.com
e.g. the attributes item_name, color, and size of the sales relation
16 MARKS
www.rejinpaul.com