0% found this document useful (0 votes)
12 views104 pages

ADBMS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views104 pages

ADBMS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Introduction

Module1  A database-management system (DBMS) is a collection of


interrelated data and a set of programs to access those data.
 The collection of data, usually referred to as the database,
MODULE I: RELATIONAL DATABASES contains information relevant to an enterprise.
 The primary goal of a DBMS is to provide a way to store and
retrieve database information that is both convenient and
efficient.

Purpose of Database System

 Before database management systems (DBMSs) were introduced,  • Atomicity problems.


organizations usually stored information in file-processing  Concurrent-access anomalies.
system.  Security problems
 Keeping organizational information in a file-processing system
has a number of major disadvantages:
 Data redundancy and inconsistency
. In addition, it may lead to data inconsistency;
 Difficulty in accessing data.
 Data isolation
 Integrity problems. The data values stored in the database must satisfy
certain types of consistency constraints
View of Data Data Abstraction

 A database system is a collection of interrelated data and a set  developers hide the complexity from users through several
of programs that allow users to access and modify these data. levels of abstraction, to simplify users’ interactions with the
 A major purpose of a database system is to provide users with system:
an abstract view of the data.  Physical level.
 That is, the system hides certain details of how the data are  The lowest level of abstraction describes how the data are
stored and maintained. actually stored. The physical level describes complex low-level
data structures in detail.

 Logical level.  View level.


 The next-higher level of abstraction describes what data are  The highest level of abstraction describes only part of the
stored in the database, and what relationships exist among entire database.
those data.  Many users of the database system do not need all information;
 The user of the logical level does not need to be aware of the instead, they need to access only a part of the database.
complexity of physical-level structures. This is referred to as  The system may provide many views for the same database.
physical data independence.
Instances and Schemas

 Databases change over time as information is inserted and


deleted. The collection of information stored in the database at
a particular moment is called an instance of the database.
 The overall design of the database is called the database schema

Data Models

 Database systems have several schemas, partitioned according  Underlying the structure of a database is the data model: a
to the levels of abstraction. collection of conceptual tools for describing data, data
 The physical schema describes the database design at the relationships, data semantics, and consistency constraints.
physical level, while the logical schema describes the database  A data model provides a way to describe the design of a
design at the logical level. database at the physical, logical, and view levels.
 A database may also have several schemas at the view level,  The data models can be classified into four different categories:
sometimes called sub schemas, that describe different views of
the database.
1. Relational Model. 2. Entity-Relationship Model.

 The relational model uses a collection of tables to represent  The entity-relationship (E-R) data model uses a collection of
both data and the relationships among those data. basic objects, called entities, and relationships among these
 Each table has multiple columns, and each column has a unique objects.
name.  An entity is a “thing” or “object” in the real world that is
 Tables are also known as relations. distinguishable from other objects.
 The relational model is an example of a record-based model.  The entity-relationship model is widely used in data base design

3. Object-Based Data Model. 4. Semi structured Data Model.

 Object-oriented programming (especially in Java,C++, or C#)  The semi structured data model permits the specification of
has become the dominant software-development methodology . data where individual data items of the same type may have
 This led to the development of an object-oriented data model different sets of attributes.
that can be seen as extending the E-R model with notions of  This is in contrast to the data models mentioned earlier, where
encapsulation, methods(functions), and object identity. every data item of a particular type must have the same set of
 The object-relational data model combines features of the attributes. The Extensible Markup Language (XML)is widely
object-oriented data model and relational data model. used to represent semi structured data.
Database Architecture

 Database applications are usually partitioned into two or three


parts.

Database Users and Administrators Database Users and User Interfaces

 A primary goal of a database system is to retrieve information  There are four different types of database-system users,
from and store new information into the database. People who differentiated by the way they expect to interact with the system.
work with a database can be categorized as database users or  Different types of user interfaces have been designed for the
database administrators different types of users.
 1. Naıve users are unsophisticated users who interact with the
system by invoking one of the application programs that have
been written previously. For example, a clerk in the university
 The typical user interface for naıve users is a forms interface,
where the user can fill in appropriate fields of the form.
 2. Application programmers are computer professionals who  4. Specialized users are sophisticated users who write
write application programs. Application programmers can specialized database applications that do not fit into the
choose from many tools to develop user interfaces. traditional data-processing frame work.
 3. Sophisticated users interact with the system without writing  Among these applications are computer-aided design systems,
programs. In-stead, they form their requests either using a knowledge-base and expert systems, systems that store data
database query language or by using tools such as data analysis with complex data types (for example, graphics data and audio
software. data), and environment-modeling systems.

Database Administrator

 One of the main reasons for using DBMSs is to have central  Storage structure and access-method definition.
control of both the data and the programs that access those  Schema and physical-organization modification. The DBA
data. carries out changes to the schema and physical organization to
 A person who has such central control over the system is called reflect the changing needs of the organization, or to alter the
a database administrator (DBA). physical organization to improve performance.
 The functions of a DBA include:
 Schema definition. The DBA creates the original database
schema by executing a set of data definition statements in the
DDL.
 Granting of authorization for data access. By granting  Routine maintenance.
different types of authorization, the database administrator can  Examples of the database administrator’s routine maintenance
regulate which parts of the data base various users can access. activities are:
The authorization information is kept in a special system  Periodically backing up the database, either onto tapes or onto
structure that the database system consults whenever some one remote servers, to prevent loss of data in case of disasters
attempts to access the data in the system. such as flooding.
 Ensuring that enough free disk space is available for normal
operations , and upgrading disk space as required.
 Monitoring jobs running on the database and ensuring that
performance is not degraded by very expensive tasks submitted
by some users.

The Entity-Relationship Model Entity Sets

 The entity-relationship(E-R) data model was developed to  An entity is a “thing” or “object” in the real world that is
facilitate data base design. distinguishable from all other objects.
 The E-R data model employs three basic concepts: entity sets,  For example, each person in a university is an entity.
relationship sets, and attributes.  An entity has a set of properties, and the values for some set
of properties may uniquely identify an entity.
 For instance, a person may have a person id property whose
value uniquely identifies that person.
 An entity set is a set of entities of the same type that share the  Entity sets do not need to be disjoint.
same properties , or attributes.  For example, it is possible to define the entity set of all people
 The set of all people who are instructors at a given university, in a university (person). A person entity may be an instructor
for example, can be defined as the entity set instructor. entity, a student entity, both, or neither.
 Similarly, the entity set student might represent the set of all
students in the university

Attributes Relationship Sets

 An entity is represented by a set of attributes. Attributes are  A relationship is an association among several entities.
descriptive properties possessed by each member of an entity  For example, we can define a relationship advisor that
set. associates instructor James with student Shankar. This
 Possible attributes of the instructor entity set are ID ,name , relationship specifies that James is an advisor to student
deptname , and salary. Shankar.
 Possible attributes of the course entity set are courseid , title ,
deptname , and credits.
 Each entity has a value for each of its attributes.
 A relationship set is a set of relationships of the same type.  The association between entity sets is referred to as
 Formally, it is a mathematical relation on n≥2 (possibly non participation;
distinct) entity sets. If E1,E2,...,En are entity sets, then a  that is, the entity setsE1,E2,...,En participate in relationship
relationship set Ris a subset of set R.
{(e1,e2,...,en)|e1∈E1,e2∈E2,..., en∈En} where (e1,e2,...  The function that an entity plays in a relationship is called that
,en) is a relationship entity’s role.

Attributes

 A relationship may also have attributes called descriptive  For each attribute, there is a set of permitted values, called the
attributes . domain,or value set, of that attribute.
 Consider a relationship set advisor with entity sets instructor  The domain of attribute courseid might be the set of all text
and student. strings of a certain length.
 We could associate the attribute date with that relationship to  An attribute, as used in the E-R model, can be characterized
specify the date when an instructor became the advisor of a by the following attribute types
student.  Simple and composite attributes.
 Single-valued and multi valued attributes.
 Derived attribute.
Simple and composite attributes. Single-valued and multi valued attributes.

 Simple- that is, they have not been divided into subparts.  single value for a particular entity.
 Composite attributes- can be divided into subparts (that is,  For instance, the studentID attributefor a specific student entity
other attributes). refers to only one studentID. Such attributes are said to be
 For example, an attribute name could be structured as a single valued.
composite attribute consisting of first name ,middle initial ,  multi valued attributes -an attribute has a set of values for a
and last name. specific entity.
 Eg: a phone number attribute.

Derived attribute. Constraints

 The value for this type of attribute can be derived from the  An E-R enterprise schema may define certain constraints to
values of other related attributes or entities. which the contents of a database must conform.
 Eg: age
Mapping Cardinalities

 Mapping cardinalities, or cardinality ratios, express the  One-to-one.


number of entities to which another entity can be associated via  An entity in A is associated with at most one entity in B, and an
a relationship set. entity in B is associated with at most one entity in A
 For a binary relationship set R between entity sets A and B,
the mapping cardinality must be one of the following:
 One-to-one.
 One-to-many.
 Many-to-one.
 Many-to-many.

 One-to-many.  Many-to-one.
 An entity in A is associated with any number (zero or more)of  An entity in A is associated with at most one entity in B. An
entities in B. An entity in B, however, can be associated with entity in B, however, can be associated with any number (zero
at most one entity in A. or more) of entities in A.
Participation Constraints

 Many-to-many.  The participation of an entity set E in a relationship set R is


 An entity in A is associated with any number (zero or more)of said to be total if every entity in E participates in at least one
entities in B, and an entity in B is associated with any number relationship in R.
(zero or more) of entities in A.  If only some entities in E participate in relationships in R, the
participation of entity set E in relationship R is said to be
partial.

Keys Entity-Relationship Diagrams

 The values of the attribute values of an entity must be such  An E-R diagram can express the overall logical structure of a
that they can uniquely identify the entity. database graphically.
 Basic Structure
 An E-R diagram consists of the following major components
 Rectangles divided into two parts :represent entity sets. The
first part, contains the name of the entity set. The second part
contains the names of all the attributes of the entity set.
 Undivided rectangles represent the attributes of a relationship
set. Attributes that are part of the primary key are underlined.
 Lines link entity sets to relationship sets.
 Dashed lines link attributes of a relationship set to the
relationship set.
 Double lines indicate total participation of an entity in a
 Diamonds represent relationship sets.
relationship set
 Double diamonds represent identifying relationship sets linked
to weak entity sets
Roles

Weak Entity Sets

 An entity set that does not have sufficient attributes to form a  Every weak entity must be associated with an identifying entity;
primary key is termed a weak entity set. that is, the weak entity set is said to be existence dependent on
 An entity set that has a primary key is termed a strong entity the identifying entity set.
set.  The identifying entity set is said to own the weak entity set that
 For a weak entity set to be meaningful, it must be associated it identifies.
with another entity set, called the identifying or owner entity  The relationship associating the weak entity set with the
set. identifying entity set is called the identifying relationship
 The identifying relationship is many-to-one from the weak  The primary key of a weak entity set is formed by the primary
entity set to the identifying entity set, and the participation of key of the identifying entity set, plus the weak entity set’s
the weak entity set in the relationship is total. discriminator.
 The identifying relationship set should not have any descriptive  In E-R diagrams, a weak entity set is depicted via a rectangle,
attributes, since any such attributes can instead be associated like a strong entity set, but there are two main differences:
with the weak entity set.  The discriminator of a weak entity is underlined with a dashed,
 The discriminator of a weak entity set is a set of attributes that rather than a solid, line.
allows this distinction to be made.  The relationship set connecting the weak entity set to the
identifying strong entity set is depicted by a double diamond.
Problem Problem
 A company database needs to store information about employees  Employees work in departments; each department is managed
(identified by ssn, with salary and phone as attributes), by an employee; a child must be identified uniquely by name
departments (identified by dno, with dname and budget as when the parent (who is an employee; assume that only one
attributes), and children of employees (with name and age as parent works for the company) is known. We are not
attributes). interested in information about a child once the parent leaves
the company.
 Draw an ER diagram that captures this information.

Introduction to the RelationalModel

 Structure of Relational Databases  in the relational model the term relation is used to refer to a
 A relational database consists of a collection of tables, each of table, while the term tuple is used to refer to a row. Similarly,
which is assigned a unique name. the term attribute refers to a column of a table.
 For example, consider the instructor table,which stores  the term relation instance to refer to a specific instance of a
information about instructors. The table has four column relation, i.e., containing a specific set of rows.
headers: ID, name, deptname, and salary.  For each attribute of a relation, there is a set of permitted
 Each row of this table records information about an instructor. values, called the domain of that attribute.
 The domain of the name attribute is the set of all possible
instructor names.
Database Schema

 for all relations r, the domains of all attributes of r be atomic.  The database schema, which is the logical design of the
 A domain is atomic if elements of the domain are considered to database, and the database instance, which is a snapshot of the
be indivisible units. data in the database at a given instant in time.
 In general, a relation schema consists of a list of attributes and
their corresponding domains.
 department(deptname,building,budget)

Keys Relational Query Languages

 Super key is a set of one or more attributes that, taken  A query language is a language in which a user requests
collectively, allow us to identify uniquely a tuple in the relation. information from the database.
 Such minimal super keys are called candidate keys  These languages are usually on a level higher than that of a
 The term primary key to denote a candidate key that is chosen standard programming language.
by the database designer as the principal means of identifying  Query languages can be categorized as either procedural or
tuples within a relation. nonprocedural.
 A relation, say r1, may include among its attributes the  In a procedural language, the user instructs the system to
primary key of an other relation, say r2. This attribute is perform a sequence of operations on the database to compute
called a foreign key from r1, referencing r2. the desired result.
 In a non procedural language, the user describes the desired
information with out giving a specific procedure for obtaining
that information
The Relational Algebra Fundamental Operations

 The relational algebra is a procedural query language.  The select, project, and rename operations are called unary
 It consists of a set of operations that take one or two relations operations, because they operate on one relation.
as input and produce a new relation as their result.  The other three operations operate on pairs of relations and are,
 The fundamental operations in the relational algebra are select, therefore, called binary operations
project, union , set difference ,Cartesian product , and
rename.
 In addition to the fundamental operations, there are several
other operations—namely, set intersection , naturaljoin ,and
assignment.

The Select Operation

 The select operation selects tuples that satisfy a given predicate.  We can find all instructors with salary greater than $90,000
 We use the lower case Greek letter sigma (σ) to denote by writing:
selection.  σ salary>90000(instructor)
 The predicate appears as a subscript to σ .  we allow comparisons using=,=,<,≤,>, and ≥ in the selection
 The argument relation is in parentheses after the σ predicate. Furthermore, we can combine several predicates into
 Thus, to select those tuples of the instructor relation where the a larger predicate by using the connectives and (∧),or(∨), and
instructor is in the “Physics” department, we write: not(¬).
 σ deptname=“Physics” (instructor)  Thus, to find the instructors in Physics with a salary greater
than $90,000, we write:
 σ deptname=“Physics”∧salary>90000 (instructor)
The Project Operation

 Projection is denoted by the uppercase Greek letter pi (Π). We


list those attributes that we wish to appear in the result as a
subscript to Π .
 The argument relation follows in parentheses.
 Π ID,name,salary(instructor)

Composition of Relational Operations The Union Operation

 “Find the name of all instructors in the Physics department.”


 Πname(σdeptname=“Physics” (instructor))
The Set-Difference Operation

 To find the set of all courses taught in the Fall 2009 semester,  The set-difference operation, denoted by −, allows us to find
we write: tuples that are in one relation but are not in another. The
 Πcourseid(σsemester=“Fall”∧year=2009(section)) expression r−s produces a relation containing those tuples in r
but not in s.
 To find the set of all courses taught in the Spring 2010 semester,
we write:  Courses offered in the Fall 2009 semester but not in Spring
2010 semester.
 Π courseid(σsemester=“Spring”∧year=2010(section))
 Πcourseid(σsemester=“Fall”∧year=2009(section)) - Π
 Πcourseid(σsemester=“Fall”∧year=2009(section)) U Π
courseid(σsemester=“Spring”∧year=2010(section))
courseid(σsemester=“Spring”∧year=2010(section))

The Cartesian-Product Operation The Rename Operation

 The Cartesian-product operation, denoted by a cross (×),  The rename operator, denoted by the lowercase Greek letter
allows us to combine information from any two relations. We rho (ρ),
write the Cartesian product of relations r1 and r2 as r1×r2.  ρ x(E)
Formal Definition of the Relational Algebra Additional Relational-Algebra Operations

 Let E1 and E2 be relational-algebra expressions. Then, the  The Set-Intersection Operation


following are all relational-algebra expressions:  Suppose that we wish to find the set of all courses taught in
 •E1∪E2 both theFall 2009 and the Spring 2010 semesters. Using set
 •E1−E2 intersection, we can write
 •E1×E2  Π courseid (σ semester=“Fall”∧year=2009(section)) ∩ Π courseid
 •σP(E1), where P is a predicate on attributes inE1 (σ semester=“Spring”∧year=2010(section))
 •ΠS(E1), where S is a list consisting of some of the attributes  Note that we can rewrite any relational-algebra expression that
inE1 uses set intersection by replacing the intersection operation with
a pair of set-difference operations as:
 ρx(E1), where x is the new name for the result of E1
 r∩s=r−(r−s)

The Natural-Join Operation

 The natural join is a binary operation that allows us to combine  “Find the names of all instructors together with the courseid of
certain selections and a Cartesian product into one operation. all courses they taught.
 It is denoted by the join symbol .

 The natural-join operation forms a Cartesian product of its two


arguments,performs a selection forcing equality on those
attributes that appear in both rela-tion schemas, and finally
removes duplicate attributes.
The Assignment Operation OUTER JOINs

 It is convenient at times to write a relational-algebra expression  Notice that much of the data is lost when applying a join to two
by assigning parts of it to temporary relation variables. relations. In some cases this lost data might hold useful
 The assignment operation, denoted by ←, works like information. An outer join retains the information that would
assignment in a programming language. have been lost from the tables, replacing missing data with
nulls.
 There are three forms of the outer join, depending on which
data is to be kept.
 LEFT OUTER JOIN - keep data from the left-hand table
 RIGHT OUTER JOIN - keep data from the right-hand table
 FULL OUTER JOIN - keep data from both tables
 1. Write a relational algebra expression that returns the food
items required to cook the recipe “Pasta and Meat-balls”. For
each such food item return the item paired with the number of
ounces required by the recipe.

 2. Write a relational algebra expression that returns food items


that are sold at “Aldi” and their price
 3. Write a relational algebra expression that returns food items
(item) that are of type “Wheat product” or oftype “Meat” and
have at least 20 calories per ounce (attribute calories)

Module II

DATA BASE DESIGN


NORMALIZATION Need for normalization
 Normalization is the process of efficiently organizing data in a  Normalization is the process of converting a relation into a standard
database. form.
 The problem in an unnormalized relation are as follows:-
 E.F Codd proposed the concept of normalization. Data redundancy
Update anomalies
 Normalization removes redundant data from the tables to improve
the storage efficiency ,data integrity and scalability. Deletion anomalies
Insertion anomalies

Need for normalization

Data redundancy:-
 In an unnormalizaed table design some information may be stored
repeatedly.

 In the below example ,student table the branch information ,hod,


office telephone number is repeated.

 This information is known as redundant data.


Functional Dependency

 A functional dependency (FD) is a relationship between two  The left side of the above FD diagram is called the determinant,
attributes, typically between the PK and other non-key and the right side is the dependent.
attributes within a table.  SIN ———-> Name, Address, Birthdate
 For any relation R, attribute Y is functionally dependent on  SIN determines Name, Address and Birthdate.
attribute X (usually the PK), if for every valid instance of X,  SIN, Course ———> DateCompleted
that value of X uniquely determines the value of Y.
 SIN and Course determine the date completed (DateCompleted). This must
 This relationship is indicated by the representation below : also work for a composite PK.
 X ———–> Y

Types of Functional dependency 1. Trivial functional dependency

 A → B has trivial functional dependency if B is a subset of A.


 Consider a table with two columns Employee_Id and Employee
_Name.
 {Employee_id,
Employee_Name} → Employee_Id is a trivial functional de
pendency as
 Employee_Id is a subset of {Employee_Id, Employee_Name}.
2. Non-trivial functional dependency Inference Rules

 A → B has a non-trivial functional dependency if B is not a  Armstrong’s axioms are a set of inference rules used to infer
subset of A. all the functional dependencies on a relational database.
 When A intersection B is NULL, then A → B is called as  They were developed by William W. Armstrong.
complete non-trivial.  Axiom of reflexivity
 ID → Name  This axiom says, if Y is a subset of X, then X determines Y

 Axiom of augmentation  Axiom of transitivity


 The axiom of augmentation, also known as a partial dependency,  The axiom of transitivity says if X determines Y, and Y
says if X determines Y, then XZ determines YZ for any Z determines Z, then X must also determine Z

 prime and non-prime attributes


 attributes of candidate key, are called prime attributes. And
rest of the attributes of the relation are non prime.
Functional Dependency Set

 Secondary Rules –  Functional Dependency set or FD set of a relation is the set of


 These rules can be derived from the axioms. all FDs present in the relation.
 { STUD_NO->STUD_NAME, STUD_NO->STUD_PHONE,
STUD_NO->STUD_STATE, STUD_NO->STUD_COUNTRY,
STUD_NO -> STUD_AGE, STUD_STATE->STUD_COUNTRY }

Attribute Closure:

 Attribute closure of an attribute set can be defined as set of  If attribute closure of an attribute set contains all attributes of
attributes which can be functionally determined from it. relation, the attribute set will be super key of the relation.
 To find attribute closure of an attribute set:  Question 1:
 Add elements of attribute set to the result set.  Given relational schema R( P Q R S T) having following
 Recursively add elements to the result set which can be attributes P Q R S and T, also there is a set of functional
functionally determined from the elements of the result set dependency denoted by FD = { P->QR, RS->T, Q->S, T-> P }.
 Determine Closure of ( T )+
 FD = { P->QR, RS->T, Q->S, T-> P }.  Consider the relation scheme R = {E, F, G, H, I, J, K, L, M,
 T+={ T,P,Q,R,S,T} N} and the set of functional dependencies {{E, F} -> {G}, {F}
-> {I, J}, {E, H} -> {K, L}, K -> {M}, L -> {N} on R.
What is the key for R?
 A. {E, F}
B. {E, F, H}
C. {E}

Canonical Cover of Functional Dependencies/Minimal


set of Functional dependency
 {{E, F} -> {G}, {F} -> {I, J}, {E, H} -> {K, L}, K ->  A canonical cover of a set of functional dependencies F is a
{M}, L -> {N} simplified set of functional dependencies that has the same
 {E, F}+={ E,F,G,I,J} closure as the original set F.
{E, F, H}+={E,F,H,G,I,J,K,L,M,N }  Extraneous attributes: An attribute of a functional dependency
 {E}+={ E}
is said to be extraneous if we can remove it without changing
the closure of the set of functional dependencies.
 A canonical cover Fc of a set of functional dependencies F  Finding Canonical Cover
such that ALL the following properties are satisfied:
 F logically implies all dependencies in Fc .
 Fc logically implies all dependencies in F.
 No functional dependency in contains an extraneous attribute.
 Each left side of a functional dependency in Fc is unique.
 Let F = {A → B, A → C, BC → D}. Can A determine D
uniquely?

 Consider a relation scheme R = (A, B, C, D, E, H) on which


the following functional dependencies hold: {A–>B, BC–> D,
E–>C, D–>A}. What are the candidate keys of R? [GATE
2005]
(a) AE, BE
(b) AE, BE, DE
(c) AEH, BEH, BCH
(d) AEH, BEH, DEH
Transactions

ModuleIII  Collections of operations that form a single logical unit of work


are called transactions.
 A database system must ensure proper execution of transactions
TRANSACTION MANAGEMENT & CONCURRENCY CONTROL despite failures—either the entire transaction executes, or none
of it does.
 Transaction is a unit of program execution that accesses and
possibly updates various data items.

What is a Transaction? What is a Transaction?

 Any action that reads from and/or writes to a database may  A logical unit of work that must be either entirely completed or
consist of aborted
 Simple SELECT statement to generate a list of table contents  Successful transaction changes the database from one consistent
 A series of related UPDATE statements to change the values of attributes in state to another
various tables  One in which all data integrity constraints are satisfied
 A series of INSERT statements to add rows to one or more tables
 Most real-world database transactions are formed by two or more
 A combination of SELECT, UPDATE, and INSERT statements database requests
 The equivalent of a single SQL statement in an application program or
transaction
Evaluating Transaction Results Transaction Properties(ACID properties)
 Not all transactions update the database
 Atomicity
 SQL code represents a transaction because database was accessed  Requires that all operations (SQL requests) of a transaction be
 Improper or incomplete transactions can have a devastating completed; if not, then the transaction is aborted
effect on database integrity  A transaction is treated as a single, indivisible, logical unit of work
 This “all-or-none” property is referred to as atomicity.
 Some DBMSs provide means by which user can define enforceable
constraints based on business rules Consistency
 Other integrity rules are enforced automatically by the DBMS when table  Consistency property ensures that the database must remain in
structures are properly defined, thereby letting the DBMS validate some the consistent state before the start of transaction and after
transactions the transaction is over.
 Consistency states that only valid data will be written to the
database.
 If for some reason a transaction is executed that violates the
database consistency rules the entire transaction will be rolled
back.

Durability

 After a transaction completes successfully, the changes it has made to the


 Isolation database persist, even if there are system failures.
 Data used during execution of a transaction cannot be used by second
 Durability can be implemented by writing all transaction into a transaction
transaction until first one is completed
 Even though multiple transactions may execute concurrently, the
log that can be used to crate a system state right before failure.
system guarantees that, for every pair of transactions Ti and Tj , it  A transaction can only regard as committed after it is written safely in the
appears to Ti that either Tj finished execution before Ti started, or log.
Tj started execution after Ti finished.  For example, in an application that transfers funds from one account to
another, the durability property ensures that the changes made to each
account will not be reversed.
 These properties are called the ACID properties.
Transaction State Transaction State
 A transaction must be in one of the following states:
 Aborted:-
 Active:-
After the transaction has been rolled back and the database
The initial state; the transaction stays in this state while it is has been restored to its state prior to the start of the
executing. transaction
 Partially committed:-  Committed:-
After the final statement has been executed. After successful completion
 Failed:-
After the discovery that normal execution can no longer
proceed.

Transaction Management with SQL


 ANSI has defined standards that govern SQL database transactions
 Transaction support is provided by two SQL statements: COMMIT
and ROLLBACK
 ANSI standards require that, when a transaction sequence is
initiated by a user or an application program,it must continue
through all succeeding SQL statements until one of four events
occurs
Transaction Management with SQL The Transaction Log

1. A COMMIT statement is reached- all changes are permanently  Keeps track of all transactions that update the database. It contains:
recorded within the database  A record for the beginning of transaction
 For each transaction component (SQL statement)
2. A ROLLBACK is reached – all changes are aborted and the Type of operation being performed (update, delete, insert)
database is restored to a previous consistent state Names of objects affected by the transaction (the name of the table)
3. The end of the program is successfully reached – equivalent to a “Before” and “after” values for updated fields
COMMIT Pointers to previous and next transaction log entries for the same
transaction
4. The program abnormally terminates and a rollback occurs  The ending (COMMIT) of the transaction
 Increases processing overhead but the ability to restore a corrupted database is
worth the price

The Transaction Log A Transaction Log


 Increases processing overhead but the ability to restore a
corrupted database is worth the price
 If a system failure occurs, the DBMS will examine the log for
all uncommitted or incomplete transactions and it will restore
the database to a previous state
 The log it itself a database and to maintain its integrity many
DBMSs will implement it on several different disks to reduce
the risk of system failure
Transactions and schedules

 A transaction is seen by the dbms as a series or list of actions.  Each transaction must specify as its final action either commit
 Actions include read and writes of database object. or abort
 Assume that an object O is always read into a program variable  AbortT and CommitT
that is also named O  Schedule is a list of actions from a set of transactions,
 Denote transaction T reading an object O as RT(O)  Schedule represents an actual or potential execution sequence.
 Similarly writing as WT(O)

 T1 T2  A schedule that contains either abort or commit for each


R(A) transactions is called complete schedule.
W(A)  If transactions are executed from start to finish,one by one---
R(B) -serial schedule
W(B)
R(C)
W(C)
Concurrent execution of Transactions
Concurrent execution of Transactions
 Transaction processing system usually allow multiple transaction to  Two reasons to allow concurrency are:-
run concurrently.  Improve throughput and resource utilization:-(Throughput – Number of
transactions that can be executed in a given amount of time.)
 Allowing multiple transaction to run concurrently and allowing
multiple transaction to update data concurrently causes several  Reduced waiting time.
complications with consistency of data.

 Ensuring consistency with concurrency require an extra work.

TYPES OF SCHEDULE
 There may be a mix of transactions running on a system, some short 1. Serial Schedule
and some long. 2. Non-serial Schedule
3. Serializable schedule
 If transactions are run serially, a short transaction may have to
wait for a preceding long transaction to complete, which can lead to
unpredictable delays in running a transaction.

 But concurrent execution reduces the unpredictable delays in running


transactions.
1. Serial Schedule

 The serial schedule is a type of schedule where one transaction is  Execute all the operations of T1 which was followed by all the
executed completely before starting another transaction. operations of T2.

 In the serial schedule, when the first transaction completes its cycle,  Execute all the operations of T2 which was followed by all the
then the next transaction is executed. operations of T1.

 For example: Suppose there are two transactions T1 and T2 which  In the given (a) figure, Schedule A shows the serial schedule where
have some operations. T1 followed by T2.

 In the given (b) figure, Schedule B shows the serial schedule where
T2 followed by T1.

 If it has no interleaving of operations, then there are the following 2. Non-serial Schedule/ Concurrent Execution
two possible outcomes:  If interleaving of operations is allowed, then there will be non-serial
schedule.

 It contains many possible orders in which the system can execute the
individual operations of the transactions.

 In the given figure (c) and (d), Schedule C and Schedule D are the
non-serial schedules. It has interleaving of operations.
Problems with Concurrent Execution
Non-serial Schedule
 In a database transaction, the two main operations are READ
and WRITE operations. So, there is a need to manage these
two operations in the concurrent execution of the transactions.
 following problems occur with the Concurrent Execution of the
operations:
 Problem 1: Lost Update Problems (W - W Conflict)
 Dirty Read Problems (W-R Conflict)
 Unrepeatable Read Problem (W-R Conflict)/ Inconsistent
Retrievals Problem

Problem 1: Lost Update Problems (W - W Conflict)

 The problem occurs when two different database transactions


perform the read/write operations on the same database items
in an interleaved manner (i.e., concurrent execution) that
makes the values of the items incorrect hence making the
database inconsistent.
 Consider the below diagram where two transactions TX and TY,
are performed on the same account A where the balance of
account A is $300.
 At time t1, transaction TX reads the value of account A, i.e., $300
(only read).
 At time t2, transaction TX deducts $50 from account A that becomes Dirty Read Problems (W-R Conflict) / Uncommitted Data
$250 (only deducted and not updated/write).
 Alternately, at time t3, transaction TY reads the value of account A that  The dirty read problem occurs when one transaction updates an
will be $300 only because TX didn't update the value yet. item of the database, and somehow the transaction fails, and
 At time t4, transaction TY adds $100 to account A that becomes $400 before the data gets rollback, the updated database item is
(only added but not updated/write). accessed by another transaction. There comes the Read-Write
 At time t6, transaction TX writes the value of account A that will be Conflict between both transactions.
updated as $250 only, as TY didn't update the value yet.
 Similarly, at time t7, transaction TY writes the values of account A, so
it will write as done at time t4 that will be $400. It means the value
written by TX is lost, i.e., $250 is lost.
 Hence data becomes incorrect, and database sets to inconsistent.

 At time t1, transaction TX reads the value of account A, i.e.,


$300.
 At time t2, transaction TX adds $50 to account A that becomes
$350.
 At time t3, transaction TX writes the updated value in account A,
i.e., $350.
 Then at time t4, transaction TY reads account A that will be
read as $350.
 Then at time t5, transaction TX rollbacks due to server problem,
and the value changes back to $300 (as initially).
 But the value for account A remains $350 for transaction TY as
committed, which is the dirty read and therefore known as the
Dirty Read Problem.
Unrepeatable Read Problem (W-R Conflict) / Inconsistent
Retrievals Problem

 Alsoknown as Inconsistent Retrievals Problem that occurs


when in a transaction, two different values are read for the
same database item.

Serializability Difference between Serial Schedules and Serializable Schedules

 When multiple transactions run concurrently, then it may give  The only difference between serial schedules and serializable
rise to inconsistency of the database. schedules is that-
 Serializability is a concept that helps to identify which non-  In serial schedules, only one transaction is allowed to execute at
serial schedules are correct and will maintain the consistency of a time i.e. no concurrency is allowed.
the database.  Whereas in serializable schedules, multiple transactions can
 If a given schedule of ‘n’ transactions is found to be equivalent execute simultaneously i.e. concurrency is allowed.
to some serial schedule of ‘n’ transactions, then it is called as
a serializable schedule.
Types of Serializability
Conflict Serializability

 A schedule is called conflict serializable if it can be


transformed into a serial schedule by swapping non-conflicting
operations.
 Let us consider a schedule S in which there are two consecutive
instructions ,Ii and Ij of transactions Ti and Tj ,respectively(i!
=j).

●There are four cases we need to consider


● Ii=read(Q) ,Ij =read(Q), the order of Ii and Ij does
 If Ii and Ij refer to different data items ,then we can swap Ii and not matter
Ij ,without affecting the results of any instruction in the ● Ii=read(Q) ,Ij =Write(Q),
schedule. ◦ If Ii comes before Ij ,then Ti doesnot read the value of
 However ,if Ii and Ij refer to the same data item Q,then the Q that is written by Tj in instruction Ij.thus the order
of Ii and Ij matters
order of the two steps may matter. Ii=Write(Q) ,Ij =read(Q), the order of Ii and Ij

matter
● Ii=write(Q) ,Ij =write(Q), the order of Ii and Ij
does not matter,how ever the value obtained by
the next read(Q)instn is affected.
 A schedule is called conflict serializable if it can be transformed into
 If a schedule S can be transformed into a schedule S’ by a a serial schedule by swapping non-conflicting operations.
series of swaps of non-conflicting instructions ,we say that S
and S’ are conflit equivalent.  Two operations are said to be conflicting if all conditions satisfy:
 A Schedule S is conflict serializable if it is conflict equivalent  They belong to different transactions
 They operate on the same data item
 At Least one of them is a write operation

Precedence Graph
 The graph contains one node for each Transaction Ti.
 Precedence Graph or Serialization Graph is used commonly to test
Conflict Serializability of a schedule.  An edge ei is of the form Tj –> Tk.

 It is a directed Graph (V, E) consisting of a set of nodes V = {T1, T2,


 where Tj is the starting node of ei and Tk is the ending node of ei.
T3……….Tn} and a set of directed edges E = {e1, e2, e3………………em}.
 An edge ei is constructed between nodes Tj to Tk if one of the
operations in Tj appears in the schedule before some conflicting
operation in Tk .
ALGORITHM ALGORITHM
 Create a node T in the graph for each participating transaction in the  For the conflicting operation write_item(X) and write_item(X) (i.e WW
conflict)– If a Transaction Tj executes a write_item (X) after Ti executes a
schedule. write_item (X), draw an edge from Ti to Tj in the graph.
 Check for conflicting instructions in the schedule:-
 The Schedule S is serializable if there is no cycle in the precedence graph.
 For the conflicting operation read_item(X) and write_item(X)(RW Conflict) – If
a Transaction Ti executes a read_item (X) after Tj executes a write_item (X), draw  If there is no cycle in the precedence graph, it means we can
an edge from Ti to Tj in the graph. construct a serial schedule S’ which is conflict equivalent to the
schedule S.
 For the conflicting operation write_item(X) and read_item(X)(i.e WR conflict) –
If a Transaction Ti executes a write_item (X) after Tj executes a read_item (X),
draw an edge from Ti to Tj in the graph.

PROBLEM 1

 SOLUTION

 Clearly, there exists a cycle in the precedence graph.

 Therefore, the given schedule S is not conflict serializable.


CONFLICT EQUIVALENT
Example: conflict serializable and conflict equivalent
 Using precedence graph we found that the schedule 3 is conflict
serializable since no cycles formed in graph.

 If a schedule S can be transformed into a schedule S’ by a series of


swapping of non conflicting instruction ,then we can say S and S’
are conflict equivalent.

 Adjacent non conflicting pairs are swapped by position

Example :CONFLICT EQUIVALENT


Example :conflict equivalent(conti..)
 To find the conflict equivalent of the schedule 3 we need to perform
 Consider schedule 3 which is conflict serializable. certain swapping, i.e swapping of positions of non conflicting
adjacent instructions in T1 and T2
Example :conflict equivalent(conti..)
 After a series of swapping we will get a serial schedule which is
conflict equivalent of schedule 3.

 Question: Consider the following schedules involving two


transactions. Which one of the following statement is true?
 S1: R1(X) R1(Y) R2(X) R2(Y) W2(Y) W1(X)
S2: R1(X) R2(X) R2(Y) W2(Y) R1(Y) W1(X)
 Both S1 and S2 are conflict serializable
 Only S1 is conflict serializable
 Only S2 is conflict serializable
 None
View Serializability

 If a given schedule is found to be view equivalent to some serial


 Only S2 is conflict serializable. schedule, then it is called as a view serializable schedule.

View Equivalent Schedules- 1. Initial Read


 Consider two schedules S1 and S2 each consisting of two  An initial read of both schedules must be the same. Suppose two
transactions T1 and T2. schedule S1 and S2. In schedule S1, if a transaction T1 is reading the
data item A, then in S2, transaction T1 should also read A.
 Two schedules S1 and S2 are said to be view equivalent if below
conditions are satisfied .
 1. Initial Read
 2. Updated Read
 3. Final Write
2. Updated Read 3. Final Write
 A final write must be the same between both the schedules. In
 In schedule S1, if Ti is reading A which is updated by Tj then in S2 schedule S1, if a transaction T1 updates A at last then in S2, final
also, Ti should read A which is updated by Tj. writes operations should also be done by T1.

 Every conflict serializable schedule is also view serializable .


 Condition 1 and 2 ensure that each transaction reads the same
value in both schedules S1 and S2(so perform same computation)
.
 But all view serializable are not conflict serializable.
 Condition 3 together with conditions 1 and 2 ensures both
schedules result in same final state.
 Blind writes appear in any view serializable schedule that is not
conflict serializable.
View serializability

 If a given schedule is found to be view equivalent to some serial  1.For each data item Q,if transaction Ti reads the initial value
schedule, then it is called as a view serializable schedule.. of Q in schedule S,then transaction Ti must in schedule S’’
 Consider two schedules S1 and S2 each consisting of two also read the initial value of Q.
transactions T1 and T2. Schedules S1 and S2 are called view  “Initial reads must be same for all data items”
equivalent if the following three conditions hold true for them-  If transaction Ti reads a data item that has been updated by the
transaction Tj in schedule S1, then in schedule S2 also,
transaction Ti must read the same data item that has been
updated by transaction Tj.
 “Write-read sequence must be same.”.

How to check whether a given schedule is view


serializable or not?

 For each data item Q ,the transaction that perform the final  Method-01:
write(Q) operation in schedule S must perform the final
Write(Q) operation in schedule in S”.  Check whether the given schedule is conflict serializable or not.
 “Final writers must be same for all data items”.
 If the given schedule is conflict serializable, then it is surely
view serializable.
 If the given schedule is not conflict serializable, then it may or
may not be view serializable. Go and check using other methods.
 Method-02:  Method-03:
 Check if there exists any blind write operation (writing without  In this method, try finding a view equivalent serial schedule.
reading a value is known as a blind write).
 If there does not exist any blind write, then the schedule is
surely not view serializable. Stop and report your answer.
 If there exists any blind write, then the schedule may or may
not be view serializable. Go and check using other methods.

EXAMPLE:SOLUTION
EXAMPLE :

 To check whether S is view serializable:-


EXAMPLE:- solution EXAMPLE:-SOLUTION (conti..)

 Step 1: final updation on data items  The first schedule S1 satisfies all three conditions, so we don't
need to check another schedule.
 In both schedules S and S1, there is no read except the initial read
that's why we don't need to check that condition.
 Hence, view equivalent serial schedule of S is S1:
 Step 2: Initial Read

 The initial read operation in S is done by T1 and in S1, it is also


T1 → T2→ T3
done by T1.

 Step 3: Final Write

 The final write operation in S is done by T3 and in S1, it is also


done by T3. So, S and S1 are view Equivalent.

Example:Irrecoverable schedule
Irrecoverable Schedules-

 If in a schedule,
 A transaction performs a dirty read operation from an uncommitted transaction
 And commits before the transaction from which it has read the value then such a
schedule is known as an Irrecoverable Schedule.
Recoverable Schedules-
Example: Irrecoverable schedule
 If in a schedule,
 In the above example
 A transaction performs a dirty read operation from an uncommitted transaction.
 T2 performs a dirty read operation.
 T2 commits before T1.
 And its commit operation is delayed till the uncommitted transaction either
 T1 fails later and roll backs.
commits or roll backs then such a schedule is known as a Recoverable Schedule.
 The value that T2 read now stands to be incorrect.
 T2 can not recover since it has already committed.
 So the above schedule is an irrecoverable schedule.

EXAMPLE: Recoverable Schedules-


In the above example T2 performs a dirty read operation.

The commit operation of T2 is delayed till T1 commits or


roll backs.

T1 commits later.

T2 is now allowed to commit.


Recoverable schedule
 In case, T1 would have failed, T2 has a chance to recover by rolling
 Two types:
back.
 Cascadeless schedule
 Cascading schedule

 Since the commit operation of the transaction that performs the dirty
read is delayed.

 This ensures that it still has a chance to recover if the uncommitted


transaction fails later.

CASCADING SCHEDULE
 Even if a schedule is recoverable ,to recover correctly from failure  In the above example ,transaction T8 has been aborted.
of transaction Ti, we may have to roll back several transaction.
 Such situations occur if transactions have read data written by Ti.  T8 must be rolled back.

 Since T9 is dependent on T8, T9 must be rolled back. Since T10 is


dependent on T9, T10 must be rolled back.

 The phenomenon, in which a single transaction failure leads to a


series of transaction rollbacks, is called cascading rollback.
Cascadeless schedule
 Cascading rollback is undesirable, since it leads to the undoing of a  A cascadeless schedule is one where, for each pair of transactions Ti
significant amount of work. and Tj such that Tj reads a data item previously written by Ti , the
commit operation of Ti appears before the read operation of Tj.

 This type of schedule is called cascadeless schedule.

RAID(redundant arrays of independent disks

Module iv The data-storage requirements of some applications


(in particular Web, database,and multimedia
applications) have been growing so fast that a large
D ATA S T O R A G E A N D Q U E R Y I N G number of disks are needed to store their data, even
though disk-drive capacities have been growing
very fast.
Having a large number of disks in a system presents
opportunities for improving the rate at which data
can be read or written, if the disks are operated in
parallel. Several independent reads or writes can
also be performed in parallel
A variety of disk-organization techniques,
Store extra information that is not needed normally, but
collectively called redundant arrays of
independent disks (RAID), have been proposed to that can be used in the event of failure of a disk to rebuild
achieve improved performance and reliability.
the lost information
RAID (redundant array of independent disks) is a
way of storing the same data in different places Effective mean time to failure is increased
on multiple hard disks or solid-state drives to Simplest (but expensive) approach to redundancy is to
protect data in the case of a drive failure.
duplicate every disk.
This technique is called mirroring (shadowing).

A logical disk then consists of two physical disks, With multiple disks, we can improve the transfer
and every write is carried out on both disks. rate as well (or instead) by striping data across
If one of the disks fails, the data can be read from multiple disks. In its simplest form, data
the other.Data will be lost only if the second disk striping consists of
fails before the first failed disk is repaired. splitting the bits of each byte across multiple
With disk mirroring, the rate at which read disks; such striping is called bit-level striping.
requests can be handled is doubled, since read
requests can be sent to either disk The transfer rate
of each read is the same as in a single-disk system,
but the number of reads per unit time has doubled.
Different type of data striping :

Block-level striping stripes blocks across


1. Bit level striping : Splitting the bits of each byte across multiple disks
multiple disks. It treats the array of disks as a
: No of disks either is a multiple of 8 or a factor of 8
single large disk, and it gives blocks logical
: These disks are considered as single disk. numbers; we assume the block numbers start
E.g. : Array of eight disks, write bit i of each byte to disk I from 0. With an array of n disks, block-level
2. Block-level striping : Stripes blocks across multiple disks striping assigns logical block i of the disk array to
: Fetches n blocks in parallel from the n disks
disk (i mod n) + 1; it uses the (i /n )th
physical block of the disk to store logical block i .
For example, logical block 11 is stored in
physical block 1 of disk 4.

When reading a large file, block-level striping In summary, there are two main goals of
fetches n blocks at a time in parallel from the n parallelism in a disk system:
disks, giving a high data-transfer rate for large 1. Load-balance multiple small accesses (block
reads. accesses), so that the throughput of such accesses
When a single block is read, the data-transfer increases.
rate is the same as on one disk, but the 2. Parallelize large accesses so that the response
remaining n − 1 disks are free to perform other time of large accesses is reduced.
actions.
RAID Levels

Mirroring provides high reliability, but it is RAID level 0 refers to disk arrays with striping
expensive. Striping provides high data-transfer rates, at the level of blocks, but without any
but does not improve reliability. redundancy (such as mirroring or parity bits).
Various alternative schemes aim to provide RAID level 1 refers to disk mirroring with block
redundancy at lower cost by combining disk striping
striping.
with “parity” bits
RAID level 2, known as memory-style error-
These schemes have different cost—performance
trade-offs. The schemes are classified into RAID levels correcting-code (ECC) organization, employs
(For all levels, the figure depicts four disks’ worth of parity bits. Memory systems have long used
data, and the extra disks depicted are used to store parity bits for error detection and correction
redundant information for failure recovery.)

Each byte in a memory system may have a The idea of error-correcting codes can be used
parity bit associated with it that records whether the directly in disk arrays by striping bytes across
numbers of bits in the byte that are set to 1 is even
disks. For example, the first bit of each byte
(parity = 0) or odd (parity = 1).
If one of the bits in the byte gets damaged (either a 1
could be
becomes a 0, or a 0 becomes a 1), the parity of the byte
stored in disk 0, the second bit in disk 1, and so
changes and thus will not match the stored parity. on until the eighth bit is stored in disk 7, and the
 Similarly,if the stored parity bit gets damaged, it will not error-correction bits are stored in further disks.
match the computed parity.Thus, all 1-bit errors will be
detected by the memory system. Error-correcting
schemes store 2 or more extra bits, and can reconstruct
the data if a single bitgets damaged.
The disks labeled P store the error correction
bits. If one of the disks fails, the remaining bits of
the byte and the associated error-correction bits
can be read from other disks, and can be used to
reconstruct the damaged data.

RAID level 3, bit-interleaved parity organization, RAID level 3 is as good as level 2, but is less
improves on level 2 by exploiting the fact that disk expensive in the number of extra disks (it has
controllers, unlike memory systems, can detect
whether a sector has been read correctly, so a single parity
only a one-disk overhead), so level 2 is not used
bit can be used for error correction, as well as for in practice.
detection. RAID level 3 has two benefits over level 1. It
If one of the sectors gets damaged, the system knows needs only one parity disk for several regular
exactly which sector it is, and, for each bit in the sector, the disks, whereas level 1 needs one mirror disk for
system can figure out whether it is a 1 or a 0 by computing
the parity of the corresponding bits from sectors in the
every
other disks. If the parity of the remaining bits is equal to disk, and thus level 3 reduces the storage
the stored parity, the missing bit is 0; otherwise, it is 1 overhead.
RAID level 4, block-interleaved parity
organization, uses block-level striping, like RAID
0, and in addition keeps a parity block on a
separate disk for corresponding blocks from N
other disks.
If one of the disks fails, the parity block can be
used with the corresponding blocks from the
other disks to restore the blocks of the failed disk

RAID level 5, block-interleaved distributed RAID level 6, the P + Q redundancy scheme, is


parity, improves on level 4 by partitioning data much like RAID level 5, but stores extra
and parity among all N + 1 disks, instead of redundant information to guard against multiple
storing data in disk failures.
N disks and parity in one disk. Instead of using parity, level 6 uses error-
correcting codes.
File Organization

A database is mapped into a number of different files A block may contain several records; the exact set
that are maintained by the underlying operating of records that a block contains is determined by
system. These files reside permanently on disks. the form of physical data organization being used.
A file is organized logically as a sequence of records. In a relational database, tuples of distinct
These records are mapped onto disk blocks.
relations are generally of different sizes. One
Each file is also logically partitioned into fixed-length
approach to mapping the database to files is to use
storage units called blocks, which are the units of both
several files, and to store records of only one fixed
storage allocation and data transfer. Most
length in any given file. An alternative is to
databases use block sizes of 4 to 8 kilobytes by default
structure our files so that we can accommodate
multiple lengths for records;

Fixed-Length Records

As an example, let us consider a file of instructor Assume that each character occupies 1 byte and
records for our university database. Each record that numeric (8,2) occupies 8 bytes.
of this file is defined (in pseudocode) as: Suppose that instead of allocating a variable
amount of bytes for the attributes ID, name, and
dept name, we allocate the maximum number of
bytes that each attribute can hold.
Then, the instructor record is 53 bytes long. A
simple approach is to use the first 53 bytes for
the first record, the next 53 bytes for the second
record, and so on
However, there are two problems with this simple To avoid the first problem, we allocate only as many
approach: records to a block as would fit entirely in the block.
1. Unless the block size happens to be a multiple of 53 When a record is deleted, we could move the record
(which is unlikely), some records will cross block that came after it into the space formerly occupied by
boundaries. That is, part of the record will be stored in the deleted record, and so on, until every record fol-
one block and part in another. It would thus require lowing the deleted record has been moved ahead .
two block accesses to read or write such a record. Such an approach requires moving a large number of
2. It is difficult to delete a record from this structure.
records. It might be easier simply to move the
The space occupied by the record to be deleted must be
final record of the file into the space occupied by the
filled with some other record of the file, or we must
deleted record
have a way of marking deleted records so that they can
be ignored
It is undesirable to move records to occupy the
space freed by a deleted record, since doing so
requires additional block accesses. Since
insertions tend to be more frequent than
deletions, it is acceptable to leave open the space
occupied by the deleted record, and to wait for a
subsequent insertion before reusing the space.
A simple marker on a deleted record is not
sufficient, since it is hard to find this available
space when an insertion is being done. Thus, we
need to introduce an additional structure.

At the beginning of the file, we allocate a certain On insertion of a new record, we use the record
number of bytes as a file header. The header will pointed to by the header.
contain a variety of information about the file. For We change the header pointer to point to the
now, all we need to store there is the address of the next available record. If no space is available, we
first record whose contents are deleted.
add the new record to the end of the file.
We use this first record to store the address of the
Insertion and deletion for files of fixed-length
second available record, and so on. Intuitively, we
records are simple to implement, because the
can think of these stored addresses as pointers,
since they point to the location of a record. space made available by a deleted record is
The deleted records thus form a linked list, which is
exactly the space needed to insert a record.
often referred to as a free list.
Variable-Length Records

Variable-length records arise in database


systems in several ways:
• Storage of multiple record types in a file.
• Record types that allow variable lengths for one
or more fields.
• Record types that allow repeating fields, such as
arrays or multisets.

Different techniques for implementing variable- The representation of a record with variable-length

length records exist. Two different attributes typically has two parts: an initial part with
fixed length attributes, followed by data for variable-
problems must be solved by any such technique:
length attributes.
• How to represent a single record in such a way Fixed-length attributes, such as numeric values, dates, or
that individual attributes can be extracted easily. fixed length character strings are allocated as many
• How to store variable-length records within a bytes as required to store their value.
block, such that records in a block can be Variable-length attributes, such as varchar types, are
extracted easily represented in the initial part of the record by a pair
(offset, length), where offset denotes where the data for
that attribute begins within the record, and length is the
length in bytes of the variable-sized attribute.
The values for these attributes are stored null bitmap, which indicates which attributes of
consecutively, after the initial fixed-length part the record have a null value. In this particular
of the record. Thus, the initial part of the record record, if the salary were null, the fourth bit of
stores a fixed size of information about each the bitmap would be set to 1, and the salary
attribute, whether it is fixed-length or variable- value stored
length. in bytes 12 through 19 would be ignored.
The slotted-page structure is commonly used for
organizing variable length records within a
block

There is a header at the beginning of each block,


containing the following information:
1. The number of record entries in the header.
2. The end of free space in the block.
3. An array whose entries contain the location
and size of each record
Organization of Records in Files

The actual records are allocated contiguously in Several of the possible ways of organizing
the block, starting from the end of the block. The records in files are:
free space in the block is contiguous, between Heap file organization. Any record can be
the final entry in the header array, and the first placed anywhere in the file where there is space
record. If a record is inserted, space is allocated for the record. There is no ordering of records.
for it at the end of free space, and an entry Typically, there is a single file for each relation.
containing its size and location is added to the  Sequential file organization. Records are
header. stored in sequential order, according to the value
If a record is deleted, the space that it occupies is of a “search key”of each record
freed, and its entry is set to deleted (its size is set
to −1, for example).

Sequential File Organization

Hashing file organization. A hash function is A sequential file is designed for efficient processing
computed on some attribute of each record. The of records in sorted order based on some search key.
result of the hash function specifies in which A search key is any attribute or set of attributes; it
block of the file the record should be placed. need not be the primary key, or even a superkey.
Generally, a separate file is used to store the To permit fast retrieval of records in search-key

records of each relation. However, in a order, we chain together records by pointers.


The pointer in each record points to the next record
multitable clustering file organization, records of
several different relations are stored in the same in search-key order. Furthermore, to minimize the
file number of block accesses in sequential file
processing, we store records physically in search-key
order, or as close to search-key order as possible.
multitable clustering file organization

The sequential file organization allows records to be A multitable clustering file organization is a file
read in sorted order; that can be useful for display organization, such as that stores related records
purposes, as well as for certain query-processing of two or more relations in each block. Such a
algorithms. file organization allows us to read records that
For insertion, we apply the following rules: would satisfy the join condition by using one
1. Locate the record in the file that comes before the block read. Thus, we are able to process this
record to be inserted in search-key order. particular query more efficiently
2. If there is a free record (that is, space left after a
deletion) within the same block as this record,
insert the new record there. Otherwise, insert the
new record in an overflow block. In either case,
adjust the pointers so as to chain together the
records in search-key order.

Indexing and Hashing

Basic Concepts Keeping a sorted list of students’ ID would not


An index for a file in a database system works in work well on very large databases with
much the same way as the index of textbook. thousands of students, since the index would
Database-system indices play the same role as itself be very big;
book indices in libraries. For example, to retrieve further, even though keeping the index sorted
a student record given an ID, the database reduces the search time, finding a student can
system would look up an index to find on which still be rather time-consuming. Instead, more
disk block the corresponding record resides, and sophisticated indexing techniques may be used.
then fetch the disk block, to get the appropriate
student record.
There are two basic kinds of indices: several techniques for both ordered indexing and
Ordered indices. Based on a sorted ordering of hashing.
Each technique must be evaluated on the basis of
the values.
Hash indices. Based on a uniform distribution these factors:
Access types: The types of access that are supported
of values across a range of buckets. The bucket to
which a value is assigned is determined by a efficiently. Access types can include finding records
with a specified attribute value and finding records
function,
whose attribute values fall in a specified range.
called a hash function.
• Access time: The time it takes to find a particular
data item, or set of items, using the technique in
question.

Insertion time: The time it takes to insert a new data We often want to have more than one index for
item. This value includes the time it takes to find the a file.
correct place to insert the new data item, as well An attribute or set of attributes used to look up
as the time it takes to update the index structure.
records in a file is called a search key.
Deletion time: The time it takes to delete a data item.
This value includes the time it takes to find the item
to be deleted, as well as the time it takes to update
the index structure.
• Space overhead: The additional space occupied by
an index structure. Provided that the amount of
additional space is moderate, it is usually worth
while to sacrifice the space to achieve improved
performance.
1.Ordered Indices

To gain fast random access to records in a file, we file may have several indices, on different search
can use an index structure. keys. If the file containing the records is
Each index structure is associated with a particular sequentially ordered, a clustering index is an
search key. Just like the index of a book or a library index whose search key also defines the
catalog, an ordered index stores the values of the sequential order of the file.
search keys in sorted order, and associates with Clustering indices are also called primary
each search key the records that contain it. indices
The records in the indexed file may themselves be Indices whose search key specifies an order
stored in some sorted order, just as books in a different from the sequential order of the file are
library are stored according to some attribute.
called nonclustering indices, or secondary
indices

1.1 Dense and Sparse Indices

An index entry, or index record, consists of a


search-key value and pointers to one or more
records with that value as their search-key value.
The pointer to a record consists of the identifier
of a disk block and an offset within the disk
block
to identify the record within the block.
There are two types of ordered indices that we
can use:
Dense index: In a dense index, an index entry Sparse index: In a sparse index, an index entry
appears for every search-key value in the file. In a appears for only some of the search-key values.
dense clustering index, the index record contains the Sparse indices can be used only if the relation is
search-key value and a pointer to the first data record stored in sorted order of the search key, that is, if
with that search-key value.
the index is a clustering index.
The rest of the records with the same search-key
value would be stored sequentially after the first
record, since, because the index is a clustering one,
records are sorted on the same search key. In a dense
nonclustering index, the index must store a list of
pointers to all records with the same search-key value
multilevel indices B+-Tree Index Files

Indices with two or more levels are called The B+ tree is a balanced binary search tree. It
multilevel indices. follows a multi-level index format.
In the B+ tree, leaf nodes denote actual data
pointers. B+ tree ensures that all leaf nodes
remain at the same height.
In the B+ tree, the leaf nodes are linked using a
link list. Therefore, a B+ tree can support
random access as well as sequential access.

Structure of a B+-Tree

The main disadvantage of the index-sequential file In the B+ tree, every leaf node is at equal
organization is that performance degrades as the distance from the root node. The B+ tree is of the
file grows. order n where n is fixed for every B+ tree.
The B+-tree index structure is the most widely used It contains an internal node and leaf node.
of several index structures that maintain their
efficiency despite insertion and deletion of data.
 A B+-tree index takes the form of a balanced tree
in which every path from the root of the tree to a
leaf of the tree is of the same length. Each nonleaf
node in the tree has between [n/2] and n children,
where n is fixed for a particular tree.
A B +-tree index is a multilevel index, but it has a We consider first the structure of the leaf nodes.
structure that differs from that of the multilevel For i = 1,2,...,n−1, pointer Pi points to a file record
index-sequential file. with search-key value Ki . Pointer Pn has a
special purpose
Since there is a linear order on the leaves based
on the search-key values that they contain, we
It contains up to n − 1 search-key values K1,K2,...,Kn use Pn to chain together the leaf nodes in search-
− 1, and n pointers P1,P2,...,Pn. The search-key key order. This ordering allows for efficient
values within a node are kept in sorted order; thus, sequential processing of the file.
if i <j , then Ki <K j

The nonleaf nodes of the B+-tree form a multilevel


(sparse) index on the leaf nodes.
The structure of nonleaf nodes is the same as that
for leaf nodes, except that all pointers are pointers
to tree nodes.
A nonleaf node may hold up to n pointers, and must
hold at least [n/2] pointers.
The number of pointers in a node is called the
fanout of the node.
Nonleaf nodes are also referred to as internal nodes.
These examples of B+-trees are all balanced.
That is, the length of every path from the root to
a leaf node is the same.
This property is a requirement for a B+tree.
Indeed, the “B”in B+-tree stands for “balanced.”
It is the balance property of B+-trees that
ensures good performance for lookup, insertion,
and deletion.

Hashing –Statuc hashing

Module4 One disadvantage of sequential file organization


is that we must access an index structure to
locate data.
PA RT - 2 File organizations based on the technique of
hashing allow us to avoid accessing an index
structure.
In our description of hashing, we shall use the
term bucket to denote a unit of storage that can
store one or more records.
 A bucket is typically a disk block, but could be
chosen to be smaller or larger than a disk block
let K denote the set of all search-key values, and Suppose that two search keys, K5 and K7, have
let B denote the set of all bucket addresses. A the same hash value; that is, h(K5) = h(K7).
hash function h is a function from K to B. Let h If we perform a lookup on K5, the bucket h(K5)
denote contains records with search-key values K5 and
a hash function. records with search-key values K7.
To insert a record with search key Ki , we Thus, we have to check the search-key value of
compute h(Ki ), which gives the address of the every record in the bucket to verify that the
bucket for that record. record is one that we want.
To perform a lookup on a search-key value Ki ,
we simply compute h(Ki ), then search the bucket
with that address.

Deletion is equally straightforward. If the Hashing can be used for two different purposes.
search-key value of the record to be deleted is In a hash file organization, we obtain the
Ki , we compute h(Ki ), then search the address of the disk block containing a desired
corresponding bucket for that record directly by
record, and delete the record from the bucket. computing a function on the search-key value of
the record.
 In a hash index organization we organize the
search keys, with their associated pointers, into a
hash file structure.
Hash Functions

An ideal hash function distributes the stored


keys uniformly across all the buckets, so that
every bucket has the same number of records.
Since we do not know at design time precisely
which search-key values will be stored in the file,
we want to choose a hash function that assigns
search-key
values to buckets in such a way that the
distribution has these qualities:
 The distribution is uniform.
 The distribution is random

Handling of Bucket Overflows

Hash functions require careful design. A bad If the bucket does not have enough space,
hash function may result in lookup taking time a bucket overflow is said to occur. Bucket overflow can
proportional to the number of search keys in the occur for several reasons:
Insufficient buckets.
file. A well designed function gives an average-
The number of buckets, which we denote nB , must be
case lookup time that is a (small) constant,
independent of the number of search keys in the chosen such that nB >nr /fr , where nr denotes the total
number of records that will be stored and fr denotes
file
the number of records that will fit in a bucket. This
designation, of course, assumes that the total number
of records is known when the hash function is chosen.
 Skew. Some buckets are assigned more records So that the probability of bucket overflow is
than are others, so a bucket may overflow even reduced, the number of buckets is chosen to be
when other buckets still have space. (nr /fr ) ∗ (1 + d), where d is a fudge factor,
 This situation is called bucket skew. typically around 0.2.
 Skew can occur for two reasons: Some space is wasted: About 20 percent of the
1. Multiple records may have the same search space in the buckets will be empty. But the
key. benefit is that the probability of overflow is
2. The chosen hash function may result in reduced.
nonuniform distribution of search keys

Despite allocation of a few more buckets than Overflow handling using such a linked list is
required, bucket overflow can still occur. called overflow chaining.
We handle bucket overflow by using overflow
buckets. If a record must be inserted into a bucket
b, and b is already full, the system provides an
overflow bucket for b, and inserts the record into
the overflow bucket.
If the overflow bucket is also full, the system
provides another overflow bucket, and so
on. All the overflow buckets of a given bucket are
chained together in a linked list,
We must change the lookup algorithm slightly to Under an alternative approach, called open
handle overflow chaining. hashing, the set of buckets is fixed, and there are
As before, the system uses the hash function on no overflow chains. Instead, if a bucket is full,
the search key to identify a bucket b. The system the system inserts records in some other bucket
must examine all the records in bucket b to see in the initial set of buckets B.
whether they match the search key, as before. In One policy is to use the next bucket (in cyclic
addition, if bucket b has overflow buckets, the order) that has space; this policy is called linear
system probing.
must examine the records in all the overflow
buckets also.
The form of hash structure is closed hashing.

Dynamic Hashing

the need to fix the set B of bucket addresses presents a Several dynamic hashing techniques allow the hash
serious problem with the static hashing technique. function to be modified dynamically to accommodate
Most databases grow larger over time. If we are to use the growth or shrinkage of the database.
static hashing for such a database, we have Extendable hashing(dynamic hashing) copes with
three classes of options: changes in database size by splitting and combining
1. Choose a hash function based on the current file size buckets as the database grows and shrinks. As a result,
2. Choose a hash function based on the anticipated size
space efficiency is retained.
With extendable hashing, we choose a hash function h
of the file at some point
with the desirable properties of uniformity and
in the future
randomness. However, this hash function generates
3. Periodically reorganize the hash structure in
values over a relatively large range—namely, b-bit
response to file growth. binary integers. A typical value for b is 32.
Distributed Databases

Module v  A distributed database is basically a database that is not limited


to one system, it is spread over different sites, i.e, on multiple
computers or over a network of computers.
 A distributed database system is located on various sites that
don’t share physical components. This may be required when a
particular database needs to be accessed by various users
globally.
 It needs to be managed such that for the users it looks like one
single database.

Homogeneous and Heterogeneous Databases Distributed Data Storage

 In a homogeneous distributed database system, all sites have  Consider a relation r that is to be stored in the database. There
identical database management system software, are aware of are two approaches
one another, and agree to cooperate in processing users’ to storing this relation in the distributed database:
requests. • Replication. The system maintains several identical replicas
 In contrast, in a heterogeneous distributed database, different (copies) of the relation, and stores each replica at a different
sites may use different schemas, and different database- site. The alternative to replication is to store only one copy of
management system software. The sites may not be aware of relation r.
one another, and they may provide only limited facilities for • Fragmentation. The system partitions the relation into several
cooperation in transaction processing. fragments, and stores each fragment at a different site.
Data Replication

 Fragmentation and replication can be combined: A relation can  If relation r is replicated, a copy of relation r is stored in two
be partitioned into several fragments and there may be several or more sites. In the most extreme case, we have full
replicas of each fragment replication, in which a copy is stored in every site in the system.
There are a number of advantages and disadvantages to
replication.
 Availability: If one of the sites containing relation r fails, then
the relation r can be found in another site. Thus, the system
can continue to process queries
involving r, despite the failure of one site

 Increased parallelism.  Increased overhead on update.


 In the case where the majority of accesses to the relation r  The system must ensure that all replicas of a relation r are
result in only the reading of the relation, then several sites can consistent; otherwise, erroneous computations may result.
process Thus,
queries involving r in parallel. The more replicas of r there are, whenever r is updated, the update must be propagated to all
the greater the chance that the needed data will be found in the sites containing replicas. The result is increased overhead. For
site where the transaction example, in a banking system, where account information is
is executing. Hence, data replication minimizes movement of replicated in various sites, it is necessary to ensure that the
data between balance in a particular account agrees in all sites.
sites.
Data Fragmentation

 We can simplify the management of replicas of relation r by  If relation r is fragmented, r is divided into a number of
choosing one of them as the primary copy of r. fragments r1, r2, . . . , rn. These fragments contain
 For example, in a banking system, an account can be associated sufficient information to allow reconstruction of the
with the site in which the account has been opened. original relation r.
 There are two different schemes for fragmenting a relation:
horizontal fragmentation and vertical fragmentation.
 Horizontal fragmentation splits the relation by assigning each
tuple of r to one or more fragments. Vertical
fragmentation splits the relation by decomposing the scheme R
of relation r.

 In horizontal fragmentation, a relation r is partitioned into a  a horizontal fragment can be defined as a selection on the
number of subsets, r1, r2, . . . , rn. Each tuple of relation r global relation r. That is, we use a predicate Pi to construct
must belong to at least one of the fragments, so that the fragment ri :
original relation can be reconstructed, if needed.
 the account relation can be divided into several different
fragments, each of which consists of tuples of accounts
belonging to a particular  We reconstruct the relation r by taking the union of all
branch. fragments; that is:
 vertical fragmentation is the same as decomposition. Vertical
fragmentation of r(R) involves the definition of several subsets of
attributes R1, R2, . . . , Rn of the schema R so that:
R = R1 ∪ R2 ∪ · · · ∪ R
 Each fragment ri of r is defined by  One way of ensuring that the relation r can be reconstructed is
to include the primary-key attributes of R in each Ri . More
generally, any superkey can be used.
It is often convenient to add a special attribute, called a tuple-
 we can reconstruct relation id, to the schema R
r from the fragments by taking the natural join:

Transparency

 The user of a distributed database system should not be  Replication transparency. Users view each data object as
required to know where the data are physically located nor logically unique. The distributed system may replicate an object
how the data can be accessed at the specific local site. This to increase either system performance or data availability.
characteristic, called data transparency, can take several forms: Users do not have to be concerned with what data objects have
 Fragmentation transparency. Users are not required to know been replicated, or where replicas have been placed.
how a relation has been fragmented.  Location transparency. Users are not required to know the
physical location of the data. The distributed database system
should be able to find any data
as long as the data identifier is supplied by the user transaction.
 Data items—such as relations, fragments, and replicas—must  The database system can create a set of alternative
have unique names. This property is easy to ensure in a names, or aliases, for data items. A user may thus refer to
centralized database. In a distributed database, however, we data items by simple names that are translated by the system to
must take care to ensure that two sites do not use the same complete names.
name for distinct data items.
 One solution to this problem is to require all names to be
registered in a central name server. The name server helps to
ensure that the same name does not get used for different data
items.

Distributed Transactions System Structure

 Access to the various data items in a distributed system is  Each site has its own local transaction manager, whose
usually accomplished through transactions, which must function is to ensure the ACID properties of those transactions
preserve the ACID properties . that execute at that site. The various transaction managers
 There are two types of transaction that we need to consider. cooperate to execute global transactions.
The local transactions are those that access and update data in  each site contains two subsystems
only one local database; the global transactions are those that  The transaction manager manages the execution of those
access and update data in several local databases. transactions (or subtransactions) that access data stored in a
local site.
 The transaction coordinator coordinates the execution of the
various transactions (both local and global) initiated at that site.
 Each transaction manager is responsible for:  the coordinator is responsible for:
• Maintaining a log for recovery purposes. • Starting the execution of the transaction.
• Participating in an appropriate concurrency-control scheme to • Breaking the transaction into a number of sub transactions and
coordinate the distributing these subtransactions to the appropriate sites for
concurrent execution of the transactions executing at that site. execution.
• Coordinating the termination of the transaction, which may
result in the transaction being committed at all sites or aborted
at all sites.

System Failure Modes Object-BasedDatabases

 A distributed system may suffer from the same types of failure  complex application domains require correspondingly complex
that a centralized system does data types, such as nested record structures, multivalued
 • Failure of a site. attributes, and inheritance, which are supported by traditional
• Loss of messages. programming languages.
• Failure of a communication link.  The object-relational data model extends the relational data
• Network partition. model by providing a richer type system including complex data
types and object orientation.
 Object-relational database systems, that is, database systems  Two approaches are used
based on the object-relation model, provide a convenient 1. Build an object-oriented database system, that is, a database
migration path for users system that natively supports an object-oriented type system,
of relational databases who wish to use object-oriented features. and allows direct access to
data from an object-oriented programming language using the
native type system of the language.
2. Automatically convert data from the native type system of
the programming language to a relational representation, and
vice versa. Data conversion is
specified using an object-relational mapping.

Complex Data Types

 Traditional database applications have conceptually simple data  On the other hand, if an address were represented by breaking
types. The basic data items are records that are fairly small it into the components (street address, city, state, and postal
and whose fields are atomic. code), writing queries would be more complicated since they
 Consider, for example, addresses. would have to mention each field.
 While an entire address could be viewed as an atomic data item  A better alternative is to allow structured data types that allow
of type string, this view would hide details such as the street a type address with subparts street address, city, state, and
address, city, state, and postal code, which could be of postal code.
interest to queries.  With complex type systems we can represent E-R model
concepts, such as composite attributes, multivalued attributes,
generalization, and specialization
directly, without a complex translation to the relational model.
Structured Types and Inheritance in SQL

 Rather than view a database as a set of records, users of  Structured Types


certain applications view it as a set of objects (or entities).  Structured types allow composite attributes of E-R designs to
be represented directly. For instance, we can define the
following structured type to represent a composite attribute
name with component attribute firstname and lastname:

 the following structured type can be used to represent a  We can now use these types to create composite attributes in a
composite attribute address: relation, by simply declaring an attribute to be of one of these
types. For example, we could create a table person as follows:

 Such types are called user-defined types in SQL. The final and
not final specifications are
related to subtyping,
 The components of a composite attribute can be accessed using
a “dot” notation; for instance name.firstname returns the
firstname component of the name attribute. An access to
attribute name would return a value of the structured type
Name.
 We can also create a table whose rows are of a user-defined
type. For example, we could define a type PersonType and
create the table person as follows:

 An alternative way of defining composite attributes in SQL is to  The following query illustrates how to access component
use unnamed row types. attributes of a composite attribute. The query finds the last
name and city of each person.

 A structured type can have methods defined on it. We declare


 This definition is equivalent to the preceding table definition,
methods as part of the type definition of a structured type:
except that the attributes name and address have unnamed
types, and the rows of the table also
have an unnamed type
 Note that the for clause indicates which type this method is for,
while the keyword instance indicates that this method executes
on an instance of the Person type. The variable self refers to
the Person instance on which the method is invoked.
 constructor functions are used to create values of structured
types. A function with the same name as a structured type is a
constructor function
for the structured type.

 we could declare a constructor for the type  We can then use new Name(’John’, ’Smith’) to create a value
Name like this of the type Name.
We can construct a row value by listing its attributes within
parentheses.
 By default every structured type has a constructor with no
arguments, which sets the attributes to their default values.
Any other constructors have to be created
explicitly. There can be more than one constructor for the same
structured type; although they have the same name, they must
be distinguishable by the number
of arguments and types of their arguments
Type Inheritance

 The following statement illustrates how we can create a new  Suppose that we have the following type definition for people:
tuple in the Person relation

 We may want to store extra information in the database about


people who are students, and about people who are teachers.
Since students and teachers are alsopeople, we can use
inheritance to define the student and teacher types in SQL:

 Both Student and Teacher inherit the attributes of Person


—namely, name and address. Student and Teacher are said to
be subtypes of Person, and Person is a supertype of Student,
as well as of Teacher .
 Multiple inheritance
Table Inheritance

 The SQL standard requires an extra field at the end of the type  create table people of Person;
definition,whose value is either final or not final. The keyword  We can then define tables students and teachers as subtables of
final says that subtypes may people,
not be created from the given type, while not final says that
subtypes may be created.

 Further, when we declare students and teachers as subtables of  SQL permits us to find tuples that are in people but not in its
people, everytuple present in students or teachers becomes subtables by using “only people”in place of people in a query.
implicitly present in people. Thus, The only keyword can also be used in delete and update
if a query uses the table people, it will find not only tuples statements. Without the only keyword, a delete statement on a
directly inserted into that table, but also tuples inserted into its supertable, such as people, also deletes tuples that were
subtables, namely students and teachers. originally inserted in subtables (such as students);
However, only those attributes that are present in people can  delete from people where P;
be accessed by that  would delete all tuples from the table people, as well as its
query. subtables students and teachers, that satisfy P
Array and Multiset Types in SQL

 If the only keyword is added to the above statement,  SQL supports two collection types: arrays and multisets;
tuples that were inserted in subtables are not affected, even if  a multiset is an unordered collection, where an element may
they satisfy the where clause conditions occur multiple times.
 multiple inheritance is possible with tables,

Creating and Accessing Collection Values

 An array of values can be created in SQL:

 Similarly, a multiset of keywords can be constructed as


Object-Identity and Reference Types in SQL

 Object-oriented languages provide the ability to refer to objects.  We can omit the declaration scope people from the type
An attribute of a type can be a reference to an object of a declaration and instead
specified type. make an addition to the create table statement
 For example, in SQL we can define a type Department with a
field name and a field head that is a reference to the type
Person, and a table departments of type Department, as
follows:
 The referenced table must have an attribute that stores the
identifier of the tuple. We declare this attribute, called the
self-referential attribute, by adding a ref is clause to the
create table statement:

Next generation databases

 CAP theorem  In normal operations, your data store provides all three
 The CAP Theorem is comprised of three components (hence its functions. But the CAP theorem maintains that when a
name) as they relate to distributed data stores: distributed database experiences a network failure, you can
 Consistency. All reads receive the most recent write or an
provide either consistency or availability.
error.  In the theorem, partition tolerance is a must. The assumption is
 Availability. All reads contain data, but it might not be the
that the system operates on a distributed data store so the
most recent. system, by nature, operates with network partitions. Network
failures will happen, so to offer any kind of reliable service,
 Partition tolerance. The system continues to operate despite partition tolerance is necessary—the P of CAP.
network failures (ie; dropped partitions, slow network
connections, or unavailable network connections between nodes.
)
 if Partition means a break in communication then Partition  That leaves a decision between the other two, C and A. When
tolerance would mean that the system should still be able to a network failure happens, one can choose to guarantee
work even if there is a partition in the system. Meaning if a consistency or availability:
node fails to communicate, then one of the replicas of the node  High consistency comes at the cost of lower availability.
should be able to retrieve the data required by the user.  High availability comes at the cost of lower consistency.
 The CAP theorem states that a distributed database system has
to make a tradeoff between Consistency and Availability when a
Partition occurs.

Non-relational database

 Non-relational databases (often called NoSQL databases) are  There are several advantages to using non-relational databases,
different from traditional relational databases in that they store including:
their data in a non-tabular form.  Massive dataset organization
 Instead, non-relational databases might be based on data In the age of Big Data, non-relational databases can not only
structures like documents. A document can be highly detailed store massive quantities of information, but they can also query
while containing a range of different types of information in these datasets with ease. Scale and speed are crucial advantages
different formats. of non-relational databases.
MongoDB

 Flexible database expansion  MongoDB is an open-source document database and leading


Data is not static. As more information is collected, a non- NoSQL database.
relational database can absorb these new data points, enriching  MongoDB works on concept of collection and document.
the existing database with new levels of granular value even if  Rather than using the tables and fixed schemas of a relational
they don’t fit the data types of previously existing information. database management system (RDBMS), MongoDB uses key-
 Multiple data structures value storage in the collection of documents. It also supports a
 Built for the cloud number of options for horizontal scaling in large, production
environments.

MongoDB Sharding

 MongoDB is a NoSQL document database system that scales  MongoDB achieves scaling through a technique known as
well horizontally and implements data storage through a key- “sharding”. It is the process of writing data across different
value system. servers to distribute the read and write load and data storage
requirements
 Sharding is the process of storing data records across multiple
machines and it is MongoDB’s approach to meeting the demands
of data growth. As the size of the data increases, a single
machine may not be sufficient to store the data nor provide an
acceptable read and write throughput
MongoDB Replication

 Sharding solves the problem with horizontal scaling. With  Replica Sets are a great way to replicate MongoDB data across
sharding, you add more machines to support data growth and multiple servers and have the database automatically failover in
the demands of read and write operations. case of server failure.

MongoDB sharding basics Shard

 MongoDB sharding works by creating a cluster of MongoDB  A shard is a single MongoDB instance that holds a subset of the
instances consisting of at least three servers. That means sharded data. Shards can be deployed as replica sets to
sharded clusters consist of three main components: increase availability and provide redundancy. The combination
 The shard of multiple shards creates a complete data set. For example, a
 Mongos
2 TB data set can be broken down into four shards, each
containing 500 GB of data from the original data set.
 Config servers
 Mongos  Config Servers
 Mongos act as the query router providing a stable interface  Configuration servers store the metadata and the configuration
between the application and the sharded cluster. This MongoDB settings for the whole cluster.
instance is responsible for routing the client requests to the
correct shard.

 The application communicates with the routers (mongos) about


the query to be executed.
 The mongos instance consults the config servers to check which
shard contains the required data set to send the query to that
shard.
 Finally, the result of the query will be returned to the
application.
HBase

 HBase is a column-oriented non-relational database management  Unlike relational database systems, HBase does not support a
system that runs on top of Hadoop Distributed File System structured query language like SQL; in fact, HBase isn’t a
(HDFS). HBase provides a fault-tolerant way of storing sparse relational data store at all. HBase applications are written in
data sets, which are common in many big data use cases. It is Java much like a typical Apache MapReduce application.
well suited for real-time data processing or random read/write
access to large volumes of data.

 HBase is a column-oriented database and the tables in it are  in an HBase:


sorted by row. The table schema defines only column families,  Table is a collection of rows.
which are the key value pairs. A table have multiple column  Row is a collection of column families.
families and each column family can have any number of
columns. Subsequent column values are stored contiguously on  Column family is a collection of columns.
the disk. Each cell value of the table has a timestamp  Column is a collection of key value pairs.
 Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters

Cassandra

 Apache Cassandra is an open source, distributed and  It is scalable, fault-tolerant, and consistent.
decentralized/distributed storage system (database), for  It is a column-oriented database.
managing very large amounts of structured data spread out  Its distribution design is based on Amazon’s Dynamo and its data
across the world. It provides highly available service with no model on Google’s Bigtable.
single point of failure.  Created at Facebook, it differs sharply from relational database
management systems.
 Cassandra implements a Dynamo-style replication model with no
single point of failure, but adds a more powerful “column family”
data model.
 Cassandra is being used by some of the biggest companies such
as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix,
and more
Features of Cassandra

 Elastic scalability − Cassandra is highly scalable; it allows to  Flexible data storage − Cassandra accommodates all possible data
add more hardware to accommodate more customers and more formats including: structured, semi-structured, and
data as per requirement. unstructured. It can dynamically accommodate changes to your
data structures according to your need.
 Always on architecture − Cassandra has no single point of  Easy data distribution − Cassandra provides the flexibility to
failure and it is continuously available for business-critical distribute data where you need by replicating data across
applications that cannot afford a failure. multiple data centers.
 Fast linear-scale performance − Cassandra is linearly scalable,  Transaction support − Cassandra supports properties like
i.e., it increases your throughput as you increase the number Atomicity, Consistency, Isolation, and Durability (ACID).
of nodes in the cluster. Therefore it maintains a quick response  Fast writes − Cassandra was designed to run on cheap commodity
time. hardware. It performs blazingly fast writes and can store
hundreds of terabytes of data, without sacrificing the read
efficiency.

Components of Cassandra

 Node − It is the place where data is stored.  Mem-table − A mem-table is a memory-resident data structure.
 Data center − It is a collection of related nodes. After commit log, the data will be written to the mem-table.
 Cluster − A cluster is a component that contains one or more
Sometimes, for a single-column family, there will be multiple
data centers. mem-tables.
 SSTable − It is a disk file to which the data is flushed from the
 Commit log − The commit log is a crash-recovery mechanism in
Cassandra. Every write operation is written to the commit log. mem-table when its contents reach a threshold value.
 Bloom filter − These are nothing but quick, nondeterministic,
algorithms for testing whether an element is a member of a set.
It is a special kind of cache. Bloom filters are accessed after
every query
Cassandra Query Language

 users can access Cassandra through its nodes using Cassandra  Write Operations
Query Language (CQL). CQL treats the database (Keyspace) as  Every write activity of nodes is captured by the commit logs
a container of tables. Programmers use cqlsh: a prompt to written in the nodes. Later the data will be captured and stored
work with CQL or separate application language drivers. in the mem-table. Whenever the mem-table is full, data will
be written into the SStable data file. All writes are
automatically partitioned and replicated throughout the cluster.
Cassandra periodically consolidates the SSTables, discarding
unnecessary data.

 Read Operations
 During read operations, Cassandra gets values from the mem-
table and checks the bloom filter to find the appropriate SSTable
that holds the required data.
Module5  XML stands for eXtensible Markup Language.
 XML was designed to store and transport data.
 XML is a software- and hardware-independent tool for storing
XML and transporting data.
 The Difference Between XML and HTML
 XML and HTML were designed with different goals:
 XML was designed to carry data - with focus on what data is
 HTML was designed to display data - with focus on how data
looks
 XML tags are not predefined like HTML tags are

XML DTD

 XML Does Not Use Predefined Tags  DTD stands for Document Type Definition.
 The XML language has no predefined tags.  A DTD defines the structure and the legal elements and
 XML is Extensible attributes of an XML document.
 Most XML applications will work as expected even if new data  An XML document with correct syntax is called "Well Formed".
is added (or removed).  An XML document validated against a DTD is both "Well
 XML Simplifies Things Formed" and "Valid".
 It simplifies data sharing
 It simplifies data transport
 It simplifies platform changes
 It simplifies data availability
 The DOCTYPE declaration above contains a reference to a DTD
file.
 The purpose of a DTD is to define the structure and the legal
elements and attributes of an XML document:

 #PCDATA means parseable character data.


XML Schema

 An XML Schema describes the structure of an XML document,


just like a DTD.
 An XML document with correct syntax is called "Well Formed".
 An XML document validated against an XML Schema is both "
Well Formed" and "Valid".

 <xs:element name="note"> defines the element called "note"  XML Schemas are written in XML
 <xs:complexType> the "note" element is a complex type  XML Schemas are extensible to additions
 <xs:sequence> the complex type is a sequence of elements  XML Schemas support data types
 <xs:element name="to" type="xs:string"> the element "to" is of  XML Schemas support namespaces
type string (text)
 <xs:element name="from" type="xs:string"> the element "from"
is of type string
 <xs:element name="heading" type="xs:string"> the element "
heading" is of type string
 <xs:element name="body" type="xs:string"> the element "body"
is of type string
XML Applications

 Storing Data with Complex Structure  Standardized Data Exchange Formats


 Many applications need to store data that are structured, but  XML-based standards for representation of data have been
are not easily mod- developed for a variety
eled as relations. of specialized applications, ranging from business applications
 XML-based representations are now widely used for storing such as banking
documents, spre- and shipping to scientific applications such as chemistry and
adsheet data and other data that are part of office application molecular biology
packages.
 XML is also used to represent data with complex structure that
must be ex-
changed between different parts of an application.

 Web Services
 When the information is to be used directly by a human,
organizations pro-
vide Web-based forms, where users can input values and get
back desired in-
formation in HTML form. However, there are many applications
where such in-
formation needs to be accessed by software programs, rather
than by end users.
Providing the results of a query in XML form is a clear
requirement. In addition,
it makes sense to specify the input values to the query also in
XML format.

You might also like