DBMS Bal Krishna Nyaupane PDF
DBMS Bal Krishna Nyaupane PDF
Course Manual
On
Database Management System
Application 1 View 1
View 2
Application 2 DBMS Database
View 3
Application 3
1.3 Objectives
A database should provide for efficient storage, update, and retrieval of data.
A database should be reliable - the stored data should have high integrity and promote
user trust in that data.
A database should be adaptable and scalable to new and unforeseen requirements and
applications.
A database should identify the existence of common data and avoid duplicate recording.
Selective redundancy is sometimes allowed to improve performance or for better
reliability.
Structure
data types
data behavior
Persistence
store data on secondary storage
Retrieval
a declarative query language
a procedural database programming language
Banking: transactions
Airlines: reservations, schedules
Universities: registration, grades
Sales: customers, products, purchases
Online retailers: order tracking, customized recommendations
Manufacturing: production, inventory, orders, supply chain
Human resources: employee records, salaries, tax deductions
Logical Level
Describes what data stored in database, and what relationships exist among the
data.
Describes the entire database in terms of small number of relatively simple
structure.
type instructor = record
ID : string;
name : string;
dept_name : string;
salary : integer;
end;
View Level:
Application programs hide details of data types. Views can also hide information
(such as an employee’s salary) for security purposes.
Data independence which means that upper levels are unaffected by changes in lower
levels.
In general, the interfaces between the various levels and components should be well
defined so that changes in some parts do not seriously influence others.
When a schema at a lower level is changed, only the mappings between this schema and
higher-lever schemas need to be changed in a DBMS that fully supports data
independence.
The higher-level schemas themselves are unchanged. Hence, the application programs
need not be changed since they refer to the external schemas.
Two types of independence
1. Physical Data Independence
2. Logical Data Independence
Physical Data Independence
It indicates that the physical storages structures or devices could be changed
without affecting the conceptual schema.
The ability to modify the physical schema without changing the logical schema.
Applications depend on the logical schema
Logical Data Independence
The capacity to change the conceptual schema without having to change the
external schemas and their application programs.
The conceptual schema can be changed without affecting the existing external
schemas.
A set of concepts to describe the structure of a database, and certain constraints that
the database should obey.
Categories of Data Models:
Conceptual (high-level, semantic) data models: Provide concepts that are
close to the way many users perceive data. (Also called entity-based or object-
based data models.)
♦ entity
♦ attribute
♦ relationship
Physical (low-level, internal) data models: Provide concepts that describe
details of how data is stored in the computer.
♦ record formats
♦ record ordering
♦ access paths
Implementation (representational) data models: Provide concepts that fall
between the above two, balancing user views with some computer storage
details.
♦ relational
♦ network
♦ hierarchical
2.2 E-R Model
A database can be modeled as:
a collection of entities,
Relationship among entities.
Entity:
An entity is an object that exists and is distinguishable from other objects.
Real-world object distinguishable from other objects.
Example: specific person, company, event, plant
Entity set :
An entity set is a set of entities of the same type that share the same properties.
All entities in an entity set have the same set of attributes.
Each entity set has a key.
Each attribute has a domain
Example: set of all Departments, Professors, Students, Administrators
Relationship
The connections among two or more entity Sets
A relationship is an association among several entities
A group of one or more attributes that uniquely identify an entity in the entity set
Types of Keys
1. Super Key:
a set of attributes that allows to identify and entity uniquely in the
entity set
2. Candidate Key:
Is a minimal super key that uniquely identifies either an entity or a
relationship
3. Primary Key:
Is a candidate key that is chosen by the database designer to identify
the entities of an entity set
Fig: The labels “manager” and “worker” are called roles; they specify how employee entities
interact via the works for relationship set. Roles are indicated in E-R diagrams by labeling the
lines that connect diamonds to rectangles.
Specialization
The process of designating sub grouping within an entity set is called
specialization.
Top-down design process; designate sub groupings within an entity set that
are distinctive from other entities in the set.
These sub groupings become lower-level entity sets that have attributes or
participate in relationships that do not apply to the higher-level entity set.
Depicted by a triangle component labeled ISA (E.g., instructor “is a” person).
Attribute inheritance – a lower-level entity set inherits all the attributes and
relationship participation of the higher-level entity set to which it is linked.
Generalization
A bottom-up design process – combine a number of entity sets that share the
same features into a higher-level entity set
Generalization is a containment relationship that exists between a higher level
entity set and one or lower level entity sets.
Specialization and generalization are simple inversions of each other; they are
represented in an E-R diagram in the same way
The ISA relationship also referred to as superclass - subclass relationship
2. overlap
That is the same entity may be a member of more than one subclass of the
specialization
Overlap constraint is shown by placing an o in the specialization circle.
Example: Person is a employee and customer. Employee can be customer.
3. Completeness constraint
The completeness constraint may be either total or partial.
A total specialization constraint specifies that every entity in the super class
must be a member of at least one subclass of the specialization.
Total specialization is shown by using a double line to connect the super class
to the circle.
A single line is used to display a partial specialization, meaning that an entity
does not have to belong to any of the subclasses.
Employee
2.6 Aggregation
The model was first proposed by Dr. E.F. Codd of IBM in 1970 in the following paper:
"A Relational Model for Large Shared Data Banks," Communications of the ACM, June
1970.
The relational Model of Data is based on the concept of a Relation.
A Relation is a mathematical concept based on the ideas of sets.
The strength of the relational approach to data management comes from the formal
foundation provided by the theory of relations.
Relational model is most widely used data model for commercial data-processing. The
reason it’s used so much is, because it’s simple and easy to maintain.
View Definition
Transition Control
Embedded SQL and Dynamic SQL
Integrity constraints ensure that changes made to the database by authorized users do not
result in loss of data consistency.
Domain Constraints
A domain of possible values must be associated with every attribute in the database.
Declaring an attribute of a particular domain acts as a restraint on the values it can take.
They are easily tested by the system. EX1: cannot set an integer variable to “cat”.
Data Integrity
For relational databases, there are entity integrity and referential integrity rules which
help to make sure that we have data integrity
Attribute Integrity
Attribute integrity is not part of the relational model. It is used by database software to
help with data integrity. The software makes sure that data for particular fields is of the
correct type (eg letters or numbers) or the correct length
Entity Integrity
The entity integrity rule applies to Primary Keys. The entity integrity rule says that the
value of a Primary Key in a table must be unique and it can never have no value (null)
Operations on the database which insert new data, update existing data, or delete data must
follow this rule
Referential Integrity
The referential integrity key applies to Foreign Keys. A relation schema may have an
attribute that corresponds to the primary key of another relation. The attribute is called a
foreign key.
The referential integrity key says that the value of a Foreign key must either be null (ie
have no value) or be equal to the value in the linked table where the Foreign Key is the
Primary Key
Ensuring that a value that appears in one relation for a given set of attributes also appears
for a certain set of attributes in another relation.
Database Modification
Cascading
Constraint restricts the values that the table can store. We can declare integrity
constraints at the table level or column level. In column-level constraints the
constraint type is specified after specifying the column data type i.e., before the
delimiting comma. Where as in the table-level constraints the constraint type is
going to be specified as separate comma-delimited clauses after defining the columns.
1. Not Null
2. Unique Key
3. Check
4. Primary Key
5. Foreign Key
6. Default
1. Not Null
If a column in a table is specified as Not Null, then it's not possible t o insert a null in
such a column. It can be implemented with create and alter commands. When we implement
the Not Null constraint with alter command there should not be any null values in the existing
table.
2. Unique Key
3. Check
Check constraint is used to restrict the values before inserting into a table.
4. Primary Key
The key column with which we can identify the entire Table is called as a primary
key column. A primary key is a combination of a Unique and a Not Null
constraint; it will not allow null and duplicate values. A table can have only one
primary key.
A primary key can be declared on two or more columns as a Composite Primary
Key.
5. Foreign Key
Columns defined as foreign keys refer the Primary Key of other tables. The
Foreign Key "points" to a primary key of another table, guaranteeing that you
can t enter data into a table unless the referenced table has the data already which
enforces the REFERENTIAL INTEGRITY. This column will take Null values.
6. Default
Example
Create table Employee (empno number (4) constraint pk_emp primary key,
ename varchar2(50),
salary number(10,2),
hire_date date,
gender char(1) constraint chk_gen check(gender in ('M', 'F', 'm', 'f')),
email varchar2(50) unique );
CREATE table Persons ( P_Id int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Example
8
To Add constraint in table
9
3.6.4 Data Manipulation language (DML)
The SELECT statement is used to select data from a table. The tabular result is stored in a
result table (called the result-set).
To select all columns from table, use a * symbol instead of column names.
The DISTINCT keyword is used to return only distinct (different) values.
Syntax
SELECT "column_name" FROM "table_name"
SELECT * FROM “table_name"
SELECT DISTINCT "column_name" FROM "table_name"
Where
To conditionally select data from a table, a WHERE clause can be added to the SELECT
statement.
Comparison results can be combined using the logical connectives and, or, and not.
With the WHERE clause, the following operators can be used
Syntax
SELECT "column_name" FROM "table_name”
WHERE "condition"
SELECT "column_name" FROM "table_name"
WHERE "simple condition" [AND|OR] "simple condition"
SELECT "column_name" FROM "table_name"
WHERE "column_name" BETWEEN 'value1' AND 'value2'
SELECT "column_name" FROM "table_name"
WHERE "column_name" IN ('value1', 'value2', ...)
10
String Operations
Used to sort the tuple either ascending or descending order in the result of query
Specify desc for descending order or asc for ascending order, for each attribute;
ascending order is the default.
syntax
SELECT "column_name" FROM "table_name"[WHERE "condition"]
ORDER BY "column_name" [ASC, DESC]
Suppose that we wish to list the entire Student relation in ascending order of amount. If
several students have the same age, then order them in descending order by first name.
Syntax
SELECT * FROM Student ORDER BY age asc, first_name desc
Tuple variables are defined in the from clause via the use of the as clause.
Find the names of all branches that have greater assets than some branch located in KTM
select distinct T.branch_name from branch as T, branch as S
where T.assets > S.assets and S.branch_city = 'KTM'
Keyword as is optional and may be omitted
borrower as T ≡ borrower T
The SQL allows renaming relations and attributes using the as clause:
old-name as new-name
Find the name, loan number and loan amount of all customers; rename the column name
loan_number as loan_id.
Select customer_name, borrower.loan_number as loan_id, amount
from borrower, loan
where borrower.loan_number = loan.loan_number
11
3.6.4.5 Set Operations
The set operations union, intersect, and except operate on relations and correspond to
the relational algebra operations ∪, ∩, −.
Each of the above operations automatically eliminates duplicates; to retain all duplicates
write union all, intersect all and except all in place of union, intersect and except
respectively .
Suppose a tuple occurs m times in r and n times in s, then, it occurs:
m + n times in r union all s
min(m,n) times in r intersect all s
max(0, m – n) times in r except all s
Syntax (Sql statement) Set operation ( Sql statement)
An aggregate function summarizes the results of an expression over a number of rows, returning a
single value.
Some of the commonly used aggregate functions are
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
Remember: COUNT (*) is the only function which won’t ignore Nulls. Other functions
like SUM, AVG, MIN, MAX they ignore Nulls.
Each subgroup of tuples consists of the set of tuples that have the same value for the
grouping attribute(s).
Group by clause used to group a set of tuples having same value on given attribute.
The attribute or attributes given in the group by clause are placed in one group.
Syntax
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name condition
GROUP BY column_name
The HAVING clause is used for specifying a selection condition on groups (rather than
on individual tuples)
Predicates in the having clause are applied after the formation of groups whereas
predicates in the where clause are applied before forming groups
12
Syntax
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name condition
GROUP BY column_name
HAVING aggregate_function(column_name) condition
3.6.4.7 Null Values
It is possible for tuples to have a null value, denoted by null, for some of their attributes
Null signifies an unknown value or that a value does not exist.
The predicate is null can be used to check for null values.
Example: Find all loan number which appear in the loan relation with null values for
amount. select loan_number
from loan
where amount is null
The result of any arithmetic expression involving null is null
Example: 5 + null returns null
Any comparison with null returns unknown
Example: 5 < null or null <> null or null = null
Three-valued logic using the truth value unknown:
OR: (unknown or true) = true,
(unknown or false) = unknown
(unknown or unknown) = unknown
AND: (true and unknown) = unknown,
(false and unknown) = false,
(unknown and unknown) = unknown
NOT: (not unknown) = unknown
“P is unknown” evaluates to true if predicate P evaluates to unknown
Result of where clause predicate is treated as false if it evaluates to unknown
Total all loan amounts
select sum (amount )
from loan
Above statement ignores null amounts
Result is null if there is no non-null amount
All aggregate operations except count (*) ignore tuples with null values on the
aggregated attributes.
13
3.6.5 Nested Queries (Sub Queries)
A nested query is a form of a SELECT command that appears inside another SQL
statement. It is also termed as subquery.
The SELECT commands containing a subquery are referred as parent statement. The
rows returned by the subquery are used by the parent statement.
Sub queries are SELECT statements embedded within another SELECT statement
the results of the inner SELECT statement (or sub select) are used in the outer
statement to help determine the contents of the final result
Inner to outer means evaluating statements from right to left
A subselect can be used in the WHERE and HAVING clauses of an
outer SELECT statement
Sub queries can be used with a number of operators:
IN, NOT IN
ALL
SOME, ANY
EXISTS, NOT EXISTS
The ALL operator may be used with subqueries that produce a single column of numbers.
If the subquery is preceded by the keyword ALL, the condition will only be TRUE if it is
satisfied by all the values produced by the subquery
The SOME operator may be used with subqueries that produce a single column of
numbers. SOME and ANY can be used interchangeably.
If the subquery is preceded by the keyword SOME, the condition will only be TRUE if it
is satisfied by any (one or more) values produced by the subquery.
EXISTS and NOT EXISTS produce a simple TRUE/FALSE result.
EXISTS is TRUE if and only if there exists at least one row in the result table returned by
the subquery; it is FALSE if the subquery returns an empty result table. NOT EXISTS is
the opposite of EXISTS.
(= some) ≡ in However, (≠ some) ≠ not in
(≠ all) ≡ not in However, (= all) ≠ in
Sql Example: Find all customers who have both an account and a loan at the Perryridge
branch
Select distinct customer_name
from borrower, loan
where borrower.loan_number = loan.loan_number and
branch_name = 'Perryridge' and
(branch_name, customer_name )
in (select branch_name, customer_name
from depositor, account
where depositor.account_number = account.account_number )
14
Find all branches that have greater assets than some branch located in KTM.
Select branch_name
from branch
where assets > some
(select assets
from branch
where branch_city = 'KTM')
Find all customers who have a loan at the bank but do not have an account at the bank
Select distinct customer_name
from borrower
where customer_name not in (select customer_name
from depositor)
15
Update Statement
UPDATE "table_name” SET "column_1" = [new value] WHERE {condition}
UPDATE tbLNAME
set column = case
When predicate1 then result1
When predicate2 then result2
…………….
When predicate n then result n
Else result
end
Example: Increase all accounts with balances over 20,000 by 10%, all other accounts
receive 15%.
Update account
set balance = case
when balance <= 20000 then balance *1.1
else balance * 1.15
end
Join operations take two relations and return as a result another relation.
These additional operations are typically used as subquery expressions in the from clause
In relational databases, a join operation matches records in two tables. The two tables
must be joined by at least one common field. That is, the join field is a member of both
tables.
Equi-join–a join in which the joining condition is based on equality between values in
the common columns; common columns appear redundantly in the result table
Natural join–an equi-join in which one of the duplicate columns is eliminated in the
result table
An inner join is a join in which the DBMS selects records from two tables only when the
records have the same value in the common field that links the tables
An outer join returns all rows that satisfy the join condition and those rows from one
table for which no rows from the other satisfy the join condition. Such rows are not
returned by a simple join.
There are three types of an OUTER JOIN: LEFT, RIGHT, and FULL. The LEFT
OUTER JOIN keeps the stray rows from the “left” table (the one listed first in your query
statement). In the result set, columns from the other table that have no corresponding data
are filled with NULL values. Similarly, the RIGHT OUTER JOIN keeps stray rows from
16
the right table, filling columns from the left table with NULL values. The FULL OUTER
JOIN keeps all stray rows as part of the result set.
subject
sub_code sub_name lecture_hour practical_hour
201CT DBMS 6 3
335CT DSA 6 3
346CT TOC 4 0
445CT CG 4 3
author
author_ID sub_name
A03 SE
B315 DSA
KP05 TOC
SP35 MP
17
Select * from Subject Full Outer Join Author on S.sub_name= A.s_name:
18
3.8 Embedded SQL
The SQL standard defines embeddings of SQL in a variety of programming languages
such as C, Java, and C#.
A language to which SQL queries are embedded is referred to as a host language, and the
SQL structures permitted in the host language comprise embedded SQL.
Approach: Embed SQL in the host language.
A preprocessor converts the SQL statements into special API calls.
Then a regular compiler is used to compile the code.
Language constructs:
Connecting to a database:
EXEC SQL CONNECT
Declaring variables:
EXEC SQL BEGIN (END) DECLARE SECTION
Statements:
EXEC SQL Statement
3.9 View
A view is a logical or virtual table, which does not, exists physically in the
database. Views are also called as Dictionary Objects.
Advantages
Security - The confidential columns can be suppressed from viewing
and manipulation.
Complexity can be avoided.
Logical columns can be created.
Views can represent a subset of the data contained in a table
Views can join and simplify multiple tables into a single virtual table
Views can act as aggregated tables, where the database engine aggregates data
(sum, average etc.) and presents the calculated results as part of the data
Views take very little space to store; the database contains only the definition of a
view, not a copy of all the data it presents
Views can limit the degree of exposure of a table or tables to the outer world
Views can be classified into two categories on the way they are created.
Simple View
If the view is created from a single table, it is called as a simple
view.We can do DML Operation, if it is Simple View.
Complex View
If the view is created on multiple tables, it is called as a complex
view.
To create view
Create View < View Name> as <Query>
19
Example
A view of instructors without their salary
create view faculty as
select ID, name, dept_name
from instructor
Create a view of department salary totals
create view departments_total_salary(dept_name, total_salary) as
select dept_name, sum (salary)
from instructor
group by dept_name;
Database Security
Security Levels
20
Authorization
For security purposes, we may assign a user several forms of authorization on parts of the
databases which allow:
Read: read tuples.
Insert: insert new tuple, not modify existing tuples.
Update: modification, not deletion, of tuples.
Delete: deletion of tuples.
We may assign the user all, none, or a combination of these. In addition to the previously
mentioned, we may also assign a user rights to modify the database schema:
Index: allows creation and modification of indices.
Resource: allows creation of new relations.
Alteration: addition or deletion of attributes in a tuple.
Drop: allows the deletion of relations.
Authorization in SQL
The SQL language offers a fairly powerful mechanism for defining authorizations by
using privileges.
Privileges in SQL
SQL standard includes the privileges:
Delete
Insert
Select
Update
References: permits declaration of foreign keys.
SQL includes commands to grant and revoke privileges.
GRANT Command Syntax
grant <privilege list> on <relation or view name> to <user>
The following privileges can be specified:
SELECT: Can read all columns (including those added later via ALTER TABLE
command).
INSERT (col-name): Can insert tuples with non-null or non-default values in this
column.
INSERT means same right with respect to all columns.
DELETE: Can delete tuples.
REFERENCES (col-name): Can define foreign keys (in other tables) that refer to
this column.
If a user has a privilege with the GRANT OPTION, can pass privilege on to other users (with or
without passing on the GRANT OPTION). Examples
GRANT INSERT, SELECT ON Student TO Ram
• Ram can query Student or insert tuples into it.
21
GRANT DELETE ON Student TO Hari WITH GRANT OPTION
• Hari can delete tuples, and also authorize others to do so.
GRANT UPDATE (age) ON Student TO Dinesh
• Dinesh can update (only) the age field of Student tuples.
GRANT SELECT ON ActiveSailors TO Guppy, Yuppy
• This does NOT allow the ‘uppies to query Sailors directly!
By default, a user granted privileges is not allowed to grant those privileges to other
users. To allow this, we append the term “with grant option” clause to the appropriate
grant command.
grant select on branch to U1 with grant option
To revoke a privilege we use the ‘revoke’ clause, which is used very much like ‘grant’.
Revoke Command syntax
revoke <privilege list> on <relation or view name> from <user list>
Example: Revoke INSERT, SELECT ON Sailors from Horatio
Roles
Roles are a collection of privileges or access rights. When there are many users in a
database it becomes difficult to grant or revoke privileges to users. Therefore, if you
define roles, you can grant or revoke privileges to users, thereby automatically granting
or revoking privileges.
The Syntax to create a role is:
CREATE ROLE role_name
To grant CREATE TABLE privilege to a user by creating a testing role:
First, create a manager Role
• Creating role manager
Second, grant a CREATE TABLE privilege to the ROLE Manager. You can add
more privileges to the ROLE.
• Grant create table to manager
Third, grant the role to a Ram.
• Grant Manager to Ram
To revoke a CREATE TABLE privilege from Manager ROLE, you can write:
Revoke Create table from manager
22
3.11 Triggers and Assertion
Triggers
A trigger is a statement that the system executes automatically as a side effect of a
modification to the database.
To design a trigger we must meet two requirements:
Specify when a trigger is to be executed. This is broken up into an event that
causes the trigger to be checked and a condition that must be satisfied for trigger
execution to proceed.
Specify the actions to be taken when the trigger executes.
This is referred to as the event-condition-action model of triggers. The database stores
triggers just as if they were regular data. This way they are persistent and are accessible
to all database operations. Once a trigger is entered into the database, the database system
takes on the responsibility of executing it whenever the event occurs and the condition is
satisfied.
Need for Triggers: A good use for a trigger would be, for instance, if you own a
warehouse and you sell out of a particular item, to automatically re-order that item and
automatically generate the order invoice. So, triggers are very useful for automating
things in your database.
Three parts of a trigger:
Event (activates the trigger) insert, delete or update of the database.
Condition (tests whether the trigger should run) a Boolean statement or a query
Action (what happens if the trigger runs) wide variety of options.
Assertions
An assertion is a statement in SQL that ensures a certain condition will always exist in
the database. Assertions are like column and table constraints, except that they are
specified separately from table definitions. An example of a column constraint is NOT
NULL, and an example of a table constraint is a compound foreign key, which, because
it's compound, cannot be declared with column constraints.
Defined independently from any table.
Activated on any modification of any table mentioned in the assertion.
Components include:
a constraint name,
followed by CHECK,
followed by a condition
Query result must be empty
If the query result is not empty, the assertion has been violated
23
Example : The salary of an employee must not be greater than the salary of the manager
of the department that the employee works for
Example on SQL
STUDENT (CRN, FNAME, LNAME, DOB, AGE, DISTRICT, WARD_NO,
VDC/MUNICIPALITY, PROGRAM, BATCH, PHONE)
ACCOUNT (CRN, ACCOUNT_ID, FEE, PAID_MONEY)
1. Find Name (combine FNAME and LNAME) of students who study in either BCT or BEX.
SELECT FNAME+''+LNAME as NAME
FROM STUDENT
WHERE PROGRAM IN ('BCT','BEX')
Alternate method:
SELECT FNAME+''+LNAME AS NAME
FROM STUDENT
WHERE PROGRAM ='BCT' or PROGRAM ='BEX'
24
4. Find FNAME of the students whose last name begin from ‘S’.
SELECT FNAME
FROM STUDENT
WHERE LNAME LIKE 'S%'
6. Find CRN and fname of the students whose lname has at least five characters.
SELECT CRN,FNAME
FROM STUDENT
WHERE LNAME LIKE '______%'
7. Sort the list of students according to age in ascending order. If there are number of students
having same age then sort them in descending order according to first name.
SELECT *
FROM STUDENT
ORDER BY AGE ASC,FNAME DESC
8. Find the name of the students who live in KTM district and paid money is 20000.
SELECT FNAME+' '+LNAME AS NAME
FROM STUDENT, TBL_ACCOUNT
WHERE STUDENT.CRN=TBL_ACCOUNT.CRN
AND TBL_ACCOUNT.FEE =20000
AND STUDENT.DISTRICT='KATHMANDU'
9. Count the total number of students in each ward of Lalitpur metropolitan city.
SELECT WARD_NO,COUNT(CRN) AS
NUMBER_OF_STUDENT_IN_WARD
FROM STUDENT
WHERE DISTRICT='LALITPUR'
GROUP BY WARD_NO
10. Find the eldest person name in each batch.
SELECT FNAME,AGE,BATCH
FROM STUDENT,( SELECT MAX(AGE)AS MAX_AGE , PROGRAM FROM
STUDENT GROUP BY BATCH) AS T(A,B)
WHERE Student.Age=T.A=Student.Porgramme=T.B
25
11. Find the name of students whose age and batch is same as of Rita.
SELECT FNAME,AGE,BATCH
FROM STUDENT
WHERE AGE IN (SELECT AGE
FROM STUDENT
WHERE FNAME='RITA')AND
BATCH IN(SELECT BATCH FROM STUDENT
WHERE FNAME='RITA')
12. Find the program and average age of student in each program that average age greater than 20.
SELECT PRORAM, AVG(AGE) AS AVG_AGE
FROM STUDENT
GROUP BY PROGRAM
HAVING AVG(AGE)>20
13. Display NAME and ACCOUNT_ID of student who live in Chitwan district.
SELECT FNAME+' '+LNAME AS NAME, ACCOUNT_ID
FROM STUDENT, TBL_ACCOUNT
WHERE STUDENT.CRN= TBL_ACCOUNT.CRN
AND STUDENT.DISTRICT='CHITWAN'
16. Find the name of student who are elder than the some student live in KTM district.
26
19. Change the batch 2066 to 2068 of all students whose program is 'BCT'
UPDATE student
SET batch = 2068
WHERE program = 'BCT' and batch = 2066
20. Update the batch of all student from 2066 to 2068, 2067 to 2069 AND rest batch as it is of all
student
UPDATE student
SET batch = case
When batch = 2066 then 2068
When batch = 2067 then 2069
Else batch
End
Relational algebra is the basic set of operations for the relational model
Relational Algebra is algebra whose operands are relations and operators are designed to
do the most commons things that we need to do with relations.
A relation schema is given by R(A1,…,Ak), the name of the relation and the list of the
attributes in the relation
A relation is a set of tuples that are valid instances of its schema
Relational algebra expressions take as input relations and produce as output new
relations.
After each operation, the attributes of the participating relations are carried to the new
relation. The attributes may be renamed, but their domain remains the same.
Select
Project
Union
Set Difference (or Subtract or minus)
Cartesian product
Rename
27
Bank Schema
Select Operation
Unary operations
The SELECT operation (denoted by σ (sigma)) is used to select a subset of the tuples
from a relation based on a selection condition
Notation: σ p(r)
p is called the selection predicate
Defined as:
σp(r) = {t | t ∈ r and p(t)}
Where p is a formula in propositional calculus consisting of terms connected by : ∧ (and),
∨ (or), ¬ (not)
Each term is one of:
<Attribute> op <attribute> or <constant>
Where op is one of: =,≠, >, ≥. <. ≤
The SELECT operation s <selection condition>(R) produces a relation S that has the same
schema (same attributes) as R
SELECT s is commutative:
s <condition1>(s < condition2> (R)) = s <condition2> (s < condition1> (R))
Because of commutativity property, a cascade (sequence) of SELECT operations may be
applied in any order:
s<cond1>(s<cond2> (s<cond3> (R)) = s<cond2> (s<cond3> (s<cond1> ( R)))
28
A cascade of SELECT operations may be replaced by a single selection with a
conjunction of all the conditions:
s<cond1>(s< cond2> (s<cond3>(R)) = s <cond1> AND < cond2> AND < cond3>(R)))
The number of tuples in the result of a SELECT is less than (or equal to) the number of
tuples in the input relation R
Example of selection:
σ branch_name=“Perryridge”(account)
Project Operation
Unary operations
Notation: ∏ A1 , A2 ,, Ak (r )
Where A1, A2 are attributes names and r is a relation name.
The result is defined as the relation of k columns obtained by erasing the columns that are
not listed
Duplicate rows removed from result, since relations are sets
PROJECT is not commutative
Example: To eliminate the branch_name attribute of account
∏account_number, balance (account)
Union Operation
Binary operations
Notation: r ∪ s
Defined as:
r ∪ s = {t | t ∈ r or t ∈ s}
Duplicate tuples are eliminated
For r ∪ s to be valid.
r, s must have the same Arity (same number of attributes)
nd
The attribute domains must be compatible (example: 2 column of r deals with
the same type of values as does the 2nd column of s)
Example: to find all customers with either an account or a loan
∏customer_name (depositor) ∪ ∏customer_name (borrower)
29
Cartesian-Product Operation
Notation r x s
Defined as:
r x s = {t q | t ∈ r and q ∈ s}
If attributes of r(R) and s(S) are not disjoint, then renaming must be used.
This operation is used to combine tuples from two relations in a combinatorial fashion.
Denoted by R(A1, A2, . . ., An) x S(B1, B2, . . ., Bm)
Result is a relation Q with degree n + m attributes: Q(A1, A2, . . ., An, B1, B2, . . ., Bm),
in that order.
The resulting relation state has one tuple for each combination of tuples—one from R and
one from S. Hence, if R has nR tuples (denoted as |R| = nR ), and S has nS tuples, then R x
S will have nR * nS tuples.
Rename Operation
In some cases, we may want to rename the attributes of a relation or the relation name or
both
The general RENAME operation ρ can be expressed by any of the following forms:
• ρS (B1, B2, …, Bn )(R) changes both:
The relation name to S, and the column names to B1, B1,…,Bn
• ρS(R) changes:
the relation name only to S
If a relational-algebra expression E has Arity n, then
ρ x ( A , A ,..., A ) ( E )
1 2 n
• Returns the result of expression E under the name X, and with the attributes
renamed to A1 , A2 , …., An .
Find the largest account balance
• ∏balance(account) - ∏account.balance (σaccount.balance < d.balance (account x ρd (account)))
Set-Intersection Operation
Notation: r ∩ s
Defined as:
r ∩ s = { t | t ∈ r and t ∈ s }
Assume:
r, s have the same Arity
Attributes of r and s are compatible
Note: r ∩ s = r – (r – s)
30
Aggregate Functions and Operations
Aggregation function takes a collection of values and returns a single value as a result.
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
Aggregate operation in relational algebra
G1 ,G2 ,,Gn ϑF1 ( A1 ),F2 ( A2 ,, Fn ( An ) ( E )
Where E is any relational-algebra expression and G1, G2 …, Gn is a list of
attributes on which to group (can be empty)
Each Fi is an aggregate function
Each Ai is an attribute name
Example
31
Delete, Insert and Update Operation
Delete
A delete request is expressed similarly to a query, except instead of displaying tuples to
the user; the selected tuples are removed from the database.
Can delete only whole tuples; cannot delete values on only particular attributes
A deletion is expressed in relational algebra by:
r←r–E
Where r is a relation and E is a relational algebra query.
Delete all account records in the New road branch
Account ← account – σ branch_name = “New Road” (account)
Delete all accounts at branches located in KTM
r1 ← σ branch_city = “Needham” (account branch)
r2 ← ∏ account_number, branch_name, balance (r1)
r3 ← ∏ customer_name, account_number (r2 depositor)
Account ← account – r2
Depositor ← depositor – r3
Insertion
To insert data into a relation, we either:
specify a tuple to be inserted
write a query whose result is a set of tuples to be inserted
in relational algebra, an insertion is expressed by:
r← r ∪ E
Where r is a relation and E is a relational algebra expression.
The insertion of a single tuple is expressed by letting E be a constant relation containing
one tuple.
Insert information in the database specifying that Smith hasRs. 1200 in account A-973 at
the Patan branch.
Account ← account ∪ {(“A-973”, “Patan”, 1200)}
Depositor ← depositor ∪ {(“Smith”, “A-973”)}
Provide as a gift for all loan customers in the Patan branch, Rs.200 savings account. Let
the loan number serve as the account number for the new savings account.
r1 ← (σbranch_name = “Patan” (borrower loan))
Account ← account ∪ ∏ loan_number, branch_name, 200 (r1)
Depositor ← depositor ∪ ∏ customer_name, loan_number (r1)
32
Updating
A mechanism to change a value in a tuple without charging all values in the tuple
Use the generalized projection operator to do this task
r ← ∏ F1 , F2 ,, Fl , (r )
Each Fi is either
the I th attribute of r, if the I th attribute is not updated, or,
if the attribute is to be updated Fi is an expression, involving only constants and
the attributes of r, which gives the new value for the attribute
Pay all accounts with balances over $10,000 6 percent interest and pay all others 5
percent
Account ← ∏ account_number, branch_name, balance * 1.06 (σ BAL > 10000 (account))
∪ ∏ account_number, branch_name, balance * 1.05 (σBAL ≤ 10000 (account))
The general form of a join operation on two relations R(A1, A2, . . ., An) and S(B1, B2,
. . ., Bm) is: R <join condition> S
Where R and S can be any relations that result from general relational
algebra expressions.
The join condition is called theta
Theta can be any general Boolean expression on the attributes of R and S;
Most join conditions involve one or more equality conditions “AND”ed together.
Natural-Join Operation
Notation: r s
Let r and s be relations on schemas R and S respectively. Then, r s is a relation on
schema R ∪ S obtained as follows:
Consider each pair of tuples tr from r and ts from s.
If tr and ts have the same value on each of the attributes in R ∩ S, add a tuple t to
the result, where
• t has the same value as tr on r
• t has the same value as ts on s
Example:
R = (A, B, C, D)
S = (E, B, D)
Result schema = (A, B, C, D, E)
r s is defined as:
∏ r.A, r.B, r.C, r.D, s.E (σr.B = s.B ∧ r.D = s.D (r x s))
33
Outer Join
34
Example On Relational Algebra
2. Find the first name of student whose age is 20 and live in Kathmandu district
π f _ name (σ age=20∧district ="KTM " (student ))
3. Find the first name of student who live in Kathmandu or Lalitpur district
π f _ name (σ district ="KTM "∨ district ="LPR" (student ))
π f _ name (σ district ="KTM " (student )) ∪ π f _ name (σ district ="LPR" (student ))
4. Find the first name of student who do not live in Kathmandu district
π f _ name (σ district ≠" KTM " (student ))
π f _ name (student ) − π f _ name (σ district ="KTM " (student ))
6. Find the customer name who have made a load at the bank located in Kathmandu city
σ Borrower .Loan _ number =loan.loan _ number ∧loan.branch _ name=branch.branch _ name∧branch _ city ="KTM "
π customer _ Name
(Borrower × Loan × Branch )
7. Find the first name of youngest student name of the college
π f _ name (student ) − π student . f _ name (σ student .age > d .age (student × δ d (student )))
8. Find the first name of student whose batch and programme are same as Ram's batch and
programme.
π f _ name (σ student .batch=d .Rbatch∧ student . programme=d .Rprogramme (student × δ d ( Rbatch,Rprogramme) (π batch, programme (σ f _ name="RAM " (student ))))
9. Find the name of customer who have an account and a loan at the bank
Π customer _ name (borrower ) ∩ Π customer _ name (depositor )
10. Count the total number of students in each batch
Batchζ count (CRN ) as total_Student (student )
11. Find the average age of students of a collegessss
ζ avg (age )as average_age (student )
35
12.
Find the min age of student in each programme
ζ min (age )as minimum_age (student )
13.
Find the name of customer whose balance is 10000 at the bank.
Π customer _ name (σ balance=10000 (accountΝdepositor ))
14. Delete the record of students who live in KTM district.
student ← student − σ district ="KTM " (student )
15. Delete the record of student whose fee is less than 10000.
r1 ← σ fee<10000 (studentNaccount )
r 2 ← Π CRN , f _ name,l _ name,age ,batch , programme (r1)
r 3 ← Π account _ id , fee ,due _ amount ,CRN (r1)
student ← student − r 2
student ← account − r 3
16. Delete the account records of customer who live in Kathmandu city
r1 ← σ customer _ city ="KTM " (accountNdepositorNcustomer )
r 2 ← Π account _ number ,branch _ name,balance (r1)
account ← account − r 2
r 3 ← Π account _ number ,customer _ name (r1)
depositor ← depositor − r 3
17. Increase the balance of all account by 10%
36
Chapter 4
4.1 Introduction
Normalization is formal process for determining which field belongs in which tables
in a relational database.
Database normalization is the process of removing redundant data from your tables in
to improve storage efficiency, data integrity, and scalability.
Normalization is the process of efficiently organizing data in a database with two
goals in mind
First goal: eliminate redundant data
Second Goal: ensure data dependencies make sense
Bad database design may have
Repetition of information
Inability to represent certain information
In order to comply with the relational model it is necessary to
Remove repeating groups and
Avoid redundancy and data anomalies by remoting partial and transitive
functional dependencies.
Relational Database Design: All attributes in a table must be atomic, and solely
dependent upon the fully primary key of that table.
Redundant data is where we have stored the same ‘information’ more than once. i.e.,
the redundant data could be removed without the loss of information. Such
redundancy could lead to the Insert, delete and update anomalies
Purpose of Normalization
To avoid redundancy by storing each ‘fact’ within the database only once.
To put data into a form that conforms to relational principles (e.g., single valued
attributes, each relation represents one entity) - no repeating groups.
To put the data into a form that is more able to accurately accommodate change.
To avoid certain updating ‘anomalies’.
To facilitate the enforcement of data constraints.
Advantages of Normalization
Given a set of attributes A, define the closure of A under F (denoted by A+) as the set of
attributes that are functionally determined by a under F
Algorithm to compute A+, the closure of a under F
result := A;
while (changes to result) do
for each β → γ in F do
begin
if β ⊆ result then result := result ∪ γ
end
Uses of Attribute Closure
• Testing for super key- To test, if α is a super key, we compute α+, and check if α+
contains all attributes of R. Then α is a super key of R.
• Testing functional dependencies -To check if a functional dependency α → β
holds (or, in other words, is in F+), just check if β ⊆ α+.
Sets of functional dependencies may have redundant dependencies that can be inferred
from the others
For example: A → C is redundant in: {A → B, B → C, A C}
Parts of a functional dependency may be redundant
E.g.: on RHS: {A → B, B → C, A → CD} can be simplified to
{A → B, B → C, A → D}
E.g.: on LHS: {A → B, B → C, AC → D} can be simplified to
{A → B, B → C, A → D}
Intuitively, a canonical cover of F is a “minimal” set of functional dependencies
equivalent to F, having no redundant dependencies or redundant parts of dependencies
Example 1: Given F = {A → C, AB → C }
B is extraneous in AB → C because {A → C, AB → C} logically implies
A → C (I.e. the result of dropping B from AB → C).
A relation is Unnormalised when it has not had any normalization rules applied to it, and
it suffers from various anomalies.
A relation has repeating group, so it has more than one value for a given key.
A repeating group is an attribute (or set of attributes) that can have more than one value
for a primary key value.
Repeating Groups are not allowed in a relational design, since all attributes have to be
‘atomic’ - i.e., there can only be one value per cell in a table.
A relational schema R is in first normal form if the domains of all attributes of R are
atomic. Domain is atomic if its elements are considered to be indivisible units.
Ensure that each table has a primary key: minimal set of attributes which can uniquely
identify a record.
Eliminate repeating groups (categories of data which would seem to be required a
different number of times on different records) by defining keyed and non-keyed
attributes appropriately.
Atomicity: Each attribute must contain a single value, not a set of values.
A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully
dependent on the primary key.
Remove partial functional dependencies into a new relation
A relation R is in 2NF if
R is 1NF , and
All non-prime attributes are fully dependent on the candidate keys.
Converting from 1NF to 2NF:
Identify the primary key for the 1NF relation.
Identify the functional dependencies in the relation.
If partial dependencies exist on the primary key remove them by placing then in a
new relation along with a copy of their determinant
A relation is in 3NF if, and only if, it is in 2NF and every non-key attribute is non-
transitively dependent on the primary key.
Remove transitive dependencies into a new relation
A relation schema R is in third normal form (3NF) if for all:
α → β in F
+
BCNF refers to decompositions involving Relations with more than one candidate key,
where the candidate keys are composite and overlapping
A relation is in BCNF if and only if every determinant is a candidate key.
A determinant is any attribute whose value determines other values with a row.
If a table contains only one candidate key, the 3NF and the BCNF are equivalent. BCNF
is a special case of 3NF.
Example 2:
In UNF
T(Patient_Num,F_name,L_name,Ward_num,ward_Name,Prescription_date,Drug_Code,
Drug_Name,Dosage,length_Of_Treatment)
In 1NF
T1 (Patient_Num,F_name,L_name,Ward_num,ward_Name)
T2 (Patient_Num, Prescription_date, Drug_Code, Drug_Name, Dosage,
length_Of_Treatment)
In2NF
T1 (Patient_Num,F_name,L_name,Ward_num,ward_Name)
T2 (Patient_Num, Prescription_date, Drug_Code, Dosage, length_Of_Treatment)
T3 ( Drug_Code, Drug_Name)
In3NF
T1 (Patient_Num,F_name,L_name,Ward_num)
T2 (Patient_Num, Prescription_date, Drug_Code, Dosage, length_Of_Treatment)
T3 ( Drug_Code, Drug_Name)
T4 (Ward_num, ward_Name)
Example2: BCNF
• Let us assume the following reality
For each subject, each student is taught by one Instructor
Each Instructor teaches only one subject
Each Subject is taught by several Instructors
This relation is in 3NF but NOT in BCN F, so we should decompose so to meet BCNF
property. Learning (Student, Instructor) , Teaching (Instructor, Course)
Decomposition of R = (A, B, C)
R1 = (A, B) R2 = (B, C)
A B C A B B C
1 a x 1 a a x
2 b y 2 b b y
3 c z 3 c c z
∏A B(R) ∏B, C(R)
A B C
1 a x
2 b y
3 c z
∏A , B, C (R1 R2)
For the case of R = (R1, R2), we require that for all possible relations r on schema R
r = ∏R1 (r ) ∏R2 (r )
A decomposition of R into R1 and R2 is lossless join if at least one of the following
dependencies is in F+:
R1 ∩ R2 → R1
R1 ∩ R2 → R2
The above functional dependencies are a sufficient condition for lossless join
decomposition; the dependencies are a necessary condition only if all constraints are
functional dependencies.
If we have first two tuples (1 & 2), then the last two tuples (3 & 4) must also be in
the relation. Name-phone and Name-soda relations are independent.
Representation of XY
X Y Other
Equal Exchange
5.1 Introduction
Query Processing – activities involved in retrieving data from the database:
SQL query translation into low-level language implementing relational algebra.
Query execution
Query Optimization – selection of an efficient query execution plan
Phases of Query Processing
We can choose a strategy based on reliable information, database systems may store
statistics for each relation r. These statistics includes
The number of tuples in a relation
The size of a tuple in a relation
The number of distinct values that appear in the relation r for a particular
attribute.
Cost is generally measured as total elapsed time for answering query
Many factors contribute to time cost
disk accesses, CPU, or even network communication
Typically disk access is the predominant cost, and is also relatively easy to estimate.
Measured by taking into account
Number of seeks * average-seek-cost
Number of blocks read * average-block-read-cost
Number of blocks written * average-block-write-cost
Cost to write a block is greater than cost to read a block
data is read back after being written to ensure that the write was successful
For simplicity we just use the number of block transfers from disk and the number of
seeks as the cost measures
tT – time to transfer one block
tS – time for one seek
Cost for b block transfers plus S seeks
b * tT + S * tS
5.3 Query Operation
5.3.1Slection operation
File scan – search algorithms that locate and retrieve records that fulfill a
selection condition.
Algorithm A1 (linear search). Scan each file block and test all records to see
whether they satisfy the selection condition.
Cost estimate = br block transfers + 1 seek
♦ br denotes number of blocks containing records from relation r
If selection is on a key attribute, can stop on finding record
♦ cost = (br /2) block transfers + 1 seek
Linear search can be applied regardless of
♦ selection condition or
♦ ordering of records in the file, or
♦ availability of indices
Nested-Loop Join
Hash-Join
3 Only the last in a sequence of projection operations is needed, the others can be
omitted.
Π t1 (Π t2 ( (Π tn ( E )) )) = Π t1 ( E )
Storage Hierarchy
Storage Type
Primary storage:
Fastest media but volatile (cache, main memory).
Secondary storage:
next level in hierarchy, non-volatile, moderately fast access time
also called on-line storage
E.g. flash memory, magnetic disks
Tertiary storage:
lowest level in hierarchy, non-volatile, slow access time
also called off-line storage
E.g. magnetic tape, optical storage
Main memory:
Fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)
Generally too small (or too expensive) to store the entire database
Capacities of up to a few Gigabytes widely used currently
Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly
factor of 2 every 2 to 3 years)
Volatile — contents of main memory are usually lost if a power failure or system crash
occurs.
Flash memory:
Data survives power failure
Data can be written at a location only once, but location can be erased and written to
again
Can support only a limited number (10K – 1M) of write/erase cycles.
Erasing of memory has to be done to an entire bank of memory
Reads are roughly as fast as main memory
But writes are slow (few microseconds), erase is slower
Cost per unit of storage roughly similar to main memory
Widely used in embedded devices such as digital cameras
Is a type of EEPROM (Electrically Erasable Programmable Read-Only Memory)
Magnetic-disk
Data is stored on spinning disk, and read/written magnetically
Primary medium for the long-term storage of data; typically stores entire database.
Data must be moved from disk to main memory for access, and written back for storage
Much slower access than main memory
direct-access – possible to read data on disk in any order, unlike magnetic tape
Capacities range up to roughly 1000 GB currently
Much larger capacity and cost/byte than main memory/flash memory
Growing constantly and rapidly with technology improvements (factor of 2 to 3
every 2 years)
Survives power failures and system crashes
disk failure can destroy data, but is rare
Read-write head
Positioned very close to the platter surface (almost touching it)
Reads or writes magnetically encoded information.
Optical storage
non-volatile, data is read optically from a spinning disk using a laser
CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R,
DVD+R)
Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM)
Reads and writes are slower than with magnetic disk
Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism
for automatic loading/unloading of disks available for storing large volumes of data
non-volatile, used primarily for backup (to recover from disk failure), and for archival
data
sequential-access – much slower than disk
very high capacity (40 to 300 GB tapes available)
tape can be removed from drive ⇒ storage costs much cheaper than disk, but drives are
expensive
Tape jukeboxes available for storing massive amounts of data
hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte =
1012 bytes)
Two problems:
Difficult to delete a record from the structure
♦ Deleted space must be filled by another record or
♦ Marking deleted records so that they can be ignored
Block size must be a multiple of n(file size).Some record cross the boundaries. So
it requires two block access for a record
Approaches for deletion :
Move records i + 1, . . ., n to i, . . . , n – 1 (Requires more access to move the
record )
Don’t move records, but link all free records on a free list
Store the address of the first deleted record in the file header.
this first record to store the address of the second deleted record, and so on
Can think of these stored addresses as pointers since they “point” to the location of a
record.
More space efficient representation: reuse space for normal attributes of free records to
store pointers. (No pointers stored in in-use records.)
Variable-Length Records
Byte-String Representation
For Insertion follow the following rule: –locate the position where the record is to be
inserted
If there is free space insert there
If no free space, insert the record in an overflow block
In either case, pointer chain must be updated
For Deletion – use pointer chains
Need to reorganize the file from time to time to restore sequential order
Dense Index
Sparse Index
♦ Sparse Index: contains index records for only some search-key values.
♦ Applicable when records are sequentially ordered on search-key
♦ To locate a record with search-key value K we:
• Find index record with largest search-key value < K
• Search file sequentially starting at the record to which the index record
points
Sparse Index
Multilevel Index
Secondary Index
A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block).
In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of all
bucket addresses B.
Hash function is used to locate records for access, insertion as well as deletion.
Records with different search-key values may be mapped to the same bucket; thus entire
bucket has to be searched sequentially to locate a record.
Hash file organization of account file, using branch_name as key
There are 10 buckets,
The binary representation of the ith character is assumed to be the integer i.
The hash function returns the sum of the binary representations of the characters
modulo 10
♦ E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
Extendable hashing
In this hashing scheme the set of keys can be varied, and the address space is
allocated dynamically
Good for database that grows and shrinks in size
Allows the hash function to be modified dynamically
Extendable hashing – one form of dynamic hashing
Keys stored in buckets.
Each bucket can only hold a fixed size of items.
Index is an extendible table; h(x) hashes a key value x to a bit map; only a
portion of a bit map is used to build a directory.
Once the bucket is full – split the bucket into two
Two situation will be possible:
♦ Directory remains of the same size adjust pointer to a bucket
♦ Size grows from 2k to 2k+1 i.e. directory size can be 1, 2, 4, 8, 16 etc.
• Number of buckets will remain the same, i.e. some references
will point to the same bucket.
Finally, one can use bitmap to build the index but store an actual key in the
bucket!
Assume that a hashing technique is applied to a dynamically changing file
composed of buckets, and each bucket can hold only a fixed number of items.
Extendible hashing accesses the data stored in buckets indirectly through an index
that is dynamically adjusted to reflect changes in the file.
The characteristic feature of extendible hashing is the organization of the index,
which is an expandable table.
A hash function applied to a certain key indicates a position in the index and not
in the file (or table or keys). Values returned by such a hash function are called
pseudo keys.
6.4 B+ Tree
Example of B+ Tree
A transaction is a unit of program execution that accesses and possibly updates various
data items.
A transaction is an action, or a series of actions, carried out by a single user or an
application program, which reads or updates the contents of a database.
A transaction must see a consistent database.
During transaction execution the database may be temporarily inconsistent. When the
transaction completes successfully (is committed), the database must be consistent. After
a transaction commits, the changes it has made to the database persist, even if there are
system failures.
Multiple transactions can execute in parallel.
Two main issues to deal with:
Failures of various kinds, such as hardware failures and system crashes
Concurrent execution of multiple transactions
A transaction is a unit of program execution that accesses and possibly updates various
data items. To preserve the integrity of data the database system must ensure:
Atomicity
♦ Either all operations of the transaction are properly reflected in the
database or none are.
♦ Transaction is indivisible – it completes entirely or not at all, despite
failures.
Consistency. Execution of a transaction in isolation preserves the consistency of
the database. Consistency transfers the database from one consistent state to
another consistent state.
Isolation:
♦ Although multiple transactions may execute concurrently, each transaction
must be unaware of other concurrently executing transactions.
Intermediate transaction results must be hidden from other concurrently
executed transactions. That is, for every pair of transactions Ti and Tj, it
appears to Ti that either Tj, finished execution before Ti started, or Tj started
execution after Ti finished.
♦ The effects of a transaction are not visible to other transactions until it has
completed
Durability. After a transaction completes successfully, the changes it has made
to the database persist, even if there are system failures.
Atomicity requirement — if the transaction fails after step 3 and before step 6, the
system should ensure that its updates are not reflected in the database, else an
inconsistency will result.
Consistency requirement – the sum of A and B is unchanged by the execution of the
transaction.
Isolation requirement — if between steps 3 and 6, another transaction is allowed to
access the partially updated database, it will see an inconsistent database (the sum A + B
will be less than it should be).
Isolation can be ensured trivially by running transactions serially, that is one after
the other.
However, executing multiple transactions concurrently has significant benefits, as
we will see later.
Durability requirement — once the user has been notified that the transaction has
completed (i.e., the transfer of the 50 has taken place), the updates to the database by the
transaction must persist despite failures.
Let T1 and T2 be the transactions defined previously. The following schedule is not a
serial schedule, but it is equivalent to Schedule 1. In Schedules 1, 2 and 3, the sum A
+ B is preserved
7.5 Serializability
A schedule S is serial if, for every transaction T participating in the schedule, all the
operations of T is executed consecutively in the schedule.
No interleaving occurs in serial schedule
A schedule S of n transactions is serializable if it is equivalent to some serial schedule of
the same n transactions.
Two schedules are called result equivalent if they produce the same final state of the
database.
A schedule is serializable if it is equivalent to a serial schedule. Different forms of
schedule equivalence give rise to the notions of:
Conflict serializability
View serializability
Conflict serializability
Two actions Ai and Aj executed on the same data object by Ti and Tj conflicts if
either one of them is a write operation.
Let Ai and Aj are consecutive non-conflicting actions that belong to different
transactions. We can swap Ai and Aj without changing the result.
Two schedules are conflict equivalent if they can be turned one into the other by a
sequence of non-conflicting swaps of adjacent actions.
Schedule 3 Schedule 4
Example
Assume we have these three transactions:
T1: r1(x); w1(x); r1(y); w1(y)
T2: r2 (z); r2(y); w2(y); r2(x); w2(x)
T3: r3(y); r3(z);w3(y);w3(z)
Assume we have these schedules:
S1: r2(z);r2(y);w2(y); r3(y);r3(z); r1(x);w1(x); w3(y);w3(z);r2(x); r1(y);w1(y); w2(x)
No equivalent serial schedule
(Cycle x (T1T2), y (T2T1))
(Cycle x (T1T2), yz(T2T3),y(T3T1))
Assume we have another schedule for the same transactions:
S2: r3(y);r3(z);r1(x); w1(x);w3(y);w3(z);r2(z); r1(y);w1(y);r2(y); w2(y);r2(x);w2(x)
Equivalent serial schedule
T3T1T2
T1 T2
Read(A)
Write(A)
Read(A)
Read(B)
If T1 should abort, T2 would have read (and possibly shown to the user) an inconsistent
database state. Hence database must ensure that schedules are recoverable.
Cascading rollback – a single transaction failure leads to a series of transaction
rollbacks. Consider the following schedule where none of the transactions has yet
committed (so the schedule is recoverable)
T1 T2 T3
Read(A)
Read(B)
Write(A)
Read(A)
Write(A)
Read(A)
A transaction may be granted a lock on an item if the requested lock is compatible with
locks already held on the item by other transactions.
If a transaction has a shared lock on a data item, it can be read the item but not update it.
If a transaction has a shared lock on a data items, other transaction can obtain a shared
lock on the data item, but not exclusive locks.
If a transaction has a exclusive lock on a data item, it can be read and update it.
If a transaction has a exclusive lock on a data items, other transaction can’t obtain either
a shared lock or a exclusive lock on the data item.
Any number of transactions can hold shared locks on an item,
But if any transaction holds an exclusive on the item no other transaction may
hold any lock on the item.
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.
Neither T3 nor T4 can make progress — executing lock-S (B) causes T4 to wait for
T3 to release its lock on B, while executing lock-X (A) causes T3 to wait for T4 to
release its lock on A.
Such a situation is called a deadlock.
♦ To handle a deadlock one of T3 or T4 must be rolled back
and its locks released.
Starvation is also possible if concurrency control manager is badly designed. For
example:
♦ Suppose a transaction T2 has a shared mode lock on a data item, and
another transaction T1 requests an exclusive mode lock on the data item.
Clearly, T1 has to wait for T2to release the shared mode lock.
♦ Meanwhile, a transaction T3 may request a shared mode lock on same
data item. At this time T2 may release the lock, but still T1 has to wait for
T3 to finish.
♦ But again, there may be a new transaction T4 that request a shared mode
lock on the same data item, and is granted the lock before T3 release it.
♦ It is possible that there is a sequence of transactions that each requests a
shared mode lock on the same data item, and each transaction release the
lock a short while after it is granted, but T1 never gets the exclusive mode
lock on the data item. The transaction T1 may never make progress, and is
said to be starved.
Concurrency control manager can be designed to prevent starvation.
T1 T2 T3
Lock-X(A)
Read(A)
Lock-S(A)
Read(B)
Write(A)
Unlock(A)
Lock-X(A)
Read(A)
Write(A)
Unlock(A)
Lock-S(A)
Read(A)
Lock conversion provide a mechanism for upgrading a shared lock to exclusive lock, and
downgrading an exclusive lock to a shared lock.
Upgrading can take place in only the growing phase, whereas downgrading can take
places in only the shrinking phase.
Two-phase locking with lock conversions:
First Phase:
♦ can acquire a lock-S on item
♦ can acquire a lock-X on item
♦ can convert a lock-S to a lock-X (upgrade)
Second Phase:
♦ can release a lock-S
♦ can release a lock-X
♦ can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the programmer to insert the
various locking instructions.
T1 T2
Lock-X on X
Write(X)
lock-X on Y
write (Y)
wait for lock-X on X
wait for lock-X on Y
System is deadlocked if there is a set of transactions such that every transaction in the set
is waiting for another transaction in the set.
Deadlock prevention protocols ensure that the system will never enter into a deadlock
state. Some prevention strategies:
Require that each transaction locks all its data items before it begins execution
(pre declaration).
Impose partial ordering of all data items and require that a transaction can lock
data items only in the order specified by the partial order (graph-based protocol).
The deferred database modification scheme records all modifications to the log, but
defers all the writes to after partial commit.
If the system crashes before the transaction completes its execution, or if the transaction
aborts, then the information on the log is simply ignored.
Assume that transactions execute serially
Transaction starts by writing <Ti start> record to log.
A write(X) operation results in a log record <Ti, X, V> being written, where V is the
new value for X
Note: old value is not needed for this scheme
The write is not performed on X at this time, but is deferred.
When Ti partially commits, <Ti commit> is written to the log.
Finally, the log records are read and used to actually execute the previously deferred
writes.
Using log, the system handle any failure that results in the information loss. The
Recovery schemes uses the following recovery procurers:
Redo (Ti) sets the value of all data items updated by transaction Ti to the new
values. New value can be found in log.
The redo() operation must be idempotent:
Executing it several times must be equivalent to executing it once
This is required if we are to generate correct behavior even if a failure
occurs during the recovery process.
During recovery after a crash, a transaction needs to be redone if and only if both
<Ti start> and<Ti commit> are there in the log.
Support for high-concurrency locking techniques, such as those used for B+-tree
concurrency control, which release locks early
Supports “logical undo”
Recovery based on “repeating history”, whereby recovery executes exactly the same
actions as normal processing
including redo of log records of incomplete transactions, followed by subsequent
undo
Key benefits
♦ supports logical undo
♦ easier to understand/show correctness
Physical Redo
Redo information is logged physically (that is, new value for each write) even for
operations with logical undo
Logical redo is very complicated since database state on disk may not be
“operation consistent” when recovery starts
Physical redo logging does not conflict with early lock release
Txn Rollback
Crash Recovery
The following actions are taken when recovering from system crash
1. (Redo phase): Scan log forward from last < checkpoint L> record till end of log
Repeat history by physically redoing all updates of all transactions,
Create an undo-list during the scan as follows
• undo-list is set to L initially
• Whenever <Ti start> is found Ti is added to undo-list
• Whenever <Ti commit> or <Ti abort> is found, Ti is deleted from
undo-list
• This brings database to state as of crash, with committed as well as
uncommitted transactions having been redone.Now undo-list
contains transactions that are incomplete, that is, have neither
committed nor been fully rolled back.
2. (Undo phase): Scan log backwards, performing undo on log records of
transactions found in undo-list.
Log records of transactions being rolled back are processed as described
earlier, as they are found
• Single shared scan for all transactions being undone
When <Ti start> is found for a transaction Ti in undo-list, write a <Ti
abort> log record.
Stop scan when <Ti start> records have been found for all Ti in undo-list
This undoes the effects of incomplete transactions (those with neither commit nor abort
log records). Recovery is now complete.
A distributed database system consists of loosely coupled sites that share no physical
component
Database systems that run on each site are independent of each other
Transactions may access data at one or more sites
Data spread over multiple machines (also referred to as sites or nodes).
Network interconnects the machines
Data shared by users on multiple machines
The main difference between the centralized and distributed system is that the data reside
in on single location, where as in the later, the data reside in several locations.
Mainly two type
Heterogeneous Database
Homogeneous Database
Data Replication
Data Fragmentation
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
Two different schemes for fragmentation
Horizontal fragmentation:
♦ each tuple of r is assigned to one or more fragments
Vertical fragmentation:
♦ the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or super key) to ensure lossless join
property.
A special attribute, the tuple-id attribute may be added to each schema to serve as a
candidate key.
Example : Relation account with following schema
Account = (branch_name, account_number, balance )
Horizontal Fragmentation of Account Relation
Advantages of Fragmentation
Horizontal:
♦ allows parallel processing on fragments of a relation
♦ allows a relation to be split so that tuples are located where they are most
frequently accessed
Sharing data – users at one site able to access the data residing at some other sites.
Autonomy – each site is able to retain a degree of control over data stored locally.
Higher system availability through redundancy — data can be replicated at remote sites,
and system can function even if a site fails.
Disadvantage: added complexity required to ensure proper coordination among sites.
Software development cost.
Greater potential for bugs.
Increased processing overhead.
Parallel database systems consist of multiple processors and multiple disks connected by
a fast interconnection network.
A coarse-grain parallel machine consists of a small number of powerful processors
A massively parallel or fine grain parallel machine utilizes thousands of smaller
processors.
Two main performance measures:
throughput --- the number of tasks that can be completed in a given time interval
response time --- the amount of time it takes to complete a single task from the
time it is submitted
Shared Nothing
Node consists of a processor, memory, and one or more disks. Processors at one node
communicate with another processor at another node using an interconnection network. A
node functions as the server for the data on the disk or disks the node owns.
Examples: Teradata, Tandem, Oracle-n CUBE
Data accessed from local disks (and local memory accesses) do not pass through
interconnection network, thereby minimizing the interference of resource sharing.
Shared-nothing multiprocessors can be scaled up to thousands of processors without
interference.
Main drawback: cost of communication and non-local disk access; sending data involves
software interaction at both ends.
Hierarchical
Combines characteristics of shared-memory, shared-disk, and shared-nothing
architectures.
Top level is a shared-nothing architecture – nodes connected by an interconnection
network, and do not share disks or memory with each other.
Each node of the system could be a shared-memory system with a few processors.
Alternatively, each node could be a shared-disk system, and each of the systems sharing a
set of disks could be a shared-memory system.
Reduce the complexity of programming such systems by distributed virtual-memory
architectures
Also called non-uniform memory architecture (NUMA)
A single, complete and consistent store of data obtained from a variety of different
sources made available to end users in what they can understand and use in a business
context is called Data warehouse.
A data warehouse is subject-oriented, integrated, time-variant, nonvolatile collection of
data in support of management’s decision making process.
Subject-oriented
Data is arranged and optimized to provide answer to questions from diverse
functional areas
Data is organized and summarized by topic
♦ Sales / Marketing / Finance / Distribution / Etc.
It focuses on modeling and analysis of data for decision makers.
Excludes data not useful in decision support process.
Integrated
Data Warehouse is constructed by integrating multiple heterogeneous sources.
Data preprocessing are applied to ensure consistency.
The data warehouse is a centralized, consolidated database that integrated data
derived from the entire organization
♦ Multiple Sources
♦ Diverse Sources
♦ Diverse Formats
Time-variant
The Data Warehouse represents the flow of data through time
Can contain projected data from statistical models
Data is periodically uploaded then time-dependent data is recomputed
Provides information from historical perspective e.g. past 5-10 years
Every key structure contains either implicitly or explicitly an element of time
Nonvolatile
Once data is entered it is NEVER removed
Represents the company’s entire history
♦ Near term history is continually added to it
♦ Always growing
♦ Must support terabyte databases and multiprocessors
Read-Only database for data analysis and query processing
Data warehouse requires two operations in data accessing
♦ Initial loading of data
♦ Access of data
Data Preparation
Identify the main data sets to be used by the data mining operation (usually the
data warehouse)
Data Analysis and Classification
Study the data to identify common data characteristics or patterns
♦ Data groupings, classifications, clusters, sequences
♦ Data dependencies, links, or relationships
♦ Data patterns, trends, deviation
Knowledge Acquisition
Uses the Results of the Data Analysis and Classification phase
Data mining tool selects the appropriate modeling or knowledge-acquisition
algorithms
♦ Neural Networks
♦ Decision Trees
♦ Rules Induction
♦ Genetic algorithms
♦ Memory-Based Reasoning
Industry has huge amount of operational data. Knowledge worker wants to turn this data
into useful information. This information is used by them to support strategic decision
making.
It is a platform for consolidated historical data for analysis.
It stores data of good quality so that knowledge worker can make correct decisions.
From business perspective
It is latest marketing weapon
Helps to keep customers by learning more about their needs.
Valuable tool in today’s competitive fast evolving world.
OLAP (Online Analytical Processing) is a term used to describe the analysis of complex
data from the data warehouse.
DSS (Decision Support Systems) also known as EIS (Executive Information Systems)
supports organization’s leading decision makers for making complex and important
decisions.
Data Mining is used for knowledge discovery, the process of searching data for
unanticipated new knowledge.
Vector model
Modeling
Assume 2-D and GIS application, two basic things need to be represented:
Objects in space: cities, forests, or rivers
♦ single objects
Coverage/Field: say something about every point in space (e.g., partitions,
thematic maps)
♦ spatially related collections of objects
Spatial primitives for objects
Point: object represented only by its location in space, e.g. center of a state
Line (actually a curve or ployline): representation of moving through or
connections in space, e.g. road, river
Region: representation of an extent in 2d-space, e.g. lake, city
Coverages
Partition: set of region objects that are required to be disjoint (adjacency or region
objects with common boundaries), e.g. thematic maps
iAEeq33EB;fi8=irEifi
;H' i;3q E -i -e'; i:
^A --.dE'
s=}ii
gI'"e
il
:;
E
[6
r.f, 6i 6 -J
;! €FB q <'raiE
[-ygE i:HB
tn
m
' tg q # i o-g 3
sL
f
o <5
u!!
r =
0' E [4"
ffr il a
'i5 c'r : fi a
I,
fa
CI-' 8 n+ ;6'' e P *.
E 9T
;d
!;
n6
vrni
io 5
Q C
*g ;e 8s
aqg5boz-
o =
r ;9 r("
o P6
F =*
1
I
E
o
E(D*m
EF
3* 6
-'l
t,
ii
':(/a
ii o
1r<
I ! r; E iq
eed?
trctrx
:f,
il
q
P v 91
z
:9
5 Ol A
:l-
o =
a o
c c c c
Pe' Slo Sr.s gty|:T I
="hT..r9<ogrB
dP Hiq !-':rHrr+=H;i
O'o(DHX;^'
.- _6 ii ..rJ,t 6 + h (D 6
99p9>
SUNTq
Oo a6 :Qsx=6p3Ht
i--oO-b'I=o6'
== r-)(-)TOf
o6-no
-=o5a
OJ
o
$
a Bx*sddff
13-EpSi
o
uaU=.cI
u+{^
a .blo;*=.o oo;r
a
* st 'Kel 6^-
sil
lUr
o s
SorDl:GO
B;; UE
Ud
dP
r*
oc)
o S p.i 4Be hO
'rD
6O
d.
a R 9:t PG,
,ft
Q
E= AF
e-3 [H;
o'-x
o-
o
6
o
P
cD
E.
(4 S
q ? isfl
aH.
o
o
rD
b=
E
fFBS
E EgE 5
J
o
a
=Ft' o
+
--{ 5-]..m
o :rDo
rD -t
o o
5
rD 0 Co ! o u F U N lJ !l-oc
(U C
o o Oi
o o oro3.
660
i
f
o- -. a.
;Y6
rD
o?
o
l. F
(r A F o s E Ol { ! UJ
I
o 9s.
rD:e
o
E o-o
o
o_
- -.u
f<
d
(D d(D
f-a
o o oo
G' d:
=r- 0+
f oo
o) F c'J
Ol
N @ Co
N N
P Eq) 6-(D
NJ cr
dF o-a
go
=' 6'h
f- OE
(D * <d
f :o
c
3 o
5
rD
:r
o
=
o
cr
c
?
-l
=
o
o
o
c
o
o
=.
:l
@
o
f,
o