Database Systems Concepts Design and Applications 2nd Edition 9788131760925 978 8131760925

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1405

DATABASE SYSTEMS

Concepts, Design and Applications

Second Edition

S. K. SINGH
Head
Maintenance Engineering Department (Electrical)
Tata steel Limited
Jamshedpur

Delhi • Chennai • Chandigarh


Contents

Foreword

Preface to the Second Edition

Preface

About the Author

PART I: DATABASE CONCEPTS

Chapter 1 Introduction to Database Systems

1.1 Introduction

1.2 Basic Concepts and Definitions

1.2.1 Data

1.2.2 Information

1.2.3 Data Versus Information

1.2.4 Data Warehouse

1.2.5 Metadata

1.2.6 System Catalog

1.2.7 Data Item or Fields

1.2.8 Records

1.2.9 Files

1.3 Data Dictionary

1.3.1 Components of Data Dictionaries


1.3.2 Active and Passive Data Dictionaries

1.4 Database

1.5 Database System

1.5.1 Operations Performed on Database Systems

1.6 Data Administrator (DA)

1.7 Database Administrator (DBA)

1.7.1 Functions and Responsibilities of DBAs

1.8 File-oriented System versus Database System

1.8.1 Advantages of Learning File-oriented System

1.8.2 Disadvantages of File-oriented System

1.8.3 Database Approach

1.8.4 Database System Environment

1.8.5 Advantages of DBMS

1.8.6 Disadvantages of DBMS

1.9 Historical Perspective of Database Systems

1.10 Database Language

1.10.1 Data Definition Language (DDL)

1.10.2 Data Storage Definition Language (DSDL)

1.10.3 View Definition Language (VDL)

1.10.4 Data Manipulation Language (DML)

1.10.5 Fourth-generation Language (4GL)

1.11 Transaction Management

Review Questions

Chapter 2 Database System Architecture

2.1 Introduction
2.2 Schemas, Sub-schemas, and Instances

2.2.1 Schema

2.2.2 Sub-schema

2.2.3 Instances

2.3 Three-level ANSI-SPARC Database Architecture

2.3.1 Internal Level

2.3.2 Conceptual Level

2.3.3 External Level

2.3.4 Advantages of Three-tier Architecture

2.3.5 Characteristics of Three-tier Architecture

2.4 Data Independence

2.4.1 Physical Data Independence

2.4.2 Logical Data Independence

2.5 Mappings

2.5.1 Conceptual/Internal Mapping

2.5.2 External/Conceptual Mapping

2.6 Structure, Components, and Functions of DBMS

2.6.1 Structure of a DBMS

2.6.2 Execution Steps of a DBMS

2.6.3 Components of a DBMS

2.6.4 Functions and Services of DBMS

2.7 Data Models

2.7.1 Record-based Data Models

2.7.2 Object-based Data Models

2.7.3 Physical Data Models


2.7.4 Hierarchical Data Model

2.7.5 Network Data Model

2.7.6 Relational Data Model

2.7.7 Entity-Relationship (E-R) Data Model

2.7.8 Object-oriented Data Model

2.7.9 Comparison between Data Models

2.8 Types of Database Systems

2.8.1 Centralized Database System

2.8.2 Parallel Database System

2.8.3 Client/Server Database System

2.8.4 Distributed Database System

Review Questions

Chapter 3 Physical Data Organisation

3.1 Introduction

3.2 Physical Storage Media

3.2.1 Primary Storage Device

3.2.2 Secondary Storage Device

3.2.3 Tertiary Storage Device

3.2.4 Cache Memory

3.2.5 Main Memory

3.2.6 Flash Memory

3.2.7 Magnetic Disk Storage

3.2.8 Optical Storage

3.2.9 Magnetic Tape Storage

3.3 RAID Technology


3.3.1 Performance Improvement Using Data Stripping (or Parallelism)

3.3.2 Advantages of Raid Technology

3.3.3 Disadvantages of Raid Technology

3.3.4 Reliability Improvement Using Redundancy

3.3.5 RAID Levels

3.3.6 Choice of RAID Levels

3.4 Basic Concept of Files

3.4.1 File Types

3.4.2 Buffer Management

3.5 File Organisation

3.5.1 Records and Record Types

3.5.2 File Organisation Techniques

3.6 Indexing

3.6.1 Primary Index

3.6.2 Secondary Index

3.6.3 Tree-based Indexing

Review Questions

PART II: RELATIONAL MODEL

Chapter 4 The Relational Algebra and Calculus

4.1 Introduction

4.2 Historical Perspective of Relational Model

4.3 Structure of Relational Database

4.3.1 Domain

4.3.2 Keys of Relations

4.4 Relational Algebra


4.4.1 Selection Operation

4.4.2 Projection Operation

4.4.3 Joining Operation

4.4.4 Outer Join Operation

4.4.5 Union Operation

4.4.6 Difference Operation

4.4.7 Intersection Operation

4.4.8 Cartesian Product Operation

4.4.9 Division Operation

4.4.10 Examples of Queries in Relational Algebraic using Symbols

4.5 Relational Calculus

4.5.1 Tuple Relational Calculus

4.5.2 Domain Relational Calculus

Review Questions

Chapter 5 Relational Query Languages

5.1 Introduction

5.2 Codd’s Rules

5.3 Information System Based Language (ISBL)

5.3.1 Query Examples for ISBNL

5.3.2 Limitations of ISBL

5.4 Query Language (QUEL)

5.4.1 Query Examples for QUEL

5.4.2 Advantages of QUEL

5.5 Structured Query Language (SQL)

5.5.1 Advantages of SQL


5.5.2 Disadvantages of SQL

5.5.3 Basic SQL Data Structure

5.5.4 SQL Data Types

5.5.5 SQL Operators

5.5.6 SQL Data Definition Language (DDL)

5.5.7 SQL Data Query Language (DQL)

5.5.8 SQL Data Manipulation Language (DML)

5.5.9 SQL Data Control Language (DCL)

5.5.10 SQL Data Administration Statements (DAS)

5.5.11 SQL Transaction Control Statements (TCS)

5.6 Embedded Structured Query Language (SQL)

5.6.1 Advantages of Embedded SQL

5.7 Query-By-Example (QBE)

5.7.1 QBE Queries on One Relation (Single Table Retrievals)

5.7.2 QBE Queries on Several Relations (Multiple Table Retrievals)

5.7.3 QBE for Database Modification (Update, Delete & Insert)

5.7.4 QBE Queries on Microsoft Access (MS-ACCESS)

5.7.5 Advantages of QBE

5.7.6 Disadvantage of QBE

Review Questions

Chapter 6 Entity-Relationship (ER) Model

6.1 Introduction

6.2 Basic E-R Concepts

6.2.1 Entities

6.2.2 Relationship
6.2.3 Attributes

6.2.4 Constraints

6.3 Conversion of E-R Model into Relations

6.3.1 Conversion of E-R Model into SQL Constructs

6.4 Problems with E-R Models

6.5 E-R Diagram Symobls

Review Questions

Chapter 7 Enhanced Entity-Relationship (EER) Model

7.1 Introduction

7.2 Subclasses, Subclass Entity Types and Super-classes

7.2.1 Notation for Superclasses and Subclasses

7.2.2 Attribute Inheritance

7.2.3 Conditions for Using Supertype/Subtype Relationships

7.2.4 Advantages of Using Superclasses and Subclasses

7.3 Specialisation and Generalisation

7.3.1 Specialisation

7.3.2 Generalisation

7.3.3 Specifying Constraints on Specialisation and Generalisation

7.4 Categorisation

7.5 Eample of EER Diagram

Review Questions

PART III: DATABASE DESIGN

Chapter 8 Introduction to Database Design

8.1 Introduction

8.2 Software Development Life Cycle (SDLC)


8.2.1 Software Development Cost

8.2.2 Structured System Analysis and Design (SSAD)

8.3 Database Development Life Cycle (DDLC)

8.3.1 Database Design

8.4 Automated Design Tools

8.4.1 Limitations of Manual Database Design

8.4.2 Computer-aided Software Engineering (CASE) Tools

Review Questions

Chapter 9 Functional Dependency and Decomposition

9.1 Introduction

9.2 Functional Dependency

9.2.1 Functional Dependency Diagram and Examples

9.2.2 Full Functional Dependency (FFD)

9.2.3 Armstrong’s Axioms for Functional Dependencies

9.2.4 Redundant Functional Dependencies

9.2.5 Closures of a Set of Functional Dependencies

9.3 Decomposition

9.3.1 Lossy Decomposition

9.3.2 Lossless-Join Decomposition

9.3.3 Dependency-Preserving Decomposition

Review Questions

Chapter 10 Normalization

10.1 Introduction

10.2 Normalization

10.3 Normal Forms


10.3.1 First Normal Form (1NF)

10.3.2 Second Normal Form (2NF)

10.3.3 Third Normal Form (3NF)

10.4 Boyce-Codd Normal Forms (BCNF)

10.5 Multi-valued Dependencies and Fourth Normal Forms (4NF)

10.5.1 Properties of MVDs

10.5.2 Fourth Normal Form (4NF)

10.5.3 Problems with MVDs and 4NF

10.6 Join Dependencies and Fifth Normal Forms (5NF)

10.6.1 Join dependency (JD)

10.6.2 Fifth Normal Form (5NF)

Review Questions

PART IV: QUERY, TRANSACTION AND SECURITY MANAGEMENT

Chapter 11 Query Processing and Optimization

11.1 Introduction

11.2 Query Processing

11.3 Syntax Analyzer

11.4 Query Decomposition

11.4.1 Query Analysis

11.4.2 Query Normalization

11.4.3 Semantic Analyzer

11.4.4 Query Simplifier

11.4.5 Query Restructuring

11.5 Query Optimization

11.5.1 Heuristic Query Optimization


11.5.2 Transformation Rules

11.5.3 Heuristic Optimization Algorithm

11.6 Cost Estimation in Query Optimization

11.6.1 Cost Components of Query Execution

11.6.2 Cost Function for SELECT Operation

11.6.3 Cost Function for JOIN Operation

11.7 Pipelining and Materialization

11.8 Structure of Query Evaluation Plans

11.8.1 Query Execution Plan

Review Questions

Chapter 12 Transaction Processing and Concurrency Control

12.1 Introduction

12.2 Transaction Concepts

12.2.1 Transaction Execution and Problems

12.2.2 Transaction Execution with SQL

12.2.3 Transaction Properties

12.2.4 Transaction Log (or Journal)

12.3 Concurrency Control

12.3.1 Problems of Concurrency Control

12.3.2 Schedule

12.3.3 Degree of Consistency

12.3.4 Permutable Actions

12.3.5 Serializable Schedule

12.4 Locking Methods for Concurrency Control

12.4.1 Lock Granularity


12.4.2 Lock Types

12.4.3 Deadlocks

12.5 Timestamp Methods for Concurrency Control

12.5.1 Granula Timestamps

12.5.2 Timestamp Ordering

12.5.3 Conflict Resolution in Timestamps

12.5.4 Drawbacks of Timestamp

12.6 Optimistic Methods for Concurrency Control

12.6.1 Read Phase

12.6.2 Validation Phase

12.6.3 Write Phase

12.6.4 Advantages of Optimistic Methods for Concurrency Control

12.6.5 Problems of Optimistic Methods for Concurrency Control

12.6.6 Applications of Optimistic Methods for Concurrency Control

Review Questions

Chapter 13 Database Recovery System

13.1 Introduction

13.2 Database Recovery Concepts

13.2.1 Database Backup

13.3 Types of Database Failures

13.4 Types of Database Recovery

13.4.1 Forward Recovery (or REDO)

13.4.2 Backward Recovery (or UNDO)

13.4.3 Media Recovery

13.5 Recovery Techniques


13.5.1 Deferred Update

13.5.2 Immediate Update

13.5.3 Shadow Paging

13.5.4 Checkpoints

13.6 Buffer Management

Review Questions

Chapter 14 Database Security

14.1 Introduction

14.2 Goals of Database Security

14.2.1 Threats to Database Security

14.2.2 Types of Database Security Issues

14.2.3 Authorisation and Authentication

14.3 Discretionary Access Control

14.3.1 Granting/Revoking Privileges

14.3.2 Audit Trails

14.4 Mandatory Access Control

14.5 Firewalls

14.6 Statistical Database Security

14.7 Data Encryption

14.7.1 Simple Substitution Method

14.7.2 Polyalphabetic Substitution Method

Review Questions

PART V: OBJECT-BASED DATABASES

Chapter 15 Object-Oriented Databases

15.1 Introduction
15.2 Object-Oriented Data Model (OODM)

15.2.1 Characteristics of Object-Oriented Databases (OODBs)

15.2.2 Comparison of an OOMD and E-R Model

15.3 Concept of Object-Oriented Database (OODB)

15.3.1 Objects

15.3.2 Object Identity

15.3.3 Object Attributes

15.3.4 Classes

15.3.5 Relationship or Association among Objects

15.3.6 Structure, Inheritance, and Generalisation

15.3.7 Operation

15.3.8 Polymorphism

15.3.9 Advantages of OO Concept

15.4 Object-Oriented DBMS (OODBMS)

15.4.1 Features of OODBMSs

15.4.2 Advantages of OODBMSs

15.4.3 Disadvantages of OODBMSs

15.5 Object Data Management Group (OMDG) and Object-Oriented Languages

15.5.1 Object Model

15.5.2 Object Definition Language (ODL)

15.5.3 Object Query Language (OQL)

Review Questions

Chapter 16 Object-Relational Database

16.1 Introduction

16.2 History of Object-relational DBMS (ORDBMS)


16.2.1 Weaknesses of RDBMS

16.2.2 Complex Objects

16.2.3 Emergence of ORDBMS

16.3 ORDBMS Query Language (SQL3)

16.4 ORDBMS Design

16.4.1 Challenges of ORDBMS

16.4.2 Features of ORDBMS

16.4.3 Comparison of ORDBMS and OODBMS

16.4.4 Advantages of ORDBMS

16.4.5 Disadvantages of ORDBMS

Review Questions

PART VI: ADVANCE AND EMERGING DATABASE CONCEPTS

Chapter 17 Parallel Database Systems

17.1 Introduction

17.2 Parallel Databases

17.2.1 Advantages of Parallel Databases

17.2.2 Disadvantages of Parallel Databases

17.3 Architecture of Parallel Databases

17.3.1 Shared-memory Multiple CPU Parallel Database Architecture

17.3.2 Shared-disk Multiple CPU Parallel Database Architecture

17.3.3 Shared-nothing Multiple CPU Parallel Database Architecture

17.4 Key Elements of Parallel Database Processing

17.4.1 Speed-up

17.4.2 Scale-up

17.4.3 Synchronization
17.4.4 Locking

17.5 Query Parallelism

17.5.1 I/O Parallelism (Data Partitioning)

17.5.2 Intra-query Parallelism

17.5.3 Inter-query Parallelism

17.5.4 Intra-Operation Parallelism

17.5.5 Inter-Operation Parallelism

Review Questions

Chapter 18 Distribution Database Systems

18.1 Introduction

18.2 Distributed Databases

18.2.1 Difference between Parallel and Distributed Databases

18.2.2 Desired Properties of Distributed Databases

18.2.3 Types of Distributed Databases

18.2.4 Desired Functions of Distributed Databases

18.2.5 Advantages of Distributed Databases

18.2.6 Disadvantages of Distributed Databases

18.3 Architecture of Distributed Databases

18.3.1 Client/Server Architecture

18.3.2 Collaborating Server System

18.3.3 Middleware Systems

18.4 Distributed Database System (DDBS) Design

18.4.1 Data Fragmentation

18.4.2 Data Allocation

18.4.3 Data Replication


18.5 Distributed Query Processing

18.5.1 Semi-JOIN

18.6 Concurrency Control in Distributed Databases

18.6.1 Distributed Locking

18.6.2 Distributed Deadlock

18.6.3 Timestamping

18.7 Recovery Control in Distributed Databases

18.7.1 Two-phase Commit (2PC)

18.7.2 Three-phase Commit (3PC)

Review Questions

Chapter 19 Decision Support Systems (DSS)

19.1 Introduction

19.2 History of Decision Support System (DSS)

19.2.1 Use of Computers in DSS

19.3 Definition of Decision Support System (DSS)

19.3.1 Characteristics of DSS

19.3.2 Benefits of DSS

19.3.3 Components of DSS

19.4 Operational Data versus DSS Data

Review Questions

Chapter 20 Data Warehousing and Data Mining

20.1 Introduction

20.2 Data Warehousing

20.2.1 Evolution of Data Warehouse Concept

20.2.2 Main Components of Data Warehouses


20.2.3 Characteristics of Data Warehouses

20.2.4 Benefits of Data Warehouses

20.2.5 Limitations of Data Warehouses

20.3 Data Warehouse Architecture

20.3.1 Data Marts

20.3.2 Online Analytical Processing (OLAP)

20.4 Data Mining

20.4.1 Data Mining Process

20.4.2 Data Mining Knowledge Discovery

20.4.3 Goals of Data Mining

20.4.4 Data Mining Tools

20.4.5 Data Mining Applications

Review Questions

Chapter 21 Emerging Database Technologies

21.1 Introduction

21.2 Internet Databases

21.2.1 Internet Technology

21.2.2 The World Wide Web

21.2.3 Web Technology

21.2.4 Web Databases

21.2.5 Advantages of Web Databases

21.2.6 Disadvantages of Web Databases

21.3 Digital Libraries

21.3.1 Introduction to Digital Libraries

21.3.2 Components of Digital Libraries


21.3.3 Need for Digital Libraries

21.3.4 Digital Libraries for Scientific Journals

21.3.5 Technical Developments in Digital Libraries

21.3.6 Technical Areas of Digital Libraries

21.3.7 Access to Digital Libraries

21.3.8 Database for Digital Libraries

21.3.9 Potential Benefits of Digital Libraries

21.4 Multimedia Databases

21.4.1 Multimedia Sources

21.4.2 Multimedia Database Queries

21.4.3 Multimedia Database Applications

21.5 Mobile Databases

21.5.1 Architecture of Mobile Databases

21.5.2 Characteristics of Mobile Computing

21.5.3 Mobile DBMS

21.5.4 Commercial Mobile Databases

21.6 Spatial Databases

21.6.1 Spatial Data

21.6.2 Spatial Database Characteristics

21.6.3 Spatial Data Model

21.6.4 Spatial Database Queries

21.6.5 Techniques of Special Database Query

21.7 Clustering-based Disaster-proof Databases

Review Questions

PART VII: CASE STUDIES


Chapter 22 Database Design: Case Studies

22.1 Introduction

22.2 Database Design for Retail Banking

22.2.1 Requirement Definition and Analysis

22.2.2 Conceptual Design: Entity-Relationship (E-R) Diagram

22.2.3 Logical Database Design: Table Definitions

22.2.4 Logical Database Design: Sample Table Contents

22.3 Database Design for an Ancillary Manufacturing System

22.3.1 Requirement Definition and Analysis

22.3.2 Conceptual Design: Entity-Relationship (E-R) Diagram

22.3.3 Logical Database Design: Table Definitions

22.3.4 Logical Database Design: Sample Table Contents

22.3.5 Functional Dependency (FD) Diagram

22.4 Database Design for an Annual Rate Contract System

22.4.1 Requirement Definition and Analysis

22.4.2 Conceptual Design: Entity-Relationship (E-R) Diagram

22.4.3 Logical Database Design: Table Definitions

22.4.4 Logical Database Design: Sample Table Contents

22.4.5 Functional Dependency (FD) Diagram

22.5 Database Design of Technical Training Institute

22.5.1 Requirement Definition and Analysis

22.5.2 Conceptual Design: Entity-Relationship (E-R) Diagram

22.5.3 Logical Database Design: Table Definitions

22.6 Database Design of an Internet Bookshop

22.6.1 Requirement Definition and Analysis


22.6.2 Conceptual Design: Entity-Relationship (E-R) Diagram

22.6.3 Logical Database Design: Table Definitions

22.6.4 Change (Addition) in Requirement Definition

22.6.5 Modified Table Definition

22.6.6 Schema Refinement

22.6.7 Modified Entity-Relationship (E-R) Diagram

22.6.8 Logical Database Design: Sample Table Contents

22.7 Database Design of Customer Order Warehouse

22.7.1 Requirement Definition and Analysis

22.7.2 Conceptual Design: Entity-Relationship (E-R) Diagram

22.7.3 Logical Database Design: Table Definition

22.7.4 Logical Database Design: Sample Table Contents

22.7.5 Functional Dependency (FD) Diagram

22.7.6 Logical Record Structure and Access Path

Review Questions

PART VIII: COMMERCIAL DATABASES

Chapter 23 IBM DB2 Universal Database

23.1 Introduction

23.2 DB2 Products

23.2.1 DB2 SQL

23.3 DB2 Universal Database (UDB)

23.3.1 Configuration of DB2 Universal Database

23.3.2 Other DB2 UDB Related Products

23.3.3 Major Components of DB2 Universal Database

23.3.4 Features of DB2 Universal Database


23.4 Installation Prerequisite for DB2 Universal Database Server

23.4.1 Installation Prerequisite: DB2 UDB Personal Edition

23.4.2 Installation Prerequisite: DB2 Workgroup Server Edition and Non-partitioned DB2
Enterprise Server Edition

23.4.3 Installation Prerequisite: Partitioned DB2 Enterprise

23.4.4 Installation Prerequisite: DB2 Connect Personal Edition

23.4.5 Installation Prerequisite: DB2 Connect Enterprise Edition

23.4.6 Installation Prerequisite: DB2 Query Patroller Server

23.4.7 Installation Prerequisite: DB2 Cube Views

23.5 Installation Prerequisite for DB2 Clients

23.5.1 Installation Prerequisite: DB2 Clients

23.5.2 Installation Prerequisite: DB2 Query Patroller Clients

23.6 Installation and Configuration of DB2 Universal Database Server

23.6.1 Performing Installation Operation for IBM DB2 Universal Database Version 8.1

Review Questions

Chapter 24 Oracle

24.1 Introduction

24.2 History of Oracle

24.2.1 The Oracle Family

24.2.2 The Oracle Software

24.3 Oracle Features

24.3.1 Application Development Features

24.3.2 Communication Features

24.3.3 Distributed Database Features

24.3.4 Data Movement Features

24.3.5 Performance Features


24.3.6 Database Management Features

24.3.7 Backup and Recovery Features

24.3.8 Oracle Internet Developer Suite

24.3.9 Oracle Lite

24.4 SQL*Plus

24.4.1 Features of SQL*Plus

24.4.2 Invoking SQL*Plus

24.4.3 Editor Commands

24.4.4 SQL*Plus Help System and Other Useful Commands

24.4.5 Formatting the Output

24.5 Oracles Data Dictionary

24.5.1 Data Dictionary Tables

24.5.2 Data Dictionary Views

24.6 Oracle System Architecture

24.6.1 Storage Management and Processes

24.6.2 Logical Database Structure

24.6.3 Physical Database Structure

24.7 Installation of Oracle 9i

Review Questions

Chapter 25 Microsoft SQL Server

25.1 Introduction

25.2 Microsoft SQL Server Setup

25.2.1 SQL Server 2000 Editions

25.2.2 SQL Server 2005 Editions

25.2.3 Features of Microsoft SQL Server


25.3 Stored Procedures in SQL Server

25.3.1 Benefits of Stored Procedures

25.3.2 Structure of Stored Procedures

25.4 Installing Microsoft SQL Server 2000

25.4.1 Installation Steps

25.4.2 Starting and Stopping SQL Server

25.4.3 Starting the SQL Server Services Automatically

25.4.4 Connection to Microsoft SQL Server Database System

25.4.5 The Sourcing of Data

25.4.6 Security

25.5 Database Operation with Microsoft SQL Server

25.5.1 Connecting to a Database

25.5.2 Database Creation

Review Questions

Chapter 26 Microsoft Access

26.1 Introduction

26.2 An Access Database

26.2.1 Tables

26.2.2 Queries

26.2.3 Reports

26.2.4 Forms

26.2.5 Macros

26.3 Database Operation in Microsoft Access

26.3.1 Creating Forms

26.3.2 Creating a Simple Query


26.3.3 Modifying a Query

26.4 Features of Microsoft Access

Review Questions

Chapter 27 MySQL

27.1 Introduction

27.2 An Overview of MySQL

27.2.1 Features of MySQL

27.2.2 MySQL Stability

27.2.3 MySQL Table Size

27.2.4 MySQL Development Roadmap

27.2.5 Features Available in MySQL 4.0

27.2.6 The Embedded MySQL Server

27.2.7 Features of MySQL 4.1

27.2.8 MySQL 5.0: The Next Development Release

27.2.9 The MySQL Mailing Lists

27.2.10 Operating Systems Supported by MySQL

27.3 PHP-An Introduction

27.3.1 PHP Language Syntax

27.3.2 PHP Variables

27.3.3 PHP Operations

27.3.4 Installing PHP

27.4 MySQL Database

27.4.1 Creating Your First Database

27.4.2 MySQL Connect

27.4.3 Choosing the Working Database


27.4.4 MySQL Tables

27.4.5 Create Table MySQL

27.4.6 Inserting Data into MySQL Table

27.4.7 MySQL Query

27.4.8 Retrieving Information from MySQL

27.5 Installing MySQL on Windows

27.5.1 Windows Systems Requirements

27.5.2 Choosing an Installation Package

27.5.3 Installing MySQL with the Automated Installer

27.5.4 Using the MySQL Installation Wizard

27.5.5 Downloading and Starting the MySQL Installation Wizard

27.5.6 MySQL Installation Steps

27.5.7 Set up Permissions and Passwords

Review Questions

Chapter 28 Teradata RDBMS

28.1 Introduction

28.2 Teradata Technology

28.3 Teradata Tools and Utilities

28.3.1 Operating System Platform

28.3.2 Hardware Platform

28.3.3 Features of Teradata

28.3.4 Teradata Utilities

28.3.5 Teradata Products

28.3.6 Teradata-specific SQL Procedures Pass-through Facilities

28.4 Teradata RDBMS


28.4.1 Parallel Database Extensions (PDE)

28.4.2 Teradata File System

28.4.3 Parsing Engine (PE)

28.4.4 Access Module Processor (AMP)

28.4.5 Call Level Interface (CLI)

28.4.6 Teradata Director Program (TDP)

28.5 Teradata Client Software

28.6 Installation and Configuration of Teradata

28.7 Installation of Teradata Tools and Utilities Software

28.7.1 Installing with Microsoft Windows Installer

28.7.2 Installing with Parallel Upgrade Tool (PUT)

28.7.3 Typical Installation

28.7.4 Custom Installation

28.7.5 Network Installation

28.8 Basic Teradata Query (BTEQ)

28.8.1 Usage Tips

28.8.2 Frequently Used Commands

28.9 Open Database Connectvity (ODBC) Application Development

Review Questions

Answers

Bibliography
About the Author

S. K. Singh is Head of Maintenance Engineering Department (Electrical) at


Tata Steel Limited, Jamshedpur, India. He has received degrees in both
Electrical and Electronics engineering, as well as M.Sc. (Engineering) in
Power Electronics from Regional Institute of Technology, Jamshedpur,
India. He also obtained Executive Post Graduate Diploma in International
Business from Indian Institute of Foreign Trade, New Delhi.
Dr. Singh is an accomplished academician with over 30 years of rich
industrial experience in design, development, implementation and
marketing and sales of IT, Automation, and Telecommunication solutions,
Electrical and Electronics maintenance, process improvement initiatives
(Six-sigma, TPM, TOC), Training and Development, and Relationship
Management.
Dr. Singh has published a number of papers in both national and
international journals and has presented these in various seminars and
symposiums. He has written several successful engineering textbooks for
undergraduate and postgraduate students on Industrial Instrumentation and
Control, Process Control, etc. He has been a visiting faculty member and an
external examiner for Electrical Engineering, Computer Science, and
Electronics and Communication Engineering branches at National Institute
of Technology (NIT), Jamshedpur, Birsa institute of Technology (BIT)
Sindri, Dhanbad, Jharkhand. He worked as an observer of Jamshedpur
Centre for DOEACC examinations, Government of India, and was a
councilor for computer papers at Jamshedpur Centre of Indira Gandhi Open
University (IGNOU).
He has been conferred with the Eminent Engineer and Distinguished
Engineer Awards by The Institution of Engineers (India) for his
contributions to the field of computer science and engineering. He is
connected with many professional, educational and social organizations. He
is a Chartered Engineer and Fellow Member of The Institution of Engineers
(India). He is Referee of The Institution of Engineers (India) for assessing
project work of students appearing for sec. ‘B’ examination in Computer
Engineering branch.
Dedicated to
my wife Meena
and children Alka, Avinash, and Abhishek
for their love, understanding, and support
Foreword

Databases have become ubiquitous, touching almost every activity. Every


IT application today uses databases in some form or the other. They have
tremendous impact in all applications, and have made qualitative changes in
fields as diverse as health, education, entertainment, industry, and banking.
Database systems have evolved from the late 1960s hierarchical and
network models to today’s relational model. From the earlier file-based
system (basically repositories of data, providing very simple retrieval
facilities), they now address complex environments, offering a range of
functionalities in user-friendly environment. The academic community has
strived, and is continuing to strive, to improve these services. At the back of
these complex software packages is mathematics and other research which
provides the backbone and basic building blocks of these systems. It is a
challenge to provide good database services in a dynamic and flexible
environment in a user friendly way. An understanding of the basics of
database systems is crucial to designing good applications.
Database systems is a core course in most of the B. Tech./MCA/IT
programs in the country. This book addresses the needs of students and
teachers, comprehensively covering the syllabi of these courses, as well as
the needs of professional developers by addressing many practical issues.
I am confident that students, teachers and developers of database systems
alike will benefit from this book.

S. K. GUPTA
Professor
Department of Computer science and Engineering
IIT Delhi
Preface to the Second Edition

The first edition of the book received overwhelming response from both
students and teaching faculties of undergraduate and postgraduate
engineering courses and also from practicing engineers in computer and IT-
application industries. A large number of reprints of the first edition in the
last five years indicate great demand and popularity of the book amongst
students and teaching communities.
The advancement and rapid growth in computing and communication
technologies has revolutionized computer applications in everyday life. The
dependence of the business establishment on computers is also increasing at
an accelerated pace. Thus, the key to success of a modern business is an
effective data-management strategy and interactive data-analysis
capabilities. To meet these challenges, database management has evolved
from a specialized computer application to a central component of the
modern computing environment. This has resulted in the development of
new database application platforms.
While retaining the features of previous edition, this book contains a
chapter on a new commercial database called “TERADATA Relational
Database Management System”. Teradata is a parallel processing system
and is linearly and predictably scalable in all dimensions of a database
system workload (data volume, breadth, number of users and complexity of
queries). Due to the scalability features, it is popular in enterprise data
warehousing applications.
This book also has a study card to give a brief definition of important
topics related to DBMS. This will help student to quickly grasp the subject.
Preface

Database Systems: Concepts, Design and Applications is a comprehensive


introduction to the vast and important field of database systems. It presents
a thorough treatment of the principles that govern these systems and
provides a detailed survey of the future development in this field.
It serves as a textbook for both undergraduate and postgraduate students
of computer science and engineering, information technology as well as
those enrolled in BCA, MCA, and MBA courses. It will also serve as a
handbook for practicing engineers and as a guide for research and field
personnel at all levels.
Covering the concepts, design, and applications of database systems, this
book is divided into eight main parts consisting of 27 chapters.
The first part of the book (Database Concepts: Chapters 1 to 3) provides
a broad introduction to the concepts of database systems, database
architecture, and physical data organization.
The second part of the book (Relational Model: Chapters 4 to 7)
introduces the relational model and discusses relational systems, query
languages, and entity-relationship (E-R) model.
The third part of the book (Database Design: Chapters 8 to 10) covers the
design aspects of database systems and discusses methods for achieving
minimum redundancy through functional decomposition and normalization
processes.
The fourth part (Query, Transaction and Security Management: Chapters
11 to 14) deals with query, transaction, and recovery management aspects of
data systems. This includes query optimization techniques (used to choose
an efficient execution plan to minimise runtime), the main properties of
database transactions, database recovery and the techniques that can be used
to ensure database consistency in the event of failures, and the potential
threats to data security and protection against unauthorized access.
The fifth part (Object-based Databases: Chapters 15 to 16) discusses key
concepts of object-oriented databases, object-oriented languages, and the
emerging class of commercial object-oriented database management
systems.
The sixth part (Advanced and Emerging Database Concepts: Chapters 17
to 21) introduces advanced and emerging database concepts such as parallel
databases, distributed database management, decision support systems, data
warehousing and data mining. This part also covers emerging database
technologies such as Web-enabled databases, mobile databases, multimedia
databases, spatial databases, and digital libraries.
The seventh part (Case Studies: Chapter 22) provides six different case
studies related to real-time applications for the design of databases systems.
This will help students revisit the concepts and refresh their understanding.
The eighth part (Commercial Databases: Chapters 23 to 27) is devoted to
the commercial databases available in the market today, including DB2
Universal Database, Oracle, Microsoft SQL server, Access, and MySQL.
This will help students in bridging the gap between the theory and the
practical implementation in real-time industrial applications.
The book is further enhanced by the inclusion of a large number of
illustrative figures throughout as well as examples at the end of the
chapters. Suggestions for the further improvement of the book are welcome.

ACKNOWLEDGEMENTS

I am indebted to my colleagues and friends who have helped, inspired, and


given me moral support and encouragement, in various ways, in completing
this task. I am pleased to acknowledge the helpful comments and
suggestions provided by many students and engineers. I would like to give
special recognition to the reviewers of the manuscripts for this book. The
suggestions of reviewers were essential and are much appreciated.
I am thankful to the senior executives of Tata Steel for their
encouragement and support without which I would not have been able to
complete this book. I owe a debt of gratitude to Mr. J. A. C. Saldanha,
former executive of Tata Steel, who has been a continuous source of
inspiration to me.
I owe a special debt to my wife Meena and my children for their
sacrifices of patience, and their understanding and encouragement during
the completion of this book. My eternal gratitude goes to my parents for
their love, support, and inspiration.
I would like to place on record my gratitude and deep obligation to Dr. J.
J. Irani, former Managing Director, Tata Steel, for his interest and
encouragement.
Finally, I wish to acknowledge the assistance given by the team of editors
at Pearson Education.

S. K. SINGH
Part-I

DATABASE CONCEPTS
Chapter 1

Introduction to Database Systems

1.1 INTRODUCTION

In today’s competitive environment, data (or information) and its efficient


management is the most critical business objective of an organisation. It is
also a fact that we are in the age of information explosion where people are
bombarded with data and it is a difficult task to get the right information at
the right time to take the right decision. Therefore, the success of an
organisation is now, more than ever, dependent on its ability to acquire
accurate, reliable and timely data about its business or operation for effective
decision-making process.
Database system is a tool that simplifies the above tasks of managing the
data and extracting useful information in a timely fashion. It analyses and
guides the activities or business purposes of an organisation. It is the central
repository of the data in the organisation’s information system and is
essential for supporting the organisation’s functions, maintaining the data for
these functions and helping users interpret the data in decision-making.
Managers are seeking to use knowledge derived from databases for
competitive advantages, for example, to determine customer buying pattern,
tracking sales, support customer relationship management (CRM), on-line
shopping, employee relationship management, implement decision support
system (DSS), managing inventories and so on. To meet the changing
organisational needs, database structures must be flexible to accept new data
and accommodate new relationships to support the new decisions.
With the rapid growth in computing technology and its application in all
spheres of modern society, databases have become an integral component of
our everyday life. We encounter several activities in our day-to-day life that
involve interaction with a database, for example, bank database to withdraw
and deposit money, air or railway reservation databases for booking of
tickets, library database for searching of a particular book, supermarket
goods databases to keep the inventory, to check for sufficient credit balance
while purchasing goods using credit cards and so on.
In fact, databases and database management systems (DBMS) have
become essential for managing our business, governments, banks,
universities and every other kind of human endeavour. Thus, they are a
critical element of today’s software industry to support these requirements
and a daunting task to solve the problems of managing huge amounts of data
that are increasingly being stored.
This chapter introduces the basic concepts of databases and database
management system (DBMS), reviews the goals of DBMS, types of data
models and storage management system.

1.2 BASIC CONCEPTS AND DEFINITIONS

With the growing use of computers, the organisations are fast migrating from
a manual system to a computerised information system for which the data
within the organisation is a basic resource. Therefore, proper organisation
and management of data is necessary to run the organisation efficiently. The
efficient use of data for planning, production control, marketing, invoicing,
payroll, accounting and other function in an organisation have a major
impact for its competitive edge. In this section, formal definition of the terms
used in databases is provided.

1.2.1 Data
Data may be defined as a known fact that can be recorded and that have
implicit meaning. Data are raw or isolated facts from which the required
information is produced.
Data are distinct pieces of information, usually formatted in a special way.
They are binary computer representations of stored logical entities. A single
piece of data represents a single fact about something in which we are
interested. For an industrial organisation, it may be the fact that Thomas
Mathew’s employee (or social security) number is 106519, or that the largest
supplier of the casting materials of the organisation is located in Indore, or
that the telephone number of one of the key customers M/s Elbee Inc. is 001-
732-3931650. Similarly, for a Research and Development set-up it may be
the fact that the largest number of new products as on date is 100, or for a
training institute it may be the fact that largest enrolment were in Database
Management course. Therefore, a piece of data is a single fact about
something that we care about in our surroundings.
Data can exist in a variety of forms that have meaning in the user’s
environment such as numbers or text on a piece of paper, bits or bytes stored
in computer’s memory, or as facts stored in a person’s mind. Data can also
be objects such as documents, photographic images and even video
segments. The example of data is shown in Table 1.1.

Table 1.1 Example of data

In Salesperson’s view In Electricity supplier’s context In Employer’s mind


Customer-name Consumer-name Employee-name
Customer-account Consumer-number Identification-number

Address Address Department


Telephone numbers Telephone numbers Date-of-birth

Unit consumed Qualification

Amount-payable Skill-type

Usually there are many facts to describe something of interest to us. For
example, let us consider the facts that as a Manager of M/s Elbee Inc., we
might be interested in our employee Thomas Mathew. We want to remember
that his employee number is 106519, his basic salary rate is Rs. 2,00,000
(US$ 4000) per month, his home town is Jamshedpur, his home country is
India, his date of birth is September 6th, 1957, his marriage anniversary is on
May 29th, his telephone number is 0091-657-2431322 and so forth. We need
to know these things in order to process Mathew’s payroll check every
month, to send him company greeting cards on his birthday or marriage
anniversary, print his salary slip, to notify his family in case of any
emergency and so forth. It certainly seems reasonable to collect all the facts
(or data) about Mathew that we need for the stated purposes and to keep
(store) all of them together. Table 1.2 shows all these facts about Thomas
Mathew that concern payroll and related applications.

Table 1.2 Thomas Mathew’s payroll facts

Data is also known as the plural of datum, which means a single piece of
information. However, in practice, data is used as both-the singular and the
plural form of the word. The term data is often used to distinguish machine-
readable (binary) information from human-readable (textual) information.
For example, some applications make a distinction between data files (that
contain binary data) and text files (that contain ASCII data). Either numbers,
or characters or both can represent data.

1.2.1.1 Three-layer data architecture


To centralise the mountain of data scattered throughout the organisation and
make them readily available for efficient decision support applications, data
is organised in the following layered structure:
Operational data
Reconciled data
Derived data

Figure 1.1 shows a three-layer data structure that is generally used for data
warehousing applications (the detailed discussion on data warehouse is
given in Chapter 20).
Fig. 1.1 Three-layer data structure

Operational data are stored in various operational systems throughout the


organisation (both internal and external) systems.
Reconciled data are stored in the organisation data warehouse and in
operational data store. They are detailed and current data, which is intended
as the single, authoritative source for all decision support applications.
Derived data are stored in each of the data mart (a selected, limited and
summarised data warehouse). Derived data are selected, formatted and
aggregated for end-user decision support applications.

1.2.2 Information
Data and information are closely related and are often used interchangeably.
Information is processed, organised or summarised data. It may be defined
as collection of related data that when put together, communicate meaningful
and useful message to a recipient who uses it, to make decision or to
interpret the data to get the meaning.
Data are processed to create information, which is meaningful to the
recipient, as shown in Fig. 1.2. For example, from the salesperson’s view, we
might want to know the current balance of a customer M/s Waterhouse Ltd.,
or perhaps we might ask for the average current balance of all the customers
in Asia. The answers to such questions are information. Thus, information
involves the communication and reception of knowledge or intelligence.
Information apprises and notifies, surprises and stimulates. It reduces
uncertainty, reveals additional alternatives or helps in eliminating irrelevant
or poor ones, influences individuals and stimulates them into action. It gives
warning signals before some thing starts going wrong. It predicts the future
with reasonable level of accuracy and helps the organisation to make the best
decisions.

Fig. 1.2 Information cycle

1.2.3 Data Versus Information


Let us take the following two examples with the given list of facts or data as
shown in Fig. 1.3. Both the examples given below 1.1, 1.2 and 1.3 satisfy the
definition of data, but the data are useless in their present form as they are
unable to convey any meaningful message. Even if we guess in example 1.1
that it is person’s names together with some identification or social security
numbers, that in example 1.2 it is customer’s names together with some
money transaction and in example 1.3 it may be student’s name together
with the marks obtained in some examination the data remain useless since
they do not convey any meaning about the purpose of the entries.

Fig. 1.3 Data versus Information

Now let us modify the data in example 1.1 by adding a few additional data
and providing some structure and place the same data in a context shown in
Fig. 1.4 (a). Now data has been rearranged or processed to provide
meaningful message or information, which is an Employee Master of M/s
Metal Rolling Pvt. Ltd. Now this is useful information for the departmental
head or the organisational head for taking decisions related to the additional
requirement of experienced and qualified manpower.

Fig. 1.4 Converting data into information for Example 1.1

(a) Data converted into textual information

(b) Data converted into summarised information

Another way to convert data intoformation is to summarise them or


otherwise process and present them for human interpretation. For example,
Fig. 1.4 (b) shows summarised data related to the number of employees
versus experience and qualification presented as graphical information. This
information could be used by the organisation as a basis for deciding
whether to add or hire new experienced or qualified manpower.
Data in Example 1.2 can be modified by adding additional data and
providing some structure, as shown in Fig. 1.5. Now data has been
rearranged or processed to provide meaningful message or information,
which is a Customer Invoicing of M/s Metal Rolling Pvt. Ltd. Now this is
useful information for the organisation to sending reminders to the customer
for the payment of pending balance amount and so on. Similarly, as shown
in Fig. 1.6, the data has been converted into textual and summarised
information for Example 1.3.
Today, database may contain either data or information (or both),
according to the organisations definition and needs. For example, a database
may contain an image of the Employee Master shown in Fig. 1.4 (a) or
Customer Master shown in Fig. 1.5 or Student’s Performance Roaster shown
in Fig. 1.6 (a), and also in summarised (trend or picture) form shown in Figs.
1.4 (b) and 1.6 (b) for decision support functions by the organisation. In this
book, the terms data and information have been treated as synonymous.
Fig. 1.5 Converting data into information for Example 1.2

Fig. 1.6 Converting data into information for Example 1.3

(a) Data converted into textual information


(b) Data converted into summarised information

1.2.4 Data Warehouse


Data warehouse is a collection of data designed to support management in
the decision-making process. It is a subject-oriented, integrated, time-
variant, non-updatable collection of data used in support of management
decision-making processes and business intelligence. It contains a wide
variety of data that present a coherent picture of business conditions at a
single point of time. It is a unique kind of database, which focuses on
business intelligence, external data and time-variant data (and not just
current data).
Data warehousing is the process, where organisations extract meaning and
inform decision making from their informational assets through the use of
data warehouses. It is a recent initiative in information technology and has
evolved very rapidly. A further detail on data warehousing is given in
Chapter 20.

1.2.5 Metadata
A metadata (also called the data dictionary) is the data about the data. It is
also called the system catalog, which is the self-describing nature of the
database that provides program-data independence. The system catalog
integrates the metadata. The metadata is the data that describes objects in the
database and makes easier for those objects to be accessed or manipulated. It
describes the database structure, constraints, applications, authorisation,
sizes of data types and so on. These are often used as an integral tool for
information resource management.
Metadata is found in documentation describing source systems. It is used
to analyze the source files selected to populate the largest data warehouse. It
is also produced at every point along the way as data goes through the data
integration process. Therefore, it is an important by-product of the data
integration process. The efficient management of a production or enterprise
warehouse relies heavily on the collection and storage of metadata. Metadata
is used for understanding the content of the source, all the conversion steps it
passes through and how it is finally described in the target system or data
warehouse.
Metadata is used by developers who rely on it to help them develop the
programs, queries, controls and procedures to manage and manipulate the
warehouse data. Metadata is also used for creating reports and graphs in
front-end data access tools, as well as for the management of enterprise-wide
data and report changes for the end-user. Change management relies on
metadata to administer all of the related objects for example, data model,
conversion programs, load jobs, data definition language (DDL), and so on,
in the warehouse that are impacted by a change request. Metadata is
available to database administrators (DBAs), designers and authorised users
as on-line system documentation. This improves the control of database
administrators (DBAs) over the information system and the users’
understanding and use of the system.

1.2.5.1 Types of Metadata


The advent of data warehousing technology has highlighted the importance
of metadata. There are three types of metadata as shown in Fig. 1.7. These
metadata are linked to the three-layer data structure as shown in Fig. 1.7.

Fig. 1.7 Metadata layer

Operational metadata: It describes the data in the various operational


systems that feed the enterprise data warehouse. Operational metadata
typically exist in a number of different formats and unfortunately are often
of poor quality.
Enterprise data warehouse (EDW) metadata: These types of metadata are
derived from the enterprise data model. EDW metadata describe the
reconciled data layer as well as the rules for transforming operational data to
reconciled data.
Data mart metadata: They describe the derived data layer and the rules
for transforming reconciled data to derived data.

1.2.6 System Catalog


A system catalog is a repository of information describing the data in the
database, that is the metadata (or data about the data). System catalog is a
system-created database that describes all database objects, data dictionary
information and user access information. It also describes table-related data
such as table names, table creators or owners, column names, data types,
data size, foreign keys and primary keys, indexed files, authorized users,
user access privileges and so forth.
The system catalog is created by the database management system and the
information is stored in system files, which may be queried in the same
manner as any other data table, if the user has sufficient access privileges. A
fundamental characteristic of the database approach is that the database
system contains not only the database but also a complete definition or
description of the database structure and constraints. This definition is stored
in the system catalog, which contains information such as the structure of
each file, the type and storage format of each data item and various
constraints on the data. The information stored in the catalog is called
metadata. It describes the structure of the primary database.

1.2.7 Data Item or Fields


A data item is the smallest unit of the data that has meaning to its user. It is
traditionally called a field or data element. It is an occurrence of the smallest
unit of named data. It is represented in the database by a value. Names,
telephone numbers, bill amount, address and so on in a telephone bill and
name, basic allowances, deductions, gross pay, net pay and so on in
employee salary slip, are a few examples of data. Data items are the
molecules of the database. There are atoms and sub-atomic particles
composing each molecule (bits and bytes), but they do not convey any
meaning on their own right and so are of little concern to the users. A data
item may be used to construct other, more complex structures.

1.2.8 Records
A record is a collection of logically related fields or data items, with each
field possessing a fixed number of bytes and having a fixed data type. A
record consists of values for each field. It is an occurrence of a named
collection of zero, one, or more than one data items or aggregates. The data
items are grouped together to form records. The grouping of data items can
be achieved through different ways to form different records for different
purposes. These records are retrieved or updated using programs.

1.2.9 Files
A file is a collection of related sequence of records. In many cases, all
records in a file are of the same record type (each record having an identical
format). If every record in the file has exactly the same size (in bytes), the
file is said to be made up of fixed-length records. If different records in the
file have different sizes, the file is said to be made of variable-length
records.

Table 1.3 Employee payroll file for M/s Metal Rolling Pvt. Ltd.

Table 1.3 illustrates an example of a payroll file in tabular form. Each


kind of fact in each column, for example, employees number or home-town
is called a field. The collection of facts about a particular employee in one
line or row (for example, all the fields of all the columns) of the table is an
example of record. The collection of payroll facts for all of the employees
(all columns and rows), that is, the entire table in Table 1.13 is an example of
file.

1.3 DATA DICTIONARY


Data dictionary (also called information repositories) are mini database
management systems that manages metadata. It is a repository of
information about a database that documents data elements of a database.
The data dictionary is an integral part of the database management systems
(DBMSs) and stores metadata, or information about the database, attribute
names and definitions for each table in the database. Data dictionaries aid
the database administrator in the management of a database, user view
definitions as well as their use.
The most general structure of a data dictionary is shown in Fig. 1.8. It
contains descriptions of the database structure and database use. The data in
the data dictionary are maintained by several programs and produce diverse
reports on demand. Most data dictionary systems are stand-alone systems,
and their database is maintained independently of the DBMS, thereby
enabling inconsistencies between the database and the data dictionary. To
prevent them, the data dictionary is integrated with DBMSs in which the
schema and user view definitions are controlled through the data dictionary
and are made available to the DBMS software.

Fig. 1.8 Structure of data dictionary

Data dictionary is usually a part of the system catalog that is generated for
each database. A useful data dictionary system usually stores and manages
the following types of information:
Descriptions of the schema of the database.
Detailed information on physical database design, such as storage structures, access paths
and file and record sizes.
Description of the database users, their responsibilities and their access rights.
High-level descriptions of the database transactions and applications and of the relationships
of users to transactions.
The relationship between database transactions and the data items referenced by them. This is
useful in determining which transactions are affected when certain data definitions are
changed.
Usage statistics such as frequencies of queries and transactions and access counts to different
portions of the database.

Let us take an example of a manufacturing company M/s ABC Motors


Ltd., which has decided to computerise its activities related to various
departments. The manufacturing department is concerned with types (or
brands) of motor in its manufacturing inventory, while the personnel
department is concerned with keeping track of the employees of the
company. The manufacturing department wants to store the details (also
called entity set) such as the model no., model description and so on.
Similarly, personnel department wants to keep the facts such as employee’s
number, last name, first name and so on. Fig. 1.9 illustrates the two data
processing (DP) files, namely INVENTORY file of manufacturing
department and EMPLOYEE file of personnel department.
Fig. 1.9 Data processing files of M/s ABC Motors Ltd

(a) INVENTORY file of manufacturing department

(b) EMPLOYEE file of personnel department

Now, though, manufacturing and employee departments are interested in


keeping track of their inventory and employees details respectively, the data
processing (DP) department of M/s ABC Motors Ltd., would be interested in
tracking and managing entities (individual fields and the two files), that is
the data dictionary. Fig. 1.10 shows a sample of the data dictionary for the
two files (field’s file and file’s file) of Fig. 1.9.
As it can be seen from Fig. 1.10 that all data fields of both the files are
included in field’s file and both the files (INVENTORY and EMPLOYEE)
in the file’s file. Thus, the data dictionary contains the attributes for the
field’s file such as FIELD-NAME, FIELD-TYPE, FIELD-LENGTH and for
file’s file such as FILE-NAME and FILE-LENGTH.
In the manufacturing department’s INVENTORY file, each row
(consisting of fields namely MOD-NO, MOD-NAME, MOD-DESC, UNIT-
PRICE) represents the details of a model of a car, as shown in Fig. 1.9 (a). In
the personnel department’s EMPLOYEE file, each row (consisting of fields
namely EMP-NO, EMP-LNAME, EMP-FNAME, EMP-SALARY)
represents details about an employee, as shown in Fig. 1.9 (b). Similarly, in
the data dictionary, each row of the field’s file (consisting of entries namely
FIELD-NAME, FIELD-TYPE, FIELD-LENGTH) represents one of the
fields in one of the application data files (in this case INVENTORY and
EMPLOYEE files) processed by the data processing department, as shown
in Fig. 1.10 (a). Also, each row of the file’s file (consisting of entries namely
FILE-NAME, FILE-LENGTH) represents one of the application files (in
this case INVENTORY and EMPLOYEE files) processed by data
processing department, as shown in Fig. 1.9 (b). Therefore, we see that, each
row of the field’s file in Fig. 1.10 (a) represents one of the fields of one of
the files in Fig. 1.9, and each row of the file’s file in Fig. 1.10 (b) represents
one of the files in Fig. 1.9.

Fig. 1.10 Data dictionary files of M/s ABC Motors Limited


Data dictionary also keeps track of the relationships among the entities,
which is important in the data processing environment as how these entities
interrelate. Figure 1.11 shows the links (relationship) between fields and
files. These relationships are important for the data processing department.

Fig. 1.11 Data dictionary showing relationships

1.3.1 Components of Data Dictionaries


As discussed in the previous section, data dictionary contains the following
components:
Entities
Attributes
Relationships
Key

1.3.1.1 Entities
Entity is the real physical object or an event; the user is interested in keeping
track of. In other words, any item about which information is stored is called
entity. For example, in Fig. 1.9 (b), Thomas Mathew is a real living person
and an employee of M/s ABC Motors Ltd., is an entity for which the
company is interested in keeping track of the various details or facts.
Similarly, in Fig. 1.9 (a), Maharaja model car (Model no. M-1000) is a real
physical object manufactured by M/s ABC Motors Ltd., is an entity. A
collection of the entities of the same type, for example “all” of the
company’s employees (the rows in EMPLOYEE file in Fig. 1.9 (b)), and
“all” the company’s model (the rows in INVENTORY file in Fig. 1.9 (a)) are
called an entity set. In other words, we can say that, a record describes the
entity and a file describes an entity set.

1.3.1.2 Attributes
An attribute is a property or characteristic (field) of an entity. In Fig. 1.9 (b),
Mathew’s EMP-NO, EMP-SALARY and so forth, all are his attributes.
Similarly, in Fig. 1.9 (a), Maharaja car’s MOD-NO, MOD-DESC, UNIT-
PRICE and so forth, all are its attributes. In other words, we can say that,
values in all the fields are attributes. Fig. 1.12 shows an example of an entity
set and its attributes.

Fig. 1.12 Entity set and attributes

1.3.1.3 Relationships
The associations or the ways that different entities relate to each other is
called relationships, as shown in Fig. 1.11. The relationship between any pair
of entities of a data dictionary can have value to some part or department of
the organisation. Some data dictionaries define limited set of relationships
among their entities, while others allow the relationship between every pair
of entities. Some examples of common data dictionary relationships are
given below:
Record construction: for example, which field appears in which records.
Security: for example, which user has access to which file.
Impact of change: for example, which programs might be affected by changes to which files.
Physical residence: for example, which files are residing in which storage device or disk
packs.
Program data requirement: for example, which programs use which file.
Responsibility: for example, which users are responsible for updating which files.

Relationships could be of following types:


One-to-one (1:1) relationship
One-to-many (1:m) relationships
Many-to-many (n:m) relationships

Let us take the example shown in Fig. 1.9 (b), wherein there is only one
EMP-NO (employee identification number) in the EMPLOYEE file of
personnel department for each employee, which is unique. This is called
unary associations or one-to-one (1:1) relationship, as shown in Fig. 1.13
(a).
Now let us assume that an employee belongs to a manufacturing
department. While for a given employee there is one manufacturing
department, in the manufacturing department there may be many employees.
Thus, in this case, there is one-to-one relationship in one direction and a
multiple association in the other direction. This combination is called one-to-
many (1:m) relationship, as shown in Fig. 1.13 (b).
Fig. 1.13 Entity relationship (ER) diagram

(a) One-to-one relationship

(b) One-to-many relationship

(c) Many-to-many relationship

Finally, consider the situation in which an employee gets a particular


salary. While for a given employee there is one salary amount (for example,
4000), the same amount may be given to many employees in the department.
In this case, there is multiple associations in both the direction, and this
combination is called many-to-many (n:m) relationship, as shown in Fig.
1.13 (c).

1.3.1.4 Key
The data item (or field) for which a computer uses to identify a record in a
database system is referred to as key. In other words, key is a single attribute
or combination of attributes of an entity set that is used to identify one or
more instances of the set. There are various types of keys.
Primary key
Concatenated key
Secondary key
Super key
Primary key is used to uniquely identify a record. It is also called entity
identifier, for example, EMP-NO in the EMPLOYEE file of Fig. 1.9 (b) and
MOD-NO in the INVENTORY file of Fig. 1.9 (a). When more than one data
item is used to identify a record, it is called concatenated key, for example,
both EMP-NO and EMP-FNAME in EMPLOYEE file of Fig. 1.9 (b) and
both MOD-NO and MOD-TYPE in INVENTORY file of Fig. 1.9 (a).
Secondary key is used to identify all those records, which have a certain
property. It is an attribute or combination of attributes that may not be a
concatenated key but that classifies the entity set on a particular
characteristic. In Super key includes any number of attributes that possess a
uniqueness property. For example, if we add additional attributes to a
primary key, the resulting combination would still uniquely identify an
instance of the entity set. Such keys are called super keys. Thus, a primary
key is a minimum super key.

1.3.2 Active and Passive Data Dictionaries


Data dictionary may be either active or passive. An active data dictionary
(also called integrated data dictionary) is managed automatically by the
database management software. Since active data dictionaries are maintained
by the system itself, they are always consistent with the current structure and
definition of the database. Most of the relational database management
systems contain active data dictionaries that can be derived from their
system catalog.
The passive data dictionary (also called non-integrated data dictionary) is
the one used only for documentation purposes. Data about fields, files,
people and so on, in the data processing environment are entered into the
dictionary and cross-referenced. Passive dictionary is simply a self-
contained application and a set of files is used for documenting the data
processing environment. It is managed by the users of the system and is
modified whenever the structure of the database is changed. Since this
modification must be performed manually by the user, it is possible that the
data dictionary will not be current with the current structure of the database.
However, the passive data dictionaries may be maintained as a separate
database. Thus, it allows developers to remain independent from using a
particular relational database management system for as long as possible.
Also, passive data dictionaries are not limited to information that can be
discerned by the database management system. Since passive data
dictionaries are maintained by the user, they may be extended to contain
information about organisational data that is not computerized.

1.4 DATABASE

A database is defined as a collection of logically related data stored together


that is designed to meet the information needs of an organisation. It is
basically an electronic filing cabinet, which contain computerized data files.
It can contain one data file (a very small database) or large number of data
files (a large database) depending on organisational needs. A database is
organised in such a way that a computer program can quickly select desired
pieces of data.
Database can further be defined as, it
a. is a collection of interrelated data stored together without harmful or unnecessary
redundancy. However, redundancy is sometimes useful for performance reasons but is costly.
b. serves multiple applications in which each user has his own view of data. This data is
protected from unauthorized access by security mechanism and concurrent access to data is
provided with recovery mechanism.
c. stores data independent of programs and changes in data storage structure or access strategy
do not require changes in accessing programs or queries.
d. has structured data to provide a foundation for growth and controlled approach is used for
adding new data, modifying and restoring data.

The names, addresses, telephone numbers and so on, of the people we


maintain in an address book, store in the computer storage (such as floppy or
hard disk), or in the excel worksheet of Microsoft and so on, are the
examples of a database. Since, it is a collection of related data (addresses of
people we know) with an implicit meaning, it is a database.
A database is designed, built and populated with data for a specific
purpose. It has an intended group of users and some preconceived
applications in which these users are interested. In other words, database has
some source from where data is derived, some degree of interaction with
events in the real world and an audience that is actively interested in the
contents of the database. A database can be of any size and of varying
complexity. It may be generated and maintained manually or it may be
computerized. A computerized database may be created and maintained
either by a group of application programs written specifically for that task or
by a database management system.
A database consists of the following four components as shown in Fig.
1.14:
Data item
Relationships
Constraints and
Schema.

Fig. 1.14 Components of database

As explained in the earlier sections, data (or data item) is a distinct piece
of information. Relationships represent a correspondence (or
communication) between various data elements. Constraints are predicates
that define correct database states. Schema describes the organisation of data
and relationships within the database. It defines various views of the
database for the use of the various system components of the database
management system and for application security. A schema separates the
physical aspect of data storage from the logical aspects of data
representation.
An organisation of a database is shown in Fig. 1.15. It consists of the
following three independent levels:
Physical storage organisation or internal schema layer
Overall logical organisation or global conceptual schema layer
Programmers’ logical organisation or external schema layer.
Fig. 1.15 Database organisation

The internal schema defines how and where the data are organised in
physical data storage. The conceptual schema defines the stored data
structure in terms of the database model used. The external schema defines a
view of the database for particular users. A database management system
provides for accessing the database while maintaining the required
correctness and consistency of the stored data.

1.5 DATABASE SYSTEM

A database system, also called database management system (DBMS), is a


generalized software system for manipulating databases. It is basically a
computerized record-keeping system; which it stores information and allows
users to add, delete, change, retrieve and update that information on demand.
It provides for simultaneous use of a database by multiple users and tool for
accessing and manipulating the data in the database. DBMS is also a
collection of programs that enables users to create and maintain database. It
is a general-purpose software system that facilitates the process of defining
(specifying the data types, structures and constraints), constructing (process
of storing data on storage media) and manipulating (querying to retrieve
specific data, updating to reflect changes and generating reports from the
data) for various applications.
Typically, a DBMS has three basic components, as shown in Fig. 1.16,
and provides the following facilities:
Fig. 1.16 DBMS Components

Data description language (DDL): It allows users to define the database, specify the data
types, and data structures, and the constraints on the data to be stored in the database, usually
through data definition language. DDL translates the schema written in a source language
into the object schema, thereby creating a logical and physical layout of the database.
Data manipulation language (DML) and query facility: It allows users to insert, update,
delete and retrieve data from the database, usually through data manipulation language
(DML). It provides general query facility through structured query language (SQL).
Software for controlled access of database: It provides controlled access to the database,
for example, preventing unauthorized user trying to access the database, providing a
concurrency control system to allow shared access of the database, activating a recovery
control system to restore the database to a previous consistent state following a hardware or
software failure and so on.
The database and DBMS software together is called a database system. A
database system overcomes the limitations of traditional file-oriented system
such as, large amount of data redundancy, poor data control, inadequate data
manipulation capabilities and excessive programming effort by supporting
an integrated and centralized data structure.

1.5.1 Operations Performed on Database Systems


As discussed in the previous section, database system can be regarded as a
repository or container for a collection of computerized data files in the form
of electronic filing cabinet. The users can perform a variety of operations on
database systems. Some of the important operations performed on such files
are as follows:
Inserting new data into existing data files
Adding new files to the database
Retrieving data from existing files
Changing data in existing files
Deleting data from existing files
Removing existing files from the database.

Let us take an example of M/s Metal Rolling Pvt. Ltd. having a very small
database containing just one, called EMPLOYEE, as shown in Table 1.4.
The EMPLOYEE file in turn contains data concerning the details of
employee working in the company. Fig. 1.17 depicts the various operations
that can be performed on EMPLOYEE file and the results thereafter
displayed on the computer screen.
Table 1.4 EMPLOYEE file of M/s Metal Rolling Pvt. Ltd.

1.6 DATA ADMINISTRATOR (DA)

A data administrator (DA) is an identified individual person in the


organisation who has central responsibility of controlling data. As discussed
earlier, data are important assets of an organisation.
Fig. 1.17 Operations on EMPLOYEE file

(a) Inserting new data into a file

(b) Retrieving existing data from a file


(c) Changing existing data of a file

(d) Deleting existing data from a file

Therefore, it is important that someone at a senior level in the organisation


understands these data and the organisational needs with respect to data.
Thus, a DA is this senior level person in the organisation whose job is to
decide what data should be stored in the database and establish policies for
maintaining and dealing with that data. He decides exactly what information
is to be stored in the database, identifies the entities of the interest to the
organisation and the information to be recorded about those entities. A DA
decides the content of the database at an abstract level. This process
performed by DA is known as logical or conceptual database design. DAs
are the manager and need not be a technical person, however, knowledge of
information technology helps them in an overall understanding and
appreciation of the system.

1.7 DATABASE ADMINISTRATOR (DBA)

A database administrator (DBA) is an individual person or group of persons


with an overview of one or more databases who controls the design and the
use of these databases. A DBA provides the necessary technical support for
implementing policy decisions of databases. Thus, a DBA is responsible for
the overall control of the system at technical level and unlike a DA, he or she
is an IT professional. A DBA is the central controller of the database system
who oversees and manages all the resources (such as database, DBMS and
related software). The DBA is responsible for authorizing access to the
database, for coordinating and monitoring its use and for acquiring software
and hardware resources as needed. They are accountable for security system,
appropriate response time and ensuring adequate performance of the
database system and providing a variety of other technical services. The
database administrator is supported with a number of staff or a team of
people such as system programmers and other technical assistants.

1.7.1 Functions and Responsibilities of DBAs


Following are some of the functions and responsibilities of database
administrator and his staff:
a. Defining conceptual schema and database creation: A DBA creates the conceptual
schema (using data definition language) corresponding to the abstract level database design
made by data administrator. The DBA creates the original database schema and structure of
the database. The object from the schema is used by DBMS in responding to access requests.
b. Storage structure and access-method definition: DBA decides how the data is to be
represented in the stored database, the process called physical database design. Database
administrator defines the storage structure (called internal schema) of the database (using
data definition language) and the access method of the data from the database.
c. Granting authorisation to the users: One of the important responsibilities of a DBA is
the liaising with end-users to ensure availability of required data to them. A DBA grants
access to use the database to its users. It regulates the usage of specific parts of the database
by various users. The authorisation information is kept in a special system structure that the
database system consults whenever someone attempts to access the data in the system. DBAs
assist the user with problem definition and its resolution.
d. Physical organisation modification: The DBA carries out the changes or modification to
the description of the database or its relationship to the physical organisation of the database
to reflect the changing needs of the organisation or to alter the physical organisation to
improve performance.
e. Routine maintenance: The DBA maintains periodical back-ups of the database, either
onto hard disks, compact disks or onto remote servers, to prevent loss of data in case of
disasters. It ensures that enough free storage space is available for normal operations and
upgrading disk space as required. A DBA is also responsible for repairing damage to the
database due to misuse or software and hardware failures. DBAs define and implement an
appropriate damage control mechanism involving periodic unloading or dumping of the
database to backup storage device and reloading the database from the most recent dump
whenever required.
f. Job monitoring: DBAs monitor jobs running on the database and ensure that performance
is not degraded by very expensive tasks submitted by some users. With change in
requirements (for example, reorganising of database), DBAs are responsible for making
appropriate adjustment or tuning of the database

1.8 FILE-ORIENTED SYSTEM VERSUS DATABASE SYSTEM

Computer-based data processing systems were initially used for scientific


and engineering calculations. With increased complexity of business
requirements, gradually they were introduced into the business applications.
The manual method of filing systems of an organisation, such as to hold all
internal and external correspondence relating to a project or activity, client,
task, product, customer or employee, was maintaining different manual
folders. These files or folders were labelled and stored in one or more
cabinets or almirahs under lock and key for safety and security reasons. As
and when required, the concerned person in the organisation used to search
for a specific folder or file serially starting from the first entry. Alternatively,
files were indexed to help locate the file or folder more quickly. Ideally, the
contents of each file folder were logically related. For example, a file folder
in a supplier’s office might contain customer data; one file folder for each
customer. All data in that folder described only that customer’s transaction.
Similarly, a personnel manager might organise personnel data of employees
by category of employment (for example, technical, secretarial, sales,
administrative, and so on). Therefore, a file folder leveled ‘technical’ would
contain data pertaining to only those people whose duties were properly
classified as technical.
The manual system worked well as data repository as long as the data
collection were relatively small and the organisation’s managers had few
reporting requirements. However, as the organisation grew and as the
reporting requirements became more complex, it became difficult in keeping
track of data in the manual file system. Also, report generation from a
manual file system could be slow and cumbersome. Thus, this manual filing
system was replaced with a computer-based filing system. File-oriented
systems were an early attempt to computerize the manual filing system that
we are familiar with. Because these systems performed normal record-
keeping functions, they were called data processing (DP) systems. Rather
than establish a centralised store for organisation’s operational data, a
decentralised approach was taken, where each department, with the
assistance of DP department staff, stored and controlled its own data.
Table 1.5 shows an example of file-oriented system of an organisation
engaged in product distribution. Each table represents a file in the system,
for example, PRODUCT file, CUSTOMER file, SALES file and so on. Each
row in these files represents a record in the file. PRODUCT file contains 6
records and each of these records contains data about different products. The
individual data items or fields in the PRODUCT file are PRODUCT-ID,
PRODUCT-DESC, MANUF-ID and UNIT-COST. CUSTOMER file
contains 5 records and each of these records contains data about customer.
The individual data items in CUSTOMER file are CUST-ID, CUST-NAME,
CUST-ADDRESS, COUNTRY, TEL-NO and BAL-AMT. Similarly, SALES
file contains 5 records and each of these records contains data about sales
activities. The individual data items in SALES file are SALES-DATE,
CUST-ID, PROD-ID, QTY and UNIT-PRICE.
Table 1.5 File-oriented system

With the assistance of DP department, the files were used for a number of
different applications by the user departments, for example, account
receivable program written to generate billing statements for customers. This
program used the CUSTOMER and SALES files and these files were both
stored in the computer in order by CUST-ID and were merged to create a
printed statement. Similarly, sales statement generation program (using
PRODUCT and SALES files) was written to generate product-wise sales
performance. This type of program, which accomplishes a specific task of
practical value in a business situation is called application program or
application software. Each application program that is developed is designed
to meet the specific needs of the particular requesting department or user
group.
Fig. 1.18 illustrates structures in which application programs are written
specifically for each user department for accessing their own files. Each set
of departmental programs handles data entry, file maintenance and the
generation of a fixed set of specific reports. Here, the physical structure and
storage of the data files and records are defined in the application program.
For example:

Fig. 1.18 File-oriented system

a. Sales department stores details relating to sales performance, namely SALES(SALE-DATE,


CUST-ID, PROD-ID, QTY, UNIT-PRICE).
b. Customer department stores details relating to customer invoice realization summary, namely
CUSTOMER (CUST-ID, CUST-NAME, CUST-ADD, COUNTRY, TEL-NO, BAL-AMT).
c. Product department stores details relating to product categorization summary, namely
PRODUCT (PROD-ID, PROD-DESC, MANUF-ID, UNIT-COST).

It can be seen from the above examples that there is significant amount of
duplication of data storage in different departments (for example, CUST-ID
and PROD-ID), which is generally true with file-oriented system.

1.8.1 Advantages of Learning File-oriented System


Although the file-oriented system is now largely obsolete, following are the
several advantages of learning file-based systems:
It provides a useful historical perspective on how we handle data.
The characteristics of a file-based system helps in an overall understanding of design
complexity of database systems.
Understanding the problems and knowledge of limitation inherent in the file-based system
helps avoid these same problems when designing database systems and thereby resulting in
smooth transition.

1.8.2 Disadvantages of File-oriented System


Conventional file-oriented system has the following disadvantages:
a. Data redundancy (or duplication): Since a decentralised approach was taken, each
department used their own independent application programs and special files of data. This
resulted into duplication of same data and information in several files, for example,
duplication of PRODUCT-ID data in both PRODUCT and SALES files, and CUST-ID data
in both CUSTOMER and SALES files as shown in Table 1.5. This redundancy or duplication
of data is wasteful and requires additional or higher storage space, costs extra time and
money, and requires increased effort to keep all files up-to-date.
b. Data inconsistency (or loss of data integrity): Data redundancy also leads to data
inconsistency (or loss of data integrity), since either the data formats may be inconsistent or
data values (various copies of the same data) may no longer agree or both.
Fig. 1.19 Inconsistent product description data

Fig. 1.19 shows an example of data inconsistency in which a field for product description is
being shown by all the three department files, namely SALES, PRODUCT and
ACCOUNTS. It can been seen in this example that even though it was always the product
description, the related field in all the three department files often had a different name, for
example, PROD-DESC, PROD-DES and PRODDESC. Also, the same data field might have
different length in the various files, for example, 15 characters in SALES file, 20 characters
in PRODUCT file and 10 characters in ACCOUNTS file. Furthermore, suppose a product
description was changed from steel cabinet to steel chair. This duplication (or redundancy) of
data increased the maintenance overhead and storage costs. As shown in Fig. 1.19, the
product description filed might be immediately updated in the SALES file, updated
incorrectly next week in the PRODUCT file as well as ACCOUNT file. Over a period of
time, such discrepancies can cause serious degradation in the quality of information
contained in the data files and can also affect the accuracy of reports.
c. Program-data dependence: As we have seen, file descriptions (physical structure, storage
of the data files and records) are defined within each application program that accesses a
given file. For example, “Account receivable program” of Fig. 1.18 accesses both
CUSTOMER file and SALES file. Therefore, this program contains a detailed file
description for both these files. As a consequence, any change for a file structure requires
changes to the file description for all programs that access the file. It can also be noticed in
Fig. 1.18 that SALES file has been used in both “Account receivable program” and “Sales
statement program”. If it is decided to change the CUST-ID field length from 4 characters to
6 characters, the file descriptions in each program that is affected would have to be modified
to confirm to the new file structure. It is often difficult to even locate all programs affected by
such changes. It could be very time consuming and subject to error when making changes.
This characteristic of file-oriented system is known as program-data dependence.
d. Poor data control: As shown in Fig. 1.19, a file-oriented system being decentralised in
nature, there was no centralised control at the data element (field) level. It could be very
common for the data field to have multiple names defined by the various departments of an
organisation and depending on the file it was in. This could lead to different meanings of a
data field in different context, and conversely, same meaning for different fields. This leads
to a poor data control, resulting in a big confusion.
e. Limited data sharing: There is limited data sharing opportunities with the traditional file-
oriented system. Each application has its own private files and users have little opportunity to
share data outside their own applications. To obtain data from several incompatible files in
separate systems will require a major programming effort. In addition, a major management
effort may also be required since different organisational units may own these different files.
f. Inadequate data manipulation capabilities: Since File-oriented systems do not provide
strong connections between data in different files and therefore its data manipulation
capability is very limited.
g. Excessive programming effort: There was a very high interdependence between program
and data in file-oriented system and therefore an excessive programming effort was required
for a new application program to be written. Even though an existing file may contain some
of the data needed, the new application often requires a number of other data fields that may
not be available in the existing file. As a result, the programmer had to rewrite the code for
definitions for needed data fields from the existing file as well as definitions of all new data
fields. Therefore, each new application required that the developers (or programmers)
essentially start from scratch by designing new file formats and descriptions and then write
the file access logic for each new program. Also, both initial and maintenance programming
efforts for management information applications were significant.
h. Security problems: Every user of the database system should not be allowed to access all
the data. Each user should be allowed to access the data concerning his area of application
only. Since, applications programs are added to the file-oriented system in an ad hoc manner,
it was difficult to enforce such security system.

1.8.3 Database Approach


The problems inherent in file-oriented systems make using the database
system very desirable. Unlike the file-oriented system, with its many
separate and unrelated files, the database system consists of logically related
data stored in a single data dictionary. Therefore, the database approach
represents the change in the way end user data are stored, accessed and
managed. It emphasizes the integration and sharing of data throughout the
organisation. Database systems overcome the disadvantages of file-oriented
system. They eliminate problems related with data redundancy and data
control by supporting an integrated and centralised data structure. Data are
controlled via a data dictionary (DD) system which itself is controlled by
database administrators (DBAs). Fig. 1.20 illustrates a comparison between
file-oriented and database systems.
Fig. 1.20 File-oriented versus database systems

(a) File-oriented system


(b) Database system

1.8.4 Database System Environment


A database system refers to an organisation of components that define and
regulate the collection, storage, management and use of data within a
database environment. It consists of four main parts:
Data
Hardware
Software
Users (People)

Data: From the user’s point of view, the most important component of
database system is perhaps the data. The term data has been explained in
Section 1.2.1. The totality of data in the system is all stored in a single
database, as shown in Fig. 1.20 (b). These data in a database are both
integrated and shared in a system. Data integration means that the database
can be thought of as a function of several otherwise distinct files, with at
least partly eliminated redundancy among the files. Whereas in data sharing,
individual pieces of data in the database can be shared among different users
and each of those users can have access to the same piece of data, possibly
for different purposes. Different users can effectively even access the same
piece of data concurrently (at the same time). Such concurrent access of data
by different users is possibly because of the fact that the database is
integrated.
Depending on the size and requirement of an organisation or enterprise,
database systems are available on machines ranging from the small personal
computers to the large mainframe computers. The requirement could be a
single-user system (in which at most one user can access the database at a
given time) or multi-user system (in which many users can access the
database at the same time).
Hardware: All the physical devices of a computer are termed as
hardware. The computer can range from a personal computer
(microcomputer), to a minicomputer, to a single mainframe, to a network of
computers, depending upon the organisation’s requirement and the size of
the database. From the point of view of the database system the hardware
can be divided into two components:
The processor and associated main memory to support the execution of database system
(DBMS) software and
The secondary (or external) storage devices (for example, hard disk, magnetic disks, compact
disks and so on) that are used to hold the stored data, together with the associated peripherals
(for example, input/output devices, device controllers, input/output channels and so on).
A database system requires a minimum amount of main memory and disk
space to run. With a large number of users, a very large amount of main
memory and disk space is required to maintain and control the huge quantity
of data stored in a database. In addition, high-speed computers, networks and
peripherals are necessary to execute the large number of data access required
to retrieve information in an acceptable amount of time. The advancement in
computer hardware technology and development of powerful and less
expensive computers, have resulted into increased database technology
development and its application.
Software: Software is the basic interface (or layer) between the physical
database and the users. It is most commonly known as database management
system (DBMS). It comprises the application programs together with the
operating system software. All requests from the users to access the database
are handled by DBMS. DBMS provides various facilities, such as adding
and deleting files, retrieving and updating data in the files and so on.
Application software is generally written by company employees to solve a
specific common problem.
Application programs are written typically in a third-generation
programming language (3GL), such as C, C++, Visual Basic, Java, COBOL,
Ada, Pascal, Fortran and so on, or using fourth-generation language (4GL),
such as SQL, embedded in a third-generation language. Application
programs use the facilities of the DBMS to access and manipulate data in the
database, providing reports or documents needed for the information and
processing needs of the organisation. The operating system software
manages all hardware components and makes it possible for all other
software to run on the computers.
Users: The users are the people interacting with the database system in
any form. There could be various categories of users. The first category of
users is the application programmers who write database application
programs in some programming language. The second category of users is
the end users who interact with the system from online workstations or
terminals and accesses the database via one of the online application
programs to get information for carrying out their primary business
responsibilities. The third category of users is the database administrators
(DBAs), as explained in Section 1.7, who manage the DBMS and its proper
functioning. The fourth category of users is the database designers who
design the database structure.

1.8.5 Advantages of DBMS


Due to the centralised management and control, the database management
system (DBMS) has numerous advantages. Some of these are as follows:
a. Minimal data redundancy: In a database system, views of different user groups (data
files) are integrated during database design into a single, logical, centralised structure. By
having a centralised database and centralised control of data by the DBA the unnecessary
duplication of data are avoided. Each primary fact is ideally recorded in only one place in the
database. The total data storage requirement is effectively reduced. It also eliminates the
extra processing to trace the required data in a large volume of data. Incidentally, we do not
mean or suggest that all redundancy can or necessarily should be eliminated. Sometimes
there are sound business and technical reasons for maintaining multiple copies of the same
data, for example, to improve performance, model relationships and so on. In a database
system, however, this redundancy can be carefully controlled. That is, the DBMS is aware of
it, if it exists and assumes the responsibility for propagating updates and ensuring that the
multiple copies are consistent.
b. Program-data independence: The separation of metadata (data description) from the
application programs that use the data is called data independence. In the database
environment, it allows for changes at one level of the database without affecting other levels.
These changes are absorbed by the mappings between the levels. With the database approach,
metadata are stored in a central location called repository. This property of data systems
allows an organisation’s data to change and evolve (within limits) without changing the
application programs that process the data.
c. Efficient data access: DBMS utilizes a variety of sophisticated techniques to store and
retrieve data efficiently. This feature is especially important if the data is stored on external
storage devices.
d. Improved data sharing: Since, database system is a centralised repository of data
belonging to the entire organisation (all departments), it can be shared by all authorized
users. Existing application programs can share the data in the database. Furthermore, new
application programs can be developed on the existing data in the database to share the same
data and add only that data that is not currently stored, rather having to define all data
requirements again. Therefore, more users and applications can share more of the data.
e. Improved data consistency: Inconsistency is the corollary to redundancy. As explained in
Section 1.8.2 (b) in the file-oriented system, when the data is duplicated and the changes
made at one site are not propagated to the other site, it results into inconsistency. Such
database supplies incorrect or contradictory information to its users. So, if the redundancy is
removed or controlled, chances of having inconsistence data is also removed and controlled.
In database system, such inconsistencies are avoided to some extent by making them known
to DBMS. DMS ensures that any change made to either of the two entries in the database is
automatically applied to the other one as well. This process is known as propagating updates.
f. Improved data integrity: Data integrity means that the data contained in the database is
both accurate and consistent. Integrity is usually expressed in terms of constraints, which are
consistency rules that the database system should not violate. For example in Table 1.5, the
marriage month (MRG-MTH) in the EMPLOYEE file might be shown as 14 instead of 12.
Centralised control of data in the database system ensures that adequate checks are
incorporated in the DBMS to avoid such data integrity problem. For example, an integrity
check for the data field marriage date (MRG-MTH) can be introduced between the range of
01 and 12. Another integrity check can be incorporated in the database to ensure that if there
is reference to a certain object, that object must exit. For example, in the case of bank’s
automatic teller machine (ATM), a user is not allowed to transfer fund from a nonexistent
saving to a checking account.
g. Improved security: Database security is the protection of database from unauthorised
users. The database administrator (DBA) ensures that proper access procedure is followed,
including proper authentication schemes for access to the DBMS and additional checks
before permitting access to sensitive data. ADBA can define (which is enforced by DBMS)
user names and passwords to identify people authorised to use the database. Different levels
of security could be implemented for various types of data and operations. The access of data
by authorised user may be restricted for each type of access (for example, retrieve, insert,
modify, update, delete and so on) to each piece of information in the database. The
enforcement of security could be data-value dependent (for example, a works manager has
access to the performance details of employees in his or her department only), as well as
data-type dependent (but the manager cannot access the sensitive data such as salary details
of any employees, including those in his or her department).
h. Increased productivity of application development: The DBMS provides many of the
standard functions that the application programmer would normally have to write in a file-
oriented application. It provides all the low-level file-handling routines that are typical in
application programs. The provision of these functions allows the application programmer to
concentrate on the specific functionality required by the users without having to worry about
low-level implementation details. DBMSs also provide a high-level (4GL) environment
consisting of productivity tools, such as forms and report generators, to automate some of the
activities of database design and simplify the development of database applications. This
results in increased productivity of the programmer and reduced development time and cost.
i. Enforcement of standards: With central control of the database, a DBA defines and
enforces the necessary standards. Applicable standards might include any or all of the
following: departmental, installation, organisational, industry, corporate, national or
international. Standards can be defined for data formats to facilitate exchange of data
between systems, naming conventions, display formats, report structures, terminology,
documentation standards, update procedures, access rules and so on. This facilitates
communication and cooperation among various departments, projects and users within the
organisation. The data repository provides DBAs with a powerful set of tools for developing
and enforcing these standards.
j. Economy of scale: Centralising of all the organisation’s operational data into one database
and creating a set of application programs that work on this source of data resulting in drastic
cost savings. The DBMS approach permits consolidation of data and applications. Thus
reduces the amount of wasteful overlap between activities of data-processing personnel in
different projects or departments. This enables the whole organisation to invest in more
powerful processors, storage devices or communication gear, rather than having each
department purchase its own (low-end) equipment. Thus, a combined low cost budget is
required (instead of accumulated large budget that would normally be allocated to each
department for file-oriented system) for the maintenance and development of system. This
reduces overall costs of operation and management, leading to an economy of scale.
k. Balance of conflicting requirements: Knowing the overall requirements of the
organisation (instead of the requirements of individual users), the DBA resolves the
conflicting requirements of various users and applications. A DBA can structure the system
to provide an overall service that is best for the organisation. A DBA can chose the best file
structure and access methods to get optimal performance for the response-critical operations,
while permitting less critical applications to continue to use the database (with a relatively
slower response). For example, a physical representation can be chosen for the data in
storage that gives fast access for the most important applications.
l. Improved data accessibility and responsiveness: As a result of integration in database
system, data that crosses departmental boundaries is directly accessible to the end-users. This
provides a system with potentially much more functionality. Many DBMSs provide query
languages or report writers that allow users to ask ad hoc questions and to obtain the required
information almost immediately at their terminal, without requiring a programmer to write
some software to extract this information from the database. For example (from Table 1.4), a
works manager could list from the EMPLOYEE file, all employees belonging to India with a
monthly salary greater than INR 5000 by entering the following SQL command at a terminal,
as shown in Fig. 1.21.

Fig. 1.21 SQL for selected data fields

m. Increased concurrency: DBMSs manage concurrent databases access and prevents the
problem of loss of information or loss of integrity.
n. Reduced program maintenance: The problems of high maintenance effort required in
file-oriented system, as explained in Section 1.8.2 (g), are reduced in database system. In a
file-oriented environment, the descriptions of data and the logic for accessing data are built
into individual application programs. As a result, changes to data formats and access methods
inevitably result in the need to modify application programs. In database environment, data
are more independent of the application programs.
o. Improved backup and recovery services: DBMS provides facilities for recovering from
hardware or software failures through its back up and recovery subsystem. For example, if
the computer system fails in the middle of a complex update program, the recovery
subsystem is responsible and makes sure that the database is restored to the state it was in
before the program started executing. Alternatively, the recovery subsystem ensures that the
program is resumed from the point at which it was interrupted so that its full effect is
recorded in the database.
p. Improved data quality: The database system provides a number of tools and processes to
improve data quality.

1.8.6 Disadvantages of DBMS


In spite of the advantages, the database approach entails some additional
costs and risks that must be recognized and managed when implementing
DBMS. Following are the disadvantages of using DBMS:
a. Increased complexity: A multi-user DBMS becomes an extremely complex piece of
software due to expected functionality from it. It becomes necessary for database designers,
developers, database administrators and end-users to understand this functionality to full
advantage of it. Failure to understand the system can lead to bad design decisions, which can
have serious consequences for an organisation.
b. Requirement of new and specialized manpower: Because of rapid changes in database
technology and organisation’s business needs, the organisation’s need to hire, train or retrain
its manpower on regular basis to design and implement databases, provide database
administration services and manage a staff of new people. Therefore, an organisation needs
to maintain specialized skilled manpower.
c. Large size of DBMS: The large complexity and wide functionality makes the DBMS an
extremely large piece of software. It occupies many gigabytes of storage disk space and
requires substantial amounts of main memory to run efficiently.
d. Increased installation and management cost: The large and complex DBMS software
has a high initial cost. It requires trained manpower to install and operate and also has
substantial annual maintenance and support costs. Installing such a system also requires
upgrades to the hardware, software and data communications systems in the organisation.
Substantial training of manpower is required on an ongoing basis to keep up with new
releases and upgrades. Additional or more sophisticated and costly database software may be
needed to provide security and to ensure proper concurrent updating of shared data.
e. Additional hardware cost: The cost of DBMS installation varies significantly, depending
on the environment and functionality, size of the hardware (for example, micro-computer,
mini-computer or main-frame computer) and the recurring annual maintenance cost of
hardware and software.
f. Conversion cost: The cost of conversion (both in terms of money and time) from legacy
system (old file-oriented and/or older database technology) to modern DBMS environment is
very high. In some situations, the cost of DBMS and extra hardware may be insignificant
compared with the cost of conversion. This cost includes the cost of training manpower
(staff) to use these new systems and cost of employing specialists manpower to help with the
conversion and running of the system.
g. Need for explicit backup and recovery: For a centralised shared database to be accurate
and available all times, a comprehensive procedure is required to be developed and used for
providing backup copies of data and for restoring a database when damage occurs. A modern
DBMS normally automates many more of the backup and recovery tasks than a file-oriented
system.
h. Organisational conflict: A centralised and shared database (which is the case with
DBMS) requires a consensus on data definitions and ownership as well as responsibilities for
accurate data maintenance. As per past history and experience, sometimes there are conflicts
on data definitions data formats and coding, rights to update shared data, and associated
issues, which are frequent and often difficult to resolve. Organisational commitment to the
database approach, organisationally astute database administrators and a sound evolutionary
approach to database development is required to handle these issues.

1.9 HISTORICAL PERSPECTIVE OF DATABASE SYSTEMS

From the earliest days of computers, storing and manipulation of data have
been a major application focus. Historically, the initial computer applications
focused on clerical tasks, for example, employee’s payroll calculation, work
scheduling of a manufacturing industry, order and entry processing and so
on. Based on the request from the users, such applications accessed data
stored in computer files, converted stored data into information, and
generated various reports useful for the organisation. These were called file-
based systems. Decades-long evolution in computer technology, data
processing and information management, have resulted into development of
sophisticated modern database system. Due to the needs and demands of
organisations, database technology has developed from the primitive file-
based methods of the fifties to the powerful integrated database systems of
today. The file-based system still exists in specific areas of applications. Fig.
1.22 illustrates the evolution of database system technologies in the last
decades.
During 1960s, the US President, Mr. Kennedy initiated a project called
“Apollo Moon Landing”, with an objective of landing of man on the moon
by the end of that decade. The project expected to generate a large volume of
data and there was no system available at that time. File-based system was
unable to handle such voluminous data. Database systems were first
introduced during this time to handle such requirements. The North
American Aviation (now known as Rockwell International), which was the
prime contractor for the project, developed a software known as Generalized
Update Access Method (GAUM) to meet the voluminous data processing
demand of the project. GAUM software was based on the concept that
smaller components come together as parts of larger components, and so on,
until the final product is assembled. This structure confirmed to an up-down
tree and was named as hierarchical structure. Thereafter, database systems
have continued to evolve during subsequent decades.

Fig. 1.22 Evolution of database system technology

However, in mid-1960s, the first general purpose DBMS was designed by


Charles Bachman at General Electric, USA and was called Integrated Data
Store (IDS). IDS formed the basis for the network data model. The network
data model was standardized by the Conference of Data Systems Languages
(CODASYL), comprising representatives of the US government and the
world of business and commerce. It strongly influenced database systems
throughout 1960s. CODASYL formed a List Processing Task Force (LPTF)
in 1965, subsequently renamed the Data Base Task Force (DBTG) in 1967.
The term of reference for the DBTG was to define standard specifications
for an environment that world allow database creation and data
manipulation. Bachman was the first recipient of the computer science
equivalent of the Nobel Prize, called Association of Computing Machinery
(ACM) Turing Award, for work in the database area. He received this award
in 1973 for his work.
IBM joined the North American Aviation to develop GAUM into what is
known as Information Management System (IMS) DBMS, released in 1968.
Since serial storage devices, such as magnetic tape, were the market
requirement at that time, IBM restricted IMS to the management of
hierarchies of records to allow the use of these serial storage devices. Later
on, IMS was made usable for other storage devices also and is still the main
hierarchical DBMS for most large mainframe computer installations. IMS
formed the basis for an alternative data representation framework called the
hierarchical data model. The SABRE system for making airline reservations
was jointly developed by American Airlines and IBM around the same time.
It allowed several people to access the same data through a computer
network. Today the SABRE system is being used to power popular web-
based travel services.
The hierarchical model structured data as a directed tree with a root at the
top and leaves at the bottom. The network model structured data as a
directed graph without circuits, a slight generalisation that allowed the
network model to represent certain real-world data structures more easily.
The CODASYL and hierarchical structure represented the first-generation of
DBMSs. During the decade 1970s, the hierarchical and network database
management systems were developed largely to cope with increasingly
complex data structures that were extremely difficult to manage with
conventional file processing methods. Both approaches are still being used
by most organisations today and are called legacy systems. Following were
the major drawbacks with these products:
Queries against the data were difficult to execute, normally requiring a program written by an
expert programmer who understood what could be a complex navigational structure of the
data.
Data independence is very limited so that the programs are not insulated from changes to
data formats.
Widely accepted theoretical foundation is not available.

In 1970, Edgar Codd, at IBM’s San Jose Research Laboratory, wrote a


landmark paper proposing a new data representation framework called the
relational data model and non-procedural ways of querying data in the
relational model. This model considered second-generation DBMS and
received widespread commercial acceptance and diffusion during the 1980s.
It sparked rapid development of several DBMSs based on relational model,
along with a rich body of theoretical results that placed the field on a firm
foundation. In the relational model, all data are represented in the form of
tables and simple fourth-generation language called Structure Query
Language (SQL) is used for data retrieval. The simplicity of relational
model, the possibility of hiding implementation details completely from the
programmer and ease of access for non-programmers, solved the major
drawbacks of first-generation DBMSs. Codd won the 1981 ACM’s Turing
Award for his seminal work.
The relational model was not used in practice initially because of its
perceived performance disadvantages and remained academically interesting
for the users. It could not match the performance of existing network and
hierarchical data models. In the 1980s, IBM initiated System R project that
developed techniques for the construction of an efficient relational database
system. This led to the development of a fully functional relational database
product, called Structured Query Language / Database System (SQL/DS).
This resulted in relational model consolidating its position as the dominant
DBMS paradigm and database systems continued to gain widespread use.
Relational databases were very easy to use and eventually replaced network
and hierarchical databases. Now there are several relational DBMSs for both
mainframe and PC environments for commercial applications, such as
Ingress from Computer Associates, Informix from Informix Software Inc.,
ORACLE, IBM DB2, Digital Equipment Corporation’s relational database
(DEC Rdb), Access and FoxPro from Microsoft, Paradox from Corel
Corporation, InterBase and BDE from Borland and R-Base from R-Base
Technologies. These databases played an important role in advancing
techniques for efficient processing of declarative queries.
In the late 1980s (early 1990s), SQL was standardised and was adopted by
the American National Standards Institute (ANSI) and the International
Standard Organizations (ISO). During this period, concurrent execution of
database programs, called transactions, became most widely used from of
concurrent programming. Transaction processing applications are update
intensive and users write programs as if they are to be run by themselves.
The responsibility for running them concurrently is given to the DBMS. For
his contribution to the field of transaction management in a DBMS, James
Gray won the 1999 ACM’s Turing award.
In the late 1990s, a new era of computing started, such as client / server
computing, data warehousing, and Internet applications, called World Wide
Web (WWW). During this period, advances were made in many areas of
database systems and multimedia data (including graphics, sound images
and video) became increasingly common. An object-oriented database
(OODBMS) and objected-relational databases (ORDBMS) were introduced
during this period to cope with these increasingly complex data. Object-
oriented databases were considered as third-generation databases.
The emergence of enterprise resource planning (ERP) and management
resource planning (MRP) packages have added substantial layer of
application-oriented features on top of a DBMS. The widely used ERP and
MRP packages include systems from BANN, Oracle, PeopleSoft, SAP and
Siebel. These packages identify a set of common tasks (for example,
inventory management, financial analysis, human resource planning,
production planning, order management and so on) encountered by large
organisations and provide a general application layer to carry out these tasks.
The DBMS continues to gain importance as more and more data is
brought on-line and made ever more accessible through computer
networking. Today the database field is being driven by exciting visions such
as multimedia databases, interactive video, digital libraries, data mining and
so on.

1.10 DATABASE LANGUAGE

As explained in Section 1.5, for supporting variety of users, a DBMS must


provide appropriate languages and interfaces for each category of users to
express database queries and updates. Once the design of database is
complete and a DBMS is chosen to implement the database, it is important
to first specify the conceptual and internal schemas for the database and any
mappings between the two. Following languages are used to specify
database schemas:
Data definition language (DDL)
Storage definition language (SDL)
View definition language (VDL)
Data manipulation language (DML)
Fourth-generation language (4GL)

In practice, the data definition and data manipulation languages are not
two separate languages. Instead they simply form parts of a single database
language and a comprehensive integrated language is used such as the
widely used structured query language (SQL). SQL represents combination
of DDL, VDL and DML, as well as statements for constraints specification
and schema evaluation. It includes constructs for conceptual schema
definition view definition, and data manipulation.

1.10.1 Data Definition Language (DDL)


Data definition (also called description) language (DDL) is a special
language used to specify a database conceptual schema using set of
definitions. It supports the definition or declaration of database objects (or
data element). DDL allows the DBA or user to describe and name the
entities, attributes and relationships required for the application, together
with any associated integrity and security constraints. Theoretically, different
DDLs are defined for each schema in the three-level schema-architecture
(for example, for conceptual, internal and external schemas). However, in
practice, there is one comprehensive DDL that allows specification of at
least the conceptual and external schemas.
Various techniques are available for writing data definition language. One
widely used technique is writing DDL into a text file (similar to a source
program written using programming languages). Other methods use DDL
compiler or interpreter to process the DDL file or statements in order to
identify description of the schema constructs and to store the schema
description in the DBMS catalog (or tables), which can be understood by
DBMS. The result of the compilation of DDL statements is a set of tables
stored in specific file collectively called the system log (explained in Section
1.2.6) or data dictionary.
For example, let us look at the following statements of DDL:

Example 1
CREATE TABLE PRODUCT
(PROD-ID CHAR (6),
PROD-DESC CHAR (20),
UNIT-COST NUMERIC (4);

Example 2
CREATE TABLE CUSTOMER
(CUST-ID CHAR (4),
CUST-NAME CHAR (20),
CUST-STREET CHAR (25),
CUST-CITY CHAR (15)
CUST-BAL NUMERIC (10);

Example 3
CREATE TABLE SALES
(CUST-ID CHAR (4),
PROD-ID CHAR (6),
PROD-QTY NUMERIC (3),

The execution of the above DDL statements will create PRODUCT,


CUSTOMER and SALES tables, as illustrated in Fig. 1.23 (a), (b) and (c)
respectively.

Fig. 1.23 Table creation using DDL

(a) Table created for PRODUCT (Example 1)

(b) Table created for CUSTOMER (Example 2)


(c) Table created for SALES (Example 3)

1.10.2 Data Storage Definition Language (DSDL)


Data storage definition language (DSDL) is used to specify the internal
schema in the database. The mapping between the conceptual schema (as
specified by DDL) and the internal schema (as specified by DSDL) may be
specified in either one of these languages. In DSDL, the storage structure
and access methods used by the database system is specified by set of
statements. These statements define the implementation details of the
database schemas, which are usually hidden from the users.

1.10.3 View Definition Language (VDL)


View definition language (VDL) is used to specify user’s views (external
schema) and their mappings to the conceptual schema. However, in most of
DBMSs, DDL is used to specify both conceptual and external schemas.
There are two views of data. One is the logical view of data. This is the form
that the programmer perceives to be in. The other is the physical view. This
reflects the way that data is actually stored on disk (or other storage devices).

1.10.4 Data Manipulation Language (DML)


Data manipulation language (DML) is a mechanism that provides a set of
operations to support the basic data manipulation operations on the data held
in the database. It is used to retrieve data stored in a database, express
database queries and updates. In other words, it helps in communicating with
DBMS. Data manipulation applies to all the three (conceptual, internal and
external) levels of schema. The part of DML that provides data retrieval is
called query language.
The DML provides following functional access (or manipulation
operations) to the database:
Retrieve data and/or records from database.
Add (or insert) records to database files.
Delete records from database files.
Retrieve records sequentially in the key sequence.
Retrieve records in the physically recorded sequence.
Rewrite records that have been updated.
Modify data and/or record in the database files.

For example, let us look at the following statements of DML that are
specified to retrieve data from tables shown in Fig. 1.24.
Fig. 1.24 Retrieve data from tables using DML

(a) PRODUCT table

(b) CUSTOMER table

(c) SALES table

Example 1
SELECT PRODUCT.PROD-DESC
FROM PRODUCT
WHERE PROD-ID = ‘B4432’;

The above query (or DML statement) specifies that those rows from the
table PRODUCT where the PROD-ID is B4432 should be retrieved and the
PROD-DESC attribute of these rows should be displayed on the screen.
Once this query is run for table PRODUCT, as shown in Fig. 1.24 (a), the
result will be displayed on the computer screen as shown below.

B44332 Freeze

Example 2

SELECT CUSTOMER.CUST-ID,
CUSTOMER.CUST-NAME,
FROM CUSTOMER
WHERE CUST-CITY = ‘Mumbai’;

The above query (or DML statement) specifies that those rows from the
table CUSTOMER where the CUST-CITY is INDIA will be retrieved. The
CUST-ID, CUST-NAME and CUST-TEL attributes of these rows will be
displayed on the screen.
Once this query is run for table PRODUCT, as shown in Fig. 1.24 (b), the
result will be displayed on the computer screen as shown below.

1001 Waterhouse Ltd.


1010 Concept Shapers

DML query may be used for retrieving information from more than one
table as explained in example 3 below.
Example 3

SELECT CUSTOMER.CUST-NAME
CUSTOMER.CUST-BAL
FROM SALES.PROD-ID
WHERE SALES.PROD-ID = ‘B23412’
AND CUSTOMER.CUST-ID = SALES.CUST-ID;

The above query (or DML statement) specifies that those rows from the
tables CUSTOMER and SALES where the PROD-ID = B23412 and CUST-
ID is same in both the tables will be retrieved and the CUST-BAL attribute
of that row will be displayed on the screen.
Once this query is run for tables CUSTOMER and SALES, as shown in
Fig. 1.24 (b) and (c), the result will be displayed on the computer screen as
shown below.

KLY System 40000.00

There are two ways of accessing (or retrieving) data from the database. In
one way, an application program issues an instruction (called embedded
statements) to the DBMS to find certain data in the database and returns it to
the program. This is called procedural DML. Procedural DML allows the
user to tell the system what data is needed and exactly how to retrieve the
data. Procedural DML retrieves a record, processes it and retrieves another
record based on the results obtained by this processing and so on. The
process of such retrievals continues until the data request from the retrieval
has been obtained. Procedural DML is embedded in a high-level language,
which contains constructs to facilitate iteration and handle navigational
logic.
In the second way of accessing the data, the person seeking data sits down
at a computer display terminal and issues a command in a special language
(called query) directly to the DBMS to find certain data and returns it to the
display screen. This is called non-procedural DML (or declarative
language). Non-procedural DML allows the user to state what data are
needed, rather than how they are to be retrieved.
DBMS translates a DML statement into a procedure (or set of procedures)
that manipulates the required set of records. This removes the concern of the
user to know how data structures are internally implemented, what
algorithms are required to retrieve and how to transform the data. This
provides users with a considerable degree of data independence.

1.10.5 Fourth-generation Language (4GL)


The fourth-generation language (4GL) is a compact (a short-hand type),
efficient and non-procedural programming language that is used to improve
the productivity of the DBMS. In 4GL, the user defines what is to be done
and not how it is to be done. The 4GL depends on higher-level 4GL tools,
which are used by the users to define parameters to generate an application
program. The 4GL has the following components inbuilt in it:
Query languages
Report generators
Spreadsheets
Database languages
Application generators to define operations such as insert, retrieve and update data from the
database to build applications
High-level languages to generate application program.

Structured query language (SQL) and query by example (QBE) are the
examples of fourth-generation language.

1.11 TRANSACTION MANAGEMENT

All work that logically represents a single unit is called transaction. The
sequence of database operations that represents a logical unit of work is
grouped together as a single transaction and access a database and
transforms it from one state to another. A transaction can update a record,
delete a record, modify a set of records and so on. When the DBMS does a
‘commit’, the changes made by transaction are made permanent. If the
changes are not be made permanent, the transaction can be ‘rollback’ and the
database will remain in its original state.
When updates are performed on a database, we need some way to
guarantee that a set of updates will succeed all at once or not at all.
Transaction ensures that all the work completes or none of it affects the
database. This is necessary in order to keep the database in a consistent state.
For example, a transaction might involve transferring money from a bank
saving account of a person to a checking account. While this would typically
involve two separate database operations. First a withdrawal from the
savings account and then a deposit into the checking account. It is logically
considered one unit of work. It is not acceptable to do one operation and not
the other operation because that would violate integrity of the database.
Thus, both withdrawal and deposit must be completed (committed) or partial
transaction must be aborted (rolled-back), so that uncompleted work does
not affect database.
Consider another example of a railway reservation system in which at any
given instant, it is likely that several travel agents are looking for
information about available seats on various trains and routes and making
new reservations. When several users (travel agents) access the railway
database concurrently, the DBMS must order their request carefully to avoid
conflicts. For example, when one travel agent looks for a train no. 8314 on
some given day and finds an empty seat, another travel agent may
simultaneously be making a reservation for the same seat, thereby making
the information seen by the first agent obsolete.
Through transaction management feature, database management system
must protect users from the effect of system failures or crashes. DBMS
ensures that all data and status is restored to a consistent state when system
is restarted after a crash or failure. For example, if the travel agent asks for a
reservation to be made and the DBMS has responded saying that the
reservation has been made, the reservation is not lost even if the system
crashes or fails. On the other hand, if the DBMS has not yet responded to the
request, but is in the process of making the necessary changes to the data
while the crash occurs, the partial changes are not affected in the database
when the system is restored.
Transaction has, generally, following four properties, called ACID:
Atomicity
Consistency
Isolation
Durability

Atomicity means that either all the work of a transaction or none of it is


applied. With atomicity property of the transaction, other operations can
only access any of the rows involved in transactional access either before the
transaction occurs or after the transaction is complete, but never while the
transaction is partially complete. Consistency means that the transaction’s
work will represent a correct (or consistent) transformation of the database’s
state. Isolation requires that a transaction not to be influenced by changes
made by other concurrently executing transactions. Durability means that the
work associated with a successfully completed transaction is applied to the
database and is guaranteed to survive system or media failures.
Thus, summarising above arguments, we can say that a transaction is a
collection of operations that performs a single logical function in a database
application. Each transaction is a unit of ACID (that is, atomicity,
consistency, isolation and durability). Transaction management plays an
important role in shaping many DBMS capabilities, including concurrency
control, backup and recovery and integrity enforcement. Transaction
management is further discussed in greater detail in Chapter 12.

R Q
1. What is data?
2. What do you mean by information?
3. What are the differences between data and information?
4. What is database and database system? What are the elements of database system?
5. Why do we need a database?
6. What is system catalog?
7. What is database management system? Why do we need a DBMS?
8. What is transaction?
9. What is data dictionary? Explain its function with a neat diagram.
10. What are the components of data dictionary?
11. Discuss active and passive data dictionaries.
12. What is entity and attribute? Give some examples of entities and attributes in a
manufacturing environment.
13. Name some entities and attributes with which an educational institution would be concerned.
14. Name some entities and attributes related to a personnel department and storage warehouse.
15. Why are relationships between entities important?
16. Describe the relationships among the entities you have found in Questions 13 and 14.
17. Outline the advantages of implementing database management system in an organisation.
18. What is the difference between a data definition language and a data manipulation language?
19. The data file shown in Table 1.6 is used in the data processing system of M/s ABC Motors
Ltd., which makes cars of different models.

Table 1.6 Data file of M/s ABC Motors Ltd.

a. Name one of the entities described in the data file. How would you describe the
entity set?
b. What are the attributes of the entities? Choose one of the entities and describe it.
c. Choose one of the attributes and discuss the nature of the set of values that it can
take.

20. What do you mean by redundancy? What is the difference between controlled and
uncontrolled redundancy? Illustrate with examples.
21. Define the following terms:

a. Data
b. Database
c. Database system
d. DBMS
e. Database catalog
f. DBA
g. Metadata
h. DA
i. End user
j. Security
k. Data Independence
l. Data Integrity
m. Files
n. Records
o. Data warehouse.

22. Who is a DBA? What are the responsibilities of a DBA?


23. With a neat diagram, explain the organisation of a database.
24. List some examples of database systems.
25. Describe a file-oriented system and its approach taken to the handling of data. Give some
examples of file-oriented system.
26. Discuss advantages and disadvantages of file-oriented system.
27. Compare file-oriented system and database system.
28. List five significant differences between a file-oriented system and a DBMS.
29. Describe the main characteristics of the database approach in contrast with the file-oriented
approach.
30. Describe various components of DBMS environment and discuss how they relate to each
other.
31. Describe the different types of database languages and their functions in database system.
32. Discuss the roles of the following personnel in the database environment:

a. Data administrator
b. Database administrator
c. Application developer
d. End users.

33. Discuss the advantages and disadvantages of a DBMS.


34. Explain the difference between external, internal and conceptual schemas.
35. Describe with diagram the three-layer data structure that is generally used for data warehouse
applications.
36. Give historical perspective of database system.
37. When the following SQL command is given, what will be the effect of retrieval on the
EMPLOYEE database of M/s KLY System Ltd. of Table 1.7.

(a) SELECT EMP-NO, EMP-LNAME, EMP-FNAME,


DEPT
FROM EMPLOYEE
WHERE SALARY => 4000;
(b) SELECT EMP-FNAME, EMP-LNAME, DEPT, TEL-
NO
FROM EMPLOYEE
WHERE EMP-NO = 123456;
(c) SELECT EMP-NO, EMP-FNAME, DEPT, SALARY
FROM EMPLOYEE
WHERE EMP-LNAME = ‘Kumar’;
(d) SELECT EMP-NO, EMP-LNAME, EMP-FNAME
FROM EMPLOYEE
WHERE SALARY => 7000;

Table 1.7 EMPLOYEE file of M/s KLY System Ltd.

38. Show the effects of the following SQL operation on the EMLOYEE file of M/s KLY System
Ltd. of Table 1.7.

(a) INSERT INTO EMPLOYEE (EMP-NO, EMP-LNAME, EMP-


FNAME, SALARY, COUNTRY, BIRTH-CITY,
DEPT, TEL-NO)
VALUES (221333, ‘Deo’, ‘Kapil’, 8800, IND, Kolkata,
HR, 3342217);
(b) UPDATE EMPLOYEE
SET DEPT = ‘DP’
WHERE EMP-NO. = 123243;
(c) DELETE
FROM EMPLOYEE
WHERE EMP-NO = 106519;
(d) UPDATE EMPLOYEE
SET SALARY = SALARY + 1500
WHERE DEPT = ‘MFG’.
39. Write SQL statements to perform the following operations on the EMLOYEE data file of M/s
KLY System Ltd., of Table 1.7.

a. Get employee’s number, employee’s name and telephone number for all employees
of DP department.
b. Get employee’s number, employee’s name, department and telephone number for all
employees of Indian origin.
c. Add 250 in the salary of employees belonging to USA.
d. Remove all records of employees getting salary of more than 6000.
e. Add a new employee details whose details are as follows: employee no.: 106520,
last name: Joseph, first name: Gorge, salary: 8200, country: AUS, birth place:
Melbourne, department: DP, and telephone no.: 334455661

40. List the DDL statements to be given to create three tables shown in Fig. 1.25.

Fig. 1.25 Database tables

(a) PRODUCT table

(b) CUSTOMER table


(c) SALES table

41. Show the effects of the following DML statements on the EMPLOYEE file of M/s KLY
System Ltd., of Table 1.7. For example, let us look at the following statements of DML that
are specified to retrieve data from tables shown in Fig. 1.24.

(a) SELECT PRODUCT.PROD-DESC


FROM PRODUCT
WHERE PROD-ID = ‘A2983455’;
(b) SELECT CUSTOMER.CUST-ID,
CUSTOMER.CUST-NAME,
FROM CUSTOMER
WHERE CUST-CITY = ‘Chicago’
(c) SELECT CUSTOMER.CUST-NAME
CUSTOMER.CUST-BAL
FROM SALES.PROD-ID
WHERE SALES.PROD-ID = ‘B4433234’
AND CUSTOMER.CUST-ID = SALES.CUST-ID;

42. A personnel department of an enterprise has structure of a EMPLOYEE data file, as shown in
Table 1.8.
Table 1.8 EMPLOYEE data file of an enterprise

a. How many records does the file contain, and how many fields are there per record?
b. What data redundancies do you detect and how could these redundancies lead to
anomalies?
c. If you wanted to produce a listing of the database file contents by the last name,
city’s name, country’s name and telephone number, how would you alter the file
structure?
d. What problem would you encounter if you wanted to produce a listing by city? How
would you solve this problem by altering the file structure?

43. What could be the entities of interest to the following enterprise?

a. Technical university
b. Public library
c. General hospital
d. Departmental store
e. Fastfood restaurant
f. Software marketing company.

For each such entity set, list the attributes that could be used to model each of the entities.
What are some of the applications that may be automated for the above enterprise using a
DBMS?
44. Datasoft Inc. is an enterprise involved in the design, development, testing and marketing of
software for auto industry (two-wheeler). What entities is of interest to such an enterprise?
Give a list of these entities and the relationships among them.
45. Some of the entities relevant to a technical university are given below.

a. STUDENT and ENGG-BRANCH (students register for engg branches).


b. BOOK and BOOK-COPY (books have copies).
c. ENGG-BRACH and SECTION (branches have sections).
d. SECTION and CLASS-ROOM (sections are scheduled in class rooms).
e. FACULTY and ENGG-BRANCH (faccilty teaches is a particular branch).

For each of them, indicate the type of relationship existing among them (for example, one-to-
one, one-to-many or many-to-many). Draw a relationship diagram for each of them.
STATE TRUE/FALSE

1. Data is also called metadata.


2. Data is a piece of fact.
3. Data are distinct pieces of information.
4. In DBMS, data files are the files that store the database information.
5. The external schema defines how and where the data are organised in a physical data storage.
6. A collection of data designed for use by different users is called a database.
7. In a database, data integrity can be maintained.
8. The data in a database cannot be shared.
9. The DBMS provides support languages used for the definition and manipulation of the data
in the database.
10. Data catalog and data dictionary are the same.
11. The data catalog is required to get information about the structure of the database.
12. A database cannot avoid data inconsistency.
13. Using database redundancy can be reduced.
14. Security restrictions cannot be applied in a database system.
15. Data and metadata are the same.
16. Metadata is also known as data about data.
17. A system catalog is a repository of information describing the data in the database.
18. The information stored in the catalog is called metadata.
19. DBMSs manage concurrent databases access and prevents from the problem of loss of
information or loss of integrity.
20. View definition language is used to specify user views (external schema) and their mappings
to the conceptual schema.
21. Data storage definition language is used to specify the conceptual schema in the database.
22. Structured query language (SQL) and query by example (QBE) are the examples of fourth-
generation language.
23. A transaction cannot update a record, delete a record, modify a set of records and so on.

TICK (✓) THE APPROPRIATE ANSWER

1. Which of the following is related to information?

a. data
b. communication
c. knowledge
d. all of these.

2. Data is:

a. a piece of fact
b. metadata
c. information
d. none of these.
3. Which of the following is element of the database?

a. data
b. constraints and schema
c. relationships
d. all of these.

4. What represent a correspondence between the various data elements?

a. data
b. constraints
c. relationships
d. schema.

5. Which of the following is an advantage of using database system?

a. security enforcement
b. avoidance of redundancy
c. reduced inconsistency
d. all of these.

6. Which of the following is characteristic of the data in a database?

a. independent
b. secure
c. shared
d. all of these.

7. The name of the system database that contains descriptions of data in the database is:

a. data dictionary
b. metadata
c. table
d. none of these.

8. Following is the type of metadata:

a. operational
b. EDW
c. data mart
d. all of these.

9. System catalog is a system-created database that describes:

a. database objects
b. data dictionary information
c. user access information
d. all of these.
10. A file is a collection of related sequence of records:

a. related records
b. related fields
c. related data items
d. none of these.

11. Relationships could be of the following type:

a. one-to-one relationship
b. one-to-many relationships
c. many-to-many relationships
d. all of these.

12. In a file-oriented system there is:

a. data inconsistency
b. duplication of data
c. data dependence
d. all of these.

13. In a database system there is:

a. increased productivity
b. improved security
c. economy of scale
d. all of these.

14. In a database system there is:

a. large size of DBMS


b. increased overall costs
c. increased complexity
d. all of these.

15. IDS formed the basis for the:

a. network model
b. hierarchical model
c. relational model
d. all of these.

16. Recipient of ACM Turing Award in 1981 was:

a. Bachman
b. Codd
c. James Gray
d. None of them.
17. DSDL is used to specify:

a. internal schema
b. external schema
c. conceptual schema
d. none of these.

18. VDL is used to specify:

a. internal schema
b. external schema
c. conceptual schema
d. none of these.

19. The DML provides following functional access to the database:

a. retrieve data and/or records


b. add (or insert) records
c. delete records from database files
d. all of these.

20. 4GL has the following components inbuilt in it:

a. query languages
b. report generators
c. spreadsheets
d. all of these.

FILL IN THE BLANKS

1. _____ is the most critical resource of an organisation.


2. Data is a raw _____ whereas information is _____.
3. A _____ is a software that provides services for accessing a database.
4. Two important languages in the database system are (a) _____ and (b) _____.
5. To access information from a database, one needs a _____.
6. DBMS stands for _____.
7. SQL stands for _____.
8. 4GL stands for _____.
9. The three-layer data structures for data warehouse applications are (a) _____, (b) _____ and
(c) _____.
10. DDL stands for _____.
11. DML stands for _____.
12. Derived data are stored in _____.
13. The four components of data dictionary are (a) _____ , (b) _____ , (c) _____ and (d) _____.
14. The four types of keys used are (a) _____, (b) _____, (c) and (d) _____.
15. The two types of data dictionaries are (a) _____ and (b) _____.
16. CODASYL stands for _____.
17. LPTF stands for _____.
18. DBTG stands for _____.
19. In mid-1960s, the first general purpose DBMS was designed by Charles Bachman at General
Electric, USA was called _____.
20. First recipient of the computer science equivalent of the Nobel Prize, called Association of
Computing Machinery (ACM) Turing Award, for work in the database area, in 1973 was
_____.
21. When the DBMS does a commit, the changes made by the transaction are made _____.
Chapter 2

Database System Architecture

2.1 INTRODUCTION

An organisation requires an accurate and reliable data and efficient database


system for effective decision-making. To achieve this goal, the organisation
maintains records for its varied operations by building appropriate database
models and by capturing essential properties of the objects and record
relationship. Users of a database system in the organisation look for an
abstract view of data they are interested in. Furthermore, since database is a
shared resource, each user may require a different view of the data held in
the database. Therefore, one of the main aims of a database system is to
provide users with an abstract view of data, hiding certain details of how
data is stored and manipulated. To satisfy these needs, we need to develop
architecture for the database systems. The database architecture is a
framework in which the structure of the DBMS is described.
The DBMS architecture has evolved from early-centralised monolithic
systems to the modern distributed DBMS system with modular design.
Large centralised mainframe computers have been replaced by hundreds of
distributed workstations and personal computers connected via
communications networks. In the early systems, the whole DBMS package
was a single, tightly integrated system, whereas the modern DBMS is based
on client-server system architecture. Under the client-server system
architecture, the majority of the users of the DBMS are not present at the site
of the database system, but are connected to it through a network. On server
machines, the database system runs, whereas on client machines (which are
typically workstations or personal computers) remote database users work.
The client-server architecture has been explained in details in Section 2.9.3
in this chapter.
The database applications are usually portioned into a two-tier
architecture or a three-tier architecture, as shown in Fig. 2.1. In a two-tier
architecture, the application is partitioned into a component that resides at
the client machines, which evokes database system functionality at the
server machine through query language statements. Application program
interface standards are used for interaction between the client and the server.
In a three-tier architecture, the client machine acts as merely a front-end
and does not contain any direct database calls. Instead, the client end
communicates with an application server, usually through a forms interface.
The application server in turn communicates with a database system to
access data. The business logic of the application, which says what actions to
carry out and under what conditions, is embedded in the application server,
instead of being distributed across multiple clients. Three-tier architectures
are more appropriate for large applications and for applications that run on
the World Wide Web (WWW).
It is not always possible that every database system can be fitted or
matched to a particular framework. Also, there is no particular framework
that can be said to be the only possible framework for defining database
architecture. However, in this chapter, a generalized architecture of the
database system, which fits most system reasonably well, will be discussed.
Fig. 2.1 Database system architectures

2.2 SCHEMAS, SUBSCHEMA AND INSTANCES

When the database is designed to meet the information needs of an


organisation, plans (or scheme) of the database and actual data to be stored
in it becomes the most important concern of the organisation. It is important
to note that the data in the database changes frequently, while the plans
remain the same over long periods of time (although not necessarily
forever). The database plans consist of types of entities that a database deals
with, the relationships among these entities and the ways in which the
entities and relationships are expressed from one level of abstraction to the
next level for the users’ view. The users’ view of the data (also called logical
organisation of data) should be in a form that is most convenient for the
users and they should not be concerned about the way data is physically
organised. Therefore, a DBMS should do the translation between the logical
(users’ view) organisation and the physical organisation of the data in the
database.
2.2.1 Schema
The plan (or formulation of scheme) of the database is known as schema.
Schema gives the names of the entities and attributes. It specifies the
relationship among them. It is a framework into which the values of the data
items (or fields) are fitted. The plans or the format of schema remains the
same. But the values fitted into this format changes from instance to
instance. In other terms, schema mean an overall plan of all the data item
(field) types and record types stored in a database. Schema includes the
definition of the database name, the record type and the components that
make up those records. Let us look at a Fig. 1.23 and assume that it is a sales
record database of M/s ABC, a manufacturing company. The structure of the
database consisting of three files (or tables) namely, PRODUCT,
CUSTOMER and SALES files is the schema of the database. A database
schema corresponds to the variable declarations (along with associated type
definitions) in a program. Fig. 2.2 shows a schema diagram for the database
structure shown in Fig. 1.23. The schema diagram displays the structure of
each record type but not the actual instances of records. Each object in the
schema, for example, PRODUCT, CUSTOMER or SALES are called a
schema construct.
Fig. 2.2 Schema diagram for database of M/s ABC Company

(a) Schema diagram for sales record database

(b) Schema defined using database language

Fig. 2.3 shows the schema diagram and the relationships for another
example of purchasing system of M/s KLY System. The purchasing system
schema has three records (or objects) namely PURCHASE-ORDER,
SUPPLIER, PURCHASE-ITEM, QUOTATION and PART. Solid arrows
connecting different blocks show the relationships among the objects. For
example, the PURCHASE-ORDER record is connected to the PURCHASE-
ITEM records of which that purchase order is composed and the SUPPLIER
record to the QUOTATION records showing the parts that supplier can
provide and so forth. The dotted arrows show the cross-references between
attributes (or data items) of different objects or records.
As can be seen in Fig. 2.3 (c), the duplication of attributed are avoided
using relationships and cross-referencing. For example, the attributes SUP-
NAME, SUP-ADD and SUP-DETAILS are included in separate SUPPLIER
record and not in the PURCHASE-ORDER record. Similarly, attributes such
as PART-NAME, PART-DETAILS and QTY-ON-HAND are included in
separate PART record and not in the PURCHASE-ITEM record. Thus, the
duplication of including PART-DETAILS and SUPPLIERS in every
PURCHASE-ITEM is avoided. With the help of relationships and cross-
referencing, the records are linked appropriately with each other to complete
the information and data is located quickly.
Fig. 2.3 Schema diagram for database of M/s KLY System

(a) Schema diagram of purchasing system database

(b) Schema defined using database language


(c) Schema relationship diagrams

The database system can have several schemas partitioned according to


the levels of abstraction. In general, schema can be categorised in two parts;
(a) a logical schema and (b) a physical schema. The logical schema is
concerned with exploiting the data structures offered by a DBMS in order to
make the scheme understandable to the computer. The physical schema, on
the other hand, deals with the manner in which the conceptual database shall
get represented in the computer as a stored database. The logical schema is
the most important as programs use it to construct applications. The physical
schema is hidden beneath the logical schema and can usually be changed
easily without affecting application programs. DBMSs provide database
definition language (DDL) and database storage definition language (DSDl)
in order to make the specification of both the logical and physical schema
easy for the DBA.

2.2.2 Subschema
A subschema is a subset of the schema and inherits the same property that a
schema has. The plan (or scheme) for a view is often called subschema.
Subschema refers to an application programmer’s (user’s) view of the data
item types and record types, which he or she uses. It gives the users a
window through which he or she can view only that part of the database,
which is of interest to him. In other words, subschema defines the portion of
the database as “seen” by the application programs that actually produced
the desired information from the data contained within the database.
Therefore, different application programs can have different view of data.
Fig. 2.4 shows subschemas viewed by two different application programs
derived from the example of Fig. 2.3.
As shown in Fig. 2.4, the SUPPLIER-MASTER record of first application
program {Fig. 2.4 (a)} now contains additional attributes such a SUP-
NAME and SUP-ADD from SUPPLIER record of Fig. 2.3 and the
PURCHASE-ORDER-DETAILS record contains additional attributes such
as PART-NAME, SUP-NAME and PRICE from two records PART and
SUPPLIER respectively. Similarly, ORDER-DETAILS record of second
application program {Fig. 2.4 (b)} contains additional attributes such as
SUP-NAME, and QTY-ORDRD form two records SUPPLIER and
PURCHASE-ITEM respectively.
Individual application programs can change their respective subschema
without effecting subschema views of others. The DBMS software derives
the subschema data requested by application programs from schema data.
The database administrator (DBA) ensures that the subschema requested by
application programs is derivable from schema.
Fig. 2.4 Subschema views of two applications programs

(a) Subschema for first application program

(b) Subschema for second application program

The application programs are not concerned about the physical


organisation of data. The physical organisation of data in the database can
change without affecting application programs. In other words, with the
change in physical organisation of data, application programs for subschema
need not be changed or modified. Subschemas also act as a unit for
enforcing controlled access to the database, for example, it can bar a user of
a subschema from updating a certain value in the database but allows him to
read it. Further, the subschema can be made basis for controlling concurrent
operations on the database. Subschema definition language (SDL) is used to
specify a subschema in the DBMS. The nature of this language depends
upon the data structure on which a DBMS is based and also upon the host
language within which DBMS facilities are used. The subschema is
sometimes referred to as an LVIEW or logical view. Many different
subschemas can be derived from one schema.

2.2.3 Instances
When the schema framework is filled in the data item values or the contents
of the database at any point of time (or current contents), it is referred to as
an instance of the database. The term instance is also called as state of the
database or snapshot. Each variable has a particular value at a given instant.
The values of the variables in a program at a point in time correspond to an
instance of a database schema, as shown in Fig. 2.5.
The difference between database schema and database state or instance is
very distinct. In the case of a database schema, it is specified to DBMS when
new database is defined, whereas at this point of time, the corresponding
database state is empty with no data in the database. Once the database is
first populated with the initial data, from then on, we get another database
state whenever an update operation is applied to the database. At any point
of time, the current state of the database is called the instance.
Fig. 2.5 Instance of the database of M/s ABC Company

(a) Instance of the PRODUCT relation

(b) Instance of the CUSTOMER relation

(c) Instance of the SALES relation

2.3 THREE-LEVEL ANSI-SPARC DATA BASE ARCHITECTURE

For the first time in 1971, Database Task Group (DBTG) appointed by the
Conference on Data Systems and Languages (CODASYL), produced a
proposal for general architecture for database systems. The DBTG proposed
a two-tier architecture as shown in Fig. 2.1 (a) with a system view called the
schema and user views called subschemas. In 1975, ANSI-SPARC
(American National Standards Institute — Standards Planning and
Requirements Committee) produced a three-tier architecture with a system
catalog. The architecture of most commercial DBMSs available today is
based to some extent on ANSI-SPARC proposal.
ANSI-SPARC three-tier database architecture is shown in Fig. 2.6. It
consists of following three levels:
Internal level,
Conceptual level,
External level.

Fig. 2.6 ANSI-SPARC three-tier database structure

The view at each of the above levels is described by a scheme or schema.


As explained in Section 2.2, a schema is an outline or plan that describes the
records, attributes and relationships existing in the view. The term view,
scheme and schema are used interchangeably. A data definition language
(DDL), as explained in Section 1.10.1, is used to define the conceptual and
external schemas. Structured query language (SQL) commands are used to
describe the aspects of the physical (or internal schema). Information about
the internal, conceptual and external schemas is stored in the system catalog,
as explained in Section 1.2.6.

Fig. 2.7 CUSTOMER record definition

(a) CUSTOMER record

(b) Integrated record definition of CUTOMER record

Let us take an example of CUSTOMER record of Fig. 2.2 as shown in


Fig. 2.7 (a). The integrated record definition of CUSTOMER record is
shown in Fig. 2.7 (b). The data has been abstracted in three levels
corresponding to three views (namely internal, conceptual and external
views), as shown in Fig. 2.8. The lowest level of abstraction of data contains
a description of the actual method of storing data and is called the internal
view, as shown in Fig. 2.8 (c). The second level of abstraction is the
conceptual or global view, as shown in Fig. 2.8 (b). The third level is the
highest level of abstraction seen by the user or application program and is
called the external view or user view, as shown in Fig. 2.8 (a). The
conceptual view is the sum total of user or external view of data.
Fig. 2.8 Three views of the data

(a) Logical records

(b) Conceptual records

(c) Internal record

From Fig. 2.8, the following explanations can be derived:


At the internal or physical level, as shown in Fig. 2.8 (c), customers are represented by a
stored record type called STORED-CUST, which is 74 characters (or bytes) long.
CUSTOMER record contains five fields or data items namely CUST-ID, CUST-NAME,
CUST-STREET, CUST-CITY and CUST-BAL corresponding to five properties of customers.
At the conceptual or global level, as shown in Fig. 2.8 (b), the database contains information
concerning an entity type called CUSTOMER. Each individual customer has a CUST-ID (4
digits), CUST-NAME (20 characters), CUST-STREET (40 characters), CUST-CITY (10
characters) and CUST-BAL (8 digits).
The user view 1 in Fig. 2.8 (a) has an external schema of the database in which each
customer is represented by a record containing two fields or data items namely CUST-NAME
and CUST-CITY. The other three fields are of no interest to this user and have therefore been
omitted.
The user view 2 in Fig. 2.8 (a) has an external schema of the database in which each
customer is represented by a record containing three fields or data items namely CUST-ID,
CUST-NAME and CUST-BAL. The other two fields are of no interest to this user and have
thus been omitted.
There is only one conceptual schema and one internal schema per database.

2.3.1 Internal Level


Internal level is the physical representation of the database on the computer
and this view is found at the lowest level of abstraction of database. This
level indicates how the data will be stored in the database and describes the
data structures, file structures and access methods to be used by the database.
It describes the way the DBMS and the operating system perceive the data in
the database. Fig. 2.8 (c) shows internal view record of a database. Just
below the internal level there is physical level data organisation whose
implementation is covered by the internal level to achieve routine
performance and storage space utilization. The internal schema defines the
internal level (or view). The internal schema contains the definition of the
stored record, the method of representing the data fields (or attributes),
indexing and hashing schemes and the access methods used. Internal level
provides coverage to the data structures and file organisations used to store
data on storage devices.
Essentially, internal schema summarizes how the relations described in the
conceptual schema are actually stored on secondary storage devices such as
disks and tapes. It interfaces with the operating system access methods (also
called file management techniques for storing and retrieving data records) to
place the data on the storage devices, build the indexes, retrieve the data and
so on. Internal level is concerned with the following activities:
Storage space allocation for data and storage.
Record descriptions for storage with stored sizes for data items.
Record placement.
Data compression and data encryption techniques.

The process arriving at a good internal (or physical) schema is called


physical database design. The internal schema is written using SQL or
internal data definition language (internal DDL).

2.3.2 Conceptual Level


The conceptual level is the middlelevel in the three-tier architecture. At this
level of database abstraction, all the database entities and relationships
among them are included. Conceptual level provides the community view of
the database and describes what data is stored in the database and the
relationships among the data. It contains the logical structure of the entire
database as seen by the DBA. One conceptual view represents the entire
database of an organisation. It is a complete view of the data requirements of
the organisation that is independent of any storage considerations. The
conceptual schema defines conceptual view. It is also called the logical
schema. There is only one conceptual schema per database. Fig. 2.8 (b)
shows conceptual view record of a database. This schema contains the
method of deriving the objects in the conceptual view from the objects in the
internal view. Conceptual level is concerned with the following activities:
All entities, their attributes and their relationships.
Constraint on the data.
Semantic information about the data.
Checks to retain data consistency and integrity.
Security information.

The conceptual level supports each external view, in that any data
available to a user must be contained in, or derived from, the conceptual
level. However, this level must not contain any storage-dependent details.
For example, the description of an entity should contain only data types of
attributes (for example, integer, real, character and so on) and their length
(such as the maximum number of digits or characters), but not any storage
consideration, such as the number of bytes occupied. The choice of relations
and the choice of field (or data item) for each relation, is not always obvious.
The process of arriving at a good conceptual schema is called conceptual
database design. The conceptual schema is written using conceptual data
definition language (conceptual DDL).

2.3.3 External Level


The external level is the user’s view of the database. This level is at the
highest level of data abstraction where only those portions of the database of
concern to a user or application program are included. In other words, this
level describes that part of the database that is relevant to the user. Any
number of user views, even identical, may exist for a given conceptual or
global view of the database. Each user has a view of the “real world”
represented in a form that is familiar for that user. The external view
includes only those entities, attributes and relationships in the “real world”
that the user is interested in. Other entities, attributes and relationships that
are not of interest to the user, may be represented in the database, but the
user will be unaware of them. Fig. 2.8 (a) shows external or user view record
of a database.
In the external level, the different views may have different
representations of the same data. For example, one user may view data in the
form as day, month, year while another may view as year, month, day. Some
views might include derived or calculated data, that is, data is not stored in
the database but are created when needed. For example, the average age of
an employee in an organisation may be derived or calculated from the
individual age of all employees stored in the database. External views may
include data combined or derived from several entities.
An external schema describes each external view. The external schema
consists of the definition of the logical records and the relationships in the
external view. It also contains the method of deriving the objects (for
example, entities, attributes and relationships) in the external view from the
object in the conceptual view. External schemas allow data access to be
customized at the level of individual users or groups of users. Any given
database has exactly one internal or physical schema and one conceptual
schema because it has just one set of stored relations, as shown in Fig. 2.8
(a) and (b). But, it may have several external schemas, each tailored to a
particular group of users, as shown in Fig. 2.8 (a). The external schema is
written using external data definition language (external DDL).

2.3.4 Advantages of Three-tier Architecture


The main objective of the three-tier database architecture is to isolate each
user’s view of the database from the way the database is physically stored or
represented. Following are the advantages of a three-tier database
architecture:
Each user is able to access the same data but have a different customized view of the data as
per their own needs. Each user can change the way he or she views the data and this change
does not affect other users of the same database.
The user is not concerned about the physical data storage details. The user’s interaction with
the database is independent of physical data storage organisation.
The internal structure of the database is unaffected by changes to the physical storage
organisation, such as changeover to a new storage device.
The database administrator (DBA) is able to change the database storage structures without
affecting the user’s view.
The DBA is able to change the conceptual structure of the database without affecting all
users.

2.3.5 Characteristics of Three-tier Architecture


Table 2.1 shows degree of abstraction, characteristics and type of DBMS
used for the three levels.
Table 2.1 Features of three-tier structure

2.4 DATA INDEPENDENCE

Data independence (briefly discussed in Section 1.8.5 (b)) is a major


objective of implementing DBMS in an organisation. It may be defined as
the immunity of application programs to change in physical representation
and access techniques. Alternatively, data independence is the characteristics
of a database system to change the schema at one level without having to
change the schema at the next higher level. In other words, the application
programs do not depend on any one particular physical representation or
access technique. This characteristic of DBMS insulates the application
programs from changes in the way the data is structured and stored. The data
independence in achieved by DBMS through the use of the three-tier
architecture of data abstraction. There are two types of data independence as
shown in the mapping of three-tier architecture of Fig. 2.9.
i. Physical data independence.
ii. Logical data independence.

2.4.1 Physical Data Independence


Immunity of the conceptual (or external) schemas to changes in the internal
schema is referred to as physical data independence. In physical data
independence, the conceptual schema insulates the users from changes in the
physical storage of the data. Changes to the internal schema, such as using
different file organisations or storage structures, using different storage
devices, modifying indexes or hashing algorithms, must be possible without
changing the conceptual or external schemas. In other words, physical data
independence indicates that the physical storage structures or devices used
for storing the data could be changed without necessitating a change in the
conceptual view or any of the external views. The change is absorbed by
conceptual/internal mapping, as discussed in Section 2.5.1.

Fig. 2.9 Mappings of three-tier architecture

2.4.2 Logical Data Independence


Immunity of the external schemas (or application programs) to changes in
the conceptual schema is referred to as logical data independence. In logical
data independence, the users are shielded from changes in the logical
structure of the data or changes in the choice of relations to be stored.
Changes to the conceptual schema, such as the addition and deletion of
entities, addition and deletion of attributes, or addition and deletion of
relationships, must be possible without changing existing external schemas
or having to rewrite application programs. Only the view definition and the
mapping need be changed in a DBMS that supports logical data
independence. It is important that the users for whom the changes have been
made should not be concerned. In other words, the application programs that
refers to the external schema constructs must work as before, after the
conceptual schema undergoes a logical reorganisation.

2.5 MAPPINGS

The three schemas and their levels discussed in Section 2.3 are the
description of data that actually exists in the physical database. In the three-
schema architecture database system, each user group refers only to its own
external schema. Hence, the user’s request specified at external schema level
must be transformed into a request at conceptual schema level. The
transformed request at conceptual schema level should be further
transformed at internal schema level for final processing of data in the stored
database as per user’s request. The final result from processed data as per
user’s request must be reformatted to satisfy the user’s external view. The
process of transforming requests and results between the three levels are
called mappings. The database management system (DBMS) is responsible
for this mapping between internal, conceptual and external schemas. The
three-tier architecture of ANSI-SPARC model provides the following two-
stage mappings as shown in Fig. 2.9:
Conceptual/Internal mapping
External/Conceptual mapping

2.5.1 Conceptual/Internal Mapping


The conceptual schema is related to the internal schema through
conceptual/internal mapping. The conceptual internal mapping defines the
correspondence between the conceptual view and the stored database. It
specifies how conceptual records and fields are presented at the internal
level. It enables DBMS to find the actual record or combination of records in
physical storage that constitute a logical record in the conceptual schema,
together with any constraints to be enforced on the operations for that logical
record. It also allows any differences in entity names, attribute names,
attribute orders, data types, and so on, to be resolved. In case of any change
in the structure of the stored database, the conceptual/internal mapping is
also changed accordingly by the DBA, so that the conceptual schema can
remain invariant. Therefore, the effects of changes to the database storage
structure are isolated below the conceptual level in order to preserve the
physical data independence.

2.5.2 External/Conceptual Mapping


Each external schema is related to the conceptual schema by the
external/conceptual mapping. The external/conceptual mapping defines the
correspondence between a particular external view and the conceptual view.
It gives the correspondence among the records and relationships of the
external and conceptual views. It enables the DBMS to map names in the
user’s view on to the relevant part of the conceptual schema. Any number of
external views can exist at the same time, any number of users can share a
given external view and different external view can overlap.
There could be one mapping between conceptual and internal levels and
several mappings between external and conceptual levels. The
conceptual/internal mapping is the key to physical data independence while
the external/conceptual mapping is the key to the logical data independence.
Fig. 2.9 illustrates the three-tier ANSI-SPARC architecture with mappings.
The information about the mapping requests among various schema levels
are included in the system catalog of DBMS. The DBMS uses additional
software to accomplish the mappings by referring to the mapping
information in the system catalog. When schema is changed at some level,
the schema at the next higher level remains unchanged. Only the mapping
between the two levels is changed. Thus, data independence is
accomplished. The two-stage mapping of ANSI-SPARC three-tier structure
provides greater data independence but inefficient mapping. However,
ANSI-SPARC provides efficient mapping by allowing the direct mapping of
external schemas on to the internal schema (by passing the conceptual
schema) but at reduced data independence (more data-dependent).

2.6 STRUCTURE, COMPONENTS, AND FUNCTIONS OF DBMS

As discussed in Chapter 1, Section 1.5, a database management system


(DBMS) is highly complex and sophisticated software that handles access to
the database. The structure of DBMS varies greatly from system to system
and, therefore, a generalised component structure of DBMS is not possible
to make.

2.6.1 Structure of a DBMS


A typical structure of a DBMS with its components and relationships
between them is shown in Fig. 2.10. The DBMS software is partitioned into
several modules. Each module or component is assigned a specific operation
to perform. Some of the functions of the DBMS are supported by operating
systems (OS) to provide basic services and DBMS is built on top of it. The
physical data and system catalog are stored on a physical disk. Access to the
disk is controlled primarily by OS, which schedules disk input/output.
Therefore, while designing a DBMS its interface with the OS must be taken
into account.

2.6.2 Execution Steps of a DBMS


As shown in Fig. 2.10, conceptually, following logical steps are followed
while executing users request to access the database system:
Fig. 2.10 Structure of DBMS

i. Users issue a query using particular database language, for example, SQL commands.
ii. The passed query is presented to a query optimiser, which uses information about how the
data is stored to produce an efficient execution plan for evaluating the query.
iii. The DBMS accepts the users SQL commands and analyses them.
iv. The DBMS produces query evaluation plans, that is, the external schema for the user, the
corresponding external/conceptual mapping, the conceptual schema, the conceptual/internal
mapping, and the storage structure definition. Thus, an evaluation plan is a blueprint for
evaluating a query.
v. The DBMS executes these plans against the physical database and returns the answers to the
users.

Using components such as transaction manager, buffer manager, and


recovery manager, the DBMS supports concurrency and crash recovery by
carefully scheduling users requests and maintaining a log of all changes to
the database.

2.6.3 Components of a DBMS


As explained in Section 2.6.2, the DBMS accepts the SQL commands
generated from a variety of user interfaces, produces query evaluation plans,
executes these plans against the database, and returns the answers. As shown
in Fig. 2.10, the major software modules or components of DBMS are as
follows:
i. Query processor: The query processor transforms users queries into a series of low-level
instructions directed to the run time database manager. It is used to interpret the online user’s
query and convert it into an efficient series of operations in a form capable of being sent to
the run time data manager for execution. The query processor uses the data dictionary to find
the structure of the relevant portion of the database and uses this information in modifying
the query and preparing an optimal plan to access the database.
ii. Run time database manager: Run time database manager is the central software
component of the DBMS, which interfaces with user-submitted application programs and
queries. It handles database access at run time. It converts operations in user’s queries
coming directly via the query processor or indirectly via an application program from the
user’s logical view to a physical file system. It accepts queries and examines the external and
conceptual schemas to determine what conceptual records are required to satisfy the users
request. The run time data manager then places a call to the physical database to perform the
request. It enforces constraints to maintain the consistency and integrity of the data, as well
as its security. It also performs backing and recovery operations. Run time database manager
is sometimes referred to as the database control system and has the following components:

Authorization control: The authorization control module checks that the user has
necessary authorization to carry out the required operation.
Command processor: The command processor processes the queries passed by
authorization control module.
Integrity checker: The integrity checker checks for necessary integrity
constraints for all the requested operations that changes the database.
Query optimizer: The query optimizer determines an optimal strategy for the
query execution. It uses information on how the data is stored to produce an
efficient execution plan for evaluating query.
Transaction manager: The transaction manager performs the required processing
of operations it receives from transactions. It ensures that (a) transactions request
and release locks according to a suitable locking protocol and (b) schedules the
execution of transactions.
Scheduler: The scheduler is responsible for ensuring that concurrent operations
on the database proceed without conflicting with one another. It controls the relative
order in which transaction operations are executed.
Data manager: The data manager is responsible for the actual handling of data in
the database. This module has the following two components:
Recovery manager: The recovery manager ensures that the database remains in a consistent
state in the presence of failures. It is responsible for (a) transaction commit and abort operations,
(b) maintaining a log, and (c) restoring the system to a consistent state after a crash.
Buffer manager: The buffer manager is responsible for the transfer of data between the main
memory and secondary storage (such as disk or tape). It brings in pages from the disk to the main
memory as needed in response to read user requests. Buffer manager is sometimes referred as the
cache manager.

iii. DML processor: Using a DML compiler, the DML processor converts the DML
statements embedded in an application program into standard function calls in the host
language. The DML compiler converts the DML statements written in a host programming
language into object code for database access. The DML processor must interact with the
query processor to generate the appropriate code.
iv. DDL processor: Using a DDL compiler, the DDL processor converts the DDL statements
into a set of tables containing metadata. These tables contain the metadata concerning the
database and are in a form that can be used by other components of the DBMS. These tables
are then stored in the system catalog while control information is stored in data file headers.
The DDL compiler processes schema definitions, specified in the DDL and stores description
of the schema (metadata) in the DBMS system catalog. The system catalog includes
information such as the names of data files, data items, storage details of each data file,
mapping information amongst schemas, and constraints.

2.6.4 Functions and Services of DBMS


As discussed in Chapter 1, Section 1.8.5, the DBMS offers several
advantages over file-oriented systems. A DBMS performs several important
functions that guarantee integrity and consistency of data in the database.
Most of these functions are transparent to end-users. Fig. 2.11 illustrates the
functions and services provided by a DBMS.
Fig. 2.11 Functions of DBMS

i. Data Storage Management: The DBMS creates the complex structures required for data
storage in the physical database. It provides a mechanism for management of permanent
storage of the data. The internal schema defines how the data should be stored by the storage
management mechanism and the storage manager interfaces with the operating system to
access the physical storage. This relieves the users from the difficult task of defining and
programming the physical data characteristics. The DBMS provides not only for the data, but
also for related data entry forms or screen definitions, report definitions, data validation rules,
procedural code, structure to handle video and picture formats, and so on.
ii. Data Manipulation Management: A DBMS furnishes users with the ability to retrieve,
update and delete existing data in the database or to add new data to the database. It includes
a DML processor component (as shown in Fig. 2.10) to deal with the data manipulation
language (DML).
iii. Data Definition Services: The DBMS accepts the data definitions such as external
schema, the conceptual schema, the internal schema, and all the associated mappings in
source form. It converts them to the appropriate object form using a DDL processor
component (as shown in Fig. 2.10) for each of the various data definition languages (DDLs).
iv. Data Dictionary/System Catalog Management: The DBMS provides a data dictionary or
system catalog function in which descriptions of data items are stored and which is accessible
to users. As explained in Chapter 1, Section 1.2.6 and 1.3, a system catalog or data dictionary
is a system database, which is a repository of information describing the data in the database.
It is the data about the data or metadata. All of the various schemas and mappings and all of
the various security and integrity constraints, in both source and object forms, are stored in
the data dictionary. The system catalog is automatically created by the DBMS and consulted
frequently to resolve user requests. For example, the DBMS will consult the system catalog
to verify that a requested table exists and that the user issuing the request has the necessary
access privileges.
v. Database Communication Interfaces: The end-user’s requests for database access (may
be from remote location through internet or computer workstations) are transmitted to DBMS
in the form of communication messages. The DBMS provides special communication
routines designed to allow the database to accept end-user requests within a computer
network environment. The response to the end user is transmitted back from DBMS in the
form of such communication messages. The DBMS integrates with a communication
software component called data communication manager (DCM), which controls such
message transmission activities. Although, the DCM is not a part of DBMS, both work in
harmony in which the DBMS looks after the database and the DCM handles all messages to
and from the DBMS.
vi. Authorisation / Security Management: The DBMS protects the database against
unauthorized access, either intentional or accidental. It furnishes mechanism to ensure that
only authorized users can access the database. It creates a security system that enforces user
security and data privacy within the database. Security rules determine which users can
access the database, which data items each user may access and which data operations (read,
add, delete and modify) the user may perform. This is especially important in multi-user
environment where many users can access the database simultaneously. The DBMS monitors
user requests and rejects any attempts to violate the security rules defined by the DBA. It
monitors and controls the level of access for each user and the operations that each user can
perform on the data depending on the access privileges or access rights of the users.
There are many ways for a DBMS to identify legitimate users. The most common method is
to establish accounts with passwords. Some DBMSs use data encryption mechanisms to
ensure the information written to disk cannot be read or changed unless the user provides the
encryption key that unscrambles the data. Some DBMSs also provide users with the ability to
instruct the DBMS, via user exits, to employ custom-written routines to encode the data. In
some cases, organisations may be interested in conducting security audits, particularly if they
suspect the database may have been tampered with. Some DBMSs provide audit trails,
which are traces or logs that records various kinds of database access activities (for example,
unsuccessful access attempts). Security managemnt is discussed in further details in Chapter
14.
vii. Backup and Recovery Management: The DBMS provides mechanisms for backing up
data periodically and recovering from different types of failures. This prevents the loss of
data. It ensures that the aborted or failed transactions do not create any adverse effect on the
database or other transactions. The recovery mechanisms of DBMSs make sure that the
database is returned to a consistent state after a transaction fails or aborts due to a system
crash, media failure, hardware or software errors, power failure, and so on. Many DBMSs
enable users to make full or partial backups of their data. A full backup saves all the data in
the target resource, such as the entire file or an entire database. These are useful after a large
quantity of work has been completed, such as loading data into a newly created database.
Partial, or incremental, backups usually record only the data that has been changed since the
last full backup. These are less time-consuming than full backups and are useful for capturing
periodic changes. Some DBMSs support online backups, enabling a database to be backed up
while it is open and in use. This is important for applications that require support for
continuous operations and cannot afford having a database inaccessible. Recovery
management is discussed in further detail in Chapter 13.
viii. Concurrency Control Services: Since DBMSs support sharing of data among multiple
users, they must provide a mechanism for managing concurrent access to the database.
DBMSs ensure that the database is kept in consistent state and that the integrity of the data is
preserved. It ensures that the database is updated correctly when multiple users are updating
the database concurrently. Concurrency control is discussed in further detail in Chapter 12.
ix. Transaction Management: A transaction is a series of database operations, carried out by
a single user or application program, which accesses or changes the contents of the database.
Therefore, a DBMS must provide a mechanism to ensure either that all the updates
corresponding to a given transaction are made or that none of them is made. A detailed
discussion on transaction management has been given in Chapter 1, Section 1.11. A further
detail on transaction processing is given in Chapter 12.
x. Integrity Services: As discussed in Chapter 1, Section 1.5 (f), database integrity refers to
the correctness and consistency of stored data and is especially important in transaction-
oriented database system. Therefore, a DBMS must provide means to ensure that both the
data in the database and changes to the data follow certain rules. This minimises data
redundancy and maximises data consistency. The data relationships stored in the data
dictionary are used to enforce data integrity. Various types of integrity mechanisms and
constraints may be supported to help ensure that the data values within a database are valid,
that the operations performed on those values are valid and that the database remains in a
consistent state.
xi. Data Independence Services: As discussed in Chapter 1, Section 1.8.5 (b) and Section
2.4, a DBMS must support the independence of programs from the actual structure of the
database.
xii. Utility Services: The DBMS provides a set of utility services used by the DBA and the
database designer to create, implement, monitor and maintain the database. These utility
services help the DBA to administer the database effectively.
xiii. Database Access and Application Programming Interfaces: All DBMSs provide
interface to enable applications to use DBMS services. They provide data access via
structured query language (SQL). The DBMS query language contains two components: (a) a
data definition language (DDL) and (b) a data manipulation language (DML). As discussed
in Chapter 1, Section 1.10, the DDL defines the structure in which the data are stored and the
DML allows end users to extract the data from the database. The DBMS also provides data
access to application programmers via procedural (3GL) languages such as C, PASCAL,
COBOL, Visual BASIC and others.

2.7 DATA MODELS

A model is an abstraction process that concentrates (or highlights) essential


and inherent aspects of the organisation’s applications while ignores (or
hides) superfluous or accidental details. It is a representation of the real
world objects and events and their associations. A data model (also called
database model) is a mechanism that provides this abstraction for database
application. It represents the organisation itself. It provides the basic
concepts and notations to allow database designers and end-users
unambiguously and accurately communicate their understanding of the
organisational data. Data modelling is used for representing entities of
interest and their relationships in the database. It allows the
conceptualisation of the association between various entities and their
attributes. A data model is a conceptual method of structuring data. It
provides mechanism to structure data (consisting of a set of rules according
to which databases can be constructed) for the entities being modelled, allow
a set of manipulative operations (for example, updating or retrieving data
from the database) to be defined on them, and enforce set of constraints (or
integrity rules) to ensure accuracy of data.
To summarise, we can say that a data model is a collection of
mathematically well-defined concepts that help an enterprise to consider and
express the static and dynamic properties of data intensive applications. It
consists of the following:
Static properties, for example, objects, attributes and relationships.
Integrity rules over objects and operations.
Dynamic properties, for example, operations or rules defining new database states based on
applied state changes.

Data models can be broadly classified into the following three categories:
Record-based data models
Object-based data models
Physical data models

Most commercial DBMSs support a single data model but the data models
supported by different DBMSs differ.

2.7.1 Record-based Data Models


A record-based data models are used to specify the overall logical structures
of the database. In the record based models, the database consists of a
number of fixed-format records possibly of different types. Each record type
defines a fixed number of fields, each typically of a fixed length. Data
integrity constraints cannot be explicitly specified using record-based data
models. There are three principle types of record-based data models:
Hierarchical data model.
Network data model.
Relational data model.

2.7.2 Object-based Data Models


Object-based data models are used to describe data and its relationships. It
uses concepts such as entities, attributes and relationships. Its definition has
already been explained in Chapter 1, Section 1.3.1. It has flexible data
structuring capabilities. Data integrity constraints can be explicitly specified
using object-based data models. Following are the common types of object-
based data models:
Entity-relationship.
Semantic.
Functional.
Object-oriented.

The entity-relationship (E-R) data model is one of the main techniques for
a database design and widely used in practice. The object-oriented data
models extend the definition of an entity to include not only the attributes
that describe the state of the object but also the actions that are associated
with the object, that is, its behaviour.

2.7.3 Physical Data Models


Physical data models are used for a higher-level description of storage
structure and access mechanism. They describe how data is stored in the
computer, representing information such as record structures, record
orderings and access paths. It is possible to implement the database at
system level using physical data models. There are not as many physical
data models so far. The most common physical data models are as follows:
Unifying model.
Frame memory model.
2.7.4 Hierarchical Data Models
The hierarchical data model is represented by an upside-down tree. The user
perceives the hierarchical database as a hierarchy of segments. A segment is
the equivalent of a file system’s record type. In a hierarchical data model, the
relationship between the files or records forms a hierarchy. In other words,
the hierarchical database is a collection of records that is perceived as
organised to conform to the upside-down tree structure. Fig. 2.12 shows a
hierarchical data model. A tree may be defined as a set of nodes such that
there is one specially designated node called the root (node), which is
perceived as the parent (like a family tree having parent-child or an
organisation tree having owner-member relationships between record types)
of the segments directly beneath it. The remaining nodes are portioned into
disjoint sets and are perceived as children of the segment above them. Each
disjoint set in turn is a tree and the sub-tree of the root. At the root of the tree
is the single parent. The parent can have none, one or more children. A
hierarchical model can represent a one-to-many relationship between two
entities where the two are respectively parent and child. The nodes of the
tree represent record types. If we define the root record type to level-0, then
the level of its dependent record types can be defined as being level-1. The
dependents of the record types at level-1 are said to be at level-2 and so on.

Fig. 2.12 Hierarchical data model

A hierarchical path that traces the parent segments to the child segments,
beginning from the left, defines the tree shown in Fig. 2.12. For example, the
hierarchical path for segment ‘E’ can be traced as ABDE, tracing all
segments from the root starting at the leftmost segment. This left-traced path
is known as preorder traversal or the hierarchical sequence. As can be noted
from Fig. 2.12 that each parent can have many children but each child has
only one parent.
Fig. 2.13 (a) shows a hierarchical data model of a UNIVERSITY tree type
consisting of three levels and three record types such as DEPARTMENT,
FACULTY and COURSE. This tree contains information about university
academic departments along with data on all faculties for each department
and all courses taught by each faculty within a department. Fig. 2.13 (b)
shows the defined fields or data types for department, faculty, and course
record types. A single department record at the root level represents one
instance of the department record type. Multiple instances of a given record
type are used at lower levels to show that a department may employ many
(or no) faculties and that each faculty may teach many (or no) courses. For
example, we have a COMPUTER department at the root level and as many
instances of the FACULTY record type are faculties in the computer
department. Similarly, there will be as many COURSE record instances for
each FACULTY record as that faculty teaches. Thus, there is a one-to-many
(1:m) association among record instances, moving from the root to the
lowest level of the tree. Since there are many departments in the university,
there are many instances of the DEPARTMENT record type, each with its
own FACULTY and COURSE record instances connected to it by
appropriate branches of the tree. This database then consists of a forest of
such tree instances; as many instances of the tree type as there are
departments in the university at any given time. Collectively, these comprise
a single hierarchic database and multiple databases will be online at a time.
Fig. 2.13 Hierarchical data model relationship of university tree type

Suppose we are interested in adding information about departments to our


hierarchical database. For example, since the departments are having various
subjects for teaching, we want to keep record of subjects with each
department in the university. In that case, we would expand the diagram of
Fig. 2.13 to look like that of Fig. 2.14. DEPARTMENT is still related to
FACULTY which is related to COURSE. DEPARTMENT is also related to
SUBJECT which is related to TOPIC. We see from this diagram that
DEPARTMENT is at the top of a hierarchy from which a large amount of
information can be derived.
Fig. 2.14 Hierarchical relationship of department with faculty and subject

Hierarchical database is one of the oldest database models used by


enterprise in the past. Information Management System (IMS), developed
jointly by IBM and North American Rockwell Company for mainframe
computer platform, was one of the first hierarchical databases. IMS became
the world’s leading hierarchical database system in the 1970s and early
1980s. Hierarchical database model was the first major commercial
implementation of a growing pool of database concepts that were developed
to counter the computer file system’s inherent shortcomings.

2.7.4.1 Advantages of Hierarchical Data Model


Following are the advantages of hierarchical data model:
Simplicity: Since the database is based on the hierarchical structure, the relationship
between the various layers is logically (or conceptually) simple and design of a hierarchical
database is simple.
Data sharing: Because all data are held in a common database, data sharing becomes
practical.
Data security: Hierarchical model was the first database model that offered the data
security that is provided and enforced by the DBMS.
Data independence: The DBMS creates an environment in which data independence can
be maintained. This substantially decreases the programming effort and program
maintenance.
Data integrity: Given the parent/child relationship, there is always a link between the
parent segment and its child segments under it. Because the child segments are automatically
referenced to its parent, this model promotes data integrity.
Efficiency: The hierarchical data model is very efficient when the database contains a large
volume of data in one-to-many (1:m) relationships and when the users require large numbers
of transactions using data whose relationships are fixed over time.
Available expertise: Due to a large number of available installed mainframe computer
base, experienced programmers were available.
Tried business applications: There was a large number of tried-and-true business
applications available within the mainframe environment.

2.7.4.2 Disadvantages of Hierarchical Data Model


Implementation complexity: Although the hierarchical database is conceptually simple,
easy to design and no data-independence problem, it is quite complex to implement. The
DBMS requires knowledge of the physical level of data storage and the database designers
should have very good knowledge of the physical data storage characteristics. Any changes
in the database structure, such as the relocation of segments, require changes in all
applications programs that access the database. Therefore, implementation of a database
design becomes very complicated.
Inflexibility: A hierarchical database lacks flexibility. The changes in the new relations or
segments often yield very complex system management tasks. A deletion of one segment
may lead to the involuntary deletion of all the segments under it. Such an error could be very
costly.
Database management problems: If you make any changes to the database structure of
the hierarchical database, then you need to make the necessary changes in all the application
programs that access the database. Thus, maintaining the database and the applications can
become very difficult.
Lack of structural independence: Structural independence exists when the changes to the
database structure does not affect the DBMS’s ability to access data. The hierarchical
database is known as a navigational system because data access requires that the preorder
traversal (a physical storage path) be used to navigate to the appropriate segments. So the
application programmer should have a good knowledge of the relevant access paths to access
the data from the database. Modifications or changes in the physical structure can lead to the
problems with applications programs, which will also have to be modified. Thus, in a
hierarchical database system the benefits of data independence is limited by structural
dependence.
Application programming complexity: Applications programming is very time
consuming and complicated. Due to the structural dependence and the navigational structure,
the application programmers and the end-users must know precisely how the data is
distributed physically in the database and how to write lines of control codes in order to
access data. This requires knowledge of complex pointer systems, which is often beyond the
grasp of ordinary users who have little or no programming knowledge.
Implementation limitation: Many of the common relationships do not confirm to the one-
to-many relationship format required by the hierarchical database model. For example, each
student enrolled at a university can take many courses, and each course can have many
students. Thus, such many-to-many (n:m) relationships, which are more common in real life,
are very difficult to implement in a hierarchical data model.
No standards: There is no precise set of standard concepts nor the does the
implementation of model confirm to a specific standard in a hierarchical data model.
Extensive programming efforts: Use of hierarchical model requires extensive
programming activities, and therefore, it has been called as a system created by programmers
for programmers. Modern data processing environment does not accept such concepts.

2.7.5 Network Data Model


The Database Task Group of the Conference on Data System Languages
(DBTG/CODASYL) formalized the network data model in the late 1960s.
The network data models were eventually standardised as the CODASYL
model. The network data model is similar to a hierarchical model except that
a record can have multiple parents. The network data model has three basic
components such as record type, data items (or fields) and links. Further, in
network model terminology, a relationship is called a set in which each set is
composed of at least two record types. First record type is called an owner
record that is equivalent to the parent in the hierarchical model. Second
record type is called a member record that is equivalent to child in the
hierarchical model. The connection between an owner and its member
records is identified by a link to which database designers assign a set-name.
This set-name is used to retrieve and manipulate data. Just as the branches of
a tree in the hierarchical data models represent access path, the links
between owners and their members indicate access paths in network models
and are typically implemented with pointers. In network data model,
member can appear in more than one set and thus may have several owners,
and therefore, it facilitates many-to-many (n:m) relationships. A set
represents a one-to-many (1:m) relationship between the owner and the
member.
Fig. 2.15 Network data model

Fig. 2.15 shows a diagram of network data model. It can be seen in the
diagram that member ‘B’ has only one owner ‘A’ whereas member ‘E’ has
two owners namely ‘B’ and ‘C’. Fig. 2.16 illustrates an example of
implementing network data model for a typical sales organisation in which
CUSTOMER, SALES_REPRESENTATIVE, INVOICE, INVOICE_LINE,
PRODUCT and PAYMENT represent record types. It can be seen in Fig.
2.16 that INVOICE_LINE is own by both PRODUCT and INVOICE.
Similarly, INVOICE has two owners namely SALES_REPRESENTATIVE
and CUSTOMER. In network data model, each link between two record
types represents a one-to-many (1:m) relationship between them.

Fig. 2.16 Network data model for a sales organisation

Unlike the hierarchical data model, network data model supports multiple
paths to the same record, thus avoiding the data redundancy problem
associated with hierarchical system.

2.7.5.1 Advantages of Network Data Model


Simplicity: Similar to hierarchical data model, network model is also simple and easy to
design.
Facilitating more relationship types: The network model facilitates in handling of one-to-
many (1:m) and many-to-many (n:m) relationships, which helps in modelling the real life
situations.
Superior data access: The data access and flexibility is superior to that is found in the
hierarchical data model. An application can access an owner record and all the members
record within a set. If a member record in the set has two or more (like a faculty working for
two departments), then one can move from one owner to another.
Database integrity: Network model enforces database integrity and does not allow a
member to exist without an owner. First of all, the user must define the owner record and
then the member.
Data independence: The network data model provides sufficient data independence by at
least partially isolating the programs from complex physical storage details. Therefore,
changes in the data characteristics do not require changes in the application programs.
Database standards: Unlike hierarchical model, network data model is based on the
universal standards formulated by DBTG/CODASYL and augmented by ANSI-SPARC. All
the network data models confirm to these standards, which also includes a DDL and DML.

2.7.5.2 Disadvantages of Network Data Model


System complexity: Like hierarchical data model, network model also provides a
navigational access mechanism to the data in which the data are accesses one record at a
time. This mechanism makes the system implementation very complex. Consequently, the
DBAs, database designers, programmers and end users must be familiar with the internal
data structure in order to access the data and take advantage of the system’s efficiency. In
other words, network database models are also difficult to design and use properly.
Absence of structural independence: It is difficult to make changes in a network
database. If changes are made to the database structure, all subschema definitions must be
revalidated before any applications programs can access the database. In other words,
although the network model achieves data independence, it does not provide structural
independence.
Not a user-friendly: The network data model is not a design for user-friendly system and
is a highly skill-oriented system.

2.7.6 Relational Data Model


E.F. Codd of IBM Research first introduced the relational data model in a
paper in 1970. The relational data model is implemented using very
sophisticated Relational Database Management System (RDBMS). The
RDMS performs the same basic functions of the hierarchical and network
DBMSs plus a host of other functions that make the relational data models
easier to understand and implement. The relational data model simplified the
user’s view of the database by using simple tables instead of the more
complex tree and network structures. It is a collection of tables (also called
relations) as shown in Fig. 2.17 (a) in which data is stored. Each of the tables
is a matrix of a series of row and column intersections. Tables are related to
each other by sharing common entity characteristic. For example, a
CUSTOMER table might contain an AGENT-ID that is also contained in the
AGENT table, as shown in Fig. 2.17 (a) and (b).
Even though the customer and agent data are stored in two different
tables, the common link between the CUSTOMER and AGENT tables,
which is AGENT-ID, helps in connecting or matching of the customer to its
sales agent. Although tables are completely independent of one another, data
between the tables can be easily connected using common links. For
example, the agent of customer “Lions Distributors” of CUSTOMER table
can be retrieved as “Greenlay & Co.” from AGENT table with the help of a
common link AGENT-ID, which is AO-9999. Further details on relational
data model is given in Chapter 4.

2.7.6.1 Advantages of Relational Data Model


Simplicity: A relational data model is even simpler than hierarchical and network models.
It frees the designers from the actual physical data storage details, thereby allowing them to
concentrate on the logical view of the database.
Structural independence: Unlike hierarchical and network models, the relational data
model does not depend on the navigational data access system. Changes in the database
structure do not affect the data access.
Ease of design, implementation, maintenance and uses: The relational model provides
both structural independence and data independence. Therefore, it makes the database design,
implementation, maintenance and usage much easier.
Flexible and powerful query capability: The relational database model provides very
powerful, flexible, and easy-to-use query facilities. Its structured query language (SQL)
capability makes ad hoc queries a reality.
Fig. 2.17 Relational data model

(a) Relational Tables

(b) Linkage between relational tables

2.7.6.2 Disadvantages of Relational Data Model


Hardware overheads: The relational data models need more powerful computing
hardware and data storage devices to perform RDMS-assigned tasks. Consequently, they tend
to be slower than the other database systems. However, with rapid advancement in
computing technology and development of much more efficient operating systems, the
disadvantage of being slow is getting faded.
Easy-to-design capability leading to bad design: Easy-to-use feature of relational
database results into untrained people generating queries and reports without much
understanding and giving much thought to the need of proper database design. With the
growth of database, the poor design results into slower system, degraded performance and
data corruption.

2.7.7 Entity-Relationship (E-R) Data Model


An entity-relationship (E-R) model is a logical database model, which has a
logical representation of data for an enterprise of business establishment. It
was introduced by Chen in 1976. E-R data model is a collection of objects of
similar structures called an entity set. The relationship between entity sets is
represented on the basis of number of entities from entity set that can be
associated with the number of entities of another set such as one-to-one
(1:1), one-to-many (1:n), or many-to-many (n:n) relationships, as explained
in Chapter 1, Section 1.3.1.3. The E-R diagram is shown graphically.
Fig. 2.18 shows building blocks or symbols to represent E-R diagram. The
rectangular boxes represent entity, ellipses (or oval boxes) represent
attributes (or properties) and diamonds represent relationship (or association)
among entity sets. There is no industry standard notation for developing E-R
diagram. However, the notations or symbols of Fig. 2.18 are widely used
building blocks for E-R diagram.
Fig. 2.18 Building blocks (symbols) of E-R diagram

Fig. 2.19 (a) illustrates a typical E-R diagram for a product sales
organisation called M/s ABC & Co. This organisation manufactures various
products, which are sold to the customers against an order. Fig. 2.19 (b)
shows data items and records of entities. According to the E-R diagram of
Fig. 2.19 (a), a customer having identification no. 1001, name Waterhouse
Ltd. with address Box 41, Mumbai [as shown in Fig. 2.19 (b)], is an entity
since it uniquely identifies one particular customer. Similarly, a product
A1234 with a description Steel almirah and unit cost of 4000 is an entity
since it uniquely identifies one particular product and so on.
Now the set of all products (all records in the PRODUCT table of Fig.
2.19 (b) of M/s ABC & Co. is defined as the entity set PRODUCT.
Similarly, the entity set CUSTOMER represents the set of all the customers
of M/s ABC & Co. and so on. An entity set is represented by set of attributes
(called data items or fields). Each rectangular box represents an entity for
example, PRODUCT, CUSTOMER and ORDER. Each ellipse (or oval
shape) represents attributes (or data items or fields). For example, attributes
of entity PRODUCT are PROD-ID, PROD-DESC and UNIT-COST.
CUSTOMER entity contains attributes such as CUST-ID, CUST-NAME and
CUST-ADDRESS. Similarly, entity ORDER contains attributes such as
ORD-DATE, PROD-ID and PROD-QTY. There is a set of permitted values
for each attribute, called the domain of that attribute, as shown in Fig. 2.19
(b).
Fig. 2.19 E-Rdiagram for M/s ABC & Co

(a) E-R diagram for a product sales organisation


(b) Attributes (data items) and records of entities

The E-R diagram has become a widely excepted data model. It is used for
designing of relational databases. A further detail on the E-R data model is
given in Chapter 6.

2.7.7.1 Advantages of E-R Data Model


Straightforward relational representation: Having designed an E-R diagram for
adatabase application, the relational representation of the database model becomes relatively
straightforward.
Easy conversion for E-R to other data model: Conversion from E-R diagram to a
network or hierarchical data model can easily be accomplished.
Graphical representation for better understanding: An E-R model gives graphical and
diagrammatical representation of various entities, its attributes and relationships between
entities. This in turn helps in the clear understanding of the data structure and in minimizing
redundancy and other problems.

2.7.7.2 Disadvantages of E-R Data Model


No industry standard for notation: There is no industry standard notation for developing
an E-R diagram.
Popular for high-level design: The E-R data model is especially popular for high-level
database design.

2.7.8 Object-oriented Data Model


Object-oriented data model is a logical data model that captures the
semantics of objects supported in an object-oriented programming. It is a
persistent and sharable collection of defined objects. It has the ability to
model complete solution. Object-oriented database models represent an
entity and a class. A class represents both object attributes as well as the
behaviour of the entity. For example, a CUSTOMER class will have not only
the customer attributes such as CUST-ID, CUST-NAME, CUST-aDdRESS
and so on, but also procedures that imitate actions expected of a customer
such as update-order. Instances of the class-object correspond to individual
customers. Within an object, the class attributes takes specific values, which
distinguish one customer (object) from another. However, all the objects
belonging to the class, share the behaviour pattern of the class. The object-
oriented database maintains relationships through logical containment.
The object-oriented database is based on encapsulation of data and code
related to an object into a single unit, whose contents are not visible to the
outside world. Therefore, object-oriented data models emphasise on objects
(which is a combination of data and code), rather than on data alone. This is
largely due to their heritage from object-oriented programming languages,
where programmers can define new types or classes of objects that may
contain their own internal structures, characteristics and behaviours. Thus,
data is not thought of as existing by itself. Instead, it is closely associated
with code (methods of member functions) that defines what objects of that
type can do (their behaviour or available services). The structure of object-
oriented data model is highly variable. Unlike traditional databases (such as
hierarchical, network or relational), it has no single inherent database
structure. The structure for any given class or type of object could be
anything a programmer finds useful, for example, a linked list, a set, an array
and so forth. Furthermore, an object may contain varying degrees of
complexity, making use of multiple types and multiple structures.
The object-oriented database management system (OODBMS) is among
the most recent approaches to database management. They started in the
engineering and design domain applications, and became the favoured
system for financial, telecommunications, and World Wide Web (WWW)
applications. It is suited for multimedia applications as well as data with
complex relationships that are difficult to model and process in a relational
DBMS. A further detail on object-oriented model is given in Chapter 15.

2.7.8.1 Advantages of Object-oriented Data Model


Capable of handling a large variety of data types: Unlike traditional databases (such as
hierarchical, network or relational), the object-oriented database are capable of storing
different types of data, for example, pictures, voices, video, including text, numbers and so
on.
Combining object-oriented programming with database technology: Object-oriented
data model is capable of combining object-oriented programming with database technology
and thus, providing an integrated application development system.
Improved productivity: Object-oriented data models provide powerful features such as
inheritance, polymorphism and dynamic binding that allow the users to compose objects and
provide solutions without writing object-specific code. These features increase the
productivity of the database application developers significantly.
Improved data access: Object-oriented data model represents relationships explicitly,
supporting both navigational and associative access to information. It further improves the
data access performance over relational value-based relationships.

2.7.8.2 Disadvantages of Object-oriented Data Model


No precise definition: It is difficult to provide a precise definition of what constitutes an
object-oriented DBMS because the name has been applied to a variety of products and
prototypes, some of which differ considerably from one another.
Difficult to maintain: The definition of objects is required to be changed periodically and
migration of existing databases to confirm to the new object definition with change in
organisational information needs. It posses real challenge when changing object definitions
and migrating databases.
Not suited for all applications: Object-oriented data models are used where there is a
need to manage complex relationships among data objects. They are especially suited for
specific applications such as engineering, e-commerce, medicines and so on, and not for all
applications. Its performance degrades and requires high processing requirements when used
for ordinary applications.

2.7.9 Comparison between Data Models


Table 2.2 summarises the characteristics of different data models discussed
above.
Table 2.2 Comparison between different data models

2.8 TYPES OF DATABASE SYSTEMS

The classification of a database management system (DBMS) is greatly


influenced by the underlying computing system on which it runs, in
particular of computer architecture such as parallel, networked or
distributed. However, the DBMS can be classified according to the number
of users, the database site locations and the expected type and extent of use.
a. On the basis of the number of users:

Single-user DBMS.
Multi-user DBMS.

b. On the basis of the site locations:

Centralised DBMS.
Parallel DBMS.
Distributed DBMS.
Client/server DBMS.

c. On the basis of the type and the extent of use:


Transactional or production DBMS.
Decision support DBMS.
Data warehouse.

In this section, we will discuss about some of the important types of


DBMS system, which are presently being used.

2.8.1 Centralised Database System


The centralised database system consists of a single processor together with
its associated data storage devices and other peripherals. It is physically
confined to a single location. The system offers data processing capabilities
to users who are located either at the same site, or, through remote terminals,
at geographically dispersed sites. The management of the system and its data
are controlled centrally form any one or central site. Fig. 2.20 illustrates an
example of centralised database system.

Fig. 2.20 Centralised database system

2.8.1.1 Advantages of Centralised Database System


Most of the functions such as update, backup, query, control access and so on, are easier to
accomplish in a centralised database system.
The size of the database and the computer on which it resides need not have any bearing on
whether the database is centrally located. For example, a small enterprise with its database on
a personal computer (PC) has a centralised database, a large enterprise with many computers
has database entirely controlled by a mainframe.

2.8.1.2 Disadvantages of Centralised Database System


When the central site computer or database system goes down, then every one (users) is
blocked from using the system until the system comes back.
Communication costs from the terminals to the central site can be expensive.

To take care of disadvantages of centralised database systems, parallel or


distributed database systems are used, which are discussed in chapters 17
and 18.

2.8.2 Parallel Database System


Parallel database systems architecture consists of a multiple central
processing units (CPUs) and data storage disks in parallel. Hence, they
improve processing and input/output (I/O) speeds. Parallel database systems
are used in the applications that have to query extremely large databases or
that have to process an extremely large number of transactions per second.
Several different architectures can be used for parallel database systems,
which are as follows:
Shared data storage disk
Shared memory
Hierarchical
Independent resources.
Fig. 2.21 Parallel database system architectures

(c) Independent resource


(d) Hierarchical

Fig. 2.21 illustrates the different architecture of parallel database system.


In shared data storage disk, all the processors share a common disk (or set of
disks), as shown in Fig. 2.21 (a). In shared memory architecture, all the
processors share common memory, as shown in Fig. 2.21 (b). In independent
resource architecture, the processors share neither a common memory nor a
common disk. They have their own independent resources as shown in Fig.
2.21 (c). Hierarchical architecture is hybrid of all the earlier three
architectures, as shown in Fig. 2.21 (d). A further detail on parallel database
system is given in Chapter 17.

2.8.2.1 Advantages of a Parallel Database System


Parallel database systems are very useful for the applications that have to query extremely
large databases (of the order of terabytes, for example, 1012 bytes) or that have to process an
extremely large number of transactions per second (of the order of thousands of transactions
per second).
In a parallel database system, the throughput (that is, the number of tasks that can be
completed in a given time interval) and the response time (that is, the amount of time it takes
to complete a single task from the time it is submitted) are very high.
2.8.2.2 Disadvantages of a Parallel Database System
In a parallel database system, there is a startup cost associated with initiating a single process
and the startup-time may overshadow the actual processing time, affecting speedup
adversely.
Since processes executing in a parallel system often access shared resources, a slowdown
may result from interference of each new process as it competes with existing processes for
commonly held resources, such as shared data storage disks, system bus and so on.

2.8.3 Client/Server Database System


Client/server architecture of database system has two logical components
namely client, and server. Clients are generally personal computers or
workstations whereas server is large workstations, mini range computer
system or a mainframe computers system. The applications and tools of
DBMS run on one or more client platforms, while the DBMS softwares
reside on the server. The server computer is called backend and the client’s
computer is called front-end. These server and client computers are
connected into a network. The applications and tools act as clients of the
DBMS, making requests for its services. The DBMS, in turn, processes these
requests and returns the results to the client(s). Client/server architecture
handles the graphical user interface (GUI) and does computations and other
programming of interest to the end user. The server handles parts of the job
that are common to many clients, for example, database access and updates.
Fig. 2.22 illustrates client/server database architecture.
Fig. 2.22 Client/server database architecture

As shown in Fig. 2.22, the client/server database architecture consists of


three components namely, client applications, a DBMS server and a
communication network interface. The client applications may be tools,
user-written applications or vendor-written applications. They issue SQL
statements for data access. The DBMS server stores the related software,
processes the SQL statements and returns results. The communication
network interface enables client applications to connect to the server, send
SQL statements and receive results or error messages or error return codes
after the server has processed the SQL statements. In client/server database
architecture, the majority of the DBMS services are performed on the server.
The client/server architecture is a part of the open systems architecture in
which all computing hardware, operating systems, network protocols and
other software are interconnected as a network and work in concert to
achieve user goals. It is well suited for online transaction processing and
decision support applications, which tend to generate a number of relatively
short transactions and require a high degree of concurrency.
Further details on client/server database system is given in Chapter 18,
Section 18.3.1.

2.8.3.1 Advantages of Client/server Database System


Client-server system has less expensive platforms to support applications that had previously
been running only on large and expensive mini or mainframe computers.
Clients offer icon-based manu-driven interface, which is superior to the traditional command-
line, dumb terminal interface typical of mini and mainframe computer systems.
Client/server environment facilitates in more productive work by the users and making better
use of existing data.
Client-server database system is more flexible as compared to the centralised system.
Response time and throughput is high.
The server (database) machine can be custom-built (tailored) to the DBMS function and thus
can provide a better DBMS performance.
The client (application database) might be a personnel workstation, tailored to the needs of
the end users and thus able to provide better interfaces, high availability, faster responses and
overall improved ease of use to the user.
A single database (on server) can be shared across several distinct client (application)
systems.

2.8.3.2 Disadvantages of Clien/Server Database System


Labour or programming cost is high in client/server environments, particularly in initial
phases.
There is a lack of management tools for diagnosis, performance monitoring and tuning and
security control, for the DBMS, client and operating systems and networking environments.

2.8.4 Distributed Database System


Distributed database systems are similar to client/server architecture in a
number of ways. Both typically involve the use of multiple computer
systems and enable users to access data from remote system. However,
distributed database system broadens the extent to which data can be shared
well beyond that which can be achieved with the client/server system. Fig.
2.23 shows a diagram of distributed database architecture.
As shown in Fig. 2.23, in distributed database system, data is spread
across a variety of different databases. These are managed by a variety of
different DBMS softwares running on a variety of different computing
machines supported by a variety of different operating systems. These
machines are spread (or distributed) geographically and connected together
by a variety of communication networks. In distributed database system, one
application can operate on data that is spread geographically on different
machines. Thus, in distributed database system, the enterprise data might be
distributed on different computers in such a way that data for one portion (or
department) of the enterprise is stored in one computer and the data for
another department is stored in another. Each machine can have data and
applications of its own. However, the users on one computer can access to
data stored in several other computers. Therefore, each machine will act as a
server for some users and a client for others. A further detail on distributed
database system is given in Chapter 18.

2.8.4.1 Advantages of Distributed Database System


Distributed database architecture provides greater efficiency and better performance.
Response time and throughput is high.
The server (database) machine can be custom-built (tailored) to the DBMS function and thus
can provide better DBMS performance.
The client (application database) might be a personnel workstation, tailored to the needs of
the end users and thus able to provide better interfaces, high availability, faster responses and
overall improved ease of use to the user.
A single database (on server) can be shared across several distinct client (application)
systems.
As data volumes and transaction rates increase, users can grow the system incrementally.
It causes less impact on ongoing operations when adding new locations.
Distributed database system provides local autonomy.
Fig. 2.23 Distributed database system

2.8.4.2 Disadvantages of Distributed Database System


Recovery from failure is more complex in distributed database systems than in centralized
systems.

R Q
1. Describe the three-tier ANSI-SPARC architecture. Why do we need mappings between
different schema levels? How do different schema definition languages support this
architecture?
2. Discuss the advantages and characteristics of the three-tier architecture.
3. Discuss the concept of data independence and explain its importance in a database
environment.
4. What is logical data independence and why is it important?
5. What is the difference between physical data independence and logical data independence?
6. How does the ANSI-SPARC three-tier architecture address the issue of data independence?
7. Explain the difference between external, conceptual and internal schemas. How are these
different schema layers related to the concepts of physical and logical data independence?
8. Describe the structure of a DBMS.
9. Describe the main components of a DBMS.
10. With a neat sketch, explain the structure of DBMS.
11. What is a transaction?
12. How does the hierarchical data model address the problem of data redundancy?
13. What do you mean by a data model? Describe the different types of data models used.
14. Explain the following with their advantages and disadvantages:

a. Hierarchical database model


b. Network database model
c. Relational database model
d. E-R data models
e. Object-oriented data model.

15. Define the following terms:

a. Data independence
b. Query processor
c. DDL processor
d. DML processor.
e. Run time database manager.

16. How does the hierarchical data model address the problem of data redundancy?
17. What do each of the following acronyms represent and how is each related to the birth of the
network database model?

a. SPARC
b. ANSI
c. DBTG
d. CODASYL.

18. Describe the basic features of the relational data model. Discuss their advantages,
disadvantages and importance to the end-user and the designer.
19. A university has an entity COURSE with a large number of courses in its catalog. The
attributes of COURSE include COURSE-NO, COURSE-NAME and COURSE-UNITS.
Each course may have one or more different courses as prerequisites or may have no
prerequisites. Similarly, a particular course may be a prerequisite for any number of courses,
or may not be a prerequisite for any other course. Draw an E-R diagram for this situation.
20. A company called M/s ABC Consultants Ltd. has an entity EMPLOYEE with a number of
employees having attributes such as EMP-ID, EMP-NAME, EMP-ADD and EMP-BDATE.
The company has another entity PROJECT that has several projects having attributes such as
PROJ-ID, PROJ-NAME and START-DATE. Each employee may be assigned to one or more
projects, or may not be assigned to a project. A project must have at least one employee
assigned and may have any number of employees assigned. An employee’s billing rate may
vary by project, and the company wishes to record the applicable billing rate (BILL-RATE)
for each employee when assigned to a particular project. By making additional assumptions,
if so required, drawn an E-R diagram for the above situation.
21. An entity type STUDENT has the attributes such as name, address, phone, activity, number
of years and age. Activity represents some campus-based student activity, while number of
years represents the number of years the student has engaged in these activities. A given
student may engage in more than one activity. Draw an E-R diagram for this situation.
22. Draw an E-R diagram for an enterprise or an organisation you are familiar with.
23. What is meant by the term client/server architecture and what are the advantages and
disadvantages of this approach?
24. Compare and contrast the features of hierarchical, network and relational data models. What
business needs led to the development of each of them?
25. Differentiate between schema, subschema and instances.
26. Discuss the various execution steps that are followed while executing users request to access
the database system.
27. With a neat sketch, describe the various components of database management systems.
28. With a neat sketch, describe the various functions and services of database management
systems.
29. Describe in detail the different types of DBMSs.
30. Explain with a neat sketch, advantages and disadvantages of a centralised DBMS.
31. Explain with a neat sketch, advantages and disadvantages of a parallel DBMS.
32. Explain with a neat sketch, advantages and disadvantages of a distributed DBMS.

STATE TRUE/FALSE

1. In a database management system, data files are the files that store the database information.
2. The external schema defines how and where data are organised in physical data storage.
3. In a network database terminology, a relationship is a set.
4. A feature of relational database is that a single database can be spread across several tables.
5. An SQL is a fourth generation language.
6. An object-oriented DBMS is suited for multimedia applications as well as data with complex
relationships.
7. An OODBMS allows for fully integrated databases that hold data, text, voice, pictures and
video.
8. The hierarchical model assumes that a tree structure is the most frequently occurring
relationship.
9. The hierarchical database model is the oldest data model.
10. The data in a database cannot be shared.
11. The primary difference between the different data models lies in the methods of expressing
relationships and constraints among the data elements.
12. In a database, the data are stored in such a fashion that they are independent of the programs
of users using the data.
13. The plan (or formulation of scheme) of the database is known as schema.
14. The physical schema is concerned with exploiting the data structures offered by a DBMS in
order to make the scheme understandable to the computer.
15. The logical schema, deals with the manner in which the conceptual database shall get
represented in the computer as a stored database.
16. Subschemas act as a unit for enforcing controlled access to the database.
17. The process of transforming requests and results between three levels are called mappings.
18. The conceptual/ internal mapping defines the correspondence between the conceptual view
and the stored database.
19. The external/conceptual mapping defines the correspondence between a particular external
view and the conceptual view.
20. A data model is an abstraction process that concentrates essential and inherent aspects of the
organisation’s applications while ignores superfluous or accidental details.
21. Object-oriented data model is a logical data model that captures the semantics of objects
supported in object-oriented programming.
22. Centralised database system is physically confined to a single location.
23. Parallel database systems architecture consists of one central processing unit (CPU) and data
storage disks in parallel.
24. Distributed database systems are similar to client/server architecture.

TICK (✓) THE APPROPRIATE ANSWER

1. Which of the following is database element?

a. data
b. constraints and schema
c. relationships
d. all of these.

2. What separates the physical aspects of data storage from the logical aspects of data
representation?

a. data
b. schema
c. constraints
d. relationships.

3. What schema defines how and where the data are organised in a physical data storage?

a. external
b. internal
c. conceptual
d. nNone of these

4. Which of the following schemas defines the stored data structures in terms of the database
model used?

a. external
b. conceptual
c. internal
d. none of these.

5. Which of the following schemas defines a view or views of the database for particular users?

a. external
b. conceptual
c. internal
d. none of these.

6. A collection of data designed to be used by different people is called:

a. Database
b. RDBMS
c. DBMS
d. none of these.

7. Which of the following is a characteristic of the data in a database?

a. shared
b. secure
c. independent
d. all of these.

8. Which of the following is the database management activity of coordinating the actions of
database manipulation processes that operate concurrently, access shared data and can
potentially interfere with each other?

a. concurrency management
b. database management
c. transaction management
d. information management.

9. An object-oriented DBMS is capable of holding:

a. data and text


b. pictures and images
c. voice and video
d. all of the above.

10. Which of the following is an object-oriented feature?


a. inheritance
b. abstraction
c. polymorphism
d. all of these.

11. Immunity of the conceptual (or external) schemas to changes in the internal schema is
referred to as:

a. physical data independence


b. logical data independence
c. both (a) and (b)
d. none of these.

12. A physical data models are used to:

a. specify overall logical structure of the database


b. describe data and its relationships
c. higher-level description of storage structure and access mechanism
d. all of these.

13. Immunity of the external schemas (or application programs) to changes in the conceptual
schema is referred to as:

a. physical data independence


b. logical data independence
c. both (a) and (b)
d. none of these.

14. A record-based data models are used to:

a. specify overall logical structure of the database


b. describe data and its relationships
c. higher-level description of storage structure and access mechanism
d. all of these.

15. An object-oriented data models are used to:

a. specify overall logical structure of the database


b. describe data and its relationships
c. higher-level description of storage structure and access mechanism
d. all of these.

16. The relational data model was first introduced by:

a. SPARC
b. E.F. Cord
c. ANSI
d. Chen.
17. The E-R data model was first introduced by:

a. SPARC
b. E.F. Cord
c. ANSI
d. Chen.

FILL IN THE BLANKS

1. Relational data model stores data in the form of a _____.


2. The _____ defines various views of the database.
3. The _____ model defines the stored data structures in terms of the database model used.
4. The object-oriented data model maintains relationships through _____.
5. The _____ data model represents an entity as a class.
6. _____ represent a correspondence between the various data elements.
7. To access information from a database one needs a _____.
8. A _____ is a sequence of database operations that represent a logical unit of work and that
access a database and transforms it from one state to another.
9. The database applications are usually portioned into a _____ architecture or a _____
architecture.
10. A subschema is a _____ of the schema.
11. Immunity of the conceptual (or external) schemas to changes in the internal schema is
referred to as _____.
12. Immunity of the external schemas (or application programs) to changes in the conceptual
schema is referred to as _____.
13. The process of transforming requests and results between three levels are called _____.
14. The conceptual/internal mapping defines the correspondence between the _____ view and the
_____.
15. The external/conceptual mapping defines the correspondence between a particular _____
view and the view.
16. The hierarchical data model is represented by an _____ tree.
17. Information Management System (IMS) was developed jointly by _____ and _____.
18. Network data model was formalized by _____ in the late _____.
19. The three basic components of network model are (a) _____, (b) _____ and (c) _____.
20. The relational data model was first introduced by _____.
21. Client/server architecture of database system has two logical components namely _____
and_____.
Chapter 3

Physical Data Organisation

3.1 INTRODUCTION

As discussed in the preceding chapters, the goal of a database system is to


simplify and facilitate access to data. The users of the system should not be
burdened with the physical details of the implementation of the system.
Databases are stored physically on storage devices and organised as files and
records. The overall performance of a database system is determined by the
physical database organisation. Therefore, it is important that the physical
organisation of data is efficiently managed.
All data that is processed by a computer cannot reside in the main
memory because of the fact that:
i. Main memory is a scarce resource in which big programs and large data cannot be stored in.
ii. It is often necessary to store data from one execution of a program to the next.

Thus, large volumes of data and programs are stored in physical storage
devices, called secondary, auxiliary or external storage devices. The database
management system (DBMS) software then retrieves updates and processes
this data as needed. When data are stored physically on secondary storage
devices, the organisation of data determines the way data can be accessed.
The organisation of data is influenced by a number of factors such as:
Maximizing the amount of data that can be stored efficiently in a particular storage device by
suitable structuring and blocking of data or records.
Time (also called response time) required for accessing a record, writing a record, modifying
a record and transferring a record to the main memory. This affects the types of applications
that can use the data and the time and cost required to do so.
Minimizing or zero data redundancy.
Characteristics of secondary storage devices.
Expandability of data.
Recovery of vital data in case of system failure or data loss.
Data independence.
Complexity and cost.

This chapter introduces various aspects of physical database organisation


that foster efficient database operation. These include, various types of
physical storage media and technologies, concept of file and file
organisation and indexing and hashing of files.

3.2 PHYSICAL STORAGE MEDIA

As discussed above, the data in database management system (DBMS) is


stored on physical storage devices such as main memory and secondary
(external) storage. Thus, it is important that the physical database (or
storage) is properly designed to increase data processing efficiency and
minimise the time required by users to interact with the information system.

Fig. 3.1 System of physically accessing the database


When required, a record is fetched from the disk to main memory for
further processing. File manager is the software that manages the allocation
of storage locations and data structure. It determines the page on which the
record resides. The file manager sometimes uses auxiliary data structures to
quickly identify the page that contains a desired record and then it issues a
request for the page to the buffer manager. The buffer manager fetches a
requested page from disk into a region of main memory called the buffer
pool and tells the file manager, the location of the requested page. It is the
software that controls the movement of data between main memory and disk
storage.
Physical storage devices are of several types that exist in most computer
systems. These storage devices are classified by the speed of data access, the
cost per unit of data to purchase the medium and reliability of the medium.
Typical physical storage devices can be categorised mainly as:
a. Primary,
b. Secondary,
c. Tertiary storage devices.

The primary storage devices can be categorised as:


Cache,
Main memory,
Flash memory.

The secondary (also called on-line) storage can be further categorised as:
Magnetic disk.

The tertiary (also called off-line) storage can be categorised as:


Magnetic tape,
Optical storage.

Memory in a computer system is arranged in a hierarchy, as shown in Fig.


3.2. As we move up on the hierarchy of the storage devices, their cost per bit
increases and the speed becomes faster. There is increase in the capacity,
stability and access time when we move down the hierarchy. The highest-
speed storage is the most expensive and is therefore available with the least
capacity. The lowest-speed storage is available with indefinite capacity, high
access time but lowest speed.

3.2.1 Primary Storage Device


At the top of the hierarchy are the primary storage devices, which are
directly accessible by the processor. Primary storage devices, also called
main memory, store active executing programs, data and portion of the
system control program (for example, operating system, database
management system, network control program and so on.) that is being
processed. As soon as a program terminates, its memory becomes available
for use by other processes. It is evident from Fig. 3.2 that primary storage
devices are the fastest and costliest storage media. They provide very fast
access to data. The cost of main memory is more than 100 times the cost of
secondary storage and further more as compared to tertiary storage devices.
The primary storage devices are volatile in nature. This means that they lose
their contents when the power to the memory is switched off or computer is
restarted (after a shutdown or a crash). They require a battery back-up
system to avoid data loss from the memory. The primary storage includes
main memory, chase memory and flash memory.

Fig. 3.2 Hierarchy of physical storage devices


3.2.2 Secondary Storage Device
The secondary storage devices (also called external or auxiliary storage)
provide stable storage where software (program) and data can be held ready
for direct use by the operating system and applications. The secondary
storage devices usually have a larger capacity, less cost and provider slower
speed. Data to the secondary storage cannot be processed directly by the
computer CPU. It must first be transferred into primary storage. They are
non-volatile. The secondary storage includes magnetic disks. Magnetic disks
are also used as virtual memory or swap space for processes and data that
either are too big to fit in the primary memory or must be temporarily
swapped out to disk to enable other processes to run.

3.2.3 Tertiary Storage Device


Tertiary storage devices are primarily used for archival purposes. Data held
on tertiary devices is not directly loaded and saved by application programs.
Instead, operating system utilities are used to move data between tertiary and
secondary stores as required. The tertiary storage devices are also non-
volatile. Tertiary storage devices such as optical disk and magnetic tape are
the slowest class of storage devices.
Since cost of primary storage devices is very high, buying enough main
memory to store all data is prohibitively expensive. Thus, secondary and
tertiary storage devices play an important role in database management
systems for storage of very large volume of data. Large volume of data is
stored on the disks and/or tapes and a database system is built that can
retrieve data from lower levels of the memory hierarchy into main memory
as needed for processing. In balancing the requirements for different types of
storage, there are considerable trade-offs involving cost, speed, access time
and capacity. Main memory and most magnetic disks are fixed storage
media. The capacity of such devices can only be increased by adding further
devices. Optical disks and tape, although slower, are relatively inexpensive
because they are removable media. Once the read/write devices are installed,
storage capacity may be expanded simply by purchasing further tapes, disks,
CD-ROM and so on.

3.2.4 Cache Memory


Cache memory is primary memory. It is a small storage that provides a
buffering capability by which the relatively slow and ever-increasingly large
main memory can interface to the central processing unit (CPU) at the
processor cycle time. It is used in conjunction with the main memory in
order to optimise performance. Cache is a high-speed storage that is much
faster than the main storage but extremely expensive as compared with the
main storage. Therefore, only small cache storage is used. Use of cache
storage is managed by computer system hardware and thus, managing cache
storage it is not of much concern in database management system.

Advantages:
High-speed storage and much faster than main memory.

Disadvantages:
Small storage device.
Expensive as compared to main memory.
Volatile memory.

3.2.5 Main Memory


The main memory (also called primary memory) is a high-speed random
access memory (RAM). It stores data and/or information, general-purpose
machine instructions or programs that are required for execution or
processing by the computer system. It is volatile in nature, which means that
data, information or programs are stored in it as long as power is available to
the computer. During power failure or computer system crash, the content of
the main memory is lost. The operation of main memory is very fast,
typically measured in tens of nanoseconds. But, it is very costly. Therefore,
the main memory is usually small in size (in the order of few megabytes or
gigabytes) and data/program for immediate access are only stored in it. Rest
of the data and programs are stored in the secondary storage device. This is
why, main memory is also called immediate access storage (IAS) device. It
is located in the central processing unit (CPU). Relevant data/programs are
transmitted from secondary storage device to main memory for execution.
Decreasing memory costs have made large main memory systems possible.
It has resulted in the possibility of keeping large parts of a database active in
the main memory rather than on secondary storage devices.

Advantages:
High-speed random access memory.
Its operation is very fast.

Disadvantages:
Usually small in size but bigger than cache memory.
Very costly.
Volatile memory.

3.2.6 Flash Memory


Flash memory is also a primary memory. It is a type of read only memory
(ROM), which is non-volatile and data remains intact even after power
failure. It is also called as electrically erasable programmable read-only
memory (EEPROM). Flash memory is as fast as the main memory and it
takes very little time (less than 1/10th of a microsecond) to read data from a
flash memory. However, writing data to flash memory takes a little longer
time (about 4 to 10 microseconds). Also, data to flash memory can be
written once and cannot be overwritten again directly. The entire bank of
flash memory has to be erased at once to overwrite again to the flash
memory. The Flash memory supports limited number of erase cycles,
ranging from 10,000 to 1 million. It is used for the storage of a small volume
of data (ranging from 5 to 10 megabytes) in low-cost computer systems such
as in hand-held computing devices, digital electronic devices, real-time
computers for process control applications and so on.
Advantages:
Non-volatile memory.
It is as fast as main memory.

Disadvantages:
Usually small in size.
It is costly as compared to secondary storage.

3.2.7 Magnetic Disk Storage


Magnetic disks are the main form of secondary storage memory, which are
non-volatile. They are used for bulk storage of data and programs, which are
required infrequently at a much lower cost than the high-speed main
memory. Usually, entire database is stored on magnetic disk and portions of
it are transferred to main memory as needed. It has disadvantages of taking
much longer access time and the need for interface boards and software to
connect to the CPU. These storage devices operate synchronously to the
CPU and care has to be taken in deciding on the appropriate transfer
technique for data between the CPU, fast access main-memory and the
secondary storage. Magnetic disks have a larger storage capacity and are less
expensive per bit of information stored than main memory. Magnetic disks
are available today to store very large volume of data, typically ranging from
few gigabytes to 100 gigabytes. With advancement in computing technology,
the storage capacity of magnetic disks is increasing every year. The time
required to access data or information is much greater in case of magnetic
disk storage.
Magnetic disks are a type of direct access storage device (DASD) in
which one record can be accessed directly by specifying the location (or
address) of the record on the storage media. Different types of magnetic
disks and their sub-types are given below:
a. Fixed disks

Hard disks.
Removable-pack disks.
Winchester disks.
b. Exchangeable or flexible disks

Floppy disks.
Zip disks.
Jaz disks.
Super disks.

A magnetic disk is single-sided if it stores information on only one of its


surface and double-sided if both its surfaces are used. To increase the storage
capacity, disks are assembled into a disk pack which may include many disks
and hence many surfaces. The physical unit in which the magnetic disk
recording medium is contained is called disk drive. Each disk drive contains
one disk pack (also called volume). Disk packs consist of a number of
platters which are stacked on top of each other on a spindle. Each disk
platter has a flat circular shape and its surfaces are covered with a magnetic
material. Data and information are stored on these surfaces. Platters are
made from rigid metal or glass and are usually covered on both surfaces with
a magnetic recording material. Fig. 3.3 illustrates mechanism of magnetic
disk storage. Data is stored on disk in units called disk blocks (or pages),
which is a contiguous sequence of bytes. Data is written to a disk and read
from a disk in the form of disk blocks. Blocks are arranged in concentric
rings called tracks. Each disk platter has two disk surfaces. The disk surfaces
are logically divided into tracks. Set of all tracks with the same diameter is
called a cylinder. A cylinder contains one track per platter surface. Tracks
can be recorded on one or both surfaces of a platter. If tracks are recorded on
one surface of the platter, it is called a single-sided disk storage. When tracks
are recorded on both the surfaces of the platter, it is called a double-sided
disk storage. Each track is subdivided into arcs called sectors. A sector is the
smallest unit of information that can be read from or written to the disk. The
size of the sectors cannot be changed. The present available disks have
sector size of 512 bytes, more than 16,000 tracks on each platter and 2 to 4
platters per disk. The tracks closer to spindle (called inner tracks) are of
smaller length. the outer tracks contain more sectors (typically 400 sectors)
than the inner tracks (typically 200 sectors). With fast development in the
computing technology, these numbers and definitions are also changing very
fast.
An array of electromagnetic read/write heads (one per recorded surface of
a platter) is provided. These read/write heads are mounted on a single
assembly called disk arm assembly, as shown in Fig. 3.3. The disk platters
mounted on a spindle and the read/write heads mounted on disk arm
assembly are together known as head-disk assembly. Data or information is
transferred to or from the disk through the read/write heads. Each read/write
head floats just above or below the surface of a disk while the disk is
rotating constantly at a high speed (usually 60, 90, 120 or 250 revolutions
per second). The read/write heads are kept as close as possible to the disk
surface to increase the recording density. The read/write head never actually
touches the disk but hovers a few thousands or millions of an inch over it.
The spinning of the disk creates a small breeze and the head assembly is
shaped so that the breeze keeps the head floating just above the disk surface.
Because the read/write head floats so close to the surface, platters must be
machined carefully to be flat. A dust particle or a human hair on the disk
surface could cause the contact of read/write head to the disk surface and
thus causing head to crash into the disk. This event is known as head crash.
In case of head crash, the head can scrap the recording medium of the disk,
destroying the data that had been stored on the disk. Under normal
circumstances, a head crash results in failure of the entire disk, which must
then be replaced. Current-generation disk drives use a thin film of magnetic
metal as recording medium, which are much less susceptible to failure by
head crashes than the older oxide-coated disks.
Fig. 3.3 Magnetic disk storage mechanism

With a fixed disk drive, the head-disk assembly is permanently mounted


on the disk drive and has a separate head for each track. This arrangement
allows the computer to switch from track to track quickly, without having to
move the head assembly. Whereas in case of exchangeable disk drive, the
head-disk assembly is movable. In case of exchangeable disks (such as a
floppy disk), the read/write heads are attached to a movable arm to form a
comb-like access assembly. For accessing data on a particular track, the
whole assembly moves to position the read/write heads over the desired
track. While many read/write heads may be in the position for a read/write
transaction at a given point in time, data transmission can only take place
through one read/write head at a time. Access time to locate data on hard
disks is 10 to 100 milliseconds compared to 100 to 600 milliseconds on
floppy disks. A disk controller, typically embedded in the disk drive,
controls the disk drive and interfaces it to the computer system. The disk
controller accepts high-level input/ output (I/O) commands and takes
appropriate action to position the arm and causes the read/write action to
take place. To transfer a disk block, given its address, the disk controller first
mechanically positions the read/ write head on the correct track.

3.2.7.1 Factors Affecting Magnetic Disk Performance


The performance of a magnetic disk is measured by the following
parameters:
Access time.
Data-transfer rate.
Reliability or mean time to failure (MTTF).
Storage capacity.

Access time is the time from when a read or write request is issued to
when the data transfer begins. The read/write arm first moves to get
positioned over the correct track to access data on a given sector of a disk. It
then waits for the sector to appear under it as the disk rotates. The time
required to move the read/write heads from their current position to a new
cylinder address is termed as seek time (or access motion time). Seek time
increases with the distance that the arm moves and it ranges typically from 2
to 30 milliseconds depending on how far the track is from the initial arm
position. The time for a seek is the most significant delay when accessing
data on a disk, just as it is when accessing data on a movable-head assembly.
Therefore, it is always desirable to minimise the total seek time. Once seek
has started, the read/write head waits for the sector to be accessed to appear
under it. This waiting time, due to rotational delay is termed as rotational
latency time. There is third timing factor called head activation time, which
is required to electronically activate the read/ write head over the disk
surface where data transfer is to take place. Head activation time is regarded
as negligible as compared to other performance factors. Therefore, access
time depends both on seek time and the latency time.
Disk transfer rate is the amount of time required to transfer data from the
disk to or from main memory. In other words, it is the state at which data can
be retrieved from or stored to the disk from the main memory. Data transfer
rate is a function of the rotational speed and the density of the recorded data.
Ideally, the current magnetic disks have data transfer rates of about 25 to 40
megabytes per second, however actual data transfer rates are significantly
less (in order of 4 to 8 megabytes per second).
Mean time to failure (MTTF) is the measure of reliability of the disk.
MTTF of a disk is the amount of time that a system is, on an average,
expected to run continuously without any failure. Theoretically, present
available MTTF of disk is typically ranging from 30,000 to 1,200,000 hours
(about 3.4 to 136 years). But in practice the MTTF is computed on the
probability of failure when the disk is new and a MTTF of 1,200,000 hours
does not mean that a disk can be expected to function for 136 years. Most
disks have expected life span of about 5 years and have high rates of failure
with increased years of use.

3.2.7.2 Advantages of Magnetic Disks


Non-volatile memory.
It has very large storage capacity.
Less expensive.

3.2.7.3 Disadvantages of Magnetic Disks


Greater access time as compared to main memory.

3.2.8 Optical Storage


Optical storage is a tertiary storage device which offers access times
somewhat slower than conventional magnetic disks. Optical storage devices
have storage capacity of several hundred megabytes or more per disk. In
optical storage, data and programs are stored (or written) optically using
laser beams. Optical devices use different technologies such as Magneto-
Optical (MO), combining both magnetic and optical storage methods and
purely optical methods. There are various types of optical disk storage
devices namely:
Compact disk - read only memory (CD-ROM).
Digital video disk (DVD).
Write-once, read-many (WORM) disks.
CD-R and DVD-R.
CD-RW and DVD-RW.

The most popular form of optical storages is the compact disk (CD) and
digital video disk (DVD). CD can store more than 1 gigabyte of data and
DVD can store more than 20 gigabytes of data on both sides of the disk.
Like audio CDs, CD-ROMs come with data already encoded onto them. The
data is permanent and can be read any number of times but cannot be
modified. A CD-ROM player is required to read data from CD-ROM drive.
There are record-once versions of compact disks called CD-recordable (CD-
R) and DVD-Recordable (DVD-R), which can be written only once. Such
disks are also called write-once, read-only memory (WOROM) disks.
Multiple-write versions of compact disks called CD-ReWritable (CD-RW)
and digital video disks called DVD-ReWritable (DVD-RW) and DVD-RAM
are also available which can be written multiple times. Recordable compact
disks are magnetic-optical storage devices that use optical means to read
magnetically encoded data. Such optical disks are useful for archival storage
of data as well as distribution of data.
Since the head assembly is heavier, DVD and CD drives have much
longer seek time (typically 100 milliseconds) as compared to magnetic-disk
drives. Rotational speeds of DVD and CD drives are lower than that of
magnetic disk drives. Faster DVD and CD drives have rotational speed of
about 3000 rotations per minute, which is comparable to speed of lower-end
magnetic-disk drives. Data transfer rates of DVD and CD drives are less
than that of magnetic disk drives. The data transfer rate of CD drive is
typically 3 to 6 megabytes per second and that of DVD drive is 8 to 15
megabytes per second. The transfer rate of optical drives is characterised as
n×, which means the drive supports transfer at n-times the standard rate. The
commonly available transfer rate of CD is 50× and that of DVD is 12×. Due
to high storage capacity, longer lifetime than magnetic disks and being
remove able, CD-R / CD-RW and DVD-R / DVD-RW are popular for
archival storage of data.

3.2.8.1 Advantages of Optical Storage


Large storage device.
Reliable as compared to floppy disks.
Non-volatile.
Cheap to mass-produce.

3.2.8.2 Disadvantages of Optical Storage


Special care is required for its handling.
Data once written, cannot be modified.
Data transfer rates and rotational speeds are lower than that of magnetic disk drives.

3.2.9 Magnetic Tape Storage


Magnetic tape storage devices are tertiary storage devices which are non-
volatile. They are also used for bulk storage of data and programs and are
mainly used for backup and archival data. Magnetic tape is a ferrite coated
magnetic strip of plastics wounded on reels on which data is encoded. It is
kept in a spool and is wound or rewound past a read/write head. Magnetic
tape takes seconds or even minutes for moving to the correct spot. Physical
appearance of magnetic tape reels is similar to that of stereo tapes used for
sound (music) recording or storage. But these reels used in computers are
wider and much larger (about half-inch wide and 2400 feet long) and are
capable of storing more than 6000 bytes per inch. Data or information on
magnetic tapes is stored character by character. Magnetic tapes are accessed
sequentially and are also referred to as sequential-access storage device and
are much slower than magnetic disks. Tapes have a high storage capacity,
ranging from 40 gigabytes to more than 300 gigabytes. Magnetic tapes also
have a read/write head, which is an electromagnet. The read/write head reads
magnetized areas (which represent data on the tape), converts them into
electrical signals and sends them to main memory and CPU for execution or
further processing.
Fig. 3.4 Layout of inter-record gaps and inter-block gaps

(a) Inter-record gap (IRG)

(b) Inter-block gap (IBG)

For reading or writing data or information on magnetic tape, it is mounted


on a magnetic tape drive and fed past read/write heads at a particular speed
(typically 125 inches per second). When a command to read or write is
issued by the CPU, tape accelerates from a stop (rest) position to a constant
high speed. Following the completion of a read or write command, the tape
is decelerated to a stop position. It is not possible to stop a tape exactly
where we want. Thus, during either an acceleration or deceleration phase, a
certain length of tape is passed over. This section of tape is neither read nor
written upon. This space appears between successive records and is called an
inter-record gaps (IRG). AN IRG varies from ½ to ¾ inch, depending on the
nature of tape drive. If each record is short, then a large number of IRGs will
be required, which means that much of the tape length will be left blank. The
greater the number of IRGs, the smaller the storage capacity of tape. For
example, when there is a gap after every record (as in case of IRG), the
computer reads and processes the data between one gap and the next and
then between the gap and the one following and so on until the end of file or
tape is reached. Fig 3.4 (a) illustrates inter-record gap layout. As shown,
only one record will be processed at a time and the tape must start and stop
between each gap.
To circumvent this problem, several records are grouped together in
blocks. The gap between blocks of records is called inter-block gaps (IBG),
as shown in Fig. 3.4 (b). In this case, one write command can transfer a
number of consecutive records to the tape without requiring IRGs between
them. The computer now reads and processes a block at a time. The number
of records in a block is called the blocking factor. The average time taken to
read or write a record is inversely proportional to the blocking factor, since
fewer gaps must be spanned and more records can be read or written per
command. Blocking factor should be large to utilize tape storage efficiently
and minimize reading and writing time.
Magnetic tape storage devices are available in variety of forms such as:
Quarter-inch cartridge (QIC) tapes.
8-mm Ampex helical scan tapes.
Digital audio tape (DAT) cartridge.
Digital linear tape (DLT).

Current magnetic tapes are available with high storage capacity. Digital
audio tap (DAT) cartridge is available with storage capacity in the range of
few gigabytes, whereas digital linear tape (DLT) is available with storage
capacity of more than 40 gigabytes. The storage capacity of Ultrium tape
format is more than 100 gigabytes and that of Ampex helical scan tapes is in
the range of 330 gigabytes. Data transfer rates of these tapes are of the order
of a few megabytes per second to tens of megabytes per second.

3.2.9.1 Advantages of Magnetic Tape Storage


It is much cheaper than magnetic disks.
Non-volatile.

3.2.9.2 Disadvantages of Magnetic Tape Storage


Data access is sequential and much slower.
Records must be processed in the order in which they reside on the tape.

3.3 RAID TECHNOLOGY

With fast growing database applications such as World Wide Web,


multimedia and so on, the data storage requirements are also growing at the
same pace. Also, faster microprocessors with larger and larger primary
memories are continually becoming available with the exponential growth in
the performance and capacity of semiconductor devices and memories.
Therefore, it is expected that secondary storage technology must also take
steps to keep up in performance and reliability with processor technology to
match the growth.
Development of redundant arrays of inexpensive disks (RAID) was a
major advancement in secondary storage technology to achieve improved
performance and reliability of storage system. Lately, the “I” in RAID is said
to stand for independent. The main goal of RAID is to even out the widely
different rates of performance improvement of disks against those in
memory and microprocessor. RAID technology provides a disk array
arrangement in which a large number of small independent disks operate in
parallel and act as a single higher-performance logical disk in place of a
single very large disk. The parallel operation of several disks improve the
rate at which data can be read or written and performs several independent
reads and writes in parallel. In a RAID system, a combination of data
stripping (also called parallelism) and data redundancy is implemented. Data
is distributed over several disks and redundant information is stored on
multiple disks. Thus, in case of disk failure the redundant information is
used to reconstruct the content of the failed disk. Therefore, failure of one
disk does not lead to the loss of data. The RAID system increases the
performance and improves reliability of the resulting storage system.

3.3.1 Performance Improvement Using Data Stripping (or Parallelism)


Data stripping distributes data transparently over multiple disks to make
them appear as a single large, fast disk. Data stripping (or parallelism)
consists of segmentation or splitting of data into equal-size partitions, which
are distributed over multiple disks (also called disk array). The size of the
partition is called stripping unit and the partitions are usually distributed
using a round robin algorithm in the disk array. A disk array gives the user
the abstraction of having a single, very large disk. Fig. 3.5 shows a file
distributed or stripped over four disks.
Fig. 3.5 Example of data stripping

There are two types of stripping, namely:


a. Bit-level stripping.
b. Block-level stripping.

In a bit-level stripping, splitting of bits of each byte is done across


multiple disks. For example, in array of n disks we write bit i of each byte to
disk i, and partition i is written on to disk (i mod n). The array of n disks can
be treated as a single disk with sectors that are n times the normal size and
that has n times the transfer rate. Since any n successive data bits are
distributed over all n data disks in the array, all read/write (also called
input/output) operations involve all n disks in the array. Since the smallest
unit of transfer form a disk is a block, each read/write request involves
transfer of at least n number of blocks of n disks in the array. Since n number
of blocks can be read from n number of disks in parallel, the transfer rate of
each read/write request is n times transfer rate of a single disk. In such an
arrangement, every disk participates in every read/write operation and each
read/write request uses the aggregate bandwidth of all disks in the array.
Therefore, the number of read/write operations that can be processed per
second is about the same as on a single disk. In a bit-level stripping, the
number of disks is the array are either multiple of 8 or a factor of 8. For
example, in an array of 4 disks, bits i and 4+i of each byte go to disk i.
In a block-level stripping, splitting of blocks is done across multiple disks
and it treats the array of disks as a single large disk. Block-level stripping is
the most commonly used form of data stripping. In a block-level stripping,
read/write requests of the size of a disk block are processed by one disk in
the array. In case of many read/write requests of the size of a disk block and
the requested blocks residing on different disks, all requests can be
processed in parallel and thus reduce the average response time of read/write
operation. With an array of n disks, block-level stripping assigns logical
block i of the disk array to disk (i mod n) + 1. It uses the i/nth physical block
of the disk to store logical block i. In block-level stripping, the blocks are
given logical numbers starting from 0. With 8 numbers of disks in the array,
the logical block 0 is stored in the physical block 0 of the disk 1, while
logical block 11 is stored in physical block 1 of disk 4. When reading a large
file, block-level stripping fetches n blocks at a time in parallel from the n
number of disks in the array, giving a high transfer data rate for large number
of read/write operations. When a single block is read, the data transfer rate is
the same as on one disk, but the remaining n − 1 disks are free to perform
other functions.
In a data stripping arrangement, the overall performance of storage system
increases, whereas reliability of the system is reduced. For example, with the
mean-time-to-failure (MTTF) of 100,000 hours of a single disk, the disks
expected life would be 11.4 years. But, the MTTF of an array of 100 disks in
parallel would be only 100,000/100 = 10000 hours or 42 days, assuming that
failures occur independently and that the failure probability of a disk does
not change over time.

3.3.2 Advantages of RAID Technology


Since read requests can be sent to any of the multiple disks in parallel, rate of read request
(i.e. number of reads per unit time) is almost doubled.
High data transfer rates.
Throughput of the system increases.
Improved overall performance.

3.3.3 Disadvantages of RAID Technology


Having more disks reduces overall system reliability.

3.3.4 Reliability Improvement Using Redundancy


The disadvantages of lower reliability of data stripping can be overcome by
introducing redundancy of data. In case of redundancy, extra or duplicate
data/information is stored, which are used for rebuilding the lost information
only in the event of disk failure or disk crashes. The redundancy increases
MTTF of a disk array, because data are not lost even if a disk fails.
Redundancy is introduced using mirroring (also called shadowing)
technique in which data is duplicated on every disk. In this case, the logical
disk consists of two physical disks and every write is carried out on both
disks. In the event of failure of one disk, the data can be read from the other
disk. Data can be lost only when the second disk fails before the first failed
disk is repaired. Therefore, MTTF of mirrored disk depends on the MTTF of
individual disk as well as mean time to repair (MTTR) of individual disk.
MTTR is the average time it takes to replace a failed disk and to restore the
data on it.
While incorporating redundancy into a disk array, the redundant
information can either be stored on a small number of check disks or
distributed uniformly over all the disks. Most disk arrays store parity
information in an extra check disk. The check disk stores parity information
that can be used to recover from failure of any one disk in the array.

Advantages:
Improved overall reliability.
Expensive.

Disadvantages:
Redundant data.

3.3.5 RAID Levels


In RAID technology the disk array is partitioned into reliability groups. A
reliability group consists of a set of data disks and a set of check disks. The
number of check disks depends on the RAID level chosen. RAID levels are
the various alternative schemes of providing redundancy at a lower cost by
combining disk stripping with mirroring and parity bits. These schemes have
different cost-performance trade-offs. Fig. 3.6 shows a scheme of RAID
levels in which four disks has been assumed to accommodate all the sample
data. Depending on the RAID level chosen, the number of check disks will
vary from 0 to 4. As shown in the figure, m indicates mirror copy of the data
and i indicates error-correcting bit.

3.3.5.1 RAID Level 0: Non-redundant Disk Stripping


The RAID level 0 scheme uses data stripping at the blocks level to increase
the maximum bandwidth available. At this level, no redundant information
(such as parity bits or mirroring) is maintained. Fig. 3.6. (a) shows RAID
level 0 with a disk array size of 4.
Because of non-redundancy, RAID level 0 has the best write performance
of all RAID levels, but it does not have the best read performance of all
RAID levels. Effective disk space utilisation for a RAID level 0 system is
always 100 percent.

Fig. 3.6 RAID levels

(a) RAID level 0: Non-redundant disk stripping

(b) RAID level 1: Mirrored-disk

(c) RAID level 2: Error-correcting codes


(d) RAID level 3: Bit-interleaved parity

(e) RAID level 4: Block-interleaved parity

(f) RAID level 5: Block-interleaved distributed parity

(g) RAID level 6: PI + Q redundancy

3.3.5.2 RAID level 1: Mirrored-disk


The RAID level 1 scheme uses mirror copy of data in which two identical
copies of data on two different disks are maintained instead of having one
copy of the data. Every write operation takes place on both disks. Since a
global system failure might occur while writing the blocks, the write
operations may not be performed simultaneously. Therefore, a block is
always written on one disk first and then on the mirror disk as a copy. Since
these two copies of each block exist on different disks, the reads operations
can be distributed between the two disks. This allows parallel reads of
different disk blocks that conceptually reside on the same disk. Fig. 3.6. (b)
shows RAID level 1 with mirror organisation that holds 4 disk worth of data.
Since RAID level 1 does not split the data over different disks, the transfer
rate for a single request is comparable to the transfer rate of a single disk. It
is the most expensive scheme of all RAID levels and does not have the best-
read performance of all RAID levels. Effective disk space utilisation for a
RAID level 1 system is 50 per cent and is independent of the number of data
disks.

3.3.5.3 RAID level 2: Error-correcting Codes


The RAID level 2 scheme uses single bit striping unit and employs parity
bits for error correction. It is known as memory-style error-correcting codes
(ECC) organisation. It uses hamming codes. It uses only three check disks in
addition to 4 data disks, as shown in Fig. 3.6 (c). The number of check disks
increases logarithmically with the number of data disks. The disks labelled i
store the error correction bits. If one of the disks fails, the remaining bits of
the byte and the associated error-correction bits can be read from other disks
and can be used to reconstruct the damaged data. As shown in Fig. 3.6 (c),
RAID level 2 system requires only 3 numbers of overhead disks as
compared to RAID level 1 scheme.
For each read/write request the aggregated bandwidth of all data disks is
used. Due to this, RAID level 2 is good for workloads with many large
read/write requests. But it is bad for small read/write requests of the size of
an individual block. The cost of RAID level 2 scheme is less than that of
RAID level 1, but it keeps more redundant information than is necessary.
With the use of hamming code, it is possible to identify which disk has failed
and therefore, the check disks do not need to contain information to identify
the failed disk. Effective disk space utilisation for a RAID level 2 system is
up to 53 per cent. The effective space utilisation increases with an increase
in the number of data disks.

3.3.5.4 RAID level 3: Bit-interleaved Parity


The RAID level 3 scheme uses a single parity disk relying on the disk
controller to figure out which disk has failed. It uses disk controllers to
detect whether a sector has been read correctly. A single parity bit is used for
error correction as well as for detection. In case of disk failure, the system
knows exactly which sector has failed by comparing the parity bit with other
disks.
The performance characteristics of RAID level 3 are very similar to that
of RAID level 2. Its reliability overhead is a single disk, which is the lowest
possible overhead, as shown in Fig. 3.6 (d). It is less expensive compared to
RAID level 2. Since every disk participates in every read/write operation,
RAID level 3 has lower number of read/write (input/output) operations.
The RAID level 3 configuration with 4 data disks require just one check
disk and thus its effective storage space utilisation is 80 per cent. Since
always only one check disk is required, the effective space utilisation
increases with the number of data disks.

3.3.5.5 RAID level 4: Block-interleaved Parity


The RAID level 4 system uses block-level stripping like RAID level 0,
instead of a single bit as in RAID level 3 scheme. In addition, it keeps a
parity block on a separate disk for corresponding blocks from n other disk,
as shown in Fig. 3.6 (e). The block-level stripping has the advantage that the
read requests of the size of a disk block can be served entirely by the disk
where the request block resides. Large read requests of several disk blocks
can still utilise the aggregated bandwidth of the multiple disks. If one of the
disks fails, the parity block can be used with the corresponding blocks from
the other disks to restore the blocks of the failed blocks. A single block read
requires only one data disk and one check disk, allowing other requests to be
processed by other data disks.
The data-transfer rate for each access is slower but multiple read accesses
can proceed in parallel leading to higher overall read/write rate. Since all the
disks can be read in parallel, the data-transfer rates for large reads is high.
Similarly, since the data and the parity can be written in parallel, large writes
also have a high data-transfer rates.
A write of a single block requires a read-modify-write cycle and accesses
only one data disk on which the block is stored, as well as the check (parity)
disk. Computing the difference between the old data block and the new data
block and then applying the difference to the parity block on the check disk
can obtain the new parity. Thus, the parity on the check disk is updated
without reading all n disk blocks. The read-modify-write cycle involves
reading of the old data blocks and the old parity block, modifying the two
blocks, and writing them back to disk, resulting in 4 disk accesses per write.
Thus, a single write requires 4 disk accesses, 2 to read the 2 old blocks and 2
to write the 2 old blocks.
The RAID level 4 configuration with 4 data disks require just one check
disk and thus its effective storage space utilization is 80 per cent. Since
always only one check disk is required, the effective space utilisation
increases with the number of data disks.

3.3.5.6 RAID level 5: Block-interleaved Distributed Parity


In RAID level 5 configuration, the parity blocks are distributed uniformly
over all n number of disks, instead of storing them on a single check disk.
Fig. 3.6 (f) shows RAID level 5 configuration.
RAID level 5 system has advantages such as:
a. Several write requests can be processed in parallel.
b. Requests have a higher level of parallelism. Since the data is distributed over all disks, read
requests involve all disks.

The RAID level 5 configuration has the best redundancy performance for
small and large read and large write requests. Small writes require a read-
modify-write cycle and are thus less efficient than RAID level 1 system. The
effective space utilisation of RAID level 5 system is 80 per cent, the same as
in RAID 3 and 4 systems.

3.3.5.7 RAID level 6: P + Q Redundancy


The RAID level 6, also called P+Q redundancy scheme, uses error-
correcting codes (ECC) called Reed-Solomon codes, instead of using parity.
It is similar to RAID level 5, but stores extra redundant information to guard
against multiple disk failures. As shown in Fig. 3.6 (g), in RAID level 6, 2
bits of redundant data are stored for every 4 bits of data unlike 1 parity bit in
RAID level 5.
The performance characteristics of RAID level 6 for small and large read
requests and for large write requests are analogous to RAID level 5 system.
For small writes, the read-modify-write cycle involves 6 instead of 4 disks as
compared to RAID level 5. Since 6 disks are required with storage capacity
equal to 4 data disks, effective storage space utilisation of RAID level 6
system is 66 per cent.

3.3.6 Choice of RAID Levels


Factors to be considered while choosing a RAID level, are as follows:
Cost of extra disk requirements.
Performance requirements in terms of number of read/write (input/output) operations.
Performance requirement in the event of disk failure.
Performance requirement during rebuild operation.

Orientation table for RAID levels is shown in Table 3.1.

Table 3.1 Orientation table for RAID levels


3.4 BASIC CONCEPT OF FILES

As explained in Chapter 1, Section 1.2.9, a file is a collection of related


sequence of records. As further explained in Sections 1.2.7 and 1.2.8,
records are a collection of logically related fields or data items made up of
bytes and words of binary-coded information, which are the smallest unit of
data that has meaning to its user. One or more data items or fields in a record
are unique identifier called key that differentiates between records. For
example, model number (MOD-NO) of model record in INVENTORY file
and employee number (EMP-NO) of employee record in EMPLOYEE file
of Fig. 1.9 (a) and (b), respectively, are the unique identifiers (key) to
differentiate between records of these two files. These fields are unique
identifier because they have a unique value. All data items or fields cannot
be unique identifier, for example model name (MOD-NAME) or employee
name (EMP-NAME) in Fig. 1.9, because the same names can be spelled
differently. A file resides in secondary storage (for example, magnetic disk),
or in other words, all data on a secondary storage (for example, magnetic
disk) is stored as a file with unique file name. The structure of the file is
known to the application software, which manipulates it. In a physical
storage (such as magnetic disk), a record of a file has a physical storage
location (called address) associated with it.
Files can be composed to form a set of files. When the application
programs of an organisation or an enterprise use this set of files, and if these
files exhibit some association or relationships between the records of the
files, then such a collection of files or set of files is referred to as a database.
In other words, a database can be defined as a collection of logically related
data stored together that is designed to meet the information needs of an
organisation. A database and its organisation is explained in detail in
Chapter 1, Sections 1.4. and 1.5. Fig. 3.7 illustrates the information-structure
hierarchy of file-processing applications. Fields or data items, records, files
and database are logical terms as to how they can be realized physically on a
secondary storage device.
Fig. 3.7 Information structure hierarchy

3.4.1 File Types


Following are the three types of files that are used in database environment:
Master files.
Transaction files.
Report files.

Master file: Master file is a file that stores information of permanent


nature about entities for which the user is interested in monitoring. Master
files are used as a source of reference data for processing transactions and
accumulated information based on the transaction data. For example,
EMPLOYEE master file of an organisation will contain details such as
Employee number (EMPL-NO), name of employee (EMP-NAME), address
of employee (EMP-ADD) and so on. Similarly, an INVENTORY master file
of a manufacturing set-up may have details such as INVENTORY-ID, PART-
NO, PART-NAME and so on.
Transaction file: Transaction file is a collection of records in a file
describing activities (called transactions) being carried out by the
organisation. As explained in Chapter 1, Section 1.11, transaction is a logical
unit of work encompassing sequence of database operations such as updating
a record, deleting a record, modifying a set of records and so on. Transaction
files permanently update the details in the master file.
Record file: Report file is a file that is created by extracting relevant or
desired data items in a record to prepare reports.

3.4.2 Buffer Management


A buffer is the part of main memory that is available for storage of contents
of disk blocks. When several blocks need to be transferred from disk to main
memory and all the block addresses are known, several buffers can be
reserved in main memory to speed up the transfer. While one buffer is being
read or written, the CPU can process data in the other buffer. As shown in
Fig. 3.1, buffer manager is software that controls the movement of data
between the main memory and disk storage in units of disk blocks. To
retrieve, update, or modify information stored in a file residing on disk
storage (secondary memory), buffer manager transfers (or fetches)
appropriate portion of the file in units of disk blocks into the buffer of main
memory. Buffer manager manages the allocation of buffer space in the main
memory.
In a database system, users submit their requests through programs (or file
management system) to the buffer manager for transfer of desired blocks in
the file from secondary storage disk. If requested, the block is already in the
buffer, the buffer manager passes the address of the blocks in the main
memory to the requesters (user) via file manager. If the block is not in the
buffer, the buffer manager first allocates space in the buffer of main memory
for the block. If necessary (when empty space is not available in the buffer),
buffer manager creates space for new block by throwing out some other
block in the buffer. Then, the buffer manager reads the requested block from
the storage disk into the empty buffer and passes the address of the block in
main memory to the requester. Now, the thrown-out block is written back to
disk only if it has been modified since the most recent time that it was
written to the disk. Thus, buffer manager is like a virtual-memory manager
that is found in operating systems. Buffer manager uses various techniques
such as buffer replacement strategy, pinned blocks, buffer cache and so on.,
to serve database system efficiently.

3.5 FILE ORGANISATION

A file organisation in a database system essentially is a technique of physical


arrangement of records of a file on secondary storage device. It is a method
of arranging data on secondary storage devices and addressing them such
that it facilitates storage and read/write (input/output) operations of data or
information requested by the user. The organisation of data in a file is
influenced by number of factors that must be taken into consideration while
choosing a particular technique. Some of these factors are as follows:
Fast response time required to access a record (data retrieval), transfer the data to the main
memory, write record and or modify a record.
High throughput.
Intended use (type of application).
Efficient utilisation of secondary storage space.
Efficient file manipulation operations.
Protection from failure or data loss (disk crashes, power failures and so on).
Security from unauthorised use.
Provision for growth.
Cost.

3.5.1 Records and Record Types


The method of file organisation chosen determines how the data can be
accessed in a record. Data is usually stored in the form of records. As we
described earlier, a record is an entity composed of data items or fields in a
file. Each data item is formed of one or more bytes and corresponds to a
particular field of the record. Records usually describe entities and their
attributes. For example, a purchase record represents a purchasing order
entity, and each field value in the record specifies some attribute of that
purchase order, such as ORD-NO, SUP-NAME, SUP-CITY, ORD-VAL, as
shown in Fig. 3.8 (a). Records of a file may reside on one or several pages in
the secondary storage. Each record has a unique identifier called a record-id.
A file can be of:
a. Fixed length records.
b. Variable length records.

3.5.1.1 Fixed-length Records


In a file with fixed-length records, all records on the page are of the same
slot length. Record slots are uniform, and records are arranged consecutively
within a page. Every record in the file has exactly the same size (in bytes).
Fig. 3.8 (a) shows a structure of PURCHASE record and Fig. 3.8 (b) shows
number of records in the PURCHASE record. As shown, all records are
having same fixed length of total 50 bytes, if we assume that each character
occupies 1 byte of space. That means, each record uses 50 bytes and
occupies slots in the page one after another in a serial sequence. A record is
identified using both page-id and slot number of the record.
Fig. 3.8 PURCHASE record

(a) Structure of record

(b) Number of records

The first operation is to insert records in the first available slots (or empty
spaces). Now whenever a record is deleted, the empty slot created by
deletion of record must be filled with some other record of the file. This can
be achieved using number of alternatives. The first alternative is that the
record that came after deleted record can be moved into the empty space
formally occupied by the deleted record. This operation will continue until
every record following the deleted record has been moved ahead. Fig. 3.9 (a)
shows an empty slot created by deletion of record 5, whereas in Fig. 3.9 (b)
all the subsequent records have moved one slot upward from record 6
onwards. All empty slots appear together at the end of the page. Such an
approach requires moving a large number of records depending on the
position of deleted record in a page of the file.
Fig. 3.9 Deletion operation on PURCHASE record

(a) Empty slot created by deletion of record 5

(b) Empty slot occupation by subsequent records


(c) Empty slot occupation by last record (number 10)
(d) File header with addresses of deleted record 1, 5, and 9

The second alternative is that only the last record is shifted in empty slot
of deleted record, instead of disturbing large number of records, as shown in
Fig. 3.9. (c). In both these two alternatives, it is not desirable to move
records to occupy the empty slot of deleted record as because doing so
requires additional block accesses. As insertion of records is a more
frequently performed operation than deletion of records, it would be more
appropriate to keep the empty slot of the deleted record vacant for a
subsequent insertion of a record before the space can be reused.
Therefore, a third alternative is used in which the deletion of a record is
handled by using an array of bits (or bytes) called file header at the
beginning of the file, one per slot, to keep track of free (or empty) slot
information. Till the time record is stored in the slot, its bit is ON. But when
a record is deleted, its bit is turned OFF. The file header tracks this bit
becoming ON or OFF. A file header contains a variety of information about
the file including the addresses of the slot of deleted records. When the first
record is deleted, the file header stores its slot address. Now this empty slot
of first deleted record is used to store the empty slot address of the second
available record and so on, as shown in Fig. 3.9 (d). These stored empty slot
addresses of deleted records are also called pointers since they point to the
location of a record. The empty slot of deleted records thus forms a linked
list, which is referred to as a free list. Under this arrangement, whenever a
new record is inserted, the first available empty slot pointed by the file
header is used to store it. The file header pointer is now pointed to the next
available empty slot for storing next inserted record. In case of unavailability
of an empty slot, the new record is added at the end of the file.

Advantages of fixed-length record:


Because the space made available by a deleted record is exactly the space needed to insert a
new record, insertion and deletion for files are simple to implement.

3.5.1.2 Variable-length Records


In a file with variable-length records, all records on the page are not of the
same length. In this case, different records in the file have different sizes. A
file in the database system can have multiple record types, record with
variable filed lengths or repeating fields in a record. The main problem with
variable-length records is that when a new record is to be inserted, an empty
slot of just the right length is required. In case the empty slot is smaller than
the new record length, it cannot be used. Similarly, if the empty slot is too
big, extra space is wasted. Therefore, it is important that just the right length
of space is allocated while inserting new records and move records to fill the
space created by deletion of records to ensure that all the free space in the
file is contiguous.
To implement variable-length records, the structure of file is first made
flexible as shown in Fig. 3.10. This structure is related to the purchasing
system database of an organisation in which PURCHASE-INFO file has
been defined as an array with an arbitrary number of elements (for example,
PURCHASE-INFO), which does not limit the number of elements in the
array. Although, any actual record will have a specific number of elements in
its array. There is no limit on how large a record can be (except up to the
limit of size of disk storage device).
Fig. 3.10 Flexible structure of PURCHASE-LIST record

Byte-string representation: Different techniques are used to implement


variable-length records operation in a file. Byte-string representation is one
of the simplest techniques of implementing variable-length records
operation. In byte-string technique, a special symbol (⊥) called end-of-
record is attached at the end of each record. Each record is stored as a string
of consecutive bytes. Fig. 3.11 shows an implementation of byte-string
technique using end-of-record symbol to represent the fixed-length records
of Fig. 3.9 (a) as variable-length records.

Fig. 3.11 Byte-string representation of variable-length records

Disadvantages of byte-string representation:


It is very difficult to reuse empty slot space occupied formally by deleted record.
A large number of small fragments of disk storage are wasted.
There is hardly any space for future growth of records.
Due to the above disadvantages and other limitations, the byte-string
technique is not usually used for implementing variable-length records.
Fixed-length representation: Fixed-length representation is another
technique to implement variable-length records operation in a file. In this
technique, one or more fixed-length records are used to represent variable-
length record. Two methods namely (a) reserved space and (b) list
representation are used to implement it. In reserved space method, a fixed-
length record of the size equal to that of maximum record length in a file
(that is never exceeded) is used. Unused space of the records shorter than the
maximum size is filled with a special null or end-of-record symbol. Fig. 3.12
shows fixed-length representation of the file of Fig. 3.11. As shown,
suppliers KLY System, Concept Shapers and Trinity Agency have maximum
of two order numbers (ORD-NO). Therefore, the PURCHASE-INFO array
of PURCHASE-LIST record contains exactly two records for maximum of
two ORD-NO per supplier. The suppliers with less than two ORD-NO will
have records with null field (symbol 1) in the place of second ORD-NO. The
reserved-space method is useful when most records have a length close to
the maximum. Otherwise, a significant amount of space may be wasted.

Fig. 3.12 Reserved-space method of Fixed-length representation for implementing variable-length


records

In case of list representation (also called link list), a list of fixed-length


records, chained together by pointers, is used to implement variable-length
record. This method is similar to file header address of Fig. 3.9 (d) except
that in case of file header method pointers are used to chain together only
deleted records, whereas in list representation, pointers of all records
pertaining to the same supplier all chained together. Fig. 3.13 (a) shows an
example of link list method of fixed-length representation for implementing
variable-length records.
As shown in Fig. 3.13 (a), link list method has disadvantages of wasting
space in all records except first in the chain because the first record has the
supplier name (SUP-NAME) and order value (ORD-VAL), but subsequent
repeating customer records do not have these fields. Even though they
remain empty, a field space is repeated for SUP-NAME in all records (for
example, records 6, 7 and 8), lest the records not be of fixed length. To
overcome this problem, two types of block structures namely (a) anchor-
block and (b) overflow-block structures are used. Fig. 3.13 (b) shows the
structure of anchor-block and overflowblock of link list configuration of
fixed-length record representation for implementing variable-length records.
The anchor-block structure contains the first record of chain, while the
overflow-block contains records other than those that are the first record of
chain, that is repeating order of same manufacturer. Thus all records within a
block have the same length, even though not all records in the file have the
same length.
Fig. 3.13 Link list method of fixed-length representation for implementing variable-length records

(a) Link list method with lot of empty (unused) of space

(b) Anchor-block and overflow-block structures

3.5.2 File Organisation Techniques


As explained above, a file organisation is a way of arranging records in a file
when the file is stored on secondary (magnetic disk) storage. The method of
file organisation chosen, determines how the data stored on the secondary
storage device can be accessed. The file organisation also affects the types of
applications that can use the data and the time and cost necessary to do so.
Following operations are generally performed on a file:
Scanning or fetching records from the file through the buffer pool.
Searching records that satisfy an equality selection (that is, fetching a specific record).
Searching records between a particular range.
Inserting records into a file.
Deleting a record from the file.

There are different types of file organisations that are used by


applications. However, the operations to be performed as discussed and the
selection of storage device influence the choice of a particular file
organisation. Different types of file organisation that are used in a database
environment are as follows:
Heap file organisation.
Sequential file organisation.
Indexed-sequential file organisation.
Direct or hash file organisation.

3.5.2.1 Heap File Organisation


In a heap file (also called a pile or serial file) organisation, records are
collected in their arrival order. A heap file has no particular order. Therefore,
it is equivalent to a file of unordered records or sequence. Wherever there is
an empty space available, any record can be placed in that block. Pointers
are used to link the record blocks. If there is no space available to
accommodate the inserted record, a new block is allocated and a new record
is placed into it. Thus, a heap or serial file is generated by appending records
at the end. When the file is organised in pages, records are always added to
the last page until there is insufficient room to hold a complete record. At
this point a new page is allocated and the new record inserted. Fig. 3.14
shows adding records to heap file of hospital patient records. Fig. 3.14 (a)
shows a PATIENT heap file having records with two pages. Page 1 has three
records and is full, whereas Page 2 has two records and can accommodate
one more. Fig. 3.14 (b) shows the same file in which two new records have
been added. Since Page 2 could accommodate only one record and is already
filled, a new page (Page 3) was allocated to accommodate the next new
record 3.
If records are randomly appended, the logical ordering of the file with
respect to a given key bears no correspondence to the physical sequence.
Updating an individual record or group of records can be done if it is
assumed that the records are of fixed length and modifications do not change
their size. Retrieval of a particular record calls for searching of the file from
the beginning to end.

Fig. 3.14 Heap file of PATIENT record

(a) Records with two pages before adding new records

(b) Records with three pages after adding new records

As discussed in Section 3.5.1 (under the heading of fixed-length record),


in case of deletion of a record, all records following the deleted record can
be moved forward or the last record in the file can be brought into the place
vacated by the deleted record. However, these methods require many
additional accesses up to the end of the file. A more practical and better
method is used in which a record is deleted logically and a (called deletion
bit) flag is set whenever a record is deleted. The space of deleted record is
reused by future insertion of records.

Advantages of a heap or serial file:


Insertion of new record is less time consuming.
It has a fill factor of 100 per cent as because each page is filled to capacity as new records are
added.
Space utilisation is high, making the heap suitable for conserving space in large files.

Limitations of a heap or serial file:


Slow retrieval of record.
High updating cost of records.
High cost of record retrieval.
Has restricted use.

Uses of a heap or serial file:


Storing small files covering only a few pages.
Where data is difficult to organise.
When data is collected prior to processing.
Very efficient for bulk-loading large volumes of data.

3.5.2.2 Sequential File Organisation


A sequential file (also called an ordered file) is a set of contiguously stored
records on a physical storage device such as a magnetic disk, tape or CD-
ROM. A sequential file can be created by sorting the records in a heap file.
In a sequential file organisation, records are stored in sequential (ascending
or descending) order onto the secondary storage media. The logical and
physical sequence of records is the same in a sequential file organisation. A
search-key is used to sort all records in sequential order. A sequential file is
organised for efficient processing of records in sorted order based on this
search-key. A search-key can be any attribute (field item) or set of attributes
and need not necessarily be a primary key or a super key. A key has already
been defined in Chapter 1, Section 1.3.1.4, To locate a particular record, a
program scans the file from the beginning until the desired record is located.
If the ordering is based on a unique key, the scan stops when one matching
record has been found. Unlike the heap file, where all records must be
scanned to locate those matching a search-key, the sequential scan stops as
soon as a greater value is found. A common example of a sequential-file
organisation is the alphabetical list of persons of a telephone directory or an
english/hindi dictionary and so on. Fig. 3.15 shows an example of a
sequential file that has been obtained after sorting of PURCHASE file of
Fig. 3.12 on primary key SUP-NAME in an ascending order. Thus in a
sequential file, records are maintained in the logical sequence of their search
(primary) key values.

Fig. 3.15 Sequential file sorted in ascending order

In case of multiple key search (or sorting), the first key is called a primary
key while the others are called secondary keys. Fig. 3.16 (a) shows a simple
EMPLOYEE file of an organisation, while Fig. 3.16 (b) shows the same file
sorted on three keys in ascending order. As shown in Fig. 3.16 (b), the first
key (primary key) is employee’s last name (EMP-LNAME), the second key
(secondary key) is employee’s identification number (EMP-ID) and the third
key (secondary key) is employee’s country (COUNTRY) to which they
belong. Whenever an attribute (filed item) or a set of attributes is added into
the record, the entire file is reorganised to effect the addition of new attribute
in each record of the file. Therefore, extra fields are always kept in the
sequential file for future addition of items.
Fig. 3.16 EMPLOYEE payroll file of an organisation

(a) Unsorted

(b) Sorted on multiple key

Sequential file organisation can exist on all types of secondary storage


devices such as magnetic tapes magnetic disks and so on. Due to the
physical nature of magnetic tape storage the sequential file records stored on
it are processed sequentially. Accessing a particular record requires the
accessing of all previous records in the file along the length of the tape.
When sequential file records are stored on magnetic disks, they are
processed either sequentially or directly. In case of sequential processing of
records on disks while retrieving a particular record, all the records
preceding it are to be processed first. Thus the entire process of retrieval
becomes slow if the target record is residing on the disk towards the end of
the file.
The efficiency of a sequential file organisation depends up+on the type of
query. If the query is for specific record identified by a record key (for
example, EMP-ID in Fig. 3.16 (b)), the file is searched from the beginning
until the record is found. The retrieval of a record from a sequential file, on
an average, requires access to half the records in the file. Thus, making
enquiries in a sequential file is inefficient and very time consuming for large
files. If the query is for batch operation, the sorting is done in order of the
search-key of the sequential file and then the processing (update) of the
sequential is done in a single pass. Such a file, containing the updates to be
made to a sequential file, is also referred to as a transaction file. The
processing efficiency in the batch operation improves and cost of processing
reduces.

Advantages of a sequential file:


Sequential retrieval of records on a primary key is very fast.
On an average, queries can be performed in half the time taken for a similar query on a heap
file with a similar fill factor.
Ease of access to the next record.
Simplicity of organisation.
Absence of auxiliary data structure.
Creation of an automatic backup copy.

Limitations of a sequential file:


Simple queries are time consuming for large files.
Multiple key retrieval requires scanning of an entire file.
While deleting records, it creates wasted space and requires reorganising of records.
Insertion updates of records require creation (rewriting) of a new file.
Inserting and deleting records are expensive operations because the records must remain
physically ordered.

Uses of a sequential file:


For range queries and some partial match queries without the need for a complete scan of the
file.
Processing of high percentage of records in a file.
Batch oriented commercial processing.

3.5.2.3 Indexed-sequential File Organisation


An indexed-sequential file organisation is a direct processing method that
combines the features of both sequential and direct access of records. A
sequential (sorted on primary keys) file that is indexed is called an indexed
sequential file. As in case of a sequential file, the records in indexed-
sequential file are stored in the physical sequence by primary key. In
addition, an index of record locations is stored on the disk. Thus, indexes
associated with the file are provided to quickly locate any given record for
random processing. The indexes and the records are both stored on disk. The
index provides for random access of records, while the sequential nature of
the file provides easy access to the subsequent records as well as sequential
processing. This method allows records to be accessed sequentially for
applications requiring the updating of large numbers of records as well as
providing the ability to access records directly in response to user queries.
An indexed-sequential file organisation consists of three contents namely:
a. Primary data storage.
b. Overflow area.
c. Hierarchy of indices.

The primary data storage area (also called prime area) is an area in which
records are written when an indexed-sequential file is originally created. It
contains the records written by the users’ programs. The records are written
in data blocks in ascending key sequence. These data blocks are in turn
stored in ascending sequence in the primary data storage area. The data
blocks are sequenced by the highest key of logical records contained in
them. The prime area is essentially a sequential file.
The overflow area is essentially used to store new records, which cannot
be otherwise inserted in the prime area without rewriting the sequential file.
It permits the addition of records to the file whenever a new record is
inserted in the original logical block. Multiple records belonging to the same
logical area may be chained to maintain logical sequencing. A pointer is
associated with each record in the prime area which indicates that the next
sequential record is stored in the overflow area. Two types of overflow areas
are generally used, which are known as:
a. Cylinder overflow area.
b. Independent overflow area.
Either or both of these overflow areas may be specified for a particular
file. In cylinder overflow area, the spare tracks in every cylinder is reserved
for accommodating the overflow records, whereas in an independent
overflow area, overflow records from anywhere in the prime area may be
placed.
In case of random enquiry or update, a hierarchy of indices are maintained
that are accessed to get the physical location of the desired record. The data
of the indexed-sequential files is stored on the cylinders, each of which is
made up of a number of tracks. Some of these tracks are reserved for
primary data storage area and others are used for an overflow area associated
with the primary data area on the cylinder. A track index is written and
maintained for each cylinder. It contains an entry of each primary data track
in the cylinder as well as an entry to indicate if any records have overflowed
from the track.
Fig. 3.17 shows an example of indexed-sequential file organisation and
access. Fig. 3.17 (a) shows how overflow area is created. As shown, when a
new record 512 is inserted in an existing logical block having records 500,
505, 510, 515, 520 and 525, an overflow area is created and record 525 is
shifted into it. Fig. 3.17 (b) illustrates the relationships between the different
levels of indices. Locating a particular record involves a search operation of
master index to find the proper cylinder index (for example Cyl index 1)
with which the record is associated. Next, a search is made of the cylinder
index to find the cylinder (for example, Cyl 1) on which the record is
located. A search of the track index is then made to know the track number
on which the record resides (for example, Track 0). Finally, a search of the
track is required to locate the desired record. Master index resides in main
memory during file processing, and remains their till the file is closed.
However, master index is not always necessary and it should only be
requested for large files. Master index is the highest level of index in an
indexed-sequential file organisation.
An example of a indexed-sequential file organisation developed by IBM,
is called Index Sequential Access Method (ISAM). Since the records are
organised and stored sequentially in ISAM files, adding new records to the
file can be a problem. To overcome this problem, ISAM files maintain an
overflow area for records added after a file is created. Pointers are used to
find the records in their proper sequence when the file is processed
sequentially. In case of overflow area becoming full, an ISAM file can be
reorganised by merging records in the overflow area with the records in the
primary data storage area to produce a new file with all the records in the
proper sequence. Virtual Storage Access Method (VSAM) is advanced
version of ISAM file in which virtual storage methods are used to enter the
instructions. It is a version of B+ tree discussed in Section 3.6.3. In VSAM
files, instead of using overflow area for adding records, the new records are
inserted into the appropriate place in the file and the records that follow are
shifted to new physical locations. The shifted records are logically connected
through pointers located at the end of inserted records. Thus, VSAM file do
not require reorganisation, as is the case with ISAM files. VSAM file
method is much more efficient that ISAM files.

Fig. 3.17 Indexed-sequential file organisation

(a) Shifting of the last record into overflow area while inserting a record
(b) Relationship between different levels of indices

Advantages of an indexed-sequential file


Allows record to be processed efficiently in both sequential and random order, depending on
the processing operations.
Data can be accesses directly and quickly.
Centrally maintain data can be kept up-to-date.

Limitations of an indexed-sequential file


Only the key attributes determine the location of the record, and therefore retrieval operations
involving non-key attributes may require searching of entire file.
Lowers the computer system’s efficiency.
As file grows, performance deteriorates rapidly because of overflows and consequently
reorganisation is required. This problem, however, is taken care by VSAM files.

Uses of an indexed-sequential file


Most popular in commercial data processing applications.
Applications where rate of insertion is very high.

3.5.2.4 Direct (or hash) File Organisation


In a direct (also called hash) file organisation, mapping (also called
transformation) of the search key value of a record is made directly to the
address of the storage location at which that record is to reside in the file.
One mechanism used for doing this mapping is called hashing. Hashing is
the process of direct mapping by performing some arithmetic manipulation.
It is a method of record addressing that eliminates the need for maintaining
and searching indexes. Elimination of the index avoids the need to make two
trips to secondary storage to access a record; one to read the index and the
other to access the file. In a hashed file organisation, the records are
clustered into buckets. A bucket is either one disk block or a cluster of
contiguous blocks. The hashing function maps a key into a relative bucket
number, rather than assign an absolute block address to the bucket. A table
maintained in the file header converts the bucket number into the
corresponding disk blocks, as illustrated in Fig. 3.18.
Fig. 3.18 Matching of bucket numbers and disk block addresses

In a hash file, the data is scattered throughout the disk in a random order.
The processing of a hash file is dependent on how the search key set for the
records is transformed (or mapped) into the addresses of secondary storage
device (for example, hard disk) to locate the desired record. The search
condition must be an equality condition on a single files, called the hash
field of the file. In most cases, the hash field is also a key field of the file, in
which case it is called hash key. In hashing operations, there is a function h
called a hash function or randomising function that is applied to the hash
field value v of a record. This operation yields the address of the disk block
in which the address is stored. A search for the record within the block can
be carried out in the main memory buffer. The function h(v) indicates the
number of the bucket in which the record with key value v is to be found. It
is desirable that h “hashes” v, that is, h(v) takes all its possible values with
roughly equal to probability as v ranges over likely collections of values for
the key.

Advantages of a direct file


Data can be accessed directly and quickly.
Centrally maintained data can be kept up-to-date.
Limitations of a direct file
Because all data must be stored on disks, the hardware is expensive.
Because files are updated directly and no transaction files are maintained, there may not be
any back in case a file is destroyed.

3.6 INDEXING

An index is a table or a data structure, which is maintained to determine the


location of rows (records) in a file that satisfy some condition. The index
table has entry consisting of value of the key attribute for a particular record
and the pointer to the location where the record is stored. Thus, each index
entry corresponds to a data record on the secondary storage (disk) device. A
record is retrieved from the disk first by searching the index for the address
of the record in the index table and then reading the record from this address.
Fig. 3.19 (b) illustrates the example of maintaining index tables, for
example, master index, cylinder index, track index and so on. A library
catalogue system by author-wise or title-wise for books, is an example of
indexing. There are mainly two types of indexing that are used in database
environment.
Ordered indexing.
Hashed indexing.

Ordered indexing is based on stored ordering of the values of records.


Whereas, hashed indexing is based on the values of record being uniformly
distributed using a hashed function.
There are two types of ordered indexing that are used, namely:
Dense indexing.
Sparse indexing.

In case of dense indexing, an index record or entry appears for every


search-key value in the file. The index record contains the search-key value
and a pointer to the first data record with the search key value. The rest of
the records with the same search-key value are stored sequentially after the
first record. To locate a record, the index entry is found with the search-key
value and then go to the record pointed by the index entry and follow the
pointers in the file until the desired record is found.
In case of sparse indexing, the index record is created only for some
search-key values. As is true in dense indexing, each such index record
contains a search-key value and a pointer to the first data record with that
search-key value. To locate a record, the index entry with largest search-key
value that is less than or equal to search-key value for the desired record, is
found. The search starts from the record pointed by the index entry and
follow the pointers in the file until the desired record is located.

Fig. 3.19 Dense indexing

Let us take an example of a PURCHASE record of Fig. 3.8 considering


only three data items namely supplier name (SUP-NAME), order number
(ORD-NO) and order value (ORD-VALUE). Fig. 3.19 shows dense indexing
for record of PURCHASE file. Suppose that we are looking up records for
the supplier name “KLY System”. Using the dense index, pointer is followed
directly to the first record with SUP-NAME “KLY System”. This record is
processed and the pointer in that record is followed to locate the next record
in order of search-key (SUP-NAME). The processing of records is continued
until a record for supplier name other than “KLY System” is encountered. In
case of sparse indexing as shown in Fig. 3.20, there is no index entry for
supplier name “KLY System”. Since the last entry in alphabet order before
“KLY System” is “JUSCO Ltd”, that pointer is followed. Then the
PURCHASE file is read in sequential order until the first “KLY System”
record is found, and processing begins at that point.

Fig. 3.20 Sparse indexing

Choice of either ordered indexing or hash indexing is evaluated on the


basis of number of factors that are mentioned below:
Access type: This type of access includes finding records with specified attribute value (for
example, EMP-ID = 123243) and finding records whose attribute values fall in a specified
range (for example, SALARY between 4000 and 5000), as shown in Fig. 3.16.
Access time: It is the time taken to find a particular data item or set of items using the given
technique.
Insertion time: It is a time taken to insert a new data items or record in the file. Insertion
time is the total time it takes to find the correct place to insert the new records and the time
taken to update the index structure.
Deletion time: It is the time taken to delete a data item or record. Deletion time is the sum of
the time it takes to find the record for deletion, time for deleting the record and the time taken
to update the index structure.
Space overhead: It is the additional space occupied by the index structure.

3.6.1 Primary Index


A primary index is an ordered file whose records are of fixed length with
two fields. The first field is of the same data type as the ordering key field
(called the primary key) of the data file and the second field is a pointer to a
disk block (a block address). There is one index entry (or index record) in
the index file for each block in the data file. Each index entry has the value
of primary key field for the first record in a block and a pointer to that block
as its two field values. Primary index (also called clustering index)
associates a primary key with the physical location in which a record is
stored. When a user requests a record, the disk operating system first loads
the primary index into the computer’s main memory and searches the index
sequentially for the primary key. When it finds the entry for the primary key,
it then reads the address in which record is stored. The disk system then
proceeds to this address and reads the contents of the desired record. In
primary index, the file containing the records is sequentially ordered.
Indexed-sequential file organisation, explained in Section 3.5.2, is an
example of primary index. Since primary index contains only two
information; the primary key of the record and the physical address of the
record, the search operation is very fast.

3.6.2 Secondary Index


A secondary index is also an ordered file with two fields. The first field is of
the same data type as some non-ordering field of the data file that is an
indexing field. The second field is either a block pointer or record pointer.
Secondary index (also called non-clustering index) is used to search a file on
the basis of secondary keys. The search key of secondary index specifies an
order that is different from the sequential order of the file. For example, in
case of EMPLOYEE payroll file of Fig. 3.16, employee identification
number (EMP-ID) may be used as primary key for constructing primary
index, whereas employee’s last and first names (EMP-LNAME and EMP-
FNAME) may be used to construct secondary index. Therefore, a search
operation can be made by the user to access the records by either the
employee identification number (EMP-ID) or the employee’s names (EMP-
FNAME and EMP-LNAME).
3.6.3 Tree-based Indexing
A tree-based indexing system is widely used in practical systems as the basis
for both primary and secondary key indexing. Unlike natural trees, these
trees are depicted upside down with the root at the top and the leaves at the
bottom. The root is a node that has no parent; it can have only child nodes.
Leaves, on the other hand, have no children or rather their children are null.
A tree can be defined recursively as the following:
An empty structure is an empty tree.
If t1……tk are disjoint trees, then the structure whose root has as its children, the roots of
t1……, tk, is also a tree.
Only structures generated by rules 1 and 2 are trees.

Fig. 3.21 shows an example of trees. Each node has to be reachable from
the root through a unique sequence of arcs called a path. The number of arcs
in a path is called the length of the path. The length of the path from the root
to the node plus 1 is called level of a node. The height of a non-empty tree is
the maximum level of a node in the tree. The empty tree is a legitimate tree
of height 0 and a single node is a tree of height 1. This is the only case in
which a node is both the root and a leaf. The level of a node must be
between the levels of the root (that is 1) and the height of the tree. Fig. 3.21
shows an example of a tree structure that reflects the hierarchy of a
manufacturing organisation.
In a tree-based indexing scheme, the search generally starts at the root
node. Depending on the conditions that are satisfied at the node under
examination, a branch is made to one of several nodes and the procedure is
repeated until a match is found or a leaf note is encountered. A leaf node is
the last node beyond which there are no more nodes available. There are
several types of tree-based index structure, however detailed explanation
about B-tree indexing and B+-tree indexing are provided in this section.
Fig. 3.21 Example of trees

3.6.3.1 B-tree Indexing


B-tree indexing operates closely with secondary storage and can be turned to
reduce the impediments imposed by this storage. The size of each node of B-
tree is as large as the size of a block. The number of keys in one node can
vary depending on the sizes of the keys, organisation of data and on the size
of a block. Block size is the size of each node of a B-tree. The order of a B-
tree specifies the maximum number of children. Sometimes nodes of a B-
tree of order m are defined as having k keys and k+1 references where m ≤ k
≤ 2m, which specifies the minimum number of children. A B-tree is always
at least half full, has few levels and is perfectly balanced. It is an access
method supported by a number of commercial database systems such as
DB2, SQL/DS, ORACLE and NonStop/SQL. It is also a dominant access
method used by other relational database management systems such as
Ingress and Sybase. B-tree provides fast random and sequential access as
well as the dynamic maintenance that virtually eliminates the overflow
problems that occur in the indexed-sequential and hashing methods.

Fig. 3.22 Example of trees

Characteristics of a B-tree index:


The root has at least two sub-trees unless it is a leaf.
Each non-root and each non-leaf node holds k-1 keys and k references to sub-trees where
[m/2] ≤ k ≤ m.
Each leaf node holds k-1 keys where [m/2] ≤ k ≤ m.
All leaves are on the same level.

3.6.3.2 B+-tree Indexing


B+-tree index is a balanced tree in which the internal nodes direct the search
operation and the leaf nodes contain the data entries. Every path from the
root to the tree leaf is of the same length. Since the tree structure grows and
shrinks dynamically, it is not feasible to allocate the leaf pages sequentially.
In order to retrieve all pages efficiently, they are linked using page pointers.
In a B+-tree index, references to data are made only from the leaves. The
internal nodes of the B+-tree are indexed for fast access of data, which is
called an index set. The leaves have different structure than other nodes of
the B+-tree. Usually the leaves are linked sequentially to form a sequence set
so that scanning this list of leaves results in data given in ascending order.
Hence, a B+-tree index is a regular B-tree plus a linked list of data. B+-tree
index is a widely used structure.

Characteristics of a B+-tree index:


It is a balanced tree.
A minimum occupancy of 50 per cent is guaranteed for each node except the root.
Since file grows rather than shrink, deletion is often implemented by simply locating the data
entry and removing it, without adjusting the tree.
Searching for a record requires just a traversal from the root to the appropriate leaf.

R Q
1. Discuss physical storage media available on the computer system.
2. What is a file? What are records and data items in a file?
3. List down the factors that influence organisation of data in a database system.
4. What is a physical storage? Explain with block diagrams, a system of physically accessing
the database.
5. A RAID system allows replacing failed disks without stopping access to the system. Thus,
the data in the failed disk must be rebuilt and written to the replacement disk while the
system is in operation. With which of the RAID levels is the amount of interference between
the rebuild and ongoing disk accesses least? Explain.
6. How are records and files related?
7. List down the factors that influence the organisation of a file.
8. Explain the differences between master files, transaction files and report files.
9. Consider the deletion of record 6 from file of Fig. 3.8 (b). Compare the relative merits of the
following techniques for implementing the deletion:
a. Move record 7 to the space occupied by record 6 and move record 8 to the space
occupied by record 7 and so on.
b. Move record 8 to the space occupied by record 6.
c. Mark record 6 as deleted and move no records.

10. Show the structure of the file of Fig. 3.9 (d) after each of the following steps:

a. Insert (P4-010, IBM System, New York, 223312).


b. Delete record 7.
c. Insert (P3-111, KLY System, Jamshedpur, 445566).

11. Give an example of a database application in which variable-length records are preferred to
the pointer method. Explain your answer.
12. What is a file organisation? What are the different types of file organisation? Explain using a
sketech each of them with their advantages and disadvantages.
13. What is a sequential file organisation and a sequential file processing?
14. What are the advantages and disadvantages of a sequential file organisation?
15. In the sequential file organisation, why is an overflow block used even if there is, at the
moment, only one overflow record?
16. What is indexing and hashing?
17. When is it preferable to use dense index than parse index? Explain your answer.
18. What is the difference between primary index and secondary index?
19. What is the most important difference between a disk and a tape?
20. Explain the terms seek time, rotational delay and transfer time.
21. Explain what buffer manager must do to process a read request for a page.
22. What is direct file organisation? Write its advantages and disadvantages.
23. What are secondary indexes and what are they used for?
24. When does a buffer manager write a page to disk?
25. What do you mean by indexed-sequential file processing?
26. Explain the difference between the following:

a. Primary versus secondary indexes.


b. Dense versus sparse indexes.

27. Compare the different methods of implementing variable-length records.


28. What are the characteristics of data that affect the choice of file organisation?
29. Why the buffers used in data transfer from secondary storage device? Explain the function of
buffer manager.
30. Compare the advantages and disadvantages of a heap and sequential file organisation. If the
records are to be processed randomly, which one would you prefer among these two
organisations?
31. How are fixed-length records stored and manipulated in a file?
32. What is variable-length record? What are its types?
33. How is variable-length record implemented?
34. What techniques will you use to shorten the average access time of an indexed-sequential file
organisation?
35. Discuss the differences between the following file organisations:
a. Heap
b. Sequential
c. Indexed-sequential.

36. What are the advantages and disadvantages of indexed-sequential file?


37. Direct-access devices are sometimes called random-access devices. What is the basis for this
synonym-type of relationship?
38. Define each of the following terms:

a. File organisation
b. Sequential file organisation
c. Indexed-file organisation
d. Direct file organisation
e. Indexing
f. RAID
g. File manager
h. Buffer manager
i. Tree
j. Leaf.
k. Cylinder
l. Main memory.

39. Compare sequential, indexed-sequential and direct file organisations.


40. Why efficiency is accomplished using pointers?
41. Distinguish between a primary key and a secondary key.
42. What are the main differences between a main memory and a auxiliary memory?
43. What are root nodes and leaf nodes in an index hierarchy?
44. What is the difference between B-tree and B+-tree indexes?
45. What are secondary storage devices and what are their uses?
46. What are advantages of secondary storage devices?
47. What are the factors that should be used to evaluate an indexing technique?
48. Explain the difference between sequential and random access.
49. What are magnetic tapes? Explain the working of magnetic tape.
50. What is a dense index and a sparse index? How are records accessed in these indexes?
51. What is a magnetic disk? Describe the working of magnetic disks.
52. What are flexible disks?
53. With a neat sketch, discuss the hierarchy of physical storage devices.
54. Write short notes on the following:

a. Cache memory
b. Main memory
c. Magnetic disk
d. Magnet tape
e. Optical disk
f. Flash memory.
55. With a neat sketch, explain the advantages, and disadvantages of a magnetic disk storage
mechanism.
56. Explain the factors affecting the performance of magnetic disk storage device.
57. What do you mean by RAID technology? What are the various RAID levels?
58. What are the factors that influence the choice of RAID levels? Provide an orientation table
for RAID levels.
59. Explain the working of a tree-based indexing.

STATE TRUE/FALSE

1. The efficiency of the computer system greatly depends on how it stores data and how fast it
can retrieve the data.
2. Because of the high cost and volatile nature of the auxiliary memory, permanent storage of
data is done in the main memory.
3. In a computer, a file is nothing but a series of bytes.
4. An indexed-sequential file organisation is a direct processing method.
5. In a physical storage, a record has a physical storage location or address associated with it.
6. Access time is the time from when a read or write request is issued, to the time when data
transfer begins.
7. The file manager is a software that manages the allocation of storage locations and data
structure.
8. The different types of files are master files, report files and transaction files.
9. The secondary devices are volatile whereas the tertiary storage devices are non-volatile.
10. The buffer manager fetches a requested page from disk into a region of main memory called
the buffer pool and tells the file manager the location of the requested page.
11. The term non-volatile means it stores and retains the programs and data even after the
computer is switched off.
12. Auxiliary storage devices are also useful for transferring data from one computer to another.
13. Transaction files contain relatively permanent information about entities.
14. Master file is a collection of records describing activities or transactions by organisation.
15. Report file is a file created by extracting data to prepare a report.
16. Auxiliary storage devices process data faster than main memory.
17. The capacity of secondary storage devices is practically unlimited.
18. It is more economical to store data on secondary storage devices than in primary storage
devices.
19. Delete operation deletes the current record and updates the file on the disk to reflect the
deletion.
20. In case of sequential file organisation, records are stored in some predetermined sequence,
one after another.
21. A file could be made of records which are of different sizes. These records are called
variable-length records.
22. Sequential file organisation is most common because it makes effective use of the least
expensive secondary storage devices such as magnetic disk.
23. When using sequential access to reach a particular record, all the records preceding it need
not be processed.
24. In direct file processing, on an average, finding one record will require that half of the
records in the file be read.
25. In a direct file, the data may be organised in such a way that they are scattered throughout the
disk in what may appear to be random in order.
26. Auxiliary and secondary storage devices are the same.
27. Sequential access storage is off-line.
28. Magnetic tapes are direct-access media.
29. Direct access systems do not search the entire file, instead, they move directly to the needed
record.
30. Hashing is a method of determining the physical location of a record.
31. In hashing, the record key is processed mathematically.
32. The file storage organisation determines how to access the record.
33. Files could be made of fixed-length records or variable-length records.
34. A file in which all the records are of the same length are said to contain fixed-length-records.
35. Because tapes are slow, they are generally used only for long-term storage and backup.
36. There are many types of magnetic disks such as hard disks, flexible disks, zip disks and jaz
disks.
37. Data transfer time is the time it takes to transfer the data to the primary storage.
38. Optical storage is low-speed direct access storage device
39. In magnetic tape, the read/write head reads magnetized areas (which represent data on the
tape), converts them into electrical signals and sends them to main memory and CPU for
execution or further processing.
40. In a bit-level stripping, splitting of bits of each byte is done across multiple disks.
41. In a block-level stripping, splitting of blocks is done across multiple disks and it treats the
array of disks as a single large disk.
42. B+-tree index is a balanced tree in which the internal nodes direct the search operation and
the leaf nodes contain the data entries.

TICK (✓) THE APPROPRIATE ANSWER

1. If data are stored sequentially on a magnetic tape, they are ideal for:

a. on-line applications
b. batch processing applications
c. spreadsheet applications
d. decision-making applications.

2. Compared to the main memory, secondary memory is:

a. costly
b. volatile
c. faster
d. none of these.

3. Secondary storage devices are used for:


a. backup of data
b. permanent data storage
c. transferring data from one computer to another
d. all of these.

4. Which of the following is direct access processing method?

a. relative addressing
b. indexing
c. hashing
d. all of these.

5. Compared to the main memory, secondary memory is:

a. costly
b. volatile
c. faster
d. none of these.

6. What is a collection of bytes stored as an individual entity?

a. record
b. file
c. field
d. none of these.

7. A file contains the following that is needed for information processing:

a. knowledge
b. instructions
c. data
d. none of these.

8. Which of the following is a valid file type?

a. master file
b. report file
c. transaction file
d. all of these.

9. Which of the following stores data that is permanent in nature?

a. transaction file
b. master file
c. report file
d. none of these.

10. Which of the following is an auxiliary device?


a. magnetic disks
b. magnetic tapes
c. optical disks
d. all of these

11. Which of the following file is created by extracting data to prepare a report?

a. report file
b. master file
c. transaction file
d. all of these.

12. DASD stands for:

a. Discrete Application Scanning Devices


b. Double Amplification Switching Devices
c. Direct Access Storage Devices
d. none of these.

13. Advantages of secondary storage devices are:

a. economy
b. security
c. capacity
d. all of these.

14. Employee ID, Supplier ID, Model No and so on are examples of:

a. primary keys
b. fields
c. unique record identifier
d. all of these.

15. Which of the following is DASD?

a. magnetic tape
b. magnetic disk
c. zip disk
d. DAT cartridge.

16. Which of the following is sequential access storage device?

a. hard disks
b. magnetic tape
c. jaz disk
d. floppy disk.

17. In primary data storage area:


a. records are written when an indexed-sequential file is originally created
b. records are written by the users’ programs
c. records are written in data blocks in ascending key sequence
d. all of these.

18. Which storage media does not permit a record to be read and written in the same place?

a. magnetic disk
b. hard disk
c. magnetic tape
d. none of these.

19. Access time is the time:

a. from when a read or write request is issued to when data transfer begins
b. amount of time required to transfer data from the disk to or from main memory
c. required to electronically activate the read/write head over the disk surface where
data transfer is to take place
d. none of these.

20. Data transfer rate is the time:

a. from when a read or write request is issued to when data transfer begins
b. amount of time required to transfer data from the disk to or from main memory
c. required to electronically activate the read/write head over the disk surface where
data transfer is to take place
d. none of these.

21. Optical storage is a:

a. high-speed direct access storage device


b. low-speed direct access storage device
c. medium-speed direct access storage device
d. high-speed sequential access storage device.

22. Head activation time is the time:

a. from when a read or write request is issued to when data transfer begins
b. amount of time required to transfer data from the disk to or from main memory
c. required to electronically activate the read/write head over the disk surface where
data transfer is to take place
d. none of these.

23. Which of the following is a factor that affects the access time of hard disks?

a. rotational delay time


b. data transfer time
c. seek time
d. all of these.

24. Which is the least expensive secondary storage device?

a. zip disk
b. hard disk
c. magnetic tape
d. none of these.

25. Which of the following is not a flexible disks?

a. optical disk
b. zip disk
c. hard disk
d. jaz disk.

26. Which of the following is not an optical disk?

a. WORM
b. Super disk
c. CD-ROM
d. CD-RW.

27. WORM stands for:

a. Write Once Read Many


b. Write Optical Read Magnetic
c. Write On Redundant Material
d. none of these.

28. What is the expansion of ISAM?

a. (a) Indexed Sequential Access Method


b. Internal Storage Access Mechanism
c. Integrated Storage and Management
d. none of these.

29. What is the expansion of VSAM?

a. Very Stable Adaptive Machine


b. Varying Storage Access Mechanism
c. Virtual Storage Access Method
d. none of these.

30. Which company developed ISAM?

a. DEC
b. IBM
c. COMPAC
d. HP.

31. Tertiary storage devices are:

a. slowest
b. fastest
c. medium speed
d. none of these.

32. Data stripping (or parallelism) consists of segmentation or splitting of:

a. data into equal-size partitions


b. distributed over multiple disks (also called disk array)
c. both (a) and (b)
d. none of these.

33. The primary storage device is:

a. cache memory
b. main memory
c. flash memory
d. all of these.

FILL IN THE BLANKS

1. The _____ temporarily stores data and programs in its main memory while the data are being
processed.
2. The most common types of _____ devices are magnetic tapes, magnetic disks, floppy disks,
hard disks and optical disks.
3. The buffer manager fetches a requested page from disk into a region of main memory called
_____ pool.
4. _____ is also known as secondary memory or auxiliary storage.
5. Redundancy is introduced using _____ technique.
6. In a bit-level stripping, splitting of bits of each byte is done across _____ .
7. There are two types of secondary storage devices (a) _____ and (b) _____ .
8. A collection of related record is called _____.
9. RAID stands for _____.
10. ISAM stands for _____.
11. VSAM stands for _____.
12. There are mainly two kinds of file operations (a) _____ and (b) _____.
13. Direct access storage devices are called _____.
14. Mean time to failure (MTTF) is the measure of _____ of the disk.
15. The overflow area is essentially used to store _____, which cannot be otherwise inserted in
the prime area without rewriting the sequential file.
16. Primary index is called _____ index.
17. Primary index is an index based on a set of fields that include _____ key.
18. Data to be used regularly is almost always kept on a _____.
19. A dust particle or a human hair on the magnetic disk surface could cause the head to crash
into the disk. This is called _____.
20. Secondary index is used to search a file on the basis of _____ keys.
21. The two forms of record organisations are (a) _____ and (b) _____.
22. In sequential processing, one field referred to as the _____, usually determines the sequence
or order in which the records are stored.
23. Secondary storage is called _____ storage whereas Tertiary storage is called _____ storage
device.
24. Processing data using sequential access is referred to as _____.
25. _____ is the duration taken to complete a data transfer _____ from the time when the
computer requests data from a secondary storage device to the time when the transfer of data
is complete.
26. A _____ is a field or set of fields whose contents is unique to one record and can therefore be
used to identify that record.
27. Hashing is also known as _____.
28. _____ is the time it takes an access arm (read/write head) to get into position over a
particular track.
29. In an indexing method, a _____ associates a primary key with the physical location at which
a record is stored.
30. When the records in a large file must be accessed immediately, then _____ organisation must
be used.
31. In an _____, the records are stored either sequentially or non-sequentially and an index is
created that allows the applications to locate the individual records using the index.
32. In an indexed organisation, if the records are stored sequentially based on primary key value,
than that file organisation is called an _____.
33. A track is divided into smaller units called _____.
34. The sectors are further divided into _____.
35. CD-R drive is short for _____.
36. _____ stands for write-once, read-many.
37. In tree-based indexing scheme, the search generally starts at the _____ node.
38. Deletion time is the time taken to delete _____.
39. ISAM was developed by _____.
Part-II

RELATIONAL MODEL
Chapter 4

Relational Algebra and Calculus

4.1 INTRODUCTION

The relational database model originated from the mathematical concept of a


relation and set theory. As discussed in Chapter 2, Section 2.7.6, it was first
proposed as an approach to data modelling by Dr. Edgar F. Codd of IBM
Research in 1970 in his paper entitled “A Relational Model of Data for
Large Shared Data Banks”. This paper marked the beginning of the field of
relational database. The relational model uses the concept of a mathematical
relation in the form of a table of values as its building block. The relational
database became operational only in mid-1980s. Apart from the widespread
success of the hierarchical and network database models in commercial data
processing until early-1980s, the main reasons for the delay in development
and implementation of relational model were:
Inadequate capabilities of the contemporary hardware.
Need to develop efficient implementation of simple relational operations.
Need for automatic query optimisation.
Unavailability of efficient software techniques.
Requirement of increased processing power.
Requirement of increased input/output (I/O) speeds to achieve comparable performance.

In this chapter, we will discuss the historical perspective of relational


database and describe the structure of the relational model, the operators of
the relational algebra and relational calculus.

4.2 HISTORICAL PERSPECTIVE OF RELATIONAL MODEL


While introducing a relational model to the database community in 1970, Dr.
E.F. Codd stressed on the independence of the relational representation from
physical computer implementation such as ordering on physical devices,
indexing and using physical access path. Dr. Codd also proposed criteria for
accurately structuring relational databases and an implementation-
independent language to operate on these databases. On the basis of his
proposal, the most significant research towards three developments resulted
into overwhelming interest in the relational model.
The first development was of prototype relational database system
(DBMS) System R at IBM’s San Jose Research Laboratory in California
USA during the late 1970s. System R provided the practical implementation
of its data structures and operations. It also provided information about
transaction management, concurrency control, recovery techniques, query
optimisation, data security, integrity, user interface and so on. System R led
to the following two major developments:
A structured query language called SQL, also pronounced S-Q-L, or See-Quel.
Production of various commercial relational DBMS such as DB2 and SQL/DS from IBM,
ORACLE from Oracle Corporation during 1970s and 1980s.

The second development was of a relational DBMS INGRESS


(Interactive Graphics Retrieval System) at the University of California at
Berkeley USA. The INGRES project involved the development of a
prototype RDBMS, with the research concentrating on the same overall
objectives as of the System R project.
The third development was the Peterlee Relational Test Vehicle at the
IBM UK Scientific Centre in Peterlee. The project had more theoretical
orientation than the System R and INGRES projects and was significant,
principally for research into such issues as query processing, optimisation
and functional extension.
Since the introduction of the relational model, there has been many more
developments in its theory and application. During the ensuing years, the
relational approach to database received a great deal of publicity. Yet, only
since the early 1980s have commercially viable relational database
management systems (RDBMSs) have been available. Today, hundreds of
RDBMSs are commercially available for various hardware platforms both
(mainframe and microcomputers). ORACLE from Oracle, INGRES System
R from IBM, Access and FoxPro from Microsoft, Paradox from Coral
Corporation, Interbase and BDE from Borland, and R:Base from R:BASE
Technologies are some of the examples of RDBMSs that are used on
Microcomputer (PC) platforms. Similarly, in addition to ORACLE and
INGRES, other RDBMS available on mainframe computers are DB2, UDB,
INFORMIX and so on.
The saga of RDBMSs is one of the most fascinating stories in this still
young field of database. How they compare with hierarchical and network
DBMSs in terms of operation, performance and overall philosophy is not
only interesting but also highly instructive for a true understanding of some
of the most basic concepts in database.

4.3 STRUCTURE OF RELATIONAL DATABASE

Relational database system has a simple logical structure with sound


theoretical foundation. The relational model is based on the core concept of
relation. In the relational model, all data is logically structured within
relations (also called table). Informally a relation may be viewed as a named
two-dimensional table representing an entity set. A relation has a fixed
number of named columns (or attributes) and a variable number of rows (or
tuples). Each tuple represents an instance of the entity set and each attribute
contains a single value of some recorded property for the particular instance.
All members of the entity set have the same attributes. The number of tuples
is called cardinality, and the number of attributes is called the degree.

4.3.1 Domain
Fig. 4.1 shows the structure of an instance or extension, of a relation called
EMPLOYEE. The EMPLOYEE relation has six attributes (field items),
namely EMP-NO, LAST-NAME, FIRST-NAME, DATE-OF-BIRTH, SEX,
TEL-NO and SALARY. The extension has seven tuples (records). Each
attribute contains values drawn form a particular domain. A domain is a set
of atomic values. Atomic means that each value in the domain is indivisible
to the relational model. Domain is usually specified by name, data type,
format and constrained range of values. For example, in Fig. 4.1, attribute
EMP-NO, is a domain whose data type is an integer with value ranging
between 1,00,000 and 2,00,000. Additional information for interpreting the
values of a domain can also be given for example, SALARY should have the
units of measurement as Indian Rupees or US Dollar. Table 4.1 shows an
example of seven different domains with respect to EMPLOYEE record of
Fig. 4.1. The value of each attribute within each tuple is atomic, that means
it is a single value drawn from the domain of the attribute. Multiple or
repeating values are not permitted.
Fig. 4.1 EMPLOYEE relation

Table 4.1 Example of domain

The relationship R for a given n number of domains D (D1, D2, D3,…, Dn)
consists of an un-ordered set of n-tuples with attributes (A1, A2, A3, … An)
where each value A1 is drawn from the corresponding domain D1. Thus,

A1 ∈ D1A2 ∈ D2 … An ∈ Dn (4.1)

Each tuple is a member of the set formed by the Cartesian product (that is
all possible distinct combination) of the domains D1× D2 × D3 … × Dn.
Thus, each tuple is distinct from all others and any instance of the relation is
a subset of the Cartesian product of its domain.

Table 4.2 Summary of structural terminology

Formal relational term Informal equivalents


relation table

attribute column or field


tuple row or record

cardinality number of rows

degree number of columns

domain pool of legal or atomic values


key unique identifier

Table 4.2 presents a summary of structural terminologies used in the


relational model. As shown in the table, the informal equivalents have only
rough (approximate) and ready definitions, while the formal relation terms
have precise definitions. For example, a term “relation” and the term “table”
are not really the same thing, although it is common in practice to pretend
that they are.

4.3.2 Keys of Relations


A relation always has a unique identifier, a field or group of fields
(attributes) whose values are unique throughout all of the tuples of the
relation. Thus, each tuple is distinct, and can be identified by the values of
one or more of its attributes called key. Keys are always minimal sequences
of attributes that provide the uniqueness quality.

4.3.2.1 Superkey
Superkey is an attribute, or set of attributes, that uniquely identifies a tuple
within a relation. In Fig. 4.1, the attribute EMP-NO is a superkey because
only one row in the relation has a given value of EMP-NO. Taken together,
the two attributes EMP-NO and LAST-NAME are also a superkey because
only one tuple in the relation has a given value of EMP-NO and LAST-
NAME. In fact, all the attributes in a relation taken together are a superkey
because only one row in a relation has a given value for all the relation
attributes.

4.3.2.2 Relation Key


Relation key is defined as a set of one or more relation attributes
concatenated. Most of the relational theory restricts the relation key to a
minimum number of attributes and excludes any unnecessary one. Such
restricted keys are called relation keys. Following three properties should
hold for all time and for any instance of the relation:
Uniqueness: A set of attributes has a unique value in the relation for each tuple.
Non-redundancy: If an attribute is removed from the set of attributes, the remaining attributes
will not possess the uniqueness property.
Validity: No attribute value in the key may be null.

A relation key can be made up of one or many attributes. Relation keys


are logical and bear no relationship to how the data are to be accessed. It
only specifies that a relation have at most row with a given value of the
relation key. Furthermore, the term relation key refers to all the attributes in
the key as a whole, not to each one.
Fig. 4.2 Relation ASSIGN

Fig. 4.2 illustrates the relation ASSIGN, showing the departments in


which the employees defined in the relation EMPLOYEE work. In each row,
the column YRS-SPENT-BY EMP-ON-PROJECT indicates the year that an
employee in the column EMP-NO spent in the department in the column
PROJECT. The relation ASSIGN has a relation key with two attributes,
EMP-NO and PROJECT. The values in these two columns together uniquely
identify the tuples in ASSIGN. EMP-NO cannot be a relation key by itself
because more than one tuple can have the same value of EMP-NO, as shown
in tuple 1, 3 and 5 in Fig. 4.2. That means, an employee can work in more
than one projects. Similarly, PROJECT cannot be a relation key on its own
because more than one employee can work on the same project.

4.3.2.3 Candidate Key


When more than one or group of attributes serve as a unique identifier, they
are each called candidate key. A candidate key has more than one relation
key, as shown in relation USE of Fig. 4.3. It contains information about
project (PROJECT), project manager (PROJ-MANAGER), machine
(MACHINE) used by a project and quantity of machines used (QTY-USED).
It has been assumed that each project has one project manager and that each
project manager manages only one project.
Fig. 4.3 Relation USE

The project manager of project P1 is Thomas and this project uses five
excavators and four drills. There will be at most one row for a combination
of a project and machine, and {PROJECT, MACHINE} is the relation key. It
is to be noted that a project has only one project manager and that
consequently PROJ-MANAGER can identify a project. {PROJ-
MANAGER, MACHINE} is also a relation key. Thus relation USE of Fig.
4.3 has two relation keys. Some keys are more important than others. For
example, {PROJECT, MACHINE} is considered more important than
{PROJ-MANGER, MACHINE} because PROJECT is more stable identifier
of projects. PROJ-MANAGER is not a stable identifier because a project’s
manager can change during its execution. Since this is an important key, it is
often known as primary key, senior to the candidate keys.
A candidate key can also be described as a superkey without the
redundancies. In other words, candidate key is a superkey such that no
proper subset is a superkey within the relation. There may be several
candidate keys for a relation.

4.3.2.4 Primary Key


Primary key is a candidate key that is selected to identify tuples uniquely
within the relation. For example, if a company assigns each employee a
unique employee identification number (for example, EMP-NO in
EMPLOYEE record of Fig 4.1), then attribute EMP-NO is a primary key
which can be used to uniquely identify a particular tuple (record). On the
other hand, if a company does not use employee identification number, then
the LAST-NAME attribute and FIRST-NAME attribute may have to be
taken as a group to provide a unique key for the relation. In this case each
attribute is a candidate key.

Fig. 4.4 Example of foreign key

4.3.2.5 Foreign Key


A foreign key may be defined as an attribute, or set of attributes, within one
relation that matches the candidate key of some (possibly the same) relation.
Thus, as shown in Fig. 4.4, the foreign key in relation R1 is a set of one or
more attributes that is a relation key in another relation R2, but not a relation
key of relation R1 . The foreign key is used in regard to database integrity.

4.4 RELATIONAL ALGEBRA

Relational algebra is a collection of operations to manipulate or access


relations. It is a procedural (or abstract) language with operations that is
performed on one or more existing relations to derive result (another)
relations without changing the original relation(s). Furthermore, relational
algebra defines the complete scheme for each of the result relations.
Relational algebra consists of set of relational operators. Each operator has
one or more relations as its input and produces a relation as its output. Thus,
both the operands and the results are relations and so the output from one
operation can become the input to another operation.
The relational algebra is a relation-at-a-time (or set) language in which all
tuples, possibly from several relations, are manipulated in one statement
without looping. There are many variations of the operations that are
included in relational algebra. Originally eight operations were proposed by
Dr. Codd, but several others have been developed. These eight operators are
divided into the following two categories:
Set-theoretic operations.
Native relational operations.

Set-theoretic operations make use of the fact that tables are essentially
sets of rows. There are four set-theoretical operations, as shown in Table 4.3.

Table 4.3 Set-theoretic operations

Native relational operation focuses on the structure of the rows. There are
four native relational operations, as shown in Table 4.4.

Table 4.4 Native relational operations


4.4.1 SELECTION Operation
The SELECT operator is used to extract (select) entire rows (tuples) from
some relation (a table). It can be used to extract either just those tuples the
attributes of which satisfy some condition (expressed as a predicate) or are
all tuples in the relation without qualification. The general form of SELECT
operation is given as:

SELECT table (or relation) name <where


predicate(s)>
Into RESULT (output relation)

In some variations, the SELECTION is also known as RESTRICTION


operation and the general form is given as:

RESTRICTION table (or relation) name <where


predicate(s)>
Into RESULT (output relation)

For queries in SQL, the SELECTION operation is expressed as:

SELECT target data


from relation (or table) name(s) for all tables
involved in the query
<where predicate(s)>

For example, let us consider a relation WAREHOUSE as shown in Fig.


4.5 (a). Now, we want to select attributes WH-ID, NO-OF-BINS and
PHONE from the table WAREHOUSE located in Mumbai. The operation
may be written as:

SELECT WH-ID, NO-OF-BINS, PHONE


from WAREHOUSE
where LOCATION = ‘Mumbai’
Into R1

Or, it can also be written as:

R1 = SELECT WH-ID, NO-OF-BINS, PHONE


from WAREHOUSE where LOCATION =
‘Mumbai’

Fig. 4.5 The SELECT operation

(a) Table (relation) WAREHOUSE


(d) Relation R3

The above operations will select attributes WH-ID, NO-OF-BINS and


PHONE of all tuples for warehouses located in Mumbai and creates a new
relation R1, as shown in Fig. 4.5 (b).
When data that has to be retrieved consists of all attributes (columns) in a
relation (or table) as shown in Fig. 4.5 (c), the SQL requirement to name
each attribute can be avoided by using “*” (star) to indicate that data from all
attributes of the relation should be returned. This operation may be written
as:

SELECT *
From WAREHOUSE
where LOCATION = ‘Mumbai’
into R2

Or, it can also be written as:

R2 = SELECT from WAREHOUSE


where LOCATION = ‘Mumbai’

We can also impose conditions on more than one attribute. For example,

SELECT *
From WAREHOUSE
where LOCATION = ‘Mumbai’ and NO-OF-BINS
>
into R3

Or, it can also be written as:

R3 = SELECT from WAREHOUSE


where LOCATION = ‘Mumbai’

The result of this operation is shown in Fig. 4.5 (d).

4.4.2 PROJECTION Operation


The PROJECTION operator is used to extract entire columns (attributes)
from some relation (a table), just as SELECT extracts rows (tuples) from the
relation. It constructs a new relation from some existing relation by selecting
only specified attributes of the existing relation and eliminating duplicate
tuples in the newly formed relation. It can also be used to change the left-to-
right order of columns within the result table (that is, new relation). The
general form of PROJECTION operation is given as:

PROJECT table (relation) name ON (or OVER) column


(attribute) name(s)
Into RESULT (output relation)

In the case of PROJECTION operation, the SQL does not follow the
relational model and the operation is expressed as:

SELECT distinct attribute data


from relation (or table)
Fig. 4.6 The PROJECT operation

For example, let us consider a relation WAREHOUSE as shown in Fig.


4.5 (a). Now, we want to project attributes WH-ID, LOCATION, and
PHONE from the table WAREHOUSE for all the tuples. The operation may
be written as:

PROJECT WAREHOUSE
ON WH-ID, LOCATION, PHONE
Into R4

Or, it can also be written as:

R4 = PROJECT WAREHOUSE OVER WH-ID,


LOCATION, PHONE

The above operations will select attributes WH-ID, LOCATION and


PHONE of all tuples from warehouses and creates a new relation R4, as
shown in Fig. 4.6 (a).
As shown in Fig 4.6 (b), we can also form following relation to eliminate
one duplicate tuple:

PROJECT WAREHOUSE
ON LOCATION
Into R5

Or, it can also be written as:

R5 = PROJECT WAREHOUSE OVER


LOCATION

4.4.3 JOINING Operation


JOINING is a method of combining two or more relations in a single
relation. It brings together rows (tuples) from different relations (or tables)
based on the truth of some specified condition. It requires choosing attributes
to match tuples in the relations. Tuple in different relations but with the same
value of matching attributes are combined into a single tuple in the output
(result) relation. Joining is the most useful of all the relational algebra
operations.
The general form of JOINING operation is given as:

JOIN table (relation) name


With table (relation) name
ON (or OVER) domain
name
Into RESULT (output relation)

In SQL, the JOIN operation is expressed as:

SELECT attribute (s) data


from outer table (relation), inner table (relation)
name
<where predicate(s)>
For example, let us consider the relation ITEMS in addition to the relation
WAREHOUSE, as shown in Fig. 4.7 (a). Relation ITEMS contain the
number of items held by each warehouse. Now, we can join the two
relations, WAREHOUSE and ITEMS, using the common attribute WH-ID.
The operation may be written as:

JOIN WAREHOUSE
with ITEMS
ON WH-ID
Into R6

Or, it can also be written as:

R6 = JOIN WAREHOUSE, ITEMS OVER WH-


ID

The above operations will select all the attributes of both relations
WAREHOUSE and ITEMS with the same value of matching attribute WH-
ID and create a new relation R6, as shown in Fig. 4.7 (b). Thus, in JOIN
operation, the tuples that have the same value of matching attributes in
relations WAREHOUSE and ITEMS be combined into a single tuple in the
new relation R6.
Fig. 4.7 The JOIN operation

(a) Two relations WAREHOUSE and ITEMS

(b) Relation R6

There are several types of JOIN operations. The JOIN operation discussed
above is called equijoin, in which two tuples are combined if the values of
the two nominated attributes are the same. A JOIN operation may be for
conditions such as a ‘greater-than’, ‘less-than’ or ‘not-equal’. The JOIN
operation requires a domain that is common to the tables (or relations) being
joined. This prerequisite for performing JOIN operation enables RDBMS
that support domains to check for a common domain before performing the
join requested. This check protects users from possible errors.

4.4.4 OUTER JOIN Operation


OUTER JOIN is an extension of JOIN operation in which it concatenates
rows (tuples) under the same conditions. Often in joining two relations, a
tuple in one relation does not have a matching tuple in the other relation. In
other words, there is no matching value in the join attributes. Therefore, we
may want a tuple from one of the relations to appear in the result even when
there is no matching value in the other relation. This can be accomplished by
the OUTER JOIN operation. The missing values in the second relation are
set to null. The advantage of an OUTER JOIN as compared to other join is
that, information (tuples) is preserved that would have been lost by other
types of join. The general form of JOINING operation is given as:

OUTER JOIN outer table (relation) name, inner table


(relation) name
ON (or OVER) domain
name
Into RESULT (output relation)

In SQL, the JOIN operation is expressed as:

SELECT attribute (s) data


from outer table (relation), inner table (relation)
name
<where predicate(s)>
Fig. 4.8 The OUTER JOIN operation

For example, let us consider relations WAREHOUSE and ITEMS, as


shown in Fig. 4.7 (a). This example concatenates tuple (rows) from
WAREHOUSE (the outer table or relation) and ITEMS (the inner table or
relation) when the warehouse identification (WH-ID) of these relations
matches. Where there is no match, the system will concatenate the relevant
row from the outer relation with NULL indicators, one for each attribute of
the inner relation, as shown in Fig. 4.8. The operation may be written as:

OUTER JOIN WAREHOUSE, ITEMS


ON WH-ID
Into R7

Or, it can also be written as:

R7 = OUTER JOIN WAREHOUSE, ITEMS


OVER WH-ID

4.4.5 UNION Operation


UNION is directly analogous of the basic mathematical operators on sets
(tables). The union of two tables (or relations) is the every row (tuple) that
appears in either (or both) of the two tables. In other words, union compares
rows (tuples) in two relations and create a new relation that contains some of
the rows (tuples) from each of the input relations. The tables (or relations) on
which it operates must contain the same number of columns (attributes).
Also, corresponding columns must be defined on the same domain. If R and
S have K and L tuples, respectively, UNION is obtained by concatenating
them into one relation with maximum of (K + L) tuples. The general form of
UNION operation is given as:

UNION table name 1, table name 2


into RESULT (output relation)

In SQL, the union operation is expressed as:

SELECT *
from relation 1
UNION
SELECT *
from relation 2
Fig. 4.9 The UNION operation

(a) Relations R8 and R9

(b) Relations R10

For example, let us consider relations R8 and R9, as shown in Fig. 4.9 (a).
Now, UNION of the two relations, R8 and R9, is given in relation R10, as
shown in Fig. 4.9 (b). The operation may be written as:

UNION R8, R9
Into R10

Or, it can also be written as:

R10 = UNION R8, R9

4.4.6 DIFFERENCE Operation


DIFFERENCE operator subtracts from the first named relation (or table)
those tuples (rows) that appear in the second named relation (or table) and
create a new relation. The general form of DIFFERENCE operation is given
as:

DIFFERENCE table (relation) name 2, table (relation) name


1
into RESULT (output relation)

In SQL, the difference operation may be expressed as:

SELECT *
from relation 1
MINUS
SELECT *
from relation 2
Fig. 4.10 The difference operation

(a) Relations R11 and R12

(b) Relations R13

For example, let us consider relations R11 and R12 as shown in Fig. 4.10
(a). Now, DIFFERENCE of the two relations, R11 and R12 is given in
relation R13, as shown in Fig. 4.10 (b). The operation may be written as:

DIFFERENCE R12, R11


Into R13

Or, it can also be written as:

R13 = DIFFERENCE R12, R11

In case of difference, only those tuples (rows) are outputted (R13) that
appear in the first relation (R11) but not the second (R12).

4.4.7 INTERSECTION Operation


In case of an INTERSECTION operator, only those rows (tuples) that appear
in both of the named relations (tables)are given as output result. The general
form of INTERSECTION operation is given as:

INTERSECTION table (relation) name 1, table (relation) name


2
into RESULT (output relation)

In SQL, the intersection operation may be expressed as:

SELECT *
from relation 1
INTERSECT
SELECT *
from relation 2

For example, let us consider relations R11 and R12 as shown in Fig. 4.10
(a). Now, INTERSECTION of the two relations, R11 and R12 is given in
relation R14, as shown in Fig. 4.11. The operation may be written as:

Fig. 4.11 The INTERSECTION operation

INTERSECTION R11, R12


Into R14

Or, it can also be written as:


R14 = INTERSECTION R11, R12

In case of intersection, those tuples (rows) are outputted (R14) that appear
in both the relations R11 and R12.

4.4.8 CARTESIAN PRODUCT Operation


In case of a CARTESIAN PRODUCT operator (also called cross-product), it
takes each tuple (row) from the first named table (relation) and concatenates
it with every row (tuple) of the second table (relation). The CARTESIAN
PRODUCT operation multiplies two relations to define another relation
consisting of all possible pairs of tuples from the two relations. Therefore, if
one relation has K tuples and M attributes and the other has L tuples and N
attributes, the Cartesian product relation will contain (K*L) tuples with
(N+M) attributes. It is possible that the two relations may have attributes
with the same name. In this case, the attribute names are prefixed with the
relation name to maintain the uniqueness of attribute names within a
relation.
CARTESIAN PRODUCT operation is costly and its practical use is
limited. The general form of this operation is given as:
CARTESIANP RODUCT table (relation) name 1, table (relation) name 2

into RESULT (output relation)

In SQL, the product operation may be expressed as:

SELECT *
from relation 1, relation 2

For example, let us consider relations R15 and R16 as shown in Fig. 4.12
(a). Now, the CARTESIAN PRODUCT of two relations, R15 and R16 is
given in relation R17, as shown in Fig. 4.12 (b). The operation may be
written as:

CARTESIAN PRODUCT R12, R11


Into R13

Or, it can also be written as:

R17 = CARTESIAN PRODUCT R11, R12

4.4.9 DIVISION Operation


The DIVISION operation is useful for a particular type of query that occurs
quite frequently in database applications. If relation R is defined over the
attribute set A and relation S is defined over the attribute set B such that B is
subset of A (B ⊆A). Let C = A − B, that is, C is the set of attributes of R that
are not attributes of S, then the DIVISION operator can be defined as a
relation over the attributes C that consists of the set of tuples from first
relation R that match the combination of every tuple in another relation S.
The general form of DIVISION operation is given as:
Fig. 4.12 The CARTESIAN PRODUCT operation

(a) Relations R15 and R16

(b) Relations R17

DIVISION table (relation) name 2, table (relation) name


1
into RESULT (output relation)

In SQL, the difference operation may be expressed as:

SELECT *
from relation 1
DIVISION
SELECT *
from relation 2
Suppose we have two relations R18 and R19, as shown in Fig. 4.13. If R18
is the dividend and R19 the divisor, then relation R20 = R18 / R19. The
operation may be written as:

DIVISION R18, R19


Into R20

Or, it can also be written as:

R20 = DIVISION R18, R19

Fig. 4.13 The DIVISION operation

The summary of relational algebra operators for relations R and S is


shown in Table 4.5.
Table 4.5 Summary of relational algebra operators

4.4.10 Examples of Queries in Relational Algebra Using Symbols


The examples of various queries illustrate the use of the relational algebra
operations using corresponding symbols are given below.

Query # 1 Select the EMPLOYEE tuples (rows) who’s (a)


department (DEPT-NO) is 10, (b) salary (SALARY) is
greater than INR 80,000.

(a) σdept-no=10(EMPLOYEE)

(b) σSALARY=80000(EMPLOYEE)
Query # 2 Select tuples for all employees in the relation
EMPLOYEE who either work in DEPT-NO 10 and get
annual salary of more than INR 80,000, or work in
DEPT-NO 12 and get annual salary of more than INR
90,000.

σ (DEPT-NO=10 AND SALARY > 80000) OR (DEPT-NO=12 AND


SALARY > 9000 (EMPLOYEE)
Query # 3 List each employee’s identification number (EMP-ID),
name (EMP-NAME) and salary (SALARY).

∏ EMP-ID, EMP-NAME, SALARY (EMPLOYEE)


Query # 4 Retrieve the name (EMP-NAME) and salary (SALARY)
of all employees in the relation EMPLOYEE who work
in DEPT-NO 10.
Or
∏ EMP-NAME, SALARY (σDEPT-N0=10) (EMPLOYEE)

EMP-DEPT-10 ← (σ (DEPT-N0=10) (EMPLOYEE)

RESULTS ← (∏ EMP-NAME, SALARY (EMP-DEPT-10)


Query # 5 Retrieve the employees identification number (EMP-ID)
of all employees who either work in DEPT-NO 10 or
directly supervise (EMP-SUPERVISION) an employee
who works in DEPT-NO=10.

EMP-DEPT-10 ← (σDEPT-NO=10 (EMPLOYEE)

RESULT 1 ← ∏EMP-ID (EMP-DEPT-10)

RESULT 2 (EMP-ID) ← ∏EMP-SUERVISION(EMP-


DEPT-10)

FINAL-RESULT ← RESULT 1 ∪ RESULT 2


Query # 6 Retrieve for each female employee (EMP-SEX=’F’) a
list of the names of her dependents (EMP-
DEPENDENT).

FEMALE-EMP ← (σEMP-SEX=‘F’ (EMPLOYEE)

ALL-EMP ← ∏EMP-ID, EMP-NAME(FEMALE-EMP)

DEPENDENTS ← ALL-EMP × EMP-DEPENDENT

ACTUAL-DEPENDENTS ← (σEMP-ID=FEPT-
ID(DEPENDENTS)

FINAL-RESULT ← ∏EMP-NAME, DEPENDENT-


NAME(ACTUAL-DEPENDENTS)
Query # 7 Retrieve the name of the manager (MANAGER) of each
department (DEPT).

DEPT-MANAGER ← DEPT ⋈ MANAGER-


ID=EMP-ID (EMPLOYEE)

FINAL-RESULT ← ∏DEPT-NAME, EMP-NAME(DEPT-


MANAGER)
Query # 8 Retrieve the names of employees in relation
EMPLOYEE who work on all the projects in relation
PROJECT controlled by DEPT-NO-10.

DEPT-10-PROJECT (PROJ-NO) ← ∏PROJECT-NUM


(σDEP-NO=10(PROJECT)

EMP-PROJ (EMP-ID, PROJ-NO) ← ∏EEMP-ID,


PROJ-NO(WORKS-ON)
RESULT-EMP-ID ← EMP-PROJ ÷ DEPT-10-
PROJECT

FINAL-RESULT ← ∏EMP-NAME(RESULT-EMP-ID *
EMPLOYEE)
Query # 9 Retrieve the names of employees who have no
dependents.

ALL-EMP ← ∏EMP-ID(EMPLOYEE)

EMP-WITH-DEPENDENT (EMP-ID) ← ∏EEMP-


ID(DEPENDENT)

EMP-WITHOUT-DEPENDENT ← (ALL-EMP -
EMP-WITH-DEPENDENT)

FINAL-RESULT ← ∏EMP-NAME(EMP-WITHOUT-
DEPENDENT * EMPLOYEE)
Query # 10 Retrieve the names of managers who have at least one
dependent.

MANAGER (EMP-ID) ← ∏MGR-ID(DEPARTMENT)

EMP-WITH-DEPENDENT (EMP-ID) ← ∏EEMP-


ID(DEPENDENT)

MGRS-WITH-DEPENDENT ← (MANAGER ⋂
EMP-WITH-DEPENDENT)

FINAL-RESULT ← ∏EMP-NAME(MGRS-WITH-
DEPENDENT * EMPLOYEE)
Query # 11 Prepare a list of project numbers (PROJ-NO) for
projects (PROJECT) that involve an employee whose
name is “Thomas”, either as a technician or as a
manager of the department that controls the project.

Thomas (EEMP-ID) ← ∏EMP-ID(σEMP-


NAME=‘Thomas’(EMPLOYEE)

Thomas-TECH-PROJ ← ∏PROJ-NO(WORKS-ON *
Thomas)

MGRS ← ∏EMP.NAME, DEPT-NO (EMPLOYEE) ⋈


EMP-ID=MGR-ID DEPARTMENT)

Thomas-MANAGED-DEPT (DEPT-NUM) ← ∏DEPT-


NO(σEMP-NAME=‘Thomas’(MGRS)

Thomas-MGR-PROJ (PROJ-NUM) ← ∏PROJ-


NO(Thomas-MANAGED-DEPT * PROJECT)

FINAL-RESULT ← (Thomas-TECH-PROJ ⋃
Thomas-MGR-PROJ)

4.5 RELATIONAL CALCULUS

Tuple and domain calculi are collectively referred to as relational calculus.


Relational calculus is a query system wherein queries are expressed as
formulas consisting of a number of variables and an expression involving
these variables. Such formulas describe the properties of the required result
relation without specifying the method of evaluating it. Thus, in a relational
calculus, there is no description of how to evaluate a query; a relational
calculus query specifies what is to be retrieved rather than how to retrieve it.
It is up to the DBMS to transform these nonprocedural queries into
equivalent and efficient procedural queries.
A relational calculus has nothing to do with differentiation or integration
of mathematical calculus, but takes its name from a branch of symbolic logic
called predicate calculus, which is calculating with predicates. Let us look at
Fig. 4.14 (a), in which a conventional method of writing the statements is to
place the predicate first and then follow it with the object enclosed in
parentheses. Therefore, the statement “ABC is a company” can be written as
“is a company (ABC)”. Now we drop the “is a” part and write the first
statement as “company (ABC)”. Finally, if we use symbols for both
predicate and the subject, we can rewrite the statements of Fig. 4.14 (a) as
P(x). The lowercase letters from the end of the alphabet (….x, y, z) denote
variables, the beginning letters (a, b, c,…..) denote constants, and uppercase
letters denote predicates. P(x), where x is the argument, is called a one-place
or monadic predicate. COMPANY(x), and DBMS(x) are examples of
monadic predicates. The variable x and y are replaceable by constants.

Fig. 4.14 Examples of statement

Let us take another example. In Fig. 4.14 (b), predicates “is smaller than”,
“is greater than”, “is north of”, “is south of” require two objects and are
called two-place predicates.
In database applications, a relational calculus is of two types:
Tuple relational calculus.
Domain relational calculus.

4.5.1 Tuple Relational Calculus


The tuple relational calculus was originally proposed by Dr. Codd in 1972.
In the tuple relational calculus, tuples are found for which a predicate is true.
The calculus is based on the use of tuple variables. A tuple variable is a
variable that ranges over a named relation, that is, a variable whose only
permitted values are tuples of the relation. To specify the range of a tuple
variable R as the EMPLOYEE relation, it can be written as:

EMPLOYEE(R)

To express the query ‘Find the set of all tuples R such that F(R) is true’,
we write:

{R|F(R)}

F is called a well-formed formula (WFF) in mathematical logic. Thus,


relational calculus expressions are formally defined by means of well-
formed formulas (WFFs) that use tuple variables to represent tuple. Tuple
variable names are the same as relation names. If tuple variable R represents
tuple r at some point, R.A will represent the A-component of r, where A is an
attribute of R. A term can be defined as:

where <variable name> = <tuple variable> . <attribute name>


= R.A
<condition> = binary operations
= .NOT., >, <, ≥ and ≤

This term can be illustrated for relations shown in Fig. 4.15, for example,

WAREHOUSE.LOCATION = MUMBAI
or, ITEMS.ITEM-NO > 30
All tuple variables in terms are defined to be free. In defining a WFF,
following symbols are used that are commonly found in predicate calculus:

⌉ = negation
∃ = existential quantifier (meaning ‘there EXISTS’)used for in
formulae that must be true for at least one instance
∀ = universal quantifier meaning ‘FORALL’) used in statements
about every instance

Tuple variables that are quantified by ∀ or ∃ are called bound variable.


Otherwise, they are called free variables.
Dr. Codd defined the well-formed formulas (WFFs) as follows:
Any term is a WFF.
If x is a WFF, so is (x) = ⌉ x. All free tuple variables in x remain free in (x) and ⌉ x, and all
bound tuple variables in x remain bound in (x) and ⌉ x.
If x, y are WFFs, so are x ⋀ y and x ⋁ y. All free tuple variables in x and y remain free in x ⋀
y and x ⋁ y.
‘If x is a WFF containing a free tuple variable T, then ∃T(x) and ∀T(x) are WFFs. T now
becomes a bound tuple variable, but any other free tuple variables remain free. All bound
terms in x remain bound in ∃T(x) and ∀T(x).
No other formulas are WFFs.

Examples of WFFs are:

STORED.ITEM-NO = ITEMS.ITEM-NO ⋀ ITEMS.WT > 30


∃ ITEMS (ITEMS.DESC = ‘Bulb’ ⋀ ITEMS.ITEM-NO
= STORED.ITEM-NO)

In the above examples, STORED and ITEMS are free variables in the first
WFF. In the second WFF, only STORED is free, whereas ITEMS is bound.
Bound and free variables are important to formulating calculus expression.
A calculus expression may be given in the form mentioned below so that all
tuple variables preceding WHERE are free in the WFF.
Fig. 4.15 Sample relations
Relational calculus expressions can be used to retrieve data from one or
more relations, with the simplest expressions being those that retrieve data
from one relation only.

4.5.1.1 Query Examples for Tuple Relational Calculus


a. List the names of employees who do not have any property.

{E.FNAME, E.INAME | EMPLOYEE(E) ⋀ (⌉ P) (PROPERTY-ON-RENT(P) ⋀


(EMPLOYEE. EMP-NO = P.EMP-NO))}

b. List the details of employees earning salary more than IRS 40000.

{E.FNAME, E.INAME | EMPLOYEE(E) ⋀ E.SAL > 40000.

c. List the details of cities where there is a branch office but no properties for rent.

B.CITY | (BRANCH(B) ⋀ (⌉ (∃P) (PROPERTY-FOR-RENT(P) ⋀ B.CITY = P.CITY))}

d. List the names of clients who have viewed a property for rent in Delhi.

C.FNAME, C.INAME | CLIENT(C) ⋀ (∃V) (∃P) (VIEWING(V) ⋀ PROPERTY-FOR-


RENT(P) ⋀ (C.CLIENT-NO = V.CLIENT-NO) ⋀ (V.PROPERTY-NO = P.PROPERTY-NO)
⋀ P.CITY = ‘Delhi’))}

e. List all the cities where there is a branch office and at least one property for the client.

{B.CITY | (BRANCH(B) ⋀ ((∃P) (PROPERTY-FOR-RENT(P) ⋀ B.CITY = P.CITY))}

4.5.2 Domain Relational Calculus


Domain relational calculus was proposed by Lacroix and Pirotte in 1977. In
domain relational calculus, the variables take their values from domains of
attributes rather than tuples of relations. An expression for the domain
relational calculus has the following general form

{d1,d2,…,dn | F(d1,d2,…,dm)} m≥n

where d1, d2,… , dn and d1, d2,… , dn represent domain variables and
F(d1, d2,…, dm) represents a formula composed of atoms. Each atom has one
of the following forms:
R(d1, d2,…, dn ), where R is a relation of degree n and each di. is a domain variable.
di. θdj, where di. and dj. are domain variables and θ is one of the comparison operators (<, ≤,
>, ≥, =, ≠); the domains di. and dj. must have members that can be compared by θ.
di. θ c, where di. is a domain variable, c is a constant from the domain of di., and θ is one of
the comparison operators.

We recursively build up formula from atoms using the following rules:


An atom is a formula.
If F1 and F2 are formula, so are their conjunction F1 ⋀ F2, their disconjunction F1 ⋁ F2 and
the negation ⌉ F1.
If F(X) is a formula with domain variable X, then (∃×) (F(X)) and (∀×) (F(X)) are also
formula.

The expression of domain relational calculus use the same operators as


those in tuple calculus. The difference is that in domain calculus, instead of
using tuple variables, we use domain variables to represent components of
tuples. A tuple calculus expression can be converted to a domain calculus
expression by replacing each tuple variable by n domain variables. Here, n is
the arity of the tuple variable.

4.5.2.1 Query Examples for Domain Relational Calculus


a. List the details of employees working on a SAP project.

{FN, IN | (∃EN, PROJ, SEX, DOB, SAL) (EMPLOYEE(EN, FN, IN, PROJ, SEX, DOB,
SAL) ⋀ PROJ = ‘SAP’)}

b. List the details of employees working on a SAP project and drawing salary more than IRS
30000.

{FN, IN | (∃EN, PROJ, SEX, DOB, SAL) (EMPLOYEE(EN, FN, IN, PROJ, SEX, DOB,
SAL) ⋀ PROJ = ‘SAP’ ⋀ SAL > 30000)}

c. List the names of clients who have viewed a property for rent in Delhi.

{FN, IN | (∃CN, CN1, PN, PN1, CITY) (CLIENT(CN, FN, IN, TEL, PT, MR) ⋀
VIEWING((CN1, PN1, DT CMT) ⋀ PROPERTY-FOR-RENT(PN, ST, CITY, PC, TYP,
RMS, MT, ON, SN) ⋀ (CN = CN1) ⋀ PN = PN1) ⋀ CITY = ‘Delhi’)}

d. List the details of cities where there is a branch office but no properties for rent.
{CITY | (BRANCH (BN, ST, CITY, PC) ⋀ ⌉ (∃CITY1) (PROPERTY-FOR-RENT(PN, ST1,
CITY1, PC1, TYP, RMS, RNT, ON, SN, BN1) ⋀ (CITY = CITY 1))}

e. List all the cities where there is both a branch office and at least one property for client.

{CITY | (BRANCH (BN, ST, CITY, PC) ⋀ (∃CITY1) (PROPERTY-FOR-RENT(PN, ST1,


CITY1, PC1, TYP, RMS, RNT, ON, SN, BN1) ⋀ (CITY = CITY 1))}

f. List all the cities where there is either a branch office or a property for client.

{CITY | (BRANCH (BN, ST, CITY, PC) ⋁ PROPERTY-FOR-RENT(PN, ST1, CITY1, PC1,
TYP, RMS, RNT, ON, SN, BN)}

R Q
1. In the context of a relational model, discuss each of the following concepts:

a. relation
b. attributes
c. tuple
d. cardinality
e. domain.

2. Discuss the various types of keys that are used in relational model.
3. The relations (tables) shown in Fig. 4.15 are a part of the relational database (RDBMS) of an
organisation.
Find primary key, secondary key, foreign key and candidate key.
4. Let us assume that a database system has the following relations:

STUDENTS (NAME, ROLL-NO, ADDRESS, MAIN)


ADMISSION (ROLL-NO, COURSE, SEMESTER)
FACULTY (COURSE, FACULTY, SEMESTER)
OFFEREING (BRANCH, COURSE)

Using relational algebra, derive relations to obtain the following information:

a. All courses taken by a given student.


b. All faculty that at some time taught a given student.
c. The names of students admitted in a particular course in a given semester.
d. The branch with which a particular student has taken courses.
e. Were two students (x and y) ever admitted in the same course in the same semester?
f. Students who have taken all courses offered by a given faculty.

5. Repeat (a) through (f) of exercise 4 using relational calculus.


6. What do you mean by relational algebra? Define all the operators of relational algebra.
7. Write a short note on the historical perspective of relational model of database system.
8. What do you mean by structure of a relational model of database system? Explain the
significance of domain and keys in the relational model.
9. What is relational algebra? What is its use? List relational operators.
10. Find the relation key for each of the following relations:

a. SALES (SELLER, PRODUCT, CATEGORY, PRICE, ADDRESS). Each SELLER


has a set price for each category and each SELLER has one SELLER-ADDRESS.
b. ORDERS (ORDER-NO, ORDER-DATE, PROJECT, DEPARTMENT). Each order
is made by one project and each project is in one department.
c. PAYMENTS (ACCOUNT-NO, CUSTOMER, AMOUNT-PAID, DATE-PAID).
There is one customer for each account. The payments on each account can be made
on different days and can be in different amounts. There is at most one payment on
an account each day.
d. REPORTS (REPORT-NAME, AUTHOR-SURNAME, REPORT-DATE, AUTHOR-
DEPARTMENT). Each author is in one department and each report is produced by
one author and has one REPORT-DATE.

11. Construct relations to store the following information:

a. The PERSON-ID and SURNAME of the current occupants of each POSITION in


the organization and their APPOINTMENT-DATE to the POSITION, together with
the DEPARTMENT of the position.
b. The PRICE of parts (identified by PART-ID) for each SUPPLIER and the
EEFECTIVE-DATE of that price.
c. The NAMEs of persons in DEPARTMENTs and the SKILLs of these persons.
d. The TIME that each vehicle identified by REG-NO) checks in at CHECK-POINTs
during race.

12. Let us assume that a relation MANUFACTURE of a database system is given, as shown in
Fig. 4.16 below:

Fig. 4.16 Relation MANUFACTURE


Write relational calculus and relational algebra expressions to retrieve the following
information:

a. The components of a given ASSEMBLY.


b. The components of the components of a given ASSEMBLY.

13. What do you mean by relational calculus? What are the types of relational calculus?
14. Define the structure of well-formed formula (WFF) in both the tuple relational calculus and
domain relational calculus.
15. What is difference between JOIN and OUTER JOIN operator?
16. Describe the relations that would be produced by the following tuple relational calculus
expressions:

a. {H.HOTEL-NAME | HOTEL(H) ⋀ H.CITY = ‘Mumbai’}


b. {H.HOTEL-NAME | HOTEL(H) ⋀ (∃R) (ROOM(R) ⋀ H.HOTEL-NO =
R.HOTEL.NO ⋀ R.PRICE > 4000}
c. {H.HOTEL-NAME | HOTEL(H) ⋀ (∃B) (∃G) (BOOKING(B) ⋀ GUEST(G) ⋀
H.HOTEL-NO = B.HOTEL.NO ⋀ B.GEST-NO = G.GUEST-NO ⋀ G.GUEST-
NAME ‘ ‘Thomas Mathew’}
d. {H.HOTEL-NAME, G.GUEST-NAME, B1.DATE-FROM, B2.DATE-FROM |
HOTEL(H) ⋀ GUEST(G) ⋀ BOOKING(B1) ⋀ BOOKING(B2) ⋀ H.HOTEL-
NO=B1.HOTEL-NO ⋀ G.GUEST-NO =B1.GUEST-NO⋀B2.HOTEL-
NO=B1.HOTEL-NO ⋀ B2.GUEST-NO = B1.GUEST-NO ⋀ B2.DATE-FROM ≠
B1.DATE-FROM}

17. Provide the equivalent domain relational calculus and relational algebra expressions for each
of the tuple relational calculus expressions of Exercise 16.
18. Generate the relational algebra, tuple relational calculus, and domain relational calculus
expressions for the following queries:

a. List all hotels.


b. List all single rooms.
c. List the names and cities of all guests.
d. List the price and type of all rooms at Taj Hotel.
e. List all guests currently staying at the Taj Hotel.
f. List the details for all rooms at the Taj Hotel, including the name of guest staying in
the room, if the room is occupied.
g. List the guest details of all guests staying at the Taj Hotel.

19. You are given the relational database as shown in Fig. 4.15. How would you retrieve the
following information, using relational algebra and relation calculus?

a. The WH-ID of warehouse located in Mumbai.


b. The ITEM-NO of the small items whose weight exceeds 8.
c. The ORD-DATE of orders made by KLY.
d. The location of warehouse that hold items with DESC “Electrode”.
e. The warehouse that stores items in orders made by ABC Co.
f. The warehouse that holds all the items in order ORD-1.
g. The total QTY of items held by each warehouse.
h. The ITEM-NO of items included in order made by KLY and held in Kolkata.

20. For the relation A and B shown in Fig. 4.17 below, perform the following operations and
show the resulting relations.

a. Find the projection of B on the attributes (Q, R).


b. Find the join of A and B on the common attributes.
c. Divide A by the relation that is obtained by first selecting those tuples of B where
the value of Q is either q1or q2 and then projecting B on the attributes (R, S).

Fig. 4.17 Exercise for 4.20

21. Consider a database for the telephone company that contains relation SUBSCRIBERS,
whose attributes are given as:

SUB-NAME, SSN, ADDRESS, CITY, ZIP, INFORMATION-NO

Assume that the INFORMATION-NO is the unique 10-digit telephone number, including
area code, provided for subscribers. Although one subscriber may have multiple phone
numbers, such alternate numbers are carried in a separate relation (table). The current relation
has a row for each distinct subscriber (but note that husband and wife, subscribing together,
can occupy two rows and share an information number). The database administrator has set
up the following rules about the relation, reflecting design intentions for the data:

No two subscribers (on separate rows) have the same social security number (SSN).
Two different subscribers can share the same information number (for example,
husband and wife). They are listed separately in the SUBSCRIBERS relation.
However, two different subscribers with the same name cannot share the same
address, city, and zip code and also the same information number.

a. Identify all candidate keys for the SUBSCRIBERS relation, based on the
assumptions given above. Note that there are such keys, one of them contains the
INFORMATION-NO attribute and a different one contains the ZIP attribute.
b. Which of these candidate keys would you choose for a primary key? Explain why.

22. What is the difference between a database and a table?


23. A relational database is given with the following relations:

EMPLOYEE (EMP-NAME, STREET, CITY)


WORKS (EMP-NAME, COMPANY-NAME, SALARY)
COMPANY (COMPANY-NAME, CITY)
MANAGES (EMP-NAME, MANAGER-NAME)

a. Give a relational algebra expression for each of the following queries:


i. Find the company with most employees.
ii. Find the company with smallest payroll.
iii. Find those companies whose employees earn a higher salary, on average, than the average salary
at ABC Co.

b. The primary keys in the relations are underlined. Give a expression in the relational
algebra to express each of the following queries:
i. Find the names of all employees who work for ABC Co.
ii. Find the names and cities of residence of all employees who work for ABC Co.
iii. Find the names, street address, and cities of residence of all employees who work for ABC Co.
and earn more than INR 35000 per month.
iv. Find names of all employees who live in the same city and on the same street as do their
managers.
v. Find the names employees who do not work for ABC Co.

24. Describe the evolution of relational database model.


25. Explain the relation database structure.
26. What is domain and how is it related to a data value?
27. Describe the eight relational operators. What do they accomplish?
28. Consider the following relations:

SUPPLIERS(S-ID: integer, S-NAME: string, ADDRESS: string)


PARTS(P-ID: integer, P-NAME: string, COLOUR: string)
CATALOGUE(S-ID: integer, P-ID: integer, COST: real)

The key fields are underlined, and the domain of each field is listed after the field name.
Write the following queries in relational algebra, tuple relational calculus, and domain
relational calculus:

a. Find the names of suppliers who supply some red part.


b. Find the S-IDs of suppliers who supply some red or green part.
c. Find the S-IDs of suppliers who supply some red part or are at 12, Beldih Avenue.
d. Find the S-IDs of suppliers who supply some red part and some green parts.
e. Find the S-IDs of suppliers who supply every part.
f. Find the S-IDs of suppliers who supply every red part.
g. Find the S-IDs of suppliers who supply every part or green part.
h. Find the P-IDs of parts that are supplied by at least two different suppliers.
i. Find the P-IDs of parts supplied by every supplier at less than INR 550.

29. Why are tuples in a relation not ordered?


30. Why are duplicate tuples are not allowed in a relation?
31. Define a foreign key. What is this concept used for? How does it play a role in the JOIN
operation?
32. List the operations of relational algebra and the purpose of each.

STATE TRUE/FALSE

1. In 1980 Dr. E. F. Codd was working with Oracle Corporation.


2. DB2, System R, and ORACLE are examples of relational DBMS.
3. In the RDBMS terminology, a table is called a relation.
4. The relational model is based on the core concept of relation.
5. Cardinality of a table means the number of columns in the table.
6. In the RDBMS terminology, an attributes means a column or a field.
7. A domain is a set of atomic values.
8. Data values are assumed to be atomic, which means that they have no internal structure as far
as the model is concerned.
9. A table cannot have more than one attribute, which can uniquely identify the rows.
10. A candidate key is an attribute that can uniquely identify a row in a table.
11. A table can have only one alternate key.
12. A table can have only one candidate key.
13. The foreign key and the primary key should be defined on the same underlying domain.
14. A relation always has a unique identifier.
15. Primary key performs the unique identification function in a relational database model.
16. In a reality, NULL is not a value, but rather the absence of a value.
17. Relational database is a finite collection of relations and a relation in terms of domains,
attributes, and tuples.
18. Atomic means that each value in the domain is indivisible to the relational model.
19. Superkey is an attribute, or set of attributes, that uniquely identifies a tuple within a relation.
20. Codd defined well-formed formulas (WFFs).

TICK (✓) THE APPROPRIATE ANSWER

1. The father of relation database system is:

a. Pascal
b. C.J. Date
c. Dr. Edgar F. Cord
d. none of these.

2. Who wrote the paper titled “A Relational Model of Data for Large Shared Data Banks”?

a. F.R. McFadden
b. C.J. Date
c. Dr. Edgar F. Cord
d. none of these.

3. The first large scale implementation of Codd’s relational model was IBM’s:

a. DB2
b. system R
c. ingress
d. none of these.

4. Which of the following is not a relational database system?

a. ingress
b. DB2
c. IMS
d. sybase.

5. What is the RDBMS terminology for a row?

a. tuple
b. relation
c. attribute
d. domain.

6. What is the cardinality of a table with 1000 rows and 10 columns?

a. 10
b. 100
c. 1000
d. none of these.

7. What is the cardinality of a table with 5000 rows and 50 columns?

a. 10
b. 50
c. 500
d. 5000.

8. What is the degree of a table with 1000 rows and 10 columns?

a. 10
b. 100
c. 1000
d. none of these.

9. What is the degree of a table with 5000 rows and 50 columns?


a. 50
b. 500
c. 5000
d. none of these.

10. Which of the following keys in a table can uniquely identify a row in a table?

a. primary key
b. alternate key
c. candidate key
d. all of these.

11. A table can have only one:

a. primary key
b. alternate key
c. candidate key
d. all of these.

12. What are all candidate keys, other than the primary keys called?

a. secondary keys
b. alternate keys
c. eligible keys
d. none of these.

13. What is the name of the attribute or attribute combination of one relation whose values are
required to match those of the primary key of some other relation?

a. candidate key
b. primary key
c. foreign key
d. matching key.

14. What is the RDBMS terminology for a column?

a. tuple
b. relation
c. attribute
d. domain.

15. What is the RDBMS terminology for a table?

a. tuple
b. relation
c. attribute
d. domain.
16. What is the RDBMS terminology for a set of legal values that an attribute can have?

a. tuple
b. relation
c. attribute
d. domain.

17. What is the RDBMS terminology for the number of tuples in a relation?

a. degree
b. relation
c. attribute
d. cardinality.

18. What is a set of possible data values called?

a. degree
b. attribute
c. domain
d. tuple.

19. What is the RDBMS terminology for the number of attributes in a relation?

a. degree
b. relation
c. attribute
d. cardinality.

20. Which of the following aspects of data is the concern of a relational database model?

a. data manipulation
b. data integrity
c. data structure
d. all of these.

21. What is the smallest unit of data in the relational model?

a. data type
b. field
c. data value
d. none of these.

FILL IN THE BLANKS

1. The relational model is based on the core concept of _____.


2. The foundation of relational database technology was laid by _____.
3. Dr. E.F Codd, in paper titled _____ laid the basic principles of the RDBMS.
4. The first attempt at a large implementation of Codd’s relational model was _____.
5. In the RDBMS terminology, a record is called a _____.
6. Degree of a table means the number of _____ in a table.
7. A domain is a set of _____ values.
8. The smallest unit of data in the relational model is the individual _____.
9. A _____ is set of all possible data values.
10. The number of attributes in a relation is called the _____ of the relation.
11. The number of tuples or rows in a relation is called the _____ of the table.
12. A table can have only one _____ key.
13. All the values that appear in a column of a table must be taken from the same _____.
14. Superkey is an attribute, or set of attributes, that uniquely identifies a tuple within a relation.
15. Tuple relational calculus was originally proposed by _____ in _____.
Chapter 5

Relational Query Languages

5.1 INTRODUCTION

In Chapter 4, we have discussed the relational query based on relational


algebra and relational calculus. The relational algebra and calculus provide a
powerful set of operations to specify queries. This forms the basis for the
data manipulation (query) language component of the DBMS. But, such
languages can be expensive to implement and use. In reality, data
manipulation languages generally have capabilities beyond those of
relational algebra and calculus. All data manipulation languages include
capabilities such as insertion, deletion, modification of commands,
arithmetic capability, assignment and print command, aggregate functions
and so on, which are not part of relational algebra or calculus.
Queries in a relational language should be able to use any attribute as a
key and so must have access to powerful indexing capability. However, such
queries can span a number of relations, the implementation of which can be
prohibitively expensive making system development infeasible in
commercial environment.
Two relational systems, System R (developed at the IBM’s Research
Laboratory in San Jose, California, USA) and INGRESS (called Interactive
Graphics and Retrieval System, developed at the University of California at
Berkeley) were developed in the early 1970s to develop a practical interface
of relational implementation for commercial use. Both these systems proved
successful and were commercialised. System R, converted to DB2, is now
IBM’s standard RDBMS product. INGRESS is also commercially marketed.
There are other RDMS products also such as ORACLE and SUPRA, which
are commercially available.
In this chapter, some of the features of query languages have been
demonstrated such as information systems based language (ISBL), query
language (QUEL), structured query language (SQL) and query-by- example
(QBE).

5.2 CODD’S RULES

Dr. Edgar F. Codd proposed a set of rules that were intended to define the
important characteristics and capabilities of any relational system [Codd
1986]. Today, Codd’s rules are used as a yardstick for what can be expected
from a conventional relational DBMS. Though, it is referred to as “Codd’s
twelve rules”, in reality there are thirteen rules. The Codd’s rules are
summarised in Table 5.1.

Table 5.1 Codd’s rules


Rule Rule Name Description
Rule 0 Foundation Rule A relational database management system must
manage the database entirely through its
relational capabilities.
Rule 1 Information Rule All information is represented logically by
values in tables.
Rule 2 Guaranteed Access Rule Every data value is logically accessible by a
combination of table name, column name and
primary key value.
Rule 3 Missing Information Rule Null values are systematically supported
independent of data type.
Rule 4 System Catalogue Rule The logical description of the database is
represented and may be interrogated by
authorised users, in the same way as for normal
data.
Rule 5 Comprehensive Language A high-level relational language with well-
Rule defined syntax expressible as character strings
must be provided to support all of the following:
data and view definitions, integrity constraints,
interactive and programmable data manipulation
and transaction start, commit and rollback.
Rule 6 View Update Rule The system should be able to perform all
theoretically possible updates on views.
Rule 7 Set Level Update Rule The ability to treat whole tables as single objects
applies to insertion, modification and deletion, as
well as retrieval of data.
Rule 8 Physical Data Independence User operations and application programs should
Rule be independent of any changes in physical
storage or access methods.
Rule 9 Logical Data Independence User operations and application programs should
Rule be independent of any changes in the logical
structure of base tables provided they involve no
loss information.
Rule 10 Integrity Independence Rule Entity and referential integrity constraints should
be defined in the high-level relational language
referred to in Rule 5, stored in the system
catalogues and enforced by the system, not by
application programs.
Rule 11 Distribution Independence User operation and application programs should
Rule be independent of the location of data when it is
distributed over multiple computers.
Rule 12 Non-subversion Rule If a low-level procedural language is supported,
it must not be able to subvert integrity or security
constraints expressed in the high-level relational
language.

Of the given rules in Table 5.1 Rules 1 to 5 and 8 are well supported by
the majority of current commercially available RDBMSs. Rule 11 is
applicable to distributed database systems.

5.3 INFORMATION SYSTEM BASED LANGUAGE (ISBL)

Information system based language (ISBL) is a pure relational algebra based


query language, which was developed in IBM’s Peterlee Centre in UK in
1973. It was first used in an experimental interactive database management
system called Peterlee Relational Test Vehicle (PRTV). Using ISBL, a
database system can be created with a size of about 50 relations, each
containing at most 65,000 tuples. Each tuple can have at most 128 columns.
Table 5.2 shows the correspondence of syntax of ISBL and relational algebra
for relations R and S. In both ISBL and relational algebra, R and S can be
any relational expression, and F is a Boolean formula.

Table 5.2 Comparison of syntax of ISBL and relational algebra


In ISBL, each relation in the logical database is defined as follows:

〈 relation name 〉 (〈 domain-name 〉 : 〈 attribute 〉,…)

A domain of attributes is defined as follows:

CREATE DOMAIN 〈 domain-name 〉 : 〈 data-type 〉

where data-type can be either numeric (a number or integer) N, or string


(or a character) C.
To print the value of an expression, the command is preceded by LIST, for
example LIST P (to print the value of P). To assign the value of an
expression to a relation, ‘equal’ symbol is used, for example R = A
(assigning the value of A to relation R). Another interesting features of
assignment is that binding of relations to names in an expression can be
delayed until the name on the left of the assignment is used. To delay
evaluation of a name, it is preceded by N!. N! is the delayed evaluation
operator, which serves the following two important purposes:
It allows the programmer to construct an expression in easy stages, by giving temporary
names to important sub-expressions.
It serves as a rudimentary facility for defining views.

Let us take an example of assignment statement in which we want to use


the composition of binary relations R (A, B) and S (C, D). Now, if we write
ISBL statement

XYZ = (R * S) : B = C % A, D

the composition of the current relations R and S would be computed and


assigned to relation name XYZ. Here, R and S have attributes with different
names, the *, or natural join operators are Cartesian product.
But, if we want XYZ to stand for the formula for composing R and S and
not for the composition of the current values of R (A, B) and S (C, D), then
we write ISBL statement
XYR = (N!R * N!S) : B = C % A, D

The above ISBL statement causes no evaluation of relations. Rather, it


defines XYZ to stand for the formula (R * S) : B + C % A, D. If we ever use
XYZ in a statement that requires its evaluation, such as:

LIST XYZ
P = XYZ + Q

the values of R and S are at that time submitted into the formula for XYZ
to get a value for XYZ.

5.3.1 Query Examples for ISBL


a. Create an external relation.

CREATE DOMAIN NAME : C, NUMBER : N


STUDENTS (NAME : S - NAME, NUMBER : ROLL-NO., NAME : ADDRESS, NAME :
MAIN)
ADMISSION (NUMBER : ROLL-NO., NAME : COURSE, NUMBER : SEMESTER)
FACULTY (NAME : COURSE, NAME : FACULTY, NUMBER : SEMESTER)
OFFERING (NAME : BRANCH, NAME : COURSE)

b. Create another relation:

CREATE DOMAIN NAME : C, NUMBER : N


CUSTOMER (NAME : C-NAME, NAME : C-ADDRESS, NAME : C-LOCATION)
SUPPLIERS (NUMBER : S-ID, NAME : S-NAME, NAME : P-NAME, NAME :
ADDRESS)
PARTS (NUMBER : P -ID, NAME : P-NAME, NAME : M-NAME, NUMBER : COST)
CATALOG (NUMBER : S-ID, NUMBER : P-ID)

c. Print name of parts (in the relation of example (b) having cost more than Rs. 4000.00.

NEW = PARTS : (COST > 4000)


NEW = NEW % P-NAME
LIST NEW

d. Find Cartesian product of relations R (A, B) and S (B, D):

NEW = R % (B → C) {NEW (A, C)}


P = NEW *S {P (A, C, B, D)}
LIST P

e. Print names of the suppliers (in the relation of example (b) who supply every part ordered by
a customer “Abhishek”

A = PARTS : (C-NAME = “Abhishek”)


A = A % (P-NAME)
B = SUPPLIERS % (S-NAME, P-NAME)
C = B% (S-NAME)
D=C*A
D=D−B
D = D% (S-NAME)
NEW = C Ȓ D
LIST NEW

f. Print names of the suppliers (in the relation of example (b) who have not supplied any part
ordered by a customer “ Abhishek”

A = PARTS : (C-NAME = “Abhishek”)


A = A% (P-NAME)
B = SUPPLIERS % (S-NAME)
C = SUPPLIERS * A
C = C % (S-NAME)
D=B−C
LIST D

5.3.2 Limitations of ISBL


When compared with query languages used in RDMS, the use of ISBL is
limited. However, using PRTV system can be used to write arbitrary PL/I
programs and integrate them into the processing of relations. Following are
the limitations of ISBL.
It has no aggregate operators e.g., average, mean, etc.
There are no facilities for insertion, deletion or modification of tuples.

5.4 QUERY LANGUAGE (QUEL)

Query language (QUEL) is a tuple relational calculus language of a


relational database system INGRESS (Interactive Graphics and Retrieval
System). INGRESS runs under UNIX operating system developed at AT and
T Bell Laboratories, USA. ‘C’ programming language has been used for
implementation of both Ingress and UNIX. The language can be used either
in a stand-alone manner by typing commands to QUEL processor, or
embedded in the ‘C’ programming language. In case it is embedded in ‘C’,
QUEL statements are preceded by hash (# #) and handled by a processor.
INGRESS statements are used for implementing QUEL.
Let us consider a tuple relational calculus statement given as

The above statement states that ti is in Ri, and q is composed r


components of the ti ’ s.
The above tuple relational calculus expression can be written in QUEL, as
follows:

range of t1 is R1
range of t2 is R2
:
:
range of tn is Rn

where ψ

where Am = jm th attribute of relation , for m = 1, 2,…, n


ψ = translation of condition ψ into a QUEL expression.

The meaning of the statement “range of t is R”, is that any subsequent


operations until t is redeclared by another range statement, are to be carried
out once for each tuple in R, with t equal to each of these tuples in turn.
To perform the translation Ψ of condition Ψ into a QUEL expression,
following rules must be followed:
Replacing references of Ψ to a component of q[m] by a reference to [jm].
Replacing any reference to tm [n] by tm.B, where B is the nth attribute of relation Rm, for
any n and m.
Replacing ≤ by <=, ≥ by >=, ≠ by ! = (not equal to).
Replacing ⋀ by AND, ⋁ by OR, ⌉ by NOT.

Table 5.3 shows various QUEL operations for relations R(A1…….An) and
S(B1……Bm).

Table 5.3 Summary of QUEL operations

5.4.1 Query Examples for QUEL


a. Following relations are given as:

CUSTOMERS (CUST-NAME, CUST-ADDRESS, BALANCE)


ORDERS (ORDER-NO, CUST-NAME, ITEM, QTY)
SUPPLIERS (SUP-NAME, SUP-ADDRESS, ITEM, PRICE)

Execute the following queries:

i. Print the names of customers with negative balances

range of t is CUSTOMERS
RETRIEVE (t. CUST-NAME)
where t. BALANCE < 0
ii. Print the supplier names, items and prices of all suppliers that supply at least one
item ordered by M/s ABC Co.

range of t is ORDERS
range of s is SUPPLIERS
RETRIEVE (s. SUP-NAME, s.ITEM, s.PRICE)
where t. CUST-NAME = “M/s ABC Co.” and t. ITEM = s. ITEM

iii. Print the supplier names that supply every item ordered by M/s ABC Co.
This query can be executed in the following three steps.

Step 1: Write a program to compute the set of supplier-


item pairs and store in the DUMMY relation.
range of s is SUPPLIERS
range of i is SUPPLIERS
RETRIEVE INTO DUMMY (S = s.SUP-NAME, I
= i.ITEM)
Step 2: Delete from DUMMY relation those supplier-item
pairs (S, I) such that S supplies I. It will result into
those (S, I) pairs wherein S does not supply I.
range of s is SUPPLIERS
range of t is DUMMY
DELETE t
where t.S = s.SUP-NAME and t.I = s.ITEM
Step 3: Create a relations of supplier-item paris (S, I) such
that S is any supplier and I is not supplied by S but
I is ordered by “M/s ABC Co.”.
range of r is ORDERS
range of t is DUMMY
RETRIEVE INTO JUNK (S = t.S, I = t.I)
where r.CUST-NAME = “M/s ABC Co.” and
r.ITEM = t.I
Step 4: List only those suppliers that do not appear as a
first component of a tuple in the relation JUNK.
range of s is SUPPLIER
RETRIEVE INTO SUPPLIERS-FINAL (S =
s.SUP-NAME)
Step 5: Print the final result.
range of u is SUPPLIERS-FINAL
range of j is JUNK
DELETE u
where u.S = j.S
SORT SUPPLIERS-FINAL
PRINT SUPPLIERS-FINAL

b. Create an external relation.

CREATE STUDENT_ADMISSION
CREATE STUDENTS (S-NAME IS 〈format〉, Roll-NO. IS 〈format〉, ADDRESS IS 〈format〉,
MAIN IS 〈format〉)
CREATE ADMISSION (ROLL-NO. IS 〈format〉, COURSE IS 〈format〉, SEMESTER IS
〈format〉)
CREATE FACULTY (COURSE IS 〈format〉, FACULTY IS 〈format〉, SEMESTER IS
〈format〉)
CREATE OFFERING (BRANCH IS 〈format〉, COURSE IS 〈format〉)

c. Transfer the relation of example (b) to or from an UNIX file

COPY STUDENTS (S-NAME IS 〈format〉, ROLL-NO IS 〈format〉, ADDRESS IS 〈format〉,


MAIN IS 〈format〉)
FROM “student.txt”
COPY ADMISSION (ROLL-NO IS 〈format〉, COURSE IS 〈format〉, SEMESTER IS
〈format〉)
FROM “admission.txt”
COPY FACULTY (COURSE IS 〈format〉, FACULTY IS 〈format〉, SEMESTER IS 〈format〉)
FROM “faculty.txt”
COPY OFFERING (BRANCH IS 〈format〉, COURSE IS 〈format〉)
FROM “offering.txt”

5.4.2 Advantages of QUEL


QUEL uses the aggregate functions such as SUM, AVG, COUNT, MIN and MAX. The
argument of such a function can be any expression involving components of a single relation,
constants and arithmetic operators.

5.5 STRUCTURES QUERY LANGUAGE (SQL)

Structured Query Language (SQL), also called Structured English Query


Language (SEQUEL), is relational query language. It is the standard
command set used to communicate with the relational database management
system (RDBMS). It is based on the tuple relational calculus, though not as
closely as QUEL. SQL resembles relational algebra in some places and tuple
relational calculus in others. It is a non-procedural language in which block
structured format of English key words is used. SEQUEL (widely known as
SQL) was the first prototype query language developed by IBM in the early-
1970s. It was first implemented on a large scale in IBM prototype called
System R and subsequently extended to numerous commerical products
from IBM as well as other vendors. In 1986, SQL was declared a standard
for relational data retrieval languages by the American National Standards
Institute (ANSI) and by the International Standards Organisation (ISO) and
called it SQL-86. In 1987, IBM published its own corporate SQL standard,
the System Application Architecture Database Interface (SAA-SQL). ANSI
published an extended standard for SQL, SQL-89 in 1989, SQL-92 in 1992
and the most recent version SQL-1999.
SQL is both data definition language and data manipulation language of a
number of relational database systems such as System R, SQL/DS, and DB2
of IBM, ORACLE of Oracle Corporation, INGRES of Relational
Technologies and so on. ORACLE was the first commerical RDBMS
developed in 1979 that supported SQL. SQL is very simple to use and
interactive in nature. Users with very little or no expertise in computers, can
find it easy to use. SQL facilitates in executing all tasks related to RDBMS
such as creating tables, querying the database for information, modifying the
data in the database, deleting them, granting access to users and so on. Thus,
it has various features such as query formulation, facilities for insertion,
deletion and update operations. It includes statements such as RETURN,
LOOP, IF, CALL, SET, LEAVE, WHILE, CASE, REPEAT and several other
related features such as variables and exception handlers. It also creates new
relations and controls the sets of indexes maintained on the database. SQL
can be used interactively to support ad hoc requests, or be embedded into
procedural code to support operational transactions. Different database
vendors use different dialects of SQL, but the basic features of all of them
are the same. They use the same base standard of the ANSI SQL standard.
SQL is essentially a free-format language, which means that parts of the
statement do not have to be typed at particular locations on the screen. There
are many software packages for example, SQL generators, CASE tools and
application development environment, where SQL statements can
automatically be generated. CASE tools such as Designer-2000, Information
Engineering Facility (IEF) and so on can be used to generate the entire
application including SQLs. In a Power Builder application, its Data
Window package can be used to generate SQL code. SQL codes can be
generated using browser software packages like MS-Query for querying and
updating data in a database. SQL is the main interface for communicating
between the users and RDBMS.
SQL has the following main components:
a. Data structure.
b. Data type.
c. SQL operators.
d. Data definition language (DDL).
e. Data query language (DQL).
f. Data manipulation language (DML).
g. Data control language (DCL).
h. Data administration statements (DAS).
i. Transaction control statements (TCS).

5.5.1 Advantages of SQL


SQL is the standard query language.
It is very flexible.
It is essentially a free-format syntax, which gives the users the ability to structure SQL
statements in a way best suited to him.
SQL is a high level language and the command structure of SQL consists of Standard
English words.
It is supported by every product in the market.
It gives the users an ability to specify key database operation such as table view and index
creation on a dynamic basic.
It can express arithmetic operations as well as operations to aggregate data and sort data for
output.
Applications written in SQL can be easily ported across systems.

5.5.2 Disadvantages of SQL


SQL is very far from being the perfect relational language and it suffers from signs of both
omission and commission.
It is not a general-purpose programming language and thus the development of an application
requires the use of SQL with a programming language.

5.5.3 Basic SQL Data Structure


In SQL, the data appears to be stored as simple linear files or relations.
These files or relations are called ‘tables’ in SQL terminology. SQL is set-
oriented in which the referenced data objects are always tables. SQL always
produces results in tabular format. The tables are accessed either
sequentially or through indexes. An index can reference one or a
combination of columns of a table. A table can have several indexes built
over it. When the data in a table changes, SQL automatically updates the
corresponding data in any indexes that are affected by that change. In SQL,
the concept of logical and physical views is implemented. A physical view is
called a ‘base table’, whereas a logical view is simply called ‘view’. The
logical view is derived from one or more base tables of physical view. A
view may consist of a subset of the columns of a single table or of two or
more joined tables.
The creation of a view in SQL does not entail the creation of a new table
by physically duplicating data in a base table. Instead, information
describing the nature of the view is kept in one or several system catalogs.
As discussed in Chapter 1, Section 1.2.6, the catalogue is a set of schemes,
which when put together, constitutes a description of a database. The queries
can be issued to either base tables or views. When a query references a view,
the information about the view in the catalogue maps it onto the base table
where the required data is physically stored. As discussed in Chapter 2,
Section 2.2, the schema is that structure which contains descriptions of
objects created by a user, such as base tables, views, constraints and so on as
part of the database.

5.5.4 SQL Data Types


Data type of every data object is required to be declared by the programmer
while using programming languages. Also, most database systems require
the user to specify the type of each data field. The data type varies from one
programming language to another and from one database application to
another. Table 5.4 lists data types supported by SQL.

Table 5.4 SQL data types


S.N. Data Type Description
1. BIT(n) Fixed-length bit string of ‘n’ bits, numbered 1-n.
2. BIT VARYING(n) Variable-length bit string with maximum length
of ‘n’ bits.
3. CHAR(n) or CHARACTER(n) Fixed-length string of length of exactly ‘n’
characters.
4. VARCHAR(n) or CHAR Variable-length character string of maximum
VARYING(n) character length of ‘n’.
5. DECIMAL(p, s) or DEC(p, s) or Exact decimal numeric value. The number of
NUMERIC(p, s) decimal digits or precision is given by ‘p’, and
the number of digits after the decimal point (the
scale) by ‘s’.
6. INTEGER or INT Integer number.
7. FLOAT(p) Floating point number with precision equal to or
greater than ‘p’.
8. REAL Single precision floating point number.
9. DOUBLE PRECISION Double precision floating point number.
10. SMALLINT Integer number of lower precision than
INTEGER.
11. DATE Date expressed as YYYY-MM-DD.
12. TIME Time expressed as HH:MM:SS.
13. TIME(p) or TIME WITH TIME The optional fractional seconds precision(p)
ZONE or TIME(p) with TIME extends the format to include fractions of
ZONE seconds, for example TIME(2) HH:MM:SS
WITH TIME ZONE adds six positions for a
relative displacement from 12:59 to +13.00 in
hours:minutes.
14. INTERVAL Relative time interval (positive or negative).
Intervals are either year/month expressed as
‘YYYY-MM’ YEAR TO MONTH, or day/time,
for example, ‘DD HH:MM:SS’ DAY TO
SECOND(p).
15. TIMESTAMP Absolute time expressed as YYYY-MM-DD
HH:MM:SS.
16. TIMESTANP(p) The optional fractional seconds precision(p)
extends the format as for TIME.
Timestamps are graduated to be unique and to
increase monotonically.
17. TIMESTANP WITH Same as TIME WITH TIMEZONE (Serial No.
TIMEZONE or TIMESTAMP(p) 13.
WITH TIMEZONE

5.5.5 SQL Operators


SQL operators and conditions are used to perform arithmetic and
comparison statements. Operators are represented by single character or
reserved words, whereas conditions are the expression of several operators
or expressions that evaluate to TRUE, FALSE or UNKNOWN. Two types of
operators are used, namely binary and unary. The unary operator operates on
only one operand, while the binary operator operates on two operands. Table
5.5 shows various types of SQL operators.

Table 5.5 SQL operators


SN Operators Description
Arithmetic Operators
1. +, − Unary operators for denoting a positive (+ve) or
negative (−ve) expression.
2. * Binary operator for multiplication.
3. / Binary operator for division.
4. + Binary operator for addition.
5. − Binary operator for subtraction.
Comparison Operators
6. = Equality.
7. ! =, < >, ⌉ Inequality.
8. < Less than.
9. > Greater than.
10. >= Greater than or equal to.
11. <= Less than or equal to.
12. IN Equal to any member of.
13. NOT IN Not equal to any member of.
14. IS NULL Test for nulls.
15. IS NOT NULL Test for anything other than nulls.
16. LIKE Returns true when the first expression matches
the pattern of the second expression.
17. ALL Compares a value to every value in a list.
18. ANY, SOME Compares a value to each value in a list.
19. EXISTS True if sub-query returns at least one row.
20. BETWEEN x and y > = x and < = y
Logical Operators
21. AND Returns true if both component conditions are
true, otherwise returns false.
22. OR Returns true if either component conditions are
true, otherwise returns false.
23. NOT Returns true if the condition is false, otherwise
returns false.
Set Operators
24. UNION Returns all distinct rows from both queries.
25. UNION ALL Returns all rows from both queries.
26. INTERSECT Returns all rows selected by both queries.
27. MINUS Returns all distinct rows that are in the first
query but not in the second one.
Aggregate Operators
28. AVG Average.
29. MIN Minimum.
30. MAX Maximum.
31. SUM Total.
32. COUNT Count.

5.5.6 SQL Data Definition Language (DDL)


The SQL data definition language (DDL) provides commands for defining
relation schemas, deleting relations and modifying relation schemas. These
commands are used to create, alter and drop tables. The syntax of the
commands are CREATE, ALTER and DROP. The main logical SQL data
definition statements are:

CREATE TABLE
CREATE VIEW
CREATE INDEX
ALTER TABLE
DROP TABLE
DROP VIEW
DROP INDEX

5.5.6.1 CREATE TABLE Operation


Tables are the basic building blocks of RDBMSs. Tables contain rows
(called tuples) and columns (called attributes) of data in a database.
CREATE TABLE operation is one of the more frequently used DDL
statements. It defines the names of tables and columns, as well as specifies
the type of data allowed in each columns. Fig. 5.1 illustrates the syntax of
statements for table creation operations.

Fig. 5.1 Syntax for creating SQL table

The CREATE TABLE statement specifies a logical definition of a stored


table (or base table). It specifies the name of the table and lists the name and
type of each column. The type of column may be standard data type or a
domain name. The keywords NULL and NOT NULL are optional. A
DEFAULT clause may be used to set column values automatically wherever
a new row is inserted. In the absence of a specified default value, nullable
columns will contain nulls. A type-dependent value, such as zero or an
empty string, will be used for nun-nullable columns.
The PRIMARY KEY clause lists one or more columns that form the
primary key. The FOREIGN KEY clause is used to specify referential
integrity constraints and, optionally, the actions to be taken if the related
tuple is deleted or the value of its primary key is updated. If the table
contains other unique keys, the column can be specified in a UNIQUE
clause.
Data types with defined constraints and default values can be combined
into domain definitions. A domain definition is a specialised data type,
which can be defined within a schema and used as desired in columns
definitions. Limited support for domains is provided by the CREATE
DOMAIN statements, which associates a domain with a data type and,
optionally, a default value. For example, suppose we wish to define a
domain of person identifiers to be used in the column definitions of various
tables. Since we will be using it over and over again in the database schema,
we would like to simplify our work and thus, we create a domain as follows:

CREATE DOMAIN PERSON-IDENTIFIER NUMBER (6) DEFAULT


(0)
or
CREATE DOMAIN PERSON-IDENTIFIER NUMERIC (6)
DEFAULT (0) CHECK (VALUE IS NOT NULL);

The above definition says that a domain named PERSON-IDENTIFIER


has the properties such as its data type is of six-digit numeric and default
value is zero. Any column defined with this domain as its data type will have
all these properties. As shown in the second form above, the domain
definition may also be followed by a constraint definition that limits the
range of possible values by employing a CHECK clause. Here domain has
the property such that it can never be null. Now we can define columns in
our schema with PERSON-IDENTIFIER as their data type. Fig. 5.2
illustrates examples of creating tables for a employee health centre database.

Fig. 5.2 Creating SQL table for employee health centre schema
As shown in Fig. 5.2 under PATIENT table, a constraint may be named by
preceding it with Constraint 〈constraint-table-name〉. ON UPDATE and ON
DELETE clauses are used to trigger referential integrity checks and
specifying their corresponding actions. The possible actions of these clauses
are SET NULL, SET DEFAULT and CASCADE. Both SET NULL and SET
DEFAULT remove the relationship by resetting the foreign key value to null,
or to its default if it has one. The action is same for both updates and deletes.
The effect of CASCADE depends on the event. With ON UPDATE, a
change to the primary key value in the related tuple is reflected in the foreign
key. Changing a primary key should normally be avoided but it may be
necessary when a value has been entered incorrectly. Cascaded update
ensures that referential integrity is maintained. With ON DELETE, if the
related tuple is deleted then the tuple containing the foreign key is also
deleted. Cascaded deletes are therefore appropriate for mandatory
relationships such as those involving weak entity classes. As shown in Fig.
5.2, the PATIENT table includes the following named referential integrity
constraints:

CONSTRAINT PATIENT-REG
FOREIGN KEY (REGISTERED-WITH) REFERENCES DOCTOR
(DOCTOR-ID)
ON DELETE SET NULL
ON UPDATE CASCADE

In the above statements, the registration of patient with a doctor is


optional to enable patient details to be entered before the patient is assigned
to a doctor, and to simplify the task of transferring a patient from one doctor
to another. The foreign key REGISTERED-WITH will be updated to reflect
any change in the primary key of the doctor table, but if the related doctor
tuple is deleted, it will be set to null. By default, all constraints are
immediate and not deferrable. This means that they are checked immediately
after any change is made and that this behavior cannot be changed.
It is also possible to create local or global temporary tables within a
transaction as shown in Fig. 5.3. They may be preserved or deleted when the
transaction is committed.
Fig. 5.3 Creating local or global temporary table

5.5.6.2 DROP TABLE Operation


DROP operation is used for deleting tables from the schema. It can be used
to delete all rows currently in the named table and to remove the entire
definition of the table from the schema. Entire schema can be dropped. The
syntax of DROP statement is given in Fig. 5.4 below:

Fig. 5.4 Syntax for DROP operation

An example of DROP operations is given below:

DROP SCHEMA HEALTH-CENTRE


or DROP TABLE PATIENT
or DROP COLUMN CONSTRAINT

Since, simple DROP statement can be a dangerous operation, either


CASCADE or RESTRICT must be specified with it as shown below:
DROP SCHEMA HEALTH-CENTRE CASCADE
or DROP TABLE PATIENT CASCADE

The above statement means to drop the schema named as well as all
tables, data and other schema objects that still exist (that means removing
the entire schema irrespective of its content).

DROP SCHEMA HEALTH-CENTRE RESTRICT


DROP TABLE PATIENT RESTRICT

The above statement means to drop the schema only if all other schema
objects have already been deleted (that is only if schema is empty).
Otherwise, an exception will be raised.

5.5.6.3 ALTER TABLE operation


ALTER operation is used for changing the definitions of tables. It is a
schema evolution command. It can be used to add one or more columns to a
table, change the definition of an existing column, or drop a column from a
table. The syntax of ALTER statement is given in Fig. 5.5, as shown below.

Fig. 5.5 Syntax for ALTER operation

An example of ALTER operations is given below:

ALTER TABLE PATIENT ADD COLUMN ADDRESS CHAR (30)


or ALTER TABLE DOCTOR DROP COLUMN ROOM-NO
or ALTER TABLE DOCTOR DROP COLUMN ROOM-NO
RESTRICT
or ALTER TABLE DOCTOR DROP COLUMN ROOM-NO
CASCADE

Again, CASCADE and RESTRICT can be used in the above statements to


determine the drop behavior when constraints or views depend on the
affected column. Column default values may be altered or dropped, as
shown below:

ALTER TABLE APPOINTMENT


ALTER COLUMN APPT-DURATION SET DEFAULT 20
or ALTER TABLE APPOINTMENT
ALTER COLUMN APPT-DURATION DROP DEFAULT

Here, the default value of 10 for the appointment duration has been
changed to 20. The default even can be removed, as shown in the second
statement above.

5.5.6.4 CREATE INDEX Operation


An index is a structure that provides faster access to the rows of a table
based on the values of one or more columns. The index stores data values
and pointers to the rows (tuples) where those data values occur. An index
sorts data values and stores them in ascending or descending order. Indexes
are created in most RDBMSs to provide rapid random and sequential access
to base table data. It can help in quickly executing a query to locate
particular column and rows. The CREATE INDEX operation allows the
creation of an index for an already existing relation. The columns to be used
in the generation of the index are also specified. The index is named and the
ordering for each column used in the index can be specified as either
ascending or descending. Like tables, indexes can also be created
dynamically. Fig. 5.6 illustrates the syntax of statements for index creation
operations.
Fig. 5.6 Syntax for creating index

The CLUSTER option could also be specified to indicate the records and
are to be placed in physical proximity to each other. The unique option
specifies that only one record could exist at any time with a given value for
the column(s) specified in the statement of create the index. An example of
creating index for EMPLOYEE relation is given below.

CREATE INDEX EMP-INDEX


ON EMPLOYEE (LAST-NAME ASC, SEX DESC);

The above statement causes a creation of an index called EMP-INDEX


with columns LAST-NAME and SEX from the relation (table) EMPLOYEE.
The entries in the index are ascending by LAST-NAME value and
descending by SEX. In the above example, there are no restrictions on the
number of records with the same LAST-NAME and SEX. An existing
relation or index can be deleted for the database by using the DROP
statement in the similar way as explained for table and schema operations.

5.5.6.5 Create View Operation


A view is a named table that is represented by its definition in terms of other
named tables. It is a virtual table, which is constructed automatically as
needed by the DBMS and is not maintained as real data. The real data are
stored in base tables. The CREATE VIEW operation defines a logical table
from one or more tables or views. Views may not be indexed. Fig. 5.7
illustrates the general syntax of creating view definition.
Fig. 5.7 Syntax for creating view definition

The sub-query cannot include either UNION or ORDER BY. The clause
‘WITH CHECK OPTION’ indicates that modifications (update and insert)
operations against the view are to be checked to ensure that the modified
row satisfies the view-defining condition. There are limitations on updating
data through views. Where views can be updated, those changes can be
transferred to the underlying base tables originally referenced to create the
view. An example of creating view for EMPLOYEE relation is given below:

CREATE VIEW PATIENT VIEW


AS SELECT DOCTOR.DOCTOR-ID, DOCTOR.PHONE-NO,
PATIENT.PATIENT-ID, PATIENT.DATE-REGISTERED,
FROM DOCTOR, PATIENT

The above view operation will result into creation of a PATIENT-VIEW


table with listing of columns such as DOCTOR-ID and PHONE-NO from
DOCTOR table and PATIENT-ID and DATE-REGISTERED from the
PATIENT table.
The main purpose of a view is to simplify query commands. However, a
view may also provide data security and significantly enhance programming
productivity for a database. A view always contains the most recent derived
values and is thus superior in terms of data currency to constructing a
temporary real table from several base tables. It consumes very little storage
space. However, it is costly as because its contents must be calculated each
time that they are requested.

5.5.7 SQL Data Query Language (DQL)


SQL data query language (DQL) is one of the most commonly used SQL
statements that enable the users to query one or more tables to get the
information they want. DQL has only one data query statement whose
syntax is SELECT. The SELECT statement is used for retrieval of data from
the tables and produce reports. It is the basis for all database queries. The
SELECT statement of SQL has no relationship to the SELECT or
RESTRICT operations of relational algebra, which was discussed in Chapter
4. SQL table departs from the strict definition of a relation in that unique
rows are not enforced. SQL allows a table (relation) to have two or more
rows (tuples) that are identical in all their attribute (column) values. Thus, a
query result may contain duplicate rows. Hence, in general, an SQL table is
not a set of tuples as is the case with relation, because a set does not allow
two identical members. In face, an SQL table is a multiset (sometimes called
bag) of tuples (or rows). Some SQL relations are constraints to be set
because a key constraint has been declared or because the DISTINCT option
has been used with the SELECT statement. A typical SQL statement for
SELECT operation can be made up of two or more of the clauses as shown
in Fig. 5.8 below:

Fig. 5.8 Syntax for SQL SELECT statement

In the above syntax, the clauses such as WHERE, GROUP BY, HAVING
and ORDER BY, are optional. They are included in the SELECT statement
only when functions provided by them are required in the query. In its basic
form of the SQL the SELECT statement is formed of three clauses namely,
SELECT, FROM and WHERE. This basic form of SELECT statement is
sometimes called a mapping or a select-from-where block. These three
clauses corresponds to the relational algebra operations as follows:
The SELECT clause corresponds to the projection operation of the relational algebra. It is
used to list the attributes (columns) desired in the result of a query. SELECT * is used to get
all the columns of a particular table.
The FROM clause corresponds to the Cartesian-product operation of the relational algebra. It
is used to list the relations (tables) to be scanned from where data has to be retrieved.
The WHERE clause corresponds to the selection predicate of the relational algebra. It
consists of a predicate involving attributes of the relations that appear in the FROM clause. It
tells SQL to include only certain rows of data in the result set. The search criteria is specified
in WHERE clause.

Fig. 5 9 illustrates the variations of SELECT statements and their results


from the queries made to the database systems shown in Fig. 4.15, Chapter
4.

Fig. 5.9 Examples of query using SELECT statement


5.5.7.1 Abbreviation or Alias Name
Columns name may be qualified by the name of the table (or relation) in
which they are found. But this is only necessary where queries involve two
or more tables containing columns with the same name to prevent ambiguity.
Due to this reason, an abbreviations (also called correlation or alias name) S
and I have been used in Query 5 to define two relations STORED and
ITEMS. Instead of abbreviations, the relation names can also be directly
used to qualify the attribute name, for example, STORED.ITEM-NO,
ITEMS.ITEM-NO and so on. Where the column name is unique the table
qualification may be omitted. Queries can also be shortened by using an
abbreviation name for a table name. This abbreviation or alias is specified in
the FROM clause.

5.5.7.2 Aggregate Functions and the GROUP BY Clause


SQL provides several sets, or aggregate functions using the GROUP BY
clause for summarising the content of the columns. This function is usually
used with aggregate functions such as AVG, SUM, MIN, MAX and so on. It
used to give out common information when querying the tables of a
database. Examples of GROUP BY clause, with reference to Fig. 4.15 of
Chapter 4, are given in Fig. 5.10 below.

5.5.7.3 HAVING Clause


The HAVING clause is used to include only certain groups produced by the
GROUP BY clause in the query result set. It is equivalent to WHERE clause
and is used to specify the search criteria or search condition when GROUP
BY clause is specified. Example of HAVING clause, with reference to Fig.
4.15 of Chapter 4, is given in Fig. 5.11 below.
Fig. 5.10 Examples of aggregate functions and GROUP BY clauses
Fig. 5.11 Examples of HAVING clause

5.5.7.4 ORDERED BY Clause


The ORDER BY clause is used to sort the results based on the data in one or
more columns in the ascending or descending order. The default of ORDER
BY clause is ascending (ASC) and if nothing is specified the result set will
be sorted in ascending order. An example of ORDER BY clause, with
reference to Fig. 4.15 of Chapter 4, is given in Fig. 5.12 below:

Fig. 5.12 Examples of ORDER BY clause

5.5.8 SQL Data Manipulation Language (DML)


The SQL data manipulation language (DML) provides query language based
on both the relational algebra and the tuple relational calculus. It provides
commands for updating, inserting, deleting, modifying and querying the data
or tuples in the database. These commands may be issued interactively, so
that a result is returned immediately following the execution of the
statement. The syntax of the SQL DML commands is INSET, DELETE and
UPDATE.

5.5.8.1 SQL INSERT Command


The SQL INSERT command is used to add a new tuple (row) to a relation.
The relation (or table) name and list of values of the tuple must be specified.
The value of each attribute (column or field) of the tuple (row or record) to
be inserted is either specified by an expression or could come from selected
records of existing relations. The values should be listed in the same order in
which the corresponding attributes were specified in the CREATE TABLE
commands (already discussed in Section 5.5.6) or in the order of existing
relation. The syntax for INSERT command is given as:
INSERT INTO 〈table-name〉 [(attributes-name)]
VALUES (lists of values for row 1,
list of values for row 2,
:
:
list of values for row n);

In the above syntax, attribute name along with relation name is optional.
An example of INSERT command, with reference to Fig. 4.15 of Chapter 4,
is given in Fig. 5.13 below:
Fig. 5.13 Examples of INSERT command

In Query 2 of the above example, the INSERT command allows the user
to specify explicit attribute names that correspond to the values provided in
the INSERT command. This is useful if a relation has many attributes, but
only a few of those attributes are assigned values in the new tuple. The
attributes not specified in the command format (as shown in Query 2), are
set to their DEFAULT or to NULL and the values are listed in the same
order as the attributes are listed in the INSERT command itself.

5.5.8.2 SQL DELETE Command


The SQL DELETE command is used to delete or remove tuples (rows) from
a relation. It includes WHERE clause, similar to that used in an SQL query,
to select the tuples to be deleted. Tuples are explicitly deleted from only one
relation (table) at a time. The syntax of the DELETE command is given as:

DELETE FROM 〈table-name〉


WHERE 〈predicate(s)〉

An example of DELETE command, with reference to Fig. 4.15 of Chapter


4, is given in Fig. 5.14 below:

Fig. 5.14 Examples of DELETE command

If WHERE clause is not given, as the case in Query 2 of the above


example, it specifies that all tuples in the relation are to be deleted. However,
the table remains in the database as an empty table. To remove the table
completely, a DROP statement (as discussed in Section 5.5.6) can be used.
The WHERE clause of a DELETE command may contain a sub-query as
illustrated in Query 3 in the above example. In this case, the ITEM-NO
column of each row in the STORED table is tested for membership of the
multi-set returned by the sub-query.

5.5.8.3 SQL UPDATE Command


The SQL UPDATE command is used to modify attribute (column) values of
one or more selected tuples (records) to be modified are specified by a
predicate in a WHERE clause and the new values of the columns to be
updated is specified by a SET clause. The syntax of the UPDATE command
is given as:

UPDATE 〈table-name〉
SET 〈target-value-list〉
WHERE 〈predicate〉

An example of UPDATE command, with reference to Fig. 4.15 of Chapter


4, is given in Fig. 5.15 below:
Fig. 5.15 Examples of UPDATE command

As with other statements, an update may be performed according to the


result of a search condition involving other tables, as illustrated in Query 2
in the above example.

5.5.9 SQL Data Control Language (DCL)


SQL data control language (DCL) provides commands to help database
administrator (DBA) to control the database. It consists of the commands
that control the user access to the database objects. Thus, SQL DCL is
mainly related to the security issues, that is, determining who has access to
the database objects and what operations they can perform on them. It
includes commands to grant or revoke privileges (or authorisation) to access
the database or particular objects within the database and to store or remove
transactions that would affect the database. The syntax of the commands is
GRANT and REVOKE.

5.5.9.1 SQL GRANT Command


The SQL GRANT command is used by the DBA to grant privileges to users.
The syntax of the GRANT command is given as:

GRANT 〈privilege(s)〉
ON 〈table-name/view-name〉
TO 〈user(s)-id〉, 〈group(s)-id〉, 〈public〉

The key words for this command are GRANT, ON and TO. A privilege is
typically a SQL command such as CREATE, UPDATE or DROP and so on.
The user-id is the identification code of the user to whom the DBA wants to
grant the specific privilege. The example of GRANT command is given
below:

Example 1: GRANT CREATE


ON ITEMS
TO Abhishek
Example 2: GRANT DROP
ON ITEMS
TO Abhishek
Example 3: GRANT UPDATE
ON ITEMS
TO Abhishek
Example 4: GRANT CREATE, UPDATE, DROP,
SELECT
ON ITEMS
TO Abhishek
WITH GRANT OPTION

In the above examples, DBA has granted a user-id named Abhishek the
capability to create, update, drop and or select tables. As shown in example
4, the DBA has granted Abhishek the right to create, update, drop and select
data in ITEMS table. Furthermore, Abhishek can grant these same rights to
others at his descretion.

5.5.9.2 SQL REVOKE Command


The SQL REVOKE command is issued by the DBA to revoke privileges
from users. It is opposite to the GRANT command. The syntax of the
REVOKE command is given as:

REVOKE 〈privilege(s)〉
ON 〈table-name/view-name〉
FROM 〈user(s)-id〉, 〈group(s)-id〉, 〈public〉

The key words for this command are REVOKE and FROM. The example
of REVOKE command is given below:

Example 1: INVOKE CREATE


ON ITEMS
FROM Abhishek
Example 2: REVOKE DROP
ON ITEMS
FROM Abhishek
Example 3: REVOKE UPDATE
ON ITEMS
FROM Abhishek
Example 4: REVOKE CREATE, UPDATE, DROP,
INSERT, SELECT
ON ITEMS
FROM Abhishek

In the above examples, DBA has revoked the privileges that were
previously granted to user-id named Abhishek.

5.5.10 SQL Data Administration Statements (DAS)


The SQL data administration statement (DAS) allows the user to perform
audits and analysis on operations within the database. They are also used to
analyse the performance of the system. Data administration is different from
database administration in the sense that database administration is the
overall administration of the database whereas data administration is only a
subset of that. DAS has only two statements whose syntax are START
AUDIT and STOP AUDIT.

5.5.11 SQL Transaction Control Statements (TCS)


A transaction is a logical unit of work consisting of one or more SQL
statements that is guaranteed to be atomic with respect to recovery. It may be
defined as a process that contains either read commands, write commands or
both. An SQL transaction automatically begins with a transaction-initiating
SQL query executed by a user or program. SQL TCS manages all the
changes made by the DML statements. The main syntax of the TCS
commands is COMMIT and ROLLBACK.
A COMMIT statement ends the transaction successfully, making the
database changes permanent. A new transaction starts after COMMIT with
the next transaction-initiating statement.
A ROLLBACK statement aborts the transaction, backing out any changes
made by the transaction. A new transaction starts after ROLLBACK with the
next transaction-initiating statement.

5.6 EMBEDDED STRUCTURED QUERY LANGUAGE (SQL)


We have looked at a wide range of SQL query constructs in the previous
sections, wherein SQL is treated as an independent language in its own right.
A RDMS supports an interactive SQL interface through which users directly
enter these SQL commands. However, in practice, often we need a greater
flexibility of a general-purpose programming language such as integrating
database application with graphical user interface. This is in addition to the
data manipulation facilities provided by SQL. To deal with such
requirements, SQL statements can be directly embedded in procedural
language (that is, program’s source code such as COBOL, C, Java,
PASCAL, FROTRAN, PL/I and so on) along with other statements of the
programming language. A language in which SQL queries are embedded is
referred to as host programming language. The use of SQL commands
within a host program is called embedded SQL. Special delimiters specify
the beginning and end of the SQL statements in the program. Thus, SQL’s
powerful retrieval capabilities can be used even within a traditional type of
programming language. The command syntax in the embedded mode is
basically the same as in the SQL query mode, except that some additional
devices (such as special pre-processor) are required to compensate for the
differences between the nature of SQL queries and the programming
language environment. The general syntax for embedded SQL is given as:

EXEC SQL 〈embedded SQL statement〉


END-EXEC

For example, the following code segment shows how an SQL statement is
included in a COBOL program.

CCOBOL statement


EXEC SQL
SELECT 〈attribute(s)-name〉
INTO: WS-NAME
FROM 〈table(s)-name〉
WHERE 〈conditions〉
END-EXEC

The embedded SQL statements are thus used in the application to perform
the data access and manipulation tasks. A special SQL pre-complier accepts
the combined source code that is, code containing the embedded SQL
statements and code containing programming language statements. It
compiles to convert into the executable form. This compilation process is
slightly different from the compilation of a program, which does not have
embedded SQL statements.
The exact syntax for embedded SQL requests depends on the language in
which SQL is embedded. For instance, a semicolon (;) is used instead of
END-EXEC (as in case of COBOL) when SQL is embedded in ‘C’. The
Java embedding of SQL (called SQL) uses syntax

#SQL {〈embedded SQL statement〉};

A statement SQL INCLUDE is placed in the program to identify the place


where the pre-processor should insert the special variables used for
communication between the program and the database system. Variables of
host programming language can be used within embedded SQL statements
but they must be preceded by a colon (:) to distinguish them from SQL
variables. A CURSOR is used to enable the program to loop over the
multiset of rows and process them one at a time. In this case the syntax of
embedded SQL can be written as:

EXEC SQL
DECLARE 〈variable-name〉 CURSOR FOR
SELECT 〈attribute(s)-name〉
FROM 〈table(s)-name〉
WHERE 〈conditions〉
END-EXEC

The declaration of CURSOR has no immediate effect on the database. The


query is only executed when the cursor is opened, after which the cursor
refers to the first record in the result set. Data values are then copied from
the table structures into host programming language variables using FETCH
statement. When no more tuples (records) are available, the cursor is closed.
An embedded SQL program executes a series of FETCH statements to
retrieve tuples of the result. The FETCH statement requires one host
programming language variable for each attribute of the result relation.

5.6.1 Advantages of Embedded SQL


Since SQL statements are merged with the host programming language, it combines the
strengths of two programming environments.
The executable program of embedded SQL is very efficient in its CPU usage as because the
use of pre-complies shifts the CPU intensive parsing and optimisation to the development
phase.
The program’s run time interface to the private database routines is transparent to the
application program. The programmers work with the embedded SQL at the source code
level. They need not be concerned about other database related issue.
The portability is very high.

5.7 QUERY-BY-EXAMPLE (QBE)

Query-By-Example (QBE) is a two-dimensional domain calculus language.


It was originally developed for mainframe database processing and became
prevalent in personal computer database systems as well. QBE was
originally developed by M.M. Zloof at IBM’s T.J. Watson Research Centre,
Yorktown Hts, in the early 1970s, to help users in their retrieval of data from
a database. The QBE data manipulation language was later used in IBM’s
Query Management Facility (QMF) with SQL/DS and DB2 database system
on IBM mainframe computers. QMF is IBM’s front-end relational query and
report generating product. QBE was so successful that this facility is now
provided in one form or the other by most of the RDBMS including today’s
many database systems such as Microsoft Access, for personal computers. It
is the most widely available direct-manipulation database query language.
Almost all RDBMSs, such as MS-Access, DB2, INGERS, ORACLE and so
on have some form of QBE or QBE based query system. QBE is a terminal
based query language in which queries are composed using special screen
editor sitting on the terminal. A button on the terminal allows the user to call
for one or more table skeletons to be displayed on the screen. These skeleton
tables show the relation schema. An example of a skeleton table, with
reference to Fig. 4.15 of Chapter 4, is given in Fig. 5.16 below:
The first column (entry) in the skeleton table denotes the relation (or
table) name. The rest of the columns denote attributes (field) name. In QBE,
rather than clutter the display with all skeletons, the users can select only
those skeletons that are needed for a given query and fills in the skeletons
with example rows. An example row consists of two items namely (a)
constants that are used without qualification and (b) domain variable
(example elements) for which an underscore character ‘_’ is used as
qualifier.
There are several QBC commands that are used in QBE query. All of the
QBE commands begin with a command letter followed by a period ‘.’. Table
5.6 illustrates the QBE commands used in constructing query.

5.7.1 QBE Queries on One Relation (Single Table Retrievals)


Let us take the example of Fig. 4.15 in Chapter 4, to execute QBE queries on
single relation (table retrievals). As discussed above, a skeleton of the target
relation (table) must be first brought on the screen to execute a QBE query.
Fig. 5.17 illustrates QBE query for single table retrievals.

5.7.2 QBE Queries on Several Relations (Multiple Table Retrievals)


QBE allows queries that span several different relations (tables). It is
analogous to the Cartesian product (or natural join) in the relational algebra
already discussed in Chapter 4. All the involved tables are displayed
simultaneously for join operation in QBE. The join attributes (fields) are
indicated by placing the same (arbitrary) variable name under the matching
join columns in the different tables. Attributes are added to and deleted from
one of the tables to create the model for the desired output. Fig. 5.18
illustrates QBE queries involving multiple table retrievals with reference to
the example of Fig. 4.15 in Chapter 4.

Fig. 5.16 QBE skeleton tables

Table 5.6 QBE commands


SN Command Description
1. P Print or Display the entire contents of a table
2. D Delete
3. I Insert
4. U Update
5. AO Ascending Order
6. DO Descending Order
7. LIKE To replace an arbitrary number of unknown
characters
8. % To replace an arbitrary number of unknown
characters
9. _ (Underscore) To replace a specific number of unknown
characters
10. CNT In-built function for counting of columns
11. UNQ Keyword for ‘Unique’ (equivalent SQL’s
‘DISTINCT’)
12. G Keyword for ‘Grouping’ (equivalent SQL’s
‘GROUP BY’)
13. SUM AVG, MAX, MIN In-built aggregate functions
14. >, <, = Comparison operators

5.7.3 QBE for Database Modification (Update, Delete and Insert)


The QBE data manipulation operations are straightforward and follows the
same general syntax that we have explained in earlier sections. Fig. 5.19
illustrates QBE queries involving update, delete and insert operations with
reference to the example of Fig. 4.15 in Chapter 4.
Fig. 5.17 QBE queries for single table retrievals
Fig. 5.18 QBE queries for multiple table retrievals
Fig. 5.19 QBE Queries for data manipulation operations
5.7.4 QBE Queries on Microsoft Access (MS-ACCESS)
QBE for Microsoft Access (MS-ACCESS) is designed for a graphical
display environment and accordingly is called graphical query-by-example,
(GQBE). As is the case with general QBE query, with GQBE query also
single-table as well multiple table queries can be viewed. A query is
designed to make an enquiry about data in a database to tell MS-Access
what data to retrieve. MS-Access offers number of queries. Table 5.7
illustrates the list of queries that can be created in MS-Access with the
Query Wizard. These queries have an asterisk ‘*’ besides their name.

Table 5.7 List of MS-Access Queries


SN Type of Query Description
1. Select * Makes an enquiry or defines a set of criteria about the data
in one or more tables. It gathers information from one, two
or more database tables. The select query is the standard
query on which all other queries are built. It is also known
as simple query.
2. Append Copies data from one or several different tables to a single
table.
3. Aggregate (total) Performs calculations on groups of records.
4. AutoLookup Automatically enters certain field values in new records.
5. Calculation Allows to add a field in the query results table for making
calculations with the data that the query returns.
6. Advanced Filter/Sort Sorts data on two or more fields instead of one on
ascending or descending order. This type of query works
on one database table only. To sort on two, three or more
fields, one has to run an Advanced Filter/Sort query.
7. Parameter Display a dialog box that tells the data-entry person what
type of information to enter.
8. Find Matched * or Find Finds duplicate records in related tables. In other words, it
Duplicates * finds all records with field values that are also found in
other records.
9. Find Unmatched* Compares database tables to find distinct records in related
tables, that is records in the first table for which a match
cannot be found in the second table. This query is useful in
maintaining referential integrity in a database.
10. Crosstab* Displays information in a matrix instead of a standard
table. It allows to summarize large amount of data and
presentable into a compact matrix form. Crosstab makes it
easier to compare the information in a database.
11. Delete Permanently deletes records that meet certain criteria from
the database.
12. Make-table Creates a table from the results of a query. This type of
query is useful for backing up records.
13. Summary Finds the sum, average, lowest or highest value, number of,
standard deviation, variance, or first or last value in the
field in a query results table.
14. Update Finds records that meet certain criteria and updates those
records en masse. This query is useful for updating records.
15. Top-Value Finds the highest or lowest values in a field.
16. SQL Uses an SQL statement to combine data from different
database tables. SQL queries are available in four kinds: a
data definition query creates or alters objects in a database,
a pass through query sends commands for retrieving data
or changing records; a Subquery queries another query for
certain results; and a Union query combines fields from
different queries.
Fig. 5.20 MS-Access 2000 screen for database creation with the name of WAREHOUSE

(a) Menu Screen

(b) WAREHOUSE Database table creation


The select query is the foundation on which all the others are built. It is
the most commonly used query. Using select queries, one can view, analyse
or make change to the data in the database. One can create his or her own
select query, however, it can also be created using Query Wizard. When we
create our own select query from scratch, MS-Access opens the Select Query
Windows and displays a dialog box (i.e., lists the tables we have created).
The Select Query Window is the graphical Query-By-Example (QBE) tool.
Through this Select Query Window dialog box, tables and/or queries are
selected that contain the data we want to add to the query. With the help of
graphical features of the QBE, one can use a mouse to select, drag or
manipulate objects in the windows to define an example of the records one
wants to view. The fields and records to be included in the query are
specified in the QBE grid. Whenever a query is created using QBE grid,
MS-Access constructs the equivalent SQL statement in the background. This
SQL statement can be viewed or edited in SQL view of MS-Access. Fig.
5.21 illustrates QBE queries in MS-Access.
Fig. 5.21 QBE query on MS-Access

(a)

(b)
(c)
(d)
(e)
(f)
(g)
(h)

Whenever a select query is run, MS-Access collects the retrieve data in


the location called dynaset. A dynaset is a dynamic view of the data from
one or more tables, selected and specified by the query.

5.7.5 Advantages of QBE


The user does not have to specify a structured query explicitly.
The query is formulated in QBE by filling in templates of relations that are displayed on the
screen. The user does not have to remember the names of the attributes or relations, because
they are displayed as part of the templates.
The user does not have to follow any rigid syntax rules for query specification.
It is the most widely available direct-manipulation database language.
It is easy to learn and considered to be highly productive.
It is very simple and popular language for developing prototypes.
It is especially useful for end-user database programming.
Complete database application can be written in QBE.
QBE, unlike SQL, performs duplicate elimination automatically.
5.7.6 Disadvantage of QBE
There is no official standard for QBE as has been defined for SQL.

R Q

1. What is relation? What are primary, candidate and foreign keys?


2. What are the Codd’s twelve rules? Describe in detail.
3. Describe the SELECT operation. What does it accomplish?
4. Describe the PROJECT operation. What does it accomplish?
5. Describe the JOIN operation. What does it accomplish?
6. Let us consider the following relations as shown in Fig. 5.22 below.

Fig. 5.22 Relations

With reference to the above relations display the result of the following commands:

a. Select tuples from the CUSTOMER relation in which CUST-NO=100512.


b. Select tuples from the CUSTOMER relation in which SALES-PERS-NO=2222.
c. Project the CITY over the CUSTOMER relation.
d. Select tuples from the CUSTOMER relation in which SALES-PERS-NO=1824.
Project the CUST-NO and CITY over that result.

7. With reference to Fig. 5.22, write relational statements to answer the following queries:

a. Find the salesperson records for Alka.


b. Find the customer records for all customers at Jamshedpur.
c. Display the list of all customers by CUST-NO with the city in which each is located.
d. Display a list of the customers by CUST-NO and city for which salesperson 3333 is
responsible.
e. Print a list of the customers by CUST-NO and CITY for which salesperson
Abhishek is responsible.
f. What are the names of the salespersons who have accounts in Delhi?

8. What is Information System Based Language (ISBL)? What are its limitations?
9. Explain the syntax of ISBL for executing query. Show the comparison of syntax of ISBN and
relational algebra.
10. How do we create an external relation using ISBL syntax?
11. What will be the output of the following ISBL syntax?

A = PARTS : (C-NAME= ‘Abhishek’)


A = A% (P-NAME)
B = SUPPLIERS % (S-NAME, P-NAME)
C = B% (S-NAME)
D=C*A
D=DȒB
D = D% (S-NAME)
NEW = C − D
LIST NEW

12. What will be the output of the following ISBL syntax?

A = PARTS : (C-NAME = ‘Abhishek’)


A = A% (P-NAME)
B = SUPPLIERS % (S-NAME)
C = SUPPLIER * A
C = C % (S-NAME)
D=B−C
LIST D

13. What is query language? What are its advantages?


14. Explain the syntax of QUEL for executing query. Show the comparison of syntax of QUEL
and relational algebra.
15. Let us assume that following QUEL statements are given:

CUSTOMERS (CUST-NAME, CUST-ADDRESS, BALANCE)


ORDERS (ORDER-NO, CUST-NAME, ITEM, QTY)
SUPPLIERS (SUP-NAME, SUP-ADDRESS, ITEM, PRICE)

Execute the following query:

a. Print the supplier names, items, and prices of all suppliers that supply at least one
item ordered by M/s ABC Co.
b. Print the supplier names that supply every item ordered by M/s ABC Co.
c. Print the names of customers with negative balance.
16. How do we create an external relation using QUEL? Explain.
17. What is structured query language? What are its advantages and disadvantages?
18. Explain the syntax of SQL for executing query.
19. What is the basic data structure of SQL? What do you mean by SQL data type? Explain.
20. What are SQL operators? List them in a tabular form.
21. What are the uses of views? How are data retrieved using views.
22. What are the main components of SQL? List the commands/statements used under these
components.
23. What are logical operators in SQL? Explain with examples.
24. Write short notes on the following:

a. Data manipulation language (DML)


b. Data definition language (DDL)
c. Transaction control statement (TCS)
d. Data control language (DCL)
e. Data administration statements (DAS).

25. How do we create table, views and index using SQL commands?
26. What would be the output of following SQL statements?

a. DROP SCHEMA HEALTH-CENTRE


b. DROP TABLE PATIENT
c. DROP SCHEMA HEALTH-CENTRE CASCADE
d. DROP TABLE PATIENT CASCADE
e. ALTER TABLE PATIENT ADD COLUMN ADDRESS CHAR (30)
f. ALTER TABLE DOCTOR DROP COLUMN ROOM-NO
g. ALTER TABLE DOCTOR DROP COLUMN ROOM-NO RESTRICT
h. ALTER TABLE DOCTOR DROP COLUMN ROOM-NO CASCADE.

27. What is embedded SQL? Why do we use it? What are its advantages?
28. The following four relations (tables), as shown in Fig. 5.23, constitute the database of an
appliance repair company named M/s ABC Appliances Company. The company maintains
the following information:

a. Data on its technicians (employee number, name and title).


b. The types of appliances that it services along with the hourly billing rate to repair
each appliance, the specific appliances (by serial number) for which it has sold
repair contracts.
c. Techniques that are qualified to service specific types of appliances (including the
number of years that a technician has been qualified on a particular appliance type).

Formulate the SQL commands to answer the following requests for data from M/s ABC
Appliances Company database:

a. The major appliances of the company.


b. Owner of the freezers.
c. Serial numbers and ages of the toasters on service contracts.
d. Average age of washers on service contract.
e. Number of different types of job titles represented.
f. Name of the technician who is most qualified.
g. Average age of each owner’s major appliances.
h. Average billing rate of for the major appliances that Avinash is qualified to fix.
i. Owner of the freezers over 6 years old.

Fig. 5.23 Database of M/s ABC appliances company

29. Using the database of M/s ABC Appliances Company of Fig. 5.23, translate the meaning of
following SQL commands and indicate their results with the data shown.
(a) SELECT *
FROM TECHNICIAN
WHERE JOB-TITLE = ‘Sr. Technician’
(b) SELECT APPL-NO, APPL-OWN, APPL-
AGE
FROM APPLIANCES
WHERE APPL-TYPE = ‘Freezer’
ORDER BY APPL-AGE
(c) SELECT APPL-TYPE, APPL-OWN
FROM APPLIANCES
WHERE APPL-AGE BETWEEN 4 AND
9
(d) SELECT COUNT(*)
FROM TECHNICIAN
(e) SELECT AVG(RATE)
FROM TYPES
GROUP BY APPL-CAT
(f) SELECT APPL-NO. APPL-OWN
FROM TYPES, APPLIANCES
WHERE TYPES. APPL-TYPE =
APPLIANCES. APPL-TYPE
AND APPL-CAT =
‘Minor’
(g) SELECT APPL-NAME, APPL-OWN
FROM TECHNICIAN,
QUALIFICATION,
APPLIANCES
WHERE TECHNICIAN.TECH-ID =
QUALIFICATION.TECH-NO
AND QUALIFICATION.APPL-TYPE =
APPLIANCES.APPL-TYPE
AND TECH-NAME = ‘Rajesh Mathew’

30. What are the uses of SUM(), AVG(), COUNT(), MIN() and MAX()?
31. What is query-by-example (QBE)? What are its advantages?
32. List the QBE commands in relational database system. Explain the meaning of these
commands with examples.
33. Using the database of M/s ABC Appliances Company of Fig. 5.23, translate the meaning of
following QBE commands and indicate their results with the data shown.

34. Consider the following relational schema in which an employee can work in more than one
department.

EMPLOYEE (EMP-ID: int, EMP-NAME: str, SALARY : real)


WORKS (EMP-ID: int, DEPT-ID: int)
DEPARTMENT (DEPT-ID: int, DEPT-NAME: str,
MGR-ID: int, FLOOR-NO: int)

Write the following QBE queries:


a. Display the names of all employees who work on the 12th floor and earn less than
INR 5,000.
b. Print the names of all managers who manage 2 or more departments on the same
floor.
c. Give 20% hike the salary to every employee who works in the Production
department.
d. Print the names of departments in which employee named Abhishek work in.
e. Print the names of employees who make more than INR 12,000 and work in either
the Production department or the Maintenance department.
f. Display the name of each department that has a manger whose last name is Mathew
and who is neither the highest-paid nor the lowest-paid employee in the department.

35. Consider the following relational schema.

SUPPLIER (SUP-ID: int, SUP-NAME: str, CITY: str)


PARTS (PARTS-ID: int, PARTS-NAME: str, COLOR: str)
ORDERS (SUP-ID: int, PARTS-ID: int, QLTY: int)

Write the following QBE queries:

a. Display the names of the suppliers located in Delhi.


b. Print the names of the RED parts that have been ordered from suppliers located in
Mumbai, Kolkata or Jamshedpur.
c. Print the name and city of each supplier from whom following parts have been
ordered in quantities of at least 450: a green shaft, red bumper and a yellow gear.
d. Print the names and cities of suppliers who have an order for more than 300 units of
red and blue parts.
e. Print the largest quantity per order for each SUP-ID such that the minimum quantity
per order for that supplier is greater than 250.
f. Display PARTS-ID of parts that have been ordered from a supplier named M/s KLY
System, but have also been ordered from some supplier with a different name in a
quantity that is greater than the M/s KLY System order by at least 150 units.
g. Print the names of parts supplied both by M/s Concept Shapers and M/s Megapoint,
in ascending order alphabetically.

STATE TRUE/FALSE

1. Dr. Edgar F. Codd proposed a set of rules that were intended to define the important
characteristics and capabilities of any relational system.
2. Codd’s Logical Data Independence rule states that user operations and application programs
should be independent of any changes in the logical structure of base tables provided they
involve no loss information.
3. The entire field of RDBMS has its origin in Dr. E.F. Codd’s paper.
4. ISBL has no aggregate operators for example, average, mean and so on.
5. ISBL has no facilities for insertion, deletion or modification of tuples.
6. QUEL is a tuple relational calculus language of a relational database system INGRESS
(Interactive Graphics and Retrieval System).
7. QUEL supports relational algebraic operations such as intersection, minus or union.
8. The first commercial RDBMS was IBM’s DB2.
9. The first commercial RDBMS was IDM’s INGRES.
10. SEQUEL and SQL are the same.
11. SQL is a relational query language.
12. SQL is essentially not a free-format language.
13. SQL statements can be invoked either interactively in a terminal session but cannot be
embedded in application programs.
14. In SQL data type of every data object is required to be declared by the programmer while
using programming languages.
15. HAVING clause is equivalent of WHERE clause and is used to specify the search criteria or
search condition when GROUP BY clause is specified.
16. HAVING clause is used to eliminate groups just as WHERE is used to eliminate rows.
17. If HAVING is specified, ORDER BY clause must also be specified.
18. ALTER TABLE command enables us to delete columns from a table.
19. The SQL data definition language provides commands for defining relation schemas,
deleting relations and modifying relation schemas.
20. In SQL, it is not possible to create local or global temporary tables within a transaction.
21. All tasks related to relational data management cannot be done using SQL alone.
22. DCL commands let users insert data into the database, modify and delete the data in the
database.
23. DML consists of commands that control the user access to the database objects.
24. If nothing is specified, the result set is stored in descending order, which is the default.
25. ‘*’ is used to get all the columns of a particular table.
26. The CREATE TABLE statement creates new base table.
27. A based table is not an autonomous named table.
28. DDL is used to create, alter and delete database objects.
29. SQL data administration statement (DAS) allows the user to perform audits and analysis on
operations within the database.
30. COMMIT statement ends the transaction successfully, making the database changes
permanent.
31. Data administration Commands allow the users to perform audits and analysis on operations
within the database.
32. Transaction control statements manage all the changes made by the DML statement.
33. DQL enables the users to query one or more table to get the information they want.
34. In embedded SQL, SQL statements are merged with the host programming language.
35. The DISTINCT keyword is illegal for MAX and MIN.
36. Application written in SQL can be easily ported across systems.
37. Query-By-Example (QBE) is a two-dimensional domain calculus language.
38. QBE was originally developed by M.M. Zloof at IBM’s T.J. Waston Research Centre.
39. QBE represents a visual approach for accessing information in a database through the use of
query templates.
40. The QBE make-table action query is an action query as it performs an action on existing
table or tables to create a new table.
41. QBE differs from SQL in that the user does not have to specify a structured query explicitly.
42. In QBE, user does not have to remember the names of the attributes or relations, because
they are displayed as part of the templates.
43. The delete action query of QBE deletes one or more than one records from a table or more
than one table.

TICK (✓) THE APPROPRIATE ANSWER

1. Codd’s Formation rule states that:

a. a relational database management system must manage the database entirely


through its relational capabilities.
b. null values are systematically supported independent of data type.
c. all information is represented logically by values in tables.
d. None of these.

2. Codd’s View Update rule states that:

a. the system should be able to perform all theoretically possible updates on views.
b. the logical description of the database is represented and may be interrogated by
authorised users, in the same way as for normal data.
c. the ability to treat whole tables as single objects applies to insertion, modification
and deletion, as well as retrieval of data.
d. null values are systematically supported independent of data type.

3. Codd’s system catalogue rule states that:

a. the system should be able to perform all theoretically possible updates on views.
b. the logical description of the database is represented and may be interrogated by
authorised users, in the same way as for normal data.
c. the ability to treat whole tables as single objects applies to insertion, modification
and deletion, as well as retrieval of data.
d. null values are systematically supported independent of data type.

4. Who developed SEQUEL?

a. Dr. E.F. Codd


b. Chris Date
c. D. Chamberlain
d. None of these.

5. System R was based on

a. SEQUEL
b. SQL
c. QUEL
d. All of these.

6. Which of the following is not a data definition statement?

a. INDEX
b. CREATE
c. MODIFY
d. DELETE.

7. Which of the following is a data query statement in QUEL?

a. GET
b. RETRIEVE
c. SELECT
d. None of these.

8. Which of the following is supported in QUEL?

a. COUNT
b. Intersection
c. Union
d. Subquery.

9. Codd’s Non-subversion rule states that:

a. entity and referential integrity constraints should be defined in the high-level


relational language referred to in Rule 5, stored in the system catalogues and
enforced by the system, not by application programs.
b. the logical description of the database is represented and may be interrogated by
authorised users in the same way as for normal data.
c. the ability to treat whole tables as single objects applies to insertion, modification
and deletion, as well as retrieval of data.
d. if a low-level procedural language is supported, it must not be able to subvert
integrity or security constraints expressed in the high-level relational language.

10. QUEL is a tuple relational calculus language of a relational database system:

a. INGRES
b. DB2
c. ORACLE
d. None of these.

11. The first commercial RDBMS is:

a. INGRESS
b. DB2
c. ORACLE
d. None of these.

12. Which of the following statements is used to create a table?

a. CREATE TABLE
b. MAKE TABLE
c. CONSTRUCT TABLE
d. None of these.

13. Which of the following is the result of a SELECT statement?

a. TRIGGER
b. INDEX
c. TABLE
d. None of these.

14. The SQL data definition language (DDL) provides commands for:

a. defining relation schemas


b. deleting relations
c. modifying relation schemas
d. all of these.

15. Which of the following is a clause in SELECT statement?

a. GROUP BY and HAVING


b. ORDER BY
c. WHERE
d. All of these.

16. The first IBM’s RDBMS is:

a. DB2
b. SQL/DS
c. IMS
d. None of these.

17. Which of the following statements is used to modify a table?

a. MODIFY TABLE
b. UPDATE TABLE
c. ALTER TABLE
d. All of these.

18. DROP operation of SQL is used for:

a. deleting tables from schemas


b. changing the definition of table
c. Both of these
d. None of these.

19. ALTER operation of SQL is used for:

a. deleting tables from schema


b. changing the definition of table
c. Both of these
d. None of these.

20. Which of the following clause specifies the table or tables from where the data has to be
retrieved?

a. WHERE
b. TABLE
c. FROM
d. None of these.

21. SELECT operation of SQL is a:

a. data query language


b. data definition language
c. data manipulation language
d. data control language.

22. Which of the following is used to get all the columns of a table?

a. *
b. @
c. %
d. #

23. GRANT command of SQL is a:

a. data query language


b. data definition language
c. data manipulation language
d. data control language.

24. Which of the following is a comparison operator used in SELECT statement?

a. LIKE
b. BETWEEN
c. IN
d. None of these

25. How many tables can be joined to create a view?


a. 1
b. 2
c. Database dependent
d. None of these.

26. Which of the following clause is usually used together with aggregate functions?

a. ORDER BY ASC
b. GROUP BY
c. ORDER BY DESC
d. None of these.

27. REVOKE command of SQL is a:

a. data query language


b. data definition language
c. data manipulation language
d. data control language.

28. CREATE operation of SQL is a:

a. data query language


b. data definition language
c. data manipulation language
d. data control language.

29. COMMIT statement of SQL TCS:

a. ends the transaction successfully


b. aborts the transaction
c. Both of these
d. None of these.

30. ROLLBACK statement of SQL TCS

a. ends the transaction successfully


b. aborts the transaction
c. Both of these
d. None of these.

31. QBE was originally developed by

a. Dr. E.F. Codd


b. M.M. Zloof
c. T.J. Watson
d. None of these.
32. What will be the result of statement such as SELECT * FROM EMPLOYEE WHERE
SALARY IN (4000, 8000)?

a. all employees whose salary is either 4000 or 8000.


b. all employees whose salary is between 4000 and 8000.
c. all employees whose salary is not between 4000 and 8000.
d. None of these.

33. Which of the following is not a DDL statement?

a. ALTER
b. DROP
c. CREATE
d. SELECT.

34. Which of the following is not a DCL statement?

a. ROLLBACK
b. GRANT
c. REVOKE
d. None of these.

35. Which of the following is not a DML statement?

a. UPDATE
b. COMMIT
c. INSERT
d. DELETE.

FILL IN THE BLANKS

1. Information system based language (ISBL) is a pure relational algebra based query language,
which was developed in _____ in UK in the year _____.
2. ISBL was first used in an experimental interactive database management system called
_____.
3. In ISBL, to print the value of an expression, the command is preceded by _____.
4. _____ is a standard command set used to communicate with the RDBMS.
5. To query data from tables in a database, we use the _____ statement.
6. The expanded from of QUEL is _____.
7. QUEL is a tuple relational calculus language of a relational database system called _____.
8. QUEL is based on _____.
9. INGRES is the relational database management system developed at _____.
10. _____ is the data definition and data manipulation language for INGRES.
11. The data definition statements used in QUEL (a)_____, (b)_____, (c)_____, (d)_____ and
(e)_____.
12. The basic data retrieval statement in QUEL is _____.
13. SEQUEL was the first prototype query language of _____.
14. SEQUEL was implemented in the IBM prototype called _____ in early-1970s.
15. SQL was first implemented on a relational database called _____.
16. DROP operation of SQL is used for _____ tables from the schema.
17. The SQL data definition language provides commands for (a)_____, (b)_____, and (c
)_____.
18. _____ is an example of data definition language command or statement.
19. _____ is an example of data manipulation language command or statement.
20. The _____ clause sorts or orders the results based on the data in one or more columns in the
ascending or descending order.
21. The _____ clause _____ specifies a summary query.
22. _____ is an example of data control language command or statement.
23. The _____ clause _____ specifies the table or tables from where the data has to be retrieved.
24. The _____ clause _____ directs SQL to include only certain rows of data in the result set.
25. _____ is an example of data administration system command or statement.
26. _____ is an example of transaction control statement.
27. SQL data administration statement (DAS) allows the user to perform (a) _____and (b) _____
on operations within the database.
28. The five aggregate functions provided by SQL are (a) _____, (b) _____, (c) _____, (d )
_____and (e) _____.
29. Portability of embedded SQL is _____.
30. Query-By-Example (QBE) is a two-dimensional _____ language.
31. QBE was originally developed by _____ at IBM’s T.J. Watson Research Centre.
32. The QBE _____ creates a new table from all or part of the data in one or more tables.
33. QBE’s _____ can be used to update or modify the values of one or more records in one or
more than one table in a database.
34. In QBE, the query is formulated by filling in _____ of relations that are displayed on the MS-
Access scree.
Chapter 6

Entity Relationship (E-R) Model

6.1 INTRODUCTION

As explained in Chapter 2, Section 2.7.7, an entity-relationship (E-R) model


was introduced by P.P Chen in 1976. E-R model is an effective and standard
method of communication amongst different designers,programmers and
end-users who tend to view data and its use in different ways. It is a non-
technical method, which is free from ambiguities and provides a standard
and a logical way of visualising the data. It gives precise understanding of
the nature of the data and how it is used by the enterprise. It provides useful
concepts that allow the database designers to move from an informal
description of what users want from their database, to a more detailed and
precise description that can be implemented in a database management
system. Thus, E-R modelling is an important technique for any database
designer to master. It has found wide acceptance in database design.
In this chapter, basic concepts of E-R model has been introduced and few
examples of E-R diagram of an enterprise database has been illustrated.

6.2 BASIC E-R CONCEPTS

E-R modelling is a high-level conceptual data model developed to facilitate


database design. A conceptual data model is a set of concepts that describe
the structure of a database and the associated retrieval and update
transactions on the database. It is independent of any particular database
management system (DBMS) and hardware platform. E-R model is also
defined as a logical representation of data for an enterprise. It was developed
to facilitate database design by allowing specification of an enterprise
schema, which represents the overall logical structure of a database. It is a
top-down approach to database design. It is an approximate description of
the data, constructed through a very subjective evaluation of the information
collected during requirements analysis. It is sometimes regarded as a
complete approach to designing a logical database schema. E-R model is one
of a several semantic data models. It is very useful in mapping the meanings
and interactions of real-world enterprise onto a conceptual schema. Many
database design tools draw on concepts from the E-R model. E-R model
provides the following three main semantic concepts to the designers:
Entities: which are distinct objects in a user enterprise.
Relationships: which are meaningful interactions among the objects.
Attributes: which describe the entities and relationships. Each such attribute is associated
with a value set (also called domain) and can take a value from this value set.
Constraints: on the entities, relationships and attributes.

6.2.1 Entities
An entity is an ‘object’ or a ‘thing’ in the real world with an independent
existence and that is distinguishable from other objects. Entities are the
principle data objects about which information is to be collected. An entity
may be an object with a physical existence such as a person, car, house,
employee or city. Or, it may be an object with a conceptual existence such as
a company, an enterprise, a job or an event of informational interest. Each
entity has attributes. Some of the examples of the entity are given below:

Person: STUDENT, PATIENT, EMPLOYEE, DOCTOR,


ENGINEER
Place: CITY, COUNTRY, STATE
Event: SEMINAR, SALE, RENEWAL, COMPETITION
Object: BUILDING, AUTOMOBILE, MACHINE, FUNITURE,
TOY
Concept: COURSE, ACCOUNT, TRAINING CENTRE, WORK
CENTRE
In E-R modelling, entities are considered as abstract but meaningful
‘things’ that exist in the user enterprise. Such things are modelled as entities
that may be described by attributes. They may also interact with one another
in any number of relationships. A semantic net can be used to describe a
model made up of a number of entities. An entity is represented by a set of
attributes. Each entity has a value for each of its attributes.
Fig. 6.1 shows an example of a semantic net of an enterprise made up of
four entities. The two entities El and E2 are PERSONS entity whereas P1 and
P2 are PROJECTS entity. In semantic net the symbol ‘•’ represents entities,
whereas the symbol ‘ ◊ ’ represents relationships. The PERSON entity set
has four attributes namely, PERSON-ID, PERSON-NAME, DESG and
DOB, associated with it. Each attributes takes a value from its associated
value set. For example, the value of attribute PERSON-ID in the entity set
PERSON (entity E2) is122186. Similarly, the entity set PROJECT has three
attributes namely, PROJ-NO, START-DATE and END-DATE.

Entity Type (or Set) and Entity Instance

An entity set (also called entity type) is a set of entities of the same type that
share the same properties or attributes. In E-R modelling, similar entities are
grouped into an entity type. An entity type is a group of objects with the
same properties. These are identified by the enterprise as having an
independent existence. It can have objects with physical (or real) existence
or objects with a conceptual (or abstract) existence. Each entity type is
identified by a name and a list of properties. A database normally contains
many different entity types and not to a single entity occurrence. In other
words, the word ‘entity’ in the E-R modelling corresponds to a table and not
to a row in the relational environment. The E-R model refers to a specific
table row as an entity instance or entity occurrence. An entity occurrence
(also called entity instance) is a uniquely identifiable object of an entity type.
For example, in a relation (table) PERSONS, the person identification
(PERSON-ID), person name (PERSON-NAME), designation (DESG), date
of birth (DOB) and so on are all entities. In Fig. 6.1, there are two entity sets
namely PROJECT and PERSON.

Classification of Entity Types

Entity types can be classified as being strong or weak entity. An entity type
that is not existence-dependent on some other entity type is called strong
entity type. The strong entity type has a characteristic that each entity
occurrence is uniquely identifiable using the primary key attribute(s) of that
entity type. Weak entity types are sometimes referred to as child, dependent
or subordinate entities. An entity type that is existence- dependent on some
other entity type is called weak entity type. The week entity type has a
characteristic that each entity occurrence cannot be uniquely identifiable
using only the attributes associated with that entity type. Strong entity types
are sometimes referred to as parent, owner or dominant entities.
With reference to semantic net of Fig 6.1, Fig. 6.2 illustrates the
distinction between and entity type and two of its instances.
Fig. 6.1 Semantic net of an enterprise

Fig. 6.2 Entity type with instances


6.2.2 Relationship
A relationship is an association among two or more entities that is of interest
to the enterprise. It represents real world association. Relationship as such,
has no physical or conceptual existence other than that which depends upon
their entity associations. A particular occurrence of a relationship is called a
relationship instance or relational occurrence. Relationship occurrence is a
uniquely identifiable association, which includes one occurrence from each
participating entity type. It indicates the particular entity occurrences that are
related. Relationships are also treated as abstract objects. As shown in Fig.
6.1, the semantic net of the enterprise has three relationships namely R1, R2
and R3, each describing an interaction between an entity PERSON and an
entity PROJECT. The relationship is joined by lines to the entities that
participate in the relationship. The lines have names that differ from the
names of other lines emanating from the relationship. Thus, the relationship
R1 is an interaction between PROJECT P1 and PERSON E1. It has two
attributes-STATUS and HRS-SPENT and two links-PERSON and
PROJECT, to its interacting entities. Attribute values of a relationship
describe the effects or method of interaction between entities. Thus, HRS-
SPENT describes the time that a PERSON spent on a PROJECT and
STATUS describes the status of a person on project. A relationship is only
labelled in one direction, which normally means that the name of the
relationship only makes sense in one direction.
In E-R modelling, similar relationships are grouped into relationship sets
(also called relationship type). Thus, a relationship type is a set of
meaningful associations between one or more participating entity types.
Each relationship type is given a name that describes its function.
Relationships with the same attributes fall into one relationship set. In Fig.
6.1, there is one relationship type (or set) WORK-ON consisting of
relationships namely R1, R2, and R3.
Relationships are described in the following types:
Degree of a relationship.
Connectivity of a relationship.
Existence of a relationship.
n-ary Relationship.

6.2.2.1 Degree of a Relationship


The degree of a relationship is the number of entities associated or
participants in the relationship. Following are the three degrees of
relationships:
Recursive or unary relationship.
Binary relationship.
Ternary relationship.

Recursive Relationship: A recursive relationship is a relationship between


the instances of a single entity type. It is a relationship type in which the
same entity type is associated more than once in different roles. Thus, the
entity relates only to another instance of its own type. For example, a
recursive binary relationship ‘manages’ relates an entity PERSON to another
PERSON by management as shown in Fig. 6.3. Recursive relationships are
sometimes called unary relationships. Each entity type that participates in a
relationship type plays a particular role in the relationship. Relationships
may be given role names to signify the purpose that each participating entity
type plays in a relationship. Role names can be important for recursive
relationships to determine the function of each participant. The use of role
names to describe the recursive relationship ‘manages’ is shown in Fig. 6.3.
The first participation of the PERSON entity type in the ‘manages’
relationship is given the role name ‘manager’ and the second participation is
given the role name ‘managed’. Role names may also be used when two
entities are associated through more than one relationship. Role names are
usually not necessary in relationship types where the function of the
participating entities in a relationship is distinct and unambiguous.
Binary Relationship: The association between the two entities is called
binary relationship. As shown in Fig. 6.3, two entities are associated in
different ways, for example, DEPT and DIVN, PERSON and PROJECT,
DEPT and PERSON and so on. Binary relationship is the most common type
relationship and its degree of relationship is two (2).
Ternary Relationship: A ternary relationship is an association among
three entities and its degree of relationship is three (3). The construct of
ternary relationship is a single diamond connected to three entities. For
example, as shown in Fig. 6.3, three entities SKILL, PERSON and
PROJECT are connected with single diamond ‘uses’. Here, the connectivity
of each entity is designated as either ‘one’ or ‘many’. An entity in ternary
relationship is considered to be ‘one’ if only one instance of it can be
associated with one instance of each of the other two associated entities. In
either case, one instance of each of the other entities is assumed to be given.
Ternary relationship is required when binary relationships are not sufficient
to accurately describe the semantic of the associations among three entities.

Fig. 6.3 Degree of a relationship


6.2.2.2 Connectivity of a Relationship
The connectivity of a relationship describes a constraint on the mapping of
the associated entity occurrences in the relationship. Values for occurrences
are either ‘one’ or ‘many’. As shown in Fig. 6.4, a connectivity of ‘one’ for
department and ‘many’ for person in a relationship between entities DEPT
and PERSON, means that there is at most one entity occurrence of DEPT
associated with many occurrences of PERSON. The actual count of elements
associated with the connectivity is called cardinality of the relationship
connectivity. Cardinality is used much less frequently than the connectivity
constraint because the actual values are usually variable across relationship
instances.

Fig. 6.4 Connectivity of a relationship

As shown in Fig. 6.4, there are three basic constructs of connectivity for
binary relationship namely, one to-one (1:1), one-to-many (1:N), and many-
to-many (M:N). In case of one-to-one connection, exactly one PERSON
manages the entity DEPT and each person manages exactly one DEPT.
Therefore, the maximum and minimum connectivities are exactly one for
both the entities. In case of one-to-many (1:N), the entity DEPT is associated
to many PERSON, whereas each person works within exactly one DEPT.
The maximum and minimum connectivities to the PERSON side are of
unknown value N, and one respectively. Both maximum and minimum
connectivities on DEPT side are one only. In case of many-to-many (M:N)
connectivity, the entity PERSON may work on many PROJECTS and each
project may be handled by many persons. Therefore, maximum connectivity
for PERSON and PROJECT are M and N respectively, and minimum
connectivities are each defined as one. If the values of M and N are 10 and 5
respectively, it means that the entity PERSON may be a member of a
maximum 5 PROJECTs, whereas, the entity PROJECT may contain
maximum of 10 PERSONs.

Fig. 6.5 An n-ary relationship

6.2.2.3 N-ary Relationship


In case of n-ary relationship, a single relationship diamond with n
connections, one to each entity, represents some association among n
entities. Fig. 6.5 shows an n-ary relationship. An n-ary relationship has n+1
possible variations of connectivity. All n-sides have connectivity ‘one’, n-1
sides with connectivity ‘one and one side with connectivity ‘many’, n-2
sides with connectivity ‘one’ and two sides with ‘many’ and so on until all
sides are ‘many’.

6.2.2.4 Existence of a Relationship


In case of existence relationship, the existence of entities of an enterprise
depends on the existence of another entity. Fig. 6.6 illustrates examples of
existence of a relationship. Existence of an entity in a relationship is defined
as either mandatory or optional. In a mandatory existence, an occurrence of
either the ‘one’ or ‘many’ side entity must always exist for the entity to be
included in the relationship. In case of optional existence, the occurrence of
that entity need not exist. For example, as shown in Fig. 6.6, the entity
PERSON may or may not be the manager of any DEPT, thus making the
entity DEPT optional in the ‘is-managed-by’ relationship between PERSON
and DEPT.
As explained in Fig. 2.18 of Chapter 2, the optional existence is defined
by letter 0 (zero) and a line perpendicular to the connection line between an
entity and relationship. In case of mandatory existence there is only line
perpendicular to the connection. If neither a zero nor a perpendicular line is
shown on the connection line between the relationship and entity, then it is
called the unknown type of existence. In such a case, it is neither optional
nor mandatory and the minimum connectivity is assumed to be one.

6.2.3 Attributes
An attribute is a property of an entity or a relationship type. An entity is
described using a set of attributes. All entities in a given entity type have the
same or similar attributes. For example, an EMPLOYEE entity type could
use name (NAME), social security number (SSN), date of birth (DOB) and
so on as attributes. A domain of possible values identifies each attribute
associated with an entity type. Each attribute is associated with a set of
values called a domain. The domain defines the potential values that an
attribute may hold and is similar to the domain concept in relational model
explained in Chapter 4, Section 4.3.1. For example, if the age of an
employee in an enterprise is between 18 and 60 years, we can define a set of
values for the age attribute of the ‘employee’ entity as the set of integers
between 18 and 60. Domain can be composed of more than one domain. For
example, domain for the date of birth attribute is made up of sub-domains
namely, day, month and year. Attributes may share a domain and is called
the attribute domain. The attribute domain is the set of allowable values for
one or more attributes. For example, the date of birth attributes for both
‘worker’ and ‘supervisor’ entities in an organisation can share the same
domain.
Fig. 6.6 Existence of a relationship

The attributes hold values that describe each entity occurrence and
represent the main part of the data stored in the database. For example, an
attribute NAME of EMPLOYEE entity might be the set of 30 characters
strings, SSN might be of 10 integers and so on. Attributes can be assigned to
relationships as well as to entities. An attribute of a many-to-many
relationship such as the ‘works-on’ relationship of Fig. 6.4 between the
entities PERSON and PROJECT could be ‘task-management’ or ‘start-date’.
In this case, a given task assignment or start date is common only to an
instance of the assignment of a particular PERSON to a particular
PROJECT, and it would be multivalued when characterising either the
PERSON or the PROJECT entity alone. Attributes of relationships are
assigned only to binary many-to-many relationships and to ternary
relationships and normally not to one-to-one or one-to-many relationships.
This is because at least one side of the relationship is a single entity and
there is no ambiguity in assigning the attribute to a particular entity instead
of assigning it to relationship.
Attributes can be classified into the following three categories:
Simple attribute.
Composite attribute.
Single-valued attribute.
Multi-valued attribute.
Derived attribute.
6.2.3.1 Simple Attributes
A simple attribute is an attribute composed of a single component with an
independent existence. A simple attribute cannot be subdivided or broken
down into smaller components. Simple attributes are sometimes called
atomic attributes. EMP-ID, EMP-NAME, SALARY and EMP-DOB of the
EMPLOYEE entity are the example of simple attributes.

6.2.3.2 Composite Attributes


A composite attribute is an attribute composed of multiple components, each
with an independent existence. Some attributes can be further broken down
or divided into smaller components with an independent existence of their
own. For example, let us assume that EMP-NAME attribute of EMPLOYEE
entity holds data as ‘Abhishek Singh’. Now this attribute can be further
divided into FIRST-NAME and LAST-NAME attributes such that they hold
data namely ‘Abhishek’ and ‘Singh’ respectively. Fig. 6.7 (a) illustrates
examples of composite attributes. The decision of modelling (or
subdividing) an attribute into simple or composite attributes depend on the
user view of the data. That means, it is dependent on whether the user view
of the data refers to the employee name attribute as a single unit or as
individual components.
Fig. 6.7 Composite attributes

Composite attributes can form a hierarchy. For example, STREET-


ADDRESS can be subdivided into three simple attributes namely, STREET-
nO, STREET-NAME and APPARTMENT-NO, as shown in Fig. 6.7 (b). The
value of the composite attribute is the concentration of the values of its
constituent simple attributes.

6.2.3.3 Single-valued Attributes


A single-valued attribute is an attribute that holds a single value for each
occurrence of an entity type. For example, each occurrence of the
EMPLOYEE entity has a single value for the employee identification
number (EMP-ID) attribute, for example, ‘106519’and therefore the EMP-
ID attribute is referred to as being single-valued. The majority of attributes
are single-valued.

6.2.3.4 Multi-valued Attributes


A multi-valued attribute is an attribute that holds multiple values for each
occurrence of an entity type. That means, multi-valued attributes can take
more than one value. Fig. 6.8 illustrates semantic representation of attributes
taking more than one value. Fig. 6.8 (a) shows an example of multi-attribute
in which each person is modelled as possessing a number of SKILL
attributes. Thus, a person whose PER-ID is 106519, has three SKILL
attributes, whose values are PROGRAMMING, DESGINING and SIX-
SIGMA. Fig. 6.8 (b) shows an example of multi-value in which each person
has only one SKILL attribute, but it can take more than one value. Thus the
person whose PER-ID is 106519, now has only one SKILL attribute whose
value is PROGRAMMING, DESGINING and SIX-SIGMA.
Fig. 6.8 Semantic representation of multi-valued attribute

(a) Multi-attribute

(b) Multi-value

An E-R diagram of a multi-valued attribute is shown in Fig. 6.9. A multi-


valued attribute may have a set of numbers with upper and lower limits. For
example, let us assume that the SKILL attribute has between one and three
values. In other words, a skill-set may have a single skill to a maximum of
three skills.
Fig. 6.9 E-R diagram of Multi-valued attribute

6.2.3.5 Derived Attributes


A derived attribute is an attribute that represents a value that is derivable
from the value of a related attribute or set of attributes, not essentially in the
same entity set. Therefore, the value held by some attributes are derived
from two or ore attribute values. For example, the value for the project
duration (PROJ-DURN) attribute of the entity PROJECT can be calculated
from the project start date (START-DATE) and project end date (END-
DATE) attributes. The PROJ-DURN attribute is referred to as derived
attribute. In the E-R diagram of Fig. 6.9, the attribute YRS-OF-
EXPERIENCE is a derived attribute, which has been derived from the
attribute DATE-EMPLOYED. In some cases, the value of an attribute is
derived from the entity occurrences in the same entity type. For example,
total number of persons (TOT-PERS) attribute of the PERSON entity can be
calculated by counting the total number of PERSON occurrences.

6.2.3.6 Identifier Attributes


Each entity is required to be identified uniquely in a particular entity set.
Using one or more entity attributes as an entity identifier this identification is
done. These attributes are known as the identifier attributes. Therefore, an
identifier is an attribute or combination of attributes that uniquely identifies
individual instances of an entity type. For example, PER-ID can be a
identifier attribute in case of PERSON or PROJ-ID in case of PROJECTS.
The identifier attribute is underlined in the E-R diagram, as shown in Fig.
6.9. In some entity types, two or more attributes are used as identifier as
because no single attribute serves the purpose. Such combination of
attributes used to identify an entity type is known as composite identifier.
Fig 6.10 illustrates an example of composite identifier in which the entity
TRAIN has a composite identifier TRAIN-ID. The composite identifier
TRAIN-ID in turn has two component attributes as TRAIN-NO and TRAIN-
DATE. This combination is required to uniquely identify individual
occurrences of train travelling to particular destination.

Fig. 6.10 Composite identifier

Similarly, identifiers of relationship sets are attributes that uniquely


identify relationships in a particular relationship set. Relationships are
usually identified by more than one attribute. Most often the identifier
attributes of a relationship are those data values that are also identifiers of
entities that participate in the relationship. For example in Fig. 6.1, the
combination of a value of PROJ-ID with a value of PERSON-ID uniquely
identifies each relationship in the relationship set WORK-ON. PROJ-ID and
PERSON-ID are the identifiers of the entities that interact in the relationship.
If the same domain of values is used to identify entities in their own right as
well as in relationship, it is usually convenient also to use the same name for
both the entity identifier attribute and the identifier attribute of entity in the
relationships. In Fig. 6.1, PERSON-ID is used as an identifier in PERSON
entities and also to identify persons in the relationship set WORK-ON.

6.2.4 Constraints
Relationship types usually have certain constraints that limit the possible
combinations of entities that may participate in the corresponding
relationship set. The constraints should reflect the restrictions on the
relationships as perceived in the ‘real world’. For example, there could be a
requirement that each department in the entity DEPT must have a person and
each person in the PERSON entity must have a skill. The main types of
constraints on relationships are multiplicity, cardinality, participation and so
on.

6.2.4.1 Multiplicity Constraints


Multiplicity is the number (or range) of possible occurrences of an entity
type that may relate to a single occurrence of an associated entity type
through a particular relationship. It constrains the way that entities are
related. It is a representation of the policies and business rules established by
the enterprise or the user. It is important that all appropriate enterprise
constraints are identified and represented while modelling an enterprise.

6.2.4.2 Cardinality Constraints


A cardinality constraint specifies the number of instances of one entity that
can (or must) be associated with each instance of entity. There are two types
of cardinality constraints namely minimum and maximum cardinality
constraints. The minimum cardinality constraint of a relationship is the
minimum number of instances of an entity that may be associated with each
instance of another entity. The maximum cardinality constraint of a
relationship is the maximum number of instances of one entity that may be
associated with a single occurrence of another entity.

6.2.4.3 Participation Constraints


The participation constraint specifies whether the existence of an entity
depends on its being related to another entity via the relationship type. There
are two types of participation constraints namely total and partial
participation constraints. Total participation constraints means that every
entity in ‘the total set’ of an entity must be related to another entity via a
relationship. Total participation is also called existence dependency. Partial
participation constraints means that some or the ‘part of the set of’ an entity
are related to another entity via a relationship, but not necessarily all. The
cardinality ratio and participation constraints are together known as the
structural constraints of a relationship type.

6.2.4.4 Exclusion and Uniqueness Constraints


E-R modelling has also constraints such exclusion constraint and uniqueness
constraint that results into poor semantic base and tries to make entity-
attribute decisions early in the conceptual modelling process. In exclusion
constraint, the normal or default treatment of multiple relationships is
inclusive OR, which allows any or all of the entities to participate. In some
situations, however, multiple relationships may be affected by the exclusive
(disjoint or exclusive OR) constraint, which allows at most one entity
instance among several entity types to participate in the relationship with a
single root entity.
Fig. 6.11 Example of exclusion constraint

Fig. 6.11 illustrates an example of exclusion constraint in which the root


entity work-task has two associated entities, external-project and internal-
project. A work-task can be assigned to either an external-project or an
internal-project, but not to both. That means, at most one of the associated
entity instances could apply to an instance of work-task. The uniqueness
constraints combine three or more entities such that the combination of roles
for the two entities in one direction uniquely determines the value of the
single entity in the other direction. This in effect, defines the functional
dependencies (FD) from the composite keys of the entities in the first
direction to the key of the entity in the second direction, and thus partly
defines a ternary relationship. Functional decomposition (FD) has been
discussed in detail in Chapter 9.

6.3 CONVERSION OF E-R MODEL INTO RELATIONS

An E-R model can be converted to relations, in which each entity set and
each relationship set is converted to a relation. Fig. 6.12 illustrates a
conversion of E-R diagram into a set of relations.
Fig. 6.12 Conversion of E-R model to relations

A separate relation represents each entity set and each relationship set.
The attributes of the entities in the entity set become the attributes of the
relation, which represents that entity set. The entity identifier becomes the
key of the relation and each entity is represented by a tuple in the relation.
Similarly, the attributes of the relationships in each relationship set become
the attributes of the relation, which represents the relationship set. The
relationship identifiers become the key of the relation and each relationship
is represented by a tuple in that relation.
The E-R model of Fig. 6.1 and Fig. 6.12 (a) is converted to the following
three relations as shown in Fig. 6.12 (b):

PERSONS (PER-ID, DESIGN, LAST-NAME, DOB)


from entity set PERSONS
PROJECTS (PROJ-ID, START-DATE, END-DATE)
from entity set PROJECTS
WORKS-ON (PROJ-ID, PER-ID, HRS-SPENT, STATUS)
from relationship set WORKS-ON

6.3.1 Conversion of E-R Model into SQL Constructs


The E-R model is transformed into SQL constructs using transformation
rules. The following three types of tables are produced during the
transformation of E-R model into SQL constructs:
a. An entity table with the same information content as the original entity: This transformation
rule always occurs with the following relationships:

Entities with recursive relationships that are many-to-many (M:N).


Entities with binary relationships that are many-to-many (M:N), one-to-many (1:N)
on the ‘1’ (parent) side, and one-to-one (1:1) on one side.
Ternary or higher-degree relationships.

b. An entity table with the embedded foreign key of the parent entity: This is one of the most
common ways CASE tools handle relationships. It prompts the user to define a foreign key in
the ‘child’ table that matches a primary key in the ‘parent’ table. This transformation rule
always occurs with the following relationships:

Each entity recursive relationship that is one-to-one (1:1) or one-to-many (1:N).


Binary relationships that are one-to-many (1:N) for the entity on the ‘N’ (child)
side, and one-to- one (1:1) relationships for one of the entities.

c. A relationship table with the foreign keys of all the entities in the relationship: This is the
other most common way CASE tools handle relationships in the E-R model. In this case, a
many-to-many (M:N) relationship can only be defined in terms of a table that contains
foreign keys that match the primary keys of the two associated entities. This new table may
also contain attributes of the original relationship. This transformation rule always occurs
with the following relationships:

Recursive and many-to-many (M:N).


Binary and many-to-many (M:N).
Ternary or higher-degree.

In the above transformations, the following rules apply to handle SQL null
values:
Nulls are allowed in an entity table for foreign keys of associated (referenced) optional
entities.
Nulls are not allowed in an entity table for foreign keys of associated (referenced) mandatory
entities.
Nulls are not allowed for any key in a relationship table because only complete row entries
are meaningful in the entries.

The preceding sub-headings shows standard SQL statements needed to


define each type of E-R model construct.

6.3.1.1 Conversion of Recursive Relationships into SQL Constructs


As illustrated in Fig. 5.2 of Chapter 5, E-R model can be converted into SQL
constructs. Fig. 6.13 illustrates the conversion of recursive relationships into
SQL constructs.
Fig. 6.13 Recursive relationship conversion into SQL constructs

(a) One-to-one (1:1) relationship with both sides optional


(b) One-to-many (1:N) relationship with ‘1’ side mandatory and ‘N’ side optional

(c) Many-to-many (M:N) relationship with both sides optional


6.3.1.2 Conversion of Binary Relationships into SQL Constructs
Fig. 6.14 illustrates the conversion of binary relationships into SQL
constructs.
Fig. 6.14 Binary relationship conversion into SQL constructs

(a) One-to-one (1:1) relationship with both entities mandatory


(b) One-to-one (1:1) relationship with one entity optional and one mandatory

(c) One-to-one (1:1) relationship with both entities optional

6.3.1.3 Conversion of Ternary Relationships into SQL Constructs


Fig 6.15 through 6.18 illustrate the conversion of ternary relationships into
SQL constructs.

Fig. 6.15 One-to-one-to-one (1:1:1) ternary relationship


Fig. 6.16 One-to-one-to-many (1.1.N) ternary relationship
Fig. 6.17 One-to-many-to-many (1:M:N) ternary relationship
Fig. 6.18 Many-to-many-to-many (M.N.P.) ternary relationship
6.4 PROBLEMS WITH E-R MODELS

Some problems, called connection traps, may arise when creating an E-R
model. The connection traps normally occur due to a misinterpretation of the
meaning of certain relationships. There are mainly two types of connection
traps:
Fan traps.
Chasm traps.

6.4.1 Fan Traps


In a fan trap, a model represents a relationship between entity types, but the
pathway between certain entity occurrences is ambiguous. A fan trap may
exist where two or more one-to-many (1:N) relationships fan out from the
same entity. Fig. 6.19 (a) shows an example of fan trap problem.
As shown in Fig. 6.19 (a), the E-R model represents the fact that a single
bank has one or more counters and has one or more persons. Here, there are
two one-to-many (1:N) relationships namely ‘has’ and ‘operates’, emanating
from the same entity called BANK. A problem arises when we want to know
which members of persons work at a particular counter. As can be seen from
the semantic net of Fig. 6.19 (b), it is difficult to give specific answer to the
question: “At which counter does person number ‘106519’ work?”. We can
only say that the person with identification number ‘106519’ works at
‘Cash’ or ‘Teller’ counter. The inability to answer this question specifically
is the result of a fan trap associated with the misrepresentation of the correct
relationships between the PERSON, BANK and COUNTER entities. This
fan trap can be resolved by restructuring the original E-R model of Fig. 6.19
(a) to represent the correct association between these association, as shown
in Fig. 6.19 (c). Similarly, a semantic net can be reconstructed for this
modified E-R model, as shown in Fig. 6.19 (d). Now we can find the correct
answer to our earlier question that person number ‘106519’ works at cash
counter, which is part of bank B1.
Fig. 6.19 Example of fan trap and its removal

(a) An example of fan trap

(b) Semantic net of E-R model


(c) Restructured E-R model to eliminate fan trap

(d) Modified semantic net

6.4.2 Chasm Traps


In a chasm trap, a model suggests the existence of a relationship between
entity types, but the pathway does not exist between certain entity
occurrences. A chasm trap may occur where there are one or more
relationships with a minimum multiplicity of zero forming part of the
pathway between related entities. Fig. 6.20 (a) shows a chasm trap problem,
which illustrates the facts that a single counter has one or more person who
overseas zero or more loan enquiries. It is also to be noted that not all person
overseas loan enquiries and not all loan enquiries are overseen by a member
person. A problem arises when we want to know which loan enquiries are
available at each counter.
As can be seen from the semantic net of Fig. 6.20 (b), it is difficult to give
specific answer to the question: “At which counter is ‘car loan’ enquiry
available?”. Since this ‘car loan’ is not yet allocated to any member of
person working at a counter, we are unable to answer this question. The
inability to answer this question is considered to be a loss of information,
and is the result of a chasm trap. The multiplicity of both the PERSON and
LOAN entities in the ‘overseas’ relationship has a minimum value of zero,
which means that some loans cannot be associated with a counter through a
member of person. Therefore, we need to identify the missing link to solve
this problem. In this case, the missing link is the ‘offers’ relationship
between the entities COUNTER and LOAN. This chasm trap can be
resolved by restructuring the original E-R model of Fig. 6.20 (a) to represent
the correct association between these association, as shown in Fig. 6.20 (c).
Similarly, a semantic net can be reconstructed for this modified E-R model,
as shown in Fig. 6.20 (d). Now by examining occurrences of the ‘has’,
‘overseas’, and ‘offers’ relationship types, we can find the correct answer to
our earlier question that a ‘car loan’ enquiry is available at ‘teller’ counter.
Fig. 6.20 Example of Chasm Trap and its removal

(a) An example of chasm trap

(b) Semantic net of E-R model

(c) Restructured E-R model to eliminate chasm trap


(d) Modified semantic net
Fig. 6.21 Building blocks (symbols) of E-R diagram

6.5 E-R DIAGRAM SYMBOLS

An E-R model is normally expressed as an entity-relationship diagram


(called E-R diagram). E-R diagram is graphical representation of an E-R
model. As depicted in the previous diagrams, the set of symbols (building
blocks) to represent E-R diagram already shown in Chapter 2, Section 2.7.7,
in Fig. 2.18, are further summarised in Fig. 6.21.
Fig. 6.22 Complete E-R diagram of banking organisation database

6.5.1 Examples of E-R Diagrams


Examples of E-R diagram schema for full representation of a conceptual
model for the databases of a Banking Organisation, Project Handling
Company and Railway Reservation System are depicted in Fig. 6.22, 6.23
and 6.24 respectively.
Fig. 6.23 E-R diagram of project handling company database

In Fig. 6.23 above, database schema of a Project Handling Company is


displayed using the (min /max) relationship notation and primary key
identification. In an E-R diagram notation, one uses either the min/max
notation, or the cardinality ratios, single line or double line notation.
However, the min/ max notation is more precise because the specification of
structural constraints of any degree can be depicted.
As can be seen from the above Figs. 6.22, 6.23 and 6.24, the E-R diagrams
include the entity sets, attributes, relationship sets and mapping cardinalities.
Fig. 6.24 E-R diagram of Railway Reservation System database
R Q
1. What is E-R modelling? How is it different than SQL?
2. Discuss the basic concepts of E-R model.
3. What are the semantic concepts of E-R model?
4. What do you understand by an entity? What is an entity set? What are the different types of
entity type? Explain the concept using semantic net.
5. When is the concept of weak entity used in the data modelling?
6. What do you understand by a relationship? What is the difference between relationship and
relationship occurrence?
7. What is a relationship type? Explain the differences between a relationship type, a
relationship instance and a relationship set.
8. Explain with diagrammatical illustrations about the different types of relationships.
9. What do you understand by degree of relationship? Discuss with diagrammatic
representation about various types of degree of relationships.
10. What do you understand by existence of a relationship? Discuss with diagrammatic
representation about various types of existence of relationships.
11. What do we mean by recursive relationship type? Give some examples of recursive
relationship types.
12. An engineering college database contains information about faculty (identified by faculty
identification number, FACLTY-ID) and courses they are teaching. Each of the following
situations concerns the ‘Teaches’ relationship set. For each situation, draw an E-R diagram
that describes it.

a. Faculty can teach the same course in several semesters and each offering must be
recorded.
b. Faculty can teach the same course in several semesters and only the most recent
such offering needs to be recorded.
c. Every faculty must teach some course and only the most recent such offering needs
to be recorded.
d. Every faculty teaches exactly one course and every course must be taught by some
faculty.

13. Discuss the E-R symbols used for E-R diagram. Discuss the conventions for displaying an E-
R model database schema as an E-R diagram.
14. E-R diagram of Fig. 6.25 shows a simplified schema for an Airline Reservations System.
From the E-R diagram, extract the requirements and constraints that produced this schema.
15. A university needs a database to hold current information on its students. An initial analysis
of these requirements produced the following facts:

a. Each of the faculties in the university is identified by a unique name and a faculty
head is responsible for each faculty.
b. There are several major courses in the university. Some major courses are managed
by one faculty member, whereas others are managed jointly by two or more faculty
members.
c. Teaching is organised into courses and varying numbers of tutorials are organised
for each course.
d. Each major course has a number of required courses.
e. Each course is supervised by one faculty member.
f. Each major course has a unique name.
g. A student has to pass the prerequisite courses to take certain courses.
h. Each course is at a given level and has a credit-point value.

Fig. 6.25 E-R diagram of airline reservation system database

i. Each course has one lecturer in charge of the course. The university keeps a record
of the lecturer’s name and address.
j. Each course can have a number of tutors.
k. Any number of students can be enrolled in each of the major courses.
l. Each student can be enrolled in only one major course and the university keeps a
record of that student’s name and address and an emergency contact number.
m. Any number of students can be enrolled in a course and each student in a course can
be enrolled in only one tutorial for that course.
n. Each tutorial has one tutor assigned to it.
o. A tutor can tutor in more than one tutorial for one or more courses.
p. Each tutorial is given in an assigned class room at a given time on a given day.
q. Each tutor not only supervises tutorials but also is in charge of some course.
Identify the entities and relationships for this university and construct an E-R
diagram.

16. Some new information has been added in the database of Exercise 6.15, which are as
follows:

a. Some tutors work part time and some are full-time staff members. Some tutors (may
be from both full-time and part-time) are not in charge of any units.
b. Some students are enrolled in major courses, whereas others are enrolled in a single
course only. Change your E-R diagrams considering the additional information.

17. What do you understand by connectivity of a relationship? Discuss with diagrammatic


representation about various types of connectivity of relationships.
18. An enterprise database needs to store information as follows:

EMPLOYEE (EMP-ID, SALARY, PHONE)


DEPARTMENTS (DEPT-ID, DEPT-NAME, BUDGET)
EMPLOYEE-CHILDREN (NAME, AGE)

Employees ‘work’ in departments. Each department is ‘managed by’ an employee. A child


must be identified uniquely by ‘name’ when the parent (who is an employee) is known. Once
the parent leaves the enterprise, the information about the child is not required.
Draw an E-R diagram that captures the above information.
19. What is an entity type? What is an entity set? Explain the difference among an entity, entity
type and entity set.
20. What do you mean by attributes? What are the different types of attributes? Explain.
21. Explain how E-R model can be converted into relations?
22. What are the problems with E-R models?
23. Briefly explain the following terms:

a. Attribute
b. Domain
c. Relationship
d. Entity
e. Entity set
f. Relationship set
g. 1:1 relationship
h. 1:N relationship
i. M:N relationship
j. Strong entity
k. Weak entity
l. Constraint
m. Role name
n. Identifier
o. Degree of relationship
p. Composite attribute
q. Multi-valued attribute
r. Derived attribute.

24. Compare the following terms:

a. Derived attribute and stored attribute


b. Entity type and relationship type
c. Strong entity and weak entity
d. Degree and cardinality
e. Simple attribute and composite attribute
f. Entity type and entity instance.

25. Define the concept of aggregation. Give few examples of where this concept is used.
26. We can convert any weak entity set into a strong entity set by adding appropriate attributes.
Why, then, do we have a weak entity set?
27. A person identified by a PER-ID and a LAST-NAME, can own any number of vehicles. Each
vehicle is of a given VEH-MAKE and is registered in any one of a number of states
identified STATE-NAME. The registration number (REG-NO) and the registration
termination date (REG-TERM-DATE) are of interest, and so is the address of a registration
office (REG-OFF-ADD) in each state.
Identify the entities and relationships for this enterprise and construct an E-R diagram.
28. An organisation purchases items from a number of suppliers. Suppliers are identified by
SUP-ID. It keeps track of the number of each item type purchased from each supplier. It also
keeps a record of supplier’s addresses. Supplied items are identified by ITEM-TYPE and
have description (DESC). There may be more than one such addresses for each supplier and
the price charged by each supplier for each item type is stored.
Identify the entities and relationships for this organisation and construct an E-R diagram.
29. Given the following E-R diagram of Fig. 6.26, define the appropriate SQL tables.

Fig. 6.26 A sample E-R diagram

30. (a) Construct an E-R diagram for a hospital management system with a set of doctors and a
set of patients. With each patient, a series of various tests and examinations are conducted.
On the basis of preliminary report patients are admitted to a particular speciality ward.
(b) Construct appropriate tables for the above E-R diagram.
31. A chemical testing laboratory has several chemists who work on one or more projects.
Chemists may have a variety of equipment on each project. The CHEMIST has the attributes
namely EMP-ID (identifier), CHEM-NAME, ADDRESS and PHONE-NO. The PROJECT
has attributes such as PROJ-ID (identifier), START-DATE and END-DATE. The
EQUIPMENT has attributes such as EQUP-SERIAL-NO and EQUP-COST. The laboratory
management wants to record the EQUP-ISSUE-DATE when given equipment item is
assigned to a particular chemist working on a specified project. A chemist must be assigned
to at least one project and one equipment item. A given equipment item need not be assigned
and a given project need not be assigned either a chemist or an equipment item.
Draw an E-R diagram for this situation.
32. A project handling organisation has persons identified by a PER-ID and a LAST-NAME.
Persons are assigned to departments identified by a DEP-NAME. Persons work on projects
and each project has a PROJ-ID and a PROJ-BUDGET. Each project is managed by one
department and a department may manage many projects. But a person may work on only
some (or none) of the projects in his or her department.

a. Identify the entities and relationships for this organisation and construct an E-R
diagram.
b. Would your E-R diagram change if the person worked on all the projects in his or
her department?
c. Would there be any change if you also recorded the TIME-SPENT by the person on
each project?

STATE TRUE/FALSE

1. E-R model was first introduced by Dr. E.F. Codd.


2. E-R modelling is a high-level conceptual data model developed to facilitate database design.
3. E-R model is dependent on a particular database management system (DBMS) and hardware
platform.
4. A binary relationship exists when an association is maintained within a single entity.
5. A weak entity type is independent on the existence of another entity.
6. An entity type is a group of objects with the same properties, which are identified by the
enterprise as having an independent existence.
7. An entity occurrence is also called entity instance.
8. An entity instance is a uniquely identifiable object of an entity type.
9. A relationship is an association among two or more entities that is of interest to the
enterprise.
10. The participation is optional if an entity’s existence requires the existence of an associated
entity in a particular relationship.
11. An entity type does not have an independent existence.
12. An attribute is viewed as the atomic real world item.
13. Domains can be composed of more than one domain.
14. The degree of a relationship is the number of entities associated or participants in the
relationship.
15. The connectivity of a relationship describes a constraint on the mapping of the associated
entity occurrences in the relationship.
16. In case of mandatory existence, the occurrence of that entity need not exist.
17. An attribute is a property of an entity or a relationship type.
18. In E-R diagram, if the attribute is simple or single-valued then they are connected using a
single line.
19. In E-R diagram, if the attribute is derived then they are connected using double lines.
20. An entity type that is not existence-dependent on some other entity type is called a strong
entity type.
21. Weak entities are also referred to as child, dependent or subordinate entities.
22. An entity type can be an object with a physical existence but cannot be an object with a
conceptual existence.
23. Simple attributes can be further divided.
24. In an E-R diagram, the entity name is written in uppercase whereas the attribute name is
written in lowercase letters.

TICK (✓) THE APPROPRIATE ANSWER

1. E-R Model was introduced by:

a. Dr. E.F. Codd.


b. Boyce.
c. P.P Chen.
d. Chamberlain.

2. An association among three entities is called:

a. binary relationship.
b. ternary relationship.
c. recursive relationship.
d. none of these.

3. The association between the two entities is called:

a. binary relationship.
b. ternary relationship.
c. recursive relationship.
d. none of these.

4. Which data model is independent of both hardware and DBMS?

a. external.
b. internal.
c. conceptual.
d. all of these.

5. A relationship between the instances of a single entity type is called:


a. binary relationship.
b. ternary relationship.
c. recursive relationship.
d. none of these.

6. A simple attribute is composed of:

a. single component with an independent existence.


b. multiple components, each with an independent existence.
c. both (a) and (b).
d. none of these.

7. A composite attribute is composed of:

a. single component with an independent existence.


b. multiple components, each with an independent existence.
c. both (a) and (b).
d. none of these.

8. What are the components of E-R model?

a. entity.
b. attribute.
c. relationship.
d. all of these.

9. The attribute composed of a single component with an independent existence is called:

a. composite attribute.
b. atomic attribute.
c. single-valued attribute.
d. derived attribute.

10. The attribute composed of multiple components, each with an independent existence is
called:

a. composite attribute.
b. simple attribute.
c. single-valued attribute.
d. derived attribute.

11. Which of these expresses the specific number of entity occurrences associated with one
occurrence of the related entity?

a. degree of relationship.
b. connectivity of relationship.
c. cardinality of relationship.
d. none of these.
FILL IN THE BLANKS

1. E-R model was introduced by _____ in _____.


2. An entity is an _____ or _____ in the real word.
3. A relationship is an _____ among two or more _____ that is of interest to the enterprise.
4. A particular occurrence of a relationship is called a _____ .
5. The database model uses the (a) _____ , (b) _____ and (c) _____ to construct representation
of the real world system.
6. The relationship is joined by _____ to the entities that participate in the relationship.
7. An association among three entities is called _____.
8. A relationship between the instances of a single entity type is called _____.
9. The association between the two entities is called _____.
10. The actual count of elements associated with the connectivity is called _____ of the
relationship connectivity.
11. An attribute is a property of _____ or a _____ type.
12. The components of an entity or the qualifiers that describe it are called _____ of the entity.
13. In E-R diagram, the _____ are represented by a rectangular box with the name of the entity in
the box.
14. The major components of an E-R diagram are (a) _____ , (b) _____ , (c) _____ and (d
)_____.
15. The E-R diagram captures the (a) _____ and (b) _____ .
16. _____ entities are also referred to as parent, owner or dominant entities.
17. A _____ is an attribute composed of a single component with an independent existence.
18. In E-R diagram, _____ are underlined.
19. Each uniquely identifiable instance of an entity type is also referred to as an _____ or _____ .
20. A _____ relationship exists when two entities are associated.
21. In an E-R diagram, if the attribute is _____, its component attributes are shown in ellipses
emanating from the composite attribute.
Chapter 7

Enchanced Entity- Relationship (EER) Model

7.1 INTRODUCTION

The basic concepts of an E-R model discussed in Chapter 6 are adequate for
representing database schemas for traditional and administrative database
applications in business and industry such as customer invoicing, payroll
processing, product ordering and so on. However, it poses inherent problems
when representing complex applications of newer databases that are more
demanding than traditional applications such as Computer-aided Software
Engineering (CASE) tools, Computer-aided Design (CAD) and Computer-
aided Manufacturing (CAM), Digital Publishing, Data Mining, Data
Warehousing, Telecommunications applications, images and graphics,
Multimedia Systems, Geographical Information Systems (GIS), World Wide
Web (WWW) applications and so on. The designers to represent these
modern and more complex applications use additional semantic modelling
concepts. There are various abstractions available to capture semantic
features, which cannot be explicitly modelled by entity and relationships.
Enhanced Entity-Relationship (EER) model uses such additional semantic
concepts incorporated into the original E-R model to overcome these
problems. The EER model consists of all the concepts of the E-R model
together with the following additional concepts:
Specialisation/Generalisation
Categorisation.

This chapter describes the entity types called superclasses (or supertype)
and subclasses (or subtype) in addition to these additional concepts
associated with the EER model. How to convert the E-R model into EER
model has also been demonstrated in this chapter.

7.2 SUPERCLASS AND SUBCLASS ENTITY TYPES

As it has been discussed in Chapter 6, Section 6.2.1, an entity type is a set of


entities of the same type that share the same properties or characteristics.
Subclasses (or subtypes) and superclasses (or supertypes) are the special
type of entities.
Subclasses or subtypes are the sub-grouping of occurrences of entities in
an entity type that is meaningful to the organisation and that shares common
attributes or relationships distinct from other sub-groupings. Subtype is one
of the data-modelling abstractions used in EER. In this case, objects in one
set are grouped or subdivided into one or more classes in many systems. The
objects in each class may then be treated differently in certain circumstances.
Superclass or supertype is a generic entity type that has a relationship with
one or more subtypes. It includes an entity type with one or more distinct
sub-groupings of its occurrences, which is required to be represented in a
data model. Each member of the subclass or subtype is also a member of the
superclass or supertype. That means, the subclass member is the same as the
entity in the superclass, but has a distinct role. The relationship between a
superclass and subclass is a one-to-one (1:1) relationship. In some cases, a
superclass can have overlapping subclasses.
A superclass/subclass is simply called class/subclass or supertype/subtype.
For example, the entity type PERSONS describes the type (that is, attributes
and relationships) of each person entity and also refers to the current set of
PERSONS entities in the enterprise database. In many cases an entity type
has sub-groupings of its entities that are meaningful and need to be
represented explicitly. For example, the entities that are members of the
PERSONS entity type may be grouped into MANAGERS, ENGINEERS,
TECHNICIANS, SECRETARY and so on. The set of entities in each of the
latter groupings is a subset of the entities that belong to the PERSON entity
set. This means that every entity that is a member of one of these sub-
groupings is also a person. Each of these sub-groupings is called a subclass
of the PERSONS entity type and the PERSONS entity type is called the
superclass for each of these subclasses. The relationship between superclass
and any one of its subclasses is called superclass/subclass relationship. The
PERSONS/MANAGERS and PERSONS/ENGINEERS are two
class/subclass relationships. It is to be noted that a member entity of the
subclass represents the same real-world entity as some member of the
superclass. A superclass/subclass is often called an IS-A (or IS-AN)
relationship because of the way we refer to the concept. We say “a
SECRETARY IS-A PERSON”, an “ENGINEER IS-A PERSON”, “a
MANAGER IS-A PERSON” and so forth.
Fig. 7.1 illustrates the semantic diagram of the classes both at enterprise
level and occurrence level. As shown in Fig. 7.1 (a), an enterprise may
employ a set of persons and each person is treated in the same way for
personnel purposes. That is, details such as date-of-birth, employment-
history, health insurance and so on. may be recorded for each person. Some
persons, however, may be hired as managers and others may be hired as
engineers. These persons may be treated differently depending on the class
to which they belong. Thus, managers may appear in a management
relationship with departments, whereas engineers may be assigned to
projects.
Fig. 7.1 Semantic diagram of the classes

The enterprise level of this semantic diagram has an IS-A association


between sets, which states that a member in one set is also a member in
another. All engineers, therefore, are persons, as are all managers. The
occurrence level semantic net of Fig. 7.1 (b) depicts individual members of
the sets and subsets. In the set PERSONS, which has four members namely
Thomas, Avinash, Alka and Mathew, Thomas is also a member of
MANAGERS set and Alka, Mathew and Avinash are members of the
ENGINEERS set. All members of the ENGINEERS set must also be in the
PERSONS set and so ENGINEERS and MANAGERS are subsets of the
PERSONS set. Because subclasses occur frequently in systems, semantic
modelling methods enable database designers to model them.

7.2.1 Notation for Superclasses and Subclasses


The basic notation for superclass and subclass is illustrated in Fig. 7.2. The
superclass is connected with a line to a circle. The circle in turn is connected
by line to each subtype that has been defined. The U-shaped symbols on
each line connecting a subtype to the circle, indicates that the subtype is a
subset of the supertype. It also indicates the direction of the
supertype/subtype relationship. The attributes shared by all the entities
(including the identifier) of the supertype (or shared by all the subtypes) are
associated with the supertype entity. The attributes that are unique to a
particular subtype are associated with the respective subtype.

Fig. 7.2 Basic notation of superclass and subclass relationships

For example, suppose that an enterprise has an entity called EMPLOYEE,


which has three subtypes namely FULL-TIME, PART-TIME and
CONSULTANT. Some of the important attributes for each of these types of
employees are as follows:
FULL-TIME-EMPLOYEE (EMP-ID, EMP-NAME, ADDRESS, DATE-OF-BIRTH, DATE-
OF-JOINING, SALARY, ALLOWANCES).
PART-TIME-EMPLOYEE (EMP-ID, EMP-NAME, ADDRESS, DATE-OF-BIRTH, DATE-
OF-JOINING, HOURLY-RATE).
CONSULTANT (EMP-ID, EMP-NAME, ADDRESS, DATE-OF-BIRTH, DATE-OF-
JOINING, CONTRACT-NO, BILLING-RATE).

It can be noticed from the above attributes that all the three categories of
employees have some attributes in common such as EMP-ID, EMP-NAME,
ADDRESS, DATE-OF-BIRTH and DATE-OF-JOINING. In addition to
these common attributes, each type has one or more attributes that is unique
to that type. For example, SALARY and ALLOWANCES are unique to a
fulltime employee, whereas the HOURLY-RATE is unique to the part time
employees and so on. While developing a conceptual data model in this
situation, the database designer might consider the following three choices:
a. Treat these entities as three separate ones. In this case, the model will fail to exploit the
common attributes of all employees and thus creating an inefficient model.
b. Treat these entities as a single entity, which contains a superset of all attributes. In this case,
the model requires the use of nulls (or the attributes for which the different entities have no
value), thus making the design more complex.

The above example is an ideal situation for the use of supertype/subtype


representation. In this case, the EMPLOYEE can be defined as supertype
entity with subtypes for a FULL-TIME-EMPLOYEE, PART-TIME-EMPLOYEE
and CONSULTANT. The attributes that are common to all the three subtypes
can be associated with the supertype. This approach exploits the common
properties of all employees, yet recognises the unique properties of each
type.
Fig. 7.3 illustrates a representation of the EMPLOYEE supertype with its
three subtypes, using EER notation. Attributes shared by all employees are
associated with the EMPLOYEE entity type whereas the attributes that are
unique to each subtype are included with that subtype only.
Fig. 7.3 EMPLOYEE supertype with three subtypes

7.2.2 Attribute Inheritance


Attribute inheritance is the property by which subtype entities inherit values
of all attributes of the supertype. This feature makes it unnecessary to
associate the supertype attributes with the subtypes, thus avoiding
redundancy. For example, the attributes of the EMPLOYEE supertype entity
such as EMP-ID, EMP-NAME, ADDRESS, DATE-OF-BIRTH, DATE-OF-
JOINING, are inherited to the subtype entities. As we explained in our
earlier discussions, the attributes it possesses and the relationship types in
which it participates define the entity type. Because an entity in the subtype
represents the same real-world entity from the supertype, it should possess
values for its specific attributes as well as values of attributes as a member of
the supertype. We say that an entity that is a member of a subtype, inherits
all the attributes of the entity as a member of the supertype. The entity also
inherits all the relationships in which the supertype participates. It can be
noticed here that a subtype with its own specific attributes and relationships
together with all the attributes and relationships it inherits from the
supertype, can be considered an entity type in its own right.
A subtype with more that one supertype is called a shared subtype. In
other words, a member of a shared subtype must be a member of the
associated supertype. As a consequence, the shared subtype inherits the
attributes of the supertypes, which may also have its own additional
attributes. This is referred to as multiple inheritances.

7.2.3 Conditions for Using Supertype/Subtype Relationships


The supertype/subtype relationships should be used when either or both of
the following conditions are satisfied:
There are attributes that are common to some (but not all) of the instances of an entity type.
For example, EMPLOYEE entity type in Fig. 7.3.
The instances of a subtype participate in a relationship unique to that subtype.

Let us expand the EMPLOYEE example of Fig. 7.3 to illustrate the above
conditions, as shown in Fig. 7.4. The EMPLOYEE supertype has three
subtypes namely FULL-TIME-EMPLOYEE, PART-TIME-EMPLOYEE and
CONSULTANT. All employees have common attributes like EMP-ID,
EMP-NAME, ADDRESS, DATE-OF-BIRTH and DATE-OF-JOINING.
Each subtype has attributes unique to that subtype. Full time employees have
SALARY and ALLOWANCE, part time employees have HOURLY-RATE,
and consultants have CONTRACT-NO and BILLING-RATE. The full time
employees have a unique relationship with the TRAINING entity. Only full
time employees can enrol in the training courses conducted by the enterprise.
Thus, this is a case where one has to use supertype/subtype relationship as
there exist an instance of a subtype that participate in a relationship that is
unique to that subtype.

7.2.4 Advantages of Using Superclasses and Subclasses


The concepts of introducing superclasses and subclasses into E-R model
provides the following enhanced features:
It avoids the need to describe similar concepts more than once, thus saving time for the data
modelling person.
It results in more readable and better-looking E-R diagrams.
Superclass and subclass relationships add more semantic content and information to the
design in a concise form.

Fig. 7.4 EMPLOYEE supertype/subtype relationship

7.3 SPECIALISATION AND GENERALISATION

Both specialisation and generalisation are useful techniques for developing


superclass/subclass relationships. The uses of specialisation or generalisation
technique for a particular situation depends on the following factors:
Nature of the problem.
Nature of the entities and relationships.
The personal preferences of the database designer.

7.3.1 Specialisation
Specialisation is the process of identifying subsets of an entity set (the
superclass or supertype) that share some distinguishing characteristic. In
other words, specialisation maximises the differences between the members
of an entity by identifying the distinguishing and unique characteristics (or
attributes) of each member. Specialisation is a top-down process of defining
superclasses and their related subclasses. Typically the superclass is defined
first, the subclasses are defined next and subclass-specific attributes and
relationship sets are then added. If specialisation approach was not applied,
the three subtypes would have looked like as depicted in Fig. 7.5 (a).
Creation of three subtypes for the EMPLOYEE supertype in Fig. 7.5 (b),
is an example of specialisation. The three subclasses have many attributes in
common. But there are also attributes that are unique to each subtype, for
example, SALARY and ALLOWANCES for the FULL-TIME-EMPLOYEE.
Also there are relationships unique to some subclasses, for example,
relationship of full-time employee to TRAINING. In this case, specialisation
has permitted a preferred representation of the problem domain.
Fig. 7.5 Example of specialisation

(a) Entity type EMPLOYEE before specialisation

(b) Entity type EMPLOYEE after specialisation


Fig. 7.6 Example of specialisation

(a) Entity type ITEM before specialisation

(b) Entity type ITEM after specialisation


Fig. 7.6 illustrates another example of the specialisation process. Fig. 7.6
(a) shows an entity type named ITEM having several attributes namely
DESCRIPTION, ITEM-NO, UNIT-PRICE, SUPPLIER-ID, ROUTE-NO,
MANUFNG-DATE, LOCATION and QTY-IN-HAND. The identifier is
ITEM-NO and the attribute SUPPLIER-ID is multi-valued as because there
may be more than one supplier for an item. Now, after analysis it can be
observed that the ITEM can either be manufactured internally, or purchased
form outside. Therefore, the supertype ITEM can be divided into two
subtypes namely MANUFACTURED-ITEM and PURCHASED-ITEM. It
can also be observed from Fig. 7.6 (a) that some of the attributes apply to all
parts regardless of source. However, other depends on the source. Thus,
ROUTE-NO and MANUFNG-DATE applies only to MANUFACTURED-
ITEM subtype. Similarly, SUPPLIER-ID and UNIT-PRICE apply only to
subtype PURCHASED-ITEM. Thus, PART is specialised by defining the
subtypes MANUFACTURED-ITEM and PURCHAED-ITEM, as shown in
7.6 (b).

7.3.2 Generalisation
Generalisation is the process of identifying some common characteristics of
a collection of entity sets and creating a new entity set that contains entities
processing these common characteristics. In other words, it is the process of
minimising the differences between the entities by identifying the common
features. Generalisation is a bottom-up process, just opposite to the
specialisation process. It identifies a generalised superclass from the original
subclasses. Typically, these subclasses are defined first, the superclass is
defined next and any relationship sets that involve the superclass are then
defined. Creation of the EMPLOYEE superclass with common attributes of
three subclasses namely FULL-TIME-EMPLOYEE, PART-TIME-
EMPLOYEE and CONSULTANT as shown in Fig. 7.7, is an example of
generalisation.
Fig. 7.7 Example of generalisation

(a) Three entity types namely CAR, TRUCK and TWO-WHEELER


(b) Generalisation to VEHICLE type

Another example of generalisation is shown in Fig. 7.7. As shown in Fig.


7.7 (a), three entity types are defined as CAR, TRUCK and TWO-
WHEELER. After analysis it is observed that these entities have a number of
common attributes such as REGISTRATION-NO, VEHICLE-ID, MODEL,
PRICE and MAX-SPEED. This fact basically indicates that each of these
three entity types is a version of a general vehicle type. Fig. 7.7 (b)
illustrates the generalised model of entity type VEHICLE together with the
resulting supertype/subtype relationships. The entity CAR has the specific
attribute as NO-OF-PASSENGERS, while the entity type TRUCK has
specific attribute as CAPACITY. Thus, generalisation has allowed to group
entity types along with their common attributes, while at the same type
preserving specific attributes that are specific to each subtype.

7.3.3 Specifying Constraints on Specialisation and Generalisation


The constraints are applied on specialisation and generalisation to capture
important business rules of the relationships in an enterprise. There are
mainly two types of constraints that may apply:
Participation constraint.
Disjoint constraint

7.3.3.1 Participation Constraints


Participation constraint may be of two types namely (a) total or (b) partial. A
total participation (also called a mandatory participation) specifies that
every member (or entity) in the supertype (or superclass) must participate as
a member of some subclass in the specialisation/generalisation. For example,
if every EMPLOYEE must be one of the three subclasses namely a FULL-
TIME EMPLOYEE, a PART-TIME EMPLOYEE, or a CONSULTANT in
an organisation and no other type of employees, then it is a total
participation constraint. A total participation constraint is represented as
double lines to connect the supertype and the specialisation/generalisation
circle, as shown in Fig. 7.8 (a).
Fig. 7.8 Participation constraints

(a) Total (or mandatory) constraint

(b) Partial (or optional) constraints

A partial participation (also called an optional participation) constraint


specifies that a member of a supertype need not belong to any of its
subclasses of a specialisation/generalisation. For example, member of the
EMPLOYEE entity type need not have an additional role as a UNION-
MEMBER or a CLUB-MEMBER. In another words, there can be employees
who are not union members or club members. A partial (or optional)
participation constraint is represented as single line connecting the supertype
and the specialisation/ generalisation circle, as shown in Fig. 7.8 (b).

7.3.3.2 Disjoint Constraints


Disjoint constraint specifies the relationship between members of the
subtypes and indicates whether it is possible for a member of a supertype to
be a member of one, or more than one, subtype. The disjoint constraint is
only applied when a supertype has more than one subtype. If the subtypes
are disjoint, then an entity occurrence can be a member of at most one of the
subtypes of the specialisation/generalisation. For example, the subtype of the
EMPLOYEE supertype namely FULL-TIME EMPLOYEE, PART-TIME
EMPLOYEE and CONSULTANT are connected as shown in Fig. 7.9 (a).
This means that an employee has to be one of these three subtypes, that is,
either full-time, part-time or consultant. The disjoint constraint is
represented by placing letter ‘d’ in the circle that connects the subtypes to
the supertype.
Fig. 7.9 Example of disjoint constraint

(a) Disjoint constraint

(b) Overlapping constraint

If the subtypes are not constrained to be disjoint, the sets of entities may
overlap. In other words, the same real-world entity may be a member of
more than one subtype of the specialisation/generalisation. This is called an
overlapping constraint. For example, the subtypes of EMPLOYEE
supertype namely UNION-MEMBER and CLUB-MEMBER are connected
as shown in Fig. 7.9 (b). This means that an employee can be a member of
one, or two of the subtypes. In other words, an employee can be a union-
member as well as a club-member. The overlapping constraint is represented
by placing letter ‘o’ in the circle that connects the subtypes to the supertype.
7.4 CATEGORISATION

Categorisation is a process of modelling of a single subtype (or subclass)


with a relationship that involves more than one distinct supertype (or
superclass). Till now all the relationships that have been discussed, are a
single distinct supertype. However, there could be need for modelling a
single supertype/subtype relationship with more than one supertype, where
the supertypes represent different entity set.

Fig. 7.10 Categorisation

For example, let us assume that a vehicle is purchased in a company for


transportation of goods from one department to another. Now, the owner of
the vehicle can be a department, an employee or the company itself. This is a
case of modelling a single supertype/subtype relationship with more than
one supertype, where the supertypes represent three entity types. In this case,
the subtype represents a collection objects that is a subset or the union of
distinct entity types. Thus, a category called OWNER can be created as a
subtype of the UNION of the three entity sets of DEPARTMENT,
EMPLOYEE and COMPANY. The supertype and subtype is connected to
the circle with the ‘U’ symbol. Fig. 7.10 illustrates an EER diagram of
categorisation.

7.5 EXAMPLE OF EER DIAGRAM


Fig. 7.11 illustrates an example of EER diagram using all the concepts
discussed in the previous sections of this chapter, for a database schema of
technical university.
The technical university database keeps track of the students and their
main department, transcripts and registrations as well as the university’s
course offerings. The database also keeps track of the sponsored research
projects of the faculty and the undergraduate students. The database
maintains person-wise information such as person’s name (F-NAME),
person identification number (PER-ID), date of birth (DOB), sex (SEX) and
address (ADDRESS). Supertype PERSON has two subtypes namely
STUDENT and FACULTY. The specific attributes of STDUENT are
CLASS (fresh = 1, sophomore = 2, post graduate = 3, doctoral = 4,
undergraduate = 5). Each student is related to his or her main and auxiliary
departments, if known ‘main’ and auxiliary’, to the course sections he or she
is currently attending ‘registered’, and to the courses completed ‘transcript.
Each transcript instance includes the GRADE the student received in the
‘course section’.
Fig. 7.11 EER for technical university database

The specific attributes of FACULTY are OFFICE, CONTACT-NO and


SALARY. All faculty members are related to the academic departments to
which they belong. Since a faculty member can be associated with more than
one departments, many-to-many (M:N) relationship exists. The UNDER-
GRADUATE student is another subtype of STUDENT with the defining
predicate Class = 5. For each under-graduate student, the university keeps a
list of previous degrees in a complete, multi-valued, attribute called
DEGREES. The under-graduate students are also related to a faculty
‘advisor’ and to a thesis ‘committee’ if one exists. The academic
DEPARTMENT has attributes such as department name (D-NAME) and
office name (OFFICE). Department is related to the faculty member who is
its ‘heads’ and to the college to which it belongs (‘college- dept’). Each
COLLEGE has attributes namely DEAN, C-NAME and C-OFFICE.
The category COURSE has attributes such as course number (C-NO),
course name (C-NAME) and course description (C-DESC). Several sections
of each course are offered. Each SECTION has attributes namely section no.
(SEC-NO), year (YEAR) and quarter (QTR) in which it was offered. SEC-
NO uniquely identifies each section and the sections being offered during
the current semester are in a subtype CURRENT-SECTION of SECTION.
Each section is related to the instructor who ‘teach’ it.
The category INSTRUCTOR-RESEARCHER is a subset of the union (U)
of the FACULTY and the UNDER-GRADUATE student and includes all the
faculty as well as the under-graduate students who are supported by teaching
or research. Finally, the entity type GRANT keeps track of research grants
and contracts awarded to the university. Each GRANT has attributes such as
grant number (NO), grant title (TITLE) and the starting date (ST-DATE). A
grant is related to one ‘principal-investigator’ and to all researchers it
supports (‘support’). Each instance of support has attributes such as the
starting date of support (START), the ending date of support (END) and the
time being spent (TIME) on the project by the researcher being supported.

R Q
1. What are the disadvantages or limitations of an E-R Model? What led to the development of
EER model?
2. What do you mean by superclass and subclass entity types? What are the differences between
them? Explain with an example.
3. Using a semantic net diagram, explain the concept of superclasses and subclasses.
4. With an example, explain the notations used for EER diagram while designing database for
an enterprise.
5. What do you mean by attribute enheritance? Why do we use it in EER diagram? Explain with
an example.
6. Differentiate between a shared subtype and a multiple enheritance.
7. What are the conditions that must be considered while deciding on supertype/subtype
relationship? Explain with an example.
8. What are the advantages of using supertypes and subtypes?
9. What do you understand by specialisation and generalisation in EER modelling? Explain
with examples.
10. Discuss the constraints on specialisation and generalisation.
11. What is participation constraint? What are its types? Explain with an example.
12. What is partial participation? Explain with an example.
13. What is mandatory participation? Explain with an example.
14. What do you mean by disjoint constraints of specialisation/generalisation? Explain with an
example.
15. What is overlapping constraint? Explain with an example.
16. A non-government organisation (NGO) depends on the number of different types of persons
for its operations. The NGO is interested in three types of persons namely volunteers, donors
and patrons. The attributes of such persons are person identification number, person name,
address, city, pin code and telephone number. The patrons have only a date-elected attribute
while the volunteers have only skill attribute. The donors only have a relationship ‘donates’
with an ITEM entity type. A donor must have donated one or more items and an item may
have no donors, or one or more donors. There are persons other than donors, volunteers and
patrons who are of interest to the NGO, so that a person need not belong to any of these three
groups. On the other hand, at a given time a person may belong to two or more of these
groups.
Draw an EER diagram for this NGO database schema.
17. Draw an EER diagram for a typical banking organisation. Make assumptions wherever
required.

STATE TRUE/FALSE

1. Subclasses are the sub-grouping of occurrences of entities in an entity type that shares
common attributes or relationships distinct from other sub-groupings.
2. In case of supertype, objects in one set are grouped or subdivided into one or more classes in
many systems.
3. Superclass is a generic entity type that has a relationship with one or more subtypes.
4. Each member of the subclass is also a member of the superclass.
5. The relationship between a superclass and a subclasses is a one-to-many (1:N) relationship.
6. The U-shaped symbols in ERR model indicates that the supertype is a subset of the subtype.
7. Attribute inheritance is the property by which supertype entities inherit values of all
attributes of the subtype.
8. Specialisation is the process of identifying subsets of an entity set of the superclass or
supertype that share some distinguishing characteristic.
9. Specialisation minimizes the differences between members of an entity by identifying the
distinguishing and unique characteristics of each member.
10. Generalisation is the process of identifying some common characteristics of a collection of
entity sets and creating a new entity set that contains entities processing these common
characteristics.
11. Generalisation maximizes the differences between the entities by identifying the common
features.
12. Total participation is also called an optional participation.
13. A total participation specifies that every member (or entity) in the supertype (or superclass)
must participate as a member of some subclass in the specialisation/generalisation.
14. The participation constraint can be total or partial.
15. A partial participation constraint specifies that a member of a supertype need not belong to
any of its subclasses of a specialisation/generalisation.
16. A non-joint constraint is also called an overlapping constraint.
17. A partial participation is also called a mandatory participation.
18. Disjoint constraint specifies the relationship between members of the subtypes and indicates
whether it is possible for a member of a supertype to be a member of one, or more than one,
subtype.
19. The disjoint constraint is only applied when a supertype has one subtype.
20. A partial participation is represented using a single line between the supertype and the
specialisation/generalisation circle.
21. A subtype is not an entity on its own.
22. A subtype cannot have its own subtypes.

TICK (✓) THE APPROPRIATE ANSWER

1. The U-shaped symbols on each line connecting a subtype to the circle, indicates that the
subtype is a

a. subset of the supertype.


b. direction of supertype/subtype relationship.
c. both of these.
d. none of these.

2. The uses of specialisation or generalisation technique for a particular situation depends on

a. the nature of the problem.


b. the nature of the entities and relationships.
c. the personal preferences of the database designer.
d. all of these.

3. In specialisation, the differences between members of an entity is

a. maximized.
b. minimized.
c. both of these.
d. none of these.

4. In generalisation, the differences between members of an entity is

a. maximized.
b. minimized.
c. both of these.
d. none of these.

5. Specialisation is a

a. top-down process of defining superclasses and their related subclasses.


b. bottom-up process of defining superclasses and their related subclasses.
c. both of these.
d. none of these.

6. EER stands for

a. extended E-R.
b. effective E-R.
c. expanded E-R.
d. enhanced E-R.

7. Which are the additional concepts that are added in the E-R mdel?

a. specialisation.
b. generalisation.
c. supertype/subtype entity.
d. all of these.

FILL IN THE BLANKS

1. The relationship between a superclass and a subclasses is _____.


2. The U-shaped symbols in EER model indicates that the _____ is a of the _____.
3. Attribute inheritance is the property by which _____ entities inherit values of all attributes of
the _____.
4. The E-R model that is supported with the additional semantic concepts is called the _____.
5. Attribute inheritance avoids _____.
6. A subtype with more that one supertype is called a _____.
7. A total participation specifies that every member in the _____ must participate as a member
of some _____ subclass in the _____.
8. A partial participation constraint specifies that a member of a _____ need not belong to any
of its _____ of a _____.
9. A total participation is also called _____ participation.
10. A partial participation is also called participation.
11. The disjoint constraint is only applied when a supertype has more than _____ subtype.
12. The property of the subtypes inheriting the relationships of the supertype is called _____.
13. The disjoint constraint is represented by placing letter _____ in the _____ that connects the
subtypes to the supertype.
14. The overlapping constraint is represented by placing letter _____ in the _____ that connects
the subtypes to the supertype.
15. A subtype with more than one supertype is called _____.
16. _____ is the process of minimizing the difference between the entities by identifying the
common features.
17. The process of _____ is the reverse of the process of specialisation.
18. The expanded form of EER is _____.
Part-III

DATABASE DESIGN
Chapter 8

Introduction to Database Design

8.1 INTRODUCTION

The data is an important corporate resource of an organisation and the


database is a fundamental component of an information system. Therefore,
management and control of the corporate data and the corporate database is
very important. Database design is a process of arranging the corporate data
fields into an organised structure needed by one or more applications in the
organisation. The organised structure must foster the required relationships
among the fields while confirming to the physical constraints of the
particular database management system in use. This structure must result
into the advantages as explained in Chapter 1, Section 1.8.5, such as:
Data redundancy.
Data independence.
Application.
Performance.
Data security.
Ease of programming.

The number and types of data fields, one composite or several databases
and others that are necessary to fulfil the requirements of an enterprise are
derived from the information system strategic planning exercise known as
information system life cycle. This is also called software development life
cycle (SDLC).
In this chapter, basic concepts of software development life cycle (SDLC),
structure system analysis and design (SSAD), database development life
cycle (DDLC) and automated design tools have been explained.
8.2 SOFTWARE DEVELOPMENT LIFE CYCLE (SDLC)

A computer-based information system of an enterprise or organisation


consists of computer hardware, applications software, a database, database
software, the users and the system developer. Software development life
cycle (SDLC) is a proper software engineering framework that is essential
for developing reliable, maintainable and cost-effective application and other
software. The software process starts with concept exploration and ends
when the product is finally retired (decommissioned). During this period, the
product goes through a series of phases and finally retires.
There are many variations of this software process life cycle model. But,
by and large, the software life cycle can be partitioned into the following
phases:
Requirements (or Concept) Phase, in which the concept is explored and refined. The client’s
(user’s or plant’s) requirements are also ascertained and analysed.
Specification Phase, in which the client’s requirements are presented in the form of the
specification document, explaining what the software product is supposed to do.
Planning Phase, in which a plan (the software project management plan), is drawn up,
detailing every aspect of the proposed software development.
Design Phase, in which the specifications document prepared in the specification phase,
undergoes two consecutive design processes. The first is called the Global Design phase.
Here the software product as a whole is broken down into components called modules. Then
each module in turn is designed in a phase termed Detailed Design. The two resulting design
documents describe how the software product does it.
Programming (or Coding or Implementation) Phase, in which the various components (or
modules) of the software are coded in a specific computer programming language.
Integration (or Testing) Phase, in which the components are tested individually as well as
combined and tested as a whole. When the software developers are satisfied with the
software product, it is tested by the client (or user) for acceptance of the system. This phase
ends when the software is acceptable to the client and goes into operations mode.
Maintenance Phase, in which all corrective maintenance (required changes or software
repair) is done. It consists of the removal of residual faults while leaving the specifications
unchanged. It also includes the enhancement (or software update) which consists of changes
to the specifications and its implementation.
Retirement Phase, in which the software product is removed from the service.

8.2.1 Software Development Cost


The changes to the specifications of software products will constantly occur
within growing organisations. This means, therefore, that the maintenance of
software products in the form of enhancement is a positive part of an
organisation’s activities, reflecting that the organisation is on the progress
path.
But, frequent software changes, without change in specifications, is an
indication of a bad design. Fig. 8.1 illustrates the approximate percentage of
time (= money) spent on each phase of the software process. As can be seen
from this figure, about two-thirds of total software costs are devoted to
maintenance. Thus, maintenance is an extremely time-consuming and
expensive phase of the software process. Because maintenance is very
important, a major aspect of software engineering consists of those
techniques, tolls and practices that lead to a reduction in maintenance costs.

Fig. 8.1 Approximate relative costs for the phases of the software process

Relative costs of fixing a fault at later phases, is more as compared to


fixing the fault at the early phases of the software process, as shown in Fig.
8.2. The solid straight line in this figure is the best fit for the data relating to
the larger projects and the dashed line is the best fit for the smaller projects.
For each of the phases of the software process, the corresponding relative
cost to detect and correct a fault is depicted in Fig. 8.3. Each point in Fig. 8.3
is constructed by taking the corresponding point on the solid straight line of
Fig. 8.2 and plotting the data on a linear scale.
It is evident from Fig. 8.3 that if it costs $ 30 to detect and correct a fault
during the integration phase, that same fault would have cost only about $2
to fix the fault during the specification phase. But during the maintenance
phase, that same fault will cost around $200 to detect and correct. The moral
of the story is that the designer must find faults early, or else it will cost
more money. The designer should therefore employ techniques for detecting
faults during the requirements and specification phases.

Fig. 8.2 Relative cost of fixing a fault at each phase of the software process

8.2.2 Structured System Analysis and Design (SSAD)


Structured system analysis and design (SSAD) is a software engineering
approach to the specification, design, construction, testing and maintenance
of software for maximising the reliability and maintainability of the system
as well as for reducing software life-cycle costs. The use of graphics to
specify software was an important technique of the 1970s. Three methods
became particularly popular, namely those of DeMarco, Gane and Sarsen
and Yourdon. The three methods are all equally good and are similar in
many ways. Gane and Sarsen’s approach is presented here.

Fig. 8.3 Relative cost to fix an error (fault) plotted on linear scale

8.2.2.1 Structured System Analysis


Structured system analysis uses the following tools to build structured
specification of software:
Data flow diagram.
Data dictionary.
Structured english.
Decision tables.
Decision trees.
Various steps involve in structured analysis, are listed below:

Step 1: Draw the Data Flow Diagram (DFD): The DFD is a


pictorial representation of all aspects of the logical data
flow. It uses four basic symbols (as per Gane and Sarsen), as
shown in Fig. 8.4.
With the help of the these symbols, data flow diagram of
software problem is drawn and further refinement (break
down) is done till a logical flow of data is achieved.
Step 2: Put in the details of data flow: Data items are identified that
are required to go into various data flows. In case of a large
system, a data dictionary is created to keep track of the
various data elements involved.
Step 3: Define the logic of processes: The logical steps (and
algorithm) within each process is determined. For
developing the logic within the process, decision tree and
decision table techniques are used.
Step 4: Define data store: Exact content of each data store and its
format are defined. These help in database design and
building database.
Step 5: Define the physical resources: Now that the designer
(developer) knows what is required online and the format of
each element, blocking factors are decided. In addition, for
each file, the file name, organisation, storage medium and
records, down to the field level, are specified.
Step 6: Determine the input/output specifications: The input and
output forms are specified. Input screens, display screens,
printed output format, are decided.
Step 7: Perform sizing: The volume of input (daily, hourly, monthly
and so on), the frequency of each printed reports, the size
and number of records of each type that are to pass between
the CPU and mass storage and the size of each file are
estimated.
Step 8: Determine the hardware requirements: Based on the
information estimated in Step 7, the hardware configuration
such as, storage capacity, processor speed, CPU size and so
on is decided.

Fig. 8.4 Symbols of Gane and Sarsen’s structured systems analysis

Determining the hardware configuration is the final step of Gane and


Sarsen’s specification method. The resulting specification document, after
approval by the client, is handed over to the design team, and the software
process continues. Fig. 8.5 illustrates an example of data flow diagram
(DFD) for process modeling of a steel making process.

8.2.2.2 Structured Design


Structured design is a specific approach to the design process that results in
small, independent, black-box modules, arranged in a hierarchy in a top-
down fashion. Structured design uses the following tools to build the
systems specifications document:
Cohesion.
Coupling.
Data flow analysis.

Cohesion of a component is a measure of how well it fits together. A


cohesive module performs a single task within a software procedure,
requiring little interaction with procedures being performed in other parts of
a program. If the component includes part which are not directly related to
its logical function, it has a low degree of cohesion. Therefore, cohesion is
the degree of interaction between two software modules.

Fig. 8.5 A typical example of DFD for modelling of steel making process

Coupling is a measure of interconnections among modules in software.


Highly coupled systems have strong interconnections, with program units
dependent on each other. Loosely coupled systems are made up of units
which are independent. In software design, we strive for the lowest (loosely)
possible coupling. Therefore, coupling is the degree of interaction between
two software modules.
Data flow analysis (DFA) is a design method of achieving software
modules with high cohesion. It can be used in conjunction with most
specification methods, such as structured system analysis. The input to DFA
is a data flow diagram (DFD).

8.3 DATABASE DEVELOPMENT LIFE CYCLE

As stated above, a database system is a fundamental component of the larger


enterprise information system. The database development life cycle (DDLC)
is a process of designing, implementing and maintaining a database system
to meet strategic or operational information needs of an organisation or
enterprise such as:
Improved customer support and customer satisfaction.
Better production management.
Better inventory management.
More accurate sales forecasting.

The database development life cycle (DDLC) is inherently associated with


the software development life cycle (SDLC) of the information system.
DDLC goes hand-in-hand with the SDLC and database development
activities starts right at the requirement phase. The information system
planning is the main source of database development projects. After
information system planning exercise, the data stores identified from the
data flow diagram (DFD) of SSAD are used as inputs to the database design
process. The various stages of database development life cycle (DDLC) are
shown in Fig. 8.6 and include the following steps:
Feasibility study and requirement analysis.
Database design.
Database implementation.
Data and application conversion.
Testing and validation.
Monitoring and maintenance.
Fig. 8.6 Database development life cycle (DDLC)

Feasibility Study and Requirement Analysis: At this stage, a


preliminary study is conducted of the existing business situation of an
enterprise or organisation and how the information systems might help solve
the problem. The business situation is then analysed to determine
organisation’s needs and a functional specification document (FSD) is
produced. Feasibility study and requirement analysis stage addresses the
following:
Study of existing systems and procedures.
Technological, operational and economic feasibilities.
The scope of the database system.
Information requirements.
Hardware and software requirements.
Processing requirements.
Intended number and types of database users.
Database applications.
Interfaces for various categories of users.
Problems and constraints of database environment such as response time constraints,
integrity constraints, access restrictions and so on.
Data, data volume, data storage and processing needs.
Properties and inter-relationships of the data.
Operation requirements.
The growth rate of the database.
Data security issues.

Database Design: At this stage, a database model suitable for the


organisation’s need is decided. The finding of the FSD serves as the input to
database design stage. A design specification document (DSD) is produced
at the end of this stage and a complete logical and physical design of the
database system on the chosen DBMS becomes ready. Database design stage
addresses the following:
Conceptual database design, that is defining the data elements for inclusion in the database,
the relationships between them and the value constraints for defining the permissible values
for a specific data item.
Logical database design.
Physical database design for determining the physical structure of the database.
Developing specification.
Database implementation and tuning.

Database Implementation: In database implementation, the steps


required to change a conceptual design to a functional database are decided.
During this stage, a database management system (DBMS) is selected and
acquired and then the detailed conceptual model is converted to the
implementation model of the DBMS. The database implementation stage
addresses the following:
Selection and acquisition of DBMS.
The process of specifying conceptual, external and internal data definitions.
Mapping of conceptual model to a functional database.
Building data dictionary.
Creating empty database files.
Developing and implementing software applications.
Procedures for using the database.
Users training.

Data and Application Conversion: This stage addresses the following:


Populating database either by loading the data directly or by converting existing files into the
database system format.
Converting previous software applications into the new system.

Testing and Validation: At this stage, the new database system is tested
and validated for its intended results.

Monitoring and Maintenance: At this stage, the system is constantly


monitored and maintained. Following are the addresses at this stage:
Growth and expansion of both data content and software applications.
Major modifications and reorganisation whenever required.

8.3.1 Database Design


The database design is a process of designing the logical and physical
structure of one or more databases. The reason behind it is to accommodate
the information needs or queries of the users for a defined set of applications
and support the overall operation and objectives of an enterprise. In other
words, the performance of a DBMS on commonly asked queries by the users
and typical update operations is the ultimate measure of a database design.
The performance of the database can be improved by adjusting some of the
parameters of the DBMS, identifying performance bottlenecks and adding
hardware to eliminate these bottlenecks. Therefore, it is important to make a
choice of good database design to help in achieving good performance.
There are many approaches to the design of a database as given below:
Bottom-up approach.
Top-down approach.
Inside-out approach.
Mixed strategy approach.

Bottom-up database approach: The bottom-up database design


approach starts at the fundamental level of attributes (or abstractions), that is
properties of entities and relationships. It then combines or add to these
abstractions, which are grouped into relations that represent types of entities
and relationships between entities. New relationships among entity types
may be added as the design progresses. The bottom-up approach is
appropriate for the design of simple databases with a relatively small number
of attributes. The normalisation process (as discussed in Chapter 10)
represents a bottom-up approach to database design. The process of
generalising entity types into higher-level generalised superclasses is another
example of a bottom-up approach. The bottom-up approach has limitation of
applying to the design of more complex databases with large number of
attributes, where it is difficult to establish all the functional dependencies
between the attributes.

Top-down database design approach: The top-down database design


approach starts with the development of data models (or schemas) that
contains high-level abstractions (entities and relationships). Then the
successive top-down refinements are applied to identify lower-level entities,
relationships and the associated attributes. The E-R model is an example of
top-down approach and is more suitable for the design of complex databases.
The process of specialisation to refine an entity type into subclasses is
another example of a top-down database design approach.

Inside-out database design approach: The inside-out database design


approach starts with the identification of set of major entities and then
spreading out to consider other entities, relationships and attributes
associated with those first identified. The inside-out database design
approach is special case of a bottom-up approach, where attention is
focussed at a central set of concepts that are most evident and then spreading
outward by considering others in the vicinity of existing ones.

Mixed strategy database design approach: The mixed strategy


database design approach uses both the bottom-up and top-down approach
instead of following any particular approach for various parts of the data
model before finally combining all parts together. In this case, the
requirements are partitioned according to a top-down approach and part of
the schema is designed for each partition according to a bottom-up approach.
Finally, all the schema parts are combined together.
Fig. 8.7 illustrates the different phases involved in a good database design
and includes the following main steps:
Data requirements collection and analysis.
Conceptual database design.
DBMS selection.
Logical database design.
Physical database design.
Prototyping.
Database implementation and tuning.
Fig. 8.7 Database design phases

8.3.1.1 Database Requirement Analysis


It is the process of a detailed analysis of the expectations of the users and
intended uses of the database. It is a time-consuming but an important phase
of database design. This step includes the following activities:
Collection and analysis of current data processing.
Study of current operating environment and planned use of the information.
Collection of written responses to sets of questions from the potential database users or user
groups to know the users’ priorities and importance of applications.
Analysis of general business functions and their database needs.
Justify need for new databases in support of business.

8.3.1.2 Conceptual Database Design


Conceptual database design may be defined as the process of the following:
Analysing overall data requirements of the proposed information system of an organisation.
Defining the data elements for inclusion in the database, the relationships between them and
the value constraints. The value constraint is a rule defining the permissible values for a
specific data item.
Constructing a model of the information used in an enterprise, independent of all physical
considerations.

As shown in Fig. 8.7, the conceptual database design stage involves two parallel activities
namely:

i. Conceptual schema design to examine the data requirements and produce a


conceptual database schema and
ii. Transaction and application design to examine the database applications and
produce high-level specifications. To carry out the conceptual database design, the
database administrator (DBA) group consists of members with expertise in design
concepts as well skills of working with user groups. They design portions of the
database, called views, which are intended for use by the user group. These views
are integrated into a complete database schema, which defines the logical structure
of the entire database, as illustrated in Fig. 2.6 of Chapter 2, Section 2.3. The
conceptual design process also resolves the conflicts between different user groups
by negotiation and establishing reasonable control as which groups can access
which data.

The conceptual database design is independent of implementation details


such as hardware platform, application programs, programming languages,
target DBMS software or any physical considerations. A high-level data
model such as E-R model and EER model is often used during this phase to
produce a conceptual schema of the database. The conceptual data model is
often said to be a top-down approach, which is driven from a general
understanding of the business area, and not from specific information
processing activities. The conceptual database design step includes the
following:
Identification of scope of database requirements for proposed information system.
Analysis of overall data requirements for business functions.
Development of primary conceptual data model, including data model and relationships.
Developing a detailed conceptual data model, including all entities, relationships, attributes
and business rules.
Making conceptual data models consistent with other models of information system.
Population of repository with all conceptual database specification.
Specifying the functional characteristics of the database transactions by identifying their
input/ output and functional behaviour.

8.3.1.3 DBMS Selection


The purpose of selecting particular DBMS is to meet the current and
expanding future requirements of the enterprise. The selection of an
appropriate DBMS to support the database application is governed by the
following two main factors:
Technical factors.
Economic factors.

The technical factors are concerned with the suitability of the DBMS for
the intended task to be performed. It considers issues such as:
types of DBMS such as relational, hierarchical, networking, object-oriented, object-relational
and so on.
The storage structures.
Access paths that the DBMS supports.
User and programmer interfaces available.
Types of high-level query languages.
Availability of development tools.
Ability to interface with other DBMS via standard interfaces.
Architectural options related to client-server operation.

The economic factors are concerned with the costs of the DBMS product
and consider the following issues:
Costs of additional hardware and software required to support the database system.
Purchase cost of basic DBMS software and other products such as language options, different
interface options such as forms, menu and Web-based graphic user interface (GUI) tools,
recovery and backup options, special access methods, documentation and so on.
Cost associated with the changeover.
Cost of staff training.
Maintenance cost.
Cost of database creation and conversion.
8.3.1.4 Logical Database Design
The logical database design may be defined as the process of the following:
Creating a conceptual schema and external schemas from the high-level data model of
conceptual database design stage into the data model of the selected DBMS by mapping
those schemas produced in conceptual design stage.
Organising the data fields into non-redundant groupings based on the data relationship and an
initial arrangement of those logical groupings into structures based on the nature of the
DBMS and the applications that will use the data.
Constructing a model of the information used in an enterprise based on a specific data model,
but independent of a particular DBMS and other physical considerations.

The logical database design is dependent on the choice of the database


model that is used. In the logical design stage, first, the conceptual data
model is translated into internal model that is, a standard notation called
relations, based on relational database theory (as explained in Chapter 4).
Then, a detailed review of the transactions, reports, displays and inquiries
are performed. This approach is called bottom-up analysis and the exact data
to be maintained in the database and the nature of these data as needed for
each transaction, report, display and others are verified. Finally, the
combined and reconciled data specifications are transformed into basic or
atomic elements following well-established rules of relational database
theory and normalisation process for well-structured data specifications.
Thus, the logical database design transforms the DBMS- independent
conceptual model into DBMS-dependent model.
There are several techniques for performing logical database design, each
with its own emphasis and approach. The logical database design step
includes the following:
Detailed analysis of transactions, forms, displays and database views (inquiries).
Integrating database views into conceptual data model.
Identifying data integrity and security requirements, and population of repository.

8.3.1.5 Physical Database Design


The physical database design may be defined as the process of the following:
Deriving the physical structure of the database and refitting the derived structures to confirm
to the performance and operational idiosyncrasies of the DBMS, guided by the application’s
processing requirements.
Selecting the data storage and data access characteristics of the database.
Producing a description of the implementation of the database on secondary storage.
Describing the base relations, file organisations and indexes used to achieve efficient access
to the data and any associated integrity constraints and security measures.

In a physical database design, the physical schema is designed and is


guided by the nature of data and its intended use. As user requirements
evolve, the physical schema is tuned or adjusted to achieve good
performance. During physical database design phase, specifications of the
stored database (internal schema) are designed in terms of physical storage
structure, record placement and indexes. This phase decides as what access
methods will be used to retrieve data and what indexes will be built to
improve the performance of the system. The physical database design is
done in close coordination with the design of all other aspects of physical
information system such as computer hardware, application software,
operating systems, data communication networks and so on. Thus, the
physical database design translates the logical design into hardware-
dependent model. The physical database design step includes the following:
Defining database to DBMS.
Deciding on physical organisation of data.
Designing database processing programs.

8.3.1.6 Prototyping
Prototyping is a rapid method of interactively building a working model of
the proposed database application. It is one of the rapid application
development (RAD) methods to design a database system. RAD is an
interactive process of rapidly repeating analysis, design and implementation
steps until it fulfils the user requirements. Therefore, prototyping is an
interactive process of database systems development in which the user
requirements are converted to a working system that is continually revised
through close work between database designer and the users.
A prototype does not normally have all the required features and
functionality of the final system. It basically allows users to identify the
features of the proposed system that work well, or are inadequate and if
possible to suggest improvements or even new features to the database
application.
Fig. 8.8 shows the prototyping steps. With the increasing use of visual
programming tools such as Java, Visual Basic, Visual C++ and fourth
generation languages, it has become very easy to modify the interface
between system and user while prototyping. A prototyping has the following
advantages:
Relatively inexpensive.
Quick to build.
Easy to change the contents and layout of user reports and displays.
With changing needs and evolving system requirements, the prototype database can be
rebuilt.

Fig. 8.8 Prototyping steps

8.3.1.7 Database Implementation and Tuning


The database implementation and tuning is carried out by DBA in
conjunction with database designer. In this phase, the application programs
for processing of databases are written, tested and installed. The programs
for generating reports and displays can be written in standard programming
languages like COBOL, C++, Visual Basic or Visual C++ or in special
database processing languages like SQL or special-purpose non-procedural
languages. The language statements of the selected DBMS in the data
definition language (DDL) and storage definition language (SDL) are
complied and used to create the database schema and empty database files.
During the database implementation and the testing phase, database and
application programs are implemented, tested, tuned and eventually
deployed for service. Various transactions and applications are tested
individually as well in conjunction with each other. The database is then
populated (loaded) with the data from existing files and databases of legacy
application and the new identified data. Finally, all database documentation
is completed, procedures are put in place and users are trained. The database
implementation and tuning stage includes the following:
Coding and testing of database programs.
Documentation of complete database and users’ training materials.
Installation of database.
Conversion of data from earlier systems.
Errors fixing in database and database applications.

As can be seen from the above discussions, database design is an


interactive process, which has a starting point and almost endless procession
of refinements. The relational database management system (RDBMS) is
having few tools to assist with physical database design and tuning. For
example, Microsoft SQL server has a tuning wizard that makes suggestions
on indexes to create. The wizard also suggests dropping an index when the
addition of other indexes makes the maintenance cost of the index outweigh
its benefits on queries. Similarly, IBM DB2 V6, Oracle and others have
tuning wizards and they make recommendations on global parameters,
suggests adding/deleting indexes and so on.
8.4 AUTOMATED DESIGN TOOLS

Whether the target database is RDBMS or object-oriented RDBMS, the


overall database design activity has to undergo a systematic process called
design methodology predominantly spanning conceptual design, logical
design and physical design stages, as discussed above. During the early days
of its introduction the database design was carried out manually by the
experienced and knowledgeable database designers.

8.4.1 Limitations of Manual Database Design


Difficulty in dealing with the increased number of alternative design to model the same
information for rapidly evolving applications and more complex data in terms of relationship
and constraints.
Making the task of managing manual designs almost impossible for ever increasing size of
the databases, their entity types and relationship types.

8.4.2 Computer-aided Software Engineering (CASE) Tools


The limitations of manual database design gave rise to the development of
Computer-aided Software Engineering (CASE) tools for database design in
which the various design methodologies are implicit. CASE tools are the
software that provides automated support for some portion of the systems
development process. They help the database administration (DBA) staff to
permit the database development activities to be carried out as efficiently
and effectively as possible. They help the designer to draw data models
using entity- relationship (E-R) diagrams, ensuring consistency across
diagrams, generating code and so on.
CASE tools may be categorised into the following three application levels
to automate various stages of development life cycle:
Upper-CASE tools.
Lower-CASE tools.
Integrated-CASE tools.

Fig. 8.9 illustrates the three application levels of CASE tools with respect
to the database development life cycle (DDLC) of Fig. 8.6. Upper-CASE
tools support the initial stages of DDLC, that is, from feasibility study and
requirement analysis through the database design. Lower-CASE tools
support the later stages of DDLC, that is, from database implementation
through testing, to monitoring and maintenance. Integrated- CASE tools
support all stages of the DDLC and provides the functionality of both upper-
CASE and lower-CASE in one tool.
Facilities provided by CASE Tools: CASE tools provide the following
facilities to the database designer:
Create a data dictionary to store information about the database application’s data.
Design tools to support data analysis.
Tools to permit development of the corporate data model and the conceptual and logical data
models.
To help in drawing conceptual schema diagram using entity-relationship (E-R) and other
various notations such as entity types, relationship types, attributes, keys and so on.
Generate schemas (or codes) in SQL DDL for various RDBMSs for model mapping and
implementing algorithms.
Decomposition and normalisation.
Indexing.
Tools to enable prototyping of applications.
Performance monitoring and measurement.
Fig. 8.9 Application levels of CASE Tools

8.4.2.1 Characteristics of CASE Tools


A good database design CASE tools have the following characteristics:
An easy-to-use graphical and point and click interface.
Analytical components for performing tasks such as evaluation of physical design
alternatives, detection of conflicting constraints among views.
Heuristic components to evaluate design alternatives.
Trade-off comparative analysis for choosing from multiple alternatives.
Display of design results such as schema in diagrammatical form, multiple design layouts
and so on.
Design verification to verify whether the resulting design satisfies the initial requirements
8.4.2.2 Benefits of CASE Tools
Following are the benefits of CASE tools:
Improved efficiency of the development process.
Improved effectiveness of the development system.
Reduced time and cost of realising database application.
Increases the satisfaction level of users of the database.
Helps in enforcing standards on software across the organisation.
Ensures the integration of all parts of the system.
Improved documentation.
Checks for consistency.
Automatically transforms parts of design specification into executable code.

Some of the popular CASE tools being used for database design are
shown in Table 8.1.
Table 8.1 Popular CASE tools for database design

R Q
1. What is software development life cycle (SDLC)? What are the different phases of a SDLC?
2. What is the cost impact of frequent software changes? Explain.
3. What is structured system analysis and design (SSAD)? Explain.
4. What do you mean by database development life cycle (DDLC)? When does DDLC start?
5. What are the various stages of DDLC? Explain each of them.
6. What are the different approaches of database design? Explain each of them.
7. What are the different phases of database design? Discuss each phase.
8. Discuss the relationship between the SDLC and DDLC.
9. Write short notes on the following:
a. Conceptual database design
b. Logical database design
c. Physical database design
d. Prototyping
e. CASE tools
f. DBMS selection.
g. Database implementation and tuning

10. Which of the different phases of database design are considered the main activities of the
database design process itself? Why?
11. Consider an actual application of a database system for an off-shore software development
company. Define the requirements of the different levels of users in terms of data needed,
types of queries and transactions to be processed.
12. What functions do the typical automated database design tools provide?
13. What are the limitations of manual database design?
14. Discuss the main purpose and activities associated with each phase of the DDLC.
15. Compare and contrast the various phases of database design.
16. Identify the stage where it is appropriate to select a DBMS and describe an approach to
selecting the best DBMS for a particular use.
17. Describe the main advantages of using a prototyping approach when building a database
application.
18. What are computer-aided software engineering (CASE) tools?
19. What are the facilities provided by CASE tools?
20. What should be the characteristics of right CASE tools?
21. List the various types of CASE tools and their functions provided by different vendors.

STATE TRUE/FALSE

1. Database design is a process of arranging the data fields into an organised structure needed
by one or more applications.
2. Frequent software changes, without change in specifications, is an indication of a good
design.
3. Maintenance is an extremely time-consuming and expensive phase of the software process.
4. Relative costs of fixing a fault at later phases, is less as compared to fixing the fault at the
early phases of the software process.
5. Database requirement phase of database design is the process of detailed analysis of the
expectations of the users and intended uses of the database.
6. The conceptual database design is dependent on specific DBMS.
7. The physical database design is independent of any specific DBMS.
8. The bottom-up approach is appropriate for the design of simple databases with a relatively
small number of attributes.
9. The top-down approach is appropriate for the design of simple databases with a relatively
small number of attributes.
10. The top-down database design approach starts at the fundamental level of abstractions.
11. The bottom-up database design approach starts with the development of data models that
contains high-level abstractions.
12. The inside-out database design approach uses both the bottom-up and top-down approach
instead of following any particular approach.
13. The mixed strategy database design approach starts with the identification of set of major
entities and then spreading out.
14. The objective of developing a prototype of the system is to demonstrate the understanding of
the user requirements and the functionality that will be provided in the proposed system.

TICK (✓) THE APPROPRIATE ANSWER

1. The organised database structure gives advantages such as

a. data redundancy
b. data independence
c. data security
d. all of these.

2. Structured system analysis and design (SSAD) is a software engineering approach to the
specification, design, construction, testing and maintenance of software for

a. maximizing the reliability and maintainability of the system


b. reducing software life-cycle costs
c. both (a) and (b)
d. none of these.

3. The database development life cycle (DDLC) is to meet strategic or operational information
needs of an organisation and is a process of

designing
implementing
maintaining
all of these.

4. The physical database design is the process of

a. deriving the physical structure of the database.


b. creating a conceptual and external schemas for the high-level data model.
c. analysing overall data requirements.
d. none of these.

5. Which of the following is the SDLC phase that starts after the software is released into use?

a. integration and testing


b. development
c. maintenance
d. none of these.
6. SDLC stands for

a. software development life cycle


b. software design life cycle
c. scientific design of linear characteristics
d. structured distributed life cycle.

7. DDLC stands for

a. distributed database life cycle


b. database development life cycle
c. direct digital line connectivity
d. none of these.

8. The logical database design is the process of

a. deriving the physical structure of the database.


b. creating a conceptual and external schemas for the high-level data model.
c. analysing overall data requirements.
d. none of these.

9. The conceptual database design is the process of

a. deriving the physical structure of the database.


b. creating a conceptual and external schemas for the high-level data model.
c. analysing overall data requirements.
d. none of these.

10. Which of the following design is both hardware and software independent?

a. physical database design


b. logical database design
c. conceptual database design
d. none of these.

11. The bottom-up database design approach starts at the

a. fundamental level of attributes.


b. development of data models that contains high-level abstractions.
c. identification of set of major entities.
d. all of these.

12. The top-down database design approach starts at the

a. fundamental level of attributes.


b. development of data models that contains high-level abstractions.
c. identification of set of major entities.
d. all of these.
13. Which database design method transforms DBMS-independent conceptual model into a
DBMS dependent model?

a. conceptual
b. logical
c. physical
d. none of these.

14. The inside-out database design approach starts at the

a. fundamental level of attributes.


b. development of data models that contains high-level abstractions.
c. identification of set of major entities.
d. all of these.

FILL IN THE BLANKS

1. Database design is a process of arranging the _____ into an organised structure needed by
one or more _____.
2. Frequent software changes, without change in specifications, is an indication of a _____
design.
3. Structured system analysis and design (SSAD) is a software engineering approach to the
specification, design, construction, testing and maintenance of software for (a) maximising
the _____ and _____ of the system as well as for reducing _____.
4. The _____ is the main source of database development projects.
5. The four database design approaches are (a) _____, (b) _____, (c) _____ and (d) _____.
6. The bottom-up database design approach starts at the _____.
7. The top-down database design approach starts at the _____.
8. The inside-out database design approach starts at the _____.
9. It is in the _____ phase that the system design objectives are defined.
10. In the _____ phase, the conceptual database design is translated into internal model for the
selected DBMS.
Chapter 9

Functional Dependency and Decomposition

9.1 INTRODUCTION

As explained in the earlier chapters, the purpose of database design is to


arrange the corporate data fields into an organised structure such that it
generates set of relationships and stores information without unnecessary
redundancy. In fact, the redundancy and database consistency are the most
important logical criteria in database design. A bad database design may
result into repetitive data and information and an inability to represent
desired information. It is, therefore, important to examine the relationships
that exist among the data of an entity to refine the database design.
In this chapter, functional dependencies and decomposition concepts have
been discussed to achieve the minimum redundancy without compromising
on easy data and information retrieval properties of the database.

9.2 FUNCTIONAL DEPENDENCY (FD)

A functional dependency (FD) is a property of the information represented


by the relation. It defines the most commonly encountered type of
relatedness property between data items of a database. Usually, relatedness
between attributes (columns) of a single relational table are considered. FD
concerns the dependence of the values of one attribute or set of attributes on
those of another attribute or set of attributes. In other words, it is a constraint
between two attributes or two sets of attributes. An FD is a property of the
semantics or meaning of the attributes in a relation. The semantics indicate
how attributes relate to one another and specify the FDs between attributes.
The database designer uses their understanding of the semantics of the
attributes of relation R to specify the FDs that should hold on all relation
states tuples r of R and to know how these semantics of attributes relate to
one another. Whenever the semantics of the two sets of attributes in relation
R indicate that an FD should hold, the dependency is specified as a
constraint. Thus, the main use of functional dependencies is to describe
further a relation schema R by specifying constraints on its attributes that
must hold at all times. Certain FDs can be specified without referring to a
specific relation, but as a property of those attributes. An FD cannot be
inferred automatically from a given relation extension r but must be defined
explicitly by the database designer who knows the semantics of the attributes
of relation R.
Functional dependency allows the database designer to express facts about
the enterprise that the designer is modelling with the enterprise databases. It
allows the designer to express constraints, which cannot be expressed with
superkeys.
Functional dependency is a term derived from mathematical theory, which
states that for every element in the attribute (which appears on some row),
there is a unique corresponding element (on the same row). Let us assume
that rows (tuples) of a relational table T is represented by the notation r1, r2,
……, and individual attributes (columns) of the table is represented by
letters A, B, ………The letters X, Y, ………, represent the subsets of
attributes. Thus, as per mathematical theory, for a given table T containing at
least two attributes A and B, we can say that A → B. The arrow notation ‘→’
is read as “functionally determines”. Thus, we can say that, A functionally
determines B or B is functionally dependent on A. In other words, we can
say that, given two rows R1,and R2,in table T, if R1(A) = R2(A) then R1(B) =
R2(B).
Fig. 9.1 Graphical depiction of functional dependency

Fig. 9.1 illustrates a graphical representation of the functional dependency


concept. As shown in Fig. 9.1 (a), A functionally determines B. Each value
of A corresponds to only one value of B. However, in Fig. 9.1 (b), A does not
functionally determine B. Some values of A correspond to more than one
value of B.
Therefore, in general terms, it can be stated that a set of attributes (subset)
X in a relational model table T is said to be functionally dependent on a set
of attributes (subset) Y in the table T if a given set of values for each attribute
in Y determines a unique (only one) value for the set of attributes in X. The
notation used to donate that X is functionally dependent on Y is Y → X.
The X is said to be functionally dependent on Y only if each X-value on
relation (or table) T has one Y-value in T associated with it. Therefore,
whenever two tuples (rows or records) of relational table T have the same X-
value and if they are functionally dependent, then they should agree on their
Y-values also.
The attributes in subset Y are sometimes known as the determinant of FD:
Y → X. The left hand side of the functional dependency is sometimes called
determinant whereas that of the right hand side is called the dependent. The
determinant and dependent are both sets of attributes.
A functional dependency is a many-to-one relationship between two sets
of attributes X and Y of a given table T. Here X and Y are subsets of the set of
attributes of table T. Thus, the functional dependency X → Y is said to hold
in relation R if and only if, whenever two tuples (rows or records) of T have
the same value of X, they also have the same value for Y.

9.2.1 Functional Dependency Diagram and Examples


In a functional dependency diagram (FDD), functional dependency is
represented by rectangles representing attributes and a heavy arrow showing
dependency. Fig. 9.2 shows a functional dependency diagram for the
simplest functional dependency, that is, FD: Y → X.

Fig. 9.2 Functional dependency diagrams

Example 1

Let us consider a functional dependency of relation R1: BUDGET, as shown


in Fig. 9.3 (a), which is given as:

FD: {PROJECT} → {PROJECT-BUDGET}


Fig. 9.3 Example 1

It means that in the BUDGET relation (or table), PROJECT-BUDGET is


functionally dependent on PROJECT, because each project has one given
budget value. Thus, once a project name is known, a unique value of
PROJECT-BUDGET is also immediately known. Fig. 9.3 (b) shows the
functional dependency diagram (FDD) for this example.

Example 2

Let us consider a functional dependency that there is one person working on


a machine each day, which is given as:

FD: {MACHINE-NO, DATE-USED}→ {PERSON-ID}

It means that once the values of MACHINE-NO and DATE-USED are


known, a unique value of PERSON-ID also can be known. Fig. 9.4 (a)
shows the functional dependency diagram (FDD) for this example.
Fig. 9.4 FDD of Example 2

Similarly, in the above example, if the person also uses one machine each
day, then FD can be given as:

FD: {PERSON-ID, DATE-USED}—> {MACHINE-NO}

It means that once the values of PERSON-ID and DATE-USED are


known, a unique value of MACHINE-NO also can be known. Fig. 9.4 (b)
shows the functional dependency diagram (FDD) for this example.

Example 3

Let us consider a functional dependency of relation R2: ASSIGN, as shown


in Fig. 9.5 (a), which is given as:

FD: {EMP-ID, PROJECT) → {YRS-SPENT-BY-EMP-ON-PROJECT)

It means that in an ASSIGN relation (or table), once the values of EMP-
NO. and PROJECT are known, a unique value of YRS-SPENT-BY-EMP-
ON-PROJECT also can be known. Fig. 9.5 (b) shows the functional
dependency diagram (FDD) for this example.
Fig. 9.5 Example 3

Example 4

Let us consider a functional dependency of relation R3: BOOK_ORDER, as


shown in Fig. 9.6. Fig. 9.6 satisfies several functional dependencies, which
can be given as:

Fig. 9.6 Relation BOOK_ORDER

FD: {BOOK-ID, CITY-ID} → {BOOK-NAME}


FD: {BOOK-ID, CITY-ID} → {QTY}
FD: {BOOK-ID, CITY-ID} → {QTY, BOOK-NAME}
In the above examples, it means that in a BOOK_ORDER relation (or
table), once the values of BOOK-ID and CITY-ID are known, a unique value
of BOOK-NAME and QTY also can be known.
Fig. 9.7 shows the functional dependency diagrams (FDD) for this
example.

Fig. 9.7 FDDs for Example 4

In Example 4, when the right-hand side of the functional dependency is a


subset of the left-hand side, it is called trivial dependency. An example of
trivial dependency can be given as:

FD: {BOOK-ID, CITY-ID} → {BOOK-ID}

Trivial dependencies are satisfied by all relations. For example, X → X is


satisfied by all relations involving attribute X. In general, a functional
dependency of the form X → Y is trivial if Y ⊆ X. Trivial dependencies are
not used in practice. They are eliminated to reduce the size of the functional
dependencies.

Example 5
A number of (or all) functional dependencies can be represented on one
functional dependency diagram (FDD). In this case FDD contains one entry
for each attribute and shows all functional dependencies between attributes.
Fig. 9.8 shows a functional dependency diagram with a number of functional
dependencies.
As shown in FDD of Fig. 9.8, suppliers make deliveries to warehouses.
One of the attributes, WAREHOUSE-NAME, identifies a warehouse. Each
warehouse has one WAREHOUSE-ADDRESS. An attribute QTY-IN-
STORE-ON-DATE is determined by the combination of attributes
WAREHOUSE-NAME, INVENTORY-DATE and PART-NO. This is an
example of a technique for modelling of time variations by functional
dependencies.
Another technique used by functional dependencies is modelling
composite identifiers. As shown in Fig. 9.8, a delivery is identified by a
combination of SUPPLIER-NAME and DELIVERY-NO within the supplier.
The QTY-DELIVERED of a particular part is determined by the
combination of this composite identifier, that is, SUPPLIER-NAME and
DELIVERY-NO.
Fig. 9.8 also shows one-to-one dependencies, in which each warehouse
has one manager and each manager manages one warehouse. These
dependencies are modelled by the double arrow between WAREHOUSE-
NAME and MANAGER-NAME, showing that following arguments are
true:

FD: {WAREHOUSE-NAME} → {MANAGER-NAME}


FD: {MANAGER-NAME} → {WAREHOUSE-NAME}
Fig. 9.8 FDD for a number of FDs

9.2.2 Full Functional Dependency (FFD)


The term full functional dependency (FFD) is used to indicate the minimum
set of attributes in a determinant of a functional dependency (FD). In other
words, the set of attributes X will be fully functionally dependent on the set
of attributes Y if the following conditions are satisfied:
X is functionally dependent on Y and
X is not functionally dependent on any subset of Y.

In relation ASSIGN of Fig. 9.9, it is true that

FD: {EMP-ID, PROJECT, PROJECT-BUDGET} → {YRS-SPENT-BY-


EMP-ON-PROJECT}

The values of EMP-ID, PROJECT and PROJECT-BUDGET determine a


unique value of YRS-SPENT-BY- EMP-ON-PROJECT. However, it is not a
full functional dependency because neither the EMP-ID → YRS- SPENT-
BY-EMP-ON-PROJECT nor the PROJECT → YRS-SPENT-BY-EMP-ON-
PROJECT holds true. In fact, it is sufficient to know only the value of a
subset of {EMP-ID, PROJECT, PROJECT-BUDGET}, namely, {EMP-ID,
PROJECT}, to determine the YRS-SPENT-BY-EMP-ON-PROJECT.
Thus, the correct full functional dependency (FFD) can be written as:

FD: {EMP-ID, PROJECT} → {YRS-SPENT-BY-EMP-ON-PROJECT}

It is to be noted that, like FD, FFD is a property of the information


represented by the relation. It is not an indication of the way that attributes
are formed into relations or the current contents of the relations.

Fig. 9.9 Relation BUDGET and ASSIGN

9.2.3 Armstrong’s Axioms for Functional Dependencies


Issues such as non-redundant sets of functional dependencies and complete
sets or closure of functional dependencies must be known for a good
relational design. Non-redundancy and closures occur when new FDs can be
derived from existing FDs.

For example, if X→Y


and Y→Z
then it is also true that X→Z
This derivation is obvious, because, if a given value of X determines a
unique value of Y and this value of Y in turn determines a unique value of Z,
the value of X will also determine this value of Z. Conversely, it is possible
for a set of FDs to contain some redundant FDs.
Let us assume that we are given a table T and that all sets of attributes X,
Y, Z are contained in the heading of T. Then following are a set of inference
rules, called Armstrong’s axioms, to derive one FDs from other FDs:

Rule 1 Reflexivity If, Y ⊆ X, then X → Y.


(inclusion)
Rule 2 Augmentation: If X → Y, then XZ → YZ.
Rule 3 Transitivity: If X → Y and Y → Z, then X → Z.

Fig. 9.10 illustrates a diagrammatical representation of the above three


Armstrong’s axioms.
From Armstrong’s axioms, a number of other rules of implication among
FDs can be proved. Again, let us assume that all sets of attributes W, X, Y, Z
are contained in the heading of a table T. Then the following additional rules
can be derived from Armstrong’s axioms:

Rule 4 Self-determination: X → X.
Rule 5 Pseudo-transitivity: If X → Y and YW → Z, then XW →
Z.
Rule 6 Union or additive: If X → Z and X → Y, then X → YZ.
Rule 7 Decomposition or If X → YZ, then X → Y and X → Z.
projective:
Rule 8 Composition: If X → Y and Z → W, then XZ →
YW.
Rule 9 Self accumulation: If X → YZ and Z → W, then X
→YZW.
Fig. 9.10 Diagrammatical representation of Armstrong’s axioms

9.2.4 Redundant Functional Dependencies


A functional dependency in the set is redundant if it can be derived from the
other functional dependencies in the set. A redundant FD can be detected
using the following steps and set of rules discussed in the previous section:

Step 1: Start with a set of S of functional dependencies (FDs).


Step 2: Remove an FD f and create a set of FDs S’ = S - f .
Step 3: Test whether f can be derived from the FDs in S’ by
using the set of Armstrong’s axioms and derived rules,
as discussed in the previous section.
Step 4: If f can be so derived, it is redundant, and hence S’ = S.
Otherwise replacef into S’ so that now S’ = S + f.
Step 5: Repeat steps 2 to 4 for all FDs in S.

Armstrong’s axioms and derived rules, as discussed in the previous


section, can be used to find redundant FDs. For example, suppose the
following set of FDs is given in the algorithm:

Z→A B→X AX → Y ZB → Y

Because ZB → Y can be derived from other FDs in the set, it can be


shown to be redundant. The following argument can be given:
a. Z → A by augmentation rule will yield ZB → AB.
b. B → X and AX → Y by pseudo-transitivity rule will yield AB → Y.
c. ZB → AB and AB → Y by transitivity rule will yield ZB → Y.

An algorithm (called membership algorithm) can be developed to find


redundant FDs, that is, to determine whether an FD f (A → B) can be derived
from a set of FDs S. Fig. 9.11 illustrates the steps and the logics of the
algorithm.
Using the algorithm of Fig. 9.11, following set of FDs can be checked for
the redundancy, as shown in Fig. 9.12.

Z→A B→X AX → Y ZB → Y
Fig. 9.11 Membership algorithm to find redundant FDs

9.2.5 Closures of a Set of Functional Dependencies


A closure of a set (also called complete sets) of functional dependency
defines all the FDs that can be derived from a given set of FDs. Given a set
of F of FDs on attributes of a table T, closure of F is defined. The notation
F+ is used to denote the closure of the set of all FDs implied by F.
Armstrong’s axioms can be used to develop algorithm that will allow
computing F+ from F.
Let us consider the set of F of FDs given by

F = {A → B, B → C, C → D, D → E, E → F, F → G, G → H}

Now by transitivity rule of Armstrong’s axioms,


A → B and B → C together imply A → C, which must be included in F+.
Also, B → C and C → D together imply B → D.
In fact, every single attribute appearing prior to the terminal one in the
sequence A B C D E F G H can be shown by transitivity rule to functionally
determine every single attribute on its right in the sequence. Trivial FDs such
as A → A is also present.
Now by union rule of Armstrong’s axioms, other FDs can be generated
such as A → A B C D E F G H. All FDs derived above are contained in F+.
To be sure that all possible FDs have been derived by applying the axioms
and rules, an algorithm similar to membership algorithm of Fig. 9.12 (b), can
be developed. Fig. 9.13 illustrates such an algorithm to compute a certain
subset of the closure. In other words, for a given set F of attributes of table T
and a set of S of FDs that hold for T, the set of all attributes of T that are
functionally dependent on F, is called closure F+ of F under S.
Fig. 9.12 Finding redundancy using membership algorithm

Fig. 9.13 Computing closure F+ of F under S

Let us consider the functional dependency diagram (FDD) of relation


schema EMP_PROJECT, as shown in Fig. 9.14. From the semantics of the
attributes, we know that the following functional dependencies should hold:

FD: {EMPLOYEE-NO} → {EMPLOYEE-NAME}


FD: {EMPLOYEE-NO, PROJECT-NO} → {HOURS-SPENT}
FD: {PROJ-NO} → {PROJECT-NAME, PROJECT-LOCATION}

Now, from the semantics of attributes, following set F of FDs can be


specified that should hold on EMP_PROJECT:

F= {EMPLOYEE-NO → EMPLOYEE-NAME, PROJECT-NO →


{PROJECT-NAME, PROJECT-LOCATION}, {EMPLOYEE-
NO, PROJECT-NO} → HOURS-SPENT}

Fig. 9.14 Functional dependency diagram of relation EMP_PROJECT

Using algorithm of Fig. 9.13, the closure sets with respect to F can be
calculated as follows:

{EMPLOYEE-NO}+ = {EMPLOYEE-NO, EMPLOYEE-NAME}


{PROJECT-NO}+ = {PROJECT-NO, PROJECT-NAME, PROJECT-
LOCATION}
{EMPLOYEE-NO, PROJECT-NO}+> = {EMPLOYEE-NO,
PROJECT-NO, EMPLOYEE-NAME, PROJECT-NAME, PROJECT-
LOCATION, HOURS-SPENT}.

9.3 DECOMPOSITION
A functional decomposition is the process of breaking down the functions of
an organisation into progressively greater (finer and finer) levels of detail. In
decomposition, one function is described in greater detail by a set of other
supporting functions. In other words, decomposition is done to break the
modules in smallest one to convert the data models in normal forms to avoid
redundancies. The decomposition of a relation scheme R consists of
replacing the relation schema by two or more relation schemas that each
contain a subset of the attributes of R and together include all attributes in R.
The algorithm of relational database design starts from a single universal
relation schema R= {A1, A2, A3…, An}, which includes all the attributes of
the database. The universal relation states that every attribute name is
unique. Using the functional dependencies, the design algorithms decompose
the universal relation schema R into a set of relation schemas D = {R1 ,R2,
R3,… Rm}. Now, D becomes the relational database schema and D is called
a decomposition of R.
The decomposition of a relation scheme R ={A1, A2, A3,…;An} is its
replacement by a set of relation schemes D = {R1, R2, R3,…,Rm}, such that

R1 ⊆ R for 1 ≤ i ≤ m
and R1 ⌒ R2 ⌒ R3 … ⌒ Rm = R

Decomposition helps in eliminating some of the problems of bad design


such as redundancy, inconsistencies and anomalies. When required, the
database designer (DBA) decides to decompose an initial set of relation
schemes.
Let us consider the relation STUDENT_INFO, as shown in Fig. 9.15 (a).
Now, this relation is replaced with the following three relation schemes:

STUDENT (STUDENT-NAME, PHONE-NO, MAJOR-SUBJECT)


TRANSCRIPT (STUDENT-NAME, COURSE-ID, GRADE)
FACULTY (COURSE-ID, PROFESSOR)
Fig. 9.15 Decomposition of relation STUDENT-INFO into STUDENT, TRANSCRIPT and
FACULTY

(a) Relation STUDENT_INFO

(b) Relation STUDENT


The first relation scheme STUDENT stores only once the phone number
and major subject of each student. Any change in the phone number will
require a change in only one tuple (row) of this relation. The second relation
scheme TRANSCRIPT stores the grade of each student in each course in
which the student is enrolled. The third relation scheme FACULTY stores
the professor of each course that is taught to the students. Fig. 9.15 (b), (c)
and (d) illustrates the decomposed relation schemes STUDENT,
TRANSCRIPT and FACULTY respectively.

9.3.1 Lossy Decomposition


One of the disadvantages of decomposition into two or more relational
schemes (or tables) is that some information is lost during retrieval of
original relation or table. Let us consider the relation scheme (or table) R(A,
B, C) with functional dependencies A → B and C → B as shown in Fig. 9.16.
The relation R is decomposed into two relations, R1(A, B) and R2(B, C).
If the two relations R1 and R2 are now joined, the join will contain rows in
addition to those in R. It can be seen in Fig. 9.16 that this is not the original
table content for R (A, B, C). Since it is difficult to know what table content
was started from, information has been lost by the above decomposition and
the subsequent join operation. This phenomenon is known as a lossy
decomposition, or lossy-join decomposition. Thus, the decomposition of
R(A, B, C) into R1 and R2 is lossy when the join of R1 and R2 does nor yield
the same relation as in R. That means, neither B → A nor B → C is true.
Now, let us consider that relation scheme STUDENT_INFO, as shown in
Fig. 9.15 (a) is decomposed into the following two relation schemes:

STUDENT (STUDENT-NAME, PHONE-NO, MAJOR-SUBJECT,


GRADE)
COURSE (COURSE-ID, PROFESSOR)

The above decomposition is a bad decomposition for the following


reasons:
There is redundancy and update anomaly, because the data for the attributes PHONE-NO and
MAJOR-SUBJECT (657-2145063, Computer Graphics) are repeated.
There is loss of information, because the fact that a student has a given grade in a particular
course, is lost.
Fig. 9.16 Lossy decomposition

9.3.2 Lossless-Join Decomposition


A relational table is decomposed (or factored) into two or more smaller
tables, in such a way that the designer can capture the precise content of the
original table by joining the decomposed parts. This is called lossless-join
(or non-additive join) decomposition. The decomposition of R (X, Y, Z) into
R1(X, Y) and R2(X, Z) is lossless if for attributes X, common to both R1 and
R2, either X → Y or Y → Z.
All decompositions must be lossless. The word loss in lossless refers to
the loss of information. The lossless-join decomposition is always defined
with respect to a specific set F of dependencies. A decomposition D≡{R1,
R2,R3,…, Rm} of R is said to have the lossless-join property with respect to
the set of dependencies F on R if, for every relation state r of R that satisfies
F, the following relation holds:

where ∏ = projection
⋈ = the natural join of all relations in
D.

The lossless-join decomposition is a property of decomposition, which


ensures that no spurious tuples are generated when a natural join operation is
applied to the relations in the decomposition.
Let us consider the relation scheme (or table) R (X, Y, Z) with functional
dependencies YZ → X, X → Y and X → Z, as shown in Fig. 9.17. The
relation R is decomposed into two relations, R1 and R2 that are defined by
following two projections:

R1 = projection of R over X, Y
R2 = projection of R over X, Z

where X is the set of common attributes in R1 and R2.


The decomposition is lossless if R = join of R1 and R2 over X
and the decomposition is lossy if R ⊂ join of R1 and R2 over X.
It can be seen in Fig. 9.17 that the join of R1 and R2 yields the same
number of rows as does R. The decomposition of R (X, Y, Z) into R1 (X, Y)
and R2 (X, Z) is lossless if for attributes X, common to both R1 and R2, either
X → Y or X → Z. Thus, in example of Fig. 9.16 the common attribute of R1
and R2 is B, but neither B → A nor B → C is true. Hence the decomposition
is lossy. In Fig. 9.17, however, the decomposition is lossless because for the
common attribute X, both X → Y and X → Z.

Fig. 9.17 Lossless decomposition

9.3.3 Dependency-Preserving Decomposition


The dependency preservation decomposition is another property of
decomposed relational database schema D in which each functional
dependency X → Y specified in F either appeared directly in one of the
relation schemas Ri in the decomposed D or could be inferred from the
dependencies that appear in some Ri. Decomposition D = {R1 R2, R3,…,
Rm} of R is said to be dependency-preserving with respect to F if the union
of the projections of F on each Ri in D is equivalent to F. In other words,

R ⊂ join of R1, R1 over X

The dependencies are preserved because each dependency in F represents


a constraint on the database. If decomposition is not dependency-preserving,
some dependency is lost in the decomposition.

R Q
1. What do you mean by functional dependency? Explain with an example and a functional
dependency diagram.
2. What is the importance of functional dependencies in database design?
3. What are the main characteristics of functional dependencies?
4. Describe Armstrong’s axioms. What are derived rules?
5. Describe how a database designer typically identifies the set of FDs associated with a
relation.
6. A relation schema R (A, B, C) is given, which represents a relationship between two entity
sets with primary key A and B respectively. Let us assume that R has the FDs A → B and B
→ A, amongst others. Explain what such a pair of dependencies means about the relationship
in the database model?
7. What is a functional dependency diagram? Explain with an example.
8. Draw a functional dependency diagram (FDD) for the following:

a. The attribute ITEM-PRICE is determined by the attributes ITEM-NAME and the


SHOP in which the item is sold.
b. A PERSON occupies a POSITION in an organisation. The PERSON starts in a
POSITION at a given START-TIME and relinquishes it at a given END-TIME. At
the most, one POSITION can be occupied by one person at a given time.
c. A TASK is defined within a PROJECT. TASK is a unique name within the project.
TASK-START is the start time of the TASK and TASK-COST is its cost.
d. There can be any number of PERSONs employed in a DEPARTMENT, but each
PERSON is assigned to one DEPARTMENT only.
9. In a legal instance of relationship schema S has the following three tuples (rows) with three
attributes A B C: (1, 2, 3), (4, 2, 3) and (5, 3, 3).

a. Which of the following dependencies can you infer does not hold over schemas S?
i. A → B
ii. BC → A
iii. B → C.

b. Identify dependencies, if any, that hold over S?

10. Describe the concept of full functional dependency (FFD).


11. Let us assume that the following is given:

Attribute set R = ABCDEFGH


FD set of F = {AB → C, AC → B, AD → E, B → D, BC → A, E → G}

Which of the following decompositions of R = ABCDEG, with the same set of dependencies
F, is
(a) dependency-preserving and (b) lossless-join?

a. {AB, BC, ABDE, EG}


b. {ABC, ACDE, ADG}.

12. What is the dependency preservation property for decomposition? Why is it important?
13. Let R be decomposed into R1, R2,…, Rn and F be a set of functional dependencies (FDs) on
R. Define what it means for F to be preserved in the set of decomposed relations.
14. A relation R having three attributes ABC is decomposed into relations R1 with attributes AB
and R2 with attributes BC. State the definition of lossless-join decomposition with respect to
this example, by writing a relational algebra equation involving R, R1, and R2.
15. What is the lossless or non-additive join property of decomposition? Why is it important?
16. The following relation is given:

LIBRARY-USERS (UNIVERSITY, CAMPUS, LIBRARY, STUDENT)

A university can have any number of campuses. Each campus has one library. Each library is
on one campus. Each library has a distinct name. A student is at one university only and can
use the libraries at some, but not all, of the campuses.

Which of the following decompositions of LIBRARY-USERS are lossless?

Decomposition 1
R1 (UNIVERSITY, CAMPUS, LIBRARY)
R2 (STUDENT, UNIVERSITY)
Decomposition 2
R1 (UNIVERSITY, CAMPUS, LIBRARY)
R2 (STUDENT, LIBRARY)
17. Consider the relation SUPPLIES given as:

SUPPLIES (SUPPLIER, PART, CONTRACT, QTY)


CONTRACT → PART
PART → SUPPLIER
SUPPLIER, CONTRACT → QTY

Now the above relation is decomposed into the following two relations:

SUPPLIERS (SUPPLIER, PART, QTY)


CONTRACTS (CONTRACT, PART)

a. Explain whether any information has been lost in this decomposition.


b. Explain whether any information has been lost if the decomposition is changed to

SUPPLIERS (CONTRACT, PART, QTY)


CONTRACTS (CONTRACT, SUPPLIER).

18. What do you mean by trivial dependency? What is its significance in database design?
19. What are redundant functional dependencies? Explain with an example. Discuss the
membership algorithm to find redundant FDs.
20. What do you mean by the closure of a set of functional dependencies? Discuss how
Armstrong’s axioms can be used to develop algorithm that will allow computing F+ from F.
21. Illustrate the three Armstrong’s axioms using diagrammatical representation.
22. A relation R(A, B, C, D) is given. For each of the following sets of FDs, assuming they are
the only dependencies that hold for R, state whether or not the proposed decomposition of R
into smaller relations is a good decomposition. Briefly explain your answer why or why not.

a. B → C, D → A; decomposed into BC and AD.


b. AB → C, C → A, C → D; decomposed into ACD and BC.
c. A → BC, C → AD, decomposed into ABC and AD.
d. A → B, B → C, C → D; decomposed into AB, AD and CD.
e. A → B, B → C, C → D; decomposed into AB and ACD.

23. Consider the relation R (ABCD) and the FDs {A → B, C → D, A → E}. Is the decomposition
of R into (ABC), (BCD) and (CDE) lossless?
24. Remove any redundant FDs from the following sets of FDs:

Set 1: A → B, B → C, AD → C
Set 2: XY → V, ZW → V, VX → Y, W → Y, Z → X
Set 3: PQ → R, PS → Q, QS → P, PR → Q, S → R.

25. The following are sets of FDs:

Set 1: A → BC, AC → Z, Z → BV, AB → Z


Set 2: P → RST, VRT → SQP, PS → T, Q → TR, QS → P,
SR → V
Set 3: KM → N, K → LM, LN → K, MP → K, P → N
a. Examine each for non-redundancy.
b. Identify for any redundant FDs.

26. Consider that there are the following requirements for a university database to keep track of
students’ transcripts:

a. The university keeps track of each student’s name (STDT-NAME), student number
(STDT-NO), social security number (SS-NO), present address (PREST-ADDR),
permanent address (PERMT-ADDR), present contact number (PREST-CONTACT-
NO), permanent contact number (PERMT- CONTACT-NO), date of birth (DOB),
sex (SEX), class (CLASS) for example fresh, graduate and so on, major department
(MAJOR-DEPT), minor department (MINOR-DEPT) and degree program (DEG-
PROG) for example, BA, BS, PH.D and so on. Both SS-NO and STDT-NO have
unique values for each student.
b. Each department is described by a name (DEPT-NAME), department code (DEPT-
CODE), office number (OFF-NO), office phone (OFF-PHONE) and college
(COLLEGE). Both DEPT-NAME and DEPT-CODE have unique values for each
department.
c. Each course has a course name (COURSE-NAME), description (COURSE-DESC),
course number (COURSE-NO), credit for number of semester hours (CREDIT),
level (LEVEL) and course offering department (COURSE-DEPT). The COURSE-
NO is unique for each course.
d. Each section has a faculty (FACULTY-NAME), semester (SEMESTER), year
(YEAR), section course (SEC-COURSE) and section number (SEC-NO). The SEC-
NO distinguishes different sections of the same course that are taught during the
same semester/year. The values of SEC-NO are 1, 2, 3,…, up to the total number of
sections taught during each semester.
e. A grade record refers to a student (SS-NO), a particular section (SEC-NO), and a
grade (GRADE).
i. Design a relational database schema for this university database application.
ii. Specify the key attributes of each relation.
iii. Show all the FDs that should hold among attributes.
Make appropriate assumptions for any unspecified requirements to render the specification
complete.

STATE TRUE/FALSE

1. A functional dependency (FD) is a property of the information represented by the relation.


2. Functional dependency allows the database designer to express facts about the enterprise that
the designer is modelling with the enterprise databases.
3. A functional dependency is a many-to-many relationship between two sets of attributes X and
Y of a given table T.
4. The term full functional dependency (FFD) is used to indicate the maximum set of attributes
in a determinant of a functional dependency (FD).
5. A functional dependency in the set is redundant if it can be derived from the other functional
dependencies in the set.
6. A closure of a set (also called complete sets) of functional dependency defines all the FDs
that can be derived from a given set of FDs.
7. A functional decomposition is the process of breaking down the functions of an organisation
into progressively greater (finer and finer) levels of detail.
8. The word loss in lossless refers to the loss of attributes.
9. The dependencies are preserved because each dependency in F represents a constraint on the
database.
10. If decomposition is not dependency-preserving, some dependency is lost in the
decomposition.

TICK (✓) THE APPROPRIATE ANSWER

1. A functional dependency is a

a. many-to-many relationship between two sets of attributes.


b. one-to-one relationship between two sets of attributes.
c. many-to-one relationship between two sets of attributes.
d. none of these.

2. Decomposition helps in eliminating some of the problems of bad design such as

a. redundancy
b. inconsistencies
c. anomalies
d. all of these.

3. The word loss in lossless refers to the

a. loss of information.
b. loss of attributes.
c. loss of relations.
d. none of these.

4. The dependency preservation decomposition is a property of decomposed relational database


schema D in which each functional dependency X → Y specified in F

a. appeared directly in one of the relation schemas Ri in the decomposed D.


b. could be inferred from the dependencies that appear in some Ri.
c. both (a) and (b).
d. none of these.
5. The set of attributes X will be fully functionally dependent on the set of attributes Y if the
following conditions are satisfied:

a. X is functionally dependent on Y.
b. X is not functionally dependent on any subset of Y.
c. both (a) and (b).
d. none of these.

FILL IN THE BLANKS

1. A _____ is a many-to-one relationship between two sets of _____ of a given relation.


2. The left-hand side and the right-hand side of a functional dependency are called the (a)
_____and the _____ respectively.
3. The arrow notation ‘→’ in FD is read as _____.
4. The term full functional dependency (FFD) is used to indicate the _____ set of attributes in a
_____ of a functional dependency (FD).
5. A functional dependency in the set is redundant if it can be derived from the other _____ in
the set.
6. A closure of a set (also called complete sets) of functional dependency defines all _____ that
can be derived from a given set of _____.
7. A functional decomposition is the process of _____ the functions of an organisation into
progressively greater (finer and finer) levels of detail.
8. The lossless-join decomposition is a property of decomposition, which ensures that no _____
are generated when a _____ operation is applied to the relations in the decomposition.
9. The word loss in lossless refers to the _____.
10. Armstrong’s axioms and derived rules can be used to find _____ FDs.
Chapter 10

Normalization

10.1 INTRODUCTION

Relational database tables derived from ER models or from some other


design method, suffer from serious problems in terms of performance,
integrity and maintainability. A large database defined as a single table,
results into a large amount of redundant data. Storing of large numbers of
values of redundant nature can result in lengthy search operations for just a
small number of target rows. It can also result in long and expensive
updates. In other words, it becomes generally inefficient, error-prone and
difficult in managing this large number of values. Fig. 10.1 illustrates a
situation of a large single database of relation STUDENT-INFO (an example
from Fig. 9.15 (a) of the previous chapter 9) with redundant data.
It can be seen in Table 10.1 that the relation STUDENT_INFO is not a
good design. For example, STDUENT-NAME “Abhishek” and “Alka” have
repetitive (redundant) and PHONE-NO information. This data redundancy
or repetition wastes storage space and leads to the loss of data integrity (or
consistency) in the database.
Therefore, the most critical criteria in a database design are redundancy
and database consistency. Data redundancy and database consistency are
interdependent. As explained in previous chapters, redundancy means that
no facts about data should be stored more than once in the database. A good
database design with minimum redundancy, necessary to represent the
semantics of the database, minimises the storage needed to store a database.
Also, with minimum redundancy, query becomes efficient and a different
answer cannot be obtained for the same query.
Table 10.1 Relational table STUDENT_INFO

This chapter focuses on various stages of accomplishing normalization.


Normal forms are discussed in detail for relational databases and database
design step to normalise the relational table. This helps in achieving the
minimum redundancy without compromising on easy data and information
retrieval properties of the database.

10.2 NORMALIZATION

Normalization is a process of decomposing a set of relations with anomalies


to produce smaller and well- structured relations that contain minimum or no
redundancy. It is a formal process of deciding which attributes should be
grouped together in a relation. Normalization provides the designer with a
systematic and scientific process of grouping of attributes in a relation.
Using normalization, any change to the values stored in the database can be
achieved with the fewest possible update operations.
Therefore, the process of normalization can be defined as a procedure of
successive reduction of a given collection of relational schemas based on
their FDs and primary keys to achieve some desirable form of minimised
redundancy, minimised insertion, minimised deletion and minimised update
anomalies.
A normalised schema has a minimal redundancy, which requires that the
value of no attribute of a database instance is replicated except where tuples
are linked by foreign keys (a set of attributes in one relation that is a key in
another). Normalization serves primarily as a tool for validating and
improving the logical database design, so that the logical design satisfies
certain constraints and avoids unnecessary duplication of data. The process
of normalization provides the following to the database designers:
A formal framework for analysing relation schemas based on their keys and on the functional
dependencies among their attributes.
A series of normal form tests that can be carried out on individual relation schemas so that
the relational database can be normalised to any desired degree.

However, during normalization, it is ensured that a normalised schema


does not lose any information present in the un-normalised schema,
does not include spurious information when the original schema is reconstructed
preserves dependencies present in the original schema.

The process of normalization was first proposed by E.F. Codd.


Normalization is a bottom-up design technique for relational database.
Therefore, it is difficult to use in large database designs. However, this
technique is still useful in some circumstances mentions below:
As a different method of checking the properties of design arrived at through EER modelling.
As a technique for reverse engineering a design from an existing undocumented
implementation.

10.3 NORMAL FORMS

A normal form is a state of a relation that results from applying simple rules
regarding functional dependencies (FDs) to that relation. It refers to the
highest normal form of condition that it meets. Hence, it indicates the degree
to which it has been normalised. The normal forms are used to ensure that
various types of anomalies and inconsistencies are not introduced into the
database. For determining whether a particular relation is in normal form or
not, the FDs between the attributes in the relation are examined and not the
current contents of the relation. First C. Berri and his co-workers proposed a
notation to emphasise these relational characteristics. They proposed that the
relation is defined as containing two components namely (a) the attributes
(b) the FDs between them. It takes the form

R1 = ({X, Y, Z},{X → Y, X → Z})

The first component of the relation R1 is the attributes, and the second
component is the FDs. For example, let us look at the relation ASSIGN of
Table 10.2.

The first component of the relation ASSIGN is

{EMP-NO, PROJECT, PROJECT-BUDGET, YRS-SPENT-BY-EMP-ON-


PROJECT}

The second component of the relation ASSIGN is

{EMP-NO, PROJECT → YRS-SPENT-BY-EMP-ON-PROJECT, PROJECT


→ PROJECT-BUDGET}

Table 10.2 Relation ASSIGN

The FDs between attributes are important when determining the relation’s
key. A relation key uniquely identifies a tuple (row). Hence, the key or prime
attributes uniquely determines the values of the non-key or non-prime
attributes. Therefore, a full FD exists from the prime to the nonprime
attributes. It is with full FDs whose determinants are not keys of a relation
that problems arise. For example, in the relation ASSIGN of Table 10.2, the
key is {EMP-NO, PROJECT}. However, PROJECT-BUDGET depends on
only part of the key. Alternatively, the determinant of the FD: PROJECT →
PROJECT-BUDGET is not the key of the relation. This undesirable property
causes the anomalies. Conversion to normal forms requires a choice of
relations that do not contain such undesirable dependencies. Various types of
normal forms used in relational database are as follows:
First normal form (1NF).
Second normal form (2NF).
Third normal form (3NF).
Boyce/Codd normal form (BCNF).
Fourth normal form (4NF).
Fifth normal form (5NF).

A relational schema is said to be in a particular normal form if it satisfies


a certain prescribed set of conditions, as discussed in subsequent sections.
Fig. 10.1 illustrates the levels of normal forms. As shown in the figure, every
relation is in 1NF. Also, every relation in 2NF is also in 1NF, every relation
in 3NF is also in 2NF and so on.
Initially, E.F. Codd proposed three normal forms namely INF, 2NF and
3NF. Subsequently, BCNF was introduced jointly by R. Boyce and E.F.
Codd. Later, the normal forms 4NF and 5NF were introduced, based on the
concepts of multi-valued dependencies and join dependencies, respectively.
All of these normal forms are based on functional dependencies among the
attributes of a relational table.
Fig. 10.1 Levels of normalization

10.3.1 First Normal Form (1NF)


A relation is said to be in first normal form (1NF) if the values in the domain
of each attribute of the relation are atomic (that is simple and indivisible). In
1NF, all domains are simple and in a simple domain, all elements are atomic.
Every tuple (row) in the relational schema contains only one value of each
attribute and no repeating groups. 1NF data requires that every data entry, or
attribute (field) value, must be non-decomposable. Hence, 1NF disallows
having a set of values, a tuple of values or a combination of both as an
attribute value for a single tuple. 1NF disallows multi-valued attributes that
are themselves composites. This is called “relations within relations”, or
nested relations, or “relations as attributes of tuples”.

Example 1
Consider a relation LIVED_IN, as shown in Fig. 10.2 (a), which keeps
records of person and his residence in different cities. In this relation, the
domain RESIDENCE is not simple. For example, an attribute “Abhishek”
can have residence in Jamshedpur, Mumbai or Delhi. Therefore, the relation
is un-normalised. Now, the relation LIVED_IN is normalised by combining
each row in residence with its corresponding value of PERSON and making
this combination a tuple (row) of the relation, as shown in Fig. 10.2 (b).
Thus, now non-simple domain RESIDENCE is replaced with simple
domains.

Example 2

Let us consider another relation PATIENT_DOCTOR, as shown in Table


10.3 which keeps the records of appointment details between patients and
doctors. This relation is in 1NF. The relational table can be depicted as:

PATIENT_DOCTOR (PATIENT-NAME, DATE-OF-BIRTH, DOCTOR-


NAME, CONTACT-NO, DATE-TIME, DURATION-MINUTES)
Fig. 10.2 Relation LIVED-IN

Table 10.3 Relation PATIENT_DOCTOR in 1NF

It can be observed from the relational table that a doctor cannot have two
simultaneous appointments and thus DOCTOR-NAME and DATE-TIME is
a compound key. Similarly, a patient cannot have same time from two
different doctors. Therefore, PATIENT-NAME and DATE-TIME attributes
are also a candidate key.
Problems with 1NF
1NF contains redundant information. For example, the relation PATIENT_DOCTOR in 1NF
of Table 10.3 has the following problems with the structure:

a. A doctor, who does not currently have an appointment with a patient, cannot be
represented.
b. Similarly, we cannot represent a patient who does not currently have an appointment
with a doctor.
c. There is redundant information such as the patient’s date-of-birth and the doctor’s
phone numbers, stored in the table. This will require considerable care while
inserting new records, updating existing records or deleting records to ensure that
all instances retain the correct values.
d. While deleting the last remaining record containing details of a patient or a doctor,
all records of that patient or doctor will be lost.

Therefore, the relation PATIENT_DOCTOR has to be normalised further


by separating the information relating to several distinct entities. Fig. 10.3
shows the functional dependencies diagrams in the PATIENT_DOCTOR
relation. Now, it is clear from the functional dependency diagram that
although the patient’s name, date of birth and the duration of the
appointment are dependent on the key (DOCTOR-NAME, DATE-TIME),
the doctor’s contact number depends on only part of the key (DOCTOR-
NAME).
Fig. 10.3 Functional dependency diagram for relation PATIENT-DOCTOR

10.3.2 Second Normal Form (2NF)


A relation R is said to be in second normal form (2NF) if it is in 1NF and
every non-prime key attributes of R is fully functionally dependent on each
relation (primary) key of R. In other words, no attributes of the relation (or
table) should be functionally dependent on only one part of a concatenated
primary key. Thus, 2NF can be violated only when a key is a composite key
or one that consists of more than one attribute. 2NF is based on the concept
of full functional dependency (FFD), as explained in Section 9.2.2, Chapter
9. 2NF is an intermediate step towards higher normal forms. It eliminates the
problems of 1NF.

Example 1

As shown in Fig. 10.3, the partial dependency of the doctor’s contact


number on the key DOCTOR-NAME indicates that the relation is not in
2NF. Therefore, to bring the relation in 2NF, the information about doctors
and their contact numbers have to be separated from information about
patients and their appointments with doctors. Thus, the relation is
decomposed into two tables, namely PATIENT_DOCTOR and DOCTOR, as
shown in Table 10.4. The relational table can be depicted as:

PATIENT_DOCTOR (PATIENT-NAME, DATE-OF-BIRTH,


DOCTOR-NAME, DATE-TIME, DURATION-MINUTES)
DOCTOR (DOCTOR-NAME, CONTACT-NO)

Table 10.4 Relation PATIENT_DOCTOR decomposed into two tables for refirement into 2NF

(a) Relation PATIENT_DOCTOR

Relation: DOCTOR
DOCTOR-NAME CONTACT-NO
Abhishek 657-2145063
Sanjay 651-2214381
Thomas 011-2324567
Thomas 011-2324567
Abhishek 657-2145063
Thomas 011-2324567
Sanjay 651-2214381
Abhishek 657-2145063

(b) Relation DOCTOR


Fig. 10.4 shows the functional dependencies diagrams (FDD) of relations
PATIENT_DOCTOR and DOCTOR.

Example 2

Let us consider another relation ASSIGN as shown in Table 10.2. This


relation is not in 2NF because the non-prime attribute PROJECT-BUDGET
is not fully dependent on the relation (or primary) key EMP-NO and
PROJECT. Here PROJECT-BUDEGT is, in fact, fully functionally
dependent on PROJECT, which is a subset of the relation key. Thus, the
relation ASSIGN is decomposed into two relations, namely ASSIGN and
PROJECTS, as shown in Table 10.5. Now, both the relations ASSIGN and
PROJECTS are in 2NF.

Fig. 10.4 FDDs for relations PATIENT-DOCTOR and DOCTOR

Table 10.5 Decomposition of relations ASSIGN into ASSIGN and PROJECTS as 2NF
Relation: ASSIGN
EMP-NO PROJECT YRS-SPENT-BY EMP-ON-PROJECT
106519 P1 5
112233 P3 2
106519 P2 5
123243 P4 10
106519 P3 3
111222 P1 4

(a)

Relation: PROJECT
PROJECT PROJECT-BUDGET
P1 INR 100 CR
P2 INR 150 CR
P3 INR 200 CR
P4 INR 100 CR
P5 INR 150 CR
P6 INR 300 CR

(b)

Let us create a new relation PROJECT_DEPARTMENT by adding


information DEPARTMENT and DEPARTMENT-ADDRESS in the relation
PROJECT of Table 10.5. The new relation PROJECT_DEPARTMENT is
shown in Table 10.6. The functional dependencies between the attributes of
the relation are shown in Fig. 10.5.
Fig. 10.5 Functional dependency diagram for relation PROJECT_DEPARTMENT

As can been seen from Table 10.6 and Table 10.7 that each project is in
one department, and each department has one address. It is however,
possible for a department to include more than one project. The relation has
only one relation (primary) key, namely, PROJECT. Both DEPARTMENT
and DEPARTMENT- ADDRESS are fully functionally dependent on
PROJECT. Thus, relation PROJECT_DEPARTMENT is in 2NF.

Table 10.6 Relation PROJECT_DEPARTMENT

Table 10.7 Relation EMPLOYEE_PROJECT_ASSIGNMENT


Example 3

Let us consider another relation


EMPLOYEE_PROJECT_ASSIGNMENT as shown in Table 10.7. This
relation has keys EMP-ID and PROJECT-ID together. Employee’s name
(EMP-NAME) is determined by employee’s identification number (EMP-
ID) and so is functionally dependent on a part of the key. That means, an
attribute EMP-ID of the employee is sufficient to identify the employee’s
name. Thus, the relation is not in 2NF.
This relation EMPLOYEE_PROJECT_ASSIGNMENT has the following
problems:
The employee’s name is repeated in every tuple (row) that refers to an assignment for that
employee.
If the name of the employee changes, every tuple (row) recording an assignment of that
employee must be update. In other words, it has update anomaly.
Because of the redundancy, the data might become inconsistent, with different tuples
showing different names for the same employee.
If at some time there are no assignments for the employee, there may be no tuple in which to
keep the employee’s name. In other words, it has insertion anomaly.

Table 10.8 Relations EMPLOYEE and PROJECT_ASSIGNMENT in 2NF

Relation: EMPLOYEE
EMP-ID EMP-NAME
106519 Kumar Abhishek

112233 Thomas Mathew

(a)
Relation: PROJECT-ASSIGNMENT
EMP-NO PROJECT YRS-SPENT-BY EMP-ON-PROJECT
106519 P1 20.05.04

112233 P1 11.1104

106519 P2 03.03.05

123243 P3 12.01.05

112233 P4 30.03.05

(b)

The relation EMPLOYEE_PROJECT_ASSIGNMENT can now be


decomposed into the following two relations, as shown in Table 10.8.

EMPLOYEE (EMP-ID, EMP-NAME)


PROJECT_ASSIGNMENT (EMP-ID, PROJECT-ID, PROJ-START-
DATE)

To bring the relation EMPLOYEE_PROJECT_ASSIGNMENT into 2NF,


it is decomposed into two relations EMPLOYEE and
PROJECT_ASSIGNMENT, as shown in Table 10.8. These decomposed
relations EMPLOYEE and PROJECT_ASSIGNMENT are now in 2NF and
the problems as discussed previously, are eliminated. These decomposed
relations are called the projection of the original relation
EMPLOYEE_PROJECT_ASSIGNMENT. It can be noticed in Table 10.8
that the relation PROJECT_ASSIGNMENT still has five tuples. This is so
because the values for EMP-ID, PROJECT-ID and PROJ-START-DATE,
taken together, were unique. However, in the relation EMPLOYEE, there are
only two tuples, because there were only two unique sets of values for EMP-
ID and EMP-NAME. Thus, data redundancy and the possibility of anomalies
have been eliminated.

Problems with 2NF


As shown in Table 10.4, deleting a record from relation PATIENT_DOCTOR may lose
patient’s details.
Any changes in the details of the patient of Table 10.4, may involve changing multiple
occurrences because this information is still stored redundantly.
As shown in Table 10.6 and Fig. 10.5, a department’s address may be stored more than once,
because there is a functional dependency between non-prime attributes, as DEPARTMENT-
ADDRESS is functionally dependent on DEPARTMENT.

10.3.3 Third Normal Form (3NF)


A relation R is said to be in third normal form (3NF) if the relation R is in
2NF and the non-prime attributes (that is, attributes that are not part of the
primary key) are
mutually independent,
functionally dependent on the primary (or relation) key.

In other words, no attributes of the relation should be transitively


functionally dependent on the primary key. Thus, in 3NF, no non-prime
attribute is functionally dependent on another non-prime attribute. This
means that a relation in 3NF consists of the primary key and a set of
independent nonprime attributes. 3NF is based on the concept of transitive
dependency, as explained in Section 9.2.3, Chapter 9.The 3NF eliminates the
problems of 2NF.

Example 1

Let us again take example of relation PATIENT_DOCTOR, as shown in


Table 10.4 (a). In this relation, there is no dependency between PATIENT-
NAME and DURATION-MINUTES. However, PATIENT-NAME and
DATE-OF-BIRTH are not mutually independent. Therefore, the relation is
not in 3NF. To convert this PATIENT_DOCTOR relation in 3NF, it has to be
decomposed to remove the parts that are not directly dependent on relation
(or primary) key. Though each value of the primary key has a single
associated value of the DATE-OF-BIRTH, there is further dependency called
transitive dependency linking DATE-OF-BIRTH directly to the primary key,
through its dependency on the PATIENT-NAME. A functional dependency
diagram is shown in Fig. 10.6. Thus, following three relations are created:

PATIENT (PATIENT-NAME, DATE-OF-BIRTH)


PATIENT_DOCTOR (PATIENT-NAME, DOCTOR-NAME, DATE-
TIME, DURATION-MINUTES)
DOCTOR (DOCTOR-NAME, CONTACT-NO)

Fig. 10.6 Functional dependency diagram for relation PATIENT_DOCTOR

Example 2

Similarly, the information in the relation PROJECT_DEPARTMENT of


Table 10.5 can be represented in 3NF by decomposing it into two relations,
namely, PROJECTS and DEPARTMENT. As shown in Table 10.9, both
these relations PROJECTS and DEPARTMENT are in 3NF and department
addresses are stored once only.

Table 10.9 Decomposition of relation PROJECT_DEPARTMENT into PROJECTS and


DEPARTMENT as 3NF

Relation: PROJECT
PROJECT PROJECT-BUDGET DEPARTMENT
P1 INR 100 CR Manufacturing
P2 INR 150 CR Manufacturing
P3 INR 200 CR Manufacturing
P4 INR 100 CR Training
(a)

Relation: DEPARTMENT
DEPARTMENT DEPARTMENT-ADDRESS
Manufacturing Jamshedpur-1
Manufacturing Jamshedpur-1
Manufacturing Jamshedpur-1
Training Mumbai-2

(b)

Example 3

In the previous examples of 3NF, only one relation (primary) key has been
used. Conversion into 3NF becomes problematic when the relation has more
than one relation keys. Let us consider another relation USE, as shown in
Fig. 10.7 (a). Functional dependency diagram (FDD) of relation USE is
shown in Fig. 10.7 (b).

Fig. 10.7 Relation USE in 3NF

As shown in Fig. 10.7 (a), the relation USE stores the machines used by
both projects and project managers. Each project has one project manager
and each project manager manages one project. Now, it can be observed that
this relation USE has two relation (primary) keys, namely, {PROJECT,
MACHINE} and {PROJ-MANAGER, MACHINE}. The keys overlap
because MACHINE appears in both keys, whereas, PROJECT and PROJ-
MANAGER each appear in one relation key only.
The relation USE of Fig. 10.7 has only one non-prime attribute called,
QTY-USED, which is fully functionally dependent on each of the two
relations. Thus, relation USE is in 2NF. Furthermore, as there is only one
non-prime attribute QTY-USED, there can be no dependencies between non-
prime attributes. Thus, the relation USE is also in 3NF.

Problems with 3NF


Since relation USE of Fig. 10.7 has two relation keys that overlap because MACHINE is
common to both, the relation has following undesirable properties:

The project manager of each project is stored more than once.


A project’s manager can not be stored until the project has ordered some machines.
A project can not be entered unless that project’s manager is known.
If a project’s manager changes, some n tuples (rows) also must be changed.

There is dependency between PROJECT and MANAGER, both of which appear in one
relation key only. This dependency leads to redundancy.

10.4 BOYCE-CODD NORMAL FORM (BCNF)

To eliminate the problems and redundancy of 3NF, R.F. Boyce proposed a


normal form known as Boyce-Codd normal form (BCNF). Relation R is said
to be in BCNF if for every nontrivial FD: X → Y between attributes X and Y
holds in R. That means:
X is super key of R,
X → Y is a trivial FD, that is, Y ⊂ X.

In other words, a relation must only have candidate keys as determinants.


Thus, to find whether a relation is in BCNF or not, FDs within each relation
is examined. If all non-key attributes depend upon only the complete key, the
relation is in BCNF.
Any relation in BCNF is also in 3NF and consequently in 2NF. However,
a relation in 3NF is not necessarily in BCNF. The BCNF is a simpler form of
3NF and eliminates the problems of 3NF. The difference between 3NF and
BCNF is that for a functional dependency A → B, 3NF allows this
dependency in a relation if B is a primary key attribute and A is not a
candidate key. Whereas, BCNF insists that for this dependency to remain in
a relation, A must be a candidate key. Therefore, BCNF is a stronger form of
3NF, such that every relation in BCNF is also in 3NF.

Example 1

Relation USE in Fig. 10.7 (a) does not satisfy the above condition, as it
contains the following two functional dependencies:

PROJ-MANAGER → PROJECT
PROJECT → PROJ-MANAGER

But neither PROJ-MANAGER nor PROJECT is a super key.


Now, the relation USE can be decomposed into the following two BCNF
relations:

USE (PROJECT, MACHINE, QTY-USED)


PROJECTS (PROJECT, PROJ-MANAGER)

Both of the above relations are in BCNF. The only FD between the USE
attributes is

PROJECT, MACHINE → QTY-USED

and (PROJECT, MACHINE) is a super key.


The two FDs between the PROJECTS attributes are

PROJECT → PROJ-MANAGER
PROJ-MANAGER → PROJECT

Both PROJECT and PROJ-MANAGER are super keys of relation


PROJECTS and PROJECTS is in BCNF.
Example 2

Let us consider another relation PROJECT_PART, as shown in Table


10.10. The relation is given as:

PROJECT_PART (PROJECT-NAME, PART-CODE, VENDOR-NAME,


QTY)

This table lists the projects, the parts, the quantities of those parts they use
and the vendors who supply these parts. There are two assumptions. Firstly,
each project is supplied with a specific part by only one vendor, although a
vendor can supply that part to more than one project. Secondly, a vendor
makes only one part but the same part can be made by other vendors. The
primary keys of the relation PROJECT_PART are PROJECT-NAME and
PART-CODE. However, another, overlapping, candidate key is present in the
concatenation of the VENDOR-NAME (assumed unique for all vendors)
and PROJECT-NAME (assumed unique for all projects) attributes. These
would also uniquely identify each tuple of the relation (table).

Table 10.10 Relation PROJECT_PART

The relation PROJECT_PART is in 3NF, since there are no transitive FDs


on the prime key. However, it is not in BCNF because the attribute
VENDOR-NAME is the determinant of PART-CODE (vendor make only
one part). As a result of this, the relation can give rise to anomalies. For
example, if the bottom tuple (row) is updated because “John” replaces
“Abhishek” as the supplier of part “bca” to project “P2”, then the
information that “Abhishek” makes part “bca” is lost from the database. If a
new vendor becomes a part supplier, this fact cannot be recorded in the
database until the vendor is contracted to a project. There is also an element
of redundancy present in that “Thomas”, for example, is shown twice as
making part “abc”. Decomposing this single relation PROJECT_PART into
two relations PROJECT_VENDOR and VENDOR_PART solves the
problem. The decomposed relations are given as:

PROJECT_VENDOR (PROJECT-NAME, VENDOR-NAME, QTY)


VENDOR_PART (VENDOR-NAME, PART-CODE)

10.4.1 Problems with BCNF


Even if a relation is in 3NF or BCNF, undesirable internal dependencies are exhibited with
dependencies between elements of compound keys composed of three or more attributes.
The potential to violate BCNF may occur in a relation that:

i. contains two or more composite candidate keys.


ii. the candidate keys overlap, that have atleast one attribute in common.

Let us consider a relation PERSON_SKILL, as shown in Table 10.11. This relation contains
the following:

a. The SKILL-TYPE possessed by each person. For example, “Abhishek” has “DBA”
and “Quality Auditor” skills.
b. The PROJECTs to which a person is assigned. For example, “John” is assigned to
projects “P1” and “P2”.
c. The MACHINEs used on each project. For example, “Excavator”, “Shovel” and
“Drilling” are used on project “P1”.
Table 10.11 Relation PERSON_SKILL

There are no FDs between attributes of relation PERSON_SKILL and yet


there is a clear relationship between them. Thus, relation PERSON_SKILL
contains many undesirable characteristics, such as:
i. The fact that “Abhishek has both “DBA” and “Quality Auditor” skills, is stored a number of
times.
ii. The fact that “P1” uses “Excavator”, “Shovel”, and “Drilling” machines.

For most purposes 3NF, or preferably BCNF, is considered to be sufficient


to minimise problems arising from update, insertion, and deletion anomalies.
Fig. 10.8 illustrates various actions taken to convert an un-normalised
relation into various normal forms.

10.5 MULTI-VALUED DEPENDENCIES (MVD) AND FOURTH NORMAL FORM (4NF)

To deal with the problem of BCNF, R. Fagin introduced the idea of multi-
valued dependency (MVD) and the fourth normal form (4NF). A multi-
valued dependency (MVD) is a functional dependency where the dependency
may be to a set and not just a single value. It is defined as X→→Y in relation
R (X, Y, Z), if each X value is associated with a set of Y values in a way that
does not depend on the Z values. Here X and Y are both subsets of R. The
notation X→→Y is used to indicate that a set of attributes of Y shows a
multi-valued dependency (MVD) on a set of attributes of X.

Fig. 10.8 Actions to convert un-normalized relation into normal forms


Thus, informally, MVDs occur when two or more independent multi-
valued facts about the same attribute occur within the same relation. There
are two important things to be noted in this definition of MVD. Firstly, in
order for a relation to contain an MVD, it must have three or more attributes.
Secondly, it is possible to have a table containing two or more attributes
which are inter-dependent multi-valued facts about another attribute. This
does not make the relation an MVD. For a relation to be MVD, the attributes
must be independent of each other.
Functional dependency (FD) concerns itself with the case where one
attribute is potentially a ‘single-value fact’ about another. Multi-valued
dependency (MVD), on the other hand, concerns itself with the case where
one attribute value is potentially a ‘multi-valued fact’ about another.

Example 1

Let us consider a relation STUDENT_BOOK, as shown in Table 10.12.


The relation STUDENT_BOOK lists students (STUDENT-NAME), the text
books (TEXT-BOOK) they have borrowed from library, the librarians
(LIBRARIAN) issuing the books and the month and year (MONTH-YEAR)
of borrowing. It contains three multi-valued facts about students, namely, the
books they have borrowed, the librarians who have issued these books to
them and the month and year upon which the books were borrowed.

Table 10.12 Relation STUDENT_BOOK


However, these multi-valued facts are not independent of each other.
There is clearly an association between librarians, the text books they have
issued and the month and year upon which they have issued the books.
Therefore, there are no MVDs in the relation. Also, there is no redundant
information in this relation. The fact that student “Thomas”, for example,
has borrowed the book “Database Management” is recorded twice, but these
are different borrowings, one in “May, 04” and the other in “Oct, 04” and
therefore constitute different items of information.

Table 10.13 Relation COURSE_STUDENT_BOOK

Relation: COURSE_STUDENT_BOOK
COURS STUDENT-NAME TEXT-BOOK
Computer Engg Thomas Database Management
Computer Engg Thomas Software Engineering
Computer Engg John Database management
Computer Engg John Software Engineering
Electronics Engg Thomas Digital Electronics
Electronics Engg Thomas Pulse Theory
MCA Abhishek Computer Networking
MCA Abhishek Data Communication

Now, let us consider another relation COURSE_STUDENT_BOOK, as


shown in Table 10.13. This relation involves courses (COURSE) being
attended by the students, students (STUDENT-NAME) taking the courses
and text books (TEXT-BOOKS) applicable for the courses. The text books
are prescribed by the authorities for each course, that is, the students have no
say in the matter. Clearly, the attributes STUDENT-NAME and TEXT-
BOOK give multi-valued facts about the attribute COURSE. However, since
a student has no influence over the text books to be used for a course, these
multi-valued facts about courses are independent of each other. Thus, the
relation COURSE_STUDENT_BOOK contains an MVD. Because being
MVD, it contains high degree of redundant information, unlike the
STUDENT_BOOK relation example of Table 10.12. For example, the fact
that the student “Thomas” attends the “Computer Engg” course, is recorded
twice, as are the text books prescribed for that course.
The formal definition of MVD specifies that, given a particular value of X,
the set of values of Y determined by this value of X is completely determined
by X alone and does not depend on the values of the remaining attributes Z
of R. Hence, whenever two tuples (rows) exist that have distinct values of Y
but the same value of X, these values of Y must be repeated in separate tuples
with every distinct value of Z that occurs with that same value of X. Unlike
FDs, MVDs are not properties of the information represented by relations. In
fact, they depend on the way the attributes are structured into relations.
MVDs occur whenever a relation with more than one non-simple domain is
normalised.

Example 2

The relation PERSON_SKILL of Table 10.11 is a relation with more than


one non-simple domain. Let us suppose that X is PERSON and Y is SKILL-
TYPE, then Z becomes a relation key {PROJECT, MACHINE}. Suppose, a
particular value of PERSON “John” is selected. Consider all tuples (rows)
that have some value of Z, for example, PROJECT = P1 and MACHINE =
“Shovel”. The value of Y in this tuples (rows) is (<Programmer>). Consider
also all tuples with same value of Y, that is PERSON but with some other
value of Z, say, PROJECT = P2 and MACHINE = “Welding”. The values of
Y in these tuples is again (<Programmer>). This same set of values of Y is
obtained for PERSON = “John”, irrespective of the values chosen for
PROJECT and MACHINE. Hence X →→ Y, or PERSON →→ SKILL-
TYPE. It can be verified that in the relation PERSON_SKILL the result is:

PROJECT →→ PERSON, SKILL-TYPE


PROJECT →→ MACHINE
PERSON →→ PROJECT, MACHINE
YX,Z is defined as the set of Y values, given a set of X and a set of Z values.
Thus, in relation PERSON_SKILL, we get:

SKILL_TYPE John,P1, Excavator = (<Programmer>)


SKILL_TYPE John, P2, Dumper = (<Programmer>)

In a formal definition of MVD the values of attribute Y depend only on


attributes X but are independent of the attributes Z. So, when given a value
of X, the value of Y will be the same for any two values Z1 or Z2, of Z.
The value of Y1, given a set of values X and Z1, is YX, Z1 and the value Y2,
given a set of values X and Z2, is MCD requires that Y1 = Y2 so X →→
Y in relation R (X, Y, Z) if for any values Z1, Z2.
It may be noticed that MVDs always come in pairs. Thus, if X →→ Y in
relation R (X, Y, Z), it is also true that X →→ Z.
Thus, alternatively it can be stated that if X →→ Y is an MVD in relation
R (X, Y, Z), whenever the two tuples (x1, y1, z1) and (x2, y2, z2) are in R, the
tuples (x1, y1, z2) and (x2, y2, z1) must also be in R. In this definition, X, Y
and Z represent sets of attributes rather than individual attributes.

Example 3

Let us examine relation PERSON_SKILL, as shown in Fig. 10.9 which


does not contain the MVD. As we discussed that for MVD

PERSON →→ PROJECT

But, we observe from Fig. 10.9 that


Fig. 10.9 Normal form of relation PERSON_SKILL not containing MVD

PROJECT John, Drilling,Programmer = (< P1 >, < P2


>)
Whereas, PROJECT John, Excavator,Programmer = (< P1 >)

Thus, PROJECT John,Excavator,Programmer = (< P1 >) does not equal


PROJECTJohn,Drilling,Programmer = (< P1 >, < P2 >) which it would have to
hold if PEARSON →→ PROJECT. But, if MACHINE is projected out of
relation PERSON_SKILL, then PEARSON →→ PROJECT will become
an MVD. An MVD that does not hold in the relation R, but holds for a
projection on some subset of the attributes of relation R is sometimes called
embedded MVD of R.
Alike trivial FDs, there are trivial MVDs also. An MVD X→→Y in
relation R is called a trivial MVD if
a. Y is a subset of X, or
b. X ∪ Y = R.

It is called trivial because it does not specify any significant or meaningful


constraint on R. An MVD that satisfies neither (a) nor (b) is called non-
trivial MVD.
Like trivial FDs, there are two kinds of trivial MVDs as follows:
a. X →→ϕ, where ϕ is an empty set of attributes.
b. X →→ A − X, where A comprises all the attributes in a relation.

Both these types of trivial MVDs hold for any set of attributes of R and
therefore can serve no purpose as design criteria.

10.5.1 Properties of MVDs


Berri described the relevant rules to derive MVDs. Following four axioms
were proposed to derive a closure D+ of MVDs:

Rule 1 Reflexivity (inclusion) If Y ⊂ X, then X →→ Y.


Rule 2 Augmentation: If X →→ Y and W ⊂ U and V ⊂
W, then WX → VY.
Rule 3 Transitivity: If X →→ Y and Y → Z, then X
→→ Z - Y.
Rule 4 Complementation: If X →→ Y, then X →→ U - X -
Y holds.

In the above rules, X, Y, and Z all are sets of attributes of a relation R and
U is the set of all the attributes of R. These four axioms can be used to derive
the closure of a set D+, of D of multi-valued dependencies. It can be noticed
that there are similarities between the Armstrong’s axioms for FDs and
Berri’s axioms for MVDs. Both have reflexivity, augmentation, and
transitivity rules. But, the MVD set also has a complementation rule.
Following additional rules can be derived from the above Berri’s axioms
to derive closure of a set of FDs and MVDs:

Rule 5 Intersection: If X →→ Y and X →→ Z, then X


→→ Y ⌒ Z.
Rule 6 Pseudo-transitivity: If X →→ Y and YW →→ Z, then
XW → (Z - WY).
Rule 7 Union: If X →→ Y and X →→ Z, then X
→→ YZ.
Rule 8 Difference: If X →→ Y and X →→ Z, then
X →→ Y - Z X →→ Z - Y.

Further additional rules can be derived from above rules, which are as
follows:

Rule 9 Replication: If X → Y, then X →→Y.


Rule 10 Coalescence: If X →→ Y and Z ⊂ Y and there
is a W such that W ⊂ U and W ⌒
Y = null and W → Z, then X →
Z.

10.5.2 Fourth Normal Form (4NF)


A relation R is said to be in fourth normal form (4NF) if it is in BCNF and
for every non-trivial MVD (X →→ Y) in F+, X is a super key for R. The
fourth normal form (4NF) is concerned with dependencies between the
elements of compound keys composed of three or more attributes. The 4NF
eliminates the problems of 3NF. 4NF is violated when a relation has
undesirable MVDs and hence can be used to identify and decompose such
relations.

Example 1

Let us consider a relation EMPLOYEE, as shown in Fig. 10.10. A tuple in


this relation represents the fact that an employee (EMP-NAME) works on
the project (PROJ-NAME) and has a dependent (DEPENDENT-NAME).
This relation is not in 4NF because in the non-trivial MVDs EMP-NAME
→→ PROJ-NAME and EMP-NAME →→ DEPENDENT-NAME, EMP-
NAME is not a super key of EMPLOYEE.
Now the relation EMPLOYEE is decomposed into EMP_PROJ and
EMP_DEPENDENTS. Thus, both EMP_PROJ and EMP_DEPENDENT are
in 4NF, because the MVDs EMP-NAME →→ PROJ-NAME in EMP_PROJ
and EMP-NAME →→ DEPENDENT-NAME in EMP_DEPENDENTS are
trivial MVDs. No other non-trivial MVDs hold in either EMP_PROJ or
EMP_DEPENDENTS. No FDs hold in these relation schemas either.

Example 2

Similarly, the relation PERSON_SKILL of Fig. 10.9 is not in 4NF


because it has the non-trivial MVDs PROJECT →→ MACHINE, but
PROJECT is not a super key. To convert this relation into 4NF, it is
necessary to decompose relation PERSON_SKILL into the following
relations:

R1 (PROJECT, MACHINE)
R2 (PERSON, SKILL-TYPE)
R3 (PERSON, PROJECT)

Example 3

Let us consider a relation R (A, B, C, D, E, F) with FDs and with MVDs A


→→ B and CD →→ EF.
Let us decompose the relation R into two relations as R1(A, B) and RZ(A,
C, D, E, F) by applying the MVD: A →→ B and its compliment A →
CDEF. The relation R1 is now in 4NF because A →→ B is trivial and is the
only MVD in the relation. The relation RZ, however is still in BCNF because
of the nontrivial MVD: CD →→ EF.
Now, RZ is decomposed into relations (C, D, E, F) and (C, D, A) by
applying the MVD: CD →→ EF and its compliment CD →→ A. Both the
decomposed relations and are now in 4NF.

Fig. 10.10 A relation EMPLOYEE decomposed into EMP_PROJ and EMP_DEPENDENTS

10.5.3 Problems with MVDs and 4NF


FDs, MVDs and 4NF are not sufficient to identify all data redundancies. Let
us consider a relation PERSONS_ON_JOB_SKILLS, as shown in Table
10.14. This relation stores information about people applying all their skills
to the jobs to which they are assigned. But, they use particular or all skills
only when the job needs that skill.

Table 10.14 Relation PERSON_ON_JOB_SKILL in BCNF and 4NF

Relation: PERSONS_ON_JOB_SKILLS
PERSON SKILL-TYPE JOB
Thomas Analyst J-1
Thomas Analyst J-2
Thomas DBA J-2
Thomas DBA J-3
John DBA J-1
Abhishek Analyst J-1

The relation PERSONS_ON_JOB_SKILLS of Table 10.14 is in BCNF


and 4NF. It can lead to anomalies because of the dependencies between the
joins. For example, person “Thomas” who possesses skills “Analyst” and
“DBA” applies them to job J-2, as J-2 needs both these skills. The same
person “Thomas” applies skill “Analyst” only to job J-1, as job J-1 needs
only skill “Analyst” and not skill “DBA”. Thus, if we delete <Thomas,
DBA, J-2>, we must also delete <Thomas, Analyst, J-2>, because persons
must apply all their skills to a job if that requires those skills.

10.6 JOIN DEPENDENCIES AND FIFTH NORMAL FORM (5NF)

The anomalies of MVDs and are eliminated by join dependency (JD) and
5NF.

10.6.1 Join Dependencies (JD)


A join dependency (JD) can be said to exist if the join of R1 and R2 over C is
equal to relation R. Where, R1 and R2 are the decompositions R1(A, B, C),
and R2 (C, D) of a given relations R (A, B, C, D). Alternatively, R1 and R2 is
a lossless decomposition of R. In other words, *(A, B, C, D), (C, D) will be a
join dependency of R if the join of the join’s attributes is equal to relation R.
Here, *(R1 R2, R3, ….) indicates that relations R1 R2, R3 and so on are a join
dependency (JD) of R. Therefore, a necessary condition for a relation R to
satisfy a JD *(R1 R2,…, Rn) is that

R = R1 ⋃ R2 ⋃ …… ⋃ Rn

Thus, whenever we decompose a relation R into R1 = XUY and R2 = (R −


Y) based on an MVD X →→ Y that holds in relation R, the decomposition
has lossless join property. Therefore, lossless-join dependency can be
defined as a property of decomposition, which ensures that no spurious
tuples are generated when relations are returned through a natural join
operation.

Example 1

Let us consider a relation PERSONS_ON_JOB_SKILLS, as shown in


Table 10.14. This relation can be decomposed into three relations namely,
HAS_SKILL, NEEDS_SKILL and ASSIGNED_TO_JOBS. Fig. 10.11
illustrates the join dependencies of decomposed relations. It can be noted
that none of the two decomposed relations are a lossless decomposition of
PERSONS_ON_JOB_SKILLS. In fact, a join of all three decomposed
relations yields a relation that has the same data as does the original relation
PERSONS_ON_JOB_SKILLS. Thus, each relation acts as a constraint on
the join of the other two relations.
Now, if we join decomposed relations HAS_SKILL and NEEDS_SKILL,
a relation CAN_USE_JOB_SKILL is obtained, as shown in Fig. 10.11. This
relation stores the data about persons who have skills applicable to a
particular job. But, each person who has a skill required for a particular job
need not be assigned to that job. The actual job assignments are given by the
relation JOB_ASSIGNED. When this relation is joined with HAS_SKILL, a
relation is obtained that will contain all possible skills that can be applied to
each job. This happens because persons assigned to that job, possesses those
skills. However, some of the jobs do not require all the skills. Thus,
redundant tuples (rows) that show unnecessary SKILL-TYPE and JOB
combinations are removed by joining with relation NEEDS_SKILL.

Fig. 10.11 Join dependencies of relation PERSONS_ON_JOB_SKILLS

10.6.2 Fifth Normal Form (5NF)


A relation is said to be in fifth normal form (5NF) if every join dependency
is a consequence of its relation (candidate) keys. Alternatively, for every
non-trivial join dependency *(R1 R2, R3) each decomposed relation Ri is a
super key of the main relation R. 5NF is also called project-join normal
form (PJNM).
There are some relations, who cannot be decomposed into two or higher
normal form relations by means of projections as discussed in 1NF, 2NF,
3NF and BCNF. Such relations are decomposed into three or more relations,
which can be reconstructed by means of a three-way or more join operation.
This is called fifth normal form (5NF). The 5NF eliminates the problems of
4NF. 5NF allows for relations with join dependencies. Any relation that is in
5NF, is also in other normal forms namely 2NF, 3NF and 4NF. 5Nf is mainly
used from theoretical point of view and not for practical database design.

Example 1

Let us consider the relation PERSONS_ON_JOB_SKILLS of Fig. 10.11.


The three relations are

HAS_SKILL (PERSON, SKILL-TYPE)


NEEDS_SKILL (SKILL-TYPE, JOB)
JOB_ASSIGNED (PERSON, JOB))

Now by applying the definition of 5NF, the join dependency is given as:

*((PERSON, SKILL-TYPE), (SKILL-TYPE, JOB), (PERSON, JOB))

The above statement is true because a join relation of these three relations
is equal to the original relation PERSONS_ON_JOB_SKILLS. The
consequence of these join dependencies is that the SKILL-TYPE, JOB or
PERSON, is not relation key, and hence the relation is not in 5NF. Now
suppose, the second tuple (row 2) is removed form relation
PERSONS_ON_JOB_SKILLS, a new relation is created that no longer has
any join dependencies. Thus the new relation will be in 5NF.
R Q

1. What do you understand by the term normalization? Describe the data normalization process.
What does it accomplish?
2. Describe the purpose of normalising data.
3. What are different normal forms?
4. Define 1NF, 2NF and 3NF.
5. Describe the characteristics of a relation in un-normalised form and how is such a relation
converted to a first normal form (1NF).
6. What undesirable dependencies are avoided when a relation is in 3NF?
7. Given a relation R(A, B, C, D, E) and F = (A → B, BC → D, D → BC, DE → ϕ), synthesise
a set of 3NF relation schemes.
8. Define Boyce-Codd normal form (BCNF). How does it differ from 3NF? Why is it
considered a stronger from 3NF? Provide an example to illustrate.
9. Why is 4NF preferred to BCNF?
10. A relation R(A, B, C) has FDs AB → C and C → A. Is R is in 3NF or in BCNF? Justify your
answer.
11. A relation R(A, B, C, D) has FD C → B. Is R is in 3NF? Justify your answer.
12. A relation R(A, B, C) has FDs A. → C. Is R is in 3NF? Does AB → C? Justify your answer.
13. Given the relation R(A, B, C, D, E) with the FDs (A → BCDE, B → ACDE, C → ABDE),
what are the join dependencies of R? Give the lossless decomposition of R.
14. Given the relation R(A, B, C, D, E, F) with the set X = (A → CE, B → D, C → ADE, BD
→→ F), find the dependency basis of BCD.
15. Explain the following:

a. Why R1 is in 1NF but not in but not 2NF, where

R1 = ({A, B, C, D}, {B → D, AB → C})

b. Why R2 is in 2NF but not 3NF, where

R2 = ({A, B, C, D, E}, {AB → CE, E → AB, C → D})

c. Why R3 is in 3NF but not BCNF, where

R3 = ({A, B, C, D}, {A → C, D → B})

d. What is the highest form of each of the following relations?

R1 = ({A, B, C}, {A ↔ B, A → C})

R2 = ({A, B, C}, {A ↔ B, C → A})

R3 = ({A, B, C, D}, {A → C, D → B})


R4 = ({A, B, C, D}, {A → C, CD → B})

16. Consider the functional dependency diagram as shown in Fig. 10.12. Following relations are
given:

a. SALE1 (SALE-NO, SALE-ITEM, QTY-SOLD)


b. SALE2 (SALE-NO, SALE-ITEM, QTY-SOLD, ITEM-PRICE)
c. SALE3 (SALE-NO, SALE-ITEM, QTY-SOLD, LOCATION)
d. SALE4 (SALE-NO, QTY-SOLD)
e. SALE5 (SALESMAN, SALE-ITEM, QTY-SOLD)
f. SALE6 (SALE-NO, SALESMAN, LOCATION)

Fig. 10.12 Functional dependency diagram

i. What are the relation keys of these relations?


ii. What is the highest normal form of the relations?

17. Consider the following FDs:

PROJ-NO → PROJ-NAME
PROJ-NO → START-DATE
PROJ-NO, MACHINE-NO → TIME-SPENT-ON-PROJ
MACHINE-NO, PERSON-NO → TIME-SPENT-BY-PERSON

State whether the following relations are in BCNF?

R1 = (PROJ-NO, MACHINE-NO, PROJ-NAME, TIME-SPENT-ON-PROJ)


R2 = (PROJ-NO, PERSON-NO, MACHINE-NO, TIME-SPENT-ON-PROJ)
R3 = (PROJ-NO, PERSON-NO, MACHINE-NO).

18. Define the concept of multi-valued dependency (MVD) and describe how this concept relates
to 4NF. Provide an example to illustrate your answer.
19. Following relation is given:

STUDENT (COURSE, STUDENT, FACULTY, TERM, GRADE)


Each student receives only one grade in a course during a terminal examination. A student
can take many courses and each course can have more than one faculty in a terminal.

a. Define the FDs and MVDs in this relation.


b. Is the relation in 4NF? If not, decompose the relation.

20. Following relation is given:

ACTING (PLAY, ACTOR, PERF-TIME)

This relation stores the actors in each play and the performance times of each play. It is
assumed that each actor takes part in every performance.

a. What are MVDs in this relation?


b. Is the relation in 4NF? If not, decompose the relation.
c. If actors in a play take part in some but not all performances of the play, what will
the MVDs?
d. Is the relation of (c) is in 4NF? If not, decompose it.

21. A role of the actor is added in the relation of exercise 20, which now becomes

ACTING (PLAY, ACTOR, ROLE, PERF-TIME)

a. Assuming that each actor has one role in each play, find the MVDs for the following
cases:
i. Each actor takes part in every performance of the play.
ii. An actor takes part in only some performances of the play.

b. In each case determine whether the relation is in 4NF and decompose it if it is not.

22. For exercise 6 of Chapter 9, design relational schemas for the database that are each in 3NF
or BCNF.
23. Consider the universal relation R (A, B, C, D, E, F, G, H, I, J) and the set of FDs

F = ({A, B} → {A} → {D, E}, {B} → {F}, {F} → {G, H}, {D} → {I, J}).

a. What is the key of R?


b. Decompose R into 2NF, then 3NF relations.

24. In a relation R (A, B, C, D, E, F, G, H, I, J), different set of FDs are given as

G = ({A, B} → {C} → {B, D} → {E, F}, {A, D} → {G, H}, {A} → {I}, {H}, → {J}).

a. What is the key of R?


b. Decompose R into 2NF, then 3NF relations.

25. Following relations for an order-processing application database of M/s KLY Ltd. are given:
ORDER (ORD-NO, ORD-DATE, CUST-NO, TOT-AMNT)
ORDER_ITEM (ORD-NO, ITEM-NO, QTY-ORDRD, TOT-PRICE, DISCT%)

Assume that each item has a different discount. The TOT-PRICE refers to the price of one
item. ORD-DATE is the date on which the order was placed. TOT-AMNT is the amount of
the order.

a. If natural join is applied on the relations ORDER and ORDER_ITEM in the above
database, what does the resulting relation schema look like?
b. What will be its key?
c. Show the FDs in this resulting relation.
d. State why or why not is it in 2NF.
e. State why or why not is it in 3NF.

26. Following relation for published books is given

BOOK (BOOK-TITLE, AUTH-NAME, BOOK-TYPE, LIST-PRICE, AUTH-AFFL,


PUBLISHER)

AUTH-AFFL refers to the affiliation of author. Suppose that the following FDs exist:

BOOK-TITLE → PUBLISHER, BOOK-TYPE


BOOK-TYPE → LIST-PRICE
AUTH-NAME → AUTH-AFFL

a. What normal form is the relation in? Explain your answer.


b. Apply normalization until the relations cannot be decomposed any further. State the
reason behind each decomposition.

27. Set of FDs given are A → BCDEF, AB → CDEF, ABC → DEF, ABCD → EF, ABCDE →
F, B → DG, BC → DEF, BD → EF and E → BF.

a. Find the minimum set of 3NF relations.


b. Designate the candidate key attributes of these relations.
c. Is the set of relations that has been derived also BCNF?

28. A relation R(A, B, C, D) has FD AB → C.

a. Is R is in 3NF?
b. Is R in BCNF?
c. Does the MVD AB →→ C hold?
d. Does the set {R1(A, B, C), R2(A, B, D)} satisfy the lossless join property?

29. A relation R(A, B, C) and the set {R1(A, B), R2(B, C)}satisfies the lossless decomposition
property.

a. Is R in 4NF?
b. Is B a candidate key?
c. Does the MVD B →→ C hold?

30. Following relations are given:

a. EMPLOYEE(E-ID, E-NAME, E-ADDRESS, E-PHONE, E-SKILL)


FD: E-ADDRESS → E-PHONE
b. STUDENT(S-ID, S-NAME, S-BLDG, S-FLOOR, S-RESIDENT)
FD: S-BLDG, S-FLOOR → S-RESIDENT
c. WORKER(W-ID, W-NAME, W-SPOUSE-ID, W-SPOUSE-NAME)
FD: W-SPOUSE-ID → W-SPOUSE-NAME
For each of the above relations,
i. Indicate which normal forms the relations confirm to, if any.
ii. Show how the relation can be decomposed into multiple relations each of which confirms to the
highest normal forms.

31. A life insurance company has a large number of policies. For each policy, the company wants
to know the policy holder’s social security number, name, address, date of birth, policy
number, annual premium and death benefit amount. The company also wants to keep track of
agent number, name, and city of residence of the agent who made the policy. A policy can
have many policies and an agent can make many policies.
Create a relational database schema for the above life insurance company with all relations in
4NF.
32. Define the concept of join dependency (JD) and describe how this concept relates to 5NF.
Provide an example to illustrate your answer.
33. Give an example of a relation schema R and a set of dependencies such that R is in BCNF,
but is not in 4NF.
34. Explain why 4NF is a normal form more desirable than BCNF.

STATE TRUE/FALSE

1. Normalization is a process of decomposing a set of relations with anomalies to produce


smaller and well-structured relations that contain minimum or no redundancy.
2. A relation is said to be in 1NF if the values in the domain of each attribute of the relation are
non- atomic.
3. 1NF contains no redundant information.
4. 2Nf is always in 1NF.
5. 2NF is the removal of the partial functional dependencies or redundant data.
6. When a relation R in 2NF with FDs A → B and B → CDEF (where A is the only candidate
key), is decomposed into two relations R1 (with A → B) and R2 (with B → CDEF), the
relations R1 and R2

a. are always a lossless decomposition of R.


b. usually have total combined storage space less than R.
c. have no delete anomalies.
d. will always be faster to execute a query than R.

7. When a relation R in 3NF with FDs AB → C and C → B is decomposed into two relations
R1 (with AB → null, that is, all key) and R2 (with C → B), the relations R1 and R2

a. are always a lossless decomposition of R.


b. are both dependency preservation.
c. Are both in BCNF.

8. When a relation R in BCNF with FDs A → BCD (where A is the primary key) is decomposed
into two relations R1 (with A → B) and R2 (with A → CD), the resulting two relations R1
and R2

a. are always dependency preserving.


b. usually have total combined storage space less than R.
c. have no delete anomalies.

9. In 3NF, no non-prime attribute is functionally dependent on another non-prime attribute.


10. In BCNF, a relation must only have candidate keys as determinants.
11. Lossless-join dependency is a property of decomposition, which ensures that no spurious
tuples are generated when relations are returned through a natural join operation.
12. Multi-valued dependencies are the result of 1NF, which prohibited an attribute from having a
set of values.
13. 5NF does not require semantically related multiple relationships.
14. Normalization is a formal process of developing data structure in a manner that eliminates
redundancy and promotes integrity.
15. 5NF is also called projection-join normal form (PJNF).

TICK (✓) THE APPROPRIATE ANSWER

1. Normalization is a process of

a. decomposing a set of relations.


b. successive reduction of relation schema.
c. deciding which attributes in a relation to be groped together.
d. all of these.

2. The normalization process was developed by

a. E.F. Codd.
b. R.F. Boyce.
c. R. Fagin.
d. Collin White.

3. A normal form is
a. a state of a relation that results from applying simple rules regarding FDs.
b. the highest normal form condition that it meets.
c. an indication of the degree to which it has been normalised.
d. all of these.

4. Which of the following is the formal process of deciding which attributes should be grouped
together in a relation?

a. optimization
b. normalization
c. tuning
d. none of these.

5. In 1NF,

a. all domains are simple.


b. in a simple domain, all elements are atomic
c. both (a) & (b).
d. none of these.

6. 2NF is always in

a. 1NF.
b. BCNF.
c. MVD.
d. none of these.

7. A relation R is said to be in 2NF

a. if it is in 1NF.
b. every non-prime key attributes of R is fully functionally dependent on each relation
key of R.
c. if it is in BCNF.
d. both (a) and (b).

8. A relation R is said to be in 3NF if the

a. relation R is in 2NF.
b. nonprime attributes are mutually independent.
c. functionally dependent on the primary key.
d. all of these.

9. The idea of multi-valued dependency was introduced by

a. E.F. Codd.
b. R.F. Boyce.
c. R. Fagin.
d. none of these.
10. The expansion of BCNF is

a. Boyd-Codd Normal Form.


b. Boyce-Ccromwell Normal Form.
c. Boyce-Codd Normal Form.
d. none of these.

11. The fourth normal form (4NF) is concerned with dependencies between the elements of
compound keys composed of

a. one attribute.
b. two attributes.
c. three or more attributes.
d. none of these.

12. When all the columns (attributes) in a relation describe and depend upon the primary key, the
relation is said to be in

a. 1NF.
b. 2NF.
c. 3NF.
d. 4NF.

FILL IN THE BLANKS

1. Normalization is a process of _____a set of relations with anomalies to produce smaller and
well-structured relations that contain minimum or no _____.
2. _____ is the formal process for deciding which attributes should be grouped together.
3. In the _____ process we analyse and decompose the complex relations and transform them
into smaller, simpler, and well-structured relations.
4. _____ first developed the process of normalization.
5. A relation is said to be in 1NF if the values in the domain of each attribute of the relation
are_____.
6. A relation R is said to be in 2NF if it is in _____ and every non-prime key attributes of R is
_____ on each relation key of R.
7. 2NF can be violated only when a key is a _____ key or one that consists of more than one
8. When the multi-valued attributes or repeating groups in a relation are removed then that
relation is said to be in _____.
9. In 3NF, no non-prime attribute is functionally dependent on _____ .
10. Relation R is said to be in BCNF if for every nontrivial FD: _____ between attributes X and Y
holds in R.
11. A relations is said to be in the _____ when transitive dependencies are removed.
12. A relation is in BCNF if and only if every determinant is a _____ .
13. Any relation in BCNF is also in _____ and consequently in_____.
14. The difference between 3NF and BCNF is that for a functional dependency A → B, 3NF
allows this dependency in a relation if B is a _____ key attribute and A is not a _____ key.
Whereas, BCNF insists that for this dependency to remain in a relation, A must be a _____
key.
15. 4NF is violated when a relation has undesirable _____.
16. A relation is said to be in 5NF if every join dependency is a _____ of its relation keys.
Part-IV

QUERY, TRANSACTION AND SECURITY MANAGEMENT


Chapter 11

Query Processing and Optimization

11.1 INTRODUCTION

In Chapter 5, we described various types of relational query languages to


specify queries for the data manipulation in a database. Therefore, query
processing and its optimization become important and necessary functions
for any database management system (DBMS). The query for a database
application can be simple, complex or demanding. Thus, the efficiency of
query processing algorithms is crucial to the performance of a DBMS.
In this chapter, we discuss the techniques used by a DBMS to process,
optimise and execute high-level queries. This chapter describes some of the
basic principles of query processing, with particular emphasis on the ideas
underlying query optimization. It discusses the techniques used to split
complex queries into multiple simple operation and methods of
implementing these low-level operations. The chapter describes the query
optimization techniques used to chose an efficient execution plan that will
minimise runtime as well as various other types of resources such as number
of disks I/O, CPU time and so on.

11.2 QUERY PROCESSING

Query processing is the procedure of transforming a high-level query (such


as SQL) into a correct and efficient execution plan expressed in low-level
language that performs the required retrievals and manipulations in the
database. A query processor selects the most appropriate plan that is used in
responding to a database request. When a database system receives a query
(using query languages discussed in chapter 5) for update or retrieval of
information, it goes through a series of query complication steps, called
execution plan, before it begins execution. In the first phase, called syntax-
checking phase, the system parses the query and checks that it obeys the
syntax rules. It then matches objects in the query syntax with views, tables
and columns listed in system tables. Finally, it performs appropriate query
modification. During this ,phase the system validates that the user has
appropriate privileges and that the query does not disobey any relevant
integrity constraints. The execution plan is finally executed to generate a
response. Query processing is a stepwise process. Fig. 11.1 shows the
different steps of processing a high-level query.
As shown in Fig. 11.1, the user gives the query request, which may be in
QBE or other form. This is first transformed into standard high-level query
language, such as SQL. This SQL query is read by the syntax analyser so
that it can be checked for correctness. At this stage, the syntax analyser uses
the grammar of SQL as input and the parser portion of the query processor
checks the syntax and verifies whether the relations and attributes of the
requested query are defined in the database. The correct query then passes to
the query decomposer. At this stage, the SQL query is translated into
algebraic expressions using various rules and information such as
equivalency rules, idempotency rules, transformation rules and so on, from
the database dictionary.
Fig. 11.1 Typical steps in high-level query processing
The relational algebraic expression now passes to the query optimiser.
Here, optimization is performed by substituting equivalent expressions for
those in the query. The substitution of this equivalent expression depends on
the factors such as the existence of certain database structures, whether or
not a given file is sorted, the presence of different indexes and so on. The
query optimization module works in tandem with the join manager module
to improve the order in which the joins are performed. At this stage, cost
model and several estimation formulas are used to rewrite the query. The
modified query is written to utilise system resources so as to yield optimal
performance. The query optimiser then generates an action (also called
execution) plan. These actions plans are converted into query codes that are
finally executed by the run time database processor. The run time database
processor estimates the cost of each access plan and chose the optimal one
for execution.

11.3 SYNTAX ANALYSER

The syntax analyser takes the query from the users, parses it into tokens and
analyses the tokens and their order to make sure they comply with the rules
of the language grammar. If an error is found in the query submitted by the
user, it is rejected and an error code together with an explanation of why the
query was rejected is returned to the user.
A simple form of language grammar that could be used to implement a
SQL statement is given below:

QUERY: = SELECT_CLAUSE + FROM_CLAUSE +


WHERE_CLAUSE
SELECT_CLAUSE: = ‘SELECT’ + <COLUMN_LIST>
FROM_CLAUSE : = ‘FROM’ + <TABLE_LIST>
WHERE_CLAUSE : = ‘WHERE’ + VALUE1 OP VALUE2
VALUE1: = VALUE / COLUMN_NAME
VALUE2: = VALUE / COLUMN_NAME
OP: = +, −, /, * =
The above grammar can be used to implement a SQL query such as the
one shown below:

SELECT COLUMN1, COLUMN2, COLUMN3, COLUMN4


FROM TEST1
WHERE COLUMN2 > 50000
AND COLUMN3 = ‘DELHI’
AND COLUMN4 BETWEEN 10000 and 80000

11.4 QUERY DECOMPOSITION

The query decomposition is the first phase of query processing whose aims
are to transform a high-level query into a relational algebra query and to
check whether that query is syntactically and semantically correct. Thus, a
query decomposition phase starts with a high-level query and transforms into
a query graph of low-level operations (algebraic expressions), which
satisfies the query. In practice, SQL (a relational calculus query) is used as
high-level query language, which is used in most commercial RDBMSs. The
SQL is then decomposed into query blocks (low-level operations), which
form the basic units. The query block contains expressions such as a single
SELECT-FROM-WHERE, as well as clause such as GROUP BY and
HAVING, if these are part of the block. Hence, nested queries within a query
are identified as separate query blocks. The query decomposer goes through
five stages of processing for decomposition into low-level operation and to
accomplish the translation into algebraic expressions. Fig. 11.2 shows the
five stages of query decomposer. The five stages of query decomposition
are:
Fig. 11.2 Stages of query decomposer

Query analysis.
Query normalization.
Semantic analysis.
Query simplifier.
Query restructuring.

11.4.1 Query Analysis


During the query analysis phase, the query is lexically and syntactically
analysed using the programming language compilers (parsers) in the same
way as conventional programming to find out any syntax errors. A
syntactically legal query is then validated, using the system catalogues, to
ensure that all database objects (relations and attributes) referred to by the
query are defined in the database. It is also verified whether relationships of
the attributes and relations mentioned in the query are correct as per the
system catalogue. The type specification of the query qualifiers and result is
also checked at this stage.
Let us consider the following query:

SELECT EMP-NAME, EMP-DESIG


FROM EMPLOYEE
WHERE EMP-DESIG > 100;

The above query will be rejected because of the following two reasons:
In the SELECT list, the attribute EMP-ID is not defined for the relation EMPLOYEE.
In the WHERE clause, the comparison “> 100” is incompatible with the data type EMP-
DESIG, which is a variable character string.

11.4.1.1 Query Tree Notation


At the end of query analysis phase, the high-level query (SQL) is
transformed into some internal representation that is more suitable for
processing. This internal representation is typically a kind of query tree. A
query tree is a tree data structure that corresponds to a relational algebra
expression. A query tree is also called as relational algebra tree. The query
tree has the following components:
Leaf nodes of the tree, representing the base input relations of the query.
Internal (non-leaf) nodes of the tree, representing an intermediate relation which is the result
of applying an operation in the algebra.
Root of the tree, representing the result of the query.
The sequence of operations (or data flow) is directed from leaves to the root.

The query tree is executed by executing an internal node operation


wherever its operands are available. The internal node is then replaced by
the relation that results from executing the operation. The execution
terminates when the root node is executed and produces the result relation
for the query.
Let us consider a SQL query in which it is required to list the project
number (PROJ-NO.), the controlling department number (DEPT-NO.), and
the department manager’s name (MGR-NAME), address (MGR-ADD) and
date of birth (MGR-DOB) for every project located in ‘Mumbai’. The SQL
query can be written as follows:
SELECT (P.PROJ-NO, P.DEPT-NO, E.NAME, E.ADD,
E.DOB)
FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DEPT-NO = D.D-NUM AND D.MGR-ID =
E.EMP-ID AND P.PROJ-LOCATION = ‘Mumbai’

In the above SQL query, the join condition DEPT-NO = D-NUM relates a
project to its controlling department, whereas the join condition MGR-ID =
EMP-ID relates the controlling department to the employee who manages
that department. The equivalent relational algebra expression for the above
SQL query can be written as:

Mumbai-PROJECT ← σPROJ.LOCATION = ‘Mumbai’ (PROJECT)

CONTROL-DEPT ← (Mumbai-PROJECT ⋈ DEPT-NO = D-


NUM(DEPARTMENT)

PROJ-DEPT-MGR ← (CONTROL-DEPT ⋈ MGR-ID = EMP-


ID(EMPLOYEE)

F1NAL-RESULT ← ∏PROJ-NO, DEPT-NO, NAME, ADD, DOBPROJ-DEPT-


MGR)

Or

∏PROJ-NO, DEPT-NO, EMP-NAME, EMP-ADD, DOB ‘(σ PROJ-LOCATION = ‘Mumbai’


(PROJECT))’

⋈ DEPT-NO = D-NUM ‘(DEPARTMENT)’

⋈ MGR-ID = EMP-ID ‘(EMPLOYEE)’


Fig. 11.3 shows an example of a query tree for the above SQL statement
and relational algebra expression. This type of query tree is also referred as
relational algebra tree.
As shown in Fig. 11.3 (a), the three relations PROJECT, DEPARTMENT
and EMPLOYEE are represented by leaf nodes P, D and E, while the
relational algebra operations of the expression are represented by internal
tree nodes. It can been seen from the query tree of Fig. 11.3 (a) that leaf node
1 first begins execution before leaf node 2 because some resulting tuples of
operation of leaf node 1 must be available before the start of execution
operation of leaf node 2. Similarly, leaf node 2 begins executing and
producing results before leaf node 3 can start execution and so on. Thus, it
can be observed that the query tree represents a specific order of operations
for executing a query. Fig. 11.3 (b) shows the initial query tree for the SQL
query discussed above.
Fig. 11.3 Query tree representation

(a) Query tree corresponding to the relational algebra expressions

(b) I i i l f SQL
(b) Initial query tree for SQL query

Same SQL query can have many different relational algebra expressions
and hence many different query trees. The query parser typically generates a
standard initial (canonical) query tree to correspond to an SQL query,
without doing any optimization. For example, the initial query tree is shown
in Fig. 11.3 (b) for a SELECT-PROJECT-JOIN query. The CARTESIAN
PRODUCT (×) of the relations specified in the FROM clause is first applied,
then the SELECTION and JOIN conditions of the WHERE clause are
applied, followed by the PROJECTION on the SELECT clause attributes.
Because of the CARTESIAN PRODUCT (×) operations, a relational algebra
expression represented by the query tree is very efficient.

11.4.1.2 Query Graph Notation


Query graph is sometimes also used for representation of a query, as shown
in Fig. 11.4. In query graph representation, the relations (PROJECT,
DEPARTMENT and EMPLOYEE in our example) in the query are
represented by relation nodes. These relation nodes are displayed as single
circle. The constant values from the query selection (project
location=‘Mumbai’ in our example) are represented by constant nodes,
displayed as double circles. The selection and join conditions are represented
by the graph edges, for example, P.DEPT-NO = D.DEPT-NUM and
D.MGR-ID=E.EMP-ID, as shown in Fig. 11.4. Finally, the attributes to be
retrieved from each relation are displayed in square brackets above each
relation, for example [P.PROJ-NUM, P.DEPT-NO] and [E.EMP-NAME,
E.EmP-ADD, E.EMP-DOB], as shown in Fig. 11.4. A query graph
representation corresponds to a relation calculus expression.
Fig. 11.4 Query graph notation

The disadvantages of a query graph are that it does not indicate an order
on which operation to perform first, as is the case with query tree. Therefore,
a query tree representation is preferred over the query graph in practice.
There is only one graph corresponding to each query. Query tree and query
graph notations are used as the basis for the data structures that are used for
internal representation of queries.

11.4.2 Query Normalization


The primary goal of normalization phase is to avoid redundancy. The
normalization phase converts the query into a normalised form that can be
more easily manipulated. In the normalization phase, a set of equivalency
rules is applied so that the projection and selection operations included in the
query are simplified to avoid redundancy. The projection operation
corresponds to the SELECT clause of SQL query and the selection operation
correspond to the predicates found in WHERE clause. The equivalency
transformation rules that are applied to SQL query is shown in Table 1 in
which UNARYOP means UNARY operation, BINOP means a BINARY
operation and REL1, REL2 and REL3 are relations.

Table 11.1 Equivalency rules


Rule Rule Name Rule Description
1. Commutativity of UNARY UNARYOP1 UNARYOP2 REL ↔ UNARYOP2
operation UNARYOP1 REL
2. Commutativity of BINARY REL1 BINOP (REL2 BINOP REL3 ↔ (REL1
operation BINOP REL2) BINOP REL3
3. Idempotency of UNARY UNARYOP REL
operations UNARYOP1
UNARYOP2 REL
4. Distributivity of UNARY UNARYOP (REL1 BINOP REL2)
operations with respect to ↔UNARYOP (REL1) BINOP UNARYOP
BINARY operation (REL2)

5. Factorisation of UNARY UNARYOP (REL1) BINOP UNARYOP (REL2)


operations ↔ UNARYOP (REL1 BINOP REL2)

By applying these equivalency rules, the normalization phase rewrites the


query into a normal form which can be readily manipulated in later steps.
The predicate is converted into one of the following two normal forms:
Conjunctive normal form.
Disjunctive normal form.

Conjunctive normal form is a sequence of conjuncts that are connected


with the ‘AND’ (‘^’) operator. Each conjunct contains one or more terms
connected by the ‘OR’ (∨) operator. A conjunctive selection contains only
those tuples that satisfy all conjuncts. An example of conjunctive normal
form can be given as:

(EMP-DESIG=‘Programmer’ ∨ EMP-SALARY > 40000) ^


LOCATION=‘Mumbai’

Disjunctive normal form is a sequence of disjunct that are connected with


the ‘OR’ (‘^’) operator. Each disjunct contains one or more terms connected
by the ‘AND’ (‘∨’) operator. A disjunctive selection contains those tuples
formed by the union of all tuples that satisfy the disjunct. An example of
disjunctive normal form can be given as:
(EMP-DESIG=‘Programmer’ ^ LOCATION=‘Mumbai’) ∨
(EMP-SALARY > 40000^ LOCATION=‘Mumbai’)

Disjunctive normal form is most often used, as it allows the query to be


broken into a series of independent sub-queries linked by unions. In practice,
the query is usually held as a graph structure by the query processor.

11.4.3 Semantic Analyser


The objective of semantic analyser phase of query processing is to reduce
the number of predicates that must be evaluated by refuting incorrect or
contradictory queries or qualifications. The semantic analyser rejects the
normalised queries that are incorrectly formulated or contradictory. A query
is incorrectly formulated if components do not contribute to the generation
of the result. This happens in case of missing join specification. A query is
contradictory if its predicate cannot satisfy by any tuple in the relation. The
semantic analyser examines the relational calculus query (SQL) to make sure
it contains only data objects (that is, tables, columns, views, indexes) that are
defined in the database catalogue. It makes sure that each object in the query
is referenced correctly according to its data type.
In case of missing join specifications the components do not contribute to
the generation of the results, and thus, a query may be incorrectly
formulated. A query is contradictory if its predicate cannot be satisfied by
any tuple. For example, let us consider the following query:

(EMP-DESIG=‘Programmer’^ EMP-DESIG=‘Analyst’)

As an employee cannot be both ‘Programmer’ and ‘Analyst’


simultaneously, the above predicate on the EMPLOYEE relation is
contradictory.
Algorithms to determine correctness exist only for the subset of queries
that do not contain disjunction and negation. Connection graphs (or query
graph), as shown in Fig. 11.5, can be constructed to check the correctness
and contradictions as follows:
Constructing a query graph for relation: If the relation graph is not connected, the query is
incorrectly formulated.
Constructing a query graph for normalised attribute: If the graph has a cycle for which the
valuation sum is negative, the query is contradictory.

Example of Correctness and Contradiction

Let us consider the following SQL query:

SELECT (P.PROJ-NO, P.PROJ-LOCATION)


FROM PROJECT AS P, VIEWING AS V, DEPARTMENT AS
D
WHERE D.DEPT-ID=VDEPT-ID AND DMAX-BUDGET >=
85000 AND D.COMPLETION YEAR = ‘2005’ AND
P.PROJ-MGR = ‘Mathew’;

A relation query graph for the above SQL query is shown in Fig. 11.5 (a),
which is not fully connected. That means, query is not correctly formulated.
In this graph, the join condition (V.PROJ-NO = P.PROJ-NO) has been
omitted.
Now let us consider the SQL query given as:

SELECT (P.PROJ-NO, P.PROJ-LOCATION)


FROM PROJECT AS P, COST_OF_PROJECT AS C,
DEPARTMENT AS D
WHERE D.MAX-BUDGET > 85000 AND D.DEPT-ID =
V.DEPT-ID AND V.PROJ-NO = P.PROJ-NO AND
D.COMPL-YEAR = ‘2005’ AND D.MAX-BUDGET <
50000;

A normalised relation query graph for the above SQL query is shown in
Fig. 11.5 (b). This graph has a cycle between the nodes D.MAX-BUDGET
and 0 with a negative valuation sum. Thus, it indicates that the query is
contradictory. Clearly, we cannot have a department with a maximum budget
that is both greater than INR 85,000 and less than INR 50000.

Fig. 11.5 Connection (or query) graphs

(a) Relation query graph showing incorrectly formulated query

(b) Normalised attribute query graph showing contradictory query

11.4.4 Query Simplifier


The objectives of a query simplifier are to detect redundant qualification,
eliminate common sub-expressions and transform sub-graphs (query) to
semantically equivalent but more easily and efficiently computed forms.
Commonly integrity constraints, view definitions and access restrictions are
introduced into the graph at this stage of analysis so that the query can be
simplified as much as possible. Integrity constraints define constants which
must hold for all states of the database, so any query that contradicts an
integrity constraint must be void and can be rejected without accessing the
database. If the user does not have the appropriate access to all the
components of the query, the query must be rejected. Queries expressed in
terms of views can be simplified by substituting the view definition, since
this will avoid having to materialise the view before evaluating the query
predicate on it. A query that violates an access restriction cannot have an
answer returned to the user, so can be answered without accessing the
database. The final form of simplification is obtained by applying the
idempotence rules of Boolean algebra, as shown in Table 11.2.

Table 11.2 Idempotence rules of Boolean algebra

Rule Description Rule Format


1. PRED AND PRED ↔ PRED (P ^ (P) ≡ P)
2. PRED AND TRUE ↔ PRED (P ^ TRUE ≡ P)
3. PRED AND FALSE ↔ FALSE (P ^ FALSE ≡ FALSE)
4. PRED AND NOT (PRED) ↔ FALSE (P ^ (∼ P) ≡ FALSE)
5. PRED1 AND (PRED1 OR PRED2) ↔ (P1 ^ (P1 ∨ P2) ≡ P1)
PRED1

6. PRED OR PRED ↔ PRED (P ∨ (P) ≡ P)


7. PRED OR TRUE ↔ TRUE (P ∨ TRUE ≡ TRUE)
8. PRED OR FALSE ↔ PRED (P ∨ FALSE ^ P)
9. PRED OR NOT (PRED) ↔ TRUE (P ∨ (∼P) ≡ TRUE)
10. PRED1 OR (PRED1 AND PRED2) ↔ (P1 ∨ (P1^ P2) ≡ P1)
PRED1

Example of using idempotence rules

Let us consider the following query:

SELECT D.DEPT-ID, M.BRANCH-MGR, M.BRANCH-ID,


B.BRANCH-ID, B.BRANCH-LOCATION,
E.EMP-NAME, E.EMP-SALARY
FROM DEPARTMENT AS D, MANAGER AS M, BRANCH AS
B
WHERE D.DEPT-ID =M.DEPT-ID
AND M.BRANCH-ID = B.BRANCH-ID
AND M.BRANCH-MGR = E.EMP-ID
AND B.BRANCH-LOCATION = ‘Mumbai’
AND NOT (B.BRANCH-LOCATION =
‘Mumbai’
AND B.BRANCH-LOCATION = ‘Delhi’)
AND B.PROFITS-TO-DATE > 100,00,000.00
AND E.EMP-SALARY > 85,000.00
AND NOT (B.BRANCH-LOCATION = ‘Delhi’)
AND D.DEPT-LOCATION = ‘Bangalore’

Let us examine the following part of the above query statement in greater
detail:

AND B.BRANCH-LOCATION = ‘Mumbai’


AND NOT (B.BRANCH-LOCATION = ‘Mumbai’
AND B.BRANCH-LOCATION = ‘Delhi’)

In the above query statement, let us equate as follows:

B. BRANCH-LOCATION = ‘Mumbai’ = PRED1


B. BRANCH-LOCATION = ‘Mumbai’ = PRED2
B. BRANCH-LOCATION = ‘Delhi’ = PRED3

Now, the above part of the query can be represented in the form of
idempotence rules of Boolean algebra as follows:
PRED1 AND NOT (PRED2 AND PRED3) = P1 ^ ∼(P2 ^ P3)

The above predicate is received by the query normalizer (section 11.4.2)


and converted into the following form by applying equivalency rule 2 of
Table 11.1:

(PRED1 AND (NOT (PRED1)) AND NOT (PRED3) = (P1 ^ (∼ P1))


AND ∼ (P3)

The query normaliser now applies rule 4 of idempotency rules (Table


11.2) of query simplifier phase and obtains the following form:

FALSE AND NOT (PRED3) = FALSE ^ ∼ (P3)

The above form is equivalent to NOT (PRED3) or ∼ (P3).


Now translating the WHERE predicate into SQL, our WHERE clause
(without JOINS) looks like

NOT (B.BRANCH-LOCATION = ‘Delhi’)


AND B.PROFITS-TO-DATE > 100,00,000.00
AND E.EMP-SALARY > 85,000.00
AND NOT (B.BRANCH-LOCATION = ‘Delhi’)

But in the above query, PRED1 and PRED2 are identical. Now the query
simplifier module applies idempotency rule 1 (Table 11.2) to obtain the
following form:

NOT (B.BRANCH-LOCATION = ‘Delhi’)


AND B.PROFITS-TO-DATE > 100,00,000.00
AND E.EMP-SALARY > 85,000.00

The SQL query in our example finally looks like the following form:

SELECT D.DEPT-ID, M.BRANCH-MGR M.BRANCH-ID,


B. BRANCH-ID, B.BRANCH-LOCATION,
E.EMP-NAME, E.EMP-SALARY
FROM DEPARTMENT AS D, MANAGER AS M, BRANCH AS
B
WHERE D.DEPT-ID =M.DEPT-ID
AND M.BRANCH-ID = B.BRANCH-ID
AND M.BRANCH-MGR = E.EMP-ID
AND NOT (B.BRANCH-LOCATION = ‘Delhi’)
AND B.PROFITS-TO-DATE > 100,00,000.00
AND E.EMP-SALARY > 85,000.00
AND D.DEPT-LOCATION = ‘Bangalore’

Thus, in the above example, the original query contained many redundant
predicates, which were eliminated without changing the semantics of the
query.

11.4.5 Query Restructuring


In the final stage of query decomposition, the query can be restructured to
give a more efficient implementation. Transformation rules are used to
convert one relational algebra expression into an equivalent form that is
more efficient. The query can now be regarded as a relational algebra
program, consisting of a series of operations on relations.

11.5 QUERY OPTIMIZATION

The primary goal of query optimiser is of choosing an efficient execution


strategy for processing a query. The query optimiser attempts to minimise
the use of certain resources (mainly the number of I/Os and CPU time) by
choosing the best of a set of alternative query access plans. Query
optimization starts during the validation phase by the system to validate
whether the user has appropriate privileges. Existing statistics for the tables
and columns are located, such as how many rows (tuples) exist in the table
and relevant indexes are found with their own applicable statistics. Now an
access plan is generated to perform the query. The access plan is then put
into effect with the execution plan of generated during query processing
phase, wherein the indexes and tables are accessed and the answer to the
query is derived from the data.
Fig. 11.6 shows a detailed block diagram of query optimiser. Following
four main inputs are used in the query optimiser module:
Relational algebra query trees generated by the query simplifier module of query
decomposer.
Estimation formulas used to determine the cardinality of the intermediate result tables.
A cost model.
Statistical data from the database catalogue.

The output of the query optimiser is the execution plan in form of


optimised relational algebra query. A query typically has many possible
execution strategies, and the process of choosing a suitable one for
processing a query is known as query optimization. The basic issues in query
optimization are:
How to use available indexes.
How to use memory to accumulate information and perform intermediate steps such as
sorting.
How to determine the order in which joins should be performed.

The term query optimization does not mean giving always an optimal
(best) strategy as the execution plan. It is just a reasonably efficient strategy
for execution of the query. The decomposed query blocks of SQL is
translated into an equivalent extended relational algebra expression (or
operators) and then optimised. There are two main techniques for
implementing query optimization. The first technique is based on heuristic
rules for ordering the operations in a query execution strategy. A heuristic
rule works well in most cases but is not guaranteed to work well in every
possible case. The rules typically reorder the operations in a query tree. The
second technique involves systematic estimation of the cost of different
execution strategies and choosing the execution plan with the lowest cost
estimate. Semantic query optimization is used in combination with the
heuristic query transformation rules. It uses constraints specified on the
database schema such as unique attributes and other more complex
constraints, in order to modify one query into another query that is more
efficient to execute.

Fig. 11.6 Detailed block diagram of query optimiser

11.5.1 Heuristic Query Optimization


Heuristic rules are used as an optimization technique to modify the internal
representation of a query. Usually, heuristic rules are used in the form of a
query tree or query graph data structure, as explained section 11.4.1 (Figs.
11.3 and 11.4), to improve its performance. One of the main heuristic rules is
to apply SELECT operations before applying the JOIN or other binary
operations. This is because the size of the file resulting from a binary
operation such as JOIN, is usually a multi-valued function of the sizes of the
input files. The SELECT and PROJECT operation reduce the size of a file
and hence, should be applied before a JOIN or other binary operation.
The query tree of Fig. 11.3 (b) as discussed in section 11.4.1 is a simple
standard form that can be easily created. Now, the heuristic query optimiser
transforms the initial (canonical) query tree into a final query tree using
equivalence transformation rules. These final query trees are efficient to
execute.
Let us consider the following relations of a company database:

EMPLOYEE (EMP-NAME, EMP-ID, BIRTH-DATE, EMP-


ADDRESS, SEX, EMP-SALARY, EMP-DEPT-NO)
DEPARTMENT (DEPT-NAME, DEPT-NO, DEPT-MGR-ID, DEPT-
MGR-START-DATE)
DEPT_LOCATION (DEPT-NO, DEPT-LOCATION)
PROJECT (PROJ-NAME, PROJ-NO, PROJ-LOCATION,
PROJ-DEPT-NO
WORKS_ON (E-ID, P-NO, HOURS)
DEPENDENT (E-ID, DEPENDENT-NAME, SEX, BIRTH-DATE,
RELATION)

Now, let us consider a query in the above database to find the names of
employees born after 1970 whop work on a project named ‘Growth’. This
SQL query can be written as follows:

SELECT EMP-NAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PROJ-NAME = ‘Growth’ AND PROJ-NO = P-NO
AND E-ID = EMP-ID AND BIRTH-DATE = ‘31-
12-1970’;

Fig. 11.7 (a) shows the initial query tree for the above SQL query. It can
be observed that by executing this initial query tree directly creates a very
large file containing the CARTESIAN PRODUCT (×) of the entire
EMPLOYEE, WORKS_ON and PROJECT files. But, the query needed only
one tuple (record) from the PROJECT relation for the ‘Growth’ project and
only the EMPLOYEE records for those whose date of birth is after ‘31-12-
1970’.
Fig. 11.7 (b) shows an improved version of a query tree that first applies
the SELECT operations to reduce the number of tuples that appear in the
CARTESIAN PRODUCT. As shown in Fig. 11.7 (c), further improvement in
the query tree is achieved by applying more restrictive SELECT operations
and switching the positions of the EMPLOYEE and PROJECT relations in
the query tree. The information that PROJ-NO is a key attribute of
PROJECT relation is used. Hence, SELECT operation on the PROJECT
relation retrieves a single record only.
A further improvement in the query tree can be achieved by replacing any
CARTESIAN PRODUCT (×) operation and SELECT operations with JOIN
operations, as shown in Fig. 11.7 (d). Another improvement in the query tree
can be achieved by keeping only the attributes needed by the subsequent
operations in the intermediate relations, by including PROJECT (∏)
operations in the query tree, as shown in Fig. 11.7 (e). This reduces the
attributes (columns or fields) of the intermediate relations, whereas the
SELECT operations reduce the number of tuples (rows or records).
Fig. 11.7 Steps in converting query tree during heuristic optimization

(a) Initial query tree

(b) Improved query tree by applying SELECT operations


(c) Improved query tree by applying more restrictive SELECT operations
(d) Improved query tree by replacing CARTESIAN PRODUCT (×) and SELECT operations with
JOIN operations
(e) Improved query tree by moving PROJECT operations down the query

To summarise, we can conclude from the preceding example that a query


tree can be transformed step by step into another more efficient executable
query tree. But, one must ensure that the transformation steps always lead to
an equivalent query tree and the desired output is achieved.

11.5.2 Transformation Rules


Transformation rules are used by the query optimiser to transform one
relational algebra expression into an equivalent expression that is more
efficient to execute. A relation is considered as equivalent of another relation
if two relations have the same set of attributes in a different order but
representing the same information. These transformation rules are used to
restructure the initial (canonical) relational algebra query tree generated
during query decomposition. Let us consider three relations R, S and T, with
R defined over the attributes A = {A1, A2,……, An} and S defined over B =
(B1, B2,……, Bn}. c = (c1, c2,…… , cn}, denote predicates and L, L1, L2, M,
M1, M2, N denote sets of attributes.

Rule 1: Cascading of Selection (σ)

Example:
σBRANCH-LOCATION = ‘Mumbai’ ^ EMP-SALARY > 85000
(EMPLOYEE) ≡

σBRANCH-LOCATION = ‘Mumbai’ (σEMP-SALARY > 85000


(EMPLOYEE))
Rule 2: Commutativity of Selection (σ)

Example:
σBRANCH-LOCATION = ‘Mumbai’ (σEMP-SALARY > 85000)
(EMPLOYEE) ≡

(σEMP-SALARY > 85000 = (σBRANCH-LOCATION = ‘Mumbai’


(EMPLOYEE)
Rule 3: Cascading of Projection (∏)
∏ L ∏ m ……… ∏ N (R) ≡ ∏ L
Example:
∏ EMP-NAME ∏ BRANCH-LOCATION, EMP-NAME (EMPLOYEE)

∏ EMP-NAME (EMPLOYEE)
Rule 4: Commutativity of Selection (σ) and Projection (∏)
Example:
∏EMP-NAME, EMP-DOB(σEMP-NAME = ‘Thomas’ (EMPLOYEE) ≡

σEMP-NAME = ‘ Thomas ’ (∏ EMP-NAME, EMP-DOB (EMPLOYEE)


Rule 5: Commutativity of Join (⋈) and Cartesian product (×)
R⋈cS ≡ S⋈cR

R× S ≡S × R
Example: EMPLOYEE ⋈ EMPLYEE.BRANCH-NO = BRANCH.BRANCH-NO
(BRANCH) ≡

STAFF ⋈ employee.branch-no = branch.branch-no


(EMPLOYEE)
Rule 6: Commutativity of Selection (σ) and Join (⋈) or
Cartesian product (×)
σc R ⋈ S ≡ (σc ⋈ S

σc(R×S)≡(σc(R)) × S

Alternatively, if the selection predicate is a conjunctive predicate of the


form (c1 AND c2,or c1 ^ c2), condition c1 involves only the attributes of R
and condition c2 involves only the attributes of S, the selection and join
operations commute as follows:

Example:
σEMP-TITLE = ‘ Manager’ ^ CITY = ‘ Mumbai ’ (EMPLOYEE) ⋈EM
PLOYEE.BRANCH-NO = BRANCH.BRANCH-NO (BRANCH)
= σ TEMP-TTTLE = ‘Manager’ (EMPLOYEE)

⋈MPLOYEE.BRANCH-NO = BRANCH.BRANCH-NO (σCITY =


‘Mumbai’ (BRANCH)
Rule 7: Commutativity of Projection (∏) and Join (⋈) or
Cartesian product (×)
Example:
∏ EMP-TITLE, CITY, BRANCH-NO (EMPLOYEE) ⋈EM
PLOYEE.BRANCH-NO =

(BRANCH) = (∏EM P-TITLE,


BRANCH.BRANCH-NO
BRANCH-NO (EMPLOYEE))

⋈EMPLOYEE.BRANCH-NO = BRANCH.BRANCH-NO (∏CITY,


BRANCH-NO (BRANCH))

If the join condition c contains additional attributes not in L (say attributes


M = M1 ∪ M2 where M1 involves only attributes of R, and M2 involves only
attributes of S, then these must be added to the projection list and a final
projection (P) operation is needed as follows:

Example:
∏EMP-TITLE, CITY (EMPLOYEE) ⋈EMPLOYEE.BRANCH-NO. =
BRANCH.BRANCH-NO.

(BRANCH) = ∏EMP-TTTLE, CITY (∏EMP-TITLE, BRANCH-


NO. (EMPLOYEE)

⋈EMPLOYEE.BRANCH-NO. = BRANCH.BRANCH-NO
(∏CITY, BRANCH-NO. (BRANCH))
Rule 8: Commutativity of Union (∪) and Intersection (∩)
R∪S≡S∩R

R∩S≡S∩R
Rule 9: Commutativity of Selection (σ) and set of operations
such as Union (∪), Intersection (∩) and set difference (−)
σc (R ∪ S) = σc (S) ⋃ σc (R)

σc (R ∪ S) = σc (S) ∩ σc (R)

σc (R ∩ S) = σc (S) − σc (R)

If θ stands for any of the set of operations such as Union (⋃), Intersection
(⋂) or set difference (−), then the above expression can be written as:

σc (R θ S) = (σc (R)) θ (σc (S))

Rule 10: Commutativity of Projection (∏) and Union (⋃)


∏ L (R ∪ S) ≡ (∏ l (R)) ∪ (∏ LS))
Rule 11: Associativity of Join (⋈) and Cartesian product (×)
(R⋈S) ⋈ T= R⋈(S⋈T)

(R×S)× T = R × (S × T)

If the join condition c involves only attributes from the relation S and T,
then join is associative in the following manner:

If θ stands for any of the set of operations such as Join (⋈), Union (∪),
Intersection (∩) or Cartesian product (×), then the above expression can be
written as:

(R θ S) θ T = R θ (S θ T)
Rule 12: Associativity of Union (∪) and Intersection (∩)
(R ∪ S) ∪ T = S ∪ (R ∪ T)

(R ∩ S) ∩ T = S ∩ (R ∩ T)
Rule 13: Converting a Selection and Cartesian Product (σ, ×)
sequence into Join (⋈)
σc (R × S) ≡ (R ⋈c S)

Examples of Transformation Rules

Let us consider the SQL query in which the prospective renters are looking
for a ‘Bungalow’. Now, we have to develop a query to find the properties
that match their requirements and are owned by owner ‘Mathew’.
The SQL query for the above requirement can be written as:

SELECT (P.PROPERTY-NO, P.CITY)


FROM CLIENT AS C, VIEWING AS V,
PROPERTY_FOR_RENT AS P
WHERE C.PROPERTY-TYPE=‘Bungalow’ AND
C. CLIENT-NO = V.CLIENT-NO AND
V.PROPERTY-NO = P.PROPERTY-NO AND
C. MAX-RENTÃ = P.RENT AND
C. PREF-TYPE = P.TYPE AND
P.OWNER=‘Mathew’;

The above SQL query is converted into relational algebra expression as


follows:

∏P.PROPERTY-NO, P.CITY (σC.PREF-TYPE -‘Bungalow’ ^ C.CLIENT-NO = V.CLIENT-


NO ^ V.PROPERTY-NO = P.PROPERTY-NO ^ C.MAX-RENT > = P.RENT ^ C.PREF-TYPE =
P. TYPE ^ P.OWNER = ‘Mathew’ ((C × V) × P)

The above query is represented as initial (canonical) relational algebra


tree, as shown in Fig. 11.8 (a).
Now, the following transformation rules can be applied to improve the
efficiency of the execution:
Rule 1 to split the conjunction of Selection operations into individual selection operations,
then Rule 2 and Rule 6 to reorder the Selection operations and then commute the Selection
and Cartesian products. The result is shown in Fig. 11.8 (b).
Rewrite a Selection with an equijoin predicate and a Cartesian product operation as an
equijoin operation. The result is shown in Fig. 11.8 (c).
Rule 11 to reorder the equijoins so that the more restrictive selection on P.OWNER=
‘Mathew’ is performed frist, as shown in Fig. 11.8 (d).
Rule 4 and Rule 7 to move the Projections down past the equijoins and create new Projection
equations as required. The result is shown in Fig. 11.8 (e).
Reduce the Selection operation C.PREF-TYPE=P.TYPE to P.TYPE=‘Bungalow’ as because
C.PREF-TYPE=P.TYPE from the first clause is a predicate. This results into pushing the
Selection down the tree resulting into the final reduced relational algebra tree as shown in
Fig. 11.8 (f).
Fig. 11.8 Relational algebra tree optimization using transformation rules

(a) Initial (canonical) relational algebra query tree

(b) Improved query tree by applying SELECT operations


(c) Improved query tree by changing Selection and Cartesian products to equijoins

(d) Improved query tree using associatives of equijoins


(e) Improved query tree by moving PROJECT operations down the query
(f) Final reduced relational algebra query tree

11.5.3 Heuristic Optimization Algorithm


The database management systems use heuristic optimization algorithm that
utilises some of the transformation rules to transform an initial (canonical)
query tree into an optimised and efficiently executable query tree. The steps
of the heuristic optimization algorithm that could be applied during query
processing and optimization are shown in Table 11.3.
In our example of heuristic query optimization (Section 11.5.1), Fig. 11.7
(b) shows the improved version of query tree after applying steps 1 and 2 of
Table 11.3. Fig. 11.7 (c) shows the query tree after applying step 4, Fig. 11.7
(d) after step 3 and Fig. 11.7 (e) after applying step 5.

Table 11.3 Heuristic optimization algorithm


Step Algorithm Action Steps Description
1. Perform Selection operation at Use transformation rule 1 to break up any
the earliest to reduce the SELECT operations with conjunctive conditions
subsequent processing of the into a cascade of SELECT operations.
relation.
2. Perform commutaitvity of Use transformation rules 2, 4, 6 and 9 concerning
SELECT operation with other the commutativity of SELECT with other
operations at the earliest to move operations such as unary and binary operations
each SELECT operation down and move each SELECT operation as far down
the query tree. the tree as is permitted by the attributes involved
in the SELECT condition. Keep selection
predicates on the same relation together.
3. Combine the Cartesian product Use transformation rule 13 to combine a
with subsequent SELECT Cartesian product operation with subsequent
operation whose predicates SELECT operation.
represent a join condition into a
JOIN operation.
4. Use commutativity and Use transformation rules 5, 11, and 12
associativity of binary operations concerning commutativity and associativity to
rearrange the leaf nodes of the tree so that the
leaf nodes with the most restrictive Selection
operations are executed first in the query tree
representation. The most restrictive SELECT
operations mean (a) either the ones that produce
a relation with the fewest tuples (records) or with
the smallest absolute size and (b) one with the
smallest selectivity. Make sure that ordering of
leaf nodes does not cause Cartesian product
operations.
5. Perform Projection operation as Use transformation rules 3, 4, 7 and 10
early as possible to reduce the concerning the cascading and commuting of
cardinality of the relation and the projection operations with other (binary)
subsequent processing of that operations. Break down and move the Projection
relation, and move the Projection attributes down the tree as far as possible by
operations as far down the query creating new PROJECT operations as needed.
tree as possible. Keep the projection attributes on the same
relation together.
6. Compute common expressions Identify sub-trees that represent groups of
once. operations that can be executed by a single
algorithm.
11.6 COST ESTIMATION IN QUERY OPTIMIZATION

The main aim of query optimization is to choose the most efficient way of
implementing the relational algebra operations at the lowest possible cost.
Therefore, the query optimizer should not depend solely on heuristics rules,
but, it should also estimate the cost of executing the different strategies and
find out the strategy with the minimum cost estimate. The method of
optimising the query by choosing a strategy those results in minimum cost is
called cost-based query optimization. The cost-based query optimization
uses formulae that estimate the costs for a number of options and selects the
one with lowest cost and most efficient to execute. The cost functions used
in query optimization are estimates and not exact cost functions. So, the
optimization may select a query execution strategy that is not the optimal
one.
The cost of an operation is heavily dependent on its selectivity, that is, the
proportion of the input relation(s) that forms the output. In general, different
algorithms are suitable for low-and high-selectivity queries. In order for a
query optimiser to choose a suitable algorithm for an operation an estimate
of the cost of executing that algorithm must be provided. The cost of an
algorithm is dependent on the cardinality of its input. To estimate the cost of
different query execution strategies, the query tree is viewed as containing a
series of basic operations which are linked in order to perform the query.
Each basic operation has an associated cost function whose argument(s) are
the cardinality of its input(s). It is also important to know the expected
cardinality of an operation’s output, since this forms the input to the next
operation in the tree. The expected cardinalities are derived from statistical
estimates of a query’s selectivity, that is, the portion of the tuple satisfying
the query.

11.6.1 Cost Components of Query Execution


The success of estimating size and cost of intermediate relational algebra
operations depends on the amount and accuracy of the statistical data
information stored with the database management system (DBMS). The cost
of executing a query includes the following components:
a. Access cost to secondary storage: Access cost is the cost of searching for reading and writing
data blocks (consisting of a number of tuples or records) that reside on secondary storage,
mainly on disk of the DBMS. The cost of searching for tuples in a database relation (or table
or files) depends on the type of access structures on that relation, such as ordering, hashing
and primary or secondary indexes. In addition, factors such as whether the file blocks are
allocated contiguously on the same disk cylinder or scattered on the disk affect the access
cost.
b. Storage cost: Storage cost is the cost of storing any intermediate relations (or tables or files)
that are generated by the execution strategy for the query.
c. Computation cost: Computation cost is the cost of performing in-memory operations on the
data buffers during query execution. Such operations include searching for and sorting
records, merging records for a join and performing computations on field values.
d. Memory uses cost: Memory uses cost is the cost pertaining to the number of memory buffers
needed during query execution.
e. Communication cost: Communication cost is the cost of transferring query and its results
from the database site to the site or terminal of query origination.

Out of the above five cost components, the most important is the
secondary storage access cost. The emphasis of cost minimisation depends
on the size and type of database applications. For example, in smaller
databases the emphasis is on the minimising computation cost as because
most of the data in the files involve in the query can be completely stored in
the main memory. For large databases, the main emphasis is on minimizing
the access cost to secondary storage. For distributed databases, the
communication cost is minimised as because many sites are involved for the
data transfer.
To estimate the costs of various execution strategies, we must keep track
of any information that is needed for the cost functions. This information
may be stored in the DBMS catalog, where it is accessed by the query
optimiser. Typically, the DBMS is expected to hold the following types of
information in its system catalogue:
i. The number of tuples (records) in relation R, given as [nTuples(R)].
ii. The average record size in relation R.
iii. The number of blocks required to store relation R, given as [nBlocks(R)].
iv. The blocking factor of relation R (that is, the number of tuples of R that fit into one block),
given as [bFactor(R)].
v. Primary access method for each file.
vi. Primary access attributes for each file.
vii. The number of levels of each multi-level index I (primary, secondary, or clustering), given as
[nLevelsA(I)].
viii. The number of first-level index blocks, given as [nBlocksA(I)]
ix. The number of distinctive values that appear for attribute A in relation R, given as
[nDistinctA(R)].
x. The minimum and maximum possible values for attribute A in relation R, given as [minA(R),
maxA(R)].
xi. The selectivity of an attribute, which is the fraction of records satisfying an equality
condition on the attribute.
xii. The selection cardinality of attribute A in relation R, given as [SCA(R)]. The selection
cardinality is the average number of tuples (records) that satisfy an equality condition on
attribute A.

For the use in estimating the cost of various execution strategies, the
query optimiser needs reasonably close values of the frequently changing
parameters such as the number of tuples (records) in a file (or relation) every
time a record is inserted or deleted. This is so because every time a tuple is
inserted deleted or updated, updating of the database at peak times would
have a significant impact on the performance of DBMS. Alternatively,
DBMS may update the database on a periodic basis, for example, fortnightly
or whenever the system is idle. This will help in minimising the estimated
cost.

11.6.2 Cost Function for SELECT operation


As discussed in chapter 4, Section 4.4.1, the Selection operation in the
relational algebra works on a single relation R and defines a relation S
containing only those tuples of R that satisfy the specified predicate. There
are a number of different implementation strategies for the Selection
operation depending on the structure of the file in which relation is stored,
whether the attributes involved in the predicate have been indexed or hashed
and so on. Table 11.4 shows the estimation of costs for different strategies
for Selection operation.

Table 11.4 Estimated cost of strategies for Selection operation


Strategies Cost Estimates
Linear search [nBlocks(R)/2], if the record is found
(nBlocks(R)), if no record satisfies the condition
Binary search [log2 (nBlocks(R))], if the equity condition is on
key attribute, because SCa(R) = 1 in this case
[log2 (nBlocks(R))] + [SCA(R) / bFcator(R)] -
1,otherwise
Using primary index or hash key to retrieve a 1, assuming no overflow
single record
Equity condition on primary key [nLevelsA (I) + 1]

Inequity condition on primary key [nLevelsA (I) + 1] + [nBlocks(R)/2]

Using inequality condition on a secondary index [nLevelsA(R)] + [nLfBlocksA(I) / 2] +


(B+-tree) [nTuples(R) / 2]

Equity condition on clustering (secondary) index [nLevelsA (I) + 1] + [SCA(R) / bFcator(R)]

Equity condition on non-clustering (secondary) [nLevelsA (I) + 1] + [SCA(R)]


index

Example of Cost Estimation for Selection Operation

Let us consider the relation EMPLOYEE having following attributes:

EMPLOYEE (EMP-ID, DEPT-ID, POSITION, SALARY)

Let us consider the following assumptions:


There is a hash index with no overflow on the primary key attribute EMP-ID.
There is a clustering index on the foreign key attribute DEPT-ID.
There is B+-tree index on the SALARY attribute.

Let us also assume that the EMPLOYEE relation has the following
statistics stored in the system catalog:

nTuples(EMPLOYEE) = 6,000
bFactor(EMPLOYEE) = 60
nBlocks(EMPLOYEE) = [nTuples(EMPLOYEE)
/bFactor
(EMPLOYEE)]
= 6,000 / 60 = 100
nDistinctDEPT-ID
(EMPLOYEE) = 1000
SCDEPT-ID (EMPLOYEE) = [nTuples(EMPLOYEE)
/nDistinctDEP-ID
(EMPLOYEE)]
= 6,000 / 1000 = 6
nDistinctPOSITION = 20
(EMPLOYEE)
SCPOSITION(EMPLOYEE) = [nTuples(EMPLOYEE)
nDistinctPOSITION
(EMPLOYEE)]
= 6,000 / 20 = 300
nDistinctSALARY = 1000
(EMPLOYEE)
SC SALARY (EMPLOYEE) = [nTuples(EMPLOYEE)
nDistinctSALARY
(EMPLOYEE)]
= 6,000 / 1000 = 6
nDistinctPOSITION = 20
(EMPLOYEE)
minSALARY(EMPLOYEE) = 20,000
maxSALARY(EMPLOYEE) = 80,000
nLevelsDEPT-ID(I) =2
nLevelsSALARY(I) =2
nLfBlocksSALARY(I) = 50
The estimated cost of a linear search on the key attribute EMP-ID is 50
blocks and the cost of a linear search on a non-key attribute is 100 blocks.
Now let us consider the following Selection operations, and use the
strategies of Table 11.4 to improve on these two costs:

Selection 1: σEMP-ID = ‘106519’(EMPLOYEE)


Selection 2: σPOSITION= ‘Manager’(EMPLOYEE)
Selection 3: σ dept-id=‘SPA-04’ (EMPLOYEE)
Selection 4: σSALARY > 30000 (EMPLOYEE)
Selection 5: σPOSITION = ‘Manager’ ^ DEPT-ID = ‘SPA-04’ (EMPLOYEE)

Now we will choose the query execution strategies by comparing the cost
as follows:

Selection 1: This Selection operation contains an equality condition


on the primary key EMP-ID of the relation
EMPLOYEE. Therefore, as the attribute EMP-ID is
hashed we can use the Strategy 3 defined in Table 11.4
to estimate the cost as 1 block. The estimated cardinality
of the result relation is SCEMP-ID (EMPLOYEE) = 1.
Selection 2: The attribute in the predicate is a non-key, non-indexed
attribute. Therefore, we can improve on the linear search
method, giving an estimated cost of 100 blocks. The
estimated cardinality of result relation is SCPOSITION
(EMPLOYEE) = 300.
Selection 3: The attribute in the predicate is a foreign key with a
clustering index. Therefore, we can use strategy 7 of
Table 11.4 to estimate the cost as (2 + [6/30]) = 3 blocks.
The estimated cardinality of the result relation is SCDEP-
ID (EMPLOYEE) = 6.
Selection 4: The predicate here involves a range search on the
SALARY attribute, which has a B+-tree index.
Therefore, we can use strategy 6 of table 11.4 to
estimate the cost as (2 + [50/2] + [6000/2]) = 3027
blocks. However, this is significantly worse than the
linear search strategy. Thus, a linear search strategy
should be used in this case. The estimated cardinality of
the result relation is SCSALARY (EMPLOYEE) = [6000*
(80000 − 20000*2) / (80000−20000)] = 4000.
Selection 5: While we are retrieving each tuple using the clustering
index, we can check whether it satisfies the first
condition (POSITION = ‘Manager’). We know that
estimated cardinality of the second condition is SCDEP-
ID (EMPLOYEE) = 6. Let us assume that this
intermediate condition is S. Then, the number of distinct
values of POSITION in S can be estimated as [(6 +
20)/3] = 9. Let us apply now the second condition using
the clustering index on DEPT-ID (selection 3 above),
which has an estimated cost of 3 blocks. Thus, the
estimated cardinality of the result relation will be
SCPOSITION (S) = 6/9 ≈1, which would be correct if there
is one manager for each branch.

11.6.3 Cost Function for JOIN operation


Join operation is the most time-consuming operation to process. An estimate
for the size (number of tuples or records) of the file that results after the join
operation is required to develop reasonably accurate cost functions for join
operations. As discussed in chapter 4, Section 4.4.3, the Join operation
defines a relation containing tuples that satisfy a specified predicate F from
the Cartesian product of two relations R and S. Table 11.5 shows the
estimation of costs for different strategies for join operation.

Table 11.5 Estimated cost of strategies for Join operation


Strategies Cost Estimates
Block nested-loop join (a) nBlocks(R) + (nBlocks(R) * nBlocks(S)), if the buffer has only one
block
(b) nBlocks(R) + [nBlocks(S) * (nBlocks(R)/(nBuffer-2))], if (nBuffer-2)
blocks for R
(c) nBlocks(R) + nBlocks(S), if all blocks of R can be read into database
buffer
Indexed nested-loop (a) nBlocks(R) + nTuples(R) * (nLevelsA(I) + 1), if join attribute A in S is
join the primary key
(b) nBlocks(R) + nTuples(R)*(nLevelsA(I)+[SCA(R)/bFcator(R)]), for
clustering index I on attribute A
Sort-merged join (a) nBlocks(R)*[log2(nBlocks(R)]+nBlocks(S)*[log2(nBlocks(R)], for
sorts
(b) nBlocks(R) + nBlocks(S), for merge
Hash join (a) 3(nBlocks(R) + nBlocks(S)), if hash index is held in memory
(b) 2(nBlocks(R) + nBlocks(S)) * [log (nBlocks(S))−1] + nBlocks(R) +
nBlocks(S), otherwise

Example of Cost Estimation for Join Operation

Let us consider the relations EMPLOYEE, DEPARTMENT and PROJECT


having the following attributes:

EMPLOYEE (EMP-ID, DEPT-ID, POSITION, SALARY)


DEPARTMENT (DEPT-ID, EMP-ID)
PROJECT (PROJ-ID, DEPT-ID, EMP-ID)

Let us consider the following assumptions:


There are separate hash indexes with no overflow on the primary key attribute EMP-ID of
relation EMPLOYEE and DEPT-NO of relation DEPARTMENT.
There are 200 database buffer blocks.

Let us also assume that the EMPLOYEE relation has the following
statistics stored in the system catalog:

nTuples(EMPLOYEE) = 12,000
bFactor(EMPLOYEE) = 120
nBlocks(EMPLOYEE) = [nTuples
(EMPLOYEE)/bFactor
(EMPLOYEE)]
= 12,000 / 120 = 200
nTuples(DEPARTMENT) = 600
bFactor(DEPARTMENT) = 60
nBlocks(DEPARTMENT) = [nTuples
(DEPARTMENT)/b
Factor(DEPARTMENT
)]
= 600 / 60 = 10
nTuples(PROJECT) = 80,000
bFactor(PROJECT) = 40
nBlocks(PROJECT) = [nTuples(PROJECT)
/bFactor(PROJECT)] = 80000 / 40 = 2000

Now let us consider the following two Join and use the strategies of Table
11.5 to improve on the costs:

JOIN 1: EMPLOYEE ⋈EMP-ID PROJECT


JOIN 2: DEPARTMENT ⋈DEPT-ID PROJECT

The estimated I/O cost of Join operations for the above two joins is shown
in Table 11.6.
Table 11.6

It can be seen in both Join 1 and 2 that the cardinality of the result relation
can be no larger than the cardinality of the first relation, as we are joining
over the key of the first relation. Also, it is to be noted that no one strategy is
best for both join operations. The sort-merge join is the best for the Join 1
provided both relations are already sorted. The indexed nested-loop join is
the best for the Join 2.

11.7 PIPELINING AND MATERIALIZATION

When a query is composed of several relational algebra operators, the result


of one operator is sometimes pipelined to another operator without creating a
temporary relation to hold the intermediate result. When the input relation to
a unary operation (for example, selection or projection) is pipelined into it, it
is sometimes said that the operation is applied on-the-fly. Pipelining (or on-
the-fly processing) is sometimes used to improve the performance of the
queries. As we know that the results of intermediate algebra operations are
stored on the secondary storage or disk, which are temporarily written. If the
output of an operator operation is saved in a temporary relation for
processing by the next operator, it is said that the tuples are materialized.
Thus, this process of temporarily writing intermediate algebra operations is
called materialization. The materialization process starts from the lowest
level operations in the expression, which are at the bottom of the query tree.
The inputs to the lowest level operations are the relations (tables) in the
database. The lowest level operations on the input relations are executed and
stored in temporary relations. Then these temporary relations are used to
execute the operations at the next level up in the tree. Thus in materialization
process, the output of one operation is stored in a temporary relation for
processing for the next operation. By repeating the process, the operation at
the root of the tree is evaluated giving the final result of the expression. The
process is called materialization because the results of each intermediate
operation are created (or materialised) and then used for evaluation of the
next-level operations. The cost of a materialised evaluation includes the cost
of writing the result of each operation, that is, the temporary relation(s) to
the secondary storage.
Alternatively, the efficiency of the query evaluation can be improved by
reducing the number of temporary files that are produced. Therefore, several
relational operations are combined into a pipeline of operations in which, the
results of one operation is pipelined to another operation without creating a
temporary relation to hold the intermediate result. A pipeline is implemented
as a separate process within the DBMS. Each pipeline takes a stream of
tuples from its inputs and creates a stream of tuples as its output. A buffer is
created for each pair of adjacent operations to hold the tuples being passed
from the first operation to the second one. Pipeline operation eliminates the
cost of reading and writing temporary relations.

Advantages
The use of pipelining saves on the cost of creating temporary relations and reading the results
back in again.

Disadvantages
The inputs to operations are not necessarily available all at once for processing. This can
restrict the choice of algorithms.

11.8 STRUCTURE OF QUERY EVALUATION PLANS

An evaluation plan is used to define exactly what algorithm should be used


for each operation and how the execution of the operations should be
coordinated. So far we have discussed mainly two basic approaches to
choosing an execution (action) plan namely, (a) heuristic optimization and
(b) cost-based optimization. Most query optimisers combine the elements of
both these approaches. Fig. 11.9 shows one of the possible evaluation plans.

Fig. 11.9 An evaluation plan

11.8.1 Query Execution Plan


The query execution plan may be classified into the following:
Left-deep (join) tree query execution plan.
Right-deep query execution plan.
Linear tree query execution plan.
Bushy (non-linear) tree query execution plan.

The above terms were defined by Graefe and DeWitt in 1987. They refer
to how operations are combined to execute the query. Naming convention
relates to the way the inputs of binary operations, particularly join, are
treated. Most operations treat their inputs in different ways, so the
performance characteristics differ according to the ordering of the inputs.
Fig. 11.10 illustrates different schemes of query evaluation plans.
Left-deep (or join) tree query execution plan starts from a relation (table)
and constructs the result by successively adding an operation involving a
single relation (table) until the query is completed. That is, only one input
into a binary operation is an intermediate result. The term relates to how
operations are combined to execute the query, for example, only the left
hand side of a join is allowed to be something that results from a previous
join and hence the name left-deep tree. Fig. 11.10 (a) shows an example of
left-deep query execution plan. All the relational algebra trees we have
discussed in the earlier sections of this chapter are left-deep (join) trees. The
left-deep tree query execution plan has the advantages of reducing the search
space and allowing the query optimiser to be based on dynamic
programming techniques. Left-tree join plans arr particularly convenient for
pipelined evaluation, since the right operand is a stored relation, and thus
only one input to each join is pipelined. The main disadvantage is that, in
reducing the search space, many alternative execution strategies are not
considered, some of which may be of lower cost than the one found using
the linear tree.
Fig. 11.10 Query execution plan

Right-deep tree execution plans have applications where there is a large


main memory. Fig. 11.10 (b) shows an example of right-deep query
execution plan.
The combination of left-deep and right-deep trees are also known as linear
trees, as shown in Fig. 11.10 (c). With linear trees, the relation on one side of
each operator is always a base relation. However, because we need to
examine the entire inner relation for each tuple of the outer relation, inner
relations must always be materialised. This makes left-deep trees appealing,
as inner relations are always base relations and thus already materialised.
Bushy (also called non-linear) tree execution plans are the most general
type of plan. They allow both inputs into binary operation to be intermediate
results. Fig. 11.10 (d) shows an example of a bushy query execution plan.
Left-deep and right-deep plans are special cases of bushy plans. The
advantages of this added flexibility allows a wide variety of plans to be
considered, which yields better plans for some queries. However, the
disadvantage is that this flexibility may considerably increase the search
space.

R Q
1. What do you mean by the term query processing? What are its objectives?
2. What are the typical phases of query processing? With a neat sketch discuss these phases in
high-level query processing.
3. Discuss the reasons for converting SQL queries into relational algebra queries before query
optimization is done.
4. What is syntax analyser? Explain with an example.
5. What is the objective of query decomposer? What are the typical phases of query
decomposition? Describe these phases with a neat sketch.
6. What is a query execution plan?
7. What is query optimization? Why is it needed?
8. With a detailed block diagram, explain the function of query optimization.
9. What is meant by the term heuristic optimization? Discuss the main heuristics that are
applied during query optimization to improve the processing of query.
10. Explain how heuristic query optimization is performed with an example.
11. How does a query tree represent a relational algebra expression?
12. Write and justify an efficient relational algebra expression that is equivalent to the following
given query:

SELECT B1.BANK-NAME
FROM BANK1 AS B1, BANK2 AS B2
WHERE B1.ASSETS > B2.ASSETS AND
B2.BANK-LOCATION = ‘Jamshedpur’

13. What is query tree? What is meant by an execution of a query tree? Explain with an example.
14. What is relational algebra query tree?
15. What is the objective of query normalization. What are its equivalence rules?
16. What is the purpose of syntax analyser? Explain with an example.
17. What is the objective of a query simplifier? What are the idempotence rules used by query
simplifier? Give an explain to explain the concept.
18. What are query transformation rules?
19. Discuss the rules for transformation of query trees and identify when each rule should be
applied during optimization.
20. Discuss the main cost components for a cost function that is used to estimate query execution
cost.
21. What cost components are used most often as the basis for cost functions?
22. List the cost functions for the SELECT and JOIN operations.
23. What are the cost functions of the SELECT operation for a linear search and a binary search?
24. Consider the relations R(A, B, C), S(C, D, E) and T(E, F), with primary keys A, C and E,
respectively. Assume that R has 2000 tuples, S has 3000 tuples, and T has 1000 tuples.
Estimate the size of R ⋈ S ⋈ T and give an efficient strategy for computing the join.
25. What is meant by semantic query optimization?
26. What are heuristic optimization algorithms? Discuss various steps in heuristic optimization
algorithm.
27. What is a query evaluation plan? What are its advantages and disadvantages?
28. Discuss the different types of query evaluation trees with the help of a neat sketch.
29. What is materialization?
30. What is pipelining? What are its advantages?
31. Let us consider the following relations (tables) that form part of a database of a relational
DBMS:

HOTEL (HOTEL-NO, HOTEL-NAME, CITY)


ROOM (ROOM-NO, HOTEL-NO, TYPE, PRICE)
BOOKING (HOTEL-NO, GUEST-NO, DATE-FROM, DATE-TO,
ROOM-NO)
GUEST (GUEST-NO, GUEST-NAME, GUEST-ADDRESS)

Using the above HOTEL schema, determine whether the following queries are semantically
correct:

(a) SELECT R.TYPE, R.PRICE


FROM ROOM AS R, HOTEL AS H
WHERE R.HOTEL-NUM = H.HOTEL-NUM AND
H.HOTEL-NAME = ‘Taj Residency’ AND
R.TYPE > 100;
(b) SELECT G.GUEST-NO, G.GUEST-NAME
FROM GUEST AS G, BOOKING AS B, HOTEL AS
H
WHERE R.HOTEL-NO = B.HOTEL-NO AND
H.HOTEL-NAME = ‘Taj Residency’;
(c) SELECT R.ROOM-NO, H.HOTEL-NO
FROM ROOM AS R, HOTEL AS H, BOOKING AS H
WHERE H.HOTEL-NO = B.HOTEL-NO AND
H.HOTEL-NO = ‘H40’ AND
B.ROOM-NO = R.ROOM-NO AND
R.TYPE > ‘S’ AND B.HOTEL-NO = ‘H50’;

32. Using the hotel schema of exercise 31, draw a relational algebra tree for each of the
following queries. Use the heuristic rules to transform the queries into a more efficient form.

(a) SELECT R.ROOM-NO, R.TYPE, R.PRICE


FROM ROOM AS R, HOTEL AS H, BOOKING AS H
WHERE R.ROOM-NO = B.ROOM-NO AND
B.HOTEL-NO = H.HOTEL-NO AND
H. HOTEL-NAME = ‘Taj Residency’ AND
R.PRICE > 1000;
(b) SELECT G.GUEST-NO, G.GUEST-NAME
FROM GUEST AS G, BOOKING AS B, HOTEL AS
H, ROOM AS R
WHERE H.HOTEL-NO = B.HOTEL-NO AND
G. GUEST-NO = B.GUEST-NO AND
H. HOTEL-NO = R.HOTEL-NO AND
H. HOTEL-NAME = ‘Taj Residnecy’ AND
B.DATE-FROM >= ‘1-Jan-05’ AND
B.DATE-TO <= ‘31-Dec-05’;
33. Using the hotel schema of exercise 31, let us consider the following assumptions:

There is a hash index with no overflow on the primary key attributes ROOM-NO,
HOTEL-NO in the relation ROOM.
There is a clustering index on the foreign key attribute HOTEL-NO in the relation
ROOM.
There is B+-tree index on the PRICE attribute in the relation ROOM.
There is a secondary index on the attribute type in the relation ROOM.

Let us also assume that the schema has the following statistics stored in the system catalogue:

nTuples(ROOM) = 10,000
nTuples(HOTEL) = 50
nTuples(BOOKING) = 100000
nDistinctHOTEL-NO = 50
(ROOM)
nDistinctTYPE (ROOM) = 10
nDistinctPRICE (ROOM) = 500
minPRICE (ROOM) = 200
maxPRICE (ROOM) = 50
nLevelsHOTEL-NO (I) =2
nLevelPRICE (I) =2
nLfBlocksPRICE(I) = 50
bFactor(ROOM) = 200
bFactor(HOTEL) = 40
bFactor(BOOKING) = 60
a. Calculate the cardinality and minimum cost for each of the following Selection
operations:

Selection 1: σROOM-NO = 1 ^HOTEL-NO =


‘H040’ (ROOM)
Selection 2: σTYPE-‘D’ (ROOM)
Selection 3: σHOME-NO = ‘H050’ (ROOM)
Selection 4: σPRICE > 100’ (ROOM)
Selection 5: σTYPE = ‘S’ ^ HOTEL-NO = ‘H060’
(ROOM)
Selection 6: σTYPE = ‘S’ ≸ PRICE. < 100’ (ROOM)

b. Calculate the cardinality and minimum cost for each of the following Join
operations:

Selection 1: HOTEL ⋈HOTEL-NO ROOM


Selection 2: HOTEL ⋈HOTEL-NO BOOKING
Selection 3: ROOM ⋈ROOM-NO BOOKING
Selection 4: ROOM ⋈HOTEL-NO HOTEL
Selection 5: BOOKING ⋈HOTEL-NO HOTEL
Selection 6: BOOKING ⋈ROOM-NO ROOM

STATE TRUE/FALSE

1. Query processing is the procedure of selecting the most appropriate plan that is used in
responding to a database request.
2. Execution plan is a series of query complication steps.
3. The cost of processing a query is usually dominated by secondary storage access, which is
slow compared to memory access.
4. The transformed query is used to create a number of strategies called execution (or access)
plans.
5. The internal query representation is usually a binary query tree.
6. A query is contradictory if its predicate cannot be satisfied by any tuple in the relation(s).
7. A query tree is also called a relational algebra tree.
8. Heuristic rules are used as an optimization technique to modify the internal representation of
a query.
9. Transformation rules are used by the query optimiser to transform one relational algebra
expression into an equivalent expression that is more efficient to execute.
10. Systematic query optimization is used for estimation of the cost of different execution
strategies and choosing the execution plan with the lowest cost estimate.
11. Usually, heuristic rules are used in the form of query tree or query graph data structure.
12. The heuristic optimization algorithm utilizes some of the transformation rules to transform an
initial query tree into an optimised and efficiently executable query tree.
13. The emphasis of cost minimisation depends on the size and type of database applications.
14. The success of estimating size and cost of intermediate relational algebra operations depends
on the amount and accuracy of the statistical data information stored with the database
management system (DBMS).
15. The cost of materialised evaluation includes the cost of writing the result of each operation to
the secondary storage and reading them back for the next operation.
16. Combining operations into a pipeline eliminates the cost of reading and writing temporary
relations.

TICK (✓) THE APPROPRIATE ANSWER

1. During the query processing, the syntax of the query is checked by

a. parser.
b. compiler.
c. syntax checker.
d. none of these.

2. A query execution strategy is evaluated by

a. access or execution plan.


b. query tree.
c. database catalog
d. none of these.

3. The query is parsed, validated, and optimised in the method called

a. static query optimization.


b. recursive query optimization.
c. dynamic query optimization.
d. repetitive query optimization.

4. The first phase of query processing is

a. decomposition.
b. restructuring.
c. analysis.
d. none of these.

5. In which phase of the query processing is the query lexically and syntactically analysed using
parsers to find out any syntax errors?

a. normalization.
b. semantic analysis.
c. analysis.
d. all of these.
6. Which of the following represents the result of a query in a query tree?

a. root node.
b. leaf node.
c. intermediate node.
d. none of these.

7. In which phase of the query processing are the queries that are incorrectly formulated or are
contradictory are rejected?

a. simplification.
b. semantic analysis.
c. analysis.
d. none of these.

8. The objective of query simplifier is

a. transformation of the query to a semantically equivalent and more efficient form.


b. detection of redundant qualifications.
c. elimination of common sub-expressions.
d. all of these.

9. Which of the following is not true?

a. R ∪ S = S ∪ R.
b. R ∩ S = S ∩ R.
c. R − S = S − R.
d. All of these.

10. Which of the following transformation is referred to as cascade of selection?

a.
b.
c. ∏L∏M………∏N (R) ≡ ∏L.
d.

11. Which of the following transformation is referred to as commutativity of selection?

a.
b.
c. ∏L∏M………∏N (R) ≡ ∏L.
d.

12. Which of the following transformation is referred to as cascade of projection?

a.
b.
c. ∏L∏M………∏N (R) ≡ ∏L.
d.

13. Which of the following transformation is referred to as commutativity of selection and


projection?

a.
b.
c. ∏L∏M………∏N (R) ≡ ∏L.
d.

14. Which of the following transformation is referred to as commutativity of projection and join?

a.
b. R ∪ S = S ∪ R.
c. R ∩ S = S ∩ R.
d. both (b) and (c).

15. Which of the following transformation is referred to as commutativity of union and


intersection?

a.
b. R ⋃ S = S u R.
c. R ⋂ S = S ⋂ R.
d. both (b) and (c).

16. Which of the following will produce an efficient execution strategy?

a. performing Projection operations as early as possible.


b. performing Selection operations as early as possible.
c. computing common expressions only once.
d. (all of these.

17. Which of the following cost is the most important cost component to be considered during
the cost-based query optimization?

a. memory uses cost.


b. secondary storage access cost.
c. communication cost.
d. all of these.

18. Usually, heuristic rules are used in the form of

a. query tree.
b. query graph data structure.
c. both (a) and (b).
d. either (a) or (b).

19. The emphasis of cost minimization depends on the

a. size of database applications.


b. type of database applications.
c. both (a) and (b)
d. none of these.

20. The success of estimating size and cost of intermediate relational algebra operations depends
on the emphasis of cost minimization depends on the

a. amount of statistical data information stored with the DBMS.


b. accuracy of statistical data information stored with the DBMS.
c. both (a) and (b).
d. none of these.

21. Which of the following query processing method is more efficient?

a. pipelining.
b. materialization.
c. tunnelling.
d. none of these.

FILL IN THE BLANKS

1. A query processor transforms a _____ query into an _____ that performs the required
retrievals and manipulations in the database.
2. Execution plan is a series of _____ steps.
3. In syntax-checking phase of query processing the system _____ the query and checks that it
obeys the _____ rules.
4. _____ is the process of transforming a query written in SQL (or any high-level language)
into a correct and efficient execution strategy expressed in a low-level language.
5. During the query transformation process, the _____ checks the syntax and verifies if the
relations and the attributes used in the query are defined in the database.
6. Query transformation is performed by transforming the query into _____ that are more
efficient to execute.
7. The four main phases of query processing are (a) _____, (b) _____, (c) _____ and (d) _____.
8. The two types of query optimization techniques are (a) _____ and (b) _____.
9. In _____, the query is parsed, validated and optimised once.
10. The objective of _____ is to transform the high-level query into a relational algebra query
and to check whether that query is syntactically and semantically correct.
11. The five stages of query decomposition are (a) _____ , (b) _____, (c) _____, (d) _____ and
(e) _____.
12. In the _____ stage, the query is lexically and syntactically analysed using parsers to find out
any syntax error.
13. In _____ stage, the query is converted into normalised form that can be more easily
manipulated.
14. In _____ stage, incorrectly formulated and contradictory queries are rejected.
15. _____ uses the transformation rules to convert one relational algebraic expression into an
equivalent form that is more efficient.
16. The main cost components of query optimization are (a) _____ and (b) _____.
17. A query tree is also called a _____ tree.
18. Usually, heuristic rules are used in the form of _____ or _____ data structure.
19. The heuristic optimization algorithm utilises some of the transformation rules to transform an
_____ query tree into an _____ and _____ query tree.
20. The emphasis of cost minimization depends on the _____ and _____ of database
applications.
21. The process of query evaluation in which several relational operations are combined into a
pipeline of operations is called_____.
22. If the results of the intermediate processes in a query are created and then are used for
evaluation of the next-level operations, this kind of query execution is called _____.
Chapter 12

Transaction Processing and Concurrency Control

12.1 INTRODUCTION

Transaction is a logical unit of work that represents real-world events of any


organisation or an enterprise whereas concurrency control is the
management of concurrent transaction execution. Transaction processing
systems execute database transactions with large databases and hundreds of
concurrent users, for example, railway and air reservations systems, banking
system, credit card processing, stock market monitoring, super market
inventory and checkouts and so on. Transaction processing and concurrency
control form important activities of any database system.
In this chapter, we will learn the main properties of database transaction
and how SQL can be used to present transactions. We will discuss the
concurrency control problems and how DBMS enforces concurrency control
to take care of lost updates, uncommitted data and inconsistent summaries
that can occur during concurrent transaction execution. We will finally
examine various methods used by the concurrency control algorithm such as
locks, deadlocks, time stamping and optimistic methods.

12.2 TRANSACTION CONCEPTS

A transaction is a logical unit of work of database processing that includes


one or more database access operations. A transaction can be defined as an
action or series of actions that is carried out by a single user or application
program to perform operations for accessing the contents of the database.
The operations can include retrieval, (Read), insertion (Write), deletion and
modification. A transaction must be either completed or aborted. A
transaction is a program unit whose execution may change the contents of a
database. It can either be embedded within an application program or can be
specified interactively via a high-level query language such as SQL. Its
execution preserves the consistency of the database. No intermediate states
are acceptable. If the database is in a consistent state before a transaction
executes, then the database should still be in consistent state after its
execution. Therefore, to ensure these conditions and preserve the integrity of
the database a database transaction must be atomic (also called
serialisability). Atomic transaction is a transaction in which either all actions
associated with the transaction are executed to completion or none are
performed. In other words, each transaction should access shared data
without interfering with the other transactions and whenever a transaction
successfully completes its execution; its effect should be permanent.
However, if due to any reason, a transaction fails to complete its execution
(for example, system failure) it should not have any effect on the stored
database. This basic abstraction frees the database application programmer
from the following concerns:
Inconsistencies caused by conflicting updates from concurrent users.
Partially completed transactions in the event of systems failure.
User-directed undoing of transactions.

Let us take an example in which a client (or consumer) of Reliance mobile


wants to pay his mobile bills using Reliance’s on-line bill payment facility.
The client will do the following:
Log on to the Reliance site, enter user name and password and select the bill information
system page.
Enter the mobile number in the bill information system page. The site will display the bill
details and the amount that the client has to pay.
Select on-line payment facility by clicking at the appropriate link of his bank. The link will
connect the client to his bill system.
Enter his credit card detail and the bill amount (for example INR 2000) to be paid to Reliance
mobile.

For the client, the entire process as explained above is a single operation
called transaction, which is payment of the mobile bill to Reliance mobile.
But within the database system, this comprises several operations. It is
essential that either all these operations occur, in which case the bill payment
will be successful, or in case of a failure, none of the operations should take
place, in which case the bill payment would be unsuccessful and client will
be asked to try again. It is unacceptable if the client’s credit card account is
debited and the Reliance mobile’s account is not credited. The client will
loose the money and his mobile number will be deactivated.
A transaction is a sequence of READ and WRITE actions that are grouped
together to from a database access. Whenever we Read from and/or Write to
(update) the database, a transaction is created. A transaction may consist of a
simple SELECT operation to generate a list of table contents, or it may
consist of a series of related UPDATE command sequences. A transaction
can include the following basic database access operations:
Read_item(X): This operation reads a database item named X into a
program variable Y. Execution of Read-item(X) command includes the
following steps:
Find the address of disk block that contains the item X.
Copy that disk block into a buffer in main memory.
Copy item X from the buffer to the program variable named Y.

Write_item(X): This operation writes the value of a program variable Y


into the database item named X. Execution of Write-item(X) command
includes the following steps:
Find the address of the disk block that contains item X.
Copy that disk block into a buffer in main memory.
Copy item X from the program variable named Y into its correct location in the buffer.
Store the updated block from the buffer back to disk.

Below is an example of transaction that updates columns (attributes) in


several relation (table) rows (tuples) by incrementing their values by 500:

BEGIN_TRANSACTION_1:
READ (TABLE = T1, ROW = 15, OBJECT = COL1);
:COL1 = COL1 + 500;
WRITE (TABLE = T1, ROW = 15, OBJECT = COL1, VALUE
=:COL1);
READ (TABLE = T2, ROW = 15, OBJECT = COL2);
:COL2 = COL2 + 500;
WRITE (TABLE = T2, ROW = 30, OBJECT = COL2, VALUE
=:COL2);
READ (TABLE = T3, ROW = 30, OBJECT = COL3);
:COL3 = COL3 + 500;
WRITE (TABLE = T3, ROW = 45, OBJECT = COL3, VALUE
=:COL3);
END_OF_TRANSACTION_1;

As can be seen from the above update operation, the transaction is


basically divided into three pairs of READ and WRITE operations. Each
operation reads the value of a column from a table and increments it by the
given amount. It then proceeds to write to new value back into the column
before proceeding to the next table.
Fig. 12.1 illustrates an example of a typical loan transaction that updates a
salary database table of M/s KLY Associates. In this example, a loan amount
of INR 10000.00 is being subtracted from an already stored loan value of
INR 80000.00. After the update, it leaves INR 70000.00 as loan balance in
the database.
A transaction that changes the contents of the database must alter the
database from one consistent state to another. A consistent database state is
one in which all data integrity constraints are satisfied. To ensure database
consistency, every transaction must begin with the database in a known
consistent state. If the database is not in a consistent state, the transaction
will result into an inconsistent database that violates its integrity and
business rules.
Much of the complexity of database management systems (DBMSs) can
be hidden behind the transaction interface. In applications such as distributed
and multimedia systems, the transaction interface is being used by the
DBMS designer as a means of making the system complexity transparent to
the user and insulation applications from implementation details.
Fig. 12.1 An example of transaction update of a salary database

12.2.1 Transaction Execution and Problems


A transaction is not necessarily just a single database operation, but is a
sequence of several such operations that transforms a consistent state of the
database into another consistent state, without necessarily preserving
consistency of all intermediate points. The simplest case of a transaction
processing system forces all transactions into a single stream and executes
them serially, allowing no concurrent execution at all. This is not a practical
strategy for large multi-user database, so mechanisms to enable multiple
transactions to execute without causing conflicts or inconsistencies are
necessary.
A transaction which successfully completes its execution is said to have
been committed. Otherwise, the transaction is aborted. Thus, if a committed
transaction performs any update operation on the database, its effect must be
reflected on the database even if there is a failure. A transaction can be in
one of the following states:
a. Active state: After the transaction starts its operation.
b. Partially committed: When the last state is reached.
c. Aborted: When the normal execution can no longer be performed.
d. Committed: After successful completion of transaction.
A transaction may be aborted when the transaction itself detects an error
during execution which it cannot recover from, for example, a transaction
trying to debit loan amount of an employee from his insufficient gross salary.
A transaction may also be aborted before it has been committed due to
system failure or any other circumstances beyond its control. When a
transaction aborts due to any reason, the DBMAS either kills the transaction
or restarts the execution of transaction. A DBMS restarts the execution of
transaction when the transaction is aborted without any logical errors in the
transaction. In either case, any effect on the stored database due to the
aborted transaction must be eliminated.
A transaction is said to be in a committed state if it has partially
committed and it can be ensured that it will never be aborted. Thus, before a
transaction can be committed, the DBMS must take appropriate steps to
guard against a system failure. But, once a transaction is committed, its
effect must be made permanent even if there is a failure.
Fig. 12.2 illustrates a state transition diagram that describes how a
transaction moves through its execution states. A transaction goes into an
active state immediately after it starts execution, where it can issue READ
and WRITE operations. When the transaction ends, it moves to the partially
committed state. To this point, some recovery protocols need to ensure that a
system failure will not result in an inability to record the changes of the
transaction permanently. Once this check is successful, the transaction is said
to have reached its commit point and enters the committed state. Once a
transaction is committed, it has concluded its execution successfully and all
its changes must be recorded permanently in the database. However, a
transaction can go to an aborted state if one of checks fails or if the
transaction is aborted during its active state. The transaction may then have
to be rolled back to undo the effect of its WRITE operations on the database.
In the terminated state, the transaction information maintained in system
tables while the transaction has been running is removed. Failed or aborted
transactions may be restarted later, either automatically or after being
resubmitted by the user as new transactions.
Fig. 12.2 Transaction execution state transition diagram

In a multiprogramming/multi-user environment, the recovery problem of


transaction after its failure is compounded by cascading effect.

12.2.2 Transaction Execution with SQL


The American National Standards Institute (ANSI) has defined standards
that govern SQL database transactions. Transaction support is provided by
two SQL statements namely COMMIT and ROLLBACK. The ANSI
standards require that, when a transaction sequence is initiated by a user or
an application program, it must continue through all succeeding SQL
statements until one of the following four events occur:
A COMMIT statement is reached, in which case all changes are permanently recorded within
the database. The COMMIT statement automatically ends the SQL transaction. The
COMMIT operations indicates successful end-of-transaction.
A ROLLBACK statement is reached, in which case all the changes are aborted and the
database is rolled back to its previous consistent state. The ROLLBACK operation indicates
unsuccessful end- of-transaction.
The end of a program is successfully reached, in which case all changes are permanently
recorded within the database. This action is equivalent to COMMIT.
The program is abnormally terminated, in which case the changes made in the database are
aborted and the database is rolled back to its previous consistent state. This action is
equivalent to ROLLBACK.

Let us consider an example of COMMIT, which updates an employee’s


loan balance (EMP_LOAN-BAL) and the project’s cost (PROJ-COST) in
the tables EMPLOYEE and PROJECT respectively.
UPDATE EMPLOYEE
SET EMP-LOAN-BAL = EMP-LOAN-BAL - 10000
WHERE EMP-ID = ‘106519’
UPDATE PROJECT
SET PROJ-COST = PROJ-COST + 40000
WHERE PROJ-ID = ‘PROJ-1’
COMMIT;

As shown in the above example, a transaction begins implicitly when the


first SQL statement is encountered. Not all SQL implementations follow the
ANSI standard. Some SQL statement use following transaction execution
statement to indicate the beginning and end of a new transaction:

BEGIN TRANSACTION_T1

READ (TABLE = EMPLOYEE, EMP-ID = ‘106519’, OBJECT =


EMP-LOAN-BAL);
: EMP-LOAN-BAL = EMP-LOAN-BAL - 10000;
WRITE (TABLE = EMPLOYEE, EMP-ID = ‘106519’, OBJECT =
EMP-LOAN-BAL, VALUE =: EMP-LOAN-BAL);

READ (TABLE = PROJECT, PROJ-ID = ‘PROJ-1’, OBJECT =


PROJ-COST);
: PROJ-COST = PROJ-COST + 40000;
WRITE (TABLE = PROJECT, PROJ-ID = ‘PROJ-1’, OBJECT =
PROJ-COST, VALUE =: PROJ-COST);

END TRANSACTION_T1;

12.2.3 Transaction Properties


A transaction must have the following four properties, called ACID
properties (also called ACIDITY of a transaction), to ensure that a database
remains stable state after the transaction is executed:
Atomicity.
Consistency.
Isolation.
Durability.

Atomicity: The atomicity property of a transaction requires that all


operations of a transaction be completed, if not, the transaction is aborted. In
other words, a transaction is treated as single, individual logical unit of
work. Therefore, a transaction must execute and complete each operation in
its logic before it commits its changes. As stated earlier, the transaction is
considered as one operation even though there are multiple read and writes.
Thus, transaction completes or fails as one unit. The atomicity property of
transaction is ensured by the transaction recovery subsystem of a DBMS. In
the event of a system crash in the midst of transaction execution, the
recovery techniques undo any effects of the transaction on the database.
Atomicity is also known as all or nothing.
Consistency: Database consistency is the property that every transaction
sees a consistent database instance. In other words, execution of a
transaction must leave a database in either its prior stable state or a new
stable state that reflects the new modifications (updates) made by the
transaction. If the transaction fails, the database must be returned to the state
it was in prior to the execution of the failed transaction. If the transaction
commits, the database must reflect the new changes. Thus, all resources are
always in a consistent state. The preservation of consistency is generally the
responsibility of the programmers who write the database programs or of the
DBMS module that enforces integrity constraints. A database program
should be written in a way that guarantees that, if the database is in a
consistent state before executing the transaction, it will be in a consistent
state after the complete execution of the transaction, assuming that no
interference with other transactions occur. In other words, a transaction must
transform the database from one consistent state to another consistent state.
Isolation: Isolation property of a transaction means that the data used
during the execution of a transaction cannot be used by a second transaction
until the first one is completed. This property isolates transactions from one
another. In other words, if a transaction T1 is being executed and is using the
data item X, that data item cannot be accessed by any other transaction (T2
…… Tn ) until T1 ends. The transaction must act as if it is the only one
running against the database. It acts as if it owned its own copy and could
not affect other transactions executing against their own copies of the
database. No other transaction is allowed to see the changes made by a
transaction until the transaction safely terminates and returns the database to
a new stable or prior stable state. Thus, transactions do not interfere with
each other. The isolation property of a transaction is particularly used in
multi-user database environments because several different users can access
and update the database at the same time. The isolation property is enforced
by the concurrency control subsystem of the DBMS.
Durability: The durability property of transaction indicates the
performance of the database’s consistent state. It states that the changes
made by a transaction are permanent. They cannot be lost by either a system
failure or by the erroneous operation of a faulty transaction. When a
transaction is completed, the database reaches a consistent state and that
state cannot be lost, even in the event of system’s failure. Durability property
is the responsibility of the recovery subsystem of the DBMS.

12.2.4 Transaction Log (or Journal)


To support transaction processing, DBMSs maintain a transaction record of
every change made to the database into a log (also called journal). DBMS
maintains this log to keep track of all transaction operations that affect the
values of database items. This helps DBMS to be able to recover from
failures that affect transactions. Log is a record of all transactions and the
corresponding changes to the database. The information stored in the log is
used by the DBMS for a recovery requirement triggered by a ROLLBACK
statement, which is program’s abnormal termination, a system (power or
network) failure, or disk crash. Some relational database management
systems (RDBMSs) use the transaction log to recover a database forward to
a currently consistent state. After a server failure, these RDBMS (for
example, ORACLE) automatically rolls back uncommitted transactions and
rolls forward transactions that were committed but not yet written to the
physical database storage.
The DBMS automatically update the transaction log while executing
transactions that modify the database. The transaction log stores before-and-
after data about the database and any of the tables, rows and attribute values
that participated in the transaction. The beginning and the ending
(COMMIT) of the transaction are also recorded in the transaction log. The
uses of a transaction log increases the processing overhead of a DBMS and
the overall cost of the system. However, its ability to restore a corrupted
database is worth the price. For each transaction, the following data is
recorded on the log:
A start-of-transaction marker.
The transaction identifier which could include who and where information.
The record identifiers which include the identifiers for the record occurrences.
The operation(s) performed on the records (for example, insert, delete, modify).
The previous value(s) of the modified data. This information is required for undoing the
changes made by a partially completed transaction. It is called the undo log. Where the
modification made by the transaction is the insertion of a new record, the previous values can
be assumed to be null.
The updated value(s) of the modified record(s). This information is required for making sure
that the changes made by a committed transaction are in fact reflected in the database and can
be used to redo these modifications. This information is called the redo part of the log. In
case the modification made by the transaction is the deletion of a record, the updated values
can be assumed to be null.
A commit transaction marker if the transaction is committed, otherwise an abort or rollback
transaction marker.

The log is written before any updates are made to the database. This is
called write-ahead log strategy. In this strategy, a transaction is not allowed
to modify the physical database until the undo portion of the log is written to
stable database. Table 12.1 illustrates example of a transaction log of section
12.2.2 in which the previous two SQL sequences are reflected for database
tables EMPLOYEE and PROJECT. In case of a system failure, the DBMS
examines the transaction log for all uncommitted or incomplete transactions
and restores (ROLLBACK) the database to its previous state based on the
information in the transaction log. When the recovery process is completed,
the DBMS writes in the transaction log all committed transactions that were
not physically written to the physical database before the failure occurred.
The TRNASACTION-ID is automatically assigned by the DBMS. If a
ROLLBACK is issued before the termination of a transaction, the DBMS
restores the database only for that particular transaction, rather than for all
transactions, in order to maintain the durability of the previous transactions.
In other words, committed transactions are not rolled back.

Table 12.1 Example of a transaction log

The transaction log itself is a database. It is managed by the DBMS like


any other database. The transaction log is kept on disk, so it is not affected
by any type of failure except for disk failure. Thus, the transaction log is
subject to such common database dangers as disk-full conditions and disk
crashes. Because the transaction log contains some of the most critical data
in a DBMS, some implementation support periodic backups of transaction
logs on several different disks or on tapes to reduce the risk of a system
failure.

12.3 CONCURRENCY CONTROL

Concurrency control is the process of managing simultaneous execution of


transactions (such as queries, updates, inserts, deletes and so on) in a
multiprocessing database system without having them interfere with one
another. This property of DBMS allows many transactions to access the
same database at the same time without interfering with each other. The
primary goal of concurrency is to ensure the atomicity (or serialisability) of
the execution of transactions in a multi-user database environment.
Concurrency controls mechanisms attempt to interleave (parallel) READ and
WRITE operations of multiple transactions so that the interleaved execution
yields results that are identical to the results of a serial schedule execution.
This interleaving creates the impression that the transactions are executing
concurrently. Concurrency control is important because the simultaneous
execution of transactions over shared database can create several data
integrity and consistency problems.

12.3.1 Problems of Concurrency Control


When concurrent transactions are executed in an uncontrolled manner,
several problems can occur. The concurrency control has the following three
main problems:
Lost updates.
Dirty read (or uncommitted data).
Unrepeatable read (or inconsistent retrievals).

12.3.1.1 Lost Update Problem


A lost update problem occurs when two transactions that access the same
database items have their operations in a way that makes the value of some
database item incorrect. In other words, if transactions T1 and T2 both read a
record and then update it, the effects of the first update will be overwritten
by the second update. Let us consider an example where two accountants in
a Finance Department of M/s KLY Associates are updating the salary record
of a marketing manager ‘Abhishek’. The first accountant is giving an annual
salary adjustment to ‘Abhishek’ and the second accountant is reimbursing
the travel expenses of his marketing tours to customer organisation. Without
a suitable concurrency control mechanism the effect of the first update will
be overwritten by the second.
Fig. 12.3 Example of lost update

Fig. 12.3 illustrates an example of lost update in which the update


performed by the transaction T2 is overwritten by transaction T1. Let us now
consider the example of SQL transaction of section 12.2.2 which updates an
attribute called employee’s loan balance (EMP_LOAN-BAL) in the table
EMPLOYEE. Assume that the current value of EMP-LOAN-BAL is INR
70000. Now assume that two concurrent transactions T1 and T2 that update
the EMP-LOAN-BAL value for some item in the EMPLOYEE table. The
transactions are as follows:

Transaction T1 : take additional loan of INR 20000 →


EMP-LOAN-BAL = EMP-LOAN-BAL + 20000
Transaction T2 : repay loan of INR 30000 →
EMP-LOAN-BAL = EMP-LOAN-BAL − 30000

Table 12.2 Normal execution of transactions T1 and T2


Table 12.2 shows the serial execution of these transactions under normal
circumstances, yielding the correct result of EMP-LOAN-BAL = 60000.
Now, suppose that a transaction is able to read employee’s EMP-LOAN-
BAL value from the table before a previous transaction for EMP-LOAN-
BAL has been committed.

Table 12.3 Example of lost updates

Table 12.3 illustrates the sequence of execution resulting in lost update


problem. It can be observed from this table that the first transaction T1 has
not yet been committed when the second transaction T2 is executed.
Therefore, transaction T2 still operates on the value 70000, and its
subtraction yields 40000 in the memory. In the meantime, transaction T1
writes the value 90000 to the storage disk, which is immediately overwritten
by transaction T2. Thus, the addition of INR 20000 is lost during the process.

12.3.1.2 Dirty Read (or Uncommitted Data) Problem


A dirty read problem occurs when one transaction updates a database item
and then the transaction fails for some reason. The updated database item is
accessed by another transaction before it is changed back to the original
value. In other words, a transaction T1 updates a record, which is read by the
transaction T2. Then T1 aborts and T2 now has values which have never
formed part of the stable database. Let us consider an example where an
accountant in a Finance Department of M/s KLY Associates records the
travelling allowance of INR 10000.00 to be given to the marketing manager
‘Abhishek’ every time he visits customer organisation. This value is read by
a report-generating transaction which includes it in the report before the
accountant realizes the error and changes the travelling allowance value to
INR 10000.00. The error arises because the second transaction sees the
first’s updates before it commits.

Fig. 12.4 Example of dirty read (or uncommitted data)

Fig. 12.4 illustrates an example of dirty read in which T1 uses a value


written by T2 which never forms part of the stable database. In dirty read,
data are not committed when two transactions T1 and T2 are executed
concurrently and the first transaction T1 is rolled back after the second
transaction T2 has already accessed the uncommitted data. Thus, it violates
the isolation property of transactions.
Let us consider the same example of lost update transaction with a
difference that this time the transaction T1 is rolled back to eliminate the
addition of INR 20000. Because transaction T2 subtracts INR 30000 from
the original INR 70000, the correct answer should be INR 60000. The
transactions are as follows:

Transaction T1 : take additional loan of INR 20000 →


EMP-LOAN-BAL = EMP-LOAN-BAL + 20000 (Rollback)
Transaction T2 : repay loan of INR 30000 →
EMP-LOAN-BAL = EMP-LOAN-BAL − 30000

Table 12.4 Normal execution of transactions T1 and T2

Table 12.4 shows the serial execution of these transactions under normal
circumstances, yielding the correct result of EMP-LOAN-BAL = 60000.
Table 12.5 illustrates the sequence of execution resulting in dirty read (or
uncommitted data) problem when the ROLLBACK is completed after
transaction T2 has begun its execution.

Table 12.5 Example of dirty read (uncommitted data)

12.3.1.3 Unrepeatable Read (or Inconsistent Retrievals) Problem


Unrepeatable read (or inconsistent retrievals) occurs when a transaction
calculates some summary (aggregate) function over a set of data while other
transactions are updating the data. The problem is that the transaction might
read some data before they are changed and other data after they are
changed, thereby yielding inconsistent results. In an unrepeatable read, the
transaction T1 reads a record and then does some other processing during
which the transaction T2 updates the record. Now, if T1 rereads the record,
the new value will be inconsistent with the previous value. Let us suppose
that a report transaction produces a profile of average monthly travelling
details for every marketing manager of M/s KLY Associates whose travel
bills are more than 5% different from the previous month’s. If the travelling
records are updated after this transaction has started, it is likely to show
details and totals which do not meet the criterion for generating the report.

Fig. 12.5 Example of unrepeatable read (or inconsistent retrievals)

Fig. 12.5 illustrates an example of unrepeatable read in which if T1 were


to read the value of X after T2 had updated X, the result of T1 would be
different. Let us consider the same example of section 12.3.1 with the
following conditions:
Transaction T1 calculates the total loan balance of all employees in the EMPLOYEE table of
M/s KLY Associates.
At a parallel level (at the same time), transaction T2 updates employee’s loan balance (EMP-
LOAN- BAL) for two employees (EMP-ID) ‘106519’ and ‘112233’ of EMPLOYEE table.
The above two transactions are as follows:

Transection T1: SELECT SUM (EMP-LOAN-ID)


FROM EMPLOYEE
Transection T2: UPDATE EMPLOYEE
SET EMP-LOAN-ID = EMP-
LOAN-BAL + 20000
WHERE EMP-ID = ‘106519’
UPDATE EMPLOYEE
SET EMP-LOAN-ID = EMP-
LOAN-BAL - 20000
WHERE EMP-ID = ‘112233’
COMMIT;

As can be observed from the above transactions that when transaction T1


calculates the total loan balance of all employees of M/s KLY Associates.
Transaction T2 represents the correction of a typing error. Let us assume that
the user added INR 20000 to EMP-LOAN-BAL of EMP-ID = ‘112233’ but
meant to add INR 20000 to EMP-LOAN-BAL of EMP-ID = ‘106519’. To
correct the problem, the user subtracts INR 20000 from EMP-LOAN-BAL
of EMP-ID = ‘112233’ and adds INR 20000 to EMP-LOAN-BAL of EMP-
ID = ‘106519’. The initial and final EMP-LOAN-BAL values are shown in
Table 12.6.
Table 12.6 Transaction results after correction

Although the final results are correct after the adjustment, inconsistent
retrievals are possible during the correction process as illustrated in Table
12.7.
As shown in Table 12.7, the computed answer of INR 350000 is obviously
wrong, because we know that the correct answer is INR 330000. Unless the
DBMS exercises concurrency control, a multi-user database environment
can create havoc within the information system.

12.3.2 Degree of Consistency


Following four levels of transaction consistency have been defined by Gray
(1976):
Level 0 consistency: In general, level 0 transactions are not recoverable
since they may have interactions with the external word which cannot be
undone. They have the following properties:
The transaction T does not overwrite other transaction’s dirty (or uncommitted) data.
Table 12.7 Example of unrepeatable read (inconsistent retrievals)

Level 1 consistency: level 1 transaction is the minimum consistency


requirement that allows a transaction to be recovered in the event of system
failure. They have the following properties:
The transaction T does not overwrite other transaction’s dirty (or uncommitted) data.
The transaction T does not make any of its updates visible before it commits.

Level 2 consistency: Level 2 transaction consistency isolates from the


updates of other transactions. They have the following properties:
The transaction T does not overwrite other transaction’s dirty (or uncommitted) data.
The transaction T does not make any of its updates visible before it commits.
The transaction T does not read other transaction’s dirty (or uncommitted) data.

Level 3 consistency: Level 3 transaction consistency adds consistent reads


so that successive reads of a record will always give the same values. They
have the following properties:
The transaction T does not overwrite other transaction’s dirty (or uncommitted) data.
The transaction T does not make any of its updates visible before it commits.
The transaction T does not read other transaction’s dirty (or uncommitted) data.
The transaction T can perform consistent reads, that is, no other transaction can update data
read by the transaction T before T has committed.

Most conventional database applications require level 3 consistency and


that is provided by all major commercial DBMSs.

12.3.3 Permutable Actions


An action is a unit of processing that is indivisible from the DBMS’s
perspective. In systems where the granule is a page, the actions are typically
read-page and write-page. The actions provided are determined by the
system designers, but in all cases they are independent of side-effects and do
not produce side- effects.
A pair of actions is permutable if every execution of Ai followed by Aj has
the same result as the execution of Aj followed by Ai on the same granule.
Actions on different granules are always permutable. For the actions read
and write we have:

Read-Read: Permutable
Read-write: Not permutable, since the result is different depending
on whether read is first or write is first.
Write-Write: Not permutable, as the second write always nullifies the
effects of the first write.

12.3.4 Schedule
A schedule (also called history) is a sequence of actions or operations (for
example, reading writing, aborting or committing) that is constructed by
merging the actions of a set of transactions, respecting the sequence of
actions within each transaction. As we have explained in our previous
discussions, as long as two transactions T1 and T2 access unrelated data,
there is no conflict and the order of execution is not relevant to the final
result. But, if the transactions operate on the same or related
(interdependent) data, conflict is possible among the transaction components
and the selection of one operational order over another may have some
undesirable consequences. Thus, DBMS has inbuilt software called
scheduler, which determines the correct order of execution. The scheduler
establishes the order in which the operations within concurrent transactions
are executed. The scheduler interleaves the execution of database operations
to ensure serialisability (as explained in section 12.3.5). The scheduler bases
its actions on concurrency control algorithms, such as locking or time
stamping methods. The schedulers ensure the efficient utilisation of central
processing unit (CPU) of computer system.
Fig. 12.6 shows a schedule involving two transactions. It can be observed
that the schedule does not contain an ABORT or COMMIT action for either
transaction. Schedules which contain either an ABORT or COMMIT action
for each transaction whose actions are listed in it are called a complete
schedule. If the actions of different transactions are not interleaved, that is,
transactions are executed one by one from start to finish, the schedule is
called a serial schedule. A non-serial schedule is a schedule where the
operations from a group of concurrent transactions are interleaved.

Fig. 12.6 Schedule involving two transactions

A serial schedule gives the benefits of concurrent execution without


giving up any correctness. The disadvantage of a serial schedule is that it
represents inefficient processing because no interleaving of operations form
different transactions is permitted. This can lead to low CPU utilisation
while a transaction waits for disk input/output (I/O), or for another
transaction to terminate, thus slowing down processing considerably.

12.3.5 Serialisable Schedules


As we have discussed earlier, the objective of concurrency control is to
arrange or schedule the execution of transactions in such a way as to avoid
any interference. This objective can be achieved by execution and commit of
one transaction at a time in serial order. But in multi-user environment,
where there are hundreds of users and thousands of transactions, the serial
execution of transactions is not viable. Thus, DBMS schedules the
transactions so that many transactions can execute consecutively without
interfering with one another and maximising concurrency in the system.
A schedule is a sequence of operations by a set of concurrent transactions
that preserves the order of the operations in each of the individual
transactions. A serialisable schedule is a schedule that follows a set of
transactions to execute in some order such that the effects are equivalent to
executing them in some serial order like a serial schedule. The execution of
transactions in a serialisable schedule is a sufficient condition for preventing
conflicts. The serial execution of transactions always leaves the database in a
consistent state.
Serialisability describes the concurrent execution of several transactions.
The objective of serialisability is to find the non-serial schedules that allow
transactions to execute concurrently without interfering with one another and
thereby producing a database state that could be produced by a serial
execution. Serialisability must be guaranteed to prevent inconsistency from
transactions interfering with one another. The order of Read and Write
operations are important in serialisability. The serialisability rules are as
follows:
If two transactions T1 and T2 only Read a data item, they do not conflict and the order is not
important.
If two transactions T1 and T2 either Read or Write completely separate data items, they do
not conflict and the execution order is not important.
If one transaction T1 Writes a data item and another transaction T2 either Reads or Writes the
same data item, the order of execution is important.

Serailisability can also be depicted by constructing a precedence graph. A


precedence relationship can be defined as, transaction T1 precedes
transaction T2 and between T1 and T2 if there are two non-permutable
actions A1 and A2 and A1 is executed by T1 before A2 is executed by T2.
Given the existence of nonpermutable actions and the sequence of actions in
a transaction it is possible to define a partial order of transactions by
constructing a precedence graph. A precedence graph is a directed graph in
which:
The set of vertices is the set of transactions.
An arc exists between transactions T1 and T2 if T1 precedes T2

A schedule is serialisable if the precedence graph is cyclic. The


serialisability property of transactions is important in multi-user and
distributed databases, where several transactions are likely to be executed
concurrently.

12.4 LOCKING METHODS FOR CONCURRENCY CONTROL

A lock is a variable associated with a data item that describes the status of
the item with respect to possible operations that can be applied to it. It
prevents access to a database record by a second transaction until the first
transaction has completed all of its actions. Generally, there is one lock for
each data item in the database. Locks are used as means of synchronising the
access by concurrent transactions to the database items. Thus, locking
schemes aim to allow the concurrent execution of compatible operations. In
other words, permutable actions are compatible. Locking is the most widely
used form of concurrency control and is the method of choice for most
applications. Locks are granted and released by a lock manager. The
principle data structure of a lock manager is the lock table. In the lock table,
an entry consists of a transaction identifier, a granule identifier and lock
type. The simplest type of a locking scheme has two types of lock namely
(a) S locks- shared or Read lock and (b) X locks-exclusive or Write lock. The
lock manager refuses incompatible requests, so if
a. Transaction T1 holds an S lock on granule G1. A request by transaction T2 for an S lock will
be granted. In other words, Read-Read is permutable.
b. Transaction T1 holds an S lock on granule G1. A request by transaction T2 for an X lock will
be refused. In other words, Read-Write is not permutable.
c. Transaction T1 holds an X lock on granule G1. No request by transaction T2 for a lock on
G1. will be granted. In other words, Write is not permutable.

12.4.1 Lock Granularity


A database is basically represented as a collection of named data items. The
size of the data item chosen as the unit of protection by a concurrency
control program is called granularity. Granularity can be a field of some
record in the database, or it may be a larger unit such as record or even a
whole disk block. Granule is a unit of data individually controlled by the
concurrency control subsystem. Granularity is a lockable unit in a lock-
based concurrency control scheme. Lock granularity indicates the level of
lock use. Most often, the granule is a page, although smaller or larger units
(for example, tuple, relation) can be used. Most commercial database
systems provide a variety of locking granularities. Locking can take place at
the following levels:
Database level.
Table level.
Page level.
Row (tuple) level.
Attributes (fields) level.

Thus, the granularity affects the concurrency control of the data items,
that is, what portion of the database a data item represents. An item can be as
small as a single attribute (or field) value or as large as a disk block, or even
a whole file or the entire database.

12.4.1.1 Database Level Locking


At database level locking, the entire database is locked. Thus, it prevents the
use of any tables in the database by transaction T2 while transaction T1 is
being executed.
Database level of locking is suitable for batch processes. Being very slow,
it is unsuitable for on-line multi-user DBMSs.

12.4.1.2 Table Level Locking


At table level locking, the entire table is locked. Thus, it prevents the access
to any row (tuple) by transaction T2 while transaction T1 is using the table. If
a transaction requires access to several tables, each table may be locked.
However, two transactions can access the same database as long as they
access different tables.
Table level locking is less restrictive than database level. But, it causes
traffic jams when many transactions are waiting to access the same table.
Such a condition is especially problematic when transactions require access
to different parts of the same table but would not interfere with each other.
Table level locks are not suitable for multi-user DBMSs.

12.4.1.3 Page Level Locking


At page level locking, the entire disk-page (or disk-block) is locked. A page
has a fixed size such as 4 K, 8 K, 16 K, 32 K and so on. A table can span
several pages, and a page can contain several rows (tuples) of one or more
tables.
Page level of locking is most suitable for multi-user DBMSs.

12.4.1.4 Row Level Locking


At row level locking, particular row (or tuple) is locked. A lock exists for
each row in each table of the database. The DBMS allows concurrent
transactions to access different rows of the same table, even if the rows are
located on the same page.
The row level lock is much less restrictive than database level, table level,
or page level locks. The row level locking improves the availability of data.
However, the management of row level locking requires high overhead cost.

12.4.1.5 Attribute (or Field) Level Locking


At attribute level locking, particular attribute (or field) is locked. Attribute
level locking allows concurrent transactions to access the same row, as long
as they require the use of different attributes within the row.
The attribute level lock yields the most flexible multi-user data access.
However, it requires a high level of computer overhead.

12.4.2 Lock Types


The DBMS mainly uses the following types of locking techniques:
Binary locking.
Exclusive locking.
Shared locking.
Two-phase locking (2PL).
Three-phase locking (3PL).

12.4.2.1 Binary Locking


In binary locking, there are two states of locking namely (a) locked (or ‘1’)
or (b) unlocked (‘0’). If an object of a database table, page, tuple (row) or
attribute (field) is locked by a transaction, no other transaction can use that
object. A distinct lock is associated with each database item. If the value of
lock on data item X is 1, item X cannot be accessed by a database operation
that requires the item. If an object (or data item) X is unlocked, any
transaction can lock the object for its use. As a rule, a transaction must
unlock the object after its termination. Any database operation requires that
the affected object be locked. Therefore, every transaction requires a lock
and unlock operation for each data item that is accessed. The DBMSs
manages and schedules these operations.
Two operations, lock_item(data item) and unlock_item(data item) are
used with binary locking. A transaction requests access to a data item X by
first issuing a lock_item(X) operation. If LOCK(X) = 1, the transaction is
forced to wait. If LOCK(X) = 0, it is set to 1 (that is, transaction locks the
data item X) and the transaction is allowed to access item X. When the
transaction is through using the data item, it issues unlock_item(X)
operation, which sets LOCK(X) to 0 (unlocks the data item) so that X may be
accessed by other transactions. Hence, a binary lock enforces mutual
exclusion on the data item.

Table 12.8 Binary lock

Table 12.8 illustrates the binary locking technique for the example of lost
update (section 12.3.1). It can be observed from the above table that the lock
and unlock features eliminate the lost update problem as depicted in table
12.3. Binary locking system has advantages of easy to implement. However,
the binary locking technique has limitations of being restrictive to yield
optimal concurrency conditions. For example, the DBMS will not allow the
two transactions to read the same database object, even though neither
transaction updates the database. Therefore, concurrency problems do not
occur as is the case in lost update.

12.4.2.2 Shared/Exclusive (or Read/Write) Locking


A shared/exclusive (or Read/Write) lock uses multiple-mode lock. In this
type of locking, there are three locking operations namely (a) Read_lock(A),
(b) Write_lock(B), and Unlock(A). A read-locked item is also called share-
locked, because other transactions are allowed to read the item. A write-
locked item is called exclusive lock, because a single transaction exclusively
holds the lock on the item. A shared lock is denoted by S and the execlusive
lock is denoted by X. A share/executive lock exists when access is
specifically reserved for the transaction that locked the object. The exclusive
lock must be used when there is a chance of conflict. An exclusive lock is
used when a transaction wants to write (update) a data item and no locks are
currently held on that data item by any other transaction. If transaction T2
updates data item A, then an exclusive lock is required by transaction T2
over data item A. The exclusive lock is granted if and only if no other locks
are held on the data item.
A shared lock exists when concurrent transactions are granted READ
access on the basis of a common lock. A shared lock produces no conflict as
long as the concurrent transactions are Read-only. A shared lock is used
when a transaction wants to Read data from the database and no exclusive
lock is held on that data item. Shared locks allow several READ transactions
to concurrently Read the same data item. For example, if transaction T1 has
shared lock on data item A, and transaction T2 wants to Read data item A,
transaction T2 may also obtain a shared lock on data item A.
If an exclusive or shared lock is already held on data item A by transaction
T1, an exclusive lock cannot be granted on transaction T2.

12.4.2.3 Two-phase Locking (2PL)


Two-phase locking (also called 2PL) is a method or a protocol of controlling
concurrent processing in which all locking operations precede the first
unlocking operation. Thus, a transaction is said to follow the two- phase
locking protocol if all locking operations (such as read_lock, write_lock)
precede the first unlock operation in the transaction. Two-phase locking is
the standard protocol used to maintain level 3 consistency (section 12.3.2).
2PL defines how transactions acquire and relinquish locks. The essential
discipline is that after a transaction has released a lock it may not obtain any
further locks. In practice this means that transactions hold all their locks they
are ready to commit. 2PL has the following two phases:
A growing phase, in which a transaction acquires all the required locks without unlocking
any data. Once all locks have been acquired, the transaction is in its locked point.
A shrinking phase, in which a transaction releases all locks and cannot obtain any new lock.

The above two-phase locking is governed by the following rules:


Two transactions cannot have conflicting locks.
No unlock operation can precede a lock operation in the same transaction.
No data are affected until all locks are obtained, that is, until the transaction is in its locked
point.

Fig. 12.7 illustrates schematic of two-phase locking. In case of a strict


two-phase locking the interleaving is not allowed. Fig. 12.8 shows a
schedule with strict two-phase locking in which transaction T1 would obtain
an exclusive lock on A first and then Read and Write A.
Fig. 12.9 illustrates an example of strict two-phase locking with serial
execution in which first strict locking is done as explained above, then
transaction T2 would request an exclusive lock on A. However, this request
cannot be granted until transaction T1 releases its exclusive lock on A, and
the DBMS therefore, suspends transaction T2 Transaction T1 now proceeds
to obtain an exclusive lock on B, Reads and Writes B, then finally commits,
at which time its locks are released. The lock request of transaction T2 is
now granted, and it proceeds. Similarly, Fig. 12.10 illustrates the schedule
following strict two-phase locking with interleaved actions.
Fig. 12.7 Schematic of Two-phase locking (2PL)

Fig. 12.8 Schedule with strict two-phase locking

Two-phase locking guarantees serialisability, which means that


transactions can be executed in such a way that their results are the same as
if each transaction’s actions were executed in sequence without interruption.
But, two-phase locking does not prevent deadlocks and therefore is used in
conjunction with a deadlock prevention technique.
Fig. 12.9 Schedule with strict two-phase locking with serial execution

12.4.3 Deadlocks
A deadlock is a condition in which two (or more) transactions in a set are
waiting simultaneously for locks held by some other transaction in the set.
Neither transaction can continue because each transaction in the set is on a
waiting queue, waiting for one of the other transactions in the set to release
the lock on an item. Thus, a deadlock is an impasse that may result when
two or more transactions are each waiting for locks to be released that are
held by the other. Transactions whose lock requests have been refused are
queued until the lock can be granted. A deadlock is also called a circular
waiting condition where two transactions are waiting (directly or indirectly)
for each other. Thus in a deadlock, two transactions are mutually excluded
from accessing the next record required to complete their transactions, also
called a deadly embrace. A deadlock exists when two transactions T1 and T2
exist in the following mode:
Fig. 12.10 Schedule with strict two-phase locking with interleaved actions

Table 12.9 Deadlock situation

Transaction T1 = access data items X and Y


Transaction T2 = access data items Y and X
If transaction T1 has not unlocked the data item Y, transaction T2 cannot
begin. Similarly, if transaction T2 has not unlocked the data item X,
transaction T1 cannot continue. Consequently, transactions T1 and T2 wait
indefinitely and each wait for the other to unlock the required data item.
Table 12.9 illustrates a deadlock situation of transactions T1 and T2. In this
example, only two concurrent transactions have been shown to demonstrate
a deadlock situation. In a practical situation, DBMS can execute many more
transactions simultaneously, thereby increasing the probability of generating
deadlocks. Many proposals have been made for detecting and resolving
deadlocks, all of which rely on detecting cycles in a waits-for graph. A
waits-for graph is a directed graph in which the nodes represent transactions
and a directed arc links a node waiting for a lock with the node that has the
lock. In other words, wait-for graph is a graph of “who is waiting for
whom”. A waits-for graph can be used to represent conflict for any resource.

12.4.3.1 Deadlock Detection and Prevention


Deadlock detection is a periodic check by the DBMS to determine if the
waiting line for some resource exceeds a predetermined limit. The frequency
of deadlocks is primarily dependent on the query load and the physical
organisation of the database. For estimating deadlock frequency, Gray
proposed in 1981 that deadlocks per second rise is the square of the degree
of multiprogramming and fourth power of transaction size. There are
following three basic schemes to detect and prevent deadlock:
Never allow deadlock (deadlock prevention): Deadlock prevention technique avoids the
conditions that lead to deadlocking. It requires that every transaction lock all data items it
needs in advance. If any of the items cannot be obtained, none of the items are locked. In
other words, a transaction requesting a new lock is aborted if there is the possibility that a
deadlock can occur. Thus, a timeout may be used to abort transactions that have been idle for
too long. This is a simple but indiscriminate approach. If the transaction is aborted, all the
changes made by this transaction are rolled back and all locks obtained by the transaction are
released. The transaction is then rescheduled for execution. Deadlock prevention technique is
used in two-phase locking.
Detect deadlock whenever a transaction is blocked (deadlock detection): In a deadlock
detection technique, the DBMS periodically tests the database for deadlocks. If a deadlock is
found, one of the transactions is aborted and the other transaction continues. The aborted
transaction is now rolled back and restarted. This scheme is expensive since most blocked
transactions are not involved in deadlocks.
Detect deadlocks periodically (deadlock avoidance): In a deadlock avoidance technique, the
transaction must obtain all the locks it needs before it can be executed. Thus, it avoids
rollback of conflicting transactions by requiring that locks be obtained in succession. This is
the optimal scheme if the detection period is suitable. The ideal period is that which, on
average, detects one deadlock cycle. A shorter period than this means that deadlock detection
is done unnecessarily and a longer period involves transactions in unnecessarily long waits
until the deadlock is broken.

The best deadlock control technique depends on the database


environment. For example, in case of low probability deadlocks, the
deadlock detection technique is recommended. However, if the probability
of a deadlock is high, the deadlock prevention technique is recommended. If
response time is not high on the system priority list, the deadlock avoidance
technique might be employed.
A simple way to detect a state of deadlock is for the system to construct
and maintain a wait-for graph. In a wait-for graph, an arrow is drawn from
the transaction to the record being sought and then drawing an arrow from
that record to the transaction that is currently using it. If the graph has
cycles, deadlock is detected. Thus, in a wait-for graph, one node is created
for each transaction that is currently executing. Whenever a transaction T1 is
waiting to lock a data item X that is currently locked by transaction T2 a
directed edge (T1 → T2) is created in the wait-for graph. When transaction
T2 releases the lock(s) on the data items that the transaction T1 was waiting
for, the directed edge is dropped form the wait-for graph. We have a state of
deadlock if and only if the wait-for graph has a cycle.
Fig. 12.11 illustrates waits-for graph for deadlocks involving two or more
transactions. In the simple case, a deadlock only involves two transactions as
shown in Fig. 12.11 (a). In this case the cycle of transactions T1 and T2
represents a deadlock and transaction T3 is waiting for transaction T2. In a
more complex situation, deadlock may involve several transactions as shown
in Fig. 12.11 (b). In the simple deadlock case of Fig. 12.11 (a), the deadlock
may be broken by aborting one of the transactions involved and restarting it.
Since, it is expensive to abort and restart a transaction, it is desirable to abort
the one that has done the least work. However, if the victim is always the
transaction that has done the least work, it is possible that a transaction may
be repeatedly aborted and thus prevented from completing. Therefore, in
practice, it is generally better to abort the most recent transaction, which is
likely to have done least work, and restart it with its original identifier. This
scheme ensures that a transaction that is repeatedly aborted will eventually
become the oldest active transaction in the system and will eventually
complete. The transaction identifier could be the monotonically increasing
sequence based on system clock.

Fig. 12.11 Waits-for graph for deadlocks involving two or more transactions

(a) Waits-for graph for simple case

(b) Waits-for graph for complex case

In the complex situation of Fig. 12.11 (b), two alternatives can be used
such as (a) to minimize the amount of work done by the transactions to be
aborted or (b) to find the minimal cut-set of the graph and abort the
corresponding transactions.

12.4.3.2 Deadlockin Distributed System


A deadlock in a distributed system may be either local or global. The local
deadlocks are handled in the same as the deadlocks in centralised systems.
Global deadlocks occur when there is a cycle in the global waits-for graph
involving cohorts in session wait and lock wait. Figure 12.12 shows an
example of distributed deadlock.

Fig. 12.12 Waits-for graph for distributed deadlocks

A distributed deadlock has a number of cohorts, each operating on a


separate node of the system, as shown in Fig. 12.12. A cohort is a process
and so may be one of a number of states (for example, processor wait,
execution wait, I/O wait and so on). The session wait and lock wait are the
states of interest for deadlock detection. In session wait, a cohort waits for
data from one or more other cohorts.
The detection of deadlock in distributed system is most difficult problem
because there is a cycle involving several nodes. Cycles in a distributed
waits-for graph are detected through actions of a designated process at one
node which:
Periodically requests fragments of local waits-for graph from all other distributed sites.
Receives form each site its local graph containing cohorts in session wait.
Constructs the global waits-for graph by matching up the local fragments.
Selects victims until there are no remaining cycles in the global graph.
Broadcasts the result, so that the session managers at the sites coordinating the victims can
abort them.

The deadlock detection do not have to be synchronised if the list of


victims from the previous round of deadlock detection is remembered, since
this allows the global deadlock detector to eliminate from the graph any
transactions that have previously been aborted.

12.5 TIMESTAMP METHODS FOR CONCURRENCY CONTROL

Timestamp is a unique identifier created by the DBMS to identify the relative


starting time of a transaction. Typically, timestamp values are assigned in the
order in which the transactions are submitted to the system. So, a timestamp
can be thought of as the transaction start time. Therefore, time stamping is a
method of concurrency control in which each transaction is assigned a
transaction timestamp. A transaction timestamp is a monotonically
increasing number, which is often based on the system clock. The
transactions are managed so that they appear to run in a timestamp order.
Timestamps can also be generated by incrementing a logical counter every
time a new transaction starts. The timestamp value produces an explicit
order in which transactions are submitted to the DBMS. Timestamps must
have two properties namely (a) uniqueness and (b) monotonicity. The
uniqueness property assures that no equal timestamp values can exist and
monotonicity assures that timestamp values always increase. The READ and
WRITE operations of database within the same transaction must have the
same timestamp. The DBMS executes conflicting operations in timestamp
order, thereby ensuring serializability of the transactions. If two transactions
conflict, one often is stopped, rescheduled and assigned a new timestamp
value.
Timestamping is a concurrency control protocol in which the fundamental
goal is to order transactions globally in such a way that older transactions get
priority in the event of a conflict. The timestamp method does not require
any locks. Therefore, there are no deadlocks. The timestamp methods do not
make the transactions wait to prevent conflicts as is the case with locking.
Transactions involved in a conflict are simply rolled back and restarted.

12.5.1 Granule Timestamps


Granule timestamp is a record of the timestamp of the last transaction to
access it. Each granule accessed by an active transaction must have a granule
timestamp. A separate record of last Read and Write accesses may be kept.
Granule timestamp may cause additional Write operations for Read accesses
if they are stored with the granules. The problem can be avoided by
maintaining granule timestamps as an in-memory table. The table may be of
limited size, since conflicts may only occur between current transactions. An
entry in a granule timestamp table consists of the granule identifier and the
transaction timestamp. The record containing the largest (latest) granule
timestamp removed from the table is also maintained. A search for a granule
timestamp, using the granule identifier, will either be successful or will use
the largest removed timestamp.

12.5.2 Timestamp Ordering


Following are the three basic variants of timestamp-based methods of
concurrency control:
Total timestamp ordering.
Partial timestamp ordering.
Multiversion timestamp ordering.

12.5.2.1 Total Timestamp Ordering


The total timestamp ordering algorithm depends on maintaining access to
granules in timestamp order by aborting one of the transactions involved in
any conflicting access. No distinction is made between Read and Write
access, so only a single value is required for each granule timestamp.

12.5.2.2 Partial Timestamp Ordering


In a partial timestamp ordering, only non-permutable actions are ordered to
improve upon the total timestamp ordering. In this case, both Read and
Write granule timestamps are stored. The algorithm allows the granule to be
read by any transaction younger than the last transaction that updated the
granule. A transaction is aborted if it tries to update a granule that has
previously been accessed by a younger transaction. The partial timestamp
ordering algorithm aborts fewer transactions than the total timestamp
ordering algorithm, at the cost of extra storage for granule timestamps.

12.5.2.3 Multiversion Timestamp Ordering


The multiversion timestamp ordering algorithm stores several versions of an
updated granule, allowing transactions to see a consistent set of versions for
all granules it accesses. So, it reduces the conflicts that result in transaction
restarts to those where there is a Write-Write conflict. Each update of a
granule creates a new version, with an associated granule timestamp. A
transaction that requires read access to the granule sees the youngest version
that is older than the transaction. That is, the version having a timestamp
equal to or immediately below the transaction’s timestamp.

12.5.3 Conflict Resolution in Timestamps


To deal with conflicts in timestamp algorithms, some transactions involved
in conflicts are made to wait and to abort others. Following are the main
strategies of conflict resolution in timestamps:
Wait-Die: The older transaction waits for the younger if the younger has
accessed the granule first. The younger transaction is aborted (dies) and
restarted if it tries to access a granule after an older concurrent transaction.
Wound-Wait: The older transaction pre-empts the younger by suspending
(wounding) it if the younger transaction tries to access a granule after an
older concurrent transaction. An older transaction will wait for a younger
one to commit if the younger has accessed a granule that both want.
The handling of aborted transactions is an important aspect of conflict
resolution algorithm. In the case that the aborted transaction is the one
requesting access, the transaction must be restarted with a new (younger)
timestamp. It is possible that the transaction can be repeatedly aborted if
there are conflicts with other transactions. An aborted transaction that had
prior access to granule where conflict occurred can be restarted with the
same timestamp. This will take priority by eliminating the possibility of
transaction being continuously locked out.

12.5.4 Drawbacks of Timestamp


Each value stored in the database requires two additional timestamp fields, one for the last
time the field (attribute) was read and one for the last update.
It increases the memory requirements and the processing overhead of database.

12.6 OPTIMISTIC METHODS FOR CONCURRENCY CONTROL

The optimistic method of concurrency control is based on the assumption


that conflicts of database operations are rare and that it is better to let
transactions run to completion and only check for conflicts before they
commit. An optimistic concurrency control method is also known as
validation or certification methods. No checking is done while the
transaction is executing. The optimistic method does not require locking or
timestamping techniques. Instead, a transaction is executed without
restrictions until it is committed. In optimistic methods, each transaction
moves through the following phases:
Read phase.
Validation or certification phase.
Write phase.

12.6.1 Read Phase


In a Read phase, the updates are prepared using private (or local) copies (or
versions) of the granule. In this phase, the transaction reads values of
committed data from the database, executes the needed computations, and
makes the updates to a private copy of the database values. All update
operations of the transaction are recorded in a temporary update file, which
is not accessed by the remaining transactions. It is conventional to allocate a
timestamp to each transaction at the end of its Read to determine the set of
transactions that must be examined by the validation procedure. These set of
transactions are those who have finished their Read phases since the start of
the transaction being verified.

12.6.2 Validation Phase


In a validation (or certification) phase, the transaction is validated to assure
that the changes made will not affect the integrity and consistency of the
database. If the validation test is positive, the transaction goes to the write
phase. If the validation test is negative, the transaction is restarted, and the
changes are discarded. Thus, in this phase the list of granules is checked for
conflicts. If conflicts are detected in this phase, the transaction is aborted and
restarted. The validation algorithm must check that the transaction has
seen all modifications of transactions committed after it starts.
not read granules updated by a transaction committed after its start.

12.6.3 Write Phase


In a Write phase, the changes are permanently applied to the database and
the updated granules are made public. Otherwise, the updates are discarded
and the transaction is restarted. This phase is only for the Read- Write
transactions and not for Read-only transactions.

12.6.4 Advantages of Optimistic Methods for Concurrency Control


The optimistic concurrency control has the following advantages:
This technique is very efficient when conflicts are rare. The occasional conflicts result in the
transaction roll back.
The rollback involves only the local copy of data, the database is not involved and thus there
will not be any cascading rollbacks.

12.6.5 Problems of Optimistic Methods for Concurrency Control


The optimistic concurrency control suffers from the following problems:
Conflicts are expensive to deal with, since the conflicting transaction must be rolled back.
Longer transactions are more likely to have conflicts and may be repeatedly rolled back
because of conflicts with short transactions.

12.6.6 Applications of Optimistic Methods for Concurrency Control


Only suitable for environments where there are few conflicts and no long transactions.
Acceptable for mostly Read or Query database systems that require very few update
transactions.

R Q
1. What is a transaction? What are its properties? Why are transactions important units of
operation in a DBMS?
2. Draw a state diagram and discuss the typical states that a transaction goes through during
execution.
3. How does the DBMS ensure that the transactions are executed properly?
4. What is consistent database state and how is it achieved?
5. What is transaction log? What are its functions?
6. What are the typical kinds of records in a transaction log? What are transaction commit
points and why are they important?
7. What is a schedule? What does it do?
8. What is concurrency control? What are its objectives?
9. What do you understand by the concurrent execution of database transactions in a multi-user
environment?
10. What do you mean by atomicity? Why is it important? Explain with an example.
11. What do you mean by consistency? Why is it important? Explain with an example.
12. What do you mean by isolation? Why is it important? Explain with an example.
13. What do you mean by durability? Why is it important? Explain with an example.
14. What are transaction states?
15. A hospital blood bank transaction system is given which records the following information:

a. Deliveries of different blood products in standard units.


b. Issues of blood products to hospital wards, clinics, and Operation Theatres (OTs).
Assume that each issue is for an identified patient and each unit is uniquely
identified.
c. Returns of unused blood products from hospital wards.
Describe a concurrency control scheme for this system which allows maximum
concurrency, always allows Read access to the stock and accurately records the
blood products used by each patient.

16. Discuss the transition execution state with a state transition diagram and related problems.
17. What are ACID properties of a database transaction? Discuss each of these properties and
how they relate to the concurrency control. Give examples to illustrate your answer.
18. Explain the concepts of serial, non-serial and serialisable schedules. State the rules for
equivalence of schedules.
19. Explain the distinction between the terms serial schedule and serialiable schedule.
20. What is locking? What is the relevance of lock in database management system? How does a
lock work?
21. What are the different types of locks?
22. What is deadlock? How can a deadlock be avoided?
23. Discuss the problems of deadlock and the different approaches to dealing with these
problems.
24. Consider the following two transactions:

T1 : Read (A)
Read (B)
If A = 0 then B := B + 1
Write (B).
T2 : Read (B)
Read (A)
If B = 0 then A := A + 1
Write (A).
a. Add lock and unlock instructions to transactions T1 and T2 , so that they observe
the two-phase locking protocol.
b. Can the execution of these transactions result in a deadlock?

25. Compare binary locks to shared/exclusive locks. Why is the former type of locks preferable?
26. Discuss the actions taken by Read_item and Write_item operations on a database.
27. Discuss how seralizability is used to enforce concurrency control in a database system. Why
is seralizability sometimes considered too restrictive as a measure of correctness for
schedules?
28. Describe the four levels of transaction concurrency.
29. Define the violations caused by the following:

a. Lost updates.
b. Dirty read (or uncommitted data).
c. Unrepeatable read (or inconsistent retrievals).

30. Describe the wait-die and wound-wait techniques for deadlock prevention.
31. What is a timestamp? How does the system generate timestamp?
32. Discuss the timestamp ordering techniques for concurrency control.
33. When a transaction is rolled back under timestamp ordering, it is assigned a new timestamp.
Why can it not simply keep its old timestamp?
34. How does optimistic concurrency control method differ from other concurrency control
methods? Why are they also called validation or certification methods:
35. How does the granularity of data items affect the performance of concurrency control
methods? What factors affect selection of granularity size of data items?
36. What is serialisability? What is its objective?
37. Using an example, illustrate how two-phase locking works.
38. Two transactions are said to be serialisable if they can be executed in parallel (interleaved) in
such a way that their results are identical to that achieved if one transaction was processed
completely before the other was initiated. Consider the following two interleaved
transactions, and suppose a consistency condition requires that data items A or B must always
be equal to 1. Assume that A = B = 1 before these transactions execute.

Transaction T1 Transaction T2
Read_item(A)
Read_item(B)
Read_item(A)
Read_item(B)
If A = 1
then B := B + 1
If B = 1
then A := A + 1
Write_item(A)
Write_item(B)
a. Will the consistency requirement be satisfied? Justify your answer.
b. Is there an interleaved processing schedule that will guarantee serialisability? If so,
demonstrate it. If not, explain why?

39. Assuming a transaction log with immediate updates, create the log entries corresponding to
the following transaction actions:

T: read (A, a1) Read the current customer balance


a1 := a1 − 500 Debit the account by INR 500
write (A, a1) Write the new balance
T: read (B, b1) Read the current accounts payable balance
Credit the account balance by INR 500
b1 := b1 + 500
write (B, b1) Write the new balance.

40. Suppose that in Question 1 a failure occurs just after the transaction log record for the action
write (B, b1) has been written.

a. Show the contents of the transaction log at the time of failure.


b. What action is necessary and why?
c. What are the resulting values of A and B?

41. What is wait-for graph? Where is it used? Explain with an example.


42. Produce a wait-for graph for the transaction scenario of Table 12.10 below, and determine
whether deadlock exists:

Table 12.10 Transaction scenario

Data items
Data items locked by
Transaction transaction is waiting
transaction
for
T1 X2 X1, X3
T2 X3, X10 X7, X8
T3 X8 X4, X5
T4 X7 X1
T5 X1, X5 X3
T6 X4, X9 X6
T7 X6 X5

43. What is the two-phase locking? How does it work?


44. What do you mean by degree of consistency? What are the various levels of consistency?
Explain with examples.
45. What is a timestamp ordering? What are the variants of timestamp ordering?
46. Discuss how a conflict is resolved in a timestamp. What are the drawbacks of a timestamp?
47. What is the optimistic method of concurrency control? Discuss the different phases through
which a transaction moves during optimistic control.
48. List the advantages, problems and applications of optimistic method of concurrency control.
49. Consider a database with objects (data items) X and Y. Assume that there are two transactions
T1 and T2. Transaction T1 Reads objects X and Y and then Writes object X. Transaction T2
Reads objects X and Y and then Writes objects X and Y.

a. Give an example schedule with actions of transactions T1 and T2 on objects X and


Y that results in a Write-Read conflict.
b. Give an example schedule with actions of transactions T1 and T2 on objects X and
Y that results in a Read-Write conflict.
c. Give an example schedule with actions of transactions T1 and T2 on objects X and
Y that results in a Write-Write conflict.
d. For each of the three schedules, show that strict two-phase locking disallows the
schedule.

STATE TRUE/FALSE

1. The transaction consists of all the operations executed between the beginning and end of the
transaction.
2. A transaction is a program unit, which can either be embedded within an application program
or can be specified interactively via a high-level query language such as SQL.
3. The changes made to the database by an aborted transaction should be revered or undone.
4. A transaction that is either committed or aborted is said to be terminated.
5. Atomic transaction is transactions in which either all actions associated with the transaction
are executed to completion, or none are performed.
6. The effects of a successfully completed transaction are permanently recorded in the database
and must not be lost because of a subsequent failure.
7. Level 0 transactions are recoverable.
8. Level 1 transaction is the minimum consistency requirement that allows a transaction to be
recovered in the event of system failure.
9. Log is a record of all transactions and the corresponding changes to the database.
10. Level 2 transaction consistency isolates from the updates of other transactions.
11. The DBMS automatically update the transaction log while executing transactions that modify
the database.
12. A committed transaction that has performed updates transforms the database into a new
consistent state.
13. The objective of concurrency control is to schedule or arrange the transactions in such a way
as to avoid any interference.
14. Incorrect analysis problem is also known as dirty read or unrepeatable read.
15. A consistent database state is one in which all data integrity constraints are satisfied.
16. The serial execution always leaves the database in a consistent state although different results
could be produced depending on the order of execution.
17. Cascading rollbacks are not desirable.
18. Locking and timestamp ordering are optimistic techniques, as they are designed based on the
assumption that conflict is rare.
19. Two types of locks are Read and Write locks.
20. In the two-phase locking, every transaction is divided into (a) growing phase and (b)
shrinking phase.
21. A dirty read problem occurs when one transaction updates a database item and then the
transaction fails for some reason.
22. The size of the locked item determines the granularity of the lock.
23. There is no deadlock in the timestamp method of concurrency control.
24. A transaction that changes the contents of the database must alter the database from one
consistent state to another.
25. A transaction is said to be in committed state if it has partially committed, and it can be
ensured that it will never be aborted.
26. Level 3 transaction consistency adds consistent reads so that successive reads of a record will
always give the same values.
27. A lost update problem occurs when two transactions that access the same database items
have their operations in a way that makes the value of some database item incorrect.
28. Serialisability describes the concurrent execution of several transactions.
29. Unrepeatable read occur when a transaction calculates some summary function over a set of
data while other transactions are updating the data.
30. It prevents access to a database record by a second transaction until the first transaction has
completed all of its actions.
31. In a shrinking phase, a transaction releases all locks and cannot obtain any new lock.
32. A deadlock in a distributed system may be either local or global.

TICK (✓) THE APPROPRIATE ANSWER

1. Which of the following is the activity of coordinating the actions of process that operate in
parallel and access shared data?

a. Transaction management
b. Recovery management
c. Concurrency control
d. None of these.

2. Which of the following is the ability of a DBMS to manage the various transactions that
occur within the system?

a. Transaction management
b. Recovery management
c. Concurrency control
d. None of these.

3. Which of the following is transaction property?

a. Isolation
b. Durability
c. Atomicity
d. All of these.
4. Which of the following ensures the consistency of the transactions?

a. Application programmer
b. Concurrency control
c. Recovery management
d. Transaction management.

5. Which of the following ensures the durability of a transaction?

a. Application programmer
b. Concurrency control
c. Recovery management
d. Transaction management.

6. In a shrinking phase, a transaction:

a. releases all locks.


b. cannot obtain any new lock.
c. both (a) and (b).
d. none of these.

7. Which of the following ensures the atomicity of a transaction?

a. Application programmer
b. Concurrency control
c. Recovery management
d. Transaction management.

8. Which of the following ensures the isolation of a transaction?

a. Application programmer
b. Concurrency control
c. Recovery management
d. Transaction management.

9. Which of the following is a transaction state?

a. Active
b. Commit
c. Aborted
d. All of these.

10. The concurrency control has the following problem:

a. lost updates
b. dirty read
c. unrepeatable read
d. all of these.
11. Which of the following is not a transaction management SQL command?

a. COMMIT
b. SELECT
c. SAVEPOINT
d. ROLLBACK.

12. Which of the following is a statement after which you cannot issue a COMMIT command?

a. INSERT
b. SELECT
c. UPDATE
d. DELETE.

13. Timestamps must have following properties namely

a. uniqueness.
b. monotonicity.
c. both (a) and (b).
d. none of these.

14. Which of the following is validation-based concurrency control?

a. validation
b. write
c. read
d. all of these.

15. The READ and WRITE operations of database within the same transaction must have

a. same timestamp.
b. different timestamp.
c. no timestamp.
d. none of these.

16. Which of the following is a transaction state when the normal execution of the transaction
cannot proceed?

a. Failed
b. Active
c. Terminated
d. Aborted.

17. Locking can take place at the following levels:

a. Page level.
b. Database level.
c. Row level.
d. all of these.

18. In binary locking, there are

a. one state of locking.


b. two states of locking.
c. three states of locking.
d. none of these.

19. The way to undo the effects of a committed transaction is?

a. Recovery
b. Compensating transaction
c. Rollback
d. None of these.

20. Which of the following is the size of the data item chosen as the unit of protection by a
concurrency control program?

a. Blocking factor
b. Granularity
c. Lock
d. none of these.

21. A transaction can include following basic database access operations:

a. Read_item(X).
b. Write_item(X).
c. both (a) & (b).
d. none of these.

22. Which of the following is a problem resulting from concurrent execution of transaction?

a. Incorrect analysis
b. Multiple update
c. Ucommitted dependency
d. all of these.

23. Which of the following is not a deadlock handling strategy?

a. Timeout
b. Deadlock annihilation
c. Deadlock prevention
d. Deadlock detection.

24. In which of the following schedule are the transactions performed one after another, one at a
time?
a. Non-serial schedule
b. Conflict serialisable schedule
c. Serial schedule
d. None of these.

25. A shared lock exists when concurrent transactions are granted the following access on the
basis of a common lock:

a. READ
b. WRITE
c. SHRINK
d. UPDATE.

26. In a growing phase, a transaction acquires all the required locks

a. by locking data
b. without unlocking any data
c. with unlocking any data
d. None of these.

27. Which of the following is an optimistic concurrency control method?

a. Validation-based
b. Timestamp ordering
c. Lock-based
d. None of these.

28. The basic variants of timestamp-based methods of concurrency control are

a. Total timestamp ordering


b. Partial timestamp ordering
c. Multiversion timestamp ordering
d. All of these.

29. In optimistic methods, each transaction moves through the following phases:

a. read phase
b. validation phase
c. write phase
d. All of these.

FILL IN THE BLANKS

1. Transaction is a _____ of work that represents real-world events of any organisation or an


enterprise, whereas concurrency control is the management of concurrent transaction
execution.
2. _____ is the activity of coordinating the actions of processes that operate in parallel, access
shared data, and therefore potentially interfere with each other.
3. A simple way to detect a state of deadlock is for the system to construct and maintain a
_____ graph.
4. A transaction is a sequence of _____ and _____ actions that are grouped together to from a
database access.
5. _____ is the ability of a DBMS to manage the various transactions that occur within the
system.
6. Atomic transaction is a transaction in which either _____ with the transaction are executed to
completion or are performed.
7. The ACID properties of a transaction are (a) _____, (b) _____, (c) _____ and (d) _____.
8. _____ means that execution of a transaction in isolation preserves the consistency of the
database.
9. The _____ of the DBMS ensures the atomicity of each transaction.
10. Transaction log is a _____ of all _____ and the corresponding changes to the _____.
11. Ensuring durability is the responsibility of the _____ of the DBMS.
12. Isolation property of transaction means that the data used during the execution of a
transaction cannot be used by _____ until the first one is completed.
13. A consistent database state is one in which all _____ constraints are satisfied.
14. A transaction that changes the contents of the database must alter the database from one
_____ to another.
15. The isolation property is the responsibility of the _____ of DBMS.
16. A transaction that completes its execution successfully is said to be _____.
17. Level 2 transaction consistency isolates from the _____ of other transactions.
18. When a transaction has not successfully completed its execution we say that it has _____.
19. A _____ is a schedule where the operations from a group of concurrent transactions are
interleaved.
20. The objective of _____ is to find non-serial schedules.
21. The situation where a single transaction failure leads to a series of rollbacks is called a
_____.
22. _____ is the size of the data item chosen as the unit of protection by a concurrency control
program.
23. Optimistic concurrency control techniques are also called _____ concurrency scheme.
24. The only way to undo the effects of a committed transaction is to execute a _____.
25. Collections of operations that form a single logical unit of work are called _____.
26. Serialisability must be guaranteed to prevent _____ from transactions interfering with one
another.
27. Precedence graph is used to depict _____.
28. Lock prevents access to a _____ by a second transaction until the first transaction has
completed all of its actions.
29. A shared/exclusive (or Read/Write) lock uses _____ lock.
30. A shared lock exists when concurrent transactions are granted _____ access on the basis of a
common lock.
31. Two-phase locking is a method of controlling _____ in which all locking operations precede
the first unlocking operation.
32. In a growing phase, a transaction acquires all the required locks without _____ any data.
33. In a shrinking phase, a transaction releases _____ and cannot obtain any _____ lock.
Chapter 13

Database Recovery System

13.1 INTRODUCTION

Concurrency control and database recovery are intertwined and both are a
part of the transaction management. Recovery is required to protect the
database from data inconsistencies and data loss. It ensures the atomicity and
durability properties of transactions as discussed in chapter 12, Section
12.2.3. This characteristics of DBMS helps to recover from the failure and
restore the database to a consistent state. It minimises the time for which the
database is not usable after a crash and thus provides high availability. The
recovery system is an integral part of a database system.
In this chapter, we will discuss the database recovery and examine the
techniques that can be used to ensure the database remaining in a consistent
state in the event of failures. We will finally examine buffer management
method used for database recovery.

13.2 DATABASE RECOVERY CONCEPTS

Database recovery is the process of restoring the database to a correct


(consistent) state in the event of a failure. In other words, it is the process of
restoring the database to the most recent consistent state that existed shortly
before the time of system failure. The failure may be the result of a system
crash due to hardware or software errors, a media failure such as head crash,
or a software error in the application such as a logical error in the program
that is accessing the database. Recovery restores a database form a given
state, usually inconsistent, to a previously consistent state.
The number of recovery techniques that are used are based on the
atomicity property of transactions. A transaction is considered as a single
unit of work in which all operations must be applied and completed to
produce a consistent database. If, for some reason, any transaction operation
cannot be completed, the transaction must be aborted and any change to the
database must be rolled back (undone). Thus, transaction recovery reverses
all the changes that the transaction has made to the database before it was
aborted.
The database recovery process generally follows a predictable scenario. It
first, determines the type and extent of the required recovery. If the entire
database needs to be recovered to a consistent state, the recovery uses the
most recent backup copy of the database in a known consistent state. The
backup copy is then rolled forward to restore all subsequent transactions by
using the transaction log information. If the database needs to be recovered
but the committed portion of the database is still unstable, the recovery
process uses the transaction log to undo all the transactions that were not
committed.

13.2.1 Database Backup


Database backup and recovery functions constitute a very important
component of DBMSs. Some DBMSs provide functions that allow the
database administrator to schedule automatic database backups to secondary
storage devices, such as disks, CDs, tapes and so on. The level of database
backups can be taken as follows:
A full backup or dump of the database.
A differential backup of the database in which only the last modifications done to the
database, when compared with the previous backup copy, are copied.
A backup of transaction log only. This level backs up all the transaction log operations that
are not reflected in a previous back up copy of the database.

The database backup is stored in a secure place, usually in a different


building and protected against danger such as fire, theft, flood and other
potential calamities. The backup’s existence guarantees database recovery
following system failures.
13.3 TYPES OF DATABASE FAILURES

There are many types of failures that can affect database processing. Some
failures affect the main memory only, while others involve secondary
storage. Following are the types of failure:
Hardware failures: Hardware failures may include memory errors, disk crashes, bad disk
sectors, disk full errors and so on. Hardware failures can also be attributed to design errors,
inadequate (poor) quality control during fabrication, overloading (use of under-capacity
components) and wearout of mechanical parts.
Software failures: Software failures may include failures related to softwares such as,
operating system, DBMS software, application programs and so on.
System crashes: System crashes are due to hardware or software errors, resulting in the loss
of main memory. There could be a situation that the system has entered an undesirable state,
such as deadlock, which prevents the program from continuing with normal processing. This
type of failure may or may not result in corruption of data files.
Network failures: Network failures can occur while using a client-server configuration or a
distributed database system where multiple database servers are connected by common
networks. Network failures such as communication software failures or aborted
asynchronous connections will interrupt the normal operation of the database system.
Media failures: Such failures are due to head crashes or unreadable media, resulting in the
loss of parts of secondary storage. They are the most dangerous failures.
Application software errors: These are logical errors in the program that is accessing the
database, which cause one or more transactions to fail.
Natural physical disasters: These are failures such as fires, floods, earthquake or power
failures.
Carelessness: These are failures due to unintentional destruction of data or facilities by
operators or users.
Sabotage: These are failures due to intentional corruption or destruction of data, hardware, or
software facilities.

In the event of failure, there are two principal effects that happen, namely
(a) loss of main memory including the database buffer and (b) the loss of the
disk copy (secondary storage) of the database. Depending on the type and
the extent of the failure, the recovery process ranges from a minor short-term
inconvenience to major long-term rebuild action. Regardless of the extent of
the required recovery process, recovery is not possible without backup.

13.4 TYPES OF DATABASE RECOVERY

In case of any type of failures, a transaction must either be aborted or


committed to maintain data integrity. Transaction log plays an important role
for database recovery and bringing the database in a consistent state in the
event of failure. Transactions represent the basic unit of recovery in a
database system. The recovery manager guarantees the atomicity and
durability properties of transactions in the event of failures. During recovery
from failure, the recovery manager ensures that either all the effects of a
given transaction are permanently recorded in the database or none of them
are recorded. A transaction begins with successful execution of a <T,
BEGIN>” (begin transaction) statement. It ends with successful execution of
a COMMIT statement. The following two types of transaction recovery are
used:
Forward recovery.
Backward recovery.

13.4.1 Forward Recovery (or REDO)


Forward recovery (also called roll-forward) is the recovery procedure,
which is used in case of a physical damage, for example crash of disk pack
(secondary storage), failures during writing of data to database buffers, or
failure during flushing (transferring) buffers to secondary storage. The
intermediate results of the transactions are written in the database buffers.
The database buffers occupy an area in the main memory. From this buffer,
the data is transferred to and from secondary storage of the database. The
update operation is regarded as permanent only when the buffers are flushed
to the secondary storage. The flushing operation can be triggered by the
COMMIT operation of the transaction or automatically in the event of
buffers becoming full. If the failure occurs between writing to the buffers
and flushing of buffers to the secondary storage, the recovery manager must
determine the status of the transaction that performed the WRITE at the time
of failure. If the transaction had already issued its COMMIT, the recovery
manager redo (roll forward) so that transaction’s updates to the database.
This redoing of transaction updates is also known as roll-forward. The
forward recovery guarantees the durability property of transaction.
To recreate the lost disk due to the above reasons explained, the systems
begin reading the most recent copy of the lost data and the transaction log
(journal) of the changes to it. A program then starts reading log entries,
starting from the first one that was recorded after the copy of database was
made and continuing through to the last one that was recorded just before the
disk was destroyed. For each of these log entries, the program changes the
data value concerned in the copy of the database to the ‘after’ value shown
in the log entry. This means that whatever processing took place in the
transaction that caused the log entry to be made, the net result of the
database after that transaction will be stored. Operation for every transaction
(each entry in the log) is performed that caused a change in the database
since the copy was taken, in the same order that these transactions were
originally executed. This brings the database copy to the up-to-date level of
the database that was destroyed.

Fig. 13.1 Forward (or roll-forward) recovery or redo

Fig. 13.1 illustrates an example of forward recovery system. There are a


number of variations on the forward recovery method that are used. In one
variation, the changes may have been made to the same piece of data since
the last database copy was made. In this case, only the last one of those
changes at the point that the disk was destroyed needs to be used in updating
the database copy in the rolled-forward operation. Another roll-forward
variation is to record an indication of what the transaction itself look like at
the point of being executed along with other necessary supporting
information, instead of reading before and after images of the data in the log.
13.4.2 Backward Recovery (or UNDO)
Backward recovery (also called roll-backward) is the recovery procedure,
which is used in case an error occurs in the midst of normal operation on the
database. The error could be a human keying in a value, or a program ending
abnormally and leaving some of the changes to the database that it was
suppose to make. If the transaction had not committed at the time of failure,
it will cause inconsistency in the database as because in the interim, other
programs may have read the incorrect data and made use of it. Then the
recovery manager must undo (rollback) any effects of the transaction
database. The backward recovery guarantees the atomicity property of
transactions.
Fig. 13.2 illustrates an a example of backward recovery method. In case of
a backward recovery, the recovery is started with the database in its current
state and the transaction log is positioned at the last entry that was made in
it. Then a program reads ‘backward’ through log, resetting each updated data
value in the database to it “before image” as recorded in the log, until it
reaches the point where the error was made. Thus, the program ‘undoes’
each transaction in the reverse order from that in which it was made.

Fig. 13.2 Backward (or roll-backward) recovery or undo

Example 1
Roll-backward (undo) and roll forward (redo) can be explained with an
example as shown in Fig. 13.3 in which there are a number of concurrently
executing transactions T1, T2, ……, T6. Now, let us assume that the DBMS
starts execution of transactions at time ts but fails at time tf due to disk crash
at time tc. Let us also assume that the data for transactions T2 and T3 has
already been written to the disk (secondary storage) before failure at time tf.
It can be observed from Fig. 13.3 that transactions T1 and T6 had not
committed at the point of the disk crash. Therefore, the recovery manager
must undo the transactions T1 and T6 at the start. However, it is not clear
from Fig. 13.3 that to what extent the changes made by the other already
committed transactions T1 and T6 have been propagated to the database on
secondary storage. This uncertainty could be because the buffers may or may
not have been flushed to secondary storage. Thus, the recovery manager
would be forced to redo transactions T2, T3, T4 and T5.

Fig. 13.3 Example of roll backward (undo) and roll froward (redo)

Example 2
Let us consider another example in which a transaction log operation
history is given as shown in Table 13.1. Besides the operation history, log
entries are listed that are written into the log buffer memory (resident in
main or physical memory) for the database recovery. The second transaction
operation W1 (A, 20) in Table 13.1 is assumed to represent an update by
transaction T1, changing the balance column value to 20 for a row in the
accounts table with ACOUNT-ID = A. In the same sense, the write log (W,
1, A, 50, 20), the value 50 is the before image for the balance column in this
row and 20 is the after image for this column. Now, let us assume that a
system crash occurs immediately after the operation W1 (B, 80) has
completed, in the sequence of events of Table 13.1. This means that the log
entry (W, 1, B, 50, 80) has been placed in the log buffer, but the last point at
which the log buffer was written out to disk was with the log entry (C, 2).
This is the final log entry that will be available when recovery is started to
recover from the crash. At this time, since transaction T2 has committed
while transaction T1 has not, we want to make sure that all updates
performed by transaction T2 are placed on disk an that all updates performed
by transaction T1 are rolled back on disk. The final values for these data
items after recovery has been performed should be A = 50, B = 50, and C =
50, which is the values just before Table 13.1.
After the crash system is reinitialised, a command is given to initiate
database recovery. The process of recovery takes place in two phases namely
(a) roll backward or ROLLBACK and (b) roll forward or ROLL
FORWARD. In the ROLLBACK phase, the entries in the sequential log file
are read in reverse order back to system start-up, when all data access
activity began. We assume that the system start-up happened just before the
first operation R1 (A, 50) of transaction history. In the ROLL FORWARD
phase, the entries in the sequential log file are read forward again to the last
entry. During the ROLLBACK step, recovery performs UNDO of all the
updates that should not have occurred, because the transaction that made
them did not commit. It also makes a list of all transactions that have
committed. We have assumed here that the ROLLBACK phase occurs first
and the ROLL FORWARD phase afterward, as is the case in most of the
commercial DBMSs such as DB2, System R of IBM.

Table 13.1 Transaction history and corresponding log entries

Table 13.2 ROLLBACK process for transaction history crashed just after Wl (B, 80)

SN Log Entry ROLLBACK action performed


1. (C, 2) Put transaction T2 in the committed list.

2. (W, 2, C, 100, 50) Since transaction T2 is in the committed list, do


nothing.
3. (S, 2) Make a note that transaction T2 is no longer active.

4. (W, 1, A, 50, 20) Transaction T1 has never committed. Its last operation
was a write. Therefore, system performs UNDO of
this update by writing the before image value (50)
into data item A. Put transaction T1 into the
uncommitted list.
5. (S, 1) Make a note that transaction T1 is no longer active.
Now that no transactions were active, ROLLBACK
phase is ended.

Table 13.2 and 13.3 list all the log entries encountered and the actions
taken during ROLLBACK and ROLL FORWARD phases of recovery. It is
to be noted that the steps of ROLLBACK are numbered on the left and the
numbering is continued during the ROLL FORWARD phase of table 13.3.
During ROLLBACK the system reads backward through the log entries of
the sequential log file and makes a list of all transactions that did and did not
commit. The list of committed transactions is used in the ROLL
FORWARD, but the list of transactions that did not commit is used to decide
when to UNDO updates. Since the system knows which transactions did not
commit as soon as it encounters (reading backward) the final log entry, it can
immediately begin to UNDO write log changes of uncommitted transactions
by writing before images onto disk over the row values affected. Disk
buffering is used during recovery to read in pages containing rows that need
to be updated by UNDO or REDO steps. An example of UNDO write is
shown in step 4 of table 13.2. Since the transaction responsible for the write
log entry did not commit, it should not have any transactional updates out on
disk. It is possible that some values given in the after images of these write
log entries are not out on disk. But, in any event it is clear that writing the
before images in place of these data items cannot hurt. Eventually, we return
to the value such data items had before any uncommitted transactions tried
to change them.

Table 13.3 ROLL FORWARD process for transaction history taking place after ROLLBACK of
table 13.2

SN Log Entry ROLLBACK action performed


6. (S, 1) No action required.
7. (W, 1, A, 50, 20) Transaction T1 is committed. No action required.

8. (S, 2) No action required.


9. (W, 2, C, 100, 50) Since transaction T2 is on the committed list, REDO
this update by writing after image value (50) into data
item C.
10. (C, 2) No action required.
11. Roll forward through all log entries and terminate
recovery.
During the ROLL FORWARD phase of table 13.3, the system simply uses
the list of committed transactions gathered during the ROLLBACK phase as
a guide to REDO updates of committed transactions that might not have
gotten out of disk. An example of REDO is shown in step 9 of table 13.3. At
the end of this phase the data item would have the right values. All updates
of transactions that committed are applied and all updates of transactions
that did not complete are rolled back. It can be noted that in step 4 of
ROLLBACK of table 13.2, the value 50 is written to the data item A and in
step 9 of ROLL FORWARD of table 13.3, the value 50 is written to data
item C. It can be recalled that the crash occurred just after the operation in
W1 (B, 80) of transaction log operation history. Since the log entry for this
operation did not get to the disk, as can be seen in table 13.1, the before
image of B cannot be applied during recovery. The update for B to the value
80 also did not get out to disk. Thus, the final values for the three data items
mentioned in the original transaction log history are A = 50, B = 50 and C =
50, which was the values just before table 13.1.

13.4.3 Media Recovery


Media recovery is performed when there is a head crash (record scratched by
a phonograph needle) on the disk. During a head crash, the data stored on the
disk is lost. Media recovery is based on periodically making a copy of the
database. In the simplest form of media recovery, before system start-up,
bulk copy is performed for all disks being run on a transactional system. The
copies are made to duplicate disks or to less expensive tape media. When a
database object such as a file or a page is corrupted or a disk has been lost in
a system crash, the disk is replaced with a back-up disk, and normal
recovery processes is performed. During this recovery, however,
ROLLBACK is performed all the way to system start-up, since one can not
depend on the backup disk to have any updates that were forced out to the
last checkpoint. Then, ROLL FORWARD is performed from that point to the
time of system crash. Thus, the normal recovery allows recovering all
updates on this backup disk.
13.5 RECOVERY TECHNIQUES

Database recovery techniques used by DBMS depend on the type and extent
of damage that has occurred to the database. These techniques are based on
the atomic transaction property. All portions of transactions must be treated
as a single logical unit of work, in which all operations must be applied and
completed to produce a consistent database. The following two types of
damages can take place to the database:
a. Physical damage: If the database has been physically damaged, for example disk crash has
occurred, then the last backup copy of the database is restored and update operations of
committed transactions are reapplied using the transaction log file. It is to be noted that the
restoration in this case is possible only if the transaction log has not been damaged.
b. Non-physical or Transaction failure: If the database has become inconsistent due to a
system crash during execution of transactions, then the changes that caused the inconsistency
are rolled-backward (undo). It may also be necessary to roll-forward (redo) some transactions
to ensure that the updates performed by them have reached secondary storage. In this case,
the database is restored to a consistent state using the before- and after-images held in the
transaction log file. This technique is also known as log-based recovery technique. The
following two techniques are used for recovery from nonphysical or transaction failure:

Deferred update.
Immediate update.

13.5.1 Deferred Update


In case of the deferred update technique, updates are not written to the
database until after a transaction has reached its COMMIT point. In other
words, the updates to the database are deferred (or postponed) until the
transaction completes its execution successfully and reaches its commit
point. During transaction execution, the updates are recorded only in the
transaction log and in the cache buffers. After the transaction reaches its
commit point and the transaction log is forced-written to disk, the updates
are recorded in the database. If a transaction fails before it reaches this point,
it will not have modified the database and so no undoing of changes will be
necessary. However, it may be necessary to redo the updates of committed
transactions as their effect may not have reached the database. In the case of
deferred update, the transaction log file is used in the following ways:
When a transaction T begins, transaction begin (or <T, BEGIN>) is written to the transaction
log.
During the execution of transaction T, a new log record containing all log data specified
previously, e.g., new value ai for attribute A is written, denoted as “<WRITE (A, ai)>”. Each
record consists of the transaction name T, the attribute name A and the new value of attribute
ai.
When all actions comprising transaction T are successfully committed, we say that the
transaction T partially commits and the record “<T, COMMIT>” are written to the
transaction log. After transaction T partially commits, the records associated with transaction
T in the transaction log are used in executing the actual updates by writing to the appropriate
records in the database.
If a transaction T aborts, the transaction log record is ignored for the transaction T and write
is not performed.

Table 13.4 Normal execution of transaction T

Time snap-shot Transaction Step Actions


Time-1 READ (A, a1) Read the current employee’s loan balance

Time-2 a1 := a1 + 20000 Increase the loan balance of the employee by


INR 20000
Time-3 WRITE (A, a1) Write the new loan balance to EMP-LOAN-BAL

Time-4 READ (B, b1) Read the current loan cash balance

Time-5 b1 := b1 − 20000 Reduce the loan cash balance left by INR 20000

Time-6 WRITE (B, b1) Write the new balance to CUR-LOAN-CASH-


BAL

Let us now consider the example of a transaction, which updates an


attribute called employee’s loan balance (EMP_LOAN-BAL) in the table
EMPLOYEE. Assume that the current value of EMP-LOAN-BAL is INR
70000. Now assume that the transaction T takes place for making a loan
payment of INR 20000 to the employee. Let us also assume that current loan
cash balance (CUR-LOAN-CASH-BAL) is INR 80000. Table 13.4 shows
the transaction steps for recording loan payment of INR 20000. The
corresponding transaction log entries are shown in table 13.5.
After a failure has occurred, the DBMS examines the transaction log to
determine which transactions need to be redone. If the transaction log
contains both the start record “<T, BEGIN>” and commit record “<T,
COMMIT>” for transaction T, the transaction T must be redone. That
means, the database may have been corrupted, but the transaction execution
was completed and the new values for the relevant data items are contained
in the transaction log. Therefore, the transaction is needed to be reprocessed.
The transaction log is used to restore the state of the database system using a
REDO(T) procedure. Redo sets the value of all data items updated by
transaction T to the new values that are recorded in the transaction log. Now
let us assume that database failure occurred in the following conditions:

Table 13.5 Deferred update log entries for transaction T

just after the COMMIT record is entered in the transaction log and before the updated
records are written to the database.
just before the execution of the WRITE operation.

Table 13.6 shows the transaction log when a failure has occurred just after
the “<T, COMMIT>” record is entered in the transaction log and before the
updated records are written to the database. When the system comes back
up, no action is necessary because no COMMIT record for transaction T
appears in the transaction log. The REDO operation is executed, resulting in
the values INR 90000 and INR 60000 being written to the database as the
updated values of A and B.
Table 13.6 Deferred update log entries for transaction T after failure occurrence and updates are
written to the database

Table 13.7 shows the transaction log when a failure has occurred just
before the execution of the write operation “WRITE (B, b1)”. When the
system comes back up, no action is necessary because no COMMIT record
for transaction T appears in the transaction log. The value of A and B in the
database remains INR 70000 and INR 80000. In this case, transaction must
be restarted.

Table 13.7 Deferred update log entries for transaction T when failure occurs before the WRITE
action to the database

Therefore, using the transaction log, the DBMS can handle any failure
without any loss of the log information itself. The prevention of loss of the
transaction log is addressed by having parallel backup (replicating) of
transaction log on more than one disk (secondary storage). Since the
probability of loss of the transaction log is very small, this method is usually
referred to as stable storage.
13.5.2 Immediate Update
In case of immediate update technique, all updates to the database are
applied immediately as they occur without waiting to reach the COMMIT
point and a record of all changes is kept in the transaction log. As discussed
in the previous case of deferred update, if a failure occurs, the transaction
log is used to restore the state of the database to a consistent previous state.
Similarly in immediate update also, when a transaction begins, a record “<T,
BEGIN>” and update operations are written to the transaction log on disk
before it is applied to the database. This type of recovery method requires
two procedures namely (a) redoing transaction T(REDO, T) and (b) undoing
of transaction T(UNDO, T). The first procedure redoes the same operation as
before, whereas the second one restores the values of all attributes updated
by transaction T to their old values. Table 13.8 shows the entries in the
transaction log after the execution of transaction T. After a failure has
occurred, the recovery system examines the transaction log to identify those
transactions that need to be undone or redone.

Table 13.8 Immediate update log entries for transaction T

In the case of immediate update, the transaction log file is used in the
following ways:
When a transaction T begins, transaction begin (or “<T, BEGIN>”) is written to the
transaction log.
When a write operation is performed, a record containing the necessary data is written to the
transaction log file.
Once the transaction log is written, the update is written to the database buffers.
The updates to the database itself are written when the buffers are next flushed (transferred)
to secondary storage.
When the transaction T commits, a transaction commit (“<T, COMMIT>”) record is written
to the transaction log.
If the transaction log reveals the record “<T, BEGIN>” but does not reveal “<T, COMMIT>”,
transaction T is undone. The old values of affected data items are restored and transaction T
is restarted.
If the transaction log contains both of the preceding records, transaction T is redone. The
transaction is not restarted.

Now suppose that database failure occurred in the following conditions:


just before the write action “WRITE (B, b1)”.
just after “<T, COMMIT>” is written to the transaction log but before the new values are
written to the database.

Table 13.9 Immediate update log entries for transaction T when failure occurs before the WRITE
action to the database

Table 13.9 shows the transaction log when a failure has occurred just
before the execution of the write operation “WRITE (B, b1)” of table 13.4.
When the system comes back up, it finds the record “<T, BEGIN>” but no
corresponding “<T, COMMIT>”. This means that the transaction T must be
undone. Thus, an “UNDO(T)"” operation is executed. This restores the value
of A to INR 70000 and the transaction can be restarted.
Table 13.10 shows the transaction log when a failure has occurred just
after the execution of “<T, COMMIT>” is written to the transaction log but
before the new values are written to the database. When the system comes
back again, a sacn of the transaction log shows corresponding “<T,
BEGIN>” and “<T, COMMIT>” records. Thus, a “REDO(T)” operation is
executed. This results into the values of A and B as INR 90000 and INR
60000 respectively.

13.5.3 Shadow Paging


Shadow paging was introduced by Lorie in 1977 as an alternative to the log-
based recovery schemes. The shadow paging technique does not require the
use of a transaction log in a single-user environment. However, in a multi-
user environment a transaction log may be needed for the concurrency
control method. In the shadow page scheme, the database is considered to be
made up of logical units of storage of fixed-size disk pages (or disk blocks).
The pages are mapped into physical blocks of storage by means of a page
table, with one entry for each logical page of the database. This entry
contains the block number of the physical (secondary) storage where this
page is stored. Thus, the shadow paging scheme is one possible form of the
indirect page allocation.
Table 13.10 Immediate update log entries for transaction T when failure occurs just after the
COMMIT action

Fig. 13.4 Virtual memory management paging scheme

The shadow paging scheme is similar to the one which is used by the
operating system for virtual memory management. In case of virtual memory
management, the memory is divided into pages that are assumed to be of a
certain size (in terms of bytes, kilobytes, or megabytes). The virtual or
logical pages are mapped onto physical memory blocks of the same size as
the pages. The mapping is provided by means of a table known as page
table, as shown in Fig. 13.4.The page table contains one entry for each
logical page of the process’s virtual address space.
The shadow paging technique maintains two page tables during the life of
a transaction namely (a) a current page table and (b) a shadow page table,
for a transaction that is going to modify the database. Fig. 13.5 shows
shadow paging scheme. The shadow page is the original page table and the
transaction addresses the database using current page table. At the start of a
transaction the two tables are same and both point to the same blocks of
physical storage. The shadow page table is never changed thereafter, and is
used to restore the database in the event of a system failure. However,
current page table entries may change during execution of a transaction. The
current page table is used to record all updates to the database. When the
transaction completes, the current page table becomes the shadow page
table.

Fig. 13.5 Shadow paging scheme

As shown in Fig. 13.5, the pages that are affected by a transaction are
copied to new blocks of physical storage and these blocks, along with the
blocks not modified, are accessible to the transaction via the current page
table. The old version of the changed pages remains unchanged and these
pages continue to be accessible via the shadow page table. The shadow page
table contains the entries that existed in the page table before the start of the
transaction and points to the blocks that were never changed by the
transaction. The shadow page table remains unaltered by the transaction and
is used for undoing the transaction.

13.5.3.1 Advantages of Shadow Paging


The overhead of maintaining the transaction log file is eliminated.
Since there is no need for undo or redo operations, recovery is significantly faster.

13.5.3.2 Disadvantages of Shadow Paging

Data fragmentation or scattering.


Need for periodic garbage collection to reclaim inaccessible blocks.

13.5.4 Checkpoints
The point of synchronisation between the database and the transaction log
file is called the checkpoint. As explained in the preceding discussions,
general method of database recovery is using information in the transaction
log. But the main difficulty of this recovery is of knowing how far to go
back in the transaction log to search in case of failure. In the absence of this
exact information, we may end up redoing transactions that have already
been safely written to the database. Also, this can be very time-consuming
and wasteful. A better way is to find a point that is sufficiently far back to
ensure that any item written before that point has been done correctly and
stored safely. This method is called checkpointing. In checkpointing, all
buffers are force- written to secondary storage. The checkpoint technique is
used to limit (a) the volume of log information, (b) amount of searching and
(c) subsequent processing that is needed to carry out on the transaction log
file. The checkpoint technique is an additional component of the transaction
logging method.
During execution of transactions, the DBMS maintains the transaction log
as we have described in the preceding sections but periodically performs
checkpoints. Checkpoints are scheduled at predetermined intervals and
involve the following operations:
Writing the start-of-checkpoint record along with the time and date to the log on a stable
storage device giving the identification that it is a checkpoint.
Writing all transaction log file records in main memory to secondary storage.
Writing the modified blocks in the database buffers to secondary storage.
Writing a checkpoint record to the transaction log file. This record contains the identifiers of
all transactions that are active at the time of the checkpoint.
Writing an end-of-checkpoint record and saving of the address of the checkpoint record on a
file accessible to the recovery routine on start-up after a system crash.

For all operations active at checkpoint, their identifiers and their database
modification actions, which at that time are reflected only in the database
buffers, will be propagated to the appropriate storage. The frequency of
checkpointing is a design consideration of the database recovery system. A
checkpoint can be taken at a fixed interval of time (for example, every 15
minutes, or 30 minutes or one hour and so on).
In case of a failure during the serial operation of transactions, the
transaction log file is checked to find the last transaction that started before
the last checkpoint. Any earlier transactions would have committed
previously and would have written to the database at the checkpoint.
Therefore, it is needed to only redo (a) the one that was active at the
checkpoint and (b) any subsequent transactions for which both start and
commit records appear in the transaction log. If a transaction is active at the
time of failure, the transaction must be undone. If transactions are performed
concurrently, redo all transactions that have committed since the checkpoint
and undo all transactions that were active at the time of failure.
Fig. 13.6 Example of checkpointing

Let us assume that a transaction log is used with immediate updates. Also,
consider that the timeline for transaction T1, T2, T3 and T4 are as shown in
Fig. 13.6. When the system fails at time tf ,the transaction log need only be
scanned as far back as the most recent checkpoint tc . Transaction T1 is okay,
unless there has been disk failure that destroyed it and probably other
records prior to the last checkpoint. In that case, the database is reloaded
from the backup copy that was made at the last checkpoint. In either case,
transactions T2 and T3 are redone from the transaction log, and transaction
T4 is undone from the transaction log.

13.6 BUFFER MANAGEMENT

DBMS application programs require input/output (I/O) operations, which are


performed by a component of operating system. These I/O operations
normally use buffers to match the speed of the processor and the relatively
fast main (or primary) memories with the slower secondary storages and also
to minimise the number of I/O operations between the main and secondary
memories wherever possible. The buffers are the reserved blocks of the main
memory. The assignment and management of memory blocks is called and
the component of the operating system that performs this task is called buffer
manager. The buffer manager is responsible for the efficient management of
the database buffers that are used to transfer (flushing) pages between buffer
and secondary storage. It ensures that as many data requests made by
programs as possible are satisfied form data copied (flushed) from secondary
storage into the buffers. The buffer manager takes care of reading of pages
from the disk (secondary storage) into the buffers (physical memory) until
the buffers become full and then using a replacement strategy to decide
which buffer(s) to force-write to disk to make space for new pages that need
to be read from disk. Some of the replacement strategies used by the buffer
manager are (a) first-in-first-out (FIFO) and (b) least recently used (LRU).
A computer system uses buffers that are in effect virtual memory buffers.
Thus, a mapping is required between a virtual memory buffer and the
physical memory, as shown in Fig. 13.7. The physical memory is managed
by the memory management component of operating system of computer
system. In a virtual memory management, the buffers containing pages of
the database undergoing modification by a transaction could be written out
to secondary storage. The timing of this premature writing of a buffer is
decided by the memory management component of the operating system and
is independent of the state of the transaction. To decrease the number of
buffer faults, the least recently used (LRU) algorithm is used for buffer
replacement.

Fig. 13.7 DBMS buffers in virtual memory


The buffer management effectively provides a temporary copy of a database
page. Therefore, it is used in database recovery system in which the
modifications are done in this temporary copy and the original page remains
unchanged in the secondary storage. Both the transaction log and the data
pages are written to the buffer pages in virtual memory. The COMMIT
transaction operation takes in two phases, and thus it is called a two-phase
commit. In the first phase of COMMIT operation, the transaction log buffers
are written out (write-ahead log). In the second phase of COMMIT
operation, data buffers are written out. In case the data buffer is being used
by another transaction, the writing of that phase is delayed. Thus, it does not
cause any problem because the log is always forced during the first phase of
the COMMIT. Since no uncommitted modifications are reflected in the
database, the undoing of transaction log is not required in this method of
database recovery.

R Q
1. Discuss the different types of transaction failures that may occur in a database environment.
2. What is database recovery? What is meant by forward and backward recovery? Explain with
an example.
3. How does the recovery manager ensure atomicity and durability of transactions?
4. What is the difference between stable storage and disk?
5. Describe how the transaction log file is a fundamental feature in any recovery mechanism.
6. What is the difference between a system crash and media failure?
7. Describe how transaction log file is used in forward and backward recovery.
8. Explain with the help of examples why it is necessary to store transaction log records in a
stable storage before committing that transaction when immediate update is allowed.
9. What can be done to recover the modifications made by partially completed transactions that
are running at the time of a system crash? Can on-line transaction be recovered?
10. What are the types of damages that can take place to the database? Explain.
11. Differentiate between immediate update and deferred update recovery techniques.
12. Assuming a transaction log with immediate updates, create log entries corresponding to the
transactions as shown in Table 13.11 below.

Table 13.11 Immediate updates entries for transaction T


Time snap-shot Transaction Step Actions
Time-1 READ (A, a1) Read the current
employee’s loan balance
Time-2 a1 := a1 − 500 Debit the account by INR
500
Time-3 WRITE (A, a1) Write the new loan balance
Time-4 READ (B, b1) Read the current account
payable balance
Time-5 b1 := b1 + 500 Credit the account balance
by INR 500
Time-6 WRITE (B, b1) Write the new balance

13. Suppose that in Question 12 a failure occurs just after the transaction log record for the action
WRITE (B, b1) has been written.

a. Show the contents of the transaction log at the time of failure.


b. What action is necessary and why?
c. What are the resulting values of A and B?

14. Suppose that in Question 12 a failure occurs just after the “<T, COMMIT>” record is written
to the transaction log.

a. Show the contents of the transaction log at the time of failure.


b. What action is necessary and why?
c. What are the resulting values of A and B?

15. Consider the entries shown in Table 13.12 at the time of database system failure in the
recovery log.

a. Assuming a deferred update log, describe for each case (A, B, C) what recovery
actions are necessary and why. Indicate what are the values for the given attributes
after the recovery actions are completed.

Table 13.12 Immediate updates entries for transaction T


Entry A Entry B Entry C
<T, BEGIN> <T, BEGIN> <T, BEGIN>
<T1, A, 500, 395> <T1, A, 500, 395> <T1, A, 500, 395>

<T1, B, 800, 950> <T1, B, 800, 950> <T1, B, 800, 950>

<T1, COMMIT> <T1, COMMIT>

<t2, BEGIN> <t2, BEGIN>

<T2, C, 320, 419> <T2, C, 320, 419>

<T1, COMMIT>

b. Assuming an immediate update log, describe for each case (A, B, C) what recovery
actions are necessary and why. Indicate what are the values for the given attributes
after the recovery actions are completed.

16. What is a checkpoint? How is the checkpoint information used in the recovery operation
following a system crash?
17. Describe the shadow paging recovery technique. Under what circumstances does it not
require a transaction log? List the advantages and disadvantages of shadow paging.
18. What is a buffer? Explain the buffer management technique used in database recovery.

STATE TRUE/FALSE

1. Concurrency control and database recovery are intertwined and both are a part of the
transaction management.
2. Database recovery is a service that is provided by the DBMS to ensure that the database is
reliable and remains in consistent state in case of a failure.
3. Database recovery is the process of restoring the database to a correct (consistent) state in the
event of a failure.
4. Forward recovery is the recovery procedure, which is used in case of physical damage.
5. Backward recovery is the recovery procedure, which is used in case an error occurs in the
midst of normal operation on the database.
6. Media failures are the most dangerous failures.
7. Media recovery is performed when there is a head crash (record scratched by a phonograph
needle) on the disk.
8. The recovery process is closely associated with the operating system.
9. Shadow paging technique does not require the use of a transaction log in a single-user
environment
10. In shadowing both the before-image and after-image are kept on the disk, thus avoiding the
need for a transaction log for the recovery process.
11. The REDO operation updates the database with new values (after-image) that is stored in the
log.
12. The REDO operation copies the old values from log to the database, thus restoring the
database prior to a state before the start of the transaction.
13. In case of deferred update technique, updates are not written to the database until after a
transaction has reached its COMMIT point.
14. In case of an immediate update technique, all updates to the database are applied immediately
as they occur with waiting to reach the COMMIT point and a record of all changes is kept in
the transaction log.
15. A checkpoint is a point of synchronisation between the database and the transaction log file.
16. In checkpointing, all buffers are force-written to secondary storage.
17. The deferred update technique is also known as the UNDO/REDO algorithm.
18. Shadow paging is a technique where transaction log are not required.
19. Recovery restores a database form a given state, usually inconsistent, to a previously
consistent state.
20. The assignment and management of memory blocks is called the buffer manager.

TICK (✓) THE APPROPRIATE ANSWER

1. Which of the following is not a recovery technique?

a. Shadow paging.
b. Deferred update.
c. Write-ahead logging.
d. Immediate update.

2. Incremental logging with deferred updates implies that the recovery system must necessarily
store

a. the old value of the updated item in the log.


b. the new value of the updated item in the log.
c. both the old and new value of the updated item in the log.
d. only the begin transaction and commit transaction records in the log.

3. Which of the following are copies of physical database files?

a. Transaction log
b. Physical backup
c. Logical backup
d. None of these.

4. In case of transaction failure under a deferred update incremental logging scheme, which of
the following will be needed:

a. An undo operation
b. A redo operation
c. Both undo and redo operations
d. None of these.

5. Which of the following failure is caused by hardware failures?

a. Operations
b. Design
c. Physical
d. None of these.

6. For incremental logging with immediate updates, a transaction log record would contain

a. a transaction name, data item name, old value of item and new value of item.
b. a transaction name, data item name, old value of item.
c. a transaction name, data item name, old new value of item.
d. a transaction name and data item name.

7. Which of the following is most dangerous type of failures?

a. Hardware
b. Network
c. Media
d. Software.

8. When a failure occurs, the transaction log is referred and each operation is either undone or
redone. This is a problem because

a. searching the entire transaction log is time consuming.


b. many redo operations are necessary.
c. Both (a) and (b).
d. None of these.

9. Hardware failures may include

a. memory errors.
b. disk crashes.
c. disk full errors.
d. All of these.

10. Software failures may include failures related to softwares such as

a. operating system.
b. DBMS software.
c. application programs.
d. All of these.

11. Which of the following is a facility provided by the DBMS to assist the recovery process?
a. Recovery manager
b. Logging facilities
c. Backup mechanism
d. All of these.

12. In the event of failure, principal effects that happen are

a. loss of main memory including the database buffer.


b. the loss of the disk copy (secondary storage) of the database.
c. Both (a) and (b).
d. None of these.

13. When using a transaction log based recovery scheme, it might improve performance as well
as providing a recovery mechanism by

a. writing the appropriate log records to disk during the transaction’s execution.
b. writing the log records to dick when each transaction commits.
c. never writing the log records to disk.
d. waiting to write the log records until multiple transactions commit and writing them
as a batch.

14. Which of the following is an example of a NO-UNDO/REDO algorithm?

a. Shadow paging
b. Immediate update
c. Deferred update
d. None of these.

15. To cope with media (or disk) failures, it is necessary

a. to keep a redundant copy of the database.


b. to never abort a transaction.
c. for the DBMS to only execute transactions in a single-user environment.
d. All of these.

16. Which of the following is an example of a UNDO/REDO algorithm?

a. Shadow paging
b. Immediate update
c. Deferred update
d. None of these.

17. If the shadowing approach is used for flushing a data item back to disk, then the item is
written to

a. the same disk location form which it was read.


b. disk before the transaction commits.
c. disk only after the transaction commits.
d. a different location on disk.

18. Shadow paging was introduced by

a. Lorie
b. Codd
c. IBM
d. Boyce.

19. Shadow paging technique maintains

a. two page tables.


b. three page tables.
c. four page tables.
d. five page tables.

20. The checkpoint technique is used to limit

a. the volume of log information.


b. amount of searching.
c. subsequent processing that is needed to carry out on the transaction log file.
d. All of these.

21. Which of the following recovery technique does not need logs?

a. Shadow paging
b. Immediate update
c. Deferred update
d. None of these.

22. The failure may be the result of

a. a system crash due to hardware or software errors.


b. a media failure such as head crash.
c. a software error in the application such as a logical error in the program that is
accessing the database.
d. All of these.

23. The database backup is stored in a secure place, usually

a. in a different building.
b. protected against danger such as fire, theft, flood.
c. other potential calamities.
d. All of these.

FILL IN THE BLANKS


1. _____ is a process of restoring a database to the correct state in the event of a failure.
2. If only the transaction has to be undone, then it is called a _____.
3. When all the active transactions have to be undone, then it is called a _____.
4. If all pages updated by a transaction are immediately written to disk when the transaction
commits, this is called a _____ and the writing is called a _____.
5. If the pages are flushed to the disk only when they are full or at some time interval, then it is
called _____.
6. Shadow paging technique does not require the use of a transaction log in _____ environment.
7. Shadow paging technique is classified as _____ algorithm.
8. Concurrency control and database recovery are intertwined and are both part of _____.
9. Recovery is required to protect the database from (a) _____ and (b) _____.
10. The failure may be the result of (a) _____, (b) _____, (c) _____ or (d) _____.
11. Recovery restores a database form a given state, usually _____, to a _____ state.
12. The database backup is stored in a secure place, usually in (a) _____ and (b) _____ such as
fire, theft, flood and other potential calamities.
13. System crashes are due to hardware or software errors, resulting in loss of _____.
14. In the event of failure, there are two principal effects that happen, namely (a) _____ and (b)
_____.
15. Media recovery is performed when there is _____ on the disk.
16. In case of deferred update technique, updates are not written to the database until after a
transaction has reached its _____.
17. In case of immediate update technique, all updates to the database are applied immediately as
they occur _____ to reach the COMMIT point and a record of all changes is kept in the
_____.
18. Shadow paging technique maintains two page tables during the life of a transaction namely
(a) _____ and (b) _____.
19. In checkpointing, all buffers are _____ to secondary storage.
20. The assignment and management of memory blocks is called _____ and the component of
the operating system that performs this task is called _____.
Chapter 14

Database Security

14.1 INTRODUCTION

Database security is an important issue in database management because of


the sensitivity and importance of data and information of an organisation.
The data stored in a DBMS is often vital to the business interests of the
organisation and is regarded as a corporate asset. Thus, a database represents
an essential resource of an organisation that should be properly secured. The
database environment is becoming more and more complex with the
growing popularity and use of distributed databases with client/server
architectures as compared to the mainframes. The access to the database has
become more open through the Internets and corporate intranets. As a result,
managing database security effectively has also become more difficult and
time consuming. Therefore, it is important for the data base administrator
(DBA) to develop overall policies, procedures and appropriate controls to
protect the databases.
In this chapter, the potential threats to data security and protection against
unauthorised access have been discussed. Various security mechanisms, such
as discretionary access control, mandatory access control and statistical
database security, have also been discussed.

14.2 GOALS OF DATABASE SECURITY

The goal of database security is the protection of data against threats such as
accidental or intentional loss, destruction or misuse. These threats pose
problems to the database integrity and access. Threats may be defined as any
situation or event, whether intentional or accidental, that may adversely
affect a system and consequently the organisation. A threat may be caused
by a situation or event involving a person, action or circumstances that are
likely to harm the organisation. The harm may be tangible, such as loss of
hardware, software, or data. The harm could be intangible, such as loss of
credibility or client confidence in the organisation. Database security
involves allowing or disallowing users from performing actions on the
database and the objects within it, thus protecting the database from abuse or
misuse.
The database administrator (DBA) is responsible for the overall security
of the database system. Therefore, the DBA of an organisation must identify
the most serious threats and enforce security to take appropriate control
actions to minimise these threats. Any individual user (a person) or a user
group (group of persons) needing to access database system, applies to DBA
for a user account. The DBA then creates an account number and password
for user to access the database on the basis of legitimate need and policy of
the organisation. The user afterwards logs in to the DBMS using the given
account number and password whenever database access is needed. The
DBMS checks for the validity of the user’s entered account number and
password. Then the valid user is permitted to use the DBMS and access the
database. DBMS maintains these two fields of user account number and
password by creating an encrypted table. DBMS keeps on appending this
table by inserting a new record whenever a new account is created. When the
account is cancelled, the corresponding record is deleted from the encrypted
table.

14.2.1 Threats to Database Security


Threats to database security may be direct, for example, browsing, changing
or stealing of data by an authorised user access. To ensure a secure database,
all parts of the system must be secure including the database, the hardware,
the operating system, the network, the users, and even the building and
housing the computer systems. Some of the threats that must be addressed in
a comprehensive database security plan are as follows:
Loss of availability.
Loss of data integrity.
Loss of confidentiality or secrecy.
Loss of privacy.
Theft and fraud.
Accidental losses.

Loss of availability means that the data, or the system, or both cannot be
accessed by the users. This situation can arise due to sabotage of hardware,
networks or applications. The loss of availability can seriously cause
operational difficulties and affect the financial performance of an
organisation. Almost all organisations are now seeking virtually continuous
operation, the so called 24 × 7 operations, that is, 24 hours a day and seven
days a week.
Loss of data integrity causes invalid or corrupted data, which may
seriously affect the operation of an organisation. Unless data integrity is
restored through established backup and recovery procedures, an
organisation may suffer serious losses or make incorrect and expensive
decisions based on the wrong or invalid data.
Loss of confidentiality refers to loss of protecting or maintaining secrecy
over critical data of the organisation, which may have strategic value to the
organisation. Loss of confidentiality could lead to loss of competitiveness.
Loss of privacy refers to loss of protecting data from individuals. Loss of
privacy could lead to blackmail, bribery, public embarrassment, stealing of
user passwords or legal action being taken against the organisation.
Theft and fraud affect not only the database environment but also the
entire organisation. Since these situations are related to the involvement of
people attention should be given to reduce the opportunity for the occurrence
of these activities. For example, control of physical security, so that
unauthorised personnel are not able to gain access to the computer room,
should be established. Another example of a security procedure could be
establishment of a firewall to protect from unauthorised access to
inappropriate parts of the database through outside communication links.
This will hamper people who are intent on theft or fraud. Theft and fraud do
not necessarily alter data, as is the case for loss of confidentiality or loss of
privacy.
Accidental losses could be unintentional threats including human error,
software and hardware-caused breaches. Operating procedures, such as user
authorisation, uniform software installation procedures and hardware
maintenance schedules, can be established to address threats from accidental
losses.

14.2.2 Types of Database Security Issues


Database security addresses many issues some of which are as follows:
Legal and ethical issues: This issue is related to the rights to access of an individual user or
user groups to access certain information. Certain private information cannot be accessed
legally by unauthorised persons.
System-related issues: In system-related issues, various security functions are enforced at
system levels, for example, at physical hardware level, at the DBMS level, or at the operating
system level.
Organisation-based issues: In this case, some organisations identify multiple security levels
and categorise the data and users based on classifications, such as top secret, secret,
confidential and unclassified. In such cases, the security policy of the organisation must be
enforced with respect to permitting access to various classifications of data.
Policy-based issues: At the institutional, corporateor government level, at times, there is a
policy about information that can be shared or made public and that which can not be shared.

The DBMS must provide techniques to certain users or user groups to


access selected portions of the database without gaining access to the rest of
the database. This is particularly very important when a large integrated
database is accessed by many different users within the same organisation in
a multi-user environment. In such cases, the database security system of a
DBMS is responsible for ensuring the security of portions of a database
against unauthorised access.

14.2.3 Authorisation and Authentication


Authorisation is the process of a granting of right or privilege to the user(s)
to have a legitimate access to a system or objects (database table) of the
system. It is the culmination of the administrative policies of the
organisation, expressed as a set of rules that can be used to determine which
user has what type of access to which portion of the database. The process of
authorisation involves authentication of user(s) requesting access to objects.
Authentication is a mechanism that determines whether a user is who he
or she claims to be. In other words, an authentication checks whether a user
operating upon the database is, in fact, allowed doing so. It verifies the
identity of a person (user) or program connecting to a database. The simplest
form of authentication consists of a secret password which must be
presented when a connection is opened to a database. Password- based
authentication is widely used by operating systems as well as databases. For
more secure scheme, especially in network environment, other
authentication schemes are used such as challenge-response system, digital
signatures and so on.
Authorisation and authentication controls can be built into the software.
Authorisation rules are incorporated in the DBMSs that restrict access to
data and also restrict the actions that people may take when they access data.
For example, a user or a person using a particular password may be
authorised to read any record in a database but cannot necessarily modify
any of those records. For this reason, authorisation controls are sometimes
referred to as access controls. Following two types of access control
techniques are used in database security system:
Discretionary access control.
Mandatory access control.

Using the above controls, database security aims to minimise losses or


damage to the database caused by anticipated events in a cost-effective
manner without unduly constraining the users of the database. The DBMS
provides these access control mechanisms to allow users to access only those
data for which he or she has been authorised and not to all the data in
unrestricted manner. Most DBMSs support either the discretionary security
scheme or the mandatory security scheme or both.

14.3 DISCRETIONARY ACCESS CONTROL


Discretionary access control (also called security scheme) is based on the
concept of access rights (also called privileges) and mechanism for giving
users such privileges. It grants the privileges (access rights) to users on
different objects, including the capability to access specific data files,
records or fields in a specified mode, such as, read, insert, delete or update or
combination of these. A user who creates a database object such as a table or
a view automatically gets all applicable privilege on that object. The DBMS
keeps track of how these privileges are granted to other users. Discretionary
security schemes are very flexible. However, it has certain weaknesses, for
example, a devious unauthorised user can trick an authorised user into
disclosing sensitive data.

14.3.1 Granting/Revoking Privileges


Granting and revoking privileges to the users is the responsibility of
database administrator (DBA) of the DBMS. DBA classifies users and data
in accordance with the policy of the organisation. DBA privileged
commands include commands for granting and revoking privileges to
individual accounts, users or user groups. It performs the following types of
actions:
a. Account creation: Account creation action creates a new account and password for a user or a
group of users to enable them to access the DBMS.
b. Privilege granting: Privilege granting action permits the DBA to grant certain privileges
(access rights) to certain accounts.
c. Privilege revocation: Privilege revoking action permits the DBA to revoke (cancel) certain
privileges (access rights) that were previously given to certain accounts.
d. Security level assignment: Security level assignment action consists of assigning user
accounts to the appropriate security classification level.

Having an account and a password do not necessarily entitle a user or user


groups to access all the functions of the DBMS. Generally, following two
levels of privilege assignment is done to access the database system:
a. The account level privilege assignment: At the account level privilege assignment, the DBA
specifies the particular privileges that each account holds independently of the relations in
the database. The account level privileges apply to the capabilities provided to the account
itself and can include the following in SQL:
CREATE SCHEMA privilege : to create a schema
CREATE TABLE privilege : to create a table
CREATE VIEW privilege : to apply schema changes such as
ALTER privilege : adding or removing attributes
from relations
DROP privilege : to delete relation or views
MODIFY privilege : to delete, insert, or update
Tuples
SELECT privilege : to retrieve information from the
database using SELECT query

b. The relation (or table) level privilege assignment: At relation or table level of privilege
assignment, the DBA controls the privilege to access each individual relation or view in the
database. Privileges at the relation level specify for each user the individual relations on
which each type of command can be applied. Some privileges also refer to individual
attributes (columns) of relations. Granting and revoking of relation privileges is controlled by
assigning an owner account for each relation R in a database. The owner account is typically
the account that was used when the relation was first created. The owner of the relation is
given all privileges on the relation. In SQL, the following types of privileges can be granted
on each individual relation R:

SELECT privilege on R : to read or retrieve tuples from R


MODIFY privileges on R : to modify (UPDATE, INSERT and
DELETE) tuples of R
REFERENCES privilege : to reference relationship R
on R

14.3.1.1 Examples of GRANT Privileges


In SQL, granting of privileges is accomplished using GRANT command.
The syntax for the GRANT command is given as

GRANT {ALL | privilege-list}


ON {table-name [(column-comma-list)] | view-name [(column-comma-
list)]) TO {PUBLIC | user-list}
[WITH GRANT OPTION]
or
GRANT {ALL | privilege-list [(COLUMN-COMMA-LIST)]}
ON {table-name | view-name}
TO {PUBLIC | user-list}
[WITH GRANT OPTION]

Meaning of the various clauses is as follows:

ALL All the privileges for the object for which the user
issuing the GRANT has grant authority, is granted.
privilege-list Only the listed privileges are granted.
ON It specifies the object on which the privileges are
granted. It can be a table or a view.
column-comma-list The privileges are restricted to the specified
columns. If this is not specified, the grant is given
for the entire table/view.
TO It is used to identify the users to whom the privileges
are granted.
PUBLIC It means that the privileges are granted to all known
users of the system who has valid User ID and
Password.
user-list The privileges will be granted to the user(s)
specified in the list.
WITH GRANT It means that the recipient has the authority to grant
OPTION the privileges that were granted to him to another
user.

Some of the examples of granting privileges are given below.


GRANT SELECT
ON EMPLOYEE
TO ABHISHEK, MATHEW

This means that the users ‘ABHISHEK’ and ‘MATHEW’ are authorised
to perform SELECT operations on the table (or relation) EMPLOYEE.

GRANT SELECT
ON EMPLOYEE
TO PUBLIC

This means that all users are authorised to perform SELECT operations on
the table (or relation) EMPLOYEE.

GRANT SELECT, UPDATE (EMP-ID)


ON EMPLOYEE
TO MATHEW

This means that the user ‘MATHEW’ has the right to perform SELECT
operations on the table EMPLOYEE as well as the right to update the EMP-
ID attribute.

GRANT SELECT, DELETE, UPDATE


ON EMPLOYEE
TO MATHEW
WITH GRANT OPTION

This means that the user ‘MATHEW’ is authorised to perform SELECT,


DELETE and UPDATE operations on the table (or relation) EMPLOYEE
with the capability to grant those privileges to other users on EMPLOYEE
table.

GRANT CREATE TABLE, CREATE VIEW


TO ABHISHEK
This means that the user ‘ABHISHEK’ is authorised to create tables and
views.

14.3.1.2 Examples of REVOKE Privileges


In SQL, revoking of privileges is accomplished using REVOKE
command. The syntax for the REVOKE command is given as

REVOKE {ALL | privilege-list}


ON {table-name [(column-comma-list)] | view-name [(column-
comma-list)]}
FROM {PUBLIC | user-list}

or

REVOKE {ALL | privilege-list [(COLUMN-COMMA-LIST)]}


ON {table-name | view-name}
FROM {PUBLIC | user-list}

Meaning of the various clauses is as follows:

ALL All the privileges for the object specified are


revoked.
privilege-list Only the listed privileges are revoked.
ON It specifies the object from which the privileges are
removed. It can be a table or view.
column-comma-list The privileges are restricted to the specified
columns. If this is not specified, the revoke is for the
entire table/view.
FROM It is used to identify the users from whom the
privileges are removed.
PUBLIC It means that the privileges are revoked from all
known users of the system.
user-list The privileges will be granted to the user(s)
specified in the list. The user issuing the REVOKE
command should be the user who granted the
privileges in the first place.

Some of the examples of revoking privileges are given below.

REVOKE SELECT
ON EMPLOYEE
FROM MATHEW

This means that the user ‘MATHEW’ is no longer authorised to perform


SELECT operations on the EMPLOYEE table.

REVOKE CREATE TABLE


FROM MATHEW

This means that the system privilege for creating table is removed from
the user ‘MATHEW’.

REVOKE ALL
ON EMPLOYEE
FROM MATHEW

This means that the all privileges are removed from the user ‘MATHEW’.

REVOKE DELETE, UPDATE (EMP-ID, EMP-SALARY)


ON EMPLOYEE
FROM ABHISHEK

This means that the DELETE and UPDATE authority on the EMP-ID and
EMP-SALARY attributes (columns) are removed from the user
‘ABHISHEK’.
The above examples illustrate a few of the possibilities for granting or
revoking authorisation privileges. The GRANT option may cascade among
users. For example, if ‘Mathew’ has the right to grant authority X to another
user ‘Abhishek’, then ‘Abhishek’ has the right to grant authority X to
another user ‘Rajesh’ and so on. Consider the following example:

Mathew:
GRANT SELECT
ON EMPLOYEE
TO ABHISHEK
WITH GRANT OPTION
Abhishek:
GRANT SELECT
ON EMPLOYEE
TO RAJESH
WITH GRANT OPTION

As long as the user has received a GRANT OPTION, he or she can confer
the same authority to others. However, if the user ‘Mathew’ later wishes to
revoke a GRANT OPTION, he could do so by using the following
command:

REVOKE SELECT
ON EMPLOYEE
FROM ABHISHEK

This revocation would apply to the user ‘Abhishek’ as well as to anyone


to whom he had conferred authority and so on.

14.3.2 Audit Trails


An audit trail is essentially a special file or database in which the system
automatically keeps track of all operations performed by users on the regular
data. It is a log of all changes (for example, updates, deletes, insert and so
on) to the database, along with information such as which user performed
the change and when the change was performed. In some systems, the audit
trail is physically integrated with the transaction log, in others the transaction
log and audit trail might be distinct. A typical audit trail entry might contain
the information as shown in Fig. 14.1.
The audit trail aids security to the database. For example, if the balance on
a bank account is found to be incorrect, bank may wish to trace all the
updates performed on the account, to find out incorrect updates, as well as
the persons who carried out the updates. The bank could then also use the
audit trail to trace all the updates performed by these persons, in order to
find other incorrect updates. Many DBMSs provide built-in mechanisms to
create audit trails. It is also possible to create an audit trail by defining
appropriate triggers on relation updates using system-defined variables that
identify the user name and time.

Fig. 14.1 Typical entries in audit trail file

14.4 MANDATORY ACCESS CONTROL

Mandatory access control (also called security scheme) is based on system-


wide policies that cannot be changed by individual users. It is used to
enforce multi-level security by classifying the data and users into various
security classes or levels and then implementing the appropriate security
policy of the organisation. Thus, in this scheme each data object is labelled
with a certain classification level and each user is given a certain clearance
level. A given data object can then be accessed only by users with the
appropriate clearance of a particular classification level. Thus, a mandatory
access control technique classifies data and users based on security classes
such as top secret (TS), secret (S), confidential (C) and unclassified (U). The
DBMS determines whether a given user can read or write a given object
based on certain rules that involve the security level of the object and the
clearance of the user.
The commonly used mandatory access control technique for multi-level
security is known as the Bel- LaPadula model. The Bel-LaPadula model is
described in terms of subjects (for example, users, accounts, programs),
objects (for example, relations or tables, tuples, columns, views, operations),
security classes (for example, TS, S, C or U) and clearances. The Bel-
LaPadula model classifies each subject and object into one of the security
classifications TS, S, C or U. The security classes in a system are organised
according to a particular order, with a most secure class and a least secure
class. The Bel-LaPadula model enforces following two restrictions on data
access based on the subject/object classifications:
a. Simple security property: In this case, a subject S is not allowed read access to an object O
unless classification of subject S is greater than or equal to classification of an object O. In
other words

class (S) ≥ class (O)

b. Star security property (or *-property): In this case, a subject S is not allowed to write an
object O unless classification of subject S is less than or equal to classification of an object O.
In other words

class (S) ≤ class (O)

If discretionary access controls are also specified, these rules represent


additional restrictions. Thus, to read or write a database object a user must
have the necessary privileges obtained via GRANT commands. Also, the
security class of the user and the object must satisfy the preceding
restrictions. Mandatory security scheme is hierarchical in nature and are
rigid as compared to discretionary security scheme. It addresses loopholes of
discretionary access control mechanism.

14.5 FIREWALLS

A firewall is a system designed to prevent unauthorized access to or from a


private network. Firewalls can be implemented in both hardware and
software, or a combination of both. They are frequently used to prevent
unauthorized Internet users from accessing private networks connected to
the Internet, especially intranets. All messages entering or leaving the
intranet pass through the firewall, which examines each message and blocks
those that do not meet the specified security criteria. Following are some of
the firewall techniques that are used in database security:
a. Packet Filter: Packet filter looks at each packet entering or leaving the network and
accepts or rejects it based on user-defined rules. Packet filtering is a fairly effective
mechanism and transparent to users. Packet filter is also susceptible to IP spooling, which is
a technique used to gain unauthorized access to computers by the intruders.
b. Application Gateway: In an application gateway, security mechanism is applied to specific
applications, such as File Transfer Protocol (FTP) and Telnet servers. This is very effective
security mechanism.
c. Circuit-level Gateway: In circuit-level gateway, security mechanisms are applied when a
Transport Control Protocol (TCP) or User Datagram Protocol (UDP) connection is
established. Once the connection has been made, packets can flow between the hosts without
further checking.
d. Proxy Server: Proxy server intercepts all messages entering and leaving the network. The
proxy server in effect hides the true network addresses.

14.6 STATISTICAL DATABASE SECURITY

Statistical database security system is used to control the access to a


statistical database, which is used to provide statistical information or
summaries of values based on various criteria. A statistical database contains
confidential information about individuals or organisations, which is used to
answer statistical queries concerning sums, averages, and numbers with
certain characteristics. Thus, a statistical database permits queries that derive
aggregated (statistical) information, for example, sums, averages, counts,
maximums, minimums, standard deviations, means, totals, or a query such
as “What is the average salary of Analysts?”, etc. They do not permit queries
that derive individual information, for example, the query “What is the
salary of an Analyst Abhishek?”.
In statistical queries, statistical functions are applied to a population of
tuples. A population is a set of tuples of relation (or table) that satisfy some
selection condition. For example, let us consider a relation EMPLOYEE, as
shown in Fig. 14.2. Each selection condition on the EMPLOYEE relation
will specify a particular population of EMPLOYEE tuples. For example, the
condition EMP-SEX = ‘F’ specifies the female population. The condition
((EMP-SEX = ‘F’) AND (EMP-CITY = ‘Jamshedpur’) specifies the female
population who lives in Jamshedpur.
Statistical database security prohibits users not to retrieve individual data,
such as the salary of a specific employee. This controlled by prohibiting
queries that retrieve attribute values and by allowing only queries that
involve statistical aggregate functions such as SUM, STANDARD
DEVIATION, MEAN, MAX, MIN, COUNT and AVERAGE.

Fig. 14.2 Relation EMPLOYEE

14.7 DATA ENCRYPTION

Data encryption is a method of coding or scrambling of data so that humans


cannot read them. In encryption method, data is encoded by a special
algorithm that renders the data unreadable by any program or humans
without the decryption key. Data encryption technique is used to protect
from threats in which user attempts to bypass the system, for example, by
physically removing part of the database or by tapping into a communication
line and so on. The data encryption technique converts readable text to
unreadable text by use of an algorithm. Encrypted data cannot be read by an
intruder unless that user knows the method of encryption. There are various
types of encryption methods, some are simple and some are complex to
provide higher level of data protection. Some of the encryption schemes
used in database security are as follows:
Simple substitution method.
Polyalphabetic substitution method.

14.7.1 Simple Substitution Method


In a simple substitution method, each letter of a plaintext is shifted to its
immediate successor in the alphabet. The blank appears immediately before
the alphabet ‘a’ and it follows the alphabet ‘z’. Now suppose we wish to
encrypt the plaintext message given as

Well done.

The above readable plaintext message will be encrypted (transformed to


ciphertext) to

xfmmaepof.

Thus, if an intruder or unauthorized user sees the message “xfmmaepof”,


there is probably insufficient information to break the code. However, if a
lrge number of words are examined, it is possible to statistically examine the
frequency with which characters occur and, thereby, easily break the code.

14.7.2 Polyalphabetic Substitution Method


In a polyalphabetic substitution method, an encryption key is used. Suppose
we wish to encrypt the message “Drive slow”. But, now an encryption key is
given as, say for example, “safety”. The encryption is done as follows:
a. The key is aligned beneath the plaintext and is repeated as many times as necessary for the
plaintext to be completely “covered”. In this example, we would have

Well done
safetysaf

b. The blank space occupies the twenty-seventh (last but one), and twenty-eight (last) position
in the alphabet. For each character, alphabet position of the plaintext character and that of the
key character is added. The resultant number is divided by 27 and remainder is kept
separately. For our example of, the first letter of the plaintext ‘W’ is found in the twenty-third
place in the alphabet, while the first letter of key ‘s’ is found in the nineteenth position. Thus,

(23 + 19) = 42. The remainder on division by 15 is zero. This process is called division
modulus 27.

Now we can find that the letter in the fifteenth position in the alphabet is ‘Q’. Thus, the
plaintext letter ‘W’ is encrypted as the letter ‘Q’ in the ciphertext. In this way, all the letters
can be encrypted.

The polyalphabet subscription method is also simple, however, it provides


higher level of data protection.

R Q
1. What is database security? Explain the purpose and scope of database security.
2. What do you mean by threat in a database environment? List the potential threats that could
affect a database system.
3. List the types of database security issues.
4. Differentiate between authorization and authentication.
5. Discuss each of the following terms:

a. Database Authorization
b. Authentication
c. Audit Trail
d. Privileges
e. Data encryption
f. Firewall.

6. What is data encryption? How is it used in database security?


7. Discuss the two types of data encryption mechanisms used in database security.
8. Using the polyalphabetic substitution method and the encryption key, SECURITY, encrypt
the plaintext message ‘SELL ALL STOCKS’.
9. What is meant by granting and revoking privileges? Discuss the types of privileges at
account level and those at the table or relation level.
10. Suppose a user has to implement suitable authorization mechanisms for a relational database
system where owner of an object can grant access rights to other users with GRANT and
REVOKE options.

a. Give an outline of the data structure which can be used by the first user to
implement this scheme.
b. Explain how this scheme can keep track of the current access rights of the users.

11. Explain the use of a audit trail.


12. What is difference between discretionary and mandatory access control? Explain with an
example.
13. Explain the intuition behind the two rules in the Bell-LaPadula model for mandatory access
control.
14. If a DBMS already supports discretionary and mandatory access controls, is there a need for
data encryption?
15. What are the typical database security classifications?
16. Discuss the simple security property and the star-property.
17. What do you mean by firewall? What are the firewall techniques that are used in database
security? Discuss in brief.
18. What is statistical database? Discuss the problems of statistical database security.

STATE TRUE/FALSE

1. Database security encompasses hardware, software, network, people and data of the
organisation.
2. Threats are any situation or event, whether intentional or accidental, that may adversely
affect a system and consequently the organisation.
3. Authentication is a mechanism that determines whether a user is who he or she claims to be.
4. When a user is authenticated, he or she is verified as an authorized user of an application.
5. Authorization and authentication controls can be built into the software.
6. Privileges are granted to users at the discretion of other users.
7. A user automatically has all object privileges for the objects that are owned by him/her.
8. The REVOKE command is used to take away a privilege that was granted.
9. Encryption alone is sufficient for data security.
10. Discretionary access control (also called security scheme) is based on the concept of access
rights (also called privileges) and mechanism for giving users such privileges.
11. A firewall is a system designed to prevent unauthorized access to or from a private network.
12. Statistical database security system is used to control the access to a statistical database.
13. Data encryption is a method of coding or scrambling of data so that humans cannot read
them.

TICK (✓) THE APPROPRIATE ANSWER

1. A threat may be caused by a

a. a situation or event involving a person that are likely to harm the organisation.
b. an action that is likely to harm the organisation.
c. circumstances that are likely to harm the organisation.
d. All of these.

2. Loss of availability means that the

a. data cannot be accessed by the users.


b. system cannot be accessed by the users.
c. Both data and system cannot be used by the users.
d. None of these.

3. Which of the following is the permission to access a named object in a prescribed manner?

a. Role
b. Privilege
c. Permission
d. All of these.

4. Loss of data integrity means that the

a. data and system cannot be accessed by the users.


b. invalid and corrupted data has been generated.
c. loss of protecting or maintaining secrecy over critical data of the organisation.
d. loss of protecting data from individuals.

5. Which of the following is not a part of the database security?

a. Data
b. Hardware and Software
c. People
d. External hackers.

6. Discretionary access control (also called security scheme) is based on the concept of

a. access rights
b. system-wide policies
c. Both (a) and (b)
d. None of these.

7. Mandatory access control (also called security scheme) is based on the concept of

a. access rights
b. system-wide policies
c. Both (a) and (b)
d. None of these.

8. Loss of confidentiality means that the

a. data and system cannot be accessed by the users.


b. invalid and corrupted data has been generated.
c. loss of protecting or maintaining secrecy over critical data of the organisation.
d. loss of protecting data from individuals.

9. Loss of privacy means that the

a. data and system cannot be accessed by the users.


b. invalid and corrupted data has been generated.
c. loss of protecting or maintaining secrecy over critical data of the organisation.
d. loss of protecting data from individuals.

10. Which of the following is the process by which a user’s identity is checked?

a. Authorization
b. Authentication
c. Access Control
d. None of these.

11. Legal and ethical issues are related to the

a. rights to access of an individual user, or user groups to access certain information.


b. enforcement of various security functions at system levels, for example, at physical
hardware level, at the DBMS level or at the operating system level.
c. enforcement of the security policy of the organisation with respect to permitting
access to various classifications of data.
d. None of these.

12. Which of the following is the process by which a user’s privileges are ascertained?

a. Authorization
b. Authentication
c. Access Control
d. None of these.

13. System-related issues are related to the

a. rights to access of an individual user or user groups to access certain information.


b. enforcement of various security functions at system levels, for example, at physical
hardware level, at the DBMS level or at the operating system level.
c. enforcement of the security policy of the organisation with respect to permitting
access to various classifications of data.
d. None of these.

14. Which of the following is the process by which a user’s access to physical data in the
application is limited, based on his privileges?

a. Authorization
b. Authentication
c. Access Control
d. None of these.

15. Organisational level issues are related to the:

a. rights to access of an individual user, or user groups to access certain information


b. enforcement of various security functions at system levels, for example, at physical
hardware evel, at the DBMS level or at the operating system level
c. enforcement of the security policy of the organisation with respect to permitting
access to various classifications of data
d. None of these.

16. Which of the following is a database privilege?

a. The right to create a table or relation.


b. The right to select rows form another user’s table.
c. The right to create a session.
d. All of these.

FILL IN THE BLANKS

1. The goal of database security is the _____ of data against _____.


2. _____ is the protection of the database against intentional and unintentional threats.
3. Loss of availability can arise due to (a) _____ and (b) _____.
4. Loss of data integrity causes _____ or _____ data, which may seriously affect the operation
of an organisation.
5. _____ is a privilege or right to perform a particular action.
6. System privileges are granted to or revoked from users using the commands _____ and
_____.
7. _____ the process of granting of right or privilege to the user(s) to have legitimate access to a
system or objects (database table) of the system.
8. _____ is the technique of encoding data so that only authorized users can understand it.
9. _____ is a mechanism that determines whether a user is who he or she claims to be.
10. Data encryption is a method of _____ of data so that humans cannot read them.
11. Two of the most popular encryption standards are (a) _____ and (b) _____.
12. _____ is responsible for the overall security of the database system.
13. Discretionary access control is based on the concept of _____ and mechanism for giving
users such privileges.
14. Mandatory access control is based on _____ that cannot be changed by individual users.
15. The commonly used mandatory access control technique for multi-level security is known as
the _____ model.
16. _____ is a system designed to prevent unauthorized access to or from a private network.
17. _____ system is used to control the access to a statistical database.
Part-V

OBJECT-BASED DATABASES
Chapter 15

Object-Oriented Databases

15.1 INTRODUCTION

An object-oriented approach to the development of software was first


proposed in the late 1960s. However, it took almost 20 years for object
technologies to become widely used. Object-oriented methods gained
popularity during the 1980s and 1990s. Throughout the 1990s, object-
oriented concept became the paradigm for choice for many software product
builders and growing number of information systems and engineering
professionals. Now, the object technologies have slowly replaced classical
software development and database design approaches.
Object-oriented database (OODB) systems are usually associated with
applications that draw their strength from intuitive graphical user interfaces
(GUIs), powerful modelling techniques and advanced data management
capabilities. In this chapter, we will discuss key concepts of object-oriented
databases (OODBs). We will also discuss the object-oriented DBMSs
(OODBMSs) and object-oriented languages used in OODBMSs.

15.2 OBJECT-ORIENTED DATA MODEL (OODM)

As we discussed in the earlier chapters, the relational data model was first
produced by Dr. E.F. Codd in his seminal paper, which addressed the
disadvantages of legacy database approaches such as hierarchical and
network (CODASYL) databases. Since then, more than hundred commercial
relational DBMSs have been developed and put in use both for mainframe
and PC environments. However, RDBMSs have their own disadvantages,
particularly, limited modelling capabilities. Various data models were
developed and implemented for database design that represents the ‘real-
world’ more closely. Fig. 15.1 shows the history of data models.
Each data model capitalised on the shortcomings of previous models. The
hierarchical model was replaced by a network model because it became
much easier to represent complex (many-to-many) relationships. In turn, the
relational model offered several advantages over the hierarchical and
network models through its simpler data representation, superior data
independence and relatively easy-to-use query language. Thereafter, entity-
relationship (E-R) model was introduced by Chen for an easy-to-use
graphical data representation. ER model became the database design
standard. As more intricate real-world problems were modelled, a need arose
for a different data model to closely represent the real-world. Thus, attempts
were made and Semantic Data Model (SDM) was developed by M. Hammer
and D. McLeod to capture more meaning from the real- world objects. SDM
incorporated more semantics into the data model and introduced concepts
such as class, inheritance and so forth. This helped to model the real-world
objects more objectively. In response to the increasing complexity of
database applications, following two new data models emerged:
Object-oriented data model (OODM).
Object-relational data model (ORDM), also called extended-relational data model (ERDM).
Fig. 15.1 History of evolution of data model

Object-oriented data models (OODMs) and object-relational data models


(ORDMs) represent third- generation DBMSs. Object-oriented data models
(OODMs) are a logical data models that capture the semantics of objects
supported in object-oriented programming. OODMs implement conceptual
models directly and can represent complexities that are beyond the
capabilities of relational systems. OODBs have adopted many of the
concepts that were developed originally for object-oriented programming
languages (OOPLs). Objects in an OOPL exist only during program
execution and are hence called transient objects. Fig. 15.2 shows the origins
of OODM drawn from different areas.
An object-oriented database (OODB) is a persistent and sharable
collection of objects defined by an OODM. An OODB can extend the
existence of objects so that they are stored permanently. Hence, the objects
persist beyond program termination and can be retrieved later and shared by
other programs. In other words, OODBs store persistent objects permanently
on secondary storage (disks) and allow the sharing of these objects among
multiple programs and applications. An OODB system interfaces with one
or more OOPLs to provide persistent and shared object capabilities.

15.2.1 Characteristics of Object-oriented Databases (OODBs)


Maintain a direct correspondence between real-world and database objects so that objects do
not loose their integrity and identity.

Fig. 15.2 Origins of OODM

OODBs provide a unique system-generated object identifier (OID) for each object so that an
object can easily be identified and operated upon. This is in contrast with the relational model
where each relation must have a primary key attribute whose value identifies each tuple
uniquely.
OODBs are extensible, that is, capable of defining new data types as well as the operations to
be performed on them.
Support encapsulation, that is, the data representation and the methods implementation are
hidden from external entities.
Exhibit inheritance, that is, an object inherits the properties of other objects.

15.2.2 Comparison of an OODM and E-R Model


Comparison between an object-oriented data model (OODM) and an entity-
relationship (E-R) model is shown in table 15.1.
The main difference between OODM and conceptual data modelling
(CDM) which is based on an entity- relationship (E-R) modelling is the
encapsulation of both state and behaviour in an object in OODM. Whereas,
CDM captures only state and has no knowledge of behaviour. CDM has no
concept of messages and consequently no provision for encapsulation.

Table 15.1 Comparison between OODM and ERDM

SN OO Data Model E-R Data Model


1. Type Entity definition
2. Object Entity
3. Class Entity set/super type
4. Instance Variable Attribute
5. Object identifier (OID) No corresponding concept
6. Method (message or operations) No corresponding concept
7. Class structure (or hierarchy) E-R diagram
8. Inheritance Entity
9. Encapsulation No corresponding concept
10. Association Relationship

15.3 CONCEPT OF OBJECT-ORIENTED DATABASE (OODB)


An object-orientation is a set of design and development principles based on
conceptually autonomous computer structures known as objects. Each object
represents a real-world entity with the ability to interact with itself and with
other objects. We live in a world of objects. These objects exist in nature, in
human made entities, in business and in the products that we use. They can
be categorised, described, organised, combined, manipulated and created.
Therefore, an object-oriented view enables us to model the world in ways
that help us better understand and navigate it.
In an object-oriented (OO) system, the problem domain is characterised as
a set of objects that have specific attributes and behaviours. The objects are
manipulated with a collection of functions (called methods, operations or
services) and communicate with one another through a messaging protocol.
Objects are categorised into classes and subclasses. OO technologies lead to
reuse, and reuse leads to faster software development and design.
Object-oriented concepts stem from object-oriented programming
languages (OOPLs), which was developed as an alternative to traditional
programming methods. OO concepts first appeared in programming
languages such as Ada, Algol, LISP and SIMULA. Later on, Smalltalk and
C++ became dominant object- oriented programming languages (OOPLs).
Today OO concepts are applied in the areas of databases, software
engineering, knowledge bases, artificial intelligence and computer systems
in general.

15.3.1 Objects
An object is an abstract representation of a real-world entity that has a
unique identity, embedded properties and the ability to interact with other
objects and itself. It is a uniquely identified entity that contains both the
attributes that describe the state of a real-world object and the actions that
are associated with it. An object may have a name, a set of attributes and a
set of actions or services. An object may stand alone or it may belong to a
class of similar objects. Thus, the definition of objects encompasses a
description of attributes, behaviours, identity, operations and messages. An
object encapsulates both data and the processing that is applied to the data.
A typical object has two components-(a) state (value) and (b) behaviour
(operations). Hence, it is somewhat similar to a program variable in a
programming language, except that it will typically have a complex data
structure as well as specific operations defined by the programmer. Fig. 15.3
illustrates the examples of objects. Each object is represented by a rectangle.
The first item in the rectangle is the name of the object. The name of the
object is separated form the object attributes by a straight line. An object
may have zero or more attributes. Each attribute has its own name, value and
specifications. The list of attributes is followed by a list of services or
actions. Each service has a name associated with it and eventually will be
translated to executable program (machine) code. Services or actions are
separated from the list of attributes by a horizontal line.

Fig. 15.3 Examples of objects

15.3.2 Object Identity


An object has a unique identity, which is represented by an Object Identifier
(OID). OODB system provides a unique ODI to each independent object
stored in the database. No two objects can share the same OID. The OID is
assigned by the system and does not depend on the object’s attribute values.
The value of an ODI is not visible to the external user, but it is used
internally by the system to identify each object uniquely and to create and
manage inter-object interfaces. OID has the following characteristics:
It is system-generated.
It is unique to the object.
It cannot be changed.
It can never be altered during its lifetime.
It can never be deleted. It can be deleted only if the object is deleted.
It can never be reused.
It is independent of the values of its attributes.
It is invisible to the user.

The OID should not be confused with the primary key of relational
database. In contrast to the OID, a primary key of relational database is user-
defined values of selected attributes and can be changed at any time.

15.3.3 Object Attributes


In an OO environment, objects are described by their attributes, known as
instance variables. Each attribute has a unique name and a data type
associated with it. For example, the object ‘Student’ has attributes such as
name, age, city, subject and sex, as shown in Fig. 15.3. Similarly, the object
‘Knife’ has attributes namely price, size, model, sharpness and so on and so
forth.
Fig. 15.4 shows attributes of objects ‘Student’ and ‘Chair’. Traditional
data types (also known as base types) such as real, integer, string and so on
can also be used. Attributes also have domain. The domain logically groups
and describes the set of all possible values that an attribute can have. For
example, the GPA’s possible values, as shown in Fig. 15.4, can be
represented by the real number base data type.
Fig. 15.4 Attributes of objects ‘Student’ and‘Chair’

Examples of Objects

The data structures for an OO database schema can be defined using type
constructors. Fig. 15.5 illustrates how the objects ‘Student’ and ‘Chair’ of
Fig. 15.4 can be declared corresponding to the object instances.

Fig. 15.5 Specifying object types ‘Student’ and ‘Chair’

15.3.4 Classes
A class is a collection of similar objects with shared structure (attributes) and
behaviour (methods). It contains the description of the data structure and the
method implementation details for the objects in that class. Therefore, all
objects in a class share the same structure and respond to the same messages.
In addition, a class acts as a storage bin for similar objects. Thus, a class has
a class name, a set of attributes and a set of services or actions. Each object
in a class is known as a class instance or object instance. There are two
implicit service or action functions defined for each class namely
GET<attribute> and PUT<attribute>. The GET function determines the
value of the attribute associated with it, and the PUT function assigns the
computed value of the attribute to the attribute’s name.
Fig. 15.6 illustrates example of a class ‘Furniture’ with two instances. The
‘Chair’ is a member (or instance) of a class ‘Furniture’. A set of generic
attributes can be associated with every object in the class ‘Furniture’, for
example, price, dimension, weight, location and colour. Because ‘Chair’ is a
member of ‘Furniture’, ‘Chair’ inherits all attributes defined for the class.
Once the class has been defined, the attributes can be reused when new
instances of the class are created. For example, assume that a new object
called ‘Table’ has been defined that is a member of the class ‘Furniture’, as
shown in Fig. 15.6. ‘Table’ inherits all of the attributes of ‘Furniture’. The
services associated with the class ‘Furniture’ is buy (purchase the furniture
object), sell (sell the furniture object) and move (move the furniture object
from one place to another).
Fig. 15.6 Example of Class ‘Furniture’

Fig. 15.7 illustrates another example of a class ‘Student’ with three


instances namely ‘Abhishek, ‘Avinash’ and ‘Alka’. Each instance is an
object belonging to the class ‘Student’. Each instance of ‘Student’ is an
object which has all of the attributes and services of ‘Student’ class. The
services associated with this class are Store (write the student object to a
file), Print (print out the object attributes) and Update (replace the value of
one or more attributes with a new value).
Fig. 15.7 Examples of class ‘Student’

Fig. 15.8 illustrates another example of a class ‘Employee’ with two


instances namely ‘Jon, and ‘Singh’. Each instance is an object belonging to
the class ‘Employee’. Each instance of ‘Employee’ is an object which has all
of the attributes and services of “Employee’ class. The services associated
with this class are Print (print out the object attributes) and Update (replace
the value of one or more attributes with a new value). To distinguish an
object from the class, class is represented with a double-lined rectangle and
an object with a single-lined rectangle. Double-line may be interpreted as
more than one object being present in the class.
Fig. 15.8 Example of classes ‘Employee’

Examples of Classes

Fig. 15.9 illustrates how the type definition of object of Fig. 15.5 may be
extended with operations (services or actions) to define classes of Fig. 15.7.

Fig. 15.9 Specfifying class ‘Student’ with operations

Fig. 15.10 illustrates an example of a class ‘Employee’ implemented with


the C++ programming language.
15.3.5 Relationship or Association among Objects
Two objects, either of the same class or of two classes may be associated
with each other by a given relation. For example, in the ‘Student’ class we
may define the association ‘Same-Major Subject’ to associate particular
objects of this class with each other. Thus, two objects are associated if and
only if their ‘Major Subject’ attributes have the same value. In other words,
two students of the same ‘Major Subject’ are related to or associated with
each other.
An association between classes or objects may be a one-to-one (1:1), one-
to-many (1:N), many-to-one (N;1) or many-to-many association (N:M). The
graphic representation of association among objects is shown by a line
connecting the associated classes or objects together. For associations that
are not one-to-one (1:1), a multiplicity number (or range of numbers) may be
introduced, written below the association line, to indicate the association
multiplicity number. When the name of the relationship is not clear from the
context or when there is more than one association between objects, the
name of the association may be written on the top of the relation line. For
relations whose inverse have a different name, the two names may be written
on either side of the association line.
Fig. 15.11 illustrates the relationships or associations among classes. Fig.
15.11 (a) shows the relationship between classes ‘Student’ and ‘Course’. In
this case, the relationship between two classes is called ‘Enrolled’ and is
one-to-many (1:N) association. That means that a student may be enrolled in
zero or more courses. In Fig. 15.11 (b), the association between classes
‘Student’ and ‘Advisor’ is called ‘Advisee-of’ and is a many-to- one (N:1)
association. The association between classes ‘Advisor’ and ‘Student’ is
called ‘Advisor-of’ and is a one-to-many (1:N) association. In Fig. 15.11 (c),
the association between classes ‘Student’ and ‘Sport_Team’ is called
‘Member-of’ and the association between classes ‘Sport_team’ and
‘Student’ is called ‘Team’. Both of these associations are many-to-many
(N:M).
Fig. 15.10 Implementation of class ‘Employee’ using C++

Fig. 15.11 Association among classes


(a)

(b)

(c)

The above associations are called binary association or relation because


they take only two classes to make an association. However, associations
may not be binary and they may take more than two classes to make an
association. Such associations are called n-ary associations.

15.3.6 Structure, Inheritance and Generalisation

15.3.6.1 Structure
Structure is basically the association of class and its objects. Let us consider
the following classes:
a. Person
b. Student
c. Employee
d. Graduate
e. Undergraduate
f. Administration
g. Staff
h. Faculty

From the above classes following observations can be made:


Graduate and undergraduate are each a subclass of ‘Student’ class.
Administrator, staff and faculty are each a subclass of ‘Employee’ class.
Student and employee are each a subclass of ‘Person’ class.

Fig. 15.12 illustrates the relationships between classes and subclasses.


Now from this figure, it can also be observed that
Person class is a superclass of ‘Student’ and ‘Employee’ classes.
Employee class is a superclass of ‘Faculty’, “Administration” and “Staff” subclasses.
Student class is a superclass of for “Graduate” and “Undergraduate” classes.

Association lines between a superclass and its subclass all originate from
a half circle that is connected to the superclass, as shown in Fig. 15.12. The
relationship between a superclass and its subclass is known as
generalisation.
Fig. 15.12 Subclass and superclass structure

Assembly Structure

An assembly structure (or relationship) is used to identify the parts of an


object or a class. An assembly structure is also called a Whole-Part Structure
(WPS). Fig. 15.13 shows a diagram of an assembly structure of an object
called ‘My_Desk’. As shown in the diagram, a desk has several parts namely
Top, Side, Drawer and Wheel. The desk has a single top, three sides (two
sides and a back panel), five drawers and a 0, 4, 6 or 8 wheels. For the
assembly relationship, as for a general relationship, we can write the
multiplicity of the parts needed to the association lines, as shown in Fig.
15.13.
Fig. 15.13 Assembly structure of desk and its parts

Combined Structure

There may be a situation in which in an object or a class is a subclass of


another class and also has an assembly relationship with its own parts.
Figure 15.14 shows such structure in which a superclass called ‘Furniture’
has subclass ‘Chair’, ‘Desk’ and ‘Sofa’. As we can see in Fig. 15.13,
subclass ‘Desk’ has an assembly relationship with its parts, ‘Top’, ‘Side’,
‘Drawer and ‘Wheel’.
As is shown in Fig. 15.14, we may combine the two relationships
superclass-subclass and assembly structure on the same diagram.

15.3.6.2 Inheritance
Inheritance is copying the attributes of the superclass into all of its subclass.
It is the ability of an object within the structure (or hierarchy) to inherit the
data structure and behaviour (methods) of the classes above it. For example,
as shown in Fig. 15.12, class ‘Graduate’ inherits its data structure and
behaviour from the superclasses ‘Student’ and ‘Person’. Similarly, class
‘Staff’ inherits its data structure and behaviour from the superclasses
‘Employee’, ‘Person’ and so on. The inheritance of data and methods goes
from the top to bottom in the class hierarchy. There are two types of
inheritances:
a. Single inheritance: Single inheritance exists when a class has only one immediate (parent)
superclass above it. An example of a single inheritance can be given as the class ‘Student’
and class ‘Employee’ inheriting immediate superclass ‘Person’.
b. Multiple inheritances: Multiple inheritances exist when a class is derived from several parent
superclasses immediately above it.
Fig. 15.14 Combined structure

15.3.7 Operation
An operation is a function or a service that is provided by all the instances of
a class. It is only through such operations that other objects can access or
manipulate the information stored in an object. The operation, therefore,
provides an external interface to a class. The interface presents the outside
view of the class without showing its internal structure or how its operations
are implemented. The operations can be classified into the following four
types:
a. Constructor operation: It creates a new instance of a class.
b. Query operation: It accesses the state of an object but does not alter the state. It has no side
effects.
c. Update operation: This operation alters the state of an object. It has side effects.
d. Scope operation: This operation applies to a class rather than an object instance.

15.3.8 Polymorphism
Object-oriented systems provide for polymorphism of operations. The
polymorphism is also sometimes referred to as operator overloading. The
polymorphism concept allows the same operator name or symbol to bound
to two or more different implementations of the operator, depending on the
type of objects to which the operator is applied.

15.3.9 Advantages of OO Concept


The OO concepts have been widely applied to many computer-based
disciplines, especially those involving complex programming and design
problems. Table 15.2 summarises the advantages of OO concepts to many
computer-based disciplines.

Table 15.2 OO advantages


SN Computer-based Discipline OO advantages
1. Programming languages
Easier to maintain.
Reduces development time.
Enhances code reusability.
Reduces the number of lines of
code.
Enhances programming
productivity.

2. Graphical User Interface (GUI)


Improves system user-friendliness.
Enhances ability to create easy-to-
use interface.
Makes it easier to define
standards.

3. Design
Better representation of the real-
world situation.
Captures more of the data model’s
semantics.

4. Operating System
Enhances system probability.
Improves systems interoperability.

5. Databases
Supports complex objects.
Supports abstract data types.
Supports multimedia databases.

15.4 OBJECT-ORIENTED DBMS (OODBMS)

Object-oriented database management system (OODBMS) is the manager of


an OODB. Many OODBMSs use a subset of the OODM features. Therefore,
those who create the OODBMS tend to select the OO features that best serve
the OODBMSs purpose such as support for early or late binding of the data
types and methods and support for single or multiple inheritances. Several
OODBMSs have been implemented in research and commercial
applications. Each one has a different set of features. Table 15.3 shows some
of the OODBMSs developed by various vendors.

Table 15.3 Summary of commercial OODBMSs

SN OODBMS name Vendor/Inventor


1. GemStone Gemstone System Inc.
2. Itasca IBEX Knowledge System SA
3. Objectivity/DB Objectivity Inc.
4. ObjectStore eXcelon Corporation
5. Ontos Ontos Inc.
6. Poet Poet Software Corporation
7. Jasmine Computer Associates
8. Versant Versant Corporation
9. Vbase Andrews and Harris
10. Orion MCC
11. PDM Manola and Dayal
12 IRIS Hewlett-Packard
13. O2 Leeluse

15.4.1 Features of OODBMSs


Object-oriented (OO) Features:
Must support complex objects.
Must support object identity.
Must support encapsulation.
Must support types or classes.
Types or classes must be able to inherit from their ancestors.
Must support dynamic binding.
The data manipulation language (DML) must be computationally complete.
The set of data types must be extensible.

General DBMS Features:


Data persistence must be provided, that means, must be able to remember data locations.
Must be capable of managing very large databases.
Must support concurrent users.
Must be capable of recovery from hardware and software failures.
Data query must be simple.

15.4.2 Advantages of OODBMSs


Enriched modelling capabilities: It allows the real-world to be modelled more closely.
Reduced redundancy and increased extensibility: It allows new data types to be built from
existing types. It has the ability to factor out common properties of several classes and from
them into superclasses that can be shared with subclasses.
Removal of impedance mismatch: It provides single language interface between the data
manipulation language (DML) and the programming language.
More expressive query language: It provides navigational access from one object to the next
for data access in contrast to the associative access of SQL.
Support for schema evolution: In OODBMS, schema evolution is more feasible.
Generalisation and inheritance allow the schema to be better structured, to be more intuitive
and to capture more of the semantics of the application.
Support for long-duration transaction: OODBMSs uses different protocol to handle the type
of long- duration transaction in contrast to enforcing serialisability on concurrent transactions
by RDBMSs to maintain database consistency.
Applicability to advanced database applications: The enriched modelling capabilities of
OODBMSs have made them suitable for advancved database applications such as, computer-
aided design (CAD), office information system (OIS), computer-aided software engineering
(CASE), multimedia systems and so on.
Improved performance. It improves the overall performance of DBMS.

15.4.3 Disadvantages of OODBMSs


Lack of universal data model.
Lack of experience.
Lack of standards.
Competition posed by RDBMS and the emerging ORDBMS products.
Query optimization compromises encapsulation.
Locking at object level may impact performance.
Complexity due to increased functionality provided by an OODBMS.
Lack of support for views.
Lack of support for security.

15.5 OBJECT DATA MANAGEMENT GROUP (ODMG) AND OBJECT-ORIENTED


LANGUAGES
Object-oriented languages are used to create an object database schema. As
we have discussed in the earlier chapters, SQL is used as a standard
language in relational DBMSs. In case of object-oriented DBMSs
(OODBMSs) a consortium of OODBMSs is formed called Object Data
Management Group (ODMG). Several important vendors formed ODMG
that includes Sun Microsystem, eXcelon Corporation, Objectivity Inc.,
POET Software, Computer Associates, Versant Corporation and so on. The
ODMG has been working on standardising language extensions to C++ and
SmallTalk to support persistence and on defining class libraries to support
persistence. The ODMG standard to object-oriented programming languages
(OOPLs) is made up of the following parts:
i. Object model.
ii. Object definition language (ODL).
iii. Object query language (OQL).
iv. Language bindings.

The bindings have been specified for three OOPLs namely, (a) C++, (b)
SmallTalk and (c) JAVA. Some vendors offer specific language bindings,
without offering the full capabilities of ODL and OQL.
The ODMG proposed a standard known as the ODMG-93 or ODMG 1.0
standard released in 1993. This was later on revised into ODMG 2.0 in 1997.
In late 1999, ODMG 3.0 was released that included a number of
enhancements to the object model and to the JAVA language binding.

15.5.1 Object Model


The OBDG object model is a superset of the ODMG, which enables both
designs and implementations to be ported between compliant systems. It is
the data model upon which the object definition language (ODL) and object
query language (OQL) are based. The object model provides the data type,
type constructors and other concepts that can be utilised in the ODL to
specify object database schemas. Hence, an object model provides a standard
data model for object-oriented databases (OODBs), just as the SQL report
describes a standard data model for relational databases (RDBs). An object
model also provides a standard terminology in a field where the same terms
are sometimes used to describe different concepts.

15.5.2 Object Definition Language (ODL)


The object definition language (ODL) is designed to support the semantic
constructs of the ODMG object model. It is equivalent to the Data Definition
Language (DDL) of traditional RDBMSs. ODL is independent of any
particular programming language. Its main objective is to facilitate
probability of schemas between compliant systems. It creates object
specifications, that is, classes and interfaces. It defines the attributes and
relationships of types and specifies the signature of the operations. It does
nor address the implementation of signatures. It is not a full programming
language. A user can first specify a database schema in ODL independently
of any programming language. Then the user can use the specific
programming language bindings to specify how ODL constructs can be
mapped to constructs in specific programming languages, such as C++,
SmallTalk and JAVA. The syntax of ODL extends the Interface Definition
Language (IDL) of CORBA.
Let us consider the example of EER diagram of technical university
database as sown in Fig. 7.10 of chapter 7, section 7.5. Fig 15.15 shows a
possible object schema for part of the university database.
Fig. 15.16 illustrates one possible set of ODMG C++ ODL class
definitions for the university database. There can be several possible
mappings from an object schema diagram or EER schema diagram into ODL
classes. Entity types are mapped into ODL classes. The classes ‘Person’,
‘Faculty’, ‘Student’ and ‘GradStudent’ have the extents persons, faculty,
student and grad_students, respectively. Both the classes ‘Faculty’ and
‘Student’ EXTENDS ‘Person’. The class ‘GradStudent’ EXTENDS
‘Student’. Hence, the collection of students and the collection of faculty will
be constrained to be a subclass of the collection of persons at any point of
time. Similarly, the collection of grad_students will be a subclass of students.
At the same time, individual ‘Student’ and ‘Faculty’ objects will inherit the
properties (attributes and relationships) and operations of ‘Person’ and
individual ‘GradStudent’ objects will inherit those of ‘Student’.
The classes ‘Department’, ‘Course’, ‘Section’ and ‘CurrSection’ in Fig.
15.16 are straightforward mappings of the corresponding entity types in Fig.
15.15. However, the class ‘Grade’ requires some explanation. As shown in
Fig. 15.15, class ‘Grade’ corresponds to (N:M) relationship between
‘Student’ and ‘Section’. It was made into separate class because it includes
the relationship attribute grade. Hence, the (N:M) relationship is mapped to
the class ‘Grade’ and a pair of 1:N relationships, one between Student and
Grade and the other between ‘Section’ and ‘Grade’. These two relationships
are represented by the relationship properties namely completed-sections of
‘Student’; section and student of ‘Grade’; and students of ‘Section’. Finally,
the class ‘Degree’ is used to represent the composite, multi-valued attribute
degrees of ‘GradStudent’.
Fig. 15.15 Structure of university database with relationship

Fig. 15.16 ODMG C++ ODL schema for university database


15.5.3 Object Query Language (OQL)
The object query language (OQL) is the query language proposed for the
ODMG object model. It is designed to work closely with the programming
languages for which an ODMG binding is defined, such as, C++, SmallTalk
and JAVA. An OQL query is embedded into these programming languages.
The query returns objects that match the type system of these languages.
The OQL provides declarative access to the object database. The OQL
syntax for query is similar to the syntax of the relational standard query
language SQL. OQL syntax has additional features for ODMG concepts,
such as, object identity, complex objects, operations, inheritance,
polymorphism and relationships. Basic OQL syntax is similar to that of SQL
structure as explained in chapter 5, section 5.5.7 and the format for SELECT
clause is given as:

SELECT [ALL | DISTINCT] ‹ column-name / expression›


FROM ‹ table(s)-name ›
WHERE ‹ conditional expression ›
GROUP BY ‹ attribute 1: expression 1, attribute 2: expression 2, ›
ORDER BY ‹ column(s)-name / expression›

Fig. 15.16 illustrates few query statements with reference to Fig. 15.15
that are used in OQL and their corresponding results.

Fig. 15.17 Query examples in OQL


R Q
1. What do you understand by an object-oriented (OO) method? What are its advantages?
2. What are the origins of the object-oriented approach? Discuss the evolution and history of
object-oriented concepts with a neat sketch.
3. What is object-oriented data model (OODM)? What are its characteristics?
4. Define and describe the following terms:
a. Object
b. Attributes
c. Object identifier
d. Class.

5. With an example, differentiate between object, object identity and object attributes.
6. What is OID? What are its advantages and disadvantages?
7. Explain how the concept of OID in OO model differs from the concept of tuple equality in
the relational model.
8. Using an example, illustrate the concepts of class and class instances.
9. Discuss the implementation of class using C++ programming language.
10. Define the concepts of class structure (or hierarchy), superclasses and subclasses.
11. What is the relationship between a subclass and superclass in a class structure?
12. What do you mean by operation in OODM? What are its type? Explain.
13. Discuss the concept of polymorphism or operator overloading.
14. Compare and contrast the OODM with the E-R and relational models.
15. A car-rental company maintains a vehicle database for all vehicles in its current fleet. For all
vehicles, it includes the vehicle identification number, license number, manufacturer, model,
date of purchase and colour. Special data are included for certain types of vehicles:

a. Trucks: cargo capacity.


b. Sports cars: horsepower, renter age requirement.
c. Vans: number of passengers.
d. Off-road vehicles: ground clearance, drivetrain (four-or two-wheel drive).

Construct an object-oriented database schema definition for this database. Use


inheritance where appropriate.

16. List the features of OODBMS.


17. List the advantages and disadvantages of OODBMSs.
18. Discuss the main concepts of the ODMG object model.
19. What is the function of the ODMG object definition language?
20. Discuss the functions of object definition language (ODL) and object query language (OQL)
in object- oriented databases.
21. List the advantages and disadvantages of OODBMS.
22. Using ODMG C++, give schema definitions corresponding to the E-R diagram in Fig. 6.22,
using references to implement relationships.
23. Using ODMG C++, give schema definitions corresponding to the E-R diagram in Fig. 6.23,
using references to implement relationships.
24. Using ODMG C++, give schema definitions corresponding to the E-R diagram in Fig. 6.24,
using references to implement relationships.

STATE TRUE/FALSE

1. An OODBMS is suited for multimedia applications as well data with complex relationships
that are difficult model and process in a RDBMS.
2. An OODBMS does not call for fully integrated databases that hold data, text, pictures, voice
and vedio.
3. OODMs are a logical data models that capture the semantics of objects supported in object-
oriented programming.
4. OODMs implement conceptual models directly and can represent complexities that are
beyond the capabilities of relational systems.
5. OODBs maintain a direct correspondence between real-world and database objects so that
objects do not loose their integrity and identity.
6. The conceptual data modelling (CDM) is based on an OO modelling.
7. Object-oriented concepts stem from object-oriented programming languages (OOPLs).
8. A class is a collection of similar objects with shared structure (attributes) and behaviour
(methods).
9. Structure is the association of class and its objects.
10. Inheritance is copying the attributes of the superclass into all of its subclass.
11. Single inheritance exists when a class has one or more immediate (parent) superclass above
it.
12. Multiple inheritances exist when a class is derived from several parent superclasses
immediately above it.
13. An operation is a function or a service that is provided by all the instances of a class.
14. The object definition language (ODL) is designed to support the semantic constructs of the
ODMG object model.
15. An OQL query is embedded into these programming languages.

TICK (✓) THE APPROPRIATE ANSWER

1. An object-oriented approach was developed in the

a. late 1960s.
b. late 1970s.
c. early 1980s.
d. late 1990s.

2. Semantic Data Model (SDM) was developed by

a. M. Hammer and D. McLeod.


b. M. Hammer.
c. D. McLeod.
d. E.F Codd.

3. Object-oriented data models (OODMs) and object-relational data models (ORDMs) represent

a. first-generation DBMSs.
b. second-generation DBMSs.
c. third-generation DBMSs.
d. none of these.
4. An OODBMS can hold

a. data and text.


b. voice and video.
c. pictures and images.
d. All of these.

5. Which of the following is an OO feature?

a. Polymorphism
b. Inheritance
c. Abstraction
d. all of these.

6. Object-oriented concepts stem from

a. SQL.
b. OPL.
c. QUEL.
d. None of these.

7. OO concepts first appeared in programming languages such as

a. Ada.
b. Algol.
c. SIMULA.
d. All of these.

8. A class is a collection of

a. similar objects.
b. similar objects with shared attributes.
c. similar objects with shared attributes and behaviour.
d. None of these.

9. Today, OO concepts are applied in the areas of databases such as

a. software engineering.
b. knowledge base.
c. artificial intelligence.
d. All of these.

10. An association between classes or objects may be a

a. one-to-one.
b. many-to-one.
c. many-to-many.
d. All of these.
11. OODBMSs have

a. enriched modelling capabilities.


b. more expressive query languages.
c. support for schema evolution.
d. All of these.

12. OODBMSs lack

a. experience.
b. standards.
c. support for views.
d. All of these

13. Following ODMG standards are available

a. ODMG 1.0.
b. ODMG 2.0.
c. ODMG 3.0.
d. All of these.

14. ODL constructs can be mapped into

a. C++.
b. SmallTalk.
c. JAVA.
d. All of these.

FILL IN THE BLANKS

1. An object-oriented approach was developed in _____.


2. Object-oriented data models (OODMs) and object-relational data models (ORDMs) represent
_____ generation DBMSs.
3. OODMs are logical data models that capture the _____ of objects supported in _____.
4. The OO model maintains relationships through _____.
5. An OODB maintain a direct correspondence between and so that objects do not loose their
_____ and _____.
6. The main difference between OODM and CDM is the encapsulation of both and _____ in an
object in OODM.
7. Object-oriented concepts stem from _____.
8. A class is a collection of similar objects with shared _____ and _____.
9. Structure is basically the association of _____ and its _____.
10. Inheritance is the ability of an object within the structure (or hierarchy) to inherit the _____
and _____ of the classes above it.
11. Single inheritance exists when a class has _____ immediate superclass above it.
12. An operation is a function or a service that is provided by all the instances of a _____.
13. Object-oriented languages are used to create _____.
14. The object definition language (ODL) is designed to support the semantic constructs of the
_____ model.
15. An OQL query is _____ into _____.
Chapter 16

Object-Relational Database

16.1 INTRODUCTION

In part 2, chapters 4, 5, 6 and 7, we discussed the relational databases


(RDBMSs), entity-relational (E-R) models and enhanced entity-relational
(EER) models to develop database schemas. In chapter 15 we examined the
basic concepts of object-oriented data models (OODMs) and object-oriented
database management systems (OODBMSs). We also discussed about
object-oriented languages used in OODBMS. Relational and object- oriented
database systems each have certain strengths and weaknesses. However, due
to enriched modelling capacity and closer to real-word, OODBMSs have
gained a steady growth in the commercial applications. Thus, looking at the
wide acceptance of OODBMSs in commercial applications and the inherent
weaknesses of traditional RDBMSs, the RDBMS community extended
RDBMS with object-oriented features such that they continue to maintain
the supremacy in the commercial applications segments. This led to the
industry to develop a new generation of hybrid database system known as
object-relational DBMS (ORDBMS) or enhanced relational DBMSs
(ERDBMSs). This product supports both object and relational capabilities.
All the major vendors of RDBMSs are developing object-relational versions
of their current RDBMSs.
In this chapter, we will discuss background concepts of this emerging
class of commercial ORDBMS, its applications, advantages and
disadvantages. We will also discuss the structured query language SQL3
used with ORDBMS.
16.2 HISTORY OF OBJECT-RELATIONAL DBMS (ORDBMS)

In response to the weaknesses of relational database systems (RDBMSs) and


in defence of the potential threats posed by the rise of the OODBMSs, the
RDBMS community extended the RDBMS with objected- oriented features.
Let us first discuss the inherent weaknesses of legacy RDBMS, the
requirement of storage and manipulation of complex objects by the modern
database applications followed by the emergence of relational ODBMSs.

16.2.1 Weaknesses of RDBMS


The inherent weaknesses of relational database management systems
(RDBMSs) make them unsuitable for the modern database applications.
Some of the weaknesses of RDBMSs are listed below:
Poor representation of ‘real world’ entities resulting out of the process of normalization.
Semantic overloading due to absence of any mechanism to distinguish between entities and
relationships, or to distinguish between different kinds of relationship that exist between
entities.
Poor support for integrity and enterprise constraints.
Homogeneous (fixed) data structure of RDBMS is too restrictive for many ‘real world’
objects that have a complex structure, leading to unnatural joints, which are inefficient.
Limited operations (having only fixed set of operations), such as set and tuple-oriented
operations and operations that are provided in the SQL specification.
Difficulty in handling recursive queries due to atomicity (repeating groups not allowed) of
data.
Impedance mismatch due to lack of computational completeness with most of Data
Manipulation Languages (DMLs) for RDBMSs.
Problems associated with concurrency, schema changes and poor navigational access.

16.2.2 Complex Objects


Increasingly, modern database applications need to store and manipulate
objects that are complex (neither small nor simple). These complex objects
are primarily in the areas that involve a variety of types of data. Examples of
such complex objects types could be:
Text in computer-aided desktop publishing.
Images in weather forecasting or satellite imaging.
Complex non-conventional data in engineering designs.
Complex non-conventional data in the biological genome information.
Complex non-conventional data in architectural drawings.
Time series data in history of stock market transactions or sales histories.
Spatial and geographic data in maps.
Spatial and geographic data in air/water pollution.
Spatial and geographic data in traffic control.

Thus, in addition to storing general and simple data types (such as,
numeric, character and temporal), the modern databases are required to
handle these complex data types that is required by modern business
applications. Table 16.1 summarises some of the common complex data
types or objects.

Table 16.1 Complex data types

Therefore, modern databases are required to be designed, which can


develop, manipulate, maintain and perform other operations on the above
complex objects that are not predefined. Furthermore, it has become
necessary to handle digitised information that represents audio and video
data streams requiring the storage of binary large objects (BLOBS) in
DBMSs. For example, a planning department of an organisation might need
to store written documents along with diagrams, maps, photographs, audio
and video recordings of some events and so on. But, the traditional data
types and search capabilities of relation DBMSs (for example SQL) are not
sufficient to meet these diverse requirements. Thus, it is required to have not
only a collection of new data types and functions, but a facility that lets user
define new data types and functions of their own. Allowing users to define
their own data types, functions and define rules that govern the behavior of
data, increases the value of stored data by increasing its semantic content.

16.2.3 Emergence of ORDBMS


The inability of the legacy DBMSs and the basic relational data model as
well as the earlier RDBMSs to meet the challenges of new applications,
triggered the need for the development of extended ORDBMS. The
RDBMSs enhanced their capabilities in the following two ways in order to
accommodate and facilitate these new requirements and trends of modern
DBMSs:
Adding an ‘object infrastructure’ to the database system itself, in the form of support for
user-defined data types (UDTs), functions and rules.
Building ‘relational extenders’ on top of the object infrastructure that support specialised
applications and handling of complex data types, such as, advanced text searching, image
retrieval, geographic applications and so on.

An object-relational database management system (ORDBMS) is a


database engine that supports both the above relational and object-oriented
features in an integrated fashion. Thus, the users can themselves define,
manipulate and query both relational data and objects while using common
interface such as SQL. An ORDBMS system provides a bridge between the
relational and object-oriented paradigms. The ORDBMS evolved to
eliminate the weaknesses of RDBMSs. ORDBMSs combine the advantages
of modern object- oriented programming languages (OOPLs) with relational
database features. Several vendors have released ORDBMS products known
as universal servers. The term universal server is a more contemporary term,
which is based on the stated objective of managing any type of data,
including user-defined data types (UDTs), with the same server technology.
Following are some of the examples of ORDBMS universal servers:
Universal Database (UDB) version of DB2 by IBM using DB2 extenders.
Postgres (‘Post INGRES’) by Michel Stonebraker of Ingres.
Informix Universal Server by Universal server using Data Blade modules.
Oracle 8i Universal Server by Oracle Corporation using Data Cartridge.
ODB-II (Jasmine) by Computer Associates.
Odapter by Hewlett Packard (HP) which extends Oracle’s DBMS.
Open ODB from HP which extends HP’s own Allbase/SQL product.
UniSQL from UniSQL Inc.

16.3 ORDBMS QUERY LANGUAGE (SQL3)

In chapter 5, we discussed extensively about structured query language


(SQL) standards presented in 1992. This standard was commonly referred to
as SQL2 or SQL-92. Later, on the ANSI and ISO SQL standardization added
the features to the SQL specification in 1999 to support object-oriented data
management. The SQL: 1999 is often referred to as SQL3 standard. SQL3
supports many of the complex data types as shown in Table 16.1. SQL3, in
addition to relational query facilities, also supports for object technology.
The SQL3 standard includes the following parts:
SQL/Framework
SQL/Foundation
SQL/Bindings
SQL/Temporal
SQL/Object
New parts addressing temporal and transaction aspects of SQL.
SQL/Call level interface (CLI).
SQL/Persistent stored modules (PSM)
SQL/Transaction
SQL/Multimedia
SQL/Real-Time

SQL/Foundation deals with new data types, new predicates, relational


operations, cursors, rules and triggers, user-defined types (UDTs),
transaction capabilities and stored routines.
SQL/Bindings include embedded SQL and Direct Invocation.
SQL/Temporal deals with historic data, time series data, versions and
other temporal extensions.
SQL/Call level interface (CLI) provides rules that allow execution of
application code without providing source code and avoids the need for
processing. It provides a new type of language binding and analogous to
dynamic SQL. It provides application programming interface (API) to the
database.
SQL/Persistent stored modules (PSM) specify facilities for partitioning an
application between a client and a server. It allows procedures and user-
defined functions to be written in a third-generation language (3GL) or in
SQL and stored in the database, making SQL computationally complete. It
enhances the performance by minimising network traffic.
SQL/Transaction specification formalises the XA interface for use by
SQL implementers.
SQL/Multimedia develops a set of multimedia library specifications that
will include multimedia objects for spatial and full-text objects, still image,
general-purpose user-defined types (that is UDTs such as complex numbers,
vectors) and generalised data types for coordinates, geometry and their
operations.
SQL/Real-Time handles real-time concepts, such as the ability to place
real-time constraints on data processing requests and to model temporally
consistent data.
For example, let us suppose that an organisation wants to create a relation
EMPLOYEE to record employee data with the attributes as shown in Table
16.2. The traditional relational attributes of this relation EMPLOYEE are
EMP-ID, EMP-NAME, EMP-ADDRESS, EMP-CITY and EMP-DOB. In
addition to these attributes, the organisation now would like to store a
photograph of each employee as part of the record (tuple) for identification
purposes. Thus, it is required to add an additional attribute named EMP-
PHOTO in the record (tuple), as shown in Table 16.2.
SQL3 provides statement to create both, tables and data types or object
class, to store the employee photo in the relation EMPLOYEE. SQL3
statement for this example may appear as follows:

CREATE TABLE EMPLOYEE


EMP-ID INTEGER NOT NULL,
EMP-NAME CHAR(20) NOT NULL,
EMP-ADDRESS CHAR(30) NOT NULL,
EMP-CITY CHAR(15) NOT NULL,
EMP-DOB DATE NOT NULL,
EMP-PHOTO TYPE IMAGE NOT NULL;

In this example, IMAGE is a complex data type or class and EMP-


PHOTO is an object of that class. IMAGE may be either predefined class or
a user-defined class. If it is a user-defined class, a SQL CREATE CLASS
statement is used to define the class. The user can also define methods that
operate on the data defined in a class. For example, one method of IMAGE
is Scale (), which can be used to expand or reduce the size of the
photograph.

Table 16.2 Relation EMPLOYEE of ORDBMS

SQL3 includes extensions for content addressing with complex data types.
For example, suppose that user wants to issue the following query:
“Given a photograph of a person, scan the EMPLOYEE table to determine if there is a close
match for any employee to that photo and then display the record (tuple) of the employee
including the photograph”.

Suppose that the electronic image of the photograph is stored in a location


called ‘MY-PHOTO’. Then a simple query in SQL3 might appear as
follows:

SELECT *
FROM EMPLOYEE
WHERE MY-PHOTO LIKE EMP-PHOTO

The content addressing in an ORDBMS is a very powerful feature that


allows users to search for matches to multimedia objects such as images,
audio or video segments and documents. One such application for this
feature is searching databases for fingerprint or voiceprint matches.

16.4 ORDBMS DESIGN

The rich variety of data types in an ORDBMS offers a database designer


many opportunities for a more efficient design. As discussed in previous
sections, an ORDBMS supports number of much better solution compared to
RDBMS and other databases.
ORDBMS allows to store the video as an user-defined abstract data type (ADT) object and
write methods that capture any special manipulation that an user wish to perform. Allowing
users to define arbitrary new data types is a key feature of ORDBMs. The ORDBMS allows
users to store and retrieve objects of type jpeg-image which stores a compressed image
representing a single frame of film, just like an object of any other type, such as integer. New
atomic data types usually need to have type-specific operations defined by the user who
creates them. For example, one might define operations on an image data type such as
compress, rotate, shrink and crop. The combination of an atomic type and its associated
method is called an abstract data type (ADT). Traditional SQL comes with built-in ADTs,
such as integers (with the associated arithmetic methods), or strings (with the equality,
comparison and LIKE methods). ORDBMSs include these ADTs and also allow users to
define their own ADTs.
User can store the location sequence for a probe in a single tuple, along with the video
information. This layout eliminates the need for joins in queries that involved both the
sequence and video information.
Let us take an example of several space probes, each of which
continuously records a video of different parts of space. A single video
stream is associated with each probe. While this video stream was collected
over a certain time period, we assume that it is now a complete object
associated with the probe. The probe’s location was also periodically
recorded during the time period over which the video was collected. Thus,
the information associated with the probe has the following parts:
A probe identifier (PROBE-ID) that uniquely identifies the probe
A video stream (VIDEO)
A location sequence (LOC-SEQ) of (time, location) pairs
A camera (CAMERA) string

In ORDBMS design, we can have a single relation (table) called


PROBE_INFO as follows:

PROBES_INFO (PROBE-ID: integer, LOC-SEQ: location_seq,


CAMERA: string, VIDEO: mpeg_stream)

Here, the MPEG_STREAM type is abstract data type (ADT), with a


method display() that takes a start time and an end time and displays the
portion of the video recorded during that interval. This method can be
implemented efficiently by looking at the total recording duration and the
total length of the video. This information can then be interpolated to extract
the segment recorded during the interval specified in the query.
Now, we can issue the following queries using extended SQL (SQL3)
syntax using display method:

Query 1: Retrieve only required segment of the video, rather than the
entire video.
SELECT display (P.video, 6.00 a.m. Jan 01
2005, 6.00 a.m. Jan 30 2005)
FROM PROBES_INFO AS P
WHERE P.PROBE-ID = 05
Query 2: Create the location_seq type by defining list type containing
a list of ROW type objects. Extract the time column from
this list to obtain a list of timestamp values. Apply the MIN
aggregate operator to this list to find the earliest time at
which the given probe recorded.
CREATE TYPE location_seq listof
(row (TIME: timestamp, LAT: real,
LONG: real)
SELECT P.PROBE-ID, MIN(P.LOC-
SEQ.TIME)
FROM PROBES_INFO AS P

Here, LAT means lateral and LONG means longitudinal.

From the above examples we can see that an ORDBMS gives us many
useful design options that are not available in a RDBMS.

16.4.1 Challenges of ORDBMS


The enhanced functionalities of ORDBMSs raise several implementation
challenges.
Storage and access methods: Since ORDBMS stores new types of data, it is required to
revisit some of the storage and indexing issues. In particular, the system must efficiently store
ADT objects and structured objects and provide efficient indexed access to both.
Storing large ADT and structured type objects: Large ADT objects and structured objects
complicate the layout of data on disk storage.
Indexing new types: An important issue for ORDBMSs is to provide efficient indexes for
ADT methods and operators on structured objects.
Query processing: ADTs and structured types call for new functionality (such as user-defined
aggregates, security and so on.) in processing queries in ORDBMSs. They also change the
number of assumptions that affect the efficiency (such as method caching and pointer
swizzling) of queries.
Query optimization: New indexes and query processing techniques widen the choices
available to a query optimised. In order to handle the new query processing functionality, an
optimiser must know functionality and use it appropriately.

16.4.2 Features of ORDBMS


An enhanced version of SQL can be used to create and manipulate both relational tables and
object types or classes.
Support for traditional object-oriented functions including inheritance, polymorphism, user-
defined data types and navigational access.

16.4.3 Comparison of ORDBMS and OODBMS


Table 16.3 shows a comparison between ORDBMS and OODBMS from
three perspectives namely data modelling, data access, and data sharing.

Table 16.3 Comparison of ORDBMS and OODBMS


Table 16.4 summarises the characteristics of different DBMSs. The
hierarchical, network and object- oriented models of DBMSs assume that the
object survives the changes of all its attributes. These DBMSs are record-
based systems. A record of the real world item appears in the database and
even though the record contents may change completely the record itself
represents the application entity. In contrast, the relational, object-relational
and deductive DBMS models are value-based. They assume that the real
world item has no identity independent of the attribute values.

Table 16.4 Characteristics of different DBMSs

As shown in the table 16.4, each DBMS model uses a particular style of
access languages to manipulate the database contents. Hierarchical, network
and object-oriented DBMS models employ a procedural language describing
the precise sequence of operations to compute the desired results. Relational,
object-relational and deductive DBMS models use non-procedural
languages, stating only the desired results and leaving specific computation
to the DBMS.
16.4.4 Advantages of ORDBMS
Resolving many weaknesses of RDBMS.
Reduced network traffic.
Reuse and sharing.
Improved application and query performance.
Simplified software maintenance.
Perseverance of the significant body of knowledge and experience that has gone into
developing relational applications.
Integrated data and transaction management.

16.4.5 Disadvantages of ORDBMS


Complexity and associated increased costs.
Loss of simplicity and purity of the relational model due to extensions of complex objects.
Large semantic gap between object-oriented and relational technologies.

R Q

1. What are the weaknesses of legacy RDBMSs?


2. What is object-relational database? What are its advantages and disadvantages?
3. How did an ORDBMS emerged? Discuss in detail.
4. What are the ORDBMS products available for commercial applications?
5. Compare RDBMSs with ORDBMSs. Describe an application scenario for which you would
choose a RDBMS and explain the reason for choosing it. Similarly, describe an application
scenario for which you would choose an ORDBMS and again explain why you have chosen
it.
6. What do you mean by complex objects? List some of the complex objects that can be
handled by ORDBMS.
7. What is the structured query language used in ORDBMSs? What are its standard parts?
Discuss them in brief.
8. Discuss the ORDBMS design with query examples in brief.
9. What are the implementation challenges to enhance the functionalities of ORDBMSs.
10. Compare different DBMSs.

STATE TRUE/FALSE

1. ORDBMS is an extended RDBMS with object-oriented features.


2. Object-relational DBMS (ORDBMS) is also called enhanced relational DBMSs
(ERDBMSs).
3. ORDBMSs have good representation capabilities of ‘real world’ entities.
4. ORDBMS has poor support for integrity and enterprise constraints.
5. ORDBMS has difficulty in handling recursive queries due to atomicity (repeating groups not
allowed) of data.
6. SQL/Foundation deals with new data types, new predicates, relational operations, cursors,
rules and triggers, user-defined types (UDTs), transaction capabilities and stored routines.
7. SQL/Temporal deals with embedded SQL and Direct Invocation.
8. SQL/Bindings deal with historic data, time series data, versions and other temporal
extensions.
9. ORDBMS provides improved application and query performance.
10. ORDBMS increases network traffic.

TICK (✓) THE APPROPRIATE ANSWER

1. ORDBMS can handle

a. complex objects.
b. user-defined types.
c. abstract data types.
d. All of these.

2. Object-relational DBMS (ORDBMS) is also called

a. enhanced relational DBMS.


b. general relational DBMS.
c. object-oriented DBMS.
d. All of these.

3. ORDBMS supports

a. object capabilities.
b. relational capabilities.
c. Both (a) and (b).
d. None of these.

4. Example of complex objects are

a. spatial and geographic data in maps.


b. spatial and geographic data in air/water pollution.
c. spatial and geographic data in traffic control.
d. All of these.

5. Example of complex objects are

a. complex non-conventional data in engineering designs.


b. complex non-conventional data in the biological genome information.
c. complex non-conventional data in architectural drawings.
d. All of these.
6. An ORDBMS product developed by IBM is known as

a. universal database.
b. postgres.
c. informix.
d. ODB-II.

7. An ORDBMS product developed by ORACLE is known as

a. universal database.
b. postgres.
c. informix.
d. None of these.

8. An ORDBMS product developed by Computer associates is known as

a. universal database.
b. postgres.
c. informix.
d. ODB-II.

9. An ORDBMS product developed by Ingres is known as

a. universal database.
b. postgres.
c. informix.
d. ODB-II.

10. An ORDBMS product developed by HP is known as

a. universal database.
b. postgres.
c. adapter.
d. ODB-II.

FILL IN THE BLANKS

1. ORDBMS is an extended _____ with _____ features.


2. Image is _____ data types.
3. Complex objects can be manipulated by _____ DBMS.
4. Images in weather forecasting or satellite imaging are examples of _____ data type.
5. Open ODB is an ORDBMS product developed by _____.
6. Informix is an ORDBMS product developed by _____.
7. SQL/Persistent stored modules (PSM) specify facilities for partitioning an application
between a _____ and a _____.
8. The three disadvantages of and ORDBMS are (a) _____ , (b) _____ and (c) _____.
9. The three advantages of and ORDBMS are (a) _____, (b) _____ and (c) _____.
10. _____ has large semantic gap between object-oriented and relational technologies.
Part-VI

ADVANCE AND EMERGING DATABASE CONCEPTS


Chapter 17

Parallel Database Systems

17.1 INTRODUCTION

The architecture of a database system is greatly influenced by the underlying


computer system on which the database system runs. Database systems can
be centraled, parallel or distributed. In the preceding chapters, we introduced
concepts of centralised database management systems (for example,
hierarchical, networking, relational, object-oriented and object-relational)
that are based on the single central processing unit (CPU) or computer
architecture. In this architecture, all the data is maintained at a single site (or
computer system) and assumed that the processing of individual transactions
is essentially by sequential. But today, a single CPU based computer
architecture is not capable enough for the modern databases that are required
to handle more demanding and complex requirements of the users, for
example, high performance, increased availability, distributed access to data,
analysis of distributed data and so on.
To meet the complex requirements of users, the modern database systems
today operate with an architecture where multiple CPUs are working in
parallel to provide complex database services. In some of the architectures,
multiple CPUs are working in parallel and are physically located in a close
environment in the same building and communicating at very high speed.
The databases operating in such environment is known as parallel
databases.
In this chapter, we will briefly discuss the different types of database
system architectures that support multiple CPUs working in parallel. We will
also discuss significant issues related to various methods of query
parallelism that are implemented on parallel databases.

17.2 PARALLEL DATABASES

In parallel database systems, multiple CPUs work in parallel to improve


performance through parallel implementation of various operations such as
loading data, building indexes and evaluating queries. Parallel processing
divides a large task into many smaller tasks and executes the smaller tasks
concurrently on several nodes (CPUs). As a result, the larger tasks completes
more quickly. Parallel database systems improve processing and input/output
(I/O) speeds by using multiple CPUs and disks working in parallel. Parallel
databases are especially useful for applications that have to query large
databases and process large number of transactions per second. In parallel
processing, many operations are performed simultaneously, as opposed to
centralised processing, in which serial computation is performed.
Thus, the goal of parallel database systems is usually to ensure that the
database system can continue to perform at an acceptable speed, even as the
size of the database and the number of transactions increases. Increasing the
capacity of the system by increasing the parallelism provides a smoother
path for growth for an enterprise than does replacing a centralised system by
a faster machine.
Parallel database systems are usually designed from the ground up to
provide best cost-performance and they are quite uniform in site machine
(computer) architecture. The cooperation between site machines is usually
achieved at the level of the transaction manager module of a database
system. Parallel database systems represent an attempt to construct a faster
centralised computer using several small CPUs. It is more economical to
have several smaller CPUs that together have the power of one large CPU.

17.2.1 Advantages of Parallel Databases


Increased throughput (scale-up).
Improved response time (speed-up).
Useful for the applications to query extremely large databases and to process an extremely
large number of transactions rate (in the order of thousands of transactions per second).
Substantial performance improvements.
Increased availability of system.
Greater flexibility.
Possible to serve large number of users.

17.2.2 Disadvantages of Parallel Databases


More start-up costs.
Interference problem.
Skew problem.

17.3 ARCHITECTURE OF PARALLEL DATABASES

As discussed in the preceding sections, in parallel database architecture,


there are multiple central processing units (CPUs) connected to a computer
system. There are several architectural models for parallel machines. Three
of the most prominent ones are listed below:
Shared-memory multiple CPU.
Shared-disk multiple CPU.
Shared-nothing multiple CPU.

17.3.1 Shared-memory Multiple CPU Parallel Database Architecture


In a shared-memory system, a computer has several (multiple)
simultaneously active CPUs that are attached to an interconnection network
and can share (or access) a single (or global) main memory and a common
array of disk storage. Thus, in shared-memory architecture, a single copy of
a multithreaded operating system and multithreaded DBMS can support
multiple CPUs. Fig. 17.1 shows a schematic diagram of a shared-memory
multiple CPU architecture. The shared-memory architecture of parallel
database system is closest to the traditional single-CPU processor of
centralised database systems, but much faster in performance as compared to
the single-CPU of the same power. This structure is attractive for achieving
moderate parallelism. Many commercial database systems have been ported
to shared-memory platforms with relative ease.
17.3.1.1 Benefits of shared-memory architecture
Communication between CPUs is extremely efficient. Data can be accessed by any CPU
without being moved with software. A CPU can send messages to the other CPUs much
faster by using memory writes, which usually takes less than a microsecond, than by sending
a message through a communication mechanism.
The communication overheads are low, because main memory can be used for this purpose
and operating system services can be leveraged to utilise the additional CPUs.

Fig. 17.1 Shared-memory multiple CPU architecture

17.3.1.2 Limitations of Shared-memory Architecture


Memory access uses a very high-speed mechanism that is difficult to partition without losing
efficiency. Thus, the design must take special precautions that the different CPUs have equal
access to the common memory. Also, the data retrieved by one CPU should not be
unexpectedly modified by another CPU acting in parallel.
Since the communication bus or interconnection network is shared by all the CPUs, the
shared- memory architecture is not scalable beyond 80 or 100 CPUs in parallel. The bus or
the interconnection network becomes a bottleneck as the number of CPUs increases.
The addition of more CPUs causes CPUs to spend time waiting for their turn on the bus to
access memory.

17.3.2 Shared-disk Multiple CPU Parallel Database Architecture


In a shared disk system, multiple CPUs are attached to an interconnection
network and each CPU has its own memory but all of them have access to
the same disk storage or, more commonly, to a shared array of disks. The
scalability of the system is largely determined by the capacity and
throughput of the interconnection network mechanism. Since memory is not
shared among CPUs, each node has its own copy of the operating system
and the DBMS. It is possible that, with the same data accessible to all nodes,
two or more nodes may want to read or write the same data at the same time.
Therefore, a kind of global (or distributed) locking scheme is required to
ensure the preservance of data integrity. Sometimes, the shared-disk
architecture is also referred to as a parallel database system. Fig. 17.2 shows
a schematic diagram of a shared-disk multiple CPU architecture.

17.3.2.1 Benefits of Shared-disk Architecture


Shared-disk architecture is easy to load-balance, because data does not have to be
permanently divided among available CPUs.
Since each CPU has its own memory, the memory bus is not a bottleneck.
It offers a low cost solution to provide a degree of fault tolerance. In case of a CPU or
memory failure, the other CPUs take over its task; since the database is resident on disks that
are accessible form all CPUs.
It has found acceptance in wide applications.

Fig. 17.2 Shared-disk multiple CPU architecture


17.3.2.2 Limitations of Shared-disk Architecture
Shared-disk architecture also faces similar problems of interference and memory contention
bottleneck as the number of CPUs increases. As more CPUs are added, existing CPUs are
slowed down because of the increased contention for memory accesses and network
bandwidth.
Shared-disk architecture also has a problem of scalability. The interconnection to the disk
subsystem becomes bottleneck, particularly when the database makes a large number of
accesses to the disks.

17.3.3 Shared-nothing Multiple CPU Parallel Database Architecture


In a shared-nothing system, multiple CPUs are attached to an
interconnection network through a node and each CPU has a local memory
and disk storage, but no two CPUs can access the same disk storage area. All
communication between CPUs is through a high-speed interconnection
network. Node functions as the server for the data on the disk or disks that
the node owns. Thus, shared-nothing environments involve no sharing of
memory or disk resources. Each CPU has its own copy of operating system,
its own copy of the DBMS, and its own copy of a portion of data managed
by the DBMS. Fig. 17.3 shows a schematic diagram of a shared-nothing
multiple CPU architecture. In this type of architecture, CPUs sharing
responsibility for database services usually split up the data among
themselves. CPUs then perform transactions and queries by dividing up the
work and communicating by message over the high-speed network (at the
rate of megabits per second).

17.3.3.1 Benefits of Shared-nothing Architecture


Shared-nothing architectures minimise contention among CPUs by not sharing resources and
therefore offer a high degree of scalability.
Since local disk references are serviced by local disks at each CPU, the shared-nothing
architecture overcomes the limitations of requiring all I/O to go through a single
interconnection network. Only queries, accesses to non-local disks and result relations pass
through the network.
The interconnection networks for shared-nothing architectures are usually designed to be
scalable. Thus, adding more CPUs and more disks enables the system grow (or scale) in a
manner that is proportionate to the power and capacity of the newly added components. This
provides for scalability that is nearly linear, enabling users to get a large return on their
investment in new hardware (resources). In other words, shared-nothing architecture provides
linear speed-up and linear scale-up.
Linear speed-up and scale-up properties increase the transmission capacity of shared-nothing
architecture as more nodes are added and therefore, it can easily support large number of
CPUs.

Fig. 17.3 Shared-nothing multiple CPU architecture

17.3.3.2 Limitations of Shared-nothing Architecture


Shared-nothing architectures are difficult to load-balance. In many multi CPU environments,
it is necessary to split the system workload in some way so that all system resources are
being used efficiently. Proper splitting or balancing this workload across a shared-nothing
system requires an administrator to properly partition or divide the data across the various
disks such that each CPU is kept roughly as busy as the others. In practice this is difficult to
achieve.
Adding new CPUs and disks to shared-nothing architecture means that the data may need to
be redistributed in order to make advantage of the new hardware (resource) and thus requires
more extensive reorganisation of the DBMS code.
The costs of communication and non-local disk access are higher than in shared-disk or
shared- memory architecture since sending data involves software interaction at both ends.
The high-speed networks are limited in size, because of speed-of-light considerations. This
leads to the requirement that a parallel architecture has CPUs that are physically close
together. This network architecture is also known as local area network (LAN).
Shared-nothing architectures introduce a single point of failure to the system. Since each
CPU manages its own disk(s), data stored on one or more of these disks become inaccessible
if its CPU goes down.
It requires an operating system that is capable of accommodating the heavy amount of
messaging required to support inter-processor communications.

17.3.3.3 Applications of Shared-nothing Architecture


Shared-nothing architectures are well suited for relatively cheap CPU technology. Since
scalability is high, users can start with a relatively small (and low-cost) system, adding more
relatively low-cost CPUs to meet increased capacity needs.
The shared-nothing approach forms the basis for massive parallel processing systems.

17.4 KEY ELEMENTS OF PARALLEL DATABASE PROCESSING

Following are the key elements of parallel database processing:


Speed-up
Scale-up
Synchronisation
Locking

17.4.1 Speed-up
Speed-up is a property in which the time taken for performing a task
decreases in proportion to the increase in the number of CPUs and disks in
parallel. In other words, speed-up is the property of running a given task in
less time by increasing the degree of parallelism (more number of hardware).
With additional hardware, speedup holds the task constant and measures the
time saved. Thus, speed-up enables users to improve the system response
time for their queries, assuming the size of their databases remain roughly
the same. Speed-up due to parallelism can be defined as
Where

TO = execution time of a task on the original or smaller machine (or


original processing time)
TP = execution time of same task on the parallel or larger machine (or
parallel processing time)

Here, the original processing time (or execution time on original or


smaller machine) TO is the elapsed time spent by a small system on the given
task and parallel processing time (or execution time on parallel or larger
machine) TP is the elapsed time spent by a larger parallel system on the
given task.
Consider a database application running on a parallel system with a
certain number of CPUs and disks. Let us suppose that the size of the system
is increased by increasing the number of CPUs, disks and other hardware
components. The goal is to process the task in time inversely proportional to
the number of CPUs and disks allocated. For example, if the original system
takes 60 seconds to perform a task and the two parallel systems take 30
seconds to perform the same task, then the value of speed-up is 60/30 = 2.
The speed-up value 2 is an indication of linear speed-up. In other words, the
parallel system is said to demonstrate linear speedup if the speed-up is N
when the larger system has N times the resources (CPUs, disks and so on) of
the smaller system. If the speed-up is less than N, the system is said to
demonstrate sub-linear speed-up. Fig. 17.4 illustrates linear and sub-linear
speed-up curve of the parallelism. The speed-up curve shows how, for a
fixed database size, more transactions can be executed per second by adding
more number of resources such as CPUs and disks.
Fig. 17.4 Speed-up with increasing resources

17.4.2 Scale-up
Scale-up is the property in which the performance of the parallel database is
sustained if the number of CPU and disks are increased in proportion to the
amount of data. In other words, scale-up is the ability of handling larger
tasks by increasing the degree of parallelism (providing more resources) in
the same time period as the original system. With added hardware (CPUs
and disks), a formula for scale-up holds the time constant and measures the
increased size of the task, which can be performed. Thus, scale-up enables
users to increase the sizes of their databases while maintaining roughly the
same response time. Scale-up due to parallelism can be defined as

Where

Vp = parallel or large processing volume.


VO = original or small processing volume.

Here, original processing volume is the transaction volume processed in a


given amount of time on a small system. Parallel processing volume is the
transaction volume processed in a given amount of time on a parallel system.
For example, if the original system can process 3000 transactions in a given
amount of time and if the parallel system can process 6000 transactions in
the same amount of time, then the scale-up value would be 6000/3000 = 2.
The scale-up value 2 is an indication of linear scale-up, which means that
twice as much hardware can process twice the data volume in the same
amount of time. If the scale-up value is less than 2, then it is called sub-
linear scale-up. Fig. 17.5 illustrates linear and sub-linear scale-up curve of
the parallelism.
The scale-up curve shows how, adding more resources (CPUs) enable the
user to process larger tasks. The first scale-up curve measures the number of
transactions executed per second as the database size is increased and the
number of CPUs is correspondingly increased. An alternative way to
measure scale-up is to consider the time taken per transaction (execution
time) as more CPUs are added to process an increasing number of
transactions per second. Thus, in this case, the goal is to sustain the response
time per second. For example, let us consider that a task (query) QN which is
N times bigger than the original task Q. Suppose that the execution time of
the task Q on a given original (small) computer system CO is TO, and
execution time of task QN on a parallel (or large) computer system Cp is TP,
which is N times larger than CO. The scale-up then can be defined as:
Fig. 17.5 Scale-up with increasing resources

The parallel computer system CP is said to demonstrate linear scale-up on


task Q if TP = TO. If TP > TO, the system is said to demonstrate sub-linear
scale-up. Scale-up is usually the more important parameter for measuring
performance of parallel database systems.

17.4.3 Synchronisation
Synchronisation is the coordination of concurrent tasks. For a successful
operation of the parallel database systems, the tasks should be divided such
that the synchronisation requirement is less. It is necessary for correctness.
With less synchronisation requirement, better speed-up and scale-up can be
achieved. The amount of synchronisation depends on the amount of
resources (CPUs, disks, memory, databases, communication network and so
on) and the number of users and tasks working on the resources. More
synchronisation is required to coordinate large number of concurrent tasks
and less synchronisation is necessary to coordinate small number of
concurrent tasks.

17.4.4 Locking
Locking is a method of synchronising concurrent tasks. Both internal as well
as external locking mechanisms are used for synchronisation of tasks that are
required by the parallel database systems. For external locking, a distributed
lock manager (DLM) is used, which is a part of the operating system
software. DLM coordinates resource sharing between communication nodes
running a parallel server. The instances of a parallel server use the DLM to
communicate with each other and coordinate modification of database
resources. The DLM allows applications to synchronise access to resources
such as data, software and peripheral devices, so that concurrent requests for
the same resource are coordinated between applications running on different
nodes.

17.5 QUERY PARALLELISM

As we have discussed in the preceding sections, parallelism is used to


provide speed-up and scale-up. This is done so that the queries are executed
faster by adding more resources and the increasing workload is handled
without increasing the response time, by increasing the degree of
parallelism. However, the main challenge in parallel databases is query
parallelism. That is, how to design an architecture that will allow parallel
execution of multiple queries, or decompose (divide) a query into parts that
act in parallel. The shared-nothing parallel database architectures have been
very successful for achieving this goal. Following are some of the query
parallelism architectures to take care of such requirements:
Input/output (I/O) parallelism.
Intra-query parallelism.
Inter-query parallelism.
Intra-operation parallelism.
Inter-operation parallelism.

17.5.1 I/O Parallelism (Data Partitioning)


Input/output (I/O) parallelism is the simplest form of parallelism in which
the relations (tables) are partitioned on multiple disks to reduce the retrieval
time of relations from disk. In I/O parallelism, the input data is partitioned
and then each partition is processed in parallel. The results are combined
after the processing of all partitioned data. I/O parallelism is also called data
partitioning. The following four types of partitioning techniques can be
used:
Hash partitioning.
Range partitioning.
Round-robin partitioning.
Schema partitioning.

17.5.1.1 Hash Partitioning


In the technique of hash partitioning a hash function is applied to the
attribute value whose range is [0, 1, 2, …, n-1). Each tuple (row) of the
original relation is hashed on the partitioning attributes. The output of this
function causes the data for that tuple to be targeted for placement on a
particular disk. For example, let us assume that there are n disks d1,d2, d3,….
dn, across which the data are to be partitioned. Now, if the hash function
returns 2, then the tuple is placed on disk d2. Fig. 17.6 (a) illustrates an
example of hash partitioning.

Advantages: Hash partitioning has the advantage of providing for even


distribution of data across the available disk, helping to prevent skewing.
Skew can slow the performance caused by one or more CPUs and disks
getting more work than others. Hash partitioning is best suited for point
queries (involving exact matches) based on the partitioning attribute. For
example, if a relation is partitioned on the employee identification numbers
(EMP-ID), then we can answer the query “Find the record of the employee
with employee identification number = 106519” using SQL statement as
follows:

SELECT *
FROM EMPLOYEE
WHERE EMP-ID = 106519;
Hash partitioning is also useful for sequential scans of the entire relation
(table) placed on n number of disks. The time taken to scan the relation is
approximately 1/n of the time required to scan the relation in a single disk
system.

Disadvantages: Hash partitioning technique is not well suited for point


queries on non-partitioning attributes. It is also not well suited for answering
range queries, since, typically hash functions do not preserve proximity
within a range. For example, hash partitioning will not perform for queries
involving range searches such as:

SELECT *
FROM EMPLOYEE
WHERE EMP-ID > 105000 and EMP-ID < 150000;

In such a case, the search (scanning) would have to involve most (or all)
disks over which the relation has been partitioned.

17.5.1.2 Range Partitioning


In the technique of range partitioning an administrator specifies that
attribute-values within a certain range are to be placed on a certain disk. In
other words, range partitioning distributes contiguous attribute-value ranges
to each disk. For example, range partitioning with three disks numbered as 0,
1, 2, …., n might place tuples for employee numbers with up to 100000 on
disk 0, tuples for employees identification numbers 100001–150000 on disk
1, tuples for employee 150001–200000 on disk 2 and so forth. Fig. 17.6 (b)
illustrates an example of range partitioning.

Advantages: Range partitioning involves placing tuples containing


attribute values that fall within a certain range on a disk. This offers good
performance for range-based queries and also provides reasonable
performance for exact-match (point) queries involving the partitioning
attribute. For point queries, the partitioning vector can be used to locate the
disk where the tuples reside. For range queries, the partitioning vector is
used to find the range of disks on which the tuples may reside. In both cases,
the search narrows to exactly those disks that might have any tuples of
interest.

Disadvantages: Range partitioning can cause skewing in some cases.


For example, consider an EMPLOYEE relation that is partitioned across
disk according to employee identification numbers. If tuples containing
numbers 100000–150000 are placed on disk 0 (d0) and tuples containing
numbers 150001–200000 are placed on disk 1 (d1), data will be evenly
distributed if the company employs 200000 employees. However, if the
company employs only 160000 employees currently and most are assigned
numbers 100000–150000, the bulk of the tuples for this relation will be
skewed towards disk 0 (do0).

17.5.1.3 Round-robin Partitioning


In the round-robin partitioning technique, the relations (tables) are scanned
in any order and ith tuple is send to disk number di mode n. In other word,
disks ‘take turns’ receiving new tuples of data. For example, a system with n
disks would place tuple A on disk 0 (d0), tuple B on disk 1 (d1), tuple C on
disk 2 (d2) and so forth. Round-robin technique ensures an even distribution
of tuples across disks. That is, each disk has approximately the same number
of tuples as the others. Fig. 17.6 (c) illustrates an example of round-robin
partitioning.

Advantages: Round-robin partitioning is ideally suited for applications


that wish to read the entire relation sequentially for each query.

Disadvantages: With round-robin partitioning technique, both point


queries and range queries are complicated to process, since each of the n
disks must be used for search.

17.5.1.4 Schema Partitioning


In schema partitioning technique, different relations (tables) within a
database are placed on different disks. Fig. 17.6 (d) illustrates an example of
schema partitioning.

Disadvantages: Schema partitioning is more prone to data skewing.


Most vendors support schema partitioning along with one or more other
techniques.
Fig. 17.6 Data partitioning techniques

17.5.2 Intra-query Parallelism


Intra-query parallelism refers to the execution of a single query in parallel on
multiple CPUs using shared- nothing parallel architecture technique. Intra-
query parallelism is sometimes called parallel query processing. For
example, suppose that a relation (table) has been partitioned across multiple
disks by range partitioning on some attribute and now user wants to ‘sort’ on
the partitioning attribute. The ‘sort’ operation can be implemented by sorting
each partition in parallel, then concatenating the sorted partitions to get the
final sorted relation. Thus, a query can be parallelised by parallelising
individual operations. Parallelism of operations is discussed in more details
in sections 17.5.4 and 17.5.5.

Fig. 17.7 Intraquery parallelism

Fig. 17.7 shows an example of intra-query parallelism. Generally two


approaches are used in intra-query parallelism. In the first approach, each
CPU can execute the same task against some portion of the data. This
approach is the most common approach to parallel query processing in
commercial products. In the second approach, the task can be divided into
different subtasks with each CPU executing a different subtask. Both
approaches presume that the data is portioned across disks in an appropriate
manner.

Advantages
Intra-query parallelism speeds up long-running queries.
They are beneficial for decision support applications that issue complex, read-only queries,
including queries involving multiple joins.

17.5.3 Inter-query Parallelism


In an inter-query parallelism, multiple transactions are executed in parallel,
one by each (CPU). Inter-query is sometimes also called parallel transaction
processing. The primary use of inter-query parallelism is to scale- up a
transaction-processing system to support a larger number of transactions per
second. Fig. 17.8 shows an example of inter-query parallelism. To support
inter-query parallelism, the DBMS generally uses means of task or
transaction dispatching. This helps to ensure that incoming requests are
routed to the least busy processor, enabling the overall workload to be kept
balanced. However, it may be difficult to fully automate this process,
depending on the underlying hardware architecture of the computer. For
example, a shared- nothing architecture dictates that data stored on certain
disks be accessible only to certain CPUs. Therefore, requests that involve
this data cannot be dispatched to just any CPU.
Efficient lock management is another method used by DBMS to support
inter-query parallelism, particularly in shared-disk architecture. Since, in the
inter-query parallelism each query is run sequentially, it does not help in
speeding up long-running queries. In such cases, the DBMS must understand
the locks held by different transactions executing on different CPUs in order
to preserve overall data integrity. If memory is shared among CPUs, lock
information can be kept in buffers in global memory and updated with little
overhead. However, if only disks are shared (and not memory), this lock
information must be kept on the only shared resource, that is disk. Inter-
query parallelism on shared-disk architecture performs best when
transactions that execute in parallel do not access the same data. The Oracle
8 and Oracle Rdb systems are examples of shared-disk parallel database
systems that support inter-query parallelism.
Fig. 17.8 Inter-query parallelism

17.5.3.1 Advantages
Easiest form of parallelism to support in a database system, particularly in shared-memory
parallel system.
Increased transaction throughput.
It scales up a transaction-processing system to support a larger number of transactions per
second.

17.5.3.2 Disadvantages
Response times of individual transactions are no faster than they would be if the transactions
were run in isolation.
It is more complicated in a shared-disk or shared-nothing architecture.

17.5.4 Intra-operation Parallelism


In intra-operation parallelism, we parallelise the execution of each individual
operation of a task, such as sorting, projection, join and so on.
Since the number of operations in a typical query is small, compared to
the number of tuples processed by each operation, intra-operation
parallelism scales better with increasing parallelism.

17.5.4.1 Advantages
Intra-operation parallelism is natural in a database.
Degree of parallelism is potentially enormous.
17.5.5 Inter-operation Parallelism
In inter-operation parallelism, the different operations in a query expression
are executed in parallel. The following two types of inter-operation
parallelism are used:
Pipelined parallelism
Independent parallelism

17.5.5.1 Pipelined Parallelism


In pipelined parallelism, the output tuples of one operation A are consumed
by a second operation B, even before the first operation has produced the
entire set of tuples in its output. Thus, it is possible to run operations A and B
simultaneously on different processors (CPUs), so that operation B
consumes tuples in parallel with operation A producing them. The major
advantage of pipelined parallelism in a sequential evaluation is that we can
carry out a sequence of such operations without writing any of the
intermediate results to disk.

Advantages: Pipelined parallelism is useful with a small number of


CPUs. Also, pipelined executions avoid writing intermediate results to disk.

Disadvantages: Pipelined parallelism does not scale up well. First,


pipeline chains generally do not attain sufficient length to provide a high
degree of parallelism. Second, it is not possible to pipeline relational
operators that do not produce output until all inputs have been accessed.
Third, only marginal speed-up is obtained for the frequent cases in which
one operator’s execution cost is much higher than are those of the others.

17.5.5.2 Independent Parallelism


In an independent parallelism, the operations in a query expression that do
not depend on one another can be executed in parallel.

Advantages: Independent parallelism is useful with a lower degree of


parallelism.
Disadvantages: Like pipelined parallelism, independent parallelism
does not provide a high degree of parallelism. It is less useful in a highly
parallel system.

R Q
1. What do you mean by parallel processing and parallel databases? What are the typical
applications of parallel databases?
2. What are the advantages and disadvantages of parallel databases?
3. Discuss the architecture of parallel databases.
4. What is shared-memory architecture? Explain with a neat sketch. What are its benefits and
limitations?
5. What is shared-disk architecture? Explain with a neat sketch. What are its benefits and
limitations?
6. What is shared-nothing architecture? Explain with a neat sketch. What are its benefits and
limitations?
7. Discuss the key elements of parallel processing in brief.
8. What do you mean by speed-up and scale-up? What is the importance of linearity in speed-
up and scale-up? Explain with diagrams and examples.
9. What is synchronisation? Why is it necessary?
10. What is locking? How is locking performed?
11. What is query parallelism? What is its type?
12. What do you mean by data partitioning? What are the different types of partitioning
techniques?
13. For each of the partitioning techniques, give an example of a query for which that
partitioning technique would provide the fastest response.
14. In a range selection on a range-partitioned attribute, it is possible that only one disk may need
to be accessed. Describe the advantages and disadvantages of this property.
15. What form of parallelism (inter-query, inter-operation or intra-operation) is likely to be the
most important for each of the following tasks:

a. Increasing the throughput of a system with many small queries.


b. Increasing the throughput of a system with a few large queries, when the number of
disks and CPUs is large.

16. What do you mean by pipelined parallelism? Describe the advantages and disadvantages of
pipelined parallelism.
17. Write short notes on the following:

a. Hash partitioning.
b. Round-robin partitioning.
c. Range partitioning.
d. Schema partitioning.
18. Write short notes on the following:

a. Intra-query parallelism.
b. Inter-query parallelism.
c. Intra-operation parallelism.
d. Inter-operation parallelism.

STATE TRUE/FALSE

1. With a good scale-up, additional CPUs reduce system response time.


2. Synchronisation is necessary for correctness.
3. The key to successful parallel processing is to divide up tasks so that very little
synchronisation is necessary.
4. The more is the synchronisation, better is the speed-up and scale-up.
5. With good speed-up, if transaction volumes grow, response time can be kept constant by
adding hardware resources such as CPUs.
6. Parallel database systems can make it possible to overcome the limitations, enabling a single
system to serve thousands of users.
7. In a parallel database system, multiple CPUs are working in parallel and are physically
located in a close environment in the same building and communicating at a very high speed.
8. Parallel processing divides a large task into many smaller tasks and executes the smaller
tasks concurrently on several nodes (CPUs).
9. The goal of parallel database systems is usually to ensure that the database system can
continue to perform at an acceptable speed, even as the size of the database and the number
of transactions increases.
10. In a shared-disk system, multiple CPUs are attached to an interconnection network through a
node and each CPU has a local memory and disk storage, but no two CPUs can access the
same disk storage area.
11. In a shared-nothing system, multiple CPUs are attached to an interconnection network and
each CPU has their own memory but all of them have access to the same disk storage or,
more commonly, to a shared array of disks.
12. In a shared-memory system, a computer has multiple simultaneously active CPUs that are
attached to an interconnection network and can access global main memory and a common
array of disk storage.
13. In a shared-memory architecture, a single copy of a multithreaded operating system and
multithreaded DBMS can support multiple CPUs.
14. In a shared-memory architecture, memory access uses a very high-speed mechanism that is
easy to partition without losing efficiency.
15. Shared-disk architecture is easy to load-balance.
16. Shared-nothing architectures are difficult to load-balance.
17. Speed-up is a property in which the time taken for performing a task increases in proportion
to the increase in the number of CPUs and disks in parallel.
18. Speed-up enables users to improve the system response time for their queries, assuming the
size of their databases remain roughly the same.
19. Scale-up is the ability of handling larger tasks by increasing the degree of parallelism
(providing more resources) in the same time period as the original system.
20. Scale-up enables users to increase the sizes of their databases while maintaining roughly the
same response time.
21. Hash partitioning prevents skewing.

TICK (✓) THE APPROPRIATE ANSWER

1. In which case is the query executed as a single large task?

a. Parallel processing
b. Centralised processing
c. Sequential processing
d. None of these.

2. What is the value of speed-up if the original system took 200 seconds to perform a task, and
two parallel systems took 50 seconds to perform the same task?

a. 2
b. 3
c. 4
d. None of these.

3. What is the value of scale-up if the original system can process 1000 transactions in a given
time, and the parallel system can process 3000 transactions in the same time?

a. 2
b. 3
c. 4
d. None of these.

4. Which of the following is the expansion of DLM?

a. Deadlock Limiting Manager


b. Dynamic Lock Manager
c. Distributed Lock Manager
d. None of these.

5. Which of the following is a benefit of a parallel database system?

a. Improved performance
b. Greater flexibility
c. Better availability
d. All of these.

6. The architecture having multiple CPUs working in parallel and physically located in a close
environment in the same building and communicating at very high speed is called
a. parallel database system.
b. distributed database system.
c. centralised database system.
d. None of these.

7. Parallel database system has the disadvantage of

a. more start-up cost.


b. interference problem.
c. skew problem.
d. All of these.

8. In a shared-memory system, a computer has

a. several simultaneously active CPUs that are attached to an interconnection network


and can share a single main memory and a common array of disk storage.
b. multiple CPUs attached to an interconnection network and each CPU has their own
memory but all of them have access to the same disk storage or to a shared array of
disks.
c. multiple CPUs attached to an interconnection network through a node and each
CPU has a local memory and disk storage, but no two CPUs can access the same
disk storage area.
d. None of these.

9. In a shared-disk system, a computer has

a. several simultaneously active CPUs that are attached to an interconnection network


and can share a single main memory and a common array of disk storage.
b. multiple CPUs attached to an interconnection network and each CPU has their own
memory but all of them have access to the same disk storage or to a shared array of
disks.
c. multiple CPUs attached to an interconnection network through a node and each
CPU has a local memory and disk storage, but no two CPUs can access the same
disk storage area.
d. None of these.

10. In a shared-nothing system, a computer has

a. several simultaneously active CPUs that are attached to an interconnection network


and can share a single main memory and a common array of disk storage.
b. multiple CPUs attached to an interconnection network and each CPU has their own
memory but all of them have access to the same disk storage or to a shared array of
disks.
c. multiple CPUs attached to an interconnection network through a node and each
CPU has a local memory and disk storage, but no two CPUs can access the same
disk storage area.
d. None of these.
11. The shared-memory architecture of parallel database system is closest to the

a. centralised database system.


b. distributed database system.
c. client/server system.
d. None of these.

12. In shared-nothing architecture, each CPU has its own copy of

a. DBMS.
b. portion of data managed by the DBMS.
c. operating system.
d. All of these.

13. A global locking system is required in

a. shared-disk architecture.
b. shared-nothing architecture.
c. shared-memory architecture.
d. None of these.

14. The scalability of shared-disk system is largely determined by the

a. capacity of the interconnection network mechanism.


b. throughput of the interconnection network mechanism.
c. Both (a) and (b).
d. None of these.

15. Locking is the

a. coordination of current tasks.


b. method of synchronising current task.
c. Both (a) and (b).
d. None of these.

16. Speed-up is a property in which the time taken for performing a task

a. decreases in proportion to the increase in the number of CPUs and disks in parallel.
b. increases in proportion to the increase in the number of CPUs and disks in parallel.
c. Both (a) and (b).
d. None of these.

17. Synchronisation is the

a. coordination of current tasks.


b. method of synchronising current task.
c. Both (a) and (b).
d. None of these.
18. Parallelism in which the relations are partitioned on multiple disks to reduce the retrieval
time of relations from disk is called

a. I/O parallelism.
b. inter-operation parallelism.
c. intra-query parallelism.
d. inter-query parallelism.

19. In intra-query parallelism

a. the execution of a single query is done in parallel on multiple CPUs using shared-
nothing parallel architecture technique.
b. multiple transactions are executed in parallel, one by each (CPU).
c. we parallelise the execution of each individual operation of a task, such as sorting,
projection, join and so on.
d. the different operations in a query expression are executed in parallel.

20. In inter-query parallelism,

a. the execution of a single query is done in parallel on multiple CPUs using shared-
nothing parallel architecture technique.
b. multiple transactions are executed in parallel, one by each (CPU).
c. we parallelise the execution of each individual operation of a task, such as sorting,
projection, join and so on.
d. the different operations in a query expression are executed in parallel.

21. In intra-operation parallelism,

a. the execution of a single query is done in parallel on multiple CPUs using shared-
nothing parallel architecture technique.
b. multiple transactions are executed in parallel, one by each (CPU).
c. we parallelise the execution of each individual operation of a task, such as sorting,
projection, join and so on.
d. the different operations in a query expression are executed in parallel.

22. In inter-operation parallelism,

a. the execution of a single query is done in parallel on multiple CPUs using shared-
nothing parallel architecture technique.
b. multiple transactions are executed in parallel, one by each (CPU).
c. we parallelise the execution of each individual operation of a task, such as sorting,
projection, join and so on.
d. the different operations in a query expression are executed in parallel.

FILL IN THE BLANKS


1. _____ divides larger tasks into many smaller tasks, and executes the smaller tasks
concurrently on several communication nodes.
2. Coordination of concurrent tasks is called _____.
3. _____ is the ability of a system N times larger to perform a task N times larger in the same
time period as the original system.
4. The architecture having multiple CPUs working in parallel and physically located in a close
environment in the same building and communicating at very high speed is called _____.
5. In a shared-memory architecture, communication between CPUs is extremely _____.
6. In a shared-memory architecture, the communication overheads are _____.
7. In a shared-disk architecture, the scalability of the system is largely determined by the _____
and _____ of the interconnection network mechanism.
8. High degree of scalability is offered by _____ architecture.
9. In a shared-nothing architecture, the costs of communication and non-local disk access are
_____.
10. In a shared-nothing architecture, the high-speed networks are limited in size, because of
_____ considerations.
11. Shared-nothing architectures are well suited for relatively cheap _____ technology.
12. The property in which the time taken for performing a task decreases in proportion to the
increase in the number of CPUs and disks in parallel is called _____.
13. Speed-up is directly proportional to _____ and inversely proportional to _____.
14. Scale-up is directly proportional to _____ and inversely proportional to _____.
15. Scale-up is the ability of handling larger tasks by increasing the _____ of _____ in the same
time period as the original system.
16. Scale-up enables users to increase the _____ of their databases while maintaining roughly the
same _____.
17. Synchronisation is the coordination of _____.
18. Locking is a method of synchronising concurrent tasks _____.
19. Skewing can be prevented by _____ partitioning.
Chapter 18

Distributed Database Systems

18.1 INTRODUCTION

In the previous chapter we discussed parallel database architectures in which


multiple CPUs are used in a closed vicinity (for instance, in a building). The
processors (CPUs) are tightly coupled and constitute a single database
system. In other architecture, multiple CPUs are loosely coupled and
geographically distributed at several sites, may be in different buildings or
cities, communicating relatively slowly by telephone lines, optical fibre
networks or satellite networks, with no sharing of physical components. The
databases operating in such environment is known as distributed databases.
Distributed database technology has now become the reality due to the rapid
development in recent times in the field of networking and data
communication technology epitomised by the Internet, mobile and wireless
computing, and intelligent devices.
In this chapter, we will briefly discuss the key features of distributed
database management and its architectures. We will also discuss the query
processing techniques, concurrency control and recovery control mechanism
of distributed database environment.

18.2 DISTRIBUTED DATABASES

A distributed database system (DDBS) is a database physically stored on


several computer systems across several sites connected together via
communication network. Each site is typically managed by a DBMS that is
capable of running independently of the other sites. In other words, each site
is a database system site in its own right and has its own local users, its own
local DBMS, and its own local data communications manager. Each site has
its own transaction management software, including its own local locking,
logging and recovery software. Although geographically dispersed, a
distributed database system manages and controls the entire database as a
single collection of data. The location of data items and the degree of
autonomy of individual sites have a significant impact on all aspects of the
system, including query optimisation and processing, concurrency control
and recovery.
In a distributed database system, both data and transaction processing are
divided between one or more computers (CPUs) connected by a network,
each computer playing a special role in the system. The computers in the
distributed systems communicate with one another through various
communication media, such as high-speed networks of telephone lines. They
do not share main memory or disks. A distributed database system allows
applications to access data from local and remote databases. Distributed
database systems use client/server architecture to process information
requests. The computers in distributed system may vary in size and function,
ranging from workstations up to mainframe systems. The computers in a
distributed database system are referred to by a number of different names,
such as sites or nodes. The general structure of a distributed database system
is shown in Fig. 18.1.
Distributed database systems arose from the need to offer local database
autonomy at geographically distributed locations. For example, local
branches of multinational or national banks or large company can have their
localised databases situated at different branches. The advancement in
communication and networking systems triggered the development of
distributed database approach. It became possible to allow these distributed
systems to communicate among themselves, so that the data can be
effectively accessed among machines (computer systems) in different
geographical locations. Distributed database systems tie together pre-
existing systems in different geographical locations. As a result, the different
site machines are quite likely to be heterogeneous, with entirely different
individual architectures, for example, ORACLE database system on a Sun
Solaris UNIX system at one site, DB2 database system on an OS/390
machine at another, and Microsoft SQL on an NT machine at a third.
Ingres/Star, DB2 and Oracle, are some of the examples of commercial
distributed database management system (DDBMS).

Fig. 18.1 Distributed database architecture

18.2.1 Differences between Parallel and Distributed Databases


Distributed databases are also a kind of shared-nothing architecture as
discussed in the previous chapter, section 17.3.3. However, the major
differences exist in the mode of operation. Following are the main
differences between parallel and distributed databases:
Distributed databases are typically geographically separated, separately administered, and
have a slower interconnection.
In a distributed database system, local and global transactions are differentiated.

18.2.2 Desired Properties of Distributed Databases


Distributed database system should make the impact of data distribution
transparent. Distributed database systems should have the following
properties:
Distributed data independence.
Distributed transaction atomicity.

18.2.2.1 Distributed Data Independence


Distributed data independence property enables users to ask queries without
specifying where the reference relations or copies or fragments of the
relations, are located. This principle is a natural extension of physical and
logical data independence. Further, queries that span multiple sites should be
optimised systematically in a cost-based manner, taking into account
communication costs and difference in local computation costs.

18.2.2.2 Distributed Transaction Atomicity


Distributed transaction atomicity property enables users to write transactions
that access and update data at several sites just as they would write
transactions over purely local data. In particular, the effects of a transaction
across sites should continue to be atomic. That is, all changes persist if the
transaction commits, and none persist if it aborts.

18.2.3 Types of Distributed Databases


As discussed in previous sections, in distributed database system (DDBS),
data and software are distributed over multiple sites connected by
communication network. However, DDBS can describe various systems that
differ from one another in many respects depending on various factors, such
as, degree of homogeneity, degree of local autonomy, and so on. Following
two types of distributed databases are most commonly used:
Homogeneous DDBS.
Heterogeneous DDBS.

18.2.3.1 Homogeneous DDBS


Homogeneous DDBS is the simplest form of a distributed database where
there are several sites, each running their own applications on the same
DBMS software. All sites have identical DBMS software, all users (or
clients) use identical software, are aware of one another and agree to
cooperate in processing user’s request. The application can all see the same
schema and run the same transactions. That is, there is location transparency
in homogeneous DDBS. The provision of location transparency forms the
core of distributed database management system (DDBMS) development.
Fig. 18.2 shows an example of homogeneous DDBS.
In homogeneous DDBS, the use of a single DBMS avoids any problems
of mismatched database capabilities between nodes, since the data is all
managed within a single framework. In homogeneous DDBS, local sites
surrender a portion of their autonomy in terms of their rights to change
schema or DBMS software.
Fig. 18.2 Homogeneous DDBS

18.2.3.2 Heterogeneous DDBS


In heterogeneous distributed database system, different sites run under the
control of different DBMSs, essentially autonomously and are connected
somehow to enable access to data from multiple sites. Different sites may
use different schemas and different DBMS software. The sites may not be
aware of one another and they may provide only limited facilities for
cooperation in transaction processing. In other words, in heterogeneous
DDBS, each server (site) is an independent and autonomous centralised
DBMS that has its own local users, local transactions, and database
administrator (DBA).
Heterogeneous distributed database system is also referred to as a multi-
database system or a federated database system (FDBS). Heterogeneous
database systems have well-accepted standards for gateway protocols to
expose DBMS functionality to external applications. The gateway protocols
help in masking the differences (such as capabilities, data formats and so on)
of accessing database servers, and bridge the differences between the
different servers in a distributed system. In heterogeneous FDBS, one server
may be a relational DBMS, another network DBMS, and a third an
ORDBMS or centralised DBMS.

18.2.4 Desired Functions of Distributed Databases


The distributed database management system (DDBMS) must be able to
provide the following additional functions as compared to a centralised
DBMS:
Fig. 18.3 Heterogeneous DDBS

Ability of keeping track of data, data distribution, fragmentation, and replication by


expanding DDBMS catalogue.
Provide local autonomy.
Should be location independent.
Distributed catalogue management.
No reliance on a central site.
Ability of replicated data management to access and maintain the consistency of a replicated
data item.
Ability to manage distributed query processing to access remote sites and transmission of
queries and data among various sites via a communication network.
Ability of distributed transaction management by devising execution strategies for queries
and transactions that access data from several sites.
Should have fragmentation independence, that is users should be presented with a view of the
data in which the fragments are logically recombined by means of suitable JOINs and
UNIONs.
Should be hardware independent.
Should be operating system independent.
Efficient distributed database recovery management in case of site crashes and
communication failures.
Should be network independent.
Should be DBMS independent.
Proper management of security of data by provide authorized access privileges to users while
executing distributed transactions.

18.2.5 Advantages of Distributed Databases


Sharing of data where users at one site may be able to access the data residing at other sites
and at the same time retain control over the data at their own site.
Increased efficiency of processing by keeping the data close to the point where it is most
frequently used.
Efficient management of distributed data with different levels of transparency.
It enables the structure of the database to mirror the structure of the enterprise in which local
data can be kept locally, where it most logically belongs, while at the same time remote data
can be accessed when necessary.
Increased local autonomy where each site is able to retain degree of control over data that are
stored locally.
Increased accessibility by allowing to access data between several sites (for example, say
accessing an account of New Delhi from seating at New York and vice versa) via
communication network.
Increased availability in which if one site fails, the remaining site may be able to continue
operating.
Increased reliability due to greater accessibility.
Improved performance.
Improved scalability.
Easier expansion with the growth of organization in terms of adding more data, increasing
database sites, or adding more CPUs.
Parallel evaluation by subdividing a query into sub-queries involving data from several sites.

18.2.6 Disadvantages of Distributed Databases


Recovery of failure is more complex.
Increased complexity in the system design and implementation.
Increased transparency leads to a compromise between ease of use and the overhead cost of
providing transparency.
Increased software development cost.
Greater potential for bugs.
Increased processing overhead.
Technical problem of connecting dissimilar machine.
Difficulty in database integrity control.
Security concern of replicated data in multiple location and the network.
Lack of standards.

18.3 ARCHITECTURE OF DISTRIBUTED DATABASES

Following three architectures are used in distributed database systems:


Client/server Architecture.
Collaborating server system.
Middleware system.

18.3.1 Client/Server Architecture


Client/server architectures are those in which a DBMS-related workload is
split into two logical components namely client and server, each of which
typically executes on different systems. Client is the user of the resource
whereas the server is a provider of the resource. Client/server architecture
has one or more client processes and one or more server processes. The
applications and tools are put on one or more client platforms (generally,
personal computers or workstations) and are connected to database
management system that resides on the server (typically a large workstation,
midrange system, or a mainframe system). The applications and tools act as
‘client’ of the DBMS, making requests for its services. The DBMS, in turn,
services these requests and returns the results to the client(s). A client
process can send a query to any one-server process. Clients are responsible
for user-interface issues and servers manage data and execute transactions.
In other words, the client/server architecture can be used to implement a
DBMS in which the client is the transaction processor (TP) and the server is
the data processor (DP). A client process could run on a personal computer
and send queries to a server running on a mainframe computer. All modern
information systems are based on client/server architecture of computing.
Fig. 18.4 shows a schematic of client/server architecture.
Fig. 18.4 Client/server database architecture

Client/server architecture consists of the following main components:


Clients in form of intelligent workstations as the user’s contact point.
DBMS server as common resources performing specialised tasks for devices requesting their
services.
Communication networks connecting the clients and the servers.
Software applications connecting clients, servers and networks to create a single logical
architecture.

The client applications (which may be tools, vendor-written applications


or user-written applications) issue SQL statements for data access, just as
they do in centralised computing environment. The networking interface
enables client applications to connect to the server, send SQL statements and
receive results or error return code after the server has processed the SQL
statements. The applications themselves often make use of presentation
services, such as graphic user interface, on the client.
While writing client/server applications, it is important to remember the
boundary between the client and the server and to keep the communication
between them as set-oriented as possible. Application writing
(programming) is most often done using a host language (for example, C,
C++, COBOL and so on) with embedded data manipulation language
(DML) statements (for example, SQL), which are communicated to the
server.

18.3.1.1 Benefits of Clien/Server Database Architecture


This architecture is relatively simple to implement, due to its clean separation of functionality
and because the server is centralised.
Better adaptability to the computing environment to meet the ever-changing business needs
of the organisation.
Use of graphical user interface (GUI) on microcomputers by the user at client, improves
functionality and simplicity.
Architecture tends to be less expensive than alternative mini or mainframe solutions.
Expensive server machines are optimally utilised by relegating mundane user-interactions to
inexpensive client machines.
Considerable cost advantage to off-load application development from the mainframe to
powerful personal computers (PCs).
Computing platform independence.
Overall productivity improvement due to decentralised operations.
Use of PC as client provides numerous data analysis and query tools to facilitate interaction
with many of the DBMSs that are available on PC platform.
Improved performance with more processing power scattered throughout the organisation.

18.3.1.2 Limitations of Client/server Database Architecture


The client/server architecture does not allow a single query to span multiple servers because
the client process would have to be capable of breaking such a query into appropriate sub-
queries to be executed at different sites and then putting together the answers to the sub-
queries.
The client process is quite complex and its capabilities begin to overlap with the server. This
results in difficulty in distinguishing between clients and servers.
An increase in the number of users and processing sites often create security problems.

18.3.2 Collaborating Server Systems


In collaborating server architecture, there are several database servers, each
capable of running transactions against local data, which cooperatively
execute transactions spanning multiple servers. When a server receives a
query that requires access to data at other servers, it generates appropriate
sub-queries to be executed by other servers and puts the results together to
compute answers to the original query.

18.3.3 Middleware Systems


The middleware database architecture, also called data access middleware,
is designed to allow a single query to span multiple servers, without
requiring all database servers to be capable of managing such multi- cite
execution strategies. Data access middleware provides users with a
consistent interface to multiple DBMSs and file systems in transparent
manner. Data access middleware simplifies heterogeneous environment for
programmers and provides users with an easier means of accessing live data
in multiple sources. It eliminates the need for programmers to code many
environment specific requests or calls in any application that needs access to
current data rather than copies of such data. The direct request or calls for
data movement to several DMSs are handled by the middleware, and hence a
major rewrite of application program is not required.
The middleware is basically a layer of software, which works as a special
server and coordinates the execution of queries and transactions across one
or more independent database servers. The middleware layer is capable of
executing joins and other relational operations on data obtained from the
other servers, but typically, does not itself maintain any data. Middleware
provides an application with a consistent interface to some underlying
services, shielding the application from different native interfaces and
complexities required to execute the services. Middleware might be
responsible for routing a local request to one or more remote servers,
translating the request from one SQL dialect to another as needed,
supporting various networking protocols, converting data from one for one
format to another, coordinating work among various resource managers and
performing other functions.
Fig. 18.5 Data access middleware architecture

Fig. 18.5 illustrates sample data access middleware architecture. Data


access middleware architecture consists of middleware application
programming interface (API), middleware engine, drivers and native
interfaces. The application programming interface (API) usually consists of
a series of available function calls as well as a series of data access
statements (dynamic SQL, QBE and so on).The middleware engine is
basically an application programming interface for routing of requests to
various drivers and performing other functions. It handles data access
requests that have been issued. Drivers are used to connect various back-end
data sources and they translate requests issued through the middleware API
to a format intelligible to the target data source. Translation service may
include SQL translation, data type translation and error messages and return
code translation.
Many data access middleware products have client/server architecture and
access data residing on multiple remote systems. Therefore, networking
interfaces may be provided between the client and the middleware, as well
as between the middleware and data sources. Specific configurations of
middleware vary from product to product. Some are largely client centric
with the middleware engine and drivers residing on a client workstation or
PC. Other are largely server centric with a small layer of software on the
client provided to connect it into the remainder of the middleware solution,
which resides primarily on a LAN server or host system.

18.4 DISTRIBUTED DATABASE SYSTEM (DDBS) DESIGN

The design of a distributed database system is a complex task. Therefore, a


careful assessment of the strategies and objectives is required. Some of the
strategies and objectives that are common to the most DBS design are as
follows:
Data fragmentation, which are applied to relational database system to partition the relations
among network sites.
Data allocation, in which each fragment is stored at the site with optimal distribution.
Data replication, which increases the availability and improves the performance of the
system.
Location transparency, which enables a user to access data without knowing, or being
concerned with, the site at which the data resides. The location of the data is hidden from the
user.
Replication transparency, meaning that when more than one copy of the data exists, one copy
is chosen while retrieving data and all other copies are updated when changes are being
made.
Configuration independence, which enables the organisation to add or replace hardware
without changing the existing software components of the DBMS. It ensures the
expandability of existing system when its current hardware is saturated.
Non-homogeneity DBMS, which helps in integrating databases maintained by different
DBMSs at different sites on different computers (as explained in section 18.2.3).

Data fragmentation and data replication are the most commonly used
techniques that are used during the process of DDBS design to break up the
database into logical units and storing certain data in more than one site.
These two techniques are further discussed below in detail.
18.4.1 Data Fragmentation
Technique of breaking up the database into logical units, which may be
assigned for storage at the various sites, is called data fragmentation. In the
data fragmentation, a relation can be partitioned (or fragmented) into several
fragments (pieces) for physical storage purposes and there may be several
replicas of each fragment. These fragments contain sufficient information to
allow reconstruction of the original relation. All fragments of a given
relation will be independent. None of the fragments can be derived from the
others or has a restriction or a projection that can be derived from the others.
For example, let us consider an EMPLOYEE relation as shown in table 18.1.

Table 18.1 Relation EMPLOYEE

Now this relation can be fragmented into three fragments as follows:

FRAGMENT EMPLOYEE AS
MUMBAI_EMP AT SITE ‘Mumbai’ WHERE DEPT-ID = 2
JAMSHEDPUR_EMP AT SITE ‘Jamshedpur’ WHERE DEPT-ID = 4
LONDON_EMP AT SITE ‘London’ WHERE DEPT-ID = 5;

The above fragmented relation will be stored at various sites as shown in


Fig. 18.6 in which the tuples (records or rows) for ‘Mumbai’ employees
(with DEPT-ID = 2) are stored at the Mumbai site, tuples for ‘Jamshedpur’
employees (with DEPT-ID = 4) are stored at the Jamshedpur site and tuples
for ‘London’ employees (with DEPT-ID = 5) are stored at the London site. It
can be noted in this example that the distributed database system’s internal
fragment names are MUMBAI_EMP, JAMSHEDPUR_EMP and
LONDON_EMP. Reconstruction of the original relation is done via suitable
JOIN and UNION operations.
A system that supports data fragmentation should also support
fragmentation independence (also called fragmentation transparency). That
means, users should not be logically concerned about fragmentation. The
users should have a feeling as if the data were not fragmented at all. DDBS
insulates the user from knowledge of the data fragmentation. In other words,
fragmentation independence implies that users will be presented with a view
of the data in which the fragments are logically recombined by means of
suitable JOINs and UNIONs. It is the responsibility of the system optimiser
to determine which fragments need to be physically accessed in order to
satisfy any given user request. Following are the two different schemes for
fragmenting a relation:
Horizontal fragmentation
Vertical fragmentation
Mixed fragmentation

18.4.1.1 Horizontal Fragmentation


A horizontal fragment of a relation is a subset of the tuples (rows) with all
attributes in that relation. Horizontal fragmentation splits the relation
‘horizontally’ by assigning each tuple or group (subset) of tuples of a
relation to one or more fragments, where each tuple or a subset has a certain
logical meaning. These fragments can then be assigned to different sites in
the distributed system. A horizontal fragmentation is produced by specifying
a predicate that performs a restriction on the tuples in the relation. It is
defined using the SELECT operation of the relational algebra, as discussed
in chapter 4, section 4.4. A horizontal fragmentation may be defined as:

σP (R)
Fig. 18.6 An example of horizontal data fragmentation

where σ = relational algebra operators for selection


p = predicate based on one or more attributes of the relation
R = a relation (table)

The fragmentation example of Fig. 18.6 is a horizontal fragmentation and


can be written in terms of relational algebra as:

MUMBAI_EMP : σDEPT-ID=2 (EMPLOYEE)


JAMSHEDPUR_EMP : σDEPT-ID=4(EMPLOYEE)
LONDON_EMP : σDEPT-ID=5 (EMPLOYEE)

Horizontal fragmentation corresponds to the relational operations of


restriction. In horizontal fragmentation, UNION operation is done to
reconstruct the original relation.

18.4.1.2 Vertical Fragmentation


Vertical fragmentation splits the relation by decomposing ‘vertically’ by
columns (attributes). A vertical fragment of a relation keeps only certain
attributes of the relation at a particular site, because each site may not need
all the attributes of a relation. Thus, vertical fragmentation groups together
the attributes in a relation that are used jointly by the important transactions.
A simple vertical fragmentation is not quite proper when the two fragments
(attributes) are stored separately. Since there is no common attribute between
the two fragments, we cannot put the original employee tuples back together.
Therefore, it is necessary to include the primary key (or candidate key)
attributes in every vertical fragment so that the full relation can be
reconstructed from the fragments. In vertical fragmentation, system-
provided ‘tuple-ID’ (or TID) is used as the primary key (or candidate key)
attribute along with the stored relation as address for linking tuples. Vertical
fragmentation corresponds to the relational operations of projection and is
defined as

where ∏ = relational algebra operator for projection


a1…,an = attributes of the relation
R = a relation (table)

For example, the relation EMPLOYEE of table 18.1 can be vertically


fragmented as follows:
FRAGMENT EMPLOYEE AS
MUMBAI_EMP (TID, EMP-ID, EMP-NAME) AT SITE ‘Mumbai’
JAMSHEDPUR_EMP (TID, DEP-ID) AT SITE ‘Jamshedpur’
LONDON_EMP (TID, EMP-SALARY) ATSITE ‘London’;

The above vertical fragmentation can be written in terms of relational


algebra as:

The above fragmented relation will be stored at various sites as shown in


Fig. 18.7 in which the attributes (TID, EMP-ID, EMP-NAME) for ‘Mumbai’
employees are stored at the Mumbai site, attributes (TID, DEPT- ID) for
‘Jamshedpur’ employees are stored at the Jamshedpur site and attributes
(TID, EMP-SALARY) for ‘London’ employees are stored at the London
site. JOIN operation is done to reconstruct the original relation.

Fig. 18.7 An example of vertical fragmentation

18.4.1.3 Mixed Fragmentation


Sometimes, horizontal or vertical fragmentation of database schema by itself
is insufficient to adequately distribute the data for some applications.
Instead, mixed or hybrid fragmentation is required. Thus, horizontal (or
vertical) fragmentation of a relation, followed by further vertical (or
horizontal) fragmentation of some of the fragments, is called mixed
fragmentation. A mixed fragmentation is defined using the selection
(SELECT) and projection (PROJECT) operations of the relational algebra.
The original relation is obtained by a combination of JOIN and UNION
operations. A mixed fragmentation is given as
or

In the example of vertical fragmentation, the relation EMPLOYEE was


vertically fragmented as
Fig. 18.8 An example of vertical fragmentation

S1 = MUMBAI_EMP : ∏TID, EMP-ID, EMP-


NAME(EMPLOYEE)
S2 = JAMSHEDPUR_EMP : ∏TID,DEP_ID(EMPLOYEE)
S1 = LONDON_EMP : ∏TID, EMP- SALARY (EMPLOYEE)

We could now horizontally fragment S2, for example, according to DEP-


ID as follows:

S21= σDEPT-ID=2 (S2) : σDEPT-ID=2(JAMSHEDPUR_EMP)


S22= σDEPT-ID=4 (S2) : σDEPT-ID=4(JAMSHEDPUR_EMP)
S23= σDEPT-ID=5 (S2) : σDEPT-ID=5(JAMSHEDPUR_EMP)

Fig. 18.8 shows an example of mixed fragmentation.

18.4.2 Data Allocation


Data allocation describes the process of deciding about locating (or placing)
data to several sites. Following are the data placement strategies that are
used in distributed database systems:
Centralised
Partitioned or fragmented
Replicated

In case of centralised strategies, entire single database and the DBMS is


stored at one site. However, users are geographically distributed across the
network. Locality of reference is lowest as all sites, except the central site,
have to use the network for all data accesses. Thus, the communication costs
are high. Since the entire database resides at one site, there is loss of the
entire database system in case of failure of the central site. Hence, the
reliability and availability are low.
In partitioned or fragmented strategy, database is divided into several
disjoint parts (fragments) and stored at several sites. If the data items are
located at the site where they are used most frequently, locality of reference
is high. As there is no replication, storage costs are low. The failure of
system at a particular site will result in the loss of data of that site. Hence,
the reliability and availability are higher than centralised strategy. However,
overall reliability and availability are still low. The communication cost is
low and overall performance is good as compared to centralised strategy.
In replication strategy, copies of one or more database fragments are
stored at several sites. Thus, the locality of reference, reliability and
availability and performance are maximised. But, the communication and
storage costs are very high.

18.4.3 Data Replication


Data replication is a technique that permits storage of certain data in more
than one site. The system maintains several identical replicas (copies) of the
relation and stores each replica at a different site. Typically, data replication
is introduced to increase the availability of the system. When a copy is not
available due to site failure(s), it should be possible to access another copy.
For example, with reference to Fig. 18.6, the data can be replicated as:

REPLICATE LONDON_EMP AS
LONMUM_EMP AT SITE ‘Mumbai’
REPLICATE MUMBAI_EMP AS
MUMLON_EMP AT SITE ‘London’

Fig. 18.9 shows an example of replication. Like fragmentation, data


replication should also support replication independence (also known as
replication transparency). That means, users should be able to behave as if
the data were in fact not replicated at all. Replication independence
simplifies user program and terminal activities. It allows replicas to be
created and destroyed at any time in response to changing requirements,
without invalidating any of those user programs or activities. It is the
responsibility of the system optimiser to determine which replicas physically
need to be accessed in order to satisfy any given user request.

Fig. 18.9 An example of mixed fragmentation


18.4.3.1 Advantages of Data Replication
Data replication enhances the performance of read operations by increasing the processing
speed at site. That means, with data replication, applications can operate on local copies
instead of having to communicate with remote sites.
Data replication increases the availability of data to read-only transactions. That means, a
given replicated object remains available for processing, at least for retrieval, so long as at
least one copy remains available.

18.4.3.2 Disadvantages of Data Replication


Increased overheads for update transactions. That means, when a given replicated object is
updated, all copies of that object must be updated.
More complexity in controlling concurrent updates by several transactions to replicated data.

18.5 DISTRIBUTED QUERY PROCESSING

In a DDBMS, a query may require data from the databases distributed in


more than one site. Some database systems support relation databases whose
parts are physically separated. Different relations might reside at different
sites, multiple copies of a single relation can be distributed among several
sites, or one relation might be partitioned into sub-relations and these sub-
relations distributed to multiple sites. In order to evaluate a query issued at a
given site, it may be necessary to transfer data between various sites.
Therefore, it is important here to optimise on the time required to access
such a query, which will be largely comprised of the time spent in
transmitting data between sites rather and not the time spent on retrieval
from the disk storage or computation.

18.5.1 Semi-JOIN
In a distributed query processing, the transmission or communication cost is
high. Therefore, semijoin operation is used to reduce the size of a relation
that needs to be transmitted and hence the communication costs. Let us
suppose that the relation R (EMPLOYEE) and S (PROJECT) are stored at
site C (Mumbai) and site B (London), respectively as shown in Fig. 18.10. A
user issues a query at site C to prepare a project allocation list, which
requires the computation JOIN of the two relations given as
JOIN (R, S)
or JOIN (EMPLOYEE, PROJECT)

One way of joining the above relations is to transmit all attributes of


relation S (PROJECT) at site C (Mumbai) and compute the JOIN at site C.
This would involve the transmission of all 12 values of relation S
(PROJECT) and will have a high communication cost.
Another way would be by first projecting the relation C (PROJECT) at
site B (London) on attribute EMP- ID and transmitting the result to site C
(Mumbai), which can be computed as:

X = ∏EMP-ID (S)
or X = ∏EMP-id (PROJECT)

The result of the projection operation is shown in Fig. 18.11. Now at site
C (Mumbai), those tuples of relation R (EMPLOYEE) are selected that have
the same value for the attribute EMP-ID as a tuple in X = ∏EMP-ID
(PROJECT) by a JOIN and can be computed as

Y = JOIN (R, X)
or Y = EMPLOYEE ⋈ X
Fig. 18.10 An example of data replication

Fig. 18.11 Obtaining a join using semijoin


The entire operation of first projecting the relation S (PROJECT) and then
performing the join is called ‘semijoin’ and is denoted by ⋉ This means that

Y = EMPLOYEE ⋉ PROJECT
≅ EMPLOYEE ⋈ X

The result of the semijoin operation is shown in Fig. 18.12. But, as can be
seen, the desired result is not obtained after the semijoin operation. The
semijoin operation reduces the number of tuples of relation R (EMPLOYEE)
that have to be transmitted at site B (London). The final result is obtained by
joining of the reduced relation R (EMPLOYEE) and relation S (PROJECT)
as shown in Fig. 18.12 and can be computed as
R⋈S = Y⋈S
or EMPLOYEE ⋈ PROJECT = Y ⋈ PROJECT

The semijoin operator (⋉) is used to reduce the communication cost. If, Z
is the result of the semijoin of relations R and S, then semijoin can be
defined as

Z = R⋉S

Z represents the set of tuples of relation R that join with some tuple(s) in
relation S. Z does not contain tuples of relation R that do not join with any
tuple in relation S. Thus, Z represents the reduced R that can be transmitted
to a site of S for a join with it. If the join of R and S is highly selective, the
size of Z would be a small proportion of the size of R. To get the join of R
and S, we now join P with S, and given as

T = Z⋈S
= (R ⋉ S) ⋈ S
= (S ⋉ R) ⋈ R
= (R ⋉ S) ⋈ (S ⋈ R)
Fig. 18.12 Result of projection operation at site B

Fig. 18.13 Result of semijoin operation at site C


The semijoin is a reduction operator and R ⋉ S can be read as R semijoin
S or the reduction of R by S. It is to be noted that the semijoin operation is
not associative. That means, in our example of relations EMPLOYEE and
PROJECT in Fig. 18.10, EMPLOYEE ⋉ PROJECT is not the same as
PROJECT ⋈ EMPLOYEE. The former produces a reduction in the number
of tuples of EMPLOYEE and the later is the same relation as PROJECT.

18.6 CONCURRENCY CONTROL IN DISTRIBUTED DATABASES

As we have discussed that the database of distribute system resides at several


sites, control of data integrity becomes more problematic. Since data are
distributed, the transaction activities may take place at a number of sites and
it can be difficult to maintain a time ordering among actions. The
concurrency control becomes difficult when two or more transactions are
executing concurrently (at the same time) and both require access to the
same data record in order to complete the processing.
Fig. 18.14 Final result by joining Y and S at site B
In a DDBS, there may be multiple copies of the same record due to the
existence of data fragmentation and data replication. Therefore, all copies
must have the same value at all times, or else transactions may operate on
inaccurate data. Most concurrency control algorithms for DDBSs use some
form of check to see that the result of a transaction is the same as if its
actions were executed serially. All concurrency control mechanisms must
ensure that the consistency of data items is preserved, and that each atomic
action is completed in a finite time. A good concurrency control mechanism
for distributed DBMSs should have the following characteristics:
Resilient to site and communication failures.
Permit parallelism to satisfy performance requirements.
Modest (minimum possible) computational storage overheads.
Satisfactory performance in a network environment having communication delays.

The concurrency control discussed in chapter 12, section 12.3 can be


modified for use in distributed environment. In DDBSs, concurrency control
can be achieved by use of locking or by timestamping.

18.6.1 Distributed Locking


Locking is the simplest method of concurrency control in DDBSs. As
discussed in chapter 12, section 12.4, different locking types can be applied
in distributed locking. In DDBS, the lock manager function is distributed
over several sites. The DDBS maintains a lock manager at each site whose
function is to administer the lock and unlock requests for those data items
that are stored at that site. In case of distributed locking, a transaction sends
a message to the lock manager site requesting appropriate locks on specific
data items. If the request for the lock could be granted immediately, the lock
manager replies granting the request. If the request is incompatible with the
current state of locking of the requested data items, the request is delayed
until it can be granted. Once it has determined that the lock request can be
granted, the lock manager sends a message back to the initiator that it has
granted the lock request.
As in the case of centralised databases, the lock can be applied in two
modes namely shared mode (S-lock) and exclusive mode (X-lock). If a
transaction locks a record (tuple) in shared mode (also called read lock), the
data items or record from any site containing a copy of it is locked and then
read. In the shared mode of locking, a transaction can read that record but
cannot update the record. If a transaction locks a record exclusive mode (also
called write lock), all copies of the data items or record have to be modified
and locked. In exclusive mode of locking, a transaction can both read and
update the record and no other record can access the record while it is
exclusively locked. At no time can two transactions hold exclusive locks on
the same record. However, any number transactions should be able to
achieve shared locks on the same record at the same time.

18.6.1.1 Advantages
Simple implementation.
Reduces the degree of bottleneck.
Reasonably low overhead, requiring two message transfers for handling lock requests, and
one message transfer for handling unlock requests.

18.6.1.2 Disadvantages
More complex deadlock handling because the lock and unlock requests are not made at
single site.
Possibility of inter-site deadlocks even when there is no deadlock within a single site.

18.6.2 Distributed Deadlock


Concurrency control with a locking-based algorithm may result in
deadlocks, as discussed in chapter 12, section 12.4.3. As in the centralised
DBMS, deadlock must be detected and resolved in a DDBS by aborting
some deadlocks transaction. In a DDBS, each site maintains a local waits-
for-graph (LWFG) and a cycle in local graph indicates a deadlock. However,
there can be a deadlock even if no local graph contains a cycle.
Let us consider a distributed database system with four sites and full data
replication. Suppose that transaction T1 and T2 wish to lock data item D in
exclusive mode (X-lock). Transaction T1 may succeed in locking data item D
at sites S2 and S3, while transaction T2 may succeed in locking data item D at
sites S2 and S4. Each transaction then must wait to acquire the third lock and
hence a deadlock has occurred. Such deadlocks can be avoided easily by
requiring all sites to request locks on replicas of a data item in the same
predetermined order. One simple method of recovering from deadlock
situation is to allow a transaction to wait for a finite amount of time for an
incompatibly locked data item. If at the end of that time the resource is still
locked, the transaction is aborted. The period of time should not be too short
too long.
In a distributed system, the detection of a deadlock requires the generation
of not only local wait-for graph (LWFG) for each site, but also a global wait-
for-graph (GWFG) for the entire system. However, GWFG has a
disadvantage of the overhead required in generating such graphs.
Furthermore, a deadlock detection site has to be chosen where the GWFG is
created. This site becomes the location for detecting deadlocks and selecting
the transactions that have to be aborted to recover from deadlock.
In a distributed database system, the deadlock prevention method by
aborting the transaction can be used such as timestamping, wait-die method
and wound-wait method. The aborted transactions are reinitiated with the
original timestamp to allow them to eventually run to completion.

18.6.3 Timestamping
As discussed in chapter 12, section 12.5, timestamping is a method of
identifying messages with their time of transaction. In the DDBSs, each
copy of the data item contains two timestamp values, namely read timestamp
and the write timestamp. Also, each transaction in the system is assigned a
timestamp value that determines its serialisability order.
In distributed systems, each site generates unique local timestamp using
either a logical counter or the local clock and concatenates it with the site
identifier. If the local timestamp were unique, its concatenation with the
unique site identifier would make the global timestamp unique across the
network. The global timestamp is obtained by concatenating the unique local
timestamp with the site identifier, which also must be unique. The site
identifier must be the least significant digits of the timestamp so that the
events can be ordered according to their occurrence and not their location.
Thus, this ensures that the global timestamps generated in one site are not
always greater than those generated in another site.
There could be a problem if one site generates local timestamps at a rate
faster than that of the other sites. Therefore, a mechanism is required to
ensure that local timestamps are generated fairly across the system and
synchronised. The synchronisation is achieved by including the timestamp in
the messages (called logical timestamp) sent between sites. On receiving a
message, a site compares its clock or counter with the timestamp contained
in the message. If it finds its clock or counter slower, it sets it to some value
greater than the message timestamp. In this way, an inactive site’s counter or
a slower clock gets synchronised with the others at the first message
interaction with other site.

18.7 RECOVERY CONTROL IN DISTRIBUTED DATABASES

As with local recovery, distributed database recovery aims to maintain the


atomicity and durability of distributed transactions. A database must
guarantee that all statements in a transaction, distributed or non-distributed,
either commit or roll back as a unit. The effects of an ongoing transaction
should be invisible to all other transactions at all sites. This transparency
should be true for transactions that include any type of operations, including
queries, updates or remote procedure calls. In a distributed database
environment also the database management system must coordinate
transaction control with these characteristics over a communication network
and maintain data consistency, even if network or system failure occurs.
In DDBMS, a given transaction is submitted at some one site, but it can
access data at other sites as well. When a transaction is submitted at some
one site, the transaction manager at that site breaks it up into a collection of
one or more sub-transactions that execute at different sites. The transaction
manager then submits these sub-transactions to the transaction managers at
the other sites and coordinates their activities. To ensure the atomicity of the
global transaction, the DDBMS must ensure that sub-transactions of the
global transaction either all commit or all abort.
In a recovery control, transaction atomicity must be ensured. When a
transaction commits, all its actions across all the sites hat it executes at, must
persist. Similarly, when a transaction aborts, none of its actions must be
allowed to persist. Recovery control in distributed system is typically based
on the two-phase commit (2PC) protocol or three-phase commit (3PC)
protocol.

18.7.1 Two-phase Commit (2PC)


Two-phase commit protocol (2PL) is the simplest and most widely used
technique for recovery and concurrency control in distributed database
environment. 2PL mechanism guarantees that all database servers
participating in a distributed transaction either all commit or all abort. In a
distributed database system, each sub-transaction (that is, part of a
transaction getting executed at each site) must show that it is prepared-to-
commit. Otherwise, the transaction and all of its changes are entirely
aborted. For a transaction to be ready to commit, all of its actions must have
been completed successfully. If any sub-transaction indicates that its actions
cannot be completed, then all the sub-transactions are aborted and none of
the changes are committed. The two-phase commit process requires the
coordinator to communicate with every participant site.
As the name implies, two-phase commit (2PC) protocol has two phases
namely the voting phase and the decision phase. Both phases are initiated by
a coordinator. The coordinator asks all the participants whether they are
prepared to commit the transaction. In the voting phase, the sub-transactions
are requested to vote on their readiness to commit or abort. In the decision
phase, a decision as to whether all sub-transactions should commit or abort
is made and carried out. If one participant votes to abort or fails to respond
within a timeout period, then the coordinator instructs all participants to
abort the transaction. If all vote to commit, then the coordinator instructs all
participants to commit the transaction. The global decision must be adopted
by all participants. Figs. 18.15 and 18.16 illustrates the voting phase and
decision phase, respectively, of two-phase commit protocol.
The basic principle of 2PC is that any of the transaction manager involved
(including the coordinator) can unilaterally abort a transaction. However,
there must be unanimity to commit a transaction. When a message is sent in
2PC, it signals a decision by the sender. In order to ensure that this decision
survives a crash at the sender’s site, the log record describing the decision is
always forced to stable storage before the message is sent. A transaction is
officially committed at the time the coordinator’s commit log record reaches
stable storage. Subsequent failures cannot affect the outcome of the
transaction. The transaction is irrevocably committed.
A log record is maintained with entries such as type of the record, the
transaction identification and the identity of the coordinator. When a system
comes back after crash, recovery process is invoked. The recovery process
reads the log and processes all transactions that were executing the commit
protocol at the time of the crash.
Fig. 18.15 Voting phase of two-phase commit (2PC) protocol

Limitations
A failure of the coordinator of sub-transactions can result in the transaction being blocked
from completion until the coordinator is restored.
Requirement of coordinator results into more messages and more overhead.

18.7.2 Three-phase Commit (3PC)


Three-phase commit protocol (3PC) is an extension of a two-phase commit
protocol. It avoids the blocking limitation of a two-phase commit protocol.
3PC is a non-blocking for site failures, except in the event of the failure of
all sites. It avoids blocking even if the coordinator site fails during recovery.
3PC assumes that no network partition occurs and not more than
predetermined number of site fails. Under these assumptions, the 3PC
protocol avoids blocking by introducing an extra third phase where multiple
sites are involved in the decision to commit. In 3PC, the coordinator
effectively postpones the decision to commit until it ensures that at least
predetermined number of sites know that it intended to commit the
transaction. If the coordinator fails, the remaining sites first select a new
coordinator. This new coordinator checks the status of the protocol from the
remaining sites. If the earlier coordinator had decided to commit, at least one
of the other predetermined sites that it informed is made up and ensures that
the commit decision is respected. The new coordinator restarts the third
phase of the protocol if some site knew that the old coordinator intended to
commit the transaction. Otherwise the new coordinator aborts the
transaction.

Fig. 18.16 Decision phase of two-phase commit (2PC) protocol

As discussed above, the basic purpose of 3PC is to remove uncertainty


period for participants that have voted for commit and are waiting for the
global abort or global commit from the coordinator. 3PC introduces a third
phase, called pre-commit, between voting and the global decision. The 3PC
protocol is not used in practice because of significant additional cost and
overheads required during normal execution.

Advantages
3PC does not block the sites.

Limitations
3PC adds to the overhead and cost.

R Q
1. What is distributed database? Explain with a neat diagram.
2. What are the main advantages and disadvantages of distributed databases?
3. Differentiate between parallel and distributed databases.
4. What are the desired properties of distributed databases?
5. What do you mean by architecture of a distributed database system? What are different types
of architectures? Discuss each of them with neat sketch.
6. What is client/server computing? What are its main components?
7. Discuss the benefits and limitations of client/server architecture of the DDBS.
8. What are the various types of distributed databases? Discuss in detail.
9. What are homogeneous DDBSs? Explain in detail with an example.
10. What are heterogeneous DDBSs? Explain in detail with an example.
11. What do you mean by distributed database design? What strategies and objectives are
common to most of the DDBMSs?
12. What is a fragment of a relation? What are the main types of data fragments? Why is
fragmentation a useful concept in distributed database design?
13. What is horizontal data fragmentation? Explain with an example.
14. What is vertical data fragmentation? Explain with an example.
15. What is mixed data fragmentation? Explain with an example.
16. Consider the following relation

EMPLOYEE (EMP, NAME, ADDRESS, SKILL, PROJ-ID)

EQUIPMENT (EQP-ID, EQP-TYPE, PROJECT)

Suppose that EMPLOYEE relation is horizontally fragmented by PROJ-ID and each


fragment is stored locally at its corresponding project site. Assume that the EQUIPMENT
relation is stored in its entirely at the Tokyo location. Describe a good strategy for processing
each of the following queries:
a. Find the join of relations EMPLOYEE and EQUIPMENT.
b. Get all employees for projects using EQP-TYPE as “Welding machine”.
c. Get all machines being used at the Mumbai Project.
d. Find all employees of the project using equipment number 110.

17. For each of the strategy of the previous question, state how your choice of a strategy depends
on:

a. The site at which the query was entered.


b. The site at which the result is desired.

18. What is data replication? Why is data replication useful in DDBMSs? What typical units of
data replicated?
19. What is data allocation? Discuss.
20. Write short notes on the following:

a. Distributed Database
b. Data Fragmentation
c. Data Allocation
d. Data Replication
e. Two-phase Commit
f. Three-phase Commit
g. Timestamping
h. Distributed Locking
i. Semi-JOIN
j. Distributed Deadlock.

21. Contrast the following terms:

a. Distributed database and parallel database.


b. Homogeneous database and heterogeneous database.
c. Horizontal fragmentation and vertical fragmentation.
d. Distributed data independence and distributed transaction atomicity.

22. What do you mean by data replication? What are its advantages and disadvantages?
23. What is distributed database query processing? How is it achieved?
24. What is semi-JOIN in a DDBS query processing? Explain with an example.
25. Compute a semijoin for the following relation shown in Fig. 18.17 kept at two different sites.
Fig. 18.17 Obtaining a join using semijoin

26. What is the difference between a homogeneous and a heterogeneous DDBS? Under what
circumstances would such systems generally arise?
27. Discuss the issues that have to be addressed with distributed database design.
28. What is middleware system architecture? Explain with a neat sketch and an example.
29. Under what condition is

(R ⋉ S) = (S ⋉ R)

30. Consider a relation that is fragmented horizontally by PLANT-NO and given as

EMPLOYEE (NAME, ADDRESS, SALARY, PLANT-NO).

Assume that each fragment has two replicas; one stored at the Bangalore site and one stored
locally at the plant site of Jamshedpur. Describe a good processing strategy for the following
queries entered at the Singapore site:

a. Find all employees at the Jamshedpur plant.


b. Find the average salary of all employees.
c. Find the highest-paid employee at each of the plant sites namely Thailand, Mumbai,
New Delhi and Chennai.
d. Find the lowest-paid employee in the company.
31. How do we achieve concurrency control in a distributed database system? What should be
the characteristics of a good concurrency control mechanism?
32. How do we achieve recovery control in a distributed database system?
33. What is distributed locking? What are its advantages and disadvantages?
34. Differentiate between deadlock and timestamping.
35. Explain the functioning of two-phase and three-phase commit protocols used in recovery
control of distributed database system.

STATE TRUE/FALSE

1. In a distributed database system, each site is typically managed by a DBMS that is dependent
on the other sites.
2. Distributed database systems arose from the need to offer local database autonomy at
geographically distributed locations.
3. The main aim of client/server architecture is to utilise the processing power on the desktop
while retaining the best aspects of centralised data processing.
4. Distributed transaction atomicity property enables users to ask queries without specifying
where the reference relations, or copies or fragments of the relations, are located.
5. Distributed data independence property enables users to write transactions that access and
update data at several sites just as they would write transactions over purely local data.
6. Although geographically dispersed, a distributed database system manages and controls the
entire database as a single collection of data.
7. In homogeneous DDBS, there are several sites, each running their own applications on the
same DBMS software.
8. In heterogeneous DDBS, different sites run under the control of different DBMSs, essentially
autonomously and are connected somehow to enable access to data from multiple sites.
9. A distributed database system allows applications to access data from local and remote
databases.
10. Homogeneous database systems have well-accepted standards for gateway protocols to
expose DBMS functionality to external applications.
11. Distributed database do not use client/server architecture.
12. In the client/server architecture, client is the provider of the resource whereas the server is a
user of the resource.
13. The client/server architecture does not allow a single query to span multiple servers.
14. A horizontal fragmentation is produced by specifying a predicate that performs a restriction
on the tuples in the relation.
15. Data replication is used to improve the local database performance and protect the
availability of applications.
16. Transparency in data replication makes the user unaware of the existence of the copies.
17. The server is the machine that runs the DBMS software and handles the functions required
for concurrent, shared data access.
18. Data replication enhances the performance of read operations by increasing the processing
speed at site.
19. Data replication decreases the availability of data to read-only transactions.
20. In distributed locking, the DDBS maintains a lock manager at each site whose function is to
administer the lock and unlock requests for those data items that are stored at that site.
21. In distributed systems, each site generates unique local timestamp using either a logical
counter or the local clock and concatenates it with the site identifier.
22. In a recovery control, transaction atomicity must be ensured.
23. The two-phase commit protocol guarantees that all database servers participating in a
distributed transaction either all commit or all abort.
24. The use of 2PC is not transparent to the users.

TICK (✓) THE APPROPRIATE ANSWER

1. A distributed database system allows applications to access data from

a. local databases.
b. remote databases.
c. both local and remote databases
d. None of these.

2. In homogeneous DDBS,

a. there are several sites, each running their own applications on the same DBMS
software.
b. all sites have identical DBMS software.
c. all users (or clients) use identical software
d. All of these.

3. In heterogeneous DDBS,

a. different sites run under the control of different DBMSs, essentially autonomously.
b. different sites are connected somehow to enable access to data from multiple sites.
c. different sites may use different schemas, and different DBMS software.
d. All of these.

4. The main components of the client/server architecture is

a. communication networks.
b. server.
c. application softwares.
d. All of these.

5. Which of the following is not a benefit of client/server architecture?

a. Reduction in operating cost


b. Adaptability
c. Platform independence
d. None of these.
6. Which of the following are the components of DDBS?

a. Communication network
b. Server
c. Client
d. All of these.

7. Which of the computing architecture is used by DDBS

a. Client/Server computing
b. Mainframe computing
c. Personal computing
d. None of these.

8. In collaborating server architecture,

a. there are several database servers.


b. each server is capable of running transactions against local data.
c. transactions are executed spanning multiple servers.
d. All of these.

9. The middleware database architecture

a. is designed to allow single query to span multiple servers.


b. provides users with a consistent interface to multiple DBMSs and file systems in a
transparent manner.
c. provides users with an easier means of accessing live data in multiple sources.
d. All of these.

10. Data fragmentation is a

a. technique of breaking up the database into logical units, which may be assigned for
storage at the various sites.
b. process of deciding about locating (or placing) data to several sites.
c. technique that permits storage of certain data in more than one site.
d. None of these.

11. A horizontal fragmentation is produced by specifying a

a. predicate operation of relational algebra.


b. projection operation of relational algebra.
c. selection and projection operations of relational algebra.
d. None of these.

12. A vertical fragmentation is produced by specifying a

a. predicate operation of relational algebra.


b. projection operation of relational algebra.
c. selection and projection operations of relational algebra.
d. None of these.

13. A mixed fragmentation is produced by specifying a

a. predicate operation of relational algebra.


b. projection operation of relational algebra.
c. selection and projection operations of relational algebra.
d. None of these.

14. Data allocation is a

a. technique of breaking up the database into logical units, which may be assigned for
storage at the various sites
b. process of deciding about locating (or placing) data to several sites.
c. technique that permits storage of certain data in more than one site.
d. None of these.

15. Data replication is a

a. technique of breaking up the database into logical units, which may be assigned for
storage at the various sites.
b. process of deciding about locating (or placing) data to several sites.
c. technique that permits storage of certain data in more than one site.
d. None of these.

16. Which of the following refers to the operation of copying and maintaining database objects in
multiple databases belonging to a distributed system?

a. Replication
b. Backup
c. Recovery
d. None of these.

17. In distributed query processing, semijoin operation is used to

a. reduce the size of a relation that needs to be transmitted.


b. reduce the communication costs.
c. Both (a) and (b).
d. None of these.

18. In DDBS, the lock manager function is

a. distributed over several sites.


b. centralised at one site.
c. no lock manager is used.
d. None of these.
19. Which of the following is the recovery management technique in DDBS?

a. 2PC
b. Backup
c. Immediate update
d. None of these.

20. In distributed system, the detection of a deadlock requires the generation of

a. local wait-for graph.


b. global wait-for-graph.
c. Both (a) and (b)
d. None of these.

21. In a distributed database system, the deadlock prevention method by aborting the transaction
can be used such as

a. timestamping.
b. wait-die method.
c. wound-wait method.
d. All of these.

22. Which of the following is the function of a distributed DBMS?

a. Distributed data recovery


b. Distributed query processing
c. Replicated data management
d. All of these.

FILL IN THE BLANKS

1. A distributed database system is a database physically stored on several computer systems


across _____ connected together via _____.
2. Distributed database systems arose from the need to offer local database autonomy at _____
locations.
3. _____ is an architecture that enables distributed computing resources on a network to share
common resources among groups of users of intelligent workstations.
4. The two desired properties of distributed databases are (a) _____ and (b) _____.
5. _____ is a database physically stored in two or more computer systems.
6. Heterogeneous distributed database system is also referred to as a _____ or _____.
7. Client/server architectures are those in which a DBMS-related workload is split into two
logical components namely (a) _____ and (b) _____.
8. The client/server architecture consists of the four main components namely (a) _____, (b )
_____, (c) _____ and (d) _____.
9. Three main advantages of distributed databases are (a) _____, (b) _____ and (c) _____.
10. Three main disadvantages of distributed databases are (a) _____, (b) _____ and (c ) _____.
11. The middleware database architecture is also called _____.
12. The middleware is basically a layer of _____, which works as special server and coordinates
the execution of _____ and _____ across one or more independent database servers.
13. A horizontal fragment of a relation is a subset of _____ with all _____ in that relation.
14. In horizontal fragmentation, _____ operation is done to reconstruct the original relation.
15. Data replication enhances the performance of read operations by increasing the _____ at site.
16. Data replication has increased overheads for _____ transactions.
17. In a distributed query processing, semijoin operation is used to reduce the _____ of a relation
that needs to be transmitted and hence the _____ costs.
18. In a distributed database deadlock situation, LWFG stands for _____.
19. In a distributed database deadlock situation, GWFG stands for _____.
20. In the DDBSs, each copy of the data item contains two timestamp values namely (a) _____
and (b) _____.
21. Two-phase commit protocol has two phases namely (a) _____ and (b) _____.
22. 3PC protocol avoids the _____ limitation of two-phase commit protocol.
Chapter 19

Decision Support Systems (DSS)

19.1 INTRODUCTION

Since data are a crucial raw material in the information age, the preceding
chapters focussed on data storage and its management for efficient database
design and the process of implementation. These chapters mostly devoted on
good database design, controlled data redundancy and produced effective
operational databases to fulfil business needs, such as, tracking customers,
sales, inventories and so on, to facilitate in the management of decision-
making.
In the last few decades, there has been a revolutional change in computer-
based technologies to improve the effectiveness of managerial decision-
making, especially in complex tasks. Decision support system (DSS) is one
of such technologies, which was developed to facilitate the decision-making
process. DSS helps in the analysis of business information. It provides a
computerised interface that enables business decision makers to creatively
approach, analyse and understand business problems. Decision support
systems, more than 30 years old, have already proven themselves by
providing business with substantial savings in time and money.
In this chapter, decision support system (DSS) technology has been
introduced.

19.2 HISTORY OF DECISION SUPPORT SYSTEM (DSS)

The concept of decision support system (DSS) can be traced back to 1940s
and 1950s with the emergence of operations research, behavioural and
scientific theories of management and statistical process control, much
before the general availability of computers. During these days, the basic
objective was to collect business operational data and convert into a form
that is useful to analyse and modify the behaviour of the business in an
intelligent manner. Fig. 19.1 illustrates the evolution of decision support
system.
In the late 1960s and early 1970s, researchers at Harvard and
Massachusetts Institute of Technology (MIT), USA, introduced the use of
computers in the decision-making process. The computing systems to help in
decision-making process were known as management decision systems
(MDS) or management information systems (MIS), Later on, it was most
commonly known as decision support system (DSS). The term management
decision system (MDS) was introduced by Scott-Morton in the early 1970s.
During the 1970s, several query languages were developed and numbers
of custom-built decision support systems were built around such languages.
These custom-built DSS were implemented using report generators such as
RPG or data retrieval products such as focus, datatrieve and NOMAD. The
data were stored in simple flat files until the early 1908s when relational
databases began to be used for decision support purposes.

19.2.1 Use of Computers in DSS


As can be observed from Fig. 19.1, computers have been used as tools to
support managerial decision making for over four decades. As per Kroeber
and Waston, the computerised tools (or decision aids) can be grouped into
the following categories:
Fig. 19.1 Evolution of decision support system (DSS)

Electronic data processing (EDP).


Transaction processing systems (TPS).
Management information systems (MIS).
Office automation systems (OAS).
Decision support systems (DSS).
Expert systems (ES).
Executive information systems (EIS).

The above systems grouped together are called computer-based


information system (CBIS). The CBIS progressed through time. The first
electronic data processing (EDP) and transaction processing systems (TPS)
appeared in the mid-1950s, followed by management information system
(MIS) in the 1960s, as shown in Fig. 19.1. Office automation system (OAS)
and decision support system (DSS) were developed in 1970s. DSS started
expanding in the 1980s and then commercial applications of expert system
(ES) emerged. The executive information system (EIS) came into existence
to support the work of senior executives of the organisation.

Fig. 19.2 Relation among EDP, DSS and MIS

Fig. 19.2 illustrates the relation among EDP, MIS and DSS. As shown,
DSS can be considered as a subset of MIS.

19.3 DEFINITION OF DECISION SUPPORT SYSTEM

The decision support system (DSS) emerged from a data processing world of
routine static reports. According to Clyde Holsapple, professor in the
decision science department of the College of Business and Economics at
the University of Kentucky in Lexington, “Decision-makers can’t wait a
week or a month for a report”. As per Holsapple, the advances in the 1960s,
such as the IBM 360 and other mainframe technologies, laid the foundation
for DSS. But, he claims, it was during the 1970s that DSS took off, with the
arrival of query systems, what-if spreadsheets, rules-based software
development and packaged algorithms from companies such as Chicago-
based SPSS Inc. and Cary, N.C.-based SAS Institute Inc.
The concepts of decision support system (DSS), was first articulated in the
early 1970s by Scott-Morton under the term management decision systems
(MDS). He defined such systems as “interactive computer- based systems,
which help decision makers utilise data and models to solve unstructured
problems”.
Keen and Scott-Morton provided another classical definition of decision
support system as, “Decision support systems couple the intellectual
resources of individuals with the capabilities of the computer to improve the
quality decisions. It is a computer-based support system for management
decision makers who deal with semi-structured problems”.
A working definition of DSS can be given as “a DSS is an interactive,
flexible and adaptable computer- based information system (CBIS) that
utilises decision rules, models and model base coupled with a
comprehensive database and the decision maker’s own insights, leading to
specific, implementable decisions in solving problems”. DSS is a
methodology designed to extract information from data and to use each such
information as a basis for decision-making. It is an arrangement of
computerised tools used to assist managerial decision-making within a
business. The DSS is used at all levels within an organisation and is often
tailored to focus on specific business areas or problems, such as, insurance,
financial, banking, health care, manufacturing, marketing and sales and so
on.
The DSS is an interactive computerised system and provides ad hoc query
tools to retrieve data and display data in different formats. It helps decision
makers compile useful information received in form of raw data from a wide
range of sources, different documents, personal knowledge and/or business
models to identify and solve problems and make decisions.
For example, a national on-line books seller wants to begin selling its
products internationally but first needs to determine if that will be a wise
business decision. The vendor can use a DSS to gather information from its
own resources to determine if the company has the ability or potential ability
to expand its business and also from external resources, such as industry
data, to determine if there is indeed a demand to meet. The DSS will collect
and analyse the data and then present it in a way that can be interpreted by
humans. Some decision support systems come very close to acting as
artificial intelligence agents.
DSS applications are not single information resources, such as a database
or a program that graphically represents sales figures, but the combination of
integrated resources working together.
Typical information that a decision support application might gather and
present would be:
Accessing all of current information assets of an enterprise, including legacy and relational
data sources, cubes, data warehouses and data marts.
Comparative sales figures between one week and the next.
Projected revenue figures based on new product sales assumptions.
The consequences of different decision alternatives, given past experience in a context that is
described.

The best DSS combines data from both internal and external sources in a
common view allowing managers and executives to have all of the
information they need at their fingertips.

19.3.1 Characteristics of DSS


Following are the major characteristics of DSS:
DSS incorporates both data and model.
They are designed to assist decision-makers (managers) in their decision processes in semi-
structured or unstructured tasks.
They support, rather than replace, managerial judgement.
They improve the effectiveness of the decisions, not the efficiency with which decisions are
being made.
A DSS enables the solution of complex problems that ordinarily cannot be solved by other
computerized approach.
A DSS enables a thorough, quantitative analysis in a very short time. It provides fast
response to unexpected situations that result in changed conditions.
It has ability to try several different strategies under different configurations, quickly and
objectively.
The users can be exposed to new insights and learning through the composition of the model
and extensive sensitive “what if” analysis.
DSS leads to learning, which leads to new demands and refinement of the system, which
leads to additional learning and so forth, in a continuous process of developing and
improving the DSS.
19.3.2 Benefits of DSS
Facilitated communication among managers.
Improved management control and performance.
More consistent and objective decisions derived from DSS.
Improved managerial effectiveness.
Cost savings.
DSS can be used to support individuals and /or groups.

19.3.3 Components of DSS


As shown in Fig. 19.3, a DSS is usually composed of following four main
high-level components:
Data Management (data extraction and filtering).
Data store.
End-user tool.
End-user presentation tool.

19.3.3.1 Data Management


The data management consists of data extraction and filtering. It is used to
extract and validate the data taken from the operational database and the
external data sources. This component extracts the data, filters them to select
the relevant records and packages the data in the right format to add into the
DSS data store component. For example, to determine the relative market
share by selected product line, the DSS requires data from competitor’s
products. Such data can be located in external databases provided by
industry groups or by companies that market such data.
Fig. 19.3 Main components of DSS

19.3.3.2 Data Store


The data store is the database of decision support system (DSS). It contains
two main types of data, namely business data and business model data. The
business data are extracted from operational database and from external data
sources. The business data summarise and arrange the operational data in
structures that are optimised for data analysis and query speed. External data
sources provide data that cannot be found within the company but are
relevant to the business, such as, stock-prices, market indicators, marketing
demographics, competitor’s data and so on.
Business model data are generated by special algorithms, such as, linear
programming, linear regression, matrix-optimisation techniques and so on.
They model the business in order to identify and enhance the understanding
of business situations and problems. For example, to define the relationship
between advertising types, expenditures and sales for forecasting, the DSS
might use some type of regression model, and then use the results to perform
a time-series analysis.
19.3.3.3 End-user Tool
The end-user tool is used by the data analyst to create the queries that access
the database. Depending on the DSS implementation, the query tool accesses
either the operational database or, more commonly, the DSS database. The
tool advises the user on which data to select and how to build a reliable
business data model.

19.3.3.4 End-user Presentation Tool


The end-user presentation tool is used by the data analyst to organise and
present the data. This tool helps the end-user select the most appropriate
presentation format, such as, summary report, map, pie or bar graph, mixed
graphs, and so on. The query tool and presentation tool is the front end to the
DSS.

19.4 OPERATIONAL DATA VERSUS DSS DATA

Operational data and DSS data serve different purposes. Their formats and
structure differ from one another. While operational data captures daily
business transactions, the DSS data give tactical and strategic business
meaning to the operational data. Most operational data are stored in
relational database in which the structures (tables) tend to be highly
normalised. The operational data storage is optimised to support transactions
that present daily operations. Customer data, inventory data and so on, are
frequently updated as when its status change. Operational systems store data
in more than one table for effective update performance. For example, sales
transaction might be represented by many tables, such as invoice, discount,
store, department and so on. Therefore, operational databases are not query
friendly. For example, to extract an invoice details, one has to join several
tables.
DSS data differ from operational data in three main areas, namely time
span, granularity and dimensionality. Table 19.1 shows the difference
between operational data and DSS data under these three areas.

Table 19.1 Difference between operational data and DSS data


Characteristics Operational Data DSS Data
A. From Analyst’s Point of View

Time span 1.Represent current (atomic) 1. Tends to be historic in nature.


transactions.
2.They cover a short time frame. 2.They cover a larger time frame.

3.For example, transactions might 3.Represent company


define a purchase order, a sales transactions up to given point in
invoice, an inventory movement time, such as, yesterday, last
and so on. week, last month, last year and so
on.
4.For example, marketing
managers are not interested in a
specific sales invoice value,
instead they will be interested to
focuses on sales generated during
the last month, last year, the last
two years and so on.
Granularity
1. Represent specific transactions 1. Represented at different levels
that occur at a given time. of aggregation, from highly
summarised to near atomic.
2. A customer may purchase 2. Managers at different levels in
particular product in specific the organisation requiring data
store. with different levels of
aggregations, also a single
problem may require data with
different summarisation levels.

3. For example, if a manager


must analyse sales by region, he
or she must be able to access data
showing the sales by region, by
city within the region, by stores
within the city and within the
region and so on.

Dimensionality
1. Represent single transaction 1. Represent multidimensional
view of data. view of data.

2. Focuses on representing atomic 2. For example, a marketing


transactions, rather than on the manager might want to know
effects of the transactions over how a product fared relative to
time. another product during past six
months by region, state, city,
store and customer.

B. From Designer’s Point of View


Data Currency 1. Represent transactions as they 1. They are snapshot of the
happen, in real-time. operation data at a given point in
time, for example week/
month/year.
2. Current operations. 2. They are historic, representing
a time slice of the operational
data.
Transaction volumes 1. Characterised by update 1. Characterisied by query (read-
transactions. only) transactions.

2. They require periodic updates. 2. They require periodic updates


to load new data that is
summarised from the operational
data.
3. Transaction volume tends to be 3. Transaction volume is low to
very high. medium levels.

Storage tables 1. Commonly stored in many 1. Generally stored in a few


tables and the stored data tables that store data derived
represent the information about a from opertional data.
given transaction only.

2. Do not include the details of


each operational transaction.
3. Represent transaction
summaries and therefore, the
DSS store data that are
integrated, aggregated and
summarised for decision support
purposes.
Degree of summarization 1. Low, some aggregate fields. 1. Very high.
2. Great deal of derived data.

Granularity 1. Atomic, detailed data. 1. Summarised data.


Data Model 1. Highly normalised structures 1. Non-normalised, structure with
with many tables, each containing few tables containing large
the minimum number of numbers of attributes.
attributes.
2. Mostly having relational 2. Complex structure.
DBMS.
3. Mostly use relational DBMS,
however, some use
multidimensional DBMS.
Query Activity 1.Frequency and complexity tend 1.DSS data exist for the sole
to be low in order to allow more purpose of serving query
processing cycles for the more requirements.
crucial update transactions.
2.Queries are narrow in scope, 2.Queries are broad in scope,
low in complexity and speed- high in complexity and less
critical. speed-critical.
Data volumes 1. Medium to large (hundreds of 1. Very large (hundreds of
megabytes to gigabytes) gigabytes to terabytes).

From the above table it can be observed that operational data have a
narrow time span, low granularity and single focus. It is normally seen in
tabular formats in which each row represents a single transaction. It is
difficult to derive useful information from operational data. On the other
hand, DSS data focuses on a broader time span, have levels of granularity
and can be seen from multiple dimensions.

R Q
1. What do you mean by the decision support system (DSS)? What role does it play in the
business environment?
2. Discuss the evolution of decision support system.
3. What are the main components of a DSS? Explain the functions of each of them with a neat
diagram.
4. What are the differences between operational data and DSS data?
5. Discuss the major characteristics of DSS.
6. List major benefits of DSS.

STATE TRUE/FALSE

1. DSS helps in the analysis of business information.


2. In the late 1950s and early 1960s, researchers at Harvard and Massachusetts Institute of
Technology (MIT), USA, introduced the use of computers in the decision-making process.
3. MDS and MIS were later on commonly known as DSS.
4. DSS is an interactive, flexible and adaptable CBIS.
5. The DSS is an interactive computerised system.
6. The end-user tool component of DSS is used by the data analyst to organise and present the
data.
7. The end-user presentation tool is used by the data analyst to create the queries that access the
database.
8. Operational data and DSS data serve the same purpose.
9. Operational data have a narrow time span, low granularity, and single focus.
10. DSS data focuses on a broader time span, and have levels of granularity.

TICK (✓) THE APPROPRIATE ANSWER

1. Use of computers in the decision making process was introduced during

a. late-1950s and early-1960s


b. late-1960s and early-1970s
c. late-1970s and early-1980s
d. late-1980s and early-1990s.

2. DSS was introduced during the

a. late-1950s and early-1960s


b. late-1960s and early-1970s
c. late-1970s and early-1980s
d. late-1980s and early-1990s.

3. The term management decision system (MDS) was introduced in the year

a. early-1960s
b. early-1970s.
c. early-1980s
d. None of these.

4. The term management decision system (MDS) was introduced by

a. Scott-Morton
b. Kroeber-Waston.
c. Harvard and MIT
d. None of these.

5. The computing systems to help in decision-making process were known as

a. MDS.
b. MIT.
c. MIS.
d. Both (b) and (c).

6. The term decision support system (DCS) was first articulated by

a. Scott-Morton
b. Kroeber-Waston
c. Harvard and MIT
d. None of these.

7. The best DSS

a. combines data from both internal and external sources in a common view allowing
managers and executives to have all of the information they need at their fingertips.
b. combines data from only internal sources in a common view allowing managers and
executives to have all of the information they need at their fingertips.
c. combines data from only external sources in a common view allowing managers
and executives to have all of the information they need at their fingertips.
d. None of these.

8. DSS incorporates

a. only data
b. only model.
c. both data and model
d. None of these.

9. DSS results into

a. improved managerial control


b. cost saving.
c. improved management control
d. all of these.

10. The end-user tool is used by the data analyst to

a. create the queries that access the database.


b. organise and present the data.
c. Both (a) & (b).
d. None of these.

11. The end-user presentation tool is used by the data analyst to

a. create the queries that access the database.


b. organise and present the data.
c. Both (a) & (b).
d. None of these.
12. DSS data differ from operational data in

a. time span
b. granularity.
c. dimensionality
d. All of these.

FILL IN THE BLANKS

1. Decision support system was developed to facilitate the _____ process.


2. The concept of decision support system (DSS) can be traced back to _____ and _____ with
the emergence of _____, _____ and _____ of management and _____.
3. The term management decision system (MDS) was introduced by _____ in the early _____.
4. These custom-built DSS were implanted using report generators such as _____.
5. The main components of DSS are (a) _____, (b) _____, (c) _____ and (d) _____.
6. Data store of DSS contains two main types of data, namely (a) _____ data and (b) _____
data.
7. While operational data captures _____, the DSS data give tactical and _____ meaning to the
_____ data.
8. DSS data differ from operational data in three main areas, namely (a) _____, (b) _____ and
(c) _____.
Chapter 20

Data Warehousing and Data Mining

20.1 INTRODUCTION

The operational database systems that we discussed in the previous chapters


have been traditionally designed to meet mission-critical requirements for
online transaction processing and batch processing. These systems attempted
to automate the established business processes by leveraging the power of
computers to obtain significant improvements in efficiency and speed.
However, in today’s business environment, efficiency or speed is not the
only key for competitiveness.
Today, multinational companies and large organisations have operations in
many places within their origin country and other parts of the world. Each
place of operation may generate large volume of data. For example,
insurance companies may have data from thousands of local and external
branches, large retail chains have data from hundreds or thousands of stores,
large manufacturing organisations having complex structure may generate
different data from different locations or operational systems and so on.
Corporate decision maker require access of information from all such
sources.
Therefore, the business of the 21st century is the competition between
business models and the ability to acquire, accumulate and effectively use
the collective knowledge of the organisation. It is the flexibility and
responsiveness that differentiate competitors in the new Web-enabled e-
business economy. The key to success of modern business will depend on an
effective data-management strategy of data warehousing and interactive data
analysis capabilities that culminates with data mining. Data warehousing
systems have emerged as one of the principal technological approaches to
the development of newer, leaner, meaner and more profitable corporate
organisations.
This chapter provides an overview of the technologies of data
warehousing and data mining.

20.2 DATA WAREHOUSING

As formally defined by W.H. Inmon, a data warehouse is “a subject-


oriented, integrated, time-variant, nonvolatile collection of data in support of
management’s decisions”. There are following four components of this
definition:
Subject-oriented: Data is arranged and optimised to provide answers to diverse queries
coming from diverse functional areas within an organisation. Thus, the data warehouse is
organised and summarised around the major subjects or topics of the enterprise, such as
customers, products, marketing, sales, finance, distribution, transportation and so on, rather
than the major application areas such as customer invoicing, stock control, product sales and
others. For each one of these subjects the data warehouse contains specific topics of interests,
such as products, customers, departments, regions, promotions and so on.
Integrated: The data warehouse is a centralised, consolidated database that integrates
source data derived from the entire organisation. The source data coming from multiple and
diverse sources are often inconsistent with diverse formats. Data warehouse makes these data
consistent and provides a unified view of the overall organisational data to the users. Data
integration enhances decision making capabilities and helps managers in better understanding
of the organisation’s operations for strategic business opportunities.
Time-variant: The warehouse data represents the flow of data through time, as data in the
warehouse is only accurate and valid at some point in time or over some time interval. In
other words, the data warehouse contains data that reflect what happened the previous day,
last week, last month, the past two years and so on.
Non-volatile: Once the data enter the data warehouse, they are never deleted. The data is
not updated in real-time but is refreshed from operational systems on a regular basis. New
data is always added as a supplement to the database, rather than a replacement. The database
continually absorbs this new data, incrementally integrating it with the previous data.

Table 20.1 summarises the difference between the operational database


and the data warehouse. Data warehouse is a special type of database with a
single, consistent and complete repository or archive of data and information
gathered from multiple sources of an organisation under a unified schema
and at a single site. Gathered data are stored for a long time permitting
access to historical data. Thus, data warehouse provides the end users a
single consolidated interface to data, making decision-support queries easier
to write. It provides storage, functionality and responsiveness to queries
beyond the capabilities of transaction-oriented databases.
Data warehousing is a blend of technologies aimed at the effective
integration of operational databases into an environment that enables the
strategic use of data. These technologies include relational and
multidimensional database management systems, client/server architecture,
metadata modelling and repositories, graphical user interfaces (GUIs) and so
on. Data warehouses extract information out of legacy systems for strategic
use of the organisation in reducing costs and improving revenues.

Table 20.1 Comparison between operational database and data warehouse

Characteristics Operational Database Data Warehouse


Subject-oriented Data are stored with a functional Data are stored with a subject
or process orientation, for orientation that facilitates
example, invoices, credits, debits multiple views for data and
and so on. decision making, for example,
sales, products, sales by products
and so on.

Integrated Similar data can have different Provides a unified view of all
representations or meanings, for data elements with a common
example, business metrics, social definition and representation for
security numbers and others. all departments.
Time-variant Data represent current Data are historic in nature. A
transactions, for example, the dimension is added to facilitate
sales of a product in a given date, data analysis and time
or over last week and so on. corporations.

Non-volatile Data updates and deletes are very Data are changed, but, are only
common. added periodically from
operational systems. Once data
are stored, no changes are
allowed.

Data warehousing is the computer systems development discipline


dedicated to systems that allow business users to analyse how well their
systems are running and figure out how to make them run better. Data
warehousing systems are the diagnostic systems of the business world.
Because data warehouses encompass large volume of data from several
sources, they are larger in order of magnitude as compared to the source
databases.

20.2.1 Evolution of Data Warehouse Concept


Since the 1970s, organisations have mostly been focussing their investments
in new computer systems that automated their business processes.
Organisations gained competitive advantage by this investment through
more efficient and cost-effective services to the customer. In this process of
adding new computing facilities, organisations accumulated growing amount
of data stored in their operational databases. However, using such
operational databases, it is difficult to support decision-making, which is the
requirement for regaining competitive advantage.
In the earliest days of computer systems, on-line transaction processing
(OLTP) systems were found effective in managing business needs of an
organisation. Today, OLTP facilitates the organisations to coordinate the
activities of hundreds of employees and meet the needs of millions of
customers, all at the same time. However, it was experienced that OLTP
system cannot help in analysing how well the business is running and how to
improve the performance. Thus, management information system (MIS) and
the decision support system (DSS) were invented to meet these new business
needs. MIS and DSS were concerned not with the day-to-day running of the
business, but with the analysis of how the business is doing. Ultimately, with
the improvement in technologies, MIS/DSS systems gradually evolved with
time into what is now known as data warehousing.
Fig. 20.1 illustrates the evolution of the concept of data warehouse. The
evolution of data warehouse has been directed by the progression of
technical and business developments in the field of computing. During the
70s and 80s, the advancement in technology and the development of
microcomputers (PCs) along with data- orientation in the form of relational
databases, drove the emergence of end-user computing. Various tools (such
as spreadsheets, end-user applications and others) provided the end-users to
develop their own query and have control on the data they required. By mid
80s, end-users developed the ability to deal with both the business and
technical aspects of data. Thus, the dependence on data processing personnel
started diminishing.

Fig. 20.1 Evolution of data warehouse

As the end-user computing was emerging, the computing started shifting


from a data-processing approach to a business-driven information
technology strategy, as shown in Fig. 20.2. With the increasing power and
sophistication of computing technology, it became possible to automate even
more complex processes and derive many benefits, such as increased speed
of throughput, improved accuracy, reduction in cost of development and
maintenance of computing systems and applications and so on. The whole
business concept of using computing environment changed from saving
money to making money.
During mid-to late 80s, a need arose for common method to describe the
data to be obtained from the operational systems and make them available to
the information environment for decision-making processes. Thus, data
modelling approaches and tools emerged to help information system
personnel in documenting the data needs of the users and the data structures.
Data warehouse concept started by implementing the modelled data in the
end-user environment. ABN AMRO, one of the Europe’s largest banks,
successfully implemented the data warehouse architecture in the mid 80s.

Fig. 20.2 Shift from data processing to information technology

At the beginning of early 90s, many industries were subjected to


significant changes in their business environments. Worldwide recession
reduced the profit margins of industries, governments deregulated industries
that were once closely controlled, competition increased and so on. These
developments forced the industries to look for a new view of how to operate
by revolutionising data and focussing on business-driven warehouse
implementation. The data revolution led to the foundations for an expansion
of the warehouse concept beyond the types of data traditionally associated
with decision support and began to bring together all aspects of how end-
users perform their jobs. A 1996 study [IDC (1996)] of 62 data warehousing
projects undertaken in this period showed an average return on investment
(ROI) of 321% for these enterprise-wide implementations in an average
payback period of 2.73 years.
The technological advances in data modelling, databases and application
development methods resulted into paradigm shift from information system
(IS) approach to business-driven warehouse implementations. This led to a
significant change in the perception of the approach. New areas of benefits
led to new demands for data and new ways of using it.
The data-warehousing concept has picked-up in the last 10-to-15 years
and industries have shown interest in implementing these concepts. The 21st
century has started witnessing the worldwide use of data warehouse and
industries have started realising the benefits out of it. Data warehousing has
triggered an era of information-based management, which provides the
following advantages to the end-users:
A single information source.
Distributed information availability.
Providing information in a business context.
Automated information delivery.
Managing information quality and ownership.

Data warehousing typically delivers information to users in one of the


following formats:
Query and reporting.
On-line analytical processing (OLAP).
Agent.
Statistical analysis.
Data mining.
Graphical/geographic system.

20.2.2 Main Components of Data Warehouses


Following are the three fundamental components that are supported by data
warehouse as shown in Fig. 20.3:
Data acquisition.
Data storage.
Data access.

Fig. 20.3 Main components of data warehousing

20.2.2.1 Data Acquisition


Data acquisition component of data warehouse is responsible for collection
of data from legacy systems and convert them into usable form for the users.
Data acquisition component is responsible for importing and exporting data
from legacy system. This component includes all of the programs,
applications, data-staging areas and legacy systems interfaces that are
responsible for pooling the data out of the legacy system, preparing it,
loading it into warehouse itself and exporting it out again, when required.
Data acquisition component does the following:
Identification of data from legacy and other systems.
Validation of data about the accuracy, appropriateness and usability.
Extraction of data from original source.
Cleansing of data by eliminating meaningless values and making it usable.
Data formatting.
Data standardisation by getting them into a consistent form.
Data matching and reconciliation.
Data merging by taking data from different sources and consolidating into one place.
Data purging by eliminating duplicate and erroneous information.
Establishing referential integrity.

20.2.2.2 Data Storage


The data storage is the centre of data-warehousing system and is the data
warehouse itself. It is a large, physical database that holds a vast amount of
information from a wide variety of sources. The data within the data
warehouse is organised such that it becomes easy to find, use and update
frequently from its sources.

20.2.2.3 Data Access


Data access component provides the end-users with access to the stored
warehouse information through the use of specialised end-user tools. Data
access component of data warehouse includes all of the different data-
mining applications that make use of the information stored in the
warehouse. Data mining access tools have various categories such as query
and reporting, on-line analytical processing (OLAP), statistics, data
discovery and graphical and geographical information systems.

20.2.3 Characteristics of Data Warehouses


As listed by E.F. Codd (1993), data warehouses have the following distinct
characteristics:
Multidimensional conceptual view.
Generic dimensionality.
Unlimited dimensions and aggregation levels.
Unrestricted cross-dimensional operations.
Dynamic sparse matrix handling.
Client/server architecture.
Multi-user support.
Accessibility.
Transparency.
Intuitive data manipulation.
Consistent reporting performance.
Flexible reporting.

20.2.4 Benefits of Data Warehouses


High returns on investments (ROI).
More cost-effective decision-making.
Competitive advantage.
Better enterprise intelligence.
Increased productivity of corporate decision-makers.
Enhanced customer service
Business and information re-engineering.

20.2.5 Limitations of Data Warehouses


It is query-intensive.
Data warehouse themselves tend to be very large, may be in the order of 600 GB, as a result
the performance tuning is hard.
Scalability can be a problem.
Hidden problems with various sources.
Increased end-user demands.
Data homogenisation.
High demand of resources.
High maintenance.
Complexity of integration.

20.3 DATA WAREHOUSE ARCHITECTURE

The data warehouse structure is based on a relational database management


system server that functions as the central repository for informational data.
In the data warehouse structure, operational data and processing is
completely separate from data warehouse processing. This central
information repository is surrounded by a number of key components,
designed to make the entire environment functional, manageable and
accessible by both the operational systems that source data into the
warehouse and by end user query and analysis tools. Fig. 20.4 shows a
typical architecture of data warehouse.

Following are the main components of data warehouse structure:


Operational and external data sources.
Data warehouse DBMS.
Repository system.
Data marts.
Application tools.
Management platform.
Information delivery system.
Typically, source data for the warehouse comes from the operational
applications or from an operational data store (ODS). The operational data
store (ODS) is a repository of current and integrated operational data used
for analysis. The ODS is often created when legacy operational systems are
found to be incapable of achieving reporting requirements. The ODS
provides users with the ease to use of a relational database while remaining
distant from the decision support functions of the data warehouse. ODS is
one of the more recent concepts in data warehousing. Its main purpose is to
address the need of users, particularly clerical and operational managers, for
an integrated view of current data. Data in ODS is subject-oriented,
integrated, volatile and current or near current. The subject-oriented and
integrated correspond to modelled and reconciled, while volatile and current
or near-current correspond to read/write, transient and current. The data
processed by ODS is a mixture of real-time and reconciled.
As the data enters the data warehouse, it is transformed into an integrated
structure and format. The transformation process may involve conversion,
summarisation, filtering and condensation of data. Because data within the
data warehouse contains large historical components (some times 5 to 10
years), the data warehouse must be capable of holding and managing large
volumes of data as well as different data structures for the same database
over time.
Fig. 20.4 Architecture of data warehousing

The central data warehouse DBMS is a cornerstone of data warehousing


environment. It is almost always implemented on the relational database
management system (RDBMS) technology. However, different technology
approaches, such as parallel databases, multi-relational databases (MRDBs),
multidimensional databases (MDDBs) and so on are also being used in data
warehouse environment to fulfil the need for flexible user view creation
including aggregates, multi-table joins and drill-downs. Data warehouse also
includes metadata, which is data about data that describes the data
warehouse. It is used for building, maintaining, managing and using the data
warehouse. Metadata provides interactive access to users to help understand
content and find data. Metadata management is provided via metadata
repository and accompanying software. Metadata repository software can be
used to map the source data to the target database, generate code for data
transformations, integrate and transform the data, and control moving data to
the warehouse. This software typically runs on a workstation and enables
users to specify how the data should be transformed, such as by data
mapping, conversion, and summarisation. Metadata repository maintains
information directory that helps technical and business users to exploit the
power of data warehousing. This directory helps integrate, maintain, and
view the contents of the data warehousing system.
Multidimensional databases (MDDBs) are tightly coupled with the online
analytical processing (OLAP) and other application tools that act as clients
to the multidimensional data stores. These tools architecturally belong to a
group of data warehousing components jointly categorised as the data query,
reporting, analysis and mining tools. These tools provide information to
business users for strategic decision making.

20.3.1 Data Marts


Data mart is a generalised term used to describe data warehouse
environments that are somehow smaller than others. It is a subsidiary of data
warehouse of integrated data. It is a localised, single-purpose data warehouse
implementation. Data mart is a relative and subjective term often used to
describe small, single-purpose minidata warehouses. The data mart is
directed at a partition of data that is created for the use of a dedicated group
of users. It delivers specific data to groups of users as required. Thus, data
mart can be defined as “a specialised, subject-oriented, integrated, time-
variant, volatile data store in support of specific subset of management’s
decisions”. A data mart may contain summarised, de-normalized, or
aggregated departmental data and can be customised to suit the needs of a
particular department that owns the data. Data mart is used to describe an
approach in which each individual department of a big enterprise
implements its own management information system (MIS), often based on
a large, parallel, relational database or on a smaller multidimensional or
spreadsheet-like system.
In a large enterprise, data marts tend to be a way to build a data warehouse
in a sequential, phased approach. A collection of data marts composes an
enterprise-wide data warehouse. Conversely, a data warehouse may be
construed as a collection of subset of data marts. Normally data marts are
resident on a separate database servers, often on the local area network
serving a dedicated user group. Data mart uses automated data replication
tools to populate the new databases, rather than the manual processes and
specially developed programs as being used previously.

20.3.1.1 Advantages of Data Marts


Data marts enable departments to customise the data as it flows into the data mart from the
data warehouse. There is no need for the data in the data mart to serve the entire enterprise.
Therefore, the department can summarise, sort, select and structure their own department’s
data independently.
Data marts enable department to select a much smaller amount of historical data than that
which is found in the data warehouse.
The department can select software for their data mart that is tailored to fit their needs.
Very cost-effective.

20.3.1.2 Limitations of Data Marts


Once in production, data marts are difficult to extend for use by other departments because of
inherent design limitations in building for a single set of business needs, and disruption of
existing users caused by any expansion of scope.
Scalability problem in situations where an initial small data mart grows quickly in multiple
dimensions.
Data integration problem.

20.3.2 Online Analytical Processing (OLAP)


Online analytical processing (OLAP) is an advanced data analysis
environment that supports decision making, business modelling and
operations research activities. It may be defined as the dynamic synthesis,
analysis and consolidation of large volumes of multi-dimensional data. It is
the interactive process of creating, managing, analysing and reporting on
data. OLAP is the use of a set of graphical tools that provides users with
multidimensional views of their data and allows them to analyse the data
using simple widowing techniques. Table 20.2 illustrates difference between
operational (one-dimensional) view and the multidimensional view.

Fig. 20.5 Operational and multidimensional views of data

(a)Operational view of data

(b)Multidimensional view of data

As can be seen in table 20.2 (a), the tabular view (in case of operational
data) of sales data is not well- suited to decision-support, because the
relationship INVOICE → PRODUCT_LINE between INVOICE and
PRODUCT_LINE does not provide a business perspective of the sales data.
On the other hand, the end-users view of sales data from a business
perspective is more closely represented by the multidimensional view of
sales than the tabular view of separate tables, as shown in table 20.2 (b). It
can also be noted that the multidimensional view allows end-users to
consolidate or aggregate data at different levels, for example, total sales
figures by customers and by date. The multidimensional view of data also
allows a business data analyst to easily switch business perspectives from
sales by customers to sales by division, by region, by products and so on.
OLAP is a database interface tool that allows users to quickly navigate
within their data. The term OLAP was coined in a white paper written for
Arbor Software Corporation in 1993. OLAP tools are based on
multidimensional databases (MDDBs). These tools allow the users to
analyse the data using elaborate, multidimensional and complex views.
These tools assume that the data is organised in a multidimensional model
that is supported by a special multidimensional database (MDDB) or by a
relational database designed to enable multidimensional properties, such as
multi-relational database (MRDB). OLAP tool is very useful in business
applications such as sales forecasting, product performance and profitability,
capacity planning, effectiveness of a marketing campaign or sales program
and so on. In summary, OLAp systems have the following main
characteristics:
Uses multidimensional data analysis techniques.
Provides advanced database support.
Provides easy-to-use end-user interfaces.
Supports client/server architecture.

20.3.2.1 OLAP Classifications


The OLAP tools can be classified as follows:
Multidimensional-OLAP (MOLAP) tools that operates on multidimensional data stores.
Relational-OLAP (ROLAP) tools that access data directly from a relational database.
Hybrid-OLAP (HOLAP) tools that combine the capabilities of both MOLAP and ROLAP
tools.
20.3.2.2 Commercial OLAP Tools
Some of the popular commercial OLAP tools available in the market are as
follows:
Essbase from Arbor/Hyperion.
Oracle Express from Oracle Corporation.
Cognos PowerPlay.
Microstrategy decision support system (DSS) server.
Microsoft decision support service.
Prodea from Platinum Technologies.
MetaCube from Informix.
Brio Technologies.

20.4 DATA MINING

Data mining may be defined as the process of extracting valid, previously


unknown, comprehensible and actionable information from large databases
and using it to make crucial business decisions. It is a tool that allows end-
users direct access and manipulation of data from within data warehousing
environment without the intervention of customised programming activity.
In other words, data mining helps end users extract useful business
information from large databases. Data mining is related to the subarea of
statistics called exploratory data analysis and subarea of artificial
intelligence called knowledge discovery and machine learning.
Data mining has a collection of techniques that aim to find useful but
undiscovered patterns in collected data. It is used to describe those
applications of either statistical analysis or data discovery products, which
analyse large populations of unknown data to identify hidden patterns or
characteristics that might be useful for the business. Thus, the goal of data
mining is to create models for decision-making that predict future behaviour
based on analyses of past activity. Data mining supports knowledge
discovery defined by William Frawley and Gregory Piatetsky-Shapiro (MIT
Press, 1991) as the nontrivial extraction of implicit, previously unknown and
potentially useful information from data. Data mining applications can
leverage the data preparation and integration capabilities of data
warehousing and can help the business in achieving sustainable competitive
advantage.
As discussed above, data mining helps in extracting meaningful new
patterns that cannot be found necessarily by merely querying or processing
data or metadata in the data warehouse. Therefore, data mining applications
should be strongly considered early, during design of a data warehouse.
Also, data mining tools should be designed to facilitate their use in
conjunction with data warehouses.

20.4.1 Data Mining Process


Data mining process is a step forward towards turning data into knowledge,
also called knowledge discovery in databases (KDD). As shown in Fig. 20.5,
the knowledge discovery process comprises six steps, namely data selection,
pre-processing (data cleaning and enrichment), data transformation or
encoding, data mining and the reporting and display of the discovered
information (knowledge delivery). The raw data first undergoes data
selection step in which the target dataset and relevant attributes are
identified. In pre-processing or data cleaning step, noise and outliers are
removed, field values are transformed to common units, new fields are
generated through combination of existing fields and data is brought into the
relational schema that is used as input to the data mining activity. In the data
mining step, actual patterns are extracted. The pattern is finally presented in
an understandable form to the end user. The results of any step in KDD
process might lead back to an earlier step in order to redo the process with
the new knowledge gained.
The data mining process focuses on optimizing the data processing. It
shows how to move from raw data to useful patterns to knowledge. Data
mining addresses inductive knowledge in which new rules are discovered
and patterns from the supplied data. The better the data mining tool, the
more automated and painless is the transition from one step to the next.

20.4.2 Data Mining Knowledge Discovery


Data mining techniques result into different new knowledge discoveries,
which can be represented into different forms. For example, in the form of
rules and patterns, in form of decision trees semantic networks, neural
networks and so on. The data mining may result into the discovery of the
following new types of information or knowledge:
Association rules: In this case, the database is regarded as a collection of transactions, each
involving a set of item. Association rules correlate the presence of a set of items with another
range of values for another set of variables. For example, (a) whenever a customer buys
video equipment, he or she also buys another electronic gadget such as blank tapes or
memory chips, (b) when a female customer buys a leather handbag, she is likely to buy
money purse, (c) when a customer releases an order for specific item in a quarter, he or she is
likely to release the order in subsequent quarter also and so on.
Classification trees: Classification is the process of learning a model that describes
different classes of data. The classes are predetermined. The classification rule creates a
hierarchy of classes from an existing set of events. For example, (a) a population may be
divided into several ranges of credit worthiness based on a history of previous credit
transactions, (b) mutual funds may be classified based on performance data using
characteristics such as growth, income and stability, (c) a customer may be classified by
frequency of visits, by types of financing used, by amount of purchase or by affinity for types
of items and some revealing statistics may be generated for such classes, (d) in a banking
application, customers who apply for a credit card may be classified as a “poor risk”, a “fair
risk” or a “good risk”.
Fig. 20.6 Data mining process

Sequential patterns: This rule defines a sequential pattern of transactions (events or


actions). For example, (a) a customer who buys more than twice in the first quarter of the
year may be likely to buy at least once during the second quarter, (b) if a patient underwent
cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood
urea within a year of surgery, he or she is likely to suffer from kidney failure within the next
16 months.
Patterns within time series: This rule detects the similarities within positions of a time
series of data, which is a sequence of data taken at regular intervals such as daily sales or
daily closing stock prices. For example, (a) stocks of a utility company M/s KLY systems,
and a financial company M/s ABC Securities, showing the same pattern during 2005 in terms
of closing stock price, (b) two products showing the same selling pattern in summer but a
different one in winter and so on.
Clustering: In clustering, a given population of events or items can be partitioned or
segmented into sets of similar elements. A set of records are partitioned into groups such that
records within the group are similar to each other and records that belong to two different
groups are dissimilar. Each such group is called a cluster and each record belongs to exactly
one cluster. For example, the women population in India may be categorised into four major
groups from “most-likely-to-buy” to “least-likely-to-buy” a new product.

20.4.3 Goals of Data Mining


Data mining is used to achieve certain goals, which fall into the following
broad categories:
Prediction: Data mining predicts the future behaviour of certain attributes within data. It
can show how certain attributes within data will behave in future. For example, on the basis
of analysis of buying transactions by customers, the data mining can predict what customer
will buy under certain discount or offering, how much sales volume will be generated in a
given period, what marketing and sales strategy would yield more profits and so on.
Similarly, on the basis of seismic wave patterns probability of an earthquake can be predicted
and so on.
Identification: Data mining can identify the existence of an event, item, or an activity on
the basis of the data patterns. For example, identification of authentication of a person or
group of persons accessing certain part of databases, identification of intruders trying to
break the system, identification of existence of gene based on certain sequences of nucleotide
symbols in the DNA sequence and so on.
Classification: Data mining can partition the data so that different classes or categories can
be identified based on combinations of parameters. For example, customers in a supermarket
can be categorised into discount-seeking customers, loyal and regular customers, brand-
specific customers, infrequent visiting customers and so on. This classification may be used
in different analysis of customer buying transactions as a post-mining activity.
Optimisation: Data mining can optimise the use of limited resources such as time, space,
money, or materials and to maximise output variables such as sales or profits under a given
set of constraints.

20.4.4 Data Mining Tools


There are various kinds of data mining tools and approaches to extract
knowledge. Most data mining tools use the open database connectivity
(ODBC). ODBC is an industry standard that works with database. It enables
access to data in most of the popular database programs such as Access,
Informix, Oracle and SQL Server. Most of the tools work in the Microsoft
Windows environment and a few in the UNIX operating system. The mining
tools can be divided on the basis of several criteria. Some of these criteria
are as follows:
Types of products.
Characteristics of the products.
Objectives or goals.
Roles of hardware, software and grayware in the delivery of information.

20.4.4.1 Data Mining Tools Based on Types of products


The data mining products can be divided into the following general types:
Query managers and report writers.
Spreadsheets.
Multidimensional databases.
Statistical analysis tools.
Artificial intelligence tools.
Advance analysis tools.
Graphical display tools.

20.4.4.2 Data Mining Tools Based on the Characteristics of the Products


There are several operational characteristics, which are shared by all data
mining products, such as:
Data-identification capabilities.
Output in several forms, for example, printed, green screen, standard graphics, enhanced full
graphics and so on.
Formatting capabilities, for example, raw data format, tabular, spreadsheet form,
multidimensional databases, visualisation and so on.
Computational facilities, such as columnar operation, cross-tab capabilities, spreadsheets,
multidimensional spreadsheets, rule-driven or trigger-driven computation and so on.
Specification management allowing end users to write and manage their own specifications.
Execution management.

20.4.4.3 Data Mining Tools Based on the Objective of Goals


All application development programs and data mining tools fall into the
following three operational categories:
Data collection and retrieval.
Operational monitoring.
Exploration and discovery.

Since data collection and retrieval is the traditional definition of on-line


transaction processing or legacy system or operational system, data mining
tools are rarely applied.
More than half of the data mining tools under operational monitoring
category are applied to keep tabs on the business operations and effective
decision-making capabilities. They include query managers, report writers,
spreadsheets, multidimensional databases and visualisation tools.
Exploration and discovery process is used to discover new things about
how to run the business more efficiently. Rest of data mining tools, statistical
analysis, artificial intelligence, neural net, advanced statistical analysis,
advanced visualisation products and so on, fall under this category. Data
mining tools are best used in exploration and discovery process.

20.4.4.4 Objectives of Data Mining Tools


Data mining tools achieve the following major objectives:
Significant improvement in the overall operational efficiency of an organisation by making
data monitoring cycle easier, faster and more efficient.
Better prediction of future activities by exploration of data and then applying analytical
criteria to discover new cause-and-effect relationships.

Table 20.3 shows some representative commercial data mining tools.

20.4.5 Data Mining Applications


The data mining technologies can be applied to a large variety of decision-
making in business environment such as marketing, finance, manufacturing,
health care and so on. Data mining applications include the following:
Marketing: (a) analysis of customer behaviour based on buying patterns, (b) identification
of customer defection pattern and customer retention by preventing actions, (c) determination
of marketing strategies such as advertising, warehouse location, (d) segmentation of
customers, products, warehouses, (e) design of catalogues, warehouse layouts, advertising
campaigns, (f) identification of customer defection pattern and customer retention by
preventing actions, (g) delivering superior sales and customer service by proper aggregation
and delivery of information to the front-end sales and service professionals, (h) providing
accurate information and executing retention campaign, lifetime value analysis, trending,
targeted promotions, (i) identifying markets with above or below average growth, (j)
identifying products that are purchased concurrently, or the characteristics of shoppers for
certain product groups, (k) market basket analysis.
Table 20.2 Commercial data mining tools

Finance: (a) analysis of creditworthiness of clients, (b) segmentation of account


receivables, (c) performance analysis of finance investments such as stocks, mutual funds,
bonds and so on (d) risk assessment and fraud detection.
Manufacturing: (a) optimization of resources such as manpower, machines, materials,
energy and so on (b) optimal design of manufacturing processes, (c) product design, (d)
discovering the cause of production problems, (e) identifying usage patterns for products and
services.
Banking: (a) detecting patterns of fraudulent credit card use, (b) identifying loyal
customers, (c) predicting customers likely to change their credit card affiliation, (d)
determining credit card spending by customer groups.
Health Care: (a) discovering patterns in radiological images, (b) analysing side effects of
drugs, (c) characterising patient behaviour to predict surgery visits, (d) identifying successful
medical therapies for different illnesses.
Insurance: (a) claims analysis, (b) predicting which customers will buy new policies.
Other applications: Comparing campaign strategies for effectiveness.

R Q
1. What is a data warehouse? How does it differ from a database?
2. What are the goals of a data warehouse?
3. What are characteristics of data warehouse?
4. What are the different components of a data warehouse? Explain with the help of a diagram.
5. List the benefits and limitations of a data warehouse.
6. Discuss what is meant by the following terms when describing the characteristics of the data
in a data warehouse:

a. subject-oriented
b. integrated
c. time-variant
d. non-volatile.

7. Differentiate between the operational database and data warehouse.


8. Define the terms OLAP, ROLAP and MOLAP.
9. What are OLAP classifications? List the OLAP tools available for commercial applications.
10. Describe the evolution of data warehousing with the half of a digram.
11. Present a diagrammatic representation of the typical architecture and main components of a
data warehouse.
12. Describe the characteristics of a data warehouse.
13. What are data marts? What are its advantages and limitations?
14. What is the difference between a data warehouse and data marts?
15. What is data mining? What are its goals?
16. What is the difference between data warehouse and data mining?
17. What are the different phases of data mining process?
18. What do you understand by data mining knowledge discovery? Explain.
19. What are different types of data mining tools? What are their goals?
20. List the applications of data mining?

STATE TRUE/FALSE

1. In a data warehouse, data once loaded is not changed.


2. Metadata is data about data.
3. A data warehouse is a collection of computer-based information that is critical to successful
execution of organisation’s initiatives.
4. Data in a data warehouse differ from operational systems data in that they can only be read,
not modified.
5. Data warehouse provides storage, functionality and responsiveness to queries beyond the
capabilities of transaction-oriented databases.
6. As the end-user computing was emerging, the computing started shifting from a business-
driven information technology strategy to a data-processing approach.
7. In the data warehouse structure, operational data and processing is completely separate from
data warehouse processing.
8. The ODS is often created when legacy operational systems are found to be incapable of
achieving reporting requirements.
9. Once in production, data marts are difficult to extend for use by other departments.
10. OLAP is a database interface tool that allows users to quickly navigate within their data.
11. Data mining is the process of extracting valid, previously unknown, comprehensible and
actionable information from large databases and using it to make crucial business decisions.
12. The goal of data mining is to create models for decision-making that predict future behaviour
based on analyses of past activity.
13. Data mining predicts the future behaviour of certain attributes within data.
14. In the association rules in data mining, the database is regarded as a collection of
transactions, each involving a set of item.
15. In data mining, classification is the process of learning a model that describes different
classes of data.

TICK (✓) THE APPROPRIATE ANSWER

1. Which of the following is a characteristic of the data in a data warehouse?

a. Non-volatile.
b. Subject-oriented.
c. Time-variant.
d. All of these.

2. Data warehouse is a special type of database with a

a. Single archive of data.


b. Consistent archive of data.
c. Complete archive of data.
d. All of these.

3. Data warehouses extract information for strategic use of the organisation in reducing costs
and improving revenues, out of

a. legacy systems.
b. secondary storage.
c. main memory.
d. None of these.

4. The advancements in technology and the development of microcomputers (PCs) along with
data- orientation in form of relational databases, drove the emergence of end-user computing
during
a. 1970s and 1980s.
b. 1980s and 1990s.
c. 1990s and 2000s.
d. the start of the 21st century.

5. Which of the following is an advantage of data warehousing?

a. Better enterprise intelligence.


b. Business reengineering.
c. Cost-effective decision-making.
d. All of these.

6. Data warehousing concept started in the

a. 1970s.
b. 1980s.
c. 1990s.
d. early 2000.

7. Which of the following technological advances in data modelling, databases and application
development methods resulted into paradigm shift from information system (IS) approach to
business- driven warehouse implementations?

a. Data modelling.
b. Databases.
c. Application development methods.
d. All of these.

8. Data acquisition component of data warehouse is responsible for

a. collection of data from legacy systems.


b. convert legacy data into usable form for the users.
c. importing and exporting data from legacy system.
d. All of these.

9. Data access component of data warehouse is responsible for

a. providing the end-users with access to the stored warehouse information through the
use of specialised end-user tools.
b. collecting the data from legacy system and convert them into usable form for the
users.
c. holding a vast amount of information from a wide variety of sources.
d. None of these.

10. Data acquisition component of data warehouse is responsible for

a. providing the end-users with access to the stored warehouse information through the
use of specialised end-user tools.
b. collecting the data from legacy system and convert them into usable form for the
users.
c. holding a vast amount of information from a wide variety of sources.
d. None of these.

11. Data storage component of data warehouse is responsible for

a. providing the end-users with access to the stored warehouse information through the
use of specialised end-user tools.
b. collecting the data from legacy system and convert them into usable form for the
users.
c. holding a vast amount of information from a wide variety of sources.
d. None of these.

12. The data warehouse structure is based on

a. a relational database management system server that functions as the central


repository for informational data.
b. a network database system.
c. Both (a) and (b).
d. None of these.

13. In the data warehouse structure,

a. operational data and processing is completely separate from data warehouse


processing.
b. operational data and processing is part of data warehouse processing.
c. Both (a) and (b).
d. None of these.

14. Data in ODS is

a. subject-oriented.
b. integrated.
c. volatile.
d. All of these.

15. Data mart is a data store which is

a. a specialised and subject-oriented.


b. integrated and time-variant.
c. volatile data store.
d. All of these.

16. A data mart may contain

a. summarised data.
b. de-normalised data.
c. aggregated departmental data.
d. All of these.

17. Online analytical processing (OLAP) is an advanced data analysis environment that supports

a. decision making.
b. business modelling.
c. operations research activities.
d. All of these.

18. OLAP is

a. a dynamic synthesis of multidimensional data.


b. a analysis of multidimensional data.
c. a consolidation of large volumes of multi-dimensional data.
d. All of these.

19. Data mining is

a. the process of extracting valid, previously unknown, comprehensible and actionable


information from large databases and using it to make crucial business decisions.
b. a tool that allows end-users direct access and manipulation of data from within data-
warehousing environment without the intervention of customised programming
activity.
c. a tool that helps end users extract useful business information from large databases.
d. All of these.

20. The expanded form of KDD is

a. Knowledge Discovery in Databases.


b. Knowledge of Decision making in Databases.
c. Knowledge-based Decision Data.
d. Karnough Decision Database.

FILL IN THE BLANKS

1. _____ is the extraction of hidden predictive information from large databases.


2. The bulk of data in a data warehouse resides in the _____.
3. There are four components in the definition of a data warehouse namely (a) _____, (b)
_____, (c) _____ and (d) _____.
4. As the end-user computing was emerging, the computing started shifting from a data-
processing approach to a _____ strategy.
5. The three main components of data warehousing are (a) _____, (b) _____ and (c ) _____.
6. The operational data store (ODS) is a repository of _____ and _____ operational data used
for analysis.
7. In ODS, the subject-oriented and integrated correspond to _____ and _____, while volatile
and current or near-current correspond to _____, _____ and _____.
8. The term OLAP was coined in a white paper written for _____ in the year _____.
9. Data mining is related to the subarea of statistics called _____ and subarea of artificial
intelligence called _____ and _____.
10. Data mining process is a step forward towards turning data into knowledge called _____.
11. The six steps of knowledge discovery process are (a) _____, (b) _____, (c) _____, (d )
_____, (e) _____ and (f) _____.
12. Four main goals of data mining are (a) _____, (b) _____, (c) _____ and (d) _____.
13. Data mining predicts the future behaviour of certain _____ within _____.
14. Data mining can identify the existence of an event, item or an activity on the basis of the
_____.
Chapter 21

Emerging Database Technologies

21.1 INTRODUCTION

In the preceding chapters of the book, we have discussed a variety of issues


related to databases, such as the basic concepts, architecture and
organisation, design, query and transaction processing, security and other
database management issues. We also covered other advanced data
management systems in chapters 15 to 20, such as object-oriented databases,
distributed and parallel databases, data warehousing and data mining and so
on that provide very large databases and tools for decision support process.
Thus, for most of the history of databases, the types of data stored in the
databases were relatively simple. In the past few years, however, there has
been an increasing need for handling new data types in databases, such as
temporal data, spatial data, multimedia data, geographic data and so on.
Another major trend in the last decade has created its own issues, for
example, the growth of mobile computers, starting with laptop computers,
palmtop computers and pocket organisers. In more recent years, mobile
phones have also come with built-in computers. These trends have resulted
into the development of new database technologies to handle new data types
and applications.
In this chapter, some of the emerging database technologies have been
briefly introduced. We have discussed how databases are used and accessed
from Internet, using web technologies, use of mobile databases to allow
users widespread and flexible access to data while being mobile and
multimedia databases providing support for storage and processing of
multimedia information. We have also introduced how to deal with
geographic information data or spatial data and their applications.

21.2 INTERNET DATABSES

The Internet revolution of the late 90s have resulted into explosive growth of
World Wide Web (WWW) technology and sharply increased direct user
access to databases. Organisations converted many of their phone interfaces
to databases into Web interfaces and made a variety of services and
information available on-line. The transaction requirements of organisations
have grown with increasing use of computers and the phenomenal growth in
the Web technology. These developments have created many sites with
millions of viewers and the increasing amount of data collected from these
viewers has produced extremely large databases at many companies.

21.2.1 Internet Technology


As its name suggest, the Internet is not a single homogeneous network. It is
an interconnected group of independently managed networks. Each network
supports the technical standards needed for inter-connection - the
Transmission Control Protocol/Internet Protocol (TCP/IP) family of
protocols and a common method for identifying computers - but in many
ways the separate networks are very different. The various sections of the
Internet use almost every kind of communications channel that can transmit
data. They range from fast and reliable to slow and erratic. They are
privately owned or operated as public utilities. They are paid for in different
ways. The Internet is sometimes called an information highway. A better
comparison would be the international transportation system, with
everything from airlines to dirt tracks.
Thus, the Internet may be defined as a network of networks, scattered
geographically all over the world. It is a worldwide collection of computer
networks connected by communication media that allow users to view and
transfer information between computers. Internet is made up of many
separate but interconnected networks belonging to commercial, educational
and government organisations and Internet Service Providers (ISPs). Thus,
the Internet is not a single organisation but cooperative efforts by multiple
organisations managing a variety of computers and different operating
systems. Fig. 21.1 illustrates a typical example of Internet.

Fig. 21.1 Architecture of an Internet

21.2.1 Internet Services


Wide varieties of services are available on the Internet. Table 21.1 gives a
summary of the services available on the Internet.

Table 21.1 Internet services


Category Services Description
Electronic Mail Electronic messages sent or received
from one computer to another,
commonly referred to as e-mail.
Newsgroups Computer discussion groups where
participants with common interests
(like hobbies or professional
associations) post messages called
“articles” that can be read and
responded to by other participants
around the world via “electronic
bulletin boards”.
Communication Mailing Lists Similar to Newsgroups except
participants exchange information via
e-mail.
Chat Real-time on-line conversations where
participants type messages to other
chat group participants and receive
responses they can read on their
screens.
File Access File Transfer Protocol (FTP) Sending (uploading) or receiving
(downloading) computer files via the
File Transfer Protocol (FTP)
communication rules.
Searching Tools Search Engines Programs that maintain indices of the
contents of files at computers on the
Internet. Users can use search engines
to find files by searching indices for
specific words or phrases.
World Wide Web (WWW) Web Interfaces A subset of the Internet using
computers called Web servers that
store multimedia files, or “pages” that
contain text, graphics, video, audio
and links to other pages that are
accessed by software programs called
Web browsers.
E-commerce Electronic Commerce Customers can place and pay for
orders via the business’s Web site.
E-Business Electronic Business Complete integration of Internet
technology into the economic
infrastructure of the business.

Today, millions of people use the Internet to shop for goods and services,
listen to music, view network, conduct research, get stock quotes, keep up-
to-date with current events and send electronic mail to other Internet users.
More and more people are using the Internet at work and at home to view
and download multimedia computer files containing graphics, sound, video
and text.

Internet History

Historically, the Internet originated in two ways. One line of development


was the local area network that was created to link computers and terminals
within a department or an organisation. Many of the original concepts came
from Xerox’s Palo Alto Research Center. In the United States, universities
were pioneers in expanding small local networks into campus-wide
networks.
The second source of network was the national networks, known as wide
area networks. The best known of these was the ARPAnet, which, by the
mid 80s, linked about 150 computer science research organisations. In the
late 1960s, the US Department of Defense developed an internet of
dissimilar military computers called the Advanced Research Projects
Agency Network (ARPAnet). Its main purpose was to investigate how to
build networks that could withstand partial outages (like nuclear bomb
attacks) and still survive. Computers on this internet communicated by using
a newly developed standard of communication rules called the Transmission
Control Protocol/Internet Protocol (TCP/IP). The creators of ARPAnet also
developed a new technology called “packet switching”. The packet
switching allowed data transmission between computers by breaking up data
into smaller “packets” before being sent to its destination over a variety of
communication routes. The data was then assembled at its destination. These
changes in communication technology enabled data to be communicated
more efficiently between different types of computers and operating
systems.
Soon scientists and researchers at colleges and universities began using
this Internet to share data. In 1980s, the military portion of this Internet
became a separate entity called the MILINET and the National Science
Foundation began overseeing the remaining non-military portions, which
were called the NSFnet. Thousands of other government, academic, and
business computer networks began connecting to the NSFnet. By the late
1980s, the term Internet became widely used to describe this huge
worldwide “network of networks”.

21.2.2 TCP/IP
The two basic protocols TCP and IP that hold the Internet together are
TCP/IP, which are two separate protocols.
The Internet Protocol (IP) joins together the separate network segments
that constitute the Internet. Every computer on the Internet has a unique
address, known as an IP address. The address consists of four numbers, each
in the range 0 to 255, such as 132.151.3.90. Within a computer, these are
stored as four bytes. When printed, the convention is to separate them with
periods as in this example. IP, the Internet Protocol, enables any computer on
the Internet to dispatch a message to any other, using the IP address. The
various parts of the Internet are connected by specialised computers, known
as “routers”. As their name implies, routers use the IP address to route each
message on the next stage of the journey to its destination. Messages on the
Internet are transmitted as short packets, typically a few hundred bytes in
length. A router simply receives a packet from one segment of the network
and dispatches it on its way. An IP router has no way of knowing whether
the packet ever reaches its ultimate destination.
The Transport Control Protocol (TCP) is responsible for reliable delivery
of complete messages from one computer to another. On the sending
computer, an application program passes a message to the local TCP
software. TCP takes the message, divides it into packets, labels each with the
destination IP address and a sequence number and sends them out on the
network. At the receiving computer, each packet is acknowledged when
received. The packets are reassembled into a single message and handed
over to an application program.
TCP guarantees error-free delivery of messages, but it does not guarantee
that they will be delivered punctually. Sometimes, punctuality is more
important than complete accuracy. If an occasional packet fails to arrive on
time, the human ear would much prefer to lose tiny sections of the sound
track rather than wait for a missing packet to be retransmitted, which would
be horribly jerky. Since TCP is unsuitable for such applications, they use an
alternate protocol, named UDP, which also runs over IP. With UDP, the
sending computer sends out a sequence of packets, hoping that they will
arrive. The protocol does its best, but makes no guarantee that any packets
ever arrive.

21.2.2 The World Wide Web (WWW)


The World Wide Web, or “the Web” as it is colloquially called, has been one
of the great successes in the history of computing. After the conception in
1989, Web is the most popular and powerful-networked information system
till date. The combinations of the Web technology and databases have
resulted into many new opportunities for creating advanced database
applications.
The World Wide Web is a subset of the Internet that uses computers called
Web servers to store multimedia files. These multimedia files are called Web
pages that are stored at locations known as Web sites. Web is a distributed
information system, based on hypertext and Hypertext Transmission
Protocol (HTTP), for providing, organising and accessing a wide variety of
resources such as text, video (images) and audio (sounds) that are available
via the Internet. Web is independent of computing platform and has lower
deployment and training costs. It also provides global application availability
to both users and organisations. Fig. 21.2 shows a typical architecture of web
databases.
The Web technology was developed in about 1990 by Tim Berners-Lee
and colleagues at CERN, the European research centre for high-energy
physics in Switzerland. It was made popular by the creation of a user
interface, known as Mosaic, which was developed by Marc Andreessen and
others at the University of Illinois, Urbana-Champaign. Mosaic was released
in 1993. Within a few years, numerous commercial versions of Mosaic
followed. The most widely used are the Netscape Navigator and Microsoft’s
Internet Explorer. These user interfaces are called Web browsers, or simply
browsers.

21.2.2.1 Features of Web


The basic reason for the success of the Web can be summarised succinctly.
It provides a convenient way to distribute information over the Internet.
Individuals can publish information and users can access that information by themselves,
with no training and no help from outsiders.
A small amount of computer knowledge is needed to establish a Web site. It is very easy to
use a browser to access the information.

Fig. 21.2 Typical architecture of web databases

21.2.3 Web Technology


Technically, the Web is based on the following simple techniques:
Internet service providers (ISPs)
IP address
Hypertext Markup Language (HTML)
Hypertext Transfer Protocol (HTTP)
Uniform Resource Locators (URLs)
Multipurpose Internet Mail Extension (MIME) Data Types

21.2.3.1 Internet Service Providers (ISPs)


Internet service providers (ISPs) are commercial agents who maintain the
host computer, serve as gateway to the Internet and provide an electronic
mail box with facilities for sending and receiving e-mails. ISPs connect the
client computers to the host computers on the Internet. Commercial ISPs
usually charge for the access to the Internet and e-mail services. They supply
the communication protocols and front-end tools that are needed to access
the Internet.

21.2.3.2 Internet Protocol (IP) Address


All host computers on the Internet are identified by a unique address called
IP address. IP address consists of a series of numbers. Computers on the
Internet use these IP address numbers to communicate with each other. ISPs
provide this IP address so that we can enter it as part of the setup process
when we originally setup our communication connection to the ISP.

21.2.3.3 Hypertext Markup Language (HTML)


Hypertext Markup Language (HTML) is an Internet language for describing
the structure and appearance of text documents. It is used to create Web
pages stored at web sites. The Web pages can contain text, graphics, video,
audio and links to other areas of the same Web page, other Web pages at the
same Web site, or to a Web page at a different Web site. These links are
called hypertext link and are used to connect Web pages. They allow the user
to move from one Web page to another. When a link is clicked with the
mouse pointer, another area of same Web page, another Web page at the
same Web site, or a Web page at different Web site appears in the browser
window.
Fig. 21.3 shows a simple HTML file and how a typical browser might
display, or render, it.

Fig. 21.3 Sample HTML file

As shown in the example of Fig. 21.3, the HTML file contains both the
text to be rendered and codes, known as tags that describe the format or
structure. The HMTL tags can always be recognised by the angle brackets
(<and>). Most HTML tags are in pairs with a “/” indicating the end of a pair.
Thus <title> and </title> enclose some text that is interpreted as a title. Some
of the HTML tags show format; thus <i> and </i> enclose text to be
rendered in italic and <br> shows a line break. Other tags show structure:
<p> and </p> delimit a paragraph and <h1> and </h1> bracket a level one
heading. Structural tags do not specify the format, which is left to the
browser.
For example, many browsers show the beginning of a paragraph by
inserting a blank line, but this is a stylistic convention determined by the
browser. This example also shows two features that are special to HMTL
and have been vital to the success of the Web. The first special feature is the
ease of including colour image in Web pages. The tag:

<img src = “logo.gif”>

is an instruction to insert an image that is stored in a separate file. The


abbreviation “img” stands for “image” and “src” for “source”. The string
that follows is the name of the file in which the image is stored. The
introduction of this simple command by Mosaic brought colour images to
the Internet. Before the Web, Internet applications were drab. Common
applications used unformatted text with no images. The Web was the first,
widely used system to combine formatted text and color images. Suddenly
the Internet came alive.
The second and even more important feature is the use of hyperlinks. Web
pages do not stand alone. They can link to other pages anywhere on the
Internet. In this example, there is one hyperlink, the tag:

<a href = “http://www.dlib.org/dlib.html”>

This tag is followed by a string of text terminated by </a>. When


displayed by a browser, as in the panel, the text string is highlighted; usually
it is printed in blue and underlined. The convention is simple. If something is
underlined in blue, the user can click on it and the hyperlink will be
executed. This convention is easy for both the user and the creator of the
Web page. In this example, the link is to an HTML page on another
computer, the home page of D-Lib Magazine.
One of the useful characteristics of HTML is that the small mistakes in its
syntax does not create any problem during execution. Other computing
languages have strict syntax. Omit a semi-colon in a computer program and
the program fails or gives the wrong result. With HTML, if the mark-up is
more or less right, most browsers will usually accept it. HTML is simple,
powerful and platform-independent document language.

21.2.3.4 Hypertext Transfer Protocol (HTTP)


Hypertext Transfer Protocol (HTTP) is a protocol for transferring HTML
documents through Internet, between Web browsers and Web servers, such
as Web pages and so on. It is a generic object-oriented, stateless protocol to
transmit information between servers and clients. In computing, a protocol is
a set of rules that are used to send messages between computer systems. A
typical protocol includes description of the formats to be used, the various
messages, the sequences in which they should be sent, appropriate
responses, error conditions and so on.
In addition to the transfer of a document, HTTP also provides powerful
features, such as the ability to execute programs with arguments supplied by
the user and deliver the results back as an HTML document. The basic
message type in HTTP is ‘get’. For example, clicking on the hyperlink with
the URL:

http://www.dlib.org/dlib.html

specifies an HTTP a get command. An informal description of this command


is:
Open a connection between the browser and the Web server that has the domain name
“www.dlib.org”.
Copy the file “dlib.html” from the Web server to the browser.
Close the connection.

21.2.3.5 Uniform Resource Locator (URL)


Uniform resource locator (URL) is a key component of the Web. It is a
globally unique name for each document that can be accessed on the Web.
URL provides a simple addressing mechanism that allows the Web to link
information on computers all over the world. It is a string of alphanumeric
characters that represent the location or address of a resource on the Internet
and how that resource should be accessed. URL is a special code to identify
each Web page on the World Wide Web. An example of a URL is

http://www.google.co.in/google.html
This URL has three parts:

http = Internet communication protocol for transferring


HTML documents
www.google.co.in/ = descriptive (domain) name of the Web server
(computer) that contains the Web pages
Google.html = file on the Web server

Some URLs are very lengthy and contain additional information about the
path and file name of the Web page. URLs can also contain the identifier of
a program located on the Web server, as well as arguments to be given to the
program. An example of such URL is given below:

http://www.google.co.in/topic/search?q=database

In the above example, “/topic/” is the path name of the HTML document
on the Web server and “/search?q=database” is an execution argument for
the search on the server www.google.co.in. Using the given arguments, the
program executes and returns an HTML document, which is then sent to the
front end.

21.2.3.6 Multipurpose Internet Mail Extension (MIME) Data Types


A file of data in a computer is simply a set of bits, but, to be useful the bits
need to be interpreted. Thus, in the previous example, in order to display the
file “google.html” correctly, the browser must know that it is in the HTML
format. The interpretation depends upon the data type of the file. Common
data types are “html” for a file of text that is marked-up in HMTL format
and “jpeg” for a file that represents an image encoded in the jpeg format.
In the Web and in a wide variety of Internet applications, the data type is
specified by a scheme called MIME, also called Internet Media Types.
MIME was originally developed to describe information sent by electronic
mail. It uses a two part encoding, a generic part and a specific part. Thus
text/ascii is the MIME type for text encoded in ASCII, image/jpeg is the type
for an image in the jpeg format and text/html is text marked- up with HMTL
tags. There is a standard set of MIME types that are used by numerous
computer programs and additional data types can be described using
experimental tags.
The importance of MIME types in the Web is that the data transmitted by
an HTTP get command has a MIME type associated with it. Thus, the file
“dlib.html” has the MIME type text/html. When the browser receives a file
of this type, it knows that the appropriate way to handle this file is to render
it as HTML text and display it in the screen.
Many computer systems use file names as a crude method of recording
data types. Thus, some Windows programs use file names that end in “.htm”
for file of HMTL data and Unix computers use “.html” for the same purpose.
MIME types are a more flexible and systematic method to record and
transmit typed data.

21.2.4 Web Databases


Web is used as front end to databases, which can run on any computer
system. There is no need to download any special-purpose software to access
information. One of the most popular uses of the Web is the viewing,
searching and filtering of data. Whether you are using a search engine to find
a specific Web site, or browsing Amazon.com’s product listings, you are
accessing collections of Web-enabled data. Database information can be
published on the Web in two different formats:
Static Web publishing.
Dynamic Web publishing.

Static Web publishing simply involves creating a list or report, based on


the information stored in a database and publishing it online. Static Web
publishing is a good way to publish information that does not change often
and does not need to be filtered or searched. Dynamic Web publishing
involves creating pages “on the fly” based on information stored in the
database.
21.2.4.1 Web Database Tools
The most commonly used Web database tools for creating Web databases are
Common gateway interface (CGI) tool.
Extended markup language (XML).

Common Gateway Interface (CGI): Common Gateway Interface, also


known as CGI, is one of the most commonly used tools for creating Web
databases. The Common Gateway Interface (CGI) is a standard for
interfacing external applications with information servers, such as HTTP or
Web servers. A plain HTML document that the Web daemon retrieves is
static, which means it exists in a constant state: a text file that does not
change. A CGI program, on the other hand, is executed in real-time, so that
it can output dynamic information.
For example, let us assume that we wanted to “hook up” our Unix
database to the World Wide Web, to allow people from all over the world to
query it. Basically, we need to create a CGI program that the Web daemon
will execute to transmit information to the database engine and receive the
results back again and display them to the client.
The database example is a simple idea, but most of the time rather
difficult to implement. There really is no limit as to what we can hook up to
the Web. The only thing we need to remember is that whatever our CGI
program does, it should not take too long to process. Otherwise, the user will
just be staring at their browser, waiting for something to happen.
Since a CGI program is executable, it is basically the equivalent of letting
the world run a program on our system, which is not the safest thing to do.
Therefore, there are some security precautions that need to be implemented
when it comes to using CGI programs. Probably the one that will affect the
typical Web user the most is the fact that CGI programs need to reside in a
special directory, so that the Web server knows to execute the program rather
than just display it to the browser. This directory is usually under direct
control of the webmaster, prohibiting the average user from creating CGI
programs. There are other ways to allow access to CGI scripts, but it is up to
the webmaster to set these up for us.
With the version of the NCSA HTTPd server distribution, a directory
called /cgi-bin is available. This is the special directory where all of CGI
programs reside. A CGI program can be written in any language that allows
it to be executed on the system, such as C/C++, Fortran, PERL, TCL, Any
Unix shell, Visual Basic or AppleScript. It just depends on what is available
on the system. If we use a programming language like ‘C’ or Fortran, we
must compile the program before it will run. In case of the /cgi-src directory
that came with the server distribution, the source code for some of the CGI
programs is found in the /cgi-bin directory. If, however, we use one of the
scripting languages instead, such as PERL, TCL or a Unix shell, the script
itself only needs to reside in the /cgi-bin directory, since there is no
associated source code. Many people prefer to write CGI scripts instead of
programs, since they are easier to debug, modify and maintain than a typical
compiled program.

Extended Markup Language (XML): XML is a meta-language for


describing markup languages for documents containing structured
information. Structured information contains both content (words, pictures
and so on) and some indication of what role that content plays (for example,
content in a section heading has a different meaning from content in a
footnote, which means something different than content in a figure caption
or content in a database table and so on). Almost all documents have some
structure. A markup language is a mechanism to identify structures in a
document. The XML specification defines a standard way to add markup to
documents. XML was created so that richly structured documents could be
used over the web.
In the context of XML, the word ‘document’ refers not only to traditional
documents, but also to the myriad of other XML ‘data formats’. These data
formats include graphics, e-commerce transactions, mathematical equations,
object metadata, server APIs and a thousand of other kinds of structured
information.
XML provides a facility to define tags and the structural relationships
between them. Since there is no predefined tag set, there can not be any
preconceived semantics. All of the semantics of an XML document will
either be defined by the applications that process them or by stylesheets.

XML is defined by a number of related specifications:


Extensible Markup Language (XML) 1.0, which defines the syntax of XML.
XML Pointer Language (XPointer) and XML Linking Language (XLink), which defines a
standard way to represent links between resources. In addition to simple links, like HTML’s
<A> tag, XML has mechanisms for links between multiple resources and links between read-
only resources. XPointer describes how to address a resource, XLink describes how to
associate two or more resources.
Extensible Style Language (XSL), which defines the standard stylesheet language for XML.

Fig. 21.4 shows an example of a simple XML document. It can be noted


that:

Fig. 21.4 Example of a simple XML document

The document begins with a processing instruction: <?xml …?>. This is the XML
declaration. While it is not required, its presence explicitly identifies the document as an
XML document and indicates the version of XML to which it was authored.
There is no document type declaration. XML does not require a document type declaration.
However, a document type declaration can be supplied and some documents will require one
in order to be understood unambiguously.
Empty elements (<applause/> in this example) have a modified syntax. While most elements
in a document are wrappers around some content, empty elements are simply markers where
something occurs. The trailing /> in the modified syntax indicates to a program processing
the XML document that the element is empty and no matching end-tag should be sought.
Since XML documents do not require a document type declaration, without this clue it could
be impossible for an XML parser to determine which tags were intentionally empty and
which had been left empty by mistake.

XML has softened the distinction between elements which are declared as
EMPTY and elements which merely have no content. In XML, it is legal to
use the empty-element tag syntax in either case. It is also legal to use a start-
tag/end-tag pair for empty elements: <applause></applause>. If
interoperability is of any concern, it is best to reserve empty-element tag
syntax for elements which are declared as EMPTY and to only use the
empty-element tag form for those elements.
XML documents are composed of markup and content. There are six
kinds of markup that can occur in an XML document, namely, elements,
entity references, comments, processing instructions, marked sections and
document type declarations.

21.2.5 Advantages of Web Databases


Simple to use HTML both for developers and end-users.
Platform-independent.
Good graphical user interface (GUI).
Standardisation of HTML.
Cross-platform support.
Transparent network access.
Scalable deployment.
Web enables organisations to provide new and innovative services and reach new customers
through globally accessible applications.

21.2.6 Disadvantages of Web Databases


The internet not yet very reliable.
Slow communication medium.
Security concern.
High cost for meeting increasing demands and expectations of customers.
Scalability problem due to enormous peak loads.
Limited functionality of HTML.

21.3 DIGITAL LIBRARIES


The Internet and the World Wide Web are two of the principal building
blocks that are used in the development of digital libraries. The Web and its
associated technology have been crucial to the rapid growth of digital
libraries.

21.3.1 Introduction to Digital Libraries


This is a fascinating period in the history of libraries and publishing. For the
first time, it is possible to build large-scale services where collections of
information are stored in digital formats and retrieved over networks. The
materials are stored on computers. A network connects the computers to
personal computers on the users’ desks. In a completely digital library,
nothing need ever reach paper. Digital libraries bring together facets of many
disciplines and experts with different backgrounds and different approaches.
Digital library can be defined as a managed collection of information,
with associated services, where the information is stored in digital formats
and accessible over a network. A key part of this definition is that the
information is managed. A stream of data sent to earth from a satellite is not
a library. The same data, when organised systematically, becomes a digital
library collection. Most people would not consider a database containing
financial records of one company to be a digital library, but would accept a
collection of such information from many companies as part of a library.
Digital libraries contain diverse collections of information for use by many
different users. Digital libraries range in size from tiny to huge. They can use
any type of computing equipment and any suitable software. The unifying
theme is that information is organised on computers and available over a
network, with procedures to select the material in the collections, to organise
it, to make it available to users and to archive it.
In some ways, digital libraries are very different from traditional libraries,
yet in others they are remarkably similar. People do not change because new
technology is invented. They still create information that has to be
organised, stored and distributed. They still need to find information that
others have created and use it for study, reference or entertainment.
However, the form in which the information is expressed and the methods
that are used to manage it are greatly influenced by technology and this
creates change. Every year, the quantity and variety of collections available
in digital form grows, while the supporting technology continues to improve
steadily. Cumulatively, these changes are stimulating fundamental alterations
in how people create information and how they use it.

21.3.2 Components of Digital Libraries


Digital libraries have the following components:
People
Economic
Computers and networks

21.3.2.1 People
It requires an understanding of the people who are developing the libraries.
Technology has dictated the pace at which digital libraries have been able to
develop, but the manner in which the technology is used depends upon
people. Two important communities are the source of much of this
innovation. One group is the information professionals. They include
librarians, publishers and a wide range of information providers, such as
indexing and abstracting services. The other community contains the
computer science researchers and their offspring, the Internet developers.
Until recently, these two communities had disappointingly little interaction;
even now it is commonplace to find a computer scientist who knows nothing
of the basic tools of librarianship, or a librarian whose concepts of
information retrieval are years out of date. Over the past few years, however,
there has been much more collaboration and understanding.
A variety of words are used to describe the people who are associated
with digital libraries. One group of people are the creators of information in
the library. Creators include authors, composers, photographers, map
makers, designers and anybody else who creates intellectual works. Some
are professionals; some are amateurs. Some work individually, others in
teams. They have many different reasons for creating information.
Another group is the users of the digital library. Depending on the context,
users may be described by different terms. In libraries, they are often called
“readers” or “patrons”; at other times they may be called the “audience” or
the “customers”. A characteristic of digital libraries is that creators and users
are sometimes the same people. In academia, scholars and researchers use
libraries as resources for their research and publish their findings in forms
that become part of digital library collections.
The final group of people is a broad one that includes everybody whose
role is to support the creators and the users. They can be called information
managers. The group includes computer specialists, librarians, publishers,
editors and many others. The World Wide Web has created a new profession
of Webmaster. Frequently a publisher will represent a creator, or a library
will act on behalf of users, but publishers should not be confused with
creators or librarians with users. A single individual may be a creator, user
and information manager.

21.3.2.2 Economics
Technology influences the economic and social aspects of information and
vice versa. The technology of digital libraries is developing fast and so are
the financial, organisational and social frameworks. The various groups that
are developing digital libraries bring different social conventions and
different attitudes to money. Publishers and libraries have a long tradition of
managing physical objects, notably books, but also maps, photographs,
sound recordings and other artifacts. They evolved economic and legal
frameworks that are based on buying and selling these objects. Their natural
instinct is to transfer to digital libraries the concepts that have served them
well for physical artifacts. Computer scientists and scientific users, such as
physicists, have a different tradition. Their interest in digital information
began in the days when computers were very expensive. Only a few well-
funded researchers had computers on the first networks. They exchanged
information informally and openly with colleagues, without payment. The
networks have grown, but the tradition of open information remains.
The economic framework that is developing for digital libraries shows a
mixture of these two approaches. Some digital libraries mimic traditional
publishing by requiring a form of payment before users may access the
collections and use the services. Other digital libraries use a different
economic model. Their material is provided with open access to everybody.
The costs of creating and distributing the information are borne by the
producer, not the user of the information. Almost certainly, both have a long-
term future, but the final balance is impossible to forecast.

21.3.2.3 Computers and networks


Digital libraries consist of many computers connected by a communications
network. The dominant network is the Internet. The emergence of the
Internet as a flexible, low-cost, world-wide network has been one of the key
factors that have led to the growth of digital libraries.
Fig. 21.5 shows some of the computers that are used in digital libraries.
The computers have three main function:

Fig. 21.5 Computers in digital library


To help users interact with the library.
To store collections of materials.
To provide services.

In the terminology of computing, anybody who interacts with a computer


is called a user or computer user. This is a broad term that covers creators,
library users, information professionals and anybody else who accesses the
computer. To access a digital library, users normally use personal computers.
These computers are given the general name clients. Sometimes, clients may
interact with a digital library without no human user involved, such as the
robots that automatically index library collections and sensors that gather
data, such as information about the weather and supply it to digital libraries.
The next major group of computers in digital libraries is repositories
which store collections of information and provide access to them. An
archive is a repository that is organised for long-term preservation of
materials.
Fig. 21.5 shows two typical services which are provided by digital
libraries: (a) location systems and (b) search systems. Search systems
provide catalogues, indexes and other services to help users find
information. Location systems are used to identify and locate information.
In some circumstances there may be other computers that sit between the
clients and computers that store information. These are not shown in the
figure. Mirrors and caches store duplicate copies of information, for faster
performance and reliability. The distinction between them is that mirrors
replicate large sets of information, while caches store recently used
information only. Proxies and gateways provide bridges between different
types of computer system. They are particularly useful in reconciling
systems that have conflicting technical specifications.
The generic term server is used to describe any computer other than the
user’s personal computer. A single server may provide several of the
functions listed above, perhaps acting as a repository, search system and
location system. Conversely, individual functions can be distributed across
many servers. For example, the domain name system, which is a locator
system for computers on the Internet, is a single, integrated service that runs
on thousands of separate servers.
In computing terminology, a distributed system is a group of computers
that work as a team to provide services to users. Digital libraries are some of
the most complex and ambitious distributed systems ever built. The personal
computers that users have on their desks have to exchange messages with the
server computers; these computers are of every known type, managed by
thousands of different organisations, running software that ranges from state-
of-the art to antiquated. The term interoperability refers to the task of
building coherent services for users, when the individual components are
technically different and managed by different organisations. Some people
argue that all technical problems in digital libraries are aspects of this one
problem, interoperability. This is probably an overstatement, but it is
certainly true that interoperability is a fundamental challenge in all aspects
of digital libraries.

21.3.3 Need for Digital Libraries


The fundamental reason for building digital libraries is a belief that they will
provide better delivery of information than was possible in the past.
Traditional libraries are a fundamental part of society, but they are not
perfect.
Enthusiasts for digital libraries point out that computers and networks
have already changed the ways in which people communicate with each
other. In some disciplines, they argue, a professional or scholar is better
served by sitting at a personal computer connected to a communications
network than by making a visit to a library. Information that was previously
available only to the professional is now directly available to all. From a
personal computer, the user is able to consult materials that are stored on
computers around the world. Conversely, all but the most diehard enthusiasts
recognise that printed documents are so much part of civilization that their
dominant role cannot change except gradually. While some important uses
of printing may be replaced by electronic information, not everybody
considers a large-scale movement to electronic information desirable, even if
it is technically, economically and legally feasible.

21.3.4 Digital Libraries for Scientific Journals


During the late 1980s several publishers and libraries became interested in
building online collections of scientific journals. The technical barriers that
had made such projects impossible earlier were disappearing, though still
present to some extent. The cost of online storage was coming down,
personal computers and networks were being deployed and good database
software was available. The major obstacles to building digital libraries were
that academic literature was on paper and not in electronic formats. Also the
institutions were organised around physical media and not computer
networks.
A slightly later effort was the CORE project at Cornell University to
mount images of chemistry journals. Both projects worked with scientific
publishers to scan journals and establish collections of online page images.
Whereas Mercury set out to build a production system, CORE also
emphasised research into user interfaces and other aspects of the system by
chemists.

21.3.4.1 Mercury
One of the first attempts to create a campus digital library was the Mercury
Electronic Library, a project that we taken at Carnegie Mellon University
between 1987 and 1993. It began in 1988 and went live in 1991 with a dozen
textual databases and a small number of page images of journal articles in
computer science. Mercury was able to build upon the advanced computing
infrastructure at Carnegie Mellon, which included a highperformance
network, a fine computer science department and the tradition of innovation
by the university libraries.

21.3.4.2 CORE
CORE was a joint project by Bellcore, Cornell University, OCLC and the
American Chemical Society that ran from 1991 to 1995. The project
converted about 400,000 pages, representing four years of articles from
twenty journals published by the American Chemical Society.
The project used a number of ideas that have since become popular in
conversion projects. CORE included two versions of every article, a scanned
image and a text version marked up in SGML. The scanned images ensured
that when a page was displayed or printed it had the same design and layout
as the original paper version. The SGML text was used to build a full-text
index for information retrieval and for rapid display on computer screens.
Two scanned images were stored for each page, one for printing and the
other for screen display. The printing version was black and white, 300 dots
per inch; the display version was 100 dots per inch, grayscale.
Although both the Mercury and CORE projects converted existing journal
articles from print to bitmapped images, conversion was not seen as the
long-term future of scientific libraries. It simply reflected the fact that none
of the journal publishers were in a position to provide other formats.
Mercury and CORE were followed by a number of other projects that
explored the use of scanned images of journal articles. One of the best
known was Elsevier Science Publishing’s Tulip project. For three years,
Elsevier provided a group of universities, which included Carnegie Mellon
and Cornell, with images from forty three journals in material sciences. Each
university, individually mounted these images on their own computers and
made them available locally.

21.3.5 Technical Developments in Digital Libraries


The first attempts to store library information on computers started in late
60s. These early attempts faced serious technical barriers, including the high
cost of computers, terse user interfaces and the lack of networks. Because
storage was expensive, the first applications were in areas where financial
benefits could be gained from storing comparatively small volumes of data
online. An early success was the work of the Library of Congress in
developing a format for Machine-Readable Cataloguing (MARC) in the late
60s. The MARC format was used by the Online Computer Library Center
(OCLC) to share catalogue records among many libraries. This resulted in
large savings in costs for libraries.
Early information services, such as shared cataloguing, legal information
systems and the National Library of Medicine’s Medline service, used the
technology that existed when they were developed. Small quantities of
information were mounted on a large central computer. Users sat at a
dedicated terminal, connected by a low-speed communications link, which
was either a telephone line or a special purpose network. These systems
required a trained user who would accept a cryptic user interface in return
for faster searching than could be carried out manually and access to
information that was not available locally.
Such systems were no threat to the printed document. All that could be
displayed was unformatted text, usually in a fixed spaced font, without
diagrams, mathematics, or the graphic quality that is essential for easy
reading. When these weaknesses were added to the inherent defects of early
computer screens - poor contrast and low resolution-it is hardly surprising
that most people were convinced that users would never willingly read from
a screen.
The past thirty years have steadily eroded these technical barriers. During
the early 1990s, a series of technical developments took place that removed
the last fundamental barriers to building digital libraries. Some of this
technology is still rough and ready, but low-cost computing has stimulated
an explosion of online information services.

21.3.6 Technical Areas of Digital Libraries


Four technical areas important to digital libraries are as follows:
Cheaper electronic storage than paper:

Large libraries are painfully expensive for even the richest organisations. Buildings are about
a quarter of the total cost of most libraries. Behind the collections of many great libraries are
huge, elderly buildings, with poor environmental control. Even when money is available,
space for expansion is often hard to find in the centre of a busy city or on a university
campus.
The costs of constructing new buildings and maintaining old ones to store printed books and
other artifacts will only increase with time, but electronic storage costs decrease by at least
30 per cent per annum. In 1987, began work on a digital library at Carnegie Mellon
University, known as the Mercury library. The collections were stored on computers, each
with ten gigabytes of disk storage. In 1987, the list price of these computers was about
$120,000. In 1997, a much more powerful computer with the same storage cost about $4,000.
In ten years, the price was reduced by about 97 per cent. Moreover, there is every reason to
believe that by 2007 the equipment will be reduced in price by another 97 per cent.

Ten years ago, the cost of storing documents on CD-ROM was already less than the cost of
books in libraries. Today, storing most forms of information on computers is much cheaper
than storing artifacts in a library. Ten years ago, equipment costs were a major barrier to
digital libraries. Today, they are much lower, though still noticeable, particularly for storing
large objects such as digitised videos, extensive collections of images, or high-fidelity sound
recordings. In ten years time, equipment that is too expensive to buy today will be so cheap
that the price will rarely be a factor in decision making.

Better personal computer displays:

Storage cost is not the only factor. Otherwise libraries would have standardised on microfilm
years ago. Until recently, very few people were happy to read from a computer. The quality
of the representation of documents on the screen was also poor. The usual procedure was to
print a paper copy. Recently, however, major advances have been made in the quality of
computer displays, in the fonts which are displayed on them and in the software that is used
to manipulate and render information. People are beginning to read directly from computer
screens, particularly materials that were designed for computer display, such as Web pages.
The best computers displays are still quite expensive, but every year they get cheaper and
better. It will be a long time before computers match the convenience of books for general
reading, but the high-resolution displays to be seen in research laboratories are very
impressive indeed.

Most users of digital libraries have a mixed style of working, with only part of the materials
that they use in digital form. Users still print materials from the digital library and read the
printed version, but every year more people are reading more materials directly from the
screen.

Widespread availability of high-speed networks:

The growth of the Internet over the past few years has been phenomenal.
Telecommunications companies compete to provide local and long distance Internet service
across the United States; international links reach almost every country in the world; every
sizable company has its internal network; universities have built campus networks;
individuals can purchase low-cost, dial-up services for their homes.

The coverage is not universal. Even in the US there are many gaps and some countries are
not yet connected at all, but in many countries of the world it is easier to receive information
over the Internet than to acquire printed books and journals by orthodox methods.
Portable computers:

Although digital libraries are based around networks, their utility has been greatly enhanced
by the development of portable, laptop computers. By attaching a laptop computer to a
network connection, a user combines the digital library resources of the Internet with the
personal work that is stored on the laptop. When the user disconnects the laptop, copies of
selected library materials can be retained for personal use.

During the past few years, laptop computers have increased in power, while the quality of
their screens has improved immeasurably. Although batteries remain a problem, laptops are
no heavier than a large book and the cost continues to decline steadily.

21.3.7 Access to Digital Libraries


Traditional libraries usually require that the user be a member of an
organisation that maintains expensive physical collections. In the United
States, universities and some other organisations have excellent libraries, but
most people do not belong to such an organisation. In theory, much of the
Library of Congress is open to anybody over the age of eighteen, and a few
cities have excellent public libraries, but in practice, most people are
restricted to the small collections held by their local public library. Even
scientists often have poor library facilities. Doctors in large medical centres
have excellent libraries, but those in remote locations typically have nothing.
One of the motives that led the Institute of Electrical and Electronics
Engineers (IEEE) to its early interest in electronic publishing was the fact
that most engineers do not have access to an engineering library. Users of
digital libraries need a computer attached to the Internet.
A factor that must be considered in planning digital libraries is that the
quality of the technology available to users varies greatly. A favoured few
have the latest personal computers on their desks, high-speed connections to
the Internet and the most recent release of software; they are supported by
skilled staff who can configure and tune the equipment, solve problems and
keep the software up to date. Most people, however, have to make do with
less. Their equipment may be old, their software out of date, their Internet
connection troublesome, and their technical support from staff who are
under-trained and over-worked. One of the great challenges in developing
digital libraries is to build systems that take advantage of modern
technology, yet perform adequately in less perfect situations.

21.3.8 Database for Digital Libraries


Digital libraries hold any information that can be encoded as sequences of
bits. Sometimes these are digitized versions of conventional media, such as
text, images, music, sound recordings, specifications and designs and many,
many more. As digital libraries expand, the contents are less often the digital
equivalents of physical items and more often items that have no equivalent,
such as data from scientific instruments, computer programs, video games
and databases.

21.3.8.1 Data and Metadata


The information stored in a digital library can be divided into data and
metadata. As discussed in introductory chapters, metadata is data about other
data. Common categories of metadata include descriptive metadata, such as
bibliographic information, structural metadata about formats and structures
and administrative metadata, which includes rights, permissions and other
information that is used to manage access. One item of metadata is the
identifier, which identifies an item to the outside world.
The distinction between data and metadata often depends upon the
context. Catalogue records or abstracts are usually considered to be
metadata, because they describe other data, but in an online catalogue or a
database of abstracts they are the data.

21.3.8.2 Items in a Digital Library


No generic term has yet been established for the items that are stored in a
digital library. The most general term is the material, which is anything that
might be stored in a library. The term digital material is used when needed
for emphasis. A more precise term is digital object. This is used to describe
an item as stored in a digital library, typically consisting of data, associated
metadata and an identifier.
Some people call every item in a digital library a document. However,
here we reserve the term for a digitised text, or for a digital object whose
data is the digital equivalent of a physical document.

21.3.8.3 Library Objects


The term library object is useful for the user’s view of what is stored in a
library. Consider an article in an online periodical. The reader thinks of it as
a single entity, a library object, but the article is probably stored on a
computer as several separate objects. They contain pages of digitised text,
graphics, perhaps even computer programs, or linked items stored on remote
computers. From the user’s viewpoint, this is one library object made up of
several digital objects.
This example shows that library objects have internal structure. They
usually have both data and associated metadata. Structural metadata is used
to describe the formats and the relationship of the parts.

21.3.8.4 Presentations, Disseminations and the Stored Form of a Digital


Object
The form in which information is stored in a digital library may be very
different from the form in which it is used. A simulator used to train airplane
pilots might be stored as several computer programs, data structures,
digitised images and other data. This is called the stored form of the object.
The user is provided with a series of images, synthesised sound and
control sequences. Some people use the term presentation for what is
presented to the user and in many contexts this is appropriate terminology. A
more general term is dissemination, which emphasises that the
transformation from the stored form to the user requires the execution of
some computer program.
When digital information is received by a user’s computer, it must be
converted into the form that is provided to the user, typically by displaying
on the computer screen, possibly augmented by a sound track or other
presentation. This conversion is called rendering.
21.3.8.5 Works and Content
Finding terminology to describe content is especially complicated. Part of
the problem is that the English language is very flexible. Words have varying
meanings depending upon the context. Consider, the example, “the song
Simple Gifts”. Depending on the context, that phrase could refer to the song
as a work with words and music, the score of the song, a performance of
somebody singing it, a recording of the performance, an edition of music on
compact disk, a specific compact disc, the act of playing the music from the
recording, the performance encoded in a digital library and various other
aspects of the song. Such distinctions are important to the music industry,
because they determine who receives money that is paid for a musical
performance or recording.
Several digital library researchers have attempted to define a general
hierarchy of terms that can be applied to all works and library objects. The
problem is that library materials have so much variety that a classification
may match some types of material well but fail to describe others
adequately.

21.3.9 Potential Benefits of Digital Libraries


The digital library brings the library to the user:

Using library requires access. Traditional methods require that the user goes to the library. In
a university, the walk to a library takes a few minutes, but not many people are member of
universities or have a nearby library. Many engineers or physicians carry out their work with
depressingly poor access to the latest information.

A digital library brings the information to the user’s desk, either at work or at home, making
it easier to use and hence increasing its usage. With a digital library on the desk top, a user
need never visit a library building. The library is wherever there is a personal computer and a
network connection.

Computer power is used for searching and browsing:

Computing power can be used to find information. Paper documents are convenient to read,
but finding information that is stored on paper can be difficult. Despite the myriad of
secondary tools and the skill of reference librarians, using a large library can be a tough
challenge. A claim that used to be made for traditional libraries is that they stimulate
serendipity, because readers stumble across unexpected items of value. The truth is that
libraries are full of useful materials that readers discover only by accident.

In most aspects, computer systems are already better than manual methods for finding
information. They are not as good as everybody would like, but they are good and improving
steadily. Computers are particularly useful for reference work that involves repeated leaps
from one source of information to another.

Information can be shared:

Libraries and archives contain much information that is unique. Placing digital information
on a network makes it available to everybody. Many digital libraries or electronic
publications are maintained at a single central site, perhaps with a few duplicate copies
strategically placed around the world. This is a vast improvement over expensive physical
duplication of little used material, or the inconvenience of unique material that is inaccessible
without traveling to the location where it is stored.

Information is easier to keep current:

Much important information needs to be brought up to date continually. Printed material is


awkward to update, since the entire document must be reprinted; all copies of the old version
must be tracked down and replaced. Keeping information current is much less of a problem
when the definitive version is in digital format and stored on a central computer.

Many libraries have the provision of online text of reference works, such as directories or
encyclopedias. Whenever revisions are received from the publisher, they are installed on the
library’s computer. The new versions are available immediately. The Library of Congress has
an online collection, called Thomas. This contains the latest drafts of all legislation currently
before the US Congress; it changes continually.

The information is always available:

The doors of the digital library never close; a recent study at a British university found that
about half the usage of a library’s digital collections was at hours when the library buildings
were closed. Material is never checked out to other readers, miss-shelved or stolen; they are
never in an offcampus warehouse. The scope of the collections expands beyond the walls of
the library. Private papers in an office or the collections of a library on the other side of the
world are as easy to use as materials in the local library.

Digital libraries are not perfect. Computer systems can fail and networks may be slow or
unreliable, but, compared with a traditional library, information is much more likely to be
available when and where the user wants it.

New forms of information become possible:

Most of what is stored in a conventional library is printed on paper, yet print is not always the
best way to record and disseminate information. A database may be the best way to store
census data, so that it can be analysed by computer; satellite data can be rendered in many
different ways; a mathematics library can store mathematical expressions, not as ink marks
on paper but as computer symbols to be manipulated by programs such as Mathematica or
Maple.

Even when the formats are similar, material that is created explicitly for the digital world are
not the same as material originally designed for paper or other media. Words that are spoken
have a different impact from the words that are written and online textual material is subtly
different from either the spoken or printed word. Good authors use words differently when
they write for different media and users find new ways to use the information. Material
created for the digital world can have a vitality that is lacking in material that has been
mechanically converted to digital formats, just as a feature film never looks quite right when
shown on television.

Each of the benefits described above can be seen in existing digital


libraries. There is another group of potential benefits, which have not yet
been demonstrated, but hold tantalising prospects. The hope is that digital
libraries will develop from static repositories of immutable objects to
provide a wide range of services that allow collaboration and exchange of
ideas. The technology of digital libraries is closely related to the technology
used in fields such as electronic mail and teleconferencing, which have
historically had little relationship to libraries. The potential for convergence
between these fields is exciting.

21.4 MULTIMEDIA DATABASES

Multimedia computing has emerged as a major area of research and has


started dominating all facets of lives of mankind. Multimedia databases
allow users to store and query different types of multimedia information. It
has opened a wide range of potential applications by combining a variety of
information sources.

21.4.1 Multimedia Sources


Multimedia databases use wide variety of multimedia sources, such as:
Images
Video clips
Audio clips
Text or documents
The fundamental characteristics of multimedia systems are that they
incorporate continuous media, such as voice (audio), video and animated
graphics.

21.4.1.1 Images
Images include photographs, drawings and so on. Images are usually stored
in raw form as a set of pixel or cell values, or in a compressed form to save
storage space. The image shape descriptor describes the geometric shape of
the raw image, which is typically a rectangle of cells of a certain width and
height. Each cell contains a pixel value that describes the cell content. In
black/white images, pixels can be one bit. In gray scale or colour images,
pixel is multiple bits. Images require very large storages space. Hence, they
are often stored in a compressed form, such as GIF, JPEG. These
compressed forms use various mathematical transformations to reduce the
number of cells stored, without disturbing the main image characteristics.
The mathematical transforms used to compress images include Discrete
Fourier Transform (DFT), Discrete Cosine Transform (DCT) and Wavelet
Transforms.
In order to identify the particular objects in an image, the image is divided
into two homogeneous segments using a homogeneity predicate. The
homogeneity predicate defines the conditions for how to automatically group
those cells. For example, in a colour image, cells that are adjacent to one
another and whose pixel values are close are grouped into a segment.
Segmentation and compression can hence identify the main characteristics of
an image.
Inexpensive image-capture and storage technologies have allowed
massive collections of digital images to be created. However, as a database
grows, the difficulty of finding relevant images increases. Two general
approach namely manual identification and automatic analysis, to this
problem have been developed. Both the approaches use metadata for image
retrieval.

21.4.1.2 Video Clips


Video clippings include movies, newsreels, home videos and so on. A video
source is typically represented as a sequence of frames, where each frame is
a still image. However, rather than identifying the objects and activities in
every individual frame, the video is divided into video segments. Each video
segment is made up of a sequence of contiguous frames that includes the
same objects or activities. Its starting and ending frames identify each
segment. The objects and activities identified in each video segment can be
used to index the segments. An indexing technique called frame segment
trees are used for video indexing. The index includes both objects (such as
persons, houses, cars and others) and activities (such as a person delivering a
speech, two persons talking and so on). Videos are also often compressed
using standards such as MPEG.

21.4.1.3 Audio Clips


Audio clips include phone messages, songs, speeches, class presentations,
surveillance recording of phone messages and conversations by law
enforcement and others. Here, discrete transforms are used to identify the
main characteristics of a certain person’s voice in order to have similarity
based indexing and retrieval Audio characteristic features include loudness,
intensity, pitch and clarity.

21.4.1.4 Text or Documents


Text or document sources include articles, books, journals and so on. A text
or document is basically the full text of some article, book, magazine or
journal. These sources are typically indexed by identifying the keywords that
appear in the text and their relative frequencies. However, filler words are
eliminated from that process. Because a technique called singular value
decompositions (SVD) based on matrix transformation is used to reduce the
number of keywords in collection of document. An indexing technique
called telescoping vector trees or TV-trees, can then be used to group similar
documents together.

21.4.2 Multimedia Database Queries


Multimedia databases provide features that allow users to store and query
different types of multimedia information. As discussed in the previous
section, the multimedia information includes images (for example, pictures,
photographs, drawings and more), video (for example, movies, newsreels,
home videos and others), audio (for example, songs, speeches, phone
messages and more) and text or documents (for example, books, journals,
articles and others). The main types of multimedia database queries are the
ones that help in locating multimedia data containing certain objects of
interest.

21.4.2.1 Content-based Retrieval


In multimedia databases, content-based queries are widely used. For
example, locating multimedia sources that contain certain objects of interest
such as locating all video clippings of in a video database that include
certain famous hills, say Mount Everest, or retrieving all photographs with
the picture of a computer from our photo gallery, or retrieving video clips
that contain a certain person from the video database. One may also want to
retrieve video clips based on certain activities included in them, for example,
video clips of all sixes in cricket test matches. These types of queries are
also called content-based retrieval, because the multimedia source is being
retrieved based on its containing certain objects or activities.
Content-based retrieval is useful in database applications where the query
is semantically of the form, “find objects that look like this one”. Such
applications include the following:
Medical imaging
Trademarks and copyrights
Art galleries and museums
Retailing
Fashion and fabric design
Interior design or decorating
Law enforcement and criminal investigation

21.4.2.2 Identification of Multimedia Sources


Multimedia databases use some model to organise and index the multimedia
sources based on their contents. Two approaches, namely automatic analysis
and manual identification are used for this purpose. In the first approach, an
automatic analysis of the multimedia sources is done to identify certain
mathematical characteristics of their contents. The automatic analysis
approach uses different techniques depending on the type of multimedia
source, for example image, video, text or audio. In the second approach,
manual identification of the objects and activities of interests is done in each
multimedia source. This information is used to index the sources. Manual
identification approach can be applied to all the different multimedia
sources. However, it requires a manual pre-processing phase where a person
has to scan each multimedia source to identify and catalog the objects and
activities it contains so that they can be used to index these sources.
A typical image database query would be to find images in the database
that are similar to a given image. The given image could be an isolated
segment that contains, say, a pattern of interest and the query is to locate
other images that contain that same pattern. There are two main techniques
for this type of search. The first technique uses distance function to compare
the given image with the stored images and their segments. If the distance
value returned is small, the probability of match is high. Indexes can be
created to group together stored images that are close in the distance metric
so as to limit the search space. The second technique is called the
transformation approach, which measures image similarity by having a small
number of transformations that can transform one image’s cells to match the
other image. Transformations include rotations, translations and scaling.

21.4.3 Multimedia Database Applications


Multimedia data may be stored, delivered and utilised in many different
ways. Some of the important applications are as follows:
Repository applications.
Presentation applications.
Collaborative work using multimedia information.
Documents and records management.
Knowledge dissemination.
Education and training.
Marketing, advertising, retailing, entertainment and travel.
Real-time control and monitoring.

21.5 MOBILE DATABASES

The rapid technological development of mobile phones (cell phones),


wireless and satellite communications and increased mobility of individual
users have resulted into increasing demand for mobile computing. Portable
computing devices such as laptop computers, palmtop computers and so on
coupled with wireless communications allow clients to access data from
virtually anywhere and at any time in the globe. The mobile databases
interfaced with these developments, offer the users such as CEOs, marketing
professionals, finance managers and others to access any data, anywhere, at
any time to take business decisions in real-time. Mobile databases are
especially useful to geographically dispersed organisations.
The flourishing of the mobile devices is driving businesses to deliver data
to employees and customers wherever they may be. The potential of mobile
gear with mobile data is enormous. A salesperson equipped with a PDA
running corporate databases can check order status, sales history and
inventory instantly from the client’s site. And drivers can use handheld
computers to log deliveries and report order changes for a more efficient
supply chain.

21.5.1 Architecture of Mobile Databases


Mobile database is a portable database and physically separate from a
centralized database server, but is capable of communicating with that server
from remote sites allowing the sharing of corporate data. Using mobile
databases, users have access to corporate data on their laptop computers,
Personal Digital Assistant (PDA), or other Internet access device that is
required for applications at remote sites. Fig. 21.6 shows a typical
architecture for a mobile database environment.
The general architecture of mobile database platform is shown in Fig.
21.6. Mobile database architecture is a distributed architecture where several
computers, generally referred to as corporate database servers (or hosts) are
interconnected through a high-speed communication network. Mobile
database consists of the following components:
Corporate database server and DBMS to manage and store the corporate data and provide
corporate applications.
Mobile (remote) database and DBMS at several locations to manage and store the mobile
data and provide mobile applications.
End user mobile database platform consisting of laptop computer, PDA and other Internet
access devices to manage and store client (end user) data and provide client applications.
Communication links between the corporate and mobile DBMS for data access.

The communication between the corporate and mobile databases is


intermittent and is established for short period of time at irregular intervals.

21.5.2 Characteristics of Mobile Computing


Mobile computing have the following characteristics:
High communication latency caused by the processes unique to the wireless medium, such as
coding data for wireless transfer and tracking and filtering wireless signals at the receiver.
Intermittent wireless connectivity due to un-reachability of wireless signals to the places,
such as elevators, subway tunnels and so on.
Limited battery life due to large battery size and mobile devices capabilities.
Changing client locations causing altering of the network topology and changing data
requirements.

21.5.3 Mobile DBMS


The mobile DBMSs are capable of communicating with a range of major
relational DBMSs and are providing services that require limited computing
resources to match those currently provided by mobile devices. The mobile
DBMSs should have the following capabilities:
Communicating with centralised or corporate database server through wireless or Internet
access.
Replicating data on the centralised database server and mobile device.
Synchronising data on the centralized database server and mobile database.
Capturing data from various sources such as the Internet.
Managing data on the mobile devices such as laptop computer, palmtop computer and so on.
Analysing data on a mobile device.
Creating customised mobile applications.
Fig. 21.6 General architecture of mobile database
Now, most mobile DBMSs provide pre-packaged SQL functions for the
mobile application as well as support extensive database querying or data
analysis.
Mobile databases replicate data among themselves and with a central
database. Replication involves examining a database source for changes due
to recent transactions and propagating the changes asynchronously to other
database targets. Replication must be asynchronous, because users do not
have constant connections to the central database.
Transaction-based replication, in which only complete transactions are
replicated, is crucial for integrity across the databases. Replicating partial
transactions would lead to chaos. Serial transaction replication is important
to maintain the same order in each database. This process prevents
inconsistencies among the databases. Another consideration in mobile
database deployment is how conflicts over multiple updates to the same
record will be resolved.

21.5.4 Commercial Mobile Databases


Sybase’s SQL Anywhere currently dominates the mobile database market.
The company has deployed SQL Anywhere, more than 6 million users at
over 10,000 sites and serves 68 per cent of the mobile database market,
according to a recent Gartner Dataquest study. Other mobile databases
include IBM’s DB2 Everyplace 7, Microsoft SQL Server 2000 Windows Ce
Edition and Oracle9i Lite. Smaller player Gupta Technologies’ SQLBase
also targets handheld devices.
Mobile databases are often stripped-down versions of their server-based
counterparts. They contain only basic SQL operations because of limited
resources on the devices. In addition to storage requirements for data tables,
the database engines require from 125K to 1MB, depending on how well the
vendor was able to streamline its code.
Platform support is a key issue in choosing a mobile database. No
organisation wants to devote development and training resources to a
platform that may become obsolete. Microsoft’s mobile database supports
Win32 and Windows CE. The IBM, Oracle and Sybase products support
Linux, Palm OS, QNX Neutrino, Symbian EPOC, Windows CE and Win32.
Though the market is still evolving, there is already a sizeable choice of
sturdy products that will extend the business data to mobile workers.

21.6 SPATIAL DATABASES

Spatial databases keep track of objects in a multidimensional space. Spatial


data support in databases is important for efficiently storing, indexing and
querying of data based on spatial locations. For example, we can use
standard index structures (such as B-trees or hash indices) for storing a set of
polygons in a database and to query the database to find all polygons that
intersect a given polygon. Efficient processing of this query would require
special-purpose index structures, such as R-trees for the task, which has been
discussed in section 21.6.5.

21.6.1 Spatial Data


A common example of spatial data includes geographic data, such as road
maps and associated information. A road map is a two-dimensional object
that contains points, lines and polygons that can represent cities, roads and
political boundaries such as states or countries. A road map is visualisation
of graphic information. The location of cities, roads and political boundaries
that exist on the surface of the Earth are projected onto two-dimensional
display or piece of paper, preserving the relative positions and relative
distances of the rendered objects. The data that indicates the Earth location
(latitude and longitude, or height and depth) of these rendered objects is the
spatial data. When the map is rendered, this spatial data is used to project the
locations of the objects on a two-dimensional piece of paper. A geographic
information system (GIS) is often used to store, retrieve and render Earth-
relative spatial data.
Another type of spatial data are the data from computer-aided design
(CAD) such as integrated-circuit (IC) designs or building designs and
computer-aided manufacturing (CAM) are other types of spatial data.
CAD/CAM types of spatial data work on a smaller scale such as for an
automobile engine or printed circuit board (PCB) as compared to GIS data,
which works at much bigger scale, for example indicating Earth location.
Applications of spatial data initially stored data as files in a file system, as
did early-generation business applications. But with the growing complexity
and volume of the data and increased number of users, ad hoc approaches to
storing and retrieving data in a file system have proved insufficient for the
needs of many applications that use spatial data.

21.6.2 Spatial Database Characteristics


A spatial database stores objects that have spatial characteristics that describe them. The
spatial relationships among the objects are important and they are often needed when
querying the database. A spatial database can refer to an n-dimensional space for any ‘n’.
Special databases consist of extensions, such as models that can interpret spatial
characteristics. In addition, special indexing and storage structures are often needed to
improve the performance. The basic extensions needed are to include two-dimensional
geometric concepts, such as points, lines and line segments, circles, polygons and arcs, in
order to specify the spatial characteristics of the objects.
Spatial operations are needed to operate on the objects’ spatial characteristics. For example,
we need spatial operations to compute the distance between two objects and other such
operations. We also need spatial Boolean conditions to check whether two objects spatially
overlap and perform other similar operations. For example, a GIS will contain the description
of the spatial positions of many types of objects. Some objects such as highways, buildings
and other landmarks have static spatial characteristics. Other objects like vehicles, temporary
buildings and others have dynamic spatial characteristics that change over time.
The spatial databases are designed to make the storage, retrieval and manipulation of spatial
data easier and more natural to users such as GIS. Once the data is stored in a spatial
database, it can be easily and meaningfully manipulated and retrieved as it relates to all other
data stored in the database.
Spatial databases provide concepts for databases that keep track of objects in a multi-
dimensional space. For example, geographic databases and GIS databases that store maps
include two-dimensional spatial descriptions of their objects. These databases are used in
many applications, such as environmental, logistics management and war strategies.

21.6.3 Spatial Data Model


The spatial data model is a hierarchical structure consisting of the following
to correspond to representations of spatial data:
Elements.
Geometries.
Layers.

21.6.3.1 Elements
An element is a basic building block of a geometric feature for the Spatial
Data Option. The supported spatial element types are points, line strings and
polygons. For example, elements might be modelled to historic markers
(point clusters), roads, (line strings) and county boundaries (polygons). Each
coordinate in an element is stored as an X, Y pair.
Point data consists of one coordinate and the sequence number is ‘0’. Line
data consists of two coordinates representing a line segment of the element,
starting with sequence number ‘0’. Polygon data consists of coordinate pair
values, one vertex pair for each line segment of the polygon. The first
coordinate pair (with sequence number ‘0’), represents the first line segment,
with coordinates defined in either a clockwise or counter-clockwise order
around the polygon with successive sequence numbers. Each layer’s
geometric objects and their associated spatial index are stored in the
database in tables.

21.6.3.2 Geometries
A geometry or geometric object is the representation of a user’s spatial
feature, modelled as an ordered set of primitive elements. Each geometric
object is required to be uniquely identified by a numeric geometric identifier
(GID), associating the object with its corresponding attribute set. A complex
geometric feature such as a polygon with holes would be stored as a
sequence of polygon elements. In multi-element polygon geometry, all sub-
elements are wholly contained within the outmost element, thus building a
more complex geometry from simpler pieces. For example, geometry might
describe the fertile land in a village. This could be represented as a polygon
with holes that represent buildings or objects that prevent cultivation.

21.6.3.3 Layers
A layer is a homogeneous collection of geometries having the same attribute
set. For example, one layer in a GIS includes topographical features, while
another describes population density and a third describes the network of
roads and bridges in the area (linea and points). Layers are composed of
geometries, which in turn are made up of elements. For example, a point
might represent a building location, a line string might be a road or flight
path and a polygon could be a state, city, zoning district or city block.

21.6.4 Spatial Database Queries


Spatial query is the process of selecting features based on their geographic or
spatial relationship to other features. There are many types of spatial queries
that can be issued to spatial databases. The following categories illustrate
three typical types of spatial queries:
Range query: Range query finds the objects of a particular type that are within a given
spatial area or within a particular distance from a given location. For example, finds all
English schools in Mumbai city or finds all hospitals police vans within 50 kilometres of
distance or other such things.

Nearest neighbour query or adjacency: This query finds an object of a particular type that
is closest to a given location. For example, finding the police post that is closest to your
house, finding all restaurants that lie within five kilometre of distance of your residence or
finding the hospital nearest to the adjacent site and so on.

Spatial joins or overlays: This query typically joins the objects of two types based on some
spatial condition, such as the objects intersecting or overlapping spatially or being within a
certain distance of one another. For example, finds all cities that fall on National Highway
from Jamshedpur to Patna, or finds all buildings within two kilometres of a steel plant.

21.6.5 Techniques of Spatial Database Query


Various techniques are used for spatial database queries. R-trees and
quadtrees are widely used techniques for spatial database query.

21.6.5.1 R-Tree
To answer the spatial queries efficiently, special techniques for spatial
indexing are needed. One of the best- known techniques used is R-tree and
its variations to answer spatial queries. R-trees group together objects that
are in close spatial physical proximity on the same leaf nodes of a tree-
structured index. Since a leaf node can point to only a certain number of
objects, algorithms for dividing the space into rectangular subspaces that
include the objects are needed. Typical criteria for dividing space include
minimising the rectangular areas, since this would lead to a quicker
narrowing of the search space. Problems such as having objects with
overlapping spatial areas are handled in differently by different variations of
R-trees. The internal nodes of R-trees are associated with rectangles whose
area covers all the rectangles in its sub-tree. Hence, R-trees can easily
answer queries, such as find all objects in a given area by limiting the tree
search to those sub-trees whose rectangles intersect with the area given in
the query.

21.6.5.2 Quadtree
Other spatial storage structures include quadtrees and their variations.
Quadtrees is an alternative representation for two-dimensional data.
Quadtrees is a spatial index, which generally divide each space or sub-space
into equally sized areas and proceed with the subdivision of each sub-space
to identify the positions of various objects. Quadtrees are often used for
storing raster data. Raster is a cellular data structure composed of rows and
columns for storing images. Groups of cells with the same value represent
features.

21.7 CLUSTERING-BASED DISASTER-PROOF DATABASES

If downtime is not an option and the Web never closes for business, how do
we keep our company’s doors open 24/7? The answer lies in high-
availability (HA) systems that approach 100 per cent uptime.
The principles of high availability define a level of backup and recovery.
Until recently, high availability simply meant hardware or software recovery
via RAID (Redundant Array of Independent Disks). RAID addressed the
need for fault tolerance in data but did not solve the problem of a complete
DBMS.
For even more uptime, database administrators are turning to clustering as
the best way to achieve high availability. Recent moves by Oracle, with its
Real Application Cluster and Microsoft, with MCS (Microsoft Cluster
Service) have made multinode clusters for HA in production environments
mainstream.

Fig. 21.7 Clustering of database

In a high-availability setup, a cluster functions by associating servers that


have the ability to share a disk group. Fig. 21.7 shows an example of hot-
standby model for clustering of databases. As illustrated here, each node has
fail-over node within its cluster. If a failure occurs in Node 1, Node 2 picks
up the slack by assuming the resources and the unique logic and transaction
functions of the failed DBMS.
Clustering can have the added benefit of not being bound by node
colocation. Fiber-optic connections, which can be cabled for miles between
the nodes in a cluster, ensure continued operation even in the face of a
complete meltdown of the primary system.
When a hot-standby model is in place, downtimes may be less than a
minute. This is especially important if the service-level agreement requires
higher than 99.9 per cent uptime, which translates to only 8.7 hours of
downtime per year.
Clustering technologies are pricey, however. The enterprise software and
hardware must be uniform and compatible with the clustering technology to
work properly. There is also the associated overhead in the design and
maintenance of redundant systems.
One cost-effective solution is log shipping, in which a database can
synchronise physically distinct databases by sending transactions logs from
one server to another. In the event of a failure, the logs can be used to
reinstate the settings up to the point of the failure. Other methods include
snapshot databases and replication technologies such as Sybase’s Replication
Server, which has been around for decades.
High-availability add-ons to databases are useful but should be understood
in the context of a complete HA methodology. This requires a concerted
effort toward standardization on each of your mission-critical infrastructures.
Fault-tolerant application design with hands-off exception handling, self-
healing and redundant networks, and a stable operating system are all
prerequisites for high availability.

R Q
1. What is Internet? What are the available Internet services?
2. What is WWW? What are Web technologies? Discuss each of them.
3. What are hypertext links?
4. What is HTML? Give an example of HTML file.
5. What is HTTP? How does it work?
6. What is an IP address? What is its importance?
7. What is domain name? What is its use?
8. What is a URL? Explain with an example.
9. What is MIMEE in the context of WWW? What is its importance?
10. What are Web browsers?
11. What do you mean by web databases? What are Web database tools? Explain.
12. What is XML? What are XML documents? Explain with an example.
13. What are the advantages and disadvantages of Web databases?
14. What do you mean by spatial data? What are spatial databases?
15. What is a digital library? What are its components? Discuss each one of them.
16. Why do we use digital libraries?
17. Discuss the technical developments and technical areas of digital libraries.
18. How do we get access to digital libraries?
19. Discuss the application of digital libraries for scientific journals.
20. Explain the method or form in which data is stored in digital libraries.
21. What are the potential benefits of digital libraries?
22. What are multimedia databases?
23. What are multimedia sources? Explain each one of them.
24. What do you mean by contest-based retrieval in multimedia databases?
25. What is automatic analysis and manual identification approaches to multimedia indexing?
26. What are the different multimedia sources?
27. What are the properties of images?
28. What are the properties of the video?
29. What is document and how are they stored in a multimedia database?
30. What are the properties of the audio source?
31. How is a query processed in multimedia databases? Explain.
32. How are multimedia sources identified in multimedia databases? Explain.
33. What are the applications of multimedia databases?
34. What is mobile computing?
35. Explain the mobile computing environment with the help of a diagram.
36. What is a mobile database? Explain the architecture of mobile database with neat sketch.
37. What is spatial data model?
38. What do you mean by element?
39. What is geometry or geometric object?
40. What is a layer?
41. What is spatial query?
42. What is spatial overlay?
43. Differentiate between range queries, neighbour queries and spatial joins.
44. What are R-trees and Quadtrees?
45. What are the main characteristics of spatial databases?
46. Explain the concept of clustering-based disaster-proof databases.

STATE TRUE/FALSE

1. The Internet is a worldwide collection of computer networks connected by communication


media that allow users to view and transfer information between computers.
2. The World Wide Web is a subset of the Internet that uses computers called Web servers to
store multimedia files.
3. An IP address consists of four numbers separated by periods. Each number must be between
‘0’ and 255.
4. HTML is an Internet language for describing the structure and appearance of text documents.
5. URLs are used to identify specific sites and files available on the WWW.
6. The program that can be used to interface with a database on the Web is Common Gateway
Interface (CGI).
7. Digital library is a managed collection of information, with associated services, where the
information is stored in digital formats and accessible over a network.
8. Different types of spatial data include data from GIS, CAD, and CAM systems.
9. The spatial database is designed to make the storage, retrieval, and manipulation of spatial
data easier and more natural to users such as a GIS.
10. Some of the spatial element types are points, line strings, and polygons.
11. Each geometric object is required to be uniquely identified by a numeric geometric identifier
(GID), associating the object with its corresponding attribute set.
12. R-trees are often used for storing raster data.
13. Multimedia databases provide features that allow users to store and query different types of
multimedia information.
14. The multimedia information includes images, video, audio, and documents.
15. In a mobile database the users have their applications and data with them on their portable
laptop computers.
16. A geometry or geometric object is the representation of a user’s spatial feature, modelled as
an ordered set of primitive elements.
17. A layer is a homogeneous collection of geometries having the same attribute set.

TICK (✓) THE APPROPRIATE ANSWER

1. Service provided by the Internet is

a. search engine.
b. WWW.
c. FTP.
d. All of these.

2. The first form of Internet developed is known as

a. ARPAnet.
b. NSFnet.
c. MILInet.
d. All of these.

3. Which of the following is not an Internet addressing system?

a. Domain name.
b. URL.
c. IP address.
d. HTTP.

4. Which of the following must be unique in Internet?

a. IP address
b. E-mail address
c. Domain name
d. All of these.

5. Which of the following is the expansion of HTTP?


a. Hypertext Transport Protocol.
b. Higher Tactical Team Performance.
c. Higher Telephonic Transport Protocol.
d. None of these.

6. What is the expansion of CGI?

a. Compiler Gateway Interface.


b. Common Gateway Interface.
c. Command Gateway Interface.
d. None of these.

7. Which of the following is an example of spatial data?

a. GIS data.
b. CAD data.
c. CAM data.
d. All of these.

8. The component of digital libraries is

a. people.
b. economic.
c. computers and networks.
d. All of these.

9. Which of the following is not a spatial data type?

a. Line
b. Points
c. Polygon
d. Area.

10. Which of the following finds objects of a particular type that is within a given spatial area or
within a particular distance from a given location?

a. Range query
b. Spatial joins
c. Nearest neighbour query
d. None of these.

11. Which of the following is a spatial indexing method?

a. X-trees
b. R-trees
c. B-trees
d. None of these.
12. Which of the following is a mathematical transformation used by image compression
standards?

a. Wavelet Transform
b. Discrete Cosine Transform’
c. Discrete Fourier Transform
d. All of these.

13. Which of the following is not an image property?

a. Cell
b. Shape descriptor
c. Property descriptor
d. Pixel descriptor.

14. Which of the following is an example of a database application here content-based retrieval
is useful?

a. Trademarks and Copyrights


b. Medical imaging
c. Fashion and fabric design
d. All of these.

15. Spatial databases keep track of objects in a

a. multidimensional space.
b. Single dimensional space.
c. Both (a) & (b).
d. None of these.

FILL IN THE BLANKS

1. The Internet is a worldwide collection of _____ connected by _____ media that allow users
to view and transfer information between _____.
2. The World Wide Web is a subset of _____ that uses computers called _____ to store
multimedia files.
3. The _____ is a system, based on hypertext and HTTP, for providing, organising, and
accessing a wide variety of resources that are available via the Internet.
4. HTML is the abbreviation of _____.
5. HTML is used to create _____ stored at web sites.
6. URL is the abbreviation for _____.
7. An _____ is a unique number that identifies computers on the Internet.
8. _____ look up the domain name and match it to the corresponding IP address so that data can
be properly routed to its destination on the Internet.
9. The Common Gateway Interface CGI is a standard for interfacing _____ with _____.
10. _____ is the set of rules, or protocol, that governs the transfer of hypertext between two or
more computers.
11. _____ provide the concept of database that keep track of objects in a multidimensional space.
12. _____ provide features that allow users to store and query different types of multimedia
information like images, video clips, audio clips and text or documents.
13. The _____ is a hierarchical structure consisting of elements, geometries and layers, which
correspond to representations of spatial data.
14. _____ are composed of geometries, which in turn are mad up of elements.
15. An element is the basic building block of a geometric feature for the _____.
16. A _____ finds objects of a particular type that are within a given spatial area or within a
particular distance from a given location.
17. _____ query finds an object of a particular type that is closer to a given location.
18. A _____ is the representation of a user’s spatial feature, modelled as an ordered set of
primitive elements.
19. The process of overlaying one theme with another in order to determine their geographic
relationships is called _____.
20. _____ joins the objects of two types based on some spatial condition, such as the objects
intersecting or overlapping spatiality or being within a certain distance of one another.
21. The multimedia queries are called _____ queries.
22. _____ is general-purpose mathematical analysis tool that has been used in a variety of
information- retrieval applications.
23. An indexing technique called _____ can then be used to group similar documents together.
24. Spatial databases keep track of objects in a _____ space.
Part-VII

CASE STUDIES
Chapter 22

Database Design: Case Studies

22.1 INTRODUCTION

From Chapter 6 to Chapter 10, we discussed the concepts of relational


database and database design steps. This chapter deals with some of the
practical database design projects as case studies using the concepts used in
them. Different types of case studies have been considered, covering several
important aspect of real life situation of a business model and database
design exercise has been carried out. Each project starts with the requirement
definition and assumes that data analysis has been completed. In all the case
studies, while creating tables, only representative relations (tables) and their
attributes have been considered. In a real life situation, the number of
relations (tables) attributes may vary depending on the actual business
requirements.
Thus, this chapter provides the reader with an opportunity to
conceptualise, how to design a database for a given application. It also gives
an overall insight as how a database for an application can be designed.

22.2 DATABASE DESIGN FOR RETAIL BANKING

M/s Greenlay Bank has just ventured into a retail banking system with the
following functions (sub-processes) at the beginning:
Saving Bank accounts.
Current Bank accounts.
Fixed deposits.
Loans.
DEMAT Account.
Fig. 22.1 shows various sub-processes of a typical retail banking system.
Each functions (or sub-processes), in turn has multiple child processes that
work together in harmony for the process to be useful. In this example, we
will consider only three functions, namely saving bank (SB) accounts,
current bank (CB) accounts and fixed deposits (FDs).

22.2.1 Requirement Definition and Analysis


After a detailed analysis, following activities and functionalities have been
identified for each sub-processes of the retail banking system:
Saving Account

Bank maintains record of each customer with the following details:

CUST-NAME : Customer name


ADDRESS : Customer address
CONT-NO : Customer contact number
INT-NAME : Introducer name
INT-ACC : Introducer account number

Saving bank transactions, both deposits and withdrawals, are updated on real-time
basis.

Fig. 22.1 Retail banking system

Current Account

Bank maintains record of each organisation or company with the following details:

ORG-NAME : Organisation name


ADDRESS : Organisation address
CONT-NO : Organisation contact number
INT-NAME : Introducer name
INT-ACC : Introducer account number

Current account transactions, both deposits and withdrawals, are updated on real-
time basis.

Fixed Deposit (FD)

Bank maintains record of each FD customer with the following details:

CUST-NAME : Customer name


ADDRESS : Customer address
CONT-NO : Customer contact number
FD-IRROR : Interest rate
FD-DUR : Duration (time period) of FD
INT-NAME : Introducer name
INT-ACC : Introducer account number

FD transactions are updated on periodic basis.

22.2.2 Conceptual Design: Entity-Relationship (E-R) Diagram


In the conceptual design, high-level description of the data in terms of
entity-relationship (E-R) model is developed. Fig. 22.2 illustrates a typical
initial E-R diagram, as discussed in chapter 6, for M/s Greenlay’s retail
banking system. An E-R diagram attempts to give a visual representation of
the entity relationship between various tables created or identified for the
retail banking system.
Fig. 22.2 E-R Diagram for retail banking
22.2.3 Logical Database Design: Table Definitions
Using the standard approach discussed in Chapter 5 (Section 5.5.6), the
following tables are created by mapping E-R diagram shown in Fig. 22.2 to
the relational model:

CREATE TABLE (EMP-NO CHAR (10),


EMP_MASTER
BR-NO CHAR (10),
NAME CHAR (30),
DEPT CHAR (25),
DESG CHAR (20),
PRIMARY KEY (ACCT-NO, BR-NO));
CREATE TABLE (ACCT-NO CHAR (10),
ACCT_MASTER
BR-NO CHAR (10),
ACCT-TYPE CHAR (2),
NOMINEE CHAR (30),
SF-NO CHAR (10),
LF-NO CHAR (10),
TITLE CHAR (30),
INTR-CUST-NO CHAR (10),
INTR-ACCT-NO CHAR (10),
STATUS CHAR (1),
PRIMARY KEY (ACCT-NO, BR-NO));
CREATE TABLE (CUST-NO CHAR (10),
CUST_MASTER
NAME CHAR (30),
DOB DATE,
PRIMARY KEY (CUST-NO));
CREATE TABLE ADD- (HLD-NO CHAR (5),
DETAILS
STREET CHAR (25));
CITY CHAR (20),
PIN CHAR (6),
CREATE TABLE (FD-NO CHAR (10),
FD_MASTER
BR-NO CHAR (10),
SF-NO CHAR (10),
TITLE CHAR (30),
INTR-CUST-NO CHAR (10),
INTR-ACCT-NO CHAR (10),
NOMINEE CHAR (30),
FD-AMT NUM (8, 2),
PRIMARY KEY (FD-NO, BR-NO));
CREATE TABLE (FD-NO CHAR (10),
FD_DETAILS
TYPE CHAR (1),
FD-DUR NUM (5),
FD-AMT NUM (8, 2),
PRIMARY KEY (FD-NO);
CREATE TABLE ACCT-FD- (ACCT-FD-NO CHAR (10),
CUST_DETAILS
CUST-NO CHAR (10),
PRIMARY KEY (ACCT-FD-NO., CUST-
NO.));
CREATE TABLE (TRANS-NO CHAR (10),
TRANS_MASTER
ACCT-NO CHAR (10),
TRANS-TYPE CHAR (1),
TRANS-AMT NUM (8, 2),
BALANCE NUM (8, 2),
DATE DATE,
PRIMARY KEY (TRANS-NO, ACCT-NO)
CREATE TABLE (TRANS-NO CHAR (10),
TRANS_DETAILS
BR-NAME CHAR (30),
BANK-NAME CHAR (30),
INV-DATE DATE,
PRIMARY KEY (TRANS-NO));

22.2.4 Logical Database Design: Sample Table Contents


The sample contents for each of the above relations are shown in Fig. 22.3.

Fig. 22.3 Sample relations and contents for retail banking


22.3 DATABASE DESIGN FOR AN ANCILLARY MANUFACTURING SYSTEM

M/s ABC manufacturing company is in the business of ancillary


manufacturing. It manufactures specialpurpose assemblies for its customers.
Company has a number of processes in the assembly line supervised by
separate departments. The company also has an account department to
manage the expenditure of the assemblies. It is required to carry out database
design for M/s ABC manufacturing company to meet its computerised
management information system (MIS) requirements.

22.3.1 Requirement Definition and Analysis


After a detailed analysis of the present and proposed functioning of M/s
ABC manufacturing company, the following requirements have been
identified for consideration while designing the database:
Each assembly is identified by a unique assembly identification number (ASS-ID), the
customer for the assembly (CUST) and ordered date (ORD-DATE).
To manufacture assemblies, the organisation contains a number of processes each identified a
unique process identification number (PROC-ID) and each supervised by a department
(DEP).
The assembly processes of the organisation are classified into three types, namely painting
(PAINT), fitting (FIT) and cutting (CUT). The type set that represent these processes are
uniform and hence use the same identifier as the process. The following information is kept
about each type of process:

PAINT : PAINT-TYPE, PAINT-METHOD


FIT : FIT-TYPE
CUT : CUT-TYPE, MC-TYPE

During manufacturing, an assembly can pass through any sequence of processes in any order.
It may pass through the same process more than once.
A unique job number (JOB-NO) is assigned every time a process begins on an assembly.
Information recorded about a JOB-NO includes COST, DATE-COMMENCED and DATE-
COMPLETED at the process as well as additional information that depends on the type of
JOB process.
JOBs are classified into job type sets. These type sets are uniform and hence use the same
identifier as JOB-NO. Information stored about particular job types is as follows:

CUT-JOB : MC-TYPE, MC-TIME-USED,


MATL-USED, LABOUR-TIME
(Here it is assumed that only one machine and
machine type and only one type and one type of
material is used with each CUTTING process.)
PAINT-JOB : COLOUR, VOLUME, LABOUR-TIME
(Here it is assumed that only one COLOUR is
used by each PAINTING process.)
FIT-JOB : LABOUR-TIME

An accounting system is maintained by the organisation to maintain expenditure for each of


the following:

PROC-ID

ASS
DEPT

Following three types of accounts are maintained by the organisation:

ASS-ACCT : To record costs of assemblies.


DEPT-ACCT : To record costs of departments.
PROC-ACCT : To record costs of processes.

The above account types can be kept in different type sets. The type sets are unique and
hence use a common identifier as ACCOUNT.
As a job proceeds, cost transactions can be recorded against it. Each such transaction is
identified by a unique transaction number (TRANS-NO) and is for a given cost, SUP-COST.
Each transaction updates the following three accounts:

PROC-ACCT

ASS-ACCT

DEPT-ACCT

The updated process account is for the process used by a job.


The updated department account is for the department that manages that process.
The updated assembly account is for the assembly that requires the job.

22.3.2 Conceptual Design: Entity-Relationship (E-R) Diagram


Fig. 22.3 illustrates E-R diagram for M/s ABC manufacturing company
depicting high-level description of the data items.

22.3.3 Logical Database Design: Table Definitions


Using the standard approach discussed in chapter 5 (Section 5.5.6),
following tables are created by mapping E-R diagram shown in Fig. 22.4 to
the relational model:

CREATE TABLE (CUST-ID CHAR (10),


CUSTOMER
ADDRESS CHAR (30),
PRIMARY KEY (CUST-ID));
CREATE TABLE (CUST-ID CHAR (10),
ACCOUNTS
DT-EST DATE,
PRIMARY KEY (ACCT-ID));
CREATE TABLE ASSEMBLY_ACCOUNTS (ACCT-ID CHAR (10),
DET-1 CHAR (30),
PRIMARY KEY (ACCT-ID));
CREATE TABLE DEPT_ACCOUNTS (ACCT-ID CHAR (10),
DET-2 CHAR (30),
PRIMARY KEY (ACCT-ID));
CREATE TABLE PROCESS_ACCOUNTS (ACCT-ID CHAR (10),
DET-3 CHAR (30),
PRIMARY KEY (ACCT-ID));
CREATE TABLE A1 (ACCT-ID CHAR (10),
ASS-ID CHAR (10),
PRIMARY KEY (ACCT-ID, ASS-ID));
CREATE TABLE A2 (ACCT-ID CHAR (10),
DEPT CHAR (20),
PRIMARY KEY (ACCT-ID));
CREATE TABLE A3 (ACCT-ID CHAR (10),
PROC-ID CHAR (10),
PRIMARY KEY (ACCT-ID, PROC-ID));
CREATE TABLE ORDERS (CUST-ID CHAR (10),
ASS-ID CHAR (10),
PRIMARY KEY (CUST-ID, ASS-ID));
Fig. 22.4 E-R Diagram for ancillary manufacturing system
CREATE TABLE (ASS-ID CHAR (10),
ASSEMBLIES
ASS-DETAIL CHAR (30),
DATE-ORD DATE,
PRIMARY KEY (ASS-ID));
CREATE TABLE JOBS (JOB-NO CHAR (10),
DATE-COMM DATE,
DATE-COMP DATE,
COST REAL,
PRIMARY KEY (JOB-NO));
CREATE TABLE CUT-JOBS (CUT-JOB-NO CHAR (15),
MATL-USED CHAR (15),
MC-TYPE-USED CHAR (10),
MC-TIME-USED TIME,
LABOR-TIME TIME,
PRIMARY KEY (CUT-JOB-NO));
CREATE TABLE FIT-JOBS (FIT-JOB-NO CHAR (15),
LABOR-TIME TIME,
PRIMARY KEY (FIT-JOB-NO));
CREATE TABLE PAINT- (PAINT-JOB-NO CHAR (15),
JOBS
VOLUME CHAR (5),
COLOUR CHAR (5),
LABOR-TIME TIME,
PRIMARY KEY (PAINT-JOB-NO));
CREATE TABLE (TRANS-NO CHAR (10),
TRANSACTIONS
SUP-COST REAL,
PRIMARY KEY (TRANS-NO.));
CREATE TABLE (TRANS-NO CHAR (10),
ACTIVITY
JOB-NO CHAR (10),
PRIMARY KEY (TRANS-NO., JOB-NO));
CREATE TABLE T1 (TRANS-NO CHAR (10),
ACCT-1 CHAR (10),
PRIMARY KEY (TRANS-NO));
CREATE TABLE T2 (TRANS-NO CHAR (10),
ACCT-2 CHAR (10),
PRIMARY KEY (TRANS-NO));
CREATE TABLE T3 (TRANS-NO CHAR (10),
ACCT-3 CHAR (10),
PRIMARY KEY (TRANS-NO));
CREATE TABLE (DEPT CHAR (20),
DEPARTMENTS
DEPT-DATA CHAR (20),
PRIMARY KEY (DEPT));
CREATE TABLE (PROC-ID CHAR (10),
PROCESSES
PROC-DATA CHAR (20),
PRIMARY KEY (PROC-ID));
CREATE TABLE PAINT- (PAINT-PROC-ID CHAR (10),
PROCESSES
PAINT-METHOD CHAR (20),
PRIMARY KEY (PAINT-PROC-ID));
CREATE TABLE FIT- (FIT-PROC-ID CHAR (10),
PROCESSES
FIT-TYPE CHAR (5),
PRIMARY KEY (PAINT-PROC-ID));
CREATE TABLE CUT- (CUT-PROC-ID CHAR (10),
PROCESSES
CUT-TYPE CHAR (5),
PRIMARY KEY (CUT-PROC-ID));

22.3.4 Logical Database Design: Sample Table Contents


The sample contents for each of the above relations are shown in Fig. 22.5.

22.3.5 Functional Dependency (FD) Diagram


Fig. 22.6 shows the functional dependency (FD) diagram for customer order
warehouse project.
Fig. 22.5 Sample relations and contents for ancillary manufacturing system
The mathematical representation of this FD can be given as

ASS-ID → DT-ORDERED, ASS-DETAILS, CUSTOMER


TRANS-NO→ ASS-ID, DATE, PROC-ACCT, SUP-COST,
DEPT-ACC, PROC-ID
CUT-JOB-NO→ JOB-NO, MC-TYPE-USED, MC-TIME-USED,
MATL-USED
JOB-NO→ DATE-COMM, DATE-COMP, ASS-ID, COST,
PROC-ID
FIT-JOB-NO→ JOB-NO, LABOUR-TIME
CUT-JOB-NO, FIT-JOB-NO, PAINT-JOB-NO → JOB-NO

Fig. 22.6 FD Diagram for ancillary manufacturing system

TRANS-NO, ASS-ACCT, JOB-NO → ASS-ID


ASS-ACCT→ ASS-ID, ACCOUNT, DET-1
ACCOUNT→ BALANCE
ASS-ACCT, DEPT-ACCT, PROC-ACCT → ACCOUNT
PAINT-JOB-NO → JOB-NO,COLUR, VOLUME, LABOUR-TIME
PROC-ID → DEPT
PAINT-PROC-ID, FIT-PROC-ID, PRDC-ACCT, JOB-NO, TRANS-NO,
CUT-PROC-ID, PROC-DATA, → PROC-ID
PROC-ACCT, JOB-NO, TRANS-NO
PROC-ID, DEPT-ACCT → DEPT
DEPT → DEPT-DATA
DEPT-ACCT → DEPT, DET-2, ACCOUNT
PROC-ACCT → PROC-ID, ACCOUNT, DET-3
PAINT-PROC-ID → PAINT-TYPE, PAINT-METHOD
FIT-PROC-ID → FIT-TYPE

22.4 DATABASE DESIGN FOR AN ANUAL RATE CONTRACT SYSTEM

M/s Global Computing System manufactures computing devices under its


brand name. It requires different types of items from various suppliers to
manufacture its branded product. Now the company wants to finalise an
annual rate contract with its suppliers for supply of various items. It is
required to carry out database design for M/s Global Computing System to
computerise the Annual Rate Contract (ARC) handling system.

22.4.1 Requirement Definition and Analysis


After a detailed analysis of the present and proposed functioning of M/s
Global Computing System, the following requirements have been identified:
The manufacturing company negotiates contracts with several suppliers for the supply of
different amounts various items components at a price that is fixed for one full year.
Orders are placed by the company against any of the negotiated contracts for the supply of
items at a price finalised in the contract.
An order can consist of any amount of those items that are in that contract.
Any number of orders can be released against a contract. However, the sum of any given item
type in all orders can be made against one contract cannot exceed the amount of that item
type mentioned in the contract.
An inquiry would be made to find if a sufficient quantity of an item is available before an
order for that item is placed.
All the items in an order must be supplied as part of the same contract.
Each order is placed only against one contract and is released made on behalf of one project.
An order is released for one or more item types in that contract.

22.4.2 Conceptual Design: Entity-Relationship (E-R) Diagram


Fig. 22.7 illustrates E-R diagram for M/s Global Computing System
depicting high-level description of the data items.

22.4.3 Logical Database Design: Table Definitions


Using the standard approach discussed in chapter 5 (Section 5.5.6),
following tables are created by mapping E-R diagram shown in Fig. 22.7 to
the relational model:

CREATE TABLE (SUPPLIER-NO CHAR (10),


SUPPLIERS
SUPPLIER-NAME CHAR (20),
SUPPLIER- CHAR (30),
ADDRESS
PRIMARY KEY (SUPPLIER-NO));
Fig. 22.7 E-R Diagram for annual rate contract system

CREATE TABLE (CONTRACT-NO CHAR (10),


CONTRACTS
PRIMARY KEY (CONTRACT-NO));
CREATE TABLE (SUPPLIER-NO CHAR (10),
NEGOTIATE
CONTRACT-NO CHAR (10),
DATE-OF- DATE,
CONTRACT
PRIMARY KEY (SUPPLIER-NO,
CONTRACT-NO));
CREATE TABLE TO- (ITEM-NO CHAR (10),
SUPPLY
CONTRACT-NO CHAR (10),
CONTRACT-PRICE REAL,
CONTRACT- REAL,
AMOUNT
PRIMARY KEY (ITEM-NO,
CONTRACT-NO));
CREATE TABLE ITEMS (ITEM-NO CHAR (10),
ITEM- CHAR (20),
DESCRIPTION
PRIMARY KEY (ITEM-NO));
CREATE TABLE ORDERS (ORDER-NO CHAR (10),
DATE-REQUIRED DATE,
DATE-COMPLETE DATE,
PRIMARY KEY (ORDER-NO.));
CREATE TABLE (PROJECT-NO CHAR (10),
PROJECTS
PROJECT-DATA CHAR (20),
PRIMARY KEY (PROJECT-NO));

22.4.4 Logical Database Design: Sample Table Contents


The sample contents for each of the above relations are shown in Fig. 22.8.
Fig. 22.8 Sample relations and contents for annual rate contract system

22.4.5 Functional Dependency (FD) Diagram


Fig. 22.9 shows the functional dependency (FD) diagram for customer order
warehouse project.

Fig. 22.9 FD Diagram for annual rate contract system

The mathematical representation of this FD can be given as

SUPPLIER-NO→ SUPPLIER-NAME, SUPPLIER-ADDRESS


CONTRACT-NO→ SUPPLIER-NO, DATE-OF-CONTRACT
CONTRACT-NO, ITEM-NO → CONTRACT-AMOUNT, CONTRACT-
PRICE
ITEM-NO→ ITEM-DESCRIPTION
ITEM-NO, ORDER-NO → ORDER-QUANTITY
PROJECT-NO→ PROJECT-DATA
ORDER-NO→ PROJECT-NO, CONTRACT-NO, DATE-
REQUIRED,
DATE-COMPLETED→

22.5 DATABASE DESIGN OF TECHNICAL TRAINING INSTITUTE

A private technical training institute provides courses on different subjects of


computer engineering to its corporate and other customers. The institute
wants to computerise its training activities for efficient management of its
information system.

22.5.1 Requirement Definition and Analysis


The detailed analysis was done for the present and proposed functioning of
the technical training institute. Following requirements have been identified:
Each course in the university is identified by a given course number (COURSE-NO).
Each course has a descriptive title (DESC-TITLE), for example DATABASE
MANAGEMENT SYSTEM, together with an originator and approved date of
commencement (COM-DATE).
Each course has a certain duration (DURATION) measured in number of days and is of given
class (CLASS), for example, ‘DATABASE’.
Each course may be offered any number of times. Each course presentation commences on a
given START-DATE at a given location. The composite identifier for each course offering are
as follows:

COURSE-NO, LOCATION-OFFERED, START-DATE

The course may be presented either to the general public (GEN-OFFERING), or, as a special
presentation (SPECIAL-OFFERING), to a specific organisation.
There can be any number of participants at each course presentation. Each participant has a
name and is associated with some organisation.
Each course has a fee structure. There is a standard FEE for each participant at a general
offering.
There is a separate SPECIAL-FEE if the course is a SPECIAL-OFFERING on an
organisation’s premises. In that case only a fixed fee is charged for the whole course to the
organisation and there is no extra fee for each participant.
Employees of the organisation can be authorised to be LECTURERs or ORIGINATOR. The
sets that present these nodes are uniform and use the same identifier as the source entity,
EMPLOYEE.
Each lecturer may spend any number of days on one presentation of a given course provided
that such an assignment does not exceed the duration of the course.
The DAYS-SPENT by a lecturer on a course offering are recorded.

22.5.2 Conceptual Design: Entity-Relationship (E-R) Diagram


Fig. 22.10 illustrates E-R diagram for private technical training institute
depicting high-level description of the data items.

Fig. 22.10 E-R Diagram for technical training institute


22.5.3 Logical Database Design: Table Definitions
Following tables are created by mapping E-R diagram shown in Fig. 22.10
to the relational model:

CREATE TABLE (EMP-ID CHAR (10),


EMPLOYEE
NAME CHAR (20),
ADDRESS CHAR (30),
PRIMARY KEY (EMP-ID));
CREATE TABLE PREPARE (COURSE-NO CHAR (10),
ORG-NAME CHAR (20),
PREP-TIME NUMBER,
PRIMARY KEY (COURSE-NO));
CREATE TABLE COURSE (COURSE-NO CHAR (10),
DESC-TITLE CHAR (20),
LENGTH NUMBER,
CLASS CHAR (5)
PRIMARY KEY (COURSE-NO.));
CREATE TABLE COURSE- (COURSE-NO) CHAR (10),
HISTORY
START-DATE DATE,
LOC-OFFRD CHAR (15),
STATUS CHAR (5),
PRIMARY KEY (EMP-ID));
CREATE TABLE ASSIGN (COURSE-NO CHAR (10),
LECT-NAME CHAR (20),
START-DATE DATE,
LOC-OFFRD CHAR (15),
DAYS-SPENT NUMBER,
PRIMARY KEY (COURSE-NO., LECT-
NAME,
START-DATE));
CREATE TABLE (COURSE-NO CHAR (10),
GENERAL-OFFERING
START-DATE DATE,
LOC-OFFRD CHAR (15),
PRIMARY KEY (COURSE-NO, LOC-
OFFRD,
START-DATE));
CREATE TABLE SPECIAL- (COURSE-NO CHAR (10),
OFFERING
START-DATE DATE,
LOC-OFFRD CHAR (15),
PRIMARY KEY (COURSE-NO, LOC-
OFFRD, START-DATE));
CREATE TABLE GENERAL-ATTENDENCE CHAR (10),
(COURSE-NO
ATTENDEE-NAME CHAR (20),
START-DATE DATE,
LOC-OFFRD CHAR (15),
FEE REAL,
PRIMARY KEY (COURSE-NO, LOC-
OFFRD, START-DATE));
CREATE TABLE SPECIAL-ARRANGEMENT CHAR (10),
(COURSE-NO
START-DATE DATE,
LOC-OFFRD CHAR (15),
ATTENDEE-NAME CHAR (20),
PRIMARY KEY (COURSE-NO, LOC-
OFFRD, START-DATE));
CREATE TABLE (ORG-NAME CHAR (10),
ORGANISATION
ATTENDENCE NUMBER (2),
ATTENDEE-NAME CHAR (20),
START-DATE DATE,
LOC-OFFRD CHAR (15),
PRIMARY KEY (LOC-OFFRD, START-
DATE));
CREATE TABLE (ATTENDEE-NAME CHAR (20),
TTENDENCE
TITLE CHAR (20));

22.6 DATABASE DESIGN OF AN INTERNET BOOKSHOP

M/s KLY Enterprise is one of M/s KLY group company, which runs a large
book store. It keeps books on various subjects. Presently, M/s KLY
Enterprise takes the order from its customer on a phone and the inquiry
about order shipment, delivery status and so on, are handled manually. M/s
KLY Enterprise wants to go online and automate its activities by database
design and implementation. M/s KLY Enterprise wants its entire activities on
a new Web site such that the customers can access and order for books
directly from the Internet.
22.6.1 Requirement Definition and Analysis
The following requirements have been identified after the detailed analysis
of the existing system:
Customers browse the catalogue of books and place orders over the Internet.
M/s KLY’s Internet Book Shop has mostly corporate customers who call the book store and
give the ISBN number of a book and a quantity. M/s KLY then prepares a shipment that
contains the books they have ordered. In case enough copies are not available in the stock,
additional copies are ordered by M/s KLY. The shipment is delayed until the new copies
arrive and entire order together is shipped.
The book store’s catalogue includes all the books that M/s KLY Enterprise sells.
For each book, the catalogue contains the following details:

ISBN : ISBN number


TITLE : Title of the book
AUTHOR : author of the book
PUR-PRICE : Book purchase price
SALE-PRICE : Book sales price
PUB-YEAR : Year of publication of book

Most of the customers of M/s KLY Enterprise are regulars, and their records are kept as
follows:

CUST-NAME : Customer’s name


ADDDRESS : Customer’s address
CARD-NO : Customer’s credit card number.

New customers are given an account number before they can use the Web site.
On M/s KLY’s Web site, customers first identify themselves by their unique customer
identification number (CUST-ID) and then they are allowed to browse the catalogue and
place orders on line.

22.6.2 Conceptual Design: Entity-Relationship (E-R) Diagram


Fig. 22.11 illustrates a typical initial E-R diagram for M/s KLY Enterprise
Internet Shop.
Fig. 22.11 Initial E-R Diagram for internet book shop

22.6.3 Logical Database Design: Table Definitions


The following tables are created by mapping E-R diagram shown in Fig.
22.11 to the relational model:

CREATE TABLE BOOKS (ISBN CHAR (15),


TITLE CHAR (50),
AUTHOR CHAR (30),
QTY-IN-STOCK INTEGER,
PRICE REAL,
PUB-YEAR INTEGER,
PRIMARY KEY (ISBN));
CREATE TABLE ORDERS (ISBN CHAR (15),
CUST-ID CHAR (10),
QTY INTEGER,
ORDER-DATE DATE,
SHIP-DATE DATE,
PRIMARY KEY (ISBN, CUST-ID)
FOREIGN KEY (ISBN) REFERENCES
BOOKS,
FOREIGN KEY (CUST-ID) REFERENCES
CUSTOMERS);
CREATE TABLE (CUST-ID CHAR (10),
CUSTOMERS
CUST-NAME CHAR (20),
ADDRESS CHAR (30),
CARD-NO CHAR (10),
PRIMARY KEY (CUST-ID)
UNIQUE (CARD-NO));

22.6.4 Change (Addition) in Requirement Definition


As can be observed, the ORDERS table contains the field ORDER-DATE
and the primary key as ISBN and CUST-ID. Because of this, a customer
cannot order the same book on different days. Let us assume that the
following additional requirements are also added:
Customers should be able to purchase several different books in a single order. For example,
if a customer wants to purchase five copies of “Database Systems” and three copies of
“Software Engineering”, the customer should be able to place a single order for both books.
M/s KLY Enterprise will ship an ordered book as soon as they have enough copies of that
book, even if an order contains several books. For example, it could happen that five copies
of “Database Systems” are shipped in first lot because they have ten copies in stock, but that
“Software Engineering” is shipped in second lot, because there may be only two copies in the
stock and more copies might arrive after some time.
Customers can place more than one order per day and they can identify the orders they
placed.

22.6.5 Modified Table Definition


To accommodate the additional requirements stated in Section 22.6.4, a new
attribute (field) order number (ORDER-NO) will be introduced into the
ORDERS table to uniquely identify an order. To purchase several books,
both ORDER-NO and ISBN will be required as the primary key to
determine QTY and SHIP-DATE in the ORDER table. The modified
ORDERS table is given below:

CREATE TABLE ORDERS (ORDER-NO INTEGER,


ISBN CHAR (15),
CUST-ID CHAR (10),
QTY INTEGER,
ORDER-DATE DATE,
SHIP-DATE DATE,
PRIMARY KEY (ORDER-NO, ISBN)
FOREIGN KEY (ISBN) REFERENCES
BOOKS,
FOREIGN KEY (CUST-ID) REFERENCES
CUSTOMERS);

Thus, now the ORDERS are assigned sequential order numbers (ORDER-
NO). The orders that are placed later will have higher order numbers. If
several orders are placed by the same customer on a single day, these orders
will have different order numbers and can thus be distinguished.

22.6.6 Schema Refinement


Following can be observed from the modified table definitions:
The relation (table) BOOKS has only one primary key as ISBN, and no other functional
dependencies hold over the table. Thus, the relation BOOKS is in BCNF.
The relation CUSTOMERS has the primary key as CUST-ID and since a credit card number
(CARDNO) uniquely identifies its card holder, the functional dependency (FD) CARD-NO
→ CUST-ID also holds. Since CUST-ID is a primary key, CARD-NO is also a key. No other
dependencies hold. So, the relation CUSTOMERS is also in BCNF.
for the relation ORDERS, the pair ORDER-NO and ISBN is the key for the ORDERS table.
In addition, since each order is placed by one customer one specific date, the following two
functional dependencies (FDs) hold:

ORDER-NO → CUST-ID
ORDER-NO → ORDER-DATE

Thus, the table ORDERS is not even in 3NF.

To make the ORDERS table in BCNF, it is decomposed into the following


two relations:

ORDERS (ORDER-NO, CUST-ID, ORDER-DATE)


ORDER_LIST (ORDER-NO, ISBN, QTY, SHIP-DATE)

The resulting two relations ORDERS and ORDER_LIST, are both in


BCNF. Also, the decompositions is lossless-join since ORDER-NO is a key
for the modified ORDERS table. Thus, the structure of the two decomposed
tables is given as:

CREATE TABLE ORDERS (ORDER-NO INTEGER,


CUST-ID CHAR (10),
ORDER-DATE DATE,
PRIMARY KEY (ORDER-NO)
FOREIGN KEY (CUST-ID) REFERENCES
CUSTOMERS);
CREATE TABLE ORDER- (ORDER-NO INTEGER,
LIST
ISBN CHAR (15),
QTY INTEGER,
SHIP-DATE DATE,
PRIMARY KEY (ORDER-NO, ISBN)
FOREIGN KEY (ISBN) REFERENCES
BOOKS,

22.6.7 Modified Entity-Relationship (E-R) Diagram


Fig. 22.12 shows modified E-R diagram for M/s KLY Enterprise Internet
shop reflecting modified table definition.

22.6.8 Logical Database Design: Sample Table Contents


The sample contents for each of the modified relations are shown in Fig.
22.13.
Fig. 22.12 Modified E-R Diagram for internet shop

Fig. 22.13 Sample relations and contents for internet book shop
22.7 DATABASE DESIGN FOR CUSTOMER ORDER WAREHOUSE

A warehouse chain keeps several items to supply to its customers on the


basis of order released by them for particular item(s). The warehouse has
number of stores and the stores hold a variety of items. The warehouse is
required to meet all of a customer’s order requirements from those stores
located in the customer’s city. The warehouse chain wants an efficient
database design for its business to meet the increased service demand of its
customers.

22.7.1 Requirement Definition and Analysis


The following requirements have been identifies after the detailed analysis
of the existing system:
The quantity of items held (QTY-HELD) by each store appears in relation HOLD and the
stores themselves are described in relation STORES.
The database stores information about the enterprise customers.
The city location of the customer, together with the data of the customer’s first order, is
stored in the database.
Each customer lives in one city only.
The customers order items from the enterprise. Each such order can be for any quantity
(QTY-ORDERED) of any number of items. The items ordered are stored in ITEM-
ORDERED.
Each order is uniquely identified by its order number (ORDER-NO).
The location of store is also kept in the database. Each store is located in one city and there
may be many stores in that city.
Each city has a main coordination centre known as HDQ-ADD for all it stores and there is
one HDQ- ADD for each city.
The database contains some derived data. The data in ITEM-CITY are derived from relations
STORES and HOLD. Thus each item is taken and the quantities of the item (QTY-HELD) in
all the stores in a city are totalled into QTY-IN-CITY and stored in ITEM-CITY.

22.7.2 Conceptual Design: Entity-Relationship (E-R) Diagram


Fig. 22.14 illustrates a high-level E-R diagram for the warehouse chain. An
E-R diagram gives a visual representation of the entity relationship between
various tables created or identified for the warehouse chain.

22.7.3 Logical Database Design: Table Definition


The following tables are created by mapping E-R diagram shown in Fig.
22.14 to the relational model:
Fig. 22.14 E-R Diagram for customer order warehouse

CREATE TABLE STORES (STORE-ID CHAR (5),


PHONE INTEGER,
NO.-OF-BINS INTEGER,
LOCATED-IN-CITY CHAR (15),
PRIMARY KEY (STORE-ID));
CREATE TABLE CITY (CITY CHAR (15),
HDQ-ADD CHAR (30),
STATE CHAR (15),
PRIMARY KEY (CITY));
CREATE TABLE HOLD (STORE-ID CHAR (5),
ITEM-ID CHAR (3),
QTY-HELD INTEGER,
PRIMARY KEY (STORE-ID, ITEM-ID));
CREATE TABLE ITEMS (ITEM-ID CHAR (3),
DESCRP CHAR (20),
SIZE CHAR (10),
WEIGHT REAL,
PRIMARY KEY (ITEM-ID));
CREATE TABLE ORDERS (ORDER-NO CHAR (5),
ORDER-DATE DATE,
CUST-NAME CHAR (20),
PRIMARY KEY (ORDER-NO));
CREATE TABLE ITEMS- (ORDER-NO CHAR (10),
ORDERED
ITEM-ID CHAR (3),
QTY-ORDERED INTEGER,
PRIMARY KEY (ORDER-NO, ITEM-ID));
CREATE TABLE (CUST-NAME CHAR (20),
CUSTOMERS
FIRST-ORDER- DATE,
DATE
LIVE-IN-CITY CHAR (15));

22.7.4 Logical Database Design: Sample Table Contents


The sample contents for each of the above relations are shown in Fig. 22.15.
Fig. 22.15 Sample relations and contents

22.7.5 Functional Dependency (FD) Diagram


Fig. 22.16 shows the functional dependency (FD) diagram for customer
order warehouse project.

Fig. 22.16 FD Diagram for customer order warehouse

The mathematical representation of this FD can be given as

ORDER-NO → ORDER-DATE, CUST-NAME


CUST-NAME → FIRST-ORDER-DATE, CITY
CITY → HDQ-ADD, STATE
ITEM-ID → SIZE, WEIGHT, DESCR
STORE-ID → NO-OF-BINS, PHONE, CITY
ORDER-NO, ITEM-ID → QTY-ORDERED
ITEM-ID, CITY → QTY-IN-CITY
ITEM-ID, STORE-ID → QTY-HELD
22.7.6 Logical Record Structure and Access Path
Fig. 22.17 shows a logical record structure derived from the E-R diagram of
Fig. 22.15 and access path on logical record structure for the on-line access
requirements.

Fig. 22.17 Logical record structure and access path

R Q
1. Draw functional dependency (FD) diagram for retail banking case study discussed in Section
22.2.
2. M/s KLY Computer System and Services is in the business of computer assembly and
retailing. It assembles personal computers (PCs) and sales to its customers. To remain
competitive in the computer segment and provide its customers the best deals, M/s KLY has
decided to implement a computerised manufacturing and sales system. The requirement
definition and analysis is given below:

Requirement Definition and Analysis

M/s KLY computer system and services has the following main processes:

Marketing.
PC assembly.
Finished goods warehouse.
Sales and delivery.
Finance.
Purchase and stores.

The following requirements have been identifies after the detailed analysis of the existing
system:

Customer places order for PCs with detailed technical specification in consultation
with the marketing person of M/s KLY.
Customer order is delivered to PC Assembly department. Assembly department
creates the customer invoice with advance details based on customer order together
with explicit unit cost and the total assembly costs.
PC Assembly department requisitions for parts or components from the Purchase
and Store department. After receiving parts, PCs are assembled and moved to the
Finished Goods Warehouse for temporary storage (prior to delivery to the customer)
along with finished goods delivery note.
The Purchase and Stores department buys computer parts in bulk from various
suppliers and stocked in the stores.
PCs are dispatched to the customer by the Sales and Delivery department along with
a goods delivery challan.
After receiving the delivery challan, customer makes the payment at the Finance
department of M/s KLY.

Figs. 22.18, 22.19 and 22.20 shows workflow diagrams of M/s KLY Computer System and
Services for Customer Order, PC Assembly and Delivery and Spare Parts Inventory,
respectively.

a. Identify entities and draw an E-R diagram.


b. Create tables by mapping an E-R diagram to the relational model.
c. Develop sample table contents.
d. Draw functional dependency (FD) diagram.
Fig. 22.18 Workflow diagram for customer

Fig. 22.19 Workflow diagram for PC assembly and delivery

3. for private technical training institute case discussed in Section 22.5, develop the following:

a. Develop sample table contents.


b. Draw functional dependency (FD) diagram.
Fig. 22.20 Workflow diagram for spare parts inventory

4. for Internet book shop case discussed in Section 22.6, develop the following:

a. Develop sample table contents.


b. Draw functional dependency (FD) diagram.
Part-VIII

COMMERCIAL DATABASES
Chapter 23

IDM DB2 Universal Database

23.1 INTRODUCTION

DB2 is a trademark of International Business Machine (IBM), Inc. The first


DB2 product was released in 1984 on the IBM mainframe platform. It was
followed over time by versions for other platforms. IBM has continually
enhanced the DB2 product in areas such as transaction processing, query
processing and optimisation, parallel processing, active database support and
object-oriented support, by leveraging the innovations from it Research
Division. The DB2 database engine is available in four different code bases
namely OS/390, VM, AS/400 and all other platforms. Common element in
all these code bases is the external interfaces, especially data definition
language (DDL) and SQL and basic tools such as administration. However,
differences do exist as a result of the differing development histories of the
code bases.
This chapter provides a practical, hands-on approach to learning how to
use DB2 Universal Database.

23.2 DB2 PRODUCTS

DB2 comes in the four editions namely DB2 Express, DB2 Workgroup
Server Edition, DB2 Enterprise Server Edition and DB2 Personal Edition.
All four editions provide the same full-function database management
system, but they differ from each other in terms of connectivity options,
licensing agreements and additional function.

There are three main DB2 products, which are as follows:


DB2 Universal Database (UDB): DB2 UDB is designed for use in a variety of purposes and
in a variety of environments.
DB2 Connect: It provides the ability to access a host database with Distributed Relational
Database Architecture (DRDA). There are two versions of DB2 Connect, namely DB2
Connect Personal Edition and DB2 Connect Enterprise Edition.
DB2 Developer’s Edition: It provides the ability to develop and test a database application
for one user. There are two versions of DB2 Developer’s Edition, namely DB2 Personal
Developer’s Edition (PDE) and DB2 Universal Edition (UDE).

All DB2 products have a common component called the DB2 Client
Application Enabler (CAE). Once a DB2 application has been developed,
the DB2 Client Application (CAE) component must be installed on each
workstation executing the application. Fig. 23.1 shows the relationship
between the application, CAE and the DB2 database server. If the
application and database are installed on the same workstation, the
application is known as a local client. If the application is installed on a
workstation other than the DB2 server, the application is known as a remote
client.
The Client Application Enabler (CAE) provides functions other than the
ability to communicate with a DB2 UDB server or DB2 Connect gateway
machine. CAE enables users to perform any of the following tasks:

Fig. 23.1 Remote client accessing DB2 server using CAE

Issue an interactive SQL statement using CAE on a remote client to access data on a remote
UDB server.
Graphically administer and monitor a UDB database server.
Run applications that were developed to comply with the Open Database Connectivity
(ODBC) standard.
Run Java applications that access and manipulate data in DB2 UDB databases using Java
Database Connectivity (JDBC).
There are no licensing requirements to install the Client Application
Enabler (CAE) component. Licensing is controlled at the DB2 UDB server.
The CAE installation depends on the operating system on the client machine.
There is a different CAE for each supported DB2 client operating system.
The supported platforms are OS/2, Windows NT, Windows 95, Window
2000, Window XP, Windows 3.x, AIX, HP-UX and Solaris. The CAE
component should be installed on all end-user workstations.
The DB2 database products are collectively known as the DB2 Family.
The DB2 family is divided into the following two main groups:
DB2 for midrange and large systems. This is supported on platforms such as OS/400,
VSE/VM and OS/390.
DB2 UDB for Intel and UNIX environments. This is supported on platforms such as MVS,
OS/2, Windows NT, Windows 95, Windows 2000, AIX, HP-UX and Sun Solaris.

The midrange and large system members of the DB2 Family are very
similar to DB2 UDB, but their features and implementations sometimes
differ due to operating system differences.
Table 23.1 summarises the DB2 family of products. The DB2 provides
seamless database connectivity using the most popular network
communications protocols, including NetBIOS, TCP/IP, IPX/SPX, Named
Pipes and APPC. The infrastructure within which DB2 database clients and
DB2 database servers communicate is provided by the DB2.

23.2.1 DB2 SQL


DB2 SQL confirms to the ANSI/ISO SQL-92 Entry Level standard,
although IBM has added enhancements. DB2 SQL supports CUBE and
ROLLUP aggregations, full outer joins, CREATE SCHEMA and DROP
SCHEMA. It supports entity integrity (required values) and domain integrity
(each value is a legal value for a column). It supports role-based
authorisation and column level UPDATE and REFERENCES privileges.
DB2 SQL supports triggers, user-defined functions (UDFs), user-defined
types (UDTs) and stored procedures.
SQL dialects, such as SPL, PL/SQL and T-SQL, are block-structured
languages in their own right and capable of authoring stored procedures. The
model for DB2 SQL is different. DB2 has a rich SQL dialect but it does not
include constructs for procedural programming. The author stores
procedures and user-defined functions, DB2 developers use programming
languages such as C, Java, or COBOL. DB2’s CREATE FUNCTION and
CREATE PROCEDURE statements include an EXTERNAL clause for
denoting procedures and UDFs written in external programming languages.

Table 23.1 DB2 family of products


Database Servers Database Integration
• DB2 UDB for Unix (AIX, HP-UX) • DB2 DataJoiner

• DB2 UDB for Windows • DataLinks

(Windows NT, Windows 2000, • Data replication Services

Windows XP • DB2 Connect

• DB2 UDB for OS/2, Linux Business Intelligence


• DB2 UDB for OS/390 • DB2 OLAP Server

• DB2 UDB for AS/400 • DB2 Intelligent Miner

• DB2 for VM/VSE • DB2 Spatial extender

Application Development • DB2 Warehouse Manager

• VisualAge for Java, Basic, C, C++ • QMS for OS/390


• VisualAge Generator Content Management

• DB2 Forms for OS/390 • Content Manager

• Lotus Approach • Content Manager VideoCharger

Database Management Tools E-Business Tolls


• DB2 Control Centre • DB2 Net Search Extender

• DB2 Admin Tools for OS/390 • DB2 XML Extender

• DB2 Buffer Pool Tool • Net.Data

• DB2 Estimator for OS/390 • DB2 for WebSphere

• DB2 Performance Monitor Multimedia delivery


• DB2 Query Patroller • DB2 ObjectRelational Extender

Mobile Data Access •DB2 EveryPlace

•DB2 Satellite Edition

• Digital Library

23.3 DB2 UNIVERSAL DATABASE (UDB)

DB2 Universal Database (UDB) is an object-oriented relational database


management system (OORDBMS) characterised by multimedia support,
content-based queries and an extensible architecture. It is a Web- enabled
relational database management system that supports data warehousing and
transaction processing. DB2 UDB provides SQL + objects + server
extensions, that is, it “understands” traditional SQL data, complex data and
abstract data types. DB2 UDB family of products consists of database
servers and a suite of related products. DB2 UDB is a powerful relational
database that can help the organisations in running their business
successfully. It can be scaled from hand-held computers to single processors
to clusters of computers and is multimedia-capable with image, audio, video
and text support.
The term “universal” in DB2 UDB refers to the ability to store all kinds of
electronic information. This electronic information includes traditional
relational data, structured and unstructured binary information, documents
and text in many languages, graphics, images, multimedia (audio and video)
and information specific to operations like engineering drawings, maps,
insurance claims forms, numerical control streams or any other type of
electronic information.

23.3.1 Configuration of DB2 Universal Database


DB2 UDB is a versatile family of products that supports many different
configurations and nodes of use. DB2 UDB is capable of supporting
hardware platforms from laptops to massively parallel systems with
hundreds of nodes. This provides extensive and granular growth. DB2 UDB
provides both client and software for many kinds of platform. UDB clients
and servers can communicate with each other on local-area networks
(LANs) using various protocols such as APPC, TCP/IP, NetBIOS and
IPX/SPX. In addition, DB2 UDB can participate in heterogeneous networks
that are distributed throughout the world, using a protocol called Distributed
Relational Database Architecture (DRDA).
DRDA consists of two parts, namely and Application Requester (AR)
protocol and an Application Server (AS) protocol. Any client that
implements the AR protocol can connect to any server that implements the
AS protocol. All DB2 products and many other systems as well, implement
the DRDA protocols. For example, a user in Bangalore running UDB on
Windows NT might access a database in Tokyo managed by DB2 for OS
390.
Fig. 23.2 illustrates all of the DB2 Universal Database server products.
DB2 Universal Database (UDB) product family includes four “editions”
(called DB2 database server products) that support increasingly complex
database and user environments. These four different DB2 Database server
products are as follows:
DB2 UDB Personal Edition.
DB2 UDB Workgroup Edition.
DB2 UDB Enterprise Edition.
DB2 UDB Enterprise-Extended Edition.

All the DB2 UDB editions contain the same database management engine,
support the full SQL language and provide graphical user interfaces (GUIs)
for interactive query and database administration.
DB2 Universal Database (UDB) product family also includes two
“developer’s editions” that provide tools for application program
development. These four different DB2 database server products are as
follows:
DB2 UDB Personal Developer’s Edition.
DB2 UDB Universal Developer’s Edition.

With the exception of the Personal Edition and the Personal Developer’s
Edition, all versions of DB2 UDB are multi-user systems that support remote
clients and include client software called Client Application Enablers
(CAEs) for all supported platforms. The licensing terms of the multi-user
versions of DB2 UDB depend on the number of users and the number of
processors in user’s hardware configuration.

23.3.1.1 DB2 UDB Personal Edition


DB2 UDB Personal Edition allows the users to create and use local
databases and access remote databases if they are available. It provides the
simplest UDB installation. This version of UDB can create, administer and
provide database access for one local user, running one or more applications.
DB2 UDB Personal Edition is available on Windows NT, Window-95,
Window 2000, Window Xp, Linux and OS/2 platforms. If access to
databases on host systems is required, DB2 UDB Personal Edition can be
used in conjunction with DB2 Connect Personal Edition.
DB2 UDB Personal Edition provides the same engine functions found in
Workgroup, Enterprise and Enterprise-Extended Editions. However, DB2
UDB Personal Edition cannot accept requests from a remote client. As the
name suggests, DB2 UDB Personal Edition is licensed for one user to create
databases on the workstation in which it was installed. DB2 UDB Personal
Edition can be used as a remote client to a DB2 UDB server where
Workgroup, Enterprise or Enterprise-Extended is installed, since it contains
the Client Application Enabler (CAE) component. Therefore, once the DB2
UDB Personal Edition has been installed, one can use this workstation as a
remote client connecting to a DB2 Server, as well as a DB2 Server managing
local databases.
Fig. 23.2 Configuration of DB2 universal database (UDB)

Fig. 23.3 illustrates configuration of DB2 UDB Personal Edition. The user
can access a local database on their mobile workstation (for example, a
laptop) and access remote databases found on the database server. From the
laptop, the user can make changes to the database throughout the day and
replicate those changes as a remote client to a DB2 UDB remote server. DB2
UDB Personal Edition includes graphical tools that enable user to
administer, tune for performance, access remote DB2 servers, process SQL
queries and manage other servers from single workstation.
DB2 UDB Personal Edition product may be appropriate for the following
users:
DB2 mobile users who use a local database and can take advantage of the replication feature
in UDB to copy local changes to a remote server.
DB2 end-users requiring access to local and remote databases.

Fig. 23.3 Configuration of DB2 UDB personal edition

23.3.1.2 DB2 UDB Workgroup Edition


DB2 UDB Workgroup Edition is a server that supports both local and remote
users and applications. It contains all DB2 UDB Personal Edition product
functions with the added ability to accept requests from remote clients.
Remote clients can connect to a DB2 UDB Workgroup Edition server, but
DB2 UDB Workgroup Edition does not provide a way from its users to
connect to databases on host systems. DB2 UDB Workgroup Edition can be
installed on a symmetric multiprocessor platform containing up to four
processors. Like DB2 UDB Personal Edition, DB2 UDB Workgroup Edition
is for the Intel platform only. In Fig. 23.3, DB2 Personal Edition is shown as
a mobile user that occasionally connects to local area network (LAN). This
mobile user can access any of the databases on the workstation where DB2
UDB Workgroup Edition is installed.
DB2 UDB Workgroup Edition is designed for use in a LAN environment.
It provides support for both remote and local clients. A workstation with
Workgroup Edition installed can be connected to a network and participate
in a client/server environment. Fig. 23.4 illustrates a possible configuration
of DB2 UDB Workgroup Edition.
As shown in Fig. 23.4, there are local database applications, which can
also be executed by remote clients by performing proper client/server setup.
A DB2 application does not contain any specific information regarding the
physical location of the database. DB2 client applications communicate with
DB2 Workgroup Edition using a client/server-supported protocol with DB2
CAE. Depending on the client and server operating system involved, DB2
Workgroup supports the protocol such as TCP/IP, NetBIOS, IPX/SPX,
Named Pipes and APPC.
DB2 Workgroup includes Net.Data for Internet support and graphical
management tools that are found in DB2 UDB Personal Edition. In addition,
the DB2 Client Pack is shipped with DB2 UDB Workgroup Edition. The
DB2 Client Pack contains all of the current DB2 Client Application Enablers
(CAEs). DB2 UDB Workgroup Edition is licensed on a per-user basis. The
base license is for one concurrent or registered DB2 user. It is available for
OS/2, AIX, HP-UX, Linux, Solaris, Windows Server 2003, Windows 2000,
Windows XP and Windows NT platforms. Additional entitlements (user
licenses) are available in multiples of 1, 5, 10 or 50. DB2 Workgroup is
licensed for a machine with one-to-four processors only. Entitlement keys
are required for additional users. The DB2 UDB Workgroup Edition is most
suitable for smaller departmental applications or for applications that do not
need access to remote databases on iSeries or zSeries.
Fig. 23.4 Configuration of DB2 UDB workgroup edition

Another edition, called DB2 Express Edition, has been introduced, which
has the same full-function DB2 database as DB2 Workgroup Edition with
additional new features. The new features make it easy to transparently
install within an application. DB2 Express Edition is available for the Linux
and Windows platform (Window NT, Window 2000 and Window XP).

23.3.1.3 DB2 UDB Enterprise Edition


DB2 UDB Enterprise Edition provides local and remote users with access to
local and remote databases. DB2 UDB Enterprise Edition includes all the
features provided in the DB2 UDB Workgroup Edition, plus supports for
host database connectivity, providing users with access to DB2 databases
residing on iSeries or zSeries platforms. Thus, it supports more users than
Workgroup Edition. It can be installed on symmetric multiprocessor
platforms with more than four processors. It implements both the application
requester (AR) and application server (AS) protocols and can participate in
distributed relational database architecture (DRDA) networks with full
generality. Fig. 23.5 illustrates configuration of DB2 UDB Enterprise
Edition.
DB2 UDB Enterprise Edition allows access to DB2 databases residing on
host systems such as DB2 for OS/390 Version 5.1, DB2 for OS/400 and DB2
for VSE/VM Version 5.1. The licensing for DB2 UDB Enterprise Edition is
based on number of users, number of machines installed and processor type.
The base license is for one concurrent or registered DB2 user. Additional
entitlements are available for 1, 5, 10 or 50 users. The base installation of
DB2 UDB Enterprise Edition is on a uni-processor. Tier upgrades are
available if the machine has more than one processor. The licensing for the
gateway capability for access to a host database is for 30 concurrent or
registered DB2 users. DB2 UDB Enterprise Edition is available on OS/2,
Windows NT, Windows 2000, Windows XP and the UNIX platforms namely
AIX, HP-UX, Linux, and Solaris.
Fig. 23.5 Configuration of DB2 UDB enterprise edition

The popularity of the Internet and the World Wide Web (WWW) has
created a demand for web access to enterprise data. The DB2 UDB server
product includes all supported DB2 Net.Data products. Applications that are
build with Net.Data may be stored on a web server and can be viewed from
any web browser because they are written in hypertext markup language
(HTML). While viewing these documents, users can either select automated
queries or define new ones that retrieve specific information directly from a
DB2 UDB database. The ability to connect to a host database (DB2
Connect) is built into Enterprise Edition.
23.3.1.4 DB2 UDB Enterprise-Extended Edition
As discussed earlier, all the DB2 UDB editions can take advantage of
parallel processing when installed on a symmetric multiprocessor platform.
DB2 UDB Enterprise-Extended Edition introduces a new dimension of
parallelism that can be scaled to a very large capacity and very high
performance. It contains all the features and functions of DB2 UDB
Enterprise Edition. It also provides the ability for an Enterprise-Extended
Edition (EEE) database to be partitioned across multiple independent
machines (computers) of the same platform that are connected by network or
a high-speed switch. Additional machines can be added to an EEE system as
application requirements grow. The individual machines participating in an
EEE installation may be either uni-processors or symmetric multiprocessors.
To the end-user or application developer, the EEE database appears to be
on a single machine. While, DB2 UDB Workgroup and DB2 UDB
Enterprise Edition can handle large databases, the Enterprise-Extended
Edition (EEE) is designed for applications where the database is simply too
large for a single machine to handle efficiently. SQL operations can operate
in parallel on the individual database partitions, thus increasing the
execution of a single query.
DB2 UDB Enterprise-Extended Edition licensing is similar to that of DB2
Enterprise Edition. However, the licensing is based on the number of
registered or concurrent users, the type of processor and the number of
database partitions. The base license for DB2 UDB Enterprise-Extended
Edition is for machines ranging from a uni-processor up to a 4-way SMP.
The base number of users is different in Enterprise-Extended Edition than in
Enterprise Edition. The base user license is for one user with an additional
50 users, equalling 51 users for that database partition. The total number of
users per database partition also depends on the total number of database
partitions. For example, in a system configuration of four nodes, each node
or database partition could support 51 × 4 or 204 users. Tier upgrades also
are available. The first tier upgrade for a database partition provides the
rights to a 50 user entitlement pack for that database partition node.
Additional user entitlements are available for 1, 5, 10 or 50 users. DB2 UDB
Enterprise-Extended Edition is available on the AIX platform.

23.3.1.5 DB2 UDB Personal Developer’s Edition


DB2 UDB Personal Developer’s Edition includes all the tools needed to
develop application programs for DB2 UDB Personal Edition, including host
language pre-compilers, header files and sample application code. It includes
DB2 Universal Database Personal Edition, DB2 Connect Personal Edition
and DB2 Software Developer’s Kits for Windows platforms. It is available
for Windows NT, Window 95, Window 2000, Window XP and OS/2
platforms.
The DB2 Personal Developer’s Edition allows a developer to design and
build single-used desktop applications. It provides all the tools needed to
create multimedia database applications that can run on Linux and Windows
platforms and can connect to any DB2 server. The kit includes Windows and
Linux versions of DB2 Personal Edition plus all the DB2 Extenders.

23.3.1.6 DB2 UDB Universal Developer’s Edition


DB2 UDB Universal Developer’s Edition includes all the tools needed to
develop application programs for all DB2 UDB servers, including Software
Developer’s Kits (SDKs) for Windows NT, Windows 9x, Sun Solaris, OS/2
and UNIX platforms (such as AIX, HP-UX), the DB2 Extenders, Warehouse
Manager and Intelligent Miner, along with the application development tools
for all supported platforms.
This kit gives all the tools needed to create multimedia database
applications that can run on any of the DB2 client or server platforms and
can connect to any DB2 server. It includes the complementary products
namely DB2 Everyplace Software Development Kit, DB2 Client Packs,
Lotus Approach, Go webserver, VisualAge for Basic, VisualAge for Java,
WebSphere Application Server, WebSphere Studio Site Developer
Advanced, WebSphere MQ and QMF for Windows. It also includes
DataPropagator (replication), DB2 Connect (host connectivity) and Net.Data
(Web server connectivity).
The application development environment provided with both DB2 UDB
Personal Developer’s Edition and DB2 UDB Universal Developer’s Edition
allow application developers to write programs using the following methods:
Embedded SQL.
Call Level Interface (CLI) (compatible with the Microsoft ODBC standard).
DB2 Application Programming Interfaces (APIs).
DB2 data access through the Wold Wide Web.

The programming environment also includes the necessary programming


libraries, header files, code samples and pre-compilers for the supported
programming languages. Several programming languages, including
COBOL, FORTRAN, REXX, C and C++, Basic and Java are supported by
DB2.
Fig. 23.6 shows the contents of both the DB2 UDB Personal Developer’s
Edition and DB2 UDB Universal Developer’s Edition. UDB Server and
Connect products are part of the Developer’s Edition. The DB2 UDB
Personal Developer’s Edition contains DB2 UDB Personal Edition and DB2
Connect Personal Edition. This allows a single application developer to
develop and test a database application. DB2 Personal Developer’s Edition
is a single-user product available for OS/2, Window 3.1, Window NT,
Window 2000 and Window XP.
Fig. 23.6 DB2 UDB personal developer’s and DB2 UDB universal developer’s edition

DB2 UDB Universal Developer’s Edition is supported on all platforms


that support the DB2 Universal Database server product, except for the
Enterprise-Extended Edition product or partitioned database environment.
DB2 UDB Universal Developer’s Edition is intended for application
development and testing only. The database server can be on a platform that
is different from the platform on which the application is developed. It
contains the DB2 UDB Personal Edition, Workgroup and Enterprise Editions
of the database server product. Also, DB2 Connect Personal and Enterprise
Edition are found in the Universal Developer’s Edition (UDE) product. The
UDE licensed for one user. Additional entitlements are available for 1, 5 or
10 concurrent or registered DB2 users.
As shown in Fig. 23.6, both DB2 UDB Personal Developer’s Edition and
DB2 UDB Universal Developer’s Edition contain the following:
Software Developer’s Kit (SDK): SDK provides the environment and tools to the user to
develop applications that access DB2 databases using embedded SQL or DB2 Call Level
Interface (CLI). This is found in both Personal Developer’s Edition (PDE) and Universal
Developer’s Edition (PDE). However, the SDK in the PDE is for OS/2, Windows 16-bit and
32-bit 3.x, Window 95 and Window NT. The SDK in the UDE is for all platforms.
Extender Support: It provides the ability to define large object data types and includes
related functions that allow the user applications to access and retrieve documents,
photographs, music, movie clips or fingerprints.
Visual Age for Basic: It is a suite of application development tools built around an advanced
implementation of the Basic programming language, to create GUI clients, DB2 stored
procedures and DB2 user-defined functions (UDFs). This is found in both PDE and UDE.
Visual Age for Java: It is a suite of application development tools for building Java-
compatible applications, applets and JavaBean components that run on any Java
Development Kit (JDK) enabled browser. It contains Enterprise Access Builder for building
Java Database Connectivity (JDBC) interfaces to data managed by DB2. It can also be used
to create DB2 stored procedures and DB2 user-defined functions (UDFs).
Net.Data: It is a comprehensive World Wide Web (WWW) development tool kit to create
dynamic web pages or complex web-based applications that can access DB2 databases.
These applications take the form of “web macros” that dynamically generate data for display
on a web page. Net.Data is used in conjunction with a web server that handles requests from
web browsers for display of web pages. When a request page contains a web macro, the web
server calls Net.Data to expand the macro into some dynamic content for the page. The
definition of macro may include SQL statements that are submitted to a UDB server for
processing. For example, a web page might contain a form that users can fill in to request
information from a database and the requested information might be retrieved by Net.Data
and converted into HTML for display by the web server. Net.Data itself is a common
gateway interface (CGI) program that can be installed in the cgi-bin directory of the web
server. Net.Data is included with all versions of UDB except the Personal Edition.
Lotus Approach: Lotus Approach provides an easy-to-use interface for interfacing with
UDB and other relational databases. It provides a graphical interface to perform queries,
develop reports and analyse data. Using Lotus Approach, one can design data views in many
different format, including forms, reports, worksheets and charts. While defining the view,
one can specify how the data in the view corresponds to underlying data in a UDB database.
By interfacing with the view, users can query and manipulate the underlying data. Lotus
Approach allows users to perform searches, joins, grouping operations and database updates
without having any knowledge of SQL. It can also format data attractively for printing or for
display on a web page. One can also develop applications using LotusScript, a full-featured,
object-oriented programming language. Lotus Approach runs in the Windows environment
and is included with all versions of UDB (both the PDE and UDE products).
Domino Go Web Server: It is a scalable, high-performance web server that runs on a broad
range of platforms. It offers the latest in web security and supports key Internet standards.
This is found only in the UDE product.
Java Database Connectivity (JDBC): JDBC might be thought of as a CLI interface for the
Java programming language. Its functionality is equivalent to that of CLI, but in keeping with
its host language, it uses a more object-oriented programming style. For example, a Java
object called a ResultSet, which supports various methods for describing and fetching its
values, represents the result of a query. JDBC also allows developing applets, which can be
downloaded and run by any Java-enabled web browser, thus making UDB data accessible to
web-based clients throughout the world. JDBC supports are found in PDE and UDE versions
of the Developer’s Edition.
Open Database Connectivity (ODBC): ODBC support is found in PDE and UDE versions of
the Developer’s Edition.

23.3.2 Other DB2 UDB Related Products


There are other IBM software products that are complementary with or
closely related to DB2 UDB. Some of such products are summarised below.

23.3.2.1 DB2 Family of Database Managers


As discussed earlier, UDB is the version of DB2 that is designed for
personal computer and workstation platforms. In addition to UDB, the DB2
family of database managers includes DB2 for OS/390, DB2 for OS/400 and
DB2 for VSE and VM. All these products have compatibility with each
other and with industry standards.

23.3.2.2 DB2 Connect


DB2 Connect, formally known as DDCS, is a communication product that
enables its users to connect to any database server that implements the
Distributed Relational Database Architecture (DRDA) protocol, including all
servers in the DB2 product family. The target database server for a DB2
Connect installation is known as a DRDA Application Server. All the
functionality of DB2 Connect is included in UDB Enterprise Edition. In
addition, DB2 Connect is available separately in the following two versions:
DB2 Connect Personal Edition.
DB2 Connect Enterprise Edition.

The most commonly accessed DRDA application server is DB2 for


OS/390. DB2 Connect supports the APPC communication protocol to
provide communications support between DRDA Application Servers and
DRDA Application Requesters. Also, DB2 for OS/390 Version 5.1 supports
TCP.IP in a DRDA environment. Any of the supported network protocols
can be used for the DB2 client (CAE) to establish a connection to the DB2
Connect gateway. The database application must request the data from a
DRDA Application Server through a DRDA Application Requester.
The DB2 Connect product provides DRDA Application Requester
functionality. The DRDA Application Server accessed using DB2 Connect
could be any of the DB2 Servers, namely DB2 for OS/390, DB2 for OS/ 400
or DB2 server for VSE/VM.
DB2 Connect enables applications to create, update, control and manage
DB2 databases and host systems using SQL, DB2 Administrative APIs,
ODBC, JDBC, SQLJ or DB2 CLI. In addition, DB2 Connect supports
Microsoft Windows data interfaces such as ActiveX Data Objects (ADO),
Remote Data Objects (RDO) and Object Linking and Embedding (OLE)
DB.

23.3.2.3 DB2 Connect Personal Edition


DB2 Connect Personal Edition provides access to remote databases for a
single workstation. It provides access form a desktop computer to DB2
databases residing iSeries and zSeries host systems. Fig. 23.7 illustrates
DRDA Flow in DB2 Connect Personal Edition.
Fig. 23.8 shows an example of the DB2 Connect Personal Edition. DB2
Connect Personal Edition is available for the Linux and Windows platforms
such as Window NT, Window 2000, and Window XP.

23.3.2.4 DB2 Connect Enterprise Edition


DB2 Connect Enterprise Edition provides access from network clients to
DB2 databases residing on iSeries and zSeries host systems. It can support a
cluster of client machines on a local network, collecting their requests and
forwarding them to a remote DRDA server for processing. A DB2 Connect
gateway routes each database request from the DB2 clients to the
appropriate DRDA Application Server database. Fig. 23.9 illustrates DRDA
Flow in DB2 Connect Enterprise Edition.
Fig. 23.7 DRDA flow in DB2 connect personal edition

Fig. 23.8 A sample configuration of DB2 Connect Personal Edition setup


Fig. 23.9 DRDA Flow in DB2 Connect Enterprise Edition

The DB2 Connect Enterprise Edition allows multiple clients to connect to


host data and can significantly reduce the effort required to establish and
maintain access to enterprise data. Fig. 23.10 shows an example of clients
connecting to host and iSeries databases through DB2 Connect Enterprise
Edition.
The licensing for DB2 Connect Enterprise is user-based. That is, it is
licensed on the number of concurrent or registered users. The base license is
for one user with additional entitlements of 1, 5, 10, or 50 users. DB2
Connect Enterprise Edition is available for OS/2, AIX, HP-UX, Solaris,
Linux, and Windows platforms such as Window NT, Window 95, Window
2000 and Window XP.

23.3.2.5 DB2 Extenders


DB2 Extenders are a vehicle for extending DB2 with new types and
functions to support operations using those types. It contains sets of pre-
defined datatypes and functions. Extenders expose their own APIs for
working with rich data types such as text, images, audio and video data.
Each extender delivers new functionality in the form of user-defined
functions (UDFs), user-defined types (UDTs) and C-callable functions. The
text Extender and IAV Extender (Image, Audio and Video) do not install
automatically when DB2 installation process is run. Although they are
included with DB2 UDB, it must be installed explicitly.

23.3.2.6 Text Extenders


The Text Extenders provides searches on text databases. It provides the
ability to scan the articles in a database, compute statistics and create an
index that permits fast text searches. The Text Extender operates with files
or DB2 databases and permits queries against databases with documents up
to several gigabytes (GB) in length. It supports several indexing schemes
and keyword, conceptual, wildcard and proximity searches.
The Text Extender adds functions to DB2’s SQL grammar and exposes a
C-API for searching and browsing. Programs written in languages that
support C-style bindings can call eight searches and browse functions
exported from a dynamic link library named desclapi.dll. Some of the text
processing functions requires an ODBC connection handle but ODBC/CLI
connections can be called even when using the text functions in an
embedded SQL program. He Text Extender UDFs take a handle as a
parameter, including external file handles. Handles contain document Ids,
information about the text index and information about the document.
The Text Extender provides linguistic, precise, dual and ngram indexes.
Preparing a linguistic index involves analysing a document’s text and
reducing words to their base from, for example, index machinery as
machine. A linguistic index enables user to search for synonyms and word
variants. A precise index permits retrieval based on an index based on the
exact form of a word. If the index word is “Machinery”, a document
containing “machinery” or machine” will not match. A dual index combines
the capabilities of the precise and linguistic indexes. It contains the precise
and normalised forms of each index term. An index is useful for fuzzy
searches and DBCS characters because it parses text for sets of characters.
Fig. 23.10 A sample configuration of DB2 Connect Enterprise Edition Setup

The Text Extender provides User-defined functions (UDFs) that can be


used in SQL statements to perform text searches. The CONTAIN UDF
enables users to specify search arguments, whereas the REFINE function
enables to refine the search. We use the CONTAINS, RANK,
NO_OF_MATCHES and SEARCH_RESULT functions to perform text
searches. These functions search in the text index for instances of the
argument passed to the UDF.
23.3.2.7 Image, Audio, and Video (IAV) Extenders
The IAV Extenders provide the ability to use images, audio, and video data
in user’s applications. The Image Extender provides searching, updating, and
display of images based on format, width, and length. It supports more than
a dozen image formats, including BMP, JPEG and animated GIFs.
The Audio Extender provides searching, updating and playing of audio
clips based on sample rate, length and the number of channels. It supports
six audio formats, including MIDI and Waveform audio (WAV).
The Video Extender provides searching, updating and playing video clips
based on the number of frames, compression methods, frame rate and length.
The Video Extender supports AVI, MPEG1, MPEG2 and Quicktime Video.
The Image, Audio and Video Extenders also support the importing and
exporting of their respective data types. The program for the IAV Extenders,
we use an API that is about 90 C functions. The IAV Extenders also provide
40 user-defined functions (UDFs) and distinct data types such as
DB2IMAGE (image handle), DB2AUDIO (audio handle) and DB2VIDEO
(video handle). The UDFs return information such as aspect ratios,
compression types, track names, frame rates, sampling rates and the score of
an image.

23.3.2.8 DB2 DataJoiner


DB2 DataJoiner is an version of DB2 Version 2 for Common Servers that
enables its users to interact with data from multiple heterogeneous sources,
providing an image of a single relational database. In addition to managing
its own data, DB2 DataJoiner allows users to connect to databases managed
by other DB2 systems, other relational systems (such as Oracle, Sybase,
Microsoft SQL Server and Informix) and non-relational systems (such as
IMS and VSAM). DataJoiner masks the differences among these various
data sources, presenting to the client the functionality of DB2 Version 2 for
Common Servers. All the data in the heterogeneous network appears to the
client in the form of tables in a single relational database.
DB2 DataJoiner includes an optimiser for cross-platform queries that
enables an SQL query, for example, to join a table stored in a DB2 database
in Bangalore with a table stored in an Oracle database in Mumbai. Data
manipulation statements (such as SELECT, INSERT, DELETE and
UPDATE) are independent of the location of the stored data but data
definition statements (such as CREATE TABLE) are less well standardised
and must be written in the native language of the system on which the data is
stored.

23.3.2.9 DB2 OLAP Server


Essbase is an online analytical processing (OLAP) product produced by
Arbor Software Inc., which provides operations such as CUBE and
ROLLUP for multi-dimensional analysis of large data archives. DB2 OLAP
Server is a joint product of IBM and Arbor, which provides Essbase
functionality for data stored in UDB and other DB2 databases.

23.3.2.10 Visual Warehouse


Visual Warehouse is a data warehouse product that provides a facility for
periodically extracting data from operational databases for archiving and
analysis. It includes a catalogue facility for storing metadata about the
information assets of an enterprise.

23.3.2.11 Intelligent Miner


Intelligent Miner is a set of applications that can search large volumes of
data for unexpected patterns, such as correlations among purchases of
various products. Intelligent Miner can be used with databases managed by
UDB and other members of the DB2 family.

23.3.3 Major Components of DB2 Universal Database


The major components of the DB2 Universal Database system include the
following:
Administrator’s Tools.
Control Centre.
SmartGuides.
Command Centre.
Command Line Processor.
Database Engine.

23.3.3.1 Administrator’s Tools


The Administrator’s Tools folder contains a collection of graphical tools that
help manage and administer databases, and are integrated into the DB2
environment. Fig. 23.11 shows an example of DB2 Desktop Folder for
Windows NT.

Fig. 23.11 DB2 Desktop Folder for Window NT

Fig. 23.12 shows the icons contained Administrator’s Tools folder. As


shown, the DB2 Administrator’s Tools Folder consists of the following
components:
Alert Centre.
Control Centre.
Event Analyser.
Journal.
Script Centre.
Tools Setting.

Fig. 23.12 Contents of administrator’s tools folder

23.3.3.2 Control Centre


The Control Centre is the central point of administration for DB2 Universal
Database. It provides the user with the tools necessary to perform typical
database administration tasks. The Control Centre provides a graphical
interface to administrative tasks such as recovering a database, defining
directories, configuring the system, managing media and more. It allows
easy access to all server administration tools, gives a clear overview of the
entire system, enables remote database management and provides step-by-
step assistance for complex tasks.
Fig. 23.13 shows an example of the information available from the
Control Centre. The Systems object represents both local and remote
machines. The object tree is expanded to display all the DB2 systems that
the system has catalogued by clicking on the plus (+) sign on Systems. As
Shown in Fig. 23.13, the main components of the Control Centre are as
follows:
Manu Bar: to access Control Centre functions and online help.
Tool Bar: to access the other administration tools.
Object Pane: containing all the objects that can be managed from the Control Centre as
well as their relationship to each other.
Contents Pane: containing the objects that belong or correspond to the object selected on
the Objects Pane.
Contents Pane Toolbar: used to tailor the view of the objects and information in the
Contents Pane.

23.3.3.3 SmartGuides
SmastGuides are tutors that guide a user in creating objects and other
database operations. Each operation has detailed information available to
help the user. The DB2 SmartGuides are integrated into the administration
tools and assist us in completing administration tasks. As shown in Fig.
23.11, Client Configuration Assistant (CCA) tool of DB2 Desktop Folder is
used to set up communication on a remote client to the database server.
Fig. 23.13 Control centre

Fig. 23.14 shows several ways of adding a remote database. User do not
have to know the syntax of commands, or even the location of the remote
database server. One option searches the network, looking for valid DB2
UDB servers for remote access.
Fig. 23.14 Client configuration assistant (CCA)-Add Database SmartGuide

Another SmartGuide, known as the Performance Configuration


SmartGuide, assist the user in database tuning. Fig. 23.15 shows an example
of Performance Configuration SmartGuide.
Fig. 23.15 Performance configuration SmartGuide

By extracting information from the system and asking questions about the
database workload, the Performance Configuration SmartGuide tool will run
a series of calculations designed to determine an appropriate set of values for
the database and database manager configuration variables. One can choose
whether to apply to changes immediately, or to save them in a file that can
be executed at a later time.

23.3.3.4 Command Centre


The Command Centre provides a graphical interface to the Command Line
Processor (CLP), enabling access to data through the use of database
commands and interactive SQL.

23.3.3.5 Command Line Processor (CLP)


The Command Line Processor (CLP) is a component common to all DB2
products. It is a text-based application commonly used to execute SQL
statements and DB2 commands. For example, one can create a database,
catalog a database and issue dynamic SQL statements.
Fig. 23.16 shows a command and its output as executed from the
Command Line Processor (CLP). The Command Line Processor can be used
to issue interactive SQL statements or DB2 commands. The statements and
commands can be placed in a file and executed in a batch environment or
they can be entered from an interactive mode. The DB2 Command Line
Processor is provided with all DB2 Universal Database, Connect and
Developer’s products and Client Application Enablers. All SQL statements
issued from the Command Line Processor are dynamically prepared and
executed on the database server. The output, or result, of the SQL query is
displayed on the screen by default.

Fig. 23.16 Command Line Processor (CLP)

23.3.3.6 Database Engine


The database engine provides the base functions of the DB2 relational
database management system. Its functions as follows:
Manages the data.
Controls all access to it.
Generates packages.
Generates optimised paths.
Provides transaction management.
Ensures data integrity and data protection.
Provides concurrency control.

All data access takes place through the SQL interface. The basic elements
of a database engine are database objects, system catalogs, directories and
configuration files.

23.3.4 Features of DB2 Universal Database (UDB)


Following are the features of DB2 UDB:
DB2 UDB scales from a single-user database on a personal computer to terabyte databases
on large multi-user platforms.
It is capable of supporting hardware platforms from laptops to massively parallel systems
with hundreds of nodes.
The scalability of DB2 UDB allows it to meet the performance requirements of diverse
applications and to adapt easily to changing requirements.
It provides seamless database connectivity using the most popular network communications
protocols, including NetBIOS, TCP/IP, IPX/SPX. Named Pipes and APPC.
DB2 UDB is part of a broad IBM software product line that supports network computing and
distributed computing.
It supports IBM’s middleware products, transaction processing servers, e-commerce suites,
operating systems, symmetric multiprocessor (SMP) hardware, Web server software and
developer tools for major languages such as C++, Java and Basic.
DB2 UDB runs on a large variety of hardware andf software environments, including AIX,
Solaris, SCO OpenServer, HP-UX, SINIX, Windows NT, OS/2, Window 95, Window 2000,
Windows XP and Macintosh systems.
DB2 is interfaced with message queue managers (MQSeries), transaction monitors (CICS,
Encina, IBM Transaction Server) or Web services (Net.Data). IBM Net.Data includes a
server and tools for building Web applications that operate with DB2 databases.
DB2 UDB scales form single processors to symmetric multiprocessors (SMPs).
DB2 UDB Extended Edition provides multiprocessing support, including symmetric
multiprocessor (SMP) clusters and massively parallel processor (MPP) architectures that use
as many as 512 processors.
DB2 is extensible using plug-ins written in Java and other programming languages.
DB2 includes Extenders that support multimedia (images, audio, video and text data).
The open-ended architecture of DB2 permits developers to add extensions for supporting
spatial data, time series, biometric data and other rich data types.
DB2 supports user-defined types (UDTs), known as distinct types.
DB2’s multithreaded architecture supports parallel operations, content-based queries and
very large databases.
To provide scalability and high availability, DB2 UDB Enterprise Edition can operate on
symmetrical multiprocessor (SMP), massively parallel processor (MPP), clustered or shared-
nothing systems.
DB2 UDB provides multimedia extensions and supports SQL3 features such as containers.
DB2 UDB provides large object (LOB) locators and is capable of storing large objects (as
large as 2 gigabytes) in a database.
DB2 includes HTML and PDF documentation and a hypertext search engine.
DB2 provides wizard-like SmartGuides to assist in tasks such as creating database.
DB2 databases contain variety of system-supplied and user-defined objects, such as schemas,
nodegroups, tablespaces, tables, indexes, views, packages, aliases, constraints, stored
procedures, triggers, user-defined functions (UDFs) and user-defined types (UDTs).
DB2 supports forward-scrolling, engine-based cursors and distributed transactions using two-
phase commit.
DB2 UDB offers a compelling environment for database developers with an architecture that
is highly extensible and programmable in variety of languages.
DB2 supports asynchronous I/O, parallel queries, parallel data loading and parallel backup
and restore.
DB2 supports database partitioning so that partitions, or nodes, include indexes, transaction
logs, configuration files and data.
Instances of DB2 database manager can operate across multiple nodes (partitions) and
support parallel operations.
DB2 SMP configuration provides intra-partition parallelism to permit parallel queries and
parallel index generation.
DB2 supports aggregation across partitioned and un-partitioned tables and also provides fault
tolerance with disk monitoring and automatic device failover.
DB2 supports multiple nodes of failover support, including hot standby and concurrent
access to multiple partitions.
DB2 maintains partition definitions, known as nodegroups, in db2nodes.cfg. A nodegroup
can contain one or more database partitions and DB2 associates a partitioning map with each
nodegroup. Node groups are helpful in segregating data, such as putting decision support
tables ion one nodegroup and transaction processing tables in another.
DB2 provides system-managed and database-managed tablespaces.
The DB2 Call Level Interface (CLI) confirms to the important industry standards, including
Open Database Connectivity (ODBC) and ISO Database Language SQL (SQL-92).
DB2 UDB supports a rich set of interfaces for different kinds of users and applications. It
provides easy-to-use graphical interfaces for interactive users and for database administrator.
DB2 supports DB2 CLI (ODBC), JDBC, embedded SQL, and DRDA clients. Developers
writing DB2 clients can program in a variety of languages including V/C++, Java, COBOL,
FORTRAN, PL/I, REXX and BASIC.
DB2 UDB supports static interfaces in which SQL statements are preoptimized for high
performance and dynamic interfaces in which SQL statements are generated by running
applications.
The DB2 Client Configuration Assistant (CCA) helps in defining information about the
databases to which clients connect.

23.4 INSTALLATION PREREQUISITE FOR DB2 UNIVERSAL DATABASE SERVER

As discussed in the previous sections, the DB2 UDB server runs on many
different operating systems. However, in this section we will discuss about
the installation of DB2 server on the Windows platform.

23.4.1 Installation Prerequisite: DB2 UDB Personal Edition (Windows)

23.4.1.1 Disk Requirements


The disk space required for DB2 UDB Personal Edition depends on the type
of installation you choose and the type of disk drive you are installing on.
You may require significantly more space on partitions with large cluster
sizes. When you install DB2 Personal Edition using the DB2 Setup wizard,
the installation program based on installation type and component selection
dynamically provides size estimates. The DB2 Setup wizard provides the
following installation types:
Typical installation: DB2 Personal Edition is installed with most features and functionality,
using a typical configuration. Typical installation includes graphical tools such as the Control
Centre and Configuration Assistant. You can also choose to install a typical set of data
warehousing features.
Compact installation: Only the basic DB2 Personal Edition features and functions are
installed. Compact installation does not include graphical tools or federated access to IBM
data sources.
Custom installation: A custom installation allows you to select the features you want to
install. The DB2 Setup wizard will provide a disk space estimate for the installation options
you select. Remember to include disk space allowance for required software, communication
products and documentation. In DB2 version 8, HTML and PDF documentation is provided
on separate CD-ROMs.

23.4.1.2 Handling Insufficient Space


If the space required to install selected components exceeds the space found
in the path you specify for installing the components, the setup program
issues a warning about the insufficient space. You can continue with the
installation. However, if the space for the files being installed is in fact
insufficient, the DB2 installation will stop when there is no more space and
will roll back without manual intervention.

23.4.1.3 Memory Requirements


Table 23.2 below provides recommended memory requirements for DB2
Personal Edition installed with and without graphical tools. There are a
number of graphical tools you can install including the Control Centre,
Configuration Assistant and Data Warehouse Centre.

Table 23.2 Memory requirements for DB2 Personal Edition

Type of installation Recommended memory (RAM)


DB2 Personal Edition without graphical tools 64 MB

DB2 Personal Edition with graphical tools 128 MB

When determining memory requirements, be aware of the following:


The memory requirements documented above do not account for non-DB2 software that may
be running on your system.
Memory requirements may be affected by the size and complexity of your database system.

23.4.1.4 Operating System Requirements


To install DB2 Personal Edition, one of the following operating system must
be met:
Windows ME.
Windows NT Version 4 with Service Pack 6a or higher.
Windows 2000.
Windows XP (32-bit or 64-bit).
Windows Server 2003 (32-bit or 64-bit).
Windows XP (64-bit) and Windows Server 2003 (64-bit) support: local 32-bit applications,
32-bit UDFs and stored procedures.

23.4.1.5 Hardware Requirements


For DB2 products running on Intel and AMD systems, a Pentium or Athlon
CPU is required.
23.4.1.6 Software Requirements
MDAC 2.7 is required. The DB2 Setup wizard will install MDAC 2.7 if it is not already
installed.
For 32-bit environments you will need a Java Runtime Environment (JRE) Version 1.3.1 to
run DB2’s Java-based tools, such as the Control Centre and to create and run Java
applications, including stored procedures and user-defined functions. During the installation
process, if the correct level of the JRE is not already installed, it will be installed.
For 64-bit environments you will need a Java Runtime Environment (JRE) Version 1.4 to run
DB2’s Java-based tools, such as the Control Center and to create and run Java applications,
including stored procedures and user-defined functions. During the installation process, if the
correct level of the JRE is not already installed, it will be installed. A browser is required to
view online help.

23.4.1.7 Communication Requirements


To connect to a remote database, you can use TCP/IP, NETBIOS and
NPIPE. To remotely administer a version 8 DB2 database, you must connect
using TCP/IP.
If you plan to use LDAP (Lightweight Directory Access Protocol), you
require either a Microsoft LDAP client or an IBM SecureWay LDAP client
V3.1.1.
Connections from 64-bit clients to downlevel 32-bit servers are not
supported.
Connections from downlevel 32-bit clients to 64-bit servers only support
SQL requests.
DB2 Version 8 Windows 64-bit servers support connections from DB2
Version 6 and Version 7 32-bit clients only for SQL requests. Connections
from Version 7 64-bit clients are not supported.

23.4.2 Installation Prerequisite: DB2 Workgroup Server Edition and Non-


partitioned DB2 Enterprise Server Edition (Windows)

23.4.2.1 Disk Requirements


The disk space required for DB2 Enterprise Server Edition (ESE) or
Workgroup Server Edition (WSE) depends on the type of installation you
choose and the type of disk drive. You may require significantly more space
on FAT drives with large cluster sizes. When you install DB2 Enterprise
Server Edition using the DB2 Setup wizard, the installation program based
on installation type and component selection dynamically provides size
estimates. The DB2 Setup wizard provides the following installation types:
Typical installation: DB2 is installed with most features and functionality, using a typical
configuration. Typical installation includes graphical tools such as the Control Centre and
Configuration Assistant. You can also choose to install a typical set of data warehousing or
satellite features.
Compact installation: Only the basic DB2 features and functions are installed. Compact
installation does not include graphical tools or federated access to IBM data sources.
Custom installation: A custom installation allows you to select the features you want to
install.

The DB2 Setup wizard will provide a disk space estimate for the
installation options you select. Remember to include disk space allowance
for required software, communication products, and documentation. In DB2
Version 8, HTML and PDF documentation is provided on separate CD-
ROMs.

23.4.2.2 Handling Insufficient Space


If the space required installing selected components exceeds the space found
in the path you specify for installing the components, the setup program
issues a warning about the insufficient space. You can continue with the
installation. However, if the space for the files being installed is in fact
insufficient, the DB2 installation will stop when there is no more space. At
this time, you will have to manually stop the setup program if you cannot
free up space.

23.4.2.3 Memory Requirements


At a minimum, DB2 requires 256 MB of RAM. Additional memory may be
required. When determining memory requirements, be aware of the
following:
Additional memory may be required for non-DB2 software that may be running on your
system.
Additional memory is required to support database clients.
Specific performance requirements may determine the amount of memory needed.
Memory requirements will be affected by the size and complexity of your database system.
Memory requirements will be affected by the extent of database activity and the number of
clients accessing your system.

23.4.2.4 Operating System Requirement


To install DB2, the following operating system requirements must be met:

DB2 Workgroup Server Edition runs on:


Windows NT Version 4 with Service Pack 6a or higher.
Windows 2000. Service Pack 2 is required for Windows Terminal Server.
Windows XP (32-bit).
Windows Server 2003 (32-bit).

DB2 Enterprise Server Edition runs on:


Windows NT Version 4 with Service Pack 6a or higher.
Windows 2000. Service Pack 2 is required for Windows Terminal Server.
Windows Server 2003 (32-bit and 64-bit).

Windows 2000 SP3 and Windows XP SP1 are required for running DB2
applications in either of the following environments:
Applications that have COM+ objects using ODBC; or
Applications that use OLE DB Provider for ODBC with OLE DB resource pooling disabled.

If you are not sure about whether your application environment qualifies,
then it is recommended that you install the appropriate Windows service
level. The Windows 2000 SP3 and Windows XP SP1 are not required for the
DB2 server itself or any applications that are shipped as part of DB2
products.

23.4.2.5 Hardware Requirements


For 32-bit DB2 products, a Pentium or Pentium compatible CPU is required.
For 64-bit DB2 products, an Itanium or Itanium compatible CPU is required.

25.4.2.6 Software Requirements


For 32-bit environments you will need a Java Runtime Environment (JRE) Version 1.3.1 to
run DB2’s Java-based tools, such as the Control Centre and to create and run Java
applications, including stored procedures and user-defined functions. During the installation
process, if the correct level of the JRE is not already installed, it will be installed.
For 64-bit environments you will need a Java Runtime Environment (JRE) Version 1.4 to run
DB2’s Java-based tools, such as the Control Centre and to create and run Java applications,
including stored procedures and user-defined functions. During the installation process, if the
correct level of the JRE is not already installed, it will be installed.
A browser is required to view online help.
Windows 2000 SP3 and Windows XP SP1 are required for running DB2 applications in
either of the following environments:

Applications that have COM+ objects using ODBC; or


Applications that use OLE DB Provider for ODBC with OLE DB resource pooling
disabled.

If you are not sure about whether your application environment qualifies,
then it is recommended that you install the appropriate Windows service
level. The Windows 2000 SP3 and Windows XP SP1 are not required for
DB2 server itself or applications that are shipped as part of DB2 products.

23.4.2.7 Communication Requirement


You can use APPC, TCP/IP, MPTN (APPC over TCP/IP), Named Pipes and NetBIOS. To
remotely administer a Version 8 DB2 database, you must connect using TCP/IP. DB2 Version
8 servers, using the DB2 Connect server support feature, support only outbound client APPC
requests; there is no support for inbound client APPC requests.
For TCP/IP, Named Pipes and NetBIOS connectivity, no additional software is required.
For APPC (CPI-C) connectivity, through the DB2 Connect server support feature, one of the
following communication products is required as shown in Table 23.3:

Table 23.3 Supported SNA (APPC) products


Operating system SNA (APPC) communication product
Windows NT IBM Communications Server Version 6.1.1 or later
IBM Personal Communications for Windows Version 5.0 with
CSD 3
Microsoft SNA Server Version 3 Service Pack 3 or later
Windows 2000 IBM Communications Server Version 6.1.1 or later
IBM Personal Communications for Windows Version 5.0 with
CSD 3
Microsoft SNA Server Version 4 Service Pack 3 or later
Windows XP IBM Personal Communications for Windows Version 5.5 with
APAR IC23490
Windows Server 2003 Not supported

If you plan to use LDAP (Lightweight Directory Access Protocol), you


require either a Microsoft LDAP client or an IBM SecureWay LDAP client
V3.1.1.

Windows (64-bit) considerations:


Local 32-bit applications are supported.
32-bit UDFs and stored procedures are supported.
SQL requests from remote 32-bit downlevel clients are supported.
DB2 Version 8 Windows 64-bit servers support connections from DB2 Version 6 and Version
7 32-bit clients only for SQL requests. Connections from Version 7 64-bit clients are not
supported.

Windows 2000 Terminal Server installation limitation:


You cannot install DB2 Version 8 from a network mapped drive using a remote session on
Windows 2000 Terminal Server edition. The available workaround is to use Universal Naming
Convention (UNC) paths to launch the installation, or run the install from the console session.
For example, if the directory c:\pathA\pathB\…\pathN on a serverA is shared as serverdir, you
can open \\serverA\serverdir\filename.ext to access the file c:\pathA\pathB\…pathN\filename.ext
on server.

23.4.3 Installation Prerequisite: Partitioned DB2 Enterprise Server Edition


(Windows)
23.4.3.1 Disk Requirements
The disk space required for a DB2 Enterprise Server Edition (ESE) depends
on the type of installation you choose and the type of disk drive. You may
require significantly more space on FAT drives with large cluster sizes.
When you install DB2 Enterprise Server Edition using the DB2 Setup
wizard, size estimates are dynamically provided by the installation program
based on installation type and component selection. The DB2 Setup wizard
provides the following installation types:
Typical installation: DB2 ESE is installed with most features and functionality, using a
typical configuration. Typical installation includes graphical tools such as the Control Centre
and Configuration Assistant. You can also choose to install a typical set of data warehousing
features.
Compact installation: Only the basic DB2 features and functions are installed. Compact
installation does not include graphical tools or federated access to IBM data sources.
Custom installation: A custom installation allows you to select the features you want to
install.

The DB2 Setup wizard will provide a disk space estimate for the
installation options you select. Remember to include disk space allowance
for required software, communication products, and documentation. In DB2
Version 8, HTML and PDF documentation is provided on separate CD-
ROMs.

23.4.3.2 Memory Requirements


At a minimum, DB2 requires 256 MB of RAM. Additional memory may be
required. In a partitioned database environment, the amount of memory
required for each database partition server depends heavily on your
configuration. When determining memory requirements, be aware of the
following:
Additional memory may be required for non-DB2 software that may be running on your
system.
Additional memory is required to support database clients.
Specific performance requirements may determine the amount of memory needed.
Memory requirements will be affected by the size and complexity of your database system.
Memory requirements will be affected by the extent of database activity and the number of
clients accessing your system.
Memory requirements in a partitioned environment may be affected by system design.
Demand for memory on one computer may be greater than the demand on another.

23.4.3.3 Operating System Requirements


DB2 Enterprise Server Edition runs on:
Windows NT Version 4 with Service Pack 6a or higher.
Windows 2000. Service Pack 2 is required for Windows Terminal Server.
Windows Server 2003 (32-bit and 64-bit).
Windows 2000 SP3 and Windows XP SP1 are required for running DB2 applications in
either of the following environments:

Applications that have COM+ objects using ODBC; or


Applications that use OLE DB Provider for ODBC with OLE DB resource pooling
disabled.

If you are unsure about whether your application environment qualifies,


then it is recommended that you install the appropriate Windows service
level. The Windows 2000 SP3 and Windows XP SP1 are not required for the
DB2 server itself or any applications that are shipped as part of DB2
products.

23.4.3.4 Hardware Requirements


For 32-bit DB2 products, a Pentium or Pentium compatible CPU is required.
For 64-bit DB2 products, an Itanium or Itanium compatible CPU is required.

23.4.3.5 Software Requirements


For 32-bit environments you will need a Java Runtime Environment (JRE) Version 1.3.1 to
run DB2’s Java-based tools, such as the Control Centre and to create and run Java
applications, including stored procedures and user-defined functions. During the installation
process, if the correct level of the JRE is not already installed, it will be installed.
For 64-bit environments you will need a Java Runtime Environment (JRE) Version 1.4 to run
DB2’s Java-based tools, such as the Control Centre and to create and run Java applications,
including stored procedures and user-defined functions. During the installation process, if the
correct level of the JRE is not already installed, it will be installed.
DB2 ESE provides support for host connections.
A browser is required to view online help.
Windows 2000 SP3 and Windows XP SP1 are required for running DB2 applications in
either of the following environments:

Applications that have COM+ objects using ODBC; or


Applications that use OLE DB Provider for ODBC with OLE DB resource pooling
disabled.

If you are not sure about whether your application environment qualifies,
then it is recommended that you install the appropriate Windows service
level. The Windows 2000 SP3 and Windows XP SP1 are not required for
DB2 server itself or applications that are shipped as part of DB2 products

23.4.3.6 Communication Requirements


You can use TCP/IP, Named Pipes, NetBIOS and MPTN (APPC over TCP/IP). To remotely
administer a Version 8 DB2 database, you must connect using TCP/IP. DB2 Version 8
servers, using the DB2 Connect server support feature, support only outbound client APPC
requests; there is no support for inbound client APPC requests.
For TCP/IP, Named Pipes and NetBIOS connectivity, no additional software is required.
For APPC (CPI-C) connectivity, through the DB2 Connect server support feature, one of the
following communication products is required:

Table 23.4 Supported SNA (APPC) products

Operating system SNA (APPC) communication product


Windows NT IBM Communications Server Version 6.1.1 or later
IBM Personal Communications for Windows Version 5.0 with
CSD 3
Microsoft SNA Server Version 3 Service Pack 3 or later
Windows 2000 IBM Communications Server Version 6.1.1 or later
IBM Personal Communications for Windows Version 5.0 with
CSD 3

Microsoft SNA Server Version 4 Service Pack 3 or later


Windows Server 2003 Not supported.

If you plan to use LDAP (Lightweight Directory Access Protocol), you


require either a Microsoft LDAP client or an IBM SecureWay LDAP client
V3.1.1.

Windows (64-bit) considerations:


Local 32-bit applications are supported.
32-bit UDFs and stored procedures are supported.
SQL requests from remote 32-bit downlevel clients are supported.
DB2 Version 8 Windows 64-bit servers support connections from DB2 Version 6 and Version
7 32-bit clients only for SQL requests. Connections from Version 7 64-bit clients are not
supported.

DB2 Administration Server (DAS) requirements:


A DAS must be created on each physical machine for the Control Center and the Task Center
to work properly.

Windows 2000 Terminal Server installation limitation:

You cannot install DB2 Version 8 from a network mapped drive using a
remote session on Windows 2000 Terminal Server edition. The available
workaround is to use Universal Naming Convention (UNC) paths to launch
the installation, or run the install from the console session.
For example, if the directory c:\pathA\pathB\...\pathN on a serverA is
shared as serverdir, you can open \\serverA\serverdir\filename.ext to access
the file c:\pathA\pathB\...pathN\filename.ext on server.

23.4.4 Installation Prerequisite: Partitioned DB2 Enterprise Server Edition


(Windows)

23.4.4.1 Disk Requirements


The disk space required for a DB2 Enterprise Server Edition (ESE) depends
on the type of installation you choose and the type of disk drive. You may
require significantly more space on FAT drives with large cluster sizes.
When you install DB2 Enterprise Server Edition using the DB2 Setup
wizard, size estimates are dynamically provided by the installation program
based on installation type and component selection. The DB2 Setup wizard
provides the following installation types:
Typical installation: DB2 ESE is installed with most features and functionality, using a
typical configuration. Typical installation includes graphical tools such as the Control Centre
and Configuration Assistant. You can also choose to install a typical set of data warehousing
features.
Compact installation: Only the basic DB2 features and functions are installed. Compact
installation does not include graphical tools or federated access to IBM data sources.
Custom installation: A custom installation allows you to select the features you want to
install.

Remember to include disk space allowance for required software,


communication products, and documentation. In DB2 Version 8, HTML and
PDF documentation is provided on separate CD-ROMs.

23.4.4.2 Memory Requirements


The amount of memory required to run DB2 Connect Personal Edition
depends on the components you install. Table 23.5 below provides
recommended memory requirements for DB2 Personal Edition installed with
and without graphical tools such as the Control Centre and Configuration
Assistant.

Table 23.5 DB2 Connect Personal Edition for Windows Memory requirements

Type of installation Recommended memory (RAM)


DB2 Personal Edition without graphical tools 64 MB
DB2 Personal Edition with graphical tools 128 MB

When determining memory requirements, be aware of the following:


These memory requirements do not account for non-DB2 software that may be running on
your system.
The actual amount of memory needed may be affected by specific performance requirements.

23.4.4.3 Operating System Requirements


To install a DB2 Connect Personal Edition, one of the following operating
system requirements must be met:
Windows ME.
Windows NT Version 4 with Service Pack 6a or higher.
Windows 2000.
Windows XP (32-bit and 64-bit).
Windows Server 2003 (32-bit and 64-bit).
23.4.4.4 Software Requirements
For 32-bit environments you will need a Java Runtime Environment (JRE) Version 1.3.1 to
run DB2’s Java-based tools, such as the Control Centre and to create and run Java
applications, including stored procedures and user-defined functions. During the installation
process, if the correct level of the JRE is not already installed, it will be installed.
For 64-bit environments you will need a Java Runtime Environment (JRE) Version 1.4 to run
DB2’s Java-based tools, such as the Control Centre and to create and run Java applications,
including stored procedures and user-defined functions. During the installation process, if the
correct level of the JRE is not already installed, it will be installed.

23.4.4.5 Communication Requirements


You can use APPC, TCP/IP and MPTN (APPC over TCP/IP).
For SNA (APPC) connectivity, one of the following communication products is required:

With Windows ME:


IBM Personal Communications Version 5.0 (CSD 3) or later.

With Windows NT:


IBM Communications Server Version 6.1.1 or later.
IBM Personal Communications Version 5.0 (CSD 3) or later.

With Windows 2000:


IBM Communications Server Version 6.1.1 or later.
IBM Personal Communications Version 5.0 (CSD 3) or later.

With Windows XP:


IBM Personal Communications Version 5.5 (APAR IC23490).
Microsoft SNA Server Version 3 Service Pack 3 or later.

You should consider switching to TCP/IP as SNA may no longer be


supported in future releases of DB2 Connect. SNA requires significant
configuration knowledge and the configuration process itself can prove to be
error prone. TCP/IP is simple to configure, has lower maintenance costs, and
provides superior performance. SNA is not supported on Windows XP (64-
bit) and Windows Server 2003 (64-bit).

23.4.5 Installation Prerequisite: DB2 Connect Enterprise Edition (Windows)

23.4.5.1 Disk Requirements


The disk space required for DB2 Connect Enterprise Edition depends on the
type of installation you choose and the type of disk drive you are installing
on. You may require significantly more space on FAT drives with large
cluster sizes. When you install DB2 Connect Enterprise Edition using the
DB2 Setup wizard, size estimates are dynamically provided by the
installation program based on installation type and component selection. The
DB2 Setup wizard provides the following installation types:
Typical installation: DB2 Connect Enterprise Edition is installed with most features and
functionality, using a typical configuration. This installation includes graphical tools such as
the Control Center and Configuration Assistant.
Compact installation: Only the basic DB2 Connect Enterprise Edition features and
functions are installed. This installation does not include graphical tools or federated access
to IBM data sources.
Custom installation: A custom installation allows you to select the features you want to
install.

The DB2 Setup wizard will provide a disk space estimate for the
installation options you select. Remember to include a disk space allowance
for required software, communication products, and documentation. In DB2
Version 8, HTML and PDF documentation is provided on separate CD-
ROMs.

23.4.5.2 Memory Requirements


The amount of memory required to run DB2 Connect Enterprise Edition
depends on the components you install. Table 23.6 below provides
recommended memory requirements for DB2 Connect Enterprise Edition
installed with and without graphical tools such as the Control Centre and
Configuration Assistant.

Table 23.6 DB2 Connect Enterprise Edition memory requirements

Type of installation Recommended memory (RAM)


DB2 Connect Enterprise Edition without graphical 64 MB
tools
DB2 Connect Enterprise Edition with graphical tools 128 MB
When determining memory requirements, be aware of the following:
These memory requirements are for a base of 5 concurrent client connections. You will need
an additional 16 MB of RAM per 5 client connections.
The memory requirements documented above do not account for non-DB2 software that may
be running on your system.
Specific performance requirements may determine the affect of memory needed.

23.4.5.3 Hardware Requirements


For DB2 products running on Intel and AMD systems, a Pentium or Athlon CPU is required.
For 64-bit DB2 products, an Itanium or Itanium compatible CPU is required.

23.4.5.4 Operating System Requirements


To install a DB2 Connect Enterprise Edition, one of the following operating
system requirements must be met:
Windows NT Version 4 with Service Pack 6a or higher.
Windows 2000.
Windows XP.
Windows Server 2003 (32-bit and 64-bit).
Windows 2000 SP3 and Windows XP SP1 are required for running DB2 applications in
either of the following environments:

Applications that have COM+ objects using ODBC; or


Applications that use OLE DB Provider for ODBC with OLE DB resource pooling
disabled.

If you are unsure about whether your application environment qualifies,


then it is recommended that you install the appropriate Windows service
level. The Windows 2000 SP3 and Windows XP SP1 are not required for the
DB2 server itself or any applications that are shipped as part of DB2
products.

23.4.5.5 Software Requirements


For 32-bit environments you will need a Java Runtime Environment (JRE) Version 1.3.1 to
run DB2’s Java-based tools, such as the Control Centre and to create and run Java
applications, including stored procedures and user-defined functions. During the installation
process, if the correct level of the JRE is not already installed, it will be installed.
For 64-bit environments you will need a Java Runtime Environment (JRE) Version 1.4 to run
DB2’s Java-based tools, such as the Control Centre and to create and run Java applications,
including stored procedures and user-defined functions. During the installation process, if the
correct level of the JRE is not already installed, it will be installed.

23.4.5.6 Communication Requirements


You can use APPC, TCP/IP and MPTN (APPC over TCP/IP).
For SNA (APPC) connectivity, one of the following communications products is required:

You should consider switching to TCP/IP as SNA may no longer be


supported in future releases of DB2 Connect. SNA requires significant
configuration knowledge and the configuration process itself can prove to be
error prone. TCP/IP is simple to configure, has lower maintenance costs, and
provides superior performance.
With Windows NT:

IBM Communications Server Version 6.1.1 or later.


IBM Personal Communications Version 5 CSD3 or later.

With Windows 2000:

IBM Communications Server Version 6.1.1 or later.


IBM Personal Communications Version 5 CSD3 or later.
Microsoft SNA Server Version 3 Service Pack 3 or later.

Windows Server 2003 64-bit does not support SNA.

23.4.6 Installation Prerequisite: DB2 Query Patroller Server (Windows)

23.4.6.1 Hardware Requirements


For 32-bit Query Patroller servers: a Pentium or Pentium compatible
processor is required.

23.4.6.2 Operating System Requirements


Windows NT Version 4 with Service Pack 6a or higher.
Windows 2000. Service Pack 2 is required for Windows Terminal Server.
Windows Server 2003 (32-bit).

23.4.6.3 Software Requirements


DB2 Enterprise Server Edition (Version 8.1.2 or later) must be installed in order to install the
Query Patroller server component.
You need a Java Runtime Environment (JRE) Version 1.3.1 to run Query Patroller server, the
Query Patroller Java-based tools (such as the Query Patroller Centre) and to create and run
Java applications, including stored procedures and user-defined functions.
Netscape 6.2 or Microsoft Internet Explorer 5.5 is required to view the online installation
help.

23.4.6.4 Communication Requirements


TCP/IP.

23.4.6.5 Memory Requirements


At a minimum, Query Patroller server requires 256 MB of RAM. Additional
memory may be required. When determining memory requirements,
remember:
Additional memory may be required for non-DB2 software that is running on your system.
Additional memory is required to support DB2 clients that have the Query Patroller client
tools installed on them.
Specific performance requirements may determine the amount of memory needed.
Memory requirements are affected by the size and complexity of your database system.

23.4.6.6 Disk Requirements


The disk space required for Query Patroller server depends on the type of
installation you choose and the type of disk drive upon which Query
Patroller is installed. When you install Query Patroller server using the dB2
Setup wizard, size estimates are dynamically provided by the installation
program based on installation type and component selection. Disk space is
needed for the following:
To store the product code.
To store data that will be generated when using Query Patroller (for example, the contents of
the control tables).
Remember to include disk space allowance for required software, communication products
and documentation.

23.4.6.7 Insufficient Disk Space Management


If the space required to install the selected components exceeds the space
found in the path you specify for installing the components, the DB2 Setup
wizard issues a warning about the insufficient space. If you choose, you can
continue the installation. However, if the space for the files being installed
is, in fact, insufficient, then the Query Patroller server installation stops
when there is no more space. When this occurs, the installation is rolled
back. You will then see a final dialogue with the appropriate error messages.
You can then exit the installation.

23.4.7 Installation Prerequisite: DB2 Cube Views (Windows)

23.4.7.1 Disk Requirements


The disk space required for DB2 Cube Views depends on the type of
installation you choose. When you install the product using the DB2 Setup
wizard, the installation program dynamically provides size estimates based
on the type of installation and components you select. The DB2 Setup
wizard provides the following installation types:
Typical installation: This installs all components and documentation.
Compact installation: This is identical to a typical installation.
Custom installation: You can select the features you want to install.

The DB2 Setup wizard provides a disk space estimate for the installation
option you select. Remember to include disk space allowance for required
software, communication products and documentation. For DB2 Cube Views
Version 8.1, the HTML documentation is installed with the product and the
PDF documentation is on the product CD-ROM.

23.4.7.2 Memory Requirements


The memory you allocate for your edition of DB2 Universal Database is
enough for DB2 Cube Views. Memory use by the API (Multidimensional
Services) stored procedure depends upon the following factors:
The sizes catalogued for the parameters of the stored procedure.
The amount of metadata being processed at any given time.
The API output parameters require a minimum of 2MB when two output
parameters are catalogued with their default sizes of 1MB each. The memory
required for the API depends on the size of the input CLOB structure, the
type of metadata operation being performed and the amount of data returned
from the stored procedure. As you develop applications using the API, you
might have to take one or more of the following actions:
Recatalogue the stored procedure with different sizes for the parameters.
Modify the DB2 query heap size.
Modify the DB2 application heap size.

Additionally, when running an application that deals with large CLOB


structures, you might have to increase the stack or heap size of the
application. For example, you can use the /STACK and /HEAP linker
options with the Microsoft Visual C++ linker.

23.4.7.3 Operating System Requirements


DB2 Cube Views runs on the following levels of Windows:

Server component:
Microsoft Windows NT 4 32-bit.
Windows 2000 32-bit.

Client component:
Microsoft Windows NT 4 32-bit.
Windows 2000 32-bit.
Windows XP 32-bit.

23.4.7.4 Software Requirements


DB2 Universal Database Version 8.1.2 or later.
Optional: Office Connect Analytics Edition 4.0. To use Office Connect Analytics Edition,
you need Microsoft Excel 2000 (Office 2000 with service pack 1 or later) or Microsoft Excel
XP (Office XP with service pack 1 or later). Office Connect also requires Internet Explorer
5.5 with service pack 1 or later.

23.4.7.5 Communication Requirements for DB2 Query Patroller Clients


DB2 Cube Views is a feature of DB2 Universal Database and supports the
same communication protocols that DB2 supports.

23.5 INSTALLATION PREREQUSITE FOR DB2 CLIENTS

23.5.1 Installation Prerequisite: DB2 Clients (Windows)

23.5.1.1 Disk Requirements


The actual fixed disk requirements of your installation may vary depending
on your file system and the client components you install. Ensure that you
have included a disk space allowance for your application development tools
and communication products. When you install a DB2 client using the dB2
Setup wizard, size estimates are dynamically provided by the installation
program based on installation type and component selection.

23.5.1.2 Disk Requirements


The following list outlines the recommended minimum memory
requirements for the different types of DB2 clients:
The amount of memory required for the DB2 Run-Time client depends on the operating
system and applications that you are running. In most cases, it should be sufficient to use the
minimum memory requirements of the operating system as the minimum requirement for
running the DB2 Run-Time client.
To run the graphical tools on an Administration or Application Development client, you will
require a minimum of 256 MB of RAM. The amount of memory required for these clients
depends on the operating system and applications that you are running.

Performance may be affected if less than the recommended minimum


memory requirements are used.

23.5.1.3 Operating System Requirements


One of the following:
Windows 98.
Windows ME.
Windows NT Version 4.0 with Service Pack 6a or later.
Windows NT Server 4.0, Terminal Server Edition (only supports the DB2 Run-Time Client)
with Service Pack 6 or later for Terminal Server.
Windows 2000.
Windows XP (32-bit and 64-bit editions).
Windows Server 2003 (32-bit and 64-bit editions).

23.5.1.4 Software Requirements


For 32-bit environments you will need a Java Runtime Environment (JRE) Version 1.3.1 to
run DB2’s Java-based tools, such as the Control Centre and to create and run Java
applications, including stored procedures and user-defined functions. During the installation
process, if the correct level of the JRE is not already installed, it will be installed.
For 64-bit environments you will need a Java Runtime Environment (JRE) Version 1.4 to run
DB2’s Java-based tools, such as the Control Centre and to create and run Java applications,
including stored procedures and user-defined functions. During the installation process, if the
correct level of the JRE is not already installed, it will be installed.
If you are installing the Application Development Client, you may require the Java
Developer’s Kit (JDK). During the installation process, if the JDK is not already installed, it
will be installed.
The DB2 Java GUI tools are not provided with the DB2 Version 8 Run-Time Client.
If you plan to use LDAP (Lightweight Directory Access Protocol), you require either a
Microsoft LDAP client or an IBM SecureWay LDAP client V3.1.1 or later. Microsoft LDAP
client is included with the operating system for Windows ME, Windows 2000, Windows XP
and Windows Server 2003.
If you plan to use the Tivoli Storage Manager facilities for backup and restore of your
databases, you require the Tivoli Storage Manager Client Version 3 or later.
If you have the IBM Antivirus program installed on your operating system, it must be
disabled or uninstalled to complete a DB2 installation.
If you are installing the Application Development Client, you must have a C compiler to
build SQL Stored Procedures.

23.5.1.5 Communication Requirements


Named Pipes, NetBIOS or TCP/IP.
The Windows base operating system provides Named Pipes, NetBIOS and TCP/IP
connectivity.

In Version 8, DB2 only supports TCP/IP for remotely administering a


database.

23.5.2 Installation Prerequisite: DB2 Query Patroller Clients (Windows)

23.5.2.1 Hardware Requirements


For 32-bit DB2 clients with the Query Patroller client tools installed: a
Pentium or Pentium compatible processor.

23.5.2.2 Operating System Requirements


One of the following:
Windows 98.
Windows ME.
Windows NT Version 4.0 with Service Pack 6a or later.
Windows NT Server 4.0, Terminal Server Edition (only supports the DB2 Run-Time Client)
with Service Pack 6 or later for Terminal Server.
Windows 2000.
Windows XP (32-bit).
Windows Server 2003 (32-bit).

23.5.2.3 Software Requirements


A DB2 product (Version 8.1.2 or later) must be installed on the computer
that you will install the Query Patroller client tools on. The following
products are appropriate prerequisites:
Any DB2 client product (for example, DB2 Run-Time client or DB2 Application
Development client).
Any DB2 Connect product (for example, DB2 Connect Personal Edition or DB2 Connect
Enterprise Server Edition).
Any DB2 server product (for example, DB2 Enterprise Server Edition or DB2 Workgroup
Server Edition).
You need a Java Runtime Environment (JRE) Version 1.3.1 to run the Query Patroller Java-
based tools, such as the Query Patroller Center and to create and run Java applications,
including stored procedures and user-defined functions.
Netscape 6.2 or Microsoft Internet Explorer 5.5 is required to view the online installation
help.

23.5.2.4 Communication Requirements


TCP/IP.

23.5.2.5 Memory Requirements


The following list outlines the recommended minimum memory
requirements for running the Query Patroller client tools on a DB2 client
(Windows):
To run the Query Patroller client tools on a system administration workstation requires an
additional amount of 64 MB of RAM beyond the amount of RAM required to run your
Windows operating system. For example, to run Query Patroller Centre on a system
administration workstation running Windows 2000 Professional, you need a minimum of 64
MB of RAM for the operating system plus an additional amount of 64 MB of RAM for the
tools.
To run the Query Patroller client tools on a DB2 client that submits queries to the Query
Patroller server depends on the Windows operating system you are using and the database
applications you are running. It should be sufficient to use the minimum memory
requirements of the Windows operating system as the minimum requirements for running
these tools on a DB2 client.

Performance may be affected if less than the recommended minimum


memory requirements are used.

23.5.2.6 Disk Requirements


The actual fixed disk requirements of your installation may vary depending
on your file system and the Query Patroller client tools you install. Ensure
that you have included a disk space allowance for your application
development tools (if necessary), and communication products.
When you install the Query Patroller client tools using the DB2 Setup
wizard, size estimates are dynamically provided by the installation program
based on installation type and component selection.

23.6 INSTALLATION AND CONFIGURATION OF DB2 UNIVERSAL DATABASE SERVER

Before installing DB2 product, it must be ensured that you meet prerequisite
requirements of hardware and software components such as disk, memory,
communication, operating system and so on, as discussed in previous
sections.

23.6.1 Performing Installation Operation for IBM DB2 Universal Database


Version 8.1
Follow the following steps to install IBM DB2 Universal Database Server
Version 8.1:
Step 1: Log on to your computer and shut down any other programs so
that the DB2 Setup Wizard can update files as required.
Step 2: Insert the DB2 UDB Server installation CD-ROM into CD-
drive. The autorun feature automatically starts the DB2 Setup
Wizard. The DB2 Setup Wizard determines the system language
and launches the DB2 Setup Wizard for that language.
Step 3: In the Welcome to DB2 dialog box as shown in Fig 23.17, you
can choose to see the installation prerequisites, the release
notes, and an interactive presentation of the product or you can
launch the DB2 Setup Wizard to install the product.

Fig. 23.17 The Welcome to DB2 dialogue box

Click on “Install Products” to open the “Select the Product You


Would Like to Install” dialogue as shown in Fig. 23.18.
Step 4: Choose a DB2 product depending on the type of license you
have purchased.
Choose DB2 Universal Database Enterprise Server Edition if
you want the DB2 server plus the capability of having your
clients access enterprise servers such as DB2 for z/OS.
Choose DB2 Universal Workgroup Server Edition if you want
the DB2 server.
Step 5: Click Next. “Welcome to the DB2 Setup Wizard” dialogue box
appears as shown in Fig. 23.19.
Step 6: Click Next to continue. The “License Agreement” dialogue box
appears as shown in Fig. 23.20. Read the agreement carefully,
and if you agree, Click I accept the Terms in the License
Agreement to continue with the install operation.
Step 7: Click Next to continue. “Select the Installation Type” dialogue
box appears. Select the type you prefer by clicking the
appropriate button as shown in Fig. 23.21. An estimate of disk
space requirement is shown for each option.
Step 8: Click Next to continue. “Select Installation Folder” dialogue
box appears as shown in Fig. 23.22. Select a directory and a
drive where DB2 is to be installed. The amount of disk space
required appears on the dialogue box. Click the Disk Space
button to help you select a directory with enough available disk
space. Click the Change button if you need to change the
current destination folder.
Fig. 23.18 The “Product Selection” dialogue box

Fig. 23.19 The “Welcome to the DB2 Setup Wizard” dialogue box
Fig. 23.20 The “License Agreement” dialogue box
Fig. 23.21 The “Select the Installation Type” dialogue box
Fig. 23.22 The “Select Installation Folder” dialogue box

Step 9: Click Next to continue. “Set user Information for the DB2
Administration Server” appears on the dialogue box as shown in
Fig. 23.23. Enter the user name and password you would like to
use for the DB2 Administration Server.
Fig. 23.23 The “Set user information for the DB2 Administration Server” dialogue box

Step 10: Click Next to continue. “Setup the administration contact


list” appears on the dialogue box as shown in Fig. 23.24. In
this box you can indicate where a list of administrator
contacts is to be located. This list will consist of the people
who should be notified if the database requires attention.
Choose “Local” if you want the list to be created on your
computer or “Remote” if you plan to use a global list for
your organisation. For example here, choose “Local”. This
dialogue box also allows you to enable notification to an
SMTP server that will send e-mail and pager notifications to
people on the list. For the purpose of an example here,
SMTP server has not been enabled.
Fig. 23.24 The “Set up the administration contact list” dialogue box

Step 11: Click Next to continue. “Configure DB2 Instances” appears


on the dialogue box as shown in Fig. 23.25. Choose to
create the default DB2 instance. The DB2 instance is
typically used to store application data. You can also modify
the protocol and startup settings for the DB2 instances.
Choose.
Step 12: Click Next to continue. “Prepare the DB2 Tool Catalogue”
appears on the dialogue box as shown in Fig. 23.26. Here
you can select to prepare the tools catalogue to enable tools
such as the Task Centre and Scheduler. Select Prepare the
DB2 Tools Catalogue in a Local Database.
Step 13: Click Next to continue. “Specify a local database to store
the DB2 tools catalogue” appears on the dialogue box as
shown in Fig. 23.27. Here the DB2 tools catalogue will be
stored in a local database.
Step 14: Click Next to continue. “Specify a contact for health
monitor notification” appears on the dialogue box as shown
in Fig. 23.28. Here you can specify the name of the person
to be contacted in case your system needs attention. This
name can also be added and changed after the installation.
In such case, select to Defer the Task Until After Installation
is Complete.
Fig. 23.25 The “Configure DB2 Instances” dialogue box

Fig. 23.26 The “Prepare the DB2 tools catalog” dialogue box
Fig. 23.27 The “Specify a local database to store the DB2 tools catalog” dialogue box
Fig. 23.28 The “Specify a contact for health monitor notification” dialogue box

Step 15: Click Next to continue. “Start copying files” appears on the
dialogue box as shown in Fig. 23.29. As you have already
given DB2 all the information required to install the product
on your computer, it gives you one last chance to verify the
values you have entered.
Fig. 23.29 The “Start copying files” dialogue box

Step 16: Click Install to have the files copied to your system. You
can also click Back to return to the dialogue boxes that you
have already completed to make any changes. The
installation progress bars appear on screen while the product
is being installed.
Step 17: After the completion of installation process, “Set up is
complete” appears on the dialogue box as shown in Fig.
23.30.
Step 18: Click the Finish button to complete the installation. “First
Steps” and “Congratulations!” appear on the dialogue box
as shown in Fig. 23.31 with the following options:
Create Sample Database.
Work with Sample Database.
Work with Tutorial.
View the DB2 Product Information Library.
Launch DB2 UDB Quick Tour.
Find other DB2 Resources on the World Wide Web.
Exit First Step.

With the completion of above steps, the installation program has


completed the following:
Created DB2 program groups and items (shortcuts).
Registered a security service.
Updated the Windows Registry.
Created a default instance named DB2, added it as a service and configured it for
communications.
Fig. 23.30 The “Setup is complete" dialogue box

Fig. 23.31 The “First Steps” dialogue box

Created the DB2 Administration Sever, added it as a service, and configured it so that DB2
tools can administer the server. The service’s start type was set to Automatic.
Activated DB2 First Steps to start automatically following the first boot after installation.

Now all the installation steps have been completed and DB2 UDB can be
used to create DB2 UDB applications using options as shown in Fig. 23.32.

Fig. 23.32 The “First Steps” dialogue box with DB2 UDB Sample

R Q
1. What is a DB2? Who developed DB2 products?
2. What are the main DB2 products? What are their functions? Explain.
3. On what platforms can DB2 Universal Database be run?
4. What is DB2 SQL? Explain.
5. What tools are available to help administer and manage DB2 databases?
6. What is DB2 Universal Database? Explain with its configuration.
7. With neat sketches, write short notes on the following:

a. DB2 UDB Personal Edition.


b. DB2 UDB Workgroup Edition.
c. DB2 UDB Enterprise Edition.
d. DB2 UDB Enterprise-Extended Edition.
e. DB2 UDB Personal Developer’s Edition.
f. DB2 UDB Universal Developer’s Edition.

8. What do you mean by a local application and a remote application?


9. What are the two ways to use Java Database Connectivity to access DB2 data?
10. What is the name of the DB2 product feature that provides a parallel, partitioned database
server environment?
11. Name the interfaces that you can use when creating applications with the DB2 Application
Development Client.
12. Differentiate between DB2 Workgroup Server Edition and DB2 Enterprise Server Edition.
13. What is DB2 Connect product? What are its functions?
14. What are the two versions of DB2 Connect product? Explain each one of them with neat
sketch.
15. Write short notes on the following

a. DB2 Extenders
b. Text Extenders
c. IAV Extenders
d. DB2 DataJoiner.

16. What are the major components of DB2 Universal Database? Explain each of them.
17. What are the features of DB2 Universal Databases?
18. What is DB2 Administrator’s Tool Folder? What are its components?
19. What is Control Centre? What are its main components?
20. What is a SmartGuide?
21. What are the functions of Database engine?

STATE TRUE/FALSE

1. Once a DB2 application has been developed, the DB2 Client Application (CAE) component
must be installed on each workstation executing the application.
2. DB2 UDB is a Web-enabled relational database management system that supports data
warehousing and transaction processing.
3. DB2 UDB can be scaled from hand-held computers to single processors to clusters of
computers and is multimedia-capable with image, audio, video, and text support.
4. The term “universal” in DB2 UDB refers to the ability to store all kinds of electronic
information.
5. DB2 UDB Personal Edition allows the users to create and use local databases and access
remote databases if they are available.
6. DB2 UDB Workgroup Edition is a server that supports both local and remote users and
applications.
7. DB2 UDB Personal Edition provides different engine functions found in Workgroup,
Enterprise and Enterprise-Extended Editions.
8. DB2 UDB Personal Edition can accept requests from a remote client.
9. DB2 UDB Personal Edition is licensed for multi user to create databases on the workstation
in which it was installed.
10. Remote clients can connect to a DB2 UDB Workgroup Edition server, but DB2 UDB
Workgroup Edition does not provide a way fro its users to connect to databases on host
systems.
11. DB2 UDB Workgroup Edition is not designed for use in a LAN environment.
12. The DB2 UDB Workgroup Edition is most suitable for large enterprise applications.
13. DB2 Enterprise-Extended Edition provides the ability for an Enterprise-Extended Edition
(EEE) database to be partitioned across multiple independent machines (computers) of the
same platform that are connected by network or a high-speed switch.
14. Lotus Approach is a comprehensive World Wide Web (WWW) development tool kit to create
dynamic web pages or complex web-based applications that can access DB2 databases.
15. Net.Data provides an easy-to-use interface for interfacing with UDB and other relational
databases.
16. DB2 Connect enables applications to create, update, control, and manage DB2 databases and
host systems using SQL, DB2 Administrative APIs, ODBC, JDBC, SQLJ, or DB2 CLI.
17. DB2 Connect supports Microsoft Windows data interfaces such as ActiveX Data Objects
(ADO), Remote Data Objects (RDO) and Object Linking and Embedding (OLE) DB.
18. DB2 Connect Personal Edition provides access to remote databases for a multi workstation.
19. DB2 Connect Enterprise Edition provides access form network clients to DB2 databases
residing on iSeries and zSeries host systems.
20. The DB2 Extenders add functions to DB2’s SQL grammar and exposes a C API for
searching and browsing.
21. The Text Extender provides linguistic, precise, dual and ngram indexes.
22. The IAV Extenders provide the ability to use images, audio and video data in user’s
applications.
23. DB2 DataJoiner is a version of DB2 Version 2 for Common Servers that enables its users to
interact with data from multiple heterogeneous sources, providing an image of a single
relational database.

TICK (✓) THE APPROPRIATE ANSWER

1. Which DB2 UDB product cannot accept requests from remote clients?

a. DB2 Enterprise Edition


b. DB2 Workgroup
c. DB2 Personal Edition
d. DB2 Enterprise-Extended Edition.

2. From which DB2 component could you invoke Visual Explain?

a. Control Centre
b. Command Centre
c. Client Configuration Assistant
d. Both (a) and (c).

3. Which of the following is the main function of the DB2 Connect product?

a. DRDA Application Requester


b. RDBMS Engine
c. DRDA Application Server
d. DB2 Application Run-time Environment.

4. Which product contains a database engine and an application development environment?

a. DB2 Connect
b. DB2 Personal Edition
c. DB2 Personal Developer’s Edition
d. DB2 Enterprise Edition.

5. Which communication protocol could you use to access a DB2 UDB database?

a. X.25
b. AppleTalk
c. TCP/IP
d. None of these.

6. What product is required to access a DB2 for OS/390 from a DB2 CAE workstation?

a. DB2 Universal Developer’s Edition


b. DB2 Workgroup
c. DB2 Connection Enterprise Edition
d. DB2 Personal Edition.

7. Which communication protocol can be used between DRDA Application Requester (such as
DB2 Connect) and a DRDA Application Server (such as DB2 for OS/390)?

a. TCP/IP
b. NetBIOS
c. APPC
d. Both (a) and (c).

8. Which of the following provides the ability to access a host database with Distributed
Relational Database Architecture (DRDA)?

a. DB2 Connect
b. DB2 UDB
c. DB2 Developer’s Edition
d. All of these.
9. Which of the following provides the ability to develop and test a database application for one
user?

a. DB2 Connect
b. DB2 UDB
c. DB2 Developer’s Edition
d. All of these.

10. DB2 Universal Database

a. is a Object-oriented relational database management system.


b. is a Web-enabled relational database management system.
c. provides SQL + objects + server extensions.
d. DB2 Developer’s Edition.
e. All of these.

11. DB2 UDB Enterprise Edition

a. includes all the features provided in the DB2 UDB Workgroup Edition.
b. supports for host database connectivity.
c. provides users with access to DB2 databases residing on iSeries. or zSeries
platforms.
d. All of these.

12. DB2 UDB Enterprise Edition

a. can be installed on symmetric multiprocessor platforms with more than four


processors.
b. implements both the application requester (AR) and application server (AS)
protocols.
c. can participate in distributed relational database architecture (DRDA) networks with
full generality.
d. All of these.

13. DB2 UDB Personal Developer’s Edition includes for Windows platform the following:

a. DB2 Universal Database Personal Edition


b. DB2 Connect Personal Edition
c. DB2 Software Developer’s Kits
d. All of these.

14. A comprehensive World Wide Web (WWW) development tool kit to create dynamic web
pages or complex web-based applications that can access DB2 databases, is provided by

a. Net.Data.
b. Lotus Approach.
c. SDK.
d. JDBC.
15. An easy-to-use interface for interfacing with UDB and other relational databases, is provided
by

a. Net.Data.
b. Lotus Approach.
c. SDK.
d. JDBC.

16. A CLI interface for the Java programming language, is provided by

a. Net.Data.
b. Lotus Approach.
c. SDK.
d. JDBC.

17. A communication product that enables its users to connect to any database server that
implements the Distributed Relational Database Architecture (DRDA) protocol, including all
servers in the DB2 product family, is known as

a. DB2 Extender.
b. DB2 DataJoiner.
c. DB2 Connect.
d. None of these.

18. Access form network clients to DB2 databases residing on iSeries and zSeries host systems,
is provided by

a. DB2 Connect Personal Edition.


b. DB2 DataJoiner.
c. DB2 Connect Enterprise Edition.
d. DB2 Extenders.

19. Access to remote databases for a single workstation, is provided by

a. DB2 Connect Personal Edition.


b. DB2 DataJoiner.
c. DB2 Connect Enterprise Edition.
d. DB2 Extenders.

20. A vehicle for extending DB2 with new types and functions to support operations, is known as

a. DB2 Connect Personal Edition.


b. DB2 DataJoiner.
c. DB2 Connect Enterprise Edition.
d. DB2 Extenders.

21. The Control Centre


a. provides a graphical interface to administrative tasks such as recovering a database,
defining directories, configuring the system, managing media, and more.
b. allows easy access to all server administration tools.
c. gives a clear overview of the entire system, enables remote database management
and provides step-by-step assistance for complex tasks.
d. All of these.

FILL IN THE BLANKS

1. All DB2 products have common component called the _____.


2. DB2 SQL confirms to the _____.
3. DRDA consists of two parts, namely (a) _____ protocol and (b) _____ protocol.
4. If access to databases on host systems is required, DB2 UDB Personal Edition can be used in
conjunction with _____.
5. DB2 client applications communicate with DB2 Workgroup Edition using a _____ protocol
with DB2 CAE.
6. DB2 UDB Enterprise-Extended Edition introduces a new dimension of _____ that can be
scaled to _____ capacity and _____ performance.
7. The DB2 Personal Developer’s Edition allows a developer to design and build _____ desktop
applications.
8. DB2 UDB Universal Developer’s Edition is supported on all platforms that support the DB2
Universal Database server product, except for the _____ or _____ database environment.
9. SDK provides the environment and tools to the user to develop applications that access DB2
databases using _____ or _____.
10. The target database server for a DB2 Connect installation is known as a _____.
11. DB2 Connect Personal Edition provides access to remote databases for a _____ workstation.
12. The Text Extenders provides searches on _____ databases.
13. The Text Extender provides _____ that can be used in SQL statements to perform text
searches.
14. Intelligent Miner is a set of applications that can search large volumes of data for _____.
15. The Administrator’s Tools folder contains a collection of _____ that help manage and
administer databases, and are integrated into the _____ environment.
16. SmastGuides are _____ that guide a user in _____ and other database operations.
17. The extended from of CLP is _____.
18. CLP is a text-based application commonly used to execute _____ and _____.
Chapter 24

Oracle

24.1 INTRODUCTION

On the basis of IBM papers on System/R and visualising the universal


applicability of the relational model, Lawrence Ellison and his co-founders,
Bob Miner and Ed Oates created Oracle Corporation in 1977. Oracle is a
relational database management system product of Oracle Corporation. It
used SQL (pronounced “sequel”). Oracle was able to deliver the first
commercial relational database ever to reach the market in 1979. The first
version of Oracle, version 2.0, was written in assembly language for the
DEC PDP-11 machine. As early as version 3.0, the database was written in
the C language, a portable language.
In its early days, Oracle Corporation was known as an aggressive sales
and promotion organisation. Over the years, the Oracle database has grown
in depth and quality. Its technical capabilities now match its early hype.
This chapter provides the concepts and technologies behind the Oracle
database that form the foundation of Oracle’s technology products. It also
discusses the features and functionality of the Oracle products.

24.2 HISTORY OF ORACLE

Oracle has grown from its humble beginnings as one of a number of


databases available in the 1970s to the overwhelming market leader of today.
Fig. 24.1 illustrates thirty years of Oracle innovation.
In 1983, a portable version of Oracle (Version 3) was created that ran not
only on Digital VAX/VMS systems, but also on Unix and other platforms.
By 1985, Oracle claimed the ability to run on more than 30 platforms (it runs
on more than 70 today). Some of these platforms are historical curiosities
today, but others remain in use. In addition to VMS, early operating systems
supported by Oracle included IBM MVS, DEC Ultrix, HP/UX, IBM AIX
and Sun’s Solaris version of Unix, Oracle was able to leverage and
accelerate the growth of minicomputers and Unix servers in the 1980s.
Today, Oracle leverages its portability on Microsoft Windows NT/2000 and
Linux to capture a significant market share on these more recent platforms.
In addition to multiple platform support, other core Oracle messages from
the mid 80s still ring true today, including complementary software
development and decision support tools, ANSI standard SQL and portability
across platforms and connectivity over standard networks. Since the mid
80s, the database deployment model has evolved from a dedicated database
application servers to client/server to Internet computing implemented with
PCs and thin clients accessing database applications via browsers.
With the Oracle8, Oracle8i and Oracle9i releases, Oracle has added more
power and features to its already solid base. Oracle8, released in 1997, added
a host of features (such as the ability to create and store complete objects in
the database) and dramatically improved the performance and scalability of
the database. Oracle8i, released in 1999, added a new twist to the Oracle
database-a combination of enhancements that made the Oracle8i database
the focal point of the new world of Internet computing. Oracle9i adds an
advanced version of Oracle Parallel Server named Real Application Clusters,
along with many additional self-tuning, management and data warehousing
features.
Fig. 24.1 Thirty years of Oracle innovation

Oracle introduced many innovative technical features to the database as


computing and deployment models changed (from offering the first
distributed database to the first Java Virtual Machine in the core database
engine). Table 24.1 presents a short list of Oracle’s major feature
introductions.

24.2.1 The Oracle Family


Oracle9i Database Server describes the most recent major version of the
Oracle Relational Database Management System (RDBMS) family of
products that share common source code. Leveraging predecessors including
the Oracle8 release that surfaced in 1997, the family includes:
Personal Oracle, a database for single users that is often used to develop a code for
implementation on other Oracle multi-user databases.
Oracle Standard Edition, which was named Workgroup Server in its first iteration as part of
the Oracle7 family and is often simply referred to as Oracle Server.
Oracle Enterprise Edition, which includes additional functionality.

Table 24.1 History of Oracle technology introductions

Year Feature
1979 Oracle Release 2-the first commercially available relational database to
use SQL.
1983 Single code base for Oracle across multiple platforms.
1984 Portable toolset.
1986 Client/server Oracle relational database.
1987 CASE and 4GL toolset.
1988 Oracle Financial Applications built on relational database.
1989 Oracle6.
1991 Oracle Parallel Server on massively parallel platforms.
1993 Oracle7 with cost-based optimiser.
1994 Oracle Version 7.1 generally available: parallel operations including query,
load and create index.
1996 Universal database with extended SQL via cartridges, thin client and
application server.
1997 Oracle8 generally available: including object-relational and Very Large
Database (VLDB) features.
1999 Oracle8i generally available: Java Virtual Machine (JVM) in the database.
2000 Oracle9i Application Server generally available: Oracle tools integrated in
middle tier.
2001 Oracle9i Database Server generally available: Real Application Clusters,
Advanced Analytic Services.
In 1998, Oracle announced Oracle8i, which is sometimes referred to as
Version 8.1 of the Oracle8 database. The “i” was added to denote added
functionality supporting Internet deployment in the new version. Oracle9i
followed, with Application Server available in 2000 and Database Server in
2001. The terms “Oracle”, “Oracle8”, “Oracle8i” and “Oracle9i” may appear
to be used somewhat interchangeably in this book, since Oracle9i includes
all the features of previous versions. When we describe a new feature that
was first made available specifically for Oracle8i or Oracle9i we have tried
to note that fact to avoid confusion, recognising that many of you may have
old releases of Oracle. We typically use the simple term “Oracle” when
describing features that are common to all these releases.
Oracle has focused development around a single source code model since
1983. While each database implementation includes some operating system-
specific source code, most of the code is common across the various
implementations. The interfaces that users, developers and administrators
deal with for each version are consistent. Features are consistent across
platforms for implementations of Oracle Standard Edition and Oracle
Enterprise Edition. As a result, companies have been able to migrate Oracle
applications easily to various hardware vendors and operating systems while
leveraging their investments in Oracle technology. From the company’s
perspective, Oracle has been able to focus on implementing new features
only once in its product set, instead of having to add functionality at
different times to different implementations.

24.2.1.1 Oracle Standard Edition


When Oracle uses the names Oracle8 Server, Oracle8i Server or Oracle9i
Server to refer to a specific database offering, it refers to what was formerly
known as Workgroup Server and is now sometimes called Standard Edition.
From a functionality and pricing standpoint, this product intends to compete
in the entry-level multi-user and small database category, which supports a
smaller numbers of users. These releases are available today on Windows
NT, Netware and Unix platforms such as Compaq (Digital), HP/UX, IBM
AIX, Linux, and Sun Solaris.
24.2.1.2 Oracle Enterprise Edition
Oracle Enterprise Edition is aimed at larger-scale implementations that
require additional features. Enterprise Edition is available on far more
platforms than the Oracle release for workgroups and includes advanced
management, networking, programming and data warehousing features, as
well as a variety of special-purpose options.

24.2.1.3 Oracle Personal Edition


Oracle Personal Edition is the single-user version of Oracle Enterprise
Edition. It is most frequently used by developers because it allows
development activities on a single machine. Since the features match those
of Enterprise Edition, a developer can write applications using the Personal
Edition and deploy them to multiuser servers. Some companies deploy
single-user applications using this product. However, Oracle Lite offers a
much more lightweight means of deploying the same applications.

24.2.1.4 Oracle Lite


Oracle Lite, formerly known as Oracle Mobile, is intended for single users
who are using wireless devices. It differs from other members of the Oracle
database family in that it does not use the same database engine. Instead,
Oracle developed a lightweight engine compatible with the limited memory
and storage capacity of notebooks and handheld devices. Oracle Lite is
described in more detail at the end of this chapter.
As the SQL supported by Oracle Lite is largely the same as the SQL for
other Oracle databases, you can run applications developed for those
database engines using Oracle Wireless can be run. Replication of data
between Oracle Wireless and other Oracle versions is a key part of most
implementations.
Table 24.2 summarises the situations in which each database product
would be typically used. The Oracle product names to refer to the different
members of the Oracle database family have been used.

Table 24.2 Oracle family of database products


Database Name When Appropriate
Oracle Server/Standard Edition Version of Oracle server for a small number of users and a
smaller database.

Oracle Enterprise Edition Version of Oracle for a large number of users or a large
database with advanced features for extensibility,
performance and management.
Oracle Personal Edition Single-user version of Oracle typically used for
development of applications for deployment on other
Oracle versions.

Oracle Lite Lightweight database engine for mobile computing on


notebooks and handheld devices.

24.2.2 The Oracle Software


The Oracle Corporation offers numerous products from relational and
object-relational database management systems, software development tools
and CASE tools to packaged applications such as Oracle Financials. When
discussing Oracle, one must refer to a specific piece of software. People who
talk about “buying Oracle” or that say “I have used Oracle” seldom actually
know or understand what they are talking about.
Under the headings above, one might categorise the Oracle product line as
shown in Table 24.3.

Table 24.3 Oracle software product line


Category Software Description
Database Servers Oracle7, Oracle8, Oracle8i, The database engines that store and
Oracle9i. retrieve data and servers that provide
access to the DBMS over a LAN or
the Internet/Web.
There are several versions of each:

Personal Oracle X (intended for


desktop single user).
Oracle X (intended for small to
medium sized workgroups).
Oracle X Enterprise (intended for very
large organisations).

(where X is some version like Oracle7,


Oracle8, Oracle8i or Oracle9i).
Application/Web (Web) Application Server, Web development and applications
servers WebDB. server software that allow applications
to served over the web. This is
typically done in 3-tier architecture.
Software Development SQL*Plus (Command line Tools used to develop applications that
interface) Developer/2000, or access an Oracle DBMS (or multiple
simply Developer (Forms, DBMS), typically in a traditional 2-tier
Reports, Graphics). client/server architecture or 3- Tier
Designer/2000 or simply architecture.
Designer (CASE Tools).
Programmer/2000 (Embedded
SQL libraries).
JDeveloper (Java application
development).
All of these pieces are now
combined under one title: Oracle
9i Development Suite.

Packaged Apps. Oracle Financials. Software written using Oracle tools by


Oracle CRM. Oracle that is then installed and
Oracle Supply Chain customised for a business.
Management and so on.

24.3 ORACLE FEATURES


24.3.1 Application Development Features

24.3.1.1 Database Programming


All flavours of the Oracle database include different languages and
interfaces that allow programmers to access and manipulate the data in the
database. Database programming features usually interest two groups:
developers building Oracle-based applications that will be sold
commercially and IT organisations within companies that custom-develop
applications unique to their businesses.

24.3.1.2 SQL
The ANSI standard Structured Query Language (SQL) provides basic
functions for data manipulation, transaction control and record retrieval from
the database. However, most end users interact with Oracle through
applications that provide an interface that hides the underlying SQL and its
complexity.

24.3.1.3 PL/SQL
Oracle’s PL/SQL, a procedural language extension to SQL, is commonly
used to implement program logic modules for applications. PL/SQL can be
used to build stored procedures and triggers, looping controls, conditional
statements and error handling. You can compile and store PL/SQL
procedures in the database. You can also execute PL/SQL blocks via
SQL*Plus, an interactive tool provided with all versions of Oracle.

24.3.1.4 Java features and options


Oracle8i introduced the use of Java as a procedural language with a Java
Virtual Machine (JVM) in the database (originally called JServer). JVM
includes support for Java stored procedures, methods, triggers, Enterprise
JavaBeans (EJBs), CORBA, IIOP, and HTTP. The Accelerator is used for
project generation, translation and compilation. As of Oracle Version 8.1.7,
it can also be used to deploy/install shared libraries.
The inclusion of Java within the Oracle database allows Java developers
to leverage their skills as Oracle applications developers. Java applications
can be deployed in the client, Oracle9i Application Server or database,
depending on what is most appropriate.

24.3.1.5 Large objects


Interest in the use of large objects (LOBs) is growing, particularly for the
storage of nontraditional datatypes such as images. The Oracle database has
been able to store large objects for some time. Oracle8 added the capability
to store multiple LOB columns in each table.

24.3.1.6 Object-oriented programming


Support of object structures has been included in Oracle8i to provide support
for an object-oriented approach to programming. For example, programmers
can create user-defined data types, complete with their own methods and
attributes. Oracle’s object support includes a feature called Object Views
through which object- oriented programs can make use of relational data
already stored in the database. You can also store objects in the database as
varying arrays (VARRAYs), nested tables or index organised tables (IOTs).

24.3.1.7 Third-generation languages (3GLs)


Programmers can interact with the Oracle database from C, C++, Java,
COBOL or FORTRAN applications by embedding SQL in those
applications. Prior to compiling the applications using a platform’s native
compilers, you must run the embedded SQL code through a pre-compiler.
The pre-compiler replaces SQL statements with library calls the native
compiler can accept. Oracle provides support for this capability through
optional “programmer” pre-compilers for languages such as C and C++
(Pro*C) and COBOL (Pro*COBOL). More recently, Oracle added SQLJ, a
pre-compiler for Java that replaces SQL statements embedded in Java with
calls to a SQLJ runtime library, also written in Java.

24.3.1.8 Database drivers


All versions of Oracle include database drivers that allow applications to
access Oracle via ODBC (the Open DataBase Connectivity standard) or
JDBC (the Java DataBase Connectivity open standard).

24.3.1.9 The Oracle Call Interface (OCI)


If you are an experienced programmer seeking optimum performance, you
may choose to define SQL statements within host-language character strings
and then explicitly parse the statements, bind variables for them and execute
them using the Oracle Call Interface (OCI). OCI is a much more detailed
interface that requires more programmer time and effort to create and debug.
Developing an application that uses OCI can be timeconsuming, but the
added functionality and incremental performance gains often make spending
the extra time worthwhile. OCI improves application performance or adds
functionality. For instance, in high-availability implementations in which
multiple systems share disks and implement Real Application
Clusters/Oracle Parallel Server, you may want users to reattach to a second
server transparently if the first fails. You can write programs that do this
using OCI.

24.3.1.10 National Language Support (NLS)


National Language Support (NLS) provides character sets and associated
functionality, such as date and numeric formats, for a variety of languages.
Oracle9i features full Unicode 3.0 support. All data may be stored as
Unicode, or select columns may be incrementally stored as Unicode. UTF-8
encoding and UTF-16 encoding provide support for more than 57 languages
and 200 character sets. Extensive localisation is provided (for example, for
data formats) and customised localisation can be added through the Oracle
Locale Builder.

24.3.1.11 Database Extensibility


The Internet and corporate intranets have created a growing demand for
storage and manipulation of nontraditional data types within the database.
There is a need for extensions to the standard functionality of a database for
storing and manipulating image, audio, video, spatial and time series
information. Oracle8 provides extensibility to the database through options
sometimes referred to as cartridges. These options are simply extensions to
standard SQL, usually built by Oracle or its partners through C, PL/SQL or
Java. You may find these options helpful if you are working extensively with
the type of data they are designed to handle.

24.3.1.12 Oracleinter Media


Oracle inter Media bundles what was formerly referred to as the “Context
cartridge” for text manipulation with additional image, audio, video and
locator functions and is included in the database license. inter Media offers
the following major capabilities:
The text portion of inter Media (Oracle9i’s Oracle Text) can identify the gist of a document
by searching for themes and key phrases within the document.
The image portion of inter Media can store and retrieve images.
The audio and video portions of inter Media can store and retrieve audio and video clips,
respectively.
The locator portion of inter Media can retrieve data that includes spatial coordinate
information.

24.3.1.13 Oracle Spatial


The Spatial option is available for Oracle Enterprise Edition. It can optimise
the display and retrieval of data linked to coordinates and is used in the
development of spatial information systems. Several vendors of Geographic
Information Systems (GIS) products now bundle this option and leverage it
as their search and retrieval engine.

24.3.1.14 Database Connection Features


The connection between the client and the database server is a key
component of the overall architecture of a computing system. The database
connection is responsible for supporting all communications between an
application and the data it uses. Oracle includes a number of features that
establish and tune your database connections.
24.3.2 Communication Features
The following features relate to the way the Oracle database handles the
connection between the client and server machines in a database interaction.
We have divided the discussion in this section into two categories: database
networking and Oracle9i Application Server.

24.3.2.1 Database Networking


Database users connect to the database by establishing a network
connection. You can also link database servers via network connections.
Oracle provides a number of features to establish connections between users
and the database and/or between database servers, as described in the
following sections.

24.3.2.2 Oracle Net/Net8


Oracle’s network interface, Net8, was formerly known as SQL*Net when
used with Oracle7 and previous versions of Oracle. You can use Net8 over a
wide variety of network protocols, although TCP/IP is by far the most
common protocol today. In Oracle9i, the name of Net8 has been changed to
Oracle Net and the features associated with Net8, such as shared servers, are
referred to as Oracle Net Services.

24.3.2.3 Oracle Names


Oracle Names allows clients to connect to an Oracle server without requiring
a configuration file on each client. Using Oracle Names can reduce
maintenance efforts, since a change in the topology of your network will not
require a corresponding change in configuration files on every client
machine.

24.3.2.4 Oracle Internet Directory


The Oracle Internet Directory (OID) was introduced with Oracle8i. OID
serves the same function as Oracle Names in that it gives users a way to
connect to an Oracle Server without having a client-side configuration file.
However, OID differs from Oracle Names in that it is an LDAP
(Lightweight Directory Access Protocol) directory; it does not merely
support the Oracle-only Oracle Net/Net8 protocol.

24.3.2.5 Oracle Connection Manager


Each connection to the database takes up valuable network resources, which
can impact the overall performance of a database application. Oracle’s
Connection Manager, illustrated in Fig. 24.2, reduces the number of network
connections to the database through the use of concentrators, which provide
connection multiplexing to implement multiple connections over a single
network connection. Connection multiplexing provides the greatest benefit
when there are a large number of active users.

Fig. 24.2 Concentrators with Connection Managers

You can also use the Connection Manager to provide multi-protocol


connectivity when clients and servers run different network protocols. This
capability replaces the multi-protocol interchange formerly offered by
Oracle, but it is less important today because many companies now use
TCP/IP as their standard protocol.
24.3.2.6 Advanced Security Option
Advanced Security, now available as an option, was formerly known as the
Advanced Networking Option (ANO). Key features include network
encryption services using RSA Data Security’s RC4 or DES algorithm,
network data integrity checking, enhanced authentication integration, single
sign-on and DCE (Distributed Computing Environment) integration.

Availability: Advanced networking features such as the Oracle


Connection Manager and the Advanced Security Option have typically been
available for the Enterprise Edition of the database, but not for the Standard
Edition.

24.3.2.7 Oracle9i Application Server


The popularity of Internet and intranet applications has led to a change in
deployment from client/server (with fat clients running a significant piece of
the application) to a three-tier architecture (with a browser supplying
everything needed on a thin client). Oracle9i Application Server
(Oracle9iAS) provides a means of implementing the middle tier of a three-
tier solution for web-based applications, component-based applications and
enterprise application integration. Oracle9iAS replaces Oracle Application
Server (OAS) and Oracle Web Application Server. Oracle9iAS can be scaled
across multiple middle-tier servers.
This product includes a web listener based on the popular Apache listener,
servlets and JavaServer Pages (JSPs), business logic and/or data access
components. Business logic might include JavaBeans, Business Components
for Java (BC4J) and Enterprise JavaBeans (EJBs). Data access components
can include JDBC, SQLJ, BC4J, and EJBs.
Oracle9iAS offers additional solutions in the cache, portal, intelligence
and wireless areas:
Cache: Oracle9iAS Database Cache provides a middle tier for the caching of PL/SQL
procedures and anonymous PL/SQL blocks.
Portal: Oracle9iAS Portal is part of the Internet Developer Suite (discussed later in this
chapter) and is used for building easy-to-use browser interfaces to applications through
servlets and HTTP links. The developed portal is deployed to Oracle9iAS.
Intelligence: Oracle9iAS Intelligence often includes Oracle9iAS Portal, but also consists
of:

Oracle Reports, which provides a scalable middle tier for the reporting of prebuilt
query results.
Oracle Discoverer, for ad hoc query and relational online analytical processing
(ROLAP).
OLAP applications custom-built with JDeveloper.
Business intelligence beans that leverage Oracle9i Advanced Analytic Services.
Clickstream Intelligence.

Oracle Wireless Edition: It is formerly known as Oracle Portal-to-Go and includes:

Content adapters for transforming content to XML.


Device transformers for transforming XML to device-specific markup languages.
Personalisation portals for service personalisation of alerts, alert addresses, location
marks and profiles; the wireless personalisation portal is also used for the creation,
servicing, testing and publishing of URL service and for user management.

Fig. 24.3 shows many of the connection possibilities discussed above.

24.3.3 Distributed Database Features


One of the strongest features of the Oracle database is its ability to scale up
to handle extremely large volumes of data and users. Oracle scales not only
by running on more and more powerful platforms, but also by running in a
distributed configuration. Oracle databases on separate platforms can be
combined to act as a single logical distributed database. Some of the basic
ways that Oracle handles database interactions in a distributed database
system are listed below:
Fig. 24.3 Typical Oracle database connection

24.3.3.1 Distributed Queries and Transactions


The data within an organisation is often spread among multiple databases for
reasons of both capacity and organisational responsibility. Users may want to
query this distributed data or update it as if it existed within a single
database.
Oracle first introduced distributed databases in response to the
requirements for accessing data on multiple platforms in the early 80s.
Distributed queries can retrieve data from multiple databases. Distributed
transactions can insert, update or delete data on distributed databases.
Oracle’s two-phase commit mechanism guarantees that all the database
servers that are part of a transaction will either commit or roll back the
transaction. Distributed transactions that may be interrupted by a system
failure are monitored by a recovery background process. Once the failed
system comes back online, the same process will complete the distributed
transactions to maintain consistency across the databases.
You can also implement distributed transactions in Oracle by popular
transaction monitors (TPs) that interact with Oracle via XA, an industry
standard (X/Open) interface. Oracle8i also added native transaction
coordination with the Microsoft Transaction Server (MTS), so you can
implement a distributed transaction initiated under the control of MTS
through an Oracle database.

24.3.3.2 Heterogeneous Services


Heterogeneous Services allow non-Oracle data and services to be accessed
from an Oracle database through tools such as Oracle Transparent Gateways.
For example, Transparent Gateways allow users to submit Oracle SQL
statements to a non-Oracle distributed database source and have them
automatically translated into the SQL dialect of the non-Oracle source
system, which remains transparent to the user. In addition to providing
underlying SQL services, Heterogeneous Services provide transaction
services utilising Oracle’s two-phase commit with non-Oracle databases and
procedural services that call third-generation language routines on non-
Oracle systems. Users interact with the Oracle database as if all objects are
stored in the Oracle database and Heterogeneous Services handle the
transparent interaction with the foreign database on the user’s behalf.
Heterogeneous Services work in conjunction with Transparent Gateways.
Generic connectivity via ODBC and OLEDB is included with the database.
Optional Transparent Gateways use agents specifically tailored for a variety
of target systems.

24.3.4 Data Movement Features


Moving data from one Oracle database to another is often a requirement
when using distributed databases, or when a user wants to implement
multiple copies of the same database in multiple locations to reduce network
traffic or increase data availability. You can export data and data dictionaries
(metadata) from one database and import them into another. Oracle also
offers many other advanced features in this category, including replication,
transportable tablespaces and Advanced Queuing. The technology used to
move data from one Oracle database to another automatically is discussed
below:

24.3.4.1 Basic Replication


You can use basic replication to move recently added and updated data from
an Oracle “master” database to databases on which duplicate sets of data
reside. In basic replication, only the single master is updated. You can
manage replication through the Oracle Enterprise Manager (OEM prior to
Oracle9i, EM in Oracle9i).

24.3.4.2 Advanced Replication


You can use advanced replication in multi-master systems in which any of
the databases involved can be updated and conflict-resolution features are
needed to resolve inconsistencies in the data. Because there is more than one
master database, the same data may be updated on multiple systems at the
same time. Conflict resolution is necessary to determine the “true” version of
the data. Oracle’s advanced replication includes a number of conflict-
resolution scenarios and also allows programmers to write their own.

24.3.4.3 Transportable Tablespaces


Transportable tablespaces were introduced in Oracle8i. Instead of using the
export/import process, which dumps data and the structures that contain it
into an intermediate file for loading, you simply put the tablespaces in read-
only mode, move or copy them from one database to another and mount
them. You must export the data dictionary (metadata) for the tablespace from
the source and import it at the target. This feature can save a lot of time
during maintenance, because it simplifies the process.

24.3.4.4 Advanced Queuing


Advanced Queuing (AQ), introduced in Oracle8, provides the means to
asynchronously send messages from one Oracle database to another.
Because messages are stored in a queue in the database and sent
asynchronously when the connection is made, the amount of overhead and
network traffic is much lower than it would be using traditional guaranteed
delivery through the two-phase commit protocol between source and target.
By storing the messages in the database, AQ provides a solution with greater
recoverability than other queuing solutions that store messages in file
systems.

24.3.4.5 Oracle Messaging


It adds the capability to develop and deploy a content-based publish and
subscribe solution using a rules engine to determine relevant subscribing
applications. As new content is published to a subscriber list, the rules on the
list determine which subscribers should receive the content. This approach
means that a single list can efficiently serve the needs of different subscriber
communities.

24.3.4.6 Oracle9i AQ
It adds XML support and Oracle Internet Directory (OID) integration. This
technology is leveraged in Oracle Application Interconnect (OAI), which
includes adapters to non-Oracle applications, messaging products and
databases.

24.3.4.7 Availability
Although basic replication has been included with both Oracle Standard
Edition and Enterprise Edition, advanced features such as advanced
replication, transportable tablespaces and Advanced Queuing have typically
required Enterprise Edition.

24.3.5 Performance Features


Oracle includes several features specifically designed to boost performance
in certain situations. The performance features can be divided into two
categories, namely database parallelisation and data warehousing.
24.3.5.1 Database Parallelisation
Database tasks implemented in parallel, speed up querying, tuning and
maintenance of the database. By breaking up a single task into smaller tasks
and assigning each subtask to an independent process, you can dramatically
improve the performance of certain types of database operations.
Parallel query features became a standard part of Enterprise Edition
beginning with Oracle 7.3. Examples of query features implemented in
parallel include:
Table scans.
Nested loops.
Sort merge joins.
GROUP BYs.
NOT IN subqueries (anti-joins).
User-defined functions.
Index scans.
Select distinct UNION and UNION ALL.
Hash joins.
ORDER BY and aggregation.
Bitmap star joins.
Partition-wise joins.
Stored procedures (PL/SQL, Java, external routines).

When you are using Oracle, by default the degree of parallelism for any
operation is set to twice the number of CPUs. You can adjust this degree
automatically for each subsequent query based on the system load. You can
also generate statistics for the cost-based optimiser in parallel.
Maintenance functions can also be performed, such as loading (via
SQL*Loader), backups and index builds in parallel in Oracle Enterprise
Edition. Oracle Partitioning for the Enterprise Edition enables additional
parallel Data Manipulation Language (DML) inserts, updates and deletes as
well as index scans.

24.3.5.2 Data Warehousing


The parallel features as discussed above improve the overall performance of
the Oracle database. Oracle has also added some performance enhancements
that specifically apply to data warehousing applications.
Bitmap indexes: Oracle added support for stored bitmap indexes to Oracle 7.3 to provide a
fast way of selecting and retrieving certain types of data. Bitmap indexes typically work best
for columns that have few different values relative to the overall number of rows in a table.

Rather than storing the actual value, a bitmap index uses an individual bit for each potential
value with the bit either “on” (set to 1) to indicate that the row contains the value or “off’ (set
to 0) to indicate that the row does not contain the value. This storage mechanism can also
provide performance improvements for the types of joins typically used in data warehousing.
Star query optimization: Typical data warehousing queries occur against a large fact table
with foreign keys to much smaller dimension tables. Oracle added an optimisation for this
type of star query to Oracle 7.3. Performance gains are realised through the use of Cartesian
product joins of dimension tables with a single join back to the large fact table. Oracle8
introduced a further mechanism called a parallel bitmap star join, which uses bitmap indexes
on the foreign keys to the dimension tables to speed star joins involving a large number of
dimension tables.
Materialised views: In Oracle, materialised views provide another means of achieving a
significant speed-up of query performance. Summary-level information derived from a fact
table and grouped along dimension values is stored as a materialised view. Queries that can
use this view are directed to the view, transparently to the user and the SQL they submit.
Analytic functions: A growing trend in Oracle and other systems is the movement of some
functions from decision-support user tools into the database. Oracle8i and Oracle9i feature
the addition of ANSI standard OLAP SQL analytic functions for windowing, statistics,
CUBE and ROLLUP and more.
Oracle9i Advanced Analytic Services: Oracle9i Advanced Analytic Services are a
combination of what used to be called OLAP Services and Data Mining. The OLAP services
provide a Java OLAP API and are typically leveraged to build custom OLAP applications
through the use of Oracle’s JDeveloper product. Oracle9i Advanced Analytic Services in the
database also provide predictive OLAP functions and a multidimensional cache for doing the
same kinds of analysis previously possible in Oracle’s Express Server.

The Oracle9i database engine also includes data-mining algorithms that are exposed through
a Java data-mining API.
Availability: Oracle Standard Edition lacks many important data warehousing features
available in the Enterprise Edition, such as bitmap indexes and materialised views. Hence,
use of Enterprise Edition is recommended for data warehousing projects.

24.3.6 Database Management Features


Oracle includes many features that make the database easier to manage. We
have divided the discussion in this section into four categories: Oracle
Enterprise Manager, add-on packs, backup and recovery and database
availability.

24.3.6.1 Oracle Enterprise Manager


As part of every Database Server, Oracle provides the Oracle Enterprise
Manager (EM), a database management tool framework with a graphical
interface used to manage database users, instances and features (such as
replication) that can provide additional information about the Oracle
environment. EM can also manage Oracle9iAS and Oracle iFS, Internet
Directory and Express.
Prior to the Oracle8i database, the EM software had to be installed on
Windows 95/98 or NT-based systems and each repository could be accessed
by only a single database manager at a time. Now you can use EM from a
browser or load it onto Windows 95/98/2000 or NT-based systems. Multiple
database administrators can access the EM repository at the same time. In
the EM release for Oracle9i, the super administrator can define services that
should be displayed on other administrators’ consoles, and management
regions can be set up.

24.3.6.2 Add-on Packs


Several optional add-on packs are available for Oracle. In addition to these
database-management packs, management packs are available for Oracle
Applications and for SAP R/3.

24.3.6.3 Standard Management Pack


The Standard Management Pack for Oracle provides tools for the
management of small Oracle databases (for example, Oracle Server/Standard
Edition). Features include support for performance monitoring of database
contention, I/O, load, memory use and instance metrics, session analysis,
index tuning and change investigation and tracking.

24.3.6.4 Diagnostics Pack


The Diagnostics Pack can be used to monitor, diagnose and maintain the
health of Enterprise Edition databases, operating systems and applications.
With both historical and real-time analysis, problems can be avoided
automatically before they occur. The pack also provides capacity planning
features that help you plan and track future system-resource requirements.
24.3.6.5 Tuning Pack
With the Tuning Pack, you can optimise system performance by identifying
and tuning Enterprise Edition database and application bottlenecks such as
inefficient SQL, poor data design and the improper use of system resources.
The pack can proactively discover tuning opportunities and automatically
generate the analysis and required changes to tune the system.

24.3.6.6 Change Management Pack


The Change Management Pack helps eliminate errors and loss of data when
upgrading Enterprise Edition databases to support new applications. It can
analyse the impact and complex dependencies associated with application
changes and automatically perform database upgrades. Users can initiate
changes with easy-to- use wizards that teach the systematic steps necessary
to upgrade.

24.3.6.7 Availability
Oracle Enterprise Manager can be used for managing Oracle Standard
Edition and/or Enterprise Edition. Additional functionality for diagnostics,
tuning and change management of Standard Edition instances is provided by
the Standard Management Pack. For Enterprise Edition, such additional
functionality is provided by separate Diagnostics, Tuning and Change
Management Packs.

24.3.7 Backup and Recovery Features


As every database administrator knows, backing up a database is a rather
mundane but necessary task. An improper backup makes recovery difficult,
if not impossible. Unfortunately, people often realise the extreme importance
of this everyday task only when it is too late—usually after losing business-
critical data due to a failure of a related system.

24.3.7.1 Recovery Manager


Typical backups include complete database backups (the most common
type), tablespace backups, datafile backups, control file backups and
archivelog backups. Oracle8 introduced the Recovery Manager (RMAN) for
the server-managed backup and recovery of the database. Previously,
Oracle’s Enterprise Backup Utility (EBU) provided a similar solution on
some platforms. However, RMAN, with its Recovery Catalogue stored in an
Oracle database, provides a much more complete solution. RMAN can
automatically locate, back up, restore and recover datafiles, control files and
archived redo logs. RMAN for Oracle9i can restart backups and restores and
implement recovery window policies when backups expire. The Oracle
Enterprise Manager Backup Manager provides a GUI-based interface to
RMAN.

24.3.7.2 Incremental backup and recovery


RMAN can perform incremental backups of Enterprise Edition databases.
Incremental backups back up only the blocks modified since the last backup
of a datafile, tablespace, or database; thus, they are smaller and faster than
complete backups. RMAN can also perform point-in-time recovery, which
allows the recovery of data until just prior to a undesirable event (such as the
mistaken dropping of a table).

24.3.7.3 Legato Storage Manager


Various media-management software vendors support RMAN. Oracle
bundles Legato Storage Manager with Oracle to provide media-management
services, including the tracking of tape volumes, for up to four devices.
RMAN interfaces automatically with the media-management software to
request the mounting of tapes as needed for backup and recovery operations.

24.3.7.4 Database Availability


Database availability depends upon the reliability and the management of the
database, the underlying operating system and the specific hardware
components of the system. Oracle has improved availability by reducing
backup and recovery times. It has done this through:
Providing online and parallel backup and recovery.
Improving the management of online data through range partitioning.
Leveraging hardware capabilities for improved monitoring and failover.

24.3.7.5 Partitioning option


Oracle introduced partitioning as an option to Oracle8 to provide a higher
degree of manageability and availability. You can take individual partitions
offline for maintenance while other partitions remain available for user
access. In data warehousing implementations, partitioning is frequently used
to implement rolling windows based on date ranges. Hash partitioning, in
which the data partitions are divided up as a result of a hashing function, was
added to Oracle8i to enable an even distribution of data. You can also use
composite partitioning to enable hash sub-partitioning within specific range
partitions. Oracle9i adds list partitioning, which enables the partitioning of
data based on discrete values such as geography.

24.3.7.6 Oracle9i Data Guard


Oracle first introduced a standby database feature in Oracle 7.3. The standby
database provides a copy of the production database to be used if the
primary database is lost—for example, in the event of primary site failure, or
during routine maintenance. Primary and standby databases may be
geographically separated. The standby database is created from a copy of the
production database and updated through the application of archived redo
logs generated by the production database. The Oracle9i Data Guard product
fully automates this process. Agents are deployed on both the production and
standby database, and a Data Guard Broker coordinates commands. A single
Data Guard command is used to run the eight steps required for failover.
In addition to providing physical standby database support, Oracle9i Data
Guard (second release) will be able to create a logical standby database. In
this scenario, Oracle archive logs are transformed into SQL transactions and
applied to an open standby database.

24.3.7.7 Failover features and options


The failover feature provides a higher level of reliability for an Oracle
database. Failover is implemented through a second system or node that
provides access to data residing on a shared disk when the first system or
node fails. Oracle Fail Safe for Windows NT/2000, in combination with
Microsoft Cluster Services, provides a failover solution in the event of a
system failure. Unix systems such as HP-UX and Solaris have long provided
similar functionality for their clusters.

24.3.7.8 Oracle Parallel Server/Real Application Clusters failover features


The Oracle Parallel Server (OPS) option, renamed Real Application Clusters
in Oracle9i, can provide failover support as well as increased scalability on
Unix and Windows NT clusters. Oracle8i greatly improved scalability for
read/write applications through the introduction of Cache Fusion. Oracle9i
improved Cache Fusion for write/write applications by further minimising
much of the disk write activity used to control data locking.
With Real Application Clusters, you can run multiple Oracle instances on
systems in a shared disk cluster configuration or on multiple nodes of a
Massively Parallel Processor (MPP) configuration. The Real Application
Cluster coordinates traffic among the systems or nodes, allowing the
instances to function as a single database. As a result, the database can scale
across hundreds of nodes. Since the cluster provides a means by which
multiple instances can access the same data, the failure of a single instance
will not cause extensive delays while the system recovers; you can simply
redirect users to another instance that’s still operating. You can write
applications with the Oracle Call Interface (OCI) to provide failover to a
second instance transparently to the user.

24.3.7.9 Parallel Fail Safe/RACGuard


Parallel Fail Safe, renamed RACGuard in Oracle9i, provides automated
failover with bounded recovery time in conjunction with Oracle Parallel
Server/Real Application Clusters. In addition, Parallel Fail Safe provides
client rerouting from the failed instance to the instance that is available with
fast reconnect and automatically captures diagnostic data.
24.3.8 Oracle Internet Developer Suite
Many Oracle tools are available to developers to help them present data and
build more sophisticated Oracle database applications. Although this book
focuses on the Oracle database, this section briefly describes the main Oracle
tools for application development: Oracle Forms Developer, Oracle Reports
Developer, Oracle Designer, Oracle JDeveloper, Oracle Discoverer
Administrative Edition and Oracle Portal.

24.3.8.1 Oracle Forms Developer


Oracle Forms Developer provides a powerful tool for building forms-based
applications and charts for deployment as traditional client/server
applications or as three-tier browser-based applications via Oracle9i
Application Server. Developer is a fourth-generation language (4GL). With a
4GL, you define applications by defining values for properties, rather than
by writing procedural code. Developer supports a wide variety of clients,
including traditional client/server PCs and Java-based clients. Version 6 of
Developer adds more options for creating easier-to-use applications,
including support for animated controls in user dialogues and enhanced user
controls. The Forms Builder in Version 6 includes a built-in JVM for
previewing web applications.

24.3.8.2 Oracle Reports Developer


Oracle Reports Developer provides a development and deployment
environment for rapidly building and publishing web-based reports via
Reports for Oracle9i Application Server. Data can be formatted in tables,
matrices, group reports, graphs and combinations. High-quality presentation
is possible using the HTML extension Cascading Style Sheets (CSS).

24.3.8.3 Oracle JDeveloper


Oracle JDeveloper was introduced by Oracle in 1998 to develop basic Java
applications without writing code. JDeveloper includes a Data Form wizard,
a BeansExpress wizard for creating JavaBeans and BeanInfo classes and a
Deployment wizard. JDeveloper includes database development features
such as various Oracle drivers, a Connection Editor to hide the JDBC API
complexity, database components to bind visual controls and a SQLJ
precompiler for embedding SQL in Java code, which you can then use with
Oracle. You can also deploy applications developed with JDeveloper using
the Oracle9i Application Server. Although JDeveloper uses wizards to allow
programmers to create Java objects without writing code, the end result is
generated Java code. This Java implementation makes the code highly
flexible, but it is typically a less productive development environment than a
true 4GL.

24.3.8.4 Oracle Designer


Oracle Designer provides a graphical interface for Rapid Application
Development (RAD) for the entire database development process-from
building the business model to schema design, generation and deployment.
Designs and changes are stored in a multiuser repository. The tool can
reverse-engineer existing tables and database schemas for reuse and redesign
from Oracle and non-Oracle relational databases.
Designer also includes generators for creating applications for Oracle
Developer, HTML clients using Oracle9i Application Server and C++.
Designer can generate applications and reverse-engineer existing
applications or applications that have been modified by developers. This
capability enables a process called round-trip engineering, in which a
developer uses Designer to generate an application, modifies the generated
application and reverse-engineers the changes back into the Designer
repository.

24.3.8.5 Oracle Discoverer Administration Edition


Oracle Discoverer Administration Edition enables administrators to set up
and maintain the Discoverer End User Layer (EUL). The purpose of this
layer is to shield business analysts using Discoverer as an ad hoc query or
ROLAP tool from SQL complexity. Wizards guide the administrator through
the process of building the EUL. In addition, administrators can put limits on
resources available to analysts monitored by the Discoverer query governor.

24.3.8.6 Oracle9i AS Portal


Oracle9iAS Portal, introduced as WebDB in 1999, provides an HTML-based
tool for developing web-enabled applications and content-driven web sites.
Portal application systems are developed and deployed in a simple browser
environment. Portal includes wizards for developing application components
incorporating “servlets” and access to other HTTP web sites. For example,
Oracle Reports and Discoverer may be accessed as servlets. Portals can be
designed to be user-customisable. They are deployed to the middle-tier
Oracle9i Application Server.
The main enhancement that Oracle9iAS Portal brings to WebDB is the
ability to create and use portlets, which allows a single web page to be
divided up into different areas that can independently display information
and interact with the user.

24.3.9 Oracle Lite


Oracle Lite is Oracle’s suite of products for enabling mobile use of database-
centric applications. Key components of Oracle Lite include the Oracle Lite
database and iConnect, which consists of Advanced Replication, Oracle
Mobile Agents (OMA), Oracle Lite Consolidator for Palm and Oracle AQ
Lite.
Although the Oracle Lite database engine can operate with much less
memory than other Oracle implementations (it requires less than 1 MB of
memory to run on a laptop), Oracle SQL, C and C++ and Java- based
applications can run against the database. Java support includes support of
Java stored procedures, JDBC and SQLJ. The database is self-tuning and
self-administering. In addition to Windows-based laptops, Oracle Lite is also
supported on handheld devices running WindowsCE and Palm OS.
A variety of replication possibilities exist between Oracle and Oracle Lite,
including the following:
Connection-based replication via Oracle Net, Net8 or SQL*Net synchronous connections.
Wireless replication through the use of Advanced Queuing Lite, which provides a messaging
service compatible with Oracle Advanced Queuing (and replaces the Oracle Mobile Agents
capability available in previous versions of Oracle Lite).
File-based replication via standards such as FTP and MAPI.
Internet replication via HTTP or MIME.

You can define replication of subsets of data via SQL statements. Because
data distributed to multiple locations can lead to conflicts—such as which
location now has the “true” version of the data-multiple conflict and
resolution algorithms are provided. Alternatively, you can write your own
algorithm.
In the typical usage of Oracle Lite, the user will link her handheld or
mobile device running Oracle Lite to an Oracle Database Server. Data and
applications will be synchronised between the two systems. The user will
then remove the link and work in disconnected mode. After she has
performed her tasks, she will relink and resynchronise the data with the
Oracle Database Server.

24.4 SQL*PLUS

SQL*Plus is the interactive (low-level) user interface to the Oracle database


management system. Typically, SQL*Plus is used to issue ad-hoc queries
and to view the query result on the screen.

24.4.1 Features of SQL*Plus


A built-in command line editor can be used to edit (incorrect) SQL queries. Instead of this
line editor any editor installed on the computer can be invoked.
There are numerous commands to format the output of a query.
SQL*Plus provides an online-help.
Query results can be stored in .les which then can be printed. Queries that are frequently
issued can be saved to a .le and invoked later. Queries can be parameterised such that it is
possible to invoke a saved query with a parameter.

24.4.2 Invoking SQL*Plus


Before you start SQL*Plus make sure that the following UNIX shell
variables are properly set (shell variables can be checked using the env
command, for example, env | grep ORACLE):
ORACLE HOME, e.g., ORACLE HOME=/usr/pkg/oracle/734
ORACLE SID, e.g, ORACLE SID=prod

In order to invoke SQL*Plus from a UNIX shell, the command sqlplus


has to be issued. SQL*Plus then displays some information about the
product as shown in Fig. 24.4 and prompts you for your user name and
password for the Oracle system.

Fig. 24.4 Invoking SQL*Plus

SQL> is the prompt you get when you are connected to the Oracle
database system. In SQL*Plus you can divide a statement into separate lines,
each continuing line is indicated by a prompt such 2>, 3> and so on. An
SQL statement must always be terminated by a semicolon (;). In addition to
the SQL statements, SQL*Plus provides some special SQL*Plus commands.
These commands need not be terminated by a semicolon. Upper and lower
case letters are only important for string comparisons. An SQL query can
always be interrupted by using <Control>C. To exit SQL*Plus you can
either type exit or quit.

24.4.3 Editor Commands


The most recently issued SQL statement is stored in the SQL buffer,
independent of whether the statement has a correct syntax or not. You can
edit the buffer using the following commands:
l[ist] lists all lines in the SQL buffer and sets the current line (marked with an ”.”) to the last
line in the buffer.
l<number> sets the actual line to <number>.
c[hange]/<old string>/<new string> replaces the .rst occurrence of <old string> by <new
string> (for the actual line).
a[ppend]<string> appends <string> to the current line.
• del deletes the current line.
r[un] executes the current buffer contents.
get<.le> reads the data from the .le <.le> into the buffer.
save<.le> writes the current buffer into the .le <.le>.
edit invokes an editor and loads the current buffer into the editor. After exiting the editor the
modified SQL statement is stored in the buffer and can be executed (command r).

The editor can be defined in the SQL*Plus shell by typing the command
de.ne editor = <name>, where <name> can be any editor such as emacs, vi,
joe or jove.

24.4.4 SQL*Plus Help System and Other Useful Commands


To get online help in SQL*Plus, just type help <command>, or just help to get information
about how to use the help command. In Oracle Version 7 one can get the complete list of
possible commands by typing help command.
To change the password, in Oracle Version 7 the command alter user <user> identi.ed by
<new password>; is used. In Oracle Version 8 the command passw <user> prompts the user
for the old/new password.
The command desc[ribe] <table> lists all columns of the given table together with their data
types and information about whether null values are allowed or not.
You can invoke a UNIX command from the SQL*Plus shell by using host <UNIX
command>. For example, host ls -la *.sql lists all SQL .les in the current directory.
You can log your SQL*Plus session and thus queries and query results by using the
command spool <.le>. All information displayed on screen is then stored in <.le> which
automatically gets the extension .lst. The command spool o. turns spooling o.
The command copy can be used to copy a complete table. For example, the command copy
from scott/tiger create EMPL using select from EMP; copies the table EMP of the user
scott with password tiger into the relation EMPL. The relation EMP is automatically created
and its structure is derived based on the attributes listed in the select clause.
SQL commands saved in a .le <name>.sql can be loaded into SQL*Plus and executed using
the command @<name>.
Comments are introduced by the clause rem[ark] (only allowed between SQL statements),
or - - (allowed within SQL statements).

24.4.5 Formatting the Output


SQL*Plus provides numerous commands to format query results and to
build simple reports. For this, format variables are set and these settings are
only valid during the SQL*Plus session. They get lost after terminating
SQL*Plus. It is, however, possible to save settings in a .le named login.sql in
your home directory. Each time you invoke SQL*Plus this .le is
automatically loaded.
The command column <column name> <option 1> <option 2> … is used
to format columns of your query result. The most frequently used options
are:
format A<n> For alphanumeric data, this option sets the length of <column name> to <n>.
For columns having the data type number, the format command can be used to specify the
format before and after the decimal point. For example, format 99,999.99 speci.es that if a
value has more than three digits in front of the decimal point, digits are separated by a colon,
and only two digits are displayed after the decimal point.
The option heading <text> relabels <column name> and gives it a new heading.
null <text> is used to specify the output of null values (typically, null values are not
displayed).
column <column name> clear deletes the format definitions for <column name>.

The command set linesize <number> can be used to set the maximum
length of a single line that can be displayed on screen.
set pagesize <number> sets the total number of lines SQL*Plus displays
before printing the column names and headings, respectively, of the selected
rows. Several other formatting features can be enabled by setting SQL*Plus
variables.
The command show all displays all variables and their current values.
To set a variable, type set <variable><value. For example, set timing on
causes SQL*Plus to display timing statistics for each SQL command that is
executed.
set pause on [<text>] makes SQL*Plus wait for you to press Return after
the number of lines defined by set pagesize has been displayed. <text> is the
message SQL*Plus will display at the bottom of the screen as it waits for
you to hit Return.

24.5 ORACLE’S DATA DICTIONARY


The Oracle data dictionary is one of the most important components of the
Oracle DBMS. It contains all information about the structures and objects of
the database such as tables, columns, users, data .les and so on. The data
stored in the data dictionary are also often called metadata. Although it is
usually the domain of database administrators (DBAs), the data dictionary is
a valuable source of information for end users and developers. The data
dictionary consists of two levels: the internal level contains all base tables
that are used by the various DBMS software components and they are
normally not accessible by end users. The external level provides numerous
views on these base tables to access information about objects and structures
at di.erent levels of detail.

24.5.1 Data Dictionary Tables


An installation of an Oracle database always includes the creation of three
standard Oracle users:
SYS: This is the owner of all data dictionary tables and views. This user has the highest
privileges to manage objects and structures of an Oracle database such as creating new users.
SYSTEM: is the owner of tables used by different tools such SQL*Forms, SQL*Reports
etc. This user has less privileges than SYS.
PUBLIC: This is a “dummy” user in an Oracle database. All privileges assigned to this user
are automatically assigned to all users known in the database.

The tables and views provided by the data dictionary contain information
about the following:
Users and their privileges.
Tables, table columns and their data types, integrity constraints and indexes.
Statistics about tables and indexes used by the optimiser.
Privileges granted on database objects.
Storage structures of the database.

The SQL command select . from DICT[ONARY]; lists all tables and
views of the data dictionary that are accessible to the user. The selected
information includes the name and a short description of each table and
view. Before issuing this query, check the column definitions of
DICT[IONARY] using desc DICT[IONARY] and set the appropriate values
for column using the format command.
The query select . from TAB; retrieves the names of all tables owned by
the user who issues this command. The query select . from COL; returns all
information about the columns of one’s own tables.
Each SQL query requires various internal accesses to the tables and views
of the data dictionary. Since the data dictionary itself consists of tables,
Oracle has to generate numerous SQL statements to check whether the SQL
command issued by a user is correct and can be executed.

For example, the SQL Query

select .from EMP

where SAL > 2000;

requires a verification whether (1) the table EMP exists, (2) the user has
the privilege to access this table, (3) the column SAL is defined for this table
and so on.

24.5.2 Data Dictionary Views


The external level of the data dictionary provides users a front end to access
information relevant to the users. This level provides numerous views (in
Oracle7 approximately 540) that represent (a portion of the) data from the
base tables in a readable and understandable manner. These views can be
used in SQL queries just like normal tables. The views provided by the data
dictionary are divided into three groups: USER, ALL and DBA.

USER

Tuples in the USER views contain information about objects owned by the
account performing the SQL query (current user).
USER TABLES: all tables with their name, number of columns, storage information,
statistical information and so on (TABS).
USER CATALOG: tables, views and synonyms (CAT).
USER COL COMMENTS: comments on columns.
USER CONSTRAINTS: constraint definitions for tables.
USER INDEXES: all information about indexes created for tables (IND).
USER OBJECTS: all database objects owned by the user (OBJ).
USER TAB COLUMNS: columns of the tables and views owned by the user (COLS).
USER TAB COMMENTS: comments on tables and views.
USER TRIGGERS: triggers defined by the user.
USER USERS: information about the current user.
USER VIEWS: views defined by the user.

ALL

Rows in the ALL views include rows of the USER views and all information
about objects that are accessible to the current user. The structure of these
views is analogous to the structure of the USER views.
ALL CATALOGUE: owner, name and type of all accessible tables, views and
Synonyms.
ALL TABLES: owner and name of all accessible tables.
ALL OBJECTS: owner, type and name of accessible database objects.
ALL TRIGGERS …
ALL USERS …
ALL VIEWS …

DBA

The DBA views encompass information about all database objects,


regardless of the owner. Only users with DBA privileges can access these
views.
DBA TABLES: tables of all users in the database.
DBA CATALOGUE: tables, views and synonyms defined in the database.
DBA OBJECTS: object of all users.
DBA DATA FILES: information about data .les.
DBA USERS information about all users known in the database.

24.6 ORACLE SYSTEM ARCHITECTURE

In the following sections the main components of the Oracle DBMS


(Version 7.X) architecture and the logical and physical database structures
have been discussed.
24.6.1 Storage Management and Processes
The Oracle DBMS server is based on a so-called multi-server architecture.
The server is responsible for processing all database activities such as the
execution of SQL statements, user and resource management and storage
management. Although there is only one copy of the program code for the
DBMS server, to each user connected to the server logically a separate
server is assigned. Fig. 24.5 illustrates the architecture of the Oracle DBMS
consisting of storage structures, processes and files.

24.6.1.1 System Global Area (SGA)


Each time a database is started on the server (instance startup), a portion of
the computer’s main memory is allocated, the so-called System Global Area
(SGA). The SGA consists of the shared pool, the database buffer and the
redo-log buffer. Furthermore, several background processes are started. The
combination of SGA and processes is called database instance. The memory
and processes associated with an instance are responsible for efficiently
managing the data stored in the database, and to allow users accessing the
database concurrently. The Oracle server can manage multiple instances;
typically each instance is associated with a particular application domain.
The SGA serves as that part of the memory where all database operations
occur. If several users connect to an instance at the same time, they all share
the SGA. The information stored in the SGA can be subdivided into the
following three caches.
Fig. 24.5 Oracle System Architecture

24.6.1.2 Database Buffer


The database buffer is a cache in the SGA used to hold the data blocks that
are read from data files. Blocks can contain table data, index data and others.
Data blocks are modified in the database buffer. Oracle manages the space
available in the database buffer by using a least recently used (LRU)
algorithm. When free space is needed in the buffer, the least recently used
blocks will be written out to the data files. The size of the database buffer
has a major impact on the overall performance of a database.

24.6.1.3 Redo-Log-Buffer
This buffer contains information about changes of data blocks in the
database buffer. While the redo-log- buffer is filled during data
modifications, the log writer process writes information about the
modifications to the redo-log files. These files are used after, for example, a
system crash, in order to restore the database (database recovery). Shared
Pool The shared pool is the part of the SGA that is used by all users. The
main components of this pool are the dictionary cache and the library cache.
Information about database objects is stored in the data dictionary tables.
When information is needed by the database, for example, to check whether
a table column specified in a query exists, the dictionary tables are read and
the data returned is stored in the dictionary cache.

24.6.1.4 Library Cache


Note that all SQL statements require accessing the data dictionary. Thus,
keeping the relevant portions of the dictionary in the cache may increase the
performance. The library cache contains information about the most recently
issued SQL commands such as the parse tree and query execution plan. If
the same SQL statement is issued several times, it need not be parsed again
and all information about executing the statement can be retrieved from the
library cache.
Further storage structures in the computer’s main memory are the log-
archive buffer (optional) and the Program Global Area (PGA). The log-
archive buffer is used to temporarily cache redolog entries that are to be
archived in special files. The PGA is the area in the memory that is used by a
single Oracle user process. It contains the user’s context area (cursors,
variables for others), as well as process information. The memory in the
PGA is not sharable.
For each database instance, there is a set of processes. These processes
maintain and enforce the relationships between the database’s physical
structures and memory structures. The number of processes varies depending
on the instance configuration. One can distinguish between user processes
and Oracle processes. Oracle processes are typically background processes
that perform I/O operations at database run-time.

24.6.1.5 DBWR
This process is responsible for managing the contents of the database buffer
and the dictionary cache. For this, DBWR writes modified data blocks to the
data files. The process only writes blocks to the files if more blocks are
going to be read into the buffer than free blocks exist.

24.6.1.6 LGWR
This process manages writing the contents of the redo-log-buffer to the redo-
log files.

24.6.1.7 SMON
When a database instance is started, the system monitor process performs
instance recovery as needed (for example, after a system crash). It cleans up
the database from aborted transactions and objects involved. In particular,
this process is responsible for coalescing contiguous free extents to larger
extents.

24.6.1.8 PMON
The process monitor process cleans up behind failed user processes and it
also cleans up the resources used by these processes. Like SMON, PMON
wakes up periodically to check whether it is needed.

24.6.1.9 ARCH (optional)


The LGWR background process writes to the redo-log files in a cyclic
fashion. Once the last redo-log file is filled, LGWR overwrites the contents
of the first redo-log file. It is possible to run a database instance in the
archive-log mode. In this case the ARCH process copies redo-log entries to
archive files before the entries are overwritten by LGWR. Thus, it is possible
to restore the contents of the database to any time after the archivelog mode
was started.

24.6.1.10 USER
The task of this process is to communicate with other processes started by
application programs such as SQL*Plus. The USER process then is
responsible for sending respective operations and requests to the SGA or
PGA. This includes, for example, reading data blocks.

24.6.2 Logical Database Structure


For the architecture of an Oracle database we distinguish between logical
and physical database structures that make up a database. Logical structures
describe logical areas of storage (name spaces) where objects such as tables
can be stored. Physical structures, in contrast, are determined by the
operating system files that constitute the database. The logical database
structures include the following:

24.6.2.1 Database
A database consists of one or more storage divisions, so-called tablespaces.

24.6.2.2 Tablespaces
A tablespace is a logical division of a database. All database objects are
logically stored in tablespaces. Each database has at least one tablespace, the
SYSTEM tablespace, that contains the data dictionary. Other tablespaces can
be created and used for different applications or tasks.

24.6.2.3 Segments
If a database object (for example, a table or a cluster) is created,
automatically a portion of the tablespace is allocated. This portion is called a
segment. For each table there is a table segment. For indexes, the so-called
index segments are allocated. The segment associated with a database object
belongs to exactly one tablespace.

24.6.2.4 Extent
An extent is the smallest logical storage unit that can be allocated for a
database object, and it consists a contiguous sequence of data blocks! If the
size of a database object increases (for example, due to insertions of tuples
into a table), an additional extent is allocated for the object. Information
about the extents allocated for database objects can be found in the data
dictionary view USER EXTENTS.
A special type of segments are rollback segments. They do not contain a
database object, but contain a “before image” of modified data for which the
modifying transaction has not yet been committed. Modifications are undone
using rollback segments. Oracle uses rollback segments in order to maintain
read consistency among multiple users. Furthermore, rollback segments are
used to restore the “before image” of modified tuples in the event of a
rollback of the modifying transaction. Typically, an extra tablespace (RBS)
is used to store rollback segments. This tablespace can be defined during the
creation of a database. The size of this tablespace and its segments depends
on the type and size of transactions that are typically performed by
application programs.
A database typically consists of a SYSTEM tablespace containing the data
dictionary and further internal tables, procedures etc., and a tablespace for
rollback segments. Additional tablespaces include a tablespace for user data
(USERS), a tablespace for temporary query results and tables (TEMP) and a
tablespace used by applications such as SQL*Forms (TOOLS).

24.6.3 Physical Database Structure


The physical database structure of an Oracle database is determined by files
and data blocks:

24.6.3.1 Data Files


A tablespace consists of one or more operating system files that are stored
on disk. Thus, a database essentially is a collection of data files that can be
stored on differerent storage devices (magnetic tape, optical disks and so
on). Typically, only magnetic disks are used. Multiple data files for a
tablespace allows the server to distribute a database object over multiple
disks (depending on the size of the object).

24.6.3.2 Blocks
An extent consists of one or more contiguous Oracle data blocks. A block
determines the finest level of granularity of where data can be stored. One
data block corresponds to a specific number of bytes of physical database
space on disk. A data block size is specified for each Oracle database when
the database is created. A database uses and allocates free database space in
Oracle data blocks. Information about data blocks can be retrieved from the
data dictionary views USER SEGMENTS and USER EXTENTS. These
views show how many blocks are allocated for a database object and how
many blocks are available (free) in a segment/ extent.
As mentioned in Section 24.6.1, aside from datafiles three further types of
files are associated with a database instance:

24.6.3.3 Redo-Log Files


Each database instance maintains a set of redo-log files. These files are used
to record logs of all transactions. The logs are used to recover the database’s
transactions in their proper order in the event of a database crash (the
recovering operations are called roll forward). When a transaction is
executed, modifications are entered in the redo-log buffer, while the blocks
affected by the transactions are not immediately written back to disk, thus
allowing optimising the performance through batch writes.

24.6.3.4 Control Files


Each database instance has at least one control file. In this file the name of
the database instance and the locations (disks) of the data files and redo-log
files are recorded. Each time an instance is started, the data and redo-log
files are determined by using the control file(s).

24.6.3.5 Archive/Backup Files


If an instance is running in the archive-log mode, the ARCH process
archives the modifications of the redo- log files in extra archive or backup
files. In contrast to redo-log files, these files are typically not overwritten.
ER schema, as shown in Fig. 24.6, illustrates the architecture of an Oracle
database instance and the relationships between physical and logical
database structures (relationships can be read as “consists of”).

Fig. 24.6 Relationships between logical and physical database structures

24.7 INSTALLATION OF ORACLE 9I

The following instructions guide you through the installation of Oracle 9i


Release 2.
R Q
1. What is Oracle? Who developed Oracle?
2. List the names of operating systems supported by Oracle.
3. Discuss the evolution of Oracle family of database products and Oracle’s majour features
introduction.
4. What is the Oracle software products line?
5. Discuss the application development features of Oracle.
6. Discuss the communication features of Oracle.
7. Discuss the distributed database features of Oracle.
8. Discuss the data movement features of Oracle.
9. Discuss the performance features of Oracle.
10. Discuss the database management features of Oracle.
11. Discuss the backup and recovery features of Oracle.
12. What is Oracle Internet developer suite? Explain.
13. What is Oracle lite? Explain.
14. What is SQL*Plus? What are its features?
15. How is SQL*Plus invoked?
16. What is Oracle’s data dictionary? Explain its significance.
17. Discuss the Oracle architecture with a neat sketch.
18. What do you mean by logical and physical database structures?
19. With a neat diagram, explain the relationship between logical and physical database
structures.

STATE TRUE/FALSE

1. In 1983, a portable version of Oracle (Version 3) was created that ran only on Digital
VAX/VMS systems.
2. Oracle Personal Edition is the single-user version of Oracle Enterprise Edition.
3. Oracle8i introduced the use of Java as a procedural language with a Java Virtual Machine
(JVM) in the database.
4. National Language Support (NLS) provides character sets and associated functionality, such
as date and numeric formats, for a variety of languages.
5. SQL*Plus is used to issue ad-hoc queries and to view the query result on the screen.
6. The SGA serves as that part of the hard disk where all database operations occur.

TICK (✓) THE APPROPRIATE ANSWER

1. Oracle is a

a. relational DBMS.
b. hierarchical DBMS.
c. networking DBMS.
d. None of these.

2. Oracle Corporation was created by

a. Lawrence Ellison.
b. Bob Miner.
c. Ed Oates.
d. All of these.

3. First commercial Oracle database was developed in

a. 1977.
b. 1979.
c. 1983.
d. 1985.

4. A portable version of Oracle (Version 3) was created that ran not only on Digital VAX/VMS
systems in

a. 1977.
b. 1979.
c. 1983.
d. 1985.

5. The first version of Oracle, version 2.0, was written in assembly language for the

a. Macintosh machine.
b. IBM Machine.
c. HP machine.
d. DEC PDP-11 machine.

6. Oracle was developed on the basis of paper on

a. System/R.
b. DB2.
c. Sybase.
d. None of these.

7. Oracle 8i was released in

a. 1997.
b. 1999.
c. 2000.
d. 2001.

8. Oracle 8i was released in

a. 1997.
b. 1999.
c. 2000.
d. 2001.

9. Oracle 9i database server was released in


a. 1997.
b. 1999.
c. 2000.
d. 2001.

10. Oracle DBMS server is based on a

a. single-server architecture.
b. multi-server architecture.
c. Both (a) and (b).
d. None of these.

FILL IN THE BLANKS

1. The first version of Oracle, version 2.0, was written in assembly language for the _____
machine.
2. Oracle 9i application server was developed in the year _____ and the database server was
developed in the year _____.
3. Oracle Liteis intended for single users who are using _____ devices.
4. Oracle’s PL/SQL is commonly used to implement _____ modules for applications.
5. Oracle Lite is Oracle’s suite of products for enabling _____ use of database-centric
applications.
6. SQL*Plus is the _____ to the Oracle database management system.
7. SGA is expanded as _____.
Chapter 25

Microsoft SQL Server

25.1 INTRODUCTION

Microsoft SQL Server (MSSQL) is a relational database management


system that was originally developed in the 80s at Sybase for UNIX systems.
Microsoft later ported it on Windows NT system. It is a multithreaded server
that scales from laptops and desktops to enterprise servers. It has a
compatible version based on PocketPC operating system, available for
handheld devices such as PocketPCs and bar-code scanners. Since 1994,
Microsoft has shipped SQL Server releases developed independently of
Sybase, which stopped using the SQL Server name in the late 1990s.
Microsoft SQL Server can operate on clusters and symmetrical
multiprocessing (SMP) configurations. The latest available release of
Microsoft SQL Server is SQL Server 2000, available in personal, developer,
standard and enterprise editions and localised for many languages around the
world. Microsoft now plans to release SQL Server 2005 later this year.
This chapter gives brief introduction to Microsoft SQL Server and some
of the features for server programming when creating database applications.

25.2 MICROSOFT SQL SERVER SETUP

Microsoft SQL Server is an application used to create computer databases


for the Microsoft Windows family of server operating systems. It provides
an environment used to generate databases that can be accessed from
workstations, the web or other media such as a personal digital assistant
(PDA). Microsoft SQL Server is probably the most accessible and the most
documented enterprise database environment right now.
25.2.1 SQL Server 2000 Editions
Microsoft SQL Server 2000 is a full-featured relational database
management system (RDBMS) that offers a variety of administrative tools to
ease the burdens of database development, maintenance and administration.
In this chapter, six of the more frequently used tool will be covered:
Enterprise Manager, Query Analyser, SQL Profiler, Service Manager, Data
Transformation Services and Books Online.

25.2.1.1 Enterprise Manager


Enterprise Manager is the main administrative console for SQL Server
installations. It provides a graphical “birds-eye” view of all of the SQL
Server installations on your network. High-level administrative functions
that affect one or more servers, schedule common maintenance tasks or
create and modify the structure of individual databases can be performed.

25.2.1.2 Query Analyser


Query Analyser offers a quick and dirty method for performing queries
against any of the SQL Server databases. It is a great way to quickly pull
information out of a database in response to a user request; test queries
before implementing them in other applications create/modify stored
procedures and execute administrative tasks.

25.2.1.3 SQL Profiler


SQL Profiler provides a window into the inner workings of your database.
Different event types can be monitored and database performance in real
time can be observed. SQL Profiler allows to capture and replay system
“traces” that log various activities. It is a great tool for optimizing databases
with performance issues or troubleshooting particular problems.

25.2.1.4 Service Manager


Service Manager is used to control the MSSQLServer (the main SQL Server
process), MSDTC (Microsoft Distributed Transaction Coordinator) and
SQLServerAgent processes. An icon for this service normally resides in the
system tray of machines running SQL Server. Service Manager can be used
to start, stop or pause any one of these services.

25.2.1.5 Data Transformation Services (DTS)


Data Transmission Services (DTS) provide an extremely flexible method for
importing and exporting data between a Microsoft SQL Server installation
and a large variety of other formats. The most commonly used DTS
application is the “Import and Export Data” wizard found in the SQL Server
program group.

25.2.1.6 Books Online


Books Online is often overlooked resource provided with SQL Server that
contains answers to a variety of administrative, development and installation
issues. It is a great resource to consult before turning to the Internet or
technical support.

25.2.2 SQL Server 2005 Editions


Microsoft plans to release SQL Server 2005 later this year and has packed
the new database engine full of features. There are four different editions of
SQL Server 2005 that Microsoft plans to release:
SQL Server 2005 Express: which will replace the Microsoft Data Engine (MSDE) of SQL
Server for application development and lightweight use. It will be a good tool for developing
and testing applications and extremely small implementations.
SQL Server 2005 Workgroup: It is the new product line, billed as a “small business SQL
Server”. Workgroup edition can have 2 CPUs with 3GB of RAM and will allow for most of
the functionality one would expect from a server-based relational database. It offers limited
replication capabilities as well.
SQL Server 2005 Standard Edition: It is the staple of the product line for serious database
applications. It can handle up to 4 CPUs with an unlimited amount of RAM. Standard
Edition 2005 introduces database mirroring and integration services.
SQL Server 2005 Enterprise Edition: With the release of 2005, Enterprise Edition will
allow unlimited scalability and partitioning.

25.2.3 Features of Microsoft SQL Server


SQL Server supports fallback servers and clusters using Microsoft Cluster Server. Fallback
support provides the ability of a backup server, given the appropriate hardware, to take over
the functions of a failed server.
SQL Server supports network management using the Simple Network Management Protocol
Management (SNMP).
It supports distributed transactions using Microsoft Distributed Transaction Coordinator (MS
DTC) and Microsoft Transaction Server (MTS). MS DTC is an integrated component of
Microsoft SQL Server.
SQL Server includes an e-mail interface (SQL Mail) and user Object Database Connectivity
(ODBC) to support replication services among multiple copies of SQL Server as well as with
other database systems.
It provides Analytical Services, which is an integral part of the system, and includes online
analytical processing (OLAP) and data mining facilities.
SQL Server provides a large collection of graphical tools and “wizards” that guide database
administrators through tasks such as setting up regular backups, replicating data among
servers and tuning a database for performance.
SQL Server provides a suite of tools for managing all aspects of SQL Server development,
querying, tuning, testing and administration.
SQL Distributed Objects (SQLOLE) exposes SQL Server management functions as COM
objects and Automation interface.
It supports Java as a language for building user-defined functions (UDFs) and stored
procedures.
Microsoft SQL Server is part of the BackOffice suite of applications for Windows NT.
Microsoft provides technology to link SQL Server to Internet Information Server (IIS). The
options for connecting IIS and SQL Server include Internet Database Connector and Active
Server Pages.
SQL Server includes visual tools such as an interactive SQL processor and a database
administration tool named SQL Enterprise Manager. Enterprise Manager helps the users in
managing multiple servers and databases, as well as database objects, such as triggers and
stored procedures.
Microsoft SQL Server consists of other components such as MS Query, SQL Trace and SQL
Server Web Assistant.
Microsoft SQL Server supports OLAP, data marts and data warehouses.
It supports CUBE and ROLLUP operations to T-SQL, integrated on OLAP server and added
OLAP extensions to OLE DB.
It supports parallel queries and special handling of star schemas and portioned views used in
data marts and data warehouses.
Microsoft SQL Server supports triggers, stored procedures, declarative referential integrity
and SQL- 92 Entry Level.

25.3 STORED PROCEDURES IN SQL SERVER

Microsoft SQL Server provides the stored procedure mechanism to simplify


the database development process by grouping Transact-SQL statements into
manageable blocks.

25.3.1 Benefits of Stored Procedures


Precompiled execution: SQL Server compiles each stored procedure once and then reutilises
the execution plan. This results in tremendous performance boosts when stored procedures
are called repeatedly.
Reduced client/server traffic: Stored procedures can reduce long SQL queries to a single line
that is transmitted over the wire.
Efficient reuse of code and programming abstraction: Stored procedures can be used by
multiple users and client programs. If utilised in a planned manner, the development cycle
takes less time.
Enhanced security controls: To execute a stored procedure independently of underlying table
permissions users permission can be granted.

25.3.2 Structure of Stored Procedures


Stored procedures are extremely similar to the constructs seen in other
programming languages. They accept data in the form of input parameters
that are specified at execution time. These input parameters (if implemented)
are utilised in the execution of a series of statements that produce some
result. This result is returned to the calling environment through the use of a
recordset, output parameters and a return code. That may sound like a
mouthful, but it will be found that stored procedures are actually quite
simple. Let us take a look at a practical example.

Example

Assume we have a table named INVENTORY as shown in table 25.1.


Table 25.1 Table INVENTORY

This information is updated in real-time and warehouse managers are


constantly checking the levels of products stored at their warehouse and
available for shipment. In the past, each manager would run queries similar
to the following:

SELECT PRODUCT, QTY


FROM INVENTORY
WHERE WAREHOUSE = ‘JAMSHEDPUR’

This resulted in very inefficient performance at the SQL Server. Each time
a warehouse manager executed the query, the database server was forced to
recompile the query and execute it from scratch. It also required the
warehouse manager to have knowledge of SQL and appropriate permissions
to access the table information.
We can simplify this process through the use of a stored procedure. Let us
create a procedure called sp_GetInventory that retrieves the inventory levels
for a given warehouse. Here is the SQL code:

CREATE PROCEDURE sp_GetInventory@location varchar(10)


AS
SELECT PRODUCT, QTY
FROM INVENTORY
WHERE WAREHOUSE = @location
Our Jamshedpur warehouse manager can then access inventory levels by
issuing the command

EXECUTE sp_GetInventory ‘JAMSHEDPUR’

The New Delhi warehouse manager can use the same stored procedure to
access that area’s inventory.

EXECUTE sp_GetInventory ‘New Delhi’

The benefits of abstraction here is that the warehouse manager does not
need to understand SQL or the inner workings of the procedure. From a
performance perspective, the stored procedure will work wonders. The SQL
Sever creates an execution plan once and then reutilises it by plugging in the
appropriate parameters at execution time.

25.4 INSTALLING MICROSOFT SQL SERVER 2000

The installation of Microsoft SQL Server, like that of various modern


products is fairly easy, whether you are using a CD called SQL Server
Developer Edition, a DVD or a downloaded edition. If you have it on CD or
DVD, you can put it in the drive and follow the instructions on the screen, as
we will review them.

25.4.1 Installation Steps


The following steps describe the installation on a Microsoft Windows 2000
Server by the Administrator account, a Windows XP Home Edition, a
Windows XP Professional or the downloaded edition on a Microsoft
Windows 2000 Professional.

Step 01: Log on to your Windows 2000 Server or open Windows


2000/XP Professional.
Step 02: Put the CD or DVD in the drive or download the trial
edition of SQL Server.
If you are using the CD or DVD, a border-less window should come up (if
it doesn’t, as shown in Fig. 25.1. Open Windows explorer, access the drive
that has the CD or DVD and double-click autorun).

Fig. 25.1 Microsoft SQL Server 200 Window

If you had downloaded the file, you may have the Download Complete
dialogue box, as shown in Fig. 25.2.
Fig. 25.2 Download complete dialog box

Step 03: In this case, click Open. A dialogue box will indicate where
the file would be installed as shown in Fig. 25.3.
Fig. 25.3 Installation folder

Step 04: You can accept the default and click Finish. You may be
asked whether you want to create the new folder that does
not exist and you should click Yes. After a while, you
should receive a message indicating success, as shown in
Fig. 25.4.

Fig. 25.4 Creacting installation folder


Step 05: Click OK.
If you are using the CD installation, click SQL Server 2000
Components or press Alt C:

25.4.2 Starting and Stopping SQL Server


To use SQL Server, it must start as a service. You have two options. You can
start it every time you want to use. You can also make it start whenever the
computer comes up from booting.

Step 1: To start SQL Server, on the Taskbar as shown in Fig. 25.5, click
Start -> Programs -> Microsoft SQL Server -> Service
Manager.

Fig. 25.5 SQL Server startup taskbar

Step 2: On the SQL Server Service Manager dialogue box as shown in


Fig. 25.6, click the Start/Continue button if necessary.
Fig. 25.6 SQL Server service manager dialogue box

Step 3: On the lower-right corner of the desktop, on the clock section of


the Taskbar, the button of SQL Server appears with a
“Start/Continue” green play button.
Step 4: Close the dialogue box.
Step 5: To stop the SQL Server service, double-click the SQL Server
icon with green (big) play button on the Taskbar system tray as
shown in Fig. 25.7.
Fig. 25.7 SQL Server service manager dialogue box

Step 6: On the SQL Server Service Manager dialogue box, click the
Stop button.
Step 7: You will receive a confirmation message box. Click Yes.

25.4.3 Starting the SQL Server Service Automatically


Step 1: Display the Control Panel window and double-click
Administrative Tools.
Step 2: In the Administrative Tools window as shown in Fig. 25.8,
double-click Services.
Fig. 25.8 Administrative tool window

Step 3: In the Services window as shown in Fig. 25.9, scroll to the


middle of the right frame and click MSSQLSERVER.
Step 4: On the toolbar (Fig. 25.9), click the Start Service button.
Step 5: Close the Services window.

25.4.4 Connection to Microsoft SQL Server Database System


After installing Microsoft SQL Server, to use it, it must first be opened.
Before performing any database operation, you must first connect to the
database server.

Step 1: If you are planning to work on the server, on the taskbar as


shown in Fig. 25.10, you can click Start → (All) Programs and
position the mouse on Microsoft SQL Server. You can then
click either Query Analyser or Enterprise Manager:
Fig. 25.9 Service window

Fig. 25.10 Taskbar window


Step 2: If you had clicked Enterprise Manager, it would open the SQL
Server Enterprise Manager as shown in Fig. 25.11.

Fig. 25.11 SQL Server enterprise manager dialogue box

Step 3: You can also establish the connection through the SQL Query
Analyser. To do this, from the task bar, you can click Start →
(All) Programs → Microsoft SQL Server → Query Analyser.
This action would open the Connect to SQL Server dialogue
box as shown in Fig. 25.12.
Fig. 25.12 Connect to SQL Server dialogue box

Step 4: If the Enterprise Manager was already opened but the server or
none of its nodes is selected, on the toolbar of the MMC, you
can click Tools → SQL Query Manager. This also would
display the Connect to SQL Server dialogue box.

25.4.5 The Sourcing of Data


To establish a connection, you must specify the computer you are connecting
to, that has Microsoft SQL Server installed. If you are working from the
SQL Server Enterprise Manager as shown in Fig. 25.13, first expand the
Microsoft SQL Servers node, followed by the SQL Server Group. If you do
not see any name of a server, you may not have registered it (this is the case
with some installations, probably on Microsoft Windows XP Home Edition).
The following steps are used only if you need to register the new server.
Fig. 25.13 SQL Server enterprise manager dialogue box

Step 1: To proceed, you can right-click the SQL Server Group node and
click New SQL Server Registration as shown in Fig. 25.14.
Step 2: Click Next in the first page of the wizard as shown in Fig.
25.15.
Step 3: In the Register SQL Server Wizard and in the Available Servers
combo box, you can select the desired server or click (local),
then click Add as shown in Fig. 25.16.
Step 4: After selecting the server, you can click Next. In the third page
of the wizard as shown in Fig. 25.17, you would be asked to
specify how security for the connection would be handled. If
you are planning to work in a non-production environment
where you would not be concerned with security, the first radio
button would be fine. In most other cases, you should select the
second radio button as it allows you to eventually perform some
security tests during your development. This second radio
button is associated with an account created automatically
during installation. This account is called sa.
Fig. 25.14 SQL Server group and registration dialogue box

Fig. 25.15 Register SQL Server wizard


Fig. 25.16 Select SQL Server dialogue box
Fig. 25.17 Select an authentication mode dialogue box

Step 5: After making the selection, you can click Next. If you had
clicked the second radio button in the third page, one option
would ask you to provide the user name and the password for
your account as shown in Fig. 25.18. You can then type either sa
or Administrator (or the account you would be using) in the
Login Name text box and the corresponding password. The
second option would ask you to let the computer prompt you for
a username and a password. For our exercise, you should accept
the first radio button, then type a username and a password.
Fig. 25.18 Select connection option dialogue box

Step 6: The next (before last) page would ask you to add the new
server to the existing SQL Server Group as shown in Fig.
25.19. If you prefer to add the server to another group, you
would click the second radio button, type the desired name
in the Group Name text box and click Next.
Step 7: Once all the necessary information has been specified, you
can click Finish.
Step 8: When the registration of the server is over, if everything is
fine, you would be presented with a dialogue box
accordingly as shown in Fig. 25.21.
Step 9: You can then click Close.
Step 10: To specify the computer you want connecting to, if you are
working from the SQL Server Enterprise Manager, you can
click either (local) or the name of the server you want to
connect to as shown in Fig. 25.22.
Fig. 25.19 Select SQL Server group dialogue box

Fig. 25.20 Completing the register SQLServer wizard


Fig. 25.21 Server registration complete

Fig. 25.22 Connecting to SQL Server


Step 11: If you are connecting to the server using the SQL Query
Analyser, we saw that you would be presented with the
Connect to SQL Server dialog box. Normally, the name of
the computer would be selected already. If not, you can
select either (local) or the name of the computer in the SQL
Server combo box.

Fig. 25.23 SQL Server enterprise manager dialogue box

Step 12: If the SQL Server Enterprise Manager is already opened and
you want to open SQL Query Analyser as shown in Fig.
25.23, in the left frame, you can click the server or any node
under the server to select it. Then, on the toolbar of the
MMC, click Tools → SQL Query Analyser. In this case, the
Query Analyser would open directly.

25.4.6 Security
An important aspect of establishing a connection to a computer is security.
Even if you are developing an application that would be used on a
standalone computer, you must take care of this issue. The security referred
to in this attribute has to do with the connection, not how to protect your
database.
If you are using SQL Server Enterprise Manager, you can simply connect
to the computer using the steps we have reviewed so far.

Step 1: If you are accessing SQL Query Analyser from the taskbar
where you had clicked Start → (All) Programs → Microsoft
SQL Server → Query Analyser, after selecting the computer in
the SQL Server combo box, you can specify the type of
authentication you want. If security is not an issue in this
instance, you can click the Windows Authentication radio
button as shown in Fig. 25.24.
Step 2: If you want security to apply and if you are connecting to SQL
Query Analyser using the Connect To SQL Server dialogue box,
you must click the SQL Server Authentication radio button as
shown in Fig. 25.25.
Step 3: If you are connecting to SQL Query Analyser using the Connect
To SQL Server dialogue box and you want to apply
authentication, after selecting the second radio button, this
would prompt you for a username.
Step 4: If you are “physically” connecting to the server through SQL
Query Analyser, besides the username, you can (must) also
provide a password to complete the authentication as shown in
Fig. 25.26.
Fig. 25.24 Connect to SQL Server dialogue box

Fig. 25.25 Connect to SQL Server dialogue box


Fig. 25.26 SQL Server authentication

Step 5: After providing the necessary credentials and once you click
OK, the SQL Query Analyser would display as shown in Fig.
25.27.
Fig. 25.27 SQL query analyzer display

25.5 DATABASE OPERATION WITH MICROSOFT SQL SERVER

25.5.1 Connecting to a Database


Microsoft SQL Server (including MSDE) ships with various ready-made
databases you can work with. In SQL Server Enterprise Manager, the
available databases and those you will create are listed in a node called
Databases. To display the list of databases, you can click the Databases node
as shown in Fig. 25.28.
Fig. 25.28 Displaying list of databases

If you are not trying to connect to one particular database, you do not need
to locate and click any. If you are attempting to connect to a specific
database, in SQL Server Enterprise Manager, you can simply click the
desired database as shown in Fig. 25.29.

Fig. 25.29 Connecting desired database


If you are working in SQL Query Analyser but you are not trying to
connect to a specific database, you can accept the default master selected in
the combo box of the toolbar as shown in Fig. 25.29. If you are trying to
work on a specific database, to select it, on the toolbar, you can click the
arrow of the combo box and select a database from the list:

Fig. 25.30 Accepting default ‘master’ database

After using a connection and getting the necessary information from it,
you should terminate it. If you are working in SQL Server Enterprise
Manager or the SQL Query Analyser, to close the connection, you can
simply close the window as an application.

25.5.2 Database Creation


Probably before using a database, you must first have one. If you are just
starting with databases and you want to use one, Microsoft SQL Server ships
with two databases ready for you. One of these databases is called
Northwind and the other is called pubs.
Besides, or instead of, the Northwind and the pubs databases, you can
create your own. A database is primarily a group of computer files that each
has a name and a location. When you create a database using Microsoft SQL
Server, it is located in the Drive:\Program Files\Microsoft SQL
Server\MSSQL\Data folder.

To create a new database in SQL Server Enterprise Manager, do the


following:
In the left frame, you can right-click the server or the (local) node position your mouse on
New and click Database.
In the left frame, you can also right-click the Databases node and click New Database.
When the server name is selected in the left frame, on the toolbar of the window, you can
click Action, position the mouse on New and click Database.
When the server name is selected in the left frame, you can right-click an empty area in the
right frame, position your mouse on New and click Database.
When the Databases node or any node under it is selected in the left frame, on the toolbar,
you can click Action and click New Database.
When the Databases node or any node under is selected in the left frame, you can right-click
an empty area in the right frame and click New Database.

Any of these actions causes the Database Properties to display. You can
then enter the name of the database.

25.5.2.1 Naming of Created Database


Probably the most important requirement of creating a database is to give it a
name. The SQL is very flexible when it comes to names. In fact, it is very
less restrictive than most other computer languages. Still, there are rules you
must follow when naming the objects in your databases:
A name can start with either a letter (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x,
y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y or Z), a digit (0,
1, 2, 3, 4, 5, 6, 7, 8 or 9), an underscore (_) or a non-readable character. Examples are _n, act,
%783, Second.
After the first character (letter, digit, underscore or symbol), the name can have combinations
of underscores, letters, digits or symbols. Examples are _n24, act_52_t.
A name cannot include space, that is, empty characters. If you want to use a name that is
made of various words, start the name with an opening square bracket and end it with a
closing square bracket. Example are [Full Name] or [Date of Birth].

Because of the flexibility of SQL, it can be difficult to maintain names in


a database. Based on this, there are conventions we will use for our objects.
In fact, we will adopt the rules used in C/C++, C#, Pascal, Java, Visual Basic
and so on. In our databases:
Unless stated otherwise (we will mention the exception, for example with variables, tables,
etc), a name will start with either a letter (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v,
w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y or Z) or an
underscore.
After the first character, we can use any combination of letters, digits or underscores.
A name will not start with two underscores.
A name will not include one or more empty spaces. That is, a name will be made in one
word.
If the name is a combination of words, at least the second word will start in uppercase.
Examples are dateHired, _RealSport, FullName or DriversLicenseNumber.

25.5.2.2 Creating a Database in the Enterprise Manager


Step 1: Start the Enterprise Manager (Start → (All) Programs →
Microsoft SQL Server → Enterprise Manager).
Step 2: Expand the Microsoft SQL Servers node, followed by the SQL
Server Group, followed by the name of the server and followed
by the Databases node.
Step 3: Right-click Databases and click New Database as shown in fig.
25.31.
Step 4: In the Name text box, type StudentPreRegistration as shown
in fig. 25.32.

25.5.2.3 Creating a Database Using the Database Wizard


Another technique you can use to create a database is by using the Database
Wizard. There are two main ways you can launch the Database Wizard. In
the left frame, when the server node or the Databases folder is selected, on
the toolbar, you can click the Tools button and click Wizards. This causes the
Select Wizard dialog box to display. In the Select Wizard dialogue box, you
can expand the Database node and click Create Database Wizard:

Step 1: On the toolbar of the SQL Server Enterprise Manager, click


Tools → Wizards.
Step 2: In the Select Wizard dialogue box, expand the Database node,
click Create Database Wizard and click OK as shown in Fig.
25.33.
Fig. 25.31 Creating new database

Fig. 25.32 Entering name for the new database


Step 3: In the first page of the wizard, read the text and click Next.
Step 4: In the second page of the wizard and in the Database Name text
box, you can specify the name you want for your database. For
this exercise, enter NationalCensus as shown in Fig. 25.34.
Fig. 25.33 Wizard dialogue box

Fig. 25.34 Creating database wizard dialogue box

Step 5: After entering the name, click Next.


Step 6: In the third, the fourth, the fifth and the sixth pages of the
wizard, accept the default by clicking Next on each page:
Step 7: The last page of the wizard, as shown in Fig. 25.35, shows a
summary of the database that will be created. If the information
is not accurate, you can click the Back button and make the
necessary changes. Once you are satisfied, you can click Finish.

Fig. 25.35 Completing the create database wizard

Step 8: If the database is successfully created, you would receive a


message box letting you know: that the database was
successfully created, as shown in Fig. 25.36. You can then press
OK.
Fig. 25.36 Successful creation of database

25.5.2.4 Creating a Database Using SQL Query Analyser


The command used to create a database in SQL uses the following formula:

CREATE DATABASE DatabaseName

The CREATE DATABASE (remember that SQL is not case-sensitive,


even when you include it in a C++ statement) expression is required. The
DatabaseName factor is the name that the new database will carry. Although
SQL is not case-sensitive, as a C++ programmer, you should make it a habit
to be aware of the cases you use to name your objects.
As done in C++, every statement in SQL can be terminated with a semi-
colon. Although this is a requirement in many implementations of SQL, in
Microsoft SQL Server, you can omit the semi-colon. Otherwise, the above
formula would be:

CREATE DATABASE DatabaseName;


Fig. 25.37 shows an example of creating database using SQL Query
Analyser.

Fig. 25.37 Database creation using SQL Query Analyser

To assist you with writing code, the SQL Query Analyser includes
sections of sample code that can provide placeholders. To access one these
codes, on the main menu of SQL Query Analyser, click File → New. Then,
in the general property page of the New dialogue box, you can double-click
a category to see the available options.

R Q
1. What is Microsoft SQL Server? Explain.
2. What is Microsoft SQL Server 2000? What are its components? Explain.
3. Write the features of Microsoft SQL server.
4. What do you mean by stored procedures in SQL Server? What are its benefits?
5. Explain the structure of stored procedure.

STATE TRUE/FALSE
1. Microsoft SQL Server is a multithreaded server that scales from laptops and desktops to
enterprise servers.
2. Microsoft SQL Server can operate on clusters and symmetrical multiprocessing (SMP)
configurations.
3. SQL Profiler provides a window into the inner workings of the database.
4. Data Transmission Services (DTS) provide an extremely flexible method for importing and
exporting data between a Microsoft SQL Server installation and a large variety of other
formats.
5. SQL Server does not provide any graphical tools.

TICK (✓) THE APPROPRIATE ANSWER

1. Microsoft SQL Server is

a. Relational DBMS.
b. Hierarchical DBMS.
c. Networking DBMS.
d. None of these.

2. Microsoft SQL Server was developed in

a. 1980.
b. 1990.
c. 2000.
d. None of these.

3. Microsoft SQL Server was originally developed at Sybase for

a. Windows NT system.
b. UNIX system.
c. Both (a) and (b).
d. None of these.

4. Microsoft SQL Server can operate on

a. clusters.
b. symmetrical multiprocessing.
c. personal digital assistant.
d. All of these.

5. Service Manager of Microsoft SQL Server 2000 is used to control

a. the main SQL Server process.


b. Microsoft Distributed Transaction Coordinator.
c. SQLServerAgent processes.
d. All of these.
FILL IN THE BLANKS

1. Microsoft SQL Server is a _____ management system that was originally developed in 1980s
at _____ for _____ systems.
2. Query Analyser offers a quick method for performing _____ against any of your SQL Server
databases.
3. Microsoft SQL Server provides the stored procedure mechanism to simplify the database
development process by grouping _____ into _____.
4. SQL Server supports network management using the _____.
5. SQL Server supports distributed transactions using _____ and _____.
Chapter 26

Microsoft Access

26.1 INTRODUCTION

Microsoft Access is a powerful and user-friendly database management


system for Windows. It is another relational database management system
provided by Microsoft with many innovative features for data storage and
retrieval using graphical tools of Windows environment to make tasks easier
to perform.
Microsoft Access has been designed for users who want to make full
advantage of the Windows environment for their database management tasks
while remaining end users and leaving the programming to others. Microsoft
Access provides users with one of the simplest and most flexible DBMS
solutions on the market today. It supports Object Linking and Embedding
(OLE) and dynamic data exchange (DDE), the ability to incorporate text and
graphics in a form or report. It provides a graphical user interface (GUI).
Reports, forms and queries are easy to design and execute.
This chapter aims at providing essential information about Microsoft
Access, the basic components and its basic features, such as creating forms,
creating and modifying queries and so on.

26.2 AN ACCESS DATABASE

Like other database management systems, Microsoft Access provides a way


to store and manage information. It considers both the tables of data that
store your information and the supplement objects that present information
and work with it, to be a part of the database. This differs from standard
database system terminology, in which only the data itself is considered part
of the database. For example, when you use a package such as dBASE IV,
you might have an employee database, a client database and a supplier
database. Each of the databases are separate files. You would have additional
files in your dBASE directory for reports and forms that work with the
database. With Access, you could have all three types of information in one
database along with the accompanying reports and forms. All of the data and
other database objects would be stored in one file in the same fashion as an
R:Base database.
Access stores data in tables that are organised by rows and columns. A
database can contain one table or many. Other objects such as reports, forms,
queries, macros and program modules are considered to be a part of the
database along with the tables. You can have these objects in the database
along with the tables, either including them from the beginning or adding
them as you need them. The basic requirement for a database is that you
have at least one table. All other objects are optional.
Since an Access database can contain many tables and other objects, it is
possible to create one database that will meet the information requirements
for an entire company. You can build the database gradually, adding
information and reports for various applications areas as you have time. You
can define relationships between pieces of information in tables.
You can have more than one database in Access. Each database has its
own tables and other objects. You can use the move and copy features of this
package to move and copy objects from one database to another, although
you can only work with one database at a time.

26.2.1 Tables
Tables in Access database are tabular arrangements of information. Columns
represent fields of information, or one particular piece of information that
can be stored for each entity in the table. The rows of the table contain the
records. A record contains one of each field in the database. Although a field
can be left blank, each record in the database has the potential for storing
information in each field in the table. Fig. 26.1 shows some of the fields and
records in an Access table.
Generally each major type of information in the database is represented by
a table. You might have a Supplier table, a Client table and an Employee
table. It is unlikely that such dissimilar information would be placed together
in the same table, although this information is all part of the same database.
Access Table Wizard makes table creation easy. When you use the Wizard
to build a table, you can select fields from one or more sample tables. Access
allows you to define relationships between fields in various tables. Using
Wizards, you can visually connect data in the various tables by dragging
fields between them.
Access provides two different views for tables, namely the Design view
and the Datasheet view. The Design view, as shown in Fig. 26.2, is used
when you are defining the fields that store the data in the table. For each
field in the table you define the field name and data type. You can also set
field properties to change the field format and caption (used for the fields on
reports and forms), provide validation rules to check data validity, create
index entries for the field and provide a default value.
In the Datasheet view, you can enter data into fields or look at existing
records in the table. Fig. 26.1 and 26.2 show the same Employee table: Fig.
26.1 presents the Datasheet view of it and Fig. 26.2 shows the design view.
Fig. 26.1 Access table in datasheet view

26.2.2 Queries
Access supports different kinds of queries, such as select, crosstab and
action queries. You can also create parameters that let you customise the
query each time you use it. Select queries choose records from a table and
display them in a temporary table called a dynaset. Select queries are
essentially questions that ask Access about the entries tables. You can create
queries with a Query-by-Example (QBE) grid. The entries you make in this
grid tell Access which fields and records you want to appear in a temporary
table (dynaset) that shows the query results. You can use completed
combinations of criteria to define your needs and see only the records that
you need. Fig. 26.3 shows the entries in the QBE grid that will select the
records you want. This QBE grid includes a Sort row that allows you to
specify the order of records in the resulting dynaset.
Fig. 26.2 Design view for a table

Fig. 26.3 QBE Grid with query entries

Crosstab queries provide a concise summary view of data in a spreadsheet


format. Action provides four types of action queries, namely make-table,
delete, append and update action queries.
If you have defined relationships between tables, a query can recognise
the relationships and combine data from multiple tables in the query’s result,
which is called a dynaset. Fig. 26.4 shows the relationships window where
you view and maintain the relationships in the database. If the relationships
are not defined, you can still associate data in related tables by joining them
when you design the query.

Fig. 26.4 Relationships window

Queries can include calculated fields. These fields do not actually exist in
any permanent table, but display the results of calculations that use the
contents of one or more fields. Queries that use calculated fields let you
derive more meaningful information from the data you record in your tables,
such as year-end totals for sales and expenditures. The Query Wizard can
guide you through the steps of crating some common, but more complicated
types of queries.

26.2.3 Reports
In reports, you can see the detail as you can with a form on the screen but
you can also look at many records at the same time. Reports also let you
look at summary information obtained after reading every record in the
table, such as totals or averages. Reports can show the data from either a
table or a query. Fig. 26.5 shows a report created with Access. The drawing
was created using CoralDraw software.
Access can use OLE and DDE, which are windows features that let you
share data between applications. The Report Wizard of Access helps you in
creating reports.

26.2.4 Forms
You can use forms to view the records in tables or to add new records.
Unlike datasheets, which present many records on the screen at one time,
forms have a narrower focus and usually present one record on the screen at
a time. You can use either queries or tables as the input for a form. You can
create forms using Form Wizard of Access. Access also has an AutoForm
feature that can automatically create a form for a table or query.
Controls are placed on a form to display fields or text. You can select
these controls and move them to a new location or resize them to give your
form the look you want. You can move the controls for fields and the text
that describes that field separately. You can also add other text to the form.
You can change the appearance of text on a form by changing the font or
making the type boldface or italic. You also can show text as raised or
sunken or use a specific colour. Lines and rectangles can be added to a form
to enhance its appearance. Fig. 26.6 shows a form developed to present data
in an appealing manner.
Fig. 26.5 Access report

Fig. 26.6 Access form

Forms allow you to show data from more than one table. You can build a
query first to select the data from different tables to appear on a form or use
sub-forms to handle the different tables you want to work with. A sub-form
displays the records associated with a particular field on a form. Sub-forms
provide the best solution when one record in a table relates to many records
in another table. Sub-forms allow you to show the data from one record at
the top of the form with the data from related records shown below it. For
example, Fig. 26.7 shows a form that displays information from the Client
table at the top of the form and information from the Employee Time Log
table in the bottom half of the form, in a sub-form.

Fig. 26.7 Access form containing a sub-form

A form has events that you can have Access perform as different things
occur. Events happen at particular points in time in the use of a form. For
example, moving from one record to the next is an event. You can have
macros or procedures assigned to an event to tell Access what you want to
happen when an event occurs.
26.2.5 Macros
Macros are a series of actions that describe what you want Access to do.
Macros are an ideal solution for repetitive tasks. You can specify the exact
steps for a macro to perform and the macro can repeat them whenever you
need these steps executed again, without making a mistake.
Access macros are easy to work with. Access lets you select from a list of
all the actions that you can use in a macro. Once you select an action, you
use arguments to control the specific effect of the action. Arguments differ
for each of the actions, since each action requires different information
before it can perform a task. Fig. 26.8 shows macro instructions entered in a
Macro window. For many argument entries, Access provides its best guess at
which entry you will want; you only need to change the entry if you want
something different.
You can create macros for a command button in a form that will open
another form and select the records that appear in the other form. Macros
also allow other sophisticated options such as custom menus and popup
forms for data collection. Menu Builder box of Access offers easier way to
create custom menus to work with macros.
You can execute macros from the database window or other locations. Fig.
26.9 shows a number of macros in the Database Window. You can highlight
a macro and then select Run to execute it.
Fig. 26.8 Access macro

Fig. 26.9 Access Window with many macros listed


26.3 DATABASE OPERATION IN MICROSOFT ACCESS

26.3.1 Creating Forms


Step 1: Open your database.
Step 2: Click on the Forms tab under Objects. This will bring up a
list of the form objects currently stored in your database.
Step 3: Click on the New icon to create a new form.
Step 4: Select the creation method you wish to use. A variety of
different methods will be presented, which can be used to
create a form. The AutoForm options quickly create a form
based upon a table or query. Design View allows for the
creation and formatting of elaborate forms using Access’
form editing interface. The Chart Wizard and PivotTable
Wizard create forms revolving around those two Microsoft
formats.
Step 5: Select the data source and click OK. You can choose from
any of the queries and tables in your database. For this
example, select the Customers table from the pull-down
menu.
Step 6: Select the data source and click OK. You can choose from
any of the queries and tables in your database. To create a
form to facilitate the addition of customers to the database,
for this example, select the Customers table from the pull-
down menu.
Step 7: Select the form layout and click Next. You can choose from
either a columnar, tabular, datasheet or justified form layout.
We will use the justified layout to produce an organised form
with a clean layout. You may wish to come back to this step
later and explore the various layouts available.
Step 8: Select the form style and click Next. Microsoft Access
includes a number of built-in styles to give your forms an
attractive appearance. Click on each of the style names to see
a preview of your form and choose the one you find most
appealing.
Step 9: Provide a title for your form. Select something easily
recognisable-this is how your form will appear in the database
menu. Let us call our form “Customers” in this case. Select
the next action and click Finish. You may open the form as a
user will see it and begin viewing, modifying and/ or entering
new data. Alternatively, you may open the form in design
view to make modifications to the form’s appearance and
properties. Let us do the latter and explore some of the
options available to us.
Step 10: Edit Properties. Click the Properties icon. This will bring up
a menu of user-definable attributes that apply to our form.
Edit the properties as necessary. Setting the “Data Entry”
property to Yes will only allow users to insert new records
and modify records created during that session.

26.3.2 Creating a Simple Query


Microsoft Access offers a powerful query function with an easy-to-learn
interface that makes it a snap to extract exactly the information you need
from your database.
Let us explore the process step-by-step. Our goal is to create a query
listing the names of all of our company’s products, current inventory levels
and the name and phone number of each product’s supplier.

Step 1: Open your database. If you have not already installed the
Northwind sample database, these instructions will assist you.
Otherwise, you need to go to the File tab, select Open and locate
the Northwind database on your computer.
Step 2: Select the queries tab. This will bring up a listing of the
existing queries that Microsoft included in the sample database
along with two options to create new queries as shown in Fig.
26.10.
Step 3: Double-click on “create query by using wizard”. The query
wizard simplifies the creation of new queries as shown in Fig.
26.10.

Fig. 26.10 Query wizard

Step 4: Select the appropriate table from the pull-down menu. When
you select the pull-down menu as shown in Fig. 26.11, you will
be presented with a listing of all the tables and queries currently
stored in your Access database. These are the valid data sources
for your new query. In this example, we want to first select the
Products table, which contains information about the products
we keep in our inventory.
Fig. 26.11 Simple query wizard with pull-down menu

Step 5: Choose the fields you wish to appear in the query results by
either double-clicking on them or by single clicking first on the
field name and then on the “>” icon as shown in Fig. 26.12. As
you do this, the fields will move from the Available Fields
listing to the Selected Fields listing. Notice that there are three
other icons offered. The “>>” icon will select all available
fields. The “<“ icon allows you to remove the highlighted field
from the Selected Fields list while the “<<“ icon removes all
selected fields. In this example, we want to select the
ProductName, UnitsInStock and UnitsOnOrder from the
Product table.
Fig. 26.12 Field selection for query result

Step 6: Repeat steps 4 and 5 to add information from addition


tables, as desired. In our example, we wanted to include
information about the supplier. That information was not
included in the Products table-it is in the Suppliers table. You
can combine information from multiple tables and easily
show relationships. In this example, we want to include the
CompanyName and Phone fields from the Suppliers table. All
you have to do is select the fields-Access will line up the
fields for you!
Note that this works because the Northwind database has
predefined relationships between tables. If you are creating a
new database, you will need to establish these relationships
yourself.
Step 7: Click on Next.
Step 8: Choose the type of results you would like to produce. We
want to produce a full listing of products and their suppliers,
so choose the Detail option as shown in Fig. 26.13.
Step 9: Click on Next.
Step 10: Give your query a title. You are almost done! On the next
screen you can give your query a title as shown in Fig. 26.14.
Select something descriptive that will help you recognise this
query later. We will call this query “Product Supplier
Listing.”
Fig. 26.13 Choosing result type

Fig. 26.14 Giving title to the query

Step 11: Click on Finish. You will be presented with the two windows
below. The first window (Fig. 26.15) is the Query tab that we
started with. Notice that there’s one additional listing now-the
Product Supplier Listing we created. The second window (Fig.
26.16) contains our results-a list of our company products,
inventory levels and the supplier’s name and telephone number!
Fig. 26.15 Query tab

Fig. 26.16 Query result

You have successfully created your first query using Microsoft Access!

26.3.3 Modifying a Query


In the previous section, query displayed the inventory levels for all of the
products in the inventory. Now several features can be added to the previous
query such as, (a) to display those products where the current inventory level
is less than ten with no products on order, (b) displaying the product name
along with the phone number and contact name of each product’s supplier
and (c) to sort the final results alphabetically by product name.

26.3.3.1 Opening Queryin Design View


Step 1: Select the appropriate query. From the Northwind database
menu, single click on the query you wish to modify. Choose the
“Product Supplier Listing” query, as shown in Fig. 26.8 that was
designed in the previous section.
Step 2: Click the Design View icon. This icon appears in the upper left
portion of the window. Immediately upon clicking this icon, you
will be presented with the Design View as shown in Fig. 26.17.

Fig. 26.17 Design view menu


26.3.3.2 Adding Fields
Adding a field is one of the most common query modifications. This is
usually done to either display additional information in the query results or
adds criteria to the query from information not displayed in the query results.
In our example, the purchasing department wanted the contact name of each
product’s supplier displayed. As this was not one of the fields in the original
query, we must add it now.

Step 1: Choose an open table entry. Look for an entry in the field row
that does not contain any information. Depending upon the size
of your window you may need to use the horizontal scroll bar at
the bottom of the table to locate an open entry.
Step 2: Select the desired field. Single click in the field portion of the
chosen entry and a small black down arrow will appear. Click
this once and youwill be presented with a list of currently
available fields as shown in Fig. 26.18. Select the field of
interest by single clicking on it. In our example, we want to
choose the ContactName field from the Suppliers table (listed as
Suppliers. Contact Name).

26.3.3.3 Removing Fields


Often, you will need to remove unnecessary information from a query. If the
field in question is not a component of any criteria or sort orders that we
wish to maintain, the best option is to simply remove it from the query
altogether. This reduces the amount of overhead involved in performing the
query and maintains the readability of our query design.
Fig. 26.18 Adding fields to the query

Step 1: Click on the field name. Single click on the name of the field
you wish to remove in the query table. In our example, we want
to remove the CompanyName field from the Suppliers table.
Step 2: Open the Edit menu and select Delete Columns. Upon
completion of this step, as shown in Fig. 26.19, the
CompanyName column will disappear from the query table.
Fig. 26.19 Removing fields

26.3.3.4 Adding Criteria


We often desire to filter the information produced by a database query based
upon the value of one or more database fields. For example, let us suppose
that the purchasing department is only interested in those products with a
small inventory and no products currently on order. In order to include this
filtering information, we can add criteria to our query in the Design View.

Step 1: Select the criteria field of interest. Locate the field that you
would like to use as the basis for the filter and single click
inside the criteria box for that field. In our example, we would
first like to limit the query based upon the UnitsInStock field of
the Products table.
Step 2: Type the selection criteria. We want to limit our results to those
products with less than ten items in inventory. To accomplish
this, enter the mathematical expression “< 10” in the criteria
field as shown in Fig. 26.20.
Step 3: Repeat steps 1 and 2 for additional criteria. We would also like
to limit our results to those instances where the UnitsOnOrder
field is equal to zero as shown in Fig. 26.20. Repeat the steps
above to include this filter as well.

Fig. 26.20 Filtering of query

26.3.3.5 Hiding Fields


Sometimes we will create a filter based upon a database field but will not
want to show this field as part of the query results. In our example, the
purchasing department wanted to filter the query results based upon the
inventory levels but did not want these levels to appear. We can not remove
the fields from the query because that would also remove the criteria. To
accomplish this, we need to hide the field.

Step 1: Uncheck the appropriate Show box. It is that simple! Just


locate the field in the query table and uncheck the Show box by
single clicking on it as shown in Fig. 26.21. If you later decide
to include that field in the results just single click on it again so
that the box is checked.
Fig. 26.21 Hiding fields

26.3.3.6 Sorting the Results


The human mind prefers to work with data presented in an organised
fashion. For this reason, we often desire to sort the results of our queries
based upon one or more of the fields in the query. In our example, we want
to sort the results alphabetically based upon the product’s name.

Step 1: Click the Sort entry for the appropriate field. Single click in
the Sort area of the field entry and a black down arrow will
appear. Single click on this arrow and you’ll be presented with a
list of sort order choices as shown in Fig. 26.22. Do this for the
Products.ProductName field in our example.

Fig. 26.22 Setting sorting order

Step 2: Choose the sort order. For text fields, ascending order will sort
alphabetically and descending order will sort by reverse
alphabetic order as shown in Fig. 26.22. We want to choose
ascending order for this example.

That is it! Close the design view by clicking the “X” icon in the upper
right corner. From the database menu, double click on our query name and
you’ll be presented with the desired results as shown in Fig. 26.23.

Fig. 26.23 Final sorted query result

26.4 FEATURES OF MICROSOFT ACCESS

It allows us to create the framework (forms, tables and so on) for storing information in a
database.
Microsoft Access allows opening the table and scrolling through the records contained within
it.
Microsoft Access forms provide a quick and easy way to modify and insert records into your
databases.
Microsoft Access has capabilities to answer more complex requests or queries.
Access queries provide the capability to combine data from multiple tables and place specific
conditions on the data retrieved.
Access provides a user-friendly forms interface that allows users to enter information in a
graphical form and have that information transparently passed to the database.
Microsoft Access provides features such as reports, web integration and SQL Server
integration that greatly enhance the usability and flexibility of the database platform.
Microsoft Access provides native support for the World Wide Web.
Features of Access 2000 provide interactive data manipulation capabilities to web users.
Microsoft Access provides capability to tightly integrate with SQL Server, Microsoft’s
professional database server product.
R Q
1. What is Microsoft Access?
2. How are tables, forms, queries and reports created in Access? Explain.
3. What are the different types of queries that are supported by Access? Explain each of them.
4. What do you mean by macro? Explain how macros are used in Access.
5. What is form in Access? What are its purposes?

STATE TRUE/FALSE

1. Microsoft Access is a powerful and user-friendly database management system for UNIX
systems.
2. Access supports Object Linking and Embedding (OLE) and dynamic data exchange (DDE).
3. Access provides a graphical user interface (GUI).
4. Reports, forms and queries are difficult to design and execute with Access.
5. Access considers both the tables of data that store your information and the supplement
objects that present information and work with it, to be part of the database.
6. Select queries are essentially questions that ask Access about the entries tables.
7. In Access, you cannot create queries with a Query-by-Example (QBE) grid.

TICK (✓) THE APPROPRIATE ANSWER

1. Access is a

a. relational DBMS.
b. hierarchical DBMS.
c. networking DBMS.
d. none of these.

2. The Design View of Access is used

a. when you are defining the fields that store the data in the table.
b. to enter data into fields or look at existing records in the table.
c. to create parameters that let you customise the query.
d. None of these.

3. The Datasheet View of Access is used

a. when you are defining the fields that store the data in the table.
b. to enter data into fields or look at existing records in the table.
c. to create parameters that let you customise the query.
d. None of these.

4. Access supports different types of queries such as


a. Select.
b. Crosstab.
c. Action.
d. All of these.

5. Access Reports can show the data from

a. a table.
b. a query.
c. either a table or a query.
d. None of these.

FILL IN THE BLANKS

1. Microsoft Access is a powerful and user-friendly database management system for _____.
2. Access provides two different views for tables, namely (a) _____ and (b)_____.
3. Select queries choose records from a table and display them in a temporary table called a
_____.
4. Crosstab queries provide a concise summary view of data in a _____ format.
5. Action provides four types of action queries, namely (a) _____, (b) _____, (c) _____and
(d)_____.
Chapter 27

MySQL

27.1 INTRODUCTION

MySQL is an Open Source SQL database management system developed by


MySQL AB Sweden, founded by David Axmark, Allan Larsson and
Michael “Monty” Widenius. MySQL is developed, distributed and supported
by MySQL AB. MySQL AB is a commercial company, founded by the
MySQL developers. It is a second generation Open Source company that
unites Open Source values and methodology with a successful business
model.
MySQL Server was originally developed to handle large databases much
faster than existing solutions and has been successfully used in highly
demanding production environments for several years. Although under
constant development, MySQL Server today offers a rich and useful set of
functions. Its connectivity, speed and security make MySQL Server highly
suited for accessing databases on the Internet.
This chapter provides the features and functionality of MySQL.

27.2 AN OVERVIEW OF MYSQL

27.2.1 Features of MySQL


The following are some of the important characteristics of the MySQL
Database Software:
MySQL is a relational database management system.
MySQL software is Open Source. Open Source means that it is possible for anyone to use
and modify the software. Anybody can download the MySQL software from the Internet and
use it without paying anything. If you wish, you may study the source code and change it to
suit your needs. The MySQL software uses the GPL (GNU General Public License),
http://www.fsf.org/licenses, to define what you may and may not do with the software in
different situations. If you feel uncomfortable with the GPL or need to embed MySQL code
into a commercial application, you can buy a commercially licensed version from us.
The MySQL Database Server is very fast, reliable and easy to use.
MySQL Server works in client/server or embedded systems. The MySQL Database Software
is a client/server system that consists of a multi-threaded SQL server that supports different
back-ends, several different client programs and libraries, administrative tools and a wide
range of application programming interfaces (APIs). MySQL Server is also provided as an
embedded multithreaded library that you can link into your application to get a smaller,
faster, easier-to-manage product.
A large amount of contributed MySQL software is available.
Internals and Portability.

Written in C and C++.


Tested with a broad range of different compilers.
Works on many different platforms.
Uses GNU Automake, Autoconf and Libtool for portability.
APIs for C, C++, Eiffel, Java, Perl, PHP, Python, Ruby and Tcl are available.
Fully multi-threaded using kernel threads. It can easily use multiple CPUs if they
are available.
Provides transactional and non-transactional storage engines.
Uses very fast B-tree disk tables (MyISAM) with index compression.
Relatively easy to add another storage engine. This is useful if you want to add an
SQL interface to an in-house database.
A very fast thread-based memory allocation system.
Very fast joins using an optimised one-sweep multi-join.
In-memory hash tables, which are used as temporary tables.
SQL functions are implemented using a highly optimised class library and should be
as fast as possible. Usually there is no memory allocation at all after query
initialisation.
The MySQL code is tested with Purify (a commercial memory leakage detector) as
well as with Valgrind, a GPL tool (http://developer.kde.org/~sewardj/).
The server is available as a separate program for use in a client/server networked
environment. It is also available as a library that can be embedded (linked) into
standalone applications. Such applications can be used in isolation or in
environments where no network is available.

Column Types

Many column types: signed/unsigned integers 1, 2, 3, 4 and 8 bytes long, FLOAT,


DOUBLE, CHAR, VARCHAR TEXT, BLOB, DATE, TIME, DATETIME,
TIMESTAMP, YEAR, SET, ENUM and OpenGIS spatial types.
Fixed-length and variable-length records.
Statements and Functions

Full operator and function support in the SELECT and WHERE clauses of queries.
For example: mysql> SELECT CONCAT(first_name, ‘‘, last_name)

→ FROM citizen

→ WHERE income/dependents > 10000 AND age > 30;

Full support for SQL GROUP BY and ORDER BY clauses. Support for group
functions (COUNT(), COUNT(DISTINCT …), AVG(), STD(), SUM(), MAX(),
MIN() and GROUP_CONCAT()).
Support for LEFT OUTER JOIN and RIGHT OUTER JOIN with both standard
SQL and ODBC syntax.
Support for aliases on tables and columns as required by standard SQL.
DELETE, INSERT, REPLACE and UPDATE return the number of rows that were
changed (affected). It is possible to return the number of rows matched instead by
setting a flag when connecting to the server.
The MySQL-specific SHOW command can be used to retrieve information about
databases, database engines, tables and indexes. The EXPLAIN command can be
used to determine how the ptimiser resolves a query.
Function names do not clash with table or column names. For example, ABS is a
valid column name. The only restriction is that for a function call, no spaces are
allowed between the function name and the ‘(‘ that follows it.
You can mix tables from different databases in the same query.

Security: A privilege and password system that is very flexible and secure, and that allows
host-based verification. Passwords are secure because all password traffic is encrypted when
you connect to a server.
Scalability and Limits.

Handles large databases. We use MySQL Server with databases that contain 50
million records.
Up to 64 indexes per table are allowed. Each index may consist of 1 to 16 columns
or parts of columns. The maximum index width is 1000 bytes. An index may use a
prefix of a column for CHAR, VARCHAR, BLOB or TEXT column types.

Connectivity.

Clients can connect to the MySQL server using TCP/IP sockets on any platform. On
Windows systems in the NT family (NT, 2000, XP or 2003), clients can connect
using named pipes. On Unix systems, clients can connect using Unix domain socket
files.
In MySQL versions 4.1 and higher, Windows servers also support shared-memory
connections if started with the-shared-memory option. Clients can connect through
shared memory by using the-protocol = memory option.
The Connector/ODBC (MyODBC) interface provides MySQL support for client
programs that use ODBC (Open Database Connectivity) connections. For example,
you can use MS Access to connect to your MySQL server. Clients can be run on
Windows or Unix. MyODBC source is available. All ODBC 2.5 functions are
supported, as are many others.
The Connector/J interface provides MySQL support for Java client programs that
use JDBC connections. Clients can be run on Windows or Unix. Connector/J source
is available.

Localisation

The server can provide error messages to clients in many languages.


Full support for several different character sets, including latin1 (ISO-8859-1),
german, big5, ujis and more. For example, the Scandinavian characters ‘â’, ‘ä’ and
‘ö’ are allowed in table and column names. Unicode support is available as of
MySQL 4.1.
All data is saved in the chosen character set. All comparisons for normal string
columns are case-insensitive.
Sorting is done according to the chosen character set (using Swedish collation by
default). It is possible to change this when the MySQL server is started. To see an
example of very advanced sorting, look at the Czech sorting code. MySQL Server
supports many different character sets that can be specified at compile time and
runtime.

Clients and Tools

The MySQL server has built-in support for SQL statements to check, optimise, and
repair tables. These statements are available from the command line through the
mysqlcheck client. MySQL also includes myisamchk, a very fast command-line
utility for performing these operations on MyISAM tables.
All MySQL programs can be invoked with the-help or -? options to obtain online
assistance.

27.2.2 MySQL Stability


MySQL provides a stable code base and the ISAM table format used by the
original storage engine remains backward-compatible. Each release of the
MySQL Server has been usable. The MySQL Server design is multi-layered
with independent modules. Some of the newer modules are listed here with
an indication of how well-tested each of them is:
Replication (Stable): Large groups of servers using replication are in production use, with
good results. Work on enhanced replication features is continuing in MySQL 5.x.
InnoDB tables (Stable): The InnoDB transactional storage engine has been declared stable in
the MySQL 3.23 tree, starting from version 3.23.49. InnoDB is being used in large, heavy-
load production systems.
BDB tables (Stable): The Berkeley DB code is very stable, but we are still improving the
BDB transactional storage engine interface in MySQL Server.
Full-text searches (Stable): Full-text searching is widely used. Important feature
enhancements were added in MySQL 4.0 and 4.1.
MyODBC 3.51 (Stable): MyODBC 3.51 uses ODBC SDK 3.51 and is in wide production
use. Some issues brought up appear to be application-related and independent of the ODBC
driver or underlying database server.

27.2.3 MySQL Tables Size


MySQL 3.22 had a 4GB (4 gigabyte) limit on table size. With the MyISAM
storage engine in MySQL 3.23, the maximum table size was increased to 8
million terabytes (2 ^ 63 bytes). With this larger allowed table size, the
maximum effective table size for MySQL databases is usually determined by
operating system constraints on file sizes, not by MySQL internal limits.
The InnoDB storage engine maintains InnoDB tables within a tablespace
that can be created from several files. This allows a table to exceed the
maximum individual file size. The tablespace can include raw disk
partitions, which allows extremely large tables. The maximum tablespace
size is 64TB. Table 27.1 lists some examples of operating system file-size
limits.

Table 27.1 Operating system file size limits for MySQL

Operating System File-size Limit


Linux 2.2-Intel 32-bit 2GB (LFS: 4GB)

Linux 2.4 (using ext3 filesystem) 4TB

Solaris 9/10 16TB

NetWare w/NSS filesystem 8TB

win32 w/ FAT/FAT32 2GB/4GB


win32 w/ NTFS 2TB (possibly larger)

MacOS X w/ HFS+ 2TB


On Linux 2.2, you can get MyISAM tables larger than 2GB in size by
using the Large File Support (LFS) patch for the ext2 filesystem. On Linux
2.4, patches also exist for ReiserFS to get support for big files (up to 2TB).
Most current Linux distributions are based on kernel 2.4 and include all the
required LFS patches. With JFS and XFS, petabyte and larger files are
possible on Linux. However, the maximum available file size still depends
on several factors, one of them being the filesystem used to store MySQL
tables.
It should be noted for Windows users that FAT and VFAT (FAT32) are not
considered suitable for production use with MySQL. Use NTFS instead.
By default, MySQL creates MyISAM tables with an internal structure that
allows a maximum size of about 4 GB. You can check the maximum table
size for a table with the SHOW TABLE STATUS statement or with
myisamchk -dv tbl_name.
If you need a MyISAM table that is larger than 4 GB in size (and your
operating system supports large files), the CREATE TABLE statement
allows AVG_ROW_LENGTH and MAX_ROWS options. You can also
change these options with ALTER TABLE after the table has been created,
to increase the table’s maximum allowable size.

Other ways to work around file-size limits for MyISAM tables are as
follows:
If your large table is read-only, you can use myisampack to compress it. myisampack
usually compresses a table by at least 50%, so you can have, in effect, much bigger tables.
myisampack also can merge multiple tables into a single table.
MySQL includes a MERGE library that allows you to handle a collection of MyISAM tables
that have identical structure as a single MERGE table.

27.2.4 MySQL Development Roadmap


The current production release series of MySQL is MySQL 4.1, which was
declared stable for production use as of Version 4.1.7, released in October
2004. The previous production release series was MySQL 4.0, which was
declared stable for production use as of Version 4.0.12, released in March
2003. Production status means that future 4.1 and 4.0 development is limited
only to bugfixes. For the older MySQL 3.23 series, only critical bugfixes are
made.
Active MySQL development currently is taking place in the MySQL 5.0
release series, this means that new features are being added there. MySQL
5.0 is available in alpha status. Table 27.2 summarizes the features that are
planned for various MySQL series.

Table 27.2 Planned features of MySQL series

Feature MySQL Series


Unions 4.0

Subqueries 4.1
R-trees 4.1 (for MyISAM tables)

Stored procedures 5.0


Views 5.0

Cursors 5.0
Foreign keys 5.1 (implemented in 3.23 for InnoDB)
Triggers 5.0 and 5.1

Full outer join 5.1


Constraints 5.1

27.2.5 Features Available in MySQL 4.0


Speed enhancements.

MySQL 4.0 has a query cache that can give a huge speed boost to applications with
repetitive queries.
Version 4.0 further increases the speed of MySQL Server in a number of areas, such
as bulk INSERT statements, searching on packed indexes, full-text searching (using
FULLTEXT indexes) and COUNT(DISTINCT).

Embedded MySQL Server introduced.

The new Embedded Server library can easily be used to create standalone and
embedded applications. The embedded server provides an alternative to using
MySQL in a client/server environment.
InnoDB storage engine as standard.

The InnoDB storage engine is offered as a standard feature of the MySQL server.
This means full support for ACID transactions, foreign keys with cascading
UPDATE and DELETE and row-level locking are standard features.

New functionality.

The enhanced FULLTEXT search properties of MySQL Server 4.0 enables


FULLTEXT indexing of large text masses with both binary and natural-language
searching logic. You can customize minimal word length and define your own stop
word lists in any human language, enabling a new set of applications to be built
with MySQL Server.

Standards compliance, portability and migration.

MySQL Server supports the UNION statement, a standard SQL feature.


MySQL runs natively on Novell NetWare 6.0 and higher.
Features to simplify migration from other database systems to MySQL Server
include TRUNCATE TABLE (as in Oracle).

Internationalisation.

Our German, Austrian and Swiss users should note that MySQL 4.0 supports a new
character set, latin1_de, which ensures that the German sorting order sorts words
with umlauts in the same order as do German telephone books.

Usability enhancements.

Most mysqld parameters (startup options) can be set without taking down the
server. This is a convenient feature for database administrators (DBAs).
Multiple-table DELETE and UPDATE statements have been added.
On Windows, symbolic link handling at the database level is enabled by default. On
Unix, the MyISAM storage engine supports symbolic linking at the table level (and
not just the database level as before).
SQL_CALC_FOUND_ROWS and FOUND_ROWS() are new functions that make
it possible to find out the number of rows a SELECT query that includes a LIMIT
clause would have returned without that clause.

27.2.6 The Embedded MySQL Server


The libmysqld embedded server library makes MySQL Server suitable for a
vastly expanded realm of applications. By using this library, developers can
embed MySQL Server into various applications and electronics devices,
where the end user has no knowledge of there actually being an underlying
database. Embedded MySQL Server is ideal for use behind the scenes in
Internet appliances, public kiosks, turnkey hardware/software combination
units, high performance Internet servers, self-contained databases distributed
on CD-ROM and so on. On Windows there are two different libraries as
shown in Table 27.3.

Table 27.3 MySQL server library on Windows

libmysqld.lib Dynamic library for threaded applications.

mysqldemb.lib Static library for not threaded applications.

27.2.7 Features of MySQL 4.1


MySQL Server 4.0 laid the foundation for new features implemented in
MySQL 4.1, such as subqueries and Unicode support and for the work on
stored procedures being done in version 5.0. These features come at the top
of the wish list of many of our customers. Well-known for its stability, speed
and ease of use, MySQL Server is able to fulfill the requirement checklists
of very demanding buyers. MySQL Server 4.1 is currently in production
status.
Support for sub-queries and derived tables.

A “subquery” is a SELECT statement nested within another statement. A “derived


table” (an unnamed view) is a subquery in the FROM clause of another statement.

Speed enhancements.

Faster binary client/server protocol with support for prepared statements and
parameter binding.
BTREE indexing is supported for HEAP tables, significantly improving response
time for non-exact searches.

New functionality.

CREATE TABLE tbl_name2 LIKE tbl_name1 allows you to create, with a single
statement, a new table with a structure exactly like that of an existing table.
The MyISAM storage engine supports OpenGIS spatial types for storing
geographical data.
Replication can be done over SSL connections.

Standards compliance, portability and migration.

The new client/server protocol adds the ability to pass multiple warnings to the
client, rather than only a single result. This makes it much easier to track problems
that occur in operations such as bulk data loading.
SHOW WARNINGS shows warnings for the last command.

Internationalisation and Localisation.

To support applications that require the use of local languages, the MySQL software
offers extensive Unicode support through the utf8 and ucs2 character sets.
Character sets can be defined per column, table and database. This allows for a high
degree of flexibility in application design, particularly for multi-language Web sites.
Per-connection time zones are supported, allowing individual clients to select their
own time zone when necessary.

Usability enhancements.

In response to popular demand, we have added a server-based HELP command that


can be used to get help information for SQL statements. The advantage of having
this information on the server side is that the information is always applicable to the
particular server version that you actually are using. Because this information is
available by issuing an SQL statement, any client can be written to access it. For
example, the help command of the mysql command-line client has been modified to
have this capability.
In the new client/server protocol, multiple statements can be issued with a single
call.
The new client/server protocol also supports returning multiple result sets. This
might occur as a result of sending multiple statements, for example.
A new INSERT … ON DUPLICATE KEY UPDATE … syntax has been
implemented. This allows you to UPDATE an existing row if the INSERT would
have caused a duplicate in a PRIMARY or UNIQUE index.
A new aggregate function, GROUP_CONCAT(), adds the extremely useful
capability of concatenating column values from grouped rows into a single result
string.

27.2.8 MySQL 5.0: The Next Development Release


New development for MySQL is focused on the 5.0 release, featuring stored
procedures, views (including updatable views), rudimentary triggers and
other new features.
27.2.9 The MySQL Mailing Lists
Your local site may have many subscribers to a MySQL mailing list. If so,
the site may have a local mailing list, so that messages sent from
lists.mysql.com to your site are propagated to the local list. In such cases,
please contact your system administrator to be added to or dropped from the
local MySQL list.
If you wish to have traffic for a mailing list go to a separate mailbox in
your mail program, set up a filter based on the message headers. You can use
either the List-ID: or Delivered-To: headers to identify list messages.

The MySQL mailing lists are as follows:


announce: This list is for announcements of new versions of MySQL and related programs.
This is a low-volume list to which all MySQL users should subscribe.
mysql: This is the main list for general MySQL discussion. Please note that some topics are
better discussed on the more-specialised lists. If you post to the wrong list, you may not get
an answer.
bugs: This list is for people who want to stay informed about issues reported since the last
release of MySQL or who want to be actively involved in the process of bug hunting and
fixing.
internals: This list is for people who work on the MySQL code. This is also the forum for
discussions on MySQL development and for posting patches.
mysqldoc: This list is for people who work on the MySQL documentation: people from
MySQL AB, translators and other community members.
benchmarks: This list is for anyone interested in performance issues. Discussions
concentrate on database performance (not limited to MySQL), but also include broader
categories such as performance of the kernel, filesystem, disk system and so on.
packagers: This list is for discussions on packaging and distributing MySQL. This is the
forum used by distribution maintainers to exchange ideas on packaging MySQL and on
ensuring that MySQL looks and feels as similar as possible on all supported platforms and
operating systems.
java: This list is for discussions about the MySQL server and Java. It is mostly used to
discuss JDBC drivers, including MySQL Connector/J.
win32: This list is for all topics concerning the MySQL software on Microsoft operating
systems, such as Windows 9x, Me, NT, 2000, XP and 2003.
myodbc: This list is for all topics concerning connecting to the MySQL server with ODBC.
gui-tools: This list is for all topics concerning MySQL GUI tools, including MySQL
Administrator and the MySQL Control Center graphical client.
cluster: This list is for discussion of MySQL Cluster.
dotnet: This list is for discussion of the MySQL server and the .NET platform. Mostly
related to the MySQL Connector/Net provider.
plusplus: This list is for all topics concerning programming with the C++ API for MySQL.
perl: This list is for all topics concerning the Perl support for MySQL with DBD::mysql.

27.2.10 Operating Systems Supported by MySQL


MySQL supports many operating systems, which are as follows:
AIX 4.x, 5.x with native threads.
Amiga.
BSDI 2.x with the MIT-pthreads package.
BSDI 3.0, 3.1 and 4.x with native threads.
Digital Unix 4.x with native threads.
FreeBSD 2.x with the MIT-pthreads package.
FreeBSD 3.x and 4.x with native threads.
FreeBSD 4.x with LinuxThreads.
HP-UX 10.20 with the DCE threads or the MIT-pthreads package.
HP-UX 11.x with the native threads.
Linux 2.0+ with LinuxThreads 0.7.1+ or glibc 2.0.7+ for various CPU architectures.
Mac OS X.
NetBSD 1.3/1.4 Intel and NetBSD 1.3 Alpha (requires GNU make).
Novell NetWare 6.0.
OpenBSD > 2.5 with native threads. OpenBSD < 2.5 with the MIT-pthreads package.
OS/2 Warp 3, FixPack 29 and OS/2 Warp 4, FixPack 4.
SCO OpenServer with a recent port of the FSU Pthreads package.
SCO UnixWare 7.1.x.
SGI Irix 6.x with native threads.
Solaris 2.5 and above with native threads on SPARC and x86.
SunOS 4.x with the MIT-pthreads package.
Tru64 Unix.
Windows 9x, Me, NT, 2000, XP and 2003.

Not all platforms are equally well-suited for running MySQL. How well a
certain platform is suited for a high-load mission-critical MySQL server is
determined by the following factors:
General stability of the thread library. A platform may have an excellent reputation
otherwise, but MySQL is only as stable as the thread library it calls, even if everything else is
perfect.
The capability of the kernel and the thread library to take advantage of symmetric multi-
processor (SMP) systems. In other words, when a process creates a thread, it should be
possible for that thread to run on a different CPU than the original process.
The capability of the kernel and the thread library to run many threads that acquire and
release a mutex over a short critical region frequently without excessive context switches. If
the implementation of pthread_mutex_lock() is too anxious to yield CPU time, this hurts
MySQL tremendously. If this issue is not taken care of, adding extra CPUs actually makes
MySQL slower.
General file system stability and performance.
If your tables are big, the ability of the file system to deal with large files at all and to deal
with them efficiently.

27.3 PHP-AN INTRODUCTION

PHP is short for PHP Hypertext Preprocessor. PHP is an HTML-


embedded scripting language. PHP processes hypertext (that is, HTML web
pages) before they leave the web server. This allows you to add dynamic
content to pages while at the same time making that content available to
users with all types of browsers. PHP is an interpreted programming
language, like Perl.
With PHP you can do almost anything. You can connect to any thing that
you would want to on the command line, create interactive pages, PDF files
and images and connect to database, LDAP and email servers.
Much of PHP’s syntax is borrowed from C, Java and Perl with a couple of
unique PHP-specific features thrown in. The goal of the language is to allow
web developers to write dynamically generated pages quickly. PHP will
allow you to:
Reduce the time to create large websites.
Create a customised user experience for visitors based on information that you have gathered
from them.
Open up thousands of possibilities for online tools. Check out PHP - HotScripts for examples
of the great things that are possible with PHP.
Allow creation of shopping carts for e-commerce websites.

To begin working with php you must first have access to either of the
following:
A web hosting account that supports the use of PHP web pages and grants you access to
MySQL databases.
Have PHP and MySQL installed on your own computer.

Although MySQL is not absolutely necessary to use PHP, MySQL and


PHP are wonderful complements to one another and some topics covered in
this tutorial will require that you have MySQL access.
27.3.1 PHP Language Syntax
To add PHP code to a web page, you need to enclose it in one of the
following special sets of tags:

<? php_code_here ?>

OR

<?php php_code_here ?>

OR

<script language=“php”>
php_code_here
</script>

So, what kind of code goes where it says php_code_here? Here is a quick
example.

<html>
<head>
<title>My Simple Page </title>
</head>
<body>

<? php echo “Hi There”; ?>

<body>
</html>
If you copy that code to a text editor and then view it from a web site that
has PHP enabled you get a page that says Hi There. The echo command
displays whatever is within quotes to the browser. There is also a print
command which does the same thing. Note the semicolon after the quoted
string. The semicolon tells PHP that the command has finished. It is very
important to watch your semicolons! If you do not, you may spend hours
debugging a page. You have been warned.

A little more information can gained by using the PHP info command:

<html>
<head>
<title>My Simple Page</title>
</head>
<body>
<?php phpinfo(); ?>
</body>
</html>

This page will display a bunch of information about the current PHP setup
on the server as well as tell you about the many built in variables that are
available.
It is important to note that most server configurations require that your
files be named with a .php3 extension in order for them to be parsed. Name
all of your PHP coded files filename.php3.

27.3.2 PHP Variables


To declare a variable in PHP, just place a $ character before an alpha-
numeric string, type an equals sign after it and then a value.

<?php $greeting = “Hello World”; ?>


The above code sets the variable $greeting to a value of “Hello World”.

We can now use that variable to replace text throughout the page, as in the
example below:

<html>
<head>
<title>My Simple Page</title>
</head>
<body>

<?php $greeting = “Hello World”;


echo $greeting; ?>
</body>
</html>

The above code creates a page that prints the words “Hello World”. One
reason to use variables is that you can set up a page that repeats a value
throughout and then only need to change the variable value to make all the
values on the page change.

27.3.3 PHP Operations


Now we will take a look at performing some operations on some variables.
First, we will create the html form for the user to fill in. You can use any
editor to do this. Here is the source:

<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.0


Transitional//EN” “http://www.w3.org/TR/REC-
html40/loose.dtd”>
<html>
<head>
<title>Tip Calculator</title>
</head>
<body>

<form action=“tips.php3” method=“get”>

<p>Meal Cost:$<input type=“text” name=“sub_total”


value=“” size=“7”></p>

<p>Tip %: <input type=“text” name=“tip percent”


value=“20” size=“3”>%</p>
<p><input type=“submit” value=“Calculate!”></p>

</form>

</body>
</html>

Let us look at a few of the highlights in this page. The first is the action of
this page, tips.php3. That means that the web server is going to send the
information contained in this form to a page on the server called tips.php3
which is in the same folder as the form.
The names of the input items are also important. PHP will automatically
create a variable with that name and set its value equal to the value that is
sent.
Now we need to create a PHP page that will handle the data. Of course,
this page needs to be named tips.php3. The source is listed below. One way,
perhaps the best way, to create a PHP page is to create the results page in a
graphical editor, highlighting areas where dynamic content should go. You
can then use a text editor to replace the highlighted area with PHP.
1. <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN”
2. “http://www.w3.org/TR/REC-html40/loose.dtd”>
3. <html>
4. <head>
5. <title>Tip Calculation Complete</title>
6. </head>
7. <body>
8.
9. <?php
10.
11. if ($sub_total = = “”) { echo “<h4>Error: You need to input a total!
</h4>”;}
12.
13. if ($tip percent = = “”) {echo “<h4>Error: You need to input a tip
percentage
!</h4>”;}
14.
15. $tip_decimal = $tip percent/100;
16.
17. $tip = $tip_decimal * $sub_total;
18.
19. $total = $sub_total + $tip;
20.
21. ?>
22.
23. <form action=“tips.php3” method=“get”>
24.
25. <p>Meal Cost: <strong>$<?php echo $sub_total; ?></strong></p>
26.
27. <p>Tip %: <strong><?php echo $tip percent; ?>%</strong></p>
28.
29. <p>Tip Amount: <strong>$<?php echo $tip; ?></strong></p>
30.
31. <p>Total: <font size=“+1” color=“#990000”><strong>$<?php echo
$total;?></strong></font></p>
32.
33. </form>
34.
35. </body>
36. </html>
Note, that the line numbers are there for illustrative purposes only. Please
do not include them in your source code.

Lines 11 and 13 check to see if the $sub_total and $tip_percent variables


are empty. If they are, they give an error message.
Line 15 converts the tip percentage into a decimal that we can multiply by.
Line 17 multiplies the tip decimal by the sub total to get the tip.
Line 19 gets the total cost by adding the sub total to the tip.
Lines 25, 27, 29 and 31 display the PHP variables on the results page.

27.3.4 Installing PHP


For experienced users, simply head over to PHP.net - Downloads and
download the most recent version of PHP.
However, other users should follow a guide to installing PHP onto the
computer. These guides are provided by PHP.net based on the operating
system that you are using.
PHP - Windows - Windows Installation Guide.
PHP - Mac - Mac Installation Guide.
PHP - Linux - Linux Installation Guide.

27.4 MYSQL DATABASE

MySQL database is a way of organising a group of tables. If you were going


to create a bunch of different tables that shared a common theme, then you
would group them into one database to make the management process easier.

27.4.1 Creating Your First Database


Most web hosts do not allow you to create a database directly through a PHP
script. Instead they require that you use the PHP/MySQL administration
tools on the web host control panel to create these databases. For all of our
examples we will be using the following information:
Server - localhost.
Database - test.
Table - example.
Username - admin.
Password - 1admin.

The server is the name of the server we want to connect to. Because all of
our scripts are going to be run locally on your web server, the correct address
is localhost.

27.4.2 MySQL Connect


Before you can do anything with MySQL in PHP you must first establish a
connection to your web host’s MySql database. This is done with the
MySQL connect function.

PHP & MySQL Code:

<?php
mysql_connect(“localhost”, “admin”, “1admin”) or
die(mysql_error());
echo “Connected to MySQL<br />”;
?>

Display:

Connected to MySQL

If you load the above PHP script to your webserver and everything works
properly, then you should see “Connected to MySQL” displayed when you
view the .php page.
The mysql_connect function takes three arguments. Server, username and
password. In our example above these arguments where:
Server - localhost.
Username - admin.
Password - 1admin.

The “or die(mysql…” code display an error message in your browser if,
you guessed it, there is an error!

27.4.3 Choosing the Working Database


After establishing a MySQL connection with the code above, you then need
to choose which database you will be using with this connection. This is
done with the mysql_select_db function.

PHP & MySQL Code:

<?php
mysql_connect(“localhost” “admin” “1admin”) or
die(mysql_error());
echo “Connected to MySQL<br />”;
mysql_select_db(“test”) or die(mysql_error());
echo “Connected to Database”;
?>

Display:

Connected to MySQL
Connected to Database

27.4.4 MySQL Tables


A MySQL table is completely different than the normal table. In MySQL
and other database systems, the goal is to store information in an orderly
fashion. The table gets this done by making the table up of columns and
rows.
The columns specify what the data is going to be, while the rows contain
the actual data. Table 27.4 shows how you could imagine a MySQL table. (C
= Column, R = Row).

Table 27.4 MySQL table

This table has three categories or “columns”, of data: Age, Height and
Weight. This table has four entries, or in other words, four rows.

27.4.5 Create Table MySQL


Before you can enter data (rows) into a table, you must first define the table
by naming what kind of data it will hold (columns). We are going to do a
MySQL query to create this table.

[PHP & MySQL Code:]

<?php
// Make a MySQL Connection
mysql_connect(“localhost” “admin” “1admin”) or
die(mysql_error());
mysql_select_db(“test”) or die(mysql_error());

// Create a MySQL table in the selected database


mysql_query(“CREATE TABLE example(
id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
name VARCHAR(3 0),
age INT)”)
or die(mysql_error());

echo “Table Created!”;


?>

Display:

Table Created!

‘mysql_query (“Create Table example’

The first part of the mysql_query told MySQL that we wanted to create a
new table. We capitalised the two words because they are reserved MySQL
keywords.
The word “example” is the name of our table, as it came directly after
“CREATE TABLE”. It is a good idea to use descriptive names when creating
a table, such as: employee information, contacts or customer orders. Clear
names will ensure that you will know what the table is about when revisiting
it a year after you make it.

‘id INT NOT NULL AUTO_INCREMENT’

Here we create a column “id” that will automatically increment each time a
new entry is added to the table. This will result in the first row in the table
having an id = 1, the second row id = 2, the third row id = 3 and so on.

Reserved MySQL Keywords


INT - This stands for integer. ‘id’ has been defined to be an integer.
NOT NULL - These are actually two keywords, but they combine together to say that this
column cannot be null.
AUTO_INCREMENT - Each time a new entry is added the value will be incremented by 1.

‘PRIMARY KEY (id)’

PRIMARY KEY is used as a unique identifier for the rows. Here, we have
made “id” the PRIMARY KEY for this table. This means that no two ids can
be the same, or else we will run into trouble. This is why we made “id” an
auto incrementing counter in the previous line of code.

‘name VARCHAR(30),’

Here we make a new column with the name “name”! VARCHAR stands for
variable/character. We will most likely only be using this column to store
characters (A-Z, a-z). The numbers inside the parentheses sets the limit on
how many variables/characters can be entered. In this case, the limit is 30.

‘age INT,’

Our third and final column is age, which stores an integer. Notice that there
are no paratheses following “INT”, as SQL already knows what to do with
an integer. The possible integer values that can be stored within an “INT” are
-2,147,483,648 to 2,147,483,647, which is more than enough!

‘or die(mysql_error());’

This will print out an error if there is a problem in the creation process.

27.4.6 Inserting Data into MySQL Table


When data is placed into a MySQL table it is referred to as inserting data.
When inserting data it is important to remember what kind of data is
specified in the columns of the table. Here is the PHP/MySQL code for
inserting data into the “example” table.

PHP & MySQL Code:


<?php
// Make a MySQL Connection
mysql_connect(“localhost”, “admin”, “ladmin”) or
die(mysql_error());
mysql_select_db(“test”) or die(mysql_error());

// Insert a row of information into the table “example”


mysql_query(“INSERT INTO example
(name, age) VALUES(‘Kumar Abhishek’, ‘23’ ) ”)
or die(mysql_error());

mysql_query(“INSERT INTO example


(name, age) VALUES(‘Kumar Avinash’, ‘21’ ) ”)
or die(mysql_error());

mysql_query(“INSERT INTO example


(name, age) VALUES(‘Alka Singh’, ‘15’ ) ”)
or die(mysql_error());

echo “Data Inserted!”;


?>

Display:

Data Inserted!

‘mysql_query(‘INSERT INTO example’

Again we are using the msql_query function. “INSERT INTO” means that
data is going to be put into a table. The name of the table we specified is
“example”.
‘(name,age) VALUES(Timmy Mellowman’,‘23’)’)’

“(name, age)” are the two columns we want to add data in. “VALUES”
means that what follows is the data to be put into the columns that we just
specified. Here, we enter the name Kumar Abhishek for “name” and the age
23 for “age”.

27.4.7 MySQL Query


Usually most of the work done with MySQL involves pulling down data
from a MySQL database. In MySQL, pulling down data is done with the
“SELECT” keyword. Think of SELECT as working the same way as it does
on your computer. If you want to copy a selection of words you first select
them then copy and paste.
In this example we will be outputting the first entry of our MySQL
“examples” table to the web browser.

PHP & MySQL Code:

<?php
// Make a MySQL Connection
mysql_connect(“localhost”, “admin”, “ladmin”) or
die(mysql_error());
mysql_select_db(“test”) or die(mysql_error());

// Retrieve all the data from the “example” table


$result = mysql_query(“SELECT * FROM example”)
or die(mysql_error());

// store the record of the “example” table into $row


$row = mysql_fetch_array( $result );
// Print out the contents of the entry
echo “Name: “.$row[‘name’];
echo ” Age: “.$row[‘age’];
?>

Display:

Name: Kumar Abhishek Age: 23

‘$result = mysql_query(“SELECT * FROM example”)’

When you perform a SELECT query on the database it will return a MySQL
result. We want to use this result in our PHP code, so we need to store it in a
variable. $result now holds the result from our mysql_query.

“SELECT * FROM example”

This line of code reads “Select everything from the table example”. The
asterisk is the wild card in MySQL which just tells MySQL to no exclude
anything in its selection.

‘$row = mysql_fetch_array($result );’

mysql_fetch_array returns the first associative array of the mysql result that
we pass to it. Here we are passing our MySQL result $result and the function
will return the first row of that result, which includes the data “Kumar
Abhishek” and “23”.
In our MySQL table “example” there are only two fields that we care
about: name and age. These names are the keys to extracting the data from
our associative array. To get the name we use $row[‘name’] and to get the
age we use $row[‘age’]. MySQL is case sensitive, so be sure to use
capitalization in your PHP code that matches the MySQL column names.

27.4.8 Retrieving Information from MySQL


In this example we will select everything in our table “example” and put it
into a nicely formatted HTML table.

PHP & MySQL Code:

<?php
// Make a MySQL Connection
mysql_connect(“localhost”, “admin”, “ladmin”) or
die(mysql_error());
mysql_select_db(“test”) or die(mysql_error());

// Get all the data from the “example” table


$result = mysql_query(“SELECT * FROM example”)
or die(mysql_error());

echo “<table border= ‘1’>”;


echo “<tr> <th>Name</th> <th>Age</th> </tr>”;
// keeps getting the next row until there are no more to
get
while($row = mysql_fetch_array( $result )) {
// Print out the contents of each row into a table
echo “<tr><td>”;
echo $row[‘name’];
echo “</td><td>”;
echo $row[‘age’];
echo “</td></tr>”;
}
echo “</table>”;

?>

Display:
Name Age
Kumar Abhishek 23

Kumar Avinash 21

Alka Singh 15

We only had two entries in our table, so there are only two rows that
appeared above. If you added more entries to your table then you may see
more data than what is above.

‘$result = mysq_query’

When you select items from a database using mysql_query, the data is
returned as a MySQL result. Since we want to use this data in our table we
need to store it in a variable. $result now holds the result from our
mysql_query.

‘(“SELECT * FROM example”’

This line of code reads “Select everything from the table example”. The
asterisk is the wild card in MySQL which just tells MySQL to get
everything.

‘while($row = mysql_fetch_array( $result)’

The mysql_fetch_array function gets the next in line associative array from a
MySQL result. By putting it in a while loop it will continue to fetch the next
array until there is no next array to fetch. At this point the loop check will
fail and the code will continue to execute.
In our MySQL table “example” there are only two fields that we care
about: name and age. These names are the keys to extracting the data from
our associative array. To get the name we use $row[‘name’] and to get the
age we use $row[‘age’].

27.5 INSTALLING MYSQL ON WINDOWS


MySQL for Windows is available in two distribution formats:
The binary distribution contains a setup program that installs everything you need so that you
can start the server immediately.
The source distribution contains all the code and support files for building the executables
using the VC++ 6.0 compiler.

Generally speaking, you should use the binary distribution. It is simpler


and you do you need additional tools to get MySQL up and running.

27.5.1 Windows System Requirements


To run MySQL on Windows, you need the following:
A 32-bit Windows operating system such as 9x, Me, NT, 2000, XP or Windows Server 2003.
A Windows NT based operating system (NT, 2000, XP, 2003) permits you to run the MySQL
server as a service. The use of a Windows NT based operating system is strongly
recommended.
TCP/IP protocol support.
A copy of the MySQL binary distribution for Windows, which can be downloaded from
http://dev.mysql.com/downloads/.
A tool that can read .zip files, to unpack the distribution file.
Enough space on the hard drive to unpack, install, and create the databases in accordance
with your requirements (generally a minimum of 200 megabytes is recommended).

You may also have the following optional requirements:


If you plan to connect to the MySQL server via ODBC, you also need a Connector/ODBC
driver.
If you need tables with a size larger than 4GB, install MySQL on an NTFS or newer file
system. Don’t forget to use MAX_ROWS and AVG_ROW_LENGTH when you create
tables.

27.5.2 Choosing An Installation Package


Starting with MySQL version 4.1.5, there are three install packages to
choose from when installing MySQL on Windows. The packages are as
follows:
The Essentials Package: This package has a filename similar to mysql-essential-4.1.9-
win32.msi and contains the minimum set of files needed to install MySQL on Windows,
including the Configuration Wizard. This package does not include optional components
such as the embedded server and benchmark suite.
The Complete Package: This package has a filename similar to mysql-4.1.9-win32.zip and
contains all files needed for a complete Windows installation, including the Configuration
Wizard. This package includes optional components such as the embedded server and
benchmark suite.
The Noinstall Archive: This package has a filename similar to mysql-noinstall-4.1.9-
win32.zip and contains all the files found in the Complete install package, with the exception
of the Configuration Wizard. This package does not include an automated installer and must
be manually installed and configured.

The Essentials package is recommended for most users.

27.5.3 Installing MySQL with the Automated Installer


Starting with MySQL 4.1.5, users can use the new MySQL Installation
Wizard and MySQL Configuration Wizard to install MySQL on Windows.
The MySQL Installation Wizard and MySQL Configuration Wizard are
designed to install and configure MySQL in such a way that new users can
immediately get started using MySQL.
The MySQL Installation Wizard and MySQL Configuration Wizard are
available in the Essentials and Complete install packages and are
recommended for most standard MySQL installations. Exceptions include
users who need to install multiple instances of MySQL on a single server
and advanced users who want complete control of server configuration.

27.5.4 Using the MySQL Installation Wizard


MySQL Installation Wizard is a new installer for the MySQL server that
uses the latest installer technologies for Microsoft Windows. The MySQL
Installation Wizard, in combination with the MySQL Configuration Wizard,
allows a user to install and configure a MySQL server that is ready for use
immediately after installation.
The MySQL Installation Wizard is the standard installer for all MySQL
server distributions, version 4.1.5 and higher. Users of previous versions of
MySQL need to manually shut down and remove their existing MySQL
installations before installing MySQL with the MySQL Installation Wizard.
Microsoft has included an improved version of their Microsoft Windows
Installer (MSI) in the recent versions of Windows. Using the MSI has
become the de-facto standard for application installations on Windows 2000,
Windows XP and Windows Server 2003. The MySQL Installation Wizard
makes use of this technology to provide a smoother and more flexible
installation progress.

27.5.5 Downloading and Starting the MySQL Installation Wizard


The MySQL server install packages can be downloaded from
http://dev.mysql.com/downloads/. If the package you download is contained
within a Zip archive, you need to extract the archive first.
The process for starting the wizard depends on the contents of the install
package you download. If there is a setup.exe file present, double-click it to
start the install process. If there is a .msi file present, double-click it to start
the install process.
There are up three installation types available: Typical, Complete and
Custom. The Typical installation type installs the MySQL server, the mysql
command-line client, and the command-line utilities. The command- line
clients and utilities include mysqldump, myisamchk and several other tools
to help you manage the MySQL server.
The Complete installation type installs all components included in the
installation package. The full installation package includes components such
as the embedded server library, the benchmark suite, support scripts and
documentation.
The Custom installation type gives you complete control over which
packages you wish to install and the installation path that is used.
If you choose the Typical or Complete installation types and click the
Next button, you advance to the confirmation screen to confirm your choices
and begin the installation. If you choose the Custom installation type and
click the Next button, you advance to the custom install dialog.

27.5.6 MySQL Installation Steps


MySQL is a free database server which is well suited as a backend for small
database-driven Web sites developed in PHP or Perl.
Follow the following steps:

Step 1: Download MySQL 4.1.11 for AIX 4.3.3.0 (PowerPC).


This is the newest version of MySQL that will work on the
UW servers. Use wget or lynx to download and save the file
to your account:
wget:

wget http://www.washington.edu/
computing/web/publishing/mysql
-standard-4.1.11-ibm-aix4.3.3.0
-powerpc.tar.gz

lynx:

lynx -dump http://www.washington.edu


/computing/web/publishingmysql-
standard-4.1.11-ibm-aix4.3.3.0-
powerpc.tar.gz > mysql-standard-4.1.11-ibm-
aix4.3.3.0-powerpc.tar.gz

Step 2: Unzip the file you just downloaded:

gunzip-cd mysql-standard-4.1.11-ibm-
aix4.3.3.0-powerpc.tar.gz | tar xvf -

Step 3: Create a symbolic link to the MySQL directory:


In -s mysql-standard-4.1.11-ibm-aix4.3.3.0-
powerpc mysql

Configure MySQL’s basic settings, create the default


databases and start the MySQL server.
Step 4: Change directories and run the script that sets up default
permissions for users of your MySQL server:

cd mysql
./scripts/mysql_install_db

Step 5: The script informs you that a root password should be set.
You will do this in a few more steps.
Step 6: If you are upgrading an existing version of MySQL, move
back your .my.cnf file:

mv ~/.my.cnf.temp ~/.my.cnf

This requires that you keep the same port number for your
MySQL server when installing the new software.
Step 7: If you are installing MySQL for the first time, get the path
to your home directory:

echo $HOME

Note this down, as you will need the information in the next
step.
Create a new file called .my.cnf in your home directory.
This file contains account-specific settings for your MySQL
server.
pico ~/.my.cnf
Copy and paste the following lines into the file, making the
substitutions listed below:

[mysqld]
port=XXXXX
socket=/hw13/d06/accountname/mysql
.sock
basedir=/hw13/d06/accountname/mysql
datadir=/hw13/d06/accountname/mysql
/data
old-passwords
[client]
port=XXXXX
socket=/hw13/d06/accountname/mysqlm
.sock

Replace the two instances of XXXXX with a number


between 1024 and 65000 (use the same number both times).
Write the number down if you plan to install phpMyAdmin.
This is the port that MySQL will use to listen for
connections.

Note: You must use a port number that is not already in use.
You can test a port number by typing telnet localhost
XXXXX(again replacing XXXXX with the port number). If
it says “Connection Refused”, then you have a good
number. If it says something ending in “Connection closed
by foreign host.” then there is already a server running on
that port, so you should choose a different number.

Replace /hw13/d06/accountname with the path to your


home directory.

Note: If you are not planning to use the innodb storage


engine, then now is a good time to turn it off. This will save
you some space and memory. You can disable innodb by
including a line that says skip-innodb underneath the ‘old-
passwords’ line in your .my.cnf file.

Write the file and exit Pico.


Step 8: If you are following the directions to upgrade an existing
version of MySQL, you should now copy back your
databases into your new MySQL installation:

rm -R ~/mysql/data
cp -R ~/mysql-bak/data ~/mysql/data

Step 9: You are now ready to start your MySQL server.


Make sure you are in the web-development environment
(see steps 1-3), and type:

./bin/mysqld_safe &

Be sure to include the ampersand (&) at the end of the


command; it is an instruction to run the process in the
background. If you forget to type it, you will not be able to
continue your terminal session and you should close your
terminal window and open another.
If everything has gone correctly, a message similar to the
following will appear:

[1] 67786
% Starting mysqld daemon with databases
from
/hw13/d06/accountname/mysql/data

Press [enter] to return to the shell prompt. Your MySQL


server is now running as a background job and it will keep
running even after you log out.

27.5.7 Set up permissions and passwords


Note: If you are upgrading, you can return to the upgrade documentation
now. Otherwise, if this is a new MySQL installation, continue with setting
up the permissions and passwords.

Step 10: At this point your MySQL password is still empty. Use the
following command to set a new root password:

./bin/mysqladmin -u root password


mypassword

Replace mypassword with a password of your choice; do not


enclose your password in any quotation marks.
Step 11: You have now created a “root account” and given it a
password. This will enable you to connect to your MySQL
server with the built-in command-line MySQL client using
this account and password. If you are installing MySQL for
the first time, type the following command to connect to the
server:

./bin/mysql -u root -p

You will be prompted for the MySQL root password. Enter


the password you picked in the previous step.

Enter password: mypassword


Welcome to the MySQL monitor. Commands
end with ; or \g.
Your MySQL connection id is 4 to server
version: 4.1.11-standard

Type ‘help;’ or ‘\h’ for help. Type ‘\c’


to clear the buffer.

mysql>

At the mysql> prompt, type the commands that follow,


replacing mypassword with the root password. Press [enter]
after each semicolon.

mysql> use mysql;


mysql> delete from user where Host like
“%”;
mysql> grant all privileges on *.* to
root@“%.washington.edu”
identified by ‘mypassword’ with grant
option;
mysql> grant all privileges on *.* to
root@localhost identified by
‘mypassword’ with grant option;
mysql> flush privileges;
mysql> exit;

This step allows you to connect to your MySQL server as


‘root’ from any UW computer.
Step 12: Once back at your shell prompt, you can verify that your
MySQL server is running with the following command:

./bin/mysqladmin -u root -p version

You will be prompted for the root password again.


If MySQL is running, a message similar to the following
will be displayed:

./bin/mysqladmin Ver 8 . 41 Distrib 4.1.11, for


ibm aix4 . 3 . 3 . 0 on powerpc Copyright (C)
2000 MySQL AB & MySQL Finland AB & TCX
DataKonsult AB This software comes with
ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to modify
and redistribute it under the GPL license

Server version 4.1.11-standard


Protocol version 10
Connection Localhost via UNIX
socket
UNIX socket /hw13/d0
6/accountname/mysql.sock
Uptime: 3 min 32 sec

Threads: 1 Questions: 21 Slow queries : 0


Opens : 12 Flush tables: 1 Open
tables: 1 Queries per second avg: 0.099

Step 13: You are done! A MySQL server is now running in your
account and is ready to accept connections. At this point you
can learn about MySQL administration to get more familiar
with MySQL, and you can install phpMyAdmin to help you
administer your new database server.

You can delete the file used to install MySQL with the
following command:

rm ~/mysql-standard-4.1.11-ibm-aix4.3.3.0-
powerpc.tar.gz

R Q
1. What is MySQL?
2. What are the features of MySQL?
3. What do you mean by MySQL stability? Explain.
4. Discuss the features available in MySQL 4.0.
5. What do you mean by embedded MySQL Server?
6. What are the features of MySQL Server 4.1?
7. What are MySQL mailing lists? What does MySQL mailing list contain?
8. What are the operating systems supported by MySQL?
9. What is PHP? What is relevance with MySQL?

STATE TRUE/FALSE

1. MySQL is an Open Source SQL database management system.


2. MySQL is developed, distributed and supported by MySQL AB.
3. MySQL Server was originally developed to handle small database.
4. Open Source means that it is possible for anyone to use and modify the software.
5. MySQL Database Software is a client/server system that consists of a multi-threaded SQL
server that supports different back-ends, several different client programs and libraries,
administrative tools and a wide range of application programming interfaces (APIs).

TICK (Ⅳ) THE APPROPRIATE ANSWER

1. MySQL is

a. relational DBMS.
b. Networking DBMS.
c. Open source SQL DBMS.
d. Both (a) and (c).

2. MySQL AB was founded by

a. David Axmark.
b. Allan Larsson.
c. Michael “Monty” Widenius.
d. All of these.

3. MySQL 4.1 has features such as

a. Subqueries.
b. Unicode support.
c. Both (a) and (b).
d. None of these.

4. PHP allows to

a. reduce the time to create large websites.


b. create a customised user experience for visitors based on information that you have
gathered from them.
c. allow creation of shopping carts for e-commerce websites.
d. All of these.

FILL IN THE BLANKS

1. The MySQL Server design is _____ with independent modules.


2. MySQL is an _____ SQL database management system developed by _____.
3. The InnoDB storage engine maintains _____ tables within a _____ that can be created from
several files.
4. PHP is short for _____.
5. PHP is an _____ scripting language.
6. PHP processes hypertext (that is, HTML web pages) before they leave the _____.
7. PHP is an _____ programming language, like _____.
Chapter 28

Teradata RDBMS

28.1 INTRODUCTION

Teradata relational database management system was developed by


Teradata, a software company that develops and sells a relational database
management system with the name “Teradata”, founded in 1979 by a group
of people namely, Dr. Jack E. Shemer, Dr. Philip M. Neches, Walter E.
Muir, Jerold R. Modes, William P. Worth and Carroll Reed. Between 1976
and 1979, the concept of Teradata grew out of research at the California
Institute of Technology (Caltech) and from the discussions of Citibank’s
advanced technology group. Founders worked to design a database
management system for parallel processing with multiple microprocessors,
specifically for decision support. Teradata was incorporated on July 13,
1979, and started in a garage in Brentwood, California. The name Teradata
was chosen to symbolize the ability to manage terabytes (trillions of bytes)
of data.
In 1996, a Teradata database was the world’s largest, with 11 terabytes of
data, and by 1999, the database of one of Teradata’s customers was the
world’s largest database in production with 130 terabytes of user data on
176 nodes.
This chapter gives a brief introduction to Teradata RDBMS and aims at
providing details about Teradata client software, installation and
configuration, developing open database connectivity (ODBC) applications,
and so on.

28.2 TERADATA TECHNOLOGY


Teradata is a massively parallel processing system running a shared nothing
architecture. The Teradata DBMS is linearly and predictably scalable in all
dimensions of a database system workload (data volume, breadth, number
of users, complexity of queries). Due to the scalability features, it is very
popular for enterprise data warehousing applications. Teradata is offered on
Intel servers interconnected by the proprietary BYNET messaging fabric.
Teradata systems are offered with either Teradata-branded LSI or EMC disk
arrays for database storage.
Teradata enterprise data warehouses are often accessed via open database
connectivity (ODBC) or Java database connectivity (JDBC) by applications
running on operating systems such as Microsoft Windows or flavours of
UNIX. The warehouse typically sources data from operational systems via a
combination of batch and trickle loads.
Teradata acts as a single data store that can accept large numbers of
concurrent requests from multiple client applications.

28.3 TERADATA TOOLS AND UTILITIES

The Teradata Tools and Utilities software, together with the Teradata
relational database management system (RDBMS) software, permits
communication between a Teradata client and a Teradata RDBMS.

28.3.1 Operating System Platform


Teradata offers a choice of several operating systems. Before installing the
Teradata Tools and Utilities software, the target computer should run on one
of the following operating systems:
Windows 98
Windows NT
Windows 2000
Windows XP, 32-bit
Windows XP, 64-bit
Microsoft Windows Server 2003
UNIX SVR4.2 MP-RAS, a variant of System V UNIX from AT&T
SUSE Linux Enterprise Server on 64-bit Intel servers.
28.3.2 Hardware Platform
To use the Teradata Tools and Utilities software, one should have an i386-
based or greater computer with the following components:
An appropriate network card
Ethernet or Token Ring LAN cards
283.5 MB of free disk space for a full installation.

28.3.3 Features of Teradata


Significant features of Teradata RDBMS include:
Unconditional parallelism, with load distribution shared among several servers.
Complex ad hoc queries with up to 64 joins.
Parallel efficiency, such that the effort for creating 100 records is same as that for creating
1,00,000 records.
Scalability, so that increasing of the number of processors of an existing system linearly
increases the performance. Performance thus does not deteriorate with an increased number
of users.

28.3.4 Teradata Utilities


Teradata offers the following utilities that assist in data warehousing
management and maintenance along with the Teradata RDBMS:
Basic Teradata Query (BTEQ)
MultiLoad
Teradata FastLoad
FastExport
TPump
Teradata Parallel Transporter (TPT)
SQL Assistant/Queryman
Preprocessor 2/PP2

28.3.5 Teradata Products


Customer relationship management (Teradata relationship manager)
Data warehousing
Demand chain management
Financial management
Industry solutions
Profitability analytics
Supply chain intelligence
Master data management.

28.3.6 Teradata-specific SQL Procedures Pass-through Facilities


The pass-through facility of PROC SQL can be used to build own Teradata
SQL statements and then pass them to the Teradata server for execution.
The PROC SQL CONNECT statement defines the connection between SAS
and the Teradata DBMS. The following section describes the DBMS-
specific arguments that can be used in the CONNECT statement to establish
a connection with a Teradata database.

28.3.6.1 Arguments to Connect to Teradata


The SAS/ACCESS interface to Teradata can connect to multiple Teradata
servers and to multiple Teradata databases. However, if we use multiple,
simultaneous connections, we must use an alias argument to identify each
connection.
Teradata DATABASE statement should not be used within the
EXECUTE statement in PROC SQL. The SAS/ACCESS SCHEMA =
option should be used for changing the default Teradata database. The
CONNECT statement uses SAS/ACCESS connection options.

USER and PASSWORD are the only required options.

CONNECT TO TERADATA <AS alias> (USER=TERADATA-user-


name
PASSWORD = TERADATA-password
<TDPID=dbcname
SCHEMA=alternate-database
ACCOUNT=account_ID>);
USER=<‘>Teradata-user-name<’>
specifies a Teradata user name. One must also specify PASSWORD=.

PASSWORD= | PASS= | PW= <‘>Teradata-


password<’>

specifies the Teradata password that is associated with the Teradata user
name. One must also specify USER=.

28.3.6.2 Pass-through Examples


Example 1

Using the Alias DBCON for the Teradata Connection

proc sql;
connect to teradata as dbcon
(user=kamdar pass=ellis);
quit;

In Example 1, SAS/ACCESS
connects to the Teradata DBMS using the alias dbcon;
performs no other work.

Example 2

Deleting and Recreating a Teradata Table

proc sql;
connect to teradata as tera ( user=kamdar
password=ellis ); execute (drop table salary) by tera;
execute (create table salary (current
salary float, name char(10))) by tera;
execute (insert into salary values (35335.00,
‘Dan J.’)) by tera;
execute (insert into salary values (40300.00,
‘Irma L.’)) by tera;
disconnect from tera;
quit;

In Example 2, SAS/ACCESS
connects to the Teradata DBMS using the alias tera;
drops the SALARY table;
recreates the SALARY table;
inserts two rows;
disconnects from the Teradata DBMS.

Example 3

Updating a Teradata Table

proc sql;
connect to teradata as tera ( user=kamdar
password=ellis );
execute (update salary set current
salary=45000
where (name=‘Alka Singh’)) by
tera;
disconnect from tera;
quit;

In Example 3, SAS/ACCESS
connects to the Teradata DBMS using the alias tera.
updates the row for Alka Singh, changing her current salary to Rs. 45,000.00.
disconnects from the Teradata DBMS.

Example 4

Selecting and Displaying a Teradata Table

proc sql;
connect to teradata as tera2 ( user=kamdar
password=ellis ) ;
select * from connection to tera2 (select *
from salary);
disconnect from tera2;
quit;

In Example 4, SAS/ACCESS
connects to the Teradata database using the alias tera2;
selects all rows in the SALARY table and displays them using PROC SQL;
disconnects from the Teradata database.

28.4 TERADATA RDBMS

Teradata RDBMS is a complete relational database management system.


The system is based on off-the-shelf symmetric multiprocessing (SMP)
technology combined with a communication network connecting the SMP
systems to form a massively parallel processing (MMP) system. BYNET is
a hardware inter-processor network to link SMP nodes. All processors in a
same SMP node are connected by a virtual BYNET. Fig. 28.1 explains as
how each component in this DBMS works together.
Fig. 28.1 Teradata DBMS components

28.4.1 Parallel Database Extensions (PDE)


Parallel database extensions (PDE) are an interface layer on the top of
operating system. Its functions include: executing vprocs
(virtualprocessors), providing a parallel environment, scheduling sessions,
debugging, etc.

28.4.2 Teradata File System


Teradata file system allows Teradata RDBMS to store and retrieve data
regardless of low-level operating system interface.

28.4.3 Parsing Engine (PE)


Communicate with client
Manage sessions
Parse SQL statements
Communicate with AMPs
Return result to the client.

28.4.4 Access Module Processor (AMP)


BYNET interface
Manage database
Interface to disk sub-system.

28.4.5 Call Level Interface (CLI)


A SQL query is submitted and transferred in CLI packet format.

28.4.6 Teradata Director Program (TDP)


Teradata director program (TDP) routes the packets to the specified
Teradata RDBMS server. Teradata RDBMS has the following components
that support all data communication management:
Call Level Interface (CLI)
WinCLI & ODBC
Teradata director program (TDP for channel attached client)
Micro TDP (TDP for network attached client).

28.5 TERADATA CLIENT SOFTWARE

Teradata client software components include the following:


Basic Teradata Query (BTEQ): Basic Teradata Query is a general-purpose program that is
used to submit data, commands and SQL statements to Teradata RDBMS.
C/COBOL/PL/I preprocessors: These tools are needed to pre-compile the application
program that uses embedded SQL to develop client applications.
Call Level Interface (CLI)
Open database connectivity (ODBC)
TDP/MTDP/MOSI
Achieve/restore data to/from tape (ASF2)
Queryman: Queryman is based on ODBC, one can logon through a DSN and enter any
SQL statement to manipulate the data in database.
FastLoad
MultiLoad
FastExport
Open Teradata backup (OTB)
Tpump
Teradata manager
WinDDI

All client components are based on CLI or ODBC or both. So, once the
client software is installed, these two components should be configured
appropriately before these client utilities are executed.
Teradata RDBMS is able to support JDBC programs in both forms of
application and applet. The client installation manual mentions that we need
to install JDBC driver on client computers, and we also need to start a
JDBC Gateway and Web server on database server. Teradata supports at
least two types of JDBC drivers. The first type can be loaded locally and the
second should be downloadable. In either ways, to support the
development, we need local JDBC driver or Web server/JDBC Gateway
running on the same node on which Query Manager is running. But in the
setup CD we received, there is neither JDBC driver nor any Java
development tools. Moreover, Web server is not started on tour system yet.

28.6 INSTALLATION AND CONFIGURATION OF TERADATA

One floppy disk is needed, which contains licenses for all components that can be installed.
Each component has one entry in the license txt file.
If it is asked to choose ODBC or Teradata ODBC with DBQM enhanced version to install,
just ignore it. In this case, one cannot install DBQM_Admin, DBQM_Client and
DBQM_Server. These three components are used to optimize the processing of the SQL
queries. The client software still works smoothly without them.
Because CLI and ODBC are the infrastructures of other components, either of them may not
be deleted from the installation list if there is any component based on it.
After ODBC installation, it will be asked to run ODBC administrator to configure a Data
Source Name (DSN). It may be canceled simply because this job can be done later. After
Teradata Manager installation, it will be asked to run Start RDBMS Setup. This can also be
done later.

Following steps can be used for configuration:

Setting network parameters


For Windows 2000, perform the following step: Start -> Search -> For Files or Folders. The
file: hosts can be found as shown in Fig. 28.2.

Use Notepad to edit the hosts file as shown in Fig. 28.3.

Add one line into the hosts file: “130.108.5.57 teradatacop1”. Here, 130.108.5.57 is the IP
address of the top node of the system on which Query Manager is running. “teradata” will
be the TDPID which is used in many client components we installed. “cop” is a fixed suffix
string and “1” indicate that there is one RDBMS.
Fig. 28.2 Finding hosts file and setting network parameters

Fig. 28.3 Editing hosts file


Setting system environment parameters

For Windows 2000, perform the following step: Start -> Settings -> Control Panel, as shown
in Fig. 28.4.

Fig. 28.4 Setting system environment parameters

Find the icon “System”, double click it, get the following window, then choose “Advanced”
sub-window as shown in Figs. 28.5 and 28.6.
Fig. 28.5 Selecting “System” option

Fig. 28.6 Selecting “Advanced” sub-window

click “Environment Variables…” button as shown in Fig. 28.7.


Fig. 28.7 Selecting “Environment Variables” button

Click button “New…” to create some system variables as follows:

COPLIB = the directory which contains the file clispb.dat

This file is copied to the computer when Teradata client software is installed. It contains
some default settings for CLI.

COPANOMLOG = the full path and the log file name

This can be set as we want. If the file does not exist, when an error occurs, the client
software will create the file to record the log information.

TDMSTPORT = 1025

Because our server is listening the connect request on port 1025, it should be set as 1025.
This system environment variable is added for the future usage. One should not insert a line

tdmst 1025/TCP

into the file C:\WINNT\System32\drivers\etc\services directly without the correct setting on


environment variable TDMSTPORT.

Setting CLI system parameter block

We can find the file clispb.dat after we install the client software. In our computer, it is
under the directory C:\Program Files\NCR\Teradata Client. Please use Notepad to open it.

We will see the screen as shown in Fig. 28.8.


Fig. 28.8 Selecting CLI system parameter block

Originally, i_dbcpath was set as dbc. That is not the same as what was set in the file hosts.
So it was modified as teradata. When we use some components based on CLI and do not
specify the TDPID or RDBMS, the components will open this file to find this default
setting. Therefore, it is suggested to set it as what is set in the file hosts.

For other entries in this file, we can just keep them as original settings.

To use utilities such as Queryman and WinDDI, we still need to configure a Data Source
Name (DSN) for our self.

28.7 INSTALLATION OF TERADATA TOOLS AND UTILITIES SOFTWARE

There are four methods to install Teradata Tools and Utilities products:
Installing with PUT: The Teradata Parallel Upgrade Tool is an alternative method of
installing some of the Teradata tools and Utilities products purchased.
Installing with the Client Main Install: All Teradata Tools and Utilities products, except
for the OLE DB Provider, for Teradata can be installed using the Client Main Install. The
Client Main Install is typical of Windows installations, allowing three forms of installation:

Typical installation: A typical installation installs all the products on each CD.
Custom installation: A custom installation installs only those products selected
from the available products.
Network installation: A network installation copies the installation packages for
the selected products to a specified folder. The network installation does not
actually install the products. This must be done by the user.

Installing Teradata JDBC driver by copying files: Starting with Teradata Tools and
Utilities Release 13.00.00, the three Teradata JDBC driver files are included on the utility
pack CDs. To install Teradata JDBC driver, the three files are manually copied from the
\TeraJDBC directory in root on the CD ROM into a directory of choice on the target client.
Installing from the command prompt: Teradata Tools and Utilities packages are
downloaded from a patch server, or copied using Network Setup Type, then installed on the
target machine by providing the package response file name as an input to the setup. exe
command at the command prompt. These packages are installed silently.
Downloading files from the Teradata Download Center: Several Teradata Tools and
Utilities products can be downloaded from the Teradata Download Center located at:

http://www.teradata.com/resources/drivers-udfs-and-toolbox

Products that can be downloaded include:

Teradata Call Level Interface version 2 (CLIv2): This product and its dependent
products namely Teradata Generic Security Services (TDGSS) Client and the
Shared ICU Libraries for Teradata are available for download.
ODBC Driver for Teradata (ODBC): This product and its dependent products
namely Teradata Generic Security Services (TeraGSS) Client and the Shared ICU
Libraries for Teradata are available for download.

Additionally, three other products are available from the Teradata Download Center:

Teradata JDBC driver


OLE DB Provider for Teradata
NET Data Provider for Teradata

28.7.1 Installing with Microsoft Windows Installer


Microsoft Windows Installer is required for installing Teradata Tools and
Utilities software. Microsoft Windows Installer 3.0 is shipped as part of the
Microsoft Windows XP Service Pack 2 SP2 and is available as a
redistributable system component for Microsoft Windows 2000 SP3,
Microsoft Windows 2000 SP4, Microsoft Windows XP, Microsoft Windows
XP SP1 and Microsoft Windows Server 2003.

28.7.2 Installing with Parallel Upgrade Tool (PUT)


Some Teradata Tools and Utilities products can be installed with Teradata
Parallel Upgrade Tool (PUT). Currently the following products are the
Teradata Tools and Utilities products that can be installed using the
software Parallel Upgrade Tool (PUT) on Microsoft Windows:
Basic Teradata Query (BTEQ)
Named Pipes Access Module
Shared ICU Libraries for Teradata
Teradata Archive/Recovery Utility (ARC)
Teradata Call Level Interface version 2 (CLIv2)
Teradata Data Connector
Teradata Generic Security Services—Client
FastExport
FastLoad
MultiLoad
MQ Access Module
TPump

28.7.3 Typical Installation


A typical installation includes all the products on each CD. The screens
shown in this section are examples of screens that appear during a typical
installation. Depending on the Teradata Tools and Utilities products used in
the installation, some dialog boxes and screens might vary from those
shown in this guide. This installation installs all the Teradata Tools and
Utilities products.

Step 1: After highlighting Typical and clicking Next in the initial Setup
Type dialog box, the Choose Destination Location dialog box appears. If
the default path shown in the Destination Folder block is acceptable, click
Next. (This is recommended).

If a previous version of the dependent products was not uninstalled, the


install asks to overwrite the software. Click Yes to overwrite the software.

As shown in Fig. 28.9, to use a destination location other than the default,
click Browse, navigate to the location where the files are to be installed,
click OK, then click Next.

One must have write access to the destination folder, the Windows root
folder and the Windows system folder.
Step 2: In the Select Install Method dialog box as shown in Fig. 28.10,
select the products to automatically install (silent install) or clear the
products to interactively install:
a. Highlight the products to be installed silently.

The ODBC Driver for Teradata and Teradata Manager can be installed silently or
interactively; the default is interactive.
Teradata SQL Assistant/Web Edition, Teradata MultiTool and Teradata DQM
Administrator can only be installed interactively.
All other products can be installed silently or interactively; the default is silent.

b. Those not highlighted will be installed interactively, meaning that the product setup
sequence will be activated so you can make adjustments during installation.
c. Click Next.
Fig. 28.9 Choosing destination location

Fig. 28.10 Choosing destination location


Step 3: An Information window shows the path to the folder containing
the product response files for silent installation. To modify a product
response file, use a text editor to do so now. When finished, click OK in the
Information window.
During installation, progress monitors will appear. No action is required.
For silent installations, messages such as “Installing BTEQ. Please wait…” will appear. No
action is required.
The Teradata ODBC driver setup may take several minutes.

Step 4: In the Install Wizard Completion screen, click Finish to


complete the installation.
The ODBC Driver for Teradata setup may take several minutes.
The product setup log is located in the %TEMP%\ ClientISS folder.
“Silent Installation Result Codes” lists result codes from silent installations
that are useful for troubleshooting.

Step 5: After the first phase of installation is complete, go to the


installation procedure for the following products as needed:
“Installing the ODBC Driver for Teradata”
“Installing the Teradata SQL Assistant/Web Edition”
“Installing the Teradata MultiTool”
“Installing the Teradata DQM Administrator”
“Installing the Teradata Manager”.

28.7.3.1 Installing the Teradata SQL Assistant/Web Edition


The Teradata SQL Assistant/Web Edition software is only installed from the
Teradata Utility Pak CD. This software is installed in a folder on the drive
where the Microsoft Internet Information Services (IIS) was previously
installed:

<Drive where IIS was installed>:\inetpub\wwwroot

The following steps should be followed to install the Teradata SQL


Assistant/Web Edition software interactively:
Step 1: In the Welcome to the Teradata SQL Assistant/Web Edition
Setup dialog box, click Next.

Step 2: In the License Agreement dialog box, read the agreement, select
I accept the terms in the license agreement, then click Next.

Step 3: In the Select Installation Address dialog box, enter the


appropriate virtual directory and port number, then click Next.

Step 4: In the Confirm Installation dialog box, click Next.

Step 5: Following Information window may appear as shown in Fig.


28.11. If so, click OK.

Fig. 28.11 Information window

Step 6: If the machine.config file could not be modified, the following


warning appears as shown in Fig. 28.12. Click OK to continue the
installation process.
Fig. 28.12 Information window

Step 7: In the Installation Complete dialog box, click Close.

Step 8: If the machine.config file was not modified successfully, refer to


“Changing the machine.config File” for instructions on how to change it
manually.

28.7.3.1 Installing the ODBC Driver for Teradata


If an earlier version of the ODBC Driver for Teradata is installed on a client
system, uninstall it using Add/ Remove Programs. Windows
Administrators group of the computer on which uninstalling the software.
The installation of an ODBC driver will terminate if an older driver is being
installed on a system that has a newer ODBC driver installed.
When installing the ODBC Driver for Teradata product on a Windows
XP 64-bit system, the installation procedure stops if all of the following
conditions exist:
The ODBC Driver for Teradata product was already on the system.
A custom install was elected.
A silent installation was selected in ODBC_Driver_For_Teradata in the Select Install
Methods dialog.

To prevent the installation procedure from halting, first uninstall all


previous versions of the ODBC Driver for Teradata.
The Installation of the ODBC 13.00.00.00 release on a system that
already has an ODBC driver installed exhibits a different behaviour than
when the ODBC Driver for Teradata is installed on a system that has no
ODBC Driver for Teradata installed.
In this case, only two dialog boxes appear: Resuming the InstallShield
Wizard for ODBC Driver for Teradata, and the Installation Wizard
Complete dialog boxes.
When an installation is executed on a system that already has an ODBC driver installed,
Resuming the InstallShield Wizard for ODBC Driver for Teradata appears. Click Next.
Installation Wizard Complete dialog box appears. Click Finish.

It is to be noted that ODBC 13.00.00.00 is not supported on Microsoft


Windows 95, Windows 98 or Windows NT. If an attempt is made to install
ODBC 13.00.00.00 on one of these operating systems, the InstallShield
program detects the operating system and generates an error message
indicating the incompatibility, then aborts the installation.
Except on 64-bit systems, whenever ODBC is installed, version 2.8 of
the Microsoft Data Access Components (MDAC) should also be installed.
If MDAC is already installed on the computer and an upgrade to version 2.8
is not desired, clear the MDAC check box in the Select Components dialog
box. MDAC version 2.8 must be installed if Teradata SQL Assistant/Web
Edition is used.
MDAC is not installed with 64-bit ODBC, since it is installed as part of the operating
system.
To ensure the highest quality and best performance of the ODBC Driver for Teradata, the
most recent critical post-production updates are downloaded from the Teradata Software
Server at:

http://tssprod.teradata.com:8080/TSFS/home.do

Install the ODBC Driver for Teradata as follows:

Step 1: In the Welcome to the InstallShield Wizard for ODBC Driver


for Teradata dialog box, click Next.
If ODBC driver is being installed for Teradata in interactive mode on a
computer that runs on the Windows 2000 or Windows XP operating system
and do not see the Welcome dialog box, press the Alt-Tab keys and bring
the dialog box to the foreground. This does not apply to silent installation of
this software.

Step 2: In the Choose Destination Location dialog box, if the default


path shown in the Destination Folder block is acceptable, click Next. (This
is recommended).
To use a destination location other than the default, click Browse,
navigate to the location where we want the files installed, click OK, then
click Next.
We must have write access to the destination folder, the Windows root
folder and the Windows system folder.

Step 3: In the Setup Type dialog box, click the name of the desired
installation setup, then click Next:
Custom is for advanced users who want to choose the options to install.
Typical is recommended for most users. All ODBC driver programs will be installed.

If Typical is chosen, the ODBC installation installs version 2.8 of the


Microsoft Data Access Components (MDAC). If MDAC is already
installed on the client computer and an upgrade to version 2.8 is desired,
select Custom, then clear the MDAc check box in the Select Components
dialog box as shown in Fig. 28.13. MDAC version 2.8 must be installed to
use Teradata SQL Assistant/Web Edition.
Fig. 28.13 Select dialog box

Step 4: In the Select Program Folder dialog box, do one of the


following, then click Next:
Accept the default program folder
Enter a new folder name into the Program Folders text block
Select one of the names in the list of existing folders

Step 5: In the Start Copying Files dialog box, review the information.
When satisfied that it is correct, click Next.
The driver installation begins. During installation, progress monitors
appear. No action is required.

Step 6: Upon completion of the driver installation, the InstallShield


Wizard Complete dialog box appears. Choose to view the “Read Me” file,
run the ODBC Administrator immediately, or do neither, then click Finish.
Use the ODBC Administrator to configure the driver. If the ODBC
Administrator does not run now, it must be run after completing the
Teradata Tools and Utilities installation.
28.7.4 Custom Installation
A custom installation installs only those products selected from the list of
available products. The screens shown in this section are examples of
screens that appear during a Custom Installation. Depending on the
Teradata Tools and Utilities products installed, some dialog boxes and
screens might vary from those shown in this section.

Following steps are performed for a custom installation of Teradata Tools


and Utilities:

Step 1: After highlighting Custom and clicking Next in the initial Setup
Type dialog box, the Select Components dialog box appears as shown in
Fig. 28.14. Do the following:
Select the check boxes for the products to install.
Clear the check boxes for the products not to install.
Click Next.
Fig. 28.14 Select component dialog box

If the product selected is dependent on other products, then those are also
selected.
If there are questions about the interdependence of products, and
Teradata MultiTool is being installed without having Java 2 Runtime
Environment, install it when prompted to do so.

Step 2: In the Choose Destination Location dialog box as shown in Fig.


28.15, if the default path shown in the Destination Folder block is
acceptable, click Next. (This is recommended).
Fig. 28.15 Choose destination location dialog box

To use a destination location other than the default, click Browse,


navigate to the location where the files are installed, click OK, then click
Next.
Write access to the destination folder, the Windows root folder and the
Windows system folder is required.

Step 3: In the Select Install Method dialog box as shown in Fig. 28.16,
select automatic (silent) or interactive installation for each product.
a. Highlight the products being installed silently.

Shared ICU Libraries for Teradata can only be installed in the silent mode from the
CD media.
The ODBC Driver for Teradata and Teradata Manager can be installed silently or
interactively; the default is interactive.
Teradata SQL Assistant/Web Edition, Teradata MultiTool and Teradata DQM
Administrator can only be installed interactively.
All other products can be installed silently or interactively; the default is silent.
b. The products not highlighted are installed interactively, meaning that the product setup
sequence is activated so that adjustments can be made during installation.
c. Click Next.

Fig. 28.16 Select install method dialog box

Step 4: An Information window shows the path to the folder containing


the product response files for silent installation. To modify a product
response file, use a text editor to do so now.

When finished, click OK in the Information window.


During installation, progress monitors appear. No action is required.
For silent installations, messages such as “Installing BTEQ. Please wait…” appear. No
action is required.
The Teradata ODBC driver setup can take several minutes.
The product setup log is located in the %TEMP%\ ClientISS folder.
“Silent Installation Result Codes” lists result codes from silent installations that are useful
for troubleshooting.

Step 5: If other products are selected to install interactively, those setup


programs will execute. Follow the instructions in each dialog box that
appears.
Step 6: After the first phase of the installation is complete, go to the
installation procedure for the following products as needed:
“Installing the ODBC Driver for Teradata”
“Installing the Teradata SQL Assistant/Web Edition”.

Step 7: Which Setup Complete dialog box appears next depends on


whether the client computer should be restarted:
a. If the client computer does not require a restart, choose whether or not to view the Release
Definition, then click Finish.
b. If the client computer does require a restart, there are two options:

Yes, I want to restart my computer now.


No, I will restart my computer later.

It is recommended to select Yes, I want to restart my computer now,


remove the CD from the drive, then click Finish.

28.7.5 Network Installation


A network installation copies the setup files for the selected products to a
specified folder. The network installation does not actually install the
products. This must be done by the user.
The screens shown in this section are examples of screens that appear
during a network installation. Depending on the Teradata Tools and Utilities
products used in the installation, some dialog boxes and screens might vary
from those shown in this guide.
Follow these steps to perform a network installation of Teradata Tools
and Utilities. This installation only copies the setup files of the selected
products to a specified folder.

Step 1: After highlighting Network and clicking Next in the initial


Setup Type dialog box, the Select Components dialog box appears as
shown in Fig. 28.17. Do the following:
Select the boxes for the products whose setup files will be copied.
Clear the boxes for the products not being installed.
Click Next.
Fig. 28.17 Select component dialog box

Step 2: In the Choose Destination Location dialog box as shown in Fig.


28.18, if the default path shown in the Destination Folder block is
acceptable, click Next. (This is recommended).
To use a destination location other than the default, click Browse,
navigate to the location where the files will be installed, click Next.
Write access to the destination folder, the Windows root folder and the
Windows system folder is required.

Step 3: As files are copied, progress monitors appear. No action is


required.

Step 4: After the installation process copies the necessary files to the
specified folder, the Setup Complete dialog box appears. Choose whether
or not to view the Release Definition, then click Finish.
Fig. 28.18 Choose destination location dialog box

28.8 BASIC TERADATA QUERY (BTEQ)

Basic Teradata Query (BTEQ) is like a RDBMS console. This utility


enables us to connect to Teradata RDBMS server as any valid database user,
set the session environment and execute SQL statement as long as we have
such privileges. BTEQWin is a Windows version of BTEQ. Both of them
work on two components: CLI and TDP/MTDP. BTEQ commands or SQL
statements entered into the BTEQ are packed into CLI packets, then
TDP/MTDP transfers them to RDBMS. BTEQ supports 55 commands
which fall into four groups:
Session control commands
File control commands
Sequence control commands
Format control commands.

Some usage tips and examples for the frequently used commands are given here.

28.8.1 Usage Tips


BTEQ commands consist of a dot character followed by a command keyword, command
options and parameters.
.LOGON teradata/john

Teradata SQL statement doesn’t begin with a dot character, but it must end with a ‘;’
character.

SELECT * FROM students Where name = “Jack”;

Both BTEQ commands and SQL statements can be entered in any combination of uppercase
and lowercase and mixed-case formats.

.Logoff

If we want to submit a transaction which includes several SQL statements, do as the


following example:

Select * from students

;insert into table students Values(‘00001’, ‘Jack’,’M’,25)

;select * from students;

After we enter the last ‘;’ and hit the [enter] key, these SQL requests will
be submitted as a transaction. If anyone of these has an error, the whole
transaction will be rolled back.

28.8.1 Frequently Used Commands


LOGON

.logon teradata/thomas

PASSWORD:thomaspass

In the above example, we connect to RDBMS called “teradata”. “teradata” is the TDPID of
the server. “thomas” is the userid and “thomaspass” is the password of the user.

LOGOFF

.logoff

Just logoff from the current user account without exiting from BTEQ.

EXIT or QUIT

.exit
.quit

These two commands are the same. After executing them, it will exit from BTEQ.

SHOW VERSIONS

.show versions

Check the version of the BTEQ currently being used.

SECURITY

.set security passwords

.set security all

.set security none

Specify the security level of messages sent from network-attached systems to the Teradata
RDBMS. By the first one, only messages containing user passwords, such as CREATE
USER statement, will be encrypted. By the second one, all messages will be encrypted.

SESSIONS

.sessions 5

.repeat 3

select * from students;

After executing the above commands one by one, it will create five sessions running in
parallel. Then it will execute select request three times. In this situation, three out of the five
sessions will execute the select statement one time in parallel.

QUIET

.set quiet on

.set quiet off

If switched off, the result of the command or SQL statement will not be displayed.

SHOW CONTROLS

.show controls

.show control

Show the current settings for BTEQ software.


RETLIMIT

.set retlimit 4

select * from dbase;

Just display the first 4 rows of the result table and ignore the rest.

.set retlimit 0

select * from dbase;

Display all rows of the result table.

RECORDMODE

.set recordmode on

.set recordmode off

If switched on, all result rows will be displayed in binary mode.

SUPPRESS

.set suppress on 3

select * from students;

If the third column of the students table is Department Name, then the same department
names will be display only once on the terminal screen.

SKIPLINE/SKIPDOUBLE

.set skipline on 1

.set skipdouble on 3

select * from students;

During the display of result table, if the value in column 1 changes, skip one blank line to
display the next row. If the value in column 3 changes, skip two blank lines to display the
next row.

FORMAT

.set format on

.set heading “Result:”


.set rtitle “Report Title:”

.set footing “Result Finished”

Add the heading line, report title and footing line to the result displayed on the terminal
screen.

OS

.os command

c:\progra~l1\ncr\terada~\bin> dir

The first command allows entering the Windows/Dos command prompt status. Then OS
commands such as dir, del, copy, etc. can be entered.

.os dir

.os copy sample1.txt sample.old

Another way to execute the OS command is entering the command after the .os keyword.

RUN

We can run a script file which contains several BTEQ commands and SQL requests. Let us
see the following example:

1. To edit a txt file, runfile.txt, using Notepad, the file contains:

.set defaults

.set separator “$”

select * from dbase;

.set defaults

2. RUN

.run file = runfile.txt

If the working directory of BTEQ is not same as the directory containing the file,
we must specify the full path.

SYSIN & SYSOUT

SYSIN and SYSOUT are standard input and output streams of BTEQ. They can be
redirected as the following example:
Start -> programs -> accessories -> command prompt

c:\>cd c:\program files\ncr\teradata client\bin


c:\program files\ncr\teradata client\bin> bteq > result.txt
.logon teradata/john
johnpass
select * from students;
.exit c:\program files\ncr\teradata client\bin > bteq >

In the above example, all output will be written into result.txt file but not to the terminal
screen. If runfile.txt file is placed in the root directory c:\, we can redirect the standard input
stream of BTEQ as the following example:

c:\>cd c:\program files\ncr\teradata client\bin


c:\program files\ncr\teradata client\bin>bteq < c:\runfile.txt

EXPORT

.export report file = export


select * from students;
.os edit export

The command produces a report as shown in Fig. 28.19.

Fig. 28.19 Report produced by EXPORT command

Command keyword “report” specifies the format of output. “file” specifies the name of
output file. If we want to export the data to a file for backup, use the following command:
.export data file = exdata
select * from students;

we will get all data from the select statement and store them into the file, exdata, in a special
format.

After exporting the required date, we should reset the export options.

.export reset

IMPORT

As mentioned above, we have already stored all data of the students table into the file
exdata. Now, we want to restore them into database. See the following example:

.delete from students;


.import data file = exdata
.repeat 5
using(studentId char(5),name char(20),sex char(1),age integer)
insert into students (studentId, name, sex, age)
values(:studentId, :name, :sex, :age);

The third command requires BTEQ to execute the following command five times.

The last command has three lines. It will insert one row into the students table each time.

MACRO

We can use the SQL statements to create a macro and execute this macro at any time. See
the following example:

create macro MyMacro1 as (


ECHO ‘.set separator “#”’
; select * from students;
);

This macro executes one BTEQ command and one SQL request.

execute MyMacro1;

This SQL statement executes the macro.


28.9 OPEN DATABASE CONNECTIVITY (ODBC) APPLICATION DEVELOPMENT

In the following demo, each step has been described of using Teradata
DBMS to develop a DB application. In the example, there are two users:
John and Mike. John is the administrator of the application database. Mike
works for John and he is the person who manipulates the table students in
the database everyday.

Step 1: John creates user Mike and the database student_info by


using BTEQ as shown in Fig. 28.20.
Running BTEQ

start -> program -> Teradata Client -> BTEQ

Logon Teradata DBMS server

John was created by the Teradata DBMS administrator and was granted the privileges to
create a USER and a DATABASE, as shown in Fig. 28.21. In Teradata DBMS, the owner
automatically has all privileges on the database he/she creates.

Create a user and a database


Fig. 28.20 Running BTEQ

Fig. 28.21 Privilege granting to create a USER and a DATABASE


Fig. 28.22 Creating a USER

Fig. 28.22 shows how to create a user. In Teradata DBMS, user is seen as a special database.
The difference between user and database is that a user has a password and can logon to the
DBMS, while a database is just a passive object in DBMS. Fig. 28.23 shows how to create a
database.

Fig. 28.23 Creating a DATABASE

John is the owner of user Mike and database student_info. John has all privileges on this
database such as creating table, executing select, insert, update and delete statements. But
we notice that Mike does not have any privilege on this database now. So John needs to
grant some privileges to Mike for his daily work as shown in Fig. 28.24.
Fig. 28.24 Granting privilege to Mike

Create table

After granting appropriate privileges to Mike, John needs to create a table for storing the
information of all students. First, he must specify the database containing the table as shown
in Fig. 28.25.

Fig. 28.25 Specifying a DATABASE

Then he creates the table students as shown in Fig. 28.26.


Fig. 28.26 Creating a table “students”

Using SQL statements such as Select, Insert, Delete and Update Now, Mike can logon and
insert some data into the table students as shown in Figs. 28.27 through 28.29.
Fig. 28.27 Inserting data into table

Fig. 28.28 Inserting data into table


Fig. 28.29 Inserting data into table

In the Fig. 28.30, Mike inserts a new row whose first field is “00003”. We notice that
there are two rows whose first fields have the same value. So, Mike decides to delete one of
them as shown in Fig. 28.31.
Fig. 28.30 Inserting a new row in the table

Fig. 28.31 Deleting a row from the table

Logoff and Quit BTEQ

Mike uses the following command

.exit

to logoff and quit BTEQ.

Step 2: Create one ODBC Data Source Name (DSN) to develop an


application.
As shown in Fig. 28.32, the Data source modules refer to physical
databases with different DBMSs. The term Data Source Name (DSN) is just
like a profile which is a link between the first level, Application and the
second level, ODBC Driver Manager. The profile DSN is used to describe
which ODBC driver will be used in the application and the application will
logon which account on which database.

Fig. 28.32 Architecture of DB application based on ODBC

Figs. 28.33 through 28.38 show each step of creating the DSN used in the
user application:
a. start -> settings -> control panel (Fig. 28.33)
Fig. 28.33 Selecting “Control Panel” option for creating DSN

b. click the icon “Administrative Tools” (Fig. 28.34)

Fig. 28.34 Selecting “Administrative Tool” option

c. click the icon “Data Sources (ODBC)” (Fig. 28.35).


Fig. 28.35 Selecting “Data Sources” option

d. The ODBC Data Source Administrator window lists all DSN already created on the
computer as shown in Fig. 28.36. Now, click the button “Add…”

Fig. 28.36 List of DSN created

e. When asked to choose one ODBC driver for the Data Source, choose Teradata. (Fig. 28.37).
Fig. 28.37 Choosing Teradata option

f. As shown in Fig. 28.38, we then need to type in all information about the DSN, such as IP
address of server, username, password and the default database we will use.

Fig. 28.38 Entering DSN information

Step 03: Develop an application by using ODBC interface.

We can access the table students via ODBC interface. We need to include
the following files (for developing demo for ODBC interface on Windows
NT/2000 by using VC++ 6.0):
#include <sql.h>
#include <sqlext.h>
#include <odbcinst.h>
#include <odbcss.h>
#include <odbcver.h>

We also need to link odbc32.lib. In VC++ 6.0 developing studio, we can


set the link option to finish the appropriate compiling and link.

The ODBC program scheme is shown below, which is a code segment of


executing select SQL statement:

SQLAllocEnv(&DSNhenv);
SQLAllocConnect(ODBChenv, &ODBChdbc);
SQLConnect(ODBChdbc, DataSourceNname, DBusername,
DBuserpassword );
SQLAllocStmt(ODBChdbc, &ODBChstmt);
Construct the SQL command string
SQLExecDirect(ODBChstmt, (UCHAR *)command, SQL_NTS);
if (ODBC_SUCCESS)
{
SQLFetch(ODBChstmt);
while (ODBC_SUCCESS)&&(data set is not empty)
{
processing the data
SQLFetch(ODBChstmt);
}
}
SQLFreeStmt(ODBChstmt, SQL_DROP);
SQLDisconnect(ODBChdbc);
SQLFreeConnect(ODBChdbc);
SQLFreeEnv(DSNhenv);

When ODBC function SQLConnect() is called in the our program, we


need to specify the Data Source Name created in Step 2. Our demo is just a
simple example, though it has invoked almost all functions often used in
general applications.

Step 4: Running the demo

Copy all files of the project onto the target PC and double click the file
ODBCexample.dsw. VC++ 6.0 developing studio will load the Win32
project automatically as shown in Fig. 28.39. Then, choose menu item
“build” or “execute ODBCexample”.
Fig. 28.39 Loading Win32 and executing ODBC example

In the next window (Fig. 28.40), we can see all DSN defined on our PC.
We can choose TeradataExample created in Step 2. We do not need to
provide the user name mike and the password mikepass, because they were
already set in DSN. Then, click the button “Connect To Database” to
connect to the Teradata DBMS server.
Fig. 28.40 Choosing Teradata Example from all defined DSN lists

Click ”>>” button to enter next window sheet. Now, after pressing “Get
Information of All Tables In The Database”, we can see all tables in the
database including students created in Step 1. Then we can choose one table
from the leftmost listbox and enter the next window sheet by clicking ”>>”
as shown in Fig. 28.41.
Fig. 28.41 Listing of all tables

Fig. 28.42 Choosing “Get Scheme of the Table” option


As shown in Figs. 28.42 and 28.43, Click the button “Get Scheme of the
Table” to see the definition of the table. And if the table we have chosen is
students, we can press “Run SQL on This Table” to execute a SQL
statement on the table.

Fig. 28.43 SQL statement window

After typing the SQL statement in the edit box, you can press button “Get
Information” to execute it as shown in Fig. 28.44. If we want to add a
student, please click “Add Student” as shown in Fig. 28.45.
Fig. 28.44 Choosing “Get Information” option

Fig. 28.45 Adding student information

As shown in Fig. 28.46, SQL statement can be entered to get all


information of student “Jack” in the edit box.
Fig. 28.46 SQL statement for information of student “Jack”

R Q
1. What is Tearadata technology? Who developed Teradata? Explain.
2. Discuss hardware, software and operating system platforms on which Teradata works.
3. Discuss the features of Teradata.
4. List some of the Teradata utilities and Teradata products that are generally used.
5. Briefly discuss the arguments that are used to connect to Teradata using Teradata-specific
SQL procedures. Give examples.
6. What is the purpose of using pass-through facility in Teradata? Discuss with examples how
pass-through facilities are implemented using Teradata-specific SQL procedures.
7. What is Teradata database? With a neat diagram, briefly discuss various components of a
Teradata database.
8. Discuss the functions of parsing engine (PE) and Access Module Processor (AMP).
9. Briefly discuss the various components of Teradata Director Program (TDM) and Teradata
Client software.
STATE TRUE/FALSE

1. Teradata is a massively serial processing system running a shared nothing architecture.


2. The concept of Teradata grew out of research at the California Institute of Technology
(Caltech) and from the discussions of Citibank’s advanced technology group.
3. The name Teradata was chosen to symbolize the ability to manage terabytes (trillions of
bytes) of data.
4. The Teradata DBMS is linearly and predictably scalable in all dimensions of a database
system workload (data volume, breadth, number of users, complexity of queries).
5. Teradata DATABASE statement can be used within the EXECUTE statement in PROC
SQL.
6. Parallel database extensions (PDE) are an interface layer on the top of operating system.
7. A SQL query is submitted and transferred in CLI packet format.

TICK (✓) THE APPROPRIATE ANSWER

1. Teradata relational database management system was developed by Teradata, a software


company, founded in 1979 by a group of people namely, Dr. Jack E. Shemer, Dr. Philip M.
Neches, Walter E. Muir, Jerold R. Modes, William P. Worth and Carroll Reed.

a. Teradata software company


b. IBM
c. Microsoft
d. none of these.

2. Teradata software company was founded in

a. 1980
b. 1990
c. 1979
d. none of these.

3. Teradata company was founded by group of people namely, Dr. Jack E. Shemer, Dr. Philip
M. Neches, Walter E. Muir, Jerold R. Modes, William P. Worth and Carroll Reed.

a. 5 group of people
b. an individual
c. a corporate house
d. none of these.

4. The concept of Teradata grew out of research at the California Institute of Technology
(Caltech) and from the discussions of Citibank’s advanced technology group.

a. California Institute of Technology


b. AT&T Lab
c. Citibank’s advanced Technology Group
d. (a) & (c).

5. Teradata enterprise data warehouses are often accessed via

a. open database connectivity (ODBC)


b. Java database connectivity (JDBC)
c. both (a) & (b)
d. none of these.

6. Teradata RDBMS is a complete relational database management system based on off-the-


shelf symmetric multiprocessing (SMP) technology combined with a communication
network connecting the SMP systems to form a massively parallel processing (MMP)
system.

a. Off-the-shelf symmetric multiprocessing (SMP) technology


b. Massively parallel processing (MMP) system
c. both (a) & (b)
d. none of these.

7. The functions of parallel database extensions (PDE) are

a. executing vprocs (virtual processors)


b. providing a parallel environment
c. scheduling sessions
d. all of these.

8. BTEQ is a component of

a. CLI
b. TDP
c. Teradata client software
d. none of these.

9. BTEQ is a general purpose program used to submit

a. data
b. command
c. SQL statement
d. all of these.

FILL IN THE BLANKS

1. Teradata is a _____ system running a _____ architecture.


2. Between the years _____ and _____ the concept of Teradata grew out of research at the
California Institute of Technology (Caltech) and from the discussions of Citibank’s
advanced technology group.
3. Due to the _____ features, Teradata is very popular for enterprise data warehousing
applications.
4. The Teradata Tools and Utilities software, together with the Teradata Relational Database
Management System (RDBMS) software, permits communication between a _____ and a
_____.
5. Teradata RDBMS is a complete relational database management system based on _____
technology.
6. _____ is a hardware inter-processor network to link SMP nodes.
7. _____ routes the packets to the specified Teradata RDBMS server.
Answer

CHAPTER 1 INTRODUCTION OF DATABASE SYSTEM

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Data
2. Fact, processed/organized/summarized data
3. DBMS
4. (a) Data description language (DDL), (b) data manipulation language (DML)
5. DBMS
6. Database Management System
7. Structured Query Language
8. Fourth Generation Language
9. (a) Operational Data, (b) Reconciled Data, (c) Derived Data
10. Data Definition Language
11. Data Manipulation Language
12. Each of the data mart (a selected, limited, and summarized data warehouse)
13. (a) Entities, (b) Attributes, (c) Relationships, (d) Key
14. (a) Primary key, (b) Secondary key, (d) Super key, (d) Concatenated key
15. (a) Active data dictionary, (b) passive data dictionary
16. Conference of Data Systems Languages
17. List Processing Task Force
18. Data Base Task Force
19. Integrated Data Store (IDS)
20. Bachman
21. Permanent.

CHAPTER 2 DATABASE SYSTEM ARCHITECTURE

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Table
2. External Level
3. Physical
4. Entity and class
5. Object-oriented
6. E-R diagram
7. DBMS
8.
9.
10. Inherits
11. Physical data independence
12. Logical data independence
13.
14. Conceptual, stored database
15. External, conceptual
16. Upside-down
17. IBM, North American Aviation
18. DBTG/CODASYL, 1960s
19. (a) record type, (b) data items (or fields), (c) links
20. E.F. Codd
21. Client, server.

CHAPTER 3 PHYSICAL DATA ORGANIZATION

STATE TRUE/FALSE
TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Buffer
2. Secondary storage or auxiliary
3. Buffer
4. External storage
5. RAID
6. Multiple disks
7. (a) magnetic, (b) optical
8. File
9. Redundant arrays of inexpensive disks
10. Indexed Sequential Access Method
11. Virtual Storage Access Method
12. (a) sequential, (b) indexed
13. Magnetic disks
14. Reliability
15. New records
16. Clustering
17. Primary
18. Hard disk
19. Head crash
20. Secondary
21.

a. fixed-length records,
b. variable-length records

22. Search-key
23. Fixed, flexible (removeable)
24. Access time
25. Access time
26. Primary key
27. Direct file organization
28. Head activation time
29. Primary (or clustering) index
30. Indexed-sequential file
31. Indexed-sequential file
32. Sequential file
33. Sectors
34. Bytes of storage area
35. Compact disk-recordable
36. WORM
37. Root
38. A data item or record
39. IBM.

CHAPTER 4 THE RELATIONAL MODEL

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Relation and the set theory
2. Dr. E.F. Codd
3. A Relational Model of Data for Large Shared Data Banks
4. System R
5. Tuple
6. Number of columns
7. Legal or atomic values
8. Field
9. Field
10. Degree
11. Cardinality
12. Primary key
13. Relation
14. Dr. Codd, 1972.

CHAPTER 5 RELATIONAL QUERY LANGUAGES

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. IBM’s Peterlee Centre, 1973
2. Peterlee Relational Test Vehicle (PRTV)
3. LIST
4. SQL
5. SELECT
6. Query Language
7. INGRESS
8. Tuple relational calculus language of relational database system INGRESS
9. IBM
10. SQL
11. (a) CREATE, (b) RETREIVE, (c) DELETE, (d) SORT, (e) PRINT
12. RETREIVE
13. IBM
14. System R
15. System R
16. Deleting
17. (a) defining relation schemas, (b) deleting relations, (c) modifying relation schemas
18. CREATE ALTER DROP
19. INSERT, DELETE UPDATE
20. ORDER BY
21. GROUP BY
22. GRANT, REVOKE
23. FROM
24. WHERE
25. START AUDIT, STOP AUDIT
26. COMMIT, ROLLBACK
27. (a) audits, (b) analysis
28. (a) AVG, (b) SUM, (c) MIN, (d) MAX, (e)
29. Very high
30. Domain calculus
31. M.M. Zloo
32. Make-table
33. U command

CHAPTER 6 ENTITY-RELATIONSHIP (ER) MODEL

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. P.P Chen
2. Object, thing
3. Association, entities
4. Connectivity
5. (a) entities, (b) Attributes, (c) relationships
6. Lines
7. Ternary relationship
8. Recursive relationship
9. Binary relationship
10. Cardinality
11. Entity, relationship
12. Attribute or data items
13. Entity set
14. (a) entity sets, (b) relationship sets, (c) attributes, (d) mapping cardinalities
15. Data (b) data organisation
16. Strong entity type
17. Simple attribute
18. Primary keys
19. Entity occurrence, entity instance
20. Binary
21. Composite

CHAPTER 7 ENHANCED ENTITY-RELATIONSHIP (EER) MODEL

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. One-to-one (1:1)
2. Subtype, subset, supertype
3. Subtype, supertype
4. Enhanced Entity Relationship (EER) model
5. Redundancy
6. Shared subtype
7. Supertype (or superclass), specialization/generalization
8. Supertype, subclass, specialization/generalization
9. Mandatory
10. Optional
11. One
12. Attribute inheritance
13. ‘d’, circle
14. ‘o’, circle
15. Shared subtype
16. Generalization
17. Generalization
18. Enhanced Entity Relationship.
CHAPTER 8 INTRODUCTION TO DATABASE DESIGN

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Data fields, applications


2. Bad
3. Reliability, maintainability, software life-cycle costs
4. Information system planning
5. Bottom-up approach, top-down approach, inside-out approach, mixed strategy approach
6. Fundamental level of attributes
7. Development of data models (or schemas) that contains high-level abstractions
8. Identification of set of major entities and then spreading out to consider other entities,
relationships, and attributes associated with those first identified
9. Database requirement analysis
10. Physical database design.

CHAPTER 9 FUNCTIONAL DEPENDENCY AND DECOMPOSITION

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Functional dependency, attributes
2. (a) determinant, (b) dependent
3. Functionally determines
4. Minimum, determinant
5. Functional dependencies
6. FDs, FDs
7. Breaking down
8. Spurious tuples, natural join
9. Loss of information
10. Non-redundant set and complete sets (or closure) of.

CHAPTER 10 NORMALIZATION

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Decomposing, redundancy
2. Normalization
3. Normalization
4. E. F. Codd
5. Atomic
6. 1NF, fully functionally dependent
7. Composite, attribute
8. 1NF
9. Primary (or relation) key
10. X Y
11. 3NF
12. Candidate key
13. 3NF, 2NF
14. Primary, candidate, candidate
15. MVDs
16. Consequence.

CHAPTER 11 QUERY PROCESSING AND OPTIMIZATION

STATE TRUE/FALSE
TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. High-level, execution plan
2. Query complication
3. Parses, syntax
4. Query processing
5. Syntax analyzer uses the grammar of SQL as input and the parser portion of the query
processor
6. Algebraic expression (relational algebra query)
7. (a) Syntax analyser, (b) Query decomposer, (c) Query Optimizer (d) Query code generator
8. (a) Heuristic query optimisation, (b) Systematic estimation
9. Query optimizer
10. Query decomposer
11. (a) Query analysis, (b) Query normalization, (c) Semantic analysis, (d) Query simplifier, (e)
Query restructuring
12. Query analysis
13. Query normalization
14. Semantic analyser
15. Query restructuring
16. (a) number of I/Os, (b) CPU time
17. Relational algebra
18. Query, query graph
19. Initial (canonical), optimised, efficiently executable
20. Size, type
21. On-the-fly processing
22. Materialization

CHAPTER 12 TRANSACTION PROCESSING AND CONCURRENCY CONTROL

STATE TRUE/FALSE
TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Logical unit
2. Concurrency control
3. Wait-for-graph.
4. Read, Write
5. Concurrency control
6. All actions associated, none
7. (a) Atomicity, (b) Consistency, (c) Isolation, (d) Durability
8. Isolation
9. Transaction recovery subsystem
10. Record, transactions, database
11. Recovery subsystem
12. A second transaction
13. Data integrity
14. Consistent state
15. Concurrency control
16. Committed
17. Updates
18. Aborted
19. Non-serial schedule
20. Serializability
21. Cascading rollback
22. Granularity
23. Validation or certification method
24. Rollback
25. Transaction
26. Inconsistency
27. Serializability
28. Database record
29. Multiple-mode
30. READ
31. Concurrent processing
32. Unlocking
33. All locks, new.

CHAPTER 13 DATABASE RECOVERY SYSTEMS

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Database recovery
2. Rollback
3. Global undo
4. Force approach, force writing
5. No force approach
6. A single-user
7. NO-UNDO/NO-REDO
8. Transaction management
9. (a) data inconsistencies, (b) data loss
10. (a) hardware failure, (b) software failure, (c) media failure, (d) network failure
11. Inconsistent state, consistent
12. (a) different building, (b) protected against danger
13. Main memory
14. (a) loss of main memory including the database buffer, (b) the loss of the disk copy
(secondary storage) of the database
15. Head crash (record scratched by a phonograph needle)
16. COMMIT point
17. Without waiting, transaction log
18. (a) a current page table, (b) a shadow page table
19. Force-written
20. Buffer management, buffer manager.
CHAPTER 14 DATABASE SECURITY

STATE TRUE/FALSE

TICK (✓) THE APPROORIATE ANSWER

FILL IN THE BLANKS


1. Protection, threats
2. Database security
3. (a) sabotage of hardware, (b) sabotage of applications
4. Invalid, corrupted
5. Authorization
6. GRANT, REVOKE
7. Authorization
8. Data encryption
9. Authentication
10. Coding or scrambling
11. (a) Simple substitution method, (b) Polyalphabetic substitution method
12. DBA
13. Access rights (also called privileges)
14. Access rights (also called privileges)
15. The Bel-LaPadula model
16. Firewall
17. Statistical database security.

CHAPTER 15 OBJECT-ORIENTED DATABASE

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER


FILL IN THE BLANKS
1. Late 1960s
2. Third
3. Semantic, object-oriented programming
4.
5. Real-world, database objects, integrity, identity
6. State, behaviour
7. Object-oriented programming languages (OOPLs)
8. Structure (attributes), behaviour (methods)
9. Class, objects
10. Data structure, behaviour (methods)
11. Only one
12. Class
13. An object database schema
14. ODMG object
15. Embedded, these programming languages

CHAPTER 16 OBJECT-RELATIONAL DATABASE

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. RDMS, object-oriented
2. Complex objects type
3. Object-oriented
4. Complex data
5. HP
6. Universal server
7. Client, server
8.

a. Complexity and associated increased costs,


b. Loss of simplicity and purity of the relational model

9. (a) Reduced network traffic, (b) Reuse and sharing


10. ORDBMS.

CHAPTER 17 PARALLEL DATABASE SYSTEM

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Parallel processing
2. Synchronization
3. Linear speed-up
4. Parallel processing
5. Efficient
6. Low
7. Capacity, throughput
8. Shared-nothing architecture
9. Higher
10. Speed-of-light
11. CPU
12. Speed-up
13. Execution time of a task on the original or smaller machine (or original processing time),
execution time of same task on the parallel or larger machine (or parallel processing time)
14. Original or small processing volume, parallel or large processing volume
15. Degree, parallelism
16. The sizes, response time
17. Concurrent tasks
18. Hash.
CHAPTER 18 DISTRIBUTION DATABASE SYSTEMS

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Several sites, communication network


2. Geographically distributed
3. Client/Server architectures
4. (a) Provide local autonomy, (b) Should be location independent
5. Distributed database system (DDBS)
6. A multi-database system, a federated database system (FDBS)
7. (a) client, (b) server
8. (a) Clients in form of intelligent workstations as the user’s contact point, (b) DBMS server as
common resources performing specialized tasks for devices requesting their services, (c)
Communication networks connecting the clients and the servers, (d) Software applications
connecting clients, servers and networks to create a single logical architecture
9. (a) sharing of data, (b) increased efficiency, (c) increased local autonomy
10. (a) Recovery of failure is more complex, (b) Increased software development cost, (c) Lack
of standards
11. Data access middleware
12. Software, queries, transactions
13. Tuples (or rows), attributes
14. UNION
15. Processing speed
16. Update transactions
17. Size, communication
18. Local wait-for graph
19. Global wait-for graph
20. (a) read timestamp, (b) the write timestamp
21. (a) voting phase, (b) decision phase
22. Blocking.
CHAPTER 19 DICISION SUPPORT SYSTEM (DSS)

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Decision-making
2. 1940s, 1950s, research, behavioural, scientific theories, statistical process control
3. Scott-Morton, 1970s
4. RPG or data retrieval products such as Focus, Datatrieve, and NOMAD
5. (a) Data Management (Data extraction and filtering), (b) Data store, (c) End-user tool, (d)
End-user presentation tool
6. (a) business, (b) business model
7. Daily business transactions, strategic business, operational
8. (a) time span, (b) granularity, (c) dimensionality.

CHAPTER 20 DATA WAREHOUSING AND DATA MINING

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Data mining
2. Large centralized Secondary storage
3. (a) subject-oriented, (b) integrated, (c) time-variant, (d) non-volatile
4. Business-driven information technology
5. (a) Data acquisition, (b) Data storage, (c) Data access
6. Current, integrated
7. Modelled, reconciled, read/write, transient, current
8. Arbor Software Corp., 1993
9. Exploratory data analysis, knowledge discovery, machine learning
10. Knowledge discovery in databases (KDD)
11. (a) data selection, (b) pre-processing (data cleaning and enrichment), (c) data transformation
or encoding, (d) data mining, (e) reporting, (f) display of the discovered information
(knowledge delivery)
12. (a) Prediction, (b) Identification, (c) Classification, (d) Optimization
13. Attributes, data
14. Data patterns.

CHAPTER 21 EMERGING DATABASE TECHNOLOGIES

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Computer networks, communication, computers


2. Internet, Web servers
3. Web
4. Hypertext Markup Language
5. Web pages
6. Uniform resource locator
7. IP address
8. Domain name servers
9. External applications, information servers, such as HTTP or Web servers
10. HTTP
11. Spatial databases
12. Multimedia databases
13. Spatial data model
14. Layers
15. Spatial Data Option
16. Range query
17. Nearest neighbour query or Adjacency
18. A geometry or geometric object
19. Spatial overlay
20. Spatial joins or overlays
21. Content-based
22. Singular value decomposition
23. Frame segment trees
24. Multidimensional.

CHAPTER 23 IBM DB2 UNIVERSAL DATABASE

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. DB2 Client Application Enabler (CAE)
2. ANSI/ISO SQL-92 Entry Level standard
3. Application Requester (AR) protocol and an Application Server (AS) protocol.
4. DB2 Connect Personal Edition
5. Client/server-supported
6. Parallelism, very large, very high
7. Single-used
8. Enterprise-Extended Edition product, partitioned
9. Embedded SQL, DB2 Call Level Interface (CLI)
10. DRDA Application Server
11. Single
12. Text
13. User-defined functions (UDFs)
14. Unexpected patterns
15. Graphical tools, DB2
16. Tutors, creating objects
17. Command Line Processor
18. SQL statements, DB2 Commands.
CHAPTER 24 ORACLE

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. DEC PDP-11
2. 2000, 2001
3. Wireless
4. Program logic
5. Mobile
6. Interactive user interface
7. System Global Area.

CHAPTER 25 MICROSOFT SQL SERVER

STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Relational database, Sybase, UNIX
2. Queries
3. Transact-SQL statements, manageable blocks
4. Simple Network Management Protocol Management (SNMP).
5. Microsoft Distributed Transaction Coordinator (MS DTC), Microsoft Transaction Server
(MTS).

CHAPTER 26 MICROSOFT ACCESS STATE TRUE/FALSE


TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS

1. Windows
2. (a) the Design view, (b) the Datasheet view
3. Dynaset
4. Spreadsheet
5. (a) make-table, (b) delete, (c) append, (d) update.

CHAPTER 27 MYSQL STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER

FILL IN THE BLANKS


1. Multi-layered
2. Open source, MySQL AB
3. InnoDB, tablespace
4. PHP Hypertext Preprocessor
5. HTML-embedded
6. Web server
7. Interpreted, Perl.

CHAPTER 28 TERADATA RDBMS STATE TRUE/FALSE

TICK (✓) THE APPROPRIATE ANSWER


FILL IN THE BLANKS

1. Massively parallel processing, shared nothing


2. 1976, 1979
3. Scalability
4. Teradata client, Teradata RDBMS
5. Off-the-shelf Symmetric Multiprocessing (SMP)
6. BYNET
7. Teradata director program (TDP)
Bibliography

Al Stevens (1987), Database Development, M cGraw-Hill International Editions, Singapore:


Computer Science Series, Management Information Source Inc.
Balter, Alison (1999), Mastering Microsoft Access–2000 Development, New Delhi: Techmedia.
Bayross, Ivan (2004), Professional Oracle Project, New Delhi: BPB Publications.
Berson, Alex, Stephen Smith, and Kurt Thearling (2000), Building Data Mining Applications for
CRM, New Delhi: Tata McGraw-Hill Publishing Co. Ltd.
Bertino, Elisa and Martino Lorenzo (1993), Object Oriented Database System: Concepts and
Architectures, London: Addison-Wesley Publishing Co.
Bobak, Angelo R. (1996), Distributed and Multi-Database Systems, Boston: Artech House.
Bontempo, Charles J. and Cynthia Maro Saraeco (1995), Database Management Principles and
Products, New Jersey: Prentice Hall PTR.
Brown, Lyunwood (1997), Oracle Database Administration on UNIX Systems, New Jersey: Prentice
Hall PTR.
Campbell, Mary (1994), The Microsoft Access Handbook, USA: Osborne McGraw Hill.
Ceri, Stefano and Pelagatti Giuseppe (1986), Distributed Databases Principles and Systems, USA
McGraw Hill Book Co.
Champerlin, Don (1998), A Complete Guide to DB2 Universal Database, San Francisco: Morgan
Kaufmann Publishers, California: Inc.
Collins, William J. (2003), Data Structure and the Standard Template Library, New Delhi: Tata
MacGraw-Hill Publishing Co.
Connolly, Thomas M. and Carolyn E. Begg (2003), Database Systems–A Practical Approach to
Design, Implementation, and Management, 3rd Edition, Delhi: Pearson Education.
Conte, Paul (1997), Database Design and Programming for DB2/400, USA: Duke Press Colorado.
Coronel, Peter Rob Carlos (2001), Database Systems, Design, Implementation and Management, 3rd
Edition, New Delhi: Galgotia Publications Pvt. Ltd.
Couchman, Jason S. (1999), Oracle Certified Professional DBA Certification Exam Guide, 2nd
Edition, New Delhi: Tata McGraw-Hill Publishing Co.
Date, C.J. (2000), An Introduction to Database Systems, 7th Edition, USA: Addison-Wesley.
Desai, Bipin C. (2003), An Introduction to Database Systems, New Delhi: Galgotia Publications Pvt.
Ltd.
Devlin, Barry (1997), Data Warehouse from Architecture to Implementation, USA Addison-Wesley
Massachusetts.
Drozdek, Adam (2001), Data Structure and Algorithms in C++, 2nd Edition, New Delhi: Vikas Pub.
House.
Drozdek, Adam (2001), Data Structure and Algorithms in Java, New Delhi: Vikas Pub. House.
Easwarakumar, K.S. (2000), Object-Oriented Data Structure Using C++, New Delhi: Vikas Pub.
House.
Elmasri, Ramez, Shamkant B. Navathe (2000), Fundamentals of Database Systems, 3rd Edition,
USA: Addison-Wesley.
Gillenson, Mark L. (1990), Database Step-by-Step, 2nd Edition, USA: John Wiley & Sons.
Harrison, Guy (1997), Oracle SQL High-Performance Tuning, New Jersey: Prentice Hall PTR.
Hawryszkiewycz, I.T. (1991), Database Analysis and Design, 2nd Edition, USA: Macmillan Pub Co.
Hernandez, Michael J. (1999), Database Design for Mere Materials–A Hands-on Guide to Relational
Database Design, USA: Addison-Wesley Developer Press.
Hoffer, Jeffrey A., Mary B. Prescott, and Fred R. McFadden (2002), Modern Database Management,
6th Edition, Delhi: Pearson Education.
Ishikawa, Hiroshi (1993), Object Oriented Database Systems, Berlin: Springer–Verlag.
Ivan, Bayross (1997), Commercial Application Development Using Oracle Developer 2000, New
Delhi: BPB Publications.
Janacek, Calene and Dwaine Snow (1997), DB2 Universal Database Certification Guide, 2nd
Edition, New Jersey: IBM International Technical Support Organisation and Prentice Hall PTR.
Laugsam, Yedidyah, Moshe J. Augenstein, and Aaron M. Tenenbaum (1996), Data Structure Using
C++, 2nd Edition, Delhi: Pearson Education.
Leon, Alexis and Mathews Leon (2002), Database Management System, Chennai: Leon Vikas.
Lipschutz, Seymour (2001), Schaum’s Outline Series on Theory and Problems of Data Structure,
New Delhi: Tata McGraw-Hill Edition.
Lockman, David (1997), Teach Yourself Oracle 8 Database Development in 21 Days, New Delhi:
SAMS Publishing Techmedia.
Martin, James and Joe Leben (1995), Client Server Databases, New Jersey: Prentice Hall PTR.
Mattison, Rob (1996), Data Warehousing Strategies, Technologies, and Techniques, New York:
McGraw-Hill.
Mattison, Rob (1999), Web Warehousing and Knowledge Management, New Delhi: Tata McGraw-
Hill Publishing Co. Ltd.
North, Ken (1999), Database Magic with Ken North, New Jersey: Prentice Hall PTR.
O’Nell, Patrick and Elizabeth O’Nell, (2001), Database Principles Programming and Performance,
2nd Edition, Singapore: Harcourt Asia Pte Ltd.
Preiss, Bruno R.. (1999), Data Structure and Algorithms with Object-Oriented Design Paterns in
C++, USA: John Wiley & Sons Inc.
Ramakrishnan, Raghu and Johannes Gehrke (2000), Database Management Systems, 2nd Edition,
McGraw-Hill International Editions.
Rao, Bindu R. (1994), Object-Oriented Databases, USA: McGraw-Hill Inc.
Rumbaugh, James, Michael Blaha, William Premerlani, Fredrick Eddy, and William Lorensen
(1991), Object-Oriented Modelling and Design, Delhi: Pearson Education.
Ryan, Nick and Dan Smith (1995), Database System Engineering, International Thomson Computer
Press.
Silberschatz, Abraham, Herry F. Korth, and S. Sudarshan (2002), Database System Concepts, 4th
Edition, Singapore: McGraw-Hill Co. Inc.
Singh, Harry (1998), Data Warehousing Concepts, Technologies, Implementation and Management,
New Jersey: Prentice Hall PTR.
Teory, Teby J. (1999), Database Modelling and Design, 3rd Edition, Singapore: Harcourt Asia Pte
Ltd.
Tremblay, Jean Paul, and Paul G. Sorenson (1991), An Introduction to Data Structures with
Applications, 2nd Edition, New Delhi: Tata McGraw-Hill Publishing Co.
Turban, Efraim (1990), Decision Support and Expert Systems Management Support Systems, New
York: Macmillan Publishing Co.
Ullman, Jeffrey D. (1999), Principles of Database System, 2nd Edition, New Delhi: Galgotia
Publications (P) Ltd.
Van Amstel, J.J., and Jaan Porters (1989), The Design of Data Structures and Algorithms, Prentice
Hall (UK) Ltd.
Visser, Susan and Bill Wong (2004), SAMS Teach Yourself DB2 Universal Database in 21 Days, 2nd
Edition, Delhi: Pearson Education Inc.
Wiederhold, Gio (1983), Database Design, 2nd Edition, Singapore: McGraw-Hill Book Co.
Database Systems

S. K. SINGH

CHAPTER-1 INTRODUCTION TO DATABASE SYSTEMS

Data: A known fact that can be recorded and that have implicit meaning.

Information: A processed, organised or summarised data.

Data warehouse: A collection of data designed to support management in


the decision-making process.

Metadata: Data about the data.

System Catalog: Repository of information describing the data in the


database, that is the metadata.

Data: The smallest unit of data that has meaning to its user.

Record: A collection of logically related fields or data items.

File: A collection of related sequence of records.

Data Dictionary: A repository of information about a database that


documents data elements of a database.

Entity: A real physical object or an event.

Attribute: A property or characteristic (field) of an entity.


Relationships: Associations or the ways that different entities relate to each
other.

Key: Data item (or field) for which a computer uses to identify a record in a
database system.

Database: A collection of logically related data stored together that is


designed to meet the information needs of an organisation.

Database System: A generalized software system for manipulating


databases.

Database Administrator: An individual person or group of persons with


an overview of one or more databases who controls the design and the use
of these databases.

Data Definition Language (DDL): A special language used to specify a


database conceptual schema using set of definitions.

Data Manipulation Language (DML): A mechanism that provides a set


of operations to support the basic data manipulation operations on the data
held in the database.

Fourth-Generation Language (4GL): A non-procedural programming


language that is used to improve the productivity of the DBMS.

Transaction: All work that logically represents a single unit.

CHAPTER-2 DATABASE SYSTEM ARCHITECTURES

Schema: A framework into which the values of the data items (or fields)
are fitted.

Subschema: An application programmer’s (user’s) view of the data item


types and record types, which he or she uses.
Internal Level: Physical representation of the database on the computer,
found at the lowest level of abstraction of data-base.

Conceptual Level: Complete view of the data requirements of the


organisation that is independent of any storage considerations.

External Schema: Definition of the logical records and the relationships in


the external view.

Physical Data Independence: Immunity of the conceptual (or external)


schemas to changes in the internal schema.

Logical Data Independence: Immunity of the external schemas (or


application programs) to changes in the conceptual schema.

Mappings: Process of transforming requests and results between the three


levels.

Query Processor: Transforms users queries into a series of low-level


instructions directed to the run time database manager.

Run Time Database Manager: The central software component of the


DBMS, which interfaces with user-submitted application programs and
queries.

Model: A representation of the real world objects and events and their
associations.

Relational Data Model: A collection of tables (also called relations).

(E-R) Model: A logical database model, which has a logical representation


of data for an enterprise of business establishment.

Object-Oriented Data Model: A logical data model that captures the


semantics of objects supported in an object-oriented programming.
Client/Server Architecture: A part of the open systems architecture in
which all computing hardware, operating systems, network protocols and
other software are interconnected as a network and work in concert to
achieve user goals.

CHAPTER-3 PHYSICAL DATA ORGANISATION

Primary Storage Devices: Directly accessible by the processor. Primary


storage devices, also called main memory, store active executing programs,
data and portion of the system control program (for example, operating
system, database management system, network control program and so on)
that is being processed. As soon as a program terminates, its memory
becomes available for use by other processes.

Cache Memory: A small storage that provides a buffering capability by


which the relatively slow and ever-increasingly large main memory can
interface to the central processing unit (CPU) at the processor cycle time.

RAID: A disk array arrangement in which a large number of small


independent disks operate in parallel and act as a single higher-performance
logical disk in place of a single very large disk.

Buffer: Part of main memory that is available for storage of contents of


disk blocks.

File Organisation: A technique of physical arrangement of records of a file


on secondary storage device.

Sequential File: A set of contiguously stored records on a physical storage


device.

Index: A table or a data structure, which is maintained to determine the


location of rows (records) in a file that satisfy some condition.
Primary Index: An ordered file whose records are of fixed length with two
fields.

CHAPTER-4 RELATIONAL ALGEBRA AND CALCULUS

Relation: A fixed number of named columns (or attributes) and a variable


number of rows (or tuples).

Domain: A set of atomic values usually specified by name, data type,


format and constrained range of values.

Relational Algebra: A collection of operations to manipulate or access


relations.

CHAPTER-5 RELATIONAL QUERY LANGUAGES

Information System Based Language (ISBL): A pure relational algebra


based query language, which was developed in IBM’s Peterlee Centre in
UK.

Query Language (QUEL): A tuple relational calculus language of a


relational database system INGRESS.

Structured Query Language (SQL): A relational query language used to


communicate with the relational database management system (RDBMS).

SQL Data Query Language (DQL): SQL statements that enable the users
to query one or more tables to get the information they want.

Query-By-Example (QBE): A two-dimensional domain calculus language.


Originally developed for mainframe database processing.

CHAPTER-6 ENTITY RELATIONSHIP (E-R) MODEL


E-R model: A logical representation of data for an enterprise. It was
developed to facilitate database design by allowing specification of an
enterprise schema.

Entity: An ‘object’ or a ‘thing’ in the real world with an independent


existence and that is distinguishable from other objects.

Relationship: An association among two or more entities that is of interest


to the enterprise.

Attribute: Property of an entity or a relationship type described using a set


of attributes.

Constraints: Restrictions on the relationships as perceived in the ‘real


world’.

CHAPTER-7 ENCHANCED ENTITY-RELATIONSHIP (EER) MODEL

Subclasses or Subtypes: Sub-grouping of occurrences of entities in an


entity type that is meaningful to the organisation and that shares common
attributes or relationships distinct from other sub-groupings.

Superclass or Supertype: A generic entity type that has a relationship with


one or more subtypes.

Attribute Inheritance: A property by which subtype entities inherit values


of all attributes of the supertype.

Specialisation: The process of identifying subsets of an entity set (the


superclass or supertype) that share some distinguishing characteristic.

Generalisation: A process of identifying some common characteristics of a


collection of entity sets and creating a new entity set that contains entities
processing these common characteristics. A process of minimising the
differences between the entities by identifying the common features.
Categorisation: A process of modelling of a single subtype (or subclass)
with a relationship that involves more than one distinct supertype (or
superclass).

CHAPTER-8 INTRODUCTION TO DATABASE DESIGN

Software Development Life Cycle (SDLC): Software engineering


framework that is essential for developing reliable, maintainable and cost-
effective application and other software.

Structured System Analysis and Design (SSAD): A software engineering


approach to the specification, design, construction, testing and maintenance
of software for maximising the reliability and maintainability of the system
as well as for reducing software life-cycle costs.

Structured Design: A specific approach to the design process that results


in small, independent, black-box modules, arranged in a hierarchy in a top-
down fashion.

Cohesion: A measure of how well it fits together.

Coupling: A measure of interconnections among module in softwares.

Database Design: A process of designing the logical and physical structure


of one or more databases.

CASE Tools: Software that provides automated support for some portion of
the systems development process.

CHAPTER-9 FUNCTIONAL DEPENDENCY DECOMPOSITION

Functional Dependency (FD): A property of the information represented


by the relation.
Functional Decomposition: A process of breaking down the functions of
an organisation into progressively greater (finer and finer) levels of detail.

CHAPTER-10 NORMALIZATION

Normalization: A process of decomposing a set of relations with


anomalies to produce smaller and well-structured relations that contain
minimum or no redundancy.

Normal Form: State of a relation that results from applying simple rules
regarding functional dependencies (FDs) to that relation.

CHAPTER-11 QUERY PROCESSING AND OPTIMIZATION

Query Processing: The procedure of transforming a high-level query (such


as SQL) into a correct and efficient execution plan expressed in low-level
language that performs the required retrievals and manipulations in the
database.

Query Decomposition: The first phase of query processing whose aims are
to transform a high-level query into a relational algebra query and to check
whether that query is syntactically and semantically correct.

CHAPTER-12 TRANSACTION PROCESSING AND CONCURRENCY CONTROL

Transaction: A logical unit of work of database processing that includes


one or more database access operations.

Consistent Database: One in which all data integrity constraints are


satisfied.

Schedule: A sequence of actions or operations (for example, reading


writing, aborting or committing) that is constructed by merging the actions
of a set of transactions, respecting the sequence of actions within each
transaction.
Lock: A variable associated with a data item that describes the status of the
item with respect to possible operations that can be applied to it.

Deadlock: A condition in which two (or more) transactions in a set are


waiting simultaneously for locks held by some other transaction in the set.

Timestamp: A unique identifier created by the DBMS to identify the


relative starting time of a transaction.

CHAPTER-13 DATABASE RECOVERY SYSTEM

Database Recovery: A process of restoring the database to a correct


(consistent) state in the event of a failure.

Forward Recovery: A recovery procedure, which is used in case of a


physical damage, for example crash of disk pack (secondary storage),
failures during writing of data to database buffers, or failure during flushing
(transferring) buffers to secondary storage.

Backward Recovery: A recovery procedure, used in case an error occurs


in the midst of normal operation on the database.

Checkpoint: The point of synchronisation between the database and the


transaction log file.

CHAPTER-14 DATABASE SECURITY

Authorisation: The process of a granting of right or privilege to the user(s)


to have a legitimate access to a system or objects (database table) of the
system.

Authentication: A mechanism that determines whether a user is who he or


she claims to be.
Audit Trail: A special file or database in which the system automatically
keeps track of all operations performed by users on the regular data.

Firewall: A system designed to prevent unauthorized access to or from a


private network.

Data Encryption: A method of coding or scrambling of data so that


humans cannot read them.

CHAPTER-15 OBJECT-ORIENTED DATABASES

Object-Oriented Data Models (OODMs): A logical data models that


capture the semantics of objects supported in object-oriented programming.

Object-Oriented Database (OODB): A persistent and sharable collection


of objects defined by an OODM.

Object: An abstract representation of a real-world entity that has a unique


identity, embedded properties and the ability to interact with other objects
and itself.

Class: A collection of similar objects with shared structure (attributes) and


behaviour (methods).

Structure: The association of class and its objects.

Inheritance: The ability of an object within the structure (or hierarchy) to


inherit the data structure and behaviour (methods) of the classes.

CHAPTER-17 PARALLEL DATABASE SYSTEMS

Speed-up: A property in which the time taken for performing a task


decreases in proportion to the increase in the number of CPUs and disks in
parallel.
Scale-up: A property in which the performance of the parallel database is
sustained if the number of CPU and disks are increased in proportion to the
amount of data.

CHAPTER-18 DISTRIBUTED DATABASE SYSTEMS

Distributed Database System (DDBS): A database physically stored on


several computer systems across several sites connected together via
communication network.

Client/Server: A DBMS-related workload is split into two logical


components namely client and server, each of which typically executes on
different systems.

Middleware: A layer of software, which works as a special server and


coordinates the execution of queries and transactions across one or more
independent database servers.

Data Fragmentation: Technique of breaking up the database into logical


units, which may be assigned for storage at the various sites.

Timestamping: A method of identifying messages with their time of


transaction.

CHAPTER-19 DECISION SUPPORT SYSTEMS (DSS)

Decision Support System (DSS): An interactive, flexible and adaptable


computer-based information system (CBIS) that utilises decision rules,
models and model base coupled with a comprehensive database and the
decision maker’s own insights, leading to specific, implementable decisions
in solving problems.

CHAPTER-20 DATA WAREHOUSING AND DATA MINING


Data Warehouse: A subject-oriented, integrated, time-variant, non-volatile
collection of data in support of management’s decisions.

Data Mining: The process of extracting valid, previously unknown,


comprehensible and actionable information from large databases and using
it to make crucial business decisions.

CHAPTER-21 EMERGING DATABASE TECHNOLOGIES

World Wide Web (WWW): A subset of the Internet that uses computers
called Web servers to store multimedia files.

Digital Library: A managed collection of information, with associated


services, where the information is stored in digital formats and accessible
over a network.
Acknowledgements

I am obliged to students, teaching community and practicing engineers for


their excellent response to the previous edition of this book. I am also
pleased to acknowledge their valuable comments, feedbacks and
suggestions to improve the quality of the book.
I wish to acknowledge the assistance given by the editorial team at
Pearson Education especially Thomas Mathew Rajesh, and Vipin Kumar
for their sustained interest in bringing this new edition.
I am indebted to my colleagues and friends who have helped, inspired,
and given moral support and encouragement, in various ways, in
completing this task.
I am thankful to the senior executives of Tata Steel for their
encouragement without which I would not have been able to complete this
book.
Finally, I give immeasurable thanks to my family—wife Meena and
children Alka, Avinash and Abhishek—for their sacrifices, patience,
understanding and encouragement during the completion of the book. They
endured many evenings and weekends of solitude for the thrill of seeing a
book cover hang on a den wall.

S. K. SINGH
Copyright © 2011 Dorling Kindersley (India) Pvt. Ltd.
Licensees of Pearson Education in South Asia.
No part of this eBook may be used or reproduced in any manner whatsoever without the publisher’s
prior written consent.
This eBook may or may not include all assets that were part of the print version. The publisher
reserves the right to remove any material present in this eBook at any time, as deemed necessary.
ISBN 9788131760925
ePub ISBN 9789332503212
Head Office: A-8(A), Sector 62, Knowledge Boulevard, 7th Floor, NOIDA 201 309, India.
Registered Office: 11 Local Shopping Centre, Panchsheel Park, New Delhi 110 017, India

You might also like