Module1 2
Module1 2
A database is a collection of related data. By data, we mean known facts that can be recorded and
that have implicit meaning. For example, consider the names, telephone numbers, and addresses of
the people you know. You may have recorded this data in an indexed address book or you may have
stored it on a hard drive, using a personal computer and software such as Microsoft Access or Excel.
This collection of related data with an implicit meaning is a database.
A database management system (DBMS) is a collection of programs that enables users to create and
maintain a database. The DBMS is a general-purpose software system that facilitates the processes of
defining, constructing, manipulating, and sharing databases among various users and applications.
Defining a database involves specifying the data types, structures, and constraints of the data to be
stored in the database. The database definition or descriptive information is also stored by the DBMS
in the form of a database catalogue or dictionary; it is called meta-data. Constructing the database is
the process of storing the data on some storage medium that is controlled by the DBMS. Manipulating
a database includes functions such as querying the database to retrieve specific data, updating the
database to reflect changes in the mini world, and generating reports from the data. Sharing a
database allows multiple users and programs to access the database simultaneously.
An application program accesses the database by sending queries or requests for data to the DBMS.
A query typically causes some data to be retrieved.
A transaction may cause some data to be read and some data to be written into the database.
Other important functions provided by the DBMS include protecting the database and maintaining it
over a long period of time. Protection includes system protection against hardware or software
malfunction (or crashes) and security protection against unauthorized or malicious access. A typical
large database may have a life cycle of many years, so the DBMS must be able to maintain the
database system by allowing the system to evolve as requirements change over time.
In the database approach, a single repository maintains data that is defined once and then accessed
by various users. In file systems, each application is free to name data elements independently. In
contrast, in a database, the names or labels of data are defined once, and used repeatedly by queries,
transactions, and applications. The main characteristics of the database approach versus the file-
processing approach are the following:
A fundamental characteristic of the database approach is that the database system contains not only
the database itself but also a complete definition or description of the database structure and
constraints. This definition is stored in the DBMS catalogue, which contains information such as the
structure of each file, the type and storage format of each data item, and various constraints on the
data. The information stored in the catalogue is called meta-data, and it describes the structure of the
primary database.
1
ii. Insulation between programs and data, and data abstraction
In traditional file processing, the structure of data files is embedded in the application programs, so
any changes to the structure of a file may require changing all programs that access that file. By
contrast, DBMS access programs do not require such changes in most cases. The structure of data files
is stored in the DBMS catalogue separately from the access programs. We call this property program-
data independence.
The characteristic that allows program-data independence is called data abstraction. A DBMS
provides users with a conceptual representation of data that does not include many of the details of
how the data is stored or how the operations are implemented. Informally, a data model is a type of
data abstraction that is used to provide this conceptual representation. The data model uses logical
concepts, such as objects, their properties, and their interrelationships, that may be easier for most
users to understand than computer storage concepts. Hence, the data model hides storage and
implementation details that are not of interest to most database users.
A database typically has many users, each of whom may require a different perspective or view of the
database. A view may be a subset of the database or it may contain virtual data that is derived from
the database files but is not explicitly stored. Some users may not need to be aware of whether the
data they refer to is stored or derived. A multiuser DBMS whose users have a variety of distinct
applications must provide facilities for defining multiple views.
A multiuser DBMS must allow multiple users to access the database at the same time. This is essential
if data for multiple applications is to be integrated and maintained in a single database. The DBMS
must include concurrency control software to ensure that several users trying to update the same
data do so in a controlled manner so that the result of the updates is correct. For example, when
several reservation agents try to assign a seat on an airline flight, the DBMS should ensure that each
seat can be accessed by only one agent at a time for assignment to a passenger. These types of
applications are generally called online transaction processing (OLTP) applications. A fundamental
role of multiuser DBMS software is to ensure that concurrent transactions operate correctly and
efficiently. The concept of a transaction has become central to many database applications. A
transaction is an executing program or process that includes one or more database accesses, such as
reading or updating of database records. Each transaction is supposed to execute a logically correct
database access if executed in its entirety without interference from other transactions. The DBMS
must enforce several transaction properties. The isolation property ensures that each transaction
appears to execute in isolation from other transactions, even though hundreds of transactions may
be executing concurrently. The atomicity property ensures that either all the database operations in
a transaction are executed or none are.
Database systems arose in response to early methods of computerized management of commercial data. As an
example of such methods, typical of the 1960s, consider part of a university organization that, among other data, keeps
information about all instructors, students, departments, and course offerings. One way to keep the information on a
computer is to store it in operating system files. To allow users to manipulate the information, the system has a number
of application programs that manipulate the files, including programs to:
2
Add new students, instructors, and courses
Register students for courses and generate class rosters
Assign grades to students, compute grade point averages (GPA), and generate transcripts
System programmers wrote these application programs to meet the needs of the university.
New application programs are added to the system as the need arises. For example, suppose that a university decides to
create a new major (say, computer science).As a result, the university creates a new department and creates new
permanent files (or adds information to existing files) to record information about all the instructors in the
department, students in that major, course offerings, degree requirements, etc. The university may have to write new
application programs to deal with rules specific to the new major. New application programs may also have to be
written to handle new rules in the university. Thus, as time goes by, the system acquires more files and more
application programs.
This typical file-processing system is supported by a conventional operating system. The system stores permanent
records in various files, and it needs different application programs to extract records from, and add records to, the
appropriate files. Before database management systems (DBMSs) were introduced, organizations usually stored
information in such systems. Keeping organizational information in a file- processing system has a number of major
disadvantages:
Data redundancy and inconsistency. Since different programmers create the files and application programs over a long
period, the various files are likely to have different structures and the programs may be written in several
programming languages. Moreover, the same information may be duplicated in several places (files). For example, if a
student has a double major (say, music and mathematics) the address and telephone number of that student may
appear in a file that consists of student records of students in the Music department and in a file that consists of
student records of students in the Mathematics department. This redundancy leads to higher storage and access
cost. In addition, it may lead to data inconsistency; that is, the various copies of the same data may no longer agree.
For example, a changed student address may be reflected in the Music department records but not elsewhere in
the system.
Difficulty in accessing data. Suppose that one of the university clerks needs to find out the names of all students who
live within a particular postal-code area. The clerk asks the data-processing department to generate such a list.
Because the designers of the original system did not anticipate this request, there is no application program on hand
to meet it. There is, however, an application program to generate the list of all students.
The university clerk has now two choices: either obtain the list of all students and extract the needed information
manually or ask a programmer to write the necessary application program. Both alternatives are obviously
unsatisfactory. Suppose that such a program is written, and that, several days later, the same clerk needs to trim that
list to include only those students who have taken at least 60 credit hours. As expected, a program to generate such a
list does not exist. Again, the clerk has the preceding two options, neither of which is satisfactory. The point here is that
conventional file-processing environments do not allow needed data to be retrieved in a convenient and efficient manner.
More responsive data-retrieval systems are required for general use.
Data isolation. Because data are scattered in various files, and files may be in different formats, writing new
application programs to retrieve the appropriate data is difficult.
Integrity problems. The data values stored in the database must satisfy certain types of consistency constraints.
Suppose the university maintains an account for each department, and records the balance amount in each account.
Suppose also that the university requires that the account balance of a department may never fall below zero.
Developers enforce these constraints in the system by adding appropriate code in the various application programs.
However, when new constraints are added, it is difficult to change the programs to enforce them. The problem is
compounded when constraints involve several data items from different files.
3
Atomicity problems. A computer system, like any other device, is subject to failure. In many applications, it is
crucial that, if a failure occurs, the data be restored to the consistent state that existed prior to the failure.
Consider a program to transfer $500 from the account balance of department A to the account balance of
department B. If a system failure occurs during the execution of the program, it is possible that the $500
wasremoved from the balance of department A but was not credited to the balance of department B, resulting in
an inconsistent database state. Clearly, it is essential to database consistency that either both the credit and debit
occur, or that neither occur. That is, the funds transfer must be atomic—it must happen in its entirety or
not at all. It is difficult to ensure atomicity in a conventional file-processing system.
Concurrent-access anomalies. For the sake of overall performance of the system and faster response, many
systems allow multiple users to update the data simultaneously. Indeed, today, the largest Internet
retailers may have millions of accesses per day to their data by shoppers. In such an environment,
interaction of concurrent updates is possible and may result in inconsistent data. Consider department A,
with an account balance of $10,000. If two department clerks debit the account balance (by say $500 and $100,
respectively) of department A at almost exactly the same time, the result of the concurrent executions may
leave the budget in an incorrect (or inconsistent) state. Suppose that the programs executing on behalf of each
withdrawal read the old balance, reduce that value by the amount being withdrawn, and write the result
back. If the two programs run concurrently, they may both read the value $10,000, and write back $9500
and $9900, respectively. Depending on which one writes the value last, the account balance of department A
may contain either $9500 or $9900, rather than the correct value of $9400. To guard against this possibility,
the system must maintain some form of supervision. But supervision is difficult to provide because data
may be accessed by many different application programsthat have not been coordinated previously.
Security problems. Not every user of the database system should be able to access all the data. For example, in
a university, payroll personnel need to see only that part of the database that has financial information. They
do not need access to information about academic records. But, since application programs are added to the
file-processing system in an ad hoc manner, enforcing such security constraints is difficult.
These difficulties, among others, prompted the development of database systems. In what follows, we shall see
the concepts and algorithms that enable database systems to solve the problems with file-processing systems.
ADVANTAGES OF DBMS:
Controlling of Redundancy: Data redundancy refers to the duplication of data (i.e storing same data multiple
times). In a database system, by having a centralized database and centralized control of data by the DBA the
unnecessary duplication of data is avoided. It also eliminates the extra time for processing the large volume of
data. It results in saving the storage space.
Improved Data Sharing : DBMS allows a user to share the data in any number of application programs.
Data Integrity : Integrity means that the data in the database is accurate. Centralized control of the data helps
in permitting the administrator to define integrity constraints to the data in the database. For example: in
customer database we can can enforce an integrity that it must accept the customer only from Noida and
Meerut city.
Security : Having complete authority over the operational data, enables the DBA in ensuring that the only
means of access to the database is through proper channels. The DBA can define authorization checks to be
carried out whenever access to sensitive data is attempted.
4
Data Consistency : By eliminating data redundancy, we greatly reduce the opportunities for inconsistency. For
example: is a customer address is stored only once, we cannot have disagreement on the stored values. Also
updating data values is greatly simplified when each value is stored in one place only. Finally, we avoid the
wasted storage that results from redundant data storage.
Efficient Data Access : In a database system, the data is managed by the DBMS and all access to the data is
through the DBMS providing a key to effective data processing.
Enforcements of Standards : With the centralization of data, DBA can establish and enforce the data standards
which may include the naming conventions, data quality standards etc.
Data Independence : Ina database system, the database management system provides the interface between
the application programs and the data. When changes are made to the data representation, the meta data
obtained by the DBMS is changed but the DBMS is continues to provide the data to application program in the
previously used way. The DBMs handles the task of transformation of data wherever necessary.
DISADVANTAGES OF DBMS
1) It is bit complex. Since it supports multiple functionality to give the user the best, the underlying
softwarehas become complex. The designers and developers should have thorough knowledge about the
software to get the most out of it.
2) Because of its complexity and functionality, it uses large amount of memory. It also needs large memory to
run efficiently.
3) DBMS system works on the centralized system, i.e.; all the users from all over the world access this
database. Hence any failure of the DBMS, will impact all the users.
4) DBMS is generalized software, i.e.; it is written work on the entire systems rather specific one. Hence some
of the application will run slow.
FUNCTIONS OF DBMS
i. Controlling Redundancy
The redundancy in storing the same data multiple times leads to several problems. First,
there is the need to perform a single logical update multiple times: once for each file where
the data is recorded. This leads to duplication of effort. Second, storage space is wasted
when the same data is stored repeatedly, and this problem may be serious for large
databases. Third, files that represent the same data may become inconsistent. This may
happen because an update is applied to some of the files but not to others. In the database
approach, the views of different user groups are integrated during database design. Ideally,
we should have a database design that stores each logical data item—such as a student’s
name or birth date—in only one place in the database. This is known as data normalization,
and it ensures consistency and saves storage space. However, in practice, it is sometimes
necessary to use controlled redundancy to improve the performance of queries.
Database Users
In large organizations, many people are involved in the design, use, and maintenance of a
large database with hundreds of users. The people whose jobs involve the day-to-day use of
a large database can be called as the actors on the scene.
i. Database Administrators
A person who has central control over the system is called
database administrator (DBA). The function of DBA are :
1. Creation and modification of conceptual Schema definition
2. Implementation of storage structure and access method.
3. Schema and physical organization modifications .
4. Granting of authorization for data access.
5. Integrity constraints specification.
6. Execute immediate recovery procedure in case of failures
7. Ensure physical security to database
7
Application programmers are computer professionals who write application programs. Application
programmers can choose from many tools to develop user interfaces. Rapid application development (RAD) tools
are tools that enable an application programmer to construct forms and reports without writing a program. There
are also special types of programming languages that combine imperative control structures (for example,
for loops, while loops and if-then-else statements) with statements of the data manipulation language. These
languages, sometimes called fourth-generation languages, often include special features to facilitate the
generation of forms and the display of data on the screen. Most major commercial database systems include a
fourth generation language.
Sophisticated users interact with the system without writing programs. Instead, they form their requests in
a database query language. They submit each such query to a query processor, whose function is to break
down DML statements into instructions that the storage manager understands. Analysts who submit queries to
explore data in the database fall in this category.
A data model is a collection of concepts that can be used to describe the structure of a
database thatprovides the necessary means to achieve data abstraction.
i. Categories of Data Models
Many data models have been proposed, which we can categorize according to the types of
concepts they use to describe the database structure. High-level or conceptual data models
provide concepts that are close to the way many users perceive data, whereas low-level or
physical data models provide concepts that describe the details of how data is stored on the
computer storage media, typically magnetic disks. Conceptual data models use concepts
such as entities, attributes, and relationships. Concepts provided by low-level data models
are generally meant for computer specialists, not for end users. Between these two
extremes is a class of representational (or implementation) data models which provide
concepts that may be easily understood by end users but that are not too far removed from
the way data is organized in computer storage. Representational data models hide many
details of data storage on disk but can be implemented ona computer system directly.
The description of a database is called the database schema, which is specified during
database design and is not expected to change frequently. Most data models have certain
conventions for displaying schemas as diagrams. A displayed schema is called a schema
diagram. The following figure shows a schema diagram of a database. The diagram displays
the structure of each record type but not the actual instances of records. We call each
object in the schema—such as STUDENT or COURSE—a schema construct.
8
The data in the database at a particular moment in time is called a database state or
snapshot. It is also called the current set of occurrences or instances in the database. In a
given database state, each schema construct has its own current set of instances; for
example, the STUDENT construct will contain the set of individual student entities (records)
as its instances. Many database states can be constructed to correspond to a particular
database schema. Every time we insert or delete a record or change the value of a data item
in a record, we change one state of the database into another state. We get the initial state of
the database when the database is first populated or loaded with the initial data. From then
on, every time an update operation is applied to the database, we get another database
state. At any point in time, the database has a current state. The DBMS is partly responsible
for ensuring that every state of the database is a valid state—that is, a state that satisfies the
structure and constraints specified in the schema. The DBMS stores the descriptions of the
schema constructs and constraints—also called the meta-data—in the DBMS catalog so that
DBMS software can refer to the schema whenever it needs to. The schema is sometimes
called the intension, and a database state is called an extension of the schema. Although the
schema is not supposed to change frequently, it is not uncommon that changes occasionally
need to be applied to the schema as the application requirements change. This is known as
schema evolution.
9
DATABASE ARCHITECTURE
The architecture of a database system is greatly influenced by the underlying computer system on which
thedatabase system runs. Database systems can be centralized, or client-server, where one server machine
executes work on behalf of multiple client machines. Database systems can also be designed to exploit parallel
computer architectures. Distributed databases span multiple geographically separated machines.
Query Processor:
The query processor components include
· DDL interpreter, which interprets DDL statements and records the definitions in the data dictionary.
· DML compiler, which translates DML statements in a query language into an evaluation plan consisting
of low-level instructions that the query evaluation engine understands.
A query can usually be translated into any of a number of alternative evaluation plans that all give the same
result. The DML compiler also performs query optimization, that is, it picks the lowest cost evaluation plan
from among the alternatives.
Query evaluation engine, which executes low-level instructions generated by the DML compiler.
Storage Manager:
A storage manager is a program module that provides the interface between the lowlevel data stored in the
database and the application programs and queries submitted to the system. The storage manager is
responsible for the interaction with the file manager. The raw data are stored on the disk using the file system,
which is usually provided by a conventional operating system. The storage manager translates the various DML
10
statements into low-level file-system commands. Thus, the storage manager is responsible for storing,
retrieving, and updating data in the database.
The storage manager components include:
· Authorization and integrity manager, which tests for the satisfaction of integrity constraints and checks
the authority of users to access data.
· Transaction manager, which ensures that the database remains in a consistent (correct) state despite
system failures, and that concurrent transaction executions proceed without conflicting.
· File manager, which manages the allocation of space on disk storage and the data structures used to
represent information stored on disk.
· Buffer manager, which is responsible for fetching data from disk storage into main memory, and deciding
what data to cache in main memory. The buffer manager is a critical part of the database system, since it
enables the database to handle data sizes that are much larger than the size of main memory.
Transaction Manager:
A transaction is a collection of operations that performs a single logical function in a database application.Each
transaction is a unit of both atomicity and consistency. Thus, we require that transactions do not violate any
database-consistency constraints. That is, if the database was consistent when a transaction started, the
database must be consistent when the transaction successfully terminates. Transaction - manager ensures that
the database remains in a consistent (correct) state despite system failures (e.g., power failures and
operating system crashes) and transaction failures.
A database system is partitioned into modules that deal with each of the responsibilities of the overall
system. The functional components of a database system can be broadly divided into the storage
manager and the query processor components. The storage manager is important because databases typically
require a large amount of storage space. The query processor is important because it helps the database system
simplify and facilitate access to data.
It is the job of the database system to translate updates and queries written in a nonprocedural language, at the
logical level, into an efficient sequence of operations at the physical level.
11
THREE-SCHEMA ARCHITECTURE AND DATA INDEPENDENCE
A database management system that provides three level of data is said to follow three-level
architecture . The goal of the three-schema architecture, illustrated in Figure is to separate
the user applications from the physical database. In this architecture, schemas can be defined
at the following three levels:
The goal of the three-schema architecture, illustrated in the above figure is to separate the
user applications from the physical database. In this architecture, schemas can be defined at
the following three levels:
External Level :
The external level is at the highest level of database abstraction. At this level, there will be many
views defined for different users requirement. A view will describe only a subset of the database.
Anynumber of user views may exist for a given global schema (conceptual schema). For example,
each student has different view of the time table. The view of an undergraduate student in
Computer Science is different from the view a postgraduate student of the same course. Thus
this level of abstraction is concerned with different categories of users. Each external view is
described by means of a schema called sub schema.
Conceptual Level :
At this level of database abstraction all the database entities and the relationships
among them are included. One conceptual view represents the entire database. This
conceptual view is defined by the conceptual schema. The conceptual schema hides
the details of physical storage structures and concentrates on describing entities,
data types, relationships, user operations and constraints. It describes all the records
and relationships included in the conceptual view. There is only one conceptual
schema per database. It includes feature that specify the checks to relation data
consistency and integrity.
12
Internal level :
It is the lowest level of abstraction closest to the physical storage method used. It indicates how
the data will be stored and describes the data structures and access methods to be used by the
database. The internal view is expressed by internal schema. The following aspects are
considered at this level:
1. Storage allocation e.g: B-tree, hashing
2. Access paths eg. specification of primary and secondary keys, indexes etc
3. Miscellaneous eg. Data compression and encryption techniques,
optimization of the internalstructures.
13
DATA INDEPENDENCE
The three-schema architecture can be used to further explain the concept of data
independence, which can be defined as the capacity to change the schema at one level of a
database system without having to change the schema at the next higher level. We can
define two types of data independence:
1. Logical data independence is the capacity to change the conceptual schema without
having to change external schemas or application programs. We may change the conceptual
schema to expand the database (by adding a record type or data item), to change
constraints, or to reduce the database (by removing a record type or data item). In the last
case, external schemas that refer only to the remaining data should not be affected. Only the
view definition and the mappings need to be changed in a DBMS that supports logical data
independence. After the conceptual schema undergoes a logical reorganization, application
programs that reference the external schema constructs must work as before. Changes to
constraints can be applied to the conceptual schema without affecting the external schemas
or application programs.
2. Physical data independence is the capacity to change the internal schema without having
to changethe conceptual schema. Hence, the external schemas need not be changed as well.
Changes to the internal schema may be needed because some physical files were
reorganized—for example, by creating additional access structures—to improve the
performance of retrieval or update. If the same data as before remains in the database, we
should not have to change the conceptual schema.
Generally, physical data independence exists in most databases and file environments where
physical details such as the exact location of data on disk, and hardware details of storage
encoding, placement, compression, splitting, merging of records, and so on are hidden from
the user. Applications remain unaware of these details. On the other hand, logical data
independence is harder to achieve because it allows structural and constraint changes
without affecting application programs—a much stricter requirement. Whenever we have a
multiple-level DBMS, its catalogue must be expanded to include information on how to map
requests and data among the various levels. The DBMS uses additional software to
accomplish these mappings by referring to the mapping information in the catalogue. Data
independence occurs because when the schema is changed at some level, the schema at the
next higher level remains unchanged; only the mapping between the two levels is changed.
Hence, application programs referring to the higher-level schema need not be changed. The
three-schema architecture can make it easier to achieve true data independence, both
physical and logical. However, the two levels of mappings create an overhead during
compilation or execution of a query or program, leading to inefficiencies in the DBMS.
Because of this, few DBMSs have implemented the full three-schema architecture.
14