Topic Beyond Syllabus

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

KIET Group of Institutions

Department of Computer Science & Engineering


B.Tech. Vth SEM,
Topic Beyond Syllabus
Database Management System (KCS-501)

Storage System

Databases are stored in file formats, which contain records. At physical level, the
actual data is stored in electromagnetic format on some device. These storage
devices can be broadly categorized into three types –

 Primary Storage − The memory storage that is directly accessible to the


CPU comes under this category. CPU's internal memory (registers), fast
memory (cache), and main memory (RAM) are directly accessible to the
CPU, as they are all placed on the motherboard or CPU chipset. This
storage is typically very small, ultra-fast, and volatile. Primary storage
requires continuous power supply in order to maintain its state. In case
of a power failure, all its data is lost.
 Secondary Storage − Secondary storage devices are used to store data for
future use or as backup. Secondary storage includes memory devices that
are not a part of the CPU chipset or motherboard, for example, magnetic
disks, optical disks (DVD, CD, etc.), hard disks, flash drives, and
magnetic tapes.
 Tertiary Storage − Tertiary storage is used to store huge volumes of data.
Since such storage devices are external to the computer system, they are
the slowest in speed. These storage devices are mostly used to take the
back up of an entire system. Optical disks and magnetic tapes are widely
used as tertiary storage.

Memory Hierarchy
A computer system has a well-defined hierarchy of memory. A CPU has direct
access to it main memory as well as its inbuilt registers. The access time of the
main memory is obviously less than the CPU speed. To minimize this speed
mismatch, cache memory is introduced. Cache memory provides the fastest
access time and it contains data that is most frequently accessed by the CPU.
The memory with the fastest access is the costliest one. Larger storage devices
offer slow speed and they are less expensive, however they can store huge
volumes of data as compared to CPU registers or cache memory.

Magnetic Disks
Hard disk drives are the most common secondary storage devices in present
computer systems. These are called magnetic disks because they use the
concept of magnetization to store information. Hard disks consist of metal disks
coated with magnetizable material. These disks are placed vertically on a
spindle. A read/write head moves in between the disks and is used to magnetize
or de-magnetize the spot under it. A magnetized spot can be recognized as 0
(zero) or 1 (one).
Hard disks are formatted in a well-defined order to store data efficiently. A hard
disk plate has many concentric circles on it, called tracks. Every track is further
divided into sectors. A sector on a hard disk typically stores 512 bytes of data.

Redundant Array of Independent Disks


RAID or Redundant Array of Independent Disks, is a technology to connect
multiple secondary storage devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected
together to achieve different goals. RAID levels define the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down
into blocks and the blocks are distributed among disks. Each disk receives a
block of data to write/read in parallel. It enhances the speed and performance
of the storage device. There is no parity and backup in Level 0.
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it
sends a copy of data to all the disks in the array. RAID level 1 is also
called mirroring and provides 100% redundancy in case of a failure.

RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data,
striped on different disks. Like level 0, each data bit in a word is recorded on a
separate disk and ECC codes of the data words are stored on a different set
disks. Due to its complex structure and high cost, RAID 2 is not commercially
available.

RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data
word is stored on a different disk. This technique makes it to overcome single
disk failures.

RAID 4
In this level, an entire block of data is written onto data disks and then the
parity is generated and stored on a different disk. Note that level 3 uses byte-
level striping, whereas level 4 uses block-level striping. Both level 3 and level 4
require at least three disks to implement RAID.
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits
generated for data block stripe are distributed among all the data disks rather
than storing them on a different dedicated disk.

RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are
generated and stored in distributed fashion among multiple disks. Two parities
provide additional fault tolerance. This level requires at least four disk drives to
implement RAID.
------------------------------------------------------------------

File Structure

Relative data and information is stored collectively in file formats. A file is a


sequence of records stored in binary format. A disk drive is formatted into
several blocks that can store records. File records are mapped onto those disk
blocks.
File Organization
File Organization defines how file records are mapped onto disk blocks. We have
four types of File Organization to organize file records −

Heap File Organization


When a file is created using Heap File Organization, the Operating System
allocates memory area to that file without any further accounting details. File
records can be placed anywhere in that memory area. It is the responsibility of
the software to manage the records. Heap File does not support any ordering,
sequencing, or indexing on its own.
Sequential File Organization
Every file record contains a data field (attribute) to uniquely identify that record.
In sequential file organization, records are placed in the file in some sequential
order based on the unique key field or search key. Practically, it is not possible
to store all the records sequentially in physical form.
Hash File Organization
Hash File Organization uses Hash function computation on some fields of the
records. The output of the hash function determines the location of disk block
where the records are to be placed.
Clustered File Organization
Clustered file organization is not considered good for large databases. In this
mechanism, related records from one or more relations are kept in the same
disk block, that is, the ordering of records is not based on primary key or search
key.
File Operations
Operations on database files can be broadly classified into two categories −
 Update Operations
 Retrieval Operations
Update operations change the data values by insertion, deletion, or update.
Retrieval operations, on the other hand, do not alter the data but retrieve them
after optional conditional filtering. In both types of operations, selection plays a
significant role. Other than creation and deletion of a file, there could be several
operations, which can be done on files.
 Open − A file can be opened in one of the two modes, read mode or write
mode. In read mode, the operating system does not allow anyone to alter
data. In other words, data is read only. Files opened in read mode can be
shared among several entities. Write mode allows data modification. Files
opened in write mode can be read but cannot be shared.
 Locate − Every file has a file pointer, which tells the current position where
the data is to be read or written. This pointer can be adjusted accordingly.
Using find (seek) operation, it can be moved forward or backward.
 Read − By default, when files are opened in read mode, the file pointer
points to the beginning of the file. There are options where the user can
tell the operating system where to locate the file pointer at the time of
opening a file. The very next data to the file pointer is read.
 Write − User can select to open a file in write mode, which enables them
to edit its contents. It can be deletion, insertion, or modification. The file
pointer can be located at the time of opening or can be dynamically
changed if the operating system allows to do so.
 Close − This is the most important operation from the operating system’s
point of view. When a request to close a file is generated, the operating
system
o removes all the locks (if in shared mode),
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to
locate the file pointer to a desired record inside a file various based on whether
the records are arranged sequentially or clustered.

Indexing

We know that data is stored in the form of records. Every record has a key field,
which helps it to be recognized uniquely.
Indexing is a data structure technique to efficiently retrieve records from the
database files based on some attributes on which the indexing has been done.
Indexing in database systems is similar to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the
following types −
 Primary Index − Primary index is defined on an ordered data file. The data
file is ordered on a key field. The key field is generally the primary key of
the relation.
 Secondary Index − Secondary index may be generated from a field which
is a candidate key and has a unique value in every record, or a non-key
with duplicate values.
 Clustering Index − Clustering index is defined on an ordered data file. The
data file is ordered on a non-key field.
Ordered Indexing is of two types −
 Dense Index
 Sparse Index

Dense Index
In dense index, there is an index record for every search key value in the
database. This makes searching faster but requires more space to store index
records itself. Index records contain search key value and a pointer to the actual
record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index
record here contains a search key and an actual pointer to the data on the disk.
To search a record, we first proceed by index record and reach at the actual
location of the data. If the data we are looking for is not where we directly reach
by following the index, then the system starts sequential search until the
desired data is found.

Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is
stored on the disk along with the actual database files. As the size of the
database grows, so does the size of the indices. There is an immense need to
keep the index records in the main memory so as to speed up the search
operations. If single-level index is used, then a large size index cannot be kept
in memory which leads to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices
in order to make the outermost level so small that it can be saved in a single
disk block, which can easily be accommodated anywhere in the main memory.

B+ Tree
A B+ tree is a balanced binary search tree that follows a multi-level index format.
The leaf nodes of a B+ tree denote actual data pointers. B+ tree ensures that all
leaf nodes remain at the same height, thus balanced. Additionally, the leaf
nodes are linked using a link list; therefore, a B+ tree can support random
access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B+ tree is of the
order n where n is fixed for every B+ tree.

Internal nodes −

 Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root
node.
 At most, an internal node can contain n pointers.
Leaf nodes −

 Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
 At most, a leaf node can contain n record pointers and n key values.
 Every leaf node contains one block pointer P to point to next leaf node
and forms a linked list.
B+ Tree Insertion

 B+ trees are filled from bottom and each entry is done at the leaf node.
 If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
o ith key is duplicated at the parent of the leaf.
 If a non-leaf node overflows −
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
o Entries up to i are kept in one node.
o Rest of the entries are moved to a new node.
B+ Tree Deletion
 B+ tree entries are deleted at the leaf nodes.
 The target entry is searched and deleted.
o If it is an internal node, delete and replace with the entry from the
left position.
 After deletion, underflow is tested,
o If underflow occurs, distribute the entries from the nodes left to it.
 If distribution is not possible from left, then
o Distribute from the nodes right to it.
 If distribution is not possible from left or from right, then
o Merge the node with left and right to it.

Hashing

For a huge database structure, it can be almost next to impossible to search all
the index values through all its level and then reach the destination data block
to retrieve the desired data. Hashing is an effective technique to calculate the
direct location of a data record on the disk without using index structure.
Hashing uses hash functions with search keys as parameters to generate the
address of a data record.

Hash Organization
 Bucket − A hash file stores data in bucket format. Bucket is considered a
unit of storage. A bucket typically stores one complete disk block, which
in turn can store one or more records.
 Hash Function − A hash function, h, is a mapping function that maps all
the set of search-keys K to the address where actual records are placed.
It is a function from search keys to bucket addresses.
Static Hashing
In static hashing, when a search-key value is provided, the hash function
always computes the same address. For example, if mod-4 hash function is
used, then it shall generate only 5 values. The output address shall always be
same for that function. The number of buckets provided remains unchanged at
all times.

Operation
Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will
be stored.
Bucket address = h(K)
Search − When a record needs to be retrieved, the same hash function can be
used to retrieve the address of the bucket where the data is stored.
Delete − This is simply a search followed by a deletion operation.

Bucket Overflow
The condition of bucket-overflow is known as collision. This is a fatal state for
any static hash function. In this case, overflow chaining can be used.
 Overflow Chaining − When buckets are full, a new bucket is allocated for
the same hash result and is linked after the previous one. This
mechanism is called Closed Hashing.
Linear Probing − When a hash function generates an address at which data is
already stored, the next free bucket is allocated to it. This mechanism is
called Open Hashing.

Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically
as the size of the database grows or shrinks. Dynamic hashing provides a
mechanism in which data buckets are added and removed dynamically and on-
demand. Dynamic hashing is also known as extended hashing.
Hash function, in dynamic hashing, is made to produce a large number of
values and only a few are used initially.
Organization
The prefix of an entire hash value is taken as a hash index. Only a portion of
the hash value is used for computing bucket addresses. Every hash index has
a depth value to signify how many bits are used for computing a hash function.
These bits can address 2n buckets. When all these bits are consumed − that is,
when all the buckets are full − then the depth value is increased linearly and
twice the buckets are allocated.
Operation
 Querying − Look at the depth value of the hash index and use those bits
to compute the bucket address.
 Update − Perform a query as above and update the data.
 Deletion − Perform a query to locate the desired data and delete the same.
 Insertion − Compute the address of the bucket
o If the bucket is already full.
 Add more buckets.
 Add additional bits to the hash value.
 Re-compute the hash function.
 Else

Add data to the bucket,

 If all the buckets are full, perform the remedies of static


hashing.
Hashing is not favorable when the data is organized in some ordering and the
queries require a range of data. When data is discrete and random, hash
performs the best.
Hashing algorithms have high complexity than indexing. All hash operations
are done in constant time.

Distributed DBMS

In a distributed database, there are a number of databases that may be


geographically distributed all over the world. A distributed DBMS manages the
distributed database in a manner so that it appears as one single database to
users. In the later part of the chapter, we go on to study the factors that lead to
distributed databases, its advantages and disadvantages.
A distributed database is a collection of multiple interconnected databases,
which are spread physically across various locations that communicate via a
computer network.

Features
 Databases in the collection are logically interrelated with each other. Often
they represent a single logical database.
 Data is physically stored across multiple sites. Data in each site can be
managed by a DBMS independent of the other sites.
 The processors in the sites are connected via a network. They do not have
any multiprocessor configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not
synonymous with a transaction processing system.

Distributed Database Management System


A distributed database management system (DDBMS) is a centralized software
system that manages a distributed database in a manner as if it were all stored
in a single location.
Features
 It is used to create, retrieve, update and delete distributed databases.
 It synchronizes the database periodically and provides access
mechanisms by the virtue of which the distribution becomes transparent
to the users.
 It ensures that the data modified at any site is universally updated.
 It is used in application areas where large volumes of data are processed
and accessed by numerous users simultaneously.
 It is designed for heterogeneous database platforms.
 It maintains confidentiality and data integrity of the databases.

Factors Encouraging DDBMS


The following factors encourage moving over to DDBMS −
 Distributed Nature of Organizational Units − Most organizations in the
current times are subdivided into multiple units that are physically
distributed over the globe. Each unit requires its own set of local data.
Thus, the overall database of the organization becomes distributed.
 Need for Sharing of Data − The multiple organizational units often need to
communicate with each other and share their data and resources. This
demands common databases or replicated databases that should be used
in a synchronized manner.
 Support for Both OLTP and OLAP − Online Transaction Processing (OLTP)
and Online Analytical Processing (OLAP) work upon diversified systems
which may have common data. Distributed database systems aid both
these processing by providing synchronized data.
 Database Recovery − One of the common techniques used in DDBMS is
replication of data across different sites. Replication of data automatically
helps in data recovery if database in any site is damaged. Users can
access data from other sites while the damaged site is being
reconstructed. Thus, database failure may become almost inconspicuous
to users.
 Support for Multiple Application Software − Most organizations use a
variety of application software each with its specific database support.
DDBMS provides a uniform functionality for using the same data among
different platforms.

Advantages of Distributed Databases


Following are the advantages of distributed databases over centralized
databases.
Modular Development − If the system needs to be expanded to new locations or
new units, in centralized database systems, the action requires substantial
efforts and disruption in the existing functioning. However, in distributed
databases, the work simply requires adding new computers and local data to
the new site and finally connecting them to the distributed system, with no
interruption in current functions.
More Reliable − In case of database failures, the total system of centralized
databases comes to a halt. However, in distributed systems, when a component
fails, the functioning of the system continues may be at a reduced performance.
Hence DDBMS is more reliable.
Better Response − If data is distributed in an efficient manner, then user
requests can be met from local data itself, thus providing faster response. On
the other hand, in centralized systems, all queries have to pass through the
central computer for processing, which increases the response time.
Lower Communication Cost − In distributed database systems, if data is located
locally where it is mostly used, then the communication costs for data
manipulation can be minimized. This is not feasible in centralized systems.
Adversities of Distributed Databases
Following are some of the adversities associated with distributed databases.
 Need for complex and expensive software − DDBMS demands complex and
often expensive software to provide data transparency and co-ordination
across the several sites.
 Processing overhead − Even simple operations may require a large
number of communications and additional calculations to provide
uniformity in data across the sites.
 Data integrity − The need for updating data in multiple sites pose problems
of data integrity.
 Overheads for improper data distribution − Responsiveness of queries is
largely dependent upon proper data distribution. Improper data
distribution often leads to very slow response to user requests.

Types of Distributed Databases


Distributed databases can be broadly classified into homogeneous and
heterogeneous distributed database environments, each with further sub-
divisions, as shown in the following illustration.
Homogeneous Distributed Databases
In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −
 The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to
process user requests.
 The database is accessed through a single interface as if it is a single
database.
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
 Autonomous − Each database is independent that functions on its own.
They are integrated by a controlling application and use message passing
to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and
a central or master DBMS co-ordinates data updates across the sites.
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different operating
systems, DBMS products and data models. Its properties are −
 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational,
network, hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation
in processing user requests.
Types of Heterogeneous Distributed Databases
 Federated − The heterogeneous database systems are independent in
nature and integrated together so that they function as a single database
system.
 Un-federated − The database systems employ a central coordinating
module through which the databases are accessed.

Distributed DBMS Architectures


DDBMS architectures are generally developed depending on three parameters

 Distribution − It states the physical distribution of data across the different
sites.
 Autonomy − It indicates the distribution of control of the database system
and the degree to which each constituent DBMS can operate
independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data
models, system components and databases.

Architectural Models
Some of the common architectural models are −
 Client - Server Architecture for DDBMS
 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture
Client - Server Architecture for DDBMS
This is a two-level architecture where the functionality is divided into servers
and clients. The server functions primarily encompass data management, query
processing, optimization and transaction management. Client functions include
mainly user interface. However, they have some functions like consistency
checking and transaction management.
The two different client - server architecture are −
 Single Server Multiple Client
 Multiple Server Multiple Client (shown in the following diagram)
Peer- to-Peer Architecture for DDBMS
In these systems, each peer acts both as a client and a server for imparting
database services. The peers share their resource with other peers and co-
ordinate their activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.
Multi - DBMS Architectures
This is an integrated database system formed by a collection of two or more
autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas −
 Multi-database View Level − Depicts multiple user views comprising of
subsets of the integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that
comprises of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across
different sites and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −
 Model with multi-database conceptual level.
 Model without multi-database conceptual level.
Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −
 Non-replicated and non-fragmented
 Fully replicated
 Partially replicated
 Fragmented
 Mixed
Non-replicated & Non-fragmented
In this design alternative, different tables are placed at different sites. Data is placed so
that it is at a close proximity to the site where it is used most. It is most suitable for
database systems where the percentage of queries needed to join information in tables
placed at different sites is low. If an appropriate distribution strategy is adopted, then this
design alternative helps to reduce the communication cost during data processing.
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored.
Since, each site has its own copy of the entire database, queries are very fast requiring
negligible communication cost. On the contrary, the massive redundancy in data requires
huge cost during update operations. Hence, this is suitable for systems where a large
number of queries is required to be handled whereas the number of database updates
is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration
the fact that the frequency of accessing the tables vary considerably from site to site.
The number of copies of the tables (or portions) depends on how frequently the access
queries execute and the site which generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or
partitions, and each fragment can be stored at different sites. This considers the fact that
it seldom happens that all data stored in a table is required at a given site. Moreover,
fragmentation increases parallelism and provides better disaster recovery. Here, there
is only one copy of each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −
 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation
Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are
initially fragmented in any form (horizontal or vertical), and then these fragments are
partially replicated across the different sites according to the frequency of accessing
the fragments.

You might also like