DBMS Unit-5
DBMS Unit-5
If the data is spread across just multiple disks without the RAID technique, the
loss of a single disk can affect the entire data.
RAID is very transparent to the underlying system.
To the host system, it appears as a single big disk presenting itself as a linear
array of blocks. This allows older technologies to be replaced by RAID
without making too many changes in the existing code.
RAID System is evaluated by Reliability, Availability, Performance, Capacity.
There are 7 levels of RAID schemes. These schemas are as RAID 0, RAID 1,
...., RAID 6.
RAID-0 (Striping)
RAID level 0 provides data stripping, i.e., a data can place across
multiple disk.
Instead of placing just one block into a disk at a time, we can work with two
(or more) blocks placed into a disk before moving on to the next one.
Evaluation
Reliability: 0
There is no duplication of data. Hence, a block once lost cannot be recovered.
Capacity: N*B
The entire space is being used to store data. Since there is no duplication, N disks
each having B blocks are fully utilized.
RAID-1 (Mirroring)
More than one copy of each block is stored in a separate disk. Thus, every block
has two (or more) copies, lying on different disks.
Evaluation:
In this level, data is regenerated using parity drive.
It contains high data transfer rates.
In this level, data is accessed in parallel.
It required an additional drive for parity.
It gives a slow performance for operating on small sized files.
Evaluation:
Reliability: 1
RAID-4 allows recovery of at most 1 disk failure (because of the way parity
works). If more than one disk fails, there is no way to recover the data.
Capacity: (N-1)*B
One disk in the system is reserved for storing the parity. Hence, (N-1) disks are
made available for data storage, each disk having B blocks.
RAID-5 (Block-Level Striping with Distributed Parity)
This is a slight modification of the RAID-4 system where the only difference
is that the parity rotates among the drives.
Fixed-Length Records
⚫ All records on the page are guaranteed to be of the same length, record slots arc
uniform and can be arranged consecutively within a page.
⚫ When a record is inserted into the page, we must locate an empty slot and place the
record.
⚫ For inserting a record into empty slot we must find where we have a empty slot it
can be done in two ways
⚫ The first alternative is to store records in the first N. whenever a record is deleted,
we move the last record on the page into the vacated slot.
This format allows us to locate the ith record on a page by a simple offset
calculation, and all empty slots appear together at the end of the page.
This approach docs not work if there are external references to the record
⚫ The second alternative is to handle deletions by using an array of bits, one
per slot, to keep track of free slot information.
⚫ Locating records on the page requires scanning the bit array to find
slots whose bit is on; when a record is deleted, its bit is turned off.
Variable-Length Records
⚫ In variable length record type we cannot divide the page into a fixed
collection of slots. The problem is that, when a new record is to be inserted,
we have to find an empty slot of just the right length.
⚫ When a record is deleted, we must move records to fill the hole created by
the deletion, to ensure that all the free space on the page is contiguous
⚫ The most flexible organization for variable-length records is to maintain a
directory of slots for each page, with a (record offset (Location), record
length(size)) pair per slot.
⚫ The first component (record offset) is a 'pointer' to the record.
⚫ Deletion is done by setting the record offset to -1.
⚫ Records can be moved around on the page because the rid, which is the page
number and slot number does not change when the record is moved; only
the record offset stored in the slot changes.
RECORD FORMATS
⚫ While choosing a way to organize the fields of a record, we must consider
whether the fields of the record are of fixed or variable length and consider the
cost of various operations on the record.
⚫ Information common to all records of a given record type (such as the number
of fields and field types) is stored in the system catalog, which can be thought
of as a description of the contents of a database, maintained by the DBMS.
Fixed-Length Records
⚫ In a fixed-length record, each field has a fixed length, and the number of
fields is also fixed.
⚫ The fields of such a record can be stored consecutively, and given the address of
the record, the address of a particular field can be calculated using information
about the lengths of preceding fields.
Variable-Length Records
⚫ In the relational model, every record in a relation contains the same number of
fields. If the number of fields is fixed, a record is of variable length only
because some of its fields are of variable length.
⚫ One possible organization is to store fields consecutively, separated by
delimiters . This organization requires a scan of the record to locate a desired
field.
⚫ Another possible organization is to reserve some space at the beginning of a
record for use as an array of integer offsets.
⚫ The ith integer in this array is the starting address of the ith field value
relative
to the start of the record.
⚫ we also store an offset to the end of the record; this offset is needed to
recognize where the last held ends.
File Organization
A database consist of a huge amount of data. The data is grouped within a table in
RDBMS, and each table have related records. A user can see that the data is stored
in form of tables, but in actual this huge amount of data is stored in physical
memory in form of files.
File : A file is named collection of related information that is recorded on
secondary storage such as magnetic disks, magnetic tables and optical
disks.
File Organization refers to the logical relationships among various records that
constitute the file, particularly with respect to the means of identification and
access to any specific record.
Closed hashing
In Closed hashing method, a new data bucket is allocated with same address and is
linked it after the full data bucket. This method is also known as overflow
chaining.
Quadratic probing
Quadratic probing is an open-addressing scheme where we look for i2‘th slot in
i’th iteration if the given hash value x collides in the hash table.
How Quadratic Probing is done?
Let hash(x) be the slot index computed using the hash
function. If the slot hash(x) % S is full, then we try (hash(x) +
1*1) % S.
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) %
S. If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3)
% S.
This process is repeated for all the values of i until an empty slot is
found
Double hashing
Double hashing is a collision resolving technique in Open Addressed Hash tables.
Double hashing uses the idea of applying a second hash function to key when a
collision occurs.
Bucket Splitting:
When the number of elements in a bucket exceeds a particular size, then
the bucket is split into two parts.
Directory Expansion
Directory Expansion Takes place when a bucket overflows. Directory Expansion
is performed when the local depth of the overflowing bucket is equal to the global
depth.
Example of hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)
Hash Function: Suppose the global depth is X. Then the Hash Function returns
X LSBs.
16- 10000 , 4- 00100 , 6- 00110 , 22- 10110 , 24- 11000 , 10- 01010 ,
31- 11111, 7- 00111, 9- 01001 , 20- 10100 , 26- 11010
The drawback of static hashing is that that it does not expand or shrink
dynamically as the size of the database grows or shrinks
Dynamic Hashing
In Dynamic hashing, data buckets grows or shrinks as the records increases or
decreases. Dynamic hashing is also known as extended hashing.
Advantages
Tree traversal is easier and faster.
Searching becomes easy as all records are stored only in leaf nodes and are sorted
sequential linked list.
There is no restriction on B+ tree size. It may grows/shrink as the size of data
increases/decreases.
Disadvantages
Inefficient for static tables.
Cluster File Organization
In cluster file organization, two or more related tables/records are stored
withing same file known as clusters.
These files will have two or more tables in the same data block and the
key attributes which are used to map these table together are stored only
once.
Thus it lowers the cost of searching and retrieving various records in different
files as they are now combined and kept in a single cluster.
For example we have two tables or relation Employee and Department. These
table are related to each other
These table are allowed to combine using a join operation and can be seen in
a
cluster file.
If we must insert, update or delete any record we can directly do so. Data is sorted
based on the primary key or the key with which searching is done. Cluster key is
the key with which joining of the table is performed.
Desig
Indexing
Indexing is a way to optimize the performance of a database by minimizing
the number of disk accesses required when a query is processed.
It is a data structure technique which is used to quickly locate and access
the
data in a database.
Indexes are created using a few database columns.
The first column is the Search key that contains a copy of the primary key or
candidate key of the table.
The second column is the Data Reference or Pointer which contains a set of
pointers holding the address of the disk block where that particular key value can
be found.
Classification Of Index
Dense Index
For every search key value in the data file, there is an index record.
This record contains the search key and also a reference to the first
data record with that search key value.
Sparse Index
The index record appears only for a few items in the data file. Each
item points to a block as shown.
Types of Indexing
1. Single-level Indexing
Primary indexing
Clustering Indexing
Secondary Indexing
2. Multilevel Indexing
B Trees
B+ Trees
Primary indexing
It is defined mainly on the primary key of the data-file, in which the
data-
file is already ordered based on the primary key.
⚫ Primary Index is an ordered file whose records are of fixed length
with two fields. The first field of the index replicates the primary key
of the data file in an ordered manner, and the second field of the
ordered file contains a pointer that points to the data-block where a
record containing the key is available.
⚫ The first record of each block is called the Anchor record or Block
anchor. There exists a record in the primary index file for every block
of the data-file.
⚫ The average number of blocks using the Primary Index is = log2B +
1, where B is the number of index blocks.
Clustered index:
Clustered index is created only when both the following conditions
satisfy
1. The data or file, that you are moving into secondary memory should
be in sequential or sorted order.
2. There should be non key value, meaning it can have repeated values.
Properties of B-Tree
A B-Tree is defined by the term degree or Order ‘M’. The value of M depends
upon disk block size.
Every node in a B-Tree except the root node and the leaf node contain at least
M/2 children.
The root node must have at least 2 child nodes.
All nodes (including root) may contain at most M – 1 keys.
Every node in a B-Tree contains at most M children.
All leaves are at the same level.
All keys of a node are sorted in increasing order. The child between two keys
k1 and k2 contains all keys in the range from k1 and k2.
A B tree of order 4 is shown
below
While performing some operations on B Tree, any property of B Tree may violate
such as number of minimum children a node can have. To maintain the
properties of B Tree, the tree may split or join.
Operations
Searching
Insertion
Deletion
Searching
Searching in B Trees is like that in Binary search tree.
For example, if we search for an item 49 in the following B Tree. The process will
something like following :
Compare item 49 with root node 78. since 49 < 78 hence, move to its left sub-
tree.
Since, 40<49<56, traverse right sub-tree of 40.
49>45, move to right. Compare 49.
match found, return.
Searching in a B tree depends
upon the height of the tree. The
search algorithm
takes O(log n) time to search any
Insertion
Insertions are done at the leaf node level.
The following algorithm needs to be followed in order to insert an item into
B Tree.
Traverse the B Tree in order to find the appropriate leaf node at which the node
can be inserted.
If the leaf node contain less than m-1 keys then insert the element in the
increasing order.
Else, if the leaf node contains m-1 keys, then follow the following steps.
Insert the new element in the increasing order of elements.
Split the node into the two nodes at the median.
Push the median element upto its parent node.
If the parent node also contain m-1 number of keys, then split it too by
following the same steps.
DELETION
Deletion from a B-tree is more complicated than insertion, because we
can delete a key from any node-not just a leaf.
when we delete a key from an internal node, we will have to rearrange
the node’s children.
Just as we had to ensure that a node didn’t get too big due to insertion. we
must ensure that a node doesn’t get too small during deletion.
A simple approach to deletion might have to back up if a node (other than
the root) along the path to where the key is to be deleted has the minimum
number of keys.
Since most of the keys in a B-tree are in the leaves, deletion operations
are most often used to delete keys from leaves.
The recursive delete procedure then acts in one downward pass through
the tree, without having to back up.
When deleting a key in an internal node, however, the procedure makes
a downward pass through the tree but may have to return to the node
from which the key was deleted to replace the key with its predecessor
or successor.
No of children must be according to the ceil of m/2.
After Deleting 6 the tree loos like
After deleting 13 the tree looks
like
After deleting 7. Two leaf nodes are added to form the new node
For a B-Tree of order
M Each internal node has up to M-1 keys to search
Each internal node has between M/2 and M
children
Depth of B-Tree storing N items is O(log M/2
N)
Run timeis:O(log M) to binary search which branch to take at each
node. But M is small compared to N.
Total time to find an item is O(depth*log M) = O(log N)
The drawback of B-tree is that it stores the data pointer (a pointer to the
disk file block containing the key value), corresponding to a particular
key value, along with that key value in the node of a B-tree.
B+ tree
B+ tree eliminates the above drawback by storing
data pointers only at the leaf nodes of the tree.
The structure of leaf nodes of a B+ tree is quite
different
from the structure of internal nodes of the B tree.
The leaf nodes are linked to provide ordered access to
the records.
The leaf nodes form the first level of index, with the
internal nodes forming the other levels of a
multilevel index.
Some of the key values of the leaf nodes also appear in
the internal nodes, to simply act as a medium to control
the searching of a record.
B+ tree ‘a’ and ‘b’, one for the internal nodes and the
other for the external (or leaf) nodes.
Internal Node structure
A B+ tree with ‘l’ levels can store more entries in its internal nodes compared to
a B-tree having the same ‘l’ levels. This accentuates the significant improvement
made to the search time for any given key.
Operations on B+ Tree
Searching.
Insertion.
Deletion.
Searching
Searching just like in a binary search tree.
Starts at the root, works down to the leaf level.
Does a comparison of the search value and the current
“separation value”, goes left or right.
Since no structure change in a B+ tree during a searching
process.
So just compare the key value with the data in the tree, then give
the result back.
Insertion
A search is first performed, using the value to be added.
After the search is completed, the location for the new value is
known.
If the tree is empty, add to the root.
Once the root is full, split the data into 2 leaves, using the root to
hold
INSERTION
The total time complexity of the B+ tree search operation is O(t log
tn). Where O(t) is time complexity for each linear search.
The time complexity of the insertion algorithm is O(t log tn).
Advantages of B+ Trees:
1. Records can be fetched in equal number of disk accesses.
2. Height of the tree remains balanced and less compare to b-tree.
3. We can access the data stored in a B+ tree sequentially as well as
directly.
4. Keys are used for indexing.
5. Faster search queries as the data is stored only on the leaf nodes.
6. Deletion will never be a complexed process since element will always
be deleted from the leaf nodes in B+ trees .whereas in B tree,
deletion of internal nodes are so complicated and time consuming.
Disadvantages of B+ Trees:
⚫ Any search will end at leaf node only.
⚫ Time complexity for every search results in O(h).
⚫ Extra insertion and deletion overhead, space overhead..
17CI09 Data Base Management Systems
Program & Semester: B.Tech & III SEM
AI&DS
Academic Year: 2023 - 24
Consistency
Consistency means that the nodes will have the same copies of a replicated data
item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent, successful write.
Availability:
Availability means that each read or write request for a data item will either
be processed successfully or will receive a message that the operation cannot
be completed.
Partition tolerance
Partition tolerance means that the system can continue operating if the network
connecting the nodes has a fault that results in two or more partitions, where
the nodes in each partition can only communicate among each other.
Graph-Based
A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An edge
gives a relationship between nodes. Every node and edge has a unique
identifier.
A graph database is a database that is based on graph theory. It consists of a set
of objects, which can be a node or an edge.
Nodes represent entities or instances such as people, businesses, accounts, or
any other item to be tracked.
Edges, also termed graphs or relationships, are the lines that connect nodes
to other nodes; representing the relationship between them. Edges are the
key concept in graph databases.
Properties are information associated to nodes.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB
Column-based
Column-oriented databases work on columns and are based on BigTable
paper by Google. Every column is treated separately. Values of single
column databases are stored contiguously.