Dbms-Unit 4
Dbms-Unit 4
File Organization:
In a database, a file is organized as a sequence of records, which are mapped onto disk
blocks. Since records vary in size, di erent techniques are required to manage their
storage e ectively. This is important for ensuring e icient file operations like inserting,
deleting, and retrieving records.
1. Fixed-Length Records
Fixed-length records are when each record has the same size. For example, in a bank
database, each account record (account number, branch name, balance) may be 40
bytes long. Records are stored sequentially in disk blocks, with each block holding
multiple records.
type deposit = record
account number char(10) ;
branch-name char(22); balance numeric(12,2); end
If you frequently join Employee and Department tables (e.g., to display employee names
along with their department names), multitable clustering might store the records for
Alice (EmployeeID 101, DepartmentID 1) and the HR department (DepartmentID 1,
DepartmentName HR) in the same disk block. When a query requests employees and
their departments, it can retrieve both pieces of information with a single block read.
However, if another query just wants to retrieve employee data without department
information, it may still need to read the same block, even though the department data is
not required, leading to unnecessary block accesses.
Indexing :
Indexing is a technique used to speed up the retrieval of data in databases by minimizing
the number of disk accesses. It is similar to the index found in a textbook. Indexes are
built using specific database fields and consist of two parts: the Search Key and the Data
Reference (Pointer). The Search Key contains the actual values (e.g., IDs, names), and
the Pointer holds the address where the corresponding data is stored on the disk.
b.Sparse Index: Only some records have index entries. Searching takes longer since not
every record is indexed, but it saves space.
Sequential file organization works e iciently for range-based queries, where records are
fetched in a continuous manner. Sequential indexing helps in these cases by:
Reducing the number of disk accesses.
Speeding up range queries (e.g., retrieving all records from Student ID 1001 to
1050).
Sorting data in sequential order to allow sequential scanning for queries.
2.Non-clustered or Secondary Indexing: The data is stored in one place, and the index
has pointers to where the actual data is. This is helpful for quick lookups but adds an extra
step to retrieve the data.
3.Multilevel Indexing: When the database grows too large, multilevel indexing breaks the
index into smaller levels, making it easier to store and search.
Advantages of Indexing:
Faster Query Performance: Data retrieval is quicker because the index acts like a
shortcut to finding records.
E icient Data Access: Less disk space is read, improving overall access time.
Improved Sorting: Sorting is faster as indexed columns are already ordered.
Consistent Performance: The performance remains stable even as the database
grows in size.
Disadvantages of Indexing:
Storage Overhead: Indexes require extra storage space.
Slower Insert/Update Operations: Whenever data is added or modified, the index
must be updated, which can slow down these operations.
Increased Maintenance: Indexes need regular updating and management to
ensure they work e iciently.
B+tree index:
A B+ tree is a balanced binary search tree that follows a multi-level index format. It is
widely used in databases and file systems to support e icient searches, insertions, and
deletions. In a B+ tree, all leaf nodes are at the same level, and they contain the actual
data pointers, while internal nodes act as index points.
Index
key pointer
Structure of a B+ Tree:
Root: The topmost node that acts as the starting point for any search or operation.
Internal Nodes: Contain keys and child pointers. These guide the search process
by comparing the search key with the stored keys and directing the search to the
appropriate child.
Leaf Nodes: Contain both keys and associated data (or pointers to data records).
They are linked in a sequential manner using a linked list, enabling e icient
sequential access in addition to random access.
Order of the Tree: The order (n) of the B+ tree defines the maximum number of
children a node can have. Every node(internal nodes), except the root, must have
at least ⌈n/2⌉ children.
Operations in a B+ Tree:
1.Searching: The search process begins at the root node and traverses down through the
internal nodes by comparing keys, until reaching the leaf nodes. If the search key
matches a key in a leaf node, the corresponding record is found.
Ex: Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for
the intermediary node which will direct to the leaf node that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential
search to find 55.
2.Insertion: When a new key is inserted, it is placed in the correct position in a leaf node.
If the leaf node is full, it is split, and the median key is moved up to the parent node. If the
parent node is full, the splitting process propagates upward, and if the root is split, a new
root is created.
Ex: Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after
55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without a ecting the
fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split the
leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55) and
(60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to find
the node where it fits and then place it in that leaf node.
3.Deletion: The deletion of a key occurs at the leaf node. If the node still has su icient
keys, the process ends. If not, the keys may be merged or redistributed with a sibling
node. If the root has only one child after merging, that child becomes the new root.
Ex: Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from
the intermediate node as well as from the 4th leaf node too. If we remove it from the intermediate
node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it to have a balanced
tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
Characteristics of B+ Tree:
Height-balanced: All leaf nodes are at the same level, maintaining a balanced
structure for e icient searching.
E icient Sequential Access: Leaf nodes are linked, allowing fast sequential
traversal, in addition to random access.
Disk-friendly: The structure of the B+ tree minimizes the number of disk accesses
by organizing data into blocks.
Applications:
Database Indexing: B+ trees are widely used in databases for implementing
indexes, where the keys represent indexed columns, and the leaf nodes point to
actual records or disk blocks.
File Systems: B+ trees are used to organize directory structures in many file
systems due to their e iciency in handling large amounts of data.
HASHING:
Hashing in databases is a method used to quickly find or store data. When a database
grows large, it can become slow to find a specific piece of data. Hashing helps speed this
up by using a hash function to determine where data should be placed.
Hash Table: A table where data is stored in "buckets."
Hash Function: A mathematical formula that takes the unique data (like a user ID) and
calculates where it should be stored in the table (the bucket).
Bucket: A storage location where data is stored.
Example of Hashing:
Assume you have a list of employee IDs: [1001, 1020, 2034, 4567], and you want to store
these in a database using hashing.
Choose a hash function: hash(key) = key mod 10
Compute the hash values:
o For 1001: 1001 mod 10 = 1
o For 1020: 1020 mod 10 = 0
o For 2034: 2034 mod 10 = 4
o For 4567: 4567 mod 10 = 7
Store each record in the bucket corresponding to its hash value:
o 1001 in bucket 1
o 1020 in bucket 0
o 2034 in bucket 4
o 4567 in bucket 7
Buckets 2, 3, 5, 6, 8, and 9 store no records.
When searching for a record, the same hash function will be applied to the search key,
allowing the system to quickly locate the bucket and then retrieve the record.
Collision
Sometimes, two di erent data entries (like two di erent IDs) might end up in the same
bucket. This is called a collision. To handle collisions, databases use techniques like:
Chaining: Storing multiple entries in the same bucket using a linked list.
Open Addressing: Looking for the next empty bucket to store the data.
Types of Hashing:
There are two types of hashing methods used:
A. Static Hash Function:
The number of buckets is fixed.
Here, Bucket 3 already has a collision because both 3 and 7 map to the same bucket. To
handle the collision, techniques like chaining or open addressing are used.
Advantages:
Simplicity: Static hashing is easier to implement and understand due to its fixed
size.
Fast Access: Search and insert operations often take constant time, O(1).
Low Overhead: No need for resizing, reducing processing and memory overhead.
Predictable Performance: Consistent performance for fixed-size data sets.
Concurrency-Friendly: Easier to manage in multi-threaded environments since
no resizing is required.
Disadvantages:
Overflows: If a bucket is full or has too many entries, overflow occurs, and
additional storage mechanisms (like chaining) are needed.
Decreased Performance: As collisions increase, the time required to insert and
search for keys also increases, leading to slower performance.
Fixed Size: If the dataset grows beyond what was initially anticipated, the fixed
number of buckets can't handle the load e iciently.
When there are too many collisions (too many records in one bucket), new
buckets are created to store data.
Flexible Hash Function: The hash function adjusts based on the number of
records.
Example of Dynamic Hashing:
Imagine that you have 4 buckets (numbered 00, 01, 10, and 11 in binary). As new data
comes in, the hash function uses 2 bits to decide which bucket to use. If one bucket
overflows, the system can create new buckets and adjust the hash function so that data
can spread out evenly.
Why Use Hashing?
Hashing is faster than traditional searching methods like indexing.
Static hashing is good when the size of the data is predictable.
Dynamic hashing is better when the data size is unpredictable because it can
expand or shrink as needed.
What you are describing is Extendible Hashing, a dynamic hashing technique. It allows
the system to grow the number of buckets dynamically when one of the buckets
overflows. The idea is to "split" the overflowing bucket and adjust the number of bits used
by the hash function, so that data can be redistributed more evenly across the new and
old buckets.
Key Concepts in Extendible Hashing:
1. Global Depth: The number of bits the hash function uses to index into the
directory (which points to buckets). It starts small and increases as more buckets
are added.(or) the number of buckets in initial state.
2. Local Depth: Each bucket has its own depth, which determines how many bits of
the hash value it can di erentiate between.(or)the number of directory entries in
the initial state.
3. Bucket Splitting: When a bucket overflows, it is split into two new buckets, and
the directory may be updated to reflect the split.
Step-by-Step Example:
Initial Setup:
Global Depth starts at 2 (k = 2), meaning we are using 2 bits from the hash value
to index into the directory.
There are 4 buckets (00, 01, 10, 11) corresponding to the 2-bit combinations.
At this point, the hash function uses the first 2 bits of each key's hash to determine which
bucket to put the data into. For example:
Key 5 might hash to binary 01, so it goes into Bucket 01.
Key 8 might hash to binary 00, so it goes into Bucket 00
Adding Keys
Let’s add some keys to the hash table:
Key 5 hashes to 01 → goes into Bucket 01.
Key 8 hashes to 00 → goes into Bucket 00.
Key 12 hashes to 00 → also goes into Bucket 00.
Key 7 hashes to 11 → goes into Bucket 11.
At this point, the buckets look like this:
Bucket Overflow
Now, we add Key 20, which also hashes to 00. Since Bucket 00 already has Key 8 and
Key 12, it overflows.
When this happens:
1. The system checks the Local Depth of Bucket 00. In this case, its local depth is 2
because it’s using 2 bits to di erentiate between keys.
2. Since the bucket is full, the system performs a bucket split.
3. After the split, the directory is updated to reflect the change, and the Global
Depth might increase if needed.
Splitting the Bucket
In this example, Bucket 00 is split into two new buckets:
Bucket 000 for keys whose hashes end in 000.
Bucket 100 for keys whose hashes end in 100.
After splitting:
Key 8 (which hashes to 000) goes into Bucket 000.
Keys 12 and 20 (which hash to 100) go into Bucket 100.
At this point, the Global Depth increases to 3 (k = 3), and the directory grows to
accommodate the new buckets.
The system now looks like this:
After Splitting
Bucket 000 contains Key 8.
Bucket 100 contains Keys 12 and 20.
Bucket 01, Bucket 10, and others remain unchanged.
The Global Depth increased from 2 to 3 to accommodate the additional bucket split. The
system dynamically adjusted the number of buckets and the directory to ensure even
data distribution.
Uses of Extendible Hashing:
1. Scalability: Extendible hashing automatically adjusts the number of buckets
when needed, making it ideal for unpredictable data sizes.
2. E iciency: By redistributing data and only splitting the a ected buckets,
extendible hashing minimizes the overhead of maintaining the hash table.
3. Even Data Distribution: It ensures that data is spread out evenly across the
buckets, reducing the likelihood of too many collisions in one bucket.
This dynamic nature is what makes extendible hashing highly e ective for handling data
that grows or changes unpredictably.
Distributes Databases:
A distributed database is a collection of data that is spread across multiple physical
locations, often across di erent computers, networks, or even geographical areas.
Despite the distribution, the system is managed as a single cohesive unit, allowing users
to access and manage the data as though it were all stored in one place.
Distributed databases can be categorized into Homogeneous and Heterogeneous
databases.
Additionally, cloud-based databases are another form of distributed databases where
data is hosted in the cloud.
1. Homogeneous Distributed Databases
In a homogeneous distributed database, all the individual databases across di erent
locations are of the same type. This means they use the same software, schema, and
data structures. These databases are easier to manage because they follow the same
rules, processes, and architecture.
Characteristics:
Same DBMS: All sites use the same database management system (DBMS),
making them compatible.
Uniform Schema: The data structure, schema, and data format are consistent
across all locations.
Easy Communication: Since all nodes follow the same architecture,
communication between databases is seamless and e icient.
Centralized Management: All databases are managed centrally, and any change
in one location can easily be replicated to others.
Simplicity: Because of uniformity, homogeneous databases are simple to
implement and maintain.
Example:
A company might have databases in di erent o ices (New York, London, Tokyo),
all using Oracle DBMS with the same schema and structure.
2. Heterogeneous Distributed Databases
In a heterogeneous distributed database, the individual databases may use di erent
DBMS, schema, or data structures. These systems must use middleware or translation
layers to enable communication between databases that are not directly compatible.
Characteristics:
Di erent DBMS: The di erent sites use di erent DBMS (e.g., one site uses Oracle,
another uses MySQL, etc.).
Di erent Schema: The schema or data structure may di er between locations.
Middleware Requirement: Since the systems are not inherently compatible, a
middleware layer is required to enable data sharing and communication between
di erent databases.
Complexity: Heterogeneous systems are more complex to manage due to
di erences in data structures, query languages, and DBMS software.
Flexibility: Despite the complexity, these databases provide flexibility by allowing
integration of various systems.
Example:
A multinational corporation might have one database running on MySQL for
internal use and another running on Oracle for customer data. Middleware is
needed to allow these databases to communicate.
3. Cloud-based Databases
Cloud-based databases are databases hosted on cloud computing platforms like
Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP). The data
is stored in data centers managed by cloud providers, and users can access the data over
the internet.
Characteristics:
Scalability: Cloud databases can easily scale up or down based on demand. You
can add or reduce resources as needed, o ering flexibility.
Cost-e ective: Organizations pay for only the resources they use (pay-as-you-go
model), which can lower costs.
High Availability: Cloud databases often come with built-in redundancy and
failover mechanisms to ensure that the system remains available even in case of
hardware failures.
Security: Cloud providers o er security measures such as encryption, access
control, and firewalls to protect data.
Accessibility: Since data is stored in the cloud, it can be accessed from
anywhere, provided there is an internet connection.
Types of Cloud Databases:
Relational Cloud Databases: Examples include Amazon RDS (Relational
Database Service), Google Cloud SQL, etc.
NoSQL Cloud Databases: Examples include Amazon DynamoDB, Google Cloud
Firestore, etc.
Hybrid Cloud Databases: Databases that combine on-premises and cloud
storage (e.g., Microsoft Azure Hybrid Cloud).
Example:
Amazon RDS allows companies to run a fully-managed relational database in the
cloud without worrying about the underlying infrastructure.
****End***