0% found this document useful (0 votes)
25 views26 pages

Dbms-Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views26 pages

Dbms-Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT-4

File Organization:
In a database, a file is organized as a sequence of records, which are mapped onto disk
blocks. Since records vary in size, di erent techniques are required to manage their
storage e ectively. This is important for ensuring e icient file operations like inserting,
deleting, and retrieving records.
1. Fixed-Length Records
Fixed-length records are when each record has the same size. For example, in a bank
database, each account record (account number, branch name, balance) may be 40
bytes long. Records are stored sequentially in disk blocks, with each block holding
multiple records.
type deposit = record
account number char(10) ;
branch-name char(22); balance numeric(12,2); end

File containing account records


Challenges of Fixed-Length Records:
1. Deletion Problem: When a record is deleted, it leaves an empty space. To handle
this:
o You can shift all the following records forward, which is ine icient.
Record 2 deleted and all records moved
o Another approach is to replace the deleted record with the last record in
the file, or leave it empty until a new record is inserted.

Record 2 deleted and final record moved


2. Block Boundaries: If the record size doesn’t fit exactly into a block, part of the
record may end up in the next block, requiring extra disk access, which slows
down retrieval.
To handle deletion e iciently, a free list can be used. A free list is a linked list of deleted
records that points to available space. When a new record is inserted, the system checks
the free list for available space to reuse.

File with Free list after deletion of records 1,4 and 6


2. Variable-Length Records
Variable-length records arise when records don’t have a fixed size, such as when fields
allow varying lengths or repeating fields. Managing these records is more complex
because deleted records might leave gaps that are too small for new records.
A common approach for variable-length records is the slotted-page structure:
 Each block has a header with information about the number of records, the end of
free space, and an array of pointers to the records.
 Records are stored starting from the end of the block, and free space is kept at the
beginning. When records are deleted, the free space is compacted, and pointers
are updated accordingly.

3. Managing Large Objects


For large objects like images or videos, which are too big to fit in a block, databases often
store them separately, using structures like B+-trees to manage and retrieve these large
files e iciently.

Organization of Records in Files


In database systems, records must be organized in files for e icient data retrieval,
insertion, and deletion. The choice of record organization method a ects performance,
and several techniques are available:
1. Heap File Organization
In heap file organization, records are stored randomly wherever there is space available.
There is no specific order, making this the simplest method to implement. While insertion
is fast, searching can be slow since records aren't ordered.
2. Sequential File Organization
In sequential file organization, records are stored in a specific, sorted order based on a
search key (e.g., customer ID or branch name). This ordering allows for e icient retrieval
in sorted order, which is useful for queries and certain algorithms. However, maintaining
sequential order during insertions and deletions can be costly.
Challenges:
 Insertions may require shifting records(to keep in sorted order) or adding them to
an overflow block.( When a record can’t be inserted in the right place without
moving many records, it is added to an overflow area. This reduces the cost of
insertions, but it makes future searches more complicated, as the system now
has to check both the main file and overflow blocks.)
 Deletions can leave gaps that require reorganization of the file.
 As the file grows, a reorganization may be needed to restore physical order, which
can be expensive.
3. Hashing File Organization
In hashing file organization, a hash function is applied to an attribute (like a search key),
and the result determines where the record is stored in the file. This method allows for
fast access but doesn’t support range queries (is a query where you want to retrieve
multiple records that fall within a specific range of values (for example, customer IDs
between 1000 and 2000)). or sequential processing well(specific order i.e Ex.sorted by
customer ID etc.)
4. Multitable Clustering File Organization
In multitable clustering, records from di erent relations (tables) are stored together in
the same block. This method is particularly useful for queries that frequently perform
joins between tables. By storing related records together, the system can fetch all the
relevant data with a single block read, improving e iciency in specific types of queries,
such as join queries.
However, this approach may slow down other queries. For example, queries that only
need data from one table might require more block accesses because records from
multiple tables are mixed in the same blocks.

If you frequently join Employee and Department tables (e.g., to display employee names
along with their department names), multitable clustering might store the records for
Alice (EmployeeID 101, DepartmentID 1) and the HR department (DepartmentID 1,
DepartmentName HR) in the same disk block. When a query requests employees and
their departments, it can retrieve both pieces of information with a single block read.
However, if another query just wants to retrieve employee data without department
information, it may still need to read the same block, even though the department data is
not required, leading to unnecessary block accesses.

Indexing :
Indexing is a technique used to speed up the retrieval of data in databases by minimizing
the number of disk accesses. It is similar to the index found in a textbook. Indexes are
built using specific database fields and consist of two parts: the Search Key and the Data
Reference (Pointer). The Search Key contains the actual values (e.g., IDs, names), and
the Pointer holds the address where the corresponding data is stored on the disk.

There are primarily three methods of indexing:


1.Clustered or Primary Indexing: Data is sorted based on the search key, and similar
records are grouped together. It's e icient for range searches and joins.
Students studying each semester, for example, are grouped together. First-semester
students, second-semester students, third-semester students, and so on are
categorized.

Index Sequential or Sequential File Organization or Ordered Index File:


Primary indexing is closely associated with sequential file organization, where records
are stored in a sequence determined by the primary key. This method is particularly
e ective for read operations, as data can be read in the order it is stored.
a.Dense Index: An index entry is created for every record in the database. This ensures
faster searches but uses more space.

b.Sparse Index: Only some records have index entries. Searching takes longer since not
every record is indexed, but it saves space.

Sequential file organization works e iciently for range-based queries, where records are
fetched in a continuous manner. Sequential indexing helps in these cases by:
 Reducing the number of disk accesses.
 Speeding up range queries (e.g., retrieving all records from Student ID 1001 to
1050).
 Sorting data in sequential order to allow sequential scanning for queries.

2.Non-clustered or Secondary Indexing: The data is stored in one place, and the index
has pointers to where the actual data is. This is helpful for quick lookups but adds an extra
step to retrieve the data.
3.Multilevel Indexing: When the database grows too large, multilevel indexing breaks the
index into smaller levels, making it easier to store and search.

Advantages of Indexing:
 Faster Query Performance: Data retrieval is quicker because the index acts like a
shortcut to finding records.
 E icient Data Access: Less disk space is read, improving overall access time.
 Improved Sorting: Sorting is faster as indexed columns are already ordered.
 Consistent Performance: The performance remains stable even as the database
grows in size.
Disadvantages of Indexing:
 Storage Overhead: Indexes require extra storage space.
 Slower Insert/Update Operations: Whenever data is added or modified, the index
must be updated, which can slow down these operations.
 Increased Maintenance: Indexes need regular updating and management to
ensure they work e iciently.
B+tree index:
A B+ tree is a balanced binary search tree that follows a multi-level index format. It is
widely used in databases and file systems to support e icient searches, insertions, and
deletions. In a B+ tree, all leaf nodes are at the same level, and they contain the actual
data pointers, while internal nodes act as index points.

Index
key pointer

Structure of a B+ Tree:
 Root: The topmost node that acts as the starting point for any search or operation.
 Internal Nodes: Contain keys and child pointers. These guide the search process
by comparing the search key with the stored keys and directing the search to the
appropriate child.
 Leaf Nodes: Contain both keys and associated data (or pointers to data records).
They are linked in a sequential manner using a linked list, enabling e icient
sequential access in addition to random access.
 Order of the Tree: The order (n) of the B+ tree defines the maximum number of
children a node can have. Every node(internal nodes), except the root, must have
at least ⌈n/2⌉ children.
Operations in a B+ Tree:
1.Searching: The search process begins at the root node and traverses down through the
internal nodes by comparing keys, until reaching the leaf nodes. If the search key
matches a key in a leaf node, the corresponding record is found.

Ex: Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for
the intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential
search to find 55.
2.Insertion: When a new key is inserted, it is placed in the correct position in a leaf node.
If the leaf node is full, it is split, and the median key is moved up to the parent node. If the
parent node is full, the splitting process propagates upward, and if the root is split, a new
root is created.
Ex: Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after
55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.

In this case, we have to split the leaf node, so that it can be inserted into tree without a ecting the
fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split the
leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55) and
(60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to find
the node where it fits and then place it in that leaf node.
3.Deletion: The deletion of a key occurs at the leaf node. If the node still has su icient
keys, the process ends. If not, the keys may be merged or redistributed with a sibling
node. If the root has only one child after merging, that child becomes the new root.
Ex: Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from
the intermediate node as well as from the 4th leaf node too. If we remove it from the intermediate
node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it to have a balanced
tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:

Characteristics of B+ Tree:
 Height-balanced: All leaf nodes are at the same level, maintaining a balanced
structure for e icient searching.
 E icient Sequential Access: Leaf nodes are linked, allowing fast sequential
traversal, in addition to random access.
 Disk-friendly: The structure of the B+ tree minimizes the number of disk accesses
by organizing data into blocks.
Applications:
 Database Indexing: B+ trees are widely used in databases for implementing
indexes, where the keys represent indexed columns, and the leaf nodes point to
actual records or disk blocks.
 File Systems: B+ trees are used to organize directory structures in many file
systems due to their e iciency in handling large amounts of data.

HASHING:
Hashing in databases is a method used to quickly find or store data. When a database
grows large, it can become slow to find a specific piece of data. Hashing helps speed this
up by using a hash function to determine where data should be placed.
Hash Table: A table where data is stored in "buckets."

Hash Function: A mathematical formula that takes the unique data (like a user ID) and
calculates where it should be stored in the table (the bucket).
Bucket: A storage location where data is stored.
Example of Hashing:
Assume you have a list of employee IDs: [1001, 1020, 2034, 4567], and you want to store
these in a database using hashing.
 Choose a hash function: hash(key) = key mod 10
 Compute the hash values:
o For 1001: 1001 mod 10 = 1
o For 1020: 1020 mod 10 = 0
o For 2034: 2034 mod 10 = 4
o For 4567: 4567 mod 10 = 7
 Store each record in the bucket corresponding to its hash value:
o 1001 in bucket 1
o 1020 in bucket 0
o 2034 in bucket 4
o 4567 in bucket 7
Buckets 2, 3, 5, 6, 8, and 9 store no records.
When searching for a record, the same hash function will be applied to the search key,
allowing the system to quickly locate the bucket and then retrieve the record.

Collision
Sometimes, two di erent data entries (like two di erent IDs) might end up in the same
bucket. This is called a collision. To handle collisions, databases use techniques like:
 Chaining: Storing multiple entries in the same bucket using a linked list.
 Open Addressing: Looking for the next empty bucket to store the data.
Types of Hashing:
There are two types of hashing methods used:
A. Static Hash Function:
 The number of buckets is fixed.

 The same hash function is always used.


 If you know the size of the database in advance, static hashing works well.
Problem: If the data size grows, the fixed number of buckets might not be enough,
causing collisions.
Example:
Assume we have a hash table with 4 buckets and use a hash function that computes the
bucket index based on the key modulo the number of buckets, i.e., hash(key) % 4.
Initial Setup:
Number of Buckets = 4

Hash Function = key % 4


1. Initial Data (Small Dataset):
Let's say we insert keys 3, 7, 12, and 18 into this hash table. Using the hash function (key
% 4):
3%4=3 → goes to Bucket 3
7%4=3→ goes to Bucket 3 (collision with key 3)
12%4=0 → goes to Bucket 0
18%4=2 → goes to Bucket 2

Here, Bucket 3 already has a collision because both 3 and 7 map to the same bucket. To
handle the collision, techniques like chaining or open addressing are used.
Advantages:

 Simplicity: Static hashing is easier to implement and understand due to its fixed
size.
 Fast Access: Search and insert operations often take constant time, O(1).
 Low Overhead: No need for resizing, reducing processing and memory overhead.
 Predictable Performance: Consistent performance for fixed-size data sets.
 Concurrency-Friendly: Easier to manage in multi-threaded environments since
no resizing is required.
Disadvantages:

 Overflows: If a bucket is full or has too many entries, overflow occurs, and
additional storage mechanisms (like chaining) are needed.
 Decreased Performance: As collisions increase, the time required to insert and
search for keys also increases, leading to slower performance.
 Fixed Size: If the dataset grows beyond what was initially anticipated, the fixed
number of buckets can't handle the load e iciently.

B. Dynamic Hash Function:

 The number of buckets can grow or shrink as needed.

 When there are too many collisions (too many records in one bucket), new
buckets are created to store data.

 Flexible Hash Function: The hash function adjusts based on the number of
records.
Example of Dynamic Hashing:
Imagine that you have 4 buckets (numbered 00, 01, 10, and 11 in binary). As new data
comes in, the hash function uses 2 bits to decide which bucket to use. If one bucket
overflows, the system can create new buckets and adjust the hash function so that data
can spread out evenly.
Why Use Hashing?
 Hashing is faster than traditional searching methods like indexing.
 Static hashing is good when the size of the data is predictable.
 Dynamic hashing is better when the data size is unpredictable because it can
expand or shrink as needed.
What you are describing is Extendible Hashing, a dynamic hashing technique. It allows
the system to grow the number of buckets dynamically when one of the buckets
overflows. The idea is to "split" the overflowing bucket and adjust the number of bits used
by the hash function, so that data can be redistributed more evenly across the new and
old buckets.
Key Concepts in Extendible Hashing:
1. Global Depth: The number of bits the hash function uses to index into the
directory (which points to buckets). It starts small and increases as more buckets
are added.(or) the number of buckets in initial state.
2. Local Depth: Each bucket has its own depth, which determines how many bits of
the hash value it can di erentiate between.(or)the number of directory entries in
the initial state.
3. Bucket Splitting: When a bucket overflows, it is split into two new buckets, and
the directory may be updated to reflect the split.
Step-by-Step Example:
Initial Setup:

 Global Depth starts at 2 (k = 2), meaning we are using 2 bits from the hash value
to index into the directory.
 There are 4 buckets (00, 01, 10, 11) corresponding to the 2-bit combinations.

At this point, the hash function uses the first 2 bits of each key's hash to determine which
bucket to put the data into. For example:
 Key 5 might hash to binary 01, so it goes into Bucket 01.
 Key 8 might hash to binary 00, so it goes into Bucket 00
Adding Keys
Let’s add some keys to the hash table:
 Key 5 hashes to 01 → goes into Bucket 01.
 Key 8 hashes to 00 → goes into Bucket 00.
 Key 12 hashes to 00 → also goes into Bucket 00.
 Key 7 hashes to 11 → goes into Bucket 11.
At this point, the buckets look like this:
Bucket Overflow
Now, we add Key 20, which also hashes to 00. Since Bucket 00 already has Key 8 and
Key 12, it overflows.
When this happens:
1. The system checks the Local Depth of Bucket 00. In this case, its local depth is 2
because it’s using 2 bits to di erentiate between keys.
2. Since the bucket is full, the system performs a bucket split.
3. After the split, the directory is updated to reflect the change, and the Global
Depth might increase if needed.
Splitting the Bucket
In this example, Bucket 00 is split into two new buckets:
 Bucket 000 for keys whose hashes end in 000.
 Bucket 100 for keys whose hashes end in 100.
After splitting:
 Key 8 (which hashes to 000) goes into Bucket 000.
 Keys 12 and 20 (which hash to 100) go into Bucket 100.
At this point, the Global Depth increases to 3 (k = 3), and the directory grows to
accommodate the new buckets.
The system now looks like this:
After Splitting
 Bucket 000 contains Key 8.
 Bucket 100 contains Keys 12 and 20.
 Bucket 01, Bucket 10, and others remain unchanged.
The Global Depth increased from 2 to 3 to accommodate the additional bucket split. The
system dynamically adjusted the number of buckets and the directory to ensure even
data distribution.
Uses of Extendible Hashing:
1. Scalability: Extendible hashing automatically adjusts the number of buckets
when needed, making it ideal for unpredictable data sizes.
2. E iciency: By redistributing data and only splitting the a ected buckets,
extendible hashing minimizes the overhead of maintaining the hash table.
3. Even Data Distribution: It ensures that data is spread out evenly across the
buckets, reducing the likelihood of too many collisions in one bucket.
This dynamic nature is what makes extendible hashing highly e ective for handling data
that grows or changes unpredictably.

COMPARISON OF INDEXING AND HASHING:


Database System Architecture:
Centralized systems:
Centralized systems are computer systems where all processing is done on a single
machine, and they consist of a few components working together. Here's an explanation
of the architecture and key concepts related to centralized systems:
1.Basic Structure of Centralized Systems:
A modern centralized system typically consists of:
CPUs: One or a few central processing units (CPUs) handle computations.
Device Controllers: These manage specific devices like disk drives or video displays.
Shared Memory: The CPUs and device controllers are connected through a common
bus that allows them to access shared memory.
Cache Memory: CPUs have their own cache memory, which stores frequently accessed
data locally to reduce the need to access the slower shared memory.
2. Single-User vs. Multiuser Systems:
 Single-User Systems:
o These are personal computers (PCs) or workstations, typically used by one
person at a time, with a single CPU and one or two hard disks.
o They lack advanced features like concurrency control (managing multiple
users’ database operations) since only one user updates data.
o They often have simple crash recovery mechanisms, such as manual
backups before updates.
 Multiuser Systems:
o These systems support multiple users connected via terminals and have
more CPUs, disks, and memory.
o They require complex features like concurrency control to handle
simultaneous updates from multiple users.
o Multiuser systems also o er advanced transactional features like crash
recovery, ensuring consistency even in the event of a failure.
3. Coarse-Granularity Parallelism:
 Modern general-purpose systems may have multiple processors (typically 2–4
CPUs), sharing a common memory space.
 In these systems, each query is run on a single processor, but multiple queries
can be run concurrently on di erent processors. This improves the overall
throughput (more transactions per second), but doesn’t necessarily speed up
individual transactions.
 These systems support multitasking, meaning multiple processes run on the
same processor in a time-shared manner, giving the appearance of parallelism.
4. Fine-Granularity Parallelism:
 In contrast, systems with fine-granularity parallelism have many processors
(possibly dozens or more) working together.
 Database systems running on such machines try to parallelize individual tasks
like queries. Instead of running one query on one processor, they break it into
smaller parts and run those parts simultaneously on di erent processors.
Client Server Systems
Client-Server Systems represent a distributed computing model where tasks or
services are divided between two types of systems: the client and the server.
Key Concepts:
1. Shift from Centralized Systems:
o Traditionally, centralized systems performed all the tasks on a single
machine. However, with personal computers becoming more powerful, the
centralized architecture was replaced by a client-server model, where
personal computers (clients) handle user interfaces and interact with
central servers for data processing.
2. Division of Functionality:
o In a client-server database system, functionality is divided into two parts:
 Front End (Client): Manages user interactions, such as using tools
like SQL interfaces, report generators, or forms.
 Back End (Server): Handles core tasks like query processing,
access control, concurrency, and recovery.
3. SQL Interface:
o The communication between the front end and the back end typically
happens through SQL or an application program. Standards like ODBC
(Open Database Connectivity) and JDBC (Java Database Connectivity)
allow clients to connect to any database server that supports these
standards.
4. Specialized Applications:
o Applications like spreadsheets and statistical analysis tools use the
client-server architecture to access backend data for specific tasks. These
applications act as custom front ends to retrieve or manipulate data from
the server.
5. Three-Tier Architecture:
o In larger systems, a three-tier architecture is used:
 Tier 1 (Client): The front end, typically a web browser.

 Tier 2 (Application Server): Acts as the intermediary, processing


the business logic and sending requests to the database.
 Tier 3 (Database Server): The back end where data is stored and
managed.
6. Transactional Remote Procedure Calls:
o Some systems use a transactional remote procedure call (RPC), which
lets the client call procedures on the server as if they were local to the
client. These RPCs are enclosed within a single transaction, ensuring that
if the transaction fails, the server can roll back all the changes made by the
individual calls.

Server System Architecture


Server System Architectures are classified into two main types: Transaction-Server
Systems and Data-Server Systems.
1. Transaction-Server Systems (also known as Query-Server Systems):
 Functionality: These servers receive requests from clients to perform actions,
such as executing queries. The server processes the request and sends the results
back to the client.
 Operation: Clients typically send entire transactions to the server, which are
executed there. The results are returned to the client for display.
 Request Specification: Clients can specify requests through SQL or through a
specialized application program interface (API).
 Common Usage: This architecture is widely used because it e iciently handles
transaction requests and provides quick responses to client queries.
2. Data-Server Systems:
 Functionality: These systems allow clients to request to read or update data in
units like files or pages.
 File Servers: Provide a basic file-system interface where clients can perform file
operations like creating, reading, updating, or deleting files.
 Database Data Servers: O er more advanced functionality, allowing interaction
with smaller data units such as pages, tuples, or objects. They also provide
indexing for e icient data retrieval and transaction support to ensure data
consistency even in case of failures.
Comparison:
 Transaction-Server Architecture is the more widely used approach since it is
e icient in handling queries and transactions between clients and servers.
 Data-Server Systems are often used for lower-level operations, such as file
handling, and provide additional transactional safeguards to maintain data
integrity.

Distributes Databases:
A distributed database is a collection of data that is spread across multiple physical
locations, often across di erent computers, networks, or even geographical areas.
Despite the distribution, the system is managed as a single cohesive unit, allowing users
to access and manage the data as though it were all stored in one place.
Distributed databases can be categorized into Homogeneous and Heterogeneous
databases.
Additionally, cloud-based databases are another form of distributed databases where
data is hosted in the cloud.
1. Homogeneous Distributed Databases
In a homogeneous distributed database, all the individual databases across di erent
locations are of the same type. This means they use the same software, schema, and
data structures. These databases are easier to manage because they follow the same
rules, processes, and architecture.
Characteristics:
 Same DBMS: All sites use the same database management system (DBMS),
making them compatible.
 Uniform Schema: The data structure, schema, and data format are consistent
across all locations.
 Easy Communication: Since all nodes follow the same architecture,
communication between databases is seamless and e icient.
 Centralized Management: All databases are managed centrally, and any change
in one location can easily be replicated to others.
 Simplicity: Because of uniformity, homogeneous databases are simple to
implement and maintain.
Example:
 A company might have databases in di erent o ices (New York, London, Tokyo),
all using Oracle DBMS with the same schema and structure.
2. Heterogeneous Distributed Databases
In a heterogeneous distributed database, the individual databases may use di erent
DBMS, schema, or data structures. These systems must use middleware or translation
layers to enable communication between databases that are not directly compatible.
Characteristics:
 Di erent DBMS: The di erent sites use di erent DBMS (e.g., one site uses Oracle,
another uses MySQL, etc.).
 Di erent Schema: The schema or data structure may di er between locations.
 Middleware Requirement: Since the systems are not inherently compatible, a
middleware layer is required to enable data sharing and communication between
di erent databases.
 Complexity: Heterogeneous systems are more complex to manage due to
di erences in data structures, query languages, and DBMS software.
 Flexibility: Despite the complexity, these databases provide flexibility by allowing
integration of various systems.
Example:
 A multinational corporation might have one database running on MySQL for
internal use and another running on Oracle for customer data. Middleware is
needed to allow these databases to communicate.
3. Cloud-based Databases
Cloud-based databases are databases hosted on cloud computing platforms like
Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP). The data
is stored in data centers managed by cloud providers, and users can access the data over
the internet.
Characteristics:
 Scalability: Cloud databases can easily scale up or down based on demand. You
can add or reduce resources as needed, o ering flexibility.
 Cost-e ective: Organizations pay for only the resources they use (pay-as-you-go
model), which can lower costs.
 High Availability: Cloud databases often come with built-in redundancy and
failover mechanisms to ensure that the system remains available even in case of
hardware failures.
 Security: Cloud providers o er security measures such as encryption, access
control, and firewalls to protect data.
 Accessibility: Since data is stored in the cloud, it can be accessed from
anywhere, provided there is an internet connection.
Types of Cloud Databases:
 Relational Cloud Databases: Examples include Amazon RDS (Relational
Database Service), Google Cloud SQL, etc.
 NoSQL Cloud Databases: Examples include Amazon DynamoDB, Google Cloud
Firestore, etc.
 Hybrid Cloud Databases: Databases that combine on-premises and cloud
storage (e.g., Microsoft Azure Hybrid Cloud).
Example:
 Amazon RDS allows companies to run a fully-managed relational database in the
cloud without worrying about the underlying infrastructure.

Concurrency Control in Distributed Databases:


Concurrency control in distributed databases ensures that multiple transactions can
execute simultaneously without conflicting, particularly when dealing with replicated
data.
key concurrency control schemes in distributed databases:
1. Single Lock-Manager Approach:
Concept: A single lock manager at one site handles all lock and unlock requests.
Advantages:
 Simple to implement.
 Deadlock handling is straightforward since all locks are managed at one site.
Disadvantages:
 Bottleneck: All transactions pass through one site, creating potential slowdowns.
 Vulnerability: If the single lock manager fails, the system is unable to function
unless a backup recovery system is in place.
2. Distributed Lock Manager:
Concept: Each site has its own local lock manager, handling lock requests for data
stored at that site.
Advantages:
 Reduces bottlenecks compared to the single lock-manager approach.
Disadvantages:
 Deadlock handling becomes more complicated since locks are distributed across
sites, leading to potential global deadlocks.
3. Primary Copy Protocol:
 Concept: A specific site holds the primary copy of each data item, and locks
must be acquired from this site.
 Advantages:
o Simplifies concurrency control for replicated data.
 Disadvantages:
o If the primary site fails, the data item becomes inaccessible, even if
replicas exist.
4. Majority Protocol:
 Concept: A transaction must obtain locks from more than half of the sites that
store replicas of a data item.
 Advantages:
o Can continue operating even if some sites fail.
 Disadvantages:
o More complex to implement.
o Risk of deadlock even when locking a single data item.
5. Biased Protocol:
 Concept: Read operations are given preferential treatment compared to write
operations. Reads can lock at one site, but writes require locks from all sites.
 Advantages:
o Reduces overhead for read operations, especially when reads are more
frequent than writes.
 Disadvantages:
o Higher overhead for write operations.
o Deadlock handling complexity is similar to the majority protocol.
6. Quorum Consensus Protocol:
 Concept: Assigns a weight to each site storing a data item, with read and write
operations requiring enough locks to meet specific quorums (weights).
 Advantages:
o Flexible configuration to optimize for read or write operations.
o Can continue operation even with site failures.
 Disadvantages:
o Complex implementation and deadlock handling.
7. Timestamping:
 Concept: Each transaction is assigned a unique timestamp to determine the
serialization order.
 Centralized: A single site generates timestamps.
 Distributed: Each site generates its own timestamp, combined with a unique site
identifier.
 Challenge: Synchronizing logical clocks across sites to ensure fairness.
Logical Clocks:
 In a distributed system, logical clocks are used to keep track of the order of
events, rather than relying on the actual physical time.
 They help determine which event happened before another, even though di erent
sites have their own independent clocks.
Fairness:
 Fairness means ensuring that operations (like transactions or resource access)
occur in the correct and fair order across all sites.
 No site or operation should have an unfair advantage due to clock di erences.
8. Replication with Weak Consistency:
Many commercial databases today support replication, which can take one of several
forms.
 Master-Slave Replication: Updates are performed at a primary site, and read-
only transactions can access replicas.
 Multi-Master Replication: Updates can be performed at any site, and propagated
to other replicas later (lazy propagation), though it risks consistency issues.
9. Deadlock Handling:
 Deadlock Prevention: Ensures that deadlocks don’t occur, but may cause
unnecessary waiting or rollback.
 Deadlock Detection: Detects deadlocks after they occur, but requires
maintaining a wait-for graph across sites to identify global deadlocks.

****End***

You might also like