Module 5
Module 5
sql>
SELECT *
FROM Staff
WHERE Allowance = 400; sql>
COMMIT;
Contd.
• ROLLBACK in SQL is a transactional control language that is used to undo the transactions that
have not been saved in the database. The command is only been used to undo changes since the
last COMMIT.
• Example: Consider the following STAFF table with records:
sql> SELECT *
FROM EMPLOYEES
WHERE ALLOWANCE = 400;
sql> ROLLBACK;
Recoverability in DBMS
• Recoverability is a property of database systems that ensures that, in the event of a failure or error,
the system can recover the database to a consistent state.
• Recoverability guarantees that all committed transactions are durable and that their effects are
permanently stored in the database, while the effects of uncommitted transactions are undone to
maintain data consistency.
• The recoverability property is enforced through the use of transaction logs, which record all
changes made to the database during transaction processing. When a failure occurs, the system
uses the log to recover the database to a consistent state, which involves either undoing the effects
of uncommitted transactions or redoing the effects of committed transactions.
• Schedules in which transactions commit only after all transactions whose changes they read
commit are called recoverable schedules. In other words, if some transaction Tj is reading
value updated or written by some other transaction Ti, then the commit of Tj must occur after
the commit of Ti.
Contd.
• Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
W(A)
R(A)
Commit
Data integrity: the file system can ensure that the data stored in the files is accurate and
has not been corrupted.
File migration: the file system can move files from one location to another without
interrupting access to the files.
Data consistency: changes made to a file by one user are immediately visible to all other
users.
Support for different file types: the file system can support a wide range of file types,
including text files, image files, and video files.
File Models
• There are two types of file models:
Unstructured and Structured Files
Mutable and Immutable Files
• Immutable Files: In the immutable file model, the file cannot be changed once it has been created. The file can
only be deleted after its creation. To implement file updates, multiple versions are created of the same file. Every
time a new version of the file is created when a file is updated. There is consistent sharing in this file model
because of the sharing of only immutable files. Drawbacks of using the Immutable file model- increase in space
utilization and increase in disk allocation activity.
• Cedar File System (CFS) uses the Immutable file model. CFS employs the “Keep” parameter to maintain the no.
of the current version of the file. When the value of the parameter is 1 then it causes the creation of a new file
version. The existing version gets deleted and the disk space is reused for another one. When the value of the
parameter is greater than 1 then that refers to the existence of multiple versions of a file. The specific version of
a file can be accessed by mentioning its full name. In case the version number is not mentioned then CFS uses
the lowest version number for the implementation of operations like the “delete” operation and the highest
version number for the other operations like the “open” operation.
File Accessing Model
• The file-accessing model basically to depends on
The unit of data access/Transfer
The method utilized for accessing to remote files
• Based on the unit of data access, following file access models may be utilized to get to the
particular file.
• File-level transfer model: In file level transfer model, the all out document is moved while a
particular action requires the document information to be sent the whole way through the circulated
registering network among client and server. This model has better versatility and is proficient.
• Block-level transfer model: In the block-level transfer model, record information travels through
the association among client and a server is accomplished in units of document blocks. Thus, the
unit of information move in block-level transfer model is document blocks. The block-level
transfer model might be used in dispersed figuring climate containing a few diskless workstations.
Contd.
• Byte-level transfer model: In the byte-level transfer model, record information
moves the association among client and a server is accomplished in units of bytes.
In this way, the unit of information move in byte-level exchange model is bytes.
The significant hindrance to the byte-level exchange model is the trouble in store
organization because of the variable-length information for different access
requests.
• Record-level transfer model: The record-level file transfer model might be used
in the document models where the document contents are organized as records. In
record-level exchange model, document information travels through the
organization among client and a server is accomplished in units of records. The
unit of information move in record-level transfer model is record.
Contd.
• A distributed file system might utilize one of the following models to service a client’s file
access request when the accessed to file is remote:
• Remote service model: Handling of a client’s request is performed at the server’s hub.
Thus, the client’s solicitation for record access is passed across the organization as a
message onto the server, the server machine plays out the entrance demand, and the result
is shipped off the client. Need to restrict the amount of messages sent and the vertical per
message.
Remote access is taken care of across the organization so it is all the slower.
Increase server weight and organization traffic. Execution undermined.
Transmission of series of responses to explicit solicitation prompts higher
organization overhead.
For staying aware of consistency correspondence among client and server is there to
have a specialist copy predictable with clients put away data.
Remote assistance better when essential memory is close to nothing.
It is only an augmentation of neighborhood record system interface across the
network.
Contd.
• Data-caching model: This model attempts to decrease the organization traffic of
the past model by getting the data got from the server center. This exploits the
region part of the found in record gets to. A replacement methodology, for
instance, LRU is used to keep the store size restricted.
Remote access can be served locally so that access can be quicker.
Network traffic, server load is reduced. Further develops versatility.
Network over head is less when transmission of huge of information in
comparison to remote service.
For keeping up with consistency, if less writes then better performance in
maintaining consistency ,if more frequent writes then poor performance.
Caching is better for machines with disk or large main memory.
Lower level machine interface is different from upper level UI(user interface).
Contd.
• Benefit of Data-caching model over the Remote service model:
• The data -catching model offers the opportunity for expanded
execution and greater system versatility since it diminishes network
traffic, conflict for the network, and conflict for the document servers.
Hence almost all distributed file systems implement some form of
caching.
• Example: NFS utilizes the remote service model but adds caching for
better execution.
File caching
• File caching is an important feature of distributed file systems that helps to improve performance
by reducing network traffic and minimizing disk access. In a distributed file system, files are stored
across multiple servers or nodes, and file caching involves temporarily storing frequently accessed
files in memory or on local disks to reduce the need for network access or disk access.
• Client-side caching: In this approach, the client machine stores a local copy of frequently
accessed files. When the file is requested, the client checks if the local copy is up-to-date and, if so,
uses it instead of requesting the file from the server. This reduces network traffic and improves
performance by reducing the need for network access.
• Server-side caching: In this approach, the server stores frequently accessed files in memory or on
local disks to reduce the need for disk access. When a file is requested, the server checks if it is in
the cache and, if so, returns it without accessing the disk. This approach can also reduce network
traffic by reducing the need to transfer files over the network.
• Distributed caching: In this approach, the file cache is distributed across multiple servers or
nodes. When a file is requested, the system checks if it is in the cache and, if so, returns it from the
nearest server. This approach reduces network traffic by minimizing the need for data to be
transferred across the network
Advantages of file caching
• Advantages of file caching in distributed file systems include:
1. Improved performance: By reducing network traffic and
minimizing disk access, file caching can significantly improve the
performance of distributed file systems.
2. Reduced latency: File caching can reduce latency by allowing files
to be accessed more quickly without the need for network access or
disk access.
3. Better resource utilization: File caching allows frequently
accessed files to be stored in memory or on local disks, reducing the
need for network or disk access and improving resource utilization.
disadvantages to file caching
1. Increased complexity: File caching can add complexity to distributed
file systems, requiring additional software and hardware to manage
and maintain the cache.
2. Cache consistency issues: Keeping the cache up-to-date can be a
challenge, and inconsistencies between the cache and the actual file
system can occur.
3. Increased memory usage: File caching requires additional memory
resources to store frequently accessed files, which can lead to
increased
4. memory usage on client machines and servers.
File Replication
• Replication is the practice of keeping several copies of data in different places.
• It is good to have replicas of a node in a network due to following reasons:
o If a node stops working, the distributed network will still work fine due to its
replicas which will be there. Thus it increases the fault tolerance of the system.
o It also helps in load sharing where loads on a server are shared among different
replicas.
o It enhances the availability of the data. If the replicas are created and data is stored
near to the consumers, it would be easier and faster to fetch data.
• Types of Replication
• Active Replication
• Passive Replication
Active Replication:
• The request of the client goes to all the replicas.
• It is to be made sure that every replica receives the client request in the same order else the
system will get inconsistent.
• There is no need for coordination because each copy processes the same request in the
same sequence.
• All replicas respond to the client’s request.
• Advantages:
o It is really simple. The codes in active replication are the same throughout.
o It is transparent.
o Even if a node fails, it will be easily handled by replicas of that node.
• Disadvantages:
o It increases resource consumption. The greater the number of replicas, the greater the
memory needed.
o It increases the time complexity. If some change is done on one replica it should also
be done in all others.
Passive Replication:
• The client request goes to the primary replica, also called the main replica.
• There are more replicas that act as backup for the primary replica.
• Primary replica informs all other backup replicas about any modification done.
• The response is returned to the client by a primary replica.
• Periodically primary replica sends some signal to backup replicas to let them know that it is
working perfectly fine.
• In case of failure of a primary replica, a backup replica becomes the primary replica.
• Advantages:
o The resource consumption is less as backup servers only come into play when the
primary server fails.
o The time complexity of this is also less as there’s no need for updating in all the
nodes replicas, unlike active replication.
• Disadvantages:
o If some failure occurs, the response time is delayed.
Network File System(NFS)
• Network File System (NFS) is defined as a network protocol that is used for
accessing or sharing files over a network. It defines the way in which the files are
stored and retrieved from storage devices across networks.
• Network File System is a distributed file system protocol that helps the user to
access the files over a network.
• Network File System works on a client-server architecture. In this architecture,
the server hosts the file system which will be shared and the client can access those
shared files.
• Network File System takes use of Remote Procedure Call (RPC) to establish
communication between the client and the server. Due to this client are allowed to
send requests for performing operations on files.
• It takes help of various security mechanisms to have controlled access to the files.
It offers security features like user authentication, file permissions, etc.
• Network File System is designed to work fast and be efficient. With the help of
various caching machanisms, it can reduce the traffic on the network and improve
the performance.
Contd.
• Benefits of NFS
❑ The Network File System allows local access to remote files.
❑ It is a very easy to use protocol.
❑ It offers great scalability so that a large number of users can connect to a single server.
❑ For new files, there is no need for manual refresh.
❑ It is a reliable protocol which can handle network problems without losing the data.
❑ It offers a variety of security features to protect our network from failure or attacks.
• Limitations of NFS
❑ The setup and configuration of Network File System is complex for those users who don’t
have knowledge of it.
❑ There is dependence on Remote procedure calls to perform all of its operations.
❑ It is vulnerable to internal threats.
❑ It is difficult to accommodate large numbers of user at one time.
❑ It has no load balancing.
Andrew File System(AFS)
• AFS presents a homogeneous, location-independent file namespace to all client
workstations via a group of trustworthy servers.
• The goal is to facilitate large-scale information exchange by reducing client-server
communication. This is accomplished by moving whole files between server and client
computers and caching them until the servers get a more recent version.
• An AFS uses a local cache to improve speed and minimize effort in dispersed networks.
• Andrew File System Architecture:
• Vice: The Andrew File System provides a homogeneous, location-transparent file
namespace to all client workstations by utilizing a group of trustworthy servers known as
Vice. The Berkeley Software Distribution of the Unix operating system is used on both
clients and servers. Each workstation’s operating system intercepts file system calls and
redirects them to a user-level process on that workstation.
• Venus: This mechanism, known as Venus, caches files from Vice and returns updated
versions of those files to the servers from which they originated. Only when a file is
opened or closed does Venus communicate with Vice; individual bytes of a file are read
and written directly on the cached copy, skipping Venus.
Contd.
• This file system architecture was largely inspired by the need for scalability. To increase the
number of clients a server can service, Venus performs as much work as possible rather than Vice.
Vice only keeps the functionalities that are necessary for the file system’s integrity, availability,
and security. The servers are set up as a loose confederacy with little connectivity between them.
Contd.
• The following are the server and client components used in AFS networks:
o Any computer that creates requests for AFS server files hosted on a network
qualifies as a client.
o The file is saved in the client machine’s local cache and shown to the user once
a server responds and transmits a requested file.
o When a user visits the AFS, the client sends all modifications to the server via a
callback mechanism. The client machine’s local cache stores frequently used
files for rapid access.
o Advantages:
o Shared files that aren’t updated very often and local user files that aren’t
updated too often will last a long time.
o It sets up a lot of storage space for caching.
o It offers a big enough working set for all of a user’s files, ensuring that the
file is still in the cache when the user accesses it again.
Hadoop Distributed File System
• HDFS(Hadoop Distributed File System) is utilized for storage
permission is a Hadoop cluster.
• It mainly designed for working on commodity Hardware
devices(devices that are inexpensive), working on a distributed file
system design.
• HDFS is designed in such a way that it believes more in storing the
data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the
storage layer and the other devices present in that Hadoop cluster.
• HDFS is capable of handling larger size data with high volume
velocity and variety makes Hadoop work more efficient and reliable
with easy access to all its components.
Contd.
• Some Important Features of HDFS(Hadoop Distributed File System)
❑ It’s easy to access the files stored in HDFS.
❑ HDFS also provides high availability and fault tolerance.
❑ Provides scalability to scaleup or scaledown nodes as per our requirement.
❑ Data is stored in distributed manner i.e. various Datanodes are responsible for
storing the data.
❑ HDFS provides Replication because of which no fear of Data Loss.
❑ HDFS Provides High Reliability as it can store data in a large range
of Petabytes.
❑ HDFS has in-built servers in Name node and Data Node that helps them to
easily retrieve the cluster information.
❑ Provides high throughput.
Contd.
• Hadoop works on the MapReduce algorithm which is a master-slave architecture,
HDFS has NameNode and DataNode that works in the similar pattern.
1. NameNode(Master)
2. DataNode(Slave)
• NameNode: NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. nothing
but the data about the data. Meta Data can be the transaction logs that keep track of
the user’s activity in a Hadoop cluster.
• DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or
even more than that, the more number of DataNode your Hadoop cluster has More
Data can be stored. so it is advised that the DataNode should have High storing
capacity to store a large number of file blocks. Datanode performs operations like
creation, deletion, etc. according to the instruction provided by the NameNode.
Some Important Features of HDFS(Hadoop Distributed
File System)
• It’s easy to access the files stored in HDFS.
• HDFS also provides high availability and fault tolerance.
• Provides scalability to scaleup or scaledown nodes as per our requirement.
• Data is stored in distributed manner i.e. various Datanodes are responsible
for storing the data.
• HDFS provides Replication because of which no fear of Data Loss.
• HDFS Provides High Reliability as it can store data in a large range
of Petabytes.
• HDFS has in-built servers in Name node and Data Node that helps them to
easily retrieve the cluster information.
• Provides high throughput.
MapReduce Architecture
• MapReduce is a programming model used for efficient processing in parallel over large
data-sets in a distributed manner.
• The data is first split and then combined to produce the final result.
• The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it
to equivalent tasks for providing less overhead over the cluster network and to reduce the
processing power.
• The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.
• Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts
Contd.
1. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all the job-parts
combined to produce the final output.
2. Input Data: The data set that is fed to the MapReduce for processing.
3. Output Data: The final result is obtained after the processing.
Contd.
• The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The input to the
map may be a key-value pair where the key can be the id of some kind of address and value is the
actual value that it keeps. The Map() function will be executed in its memory repository on each of
these input key-value pairs and generates the intermediate key-value pair which works as input for
the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and sort and
send to the Reduce() function. Reducer aggregate or group the data based on its key-value pair as per
the reducer algorithm written by the developer.
3. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across the
cluster and also to schedule each map on the Task Tracker running on the same data node since there
can be hundreds of data nodes available in the cluster.
4. Task Tracker: The Task Tracker can be considered as the actual slaves that are working on the
instruction given by the Job Tracker. This Task Tracker is deployed on each of the nodes available in
the cluster that executes the Map and Reduce task as instructed by Job Tracker.