0% found this document useful (0 votes)
27 views6 pages

Database

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Database

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Database

• Database:- Logical interrelated collection of shared data, along with description of data, physically distributed over
a computer network.

Distributed Database
• A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer
network.

A distributed database management system (DDBMS) is the software that manages the DDB and provides an access
mechanism that makes this distribution transparent to the users

A DDBMS mainly classified into two types:


• Homogeneous Distributed database. o management systems

• Heterogeneous Distributed database O management systems

Characteristics
• All sites are interconnected.
• Fragments can be replicated.
• Logically related shared data can be collected.
• Data at each and every site is controlled by the DBMS.
• Each Distributed Database Management System takes part in at least one global application.

Functionality
• Security
• Keeping track of data
• Replicated data management
• System catalog management
• Distributed transaction management
• Distributed database recovery

Homogeneous DDBMS
• In a homogeneous distributed database all sites have identical software and are aware of each other and agree to
cooperate in processing user requests.
• The homogeneous system is much easier to design and manage
• The operating system used, at each location must be same or compatible.
• The database application (or DBMS) used at each location must be same or compatible.

Heterogeneous DDBMS
• In a heterogeneous distributed database different sites may use different schema and software.
• In heterogeneous systems, different nodes may have different hardware & software and data structures at various
nodes or locations are also incompatible.
• Different computers and operating systems, database applications or data models may be used at each of the
locations.

• On heterogeneous system, translations are required to allow communication between different sites (or DBMS).

• The heterogeneous system is often not technically or economically feasible. In this system, a user at one location
may be able to read but not update the data at another location.

Advantages
• Less danger of a single-point failure. When one of the computers fails, the workload is picked up by other
workstations.

• Data are also distributed at multiple sites.

• The end user is able to access any available copy of the data, and an end user's request is processed by any
processor at the data location.

• Improved communications. Because local sites are smaller and located closer to customers.

• Reduced operating costs. It is more cost- effective to add workstations to a network than to update a mainframe
system.

• Faster data access, faster data processing.

• A distributed database system spreads out the systems workload by processing data at several sites.

Disadvantages
• Complexity of management and control.
• Applications must recognize data location, and they must be able to stitch together data from various sites.
• Security.
• Increased storage and infrastructure requirements.
• Multiple copies of data has to be at different sites, thus an additional disk storage space will be required.
• The probability of security lapses increases when data are located at multiple sites.

What is Parallel database...??


• A parallel database system is to improve performance through parallelization of various operations, such as
loading data, building indexes and evaluating queries.
• The distribution is solely done on the bases of performance.
• Parallel databases improve processing and input/output speeds by using multiple CPUs and disks in parallel.
• Many operations are performed simultaneously
• Data may be stored in a distributed fashion.

Data fragmentation
• Fragmentation is a process of division or the mapping of the tables based on the columns and rows of data into
the smallest unit of data.
• Data that has broken down is still possible to be combined again with the intention to complete the data collection
using fragmentation.
• Fragmentation is a database server feature that allows you to control where data is stored at the table level.
• Fragmentation enables you to define groups of rows or index keys within a table.

Replication
• Replication is that we store several copies of a relation or relation fragment. An entire relation can be replicated at
one or more sites.
• Similarly, one or more fragments of a relation can be replicated at other sites.
• For example, if a relation R is fragmented into R1,R2, and R3, there might be just one copy of R1, whereas R2 is
replicated at two other sites and R3 is replicated at all sites.

Two Fold Replication


The motivation for replication is twofold:

1. Increased Availability of Data: If a site that contains a replica goes down, we can find the same data at other sites.
Similarly, if local copies of remote relations are available, we are less vulnerable to failure of communication links.
2. Faster Query Evaluation: Queries can execute faster by using a local copy of a relation instead of going to a remote
site.

Distributed Transaction
• In a distributed DBMS, a given transaction is submitted at some one site, but it can access data at other sites as
well.
• When a transaction is submitted at some site, the transaction manager at that site breaks it up into a collection of
one or more sub-transactions that execute at different sites, submits them to transaction managers at the other
sites, and coordinates their activity.
• Distributed Concurrency Control: How can locks for objects stored across several sites be managed?
• Distributed Recovery: Transaction atomicity must be ensured when a transaction commits, all its actions, across all
the sites at which it executes, must persist. Similarly, when a transaction aborts, none of its actions must be
allowed to persist.

Distributed Concurrency Control


• The choice of technique determines which objects are to be locked. When locks are obtained and released is
determined by the concurrency control protocol. We now consider how lock and unlock requests are implemented
in a distributed environment. Lock management can be distributed across sites in many ways.
• Centralized: A single site is in charge of handling lock and unlock requests for all objects.
• Primary Copy: One copy of each object is designated the primary copy. All requests to lock or unlock a copy of this
object are handled by the lock manager at the site where the primary copy is stored, regardless of where the copy
itself is stored.
• Fully Distributed : Requests to lock or unlock a copy of an object stored at a site are handled by the lock manager at
the site where the copy is stored.

DISTRIBUTED RECOVERY
• Recovery in a distributed DBMS is more complicated than in a centralized DBMS for the following reasons:
• New kinds of failure can arise : Failure of communication links and failure of a remote site at which a sub-
transaction is executing.
• Either all sub-transactions of a given transaction must commit or none must commit, and this property must be
guaranteed despite any combination of site and link failures. This guarantee is achieved using a commit protocol.

Concepts Of Locks
• A lock is used when multiple users need to access a database concurrently. This prevents data from being
corrupted or invalidated when multiple users try to write to the database.
• Any single user can only modify those database records (that is, items in the database) to which they have applied
a lock that gives them exclusive access to the record until the lock is released. Locking not only provides exclusivity
to write but also prevents (or controls) reading of unfinished. modifications.

Byzantine General's Problem


The Problem: "Several divisions of the Byzantine army are camped outside an enemy city, each division commanded by
its own general. After observing the enemy, they must decide upon a common plan of action. Some of the generals
may be traitors, trying to prevent the loyal generals from reaching agreement."

Goal:
• All loyal generals decide upon the same plan of action.
• A small number of traitors cannot cause the loyal generals to adopt a bad plan.
• The paper considers a slightly different version from the standpoint of one general (i.e. process) and multiple
lieutenants.

Goal:
• All loyal lieutenants obey the same order.
• If the commanding general is loyal, the every loyal lieutenant obeys the order he sends.

Hadoop
• Hadoop is an open-source software framework for storing data and running applications on clusters of commodity
hardware.
• It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs.
• The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (MapReduce).
• Developer - Apache Software Foundation
• Written in Java

Benefits
• Computing power - Distributed computing model ideal for big data
• Flexibility - Store any amount of any kind of data.
• Fault Tolerance - If a node goes down, jobs are automatically redirected to other nodes. And it automatically stores
multiple copies/replicas of all data.
• Low Cost - The open-source framework is free and uses commodity hardware to store large quantities of data.
• Scalability System can be grown easily by adding more nodes.

HDFS Goals
• Detection of faults and automatic recovery.
• High throughput of data access rather than low latency.
• Provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster.
• Write-once-read-many access model for files.
• Applications move themselves closer to where the data is located.
• Easily portable.

Some Nomenclature
• A Rack is a collection of nodes that are physically stored close together and are all on the same network.
• A Cluster is a collection of racks.
• NameNode Manages the files system namespace and regulates access to clients. There is a single NameNode for a
cluster.
• DataNode - Serves read, write requests, and performs block creation, deletion, and replication upon instruction
from NameNode.
• A file is split in one or more blocks and a set of blocks are stored in DataNodes.
• A Hadoop block is a file on the underlying file system. Default size 64 MB. All blocks in a file except the last block are
the same size.

Replica Management
• The NameNode keeps track of the rack id each Data Node belongs to.
• The default replica placement policy is as follows:
• One third of replicas are on one node
• Two thirds of replicas (including the above) are on one rack
• The other third are evenly distributed across the remaining racks.
• This policy improves write performance without compromising data reliability or read performance.
• HDFS tries to satisfy a read request from a replica that is closest to the reader.

NameNode
• Stores the HDFS namespace
• Record every change to file system metadata in a transaction log called EditLog
• The namespace, including the mapping of blocks to files and file system properties, is stored in a file called FsImage
• Both EditLog and FsImage are stored on the NameNode's local file system
• Keeps an image of the namespace and file blockmap in memory

• On startup

• Reads FsImage and EditLog from the disk


• Applies all transactions from the EditLog to the in-memory copy of FsImage
• Flushes the modified FsImage onto the disk
• This is called checkpointing.

• Checkpointing only occurs when the NameNode starts up


• Currently no checkpointing after startup
• After checkpointing, the NameNode enters safemode

Safemode:
• Replication of data blocks does not occur in safemode
• Receives Heartbeat and Blockreport from DataNodes
• Blockreport contains list of data blocks at a DataNode
• Each block has a specified minimum number of replicas
• A block is considered safely replicated when the minimum number of replicas has checked in with the NameNode.
• After a configurable percentage of safely replicated data blocks check in, the NameNode exits safemode.
• Replicates all blocks that were not safely replicated.

DataNode
• Stores HDFS data in files in its local file system
• Has no knowledge about HDFS files
• Stores each HDFS block in a separate file
• Stores files in subdirectories instead of one single directory
• On startup
• Scans through local file system
• Generates a list of all HDFS data blocks
• Sends the report to the NameNode
• This is called the Blockreport

Staging
• A client request to create a file does not reach the NameNode immediately
• Initially, the client caches file data into a temporary local file
• Once the local file has data over one HDFS block size, the NameNode is contacted
• The NameNode inserts the file name into the FS and allocates a data block to it
• It replies with the identity of the DataNode and the destination data block
• It also sends a list of the DataNodes replicating the block.
• The client then flushes the block of data to the DataNode.
• When a file is closed, the remaining data is also flushed to the DataNode
• It then tells the NameNode that the file is closed
• The NameNode commits the file creation operation into a persistent store.

Replication Pipelining
• The client sends the data block to the DataNode in small portions
• The DataNode writes each portion to its local filesystem
• It then passes on the portion to another DataNode for replication as determined by the NameNode
• Each DataNode, on receiving the portion, writes it to their filesystem and passes it to the next Data Node
• This continues till it reaches the last DataNode holding a replica of the data block.
HDFS:
• The Hadoop Distributed File System (HDFS) is the file system component of Hadoop. It is designed to store very
large data sets (1) reliably, and to stream those data sets (2) at high bandwidth to user applications. These are
achieved by replicating file content on multiple machines(DataNodes).
• HDFS is a block-structured file system: Files broken into blocks of 128MB (per-file configurable).
• A file can be made of several blocks, and they are stored across a cluster of one or more machines with data storage
capacity.
• Each block of a file is replicated across a number of machines, To prevent loss of data.

Distributed Hash Table


Definition:
• It is a class of decentralized distributed system that provide lookup service similar to hash
table (key, value pair).
• Responsibility for maintaining the mapping from keys to values is distributed among the
nodes, in such a way that a change in the set of participants causes a minimal amount of
disruption.

You might also like