Unit V
Unit V
Syllabus
Distributed Databases: Architecture, Data Storage, Transaction Processing, Query processing and
optimization - NOSQL Databases: Introduction - CAP Theorem - Document Based systems - Key value
Stores - Column Based Systems - Graph Databases. Database Security: Security issues - Access control
based on privileges - Role Based access control - SQL Injection - Statistical Database security - Flow control
- Encryption and Public Key infrastructures - Challenges.
Distributed Databases
AU: Dec.-04,07,16,17,19, May-03,06,14,16,17,19, Marks 16
Definition of distributed databases:
• A distributed database system consists of loosely coupled sites (computer) that share no physical
components and each site is associated a database system.
• The software that maintains and manages the working of distributed databases is called distributed database
management system.
• The database system that runs on each site is independent of each other.
• In this system, all the sites are aware of the other sites present in the system and they all cooperate in
processing user's request.
• Each site present in the system, surrenders part of its autonomy in terms of right to change schemas or
software.
• The homogeneous database system appears as a single system to the user.
(2) Heterogeneous databases
• The heterogeneous databases are kind of database systems in which different sites have different schema
or software. Refer Fig. 5.1.3.
• The participating sites are not aware of other sites present in the system.
• These sites provide limited facilities for cooperation in transaction processing.
Architecture
• Following is an architecture of distributed databases. In this architecture the local database is maintained
by each site.
• Each site is interconnected by communication network.
When user makes a request for particular data at site Si then it is first searched at the local database. If the
data is not present in the local database, then the request for that data is passed to all the other sites via
communication network. Each site then searches for that data at its local database. When data is found at
particular site say Sj then it is transmitted to site Si via communication network.
Data Storage
There are two approaches of storing relation r in distributed database -
(1) Replication: System maintains multiple copies of data, stored in different sites, for grind faster retrieval
and fault tolerance.
(2) Fragmentation: Relation is partitioned into several fragments stored in distinct sites.
Data Replication
• Concept: Data replication means storing a copy or replica of a relation fragments in two or more sites.
• There are two methods of data replication replication. (1) Full replication (2) Partial replication
• Full replication: In this approach the entire relation is stored at all the sites. In this approach full
redundant databases are those in which every site contains a copy of entire database.
• Partial replication: In this approach only some fragments of relation are replicated on the sites.
Advantages:
(1) Availability: Data replication facilitates increased availability of data.
(2) Parallelism: Queries can be processed by several sites in parallel.
(3) Faster accessing: The relation r is locally available at each site, hence data accessing becomes faster.
Disadvantages:
(1) Increased cost of update: The major disadvantage of data replication is increased betcost of updated.
That means each replica of relation r must be updated from all the sites if user makes a request for some
updates in relation.
(2) Increased complexity in concurrency control: It becomes complex to implement the concurrency
control mechanism for all the sites containing replica.
Data Fragmentation
• Concept: Data fragmentation is a division of relation r into fragments r1,r2, r3,...,rn which contain
sufficient information to reconstruct relation r.
• There are two approaches of data fragmentation - (1) Horizontal fragmentation and (2) Vertical
fragmentation.
• Horizontal fragmentation: In this approach, each tuple of r is assigned to one or more fragments. If
relation R is fragmented in r1 and r2 fragments, then to bring these fragments back to R we must use union
operation. That means R=r1ur2
• Vertical fragmentation: In this approach, the relation r is fragmented based on one or more columns. If
relation R is fragmented into r1 and r2 fragments using vertical fragmentation, then to bring these fragments
back to original relation R we must use join operation. That means R= r1 r2
• For example - Consider following relation r
Student (RollNo, Marks, City)
The values in this schema are inserted as
Horizontal Fragmentation 1:
SELECT * FROM Student WHERE Marks >50 AND City='Pune'
We will get
Horizontal Fragmentation 2:
SELECT * FROM Student WHERE Marks >50 AND City="Mumbai'
We will get
Vertical Fragmentation 1 :
SELECT RollNo FROM Student;
Vertical Fragmentation 2:
SELECT city FROM Student;
Transaction Processing
Basic Concepts
In distributed system transaction initiated at one site can access or update data at other sites. Let us discuss
various basic concepts used during transaction processing in distributed systems -
• Local and global transactions:
Local transaction Ti is said to be local if it is initiated at site Si and can access or update data at site Si only.
Global transaction Ti initiated by site Si is said to be global if it can access or update data at site Si, Sj,Sk
and so on.
• Coordinating and participating sites:
The site at which the transaction is initiated is called coordinating site. The participating sites are those sites
at which the sub-transactions are executing. For example - If site S1 initiates the transaction T1 then it is
called coordinating site. Now assume that transaction T1 (initiated at S1) can access site S2 and S3. Then
sites S2 and S3 are called participating sites.
To access the data on site S2, the transaction T1 needs another transaction T12 on site S2 similarly to access
the data on site S3, the transaction T2 needs some transaction say T13 on site S3. Then transactions T12 and
T13 are called sub-transactions. The above described scenario can be represented by following Fig. 5.1.6.
• Transaction manager:
The transaction manager manages the execution of those transactions (or subtransactions) that access data
stored in a local site.
(1) To maintain the log for recovery purpose.
(2) Participating in coordinating the concurrent execution of the transactions executing balls at that site.
• Transaction coordinator:
The transaction coordinator coordinates the execution of the various transactions (both local and global)
initiated at that site.
The tasks of Transaction coordinator are -
(1) Starting the execution of transactions that originate at the site.
(2) Distributing sub transactions at appropriate sites for execution
Let TC denotes the transaction coordinator and TM denotes the transaction manager, then the system
architecture can be represented as,
Failure Modes
There are four types of failure modes,
1. Failure of site
2. Loss of messages
3. Failure of communication link
4. Network partition
The most common type of failure in distributed system is loss or corruption of messages. The system uses
Transmission Control Protocol(TCP) to handle such error. This is a standard connection oriented protocol
in which message is transmitted from one end to another using wired connection.
• If two nodes are not directly connected, messages from one to another must be routed through sequence of
communication links. If the communication link fails, the messages are rerouted by alternative links.
• A system is partitioned if it has been split into two subsystems. This is called partitions. Lack of connection
between the subsystems also cause failure in distributed system.
Commit Protocols
Two Phase Commit Protocol
• The atomicity is an important property of any transaction processing. What is this atomicity property? This
property means either the transaction will execute completely or it won't execute at all.
• The commit protocol ensures the atomicity across the sites in following ways -
i) A transaction which executes at multiple sites must either be committed at all the sites, or aborted at all
the sites.
ii) Not acceptable to have a transaction committed at one site and aborted at another.
There are two types of important sites involving in this protocol -
• One Coordinating site
• One or more participating sites.
Two phase commit protocol
This protocol works in two phases - i) Voting phase and ii) Decision phase.
Phase 1: Obtaining decision or voting phase
Step 1: Coordinator site Ci asks all participants to prepare to commit transaction Ti.
• Ci adds the records <prepareT> to the log and writes the log to stable storage.
It then sends prepare T messages to all participating sites at which T will get executed.
Step 2: Upon receiving message, transaction manager at participating site determines if it can commit the
transaction
• If not, add a record <no T> to the log and send abort T message to coordinating site Ci.
Failure of site
There are various cases at which failure may occur,
(1) Failure of participating sites
• If any of the participating sites gets failed then when participating site Si recovers, it examines the log entry
made by it to take the decision about executing transaction.
• If the log contains <commit T> record: participating site executes redo (T)
• If the log contains <abort T> record: participating site executes undo (T)
• If the log contains <ready T> record: participating site must consult Coordinating site to take decision
about execution of transaction T.
• If T committed, redo (T)
• If T aborted, undo (T)
• If the log of participating site contains no record then that means Si gets failed before responding to Prepare
T message from coordinating site. In this case it must abort T
(2) Failure of coordinator
• If coordinator fails while the commit protocol for T is executing then participating sites must take decision
about execution of transaction T:
i) If an active participating site contains a <commit T> record in its log, then T site must be committed.
ii) If an active participating site contains an <abort T> record in its log, then T must be aborted.
iii) If some active participating site does not contain a <ready T> record in its log, then the failed coordinator
Ci cannot have decided to commit T. Can therefore abort T.
iv) If none of the above cases holds, then all participating active sites must have a <ready T> record in their
logs, but no additional control records (such as <abort T> of <commit T>). In this case active sites must wait
for coordinator site Ci to recover, to find decision.
Two phase locking protocol has blocking problem.
What is blocking problem?
It is a stage at which active participating sites may have to wait for failed coordinator site to recover.
The solution to this problem is to use three phase locking protocol.
Review Questions
1. What are the various features of distributed database versus centralized database system? AU: Dec.-17,
Marks 6, May-17, Marks 8
2. Explain the architecture of a distributed database. AU: Dec.-16, Marks 7
3. Explain about distributed databases and their characteristics, functions and advantages and
disadvantages. AU: Dec.-07, May-14, Marks 8, May-16, Marks 16
4. Explain design of distributed database. AU: Dec.-04, Marks 8
5. Discuss homogeneous and heterogeneous databases reference to distributed databases. AU: May-03,
Marks 8
6. Discuss in detail about the distributed databases. AU: May-19, Marks 13
7. What are data fragmentations? Explain various approaches for fragmenting a relation with example. AU:
May-06, Marks 6
8. Explain in detail various approaches used for storing a relation in distributed databases. AU: Dec.-04,
Marks 8, Dec.-19, Marks 9
NOSQL Databases
Introduction
• NoSQL stands for not only SQL.
• It is nontabular database system that store data differently than relational tables. There are various types of
NoSQL databases such as document, key-value, wide column and graph.
• Using NoSQL we can maintain flexible schemas and these schemas can be scaled easily with large amount
of data
Need
The NoSQL database technology is usually adopted for following reasons
1) The NoSQL databases are often used for handling big data as a part of fundamental architecture.
2) The NoSQL databases are used for storing and modelling structured, semi- structured and unstructured
data.
3) For the efficient execution of database with high availability, NoSQL is used.
4) The NoSQL database is non-relational, so it scales out better than relational databases and these can be
designed with web applications.
5) For easy scalability, the NoSQL is used.
Features
1) The NoSQL does not follow any relational model.
2) It is either schema free or have relaxed schema. That means it does not require specific definition of
schema.
3) Multiple NoSQL databases can be executed in distributed fashion.
4) It can process both unstructured and semi-structured data.
5) The NoSQL have higher scalability.
6) It is cost effective.
7) It supports the data in the form of key-value pair, wide columns and graphs.
Comparison between RDBMS and NoSQL
Review Question
1.What is NOSQL? What is the need for it.Enlist various feature of NoSQL.
CAP Theorem
• Cap theorem is also called as brewer's theorem.
• The CAP theorem is comprised of three components (hence its name) as they relate to distributed data
stores:
• Consistency: All reads receive the most recent write or an error.
• Availability: All reads contain data, but it might not be the most recent.
• Partition tolerance: The system continues to operate despite network failures(i.e.; dropped partitions,
slow network connections, or unavailable network connections between nodes.)
• The CAP theorem states that it is not possible to guarantee all three of the desirable properties -
Consistency, availability and partition tolerance at the same time in a distributed system with data replication.
Review Question
1. Write a short note on CAP theorem.
Key-Value Store
• Key-value pair is the simplest type of NoSQL database.
• It is designed in such a way to handle lots of data and heavy load.
• In the key-value storage the key is unique and the value can be JSON, string or binary objects.
• For example -
{Customer:
|
{"id":1,"name":"Ankita"},
{"id":2,"name":"Kavita"}
|
}
Here id, name are the keys and 1,2, "Ankita", "Prajkta" are the values corresponding to those keys.
Key-value stores help the developer to store schema-less data. They work best for Shopping cart contents.
The DynamoDB, Riak, Redis are some famous examples of key-value store.
Document Based Systems
• The document store make use of key-value pair to store and retrieve data.
• The document is stored in the form of XML and JSON.
• The document stores appear the most natural among NoSQL database types.
• It is most commonly used due to flexibility and ability to query on any field.
• For example -
{
"id": 101,
"Name": "AAA",
"City" : "Pune"
}
MongoDB and CouchDB are two popular document oriented NoSQL database.
The column store databases are widely used to manage data warehouses, business intelligence, HBase,
Cassandra are examples of column based databases.
Graph Databases
The graph database is typically used in the applications where the relationships among the data elements is
an important aspect.
The connections between elements are called links or relationships. In a graph database, connections are
first-class elements of the database, stored directly. In relational databases, links are implied, using data to
express the relationships.
The graph database has two components -
1) Node: The entities itself. For example - People, student.
2) Edge: The relationships among the entities.
For example -
Graph base database is mostly used for social networks, logistics, spatial data. The graph databases are -
Neo4J, Infinite Graph, OrientDB.
Database Security
• Definition: Database security is a technique that deals with protection of data against unauthorized access
and protection.
• Database security is an important aspect for any database management system as it deals with sensitivity
of data and information of enterprise.
• Database security allows or disallows users. from performing actions on the database objects.
Security Issues
Types of Security
Database security addresses following issues -
(1) Legal Issues: There are many legal or ethical issues with respect to right to access information. For
example - If some sensitive information is present in the database, then it must not be accessed by
unauthorized person.
(2) Policy Issues: There are some government or organizational policies that tells us what kind of
information should be made available to access publicly.
(3) System Issues: Under this issue, it is decided whether security function should be handled at hardware
level or at operating system level or at database level.
(4) Data and User Level Issues: In many organizations, multiple security levels are identified to categorize
data and users based on these classifications. The security policy of organization must understand these
levels for permitting access to different levels of users.
Threats to Database
Threats to database will result in loss or degradation of data. There are three kinds of loss that occur due to
threats to database
(1) Loss of Integrity:
• Database integrity means information must be protected from improper modification.
• Modification to database can be performed by inserting, deleting or modifying the data.
• Integrity is lost if unauthorized changes are made to data intentionally or accidently.
• If data integrity is not corrected and work is continued then it results in inaccuracy, fraud, or erroneous
decision.
(2) Loss of Availability:
• Database availability means making the database objects available to authorized users.
(3) Loss of Confidentiality:
• Confidentiality means protection of data from unauthorized disclosure of information.• The loss of
confidentiality results in loss of public confidence, or embarrassment or some legal action against
organization.
Control Measures
• There are four major control measures used to provide security on data in database.
1. Access control
2. Interface control
3. Flow control
4. Data encryption
• Access Control: The most common security problem is unauthorized access of computer system.
Generally, this access is for obtaining the information or to make malicious changes in the database. The
security mechanism of a DBMS must include provisions for restricting access to the database system as a
whole. This function, called access control.
• Inference Control: This method is used to provide the security to statistical database security problems.
Statistical databases are used to provide statistical information based on some criteria. These databases may
contain information about particular age group, income-level, education criteria and so on. Access to some
sensitive information must be avoided while using the statistical databases. The corresponding measure that
prevents the user from completing any inference channel.
• Flow Control: It is a kind of control measure which prevents information from flowing in such a way that
it reaches unauthorized users. Channels that are pathways for information to flow implicitly in ways that
violate the security policy of an organization are called covert channels.
• Data Encryption: The data encryption is a control measure used to secure the sensitive data. In this
technique, the data is encoded using some coding algorithm. An unauthorized user who accesses encoded
data will have difficulty deciphering it, but authorized users are given decoding or decrypting algorithms (or
keys) to decipher the data.
• Discretionary Access Control allows each user or subject to control access to their own data.
• In DAC, owner of resource restricts access to the resources based on the identity of users.
• DAC is typically the default access control mechanism for most desktop operating doy systems.
• Each resource object on DAC based system has Account Control List (ACL) associated with it.
• An ACL contains a list of users and groups to which the user has permitted access together with the level
of access for each user or group.
• For example - The ACL is an object centered description of access rights as follows-
test1.doc: {Prajka: read}
test2.exe: {Ankita: execute}, {Prajkta: execute}
test3.com: (Ankita: execute, read}, {Prajkta: execute, read, write}
• Object access is determined during Access Control List (ACL) authorization and based on user
identification and/or group membership.
• Under DAC a user can only set access permissions for resources which they already own.
• Similarly a hypothetical user A cannot change the access control for a file that is owned by user B. User A
can, however, set access permissions on a file that he/she Jono owns.
• User may transfer object ownership to another user(s).
• User may determine the access type of other users.
• The DAC is easy to implement access control model.
Advantages:
(1) It is flexible.
(2) It has simple and efficient access right management.
(3) It is scalable. That means we can add more users without any complexity.
Disadvantages:
(1) It increases the risk that data will be made accessible to users that should not necessarily be given access.
(2) There is no control over information flow as one user can transfer ownership to another user.
SQL Injection
• SQL injection is a type of code injection technique that might destroy the databases.
• In this technique the malicious code in SQL statement is placed via web page input. These statements
control a database server behind a web application.
• Attackers can use SQL injection vulnerabilities to bypass application security measures. They can go
around authentication and authorization of a web page or web application and retrieve the content of the
entire SQL database. They can also use SQL injection to add, modify and delete records in the database.
• An SQL injection vulnerability may affect any website or web application that uses an SQL database such
as MySQL, Oracle, SQL Server or others.
How SQL Injection Works?
• To make an SQL injection attack, an attacker must first find vulnerable user inputs ad to within the web
page or web application. A web page or web application that has an ses SQL injection vulnerability uses
such user input directly in an SQL query. The attacker can create input content. Such content is often called
a malicious payload and is the key part of the attack. After the attacker sends this content, malicious SQL
commands are executed in the database.
• SQL is a query language that was designed to manage data stored in relational Sup databases. You can use
it to access, modify and delete data. Many web applications and websites store all the data in SQL databases.
In some cases, you can also use SQL commands to run operating system commands. Therefore, a successful
SQL Injection attack can have very serious consequences.
Flow Control
• Flow control is a mechanism that regulates the flow of information among accessible objects.
• A flow between two objects obj1 and obj2 occurs when program reads values from obj1 and writes values
to the object obj2.
• The flow control checks that the information contained in one object should not get transferred to the less
protected object.
• The flow policy specifies the channels along which the information is allowed to move.
• The simple flow policy specifies two classes of information - Confidential(C) and non confidential(N).
According to flow policy only the information flow from confidential to non confidential class is not allowed.
Convert Channel
• A covert channel is a type of attack that creates a capability to transfer information objects between
processes that are not supposed to be allowed to communicate.
• This convert channel violates the security or the policy.
• The convert channel allows information to pass from higher classification level to lower classification level
through improper means.
• The security experts believe that one way to avoid convert channels is for as programmers to not gain the
access to sensitive data.
The sender applies the encryption algorithm and recipient applies the decryption algorithm. Both the sender
and the receiver must agree on this algorithm for any meaningful communication. The algorithm basically
takes one text as input and produces another as the output. Therefore, the algorithm contains the intelligence
for transforming message.
For example: If we want to send some message through an e-mail and we wish that nobody except the friend
should be able to understand it. Then the message can be encoded using some intelligence. For example if
the alphabets A to Z are encoded as follows-
That means last three letters are placed in reverse order and then first three letters are in straight manner.
Continuing this logic the A to Z letters are encoded. Now if I write the message
"SEND SOME MONEY"
it will be
QBSA QRTB TRSBN
This coded message is called cipher text.
There are variety of coding methods that can be used.
Types of Cryptography
There are two types encryption schemes based in key used for encryption and decryption.
1. Symmetric key encryption: It is also known as secret key encryption. In this method, only one key is
used. The same key is shared by sender and receiver for encryption and decryption of messages. Hence both
parties must agree upon the key before any transmission begins and nobody else should know about it. At
the sender's end, the key is used to change the original message into an encoded form. At the receiver's end
using the same key the encoded message is decrypted and original message is obtained. Data Encryption
Standard (DES) uses this approach. The problem with this approach is that of key agreement and distribution.
2. Asymmetric key encryption: It is also known as public key encryption. In this method, different keys
are used. One key is used for encryption and other key must be used for decryption. No other key can decrypt
the message-not even the original key used for encryption.
One of the two keys is known as public key and the other is the private key. Suppose there are two users X
and Y. The
• X wants to send a message to Y. Then X will convey its public key to Y but the private key of X will be
known to X only.
• Y should know the private key of Y and X should know the Y's public key.
Digital Signature
A digital signature is a mathematical scheme for demonstrating the authenticity of a digital message or
document. If the recipient gets a message with digital signature then he believes that the message was created
by a known sender.
Digital signatures are commonly used for software distribution, financial transactions, and in other cases
where it is important to detect forgery or tampering.
When X and Y wants to communicate with each other
1. X encrypts the original plaintext message into ciphertext by using Y's public key.
2. Then X executes an algorithm on the original plaintext to calculate a Message Digest, also known as hash.
This algorithm takes the original plaintext in the binary format, apply the hashing algorithm. As an output a
small string of binary digits gets created. This hashing algorithm is public and anyone can use it. The most
popular message digest algorithms are MD5 and SHA-1. X encrypts the message digest. For this, it uses its
own private key.
3. X now combines the ciphertext and its digital signature (i.e encrypted message digest) and it is sent over
the network to Y.
4. Y receives the ciphertext and X's digital signature. Y has to decrypt both of these. Y first decrypts
ciphertext back to plaintext. For this, it uses its own private key. Thus, Y gets the message itself in a secure
manner.
5. Now to ensure that the message has come from the intended sender Y takes X's digital signature and
decrypts it. This gives Y the message digest as was generated by X. The X had encrypted the message digest
to form a digital signature using its own private key. Therefore, Y uses X's public key for decrypting the
digital signature.
6. Hash algorithm to generate the message digest is public. Therefore, Y can also use it.
7. Now there are two message digests one created by X and other by Y. The Y now Anon simply compares
the two message digests. If the two match, Y can be sure that the message came indeed from X and not from
someone else.
Thus with digital signature confidentiality, authenticity as well as message integrity is assured.
The other important feature supported by digital signature is non-repudiation. That is, a sender cannot refuse
having sent a message. Since the digital signature requires the private key of the sender, once a message is
digitally signed, it can be legally proven that the sender had indeed sent the message.
Review Question
1. Explain in brief the concept of digital signature.
2.
Challenges
Following are the challenges faced by the database security system -
Review Questions
1. Explain various challenges faced by database security system.
Q.2 What are two approaches to store a relation in the distributed database? AU: May-04
Ans.: (1) Replication: System maintains multiple copies of data, stored in different sites, for faster retrieval
and fault tolerance.
(2) Fragmentation: Relation is partitioned into several fragments stored in distinct sites.
Q.3 What are various fragmentations? State various fragmentations with example. AU: Dec.-17
Ans.: There are two types of fragmentations - Horizontal fragmentation and vertical fragmentation.
• Horizontal fragmentation: In this approach, each tuple of r is assigned to one or more fragments. If
relation R is fragmented in r1 and r2 fragments, then to bring these fragments back to R we must use union
operation. That means R=r1ur2
• Vertical fragmentation: In this approach, the relation r is fragmented based on one or more columns. If
relation R is fragmented into r1 and r2 fragments using vertical fragmentation, then to bring these fragments
back to original relation R we must use join operation. That means R= r1 r2