0% found this document useful (0 votes)
17 views47 pages

03.module 5

The document discusses concurrency control in database management systems (DBMS), emphasizing the need for managing concurrent execution of database operations to maintain consistency. It outlines various concurrency control protocols, including lock-based, timestamp ordering, and validation-based protocols, detailing their mechanisms and advantages. Additionally, it covers concepts of granularity in data items and multiple granularity locking to enhance concurrency while managing overhead.

Uploaded by

sunil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views47 pages

03.module 5

The document discusses concurrency control in database management systems (DBMS), emphasizing the need for managing concurrent execution of database operations to maintain consistency. It outlines various concurrency control protocols, including lock-based, timestamp ordering, and validation-based protocols, detailing their mechanisms and advantages. Additionally, it covers concepts of granularity in data items and multiple granularity locking to enhance concurrency while managing overhead.

Uploaded by

sunil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Module 5 Database Management System-BCS403

Chapter 1

Concurrency Control in Database

DBMS Concurrency Control

Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database. But before knowing about concurrency
control, we should know about concurrent execution.

Concurrent Execution in DBMS


o In a multi-user system, multiple users can access and use the same database at one time, which is
known as the concurrent execution of the database. It means that the same database is executed
simultaneously on a multi-user system by different users.
o While working on the database transactions, there occurs the requirement of using the database
by multiple users for performing different operations, and in that case, concurrent execution of
the database is performed.
o The thing is that the simultaneous execution that is performed should be done in an interleaved
manner, and no operation should affect the other executing operations, thus maintaining the
consistency of the database. Thus, on making the concurrent execution of the transaction
operations, there occur several challenging problems that need to be solved.

Concurrency Control
Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the database.
Thus, for maintaining the concurrency of the database, we have the concurrency control protocols.
Concurrency Control Protocols
The concurrency control protocols ensure the atomicity, consistency, isolation,
durability and serializability of the concurrent execution of the database transactions. Therefore,
these protocols are categorized as:
o Lock Based Concurrency Control Protocol

o Time Stamp Concurrency Control Protocol

o Validation Based Concurrency Control Protocol

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 1


Module 5 Database Management System-BCS403

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it acquires an appropriate
lock on it. There are two types of lock:

1. Shared lock:

o It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction.
o It can be shared between the transactions because when the transaction holds a lock, then it
can't update the data on the data item.

2. Exclusive lock:

o In the exclusive lock, the data item can be both reads as well as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.

There are four types of lock protocols available:


1. Simplistic lock protocol
It is the simplest way of locking the data while transaction. Simplistic lock-based protocols allow all
the transactions to get the lock on the data before insert or delete or update on it. It will unlock the
data item after completing the transaction.
2. Pre-claiming Lock Protocol
o Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which they
need locks.
o Before initiating an execution of the transaction, it requests DBMS for all the lock on all those
data items.
o If all the locks are granted then this protocol allows the transaction to begin. When the
transaction is completed then it releases all the lock.
o If all the locks are not granted then this protocol allows the transaction to rolls back and waits
until all the locks are granted.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 2


Module 5 Database Management System-BCS403

3. Two-phase locking (2PL)


o The two-phase locking protocol divides the execution phase of the transaction into three parts.
o In the first part, when the execution of the transaction starts, it seeks permission for the lock it
requires.
o In the second part, the transaction acquires all the locks. The third phase is started as soon as the
transaction releases its first lock.
o In the third phase, the transaction cannot demand any new locks. It only releases the acquired
locks.

There are two phases of 2PL:


Growing phase: In the growing phase, a new lock on the data item may be acquired by the
transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released, but
no new locks can be acquired.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 3


Module 5 Database Management System-BCS403

In the below example, if lock conversion is allowed then the following phase can happen:

1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.


2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking phase.

Example:

The following way shows how unlocking and locking work with 2-PL.

Transaction T1:
o Growing phase: from step 1-3

o Shrinking phase: from step 5-7


o Lock point: at 3

Transaction T2:

o Growing phase: from step 2-6


o Shrinking phase: from step 8-9
o Lock point: at 6

4. Strict Two-phase locking (Strict-2PL)


o The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the locks,
the transaction continues to execute normally.
o The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock after
using it.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 4


Module 5 Database Management System-BCS403

o Strict-2PL waits until the whole transaction to commit, and then it releases all the locks at a
time.
o Strict-2PL protocol does not have shrinking phase of lock release.

Timestamp Ordering Protocol


o The Timestamp Ordering Protocol is used to order the transactions based on their Timestamps.
The order of transaction is nothing but the ascending order of the transaction creation.
o The priority of the older transaction is higher that's why it executes first. To determine the
timestamp of the transaction, this protocol uses system time or logical counter.
o The lock-based protocol is used to manage the order between conflicting pairs among
transactions at the execution time. But Timestamp based protocols start working as soon as a
transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the
system at 007 times and transaction T2 has entered the system at 009 times. T1 has the higher
priority, so it executes first as it is entered the system first.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.

Basic Timestamp ordering protocol works as follows:

1. Check the following condition whenever a transaction Ti issues a Read (X) operation:

o If W_TS(X) >TS(Ti) then the operation is rejected.


o If W_TS(X) <= TS(Ti) then the operation is executed.
o Timestamps of all the data items are updated.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 5


Module 5 Database Management System-BCS403

2. Check the following condition whenever a transaction Ti issues a Write(X) operation:

o If TS(Ti) < R_TS(X) then the operation is rejected.


o If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the operation
is executed.

Where,

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Advantages and Disadvantages of TO protocol:


o TO protocol ensures serializability since the precedence graph is as follows:

o TS protocol ensures freedom from deadlock that means no transaction ever waits.
o But the schedule may not be recoverable and may not even be cascade- free.

Validation Based Protocol

Validation phase is also known as optimistic concurrency control technique. In the validation based
protocol, the transaction is executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and executed. It is used to read the value
of various data items and stores them in temporary local variables. It can perform all the
write operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the
actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results are
written to the database or system otherwise the transaction is rolled back.
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 6
Module 5 Database Management System-BCS403

Here each phase has the following different timestamps:

Start(Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.

Finish(Ti): It contains the time when Ti finishes its write phase.


o his protocol is used to determine the time stamp for the transaction for serialization using the
time stamp of the validation phase, as it is the actual phase which determines if the
transaction will commit or rollback.
o Hence TS(T) = validation(T).
o The serializability is determined during the validation process. It can't be decided in
advance.
o While executing the transaction, it ensures a greater degree of concurrency and also less
number of conflicts.
o Thus it contains transactions which have less number of rollbacks.

Thomas write Rule

Thomas Write Rule provides the guarantee of serializability order for the protocol. It improves the
Basic Timestamp Ordering Algorithm.

The basic Thomas write rules are as follows:

o If TS(T) < R_TS(X) then transaction T is aborted and rolled back, and operation is rejected.
o If TS(T) < W_TS(X) then don't execute the W_item(X) operation of the transaction and
continue processing.
o If neither condition 1 nor condition 2 occurs, then allowed to execute the WRITE operation by

transaction Ti and set W_TS(X) to TS(T).


o If we use the Thomas write rule then some serializable schedule can be permitted that does not
conflict serializable as illustrate by the schedule in a given figure:

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 7


Module 5 Database Management System-BCS403

Figure: A Serializable Schedule that is not Conflict Serializable

In the above figure, T1's read and precedes T1's write of the same data item. This schedule does
not conflict serializable.
Thomas write rule checks that T2's write is never seen by any transaction. If we delete the write
operation in transaction T2, then conflict serializable schedule can be obtained which is shown
in below figure.

Figure: A Conflict Serializable Schedule


Granularity of Data Items and Multiple Granularity Locking

All concurrency control techniques assume that the database is formed of a number of named data
items. A database item could be chosen to be one of the following:

o A database record

o A field value of a database record

o A disk block

o A whole file

o The whole database

The granularity can affect the performance of concurrency control andrecovery. In Section 22.5.1,
we discuss some of the tradeoffs with regard to choosing the granularity level used for locking, and
in Section 22.5.2 we discuss a multiple granularity locking scheme, where the granularity level
(size of the data item) may be changed dynamically.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 8


Module 5 Database Management System-BCS403

1. Granularity Level Considerations for Locking

The size of data items is often called the data item granularity. Fine granularity refers to small
item sizes, whereas coarse granularity refers to large item sizes. Several tradeoffs must be
considered in choosing the data item size. We will discuss data item size in the context of locking,
although similar arguments can be made for other concurrency control techniques.

First, notice that the larger the data item size is, the lower the degree of concurrency permitted. For
example, if the data item size is a disk block, a transaction T that needs to lock a record B must lock
the whole disk block X that contains B because a lock is associated with the whole data item
(block). Now, if another transaction S wants to lock a different record C that happens to reside in
the same block X in a conflicting lock mode, it is forced to wait. If the data item size was a single
record, transaction S would be able to proceed, because it would be locking a different data item
(record).

On the other hand, the smaller the data item size is, the more the number of items in the database.
Because every item is associated with a lock, the system will have a larger number of active locks
to be handled by the lock manager. More lock and unlock operations will be performed, causing a
higher overhead. In addition, more storage space will be required for the lock table. For timestamps,
storage is required for the read_TS and write_TS for each data item, and there will be similar
overhead for handling a large number of items.

Given the above tradeoffs, an obvious question can be asked: What is the best item size? The
answer is that it depends on the types of transactions involved. If a typical transaction accesses a
small number of records, it is advantageous to have the data item granularity be one record. On the
other hand, if a transaction typically accesses many records in the same file, it may be better to have
block or file granularity so that the transaction will consider all those records as one (or a few) data
items.

2. Multiple Granularity

Let's start by understanding the meaning of granularity.

Granularity: It is the size of data item allowed to lock.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 9


Module 5 Database Management System-BCS403

Multiple Granularity:
o It can be defined as hierarchically breaking up the database into blocks which can be locked.
o The Multiple Granularity protocol enhances concurrency and reduces lock overhead.
o It maintains the track of what to lock and how to lock.
o It makes easy to decide either to lock a data item or to unlock a data item. This type of hierarchy
can be graphically represented as a tree.

For example: Consider a tree which has four levels of nodes.

o The first level or higher level shows the entire database.


o The second level represents a node of type area. The higher level database consists of
exactly these areas.

o The area consists of children nodes which are known as files. No file can be present in more
than one area.
o Finally, each file contains child nodes known as records. The file has exactly those records
that are its child nodes. No records represent in more than one file.
o Hence, the levels of the tree starting from the top level are as follows:

• Database
• Area
• File
• Record

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 10


Module 5 Database Management System-BCS403

In this example, the highest level shows the entire database. The levels below are file, record, and
fields.
There are three additional lock modes with multiple granularity:

Intention Mode Lock

Intention-shared (IS): It contains explicit locking at a lower level of the tree but only with shared
locks.
Intention-Exclusive (IX): It contains explicit locking at a lower level with exclusive or shared
locks.

Shared & Intention-Exclusive (SIX): In this lock, the node is locked in shared mode, and some
node is locked in exclusive mode by the same transaction.
Compatibility Matrix with Intention Lock Modes: The below table describes the compatibility
matrix for these lock modes:

• It uses the intention lock modes to ensure serializability. It requires that if a transaction attempts
to lock a node, then that node must follow these protocols:
• Transaction T1 should follow the lock-compatibility matrix.
• Transaction T1 firstly locks the root of the tree. It can lock it in any mode.
• If T1 currently has the parent of the node locked in either IX or IS mode, then the transaction T1
will lock a node in S or IS mode only.
• If T1 currently has the parent of the node locked in either IX or SIX modes, then the transaction
T1 will lock a node in X, SIX, or IX mode only.
• If T1 has not previously unlocked any node only, then the Transaction T1 can lock a node.
• If T1 currently has none of the children of the node-locked only, then Transaction T1 will

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 11


Module 5 Database Management System-BCS403

unlock a node.
• Observe that in multiple-granularity, the locks are acquired in top-down order, and locks must
be released in bottom-up order.
• If transaction T1 reads record Ra9 in file Fa, then transaction T1 needs to lock the database, area
A1 and file Fa in IX mode. Finally, it needs to lock Ra2 in S mode.

• If transaction T2 modifies record Ra9 in file Fa, then it can do so after locking the database, area
A1 and file Fa in IX mode. Finally, it needs to lock the Ra9 in X mode.

• If transaction T3 reads all the records in file Fa, then transaction T3 needs to lock the database,
and area A in IS mode. At last, it needs to lock Fa in S mode.
• If transaction T4 reads the entire database, then T4 needs to lock the database in S mode.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 12


Module 5 Database Management System-BCS403

Chapter 2 :

NOSQL Database and Big Data Storage Systems


Introduction to Big Data

Big Data refers to vast and complex datasets that cannot be effectively managed, processed, or
analysed using traditional data processing tools and methods. These datasets typically exhibit three
main characteristics, often referred to as the three Vs:
• Volume: Big Data involves massive amounts of data, often ranging from terabytes to petabytes
or more. This data can come from various sources, including social media, sensors, devices, and
transaction records.
• Velocity: Data is generated at an unprecedented speed. For example, social media platforms
generate millions of posts, comments, and interactions every minute. This real-time data influx
requires rapid processing and analysis.
• Variety: Big Data is heterogeneous and can include structured data (e.g., databases), semi-
structured data (e.g., JSON or XML), and unstructured data (e.g., text, images, videos).
Handling this diverse data is a significant challenge.
Additionally, two more Vs are often considered:
• Veracity: This refers to the trustworthiness or reliability of the data. Big Data may include
noisy, incomplete, or inaccurate information.
• Value: The ultimate goal of handling Big Data is to extract valuable insights, make informed
decisions, and derive business value from the data.

Introduction to NoSQL Databases

NoSQL (which stands for "Not Only SQL") databases are a family of database management
systems designed to handle the unique challenges posed by Big Data. They offer a departure from
traditional relational databases (SQL databases) by providing greater scalability, flexibility, and
performance. Here are some key characteristics and types of NoSQL databases:
1. Schema-less: Unlike SQL databases that require predefined schemas and rigid data structures,
NoSQL databases are typically schema-less. This means you can store data without defining its
structure in advance, making them suitable for handling unstructured or semi-structured data.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 1


Module 5 Database Management System-BCS403

2. Scalability: NoSQL databases are often designed to scale horizontally, meaning you can add
more servers or nodes to handle increased data volumes and traffic. This is crucial for
accommodating the high volume and velocity of Big Data.
3. Data Models: There are several types of NoSQL databases, each tailored to specific use cases:
• Document-based: Stores data in flexible, semi-structured documents (e.g., MongoDB,
Couchbase).
• Key-Value: Simplest NoSQL model, where data is stored as key-value pairs (e.g., Redis,
Amazon DynamoDB).

• Column-family: Suitable for wide-column stores, often used for time-series data (e.g., Apache
Cassandra, HBase).
• Graph databases: Optimized for managing relationships and graph-like data structures (e.g.,
Neo4j, Amazon Neptune).

Big Data and NoSQL Integration

The synergy between Big Data and NoSQL databases is evident in various ways:

1. Scalability: NoSQL databases can horizontally scale to accommodate the massive volumes of
data generated in Big Data environments.
2. Schema Flexibility: NoSQL databases are well-suited for storing and managing the diverse
data types found in Big Data, whether structured, semi-structured, or unstructured.
3. Real-time Processing: Big Data platforms like Apache Hadoop and Apache Spark often
integrate with NoSQL databases for real-time data processing, analytics, and machine learning.
4. High Throughput: NoSQL databases are capable of handling the high velocity of data
ingestion and queries, making them ideal for real-time and streaming data applications.
5. Polyglot Persistence: In many Big Data architectures, organizations use a combination of
NoSQL and SQL databases to achieve polyglot persistence, where each database type is used
for its specific strengths.
Challenges and Considerations

While Big Data and NoSQL databases offer significant benefits, they also come with challenges,
including data consistency, security, and the need for specialized skills in managing and querying

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 2


Module 5 Database Management System-BCS403

these databases. Organizations must carefully evaluate their specific use cases and requirements
before adopting Big Data and NoSQL solutions.
In summary, Big Data and NoSQL databases are integral components of modern data architectures,
enabling organizations to store, process, and analyze vast and diverse datasets with the flexibility,
scalability, and performance required to derive meaningful insights and value from their data.
NoSQL Databases

We know that MongoDB is a NoSQL Database, so it is very necessary to know about NoSQL
Database to understand MongoDB throughly.

What is NoSQL Database


Databases can be divided in 3 types:
1. RDBMS (Relational Database Management System)

2. OLAP (Online Analytical Processing)

3. NoSQL (recently developed database)

NoSQL Database is used to refer a non-SQL or non relational database.


It provides a mechanism for storage and retrieval of data other than tabular relations model used in
relational databases. NoSQL database doesn't use tables for storing data. It is generally used to store
big data and real-time web applications.
History behind the creation of NoSQL Databases
In the early 1970, Flat File Systems are used. Data were stored in flat files and the biggest problems
with flat files are each company implement their own flat files and there are no standards. It is very
difficult to store data in the files, retrieve data from files because there is no standard way to store
data.
Then the relational database was created by E.F. Codd and these databases answered the question of
having no standard way to store data. But later relational database also get a problem that it could
not handle big data, due to this problem there was a need of database which can handle every types
of problems then NoSQL database was developed.
Advantages of NoSQL
o It supports query language.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 3


Module 5 Database Management System-BCS403

o It provides fast performance.


o It provides horizontal scalability.

The CAP theorem,


originally introduced as the CAP principle, can be used to explain some of the competing
requirements in a distributed system with replication. It is a tool used to make system designers
aware of the trade-offs while designing networked shared-data systems.
The three letters in CAP refer to three desirable properties of distributed systems with replicated
data: consistency (among replicated copies), availability (of the system for read and write
operations) and partition tolerance (in the face of the nodes in the system being partitioned by a
network fault). The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a distributed
system with data replication.
The theorem states that networked shared-data systems can only strongly support two of the
following three properties:
Consistency –
Consistency means that the nodes will have the same copies of a replicated data item visible for
various transactions. A guarantee that every node in a distributed cluster returns the same, most
recent and a successful write.
Consistency refers to every client having the same view of the data. There are various types of
consistency models. Consistency in CAP refers to sequential consistency, a very strong form of
consistency.
Availability –
Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed. Every non-failing
node returns a response for all the read and write requests in a reasonable amount of time.
The key word here is “every”. In simple terms, every node (on either side of a network partition)
must be able to respond in a reasonable amount of time.
Partition Tolerance –
Partition tolerance means that the system can continue operating even if the network connecting the
nodes has a fault that results in two or more partitions, where the nodes in each partition can only

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 4


Module 5 Database Management System-BCS403

communicate among each other. That means, the system continues to function and upholds its
consistency guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once the
partition heals.
The use of the word consistency in CAP and its use in ACID do not refer to the same identical
concept. In CAP, the term consistency refers to the consistency of the values in different copies of
the same data item in a replicated distributed system. In ACID, it refers to the fact that a transaction
will not violate the integrity constraints specified on the database schema.The CAP theorem states
that distributed databases can have at most two of the three properties: consistency, availability, and
partition tolerance. As a result, database systems prioritize only two properties at a time.

The following figure represents which database systems prioritize specific properties at a given
time:

CA(Consistency and Availability)-


The system prioritizes availability over consistency and can respond with possibly stale data.
Example databases: Cassandra, CouchDB, Riak, Voldemort.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 5


Module 5 Database Management System-BCS403

AP(Availability and Partition Tolerance)-


The system prioritizes availability over consistency and can respond with possibly stale data.The
system can be distributed across multiple nodes and is designed to operate reliably even in the face of
network partitions.Example databases: Amazon DynamoDB, Google Cloud Spanner.
CP(Consistency and Partition Tolerance)-The system prioritizes consistency over availability and
responds with the latest updated data.The system can be distributed across multiple nodes and is
designed to operate reliably even in the face of network partitions.Example databases: Apache HBase,
MongoDB, Redis.
It’s important to note that these database systems may have different configurations and settings that
can change their behavior with respect to consistency, availability, and partition tolerance.
Therefore, the exact behavior of a database system may depend on its configuration and usage.
for example, Neo4j, a graph database, the CAP theorem still applies. Neo4j prioritizes consistency and
partition tolerance over availability, which means that in the event of a network partition or failure,
Neo4j will sacrifice availability to maintain consistency.

Document Data Base


A document-based database, aka a document store, stores information within CML, YAML, JSON
or binary documents such as BSON. To organize these documents in one whole, there is a specific
key assigned to each document. This characteristic makes document stores similar to key-value
stores.

Even though document stores do not have a unified schema, they are usually organized in order to
easily use and eventually analyze data. This means they are structured, to an extent. Seeing that
each object is commonly stored in a single document, there is no need for defining relationships
between documents.

These documents are in no way similar to tables of a relational database; they do not have a set
number of fields, slots, etc. and there are no empty spaces -- the missing info is simply omitted
rather than there being an empty slot left for it. Data can be added, edited, removed and queried.

The keys assigned to each document are unique identifiers required to access data within the
database, usually a path, string or Uniform Resource Identifier. IDs tend to be indexed in the
database to speed up data retrieval.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 6


Module 5 Database Management System-BCS403

The content of documents within a document store is classified using metadata. Due to this feature,
the database "understands" what class of information it holds -- whether a field contains addresses,
phone numbers or social security numbers, for example. For improved efficiency and user
experience, document stores have query language, which allows querying documents based on the
metadata or the content of the document. This allows you to retrieve all documents which contain
the same value within a certain field. Amazon has provided the following terminology
comparison between SQL and a document database, MongoDB. The following list helps draw a
parallel between the two types of databases:

• SQL: Table, Row, Column, Primary key, Index, View, Nested table or object, Array
• MongoDB: Collection, Document, Field, ObjectId, Index, View, Embedded document,
Array
Advantages
One of the top priorities in any business is making the most of the time and resources
given, increasing overall efficiency. Selecting the right database based on its purpose and the type
of data collected is a crucial step. The following features of a document store could make it the right
choice for you:
• Flexibility. Document stores have the famous advantage of all NoSQL databases, which
is a flexible structure. As mentioned previously, documents of one database do not
require consistency. They do not have to be of the same type, nor do they have to be
structured the same.

• Easy to update. Any new piece of information, when added to a relational database, has
to be added to all data sets to maintain the unified structure within a table of a relational
database. With document stores, you can add new pieces of information easily without
having to add them to all existing data sets.

• Improved performance. Rather than pulling data from multiple related tables, you can
find everything you need within one document. With everything kept in a single
location, it is much faster to reach and retrieve the data.

Disadvantages
Regardless of the size of a database, by virtue of being only semistructured, NoSQL databases are
simple when compared to relational databases. If you jeopardize the simplicity of a document store,
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 7
Module 5 Database Management System-BCS403

you will also jeopardize the previously mentioned improved performance. You can create
references between documents of a document store by interlinking them, but doing so can create
complex systems and deprive you of fast performance.
If you have a large-volume database and you would like to create a network of mutually referenced
data, you are better off finding a way of mapping it and fitting it into a relational database.

A document is a record in a document database. A document typically stores information about one
object and any of its related metadata. Documents store data in field-value pairs. The values can be
a variety of types and structures, including strings, numbers, dates, arrays, or objects. Documents
can be stored in formats like JSON, BSON, and XML.

Below is a JSON document that stores information about a user named Tom.

{
"_id": 1,
"first_name": "Tom",
"email": "[email protected]",
"cell": "765-555-5555",
"likes": [
"fashion",
"spas",
"shopping"
],
"businesses": [
{
"name": "Entertainment 1080",
"partner": "Jean",
"status": "Bankrupt",
"date_founded": {
"$date": "2012-05-19T04:00:00Z"
}
},
{
"name": "Swag for Tweens",
"date_founded": {
"$date": "2012-11-01T04:00:00Z"
}
}
]
}

Collections: A collection is a group of documents. Collections typically store documents that have
similar contents.Not all documents in a collection are required to have the same fields, because

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 8


Module 5 Database Management System-BCS403

document databases have a flexible schema. Note that some document databases provide schema
validation, so the schema can optionally be locked down when needed.

Continuing with the example above, the document with information about Tom could be stored in a
collection named users. More documents could be added to the users collection in order to store
information about other users. For example, the document below that stores information about
Donna could be added to the users collection.

"_id": 2,

"first_name": "Donna",

"email": "[email protected]",

"spouse": "Joe",
"likes": [

"spas",

"shopping",

"live tweeting"

],

"businesses": [

"name": "Castle Realty",

"status": "Thriving",

"date_founded": {

"$date": "2013-11-21T04:00:00Z"

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 9


Module 5 Database Management System-BCS403

Note that the document for Donna does not contain the same fields as the document for Tom.
The users collection is leveraging a flexible schema to store the information that exists for each
user.

CRUD operations:

Document databases typically have an API or query language that allows developers to execute the
CRUD (create, read, update, and delete) operations.
• Create: Documents can be created in the database. Each document has a unique identifier.

• Read: Documents can be read from the database. The API or query language allows
developers to query for documents using their unique identifiers or field values. Indexes can
be added to the database in order to increase read performance.
• Update: Existing documents can be updated — either in whole or in part.

• Delete: Documents can be deleted from the database.

What are the key features of document databases?


• Document databases have the following key features:
• Document model: Data is stored in documents (unlike other databases that store data in
structures like tables or graphs). Documents map to objects in most popular programming
languages, which allows developers to rapidly develop their applications.
• Flexible schema: Document databases have a flexible schema, meaning that not all documents
in a collection need to have the same fields. Note that some document databases support schema
validation, so the schema can be optionally locked down.
• Distributed and resilient: Document databases are distributed, which allows for horizontal
scaling (typically cheaper than vertical scaling) and data distribution. Document databases
provide resiliency through replication.
• Querying through an API or query language: Document databases have an API or query
language that allows developers to execute the CRUD operations on the database. Developers
have the ability to query for documents based on unique identifiers or field values.

What makes document databases different from relational databases?


Three key factors differentiate document databases from relational databases:

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 10


Module 5 Database Management System-BCS403

1. The intuitiveness of the data model: Documents map to the objects in code, so they are much more
natural to work with. There is no need to decompose data across tables, run expensive joins, or
integrate a separate Object Relational Mapping (ORM) layer. Data that is accessed together is stored
together, so developers have less code to write and end users get higher performance.

2. The ubiquity of JSON documents: JSON has become an established standard for data interchange
and storage. JSON documents are lightweight, language-independent, and human-readable. Documents
are a superset of all other data models so developers can structure data in the way their applications
need — rich objects, key-value pairs, tables, geospatial and time-series data, or the nodes and edges of
a graph.

3. The flexibility of the schema: A document’s schema is dynamic and self-describing, so developers
don’t need to first pre-define it in the database. Fields can vary from document to document.
Developers can modify the structure at any time, avoiding disruptive schema migrations. Some
document databases offer schema validation so you can optionally enforce rules governing document
structures.

How much easier are documents to work with than tables?


Developers commonly find working with data in documents to be easier and more intuitive than
working with data in tables. Documents map to data structures in most popular programming
languages. Developers don't have to worry about manually splitting related data across multiple tables
when storing it or joining it back together when retrieving it. They also don't need to use an ORM to
handle manipulating the data for them. Instead, they can easily work with the data directly in their
applications.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 11


Module 5 Database Management System-BCS403

Let's take another look at a document for a user named Tom.


Users
{
"_id": 1,
"first_name":
"Tom","email":
"[email protected]
","cell": "765-555-
5555",
"likes": [
"fashion",
"spas",
"shopping"
],
"businesses": [
{
"name": "Entertainment 1080",
"partner": "Jean",
"status": "Bankrupt",
"date_founded": {
"$date": "2012-05-19T04:00:00Z"
}
},
{
"name": "Swag for Tweens",
"date_founded": {
"$date": "2012-11-01T04:00:00Z"
}
}
]
}

All of the information about Tom is stored in a single document.


SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 12
Module 5 Database Management System-BCS403

Now let's consider how we can store that same information in a relational database. We'll begin by
creating a table that stores the basic information about the user.

Similarly, a user can run many businesses, so we will create a new table named "Businesses" to
store business information. The Businesses table will have a foreign key that references
the ID column in the Users table.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 13


Module 5 Database Management System-BCS403

In this simple example, we saw that data about a user could be stored in a single document in a
document database or three tables in a relational database. When a developer wants to retrieve or
update information about a user in the document database, they can write one query with zero joins.
Interacting with the database is straightforward, and modeling the data in the database is intuitive.

What are the relationships between document databases and other databases?

The document model is a superset of other data models, including key-value pairs, relational,
objects, graph, and geospatial.
• Key-value pairs can be modeled with fields and values in a document. Any field in a document
can be indexed, providing developers with additional flexibility in how to query the data.
• Relational data can be modeled differently (and some would argue more intuitively) by keeping
related data together in a single document using embedded documents and arrays. Related data
can also be stored in separate documents, and database references can be used to connect the
related data.
• Documents map to objects in most popular programming languages.

• Graph nodes and/or edges can be modeled as documents. Edges can also be modeled
through database references. Graph queries can be run using operations like $graphLookup.
• Geospatial data can be modeled as arrays in documents.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 14


Module 5 Database Management System-BCS403

The document model is a superset of other data models

Due to their rich data modeling capabilities, document databases are general-purpose databases that
can store data for a variety of use cases.

Why not just use JSON in a relational database?

With document databases empowering developers to build faster, most relational databases have
added support for JSON. However, simply adding a JSON data type does not bring the benefits of a
native document database. Why? Because the relational approach detracts from developer
productivity, rather than improve it. These are some of the things developers have to deal with.

Proprietary Extensions

Working with documents means using custom, vendor-specific SQL functions which will not be
familiar to most developers, and which don’t work with your favorite SQL tools. Add low-level
JDBC/ODBC drivers and ORMs and you face complex development processes resulting in low
productivity.

Primitive Data Handling

Presenting JSON data as simple strings and numbers rather than the rich data types supported by
native document databases such as MongoDB makes computing, comparing, and sorting data
complex and error prone.

Poor Data Quality & Rigid Tables

Relational databases offer little to validate the schema of documents, so you have no way to apply
quality controls against your JSON data. And you still need to define a schema for your regular
tabular data, with all the overhead that comes when you need to alter your tables as your
application’s features evolve.

Low Performance

Most relational databases do not maintain statistics on JSON data, preventing the query planner
from optimizing queries against documents, and you from tuning your queries.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 15


Module 5 Database Management System-BCS403

No native scale-out

Traditional relational databases offer no way for you to partition (“shard”) the database across
multiple instances to scale as workloads grow. Instead you have to implement sharding yourself in
the application layer, or rely on expensive scale-up systems.

What are the strengths and weaknesses of document databases?

Document databases have many strengths:


• The document model is ubiquitous, intuitive, and enables rapid software development.

• The flexible schema allows for the data model to change as an application's requirements
change.
• Document databases have rich APIs and query languages that allow developers to easily
interact with their data.
• Document databases are distributed (allowing for horizontal scaling as well as global data
distribution) and resilient.

These strengths make document databases an excellent choice for a general-purpose database.
A common weakness that people cite about document databases is that many do not support multi-
document ACID transactions. We estimate that 80%-90% of applications that leverage the
document model will not need to use multi-document transactions.

Note that some document databases like MongoDB support multi-document ACID transactions.

Visit What are ACID Transactions? to learn more about how the document model mostly eliminates
the need for multi-document transactions and how MongoDB supports transactions in the rare cases
where they are needed.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 16


Module 5 Database Management System-BCS403

MongoDB is an open source, nonrelational database management system (DBMS) that uses
flexible documents instead of tables and rows to process and store variousforms of data.
As a NoSQL database solution, MongoDB does not require a relational database management
system (RDBMS), so it provides an elastic data storage model that enables users to store and query
multivariate data types with ease. This not only simplifies database management for developers but
also creates a highly scalable environment for cross-platform applications and services.
MongoDB documents or collections of documents are the basic units of data. Formatted as Binary
JSON (Java Script Object Notation), these documents can store various types of data and be
distributed across multiple systems. Since MongoDB employs a dynamic schema design, users have
unparalleled flexibility when creating data records, querying document collections through
MongoDB aggregation and analyzing large amounts of information.

Comparing MongoDB to other databases

With so many database management solutions currently available, it can be hard to choose the right
solution for your enterprise. Here are some common solution comparisons and best use cases that
can help you decide.

MongoDB vs. MySQL

MySQL uses a structured query language to access stored data. In this format, schemas are used to
create database structures, utilizing tables as a way to standardize data types so that values are
searchable and can be queried properly. A mature solution, MySQL is useful for a variety of
situations including website databases, applications and commercial product management.

Because of its rigid nature, MySQL is preferable to MongoDB when data integrity and isolation are
essential, such as when managing transactional data. But MongoDB’s less-restrictive format and
higher performance make it a better choice, particularly when availability and speed are primary
concerns.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 17


Module 5 Database Management System-BCS403

MongoDB vs. Cassandra


While Cassandra and MongoDB are both considered NoSQL databases, they have different
strengths. Cassandra uses a traditional table structure with rows and columns, which enables users
to maintain uniformity and durability when formatting data before it’s compiled.
Cassandra can offer an easier transition for enterprises looking for a NoSQL solution because it has
a syntax similar to SQL; it also reliably handles deployment and replication without a lot of
configuration. However, it can’t match MongoDB’s flexibility for handling structured and
unstructured data sets or its performance and reliability for mission-critical cloud applications.

MongoDB use cases

Mobile applications
MongoDB’s JSON document model lets you store back-end application data wherever you need it,
including in Apple iOS and Android devices as well as cloud-based storage solutions. This
flexibility lets you aggregate data across multiple environments with secondary and geospatial
indexing, giving developers the ability to scale their mobile applications seamlessly.
Real-time analytics
As companies scale their operations, gaining access to key metrics and business insights from large
pools of data is critical. MongoDB handles the conversion of JSON and JSON-like documents, such
as BSON, into Java objects effortlessly, making the reading and writing of data in MongoDB fast
and incredibly efficient when analyzing real-time information across multiple development
environments. This has proved beneficial for several business sectors, including government,
financial services and retail.
Content management systems
Content management systems (CMS) are powerful tools that play an important role in ensuring
positive user experiences when accessing e-commerce sites, online publications, document
management platforms and other applications and services. By using MongoDB, you can easily add
new features and attributes to your online applications and websites using a single database and
with high availability.
Enterprise Data Warehouse
The Apache Hadoop framework is a collection of open source modules, including Hadoop
Distributed File System and Hadoop MapReduce, that work with MongoDB to store, process and

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 18


Module 5 Database Management System-BCS403

analyze large amounts of data. Organizations can use MongoDB and Hadoop to perform risk
modeling, predictive analytics and real-time data processing.
Mongo DB benefits

Over the years, MongoDB has become a trusted solution for many businesses that are looking for a
powerful and highly scalable NoSQL database. But MongoDB is much more than just a traditional
document-based database and it boasts a few great capabilities that make it stand out from other
DBMS.

Load balancing

As enterprises' cloud applications scale and resource demands increase, problems can arise in
securing the availability and reliability of services. MongoDB’s load balancing sharing process
distributes large data sets across multiple virtual machines at once while still maintaining acceptable
read and write throughputs. This horizontal scaling is called sharding and it helps organizations
avoid the cost of vertical scaling of hardware while still expanding the capacity of cloud-based
deployments.

Ad hoc database queries

One of MongoDB’s biggest advantages over other databases is its ability to handle ad hoc queries
that don’t require predefined schemas. MongoDB databases use a query language that’s similar to
SQL databases and is extremely approachable for beginner and advanced developers alike. This
accessibility makes it easy to push, query, sort, update and export your data with common help
methods and simple shell commands.
Multilanguage support

One of the great things about MongoDB is its multilanguage support. Several versions of
MongoDB have been released and are in continuous development with driver support for popular
programming languages, including Python, PHP, Ruby, Node.js, C++, Scala, JavaScript and many
more.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 19


Module 5 Database Management System-BCS403

MongoDB deployment and setup


Deployment involves two primary activities: installing MongoDB and creating a database.
Installing MongoDB
• Windows: To install MongoDB in a Windows environment, run Windows Server 2008 R2,
Windows Vista or later. Once you’ve decided on the type of database architecture you’ll be
using, you can download the latest version of the platform on MongoDB’s download page.
• Mac: When you install MongoDB on macOS, there are two ways you can approach it. As with
the install process for Windows-based environments, MongoDB can be installed directly from
the developer website once you’ve decided on the type of build you’ll be using. However, the
easier and more common method of installing and running MongoDB on a Mac is through the
use of the Terminal app, running Homebrew. Click here for more information on Homebrew
installations of MongoDB.
Creating a database
After installing MongoDB, you’ll need to create a directory where your data will be stored. This can
be done locally or through public or private cloud storage solutions. For more information about
getting started with MongoDB, click here for comprehensive guides, tutorials and walk-throughs.

NOSQL KEY VALUE STORES


A key-value store, or key-value database, is a type of data storage software program that stores data
as a set of unique identifiers, each of which have an associated value. This data pairing is known as
a “key-value pair.” The unique identifier is the “key” for an item of data, and a value is either the
data being identified or the location of that data .

The key could be anything, depending on restrictions imposed by the database software, but it needs
to be unique in the database so there is no ambiguity when searching for the key and its value. The
value could be anything, including a list or another key -value pair. Some database software allows
you to specify a data type for the value.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 20


Module 5 Database Management System-BCS403

In traditional relational database design, data is stored in tables composed of rows and columns. The
database developer specifies many attributes of the data to be stored in the table upfront. This
creates significant opportunities for optimizations such as data compression and performance
around aggregations and data access, but also introduces some inflexibility.

Key-value stores, on the other hand, are typically much more flexible and offer very fast
performance for reads and writes, in part because the database is looking for a single key and is
returning its associated value rather than performing complex aggregations.

What does a key-value pair mean?


A key-value pair is two pieces of data associated with each other. The key is a unique identifier that
points to its associated value, and a value is either the data being identified or a pointer to that data.
A key-value pair is the fundamental data structure of a key-value store or key-value database, but
key-value pairs have existed outside of software for much longer. A telephone directory is a
good example, where the key is the person or business name, and the value is the phone

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 21


Module 5 Database Management System-BCS403

number. Stock trading data is another example of a key -value pair. In this case, you may have a key
associated with values for the stock ticker, whether the trade was a buy or sell, the number of
shares, or the price of the trade.
Key-value store advantages
There are a few advantages that a key-value store provides over traditional row-column-based
databases. Thanks to the simple data format that gives it its name, a key-value store can be very fast
for read and write operations. And key-value stores are very flexible, a valued asset in modern
programming as we generate more data without traditional structures.
Also, key-value stores do not require placeholders such as “null” for optional values, so they may
have smaller storage requirements, and they often scale almost linearly with the number of nodes.
Key-value database use cases
The advantages listed above naturally lend themselves to several popular use cases for key-value
databases.
• Web applications may store user session details and preference in a key-value store. All the
information is accessible via user key, and key-value stores lend themselves to fast reads and writes.
• Real-time recommendations and advertising are often powered by key-value stores because
the stores can quickly access and present new recommendations or ads as a web visitor moves
throughout a site.
• On the technical side, key-value stores are commonly used for in-memory data caching to
speed up applications by minimizing reads and writes to slower disk-based systems. Hazelcast is an
example of a technology that provides an in-memory key-value store for fast data retrieval.

Distributed key-value store

A distributed key-value store builds on the advantages and use cases described above by providing
them at scale. A distributed key-value store is built to run on multiple computers working together,
and thus allows you to work with larger data sets because more servers with more memory now hold
the data. By distributing the store across multiple servers, you can increase processing performance.
And if you leverage replication in your distributed key-value store, you increase its fault tolerance.
Hazelcast is an example of a technology that provides a distributed key-value store for larger-scale
deployments. The “IMap” data type in Hazelcast, similar to the “Map” type in Java, is a key-value
store stored in memory. Unlike the Java Map type, Hazelcast IMaps are stored in memory in a

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 22


Module 5 Database Management System-BCS403

distributed manner across the collective RAM in a cluster of computers, allowing you to store much
more data than possible on a single computer. This gives you quick lookups with in-memory speeds
while also retaining other important capabilities such as high availability and security.
Columnar Data Model of NoSQL
The Columnar Data Model of NoSQL is important. NoSQL databases are different from SQL
databases. This is because it uses a data model that has a different structure than the previously
followed row-and-column table model used with relational database management systems.
(RDBMS). NoSQL databases are a flexible schema model which is designed to scale horizontally
across many servers and is used in large volumes of data.
Columnar Data Model of NoSQL :
Basically, the relational database stores data in rows and also reads the data row by row, column store
is organized as a set of columns. So if someone wants to run analytics on a small number of columns,
one can read those columns directly without consuming memory with the unwanted data. Columns
are somehow are of the same type and gain from more efficient compression, which makes reads
faster than before. Examples of Columnar Data Model: Cassandra and Apache Hadoop Hbase.
Working of Columnar Data Model:
In Columnar Data Model instead of organizing information into rows, it does in columns. This makes
them function the same way that tables work in relational databases. This type of data model is much
more flexible obviously because it is a type of NoSQL database. The below example will help in
understanding the Columnar data model:
Row-Oriented Table:

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 23


Module 5 Database Management System-BCS403

Columnar Data Model uses the concept of keyspace, which is like a schema in relational models.

Advantages of Columnar Databases:

• Well Structured: Since these data models are good at compression so these are very structured
or well organized in terms of storage.

• Flexibility :A large amount of flexibility as it is not necessary for the columns to look like each
other, which means one can add new and different columns without disrupting the whole
database

• Aggreagation queries are fast The most important thing is aggregation queries are quite fast
because a majority of the information is stored in a column. An example would be Adding up
the total number of students enrolled in one year.It can be spread across large clusters of
machines, even numbering in thousands.

• Scalability: Since one can easily load a row table in a few seconds so load times are nearly
excellent. To design an effective and working schema is too difficult and very time-
consuming.incremental data loading is suboptimal and must be avoided, but this might not be an
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 24
Module 5 Database Management System-BCS403

issue for some users.

• Load Time: If security is one of the priorities then it must be known that the Columnar data
model lacks inbuilt security features in this case, one must look into relational databases.Online
Transaction Processing (OLTP) applications are also not compatible with columnar data models
because of the way data is stored.
Disadvantages of Columnar Data Model:
• Designing indexing Schema: To design an effective and working schema is too difficult and
very time-consuming.
• Suboptimal data loading: incremental data loading is suboptimal and must be avoided, but
this might not be an issue for some users.
• Security vulnerabilities: If security is one of the priorities then it must be known that the
Columnar data model lacks inbuilt security features in this case, one must look into relational
databases.
• Online Transaction Processing (OLTP): Online Transaction Processing (OLTP) applications
are also not compatible with columnar data models because of the way data is stored.
Applications of Columnar Databases
Columnar Data Model is very much used in various Blogging Platforms.
It is used in Content management systems like WordPress, Joomla, etc.
It is used in Systems that maintain counters.
It is used in Systems that require heavy write requests.
It is used in Services that have expiring usage.

Graph Databases:
A graph database is a type of NoSQL database that is designed to handle data with complex
relationships and interconnections. In a graph database, data is stored as nodes and edges, where
nodes represent entities and edges represent the relationships between those entities.

• Graph databases are particularly well-suited for applications that require deep and complex
queries, such as social networks, recommendation engines, and fraud detection systems. They
can also be used for other types of applications, such as supply chain management, network and
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 25
Module 5 Database Management System-BCS403

infrastructure management, and bioinformatics.

• One of the main advantages of graph databases is their ability to handle and represent
relationships between entities. This is because the relationships between entities are as
important as the entities themselves, and often cannot be easily represented in a traditional
relational database.

• Another advantage of graph databases is their flexibility. Graph databases can handle data with
changing structures and can be adapted to new use cases without requiring significant changes
to the database schema. This makes them particularly useful for applications with rapidly
changing data structures or complex data requirements.

• However, graph databases may not be suitable for all applications. For example, they may not
be the best choice for applications that require simple queries or that deal primarily with data
that can be easily represented in a traditional relational database. Additionally, graph databases
may require more specialized knowledge and expertise to use effectively.

Some popular graph databases include Neo4j, OrientDB, and ArangoDB. These databases provide a
range of features, including support for different data models, scalability, and high availability, and
can be used for a wide variety of applications.

As we all know the graph is a pictorial representation of data in the form of nodes and relationships
which are represented by edges. A graph database is a type of database used to represent the data in
the form of a graph. It has three components: nodes, relationships, and properties. These
components are used to model the data. The concept of a Graph Database is based on the theory of
graphs. It was introduced in the year 2000. They are commonly referred to NoSql databases as data
is stored using nodes, relationships and properties instead of traditional databases. A graph database
is very useful for heavily interconnected data. Here relationships between data are given priority
and therefore the relationships can be easily visualized. They are flexible as new data can be added
without hampering the old ones. They are useful in the fields of social networking, fraud
detection, AI Knowledge graphs etc.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 26


Module 5 Database Management System-BCS403

The description of components are as follows:


Nodes: represent the objects or instances. They are equivalent to a row in database. The node
basically acts as a vertex in a graph. The nodes are grouped by applying a label to each member.
Relationships: They are basically the edges in the graph. They have a specific direction, type and
form patterns of the data. They basically establish relationship between nodes.
Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base etc. Out of
which Neo4j is the most popular one.

In traditional databases, the relationships between data is not established. But in the case of Graph
Database, the relationships between data are prioritized. Nowadays mostly interconnected data is
used where one data is connected directly or indirectly. Since the concept of this database is based
on graph theory, it is flexible and works very fast for associative data. Often data are interconnected
to one another which also helps to establish further relationships. It works fast in the querying part
as well because with the help of relationships we can quickly find the desired nodes. join operations
are not required in this database which reduces the cost. The relationships and properties are stored
as first-class entities in Graph Database.

Graph databases allow organizations to connect the data with external sources as well. Since
organizations require a huge amount of data, often it becomes cumbersome to store data in the form
of tables. For instance, if the organization wants to find a particular data that is connected with
another data in another table, so first join operation is performed between the tables, and then search
for the data is done row by row. But Graph database solves this big problem. They store the
relationships and properties along with the data. So if the organization needs to search for a
particular data, then with the help of relationships and properties the nodes can be found without
joining or without traversing row by row. Thus the searching of nodes is not dependent on the
amount of data.
Types of Graph Databases:
Property Graphs: These graphs are used for querying and analyzing data by modelling the
relationships among the data. It comprises of vertices that has information about the particular
subject and edges that denote the relationship. The vertices and edges have additional attributes
called properties.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 27


Module 5 Database Management System-BCS403

RDF Graphs: It stands for Resource Description Framework. It focuses more on data integration.
They are used to represent complex data with well defined semantics. It is represented by three
elements: two vertices, an edge that reflect the subject, predicate and object of a sentence. Every
vertex and edge is represented by URI(Uniform Resource Identifier).
When to Use Graph Database?
Graph databases should be used for heavily interconnected data.
It should be used when amount of data is larger and relationships are present.
It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries since data is
connected. Graph algorithms are also applied to find patterns, paths and other relationships this
enabling more analysis of the data. The algorithms help to explore the neighboring nodes, clustering
of vertices analyze relationships and patterns. Countless joins are not required in this kind of
database.
Example of Graph Database:
Recommendation engines in E commerce use graph databases to provide customers with accurate
recommendations, updates about new products thus increasing sales and satisfying the customer’s
desires.
Social media companies use graph databases to find the “friends of friends” or products that the
user’s friends like and send suggestions accordingly to user.

To detect fraud Graph databases play a major role. Users can create graph from the transactions
between entities and store other important information. Once created, running a simple query will
help to identify the fraud.
Advantages of Graph Database:
Potential advantage of Graph Database is establishing the relationships with external sources as
well
No joins are required since relationships is already specified.
Query is dependent on concrete relationships and not on the amount of data.
It is flexible and agile.
it is easy to manage the data in terms of graph.
Efficient data modeling: Graph databases allow for efficient data modeling by representing data as
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 28
Module 5 Database Management System-BCS403

nodes and edges. This allows for more flexible and scalable data modeling than traditional
relational databases.
Flexible relationships: Graph databases are designed to handle complex relationships and
interconnections between data elements. This makes them well-suited for applications that require
deep and complex queries, such as social networks, recommendation engines, and fraud detection
systems.
High performance: Graph databases are optimized for handling large and complex datasets, making
them well-suited for applications that require high levels of performance and scalability.
Scalability: Graph databases can be easily scaled horizontally, allowing additional servers to be
added to the cluster to handle increased data volume or traffic.
Easy to use: Graph databases are typically easier to use than traditional relational databases. They
often have a simpler data model and query language, and can be easier to maintain and scale.

Disadvantages of Graph Database:


Often for complex relationships speed becomes slower in searching.
The query language is platform dependent.
They are inappropriate for transactional data
It has smaller user base.
Limited use cases: Graph databases are not suitable for all applications. They may not be the best
choice for applications that require simple queries or that deal primarily with data that can be easily
represented in a traditional relational database.
Specialized knowledge: Graph databases may require specialized knowledge and expertise to use
effectively, including knowledge of graph theory and algorithms.
Immature technology: The technology for graph databases is relatively new and still
evolving,which means that it may not be as stable or well-supported as traditional relational
databases.
Integration with other tools: Graph databases may not be as well-integrated with other tools and
systems as traditional relational databases, which can make it more difficult to use them in
conjunction with other technologies.
Future of Graph Database:
Graph Database is an excellent tool for storing data but it cannot be used to completely replace the

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 29


Module 5 Database Management System-BCS403

traditional database. This database deals with a typical set of interconnected data. Although Graph
Database is in the developmental phase it is becoming an important part as business and
organizations are using big data and Graph databases help in complex analysis. Thus these
databases have become a must for today’s needs and tomorrow success.
Graph Based Data Model in NoSQL is a type of Data Model which tries to focus on building the
relationship between data elements. As the name suggests Graph-Based Data Model, each element
here is stored as a node, and the association between these elements is often known as Links.
Association is stored directly as these are the first-class elements of the data model. These data
models give us a conceptual view of the data.
These are the data models which are based on topographical network structure. Obviously, in graph
theory, we have terms like Nodes, edges, and properties, let’s see what it means here in the Graph-
Based data model.
Nodes: These are the instances of data that represent objects which is to be tracked.
Edges: As we already know edges represent relationships between nodes.
Properties: It represents information associated with nodes.
The below image represents Nodes with properties from relationships represented by edges.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 30


Module 5 Database Management System-BCS403

Working of Graph Data Model :


In these data models, the nodes which are connected together are connected physically and the
physical connection among them is also taken as a piece of data. Connecting data in this way
becomes easy to query a relationship. This data model reads the relationship from storage directly
instead of calculating and querying the connection steps. Like many different NoSQL databases
these data models don’t have any schema as it is important because schema makes the model well
and good and easy to edit.

Examples of Graph Data Models :


JanusGraph: These are very helpful in big data analytics. It is a scalable graph database system
open source too. JanusGraph has different features like:
Storage: Many options are available for storing graph data like Cassandra.
Support for transactions: There are many supports available like ACID (Atomicity, Consistency,
Isolation, and Durability) which can hold thousands of concurrent users.
Searching options: Complex searching options are available and optional support too.
Neo4j: It stands for Network Exploration and Optimization 4 Java. As the name suggests this graph
database is written in Java with native graph storage and processing. Neo4j has different features
like:

Scalable: Scalable through data partitioning into pieces known as shards.


Higher Availability: Availability is very much high due to continuous backups and rolling
upgrades.
Query Language: Uses programmer-friendly query language Cypher graph query
language.DGraph main features are:
DGraph: It is an open-source distributed graph database system designed with scalability.
Query Language: It uses GraphQL, which is solely made for APIs.
open-source system: support for many open standards.

Advantages of Graph Data Model :


• Structure: The structures are very agile and workable too.
• Explicit Representation: The portrayal of relationships between entities is explicit.
• Real-time O/P Results: Query gives us real-time output results.
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 31
Module 5 Database Management System-BCS403

Disadvantages of Graph Data Model:


• No standard query language: Since the language depends on the platform that is used so there
is no certain standard query language.
• Unprofessional Graphs: Graphs are very unprofessional for transactional-based systems.
• Small User Base: The user base is small which makes it very difficult to get support when
running into a system.
Applications of Graph Data Model:
• Graph data models are very much used in fraud detection which itself is very much useful and
important.
• It is used in Digital asset management which provides a scalable database model to keep track
of digital assets.
• It is used in Network management which alerts a network administrator about problems in a
network.
• It is used in Context-aware services by giving traffic updates and many more.
• It is used in Real-Time Recommendation Engines which provide a better user experience.

Column Oriented Data Base (Wide-column Database)

A wide-column database is a NoSQL database that organizes data storage into flexible columns that
can be spread across multiple servers or database nodes, using multi-dimensional mapping to
reference data by column, row, and timestamp.

///

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 32


Module 5 Database Management System-BCS403

What is a Wide-column Database?

A wide-column database is a type of NoSQL database in which the names and format of the
columns can vary across rows, even within the same table. Wide-column databases are also known
as column family databases. Because data is stored in columns, queries for a particular value in a
column are very fast, as the entire column can be loaded and searched quickly. Related columns can
be modeled as part of the same column family.
What Are Advantages of a Wide-column Database?

Benefits of a wide-column NoSQL database include speed of querying, scalability, and a


flexible data model.

How Does a Wide-column Store Database Differ from a Relational Database?

A relational database management system (RDBMS) stores data in a table with rows that all span a
number of columns. If one row needs an additional column, that column must be added to the entire
table, with null or default values provided for all the other rows. If you need to query that RDBMS
table for a value that isn’t indexed, the table scan to locate those values will be very slow.
Wide-column NoSQL databases still have the concept of rows, but reading or writing a row of data
consists of reading or writing the individual columns. A column is only written if there’s a data
element for it. Each data element can be referenced by the row key, but querying for a value is
optimized like querying an index in a RDBMS, rather than a slow table scan.

Neo4j is a graph database. A graph database, instead of having rows and columns has nodes edges
and properties. It is more suitable for certain big data and analytics applications than row and
column databases or free-form JSON document databases for many use cases.

A graph database is used to represent relationships. The most common example of that is the
Facebook Friend relationship as well as the Like relationship. You can see some of that in the
graphic below from Neo4j.

The circles are nodes. The lines, called edges, indicate relationships. And the any comments inside
the circles are properties of that node.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 33


Module 5 Database Management System-BCS403

We write about Neo4j here because it has the largest market share. There are other players in this
market. And according to Neo4J, Apache Spark 3.0 will add the Neo4j Cypher Query Language to
allow and make easier “property graphs based on DataFrames to Spark.” Spark already supports
GraphX, which is an extension of the RDD to support Graphs. We will discuss that in another blog
post.In another post we will also discuss graph algorithms. The most famous of those is the Google
Page Rank Index. Algorithms are the way to navigate the nodes and edges.

Costs?: Is Neo4J free? That’s rather complicated. The Community Edition is. So is the desktop
version, suitable for learning. The Enterprise edition is not. That is consistent with other opensource
products. When I asked Neo4J for a license to work with their product for an extended period of
time they recommended that I use the desktop version. The Enterprise version has a 30-day trial
period.

There are other alternatives in the market. The key would be to pick one that has enough users so
that they do not go out of business. Which one should you use? You will have to do research to
figure out that.

Install Neo4J:You can use the Desktop or tar version. Here I am using the tar version, on Mac. Just
download it and then start up the shell as shown below. You will need a Java JDK, then.

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 34


Module 5 Database Management System-BCS403

export
JAVA_HOME='/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home'

Start the server and set the initial password then open cypher-shell. The default URL is a rather
strange looking bolt://localhost:7687.

cd neo4j bin folder


neo4j-admin set-initial-password xxxxxx
./cypher-shell -a bolt://localhost:7687 -u neo4j -p xxxxx

SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 35

You might also like