03.module 5
03.module 5
Chapter 1
Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database. But before knowing about concurrency
control, we should know about concurrent execution.
Concurrency Control
Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the database.
Thus, for maintaining the concurrency of the database, we have the concurrency control protocols.
Concurrency Control Protocols
The concurrency control protocols ensure the atomicity, consistency, isolation,
durability and serializability of the concurrent execution of the database transactions. Therefore,
these protocols are categorized as:
o Lock Based Concurrency Control Protocol
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an appropriate
lock on it. There are two types of lock:
1. Shared lock:
o It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction.
o It can be shared between the transactions because when the transaction holds a lock, then it
can't update the data on the data item.
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
In the below example, if lock conversion is allowed then the following phase can happen:
Example:
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
o Growing phase: from step 1-3
Transaction T2:
o Strict-2PL waits until the whole transaction to commit, and then it releases all the locks at a
time.
o Strict-2PL protocol does not have shrinking phase of lock release.
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
Where,
o TS protocol ensures freedom from deadlock that means no transaction ever waits.
o But the schedule may not be recoverable and may not even be cascade- free.
Validation phase is also known as optimistic concurrency control technique. In the validation based
protocol, the transaction is executed in the following three phases:
1. Read phase: In this phase, the transaction T is read and executed. It is used to read the value
of various data items and stores them in temporary local variables. It can perform all the
write operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the
actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results are
written to the database or system otherwise the transaction is rolled back.
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 6
Module 5 Database Management System-BCS403
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.
Thomas Write Rule provides the guarantee of serializability order for the protocol. It improves the
Basic Timestamp Ordering Algorithm.
o If TS(T) < R_TS(X) then transaction T is aborted and rolled back, and operation is rejected.
o If TS(T) < W_TS(X) then don't execute the W_item(X) operation of the transaction and
continue processing.
o If neither condition 1 nor condition 2 occurs, then allowed to execute the WRITE operation by
In the above figure, T1's read and precedes T1's write of the same data item. This schedule does
not conflict serializable.
Thomas write rule checks that T2's write is never seen by any transaction. If we delete the write
operation in transaction T2, then conflict serializable schedule can be obtained which is shown
in below figure.
All concurrency control techniques assume that the database is formed of a number of named data
items. A database item could be chosen to be one of the following:
o A database record
o A disk block
o A whole file
The granularity can affect the performance of concurrency control andrecovery. In Section 22.5.1,
we discuss some of the tradeoffs with regard to choosing the granularity level used for locking, and
in Section 22.5.2 we discuss a multiple granularity locking scheme, where the granularity level
(size of the data item) may be changed dynamically.
The size of data items is often called the data item granularity. Fine granularity refers to small
item sizes, whereas coarse granularity refers to large item sizes. Several tradeoffs must be
considered in choosing the data item size. We will discuss data item size in the context of locking,
although similar arguments can be made for other concurrency control techniques.
First, notice that the larger the data item size is, the lower the degree of concurrency permitted. For
example, if the data item size is a disk block, a transaction T that needs to lock a record B must lock
the whole disk block X that contains B because a lock is associated with the whole data item
(block). Now, if another transaction S wants to lock a different record C that happens to reside in
the same block X in a conflicting lock mode, it is forced to wait. If the data item size was a single
record, transaction S would be able to proceed, because it would be locking a different data item
(record).
On the other hand, the smaller the data item size is, the more the number of items in the database.
Because every item is associated with a lock, the system will have a larger number of active locks
to be handled by the lock manager. More lock and unlock operations will be performed, causing a
higher overhead. In addition, more storage space will be required for the lock table. For timestamps,
storage is required for the read_TS and write_TS for each data item, and there will be similar
overhead for handling a large number of items.
Given the above tradeoffs, an obvious question can be asked: What is the best item size? The
answer is that it depends on the types of transactions involved. If a typical transaction accesses a
small number of records, it is advantageous to have the data item granularity be one record. On the
other hand, if a transaction typically accesses many records in the same file, it may be better to have
block or file granularity so that the transaction will consider all those records as one (or a few) data
items.
2. Multiple Granularity
Multiple Granularity:
o It can be defined as hierarchically breaking up the database into blocks which can be locked.
o The Multiple Granularity protocol enhances concurrency and reduces lock overhead.
o It maintains the track of what to lock and how to lock.
o It makes easy to decide either to lock a data item or to unlock a data item. This type of hierarchy
can be graphically represented as a tree.
o The area consists of children nodes which are known as files. No file can be present in more
than one area.
o Finally, each file contains child nodes known as records. The file has exactly those records
that are its child nodes. No records represent in more than one file.
o Hence, the levels of the tree starting from the top level are as follows:
• Database
• Area
• File
• Record
In this example, the highest level shows the entire database. The levels below are file, record, and
fields.
There are three additional lock modes with multiple granularity:
Intention-shared (IS): It contains explicit locking at a lower level of the tree but only with shared
locks.
Intention-Exclusive (IX): It contains explicit locking at a lower level with exclusive or shared
locks.
Shared & Intention-Exclusive (SIX): In this lock, the node is locked in shared mode, and some
node is locked in exclusive mode by the same transaction.
Compatibility Matrix with Intention Lock Modes: The below table describes the compatibility
matrix for these lock modes:
• It uses the intention lock modes to ensure serializability. It requires that if a transaction attempts
to lock a node, then that node must follow these protocols:
• Transaction T1 should follow the lock-compatibility matrix.
• Transaction T1 firstly locks the root of the tree. It can lock it in any mode.
• If T1 currently has the parent of the node locked in either IX or IS mode, then the transaction T1
will lock a node in S or IS mode only.
• If T1 currently has the parent of the node locked in either IX or SIX modes, then the transaction
T1 will lock a node in X, SIX, or IX mode only.
• If T1 has not previously unlocked any node only, then the Transaction T1 can lock a node.
• If T1 currently has none of the children of the node-locked only, then Transaction T1 will
unlock a node.
• Observe that in multiple-granularity, the locks are acquired in top-down order, and locks must
be released in bottom-up order.
• If transaction T1 reads record Ra9 in file Fa, then transaction T1 needs to lock the database, area
A1 and file Fa in IX mode. Finally, it needs to lock Ra2 in S mode.
• If transaction T2 modifies record Ra9 in file Fa, then it can do so after locking the database, area
A1 and file Fa in IX mode. Finally, it needs to lock the Ra9 in X mode.
• If transaction T3 reads all the records in file Fa, then transaction T3 needs to lock the database,
and area A in IS mode. At last, it needs to lock Fa in S mode.
• If transaction T4 reads the entire database, then T4 needs to lock the database in S mode.
Chapter 2 :
Big Data refers to vast and complex datasets that cannot be effectively managed, processed, or
analysed using traditional data processing tools and methods. These datasets typically exhibit three
main characteristics, often referred to as the three Vs:
• Volume: Big Data involves massive amounts of data, often ranging from terabytes to petabytes
or more. This data can come from various sources, including social media, sensors, devices, and
transaction records.
• Velocity: Data is generated at an unprecedented speed. For example, social media platforms
generate millions of posts, comments, and interactions every minute. This real-time data influx
requires rapid processing and analysis.
• Variety: Big Data is heterogeneous and can include structured data (e.g., databases), semi-
structured data (e.g., JSON or XML), and unstructured data (e.g., text, images, videos).
Handling this diverse data is a significant challenge.
Additionally, two more Vs are often considered:
• Veracity: This refers to the trustworthiness or reliability of the data. Big Data may include
noisy, incomplete, or inaccurate information.
• Value: The ultimate goal of handling Big Data is to extract valuable insights, make informed
decisions, and derive business value from the data.
NoSQL (which stands for "Not Only SQL") databases are a family of database management
systems designed to handle the unique challenges posed by Big Data. They offer a departure from
traditional relational databases (SQL databases) by providing greater scalability, flexibility, and
performance. Here are some key characteristics and types of NoSQL databases:
1. Schema-less: Unlike SQL databases that require predefined schemas and rigid data structures,
NoSQL databases are typically schema-less. This means you can store data without defining its
structure in advance, making them suitable for handling unstructured or semi-structured data.
2. Scalability: NoSQL databases are often designed to scale horizontally, meaning you can add
more servers or nodes to handle increased data volumes and traffic. This is crucial for
accommodating the high volume and velocity of Big Data.
3. Data Models: There are several types of NoSQL databases, each tailored to specific use cases:
• Document-based: Stores data in flexible, semi-structured documents (e.g., MongoDB,
Couchbase).
• Key-Value: Simplest NoSQL model, where data is stored as key-value pairs (e.g., Redis,
Amazon DynamoDB).
• Column-family: Suitable for wide-column stores, often used for time-series data (e.g., Apache
Cassandra, HBase).
• Graph databases: Optimized for managing relationships and graph-like data structures (e.g.,
Neo4j, Amazon Neptune).
The synergy between Big Data and NoSQL databases is evident in various ways:
1. Scalability: NoSQL databases can horizontally scale to accommodate the massive volumes of
data generated in Big Data environments.
2. Schema Flexibility: NoSQL databases are well-suited for storing and managing the diverse
data types found in Big Data, whether structured, semi-structured, or unstructured.
3. Real-time Processing: Big Data platforms like Apache Hadoop and Apache Spark often
integrate with NoSQL databases for real-time data processing, analytics, and machine learning.
4. High Throughput: NoSQL databases are capable of handling the high velocity of data
ingestion and queries, making them ideal for real-time and streaming data applications.
5. Polyglot Persistence: In many Big Data architectures, organizations use a combination of
NoSQL and SQL databases to achieve polyglot persistence, where each database type is used
for its specific strengths.
Challenges and Considerations
While Big Data and NoSQL databases offer significant benefits, they also come with challenges,
including data consistency, security, and the need for specialized skills in managing and querying
these databases. Organizations must carefully evaluate their specific use cases and requirements
before adopting Big Data and NoSQL solutions.
In summary, Big Data and NoSQL databases are integral components of modern data architectures,
enabling organizations to store, process, and analyze vast and diverse datasets with the flexibility,
scalability, and performance required to derive meaningful insights and value from their data.
NoSQL Databases
We know that MongoDB is a NoSQL Database, so it is very necessary to know about NoSQL
Database to understand MongoDB throughly.
communicate among each other. That means, the system continues to function and upholds its
consistency guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once the
partition heals.
The use of the word consistency in CAP and its use in ACID do not refer to the same identical
concept. In CAP, the term consistency refers to the consistency of the values in different copies of
the same data item in a replicated distributed system. In ACID, it refers to the fact that a transaction
will not violate the integrity constraints specified on the database schema.The CAP theorem states
that distributed databases can have at most two of the three properties: consistency, availability, and
partition tolerance. As a result, database systems prioritize only two properties at a time.
The following figure represents which database systems prioritize specific properties at a given
time:
Even though document stores do not have a unified schema, they are usually organized in order to
easily use and eventually analyze data. This means they are structured, to an extent. Seeing that
each object is commonly stored in a single document, there is no need for defining relationships
between documents.
These documents are in no way similar to tables of a relational database; they do not have a set
number of fields, slots, etc. and there are no empty spaces -- the missing info is simply omitted
rather than there being an empty slot left for it. Data can be added, edited, removed and queried.
The keys assigned to each document are unique identifiers required to access data within the
database, usually a path, string or Uniform Resource Identifier. IDs tend to be indexed in the
database to speed up data retrieval.
The content of documents within a document store is classified using metadata. Due to this feature,
the database "understands" what class of information it holds -- whether a field contains addresses,
phone numbers or social security numbers, for example. For improved efficiency and user
experience, document stores have query language, which allows querying documents based on the
metadata or the content of the document. This allows you to retrieve all documents which contain
the same value within a certain field. Amazon has provided the following terminology
comparison between SQL and a document database, MongoDB. The following list helps draw a
parallel between the two types of databases:
• SQL: Table, Row, Column, Primary key, Index, View, Nested table or object, Array
• MongoDB: Collection, Document, Field, ObjectId, Index, View, Embedded document,
Array
Advantages
One of the top priorities in any business is making the most of the time and resources
given, increasing overall efficiency. Selecting the right database based on its purpose and the type
of data collected is a crucial step. The following features of a document store could make it the right
choice for you:
• Flexibility. Document stores have the famous advantage of all NoSQL databases, which
is a flexible structure. As mentioned previously, documents of one database do not
require consistency. They do not have to be of the same type, nor do they have to be
structured the same.
• Easy to update. Any new piece of information, when added to a relational database, has
to be added to all data sets to maintain the unified structure within a table of a relational
database. With document stores, you can add new pieces of information easily without
having to add them to all existing data sets.
• Improved performance. Rather than pulling data from multiple related tables, you can
find everything you need within one document. With everything kept in a single
location, it is much faster to reach and retrieve the data.
Disadvantages
Regardless of the size of a database, by virtue of being only semistructured, NoSQL databases are
simple when compared to relational databases. If you jeopardize the simplicity of a document store,
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 7
Module 5 Database Management System-BCS403
you will also jeopardize the previously mentioned improved performance. You can create
references between documents of a document store by interlinking them, but doing so can create
complex systems and deprive you of fast performance.
If you have a large-volume database and you would like to create a network of mutually referenced
data, you are better off finding a way of mapping it and fitting it into a relational database.
A document is a record in a document database. A document typically stores information about one
object and any of its related metadata. Documents store data in field-value pairs. The values can be
a variety of types and structures, including strings, numbers, dates, arrays, or objects. Documents
can be stored in formats like JSON, BSON, and XML.
Below is a JSON document that stores information about a user named Tom.
{
"_id": 1,
"first_name": "Tom",
"email": "[email protected]",
"cell": "765-555-5555",
"likes": [
"fashion",
"spas",
"shopping"
],
"businesses": [
{
"name": "Entertainment 1080",
"partner": "Jean",
"status": "Bankrupt",
"date_founded": {
"$date": "2012-05-19T04:00:00Z"
}
},
{
"name": "Swag for Tweens",
"date_founded": {
"$date": "2012-11-01T04:00:00Z"
}
}
]
}
Collections: A collection is a group of documents. Collections typically store documents that have
similar contents.Not all documents in a collection are required to have the same fields, because
document databases have a flexible schema. Note that some document databases provide schema
validation, so the schema can optionally be locked down when needed.
Continuing with the example above, the document with information about Tom could be stored in a
collection named users. More documents could be added to the users collection in order to store
information about other users. For example, the document below that stores information about
Donna could be added to the users collection.
"_id": 2,
"first_name": "Donna",
"email": "[email protected]",
"spouse": "Joe",
"likes": [
"spas",
"shopping",
"live tweeting"
],
"businesses": [
"status": "Thriving",
"date_founded": {
"$date": "2013-11-21T04:00:00Z"
Note that the document for Donna does not contain the same fields as the document for Tom.
The users collection is leveraging a flexible schema to store the information that exists for each
user.
CRUD operations:
Document databases typically have an API or query language that allows developers to execute the
CRUD (create, read, update, and delete) operations.
• Create: Documents can be created in the database. Each document has a unique identifier.
• Read: Documents can be read from the database. The API or query language allows
developers to query for documents using their unique identifiers or field values. Indexes can
be added to the database in order to increase read performance.
• Update: Existing documents can be updated — either in whole or in part.
1. The intuitiveness of the data model: Documents map to the objects in code, so they are much more
natural to work with. There is no need to decompose data across tables, run expensive joins, or
integrate a separate Object Relational Mapping (ORM) layer. Data that is accessed together is stored
together, so developers have less code to write and end users get higher performance.
2. The ubiquity of JSON documents: JSON has become an established standard for data interchange
and storage. JSON documents are lightweight, language-independent, and human-readable. Documents
are a superset of all other data models so developers can structure data in the way their applications
need — rich objects, key-value pairs, tables, geospatial and time-series data, or the nodes and edges of
a graph.
3. The flexibility of the schema: A document’s schema is dynamic and self-describing, so developers
don’t need to first pre-define it in the database. Fields can vary from document to document.
Developers can modify the structure at any time, avoiding disruptive schema migrations. Some
document databases offer schema validation so you can optionally enforce rules governing document
structures.
Now let's consider how we can store that same information in a relational database. We'll begin by
creating a table that stores the basic information about the user.
Similarly, a user can run many businesses, so we will create a new table named "Businesses" to
store business information. The Businesses table will have a foreign key that references
the ID column in the Users table.
In this simple example, we saw that data about a user could be stored in a single document in a
document database or three tables in a relational database. When a developer wants to retrieve or
update information about a user in the document database, they can write one query with zero joins.
Interacting with the database is straightforward, and modeling the data in the database is intuitive.
What are the relationships between document databases and other databases?
The document model is a superset of other data models, including key-value pairs, relational,
objects, graph, and geospatial.
• Key-value pairs can be modeled with fields and values in a document. Any field in a document
can be indexed, providing developers with additional flexibility in how to query the data.
• Relational data can be modeled differently (and some would argue more intuitively) by keeping
related data together in a single document using embedded documents and arrays. Related data
can also be stored in separate documents, and database references can be used to connect the
related data.
• Documents map to objects in most popular programming languages.
• Graph nodes and/or edges can be modeled as documents. Edges can also be modeled
through database references. Graph queries can be run using operations like $graphLookup.
• Geospatial data can be modeled as arrays in documents.
Due to their rich data modeling capabilities, document databases are general-purpose databases that
can store data for a variety of use cases.
With document databases empowering developers to build faster, most relational databases have
added support for JSON. However, simply adding a JSON data type does not bring the benefits of a
native document database. Why? Because the relational approach detracts from developer
productivity, rather than improve it. These are some of the things developers have to deal with.
Proprietary Extensions
Working with documents means using custom, vendor-specific SQL functions which will not be
familiar to most developers, and which don’t work with your favorite SQL tools. Add low-level
JDBC/ODBC drivers and ORMs and you face complex development processes resulting in low
productivity.
Presenting JSON data as simple strings and numbers rather than the rich data types supported by
native document databases such as MongoDB makes computing, comparing, and sorting data
complex and error prone.
Relational databases offer little to validate the schema of documents, so you have no way to apply
quality controls against your JSON data. And you still need to define a schema for your regular
tabular data, with all the overhead that comes when you need to alter your tables as your
application’s features evolve.
Low Performance
Most relational databases do not maintain statistics on JSON data, preventing the query planner
from optimizing queries against documents, and you from tuning your queries.
No native scale-out
Traditional relational databases offer no way for you to partition (“shard”) the database across
multiple instances to scale as workloads grow. Instead you have to implement sharding yourself in
the application layer, or rely on expensive scale-up systems.
• The flexible schema allows for the data model to change as an application's requirements
change.
• Document databases have rich APIs and query languages that allow developers to easily
interact with their data.
• Document databases are distributed (allowing for horizontal scaling as well as global data
distribution) and resilient.
These strengths make document databases an excellent choice for a general-purpose database.
A common weakness that people cite about document databases is that many do not support multi-
document ACID transactions. We estimate that 80%-90% of applications that leverage the
document model will not need to use multi-document transactions.
Note that some document databases like MongoDB support multi-document ACID transactions.
Visit What are ACID Transactions? to learn more about how the document model mostly eliminates
the need for multi-document transactions and how MongoDB supports transactions in the rare cases
where they are needed.
MongoDB is an open source, nonrelational database management system (DBMS) that uses
flexible documents instead of tables and rows to process and store variousforms of data.
As a NoSQL database solution, MongoDB does not require a relational database management
system (RDBMS), so it provides an elastic data storage model that enables users to store and query
multivariate data types with ease. This not only simplifies database management for developers but
also creates a highly scalable environment for cross-platform applications and services.
MongoDB documents or collections of documents are the basic units of data. Formatted as Binary
JSON (Java Script Object Notation), these documents can store various types of data and be
distributed across multiple systems. Since MongoDB employs a dynamic schema design, users have
unparalleled flexibility when creating data records, querying document collections through
MongoDB aggregation and analyzing large amounts of information.
With so many database management solutions currently available, it can be hard to choose the right
solution for your enterprise. Here are some common solution comparisons and best use cases that
can help you decide.
MySQL uses a structured query language to access stored data. In this format, schemas are used to
create database structures, utilizing tables as a way to standardize data types so that values are
searchable and can be queried properly. A mature solution, MySQL is useful for a variety of
situations including website databases, applications and commercial product management.
Because of its rigid nature, MySQL is preferable to MongoDB when data integrity and isolation are
essential, such as when managing transactional data. But MongoDB’s less-restrictive format and
higher performance make it a better choice, particularly when availability and speed are primary
concerns.
Mobile applications
MongoDB’s JSON document model lets you store back-end application data wherever you need it,
including in Apple iOS and Android devices as well as cloud-based storage solutions. This
flexibility lets you aggregate data across multiple environments with secondary and geospatial
indexing, giving developers the ability to scale their mobile applications seamlessly.
Real-time analytics
As companies scale their operations, gaining access to key metrics and business insights from large
pools of data is critical. MongoDB handles the conversion of JSON and JSON-like documents, such
as BSON, into Java objects effortlessly, making the reading and writing of data in MongoDB fast
and incredibly efficient when analyzing real-time information across multiple development
environments. This has proved beneficial for several business sectors, including government,
financial services and retail.
Content management systems
Content management systems (CMS) are powerful tools that play an important role in ensuring
positive user experiences when accessing e-commerce sites, online publications, document
management platforms and other applications and services. By using MongoDB, you can easily add
new features and attributes to your online applications and websites using a single database and
with high availability.
Enterprise Data Warehouse
The Apache Hadoop framework is a collection of open source modules, including Hadoop
Distributed File System and Hadoop MapReduce, that work with MongoDB to store, process and
analyze large amounts of data. Organizations can use MongoDB and Hadoop to perform risk
modeling, predictive analytics and real-time data processing.
Mongo DB benefits
Over the years, MongoDB has become a trusted solution for many businesses that are looking for a
powerful and highly scalable NoSQL database. But MongoDB is much more than just a traditional
document-based database and it boasts a few great capabilities that make it stand out from other
DBMS.
Load balancing
As enterprises' cloud applications scale and resource demands increase, problems can arise in
securing the availability and reliability of services. MongoDB’s load balancing sharing process
distributes large data sets across multiple virtual machines at once while still maintaining acceptable
read and write throughputs. This horizontal scaling is called sharding and it helps organizations
avoid the cost of vertical scaling of hardware while still expanding the capacity of cloud-based
deployments.
One of MongoDB’s biggest advantages over other databases is its ability to handle ad hoc queries
that don’t require predefined schemas. MongoDB databases use a query language that’s similar to
SQL databases and is extremely approachable for beginner and advanced developers alike. This
accessibility makes it easy to push, query, sort, update and export your data with common help
methods and simple shell commands.
Multilanguage support
One of the great things about MongoDB is its multilanguage support. Several versions of
MongoDB have been released and are in continuous development with driver support for popular
programming languages, including Python, PHP, Ruby, Node.js, C++, Scala, JavaScript and many
more.
The key could be anything, depending on restrictions imposed by the database software, but it needs
to be unique in the database so there is no ambiguity when searching for the key and its value. The
value could be anything, including a list or another key -value pair. Some database software allows
you to specify a data type for the value.
In traditional relational database design, data is stored in tables composed of rows and columns. The
database developer specifies many attributes of the data to be stored in the table upfront. This
creates significant opportunities for optimizations such as data compression and performance
around aggregations and data access, but also introduces some inflexibility.
Key-value stores, on the other hand, are typically much more flexible and offer very fast
performance for reads and writes, in part because the database is looking for a single key and is
returning its associated value rather than performing complex aggregations.
number. Stock trading data is another example of a key -value pair. In this case, you may have a key
associated with values for the stock ticker, whether the trade was a buy or sell, the number of
shares, or the price of the trade.
Key-value store advantages
There are a few advantages that a key-value store provides over traditional row-column-based
databases. Thanks to the simple data format that gives it its name, a key-value store can be very fast
for read and write operations. And key-value stores are very flexible, a valued asset in modern
programming as we generate more data without traditional structures.
Also, key-value stores do not require placeholders such as “null” for optional values, so they may
have smaller storage requirements, and they often scale almost linearly with the number of nodes.
Key-value database use cases
The advantages listed above naturally lend themselves to several popular use cases for key-value
databases.
• Web applications may store user session details and preference in a key-value store. All the
information is accessible via user key, and key-value stores lend themselves to fast reads and writes.
• Real-time recommendations and advertising are often powered by key-value stores because
the stores can quickly access and present new recommendations or ads as a web visitor moves
throughout a site.
• On the technical side, key-value stores are commonly used for in-memory data caching to
speed up applications by minimizing reads and writes to slower disk-based systems. Hazelcast is an
example of a technology that provides an in-memory key-value store for fast data retrieval.
A distributed key-value store builds on the advantages and use cases described above by providing
them at scale. A distributed key-value store is built to run on multiple computers working together,
and thus allows you to work with larger data sets because more servers with more memory now hold
the data. By distributing the store across multiple servers, you can increase processing performance.
And if you leverage replication in your distributed key-value store, you increase its fault tolerance.
Hazelcast is an example of a technology that provides a distributed key-value store for larger-scale
deployments. The “IMap” data type in Hazelcast, similar to the “Map” type in Java, is a key-value
store stored in memory. Unlike the Java Map type, Hazelcast IMaps are stored in memory in a
distributed manner across the collective RAM in a cluster of computers, allowing you to store much
more data than possible on a single computer. This gives you quick lookups with in-memory speeds
while also retaining other important capabilities such as high availability and security.
Columnar Data Model of NoSQL
The Columnar Data Model of NoSQL is important. NoSQL databases are different from SQL
databases. This is because it uses a data model that has a different structure than the previously
followed row-and-column table model used with relational database management systems.
(RDBMS). NoSQL databases are a flexible schema model which is designed to scale horizontally
across many servers and is used in large volumes of data.
Columnar Data Model of NoSQL :
Basically, the relational database stores data in rows and also reads the data row by row, column store
is organized as a set of columns. So if someone wants to run analytics on a small number of columns,
one can read those columns directly without consuming memory with the unwanted data. Columns
are somehow are of the same type and gain from more efficient compression, which makes reads
faster than before. Examples of Columnar Data Model: Cassandra and Apache Hadoop Hbase.
Working of Columnar Data Model:
In Columnar Data Model instead of organizing information into rows, it does in columns. This makes
them function the same way that tables work in relational databases. This type of data model is much
more flexible obviously because it is a type of NoSQL database. The below example will help in
understanding the Columnar data model:
Row-Oriented Table:
Columnar Data Model uses the concept of keyspace, which is like a schema in relational models.
• Well Structured: Since these data models are good at compression so these are very structured
or well organized in terms of storage.
• Flexibility :A large amount of flexibility as it is not necessary for the columns to look like each
other, which means one can add new and different columns without disrupting the whole
database
• Aggreagation queries are fast The most important thing is aggregation queries are quite fast
because a majority of the information is stored in a column. An example would be Adding up
the total number of students enrolled in one year.It can be spread across large clusters of
machines, even numbering in thousands.
• Scalability: Since one can easily load a row table in a few seconds so load times are nearly
excellent. To design an effective and working schema is too difficult and very time-
consuming.incremental data loading is suboptimal and must be avoided, but this might not be an
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 24
Module 5 Database Management System-BCS403
• Load Time: If security is one of the priorities then it must be known that the Columnar data
model lacks inbuilt security features in this case, one must look into relational databases.Online
Transaction Processing (OLTP) applications are also not compatible with columnar data models
because of the way data is stored.
Disadvantages of Columnar Data Model:
• Designing indexing Schema: To design an effective and working schema is too difficult and
very time-consuming.
• Suboptimal data loading: incremental data loading is suboptimal and must be avoided, but
this might not be an issue for some users.
• Security vulnerabilities: If security is one of the priorities then it must be known that the
Columnar data model lacks inbuilt security features in this case, one must look into relational
databases.
• Online Transaction Processing (OLTP): Online Transaction Processing (OLTP) applications
are also not compatible with columnar data models because of the way data is stored.
Applications of Columnar Databases
Columnar Data Model is very much used in various Blogging Platforms.
It is used in Content management systems like WordPress, Joomla, etc.
It is used in Systems that maintain counters.
It is used in Systems that require heavy write requests.
It is used in Services that have expiring usage.
Graph Databases:
A graph database is a type of NoSQL database that is designed to handle data with complex
relationships and interconnections. In a graph database, data is stored as nodes and edges, where
nodes represent entities and edges represent the relationships between those entities.
• Graph databases are particularly well-suited for applications that require deep and complex
queries, such as social networks, recommendation engines, and fraud detection systems. They
can also be used for other types of applications, such as supply chain management, network and
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 25
Module 5 Database Management System-BCS403
• One of the main advantages of graph databases is their ability to handle and represent
relationships between entities. This is because the relationships between entities are as
important as the entities themselves, and often cannot be easily represented in a traditional
relational database.
• Another advantage of graph databases is their flexibility. Graph databases can handle data with
changing structures and can be adapted to new use cases without requiring significant changes
to the database schema. This makes them particularly useful for applications with rapidly
changing data structures or complex data requirements.
• However, graph databases may not be suitable for all applications. For example, they may not
be the best choice for applications that require simple queries or that deal primarily with data
that can be easily represented in a traditional relational database. Additionally, graph databases
may require more specialized knowledge and expertise to use effectively.
Some popular graph databases include Neo4j, OrientDB, and ArangoDB. These databases provide a
range of features, including support for different data models, scalability, and high availability, and
can be used for a wide variety of applications.
As we all know the graph is a pictorial representation of data in the form of nodes and relationships
which are represented by edges. A graph database is a type of database used to represent the data in
the form of a graph. It has three components: nodes, relationships, and properties. These
components are used to model the data. The concept of a Graph Database is based on the theory of
graphs. It was introduced in the year 2000. They are commonly referred to NoSql databases as data
is stored using nodes, relationships and properties instead of traditional databases. A graph database
is very useful for heavily interconnected data. Here relationships between data are given priority
and therefore the relationships can be easily visualized. They are flexible as new data can be added
without hampering the old ones. They are useful in the fields of social networking, fraud
detection, AI Knowledge graphs etc.
In traditional databases, the relationships between data is not established. But in the case of Graph
Database, the relationships between data are prioritized. Nowadays mostly interconnected data is
used where one data is connected directly or indirectly. Since the concept of this database is based
on graph theory, it is flexible and works very fast for associative data. Often data are interconnected
to one another which also helps to establish further relationships. It works fast in the querying part
as well because with the help of relationships we can quickly find the desired nodes. join operations
are not required in this database which reduces the cost. The relationships and properties are stored
as first-class entities in Graph Database.
Graph databases allow organizations to connect the data with external sources as well. Since
organizations require a huge amount of data, often it becomes cumbersome to store data in the form
of tables. For instance, if the organization wants to find a particular data that is connected with
another data in another table, so first join operation is performed between the tables, and then search
for the data is done row by row. But Graph database solves this big problem. They store the
relationships and properties along with the data. So if the organization needs to search for a
particular data, then with the help of relationships and properties the nodes can be found without
joining or without traversing row by row. Thus the searching of nodes is not dependent on the
amount of data.
Types of Graph Databases:
Property Graphs: These graphs are used for querying and analyzing data by modelling the
relationships among the data. It comprises of vertices that has information about the particular
subject and edges that denote the relationship. The vertices and edges have additional attributes
called properties.
RDF Graphs: It stands for Resource Description Framework. It focuses more on data integration.
They are used to represent complex data with well defined semantics. It is represented by three
elements: two vertices, an edge that reflect the subject, predicate and object of a sentence. Every
vertex and edge is represented by URI(Uniform Resource Identifier).
When to Use Graph Database?
Graph databases should be used for heavily interconnected data.
It should be used when amount of data is larger and relationships are present.
It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries since data is
connected. Graph algorithms are also applied to find patterns, paths and other relationships this
enabling more analysis of the data. The algorithms help to explore the neighboring nodes, clustering
of vertices analyze relationships and patterns. Countless joins are not required in this kind of
database.
Example of Graph Database:
Recommendation engines in E commerce use graph databases to provide customers with accurate
recommendations, updates about new products thus increasing sales and satisfying the customer’s
desires.
Social media companies use graph databases to find the “friends of friends” or products that the
user’s friends like and send suggestions accordingly to user.
To detect fraud Graph databases play a major role. Users can create graph from the transactions
between entities and store other important information. Once created, running a simple query will
help to identify the fraud.
Advantages of Graph Database:
Potential advantage of Graph Database is establishing the relationships with external sources as
well
No joins are required since relationships is already specified.
Query is dependent on concrete relationships and not on the amount of data.
It is flexible and agile.
it is easy to manage the data in terms of graph.
Efficient data modeling: Graph databases allow for efficient data modeling by representing data as
SUNIL G L, Asst. Professor, Dept of CSE(DS), RNSIT. Page 28
Module 5 Database Management System-BCS403
nodes and edges. This allows for more flexible and scalable data modeling than traditional
relational databases.
Flexible relationships: Graph databases are designed to handle complex relationships and
interconnections between data elements. This makes them well-suited for applications that require
deep and complex queries, such as social networks, recommendation engines, and fraud detection
systems.
High performance: Graph databases are optimized for handling large and complex datasets, making
them well-suited for applications that require high levels of performance and scalability.
Scalability: Graph databases can be easily scaled horizontally, allowing additional servers to be
added to the cluster to handle increased data volume or traffic.
Easy to use: Graph databases are typically easier to use than traditional relational databases. They
often have a simpler data model and query language, and can be easier to maintain and scale.
traditional database. This database deals with a typical set of interconnected data. Although Graph
Database is in the developmental phase it is becoming an important part as business and
organizations are using big data and Graph databases help in complex analysis. Thus these
databases have become a must for today’s needs and tomorrow success.
Graph Based Data Model in NoSQL is a type of Data Model which tries to focus on building the
relationship between data elements. As the name suggests Graph-Based Data Model, each element
here is stored as a node, and the association between these elements is often known as Links.
Association is stored directly as these are the first-class elements of the data model. These data
models give us a conceptual view of the data.
These are the data models which are based on topographical network structure. Obviously, in graph
theory, we have terms like Nodes, edges, and properties, let’s see what it means here in the Graph-
Based data model.
Nodes: These are the instances of data that represent objects which is to be tracked.
Edges: As we already know edges represent relationships between nodes.
Properties: It represents information associated with nodes.
The below image represents Nodes with properties from relationships represented by edges.
A wide-column database is a NoSQL database that organizes data storage into flexible columns that
can be spread across multiple servers or database nodes, using multi-dimensional mapping to
reference data by column, row, and timestamp.
///
A wide-column database is a type of NoSQL database in which the names and format of the
columns can vary across rows, even within the same table. Wide-column databases are also known
as column family databases. Because data is stored in columns, queries for a particular value in a
column are very fast, as the entire column can be loaded and searched quickly. Related columns can
be modeled as part of the same column family.
What Are Advantages of a Wide-column Database?
A relational database management system (RDBMS) stores data in a table with rows that all span a
number of columns. If one row needs an additional column, that column must be added to the entire
table, with null or default values provided for all the other rows. If you need to query that RDBMS
table for a value that isn’t indexed, the table scan to locate those values will be very slow.
Wide-column NoSQL databases still have the concept of rows, but reading or writing a row of data
consists of reading or writing the individual columns. A column is only written if there’s a data
element for it. Each data element can be referenced by the row key, but querying for a value is
optimized like querying an index in a RDBMS, rather than a slow table scan.
Neo4j is a graph database. A graph database, instead of having rows and columns has nodes edges
and properties. It is more suitable for certain big data and analytics applications than row and
column databases or free-form JSON document databases for many use cases.
A graph database is used to represent relationships. The most common example of that is the
Facebook Friend relationship as well as the Like relationship. You can see some of that in the
graphic below from Neo4j.
The circles are nodes. The lines, called edges, indicate relationships. And the any comments inside
the circles are properties of that node.
We write about Neo4j here because it has the largest market share. There are other players in this
market. And according to Neo4J, Apache Spark 3.0 will add the Neo4j Cypher Query Language to
allow and make easier “property graphs based on DataFrames to Spark.” Spark already supports
GraphX, which is an extension of the RDD to support Graphs. We will discuss that in another blog
post.In another post we will also discuss graph algorithms. The most famous of those is the Google
Page Rank Index. Algorithms are the way to navigate the nodes and edges.
Costs?: Is Neo4J free? That’s rather complicated. The Community Edition is. So is the desktop
version, suitable for learning. The Enterprise edition is not. That is consistent with other opensource
products. When I asked Neo4J for a license to work with their product for an extended period of
time they recommended that I use the desktop version. The Enterprise version has a 30-day trial
period.
There are other alternatives in the market. The key would be to pick one that has enough users so
that they do not go out of business. Which one should you use? You will have to do research to
figure out that.
Install Neo4J:You can use the Desktop or tar version. Here I am using the tar version, on Mac. Just
download it and then start up the shell as shown below. You will need a Java JDK, then.
export
JAVA_HOME='/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home'
Start the server and set the initial password then open cypher-shell. The default URL is a rather
strange looking bolt://localhost:7687.