0% found this document useful (0 votes)
106 views17 pages

Unit-5 Notes

The document discusses NoSQL databases, including key features and types. It describes document databases, key-value stores, wide column stores, and graph databases. It also covers the CAP theorem and how MongoDB and Cassandra relate to consistency and availability.

Uploaded by

Shyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views17 pages

Unit-5 Notes

The document discusses NoSQL databases, including key features and types. It describes document databases, key-value stores, wide column stores, and graph databases. It also covers the CAP theorem and how MongoDB and Cassandra relate to consistency and availability.

Uploaded by

Shyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-V- NoSQL DATABASE

Introduction to NoSQL - CAP Theorem – Data Models - Key-Value


Databases - Document Databases- Column Family Stores – Graph
Databases –Working of NoSQL Using MONGODB/CASSANDRA.

NoSQL DATABASE:

o NoSQL Database is used to refer a non-SQL or non relational database.


o It provides a mechanism for storage and retrieval of data other than tabular relations
model used in relational databases. NoSQL database doesn't use tables for storing
data. It is generally used to store big data and real-time web applications.

Types of NoSQL databases:

Over time, four major types of NoSQL databases emerged: document databases, key-value
databases, wide-column stores, and graph databases.

 Document databases store data in documents similar to JSON (JavaScript Object


Notation) objects. Each document contains pairs of fields and values. The values can
typically be a variety of types including things like strings, numbers, booleans, arrays,
or objects.
 Key-value databases are a simpler type of database where each item contains keys
and values.
 Wide-column stores store data in tables, rows, and dynamic columns.
 Graph databases store data in nodes and edges. Nodes typically store information
about people, places, and things, while edges store information about the relationships
between the nodes.

Advantages of NoSQL:

o It supports query language.


o It provides fast performance.
o It provides horizontal scalability.
CAP THEOREM:

The CAP theorem applies a similar type of logic to distributed systems-namely, that a
distributed system can deliver only two of three desired
characteristics: consistency, availability, and partition tolerance (the „C,‟ „A‟ and „P‟ in
CAP).

A distributed system is a network that stores data on more than one node (physical or virtual
machines) at the same time. Because all cloud applications are distributed systems, it‟s
essential to understand the CAP theorem when designing a cloud app so that you can choose
a data management system that delivers the characteristics your application needs most.

The CAP theorem is also called Brewer‟s Theorem, because it was first advanced by
Professor Eric A. Brewer during a talk he gave on distributed computing in 2000. Two years
later, MIT professors Seth Gilbert and Nancy Lynch published a proof of “Brewer‟s
Conjecture.”

The three distributed system characteristics to which the CAP theorem refers to are:

Consistency:

Consistency means that all clients see the same data at the same time, no matter which node
they connect to. For this to happen, whenever data is written to one node, it must be instantly
forwarded or replicated to all the other nodes in the system before the write is deemed
„successful.‟

Availability:

Availability means that any client making a request for data gets a response, even if one or
more nodes are down. Another way to state this—all working nodes in the distributed system
return a valid response for any request, without exception.

Partition tolerance:

A partition is a communications break within a distributed system—a lost or temporarily


delayed connection between two nodes. Partition tolerance means that the cluster must
continue to work despite any number of communication breakdowns between nodes in the
system.

CAP theorem NoSQL database types:

NoSQL databases are ideal for distributed network applications. Unlike their vertically
scalable SQL (relational) counterparts, NoSQL databases are horizontally scalable and
distributed by design—they can rapidly scale across a growing network consisting of multiple
interconnected nodes. (See "SQL vs. NoSQL Databases: What's the Difference?" for more
information.)
NoSQL databases are classified based on the two CAP characteristics they support:

 CP database: A CP database delivers consistency and partition tolerance at the


expense of availability. When a partition occurs between any two nodes, the system
has to shut down the non-consistent node (i.e., make it unavailable) until the partition
is resolved.

 AP database: An AP database delivers availability and partition tolerance at the


expense of consistency. When a partition occurs, all nodes remain available but those
at the wrong end of a partition might return an older version of data than others.
(When the partition is resolved, the AP databases typically resync the nodes to repair
all inconsistencies in the system.)

 CA database: A CA database delivers consistency and availability across all nodes. It


can‟t do this if there is a partition between any two nodes in the system, however, and
therefore can‟t deliver fault tolerance.

MongoDB and the CAP theorem

MongoDB is a popular NoSQL database management system that stores data as BSON
(binary JSON) documents. It's frequently used for big data and real-time applications running
at multiple different locations. Relative to the CAP theorem, MongoDB is a CP data store—it
resolves network partitions by maintaining consistency, while compromising on availability.

MongoDB is a single-master system—each replica set (link resides outside ibm.com) can
have only one primary node that receives all the write operations. All other nodes in the same
replica set are secondary nodes that replicate the primary node's operation log and apply it to
their own data set. By default, clients also read from the primary node, but they can also
specify a read preference (link resides outside ibm.com) that allows them to read from
secondary nodes.

When the primary node becomes unavailable, the secondary node with the most recent
operation log will be elected as the new primary node. Once all the other secondary nodes
catch up with the new master, the cluster becomes available again. As clients can't make any
write requests during this interval, the data remains consistent across the entire network.

Cassandra and the CAP theorem (AP)

Apache Cassandra is an open source NoSQL database maintained by the Apache Software
Foundation. It‟s a wide-column database that lets you store data on a distributed network.
However, unlike MongoDB, Cassandra has a masterless architecture, and as a result, it has
multiple points of failure, rather than a single one.

Relative to the CAP theorem, Cassandra is an AP database—it delivers availability and


partition tolerance but can't deliver consistency all the time. Because Cassandra doesn't have
a master node, all the nodes must be available continuously. However, Cassandra
provides eventual consistency by allowing clients to write to any nodes at any time and
reconciling inconsistencies as quickly as possible.

As data only becomes inconsistent in the case of a network partition and inconsistencies are
quickly resolved, Cassandra offers “repair” functionality to help nodes catch up with their
peers. However, constant availability results in a highly performant system that might be
worth the trade-off in many cases.
DATA MODELS IN NoSQL:

NoSQL data models can be divided into four main types:

 Document Stores
 Key-Value Stores
 Graph Databases
 Column Stores

Each type has its own unique strengths and weaknesses and is best suited to certain
types of applications or use cases. Here‟s a brief overview of each type. The picture below
represents these four different kinds of NoSQL data model.

DOCUMENT DATA MODEL:

A Document Data Model is a lot different than other data models because it stores
data in JSON, BSON, or XML documents. in this data model, we can move documents
under one document and apart from this, any particular elements can be indexed to run
queries faster. Often documents are stored and retrieved in such a way that it becomes close
to the data objects which are used in many applications which means very less translations
are required to use data in applications. JSON is a native language that is often used to store
and query data too.
So in the document data model, each document has a key-value pair below is an example
for the same.
{
"Name" : "Yashodhra",
"Address" : "Near Patel Nagar",
"Email" : "[email protected]",
"Contact" : "12345"
}
Working of Document Data Model:
This is a data model which works as a semi-structured data model in which the records and
data associated with them are stored in a single document which means this data model is
not completely unstructured. The main thing is that data here is stored in a document.
Features:

 Document Type Model: As we all know data is stored in documents rather than tables
or graphs, so it becomes easy to map things in many programming languages.
 Flexible Schema: Overall schema is very much flexible to support this statement one
must know that not all documents in a collection need to have the same fields.
 Distributed and Resilient: Document data models are very much dispersed which is
the reason behind horizontal scaling and distribution of data.
 Manageable Query Language: These data models are the ones in which query
language allows the developers to perform CRUD (Create Read Update Destroy)
operations on the data model.

Examples of Document Data Models :


 Amazon DocumentDB
 MongoDB
 Cosmos DB
 ArangoDB
 Couchbase Server
 CouchDB

Advantages:
 Schema-less: These are very good in retaining existing data at massive volumes
because there are absolutely no restrictions in the format and the structure of data
storage.
 Faster creation of document and maintenance: It is very simple to create a document
and apart from this maintenance requires is almost nothing.
 Open formats: It has a very simple build process that uses XML, JSON, and its other
forms.
 Built-in versioning: It has built-in versioning which means as the documents grow in
size there might be a chance they can grow in complexity. Versioning decreases
conflicts.

Disadvantages:
 Weak Atomicity: It lacks in supporting multi-document ACID transactions. A change
in the document data model involving two collections will require us to run two
separate queries i.e. one for each collection. This is where it breaks atomicity
requirements.
 Consistency Check Limitations: One can search the collections and documents that
are not connected to an author collection but doing this might create a problem in the
performance of database performance.
 Security: Nowadays many web applications lack security which in turn results in the
leakage of sensitive data. So it becomes a point of concern, one must pay attention to
web app vulnerabilities.
Applications of Document Data Model :
 Content Management: These data models are very much used in creating
various video streaming platforms, blogs, and similar services Because each is stored as
a single document and the database here is much easier to maintain as the service
evolves over time.
 Book Database: These are very much useful in making book databases because as we
know this data model lets us nest.
 Catalog: When it comes to storing and reading catalog files these data models are very
much used because it has a fast reading ability if incase Catalogs have thousands of
attributes stored.
 Analytics Platform: These data models are very much used in the Analytics Platform.

KEY-VALUE (KV) DATA MODEL:

A key-value data model or database is also referred to as a key-value store. It is a non-


relational type of database. In this, an associative array is used as a basic database in which
an individual key is linked with just one value in a collection. For the values, keys are
special identifiers. Any kind of entity can be valued. The collection of key-value pairs
stored on separate records is called key-value databases and they do not have an already
defined structure.

How do key-value databases work?

A number of easy strings or even a complicated entity are referred to as a value that is
associated with a key by a key-value database, which is utilized to monitor the entity. Like
in many programming paradigms, a key-value database resembles a map object or array, or
dictionary, however, which is put away in a tenacious manner and controlled by a DBMS.
An efficient and compact structure of the index is used by the key-value store to have the
option to rapidly and dependably find value using its key. For example, Redis is a key -
value store used to tracklists, maps, heaps, and primitive types (which are simple data
structures) in a constant database. Redis can uncover a very basic point of interaction to
query and manipulate value types, just by supporting a predetermined number of value
types, and when arranged, is prepared to do high throughput.

When to use a key-value database:

Here are a few situations in which you can use a key-value database:-
 User session attributes in an online app like finance or gaming, which is referred to as
real-time random data access.
 Caching mechanism for repeatedly accessing data or key-based design.
 The application is developed on queries that are based on keys.

Features:

 One of the most un-complex kinds of NoSQL data models.


 For storing, getting, and removing data, key-value databases utilize simple functions.
 Querying language is not present in key-value databases.
 Built-in redundancy makes this database more reliable.

Advantages:

 It is very easy to use. Due to the simplicity of the database, data can accept any kind, or
even different kinds when required.
 Its response time is fast due to its simplicity, given that the remaining environment nea r
it is very much constructed and improved.
 Key-value store databases are scalable vertically as well as horizontally.
 Built-in redundancy makes this database more reliable.

Disadvantages:

 As querying language is not present in key-value databases, transportation of queries


from one database to a different database cannot be done.
 The key-value store database is not refined. You cannot query the database without a
key.

Some examples of key-value databases:

Here are some popular key-value databases which are widely used:
 Couchbase: It permits SQL-style querying and searching for text.
 Amazon DynamoDB: The key-value database which is mostly used is Amazon
DynamoDB as it is a trusted database used by a large number of users. It can easily
handle a large number of requests every day and it also provides various security
options.
 Riak: It is the database used to develop applications.
 Aerospike: It is an open-source and real-time database working with billions of
exchanges.
 Berkeley DB: It is a high-performance and open-source database providing scalability.
GRAPH DATA MODEL:

Graph Based Data Model in NoSQL is a type of Data Model which tries to focus on
building the relationship between data elements. As the name suggests Graph-Based Data
Model, each element here is stored as a node, and the association between these elements is
often known as Links. Association is stored directly as these are the first-class elements of
the data model. These data models give us a conceptual view of the data.
These are the data models which are based on topographical network structure. Obviously,
in graph theory, we have terms like Nodes, edges, and properties, let‟s see what it means
here in the Graph-Based data model.
 Nodes: These are the instances of data that represent objects which is to be tracked.
 Edges: As we already know edges represent relationships between nodes.
 Properties: It represents information associated with nodes.
The below image represents Nodes with properties from relationships represented by edges.

Working of Graph Data Model :


In these data models, the nodes which are connected together are connected physically and
the physical connection among them is also taken as a piece of data. Connecting data in this
way becomes easy to query a relationship. This data model reads the relationship from
storage directly instead of calculating and querying the connection steps. Like many
different NoSQL databases these data models don‟t have any schema as it is important
because schema makes the model well and good and easy to edit.
Examples of Graph Data Models :

 JanusGraph: These are very helpful in big data analytics. It is a scalable graph
database system open source too. JanusGraph has different features like:
 Storage: Many options are available for storing graph data like Cassandra.
 Support for transactions: There are many supports available like
ACID (Atomicity, Consistency, Isolation, and Durability) which can hold
thousands of concurrent users.
 Searching options: Complex searching options are available and optional
support too.
 Neo4j: It stands for Network Exploration and Optimization 4 Java. As the name
suggests this graph database is written in Java with native graph storage and processing.
Neo4j has different features like:
 Scalable: Scalable through data partitioning into pieces known as shards.
 Higher Availability: Availability is very much high due to continuous
backups and rolling upgrades.
 Query Language: Uses programmer-friendly query language Cypher graph
query language.DGraph main features are:
 DGraph: It is an open-source distributed graph database system designed
with scalability.
 Query Language: It uses GraphQL, which is solely made for APIs.
 open-source system: support for many open standards.
Advantages of Graph Data Model :
 Structure: The structures are very agile and workable too.
 Explicit Representation: The portrayal of relationships between entities is explicit.
 Real-time O/P Results: Query gives us real-time output results.

Disadvantages of Graph Data Model :


 No standard query language: Since the language depends on the platform that is used
so there is no certain standard query language.
 Unprofessional Graphs: Graphs are very unprofessional for transactional-based
systems.
 Small User Base: The user base is small which makes it very difficult to get support
when running into a system.

Applications of Graph Data Model:


 Graph data models are very much used in fraud detection which itself is very much
useful and important.
 It is used in Digital asset management which provides a scalable database model to keep
track of digital assets.
 It is used in Network management which alerts a network administrator about problems
in a network.
 It is used in Context-aware services by giving traffic updates and many more.
 It is used in Real-Time Recommendation Engines which provide a better user
experience.
COLUMNAR OR COLUMN-FAMILY DATA MODEL:

Basically, the relational database stores data in rows and also reads the data row by row,
column store is organized as a set of columns. So if someone wants to run analytics on a
small number of columns, one can read those columns directly without consuming memory
with the unwanted data. Columns are somehow are of the same type and gain from more
efficient compression, which makes reads faster than before. Examples of Columnar Data
Model: Cassandra and Apache Hadoop Hbase.

Working of Columnar Data Model:

In Columnar Data Model instead of organizing information into rows, it does in columns.
This makes them function the same way that tables work in relational databases. This type
of data model is much more flexible obviously because it is a type of NoSQL database. The
below example will help in understanding the Columnar data model:
Row-Oriented Table:

S.No. Name Course Branch ID

01. Tanmay B-Tech Computer 2

02. Abhishek B-Tech Electronics 5


S.No. Name Course Branch ID

03. Samriddha B-Tech IT 7

04. Aditi B-Tech E & TC 8

Column – Oriented Table:

S.No. Name ID

01. Tanmay 2

02. Abhishek 5

03. Samriddha 7

04. Aditi 8

S.No. Course ID

01. B-Tech 2

02. B-Tech 5

03. B-Tech 7

04. B-Tech 8

S.No Branch ID

01. Computer 2

02. Electronics 5

03. IT 7

04. E & TC 8
Columnar Data Model uses the concept of keyspace, which is like a schema in relational
models.
Advantages of Columnar Data Model :
 Well structured: Since these data models are good at compression so these are very
structured or well organized in terms of storage.
 Flexibility: A large amount of flexibility as it is not necessary for the columns to look
like each other, which means one can add new and different columns without disrupting
the whole database
 Aggregation queries are fast: The most important thing is aggregation queries are
quite fast because a majority of the information is stored in a column. An example
would be Adding up the total number of students enrolled in one year.
 Scalability: It can be spread across large clusters of machines, even numbering in
thousands.
 Load Times: Since one can easily load a row table in a few seconds so load times are
nearly excellent.

Disadvantages of Columnar Data Model:


 Designing indexing Schema: To design an effective and working schema is too
difficult and very time-consuming.
 Suboptimal data loading: incremental data loading is suboptimal and must be avoided,
but this might not be an issue for some users.
 Security vulnerabilities: If security is one of the priorities then it must be known that
the Columnar data model lacks inbuilt security features in this case, one must look into
relational databases.
 Online Transaction Processing (OLTP): Online Transaction Processing (OLTP)
applications are also not compatible with columnar data models because of the way data
is stored.

Applications of Columnar Data Model:


 Columnar Data Model is very much used in various Blogging Platforms.
 It is used in Content management systems like WordPress, Joomla, etc.
 It is used in Systems that maintain counters.
 It is used in Systems that require heavy write requests.
 It is used in Services that have expiring usage.

MONGODB:

 MongoDB is a No SQL database. It is an open-source, cross-platform, document-


oriented database written in C++.
 MongoDB is an open-source document database that provides high performance, high
availability, and automatic scaling.
 In simple words, you can say that - Mongo DB is a document-oriented database. It is
an open source product, developed and supported by a company named 10gen.
 MongoDB is available under General Public license for free, and it is also available
under Commercial license from the manufacturer.
 The manufacturing company 10gen has defined MongoDB as:
 "MongoDB is a scalable, open source, high performance, document-oriented
database." - 10gen
 MongoDB was designed to work with commodity servers. Now it is used by the
company of all sizes, across all industry
Purpose of Building MongoDB:

The primary purpose of building MongoDB is:

o Scalability
o Performance
o High Availability
o Scaling from single server deployments to large, complex multi-site architectures.
o Key points of MongoDB
o Develop Faster
o Deploy Easier
o Scale Bigger

Features of MongoDB

These are some important features of MongoDB:

1. Support ad hoc queries

In MongoDB, you can search by field, range query and it also supports regular expression
searches.

2. Indexing

You can index any field in a document.

3. Replication

MongoDB supports Master Slave replication.

A master can perform Reads and Writes and a Slave copies data from the master and can only
be used for reads or back up (not writes)

4. Duplication of data

MongoDB can run over multiple servers. The data is duplicated to keep the system up and
also keep its running condition in case of hardware failure.

5. Load balancing

It has an automatic load balancing configuration because of data placed in shards.

6. Supports map reduce and aggregation tools.

7. Uses JavaScript instead of Procedures.

8. It is a schema-less database written in C++.


9. Provides high performance.

10. Stores files of any size easily without complicating your stack.

11. Easy to administer in the case of failures.

12. It also supports:

o JSON data model with dynamic schemas


o Auto-sharding for horizontal scalability
o Built in replication for high availability
o Now a day many companies using MongoDB to create new types of applications,
improve performance and availability.

MongoDB Database commands

The MongoDB database commands are used to create, modify, and update the database.

1. db.adminCommand(cmd)

The admin command method runs against the admin database to run specified database
commands by providing a helper.

Command: Either the argument is specified in the document form or a string form. If the
command is defined as a string, it cannot include any argument.

Example:

Creating a user named JavaTpoint with the dbOwner role on the admin database.

db.adminCommand(
{
createUser: "JavaTpoint",
pwd: passwordPrompt(),
roles: [
{ role: "dbOwner", db: "admin" }
]
})

2. db.aggregate()

The aggregate method initialize a specific diagnostic or admin pipeline, which does not
require anu underlying collection.

Syntax:

db.aggregate( [ <pipeline> ], { <options> } )


The pipeline parameter does not require any underlying collection and always starts with a
compatible stage, such as $currentOp or $listLocalSessions. It is an array of stages that will
be executed.

Example:

The following example runs a pipeline with two stages. The first is the $currentOp operation
and the second will filters the results.

use admin
db.aggregate( [ {
$currentOp : { allUsers: true, idleConnections: true } },
{
$match : { shard: "shardDemo" }
}
])

3. db.cloneDatabase("hostname")

The clonedatabase method copies the specified database to the current database and assumes
that the database at the remote location has the same name as the current database.

The hostname parameter contains the hostname of the database that we want to copy.

Example:

db.cloneDatabase("customers")

4. db.commandHelp(command)

We have the help option for the specified database command using the commandHelp
method. The command parameter contains the name of a database command.

5. db.createCollection(name, options)

A new collection or view will be created using this method. The createCollection method is
used primarily for creating new collections that use specific options when the collection is
first referenced in a command.

For example - we will create a javaTpoint collection with a JSON Schema validator:

db.createCollection( "student", {
validator: { $jsonSchema: {
bsonType: "object",
required: [ "phone" ],
properties: {
phone: {
bsonType: "string",
description: "must be a string and is required"
},
email: {
bsonType : "string",
pattern: "@mongodb\.com$",
description: "must be a string and match the regular expression pattern"
},
status: {
enum: [ "Unknown", "Incomplete" ],
description: "can only be one of the enum values"
}
}
}}
})

6. db.dropDatabase(<writeConcern>)

The drop method removes the specified database and the associated data files.

For example -

We use <database> operation to switch the current database to the temporary database. We
use the db.dropDatabase() method to drops the temporary database

use temp
db.dropDatabase()

CASSANDRA:
Apache Cassandra is highly scalable, high performance, distributed NoSQL database.
Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure.
Cassandra has a distributed architecture which is capable to handle a huge amount of data.
Data is placed on different machines with more than one replication factor to attain a high
availability without a single point of failure.

Important Points of Cassandra:

o Cassandra is a column-oriented database.


o Cassandra is scalable, consistent, and fault-tolerant.
o Cassandra's distribution design is based on Amazon's Dynamo and its data model on
Google's Bigtable.
o Cassandra is created at Facebook. It is totally different from relational database
management systems.
o Cassandra follows a Dynamo-style replication model with no single point of failure,
but adds a more powerful "column family" data model.
o Cassandra is being used by some of the biggest companies like Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Cassandra vs MongoDB

Cassandra and MongoDB both are types of NoSQL databases. Cassandra is a


distributed database system designed to handle large amount of data and known for
its high scalability and high performance. While, MongoDB is document oriented
database which also provides high scalability, high performance and automatic
scaling.

In terms of simplicity, databases can be divided in two types:

o Development simplicity
o Operational simplicity

While MongoDB is known for an easy out-of-the-box experience, Cassandra is


known for easy to manage at scale.

Following is a list of important differences between them:

Index Cassandra Mongodb

1) Cassandra is high performance distributed MongoDB is cross-platform document-oriented


database system. database system.

2) Cassandra is written in Java. MongoDB is written in C++.

3) Cassandra stores data in tabular form like SQL MongoDB stores data in JSON format.
format.

4) Cassandra is got license by Apache. MongoDB is got license by AGPL and drivers by
Apache.

5) Cassandra is mainly designed to handle large MongoDB is designed to deal with JSON-like
amounts of data across many commodity documents and access applications easier and
servers. faster.

6) Cassandra provides high availability with no MongoDB is easy to administer in the case of
single point of failure. failure.
Key Points of Apache Cassandra:

o Cassandra is highly scalable, high performance, consistent and fault-tolerant


database system. Cassandra is a column-oriented database.
o Cassandra provides easy data distribution.
o Cassandra supports ACID properties i.e. Atomicity, Consistency, Isolation, and
Durability.
o Cassandra follows the distribution design of Amazon?s dynamo and its data model
design is based on Google's Bigtable.
o Cassandra was initially created at Facebook for inbox search and now it is being used
by some of the biggest companies like Facebook, Twitter, ebay, Netflix, Cisco,
Rackspace etc.

Key Points of MongoDB:

o MongoDB is well suited for Bigdata and mobile & social infrastructure.
o MongoDB provides Replication, High availability and Auto-sharding.
o MongoDB is used by companies like Foursquare, Intuit, Shutterfly, SourceForge, The
New York Times, Lexis Nexis Orange Digital etc.

You might also like