0% found this document useful (0 votes)
25 views27 pages

BDA Module 3

big data analytics notes

Uploaded by

srpardeshi22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views27 pages

BDA Module 3

big data analytics notes

Uploaded by

srpardeshi22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

NoSQL

NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use
flexible data models that can adapt to changes in data structures and are capable of scaling
horizontally to handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the
term has since evolved to mean “not only SQL,” as NoSQL databases have expanded to
include a wide range of different database architectures and data models.
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the
tabular relations used in relational databases. Such databases came into existence in the late
1960s, but did not obtain the NoSQL moniker until a surge of popularity in the early
twenty-first century. NoSQL databases are used in real-time web applications and big data
and their use are increasing over time.
 NoSQL systems are also sometimes called Not only SQL to emphasize the fact that they
may support SQL-like query languages. A NoSQL database includes simplicity of
design, simpler horizontal scaling to clusters of machines,has and finer control over
availability. The data structures used by NoSQL databases are different from those used
by default in relational databases which makes some operations faster in NoSQL. The
suitability of a given NoSQL database depends on the problem it should solve.
 NoSQL databases, also known as “not only SQL” databases, are a new type of database
management system that has, gained popularity in recent years. Unlike traditional
relational databases, NoSQL databases are designed to handle large amounts of
unstructured or semi-structured data, and they can accommodate dynamic changes to
the data model. This makes NoSQL databases a good fit for modern web applications,
real-time analytics, and big data processing.
 Data structures used by NoSQL databases are sometimes also viewed as more flexible
than relational database tables. Many NoSQL stores compromise consistency in favor of
availability, speed,, and partition tolerance. Barriers to the greater adoption of NoSQL
stores include the use of low-level query languages, lack of standardized interfaces, and
huge previous investments in existing relational databases.
 Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE,
Google Spanner (though technically a NewSQL database), Symas LMDB, and
OrientDB have made them central to their designs.
 Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data
immediately or might result in reading data that is not accurate which is a problem
known as stale reads. Also,has some NoSQL systems may exhibit lost writes and other

Vrushali Thakur
forms of data loss. Some NoSQL systems provide concepts such as write-ahead logging
to avoid data loss.
 One simple example of a NoSQL database is a document database. In a document
database, data is stored in documents rather than tables. Each document can contain a
different set of fields, making it easy to accommodate changing data requirements
 For example, “Take, for instance, a database that holds data regarding employees.”. In a
relational database, this information might be stored in tables, with one table for
employee information and another table for department information. In a document
database, each employee would be stored as a separate document, with all of their
information contained within the document.
 NoSQL databases are a relatively new type of database management system that
hasa gained popularity in recent years due to their scalability and flexibility. They are
designed to handle large amounts of unstructured or semi-structured data and can
handle dynamic changes to the data model. This makes NoSQL databases a good fit for
modern web applications, real-time analytics, and big data processing.

Key Features of NoSQL:


1. Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate
changing data structures without the need for migrations or schema alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by adding more
nodes to a database cluster, making them well-suited for handling large amounts of data
and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a document-based
data model, where data is stored in a schema-less semi-structured format, such as JSON
or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model,
where data is stored as a collection of key-value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-based data
model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to be highly
available and to automatically handle node failures and data replication across multiple
nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible
and dynamic manner, with support for multiple data types and changing data structures.
8. Performance: NoSQL databases are optimized for high performance and can handle a
high volume of reads and writes, making them suitable for big data and real-time
applications.

Advantages of NoSQL: There are many advantages of working with NoSQL databases
such as MongoDB and Cassandra. The main advantages are high scalability and high
availability.
1. High scalability: NoSQL databases use sharding for horizontal scaling. Partitioning of
data and placing it on multiple machines in such a way that the order of the data is
preserved is sharding. Vertical scaling means adding more resources to the existing

Vrushali Thakur
machine whereas horizontal scaling means adding more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to
implement. Examples of horizontal scaling databases are MongoDB, Cassandra, etc.
NoSQL can handle a huge amount of data because of scalability, as the data grows
NoSQL scalesThe auto itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured
data, which means that they can accommodate dynamic changes to the data model. This
makes NoSQL databases a good fit for applications that need to handle changing data
requirements.
3. High availability: The auto, replication feature in NoSQL databases makes it highly
available because in case of any failure data replicates itself to the previous consistent
state.
4. Scalability: NoSQL databases are highly scalable, which means that they can handle
large amounts of data and traffic with ease. This makes them a good fit for applications
that need to handle large amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and
traffic, which means that they can offer improved performance compared to traditional
relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional
relational databases, as they are typically less complex and do not require expensive
hardware or software.
7. Agility: Ideal for agile development.

Disadvantages of NoSQL: NoSQL has the following disadvantages.


1. Lack of standardization: There are many different types of NoSQL databases, each
with its own unique strengths and weaknesses. This lack of standardization can make it
difficult to choose the right database for a specific application
2. Lack of ACID compliance: NoSQL databases are not fully ACID-compliant, which
means that they do not guarantee the consistency, integrity, and durability of data. This
can be a drawback for applications that require strong data consistency guarantees.
3. Narrow focus: NoSQL databases have a very narrow focus as it is mainly designed for
storage but it provides very little functionality. Relational databases are a better choice
in the field of Transaction Management than NoSQL.
4. Open-source: NoSQL is an databaseopen-source database. There is no reliable
standard for NoSQL yet. In other words, two database systems are likely to be unequal.
5. Lack of support for complex queries: NoSQL databases are not designed to handle
complex queries, which means that they are not a good fit for applications that require
complex data analysis or reporting.
6. Lack of maturity: NoSQL databases are relatively new and lack the maturity of
traditional relational databases. This can make them less reliable and less secure than
traditional databases.
7. Management challenge: The purpose of big data tools is to make the management of a
large amount of data as simple as possible. But it is not so easy. Data management in
NoSQL is much more complex than in a relational database. NoSQL, in particular, has

Vrushali Thakur
a reputation for being challenging to install and even more hectic to manage on a daily
basis.
8. GUI is not available: GUI mode tools to access the database are not flexibly available
in the market.
9. Backup: Backup is a great weak point for some NoSQL databases like MongoDB.
MongoDB has no approach for the backup of data in a consistent manner.
10. Large document size: Some database systems like MongoDB and CouchDB store data
in JSON format. This means that documents are quite large (BigData, network
bandwidth, speed), and having descriptive key names actually hurts since they increase
the document size.

When should NoSQL be used:


1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database regularly to handle
the data

Types of NoSQL Database:

 Document-based databases
 Key-value stores
 Column-oriented databases
 Graph-based databases

Document-Based Database:

Vrushali Thakur
The document-based database is a nonrelational database. Instead of storing the data in
rows and columns (tables), it uses the documents to store the data in the database. A
document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects
used in applications which means less translation is required to use these data in the
applications. In the Document database, the particular elements can be accessed by using
the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not
all the documents are in any collection as they require a similar schema because document
databases have a flexible schema.
Key features of documents database:
 Flexible schema: Documents in the database has a flexible schema. It means the
documents in the database need not be the same schema.
 Faster creation and maintenance: the creation of documents is easy and minimal
maintenance is required once we create the document.
 No foreign keys: There is no dynamic relationship between two documents so
documents can be independent of one another. So, there is no requirement for a foreign
key in a document database.
 Open formats: To build a document we use XML, JSON, and others.

Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL database is a


key-value store. Every data element in the database is stored in key-value pairs. The data
can be retrieved by using a unique key allotted to each element in the database. The values
can be simple data types like strings and numbers or complex objects.
A key-value store is like a relational database with only two columns which is the key and
the value.
Key features of the key-value store:
 Simplicity.
 Scalability.
 Speed.

Column Oriented Databases:

A column-oriented database is a non-relational database that stores the data in columns


instead of rows. That means when we want to run analytics on a small number of columns,
you can read those columns directly without consuming memory with the unwanted data.

Vrushali Thakur
Columnar databases are designed to read data more efficiently and retrieve the data with
greater speed. A columnar database is used to store a large amount of data. Key features of
columnar oriented database:
 Scalability.
 Compression.
 Very responsive.

Graph-Based databases:

Graph-based databases focus on the relationship between the elements. It stores the data in
the form of nodes in the database. The connections between the nodes are called links or
relationships.
Key features of graph database:
 In a graph-based database, it is easy to identify the relationship between the data by
using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the database elements.
 Updating data is also easy, as adding a new node or edge to a graph database is a
straightforward task that does not require significant schema changes.

NoSQL business drivers

The scientist-philosopher Thomas Kuhn coined the term paradigm shift to identify a
recurring process he observed in science, where innovative ideas came in bursts and impacted
the world in nonlinear ways. We’ll use Kuhn’s concept of the paradigm shift as a way to
think about and explain the NoSQL movement and the changes in thought patterns,
architectures, and methods emerging today.

Many organizations supporting single-CPU relational systems have come to a crossroads: the
needs of their organizations are changing. Businesses have found value in rapidly capturing
and analyzing large amounts of variable data, and making immediate changes in their
businesses based on the information they receive.

Figure 1.1 shows how the demands of volume, velocity, variability, and agility play a key
role in the emergence of NoSQL solutions. As each of these drivers applies pressure to the
single-processor relational model, its foundation becomes less stable and in time no longer
meets the organization’s needs.

Vrushali Thakur
1.2.1. VOLUME

Without a doubt, the key factor pushing organizations to look at alternatives to their current
RDBMSs is a need to query big data using clusters of commodity processors. Until around
2005, performance concerns were resolved by purchasing faster processors. In time, the
ability to increase processing speed was no longer an option. As chip density increased, heat
could no longer dissipate fast enough without chip overheating. This phenomenon, known as
the power wall, forced systems designers to shift their focus from increasing speed on a
single chip to using more processors working together. The need to scale out (also known
as horizontal scaling), rather than scale up (faster processors), moved organizations from
serial to parallel processing where data problems are split into separate paths and sent to
separate processors to divide and conquer the work.

1.2.2. VELOCITY

Though big data problems are a consideration for many organizations moving away from
RDBMSs, the ability of a single processor system to rapidly read and write data is also key.
Many single-processor RDBMSs are unable to keep up with the demands of real-time inserts
and online queries to the database made by public-facing websites. RDBMSs frequently
index many columns of every new row, a process which decreases system performance.
When single-processor RDBMSs are used as a back end to a web store front, the random
bursts in web traffic slow down response for everyone, and tuning these systems can be
costly when both high read and write throughput is desired.

1.2.3. VARIABILITY

Companies that want to capture and report on exception data struggle when attempting to use
rigid database schema structures imposed by RDBMSs. For example, if a business unit wants
to capture a few custom fields for a particular customer, all customer rows within the
database need to store this information even though it doesn’t apply. Adding new columns to
an RDBMS requires the system be shut down and ALTER TABLE commands to be run.

Vrushali Thakur
When a database is large, this process can impact system availability, costing time and
money.

1.2.4. AGILITY

The most complex part of building applications using RDBMSs is the process of putting data
into and getting data out of the database. If your data has nested and repeated subgroups of
data structures, you need to include an object-relational mapping layer. The responsibility of
this layer is to generate the correct combination of INSERT, UPDATE, DELETE, and
SELECT SQL statements to move object data to and from the RDBMS persistence layer.
This process isn’t simple and is associated with the largest barrier to rapid change when
developing new or modifying existing applications.

Generally, object-relational mapping requires experienced software developers who are


familiar with object-relational frameworks such as Java Hibernate (or NHibernate for .Net
systems). Even with experienced staff, small change requests can cause slowdowns in
development and testing schedules.

You can see how velocity, volume, variability, and agility are the high-level drivers most
frequently associated with the NoSQL movement. Now that you’re familiar with these
drivers, you can look at your organization to see how NoSQL solutions might impact these
drivers in a positive way to help your business meet the changing demands of today’s
competitive marketplace.

Differeneces between NoSQL and Relational database

NoSQL Database Relational Database

NoSQL Database supports a very simple query Relational Database supports a powerful
language. query language.

NoSQL Database has no fixed schema. Relational Database has a fixed schema.

NoSQL Database is only eventually consistent. Relational Database follows acid properties.
(Atomicity, Consistency, Isolation, and
Durability)

NoSQL databases don't support transactions Relational Database supports transactions


(support only simple transactions). (also complex transactions with joins).

NoSQL Database is used to handle data coming in Relational Database is used to handle data

Vrushali Thakur
high velocity. coming in low velocity.

The NoSQL?s data arrive from many locations. Data in relational database arrive from one or
few locations.

NoSQL database can manage structured, Relational database manages only structured
unstructured and semi-structured data. data.

NoSQL databases have no single point of failure. Relational databases have a single point of
failure with failover.

NoSQL databases can handle big data or data in a NoSQL databases are used to handle moderate
very high volume . volume of data.

NoSQL has decentralized structure. Relational database has centralized structure.

NoSQL database gives both read and write Relational database gives read scalability
scalability. only.

NoSQL database is deployed in horizontal fashion. Relation database is deployed in vertical


fashion.

Comparison of SQL vs NoSQL


With a basic understanding of what SQL vs NoSQL is, let's take a look at this quick
comparison chart to see what sets the two apart:

Vrushali Thakur
CAP Theorem and NoSQL Databases

What is the CAP theorem?

The CAP theorem is used to makes system designers aware of the trade-offs while designing
networked shared-data systems. CAP theorem has influenced the design of many distributed
data systems. It is very important to understand the CAP theorem as It makes the basics of
choosing any NoSQL database based on the requirements.

CAP theorem states that in networked shared-data systems or distributed systems, we can
only achieve at most two out of three guarantees for a database: Consistency, Availability
and Partition Tolerance.

A distributed system is a network that stores data on more than one node (physical or virtual
machines) at the same time.

Let’s first understand C, A, and P in simple words:

Consistency: means that all clients see the same data at the same time, no matter which node
they connect to in a distributed system. To achieve consistency, whenever data is written to
one node, it must be instantly forwarded or replicated to all the other nodes in the system
before the write is deemed successful.

Availability: means that every non-failing node returns a response for all read and write
requests in a reasonable amount of time, even if one or more nodes are down. Another way to
state this — all working nodes in the distributed system return a valid response for any
request, without failing or exception.

Partition Tolerance: means that the system continues to operate despite arbitrary message
loss or failure of part of the system. In other words, even if there is a network outage in the
data center and some of the computers are unreachable, still the system continues to perform.
Distributed systems guaranteeing partition tolerance can gracefully recover from partitions
once the partition heals.

The CAP theorem categorizes systems into three categories:

CP (Consistent and Partition Tolerant) database: A CP database delivers consistency and


partition tolerance at the expense of availability. When a partition occurs between any two

Vrushali Thakur
nodes, the system has to shut down the non-consistent node (i.e., make it unavailable) until the
partition is resolved.

Partition refers to a communication break between nodes within a distributed system.


Meaning, if a node cannot receive any messages from another node in the system, there is a
partition between the two nodes. Partition could have been because of network failure, server
crash, or any other reason.

AP (Available and Partition Tolerant) database: An AP database delivers availability and


partition tolerance at the expense of consistency. When a partition occurs, all nodes remain
available but those at the wrong end of a partition might return an older version of data than
others. When the partition is resolved, the AP databases typically resync the nodes to repair all
inconsistencies in the system.

CA (Consistent and Available) database: A CA delivers consistency and availability in the


absence of any network partition. Often a single node’s DB servers are categorized as CA
systems. Single node DB servers do not need to deal with partition tolerance and are thus
considered CA systems.

In any networked shared-data systems or distributed systems partition tolerance is a must.


Network partitions and dropped messages are a fact of life and must be handled appropriately.
Consequently, system designers must choose between consistency and availability.

The following diagram shows the classification of different databases based on the CAP
theorem.

Vrushali Thakur
System designers must take into consideration the CAP theorem while designing or choosing
distributed storages as one needs to be sacrificed from C and A for others.

NoSQL Case Study

NoSQL databases can be used for a variety of applications, but there are a few common use
cases where NoSQL shines:

1. E-commerce applications
NoSQL databases can help e-commerce companies manage large volumes of data, including
product catalogs, customer profiles, and transaction histories. NoSQL databases are also
capable of handling high traffic volumes, making them an excellent choice for e-commerce
applications that experience surges in demand.

E-commerce applications require the ability to manage a large volume of data, including
product catalogs, customer profiles, and transaction histories. This data is often unstructured
or semi-structured, making it challenging to manage with traditional relational databases.
NoSQL databases are designed to handle these types of data and provide the necessary
scalability and performance to support high-traffic e-commerce applications.

In an e-commerce application, a NoSQL database can store product data, including product
descriptions, prices, images, and availability. NoSQL databases can handle large product
catalogs with ease, making it easy for customers to search for and find the products they
need.

Customer data is another critical component of an e-commerce application. NoSQL databases


can store customer profiles, including names, addresses, purchase histories, and preferences.
With NoSQL databases, e-commerce companies can build personalized experiences for their
customers, providing targeted product recommendations and personalized offers based on
their purchase history and preferences.

Finally, NoSQL databases can also store transaction histories, providing a complete record of
all transactions that have occurred in the e-commerce application. This data can be used for
reporting

2. Social media platforms


Social media platforms generate vast amounts of unstructured data in the form of posts,
comments, likes, shares, and user profiles. This data is highly variable and unpredictable,
making it difficult to manage with traditional relational databases. NoSQL databases, on the
other hand, are designed to handle unstructured data and provide the necessary scalability and
flexibility to support social media applications.

One of the main advantages of NoSQL databases for social media platforms is their ability to
store and process unstructured data at scale. Social media platforms generate an enormous
amount of unstructured data every day, and NoSQL databases can handle this data efficiently

Vrushali Thakur
and effectively. With NoSQL databases, social media platforms can store posts, comments,
likes, and shares, and quickly retrieve this data when needed.

3. Internet of Things (IoT)


The Internet of Things (IoT) is a network of connected devices that generate vast amounts of
data from sensors, cameras, and other sources. This data is often unstructured or semi-
structured, making it difficult to manage with traditional relational databases. NoSQL
databases, however, are designed to handle the volume and variety of data generated by IoT
devices and offer the flexibility to accommodate evolving data models.

One of the primary benefits of NoSQL databases for IoT applications is their ability to handle
large volumes of data in real-time. IoT devices generate data continuously, and NoSQL
databases can store and process this data quickly and efficiently. With NoSQL databases, IoT
applications can collect and analyze data from millions of devices in real-time, providing
valuable insights into user behavior, performance, and maintenance needs.

NoSQL databases are also highly scalable, which is essential for IoT applications that may
need to accommodate tens of thousands or even millions of devices. As the number of
devices and the volume of data grows, NoSQL databases can scale horizontally to
accommodate the additional load. This scalability is critical for IoT applications that need to
process large amounts of data quickly and efficiently. To learn more about IoT architecture,
data model & queries, check out this dedicated IoT use case page.

4. Mobile applications
Mobile applications are a ubiquitous part of modern life, with billions of users worldwide
generating vast amounts of data. NoSQL databases are well-suited to handle the data
generated by mobile applications, including user profiles, location data, and app usage
statistics. With NoSQL databases, mobile applications can provide fast, reliable access to
data across a distributed network.

The advantage of NoSQL databases for mobile applications is their ability to handle semi-
structured and unstructured data. Mobile applications generate a variety of data, including
text, images, and video, and NoSQL databases can accommodate this variety of data. This
flexibility is critical for mobile applications that need to store and process data in a variety of
formats.

5. Gaming
Gaming companies generate a massive amount of data, from player data to game states, high
scores, and more. NoSQL databases are an excellent choice for gaming companies as they
can store and manage large volumes of player data with ease. Additionally, NoSQL databases
are well-suited to handling high traffic volumes, making them an excellent choice for gaming
applications that experience surges in demand.

A good reason why NoSQL databases are a fit for gaming companies is their ability to store
and manage large volumes of player data. Gaming companies must maintain a vast amount of
player data, including profiles, preferences, progress, and achievements. NoSQL databases

Vrushali Thakur
can handle this data effectively, allowing gaming companies to access and analyze player
data efficiently.

To learn more about gaming architecture and how Astra DB can power leading gaming apps
like ESL gaming, visit our gaming use case page.

6. Big data analytics


In today's data-driven world, big data analytics has become a crucial tool for companies
seeking to gain insights into their operations, customers, and markets. However, traditional
relational databases are often not well-suited for big data analytics because they struggle to
handle large volumes of unstructured data.

This is where NoSQL databases come in. NoSQL databases are designed to handle large
volumes of unstructured data, making them an excellent choice for big data analytics.
Whether it's text data, sensor data, or multimedia data, NoSQL databases can handle it all,
providing fast and flexible access to data.

One of the primary advantages of NoSQL databases for big data analytics is their ability to
handle large volumes of data. NoSQL databases can scale horizontally, allowing companies
to add more computing power and storage as needed. This scalability means that NoSQL
databases can handle even the largest data sets, providing fast and efficient data access.

NoSQL databases also offer flexible data models, which is essential for big data analytics. In
traditional relational databases, data must be organized into a rigid structure, which can be
limiting when dealing with unstructured data. NoSQL databases, on the other hand, can
handle a variety of data models, including key-value, document, and graph models. This
flexibility means that companies can store and analyze data in a way that best suits their
needs.

Real-World NoSQL Use Cases and Examples


Let's explore some real-world examples of companies that use NoSQL databases and the
applications for which they use them.

Netflix
Netflix uses NoSQL databases to store and manage massive amounts of data, including
customer profiles, viewing histories, and content recommendations. NoSQL databases allow
Netflix to handle large volumes of data and provide fast, reliable access to data across a
distributed network.

Uber
Uber uses NoSQL databases to handle the massive amounts of data generated by its ride-
sharing platform, including driver and rider profiles, trip histories, and real-time location
data. NoSQL databases provide the scalability and flexibility needed to handle high traffic
volumes and changing data models.

Vrushali Thakur
Airbnb
Airbnb uses NoSQL databases to store and manage data for its booking platform, including
property listings, guest profiles, and booking histories. NoSQL databases allow Airbnb to
handle large volumes of unstructured data and provide fast, reliable access to data across a
distributed network.

Benefits of Using NoSQL Databases for Specific Use Cases


NoSQL databases offer several benefits for specific use cases:

Improved scalability
NoSQL databases are designed to scale horizontally, allowing companies to add more
computing power and storage as needed. This makes NoSQL databases an excellent choice
for applications that require the ability to handle large volumes of data and traffic spikes
without sacrificing performance.

Flexible data models


NoSQL databases offer flexible data models that can accommodate evolving data needs. This
means that companies can easily add or modify data fields without having to restructure their
entire database schema.

High availability
NoSQL databases offer high availability and fault tolerance, ensuring that applications
remain up and running even in the event of a hardware failure or network outage. This is
critical for applications that require continuous availability, such as e-commerce and social
media platforms.

Fast read and write speeds


NoSQL databases are optimized for fast read and write speeds, making them an excellent
choice for applications that require real-time data access, such as mobile applications and
gaming platforms.

NoSQL solution for big data

Datasets that are difficult to store and analyze by any software database tool are
referred to as big data. Due to the growth of data, an issue arises that based on
recent fads in the IT region, how the data will be effectively processed. A
requirement for ideas, techniques, tools, and technologies is been set for handling
and transforming a lot of data into business value and knowledge. The major
features of NoSQL solutions are stated below that help us to handle a large
amount of data.
NoSQL databases that are best for big data are:
 MongoDB
 Cassandra

Vrushali Thakur
 CouchDB
 Neo4j

Different ways to handle Big Data problems:

1. The queries should be moved to the data rather than moving data to
queries:
At the point, when an overall query is needed to be sent by a customer to all
hubs/nodes holding information, the more proficient way is to send a query to
every hub than moving a huge set of data to a central processor. The stated
statement is a basic rule that assists to see how NoSQL data sets have sensational
execution benefits on frameworks that were not developed for queries distribution
to hubs. The entire data is kept inside hub/node in document form which means
just the query and result are needed to move over the network, thus keeping big
data’s queries quick.
2. Hash rings should be used for even distribution of data:
To figure out a reliable approach to allocating a report to a processing hub/node
is perhaps the most difficult issue with databases that are distributed. With a help
of an arbitrarily produced 40-character key, the hash rings method helps in even
distribution of a large amount of data on numerous servers and this is a decent
approach to uniform distribution of network load.
3. For scaling read requests, replication should be used:
In real-time, replication is used by databases for making data’s backup copies.
Read requests can be scaled horizontally with the help of replication. The strategy
of replication functions admirably much of the time.
4. Distribution of queries to nodes should be done by the database:
Separation of concerns of evaluation of query from the execution of the query is
important for getting more increased performance from queries traversing
numerous hubs/nodes. The query is moved to the data by the NoSQL database
instead of data moving to the query.

Vrushali Thakur
Advantages of Using NoSQL Databases for Big Data
Storage

Scalability: ability to handle large amounts of data with ease

One of the most significant advantages of using a NoSQL database for big
data storage is its scalability. Traditional relational databases are designed
to work on single servers, which can lead to performance issues when
handling large amounts of data. In contrast, NoSQL databases are designed
to scale horizontally across multiple servers, meaning that they can handle
vast amounts of data without any issues.

The ability to scale horizontally makes NoSQL databases ideal for


applications that need to store and process large volumes of data, such as
social media platforms and e-commerce sites. By allowing for seamless
scalability, businesses can easily adapt their database infrastructure as their
needs change over time.

Flexibility: ability to store data in various formats without predefined


schema

NoSQL databases offer a great deal more flexibility than traditional


relational databases. Unlike relational databases that require users to define
a schema before storing any data, NoSQL databases allow users to store
data in various formats without predefined schema.

This flexibility means that businesses can store virtually any type of data in
the database without worrying about the need for costly and time-
consuming schema changes. For example, document-based NoSQL
databases like MongoDB allow users to store and retrieve JSON
documents seamlessly.

High availability: ability to maintain uptime even during hardware


failures

Vrushali Thakur
Another significant advantage of using a NoSQL database for big data
storage is its high availability. Traditional relational databases often
experience downtime during hardware failures or maintenance windows
because they run on single servers that cannot tolerate outages or failures.
In contrast, most NoSQL databases are designed with high availability in
mind.

They offer features like automatic failover and replication that help ensure
that even in the event of hardware failure or maintenance window
downtime is minimized. This high level of availability makes NoSQL
databases perfect for applications that require uninterrupted access to data,
such as online banking platforms or healthcare information systems.

Factors that Support NoSQL for Big Data Applications


The real essence of NoSQL is it prevents the bottleneck of data when an enterprise
application is handling petabytes of data. That’s where we see the popularity of
NoSQL databases like HBase, Cassandra, and MongoDB, etc.

The key features of NoSQL databases that make it useful are:

1. Storing capacity of large volumes of unstructured data: A NoSQL database can


store unlimited sets of data with any types. Moreover, it has the user flexibility to
change the data type on the go. It is a document based database. Hence, no need to
define the data type in advance.

2. Cloud-based storage: Today most of the enterprises follow cloud-based storage


solution to save the cost. NoSQL databases like Cassandra make it happen to set up
multiple data centers without much hassle.

3. Fast development: Relational database is not an ideal solution when you are
working in an agile environment which needs frequent feedbacks and fast iterations.
In this case, NoSQL database fits well in the framework.

Understanding the types of big data problems;

Vrushali Thakur
1. Sharing and Accessing Data:
 Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources.
 Sharing data can cause substantial challenges.
 It include the need for inter and intra- institutional legal documents.
 Accessing data from public repositories leads to multiple difficulties.
 It is necessary for the data to be available in an accurate, complete and
timely manner because if data in the companies information system is to be
used to make accurate decisions in time then it becomes necessary for data
to be available in this manner.

2. Privacy and Security:


 It is another most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.
 Most of the organizations are unable to maintain regular checks due to
large amounts of data generation. However, it should be necessary to
perform security checks and observation in real time because it is most
beneficial.
 There is some information of a person which when combined with external
large data may lead to some facts of a person which may be secretive and
he might not want the owner to know this information about that person.
 Some of the organization collects information of the people in order to add
value to their business. This is done by making insights into their lives that
they’re unaware of.

3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some
main challenges questions like how to deal with a problem if data volume
gets too large?
 Or how to find out the important data points?
 Or how to use data to the best advantage?
 These large amount of data on which these type of analysis is to be done
can be structured (organized data), semi-structured (Semi-organized data)
or unstructured (unorganized data). There are two techniques through
which decision making can be done:
o Either incorporate massive data volumes in the analysis.
o Or determine upfront which Big data is relevant.

Vrushali Thakur
4. Technical challenges:
 Quality of data:
o When there is a collection of a large amount of data and storage
of this data, it comes at a cost. Big companies, business leaders
and IT leaders always want large data storage.
o For better results and conclusions, Big data rather than having
irrelevant data, focuses on quality data storage.
o This further arise a question that how it can be ensured that data
is relevant, how much data would be enough for decision making
and whether the stored data is accurate or not.
 Fault tolerance:
o Fault tolerance is another technical challenge and fault tolerance
computing is extremely hard, involving intricate algorithms.
o Nowadays some of the new technologies like cloud computing
and big data always intended that whenever the failure occurs the
damage done should be within the acceptable threshold that is the
whole task should not begin from the scratch.
 Scalability:
o Big data projects can grow and evolve rapidly. The scalability
issue of Big Data has lead towards cloud computing.
o It leads to various challenges like how to run and execute various
jobs so that goal of each workload can be achieved cost-
effectively.
o It also requires dealing with the system failures in an efficient
manner. This leads to a big question again that what kinds of
storage devices are to be used.

Storage

With vast amounts of data generated daily, the greatest challenge is storage (especially when
the data is in different formats) within legacy systems. Unstructured data cannot be stored in
traditional databases.

Processing

Processing big data refers to the reading, transforming, extraction, and formatting of useful
information from raw information. The input and output of information in unified formats
continue to present difficulties.

Vrushali Thakur
Security

Security is a big concern for organizations. Non-encrypted information is at risk of theft or


damage by cyber-criminals. Therefore, data security professionals must balance access to
data against maintaining strict security protocols.

Finding and Fixing Data Quality Issues

Many of you are probably dealing with challenges related to poor data quality, but solutions
are available. The following are four approaches to fixing data problems:

 Correct information in the original database.

 Repairing the original data source is necessary to resolve any data inaccuracies.

 You must use highly accurate methods of determining who someone is.

Scaling Big Data Systems

Database sharding, memory caching, moving to the cloud and separating read-only and write-
active databases are all effective scaling methods. While each one of those approaches is
fantastic on its own, combining them will lead you to the next level.

Evaluating and Selecting Big Data Technologies

Companies are spending millions on new big data technologies, and the market for such tools
is expanding rapidly. In recent years, however, the IT industry has caught on to big data and
analytics potential. The trending technologies include the following:

 Hadoop Ecosystem

 Apache Spark

 NoSQL Databases

 R Software

 Predictive Analytics

 Prescriptive Analytics

Vrushali Thakur
Big Data Environments

In an extensive data set, data is constantly being ingested from various sources, making it
more dynamic than a data warehouse. The people in charge of the big data environment will
fast forget where and what each data collection came from.

Real-Time Insights

The term "real-time analytics" describes the practice of performing analyses on data as a
system is collecting it. Decisions may be made more efficiently and with more accurate
information thanks to real-time analytics tools, which use logic and mathematics to deliver
insights on this data quickly.

Data Validation

Before using data in a business process, its integrity, accuracy, and structure must be
validated. The output of a data validation procedure can be used for further analysis, BI, or
even to train a machine learning model.

Healthcare Challenges

Electronic health records (EHRs), genomic sequencing, medical research, wearables, and
medical imaging are just a few examples of the many sources of health-related big data.

Barriers to Effective Use Of Big Data in Healthcare

 The price of implementation

 Compiling and polishing data

 Security

 Disconnect in communication

Analyzing big data with a shared-nothing architecture

What Is the Shared Nothing Architecture?

The concept of shared nothing architecture isn’t new. It’s been around since the 1980s, when
it was first coined by Michael Stonebraker, a computer scientist at the University of
California, Berkeley. However, with the advent of big data and the demands for more
efficient data processing, this architecture has gained renewed attention.

Vrushali Thakur
The shared nothing architecture is all about independence and autonomy. Each node, whether
it’s a computer, a server, or a database, has its own dedicated resources. These resources,
such as memory, storage, and processing power, are not shared with other nodes. As a result,
there are fewer bottlenecks and less contention between nodes, leading to improved
performance and scalability.

This is part of a series ot articles about data backup

Shared Nothing Architecture vs. Other Computing Architectures

The following diagram shows a schematic representation of the shared nothing architecture
vs. other common architectures. Below we explain the differences in more detail.

Shared Nothing Architecture vs. Shared-Everything Architecture

In a shared everything architecture, all nodes have access to a common pool of resources,
which includes memory and storage. This can lead to contention and bottlenecks, as nodes
compete for access to shared resources.

In contrast, in shared nothing architecture, each node has its own dedicated resources,
eliminating the potential for contention. This results in better performance, scalability, and
fault tolerance. However, this architecture requires careful data partitioning and management
to ensure balanced load across nodes.

Shared Nothing Architecture vs. Shared-Storage Architecture

Shared-storage architecture is another distributed computing model where all nodes access
shared disks for data storage. This leads to a single point of contention, the disk, which can be
a bottleneck in terms of performance and scalability.

Vrushali Thakur
On the other hand, shared nothing architecture eliminates this bottleneck by allocating
dedicated storage to each node. This not only improves performance but also enhances fault
tolerance, as a failure in one node doesn’t affect the others.

Shared Nothing Architecture vs. Shared-Memory Architecture

Another variation on the shared storage architecture is the shared memory architecture, a
model in which all nodes share a common memory pool. Because memory has much lower
latency than disk, this architecture can provide better performance than a shared storage
architecture. However, this architecture can still lead to contention as nodes compete for
memory access.

The shared nothing architecture addresses this issue by assigning dedicated memory to each
node. This autonomy reduces contention and improves performance, while also enhancing
fault tolerance and reliability.

Advantages of a Shared Nothing Architecture

Scalability and Performance Efficiency

One of the main advantages of shared nothing architecture is its scalability. As each node
operates independently with its own resources, you can easily add more nodes to the system
to handle increased load. This makes shared nothing architecture an ideal choice for systems
that need to scale out to accommodate growth.

In terms of performance efficiency, shared nothing architecture excels due to the absence of
contention for shared resources. Each node can process its own data without interference
from other nodes, leading to faster processing times and improved overall performance.

Fault Tolerance and Reliability

A shared nothing architecture also offers robust fault tolerance. Since each node operates
independently, the failure of one node doesn’t affect the rest of the system. This means that
even if one node goes down, the system can continue to function, providing a high level of
reliability.

Moreover, as each node has its own copy of the data, it can continue to operate even in the
event of a network failure. This redundancy further enhances the reliability of shared nothing
architecture, ensuring that your system remains up and running, even in the face of adversity.

Cost-Effectiveness and Resource Optimization

In a shared nothing architecture, each node has its own dedicated resources, so there’s no
need for expensive shared infrastructure. This reduces the total cost of ownership, making
shared nothing architecture an economical choice for distributed computing.

Vrushali Thakur
Furthermore, shared nothing architecture optimizes resource utilization. As each node is
responsible for its own processing, resources are used efficiently, with no wastage. This
makes shared nothing architecture a resource-efficient choice, ensuring that you get the most
out of your computing resources.

Challenges of a Shared Nothing Architecture

Complexity in Implementation and Maintenance

One of the primary hurdles in the adoption of shared nothing architecture is its inherent
complexity. The setup of each node with its data and resources require meticulous planning
and execution. The individual nodes, while functioning independently, need to work
coherently to ensure the overall system’s smooth operation.

In addition, each node, with its unique set of data and resources, needs to be maintained
independently. The need for individual care increases the maintenance cost, both in terms of
time and resources. Additionally, any alterations or enhancements in the system architecture
require changes to be made in each node, adding to the complexity.

Data Consistency and Synchronization Issues

Data consistency is another significant challenge in shared nothing architecture. Since each
node manages its data, maintaining synchronization across all nodes can be a complex task. If
updates are made to the data in one node, they need to be replicated across all other nodes to
maintain consistency. This can be time-consuming and, in some cases, may lead to data
inconsistencies if not managed properly.

Moreover, the issue of data partitioning arises. A single piece of data might be divided among
multiple nodes. Consequently, retrieving or updating that data becomes a complex task as it
involves multiple nodes. If not handled correctly, this can lead to data inconsistency and loss
of data integrity.

Network Dependency and Bottlenecks

A shared nothing architecture relies heavily on network communication. Each node


communicates with others through the network to accomplish tasks, share status, and
maintain consistency. Therefore, a robust and reliable network is critical for the successful
operation of such a system.

However, network dependency can lead to bottlenecks, especially when large volumes of
data are being transferred between nodes. This can hinder system performance and slow
down processes. Moreover, if the network becomes unavailable or experiences issues, it can
cause the entire system to fail or degrade in performance.

Best Practices in Implementing a Shared Nothing Architecture

Vrushali Thakur
Effective Data Partitioning

Effective data partitioning involves dividing the data into manageable and logical segments
that can be distributed among the nodes. This allows for efficient data management and
minimizes the risk of data inconsistency.

To implement effective data partitioning, you need to understand your data thoroughly.
Analyzing the data usage patterns and access frequency can provide insights into how the
data should be partitioned. The goal is to minimize data movement across nodes and ensure
that the most frequently accessed data is readily available.

Optimizing Network Communication

Given the critical role of network communication in shared nothing architecture, it’s
imperative to optimize it to ensure smooth operation. This involves efficient data transfer
techniques, robust network infrastructure, and effective error handling mechanisms.

By reducing the amount of data transferred between nodes, you can minimize network traffic
and avoid bottlenecks. This can be achieved through efficient data partitioning and local
processing of data. Furthermore, investing in a robust network infrastructure that can handle
high volumes of data can significantly improve system performance.

Node Autonomy and Independence

Node autonomy is a fundamental principle of shared nothing architecture. Each node should
be capable of operating independently, managing its resources, and making decisions without
relying on other nodes. This enhances system reliability and scalability.

To ensure node autonomy, you need to equip each node with the necessary resources and
capabilities. This includes sufficient storage and processing power, as well as the right
software tools. Moreover, you need to implement effective error handling mechanisms that
allow nodes to handle failures independently.

Balancing Load and Managing Resources

In shared nothing architecture, balancing load and managing resources effectively is crucial
to ensure optimal system performance. This involves distributing the workload evenly among
the nodes and ensuring that each node has the necessary resources to handle its tasks.

Load balancing can be achieved through effective data partitioning and task scheduling. By
distributing data and tasks evenly, you can avoid overloading certain nodes and ensure a
smooth operation. Furthermore, regular monitoring and management of resources can help
detect and address any resource shortages or inefficiencies.

Shared Nothing Data Protection with Cloudian HyperStore

Vrushali Thakur
Data protection requires powerful storage technology. Cloudian’s storage appliances, based
on a shared nothing architecture, are easy to deploy and use, let you store Petabyte-scale data
and access it instantly. Cloudian supports high-speed backup and restore with parallel data
transfer (18TB per hour writes with 16 nodes).

Cloudian provides durability and availability for your data. HyperStore can backup and
archive your data, providing you with highly available versions to restore in times of need.

In HyperStore, storage occurs behind the firewall, you can configure geo boundaries for data
access, and define policies for data sync between user devices. HyperStore gives you the
power of cloud-based file sharing in an on-premise device, and the control to protect your
data in any cloud environment.

Vrushali Thakur

You might also like