5.1 BDA NoSQL
5.1 BDA NoSQL
Introduction to NoSQL
Pros Cons
• Relational Databases do not scale out horizontally very
• Relational databases work with structured data. well (concurrency and data size), only vertically, (unless
you use sharding).
• They support ACID transactional consistency and
support “joins.” • Data is normalized, meaning lots of joins, which
affects speed.
• They come with built-in data integrity and a large eco-
system. • They have problems working with semi-structured data.
• Relationships in this system have constraints.
• There is limitless indexing. Strong SQL.
Unlike SQL databases, where you must determine and declare a table's schema before inserting data. MongoDB's collections,
by default, do not require their documents to have the same schema
• Horizontal scaling
Horizontal scaling, also known as scale-out, refers to bringing on additional nodes to share the load. This is difficult with
relational databases due to the difficulty in spreading out related data across nodes.
Vertical scaling refers to increasing the processing power of a single server or cluster. Both relational and non-relational
databases can scale up, but eventually, there will be a limit in terms of maximum processing power and throughput. Additionally, there
are increased costs with scaling up to high-performing hardware, as costs do not scale linearly.
Prepared by Mrs K H Vijaya Kumari, Asst. Professor, Dept of IT
NoSQL Database Features
• Language –
SQL databases defines and manipulates data based structured query language (SQL). SQL is
restrictive as it requires you to use predefined schemas to determine the structure of your data before you
work with it. Also all of your data must follow the same structure. This can require significant up-front
preparation which means that a change in the structure would be both difficult and disruptive to your whole
system.
A NoSQL database has dynamic schema for unstructured data. Data is stored in many ways. This flexibility
means that documents can be created without having defined structure first. The syntax varies from
database to database, and you can add fields as you go.
• Scalability –
In almost all situations SQL databases are vertically scalable. This means that you can increase the load on a
single server by increasing things like RAM, CPU or SSD. But on the other hand NoSQL databases are
horizontally scalable. This means that you handle more traffic by sharding, or adding more servers in your
NoSQL database. It is similar to adding more floors to the same building versus adding more buildings to
the neighborhood. Thus NoSQL can ultimately become larger and more powerful, making these databases
the preferred choice for large or ever-changing data sets. Prepared by Mrs K H Vijaya Kumari, Asst. Professor, Dept of IT
Difference between SQL and NoSQL
• Structure –
SQL databases are table-based on the other hand NoSQL databases are either key-value pairs, document-
based, graph databases or wide-column stores. This makes relational SQL databases a better option for
applications that require multi-row transactions such as an accounting system or for legacy systems that
were built for a relational structure.
• Property followed –
SQL databases follow ACID properties (Atomicity, Consistency, Isolation and Durability) whereas the
NoSQL database follows the Brewers CAP theorem (Consistency, Availability and Partition tolerance).
• Support –
Great support is available for all SQL database from their vendors. Also a lot of independent consultations
are there who can help you with SQL database for a very large scale deployments but for some NoSQL
database you still have to rely on community support and only limited outside experts are available for
setting up and deploying your large scale NoSQL deployments.
• The increasing business needs of the hour, require the scaling of existing
databases. The availability of huge amounts of data with its diverse forms
demands the storage and retrieval of the same in a flexible manner.
• As the existing SQL systems are well organized and can handle thousands
of queries within no time. But this is applicable only to a small scale
structured data.
• Although the NoSQL databases exists since 1960,it has been recently found
that their high operations speed and flexibility in storing the data is
suitable for present data era.
1. Multi-Model
Relational databases store data in a fixed and predefined structure. It means when you start development you
will have to define your data schema in terms of tables and columns. You have to change the schema every time the
requirements change. This will lead to creating new columns, defining new relations, reflecting the changes in your
application, discussing with your database administrators, etc.
NoSQL database provides much more flexibility when it comes to handling data. There is no requirement to
specify the schema to start working with the application. Also, the NoSQL database doesn’t put a restriction on the
types of data you can store together. It allows you to add more new types as your needs change. These all are the
reasons NoSQL is best suited for agile development which requires fast implementation. Developers and architects
choose NoSQL to handle data easily for various kinds of agile development application requirements.
Prepared by Mrs K H Vijaya Kumari, Asst. Professor, Dept of IT
Need for NoSQL
2. Easily Scalable
The primary reason to choose a NoSQL database is easy scalability. Well, relational databases can also be
scaled but not easily and at a lower cost. Relational databases are built on the concept of traditional master-slave
architecture. Scaling up means upgrading your servers by adding more processors, RAM, and hard-disks to your
machine to handle more load and increase capacity. You will have to divide the databases into smaller chunks across
multiple hardware servers instead of a single large server. This is called sharding which is very complicated in relational
databases. Replacing and upgrading your database server machines to accommodate more throughput results in
downtime as well. These things become a headache for developers and architects.
NoSQL database built with a masterless, peer-to-peer architecture. Data is partitioned and balanced
across multiple nodes in a cluster, and aggregate queries are distributed by default. This allows easy scaling in no time.
Just executing a few commands will add the new server to the cluster. This scalability also improves performance,
Relational databases use a centralized application that is location-dependent (e.g. single location), especially for write
operations. On the other hand, the NoSQL database is designed to distribute data on a global scale. It uses multiple locations involving
multiple data centers and/or cloud regions for write and read operations. This distributed database has a great advantage with masterclass
architecture. You can maintain continuous availability because data is distributed with multiple copies where it needs to be.
What will happen if the hardware will fail? Well, NoSQL is also designed to handle these kinds of critical situations.
Hardware failure is a serious concern while building an application. Rather than requiring developers, DBAs, and operations staff to build
their redundant solutions, it can be addressed at the architectural level of the database in NoSQL. If we talk about Cassandra then it uses
several heuristics to determine the likelihood of node failure. Riak follows the network partitioning approach (when one or more nodes in a
The masterclass architecture of the NoSQL database allows multiple copies of data to be maintained across different
nodes. If one node goes down then another node will have a copy of the data for easy and fast access. This leads to zero downtime in the
NoSQL database. When one considers the cost of downtime, this is a big deal.
NoSQL can handle the massive amount of data very quickly and that’s the reason it is best suited for the
big data applications. NoSQL databases ensure data doesn’t become the bottleneck when all of the other components of
As we have mentioned about many features and advantages of using the NoSQL database but that doesn’t
mean that it fits in all kinds of applications. There are some downsides of NoSQL as well so choosing the right database
depends on your type of application or nature of data. Some projects are better suited to using an SQL database, while
others work well with NoSQL. Some could use both interchangeably.
If you are storing transactional data then SQL is your friend. Relational databases were created to process
transactions and they are very good with that. If your data is analytical then think about NoSQL, SQL was never
intended for data analytics and with large amounts of data that can become an issue.
In a typical relational database table, each row contains field values for a single record. In row-
wise database storage, data blocks store values sequentially for each consecutive column making up
the entire row. If block size is smaller than the size of a record, storage for an entire record may take
more than one block. If block size is larger than the size of a record, storage for an entire record may
take less than one block, resulting in an inefficient use of disk space. In online transaction processing
(OLTP) applications, most transactions involve frequently reading and writing all of the values for
entire records, typically one record or a small number of records at a time. As a result, row-wise
storage is optimal for OLTP databases.
Prepared by Mrs K H Vijaya Kumari, Asst. Professor, Dept of IT
Columnar Database
Using columnar storage, each data block holds column field values for as many as three times as many
records as row-based storage. This means that reading the same number of column field values for the same number of
records requires a third of the I/O operations compared to row-wise storage. In practice, using tables with very large
numbers of columns and very large row counts, storage efficiency is even greater.
An added advantage is that, since each block holds the same type of data, block data can use a compression
scheme selected specifically for the column data type, further reducing disk space and I/O. For more information
about compression encodings based on data types, see Compression encodings.
The savings in space for storing data on disk also carries over to retrieving and then storing that data in
memory. Since many database operations only need to access or operate on one or a small number of columns at a
time, you can save memory space by only retrieving blocks for columns you actually need for a query. Where OLTP
transactions typically involve most or all of the columns in a row for a small number of records, data warehouse
queries commonly read only a few columns for a very large number of rows. This means that reading the same
number of column field values for the same number of rows requires a fraction of the I/O operations and uses a
fraction of the memory that would be required for processing row-wise blocks. In practice, using tables with very
large numbers of columns and very large row counts, the efficiency gains are proportionally greater. For example,
suppose a table contains 100 columns. A query that uses five columns will only need to read about five percent of the
data contained in the table. This savings is repeated for possibly billions or even trillions of records for large
databases. In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.
Prepared by Mrs K H Vijaya Kumari, Asst. Professor, Dept of IT
Columnar Database
Row-based block Storage
Consistency
Consistency means that all nodes in the network see
the same data at the same time.
Availability
Availability is a guarantee that every request receives a
response about whether it was successful or failed.
Partition Tolerance
guarantees that the system continues to operate
despite arbitrary message loss or failure of part of the system.
CAP theorem or Eric Brewers theorem states that we can
only achieve at most two out of three guarantees for a
database: Consistency, Availability and Partition Tolerance.
Compromising Consistency
Only one machine can accept modifications while the reads can be done from all machines. In such systems,
the modifications flow from that one machine to the rest. Such systems are highly available as there are
multiple machines to serve. Also, such systems are partition tolerant because if one machine goes down,
there are other machines available to take up that responsibility. Since it takes time for the data to reach other
machines from the node A, the other machine would be serving older data. This causes inconsistency.
Though the data is eventually going to reach all machine and after a while, things are going to okay. There
we call such systems eventually consistent instead of strongly consistent. This kind of architecture is found
in Zookeeper and MongoDB.
Prepared by Mrs K H Vijaya Kumari, Asst. Professor, Dept of IT
Prepared by Mrs K H Vijaya Kumari, Asst. Professor, Dept of IT