DSA 4-Introduction To NoSQL
DSA 4-Introduction To NoSQL
INTRODUCTION TO N O SQL
Order
SQL v/s NoSQL
Scalability
• SQL databases are vertically scalable
◦ decrease the load on a single server by
increasing things like CPU, RAM or SSD
Scale Vertically
• NoSQL databases are horizontally scalable
◦ handle more traffic by sharding, or adding more
servers in your NoSQL database
Scale Horizontally
SQL v/s NoSQL
Schema on Write v/s Schema on Read
• SQL databases require a fixed predefined schema, and all data must
follow a similar structure. Consequently, a lot of preparation
regarding the system is required upfront. Plus, flexibility is
compromised, considering that potential modifications in the
structure can be complex, highly complicated, and may disrupt the
system.
• In turn, NoSQL databases follow a dynamic schema for unstructured
data. Since it does not require a predefined structure, modifications
are easier to execute. Thus, NoSQL databases have greater flexibility
◦ You can create documents without having to first define their structure
◦ Each document can have its own unique structure
◦ The syntax can vary from database to database, and
◦ You can add fields as you go.
SQL v/s NoSQL
Language and Structure
• SQL databases use structured query language (SQL) for defining and manipulating
data.
• On one hand, this is extremely powerful: SQL is one of the most versatile and widely-
used options available, making it a safe choice and especially great for complex
queries.
• On the other hand, it can be restrictive.
◦ SQL requires that you use predefined schemas to determine the structure of your data
before you work with it. In addition, all of your data must follow the same structure. This
can require significant up-front preparation
• NoSQL databases, on the other hand, have dynamic schemas for unstructured data,
and data is stored in many ways
◦ NoSQL databases are document, key-value, graph or wide-column stores.
NoSQL
Types
• SQL databases are table-based, while NoSQL databases are either document-based,
key-value pairs, graph databases or wide-column stores.
• This makes relational SQL databases a better option for applications that require
multi-row transactions
◦ accounting system - or for legacy systems that were built for a relational structure.
• Some examples of SQL databases include MySQL, Oracle, PostgreSQL, and Microsoft
SQL Server.
• NoSQL database examples include MongoDB, BigTable, Redis, RavenDB Cassandra,
HBase, Neo4j and CouchDB.
NoSQL
Types: 1. Key-Value
• Data is stored in key/value pairs.
◦ A value, which can be basically any piece of data or information, is stored with a key that
identifies its location.
• The value in a key-value store can be anything: a string, a number, but also an entire
new set of key-value pairs encapsulated in an object.
Key Value
Key Value
124587 { “PrdID" : "001",
124587 ABC "Name" : “MacBook Pro 13",
134679 DEF “Color" : “Black", }
135287 HIJ 134679 DEF
135287 HIJ
NoSQL
Types: 1. Key-Value
data = {
"user:1:name": "Ali",
"user:1:age": 28,
"user:1:profile": {
"email": "[email protected]",
"address": "123 ABC XYZ"
},
"user:2:name": "Bilal",
"user:2:email": "[email protected]",
"user:2:cell": “0333-1234",
"user:3:name": “Hamza",
"user:3:age": 25,
"user:3:profile": {
"email": “[email protected]",
"address": “123 ZZ ABCD"
}
}
NoSQL
Types: 1. Key-Value
Server 1 Server 2
Server 4
NoSQL
Types: 1. Key-Value
• Based on Amazon's Dynamo paper, the key-value pair storage databases store data as a hash table
where each key is unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
• used for caching, storing, and managing user sessions, ad servicing, and recommendations.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases.
• Features:
◦ very fast
◦ scalable (horizontally distributed to nodes based on key)
◦ simple data model
◦ eventual consistency
◦ fault-tolerance
◦ Can’t model more complex data structure such as objects
NoSQL
Types: 1. Key-Value
Key-Value Stores for Binary Data
Advantages:
• Simplicity: Key-value stores are easy to use and set up.
• Performance: They can be very fast for storing and retrieving binary data, especially if the
data is small to medium-sized.
Disadvantages:
• Scalability: For large files or large volumes of binary data, key-value stores might not be as
efficient. Some key-value stores have limits on the size of the values they can handle.
• Lack of Specialized Features: Key-value stores do not typically provide specialized features
for handling media files, such as efficient indexing or transformation capabilities.
• Management Overhead: Managing and indexing large binary objects can become complex,
especially as the volume of data grows.
NoSQL
Types: 2. Document Stores
• Document stores are one step up in complexity from key-value stores.
◦ a document store does assume a certain document structure that can be specified with a
schema.
• Retrieve data as a key value pair but the value part is stored as a document.
◦ The document is stored in JSON or XML formats. The value is understood by the DB and
can be queried.
{ "ID" : "001",
{ "ID" : "001",
"Name" : "John",
"Name" : "John",
"Grade" : "Senior",
"Grade" : "Senior", }
"Classes" : {
"Class1" : "English“
"Class2" : "Geometry“
"Class3" : "History" } }
NoSQL
Types: 2. Document Stores
• Document stores appear the most natural among the NoSQL database types because
they’re designed to store everyday documents as they are.
◦ they allow for complex querying and calculations on this often already aggregated form of
data.
• The document type is mostly used for CMS systems, user profiling, blogging platforms,
and news articles
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
“articles”:[
{
NoSQL “title”: “title of the article”,
“articleID”: 1,
Types: 2. Document Stores “body”: “body of the article”,
“author”: “ABC”,
• Use-case: “comments”: [
{
“username”: “A”,
“commentid”: 1,
“commentbody”: “this is a great article”,
Comments “replies”: [
Reply {
“username”: “B”,
“commentid”: 2,
“commentbody”: “you & I so did not read
the same thing!”
}
]
Article Author Info },
{
“username”: “C”,
“commentid”: 3,
“commentbody”: “Nice, but disagree with the conclusion”,
NoSQL
Types: 3. Wide-Column
• Column-oriented or wide-column databases work on columns and are based on the
BigTable paper by Google.
• Organizes data storage into flexible columns that can be spread across multiple servers or
database nodes, using multi-dimensional mapping to reference data by column, row, and
timestamp. You can group related columns into column families. Individual rows then
constitute a column family.
• All columns treated separately; values of single column databases are stored contiguously.
• While a relational database stores data in rows and reads data row by row, a column store
is organized as a set of columns. This means that when you want to run analytics on a
small number of columns, you can read those columns directly without consuming
memory with the unwanted data.
• Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
NoSQL
Types: 3. Wide-Column
• While a relational database stores data in rows and
reads data row by row, a column store is organized as
a set of columns.
◦ This means that when you want to run analytics on a
small number of columns, you can read those columns
directly without consuming memory with the
unwanted data.
NoSQL
Types: 3. Wide-Column
NoSQL
Types: 4. Graph-Based
• A graph type database stores entities as well the relations amongst those entities.
◦ Inspired by mathematical Graph Theory (G=(E,V))
• The entity is stored as a node with the relationship as edges. An edge gives a
relationship between nodes. Every node and edge has a unique identifier.
• Also called Brewer’s Theorem, because it was first advanced by Professor Eric A.
Brewer during a talk he gave on distributed computing in 2000
The CAP Theorem
• Consistency - Every node provides the most recent state or does not provide a state at all.
• Consistency means that all clients see the same data at the same time, no matter which
node they connect to.
• For this to happen, whenever data is written to one node, it must be instantly forwarded
or replicated to all the other nodes in the system before the write is deemed ‘successful.’
2 4 1 3
2 4
Sacrifices Consistency
The CAP Theorem
• Available + Partition Tolerant
2 4 1 3
2 4
Sacrifices Availability
The CAP Theorem
• Consistent + Partition Tolerant
2 4 1 3
2 4
• Some NoSQL DBMSs, such as Apache’s CouchDB or IBM’s Db2, also possess a certain
degree of ACID compliance.
BASE
• The acronym BASE is slightly more confusing than ACID. However, the words behind it
suggest ways in which the BASE model is different.
• BASE stands for:
• Basically Available – Rather than enforcing immediate consistency, BASE-modelled
NoSQL databases will ensure availability of data by spreading and replicating it across
the nodes of the database cluster.
• Soft State – Due to the lack of immediate consistency, data values may change over
time. The BASE model breaks off with the concept of a database which enforces its
own consistency, delegating that responsibility to developers.
• Eventually Consistent – The fact that BASE does not enforce immediate consistency
does not mean that it never achieves it. However, until it does, data reads are still
possible (even though they might not reflect the reality).
BASE
• Just as SQL databases are almost uniformly ACID compliant, NoSQL databases tend to
conform to BASE principles.
• MongoDB, Cassandra and Redis are among the most popular NoSQL solutions,
together with Amazon DynamoDB and Couchbase.
ACID v/s BASE
• Financial institutions will almost exclusively use ACID databases. Money transfers
depend on the atomic nature of ACID.
• An interrupted transaction which is not immediately removed from the database can
cause a lot of issues. Money could be debited from one account and, due to an error,
never credited to another.
• Marketing and customer service companies who deal with sentiment analysis will
prefer the elasticity of BASE when conducting their social network research. Social
network feeds are not well structured but contain huge amounts of data which a
BASE-modeled database can easily store.
ACID v/s BASE
• It is not possible to give a straight answer to the question of which database model is
better. Therefore, a decision must be reached by considering all the aspects of the
project.
• Given their highly structured nature, ACID-compliant databases will be a better fit for
those who require consistency, predictability, and reliability.
• Those who consider growth to be among their priorities will likely want to choose the
BASE model, because it enables easier scaling up and provides more flexibility.
◦ BASE also requires developers who will know how to deal with the limitations of the
model.
Database Partitioning
• The best way to provide ACID and a rich query model is to have the dataset on a single
machine
• Scaling up (vertical scaling: make a “single” machine more powerful)
◦ dataset is just too big!
• Breaking large datasets into smaller ones and distributing datasets and query loads on
those datasets are requisites to high scalability
• Scaling out (horizontal scaling: adding more smaller/cheaper servers)
◦ better choice
• Approaches:
◦ Different approaches for horizontal scaling (multi-node database):
Sharding (partitioning)
Sharding
Consideration:
Storage space
Resources
Bandwidth
Geography
Storage space. A data store for a large-scale application is expected to contain huge
volumes of data that could increase significantly over time.
◦ A server typically provides only a finite amount of disk storage, but you can replace
existing disks with larger ones, or add further disks to a machine as data volumes grow.
◦ However, the system will eventually reach a limit where it isn't possible to easily increase
the storage capacity on a given server.
Sharding (Partitioning)
Computing resources. An application is required to support a large number of
concurrent users, each of which run queries that retrieve information from the data
store.
◦ A single server hosting the data store might not be able to provide the necessary
computing power to support this load, resulting in extended response times for users and
frequent failures as applications attempting to store and retrieve data time out.
◦ It might be possible to add memory or upgrade processors, but the system will reach a
limit when it isn't possible to increase the compute resources any further.
Sharding (Partitioning)
Network bandwidth. Ultimately, the performance of a data store running on a single
server is governed by the rate the server can receive requests and send replies.
• It is possible that the volume of network traffic might exceed the capacity of the
network used to connect to the server, resulting in failed requests.
• If sharding is unfair, then a single node might be taking all the load and other nodes
might sit idle. This defeats the purpose of sharding/partitioning.
Sharding Strategies
Lookup
• In this strategy, the sharding logic implements a map that routes a request for data to
the shard that contains that data using the shard key.
• In a multi-tenant application all the data for a tenant might be stored together in a
shard using the tenant ID as the shard key.
• Multiple tenants might share the same shard, but the data for a single tenant won't
be spread across multiple shards.
Sharding Strategies
Lookup
Sharding Strategies
key-range (Range)
• Sharding/partitioning by key-range divides partitions based on certain ranges.
Sharding Strategies
key-range (Range)
• Another example can be dividing employee data of an organization based on the first
letter of the employee name.
• So A-C, D-F, G-K, L-P, Q-Z is one of the ways by which whole organization data can be
partitioned in 5 parts.
• Now we know the boundaries of such partition so we can directly query a particular
partition if we know the name of an employee.
• Ranges are not necessarily evenly spaced. As in the above example, partition Q-Z has
10 letters of the alphabet but partition A-C has only three.
• In case read and writes are for the same key, all requests still end up on the same
partition.
• For example, on Instagram, a celebrity can have millions of followers. If this celebrity
posts something on his/her account, and if this post is stored using a hash key based
partitioning strategy(User id of celebrity), then there could be millions of writes (view
count update/comments etc) or reads(read query for each follower)coming from the
millions of followers.
Sharding Strategies
Sharding
Advantages and Considerations
Lookup:
This offers more control over the way that shards are configured and used.
Looking up shard locations can impose an additional overhead.
Range:
Easy to implement and works well with range queries because they can often fetch
multiple data items from a single shard in a single operation.
This strategy offers easier data management. For example, if users in the same region are
in the same shard, updates can be scheduled in each time zone based on the local load
and demand pattern.
This strategy doesn't provide optimal balancing between shards. Rebalancing shards is
difficult and might not resolve the problem of uneven load if the majority of activity is for
adjacent shard keys.
Sharding
Advantages and Considerations
Hash:
This strategy offers a better chance of more even data and load distribution.
Request routing can be accomplished directly by using the hash function.
There's no need to maintain a map.
Note that computing the hash might impose an additional overhead.
Rebalancing shards is difficult.