0% found this document useful (0 votes)
47 views

DSA 4-Introduction To NoSQL

Data science and analytics: introduction to NoSQL databases and howbto implement them.

Uploaded by

Alishba Aleem
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

DSA 4-Introduction To NoSQL

Data science and analytics: introduction to NoSQL databases and howbto implement them.

Uploaded by

Alishba Aleem
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Science and Analytics (SW-326)

INTRODUCTION TO N O SQL

Dr. Areej Fatemah Meghji [email protected]


NoSQL
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name his file based
database
• It was again re-introduced by Eric Evans when an event was organized to discuss open
source distributed databases
• He stated that “… the whole point of seeking alternatives is that you need to solve a
problem that relational databases are a bad fit for…”
SQL v/s NoSQL
• SQL databases are built to store Relational Data as efficiently as possible

• Customers place orders and Orders contain


Products.
• This tight organization is great for managing Contain
your data, but this comes at a cost: Place
• RDBs have a hard time scaling as it is an
intensive process requiring a lot of
memory and computer power.
Customer Products

Order
SQL v/s NoSQL
Scalability
• SQL databases are vertically scalable
◦ decrease the load on a single server by
increasing things like CPU, RAM or SSD

Scale Vertically
• NoSQL databases are horizontally scalable
◦ handle more traffic by sharding, or adding more
servers in your NoSQL database

Scale Horizontally
SQL v/s NoSQL
Schema on Write v/s Schema on Read
• SQL databases require a fixed predefined schema, and all data must
follow a similar structure. Consequently, a lot of preparation
regarding the system is required upfront. Plus, flexibility is
compromised, considering that potential modifications in the
structure can be complex, highly complicated, and may disrupt the
system.
• In turn, NoSQL databases follow a dynamic schema for unstructured
data. Since it does not require a predefined structure, modifications
are easier to execute. Thus, NoSQL databases have greater flexibility
◦ You can create documents without having to first define their structure
◦ Each document can have its own unique structure
◦ The syntax can vary from database to database, and
◦ You can add fields as you go.
SQL v/s NoSQL
Language and Structure
• SQL databases use structured query language (SQL) for defining and manipulating
data.
• On one hand, this is extremely powerful: SQL is one of the most versatile and widely-
used options available, making it a safe choice and especially great for complex
queries.
• On the other hand, it can be restrictive.
◦ SQL requires that you use predefined schemas to determine the structure of your data
before you work with it. In addition, all of your data must follow the same structure. This
can require significant up-front preparation
• NoSQL databases, on the other hand, have dynamic schemas for unstructured data,
and data is stored in many ways
◦ NoSQL databases are document, key-value, graph or wide-column stores.
NoSQL
Types
• SQL databases are table-based, while NoSQL databases are either document-based,
key-value pairs, graph databases or wide-column stores.
• This makes relational SQL databases a better option for applications that require
multi-row transactions
◦ accounting system - or for legacy systems that were built for a relational structure.

• Some examples of SQL databases include MySQL, Oracle, PostgreSQL, and Microsoft
SQL Server.
• NoSQL database examples include MongoDB, BigTable, Redis, RavenDB Cassandra,
HBase, Neo4j and CouchDB.
NoSQL
Types: 1. Key-Value
• Data is stored in key/value pairs.
◦ A value, which can be basically any piece of data or information, is stored with a key that
identifies its location.
• The value in a key-value store can be anything: a string, a number, but also an entire
new set of key-value pairs encapsulated in an object.

Key Value
Key Value
124587 { “PrdID" : "001",
124587 ABC "Name" : “MacBook Pro 13",
134679 DEF “Color" : “Black", }
135287 HIJ 134679 DEF
135287 HIJ
NoSQL
Types: 1. Key-Value
data = {
"user:1:name": "Ali",
"user:1:age": 28,
"user:1:profile": {
"email": "[email protected]",
"address": "123 ABC XYZ"
},
"user:2:name": "Bilal",
"user:2:email": "[email protected]",
"user:2:cell": “0333-1234",
"user:3:name": “Hamza",
"user:3:age": 25,
"user:3:profile": {
"email": “[email protected]",
"address": “123 ZZ ABCD"
}
}
NoSQL
Types: 1. Key-Value
Server 1 Server 2

Key Hash Value 01 50 100


124587 04 ABC
134679 27 DEF
135287 73 HIJ Server 2
Server 1 Server 3 Server 4
01 25 50 75 100

Key Hash Value


124587 04 ABC
134679 27 DEF
135287 100 HIJ

Server 4
NoSQL
Types: 1. Key-Value
• Based on Amazon's Dynamo paper, the key-value pair storage databases store data as a hash table
where each key is unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
• used for caching, storing, and managing user sessions, ad servicing, and recommendations.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases.
• Features:
◦ very fast
◦ scalable (horizontally distributed to nodes based on key)
◦ simple data model
◦ eventual consistency
◦ fault-tolerance
◦ Can’t model more complex data structure such as objects
NoSQL
Types: 1. Key-Value
Key-Value Stores for Binary Data

Advantages:
• Simplicity: Key-value stores are easy to use and set up.
• Performance: They can be very fast for storing and retrieving binary data, especially if the
data is small to medium-sized.

Disadvantages:
• Scalability: For large files or large volumes of binary data, key-value stores might not be as
efficient. Some key-value stores have limits on the size of the values they can handle.
• Lack of Specialized Features: Key-value stores do not typically provide specialized features
for handling media files, such as efficient indexing or transformation capabilities.
• Management Overhead: Managing and indexing large binary objects can become complex,
especially as the volume of data grows.
NoSQL
Types: 2. Document Stores
• Document stores are one step up in complexity from key-value stores.
◦ a document store does assume a certain document structure that can be specified with a
schema.
• Retrieve data as a key value pair but the value part is stored as a document.
◦ The document is stored in JSON or XML formats. The value is understood by the DB and
can be queried.

{ "ID" : "001",
{ "ID" : "001",
"Name" : "John",
"Name" : "John",
"Grade" : "Senior",
"Grade" : "Senior", }
"Classes" : {
"Class1" : "English“
"Class2" : "Geometry“
"Class3" : "History" } }
NoSQL
Types: 2. Document Stores
• Document stores appear the most natural among the NoSQL database types because
they’re designed to store everyday documents as they are.
◦ they allow for complex querying and calculations on this often already aggregated form of
data.

• The document type is mostly used for CMS systems, user profiling, blogging platforms,
and news articles
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
“articles”:[
{
NoSQL “title”: “title of the article”,
“articleID”: 1,
Types: 2. Document Stores “body”: “body of the article”,
“author”: “ABC”,
• Use-case: “comments”: [
{
“username”: “A”,
“commentid”: 1,
“commentbody”: “this is a great article”,
Comments “replies”: [
Reply {
“username”: “B”,
“commentid”: 2,
“commentbody”: “you & I so did not read
the same thing!”
}
]
Article Author Info },
{
“username”: “C”,
“commentid”: 3,
“commentbody”: “Nice, but disagree with the conclusion”,
NoSQL
Types: 3. Wide-Column
• Column-oriented or wide-column databases work on columns and are based on the
BigTable paper by Google.
• Organizes data storage into flexible columns that can be spread across multiple servers or
database nodes, using multi-dimensional mapping to reference data by column, row, and
timestamp. You can group related columns into column families. Individual rows then
constitute a column family.
• All columns treated separately; values of single column databases are stored contiguously.
• While a relational database stores data in rows and reads data row by row, a column store
is organized as a set of columns. This means that when you want to run analytics on a
small number of columns, you can read those columns directly without consuming
memory with the unwanted data.
• Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
NoSQL
Types: 3. Wide-Column
• While a relational database stores data in rows and
reads data row by row, a column store is organized as
a set of columns.
◦ This means that when you want to run analytics on a
small number of columns, you can read those columns
directly without consuming memory with the
unwanted data.
NoSQL
Types: 3. Wide-Column
NoSQL
Types: 4. Graph-Based
• A graph type database stores entities as well the relations amongst those entities.
◦ Inspired by mathematical Graph Theory (G=(E,V))
• The entity is stored as a node with the relationship as edges. An edge gives a
relationship between nodes. Every node and edge has a unique identifier.

• Traversing relationship is fast


◦ already captured into the DB, and there is no need to
calculate them.
• Graph base database mostly used for social networks,
logistics, spatial data
• Neo4J, Infinite Graph, OrientDB, FlockDB are some
popular graph-based databases.
NoSQL
Types:
Summary
• NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and
is easy to scale
• The concept of NoSQL databases beccame popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data
• In the year 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source
relational database
• NoSQL databases never follow the relational model it is either schema-free or has
relaxed schemas
• Four types of NoSQL Database are:
1). Key-value Based 2). Document-oriented
3). Column-oriented 4). Graph based
• NOSQL can handle structured, semi-structured, and unstructured data
The CAP Theorem
• The CAP theorem states that a distributed data store system (distributed system) can
deliver only two of three desired characteristics:
• Consistency
• Availability
• Partition tolerance
• (the ‘C,’ ‘A’ and ‘P’ in CAP)

• Also called Brewer’s Theorem, because it was first advanced by Professor Eric A.
Brewer during a talk he gave on distributed computing in 2000
The CAP Theorem
• Consistency - Every node provides the most recent state or does not provide a state at all.
• Consistency means that all clients see the same data at the same time, no matter which
node they connect to.
• For this to happen, whenever data is written to one node, it must be instantly forwarded
or replicated to all the other nodes in the system before the write is deemed ‘successful.’

• Availability - Every node has constant read and write access.


• Availability means that that any client making a request for data gets a response, even if
one or more nodes are down.
• Another way to state this—all working nodes in the distributed system return a valid
response for any request, without exception.
The CAP Theorem
• Partition Tolerance -The system works despite partitions in the network.

• A partition is a communications break within a distributed system—a lost or temporarily


delayed connection between two nodes.
• Partition tolerance means that the cluster must continue to work despite any number of
communication breakdowns between nodes in the system.
The CAP Theorem
The CAP Theorem
• Available + Partition Tolerant

1 3 As it is available it will provide the last


available update

2 4 1 3

2 4

Sacrifices Consistency
The CAP Theorem
• Available + Partition Tolerant

• AP database: An AP database delivers availability and partition tolerance at the


expense of consistency.
• When a partition occurs, all nodes remain available but those at the wrong end of a
partition might return an older version of data than others.
◦ When the partition is resolved, the AP databases typically resync the nodes to repair all
inconsistencies in the system.
The CAP Theorem
• Consistent + Partition Tolerant

As it is consistent it will provide an error


1 3 or a message to indicate unavailability
of data

2 4 1 3

2 4

Sacrifices Availability
The CAP Theorem
• Consistent + Partition Tolerant

• CP database: A CP database delivers consistency and partition tolerance at the


expense of availability.
• When a partition occurs between any two nodes, the system has to shut down the
non-consistent node (i.e., make it unavailable) until the partition is resolved.
The CAP Theorem
• Consistent + Available
As it is consistent and available it will
provide proper updated information
1 3 BUT what will happen in case of
partition?

2 4 1 3

2 4

Sacrifice Availability OR Consistency


The CAP Theorem
• Consistent + Available

• CA database: A CA database delivers consistency and availability across all nodes.


• It can’t do this if there is a partition between any two nodes in the system and
therefore can’t deliver fault tolerance.
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from
• traditional DBMS prefers C over A and P

• Two types of consistency


• 2 types of consistency:
1. Strong consistency – ACID
2. Weak consistency – BASE
ACID
• The ACID database transaction model ensures that a performed transaction is always
consistent.
• This makes it a good fit for businesses which deal with online transaction processing
(e.g., finance institutions) or online analytical processing (e.g., data warehousing).
• These organizations need database systems which can handle many small
simultaneous transactions.
◦ There must be zero tolerance for invalid states.
ACID
• ACID stands for:
• Atomic – Each transaction is either properly carried out or the process halts and the
database reverts back to the state before the transaction started. This ensures that all
data in the database is valid.
• Consistent – A processed transaction will never endanger the structural integrity of
the database.
• Isolated – Transactions cannot compromise the integrity of other transactions by
interacting with them while they are still in progress.
• Durable – The data related to the completed transaction will persist even in the cases
of network or power outages. If a transaction fails, it will not impact the manipulated
data.
ACID
• One safe way to make sure your database is ACID compliant is to choose a relational
database management system. These include MySQL, PostgreSQL, Oracle, SQLite, and
Microsoft SQL Server.

• Some NoSQL DBMSs, such as Apache’s CouchDB or IBM’s Db2, also possess a certain
degree of ACID compliance.
BASE
• The acronym BASE is slightly more confusing than ACID. However, the words behind it
suggest ways in which the BASE model is different.
• BASE stands for:
• Basically Available – Rather than enforcing immediate consistency, BASE-modelled
NoSQL databases will ensure availability of data by spreading and replicating it across
the nodes of the database cluster.
• Soft State – Due to the lack of immediate consistency, data values may change over
time. The BASE model breaks off with the concept of a database which enforces its
own consistency, delegating that responsibility to developers.
• Eventually Consistent – The fact that BASE does not enforce immediate consistency
does not mean that it never achieves it. However, until it does, data reads are still
possible (even though they might not reflect the reality).
BASE
• Just as SQL databases are almost uniformly ACID compliant, NoSQL databases tend to
conform to BASE principles.
• MongoDB, Cassandra and Redis are among the most popular NoSQL solutions,
together with Amazon DynamoDB and Couchbase.
ACID v/s BASE
• Financial institutions will almost exclusively use ACID databases. Money transfers
depend on the atomic nature of ACID.
• An interrupted transaction which is not immediately removed from the database can
cause a lot of issues. Money could be debited from one account and, due to an error,
never credited to another.

• Marketing and customer service companies who deal with sentiment analysis will
prefer the elasticity of BASE when conducting their social network research. Social
network feeds are not well structured but contain huge amounts of data which a
BASE-modeled database can easily store.
ACID v/s BASE
• It is not possible to give a straight answer to the question of which database model is
better. Therefore, a decision must be reached by considering all the aspects of the
project.
• Given their highly structured nature, ACID-compliant databases will be a better fit for
those who require consistency, predictability, and reliability.
• Those who consider growth to be among their priorities will likely want to choose the
BASE model, because it enables easier scaling up and provides more flexibility.
◦ BASE also requires developers who will know how to deal with the limitations of the
model.
Database Partitioning
• The best way to provide ACID and a rich query model is to have the dataset on a single
machine
• Scaling up (vertical scaling: make a “single” machine more powerful)
◦ dataset is just too big!
• Breaking large datasets into smaller ones and distributing datasets and query loads on
those datasets are requisites to high scalability
• Scaling out (horizontal scaling: adding more smaller/cheaper servers)
◦ better choice
• Approaches:
◦ Different approaches for horizontal scaling (multi-node database):
 Sharding (partitioning)
Sharding

Consideration:
Storage space
Resources
Bandwidth
Geography

Large Volume of Data


Sharding (Partitioning)
• If a dataset becomes very large for a single node or when high throughput is required,
a single node cannot suffice.

 Storage space. A data store for a large-scale application is expected to contain huge
volumes of data that could increase significantly over time.
◦ A server typically provides only a finite amount of disk storage, but you can replace
existing disks with larger ones, or add further disks to a machine as data volumes grow.
◦ However, the system will eventually reach a limit where it isn't possible to easily increase
the storage capacity on a given server.
Sharding (Partitioning)
 Computing resources. An application is required to support a large number of
concurrent users, each of which run queries that retrieve information from the data
store.
◦ A single server hosting the data store might not be able to provide the necessary
computing power to support this load, resulting in extended response times for users and
frequent failures as applications attempting to store and retrieve data time out.
◦ It might be possible to add memory or upgrade processors, but the system will reach a
limit when it isn't possible to increase the compute resources any further.
Sharding (Partitioning)
 Network bandwidth. Ultimately, the performance of a data store running on a single
server is governed by the rate the server can receive requests and send replies.
• It is possible that the volume of network traffic might exceed the capacity of the
network used to connect to the server, resulting in failed requests.

 Geography. It might be necessary to store data generated by specific users in the


same region as those users for legal, compliance, or performance reasons, or to
reduce latency of data access.
• If the users are dispersed across different countries or regions, it might not be
possible to store the entire data for the application in a single data store.
Sharding (Partitioning)
• We need to partition/shard such datasets into smaller chunks and then each partition
(shard) can act as a database on its own.
• Each shard has the same schema, but holds its own distinct subset of the data.
◦ A shard is a data store in its own right (it can contain the data for many entities of different
types), running on a server acting as a storage node.
• Thus, a large dataset can be spread across many smaller partitions/shards and each
can independently execute queries or run some programs.
◦ This way large executions can be parallelized across nodes (Partitions/Shards)
Sharding (Partitioning)
• The purpose behind partitioning is to spread data so that execution can be distributed
across nodes.
• Along with partitioning, if we can ensure that every partition node takes a fair share,
then at least in theory, 5 nodes should be able to handle 5 times as much data and 5
times as much read and write throughput of a single partition.

• If sharding is unfair, then a single node might be taking all the load and other nodes
might sit idle. This defeats the purpose of sharding/partitioning.
Sharding Strategies
Lookup
• In this strategy, the sharding logic implements a map that routes a request for data to
the shard that contains that data using the shard key.
• In a multi-tenant application all the data for a tenant might be stored together in a
shard using the tenant ID as the shard key.
• Multiple tenants might share the same shard, but the data for a single tenant won't
be spread across multiple shards.
Sharding Strategies
Lookup
Sharding Strategies
key-range (Range)
• Sharding/partitioning by key-range divides partitions based on certain ranges.
Sharding Strategies
key-range (Range)
• Another example can be dividing employee data of an organization based on the first
letter of the employee name.
• So A-C, D-F, G-K, L-P, Q-Z is one of the ways by which whole organization data can be
partitioned in 5 parts.
• Now we know the boundaries of such partition so we can directly query a particular
partition if we know the name of an employee.
• Ranges are not necessarily evenly spaced. As in the above example, partition Q-Z has
10 letters of the alphabet but partition A-C has only three.

• Can you figure out the reason behind this division?


Sharding Strategies
key-range (Range)
• The reason behind such division is to allocate the same amount of data to different
ranges.
• Due to the fact that most people have the name starting from letters between A-C, as
compared to Q-Z, so this strategy will result in near equal distribution of data across
the partition (nevertheless this is debatable).
◦ boundaries must be chosen by the administrator to ensure equal distribution.
• HBase, BigTable, RethinkDB, and the earlier version of MongoDB exercise such a
partitioning strategy.

• One of the biggest benefits of such partitioning is a concept of range queries.


Suppose I need to find all people whose name starts from letters between R-S, then I
only need to send the query to partition R-S.
Sharding Strategies
key-range (Range)
• For example:
• If an application
regularly needs to
find all orders placed
in a given month, this
data can be retrieved
more quickly if all
orders for a month
are stored in date and
time order in the
same shard.
Sharding Strategies
Key Hash
• The purpose of this strategy is to reduce the chance of hotspots (shards that receive a
disproportionate amount of load).
• The hash value of the data’s key is used to find out the partition. It distributes the data
across the shards in a way that achieves a balance between the size of each shard and
the average load that each shard will encounter.
◦ A good hash function can distribute data uniformly across multiple partitions.
• Cassandra, MongoDB, and Voldemort are databases employing a key hash-based
strategy
Sharding Strategies
Key Hash

Each partition is then assigned a


range of key hashes (Rather than
range of keys) and all keys which fall
within the parameter of partitions
range will be stored on that partition.

Partition ranges can be chosen to be


evenly spaced.
Sharding Strategies
Key Hash
Sharding Strategies
Key Hash
• Sadly with the hash of the key, we lose a nice property of key-range
partitioning: efficient querying data with some ranges.
• As keys are now scattered to different partitions instead of being adjacent to a single
partition.

• In case read and writes are for the same key, all requests still end up on the same
partition.
• For example, on Instagram, a celebrity can have millions of followers. If this celebrity
posts something on his/her account, and if this post is stored using a hash key based
partitioning strategy(User id of celebrity), then there could be millions of writes (view
count update/comments etc) or reads(read query for each follower)coming from the
millions of followers.
Sharding Strategies
Sharding
Advantages and Considerations
 Lookup:
 This offers more control over the way that shards are configured and used.
 Looking up shard locations can impose an additional overhead.
 Range:
 Easy to implement and works well with range queries because they can often fetch
multiple data items from a single shard in a single operation.
 This strategy offers easier data management. For example, if users in the same region are
in the same shard, updates can be scheduled in each time zone based on the local load
and demand pattern.
 This strategy doesn't provide optimal balancing between shards. Rebalancing shards is
difficult and might not resolve the problem of uneven load if the majority of activity is for
adjacent shard keys.
Sharding
Advantages and Considerations
 Hash:
 This strategy offers a better chance of more even data and load distribution.
 Request routing can be accomplished directly by using the hash function.
 There's no need to maintain a map.
 Note that computing the hash might impose an additional overhead.
 Rebalancing shards is difficult.

You might also like