0% found this document useful (0 votes)
11 views

Data Science v No SQL Databases

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Science v No SQL Databases

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

UNIT - I

Value of Relational Databases:

Relational databases are a type of database that allows users to


access data that is stored in various tables connected by a unique ID
or “key.”

Using this key, users can unlock data entries related to that key on
another table, to help with inventory management, shipping, and
more. On relational database management systems (RDBMS), users
can input SQL queries to retrieve the data needed.

In a relational database, each row in the table has a key. In the


columns are data attributes. Each record has a value for each
attribute, so users can understand the relationships between data
entries for functions like product marketing, manufacturing, and
more.

As an example, for a shoe store processing online orders, a


relational database might have two tables with related data. In the
first table, each record includes the customer’s name, shipping
address, email, and billing information, in columns. A key is assigned
to each row. In the second table, that key is listed alongside the
product ordered, quantity, size, color, and more. The two tables are
related, and toggled to each other, with the key. When an order
comes in, the key allows the warehouse to pull the correct product
from the shelf and ship it to the customer.

Benefits of relational databases

Relational databases provide plenty of benefits for companies. Here


are a few primary advantages of relational databases:

 Simple and centralized database: Relational databases are


simple. Toggling between tables provides a wealth of
information that can be used for various purposes. ERP
systems are built on relational databases, so they help users
manage clients, inventory, and much more.
 Easy to use: Many companies use relational databases, and
ERP, to organize and manage large amounts of data. Their
continued use helps to drive improvements to these systems
 Save time and money: By using relational databases,
companies can stay organized and efficient. The unique IDs
help eliminate duplicate information

Features of relational databases

Relational databases tend to be used for processing and managing


transactions. They are often used in retail, banking, and
entertainment industries. Transactions in this enviroment have
properties that can be represented by the acronym ACID, which
stands for:

 Atomicity: All parts of a transaction are executed completely


and successfully, or else the entire transaction fails.

 Consistency: Data remains consistent throughout the


relational database. Data integrity, or the accuracy and
completeness of the data at hand, is enforced in relational
databases with integrity constraints (similar to rule enforcers).

 Isolation: Each transaction is independent of other


transactions. Data from one record does not spill onto another,
so it is secure.

 Durability: Even if the system fails, data from completed


transactions is safely stored.

Impedance Mismatch

An impedance mismatch can occur when accessing a relational


database in an object-oriented programming language. Problems
can arise because object-oriented programming languages like C++
or Python have very different approaches to accessing data.

Some of these differences include:

 Type references: Object-oriented languages make heavy use of


by-reference attributes, while this is typically prohibited in
relational databases. Scalar types also often differ between
database and OO languages.
 In OO languages, objects can be made up of other objects,
while this is impossible in relational database languages for
integrity.
 Relational databases have well-defined primitive operations for
manipulating and querying data, while OO languages have
lower-level operations.
 Relational databases have more robust approaches to
transactions to preserve atomicity and consistency. The only
way to guarantee this through an OO language is at the level
of primitive-typed fields.

Example for representing data in relations

Methods to mitigate impedance mismatch include using NoSQL


databases and designing relational databases with object-oriented
programming languages in mind, as well as paying attention to
differences between OO languages and relational databases when
coding a project.

Application and Integration Databases

Application Database is a database that is controlled and


accessed by a single application, (in contrast to an
IntegrationDatabase). Since only a single application accesses the
database, the database can be defined specifically to make that one
application's needs easy to satisfy. This leads to a more concrete
schema that is usually easier to understand and often less complex
than that for an IntegrationDatabase.

To share data with other applications the controlling application may


provide services. It also may provide a ReportingDatabase for a
wider range of read-only use.

The great advantage of an application database is that it is easier to


change since all its use is encapsulated by a single application.
Evolutionary database design and database refactoring can be used
to make significant changes to an application database's design
even after the database is put into production.

An application database schema is usually best designed and


controlled by the application team themselves - often by having an
experienced database professional as a member of the application
team. This database professional needs to work very closely with the
rest of the application developers to keep the database close to the
needs of the rest of the application.

An integration database is a database which acts as the data


store for multiple applications, and thus integrates data across these
applications (in contrast to an ApplicationDatabase).

An integration database needs a schema that takes all its client


applications into account. The resulting schema is either more
general, more complex or both - because it has to unify what should
be separate BoundedContexts. The database usually is controlled by
a separate organization to those that develop applications and
database changes are more complex because they have to be
negotiated between the database group and the various
applications.

The benefit of this is that sharing data between applications does


not require an extra layer of integration services on the applications.
Any changes to data made in a single application are made available
to all applications at the time of database commit - thus keeping the
applications' data use better synchronized.
On the whole integration databases lead to serious problems
because the database becomes a point of coupling between the
applications that access it. This is usually a deep coupling that
significantly increases the risk involved in changing those
applications and making it harder to evolve them. Hence, the
integration databases should be avoided

Attack of the Clusters

At the beginning of the new millennium the technology world was hit
by the busting of the 1990s dot-com bubble. While this saw many
people questioning the economic future of the Internet, the 2000s
did see several large web properties dramatically increase in scale.

This increase in scale was happening along many dimensions.


Websites started tracking activity and structure in a very detailed
way. Large sets of data appeared: links, social networks, activity in
logs, mapping data. With this growth in data came a growth in users
- as the biggest websites grew to be vast estates regularly serving
huge numbers of visitors.

Coping with the increase in data and traffic required more


computing resources. To handle this kind of increase, you have two
choices: up or out. Scaling up implies bigger machines, more
processors, disk storage, and memory. But bigger machines get
more and more expensive, not to mention that there are real limits
as your size increases. The alternative is to use lots of small
machines in a cluster. A cluster of small machines can use
commodity hardware and ends up being cheaper at these kinds of
scales. It can also be more resilient - while individual machine
failures are common, the overall cluster can be built to keep going
despite such failures, providing high reliability.

As large properties moved towards clusters, that revealed a new


problem - relational databases are not designed to be run on
clusters. Clustered relational databases, such as the Oracle RAC or
Microsoft SQL Server, work on the concept of a shared disk
subsystem. They use a cluster-aware file system that writes to a
highly available disk subsystem - but this means the cluster still has
the disk subsystem as a single point of failure. Relational databases
could also be run as separate servers for different sets of data,
effectively sharding the database. While this separates the load, all
the sharding has to be controlled by the application which has to
keep track of which database server to talk to for each bit of data.
Also, we lose any querying, referential integrity, transactions, or
consistency controls that cross shards. A phrase we often hear in
this context from people who’ve done this is “unnatural acts.”

These technical issues are exacerbated by licensing costs.


Commercial relational databases are usually priced on a single-
server assumption, so running on a cluster raised prices and led to
frustrating negotiations with purchasing departments.

This mismatch between relational databases and clusters led some


organization to consider an alternative route to data storage. Two
companies in particular - Google and Amazon - have been very
influential. Both were on the forefront of running large clusters of
this kind; furthermore, they were capturing huge amounts of data.
These things gave them the motive. Both were successful and
growing companies with strong technical components, which gave
them the means and opportunity. It was no wonder they had murder
in mind for their relational databases. As the 2000s drew on, both
companies produced brief but highly influential papers about their
efforts: BigTable from Google and Dynamo from Amazon.

It’s often said that Amazon and Google operate at scales far
removed from most organizations, so the solutions they needed may
not be relevant to an average organization. While it’s true that most
software projects don’t need that level of scale, it’s also true that
more and more organizations are beginning to explore what they
can do by capturing and processing more data - and to run into the
same problems. So, as more information leaked out about what
Google and Amazon had done, people began to explore making
databases along similar lines - explicitly designed to live in a world
of clusters. While the earlier menaces to relational dominance
turned out to be phantoms, the threat from clusters was serious.

The Emergence of NoSQL.

NoSQL Database is a non-relational Data Management System, that


does not require a fixed schema. It avoids joins, and is easy to scale.
The major purpose of using a NoSQL database is for distributed data
stores with humongous data storage needs. NoSQL is used for Big
data and real-time web apps. For example, companies like Twitter,
Facebook and Google collect terabytes of user data every single
day.

NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a
better term would be “NoREL”, NoSQL caught on. Carl Strozz
introduced the NoSQL concept in 1998.

Advantages of NoSQL

 Big Data Capability


 No Single Point of Failure
 Easy Replication
 It provides fast performance and horizontal scalability.
 Can handle structured, semi-structured, and unstructured data
with equal effect
 Object-oriented programming which is easy to use and flexible
 NoSQL databases don’t need a dedicated high-performance
server
 Simple to implement than using RDBMS
 Handles big data which manages data velocity, variety,
volume, and complexity
 Excels at distributed database and multi-data center
operations
 Eliminates the need for a specific caching layer to store data

Disadvantages of NoSQL

 No standardization rules
 Limited query capabilities
 RDBMS databases and tools are comparatively mature
 It does not offer any traditional database capabilities, like
consistency when multiple transactions are performed
simultaneously.
 When the volume of data increases it is difficult to maintain
unique values as keys become difficult
 Doesn’t work as well with relational data

Relationships

Aggregates are useful in that they put together data that is commonly
accessed together. But there are still lots of cases where data that’s
related is accessed differently. Consider the relationship between a
customer and all of his orders. Some applications will want to access the
order history whenever they access the customer; this fits in well with
combining the customer with his order history into a single aggregate.
Other applications, however, want to process orders individually and thus
model orders as independent aggregates. In this case, you’ll want
separate order and customer aggregates but with some kind of
relationship between them so that any work on an order can look up
customer data. The simplest way to provide such a link is to embed the ID
of the customer within the order’s aggregate data.

That way, if you need data from the customer record, you read the order,
and make another call to the database to read the customer data. This will
work, and will be just fine in many scenarios—but the database will be
ignorant of the relationship in the data. This can be important because
there are times when it’s useful for the database to know about these
links.

As a result, many databases—even key-value stores—provide ways to


make these relationships visible to the database. Document stores make
the content of the aggregate available to the database to form indexes
and queries.
a key-value store, allows you to put link information in metadata,
supporting partial retrieval and link-walking capability. An important
aspect of relationships between aggregates is how they handle updates.
Aggregate-oriented databases treat the aggregate as the unit of data-
retrieval. Consequently, atomicity is only supported within the contents of
a single aggregate. If you update multiple aggregates at once, you have
to deal yourself with a failure partway through. Relational databases help
you with this by allowing you to modify multiple records in a single
transaction, providing ACID guarantees while altering many rows. All of
this means that aggregate-oriented databases become more awkward as
you need to operate across multiple aggregates.

This may imply that if you have data based on lots of relationships, you
should prefer a relational database over a NoSQL store. While that’s true
for aggregate-oriented databases, it’s worth remembering that relational
databases aren’t all that stellar with complex relationships either. While
you can express queries involving joins in SQL, things quickly get very
hairy—both with SQL writing and with the resulting performance—as the
number of joins mounts up.

Graph Databases

A graph database is a NoSQL-type database system based on a


topographical network structure. The idea stems from graph theory in
mathematics, where graphs represent data sets
using nodes, edges, and properties.

 Nodes or points are instances or entities of data which represent


any object to be tracked, such as people, accounts, locations, etc.
 Edges or lines are the critical concepts in graph databases which
represent relationships between nodes. The connections have a
direction that is either unidirectional (one way) or bidirectional (two
way).
 Properties represent descriptive information associated with
nodes. In some cases, edges have properties as well.
For example, analyze some of the network locations of phoenixNap:

Nodes with descriptive properties form relationships represented


by edges.

Graph databases provide a conceptual view of data more closely related


to the real world. Modeling complex connections becomes easier since
relationships between data points are given an equal value of importance
as the data itself.

Graph Database vs. Relational Database

Graph databases are not meant to replace relational databases. As of


now, relational databases are the industry standard. The most important
aspect is to know what each database type has to offer.

Relational databases provide a structured approach to data, whereas


graph databases are agile and focus on quick data relationship insight.
Both graph and relational databases have their domain. Use cases with
complex relationships leverage the power of graph databases,
outperforming traditional relational databases. Relational databases such
as MySQL or PostgreSQL require careful planning when creating database
models, whereas graphs have a much more naturalistic and fluid
approach to data.

The following table outlines the critical differences between graph and
relational databases:

Type Graph Relational


Nodes and edges with
Format Tables with rows and columns
properties
Represented with edges Created using foreign keys
Relationships
between nodes between tables
Flexibility Flexible Rigid
Complex
Quick and responsive Requires complex joins
queries
Systems with highly Transaction focused systems with
Use-case
connected relationships more straightforward relationships

Functionality of Graph Databases:

Graph databases work by treating data and relationships between data


equally. Related nodes are physically connected, and the physical
connection is also treated as a piece of data.

Modeling data in this way allows querying relationships in the same


manner as querying the data itself. Instead of calculating and querying
the connection steps, graph databases read the relationship from storage
directly.

Graph databases are more closely related to other NoSQL data


modeling techniques in terms of performance, and flexibility. Like other
NoSQL databases, graphs do not have schemas, which makes the model
flexible and easy to alter along the way.

Graph Database Use Case Examples

There are many notable examples where graph databases outperform


other database modeling techniques, some of which include:

 Real-Time Recommendation Engines. Real-time product and


ecommerce recommendations provide a better user experience
while maximizing profitability. Notable cases include Netflix, eBay,
and Walmart.
 Master Data Management. Linking all company data to one
location for a single point of reference provides data consistency
and accuracy. Master data management is crucial for large-scale
global companies.
 GDPR(General Data Protection Regulation) and regulation
compliances. Graphs make tracking of data movement and
security easier to manage. The databases reduce the potential
of data breaches and provide better consistency when removing
data, improving the overall trust with sensitive information.
 Digital asset management. The amount of digital content is
massive and constantly increasing. Graph databases provide a
scalable and straightforward database model to keep track of digital
assets, such as documents, evaluations, contracts, etc.
 Context-aware services. Graphs help provide services related to
actual-world characteristics. Whether it is natural disaster warnings,
traffic updates, or product recommendations for a given location,
graph databases offer a logical solution to real-life circumstances.
 Fraud detection. Finding suspicious patterns and uncovering
fraudulent payment transactions is done in real-time using graph
databases. Targeting and isolating parts of graphs provide quicker
detection of deceptive behavior.
 Semantic search. Natural language processing is ambiguous.
Semantic searches help provide meaning behind keywords for more
relevant results, which is easier to map using graph databases.
 Network management. Networks are linked graphs in their
essence. Graphs reduce the time needed to alert a network
administrator about problems in a network.
 Routing. Information travels through a network by finding optimal
paths makes graph databases the perfect choice for routing.

Graph Databases:

Graph databases became more popular with the rise of big data and social
media analytics. Many multi-model databases support graph modeling.
However, there are numerous graph native databases available as well.

JanusGraph
JanusGraph is a distributed, open-source and scalable graph database
system with a wide range of integration options catered to big data
analytics. Some of the main features of JanusGraph include:

 Support for ACID transactions with the ability to bear thousands of


concurrent users.
 Multiple options for storing the graph data, such as Cassandra or
HBase.
 Complex search available by default as well as optional support
for Elasticsearch.
 Full integration with Apache Spark for advanced data analytics.
 JanusGraph uses the graph transversal query language Gremlin,
which is Turing complete.

Neo4j

Neo4j (Network Exploration and Optimization 4 Java) is a graph database


written in Java with native graph storage and processing. The main
features of Neo4j are:

 The database is scalable through data partitioning into pieces


known as shards.
 High availability provided through continuous backups and rolling
upgrades.
 Multiple instances of databases are separable while remaining on
one dedicated server, providing a high level of security.
 Neo4j uses the Cypher graph query language, which is programmer
friendly.

DGraph

DGraph (Distributed graph) is an open-source distributed graph database


system designed with scalability in mind. Some exciting features of
DGraph include:
 Horizontal scalability for running in production with ACID
transactions.
 DGraph is an open-source system with support for many open
standards.
 The query language is GraphQL, which is designed for APIs.

DataStax Enterprise Graph

The DataStax Enterprise Graph is a distributed graph database based


on Cassandra and optimized for enterprises. Features include:

 DataStax provides continuous availability for enterprise needs.


 The database integrates with offline Apache Spark seamlessly.
 Real-time search and analytics are fully integrated.
 Scalability available through multiple data centers.
 It supports Gremlin as well as CQL for querying.

Graph Database Advantages and Disadvantages

Every database type comes with strengths and weaknesses. The most
important aspect is to know the differences as well as available options for
specific problems. Graph databases are a growing technology with
different objectives than other database types.

Advantages

Some advantages of graph databases include:

 The structures are flexible.


 The representation of relationships between entities is explicit.
 Queries output real-time results. The speed depends on the number
of relationships.

Disadvantages

The general disadvantages of graph databases are:


 There is no standardized query language. The language depends on
the platform used.
 Graphs are inappropriate for transactional-based systems.
 The user-base is small, making it hard to find support when running
into a problem.

Conclusion

Graph databases are an excellent approach for analyzing complex


relationships between data entities. The fast query time with real-time
results cater to the fast-paced data research of today. Graphs are a
developing technology with more improvements to come.

Schemaless Databases

A common theme across all the forms of NoSQL databases is that they are
schemaless. When you want to store data in a relational database, you
first have to define a schema—a defined structure for the database which
says what tables exist, which columns exist, and what data types each
column can hold.

Before you store some data, you have to have the schema defined for it.
With NoSQL databases, storing data is much more casual.

 A key-value store allows you to store any data you like under a key.
 A document database effectively does the same thing, since it
makes no restrictions on the structure of the documents you store.
 Column-family databases allow you to store any data under any
column you like.
 Graph databases allow you to freely add new edges and freely add
properties to nodes and edges as you wish.

Advocates of schemalessness are very glad in this freedom and flexibility.


With a schema, you have to figure out in advance what you need to store,
but that can be hard to do. Without a schema binding you, you can easily
store whatever you need. This allows you to easily change your data
storage as you learn more about your project. You can easily add new
things as you discover them.

Furthermore, if you find you don’t need some things anymore, you can
just stop storing them, without worrying about losing old data as you
would if you delete columns in a relational schema. As well as handling
changes, a schemaless store also makes it easier to deal with nonuniform
data: data where each record has a different set of fields.

A schema puts all rows of a table into a straightjacket, which becomes


awkward if you have different kinds of data in different rows. You either
end up with lots of columns that are usually null (a sparse table), or you
end up with meaningless columns like custom column 4. Schemalessness
avoids this, allowing each record to contain just what it needs—no more,
no less.

Schemalessness is appealing, and it certainly avoids many problems that


exist with fixed-schema databases, but it brings some problems of its
own. If all you are doing is storing some data and displaying it in a report
as a simple list of fieldName: value lines then a schema is only going to
get in the way. But usually we do with our data more than this, and we do
it with programs that need to know that the billing address is called
billingAddress and not addressForBilling and that the quantify field is
going to be an integer 5 and not five. The vital, if sometimes inconvenient,
fact is that whenever we write a program that accesses data, that
program almost always relies on some form of implicit schema. Unless it
just says something like

For each (Record r in records) {

For each (Field f in r.fields) {

print (f.name, f.value)


}

} it will assume that certain field names are present and carry data with a
certain meaning, and assume something about the type of data stored
within that field. Programs are not humans; they cannot read “qty” and
infer that must be the same as “quantity”—at least not unless we
specifically program them to do so.

So, however schemaless our database is, there is usually an implicit


schema present. This implicit schema is a set of assumptions about the
data’s structure in the code that manipulates the data. Having the implicit
schema in the application code results in some problems. It means that in
order to understand what data is present you have to dig into the
application code. If that code is well structured you should be able to find
a clear place from which to deduce the schema. But there are no
guarantees; it all depends on how clear the application code is.
Furthermore, the database remains ignorant of the schema—it can’t use
the schema to help it decide how to store and retrieve data efficiently.

Furthermore, the flexibility that shamelessness gives you only applies


within an aggregate—if you need to change your aggregate boundaries,
the migration is every bit as complex as it is in the relational case.

Materialized View

A materialized view simplifies complex data by saving query information –


you don’t have to create a new query every time you need to access the
information.

The main thing that sets a materialized view apart is that it is a copy of
query data that does not run in real-time. It takes a little more space, but
it also retrieves data very quickly. You can set materialized views to get
refreshed on a schedule so that the updated information won’t fall
through the cracks.
Materialized View vs View

Both a view and a materialized view can be very useful for simplifying and
optimizing data. You can join data from multiple tables and compile the
information into one simple table.

To better understand what benefits a materialized view brings, let’s


compare it to a regular view.

A view is a precise virtual table that collects data you have previously
gathered from any other relevant query. Anytime you access the view, it
recompiles the data to provide you with the most up-to-date information
according to your query.
A regular view is great because it doesn’t take much space. But it
sacrifices speed and performance.

A materialized view is much more efficient at executing queries. The data


is physically saved at a specific point in time. You don’t need to re-read all
the data associated with a query every single time.

The drawback is that you have to make sure to view the most recent data.
To reduce the risk of viewing obsolete data, you can refresh it manually,
or set it to refresh on schedule or by triggers.

Materialized views are essential in cutting costs for developers. The


results obtained in a materialized view are kept in memory. They are only
updated when needed, not constantly. So, you improve the performance
by precomputing some of the most expensive operations. Additionally, the
speed can increase greatly when it comes to querying large databases.

Unfortunately, the use of materialized views may not suit every situation.
First, not every database supports materialized views (Jump to What is a
materialized view for information on environments that do support them).

There are other issues too. Material views are read-only. This means that
you can’t update tables from a material view like you can with a regular
view. Also, even though material views are pretty secure, there are still
security risks since some security features are missing. For example, you
can’t create security keys or constraints with a material view.

Materialized View: Tips for Using

You should keep in mind some features to ensure getting the most from a
materialized view:
Make sure that you are working with the materialized view that reflects
query patterns against the base table. You don’t want to create a
materialized view for every single iteration of a query. That would defeat
the purpose. Create a materialized view that will focus on a broad set of
queries.

NoSQL Data Modeling Techniques

All NoSQL data modeling techniques are grouped into three major groups:

 Conceptual techniques
 General modeling techniques
 Hierarchy modeling techniques

Below, we will briefly discuss all NoSQL data modeling techniques.

Conceptual Techniques

There are a three conceptual techniques for NoSQL data modeling:

 Denormalization. Denormalization is a pretty common technique and


entails copying the data into multiple tables or forms in order to simplify
them. With denormalization, easily group all the data that needs to be
queried in one place. Of course, this does mean that data volume does
increase for different parameters, which increases the data volume
considerably.
 Aggregates. This allows users to form nested entities with complex
internal structures, as well as vary their particular structure. Ultimately,
aggregation reduces joins by minimizing one-to-one relationships.
Most NoSQL data models have some form of this soft schema technique.
For example, graph and key-value store databases have values that can
be of any format, since those data models do not place constraints on
value. Similarly, another example such as BigTable has aggregation
through columns and column families.
 Application Side Joins. NoSQL doesn’t usually support joins, since
NoSQL databases are question-oriented where joins are done during
design time. This is compared to relational databases where are
performed at query execution time. Of course, this tends to result in a
performance penalty and is sometimes unavoidable.

General Modeling Techniques

There are a five general techniques for NoSQL data modeling:

 Enumerable Keys. For the most part, unordered key values are very
useful, since entries can be partitioned over several dedicated servers by
just hashing the key. Even so, adding some form of sorting functionality
through ordered keys is useful, even though it may add a bit more
complexity and a performance hit.
 Dimensionality Reduction. Geographic information systems tend to use
R-Tree indexes and need to be updated in-place, which can be expensive
if dealing with large data volumes. Another traditional approach is to
flatten the 2D structure into a plain list, such as what is done with
Geohash.
With dimensionality reduction, you can map multidimensional data to a
simple key-value or even non-multidimensional models.
Use dimensionality reduction to map multidimensional data to a Key-Value
model or to another non-multidimensional model.
 Index Table. With an index table, take advantage of indexes in stores
that don’t necessarily support them internally. Aim to create and then
maintain a unique table with keys that follow a specific access pattern. For
example, a master table to store user accounts for access by user ID.
 Composite Key Index. While somewhat of a generic technique,
composite keys are incredibly useful when ordered keys are used. If you
take it and combine it with secondary keys, you can create a
multidimensional index that is pretty similar to the above-mentioned
Dimensionality Reduction technique.

Hierarchy Modeling Techniques

There are different hierarchy modeling techniques for NoSQL data:

 Tree Aggregation. Tree aggregation is essentially modeling data as a


single document. This can be really efficient when it comes to any record
that is always accessed at once, such as a Twitter thread or Reddit post.
Of course, the problem then becomes that random access to any
individual entry is inefficient.
 Adjacency Lists. This is a straightforward technique where nodes are
modeled as independent records of arrays with direct ancestors. That’s a
complicated way of saying that it allows you to search nodes by their
parents or children. Much like tree aggregation though, it is also quite
inefficient for retrieving an entire subtree for any given node.
 Materialized Paths. This technique is a sort of denormalization and is
used to avoid recursive traversals in tree structures. Mainly, we want to
attribute the parents or children to each node, which helps us determine
any predecessors or descendants of the node without worrying about
traversal. Incidentally, we can store materialized paths as IDs, either as a
set or a single string.
 Nested Sets. A standard technique for tree-like structures in relational
databases, it’s just as applicable to NoSQL and key-value or document
databases. The goal is to store the tree leaves as an array and then map
each non-leaf node to a range of leaves using start/end indexes.
Modeling it in this way is an efficient way to deal with immutable data as it
only requires a small amount of memory, and doesn’t necessarily have to
use traversals. That being said, updates are expensive because they
require updates of indexes.
 Nested Documents Flattening: Numbered Field Names. Most search
engines tend to work with documents that are a flat list of fields and
values, rather than something with a complex internal structure. As such,
this data modeling technique tries to map these complex structures to a
plain document, for example, mapping documents with a hierarchical
structure, a common difficulty you might encounter.
Of course, this type of work is pain-staking and not easily scalable,
especially as the nested structures increase.

Conclusion

NoSQL data modeling techniques are very useful, especially since a lot of
programmers aren’t necessarily familiar with the flexibility of NoSQL. The
specifics vary since NoSQL isn’t so much a singular language like SQL, but
rather a set of philosophies for database management. As such, data
modeling techniques, and how they are applied, vary wildly from database
to database.

Don’t let that put you off though, learning NoSQL data modeling
techniques is very helpful, especially when it comes to designing a
scheme for a DBM that doesn’t actually require one. More importantly,
learn to take advantage of NoSQL’s flexibility. Don’t have to worry as
much about the minutiae of schema design as you would with SQL.
UNIT - II

What is database sharding?

Sharding is a method for distributing a single dataset across multiple


databases, which can then be stored on multiple machines. This allows for
larger datasets to be split into smaller chunks and stored in multiple data
nodes, increasing the total storage capacity of the system. See more on
the basics of sharding here.

Similarly, by distributing the data across multiple machines, a sharded


database can handle more requests than a single machine can.

Sharding is a form of scaling known as horizontal scaling or scale-out,


as additional nodes are brought on to share the load. Horizontal scaling
allows for near-limitless scalability to handle big data and intense
workloads. In contrast, vertical scaling refers to increasing the power of
a single machine or single server through a more powerful CPU, increased
RAM, or increased storage capacity.

Horizontal sharding

Need for database sharding

Database sharding, as with any distributed architecture, does not come


for free. There is overhead and complexity in setting up shards,
maintaining the data on each shard, and properly routing requests across
those shards. Before you begin sharding, consider if one of the following
alternative solutions will work for you.

Vertical scaling

By simply upgrading your machine, you can scale vertically without the
complexity of sharding. Adding RAM, upgrading your computer (CPU), or
increasing the storage available to your database are simple solutions
that do not require you to change the design of either your database
architecture or your application.

Adding Ram

Specialized services or databases

Depending on your use case, it may make more sense to simply shift a
subset of the burden onto other providers or even a separate database.
For example, blob or file storage can be moved directly to a cloud
provider such as Amazon S3. Analytics or full-text search can be handled
by specialized services or a data warehouse. Offloading this particular
functionality can make more sense than trying to shard your entire
database.

Replication

If your data workload is primarily read-focused, replication increases


availability and read performance while avoiding some of the complexity
of database sharding. By simply spinning up additional copies of the
database, read performance can be increased either through load
balancing or through geo-located query routing. However, replication
introduces complexity on write-focused workloads, as each write must be
copied to every replicated node

Replication

On the other hand, if your core application database contains large


amounts of data, requires high read and high write volume, and/or you
have specific availability requirements, a sharded database may be the
way forward. Let’s look at the advantages and disadvantages of sharding.

Advantages of sharding

Sharding allows you to scale your database to handle increased load to a


nearly unlimited degree by providing increased read/write
throughput, storage capacity, and high availability. Let’s look at
each of those in a little more detail.

 Increased read/write throughput — By distributing the dataset across


multiple shards, both read and write operation capacity is increased as
long as read and write operations are confined to a single shard.
 Increased storage capacity — Similarly, by increasing the number of
shards, you can also increase overall total storage capacity, allowing near-
infinite scalability.
 High availability — Finally, shards provide high availability in two ways.
First, since each shard is a replica set, every piece of data is replicated.
Second, even if an entire shard becomes unavailable since the data is
distributed, the database as a whole still remains partially functional, with
part of the schema on different shards.

Disadvantages of sharding
Sharding does come with several drawbacks, namely overhead in query
result compilation, complexity of administration, and increased
infrastructure costs.

 Query overhead — Each sharded database must have a separate


machine or service which understands how to route a querying operation
to the appropriate shard. This introduces additional latency on every
operation. Furthermore, if the data required for the query is horizontally
partitioned across multiple shards, the router must then query each shard
and merge the result together. This can make an otherwise simple
operation quite expensive and slow down response times.
 Complexity of administration — With a single unsharded database,
only the database server itself requires upkeep and maintenance. With
every sharded database, on top of managing the shards themselves, there
are additional service nodes to maintain. Plus, in cases where replication
is being used, any data updates must be mirrored across each replicated
node. Overall, a sharded database is a more complex system which
requires more administration.
 Increased infrastructure costs — Sharding by its nature requires
additional machines and compute power over a single database server.
While this allows your database to grow beyond the limits of a single
machine, each additional shard comes with higher costs. The cost of a
distributed database system, especially if it is missing the proper
optimization, can be significant.

Having considered the pros and cons, let’s move forward and discuss
implementation.

How does sharding work?

In order to shard a database, we must answer several fundamental


questions. The answers will determine your implementation.

First, how will the data be distributed across shards? This is the
fundamental question behind any sharded database. The answer to this
question will have effects on both performance and maintenance.

Second, what types of queries will be routed across shards? If the


workload is primarily read operations, replicating data will be highly
effective at increasing performance, and you may not need sharding at
all. In contrast, a mixed read-write workload or even a primarily write-
based workload will require a different architecture.

Finally, how will these shards be maintained? Once you have


sharded a database, over time, data will need to be redistributed among
the various shards, and new shards may need to be created. Depending
on the distribution of data, this can be an expensive process and should
be considered ahead of time.

With these questions in mind, let’s consider some sharding architectures.

Sharding architectures and types

While there are many different sharding methods, we will consider four
main kinds: ranged/dynamic sharding, algorithmic/hashed sharding,
entity/relationship-based sharding, and geography-based sharding.

Ranged/dynamic sharding

Ranged sharding, or dynamic sharding, takes a field on the record as an


input and, based on a predefined range, allocates that record to the
appropriate shard. Ranged sharding requires there to be a lookup table or
service available for all queries or writes. For example, consider a set of
data with IDs that range from 0-50. A simple lookup table might look like
the following:

Shard
Range
ID

[0, 20) A

[20,
B
40)

[40,
C
50]

The field on which the range is based is also known as the shard key.
Naturally, the choice of shard key, as well as the ranges, are critical in
making range-based sharding effective. A poor choice of shard key will
lead to unbalanced shards, which leads to decreased performance. An
effective shard key will allow for queries to be targeted to a minimum
number of shards. In our example above, if we query for all records with
IDs 10-30, then only shards A and B will need to be queried.

Two key attributes of an effective shard key are high cardinality and
well-distributed frequency. Cardinality refers to the number of possible
values of that key. If a shard key only has three possible values, then
there can only be a maximum of three shards. Frequency refers to the
distribution of the data along the possible values. If 95% of records occur
with a single shard key value then, due to this hotspot, 95% of the records
will be allocated to a single shard. Consider both of these attributes when
selecting a shard key.

Range-based sharding is an easy-to-understand method of horizontal


partitioning, but the effectiveness of it will depend heavily on the
availability of a suitable shard key and the selection of appropriate
ranges. Additionally, the lookup service can become a bottleneck,
although the amount of data is small enough that this typically is not an
issue.

Algorithmic/hashed sharding

Algorithmic sharding or hashed sharding, takes a record as an input and


applies a hash function or algorithm to it which generates an output or
hash value. This output is then used to allocate each record to the
appropriate shard.

The function can take any subset of values on the record as inputs.
Perhaps the simplest example of a hash function is to use the modulus
operator with the number of shards, as follows:

Hash Value=ID % Number of Shards

This is similar to range-based sharding — a set of fields determines the


allocation of the record to a given shard. Hashing the inputs allows more
even distribution across shards even when there is not a suitable shard
key, and no lookup table needs to be maintained. However, there are a
few drawbacks.

First, query operations for multiple records are more likely to get
distributed across multiple shards. Whereas ranged sharding reflects the
natural structure of the data across shards, hashed sharding typically
disregards the meaning of the data. This is reflected in increased
broadcast operation occurrence.

Second, resharding can be expensive. Any update to the number of


shards likely requires rebalancing all shards to moving around records. It
will be difficult to do this while avoiding a system outage.

Entity-/relationship-based sharding

Entity-based sharding keeps related data together on a single physical


shard. In a relational database (such as PostgreSQL, MySQL, or SQL
Server), related data is often spread across several different tables.

For instance, consider the case of a shopping database with users and
payment methods. Each user has a set of payment methods that is tied
tightly with that user. As such, keeping related data together on the same
shard can reduce the need for broadcast operations, increasing
performance.

Geography-based sharding

Geography-based sharding, or geosharding, also keeps related data


together on a single shard, but in this case, the data is related by
geography. This is essentially ranged sharding where the shard key
contains geographic information and the shards themselves are geo-
located.

For example, consider a dataset where each record contains a “country”


field. In this case, we can both increase overall performance and decrease
system latency by creating a shard for each country or region, and storing
the appropriate data on that shard. This is a simple example, and there
are many other ways to allocate your geoshards which are beyond the
scope of our context.

What is master-slave Replication?


Master-slave is a way to optimize the I/O in your application other than
using caching. The master database serves as the keeper of information.
The true data is kept at the master database, thus writing only occurs
there. Reading, on the other hand, is only done in the slave. What is this
for? This architecture serves the purpose of safeguarding site reliability. If
a site receives a lot of traffic and the only available database is one
master, it will be overloaded with reading and writing requests. Making
the entire system slow for everyone on the site.

Visualization of an implementation

An example of the master-slave concept using Postgresql and MongoDB

In the example diagram, we used Postgresql as my master. Postgres is a


relational database. Relational databases are structured and easy to
maintain. For the slave, we used MongoDB because MongoDB is a non-
relational database or NoSQL. Having one type of database for both
master and slave can serve to be convenient because in the end
maintaining the codebase will be a lot easier.

The process for handling the data transfer/synchronization from the


master to slave databases is called replication. To replicate data you can
use a server less function as a pipeline to distribute data to the slaves.
Making your own solution to replicate databases can be tedious so it is
recommended to use a database replication tool.

Opinions

In a way, the master-slave concept can be interpreted as a method to


cache data from the master to all the slave databases, so to add a simple
Redis implementation to your system can technically work as a master-
slave system.

Master-slave database architecture

Implementing caching wouldn’t really help the database much either.


Caching will only result in select data that will be available on your cache.
If the data requested is not available then the reading process will happen
in the master. Implementing caching is great to optimize your site, but the
master-slave concept works better if we have a separate data source
identical to the master that can be used to read. Here is how I imagine
caching would be, this is not an example of how to do master-slave:

An example of caching, this is not an implementation of master-slave


In the example above, the architecture is an example of how caching and
master-slave is different. To do a master-slave WITH caching I would
probably do it like this:

An example of a master-slave database with caching

Master slave implementation


Pros of Using Master-slave NoSQL Data Replication

The Master-slave approach for replicating your NoSQL Databases has the
following advantages:

 The Master-slave approach is extremely fast and it doesn’t operate


on any performance or storage restrictions. Moreover, since read
and update tasks are divided among master and slave copies, you
can perform both operations in quick successions without facing any
time delay.
 You can use the Master-slave NoSQL Data Replication technique to
split the data read and write requests and allocate them to different
servers. This will further improve your data processing speed and
efficiency.

Cons of Using Master-slave NoSQL Data Replication

The Master-slave NoSQL Data Replication contains the following


limitations:

 This technique lacks reliability as it operates asynchronously. This


implies, that in cases, the master copy fails, certain committed
transactions will go missing and no slave copy will contain that
information.
 The Master-slave technique does not support high scaling of Write
requests. If you wish to scale such requests, you will require
additional computational capacity on the master node.
Peer-to-Peer NoSQL Data Replication

The Peer-to-Peer NoSQL Data Replication works in the concept that every
database copy is responsible to update its data. This can only work when
every copy contains an identical format of schema and stores the same
type of data. Furthermore, Database Restoration is a key requirement of
this Data Replication technique.

Pros of Using Peer-to-Peer NoSQL Data Replication

 Since the catalog queries are stored across multiple nodes, the
performance of Peer-to-Peer NoSQL Data Replication remains
constant even if your data load increases.

 If a node fails, the application layer can commute that node’s read
requests to other adjacent nodes and maintain a lossless processing
environment and data availability.

 The Peer-to-Peer technique for replication makes node maintenance


easy as it allows you to take individual nodes offline for upgrade or
maintenance without hampering the overall system performance.

Cons of Using Peer-to-Peer NoSQL Data Replication

The Peer-to-Peer NoSQL Data Replication technique comes along with the
following drawbacks:

 If you modify a particular row at more than one database node, it


can cause a data loss by triggering a conflict.

 Replicating changes is costly in terms of latency in Peer-to-Peer


replication. Furthermore, if an application requires real-time data
relocation, then you need to perform the challenging task of load
balancing dynamically across different nodes

Combining Sharing and replication

 Replication: The primary server node copies data onto secondary


server nodes. This can help increase data availability and act as a
backup, in case if the primary server fails.
 Sharding: Handles horizontal scaling across servers using a shard
key. This means that rather than copying data holistically, sharding
copies pieces of the data (or “shards”) across multiple replica sets.
These replica sets work together to utilize all of the data.

Combining sharding and replication:

 Replication and sharding are ideas that can be combined. Using


master-slave replication and sharding means that there can be
multiple masters, but each data point has only a single master.
When you combine peer-to-peer replication with sharding, each
shard can have any number of peers, where, in case of failures, the
data is built on other nodes.

Consistency:

Consistency One of the biggest changes from a centralized relational


database to a cluster-oriented NoSQL database is in how you think about
consistency. Relational databases try to exhibit strong consistency by
avoiding all the various inconsistencies. Once you start looking at the
NoSQL world, phrases such as “CAP theorem” and “eventual consistency”
appear,

Conflicts with lack of Consistency


⚫ Read-read (or simply read) conflict:
⚫ Different people see different data at the same time
⚫ Stale data: out of date
⚫ Write-write conflict
⚫ Two people updating the same data item at the same time
⚫ If the server serialize them: one is applied and immediately
overwritten by the other (lost update)
⚫ Read-write conflict:
⚫ A read in the middle of two logically-related writes
Read conflict
⚫ Replication is a source of inconsistency

Read-write conflict
⚫ A read in the middle of two logically-related writes
Solutions:

⚫ Pessimistic approach
⚫ Prevent conflicts from occurring
⚫ Usually implemented with write locks managed by the system
⚫ Optimistic approach
⚫ Lets conflicts occur, but detects them and takes action to sort
them out
⚫ Approaches (for write-write conflicts):
⚫ conditional updates: test the value just before updating
⚫ save both updates: record that they are in conflict and
then merge them

CAP theorem

The CAP theorem is about how distributed database systems behave in


the face of network instability.

When working with distributed systems over unreliable networks we need


to consider the properties of consistency and availability in order to make
the best decision about what to do when systems fail. The CAP theorem
introduced by Eric Brewer in 2000 states that any distributed database
system can have at most two of the following three desirable properties

 Consistency. Consistency is about having a single, up-to-date,


readable version of our data available to all clients. Our data should
be consistent - no matter how many clients reading the same items
from replicated and distributed partitions we should get consistent
results. All writes are atomic and all subsequent requests retrieve
the new value.
 High availability. This property states that the distributed database
will always allow database clients to make operations like select or
update on items without delay. Internal communication failures
between replicated data shouldn’t prevent operations on it. The
database will always return a value as long as a single server is
running.
 Partition tolerance. This is the ability of the system to keep
responding to client requests even if there’s a communication
failure between database partitions. The system will still function
even if network communication between partitions is temporarily
lost.

The CAP theorem categorizes systems into three categories:

 CP (Consistent and Partition Tolerant) database: A CP


database delivers consistency and partition tolerance at the
expense of availability. When a partition occurs between any two
nodes, the system has to shut down the non-consistent node (i.e.,
make it unavailable) until the partition is resolved.
Partition refers to a communication break between nodes within a
distributed system. Meaning, if a node cannot receive any messages
from another node in the system, there is a partition between the
two nodes. Partition could have been because of network failure,
server crash, or any other reason.
 AP (Available and Partition Tolerant) database: An AP
database delivers availability and partition tolerance at the expense
of consistency. When a partition occurs, all nodes remain available
but those at the wrong end of a partition might return an older
version of data than others. When the partition is resolved, the AP
databases typically resync the nodes to repair all inconsistencies in
the system.
 CA (Consistent and Available) database: A CA delivers
consistency and availability in the absence of any network partition.
Often a single node’s DB servers are categorized as CA systems.
Single node DB servers do not need to deal with partition tolerance
and are thus considered CA systems.

In any networked shared-data systems or distributed systems


partition tolerance is a must. Network partitions and dropped
messages are a fact of life and must be handled appropriately.
Consequently, system designers must choose between consistency
and availability.

The following diagram shows the classification of different


databases based on the CAP theorem.
CLASSIFICATION ON CAP THEOREM
 System designers must take into consideration the CAP theorem
while designing or choosing distributed storages as one needs to be
sacrificed from C and A for others.

Version Stamps

 Version stamps help you detect concurrency conflicts. When you


read data, then update it, you can check the version stamp to
ensure nobody updated the data between your read and write.
 Version stamps can be implemented using counters, GUIDs, content
hashes, timestamps, or a combination of these.
 With distributed systems, a vector of version stamps allows you to
detect when different nodes have conflicting updates.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store


and process data. The following illustration depicts a schematic view of a
traditional enterprise system. Traditional model is certainly not suitable to
process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too
much of a bottleneck while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce.


MapReduce divides a task into small parts and assigns them to many
computers. Later, the results are collected at one place and integrated to
form the result dataset.

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and
Reduce.

 The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-
value pairs).
 The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of
tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand
their significance.

 Input Phase − Here we have a Record Reader that translates each


record in an input file and sends the parsed data to the mapper in
the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series of key-
value pairs and processes each one of them to generate zero or
more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the
mapper are known as intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups
similar data from the map phase into identifiable sets. It takes the
intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one
mapper. It is not a part of the main MapReduce algorithm; it is
optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and
Sort step. It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running. The individual key-value
pairs are sorted by key into a larger data list. The data list groups
the equivalent keys together so that their values can be iterated
easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data
as input and runs a Reducer function on each one of them. Here, the
data can be aggregated, filtered, and combined in a number of
ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final
step.
 Output Phase − In the output phase, we have an output formatter
that translates the final key-value pairs from the Reducer function
and writes them onto a file using a record writer.

Let us try to understand the two tasks Map &f Reduce with the help of a
small diagram −
MapReduce-Example

Let us take a real-world example to comprehend the power of MapReduce.


Twitter receives around 500 million tweets per day, which is nearly 3000
tweets per second. The following illustration shows how Tweeter manages
its tweets with the help of MapReduce.

As shown in the illustration, the MapReduce algorithm performs the


following actions −
 Tokenize − Tokenizes the tweets into maps of tokens and writes
them as key-value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes
the filtered maps as key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter
values into small manageable units.

What is a key-value store?

A key-value store, or key-value database, is a type of data storage


software program that stores data as a set of unique identifiers, each of
which have an associated value. This data pairing is known as a “key-
value pair.” The unique identifier is the “key” for an item of data, and a
value is either the data being identified or the location of that data.
An example of a key-value store

The key could be anything, depending on restrictions imposed by the


database software, but it needs to be unique in the database so there is
no ambiguity when searching for the key and its value. The value could be
anything, including a list or another key-value pair. Some database
software allows you to specify a data type for the value.

In traditional relational database design, data is stored in tables composed


of rows and columns. The database developer specifies many attributes of
the data to be stored in the table upfront. This creates significant
opportunities for optimizations such as data compression and
performance around aggregations and data access, but also introduces
some inflexibility.

Key-value stores, on the other hand, are typically much more flexible and
offer very fast performance for reads and writes, in part because the
database is looking for a single key and is returning its associated value
rather than performing complex aggregations.

What does a key-value pair mean?

A key-value pair is two pieces of data associated with each other. The key
is a unique identifier that points to its associated value, and a value is
either the data being identified or a pointer to that data.

A key-value pair is the fundamental data structure of a key-value store or


key-value database, but key-value pairs have existed outside of software
for much longer. A telephone directory is a good example, where the key
is the person or business name, and the value is the phone number. Stock
trading data is another example of a key-value pair. In this case, you may
have a key associated with values for the stock ticker, whether the trade
was a buy or sell, the number of shares, or the price of the trade.

Key-value store advantages

There are a few advantages that a key-value store provides over


traditional row-column-based databases. Thanks to the simple data format
that gives it its name, a key-value store can be very fast for read and
write operations. And key-value stores are very flexible, a valued asset in
modern programming as we generate more data without traditional
structures.

Also, key-value stores do not require placeholders such as “null” for


optional values, so they may have smaller storage requirements, and they
often scale almost linearly with the number of nodes.

Key-value database use cases

The advantages listed above naturally lend themselves to several popular


use cases for key-value databases.

 Web applications may store user session details and preference in a


key-value store. All the information is accessible via user key, and
key-value stores lend themselves to fast reads and writes.
 Real-time recommendations and advertising are often powered by
key-value stores because the stores can quickly access and present
new recommendations or ads as a web visitor moves throughout a
site.
 On the technical side, key-value stores are commonly used for in-
memory data caching to speed up applications by minimizing reads
and writes to slower disk-based systems. Hazelcast is an example of
a technology that provides an in-memory key-value store for fast
data retrieval.

Distributed key-value store

A distributed key-value store builds on the advantages and use cases


described above by providing them at scale. A distributed key-value store
is built to run on multiple computers working together, and thus allows
you to work with larger data sets because more servers with more
memory now hold the data. By distributing the store across multiple
servers, you can increase processing performance. And if you leverage
replication in your distributed key-value store, you increase its fault
tolerance. Hazelcast is an example of a technology that provides a
distributed key-value store for larger-scale deployments. The “IMap” data
type in Hazelcast, similar to the “Map” type in Java, is a key-value store
stored in memory. Unlike the Java Map type, Hazelcast IMaps are stored in
memory in a distributed manner across the collective RAM in a cluster of
computers, allowing you to store much more data than possible on a
single computer. This gives you quick lookups with in-memory speeds
while also retaining other important capabilities such as high availability
and security.

When to use a key-value database

 When your application needs to handle lots of small continuous


reads and writes, that may be volatile. Key-value databases offer
fast in-memory access.
 When storing basic information, such as customer details; storing
webpages with the URL as the key and the webpage as the value;
storing shopping-cart contents, product categories, e-commerce
product details
 For applications that don’t require frequent updates or need to
support complex queries.

Use cases for key-value databases

 Session management on a large scale.


 Using cache to accelerate application responses.
 Storing personal data on specific users.
 Product recommendations, storing personalized lists of items for
individual customers.
 Managing each player’s session in massive multiplayer online
games.
Document Databases

Document Database

A document database is a type of NoSQL database which stores data as


JSON documents instead of columns and rows. JSON is a native language
used to both store and query data. These documents can be grouped
together into collections to form database systems.

Relational Vs Document Database

Relational database management systems (RDBMS) rely on Structured


Query Language (SQL). NoSQL doesn’t.

A RDBMS is focused on creating relationships between files to store and


read data. Document databases are focused on the data itself and
relationships are represented with nested data.
Key comparisons between relational and document databases:

RDBMS Document Database System

Structured around the concept Focused on data rather than


of relationships. relationships.

Organizes data into tuples (or Documents have properties without


rows). theoretical definitions, instead of rows.

Defines data (forms


relationships) via constraints
and foreign keys (e.g., a child No DDL language for defining schemas.
table references to the master
table via its ID).

Relationships represented via nested


Uses DDL (Data Definition data, not foreign keys (any document
Language) to create may contain others nested inside of it,
relationships. leading to an N:1 or 1:N relationship
between the two document entities).

Offers extreme consistency,


Offers eventual consistency with a
critical for some use cases such
period of inconsistency.
as daily banking.

Features of Document Databases

Document databases provide fast queries, a structure well suited for


handling big data, flexible indexing and a simplified method of
maintaining the database. It’s efficient for web apps and has been fully
integrated by large-scale IT companies like Amazon.

Although SQL databases have great stability and vertical power, they
struggle with super-sized databases. Use cases that require immediate
access to data, such as healthcare apps, are a better fit for document
databases. Document databases make it easy to query data with the
same document-model used to code the application.

Document Databases Use Cases

General Use Cases

Extracting real-time big


User profiles
data

Book databases Data of varying structures

Content
Catalogs
management

Patients' data

some of the above-mentioned use cases in greater detail in the following


sections.

Book Database

Both relational and NoSQL document systems are used to form a book
database, although in different ways.

The relational approach would represent the relationship between books


and authors via tables with IDs – an Author table and a Books table. It
forces each author to have at least one entry in the Books table by
disallowing null values.

By comparison, the document model lets you nest. It shows relationships


more naturally and simply by ensuring that each author document has a
property called Books, with an array of related book documents in the
property. When you search for an author, the entire book collection
appears.

Content Management
Developers use document databases to created video streaming
platforms, blogs and similar services. Each file is stored as a single
document and the database is easier to maintain as the service evolves
over time. Significant data modifications, such as data model changes,
require no downtime as no schema update is necessary.

Catalogs

Document databases are much more efficient than relational databases


when it comes to storing and reading catalog files. Catalogs may have
thousands of attributes stored and document databases provide fast
reading times. In document databases, attributes related to a single
product are stored in a single document. Modifying one product's
attributes does not affect other documents.

Document Database Advantages and Disadvantages

Below are some key advantages and disadvantages of document


databases:

Document Database Document Database


Advantages Disadvantages

Schema-less Consistency-Check Limitations

Faster creation and care Atomicity weaknesses

No foreign keys Security

Open formats

Built-in versioning

The advantages and disadvantages are further explained in the sections


below.

Advantages
 Schema-less. There are no restrictions in the format and structure
of data storage. This is good for retaining existing data at massive
volumes and different structural states, especially in a continuously
transforming system.
 Faster creation and care. Minimal maintenance is required once
you create the document, which can be as simple as adding your
complex object once.
 No foreign keys. With the absence of this relationship dynamic,
documents can be independent of one another.
 Open formats. A clean build process that uses XML, JSON and
other derivatives to describe documents.
 Built-in versioning. As your documents grow in size they can also
grow in complexity. Versioning decreases conflicts.

Disadvantages

 Consistency-Check Limitations. In the book database use case


example above, it would be possible to search for books from a non-
existent author. You could search the book collection and find
documents that are not connected to an author collection.
Each listing may also duplicate author information for each book.
These inconsistencies aren’t significant in some contexts, but at
upper-tier standards of RDB consistency audits, they seriously
hamper database performance.
 Atomicity weaknesses. Relational systems also let you modify
data from one place without the need for JOINs. All new reading
queries will inherit changes made to your data via a single
command (such as updating or deleting a row).
For document databases, a change involving two collections will
require you to run two separate queries (per collection). This breaks
atomicity requirements.
 Security. Nearly half of web applications today actively leak
sensitive data. Owners of NoSQL databases, therefore, need to pay
careful attention to web app vulnerabilities.
Best Document Databases

Amazon DocumentDB

Features:

 MongoDB-compatible
 Fully managed
 High performance with low latency querying
 Strong compliance and security
 High availability

Used for:

 Amazon’s entire development team uses Amazon DocumentDB to


increase agility and productivity. They needed nested indexes,
aggregations and ad hoc queries, with a fully managed process.
 The BBC uses it for querying and storing data from multiple data
streams and compiling into single customer feeds. They migrated to
Amazon DocumentDB for the benefits of a fully managed service
with high availability, durability, and default backups.

MongoDB

Features:

 Ad hoc queries
 Optimised indexing for querying
 Sharding
 Load-balancing

Used for:
 Forbes decreased build time by 58%, gaining a 28% increase in
subscriptions due to quicker building of new features, simpler
incorporations and better handling of increasingly diverse data
types.
 Toyota found it much simpler for developers to work at high speeds
by using natural JSON documents. More time is spent on building the
business value instead of data modeling.

Cosmos DB

Features:

 Any scale fast reads


 99,999% availability
 Fully managed
 NoSQL/Native Core APIs
 Serverless, cost-effectively/instantly scales

Used for:

 Coca-Cola gets insights delivered in minutes, facilitating global


scaling. Before migrating to Cosmos DB, it took hours.
 ASOS needed a distributed database that flexibly and seamlessly
scales to handle over 100 million global retail customers.

ArangoDB

Features:

 Schema validations
 Diverse indexing
 Fast distributed clusters
 Efficient v large datasets
 Supports multiple NoSQL data models
 Combine models into single queries
Used for:

 Oxford university reduced hospital attendance and improved test


results by developing a web--based assessment test for
cardiopulmonary disease.
 FlightStats transformed fragmented flight data (flight status,
weather, airport delays, and reference data) into one standard,
enabling accurate, predictive and analytical results.

Couchbase Server

Features:

 Ability to manage global deployments


 Extreme agility and flexibility
 Fast at large scale
 Easy cloud integrations

Used for:

 BT used Couchbase’s flexible data model to accelerate its capacity


to deliver content at high performance while scaling with ease
against demand spikes.
 eBay migrated from Oracle for a more cost-effective, feature-
applicable solution (of their key-value store/document system). App
performance and availability grew, while developers could use their
SQL know-how to speed up their CI/CD pipeline via a more flexible
schema.

How to Choose?

Your app’s critical demands determine how to structure data. A few key
questions:
 Will you be doing more reading or writing? Relational systems are
superior if you are doing more writing, as they avoid duplications
during updates.
 How important is synchronisation? Due to their ACID framework,
relational systems do this better.
 How much will your database schema need to transform in the
future? Document databases are a winning choice if you work with
diverse data at scale and require minimal maintenance.

Neither document nor SQL is strictly better than the other. The right
choice depends on your use case. When making your decision, consider
the types of operations that will be most frequently carried out.

Four Important Databases:

Column databases

Key-value databases

Document Databases

Graph Databases

You might also like