Data Science v No SQL Databases
Data Science v No SQL Databases
Using this key, users can unlock data entries related to that key on
another table, to help with inventory management, shipping, and
more. On relational database management systems (RDBMS), users
can input SQL queries to retrieve the data needed.
Impedance Mismatch
At the beginning of the new millennium the technology world was hit
by the busting of the 1990s dot-com bubble. While this saw many
people questioning the economic future of the Internet, the 2000s
did see several large web properties dramatically increase in scale.
It’s often said that Amazon and Google operate at scales far
removed from most organizations, so the solutions they needed may
not be relevant to an average organization. While it’s true that most
software projects don’t need that level of scale, it’s also true that
more and more organizations are beginning to explore what they
can do by capturing and processing more data - and to run into the
same problems. So, as more information leaked out about what
Google and Amazon had done, people began to explore making
databases along similar lines - explicitly designed to live in a world
of clusters. While the earlier menaces to relational dominance
turned out to be phantoms, the threat from clusters was serious.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a
better term would be “NoREL”, NoSQL caught on. Carl Strozz
introduced the NoSQL concept in 1998.
Advantages of NoSQL
Disadvantages of NoSQL
No standardization rules
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like
consistency when multiple transactions are performed
simultaneously.
When the volume of data increases it is difficult to maintain
unique values as keys become difficult
Doesn’t work as well with relational data
Relationships
Aggregates are useful in that they put together data that is commonly
accessed together. But there are still lots of cases where data that’s
related is accessed differently. Consider the relationship between a
customer and all of his orders. Some applications will want to access the
order history whenever they access the customer; this fits in well with
combining the customer with his order history into a single aggregate.
Other applications, however, want to process orders individually and thus
model orders as independent aggregates. In this case, you’ll want
separate order and customer aggregates but with some kind of
relationship between them so that any work on an order can look up
customer data. The simplest way to provide such a link is to embed the ID
of the customer within the order’s aggregate data.
That way, if you need data from the customer record, you read the order,
and make another call to the database to read the customer data. This will
work, and will be just fine in many scenarios—but the database will be
ignorant of the relationship in the data. This can be important because
there are times when it’s useful for the database to know about these
links.
This may imply that if you have data based on lots of relationships, you
should prefer a relational database over a NoSQL store. While that’s true
for aggregate-oriented databases, it’s worth remembering that relational
databases aren’t all that stellar with complex relationships either. While
you can express queries involving joins in SQL, things quickly get very
hairy—both with SQL writing and with the resulting performance—as the
number of joins mounts up.
Graph Databases
The following table outlines the critical differences between graph and
relational databases:
Graph Databases:
Graph databases became more popular with the rise of big data and social
media analytics. Many multi-model databases support graph modeling.
However, there are numerous graph native databases available as well.
JanusGraph
JanusGraph is a distributed, open-source and scalable graph database
system with a wide range of integration options catered to big data
analytics. Some of the main features of JanusGraph include:
Neo4j
DGraph
Every database type comes with strengths and weaknesses. The most
important aspect is to know the differences as well as available options for
specific problems. Graph databases are a growing technology with
different objectives than other database types.
Advantages
Disadvantages
Conclusion
Schemaless Databases
A common theme across all the forms of NoSQL databases is that they are
schemaless. When you want to store data in a relational database, you
first have to define a schema—a defined structure for the database which
says what tables exist, which columns exist, and what data types each
column can hold.
Before you store some data, you have to have the schema defined for it.
With NoSQL databases, storing data is much more casual.
A key-value store allows you to store any data you like under a key.
A document database effectively does the same thing, since it
makes no restrictions on the structure of the documents you store.
Column-family databases allow you to store any data under any
column you like.
Graph databases allow you to freely add new edges and freely add
properties to nodes and edges as you wish.
Furthermore, if you find you don’t need some things anymore, you can
just stop storing them, without worrying about losing old data as you
would if you delete columns in a relational schema. As well as handling
changes, a schemaless store also makes it easier to deal with nonuniform
data: data where each record has a different set of fields.
} it will assume that certain field names are present and carry data with a
certain meaning, and assume something about the type of data stored
within that field. Programs are not humans; they cannot read “qty” and
infer that must be the same as “quantity”—at least not unless we
specifically program them to do so.
Materialized View
The main thing that sets a materialized view apart is that it is a copy of
query data that does not run in real-time. It takes a little more space, but
it also retrieves data very quickly. You can set materialized views to get
refreshed on a schedule so that the updated information won’t fall
through the cracks.
Materialized View vs View
Both a view and a materialized view can be very useful for simplifying and
optimizing data. You can join data from multiple tables and compile the
information into one simple table.
A view is a precise virtual table that collects data you have previously
gathered from any other relevant query. Anytime you access the view, it
recompiles the data to provide you with the most up-to-date information
according to your query.
A regular view is great because it doesn’t take much space. But it
sacrifices speed and performance.
The drawback is that you have to make sure to view the most recent data.
To reduce the risk of viewing obsolete data, you can refresh it manually,
or set it to refresh on schedule or by triggers.
Unfortunately, the use of materialized views may not suit every situation.
First, not every database supports materialized views (Jump to What is a
materialized view for information on environments that do support them).
There are other issues too. Material views are read-only. This means that
you can’t update tables from a material view like you can with a regular
view. Also, even though material views are pretty secure, there are still
security risks since some security features are missing. For example, you
can’t create security keys or constraints with a material view.
You should keep in mind some features to ensure getting the most from a
materialized view:
Make sure that you are working with the materialized view that reflects
query patterns against the base table. You don’t want to create a
materialized view for every single iteration of a query. That would defeat
the purpose. Create a materialized view that will focus on a broad set of
queries.
All NoSQL data modeling techniques are grouped into three major groups:
Conceptual techniques
General modeling techniques
Hierarchy modeling techniques
Conceptual Techniques
Enumerable Keys. For the most part, unordered key values are very
useful, since entries can be partitioned over several dedicated servers by
just hashing the key. Even so, adding some form of sorting functionality
through ordered keys is useful, even though it may add a bit more
complexity and a performance hit.
Dimensionality Reduction. Geographic information systems tend to use
R-Tree indexes and need to be updated in-place, which can be expensive
if dealing with large data volumes. Another traditional approach is to
flatten the 2D structure into a plain list, such as what is done with
Geohash.
With dimensionality reduction, you can map multidimensional data to a
simple key-value or even non-multidimensional models.
Use dimensionality reduction to map multidimensional data to a Key-Value
model or to another non-multidimensional model.
Index Table. With an index table, take advantage of indexes in stores
that don’t necessarily support them internally. Aim to create and then
maintain a unique table with keys that follow a specific access pattern. For
example, a master table to store user accounts for access by user ID.
Composite Key Index. While somewhat of a generic technique,
composite keys are incredibly useful when ordered keys are used. If you
take it and combine it with secondary keys, you can create a
multidimensional index that is pretty similar to the above-mentioned
Dimensionality Reduction technique.
Conclusion
NoSQL data modeling techniques are very useful, especially since a lot of
programmers aren’t necessarily familiar with the flexibility of NoSQL. The
specifics vary since NoSQL isn’t so much a singular language like SQL, but
rather a set of philosophies for database management. As such, data
modeling techniques, and how they are applied, vary wildly from database
to database.
Don’t let that put you off though, learning NoSQL data modeling
techniques is very helpful, especially when it comes to designing a
scheme for a DBM that doesn’t actually require one. More importantly,
learn to take advantage of NoSQL’s flexibility. Don’t have to worry as
much about the minutiae of schema design as you would with SQL.
UNIT - II
Horizontal sharding
Vertical scaling
By simply upgrading your machine, you can scale vertically without the
complexity of sharding. Adding RAM, upgrading your computer (CPU), or
increasing the storage available to your database are simple solutions
that do not require you to change the design of either your database
architecture or your application.
Adding Ram
Depending on your use case, it may make more sense to simply shift a
subset of the burden onto other providers or even a separate database.
For example, blob or file storage can be moved directly to a cloud
provider such as Amazon S3. Analytics or full-text search can be handled
by specialized services or a data warehouse. Offloading this particular
functionality can make more sense than trying to shard your entire
database.
Replication
Replication
Advantages of sharding
Disadvantages of sharding
Sharding does come with several drawbacks, namely overhead in query
result compilation, complexity of administration, and increased
infrastructure costs.
Having considered the pros and cons, let’s move forward and discuss
implementation.
First, how will the data be distributed across shards? This is the
fundamental question behind any sharded database. The answer to this
question will have effects on both performance and maintenance.
While there are many different sharding methods, we will consider four
main kinds: ranged/dynamic sharding, algorithmic/hashed sharding,
entity/relationship-based sharding, and geography-based sharding.
Ranged/dynamic sharding
Shard
Range
ID
[0, 20) A
[20,
B
40)
[40,
C
50]
The field on which the range is based is also known as the shard key.
Naturally, the choice of shard key, as well as the ranges, are critical in
making range-based sharding effective. A poor choice of shard key will
lead to unbalanced shards, which leads to decreased performance. An
effective shard key will allow for queries to be targeted to a minimum
number of shards. In our example above, if we query for all records with
IDs 10-30, then only shards A and B will need to be queried.
Two key attributes of an effective shard key are high cardinality and
well-distributed frequency. Cardinality refers to the number of possible
values of that key. If a shard key only has three possible values, then
there can only be a maximum of three shards. Frequency refers to the
distribution of the data along the possible values. If 95% of records occur
with a single shard key value then, due to this hotspot, 95% of the records
will be allocated to a single shard. Consider both of these attributes when
selecting a shard key.
Algorithmic/hashed sharding
The function can take any subset of values on the record as inputs.
Perhaps the simplest example of a hash function is to use the modulus
operator with the number of shards, as follows:
First, query operations for multiple records are more likely to get
distributed across multiple shards. Whereas ranged sharding reflects the
natural structure of the data across shards, hashed sharding typically
disregards the meaning of the data. This is reflected in increased
broadcast operation occurrence.
Entity-/relationship-based sharding
For instance, consider the case of a shopping database with users and
payment methods. Each user has a set of payment methods that is tied
tightly with that user. As such, keeping related data together on the same
shard can reduce the need for broadcast operations, increasing
performance.
Geography-based sharding
Visualization of an implementation
Opinions
The Master-slave approach for replicating your NoSQL Databases has the
following advantages:
The Peer-to-Peer NoSQL Data Replication works in the concept that every
database copy is responsible to update its data. This can only work when
every copy contains an identical format of schema and stores the same
type of data. Furthermore, Database Restoration is a key requirement of
this Data Replication technique.
Since the catalog queries are stored across multiple nodes, the
performance of Peer-to-Peer NoSQL Data Replication remains
constant even if your data load increases.
If a node fails, the application layer can commute that node’s read
requests to other adjacent nodes and maintain a lossless processing
environment and data availability.
The Peer-to-Peer NoSQL Data Replication technique comes along with the
following drawbacks:
Consistency:
Read-write conflict
⚫ A read in the middle of two logically-related writes
Solutions:
⚫ Pessimistic approach
⚫ Prevent conflicts from occurring
⚫ Usually implemented with write locks managed by the system
⚫ Optimistic approach
⚫ Lets conflicts occur, but detects them and takes action to sort
them out
⚫ Approaches (for write-write conflicts):
⚫ conditional updates: test the value just before updating
⚫ save both updates: record that they are in conflict and
then merge them
CAP theorem
Version Stamps
Why MapReduce?
The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-
value pairs).
The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of
tuples.
Let us now take a close look at each of the phases and try to understand
their significance.
Let us try to understand the two tasks Map &f Reduce with the help of a
small diagram −
MapReduce-Example
Key-value stores, on the other hand, are typically much more flexible and
offer very fast performance for reads and writes, in part because the
database is looking for a single key and is returning its associated value
rather than performing complex aggregations.
A key-value pair is two pieces of data associated with each other. The key
is a unique identifier that points to its associated value, and a value is
either the data being identified or a pointer to that data.
Document Database
Although SQL databases have great stability and vertical power, they
struggle with super-sized databases. Use cases that require immediate
access to data, such as healthcare apps, are a better fit for document
databases. Document databases make it easy to query data with the
same document-model used to code the application.
Content
Catalogs
management
Patients' data
Book Database
Both relational and NoSQL document systems are used to form a book
database, although in different ways.
Content Management
Developers use document databases to created video streaming
platforms, blogs and similar services. Each file is stored as a single
document and the database is easier to maintain as the service evolves
over time. Significant data modifications, such as data model changes,
require no downtime as no schema update is necessary.
Catalogs
Open formats
Built-in versioning
Advantages
Schema-less. There are no restrictions in the format and structure
of data storage. This is good for retaining existing data at massive
volumes and different structural states, especially in a continuously
transforming system.
Faster creation and care. Minimal maintenance is required once
you create the document, which can be as simple as adding your
complex object once.
No foreign keys. With the absence of this relationship dynamic,
documents can be independent of one another.
Open formats. A clean build process that uses XML, JSON and
other derivatives to describe documents.
Built-in versioning. As your documents grow in size they can also
grow in complexity. Versioning decreases conflicts.
Disadvantages
Amazon DocumentDB
Features:
MongoDB-compatible
Fully managed
High performance with low latency querying
Strong compliance and security
High availability
Used for:
MongoDB
Features:
Ad hoc queries
Optimised indexing for querying
Sharding
Load-balancing
Used for:
Forbes decreased build time by 58%, gaining a 28% increase in
subscriptions due to quicker building of new features, simpler
incorporations and better handling of increasingly diverse data
types.
Toyota found it much simpler for developers to work at high speeds
by using natural JSON documents. More time is spent on building the
business value instead of data modeling.
Cosmos DB
Features:
Used for:
ArangoDB
Features:
Schema validations
Diverse indexing
Fast distributed clusters
Efficient v large datasets
Supports multiple NoSQL data models
Combine models into single queries
Used for:
Couchbase Server
Features:
Used for:
How to Choose?
Your app’s critical demands determine how to structure data. A few key
questions:
Will you be doing more reading or writing? Relational systems are
superior if you are doing more writing, as they avoid duplications
during updates.
How important is synchronisation? Due to their ACID framework,
relational systems do this better.
How much will your database schema need to transform in the
future? Document databases are a winning choice if you work with
diverse data at scale and require minimal maintenance.
Neither document nor SQL is strictly better than the other. The right
choice depends on your use case. When making your decision, consider
the types of operations that will be most frequently carried out.
Column databases
Key-value databases
Document Databases
Graph Databases