NoSQL Databases
NoSQL Databases
NoSQL stands for Not Only SQL, meaning that NoSQL databases have the
specificity of not being relational because they can store data in an
unstructured format. The following graphic highlights the main five key
features of NoSQL databases.
SQL databases use the ACID NoSQL databases, on the other hand,
(Atomicity, Consistency, use the CAP (Consistency,
Properties
Isolation, Durability) Availability, Partition Tolerance)
property. property.
1. Document Databases
This type of database is designed to store and query JSON, XML, BSON, etc.,
documents. Each document is a row or a record in the database and is in the
key-value format. A document stores information about one object and its
related data. For instance, the following database contains three records, each
one gives information about a student. For the first document, firstname is a
key, and Franck is its value.
Document Database Advantages
Schemaless: there are no limitations in terms of the format and
structure of the data storage. This is beneficial, especially when there is a
continuous transformation in the database.
Easy to update: a piece of new information can be added or deleted
without changing the rest of the existing fields of that specific document.
Improved performance: all the information about a document can be
found in that exact same document. There is no need to refer to external
information, which might not be the case for a relational database where
the user might have to request other tables.
Document Database Limitations
Consistency check issues: because documents do not necessarily need
to have a relationship with one another, and two documents can have
different fields.
Atomicity issues: If we have to change two collections of documents, we
will need to run a separate query for each document.
When to Use Document Databases
Recommended when your data schema is subject to constant changes in
the future.
Document Database Applications
Because of their flexibility, document databases can be practical for
online user profiles, where different users can have different types of
information. In this case, each user’s profile is stored only by using
attributes that are specific to them.
They can be used for content management, which requires effective
storage of data from a variety of sources. That information can then be
used to create and incorporate new types of content.
2. Key-value Databases
These are the simplest types of NoSQL databases. Every item is stored in the
database in a key-value pair. We can think of it as a table with exactly two
columns. The first column contains a unique key. The second column is the
value for each key. The values can be in different data types, such as integer,
string, and float, or more complex data types, such as image and document.
The following example illustrates a key-value database containing information
about customers where the key is their phone number, and the value is their
monthly purchase.
Key-value Database Advantages
Simplicity: the key-value structure is straightforward. The absence of
data type makes it simple to use.
Speed: the simple data format makes read and write operations faster.
Key-value Database Limitations
They cannot perform any filtering on the value column because the
returned value is all the information stored in the value field.
It is optimized only by having a single key and value. Storing multiple
values would require a parser.
The value is updated only as a whole, which requires getting the
complete data, performing the required processing on that data, and
finally storing back the whole data. This might create a performance
issue when the processing requires a lot of time.
When to Use Key-value Databases
Adapted for applications based on simple key-based queries.
Used for simple applications that need to temporarily store simple
objects such as cache.
They can be used as well when there is a need for real-time data access.
Applications
They are better for simple applications that need to temporarily store
simple objects such as cache.
3. Wide-column Databases
As the name suggests, column-oriented databases are used to store data as a
collection of columns, where each column is treated separately, and the
implementation logic is based on Google Big Table paper. They are mostly used
for analytical workloads such as business intelligence, data warehouse
management, and customer relationship management.
For instance, we can quickly get the average age and average price respectively
of customers and products with the aggregation function AVG on each column.
4. Graph/node Databases
Graph databases are used to store, map and search relationships between
nodes through edges. A node represents a data element, also called an object
or entity. Each node has an incoming or outcoming edge. An edge represents
the relationship between two nodes. Those edges contain some properties
corresponding to the nodes they connect.
“Zoumana studies at Texas Tech University. He likes to run at the Park inside
the University”
Graph/node Database Advantages
They are an agile and flexible structure.
The relationship between nodes in the database is human readable and
explicit, thus easy to understand.
Graph/node Database Limitations
There is no standardized query language because each language is
platform-dependent.
The previous reason makes it difficult to find support online when facing
an issue.
When to Use Graph/node Databases
They can be used when you need to create relationships between data
elements and be able to quickly retrieve those relationships.
Applications
They can be used to perform sophisticated fraud detection in real-time
financial transactions.
They can be used for mining data from social media. For instance,
LinkedIn uses a graph database to identify which users follow each other,
and the relationship between those users and their expertise (ML
Engineer).
Network mapping can be a great fit for representation as a graph since
those networks map relationships between hardware and the services
they support.
7 Best NoSQL Databases for Data Science
Now that you have a better knowledge of NoSQL databases, let’s look at a list
of NoSQL databases that are popular for data science projects. This analysis is
only focused on open-source NoSQL databases.
1. MongoDB
MongoDB is an open-source document-oriented database that stores data in
JSON format. It is the most commonly used database and was designed for high
availability and scalability, providing auto-sharing and built-in replication.
Our Introduction to MongoDB course covers the use of MongoDB and Python.
It helps in acquiring the skills to manipulate and analyze flexibly structured data
with MongoDB. Uber, LaunchDarkl, Delivery Hero, and 4300 companies use
MongoDB in their tech stack.
2. Cassandra
Cassandra is also an open-source large column database. It can distribute your
data across multiple machines and automatically repartition as you add new
machines to your infrastructure. Uber, Facebook, Netflix, and 506 other
companies use it in their tech stack.
3. Elasticsearch
Similar to MongoDB, Elasticsearch is also a document-oriented database and
open-source. It is a world-leading search and analytical tool focusing on
scalability and speed. Uber, Shopify, Udemy, and about 3760 other companies
use it in their stack.
4. Neo4J
Neo4J is an open-source graph-oriented database. It is mainly used to deal with
growing data with relationships. Around 220 companies reportedly use it in
their tech stack.
5. HBase
This is a distributed and column-oriented database. It also provides the same
capabilities as Google’s BigTable on top of Apache Hadoop. Reportedly, 81
companies use HBase on their tech stack.
6. CouchDB
CouchDB is also an open-source document-oriented database that collects and
stores data in a JSON format. Around 84 companies use it on their tech stack.
7. OrientDB
Also an open-source database, OrientDB is a multi-model database supporting
graph, document, key-value, and object models. Only 13 companies reportedly
use it on their tech stack.