Module 1 Nosql
Module 1 Nosql
MODULE 1
CHAPTER 1
1.1 The Value of Relational Databases
1.1.1 Persistent Data Storage
Relational databases have become deeply embedded in our computing culture, often taken for
granted. The primary advantage of relational databases is their ability to store large amounts
of persistent data. In computer systems, memory is typically divided into fast but volatile
"main memory" and a slower but larger "backing store" (usually a disk or other persistent
memory). Main memory is limited in space and loses data when the system shuts down.
Relational databases, as part of the backing store, provide a structured way to store and
retrieve data efficiently, allowing applications to access small parts of data quickly and
reliably.
1.1.2 Concurrency
In enterprise applications, multiple users may access the same data simultaneously,
potentially making changes. While users often modify different areas of the data, conflicts
can arise when they try to change the same data. Managing these concurrent interactions is
challenging, and errors like double-booking hotel rooms can occur. Relational databases
handle concurrency by controlling access to data through transactions, which allow safe and
coordinated interactions between users and systems. Though transactions don't eliminate all
errors, they significantly reduce complexity and help manage concurrency effectively.
1.1.3 Integration
Transactions in relational databases also facilitate error handling. When a change is made,
and an error occurs during processing, the transaction can be rolled back, undoing the change.
This is particularly important in enterprise ecosystems where multiple applications developed
by different teams often need to collaborate and share data. Shared database integration is a
common approach where all applications store and access data from a single database. This
method simplifies data sharing, and the database's built-in concurrency control manages
multiple applications in the same way it handles multiple users.
1.1.4 A (Mostly) Standard Model
One reason relational databases have remained dominant is the standardization of their core
features. Developers and database administrators can learn the basic principles of the
relational model and apply them across different projects. While there are variations between
vendors (e.g., different SQL dialects), the fundamental mechanisms remain consistent. This
standardization makes relational databases widely accessible and usable across different
environments.
1
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
2
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
3
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
On the other hand, an application database is managed by a single application codebase and
team, allowing for much easier schema evolution and maintenance. This approach shifts
interoperability concerns to the application’s interface, where applications communicate via
web services, especially over HTTP. The shift to web services, commonly using XML or
JSON, allows for richer data structures compared to SQL relations.
Relational databases persisted in popularity despite the rise of application databases. Most
teams stuck with them, recognizing their familiarity and ease of use.
1.4 Attack of the Clusters
The early 2000s saw the rise of massive web properties handling large-scale data from links,
logs, social networks, and mapping data. As data volumes and traffic increased, companies
faced the challenge of scaling their computing resources.
There are two ways to scale:
1. Scaling up: Adding more power (e.g., processors, memory) to a single machine.
2. Scaling out: Using multiple smaller machines in a cluster. This option is cheaper,
more resilient, and can handle individual machine failures without affecting overall
availability.
However, relational databases struggled with clusters. They were designed for single-server
environments and couldn't handle distributed data management well. Although sharding
(dividing the database across multiple servers) helped distribute the load, it introduced new
problems like losing querying capabilities and referential integrity across shards.
Large-scale companies like Google and Amazon, which handled vast amounts of data, began
to look for alternatives. Their internal developments—Google's BigTable and Amazon's
Dynamo—provided models for databases that worked efficiently on clusters, setting the
stage for NoSQL databases designed to handle big data in distributed systems.
1.5 The Emergence of NoSQL
The term "NoSQL" emerged in the late 90s as the name of an open-source relational database
developed by Carlo Strozzi. This database was unique as it stored tables in ASCII files and
manipulated them using shell scripts instead of SQL, which gave it the "NoSQL" label.
However, this early iteration of NoSQL had no lasting impact on modern databases.
The NoSQL we recognize today traces its roots to a meetup organized by Johan Oskarsson in
San Francisco in 2009, inspired by projects like BigTable and Dynamo that were
experimenting with alternative data storage solutions. Oskarsson, seeking a memorable name
for the event, chose "NoSQL" from a suggestion by Eric Evans, even though the name was
somewhat misleading as these new databases weren't strictly against SQL but rather offered
different ways to handle data.
NoSQL databases quickly gained popularity, especially in the open-source community,
though they were never bound by a precise definition. These databases are often
characterized by their non-SQL query models, openness, and ability to run on clusters,
differing from traditional relational databases, which rely on ACID transactions. The rise of
web-scale applications in the early 21st century fueled the adoption of NoSQL databases,
4
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
driven by the need to handle large-scale data and provide flexibility with schema-less
structures, making them suitable for non-uniform data.
Though NoSQL databases are primarily open-source, some closed-source systems are also
labeled as NoSQL. Despite not using SQL, some NoSQL systems, like Cassandra’s CQL,
developed query languages that resemble SQL for easier adoption. These databases are
typically built for horizontal scalability, meaning they excel in distributed environments
where relational databases struggle.
The NoSQL movement also brought forth the concept of polyglot persistence—the use of
multiple types of databases within the same application, tailored to different use cases. This
approach allows organizations to choose the most appropriate data storage solution for each
scenario, moving beyond the default choice of relational databases. As the book suggests,
NoSQL databases are seen more as application databases rather than integration
databases, shifting away from using a single relational database for everything.
1.6 Key Points
Relational databases have been highly successful for over two decades, offering
persistence, concurrency control, and integration mechanisms.
Impedance mismatch between the relational model and in-memory data structures
has frustrated application developers.
There is a growing trend to encapsulate databases within applications and integrate
through services, moving away from using databases as integration points.
The primary driver of change in data storage has been the need to handle large
volumes of data on clusters, a task for which relational databases are not optimized.
NoSQL is an accidental neologism with no prescriptive definition, only observable
common characteristics.
The common characteristics of NoSQL databases include:
o Not using the relational model
o Running efficiently on clusters
o Being open-source
o Designed for the 21st-century web estates
o Schema-less
The most significant outcome of NoSQL's rise is the concept of Polyglot Persistence,
where multiple data storage technologies are used within a single system for different
use cases.
5
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
6
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
7
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
8
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]
}
Explanation of Aggregates
Aggregate Boundaries:
o The customer and order data form distinct aggregates. Each aggregate
encapsulates related data such as billing addresses, order items, and payment
information.
o The customer aggregate contains details about the customer, including the
billing address.
o The order aggregate contains data related to the order, including items
ordered, shipping address, and payment details.
Data Duplication:
o In this model, certain data (e.g., the billing and shipping address) is copied
into different parts of the JSON rather than linked by foreign keys (as in a
relational database).
o This approach allows for the immutability of certain information. For
example, you don’t want the shipping address or payment details to change
after an order is placed. Instead of linking addresses by an ID and updating
them globally, addresses are copied where needed.
Aggregates and Relationships
The link between aggregates (such as between a customer and their orders) is
maintained through fields like customerId in the order. This shows the relationship but
does not imply aggregation; the customer and order aggregates remain distinct.
9
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
Similarly, within the order, the productId would normally refer to a separate product
aggregate. However, in this case, we've denormalized the data by including the
productName directly in the order item to reduce the need to access multiple
aggregates during data interactions.
Aggregate Design Considerations
Trade-offs: The main design decision here is whether to bundle related data (such as
a customer and their orders) into a single aggregate or keep them separate. This choice
depends on how data is typically accessed in the application:
o If a system frequently needs to retrieve all orders for a customer, it may make
sense to include orders within the customer aggregate (see Figure 2.4 in the
reference).
o If individual orders are accessed independently of customers, it's better to keep
orders and customers as separate aggregates.
Example of a Combined Customer and Order Aggregate
In some cases, you might embed all the customer's orders within the customer aggregate.
Here’s how that data might look in JSON:
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
10
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]
}
]
}
}
Consequences of Aggregate Orientation
Relational databases manage data elements and relationships, but they lack an inherent
understanding of aggregate entities. In real-world applications, such as a customer order,
aggregates like order items, shipping address, and payment may be logically grouped
together. Relational databases represent these relationships using foreign keys but lack any
distinction between aggregation and non-aggregation relationships. This limitation means that
relational databases can't leverage aggregate structures to optimize data storage or
distribution.
Challenges with Aggregates:
Relational and Aggregate-Ignorant Models: Relational databases are "aggregate-
ignorant," meaning they don't recognize or optimize for aggregate structures. NoSQL
graph databases also share this characteristic. Aggregate-ignorant databases provide
flexibility to view data from various perspectives, but this can be a limitation when
trying to identify aggregates, which may hinder performance in specific use cases,
like querying product sales across orders.
Cluster Considerations: Aggregate orientation becomes crucial when running
databases on a cluster, as is common with NoSQL systems. Defining aggregates helps
determine which pieces of data should be stored together on the same node,
minimizing the number of nodes that need to be queried and thus improving
efficiency.
Transactions: Aggregate-oriented databases tend to limit ACID (Atomic, Consistent,
Isolated, Durable) transactions to within a single aggregate. This can be seen as a
drawback compared to relational databases, which allow ACID transactions across
multiple rows and tables. However, managing transactions across multiple aggregates
is often left to the application logic. In contrast, aggregate-ignorant databases such as
graph databases still support ACID transactions, similar to relational models.
11
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
12
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
3.1. Relationships
Aggregates are beneficial for grouping data frequently accessed together, but different
applications may require various access patterns. For instance, while some applications might
prefer to combine customer information with their order history into a single aggregate,
others may treat orders as independent entities.
In cases where you want separate customer and order aggregates, it’s essential to establish a
relationship between them. A straightforward method is to embed the customer ID within the
order aggregate. This allows you to fetch the customer data by referencing the ID from the
order. However, this approach does not inform the database of the underlying relationship,
which can limit its ability to optimize queries or manage data effectively.
To address this, many databases, including key-value stores, offer mechanisms to represent
relationships explicitly. Document stores expose aggregate contents to facilitate indexing and
querying. For example, Riak allows metadata to include link information, enabling partial
retrieval and link-walking capabilities.
13
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
14
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
15
Koustav Biswas, Dept Of CSE , DSATM
NoSQL Database 21CS745
16
Koustav Biswas, Dept Of CSE , DSATM