Module 1 Nosql Notes
Module 1 Nosql Notes
MODULE 1
Why NoSQL?
For many years, relational databases have been the go-to solution for serious data
storage, especially in large-scale, enterprise-level applications. When a software
architect begins a new project, the decision typically revolves around which relational
database to use, as they are so well-established and trusted in the industry. In some
cases, the choice might not even be yours to make if your company has already
committed to a particular database vendor.
However, the recent enthusiasm around NoSQL databases has come as a surprise,
since relational databases seemed unshakable. The rise of NoSQL is not just a
temporary trend. NoSQL databases bring a new set of benefits, such as flexibility,
scalability, and the ability to handle large volumes of unstructured data, which appeal
to modern application needs. As a result, they are gaining popularity in ways that
suggest their significance will last.
Relational databases have become such an embedded part of our computing culture
that it’s easy to take them for granted. It’s therefore useful to revisit the benefits they
provide.
One of the primary benefits of using a database is its ability to manage large amounts
of persistent data efficiently. Most computer systems have two types of memory:
1. Main memory (RAM) – This is fast but volatile, meaning it only holds data
temporarily. When the system shuts down or loses power, all data in main
memory is lost.
2. Backing store – This is typically a larger, slower storage medium (often a
disk), where data can be stored permanently, even when the system loses
power.
While main memory is limited and temporary, the backing store (like a hard drive
or persistent memory) provides long-term storage. This ensures that important data is
not lost when the system shuts down unexpectedly.
The reason is that databases offer more flexibility and efficiency than the traditional
file system. They are designed to handle vast amounts of data, while still allowing
applications to retrieve specific pieces of information quickly and efficiently. Instead
of reading through an entire file to find what you need, a database enables you to
access just the relevant bits of data with ease, making it an essential tool for handling
large and complex datasets.
1.1.2. Concurrency
Enterprise applications often involve multiple users accessing and modifying the same
data simultaneously. This can lead to complex scenarios where users may be working
on different parts of the data, but occasionally, they might interact with the same data
point. For example, two users might attempt to book the same hotel room at the same
time, leading to a double booking scenario. To avoid such conflicts and ensure data
integrity, it’s essential to coordinate these interactions effectively.
Concurrency Challenges
Atomicity: Transactions ensure that all operations within the transaction are
completed successfully; if any part of the transaction fails, the entire
transaction is rolled back, and the database is returned to its previous state.
This prevents partial updates that could lead to data inconsistencies.
Isolation: Transactions are isolated from each other, meaning that the
operations of one transaction are not visible to other transactions until the first
transaction is completed. This isolation helps prevent conflicts between
concurrent operations. For instance, if one user is booking a room while
another is checking its availability, the second user will not see the changes
made by the first until the booking is finalized.
Error Handling
user tries to book a room that has just been reserved by someone else, the database
can automatically roll back the transaction. This rollback restores the database to its
previous state, effectively "cleaning up" any changes that were not successfully
completed. This feature provides a safety net, allowing applications to handle errors
gracefully without compromising data integrity.
1.1.3. Integration
Inter-Application Collaboration
When multiple applications need to access and modify the same data, it can lead to
challenges. These applications may belong to different teams, each with its own
priorities, coding standards, and deployment cycles. As a result, achieving seamless
communication and data consistency between them can be awkward, as it pushes
against the boundaries of human organization and processes.
Ease of Data Access: By using the same database, applications can easily
query and manipulate each other's data without the need for complex
integration layers or middleware. For example, if Application A updates
Relational databases have achieved significant success and widespread adoption for
several key reasons, primarily rooted in their ability to deliver essential benefits
consistently across various implementations. Here’s a breakdown of why relational
databases have thrived in the software development landscape:
Structured Query Language (SQL): SQL is the standard language used for
querying and manipulating data in relational databases. Its standardized nature
allows developers to learn a common language that can be applied across
various database systems. While there are minor dialect differences between
vendors (e.g., MySQL, PostgreSQL, SQL Server, Oracle), the fundamental
SQL syntax and commands (such as SELECT, INSERT, UPDATE, and
DELETE) remain largely consistent.
Established Best Practices: Over decades of usage, best practices and design
patterns have emerged for working with relational databases. This allows
developers to leverage tried-and-true methodologies for tasks like
normalization, indexing, and query optimization.
Relational databases offer numerous advantages; however, they are not without their
shortcomings. From their inception, various frustrations have arisen regarding their
use.
While this relational foundation brings a level of elegance and simplicity, it also
introduces certain limitations. Specifically, the values within a relational tuple must
be simple; they cannot encompass any structured data, such as nested records or lists.
In contrast, in-memory data structures can accommodate far more complex
arrangements. Consequently, when developers need to utilize these richer in-memory
structures, they must convert them into a relational format suitable for storage on disk,
giving rise to the impedance mismatch—essentially two distinct representations
requiring translation (see Figure 1.1).
This impedance mismatch has been a significant source of frustration for application
developers. In the 1990s, many anticipated that it would lead to the decline of
relational databases in favor of systems designed to directly replicate in-memory
structures on disk. This period saw the rise of object-oriented programming languages
and, subsequently, object-oriented databases, both vying to establish dominance in the
software development landscape.
The reasons behind the dominance of relational databases over object-oriented (OO)
databases remain a topic of debate among seasoned developers. However, we believe
that a key factor in this success is SQL's role as an integration mechanism between
applications. In this model, the database serves as a central integration point, allowing
multiple applications—typically developed by different teams—to store their data in a
shared database. This approach facilitates improved communication, as all
applications operate on a consistent set of persistent data.
Despite its advantages, shared database integration has notable downsides. A structure
designed to accommodate multiple applications often becomes significantly more
complex than what any single application might require. Additionally, if an
application needs to modify its data storage, it must coordinate with all other
applications utilizing the database. Since different applications have varying structural
and performance needs, an index that benefits one application may negatively impact
the performance of another. Furthermore, because each application is usually
managed by a separate team, the database cannot fully trust these applications to
update data in a manner that maintains database integrity, necessitating that the
database itself enforce these integrity constraints.
One intriguing aspect of this transition to web services is that it fosters greater
flexibility in the data structure exchanged between systems. When using SQL, data
must conform to a relational format. However, web services enable the use of richer
data structures that can include nested records and lists, typically represented as
documents in XML or, more recently, JSON. This capability is particularly
advantageous for remote communication, where reducing the number of interactions
is crucial. By allowing a comprehensive structure of information to be encapsulated in
a single request or response, web services enhance efficiency.
While web services—particularly those using text over HTTP—are the go-to choice
for most integration scenarios, situations that demand high performance may
necessitate the use of binary protocols. However, this should be approached
cautiously; text protocols are generally easier to implement and maintain, as
evidenced by their prevalent use on the Internet.
Once the decision is made to adopt an application database, teams gain greater
freedom in selecting their database technologies. The decoupling of the internal
database from external services means that external stakeholders need not concern
themselves with how data is stored. This flexibility opens the door to considering non-
relational database options. Additionally, many features typical of relational databases,
such as advanced security measures, may be less relevant to an application database
since these functionalities can often be handled by the encompassing application itself.
Despite these advantages, the anticipated rush towards alternative data stores did not
materialize. Most teams that embraced the application database approach continued to
rely on relational databases. This is largely due to the familiarity and reliability of
relational systems; they often perform well, or at least adequately, for most use cases.
It's possible that, given more time, the shift toward application databases could have
begun to undermine the stronghold of relational databases. However, the cracks in this
dominance emerged from other sources, leading to the rise of NoSQL databases and
other alternatives that addressed specific limitations of relational models.
At the beginning of the new millennium, the technology sector experienced the fallout
from the 1990s dot-com bubble burst. This event led many to question the economic
viability of the Internet. However, the 2000s also saw several large web properties
dramatically scale their operations.
This scaling occurred across various dimensions. Websites began to track user activity
and structural data in unprecedented detail, resulting in massive datasets
encompassing links, social networks, log activities, and mapping data. As this data
volume surged, so did the user base, with major websites transforming into vast
digital landscapes that regularly served millions of visitors.
To manage this exponential increase in data and traffic, organizations faced a choice
between two scaling strategies: vertical scaling (scaling up) or horizontal scaling
(scaling out). Vertical scaling involves upgrading to larger machines with more
processors, disk storage, and memory. However, as machine sizes increase, costs rise
significantly, and practical limits soon become apparent. Conversely, horizontal
scaling utilizes clusters of smaller, commodity hardware machines, which tend to be
more cost-effective and resilient. While individual machine failures are common, a
well-structured cluster can continue operating, offering high reliability even in the
face of hardware issues.
Critics often argue that the scales at which Amazon and Google operate are far
removed from those of most organizations, suggesting that their solutions may not be
applicable to the average business. While it is true that many software projects do not
require such immense scalability, an increasing number of organizations are
beginning to explore the possibilities of capturing and processing larger datasets, thus
encountering similar challenges.
The term "NoSQL" is laden with irony; it first emerged in the late 1990s as the name
of an open-source relational database developed by Carlo Strozzi. This early NoSQL
database distinguished itself by not using SQL as its query language. Instead, it
manipulated data through shell scripts, with tables stored as ASCII files, where each
tuple was represented as a line with tab-separated fields. Aside from this semantic
coincidence, Strozzi's NoSQL did not influence the contemporary NoSQL databases
that we discuss today.
The modern usage of "NoSQL" can be traced back to a meetup on June 11, 2009, in
San Francisco, organized by Johan Oskarsson, a software developer from London.
Inspired by Google's BigTable and Amazon's Dynamo, a growing number of projects
In naming the meetup, Johan aimed for a catchy, memorable title that would work
well as a Twitter hashtag. After soliciting suggestions on the #cassandra IRC channel,
he chose "NoSQL," a suggestion from Eric Evans, a developer at Rackspace.
Although the term had a somewhat negative connotation and did not accurately
describe the systems being discussed, it met Johan's criteria for brevity and
uniqueness. Initially, they only intended it for a single event, unaware that it would
evolve into a broader technological movement.
The term "NoSQL" quickly gained popularity, though it has never been firmly
defined. The original call for the meetup sought "open-source, distributed, non-
relational databases," with presentations featuring projects like Voldemort, Cassandra,
Dynomite, HBase, Hypertable, CouchDB, and MongoDB. However, NoSQL is not
confined to this original set, and no consensus exists on its definition. Instead, it's
useful to consider common characteristics of databases commonly referred to as
NoSQL.
Lack of SQL: As the name suggests, NoSQL databases do not primarily use
SQL for querying. While some databases, like Cassandra, offer query
languages that resemble SQL—such as Cassandra Query Language (CQL)—
none fully adhere to standard SQL definitions.
NoSQL databases offer various options for consistency and distribution that
align with a clustered environment.
When people first hear the term "NoSQL," they often wonder what it signifies. Many
proponents of NoSQL argue that it should be interpreted as "Not Only SQL,"
suggesting a broader context for database capabilities. However, this interpretation
has its complications. Most people use "NoSQL" as a single term, while "Not Only
SQL" would logically be abbreviated as "NOSQL." If we adopt the "not only"
definition, then traditional relational databases like Oracle or PostgreSQL could also
fit into that category, leading to confusion.
To avoid these pitfalls, it's advisable to focus on the implications of the term
"NoSQL" rather than fixating on its literal meaning. Thus, "NoSQL" encompasses a
loosely defined set of mostly open-source databases that emerged primarily in the
early 21st century and generally do not use SQL.
The "not only" interpretation has merit, particularly as it reflects the evolving
ecosystem many see as the future of databases. This viewpoint emphasizes that
Lakshmi Durga.N Dept of DS, SVIT 15
MODULE 1 NOSQL DATABASE
Polyglot Persistence
This shift has led to a concept known as polyglot persistence, which advocates for
using different data storage solutions depending on the specific requirements of
various applications and datasets. Instead of defaulting to a relational database simply
because it’s the standard, organizations are encouraged to evaluate the nature of their
data and how they wish to manipulate it. Consequently, most organizations now
employ a mix of data storage technologies tailored to different scenarios.
Scaling Needs: The first reason is to handle data access with sizes and
performance requirements that necessitate a clustered architecture. NoSQL
databases are designed to efficiently manage large volumes of data in
distributed environments.
As you explore the contents of the book, keep these two key reasons in mind. They
provide valuable insights into why NoSQL databases are gaining traction in
contemporary data management discussions.
• Relational databases have been a successful technology for twenty years, providing
persistence, concurrency control, and an integration mechanism.
• Application developers have been frustrated with the impedance mismatch between
the relational model and the in-memory data structures.
• The vital factor for a change in data storage was the need to support large volumes
of data by running on clusters. Relational databases are not designed to run efficiently
on clusters.
• Open-source
• Schemaless
The concept of a data model is central to how we perceive, interact with, and
manipulate data in a database system. It provides a framework for organizing and
structuring data in a way that is meaningful and accessible to users and applications.
This differs from a storage model, which deals with the underlying mechanics of how
data is stored and managed internally by the database system. Ideally, users should
not need to concern themselves with the storage model; however, understanding it can
be crucial for optimizing performance and ensuring efficient data retrieval.
In everyday conversation, the term “data model” often refers to the specific structure
of data within an application. For instance, a developer might showcase an entity-
relationship diagram representing their database's structure, detailing entities such as
customers, orders, and products. However, in this context, the term will primarily
denote the overarching model by which a database organizes its data, also known as a
metamodel.
For several decades, the relational data model has been the dominant paradigm in
database design. It can be visualized as a collection of tables, akin to spreadsheets,
where:
enabling users to perform operations and retrieve data based on a straightforward set
of tuples.
One of the most significant shifts with the emergence of NoSQL databases is the
move away from the rigid structure of the relational model. Each NoSQL solution
employs its unique data model, which can be broadly categorized into four types:
Aggregate Orientation
Aggregates
The relational model organizes data into tuples, which are relatively limited data
structures. Tuples cannot easily support complex data types like nested records or lists
of values. In contrast, aggregate orientation recognizes that often, data needs to be
handled in more complex units, referred to as aggregates.
Natural Unit for Clustering: Aggregates provide a natural unit for processes
like replication and sharding in clustered database environments. When data is
grouped into aggregates, it becomes easier to distribute and manage across
multiple nodes.
Figure 2.1. Data model oriented around a relational database (using UML
notation [Fowler UML] )
The relational model organizes data into tables, with each table
consisting of rows (tuples) and columns (attributes).
The data model is normalized, ensuring that there are no duplicate
data entries across tables and maintaining referential integrity.
Example: In the relational model, the data could be structured into separate
tables for customers, orders, products, and addresses. This allows for efficient
storage and retrieval but may lead to complex queries involving multiple
tables.
Pros:
Cons:
json
// Customer{
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}]}
// Order{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]}
Key Features:
Pros:
Cons:
Aggregate Boundaries
Figure 2.4. Embed all the objects for customer and the customer’s orders
Lakshmi Durga.N Dept of DS, SVIT 25
MODULE 1 NOSQL DATABASE
Example of Aggregates:
json
Copy code
// Customer with embedded orders{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]
}
]
}}
Aggregate-Ignorant Databases
Relational databases, along with some NoSQL databases like graph databases, are
termed "aggregate-ignorant" because they do not have a built-in understanding of
For example:
The main advantage of aggregate orientation comes into play when working with
clustered systems—a common scenario in NoSQL environments. By explicitly
defining aggregates, we inform the database which pieces of data are likely to be
manipulated together, enabling better data distribution across nodes. This minimizes
the number of nodes queried for data retrieval, enhancing performance.
In contrast, it’s often asserted that NoSQL databases lack ACID transactions, which
suggests a compromise on consistency. This statement, however, oversimplifies the
reality. While aggregate-oriented databases typically do not support ACID
Lakshmi Durga.N Dept of DS, SVIT 28
MODULE 1 NOSQL DATABASE
transactions that span multiple aggregates, they can perform atomic operations on
individual aggregates.
Lastly, it's essential to note that while graph and other aggregate-ignorant databases
may not have a formal aggregate structure, they usually still support ACID
transactions similar to those found in relational databases. The discussion around
consistency in databases is complex and extends beyond just whether a database is
ACID-compliant.
Key-Value and Document Databases are two popular types of NoSQL databases,
and both are designed around the concept of aggregates. An aggregate is a collection
of related data that is treated as a single unit, often identified by a unique key or ID.
Here’s how they compare:
Key-Value Databases
Data Access:
Flexibility:
1. You have a lot of freedom regarding what you store in the database,
making it a good choice for scenarios where data formats might vary
widely or where strict schema enforcement isn't necessary.
Document Databases
Data Access:
1. You can perform more sophisticated queries based on the fields within
the document. This means you can not only retrieve an entire
document using its ID but also search for documents based on specific
field values.
2. Document databases allow partial retrieval, meaning you can fetch
only the parts of the document you need instead of the whole aggregate.
Indexing:
While the distinction between key-value and document databases is clear in theory, in
practice, it often becomes blurred. Here are a few reasons:
Key-Value Stores with Structure: Some key-value stores (like Riak or Redis)
allow for additional structures or metadata, enabling features such as indexing
or the ability to break down aggregates into smaller components (like lists or
sets). For example, Riak allows you to attach metadata to aggregates for better
indexing.
Key Concepts
Column Families:
1. Skinny Rows: These have a small number of columns, where the same
columns are used across multiple rows. This structure resembles
traditional records, with the column family defining the record type
and each column acting as a field.
2. Wide Rows: These contain a large number of columns, potentially
thousands. Each row can have a different set of columns, which makes
it suitable for modeling lists (e.g., an order with multiple items where
each item is represented as a separate column).
In wide column families, there is often a defined sort order for columns. This
is especially beneficial when accessing data by keys that are concatenated (e.g.,
date and ID). This allows you to efficiently retrieve ranges of data based on
sorted keys, which can enhance query performance for specific access patterns.
Certainly! Let’s summarize and clarify the key concepts related to the three different
styles of aggregate-oriented data models: key-value stores, document stores, and
column-family databases. Each of these models shares the foundational idea of
aggregates indexed by keys, but they differ significantly in their structure and
functionality.
while all three aggregate-oriented data models revolve around the concept of
aggregates indexed by keys, they differ in how they handle these aggregates:
Key-value stores treat aggregates as opaque blobs with limited access and no
query capabilities.
Document stores provide a more transparent view of aggregates, enabling
queries and partial retrievals, but with less optimization due to their lack of
schema.
Column-family databases offer a structured approach to aggregates, allowing
the database to leverage this structure for improved access and performance.
Understanding these differences is essential for choosing the right data model based
on the specific needs and access patterns of your application.
2.5. Further Reading For more on the general concept of aggregates, which are often
used with relational databases too, see [Evans] . The Domain-Driven Design
community is the best source for further information about aggregates—recent
information usually appears at http://domaindrivendesign.org.
• Aggregates make it easier for the database to manage data storage over clusters.
• Aggregate-oriented databases work best when most data interaction is done with the
same aggregate; aggregate-ignorant databases are better when interactions use data
organized in many different formations
So far we’ve covered the key feature in most NoSQL databases: their use of
aggregates and how aggregate-oriented databases model aggregates in different ways.
While aggregates are a central part of the NoSQL story, there is more to the data
modeling side than that, and we’ll explore these further concepts in this chapter.
3.1. Relationships
Aggregates:
Independent Aggregates:
Linking Aggregates:
customer data by first retrieving the order and then looking up the
customer using the embedded ID.
2. However, while this approach is functional, it means the database is
unaware of the relationship between the aggregates, which can limit its
ability to optimize queries or enforce integrity.
1. Database Awareness:
graph databases, highlighting their unique characteristics and how they differ from
other NoSQL databases. Here's a detailed explanation of the key points:
Distinct Motivation:
Complex Interconnections:
Focus on Relationships:
Server Architecture:
graph databases are a distinct category within the NoSQL landscape, driven by the
need to efficiently manage complex relationships among small records. Their
structure—nodes connected by edges—enables powerful querying capabilities and
fast traversal of data. While they share some similarities with other NoSQL databases,
such as rejecting the relational model, their focus on relationships and different
operational characteristics set them apart, making them particularly useful for
applications involving intricate interconnections.
Definition of Schemalessness:
necessitate the definition of tables, columns, and data types before any data can be
stored, NoSQL databases allow for a more flexible approach.
Advantages of Schemalessness:
Implicit Schema:
1. When multiple applications interact with the same database, each may
have different implicit schemas. This inconsistency can lead to
problems if not managed properly.
Schema Flexibility:
more straightforward, but still requires planning to access both old and
new data effectively.
Aggregate Boundaries:
the schemaless nature of NoSQL databases offers significant flexibility and the
ability to handle evolving data needs, it introduces challenges, particularly regarding
implicit schemas and data consistency. The advantages of relational databases,
including their ability to enforce schemas and facilitate controlled changes, highlight
the trade-offs involved when choosing between NoSQL and traditional relational
approaches. Understanding these differences is crucial for developers and
organizations when deciding on the best data management strategy for their
applications.
1. Unit of Access:
1. Aggregate-oriented data models group related data into a single unit
(or aggregate). For example, all information related to an order is
stored together, making it efficient to access everything about that
order at once.
2. This design is beneficial for operations that require accessing all data
associated with an aggregate, such as retrieving a full order with all its
details.
1. Relational databases allow for flexible data access due to their lack of
aggregate structure. They support various ways to query data, enabling
users to obtain information from different perspectives without the
need to reformat the underlying data storage.
Views:
Materialized Views:
Lakshmi Durga.N Dept of DS, SVIT 48
MODULE 1 NOSQL DATABASE
NoSQL Equivalent:
1. Within Aggregates:
1. Materialized views can also exist within the same aggregate. For
instance, an order document might include a summary of the order,
enabling quick access to summary information without needing to
transfer the entire order document.
2. Column Families:
various strategies for modeling data aggregates across different types of NoSQL
databases, focusing on customer and order data. Here’s a detailed breakdown of the
key concepts and considerations:
Figure 3.2. Embed all the objects for customer and their orders.
When designing data models, particularly for aggregates like customers and their
orders, it’s crucial to consider:
1. How Data Will Be Read: The access patterns and queries that will be
performed on the data.
2. Side Effects on Data: The implications for related data when modifications
occur, such as adding new orders.
Embedding Data:
Example:json
"customerId": 1,
"customer": {
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit","ccinfo": "1000-1000-1000-1000"}],
"orders":[{"orderId":99}]
}}
Accessing Data:
1. This model allows for easy access to customer information using a key.
However, when querying specific orders or products, the entire
customer object must be read and parsed on the client side, which can
be inefficient.
Using References
json
{
"customerId": 1,
"orders": [{"orderId": 99}]}
Benefits of References:
json
{
"itemid": 27,
"orders": {99, 545, 897, 678}}
Document Stores
Lakshmi Durga.N Dept of DS, SVIT 53
MODULE 1 NOSQL DATABASE
Query Flexibility:
1. In document stores, one can query inside documents, which allows for
the removal of references to orders from the customer object. This
means the customer object does not need to be updated with new
orders, simplifying data management.
json
{
"customerId": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}]}
Attribute-Based Searching:
1. Document stores enable searches like “find all orders that include the
Refactoring Databases product,” emphasizing that the modeling
decision is based on application requirements rather than just database
capabilities.
Column-Family Stores
1. Column Ordering:
Graph Databases
Modeling Relationships:
Relationship Queries:
the importance of choosing the right data modeling strategy based on how data will
be accessed and the relationships among different entities. Each NoSQL database type
(key-value stores, document stores, column-family stores, and graph databases) offers
unique advantages and trade-offs, influencing how data is structured and queried.
Understanding these models helps in optimizing data access patterns and supporting
efficient data management, especially for use cases like real-time analytics and
complex relationships.