0% found this document useful (0 votes)
93 views56 pages

Module 1 Nosql Notes

The document discusses the evolution and significance of NoSQL databases in contrast to traditional relational databases, highlighting their flexibility, scalability, and ability to manage unstructured data. It outlines the core benefits of relational databases, such as data integrity, transaction management, and established best practices, while also addressing challenges like impedance mismatch and concurrency issues. The text emphasizes the shift towards application databases and web services, which allow for more efficient data integration and communication between applications.

Uploaded by

prasadmaruthi272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views56 pages

Module 1 Nosql Notes

The document discusses the evolution and significance of NoSQL databases in contrast to traditional relational databases, highlighting their flexibility, scalability, and ability to manage unstructured data. It outlines the core benefits of relational databases, such as data integrity, transaction management, and established best practices, while also addressing challenges like impedance mismatch and concurrency issues. The text emphasizes the shift towards application databases and web services, which allow for more efficient data integration and communication between applications.

Uploaded by

prasadmaruthi272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

MODULE 1 NOSQL DATABASE

MODULE 1

Why NoSQL?

For many years, relational databases have been the go-to solution for serious data
storage, especially in large-scale, enterprise-level applications. When a software
architect begins a new project, the decision typically revolves around which relational
database to use, as they are so well-established and trusted in the industry. In some
cases, the choice might not even be yours to make if your company has already
committed to a particular database vendor.

Throughout the years, other database technologies—like object databases in the


1990s—have emerged, but they never gained enough traction to significantly
challenge the dominance of relational databases. This long-standing dominance has
made relational databases the default standard.

However, the recent enthusiasm around NoSQL databases has come as a surprise,
since relational databases seemed unshakable. The rise of NoSQL is not just a
temporary trend. NoSQL databases bring a new set of benefits, such as flexibility,
scalability, and the ability to handle large volumes of unstructured data, which appeal
to modern application needs. As a result, they are gaining popularity in ways that
suggest their significance will last.

1.1. The Value of Relational Databases

Relational databases have become such an embedded part of our computing culture
that it’s easy to take them for granted. It’s therefore useful to revisit the benefits they
provide.

1.1.1. Getting at Persistent Data

One of the primary benefits of using a database is its ability to manage large amounts
of persistent data efficiently. Most computer systems have two types of memory:

Lakshmi Durga.N Dept of DS, SVIT 1


MODULE 1 NOSQL DATABASE

1. Main memory (RAM) – This is fast but volatile, meaning it only holds data
temporarily. When the system shuts down or loses power, all data in main
memory is lost.
2. Backing store – This is typically a larger, slower storage medium (often a
disk), where data can be stored permanently, even when the system loses
power.

While main memory is limited and temporary, the backing store (like a hard drive
or persistent memory) provides long-term storage. This ensures that important data is
not lost when the system shuts down unexpectedly.

Backing storage can be organized in different ways. For example, productivity


applications (like word processors) store their data as files within the operating
system's file system. However, for enterprise applications, databases are the
preferred solution.

The reason is that databases offer more flexibility and efficiency than the traditional
file system. They are designed to handle vast amounts of data, while still allowing
applications to retrieve specific pieces of information quickly and efficiently. Instead
of reading through an entire file to find what you need, a database enables you to
access just the relevant bits of data with ease, making it an essential tool for handling
large and complex datasets.

1.1.2. Concurrency

Enterprise applications often involve multiple users accessing and modifying the same
data simultaneously. This can lead to complex scenarios where users may be working
on different parts of the data, but occasionally, they might interact with the same data
point. For example, two users might attempt to book the same hotel room at the same
time, leading to a double booking scenario. To avoid such conflicts and ensure data
integrity, it’s essential to coordinate these interactions effectively.

Lakshmi Durga.N Dept of DS, SVIT 2


MODULE 1 NOSQL DATABASE

Concurrency Challenges

Concurrency—the ability of multiple users or systems to operate on data at the same


time—presents significant challenges. It can lead to various errors and inconsistencies,
even for skilled programmers. The complexities arise because when multiple users try
to access and modify the same data concurrently, the system must ensure that these
operations do not interfere with each other.

Role of Relational Databases

Relational databases address these concurrency challenges by controlling all access


to data through the concept of transactions. A transaction is a sequence of operations
performed as a single logical unit of work. Here’s how transactions help manage
concurrency:

Atomicity: Transactions ensure that all operations within the transaction are
completed successfully; if any part of the transaction fails, the entire
transaction is rolled back, and the database is returned to its previous state.
This prevents partial updates that could lead to data inconsistencies.

Isolation: Transactions are isolated from each other, meaning that the
operations of one transaction are not visible to other transactions until the first
transaction is completed. This isolation helps prevent conflicts between
concurrent operations. For instance, if one user is booking a room while
another is checking its availability, the second user will not see the changes
made by the first until the booking is finalized.

Consistency: Transactions help maintain the database's integrity by ensuring


that it moves from one valid state to another. Any constraints, such as unique
keys or foreign keys, are enforced during transactions, ensuring that data
remains consistent throughout its lifecycle.

Error Handling

In addition to managing concurrency, transactions play a critical role in error


handling. If an error occurs during the processing of a transaction, such as when a
Lakshmi Durga.N Dept of DS, SVIT 3
MODULE 1 NOSQL DATABASE

user tries to book a room that has just been reserved by someone else, the database
can automatically roll back the transaction. This rollback restores the database to its
previous state, effectively "cleaning up" any changes that were not successfully
completed. This feature provides a safety net, allowing applications to handle errors
gracefully without compromising data integrity.

1.1.3. Integration

Enterprise applications often operate within a rich ecosystem, where different


applications—developed by various teams—must work together to achieve common
business objectives. This collaborative approach can introduce complexity,
particularly when it comes to data sharing and updates.

Inter-Application Collaboration

When multiple applications need to access and modify the same data, it can lead to
challenges. These applications may belong to different teams, each with its own
priorities, coding standards, and deployment cycles. As a result, achieving seamless
communication and data consistency between them can be awkward, as it pushes
against the boundaries of human organization and processes.

Shared Database Integration

One effective way to facilitate this inter-application collaboration is through shared


database integration. Here’s how it works and its benefits:

Centralized Data Storage: In shared database integration, multiple


applications share a single database for storing their data. This centralized
approach allows all applications to access a common pool of data, making it
easier to share information across different systems.

Ease of Data Access: By using the same database, applications can easily
query and manipulate each other's data without the need for complex
integration layers or middleware. For example, if Application A updates

Lakshmi Durga.N Dept of DS, SVIT 4


MODULE 1 NOSQL DATABASE

customer information, Application B can immediately access the latest data


without waiting for synchronization or data replication processes.

Concurrency Control: Relational databases inherently manage concurrency


control, ensuring that multiple applications can operate simultaneously
without conflict. Just as the database handles multiple users accessing the
same data within a single application, it applies the same principles when
multiple applications interact with the same data. This means that the database
ensures that updates are atomic, isolated, and consistent, helping to maintain
data integrity.

Advantages of Shared Database Integration

Simplicity: Using a shared database reduces the complexity of data integration.


Applications do not need to implement their own data synchronization
mechanisms, as the database manages this aspect inherently.

Real-Time Data Visibility: Updates made by one application become


immediately visible to others, enabling real-time data sharing. This is
particularly important in environments where timely access to data is critical
for decision-making.

Reduced Overhead: By centralizing data storage, organizations can minimize


the resources required for maintaining multiple data stores, reducing both
costs and administrative overhead.

1.1.4. A (Mostly) Standard Model

Relational databases have achieved significant success and widespread adoption for
several key reasons, primarily rooted in their ability to deliver essential benefits
consistently across various implementations. Here’s a breakdown of why relational
databases have thrived in the software development landscape:

Core Benefits of Relational Databases

Lakshmi Durga.N Dept of DS, SVIT 5


MODULE 1 NOSQL DATABASE

Data Integrity: Relational databases enforce rules and constraints (such as


primary keys, foreign keys, and unique constraints) that help maintain data
integrity. This ensures that the data is accurate, consistent, and reliable across
different operations.

Structured Query Language (SQL): SQL is the standard language used for
querying and manipulating data in relational databases. Its standardized nature
allows developers to learn a common language that can be applied across
various database systems. While there are minor dialect differences between
vendors (e.g., MySQL, PostgreSQL, SQL Server, Oracle), the fundamental
SQL syntax and commands (such as SELECT, INSERT, UPDATE, and
DELETE) remain largely consistent.

Transaction Management: Relational databases support transactions, which


are crucial for ensuring that a series of operations either all succeed or none at
all. This property, encapsulated in the ACID (Atomicity, Consistency,
Isolation, Durability) principles, provides a reliable mechanism for error
handling and maintaining data consistency, particularly in multi-user
environments.

Mature Ecosystem: The relational database ecosystem is mature and robust,


featuring a wealth of tools, libraries, and frameworks that support
development and integration. This includes database management systems,
ORM (Object-Relational Mapping) tools, and reporting services, which make
it easier for developers to build applications that interact with relational
databases.

Established Best Practices: Over decades of usage, best practices and design
patterns have emerged for working with relational databases. This allows
developers to leverage tried-and-true methodologies for tasks like
normalization, indexing, and query optimization.

Scalability and Performance: While relational databases may face challenges


in scaling horizontally compared to some NoSQL solutions, they can be
optimized for performance through techniques like indexing, partitioning, and

Lakshmi Durga.N Dept of DS, SVIT 6


MODULE 1 NOSQL DATABASE

denormalization. Additionally, many modern relational databases have


improved their scalability and performance capabilities.

1.2. Impedance Mismatch

Relational databases offer numerous advantages; however, they are not without their
shortcomings. From their inception, various frustrations have arisen regarding their
use.

One significant issue for application developers is known as the "impedance


mismatch," which refers to the disparity between the relational model and in-memory
data structures. The relational data model organizes information into tables and
rows—more precisely, relations and tuples. In this context, a tuple is defined as a set
of name-value pairs, while a relation is a collection of tuples. It’s worth noting that
the definition of a tuple in relational databases differs slightly from its usage in
mathematics and in many programming languages, where it typically represents a
sequence of values. All SQL operations process and return relations, leading to the
mathematically elegant framework of relational algebra.

While this relational foundation brings a level of elegance and simplicity, it also
introduces certain limitations. Specifically, the values within a relational tuple must
be simple; they cannot encompass any structured data, such as nested records or lists.
In contrast, in-memory data structures can accommodate far more complex
arrangements. Consequently, when developers need to utilize these richer in-memory
structures, they must convert them into a relational format suitable for storage on disk,
giving rise to the impedance mismatch—essentially two distinct representations
requiring translation (see Figure 1.1).

Lakshmi Durga.N Dept of DS, SVIT 7


MODULE 1 NOSQL DATABASE

Figure 1.1 illustrates how an order, which appears as a single aggregate


structure in the user interface, is divided into multiple rows across various tables
in a relational database.

This impedance mismatch has been a significant source of frustration for application
developers. In the 1990s, many anticipated that it would lead to the decline of
relational databases in favor of systems designed to directly replicate in-memory
structures on disk. This period saw the rise of object-oriented programming languages
and, subsequently, object-oriented databases, both vying to establish dominance in the
software development landscape.

Despite the success of object-oriented languages, object-oriented databases eventually


faded into obscurity. Relational databases effectively countered this challenge by
emphasizing their role as integration mechanisms, bolstered by a largely standardized
language for data manipulation (SQL) and a growing professional divide between
application developers and database administrators.

The introduction of object-relational mapping (ORM) frameworks, such as Hibernate


and iBATIS, has simplified the impedance mismatch issue. These frameworks
implement established mapping patterns to alleviate much of the manual effort
involved. However, they can present challenges of their own when developers overly

Lakshmi Durga.N Dept of DS, SVIT 8


MODULE 1 NOSQL DATABASE

focus on abstracting the database, potentially leading to performance issues with


queries.

Although relational databases maintained their dominance in enterprise computing


throughout the 2000s, signs of vulnerability began to emerge during this decade.

1.3. Application and Integration Databases

The reasons behind the dominance of relational databases over object-oriented (OO)
databases remain a topic of debate among seasoned developers. However, we believe
that a key factor in this success is SQL's role as an integration mechanism between
applications. In this model, the database serves as a central integration point, allowing
multiple applications—typically developed by different teams—to store their data in a
shared database. This approach facilitates improved communication, as all
applications operate on a consistent set of persistent data.

Despite its advantages, shared database integration has notable downsides. A structure
designed to accommodate multiple applications often becomes significantly more
complex than what any single application might require. Additionally, if an
application needs to modify its data storage, it must coordinate with all other
applications utilizing the database. Since different applications have varying structural
and performance needs, an index that benefits one application may negatively impact
the performance of another. Furthermore, because each application is usually
managed by a separate team, the database cannot fully trust these applications to
update data in a manner that maintains database integrity, necessitating that the
database itself enforce these integrity constraints.

Alternatively, treating the database as an application database—one that is directly


accessed only by a single application maintained by a dedicated team—can streamline
operations. With an application database, only the team responsible for the application
needs to understand the database structure, simplifying schema maintenance and
evolution. Since this team oversees both the application code and the database, they
can ensure database integrity through the application logic.

Lakshmi Durga.N Dept of DS, SVIT 9


MODULE 1 NOSQL DATABASE

This shift in focus allows interoperability concerns to center on the application


interfaces, promoting better interaction protocols and providing support for changes
as necessary. During the 2000s, we observed a significant move towards web services,
enabling applications to communicate over HTTP. This approach introduced a new,
widely adopted communication mechanism that challenged the traditional reliance on
SQL with shared databases. Much of this evolution occurred under the umbrella of
"Service-Oriented Architecture" (SOA), a term that often lacks a precise definition.

One intriguing aspect of this transition to web services is that it fosters greater
flexibility in the data structure exchanged between systems. When using SQL, data
must conform to a relational format. However, web services enable the use of richer
data structures that can include nested records and lists, typically represented as
documents in XML or, more recently, JSON. This capability is particularly
advantageous for remote communication, where reducing the number of interactions
is crucial. By allowing a comprehensive structure of information to be encapsulated in
a single request or response, web services enhance efficiency.

While web services—particularly those using text over HTTP—are the go-to choice
for most integration scenarios, situations that demand high performance may
necessitate the use of binary protocols. However, this should be approached
cautiously; text protocols are generally easier to implement and maintain, as
evidenced by their prevalent use on the Internet.

Once the decision is made to adopt an application database, teams gain greater
freedom in selecting their database technologies. The decoupling of the internal
database from external services means that external stakeholders need not concern
themselves with how data is stored. This flexibility opens the door to considering non-
relational database options. Additionally, many features typical of relational databases,
such as advanced security measures, may be less relevant to an application database
since these functionalities can often be handled by the encompassing application itself.

Despite these advantages, the anticipated rush towards alternative data stores did not
materialize. Most teams that embraced the application database approach continued to
rely on relational databases. This is largely due to the familiarity and reliability of
relational systems; they often perform well, or at least adequately, for most use cases.

Lakshmi Durga.N Dept of DS, SVIT 10


MODULE 1 NOSQL DATABASE

It's possible that, given more time, the shift toward application databases could have
begun to undermine the stronghold of relational databases. However, the cracks in this
dominance emerged from other sources, leading to the rise of NoSQL databases and
other alternatives that addressed specific limitations of relational models.

This transition reflects a broader trend in software development: a growing


recognition of the need for more flexible data storage solutions capable of handling
complex and varied application requirements. As the landscape evolves, the ability to
adapt to these changes will be critical for organizations looking to leverage their data
effectively.

1.4. Attack of the Clusters

At the beginning of the new millennium, the technology sector experienced the fallout
from the 1990s dot-com bubble burst. This event led many to question the economic
viability of the Internet. However, the 2000s also saw several large web properties
dramatically scale their operations.

This scaling occurred across various dimensions. Websites began to track user activity
and structural data in unprecedented detail, resulting in massive datasets
encompassing links, social networks, log activities, and mapping data. As this data
volume surged, so did the user base, with major websites transforming into vast
digital landscapes that regularly served millions of visitors.

To manage this exponential increase in data and traffic, organizations faced a choice
between two scaling strategies: vertical scaling (scaling up) or horizontal scaling
(scaling out). Vertical scaling involves upgrading to larger machines with more
processors, disk storage, and memory. However, as machine sizes increase, costs rise
significantly, and practical limits soon become apparent. Conversely, horizontal
scaling utilizes clusters of smaller, commodity hardware machines, which tend to be
more cost-effective and resilient. While individual machine failures are common, a
well-structured cluster can continue operating, offering high reliability even in the
face of hardware issues.

Lakshmi Durga.N Dept of DS, SVIT 11


MODULE 1 NOSQL DATABASE

As large web properties transitioned to clustered architectures, a new challenge


emerged: relational databases are not inherently designed to function effectively
within a clustered environment. Clustered relational databases, such as Oracle RAC
and Microsoft SQL Server, typically operate on the principle of a shared disk
subsystem. They rely on a cluster-aware file system that writes to a highly available
disk subsystem, but this arrangement creates a single point of failure in the disk
subsystem.

Alternatively, organizations could configure relational databases as separate servers


handling different data sets through a process known as sharding. While sharding can
help distribute the load, it introduces complexity for application developers, who must
manage which database server to access for each piece of data. This approach also
compromises capabilities like querying, referential integrity, and transactional
consistency across shards, leading to what some developers describe as “unnatural
acts.”

In addition to technical challenges, the licensing costs associated with commercial


relational databases present further obstacles. These databases are often priced based
on a single-server model, resulting in inflated costs when deployed on clusters and
leading to frustrating negotiations with purchasing departments.

This discordance between relational databases and clustered environments prompted


organizations to seek alternative data storage solutions. Two companies, in
particular—Google and Amazon—emerged as influential pioneers in this realm. Both
organizations excelled in running large clusters and were simultaneously capturing
vast amounts of data. This intersection of capability and ambition motivated them to
rethink their reliance on traditional relational databases. As the 2000s progressed,
Google and Amazon published concise yet highly impactful papers outlining their
innovative approaches: Google's BigTable and Amazon's Dynamo.

Critics often argue that the scales at which Amazon and Google operate are far
removed from those of most organizations, suggesting that their solutions may not be
applicable to the average business. While it is true that many software projects do not
require such immense scalability, an increasing number of organizations are

Lakshmi Durga.N Dept of DS, SVIT 12


MODULE 1 NOSQL DATABASE

beginning to explore the possibilities of capturing and processing larger datasets, thus
encountering similar challenges.

As more information emerged about the methodologies employed by Google and


Amazon, the tech community began to investigate database designs that are explicitly
tailored for clustered environments. Unlike earlier alternatives that had failed to
dethrone relational databases, the threat posed by clustered architectures was real and
significant. This shift marked the beginning of a new era in database design, as
organizations sought systems that could not only accommodate vast quantities of data
but also support the dynamic and distributed nature of modern applications.

In this context, new database paradigms—often referred to as NoSQL databases—


began to gain traction. These systems, designed to operate across clusters, prioritize
scalability, flexibility, and performance, offering solutions that relational databases
struggled to provide in a distributed environment. The rise of NoSQL databases and
other alternatives signals a paradigm shift in how organizations approach data storage
and management, paving the way for innovative strategies to harness the power of
data in an increasingly complex technological landscape. As more businesses
recognize the potential benefits of these new systems, the era of relational databases
as the dominant solution appears to be giving way to a more diverse ecosystem of
data management technologies.

1.5. The Emergence of NoSQL

The term "NoSQL" is laden with irony; it first emerged in the late 1990s as the name
of an open-source relational database developed by Carlo Strozzi. This early NoSQL
database distinguished itself by not using SQL as its query language. Instead, it
manipulated data through shell scripts, with tables stored as ASCII files, where each
tuple was represented as a line with tab-separated fields. Aside from this semantic
coincidence, Strozzi's NoSQL did not influence the contemporary NoSQL databases
that we discuss today.

The modern usage of "NoSQL" can be traced back to a meetup on June 11, 2009, in
San Francisco, organized by Johan Oskarsson, a software developer from London.
Inspired by Google's BigTable and Amazon's Dynamo, a growing number of projects

Lakshmi Durga.N Dept of DS, SVIT 13


MODULE 1 NOSQL DATABASE

began exploring alternative data storage solutions. As these discussions gained


traction in various software conferences, Johan, attending a Hadoop summit, decided
to host a meetup to gather those working on these new databases in one place.

In naming the meetup, Johan aimed for a catchy, memorable title that would work
well as a Twitter hashtag. After soliciting suggestions on the #cassandra IRC channel,
he chose "NoSQL," a suggestion from Eric Evans, a developer at Rackspace.
Although the term had a somewhat negative connotation and did not accurately
describe the systems being discussed, it met Johan's criteria for brevity and
uniqueness. Initially, they only intended it for a single event, unaware that it would
evolve into a broader technological movement.

The term "NoSQL" quickly gained popularity, though it has never been firmly
defined. The original call for the meetup sought "open-source, distributed, non-
relational databases," with presentations featuring projects like Voldemort, Cassandra,
Dynomite, HBase, Hypertable, CouchDB, and MongoDB. However, NoSQL is not
confined to this original set, and no consensus exists on its definition. Instead, it's
useful to consider common characteristics of databases commonly referred to as
NoSQL.

Common Characteristics of NoSQL Databases

Lack of SQL: As the name suggests, NoSQL databases do not primarily use
SQL for querying. While some databases, like Cassandra, offer query
languages that resemble SQL—such as Cassandra Query Language (CQL)—
none fully adhere to standard SQL definitions.

Open Source: Most NoSQL databases are open-source projects. Although


some closed-source systems are labeled as NoSQL, the movement is generally
seen as an open-source phenomenon.

Cluster-Friendly: Many NoSQL databases are designed to run on clusters, a


feature that influences their data models and consistency approaches. Unlike
relational databases, which utilize ACID (Atomicity, Consistency, Isolation,
Durability) transactions to maintain consistency across the entire database,

Lakshmi Durga.N Dept of DS, SVIT 14


MODULE 1 NOSQL DATABASE

NoSQL databases offer various options for consistency and distribution that
align with a clustered environment.

Schema-Less Design: NoSQL databases typically operate without a


predefined schema, allowing users to add fields to database records freely.
This flexibility is particularly valuable when dealing with nonuniform data and
custom fields, circumventing the awkwardness of naming conventions found
in relational databases.

Emergence in the Early 21st Century: Generally, NoSQL refers to systems


developed in the early 21st century, effectively excluding many databases
created before this period.

While these characteristics offer a framework for understanding NoSQL databases,


none are strictly definitional. It’s likely that a coherent, universally accepted
definition of "NoSQL" will never emerge. Instead, the term should be viewed as a
collection of databases that emerged to meet specific needs in a rapidly evolving
technological landscape.

The Meaning Behind "NoSQL"

When people first hear the term "NoSQL," they often wonder what it signifies. Many
proponents of NoSQL argue that it should be interpreted as "Not Only SQL,"
suggesting a broader context for database capabilities. However, this interpretation
has its complications. Most people use "NoSQL" as a single term, while "Not Only
SQL" would logically be abbreviated as "NOSQL." If we adopt the "not only"
definition, then traditional relational databases like Oracle or PostgreSQL could also
fit into that category, leading to confusion.

To avoid these pitfalls, it's advisable to focus on the implications of the term
"NoSQL" rather than fixating on its literal meaning. Thus, "NoSQL" encompasses a
loosely defined set of mostly open-source databases that emerged primarily in the
early 21st century and generally do not use SQL.

The "not only" interpretation has merit, particularly as it reflects the evolving
ecosystem many see as the future of databases. This viewpoint emphasizes that
Lakshmi Durga.N Dept of DS, SVIT 15
MODULE 1 NOSQL DATABASE

NoSQL represents a movement rather than a singular technology. While relational


databases remain dominant and will continue to be widely used, the rise of NoSQL
signifies an important shift in the data storage landscape.

Polyglot Persistence

This shift has led to a concept known as polyglot persistence, which advocates for
using different data storage solutions depending on the specific requirements of
various applications and datasets. Instead of defaulting to a relational database simply
because it’s the standard, organizations are encouraged to evaluate the nature of their
data and how they wish to manipulate it. Consequently, most organizations now
employ a mix of data storage technologies tailored to different scenarios.

To facilitate this polyglot approach, organizations need to transition from integration


databases to application databases. The authors of the book in question argue that
NoSQL databases are best utilized as application databases rather than integration
databases. They assert that even if an organization does not adopt NoSQL solutions,
the shift towards encapsulating data within services is a beneficial direction.

Reasons for Considering NoSQL

In the context of NoSQL development, two primary motivations drive organizations


to consider these databases:

Scaling Needs: The first reason is to handle data access with sizes and
performance requirements that necessitate a clustered architecture. NoSQL
databases are designed to efficiently manage large volumes of data in
distributed environments.

Development Productivity: The second reason is to enhance application


development productivity by offering a more straightforward and flexible data
interaction model. Many development teams find that using a NoSQL
database simplifies database access, even if they don’t require the scalability
offered by a clustered environment.

Lakshmi Durga.N Dept of DS, SVIT 16


MODULE 1 NOSQL DATABASE

As you explore the contents of the book, keep these two key reasons in mind. They
provide valuable insights into why NoSQL databases are gaining traction in
contemporary data management discussions.

1.6. Key Points

• Relational databases have been a successful technology for twenty years, providing
persistence, concurrency control, and an integration mechanism.

• Application developers have been frustrated with the impedance mismatch between
the relational model and the in-memory data structures.

• There is a movement away from using databases as integration points towards


encapsulating databases within applications and integrating through services.

• The vital factor for a change in data storage was the need to support large volumes
of data by running on clusters. Relational databases are not designed to run efficiently
on clusters.

• NoSQL is an accidental neologism. There is no prescriptive definition—all you can


make is an observation of common characteristics.

• The common characteristics of NoSQL databases are

• Not using the relational model

• Running well on clusters

• Open-source

• Built for the 21st century web estates

• Schemaless

• The most important result of the rise of NoSQL is Polyglot Persistence.

Lakshmi Durga.N Dept of DS, SVIT 17


MODULE 1 NOSQL DATABASE

AGGREGATE DATA MODELS

The concept of a data model is central to how we perceive, interact with, and
manipulate data in a database system. It provides a framework for organizing and
structuring data in a way that is meaningful and accessible to users and applications.
This differs from a storage model, which deals with the underlying mechanics of how
data is stored and managed internally by the database system. Ideally, users should
not need to concern themselves with the storage model; however, understanding it can
be crucial for optimizing performance and ensuring efficient data retrieval.

Understanding Data Models

In everyday conversation, the term “data model” often refers to the specific structure
of data within an application. For instance, a developer might showcase an entity-
relationship diagram representing their database's structure, detailing entities such as
customers, orders, and products. However, in this context, the term will primarily
denote the overarching model by which a database organizes its data, also known as a
metamodel.

The Relational Data Model

For several decades, the relational data model has been the dominant paradigm in
database design. It can be visualized as a collection of tables, akin to spreadsheets,
where:

 Tables consist of rows and columns.


 Each row (or tuple) represents a specific entity, such as a customer or an
order.
 Each column represents a particular attribute of that entity, holding a single
value.

This model allows relationships to be established between entities through foreign


keys, linking data across tables. The relational model's strength lies in its simplicity,

Lakshmi Durga.N Dept of DS, SVIT 18


MODULE 1 NOSQL DATABASE

enabling users to perform operations and retrieve data based on a straightforward set
of tuples.

The Shift to NoSQL

One of the most significant shifts with the emergence of NoSQL databases is the
move away from the rigid structure of the relational model. Each NoSQL solution
employs its unique data model, which can be broadly categorized into four types:

1. Key-Value Stores: Data is stored as a collection of key-value pairs.


2. Document Stores: Data is organized in documents, typically formatted as
JSON or XML, allowing for nested structures.
3. Column-Family Stores: Data is stored in columns rather than rows,
optimizing for queries involving large datasets.
4. Graph Databases: Data is represented as nodes and edges, capturing
relationships in a way that is more natural for certain types of queries.

Aggregate Orientation

A common characteristic shared by key-value, document, and column-family


databases is what we refer to as aggregate orientation. This concept represents a
fundamental shift in how data is modeled and manipulated.

Aggregates

The relational model organizes data into tuples, which are relatively limited data
structures. Tuples cannot easily support complex data types like nested records or lists
of values. In contrast, aggregate orientation recognizes that often, data needs to be
handled in more complex units, referred to as aggregates.

 Aggregates: An aggregate is a collection of related objects that are treated as a


single unit for data manipulation and consistency management. This term
originates from Domain-Driven Design (DDD), which emphasizes the
importance of modeling software based on the domain it serves.

Lakshmi Durga.N Dept of DS, SVIT 19


MODULE 1 NOSQL DATABASE

Benefits of Aggregate Orientation

Complex Structures: Aggregates allow for more intricate structures than


simple rows and columns, facilitating the nesting of lists and other records
within them. This flexibility makes it easier to model real-world data
relationships.

Atomic Operations: Operations on aggregates can be performed atomically,


meaning that changes to the aggregate can be made as a single unit. This is
particularly important for maintaining data integrity and consistency.

Natural Unit for Clustering: Aggregates provide a natural unit for processes
like replication and sharding in clustered database environments. When data is
grouped into aggregates, it becomes easier to distribute and manage across
multiple nodes.

Developer Convenience: Application developers often find aggregates easier


to work with since they align more closely with how data is used in the
application logic. Rather than dealing with multiple tuples spread across
various tables, developers can interact with a single aggregate structure,
simplifying data access and manipulation.

2.1.1. Example of Relations and Aggregates

modeling data in an e-commerce website, demonstrating how data can be structured


using both a relational database model and a NoSQL aggregate-oriented model. This
comparison highlights the differences in approach, advantages, and trade-offs
between the two models.

Lakshmi Durga.N Dept of DS, SVIT 20


MODULE 1 NOSQL DATABASE

Figure 2.1. Data model oriented around a relational database (using UML
notation [Fowler UML] )

Figure 2.2 presents some sample data for this model.

Lakshmi Durga.N Dept of DS, SVIT 21


MODULE 1 NOSQL DATABASE

E-Commerce Data Modeling Overview

In developing an e-commerce website, we need to store and manage various types of


data, including:

 User Information: Details about customers.


 Product Catalog: Information about the items for sale.
 Orders: Data related to customer orders.
 Shipping and Billing Addresses: Where products are shipped and how
customers are billed.
 Payment Data: Information regarding transaction processing.

Relational Data Model

Relational Data Model Overview:

 The relational model organizes data into tables, with each table
consisting of rows (tuples) and columns (attributes).
 The data model is normalized, ensuring that there are no duplicate
data entries across tables and maintaining referential integrity.

Example: In the relational model, the data could be structured into separate
tables for customers, orders, products, and addresses. This allows for efficient
storage and retrieval but may lead to complex queries involving multiple
tables.

Pros:

 Strong data integrity and consistency.


 Well-understood and established practices.
 Powerful query capabilities through SQL.

Cons:

 Complexity in managing relationships among different tables.


 Overhead of joins in queries can lead to performance issues.

Lakshmi Durga.N Dept of DS, SVIT 22


MODULE 1 NOSQL DATABASE

 Less flexible in handling changing data structures or requirements.

NoSQL Aggregate Model

Aggregate-Oriented Data Model:

 In NoSQL databases, data is often modeled using aggregates, which


are collections of related data that can contain nested structures, lists,
and complex types.
 Aggregates provide a more natural representation of data as it often
exists in the real world, allowing for better alignment with application
logic.

Figure 2.3. An aggregate data model

Example: In this model, data is structured using JSON-like documents that


include all relevant information in a single record.

JSON Sample Data:

json
// Customer{
"id": 1,

Lakshmi Durga.N Dept of DS, SVIT 23


MODULE 1 NOSQL DATABASE

"name": "Martin",
"billingAddress": [{"city": "Chicago"}]}
// Order{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]}

Key Features:

 Aggregates can contain complex structures, like lists of items or nested


addresses.
 Rather than using IDs to reference shared data (e.g., addresses), the
same address structure can be copied into multiple aggregates.
 Relationships between aggregates (e.g., customers and orders) exist but
are not confined within a single aggregate.

Pros:

 More flexible in handling varying data types and structures.


 Simplifies data retrieval by reducing the need for complex joins.

Lakshmi Durga.N Dept of DS, SVIT 24


MODULE 1 NOSQL DATABASE

 Facilitates efficient read operations, especially in clustered


environments.

Cons:

 Potential data redundancy, as the same information may be copied


across different aggregates.
 Possible challenges in maintaining consistency when data is duplicated.
 Lack of a formalized schema may lead to less predictability in data
structure.

Aggregate Boundaries

Determining Aggregate Boundaries:

 When modeling aggregates, developers must consider how the


application will access data. The way data is structured should reflect
usage patterns:

 If customer and order data are frequently accessed together, it


may be beneficial to combine them into a single aggregate.
 If orders are often accessed independently, it may be more
efficient to keep them as separate aggregates.

Figure 2.4. Embed all the objects for customer and the customer’s orders
Lakshmi Durga.N Dept of DS, SVIT 25
MODULE 1 NOSQL DATABASE

Example of Aggregates:

 An aggregate for a customer could include their information along


with a list of their orders, as shown in the following JSON structure:

json
Copy code
// Customer with embedded orders{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]
}
]
}}

Lakshmi Durga.N Dept of DS, SVIT 26


MODULE 1 NOSQL DATABASE

the choice between a relational model and an aggregate-oriented NoSQL model


hinges on the specific requirements of the application and how data is accessed.
Relational databases offer strong consistency and integrity, making them suitable for
structured data with clear relationships. In contrast, NoSQL aggregate models provide
flexibility, making them better suited for evolving data structures and high-
performance applications, particularly in contexts where data access patterns favor
aggregates. Ultimately, understanding how to draw aggregate boundaries and how
that aligns with data manipulation requirements is crucial for effective data modeling.

2.1.2. Consequences of Aggregate Orientation

Understanding Aggregates in Data Modeling

When we create a relational model to represent various data elements—like orders,


order items, shipping addresses, and payments—we do a good job of capturing their
relationships through foreign keys. However, this approach lacks the concept of
aggregates, which are logical groupings of data that reflect how data is used in
practice.

The Role of Aggregates

In our domain, we think of an order as an aggregate that consists of order items, a


shipping address, and payment information. However, relational databases don’t
distinguish between different types of relationships, which means they can't leverage
an understanding of aggregate structures to optimize data storage and distribution.

Although some data modeling techniques attempt to mark aggregate relationships,


they often fail to provide clear semantics on what differentiates an aggregate from
other relationships. This lack of clarity can lead to inconsistencies, as modelers may
have varying interpretations of what constitutes an aggregate.

Aggregate-Ignorant Databases

Relational databases, along with some NoSQL databases like graph databases, are
termed "aggregate-ignorant" because they do not have a built-in understanding of

Lakshmi Durga.N Dept of DS, SVIT 27


MODULE 1 NOSQL DATABASE

aggregates. This isn’t necessarily a disadvantage; it can be challenging to define


aggregate boundaries clearly, especially when the same data is utilized in multiple
contexts.

For example:

 An order might be an ideal aggregate when a customer is placing or


reviewing orders.
 However, if a retailer wants to analyze product sales data, using order
aggregates can complicate the process since you must access every aggregate
to retrieve sales history.

Aggregate-ignorant models allow for flexibility in querying data from different


perspectives, making them advantageous when there isn't a clear primary structure for
data manipulation.

Importance of Aggregates in Distributed Systems

The main advantage of aggregate orientation comes into play when working with
clustered systems—a common scenario in NoSQL environments. By explicitly
defining aggregates, we inform the database which pieces of data are likely to be
manipulated together, enabling better data distribution across nodes. This minimizes
the number of nodes queried for data retrieval, enhancing performance.

Transactions and ACID Properties

Transactions are critical for ensuring data integrity, particularly in relational


databases, which allow for the manipulation of multiple rows across various tables in
a single transaction. These transactions adhere to the ACID properties—Atomicity,
Consistency, Isolation, and Durability—ensuring that all parts of a transaction are
completed successfully or none at all, and that concurrent operations remain isolated
from each other.

In contrast, it’s often asserted that NoSQL databases lack ACID transactions, which
suggests a compromise on consistency. This statement, however, oversimplifies the
reality. While aggregate-oriented databases typically do not support ACID
Lakshmi Durga.N Dept of DS, SVIT 28
MODULE 1 NOSQL DATABASE

transactions that span multiple aggregates, they can perform atomic operations on
individual aggregates.

This means that if we need to manipulate multiple aggregates atomically, we must


handle that within the application code. Fortunately, in many use cases, the need for
atomicity is confined to single aggregates. Thus, the division of data into aggregates is
often informed by how we anticipate needing to manipulate that data.

Lastly, it's essential to note that while graph and other aggregate-ignorant databases
may not have a formal aggregate structure, they usually still support ACID
transactions similar to those found in relational databases. The discussion around
consistency in databases is complex and extends beyond just whether a database is
ACID-compliant.

2.2. Key-Value and Document Data Models

Key-Value and Document Databases: Understanding the Differences

Key-Value and Document Databases are two popular types of NoSQL databases,
and both are designed around the concept of aggregates. An aggregate is a collection
of related data that is treated as a single unit, often identified by a unique key or ID.
Here’s how they compare:

Key-Value Databases

Structure and Opacity:

1. In a key-value database, each aggregate is essentially a "blob" of data


associated with a unique key. The database doesn't interpret or
structure this data; it simply stores it as a sequence of bytes.
2. This "opacity" means that you can store any kind of data without
having to conform to a specific structure. The only requirement is that
each piece of data (or aggregate) must have a unique key for retrieval.

Data Access:

Lakshmi Durga.N Dept of DS, SVIT 29


MODULE 1 NOSQL DATABASE

1. Accessing data is straightforward: you retrieve an aggregate by looking


it up using its key.
2. The downside is that you can't perform complex queries or retrieve
parts of an aggregate easily since the database doesn't understand the
data's internal structure.

Flexibility:

1. You have a lot of freedom regarding what you store in the database,
making it a good choice for scenarios where data formats might vary
widely or where strict schema enforcement isn't necessary.

Document Databases

Structure and Transparency:

1. Document databases, on the other hand, understand the structure of the


data stored within aggregates. They typically store data in formats like
JSON or XML, which inherently have defined structures.
2. This structure allows the database to interpret and manage the data
more effectively compared to key-value stores.

Data Access:

1. You can perform more sophisticated queries based on the fields within
the document. This means you can not only retrieve an entire
document using its ID but also search for documents based on specific
field values.
2. Document databases allow partial retrieval, meaning you can fetch
only the parts of the document you need instead of the whole aggregate.

Indexing:

1. Document databases can create indexes on various fields within the


documents, which enhances the speed and efficiency of queries. This

Lakshmi Durga.N Dept of DS, SVIT 30


MODULE 1 NOSQL DATABASE

capability allows for quicker access to specific pieces of data without


having to scan entire aggregates.

Blurred Lines Between Key-Value and Document Databases

While the distinction between key-value and document databases is clear in theory, in
practice, it often becomes blurred. Here are a few reasons:

Key-Value Stores with Structure: Some key-value stores (like Riak or Redis)
allow for additional structures or metadata, enabling features such as indexing
or the ability to break down aggregates into smaller components (like lists or
sets). For example, Riak allows you to attach metadata to aggregates for better
indexing.

Document Stores with Key Lookups: Document databases often include an


ID field that enables key-value style lookups, which can make them function
similarly to key-value databases in certain scenarios.

2.3. Column-Family Stores

Understanding Column-Family Databases

Column-family databases are a type of NoSQL database designed to efficiently


handle large amounts of data across distributed systems. They are influenced by the
structure of Google's BigTable and are particularly useful for certain types of data
access patterns.

Lakshmi Durga.N Dept of DS, SVIT 31


MODULE 1 NOSQL DATABASE

Figure 2.5. Representing customer information in a column-family structure

Key Concepts

Two-Level Aggregate Structure:

1. In a column-family database, the data structure can be thought of as a


two-level map. The first level consists of row identifiers (or keys),
which point to aggregates of related data.
2. Each row consists of a map of columns that hold values related to that
row. For instance, you can retrieve a specific column's value (like a
customer’s name) by using a command like get('1234', 'name').

Column Families:

1. Data is organized into column families. Each column belongs to a


specific column family, which acts as a logical grouping of related data.
2. The assumption is that columns within the same family will often be
accessed together, allowing the database to optimize storage and access
patterns.

Row-Oriented vs. Column-Oriented Views:

Lakshmi Durga.N Dept of DS, SVIT 32


MODULE 1 NOSQL DATABASE

1. Row-Oriented: Here, each row represents an aggregate (e.g., a


customer with a unique ID). Column families then represent different
categories of data related to that aggregate (e.g., profile details or order
history).
2. Column-Oriented: In this perspective, each column family can be
seen as defining a record type (e.g., customer profiles), with each row
representing an instance of that record. This allows the database to
efficiently manage and retrieve records based on these predefined
structures.

Wide vs. Skinny Rows:

1. Skinny Rows: These have a small number of columns, where the same
columns are used across multiple rows. This structure resembles
traditional records, with the column family defining the record type
and each column acting as a field.
2. Wide Rows: These contain a large number of columns, potentially
thousands. Each row can have a different set of columns, which makes
it suitable for modeling lists (e.g., an order with multiple items where
each item is represented as a separate column).

Flexibility in Column Addition

 One of the defining features of column-family databases is their flexibility


regarding column addition. You can freely add new columns to existing rows,
allowing for dynamic data structures. However, adding new column families is
a less frequent operation and may require downtime for the database.

Sorting and Access Patterns

 In wide column families, there is often a defined sort order for columns. This
is especially beneficial when accessing data by keys that are concatenated (e.g.,

Lakshmi Durga.N Dept of DS, SVIT 33


MODULE 1 NOSQL DATABASE

date and ID). This allows you to efficiently retrieve ranges of data based on
sorted keys, which can enhance query performance for specific access patterns.

Distinction from Relational Databases

 Unlike traditional relational databases, where data is strictly structured into


tables with defined schemas, column-family databases allow for more fluidity.
You can have rows in the same column family that contain different columns,
which provides greater flexibility but can be challenging to conceptualize for
those familiar with relational models.
 The ability to model lists within columns (where each item in a list is
represented as a separate column) diverges significantly from relational tables,
where each row typically represents a complete record.

Column-family databases like BigTable, HBase, and Cassandra offer a powerful


alternative to traditional relational databases, particularly for applications that require
flexible data structures, high write throughput, and efficient access patterns across
distributed systems. Their design emphasizes the need for efficient storage and
retrieval of large volumes of data while accommodating various use cases through
their unique aggregate structure.

2.4. Summarizing Aggregate-Oriented Databases

Certainly! Let’s summarize and clarify the key concepts related to the three different
styles of aggregate-oriented data models: key-value stores, document stores, and
column-family databases. Each of these models shares the foundational idea of
aggregates indexed by keys, but they differ significantly in their structure and
functionality.

Common Features of Aggregate-Oriented Data Models

 Aggregate: In all three models, an aggregate represents a collection of related


data that is treated as a single unit. Each aggregate is identified by a unique
key, which allows for efficient lookup and retrieval.

Lakshmi Durga.N Dept of DS, SVIT 34


MODULE 1 NOSQL DATABASE

 Cluster Efficiency: Aggregates are designed to be stored together on a single


node in a distributed database system. This helps in minimizing the number of
nodes accessed during data operations, which is crucial for performance in
clustered environments.
 Atomic Updates: The aggregate serves as the atomic unit for updates,
meaning that changes to the data within the aggregate are treated as a single
transaction. This provides a limited form of transactional control, ensuring that
updates to the aggregate are either fully completed or not applied at all.

Differences Among the Models

Key-Value Data Model:

 Opaque Aggregates: In a key-value store, the aggregate is treated as a


complete, opaque entity. This means that you can only retrieve the
entire aggregate using its key.
 Limited Interaction: Since you cannot query the contents of the
aggregate or access specific parts of it, interactions are limited to full
lookups. This model is straightforward but lacks flexibility for
complex queries or partial data retrieval.

Document Data Model:

 Transparent Aggregates: Document databases make the aggregate


more transparent, allowing the database to understand and manipulate
its internal structure. Each aggregate is typically represented as a
document (often in JSON or XML format).
 Querying and Partial Retrieval: You can run queries against the
fields within the document and retrieve specific parts of the aggregate.
However, since document databases often have no predefined schema,
they may struggle to optimize storage and retrieval based on the
document structure.

Column-Family Data Model:


Lakshmi Durga.N Dept of DS, SVIT 35
MODULE 1 NOSQL DATABASE

 Structured Aggregates: Column-family databases impose a certain


structure on aggregates by organizing data into column families. Each
row (aggregate) is divided into columns that belong to these families,
allowing for a more organized representation of data.
 Optimized Accessibility: Because of this structure, the database can
optimize the storage and retrieval of data. Operations can target
specific column families, improving efficiency and making it easier to
access related data within the aggregate.

while all three aggregate-oriented data models revolve around the concept of
aggregates indexed by keys, they differ in how they handle these aggregates:

 Key-value stores treat aggregates as opaque blobs with limited access and no
query capabilities.
 Document stores provide a more transparent view of aggregates, enabling
queries and partial retrievals, but with less optimization due to their lack of
schema.
 Column-family databases offer a structured approach to aggregates, allowing
the database to leverage this structure for improved access and performance.

Understanding these differences is essential for choosing the right data model based
on the specific needs and access patterns of your application.

2.5. Further Reading For more on the general concept of aggregates, which are often
used with relational databases too, see [Evans] . The Domain-Driven Design
community is the best source for further information about aggregates—recent
information usually appears at http://domaindrivendesign.org.

2.6. Key Points

• An aggregate is a collection of data that we interact with as a unit. Aggregates form


the boundaries for ACID operations with the database.

• Key-value, document, and column-family databases can all be seen as forms of


aggregate oriented database.

Lakshmi Durga.N Dept of DS, SVIT 36


MODULE 1 NOSQL DATABASE

• Aggregates make it easier for the database to manage data storage over clusters.

• Aggregate-oriented databases work best when most data interaction is done with the
same aggregate; aggregate-ignorant databases are better when interactions use data
organized in many different formations

Lakshmi Durga.N Dept of DS, SVIT 37


MODULE 1 NOSQL DATABASE

More Details on Data Models

So far we’ve covered the key feature in most NoSQL databases: their use of
aggregates and how aggregate-oriented databases model aggregates in different ways.
While aggregates are a central part of the NoSQL story, there is more to the data
modeling side than that, and we’ll explore these further concepts in this chapter.

3.1. Relationships

the complexity and utility of aggregates in NoSQL databases, particularly regarding


how they handle relationships between different types of data, such as customers and
their orders. Here's a detailed breakdown:

Understanding Aggregates and Their Relationships

Aggregates:

1. Aggregates group related data that is frequently accessed together,


optimizing data retrieval.
2. For example, combining a customer's details with their order history
into a single aggregate may benefit some applications that require
frequent access to both sets of information.

Independent Aggregates:

1. In contrast, other applications may require access to orders individually,


leading to a preference for treating orders and customers as separate
aggregates.
2. This necessitates a way to relate these aggregates without merging
them, allowing flexibility in data access.

Linking Aggregates:

1. One simple method for linking aggregates is to embed the customer ID


within the order aggregate. This allows the application to reference the

Lakshmi Durga.N Dept of DS, SVIT 38


MODULE 1 NOSQL DATABASE

customer data by first retrieving the order and then looking up the
customer using the embedded ID.
2. However, while this approach is functional, it means the database is
unaware of the relationship between the aggregates, which can limit its
ability to optimize queries or enforce integrity.

Visibility of Relationships in Databases

1. Database Awareness:

1. Many NoSQL databases, even key-value stores, provide mechanisms


to make relationships between aggregates more visible.
2. Document stores enable the content of aggregates to be indexed and
queried, allowing for more sophisticated data access.
3. For instance, Riak, a key-value store, allows metadata to store linking
information, facilitating partial retrieval and link-walking (the ability
to follow relationships between data points).

Handling Updates Across Aggregates

Atomicity and Updates:

1. In aggregate-oriented databases, atomicity is limited to operations


within a single aggregate. If updates span multiple aggregates,
applications must handle potential failures during the update process
themselves.
2. This contrasts with relational databases, which support ACID
(Atomicity, Consistency, Isolation, Durability) transactions, allowing
multiple records across different tables to be updated as part of a single
transaction. This feature provides stronger guarantees when managing
related data.

Lakshmi Durga.N Dept of DS, SVIT 39


MODULE 1 NOSQL DATABASE

Challenges with Complex Relationships:

1. While aggregate-oriented databases are less suited for managing


relationships across multiple aggregates, relational databases also
struggle with complex relationships.
2. SQL queries that involve numerous joins can become complicated and
degrade performance, making it challenging to work with highly
interrelated datasets.

Introduction to Another Category of Databases

1. Transition to New Databases:

1. Given the limitations discussed, the passage hints at introducing


another category of databases that might address the challenges of
handling complex relationships more effectively.
2. This likely refers to graph databases, which are designed specifically
for managing and querying complex relationships between data points.
Graph databases use nodes (representing entities) and edges
(representing relationships) to efficiently handle interconnected data,
making them a suitable alternative when dealing with highly relational
data structures.

the strengths and weaknesses of aggregate-oriented NoSQL databases in handling


relationships and updates across different data entities. It emphasizes the importance
of considering the specific needs of applications when choosing a database model,
particularly when relationships among data are a crucial concern. Additionally, it sets
the stage for discussing other database types that may provide better solutions for
complex relationship management.

3.2. Graph Databases

graph databases, highlighting their unique characteristics and how they differ from
other NoSQL databases. Here's a detailed explanation of the key points:

Lakshmi Durga.N Dept of DS, SVIT 40


MODULE 1 NOSQL DATABASE

Overview of Graph Databases

Distinct Motivation:

1. Graph databases emerged from frustrations with relational databases,


particularly their limitations in handling complex relationships
efficiently.
2. In contrast to many NoSQL databases, which focus on large aggregates
of data with simpler relationships, graph databases are built around
small records (nodes) that are intricately connected by relationships
(edges).

Graph Data Structure:

Figure 3.1. An example graph structure

1. In graph databases, data is represented as a graph, which consists of:

1. Nodes: Represent entities (e.g., people, books, products).

Lakshmi Durga.N Dept of DS, SVIT 41


MODULE 1 NOSQL DATABASE

2. Edges: Represent relationships between those entities (e.g.,


friendships, authorship).

2. This structure allows for sophisticated queries that explore complex


interconnections. For example, one could ask for all books in a specific
category written by authors liked by a particular friend.

Characteristics of Graph Databases

Complex Interconnections:

1. Graph databases are particularly well-suited for scenarios where


relationships are complex and numerous, such as social networks,
recommendation systems, or rule-based eligibility scenarios.
2. The structure allows for efficient modeling of these relationships,
enabling rapid traversal of connected data.

Simple Fundamental Model:

1. The basic model of a graph database is straightforward: nodes are


connected by edges. However, the way data can be stored in these
nodes and edges varies among different graph databases.
2. For instance:

1. FlockDB has no additional attributes on nodes and edges


beyond the basic structure.
2. Neo4j allows nodes and edges to have properties (like Java
objects) in a schemaless manner, enabling flexibility.
3. Infinite Graph stores Java objects that are subclasses of its
built-in types.

Querying and Performance:

1. Graph databases support specialized query operations tailored for


navigating through the graph structure.

Lakshmi Durga.N Dept of DS, SVIT 42


MODULE 1 NOSQL DATABASE

2. While relational databases can implement relationships using foreign


keys, the performance of queries involving multiple joins can suffer
significantly as the complexity and number of connections increase.
3. In contrast, graph databases optimize for relationship traversal, making
such operations faster and more efficient.

Insert vs. Query Performance:

1. A key advantage of graph databases is that they shift the majority of


the workload related to navigating relationships from query time (when
data is accessed) to insert time (when data is stored).
2. This design choice benefits applications where querying performance
is critical, even if it means that insert operations might be slightly
slower.

Indexing and Traversal:

1. Graph databases typically start queries by looking up nodes using


attributes (e.g., ID). Once the starting nodes are identified, the database
efficiently navigates through the edges to find related data.
2. Most queries revolve around exploring relationships, such as finding
common interests between multiple nodes.

Differences from Aggregate-Oriented Databases

Focus on Relationships:

1. Graph databases emphasize relationships and connections,


distinguishing them from aggregate-oriented databases, which focus on
grouping data into aggregates.
2. This relationship-centric approach influences their design, operation,
and performance characteristics.

Server Architecture:

Lakshmi Durga.N Dept of DS, SVIT 43


MODULE 1 NOSQL DATABASE

1. Graph databases tend to run on a single server rather than being


distributed across clusters, which is more common in aggregate-
oriented databases.
2. The ACID (Atomicity, Consistency, Isolation, Durability) transactions
in graph databases need to cover multiple nodes and edges to maintain
data consistency, contrasting with aggregate-oriented databases that
primarily focus on atomic operations within single aggregates.

Rejection of the Relational Model:

1. Like other NoSQL databases, graph databases reject the traditional


relational model, opting instead for a more flexible and relationship-
focused approach.
2. Their development coincided with the broader NoSQL movement,
reflecting a shift in database design priorities away from conventional
relational schemas.

graph databases are a distinct category within the NoSQL landscape, driven by the
need to efficiently manage complex relationships among small records. Their
structure—nodes connected by edges—enables powerful querying capabilities and
fast traversal of data. While they share some similarities with other NoSQL databases,
such as rejecting the relational model, their focus on relationships and different
operational characteristics set them apart, making them particularly useful for
applications involving intricate interconnections.

3.3. Schemaless Databases

the concept of schemalessness in NoSQL databases, contrasting it with the fixed-


schema approach of relational databases. Here’s a breakdown of the key points:

Schemalessness in NoSQL Databases

Definition of Schemalessness:

NoSQL databases are typically described as "schemaless," meaning they do not


require a predefined structure for data storage. Unlike relational databases, which
Lakshmi Durga.N Dept of DS, SVIT 44
MODULE 1 NOSQL DATABASE

necessitate the definition of tables, columns, and data types before any data can be
stored, NoSQL databases allow for a more flexible approach.

Examples of schemaless databases include:

1. Key-Value Stores: Store data as a simple key-value pair.


2. Document Databases: Allow storage of documents with no
fixed structure.
3. Column-Family Databases: Permit arbitrary data under any
column.
4. Graph Databases: Enable free addition of nodes and edges,
with associated properties.

Advantages of Schemalessness:

1. Flexibility: Developers can store data without having to determine the


structure in advance. This is particularly beneficial in early stages of
development or when requirements are evolving.
2. Handling Changes: As project needs evolve, it’s easy to add or
remove data types without worrying about altering a fixed schema.
3. Dealing with Nonuniform Data: Schemaless databases can
accommodate records that have different sets of fields, avoiding the
issues of sparse tables or meaningless columns.

Implicit Schemas and Their Challenges

Implicit Schema:

1. Although NoSQL databases are schemaless, programs that access this


data often rely on an implicit schema. This means that while there may
not be a formal schema defined in the database, the application code
will still make assumptions about the structure and meaning of the data.

Lakshmi Durga.N Dept of DS, SVIT 45


MODULE 1 NOSQL DATABASE

2. For example, an application might expect a field named


"billingAddress" to contain a specific format or data type, which is not
enforced by the database itself.

Problems with Implicit Schemas:

1. Lack of Clarity: Understanding the data structure requires digging


into the application code, which can vary in quality and clarity. This
can make it difficult for developers to grasp what data is available and
how it should be interpreted.
2. Database Ignorance: The database cannot optimize storage and
retrieval based on the implicit schema, nor can it enforce consistent
data handling across different applications. This could lead to
inconsistent data manipulation.

Issues with Multiple Applications:

1. When multiple applications interact with the same database, each may
have different implicit schemas. This inconsistency can lead to
problems if not managed properly.

Comparison to Relational Databases

Schema Flexibility:

1. While advocates of NoSQL argue that relational schemas are inflexible,


relational databases can indeed be modified. SQL commands allow for
the addition of new columns and other schema alterations at any time.
2. However, most developers may not frequently utilize this flexibility,
often opting for a more structured approach.

Controlled Schema Changes:

1. Changes to a relational database schema can be done in a controlled


manner, although it may require careful planning to ensure data
integrity. In contrast, making changes in a schemaless database can be

Lakshmi Durga.N Dept of DS, SVIT 46


MODULE 1 NOSQL DATABASE

more straightforward, but still requires planning to access both old and
new data effectively.

Aggregate Boundaries:

1. While schemalessness allows flexibility within individual aggregates


(like documents or records), changing the boundaries of these
aggregates (i.e., how data is grouped) can be as complex as altering
schemas in a relational database.

the schemaless nature of NoSQL databases offers significant flexibility and the
ability to handle evolving data needs, it introduces challenges, particularly regarding
implicit schemas and data consistency. The advantages of relational databases,
including their ability to enforce schemas and facilitate controlled changes, highlight
the trade-offs involved when choosing between NoSQL and traditional relational
approaches. Understanding these differences is crucial for developers and
organizations when deciding on the best data management strategy for their
applications.

3.4. Materialized Views

the concept of aggregate-oriented data models in NoSQL databases, focusing on the


advantages and disadvantages of this approach, particularly in relation to querying
data. Here’s a detailed breakdown of the key points discussed:

Advantages of Aggregate-Oriented Data Models

1. Unit of Access:
1. Aggregate-oriented data models group related data into a single unit
(or aggregate). For example, all information related to an order is
stored together, making it efficient to access everything about that
order at once.
2. This design is beneficial for operations that require accessing all data
associated with an aggregate, such as retrieving a full order with all its
details.

Lakshmi Durga.N Dept of DS, SVIT 47


MODULE 1 NOSQL DATABASE

Disadvantages of Aggregate Orientation

Challenges with Queries:

1. One significant drawback of aggregate-oriented models arises when


querying for data that isn’t stored within a single aggregate. For
example, if a product manager wants to know the sales of a specific
item over the past few weeks, the database might require scanning
through all order aggregates to gather this information.
2. This can be inefficient, especially in large datasets, as it forces the
system to read potentially every order in the database.

Need for Indexing:

1. While building an index on the product can mitigate this issue


somewhat, it still operates within the constraints of the aggregate
structure, making it challenging to perform certain queries efficiently.

Advantages of Relational Databases

Flexibility in Data Access:

1. Relational databases allow for flexible data access due to their lack of
aggregate structure. They support various ways to query data, enabling
users to obtain information from different perspectives without the
need to reformat the underlying data storage.

Views:

1. Relational databases offer a feature called "views," which are virtual


tables defined by computations over the base tables. When users access
a view, the database dynamically computes the data, allowing for
customized representations of the underlying data.
2. Views abstract away whether the data is derived or base data, but they
can be expensive to compute, especially for complex queries.

Materialized Views:
Lakshmi Durga.N Dept of DS, SVIT 48
MODULE 1 NOSQL DATABASE

1. To address performance issues with views, materialized views were


introduced. These are precomputed views stored on disk, providing
quick access to data that is read frequently but can tolerate some
degree of staleness. This approach is useful for optimizing read
performance at the cost of some freshness in data.

Materialized Views in NoSQL Databases

NoSQL Equivalent:

1. While NoSQL databases lack the traditional concept of views, they


often employ similar mechanisms like precomputed and cached queries,
referred to as "materialized views." In NoSQL, these views are crucial
for handling queries that do not align well with the aggregate structure.

Building Materialized Views:

1. There are generally two strategies for building materialized views:

1. Eager Approach: In this method, the materialized view is


updated simultaneously with the base data. For instance, when
a new order is added, the purchase history for each product is
updated concurrently. This is effective when the materialized
views are read frequently and need to remain fresh.
2. Batch Updates: Alternatively, materialized views can be
updated in batches at regular intervals, which can reduce the
overhead during data updates but may result in views that are
stale, depending on the business requirements.

Creating Materialized Views:

1. Materialized views can be constructed outside of the database by


reading data, performing computations, and storing the results back
into the database. However, many databases provide built-in support
for creating materialized views, allowing users to define the necessary

Lakshmi Durga.N Dept of DS, SVIT 49


MODULE 1 NOSQL DATABASE

computations that the database will execute based on configured


parameters.

Usage of Materialized Views

1. Within Aggregates:

1. Materialized views can also exist within the same aggregate. For
instance, an order document might include a summary of the order,
enabling quick access to summary information without needing to
transfer the entire order document.

2. Column Families:

1. In column-family databases, using separate column families for


materialized views is common. This allows updates to the materialized
view to occur as part of the same atomic operation, ensuring data
consistency.

aggregate-oriented data models provide significant benefits in terms of data grouping


and access efficiency, they also pose challenges for querying related data across
multiple aggregates. Relational databases offer advantages in flexibility and data
access through features like views and materialized views. NoSQL databases, while
lacking traditional views, adopt similar strategies for precomputing and caching data
to address the inherent limitations of their aggregate-oriented structures.
Understanding these trade-offs is essential for choosing the right data management
approach based on the specific needs of an application.

3.5. Modeling for Data Access

various strategies for modeling data aggregates across different types of NoSQL
databases, focusing on customer and order data. Here’s a detailed breakdown of the
key concepts and considerations:

Lakshmi Durga.N Dept of DS, SVIT 50


MODULE 1 NOSQL DATABASE

Figure 3.2. Embed all the objects for customer and their orders.

Modeling Data Aggregates

When designing data models, particularly for aggregates like customers and their
orders, it’s crucial to consider:

1. How Data Will Be Read: The access patterns and queries that will be
performed on the data.
2. Side Effects on Data: The implications for related data when modifications
occur, such as adding new orders.

Key-Value Store Model

Embedding Data:

1. In a key-value store, all data related to a customer (including orders)


can be embedded within a single key. For example, a customer object
might contain nested structures for billing addresses, payment methods,
and orders.

Lakshmi Durga.N Dept of DS, SVIT 51


MODULE 1 NOSQL DATABASE

Example:json

"customerId": 1,
"customer": {
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit","ccinfo": "1000-1000-1000-1000"}],
"orders":[{"orderId":99}]
}}

Figure 3.3. Customer is stored separately from Order.

Accessing Data:

1. This model allows for easy access to customer information using a key.
However, when querying specific orders or products, the entire
customer object must be read and parsed on the client side, which can
be inefficient.

Using References

Separating Customer and Order Data:

Lakshmi Durga.N Dept of DS, SVIT 52


MODULE 1 NOSQL DATABASE

1. To improve access efficiency, one can switch to a model where


customer and order data are stored separately but linked by references.
2. For example, the customer object would include an orderId reference
that links to related orders, allowing for independent access to order
data.

json
{
"customerId": 1,
"orders": [{"orderId": 99}]}

Benefits of References:

1. This separation facilitates independent querying of orders while


maintaining a link back to the customer, but it introduces complexity:
every time a new order is created, the orderId reference must also be
updated in the customer object.

Aggregate Updates and Analytics

1. Real-Time Business Intelligence:

1. Aggregates can be used for analytics, such as tracking which orders


contain specific products. This denormalization allows for faster access
to critical data, supporting real-time analytics rather than relying on
end-of-day batch processes.

Example of an aggregate update:

json
{
"itemid": 27,
"orders": {99, 545, 897, 678}}

Document Stores
Lakshmi Durga.N Dept of DS, SVIT 53
MODULE 1 NOSQL DATABASE

Query Flexibility:

1. In document stores, one can query inside documents, which allows for
the removal of references to orders from the customer object. This
means the customer object does not need to be updated with new
orders, simplifying data management.

json
{
"customerId": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}]}

Attribute-Based Searching:

1. Document stores enable searches like “find all orders that include the
Refactoring Databases product,” emphasizing that the modeling
decision is based on application requirements rather than just database
capabilities.

Column-Family Stores

Figure 3.4. Conceptual view into a column data store

Lakshmi Durga.N Dept of DS, SVIT 54


MODULE 1 NOSQL DATABASE

1. Column Ordering:

1. Column-family stores allow for ordered columns, making it important


to design the data model based on query requirements, rather than
write optimization.
2. One common strategy is to store customer and order data in separate
column families, ensuring that references to orders are easily
accessible from the customer column family.
3. Example:

1. Customer column family stores customer information.


2. Order column family stores order details.

Graph Databases

Modeling Relationships:

1. In graph databases, both customers and orders are represented as nodes,


while relationships (e.g., PURCHASED, PAID_WITH) are modeled as
edges connecting these nodes.
2. This structure allows for efficient traversal of relationships, making it
easy to query connections, such as finding all customers who
purchased a specific product.

Figure 3.5. Graph model of e-commerce data


Lakshmi Durga.N Dept of DS, SVIT 55
MODULE 1 NOSQL DATABASE

Relationship Queries:

1. For instance, to find customers who purchased a product called


"Refactoring Databases," you would query for that product node and
retrieve all customers connected through a PURCHASED relationship.

the importance of choosing the right data modeling strategy based on how data will
be accessed and the relationships among different entities. Each NoSQL database type
(key-value stores, document stores, column-family stores, and graph databases) offers
unique advantages and trade-offs, influencing how data is structured and queried.
Understanding these models helps in optimizing data access patterns and supporting
efficient data management, especially for use cases like real-time analytics and
complex relationships.

3.6. Key Points


• Aggregate-oriented databases make inter-aggregate relationships more difficult to
handle than intra-aggregate relationships.
• Graph databases organize data into node and edge graphs; they work best for data
that has complex relationship structures.
• Schemaless databases allow you to freely add fields to records, but there is usually
an implicit schema expected by users of the data.
• Aggregate-oriented databases often compute materialized views to provide data
organized differently from their primary aggregates. This is often done with map-
reduce computations.

Lakshmi Durga.N Dept of DS, SVIT 56

You might also like