0% found this document useful (0 votes)
8 views31 pages

Nosql Mod1

Relational databases are essential for storing persistent data, managing concurrency, and enabling integration among multiple applications. They face challenges such as impedance mismatch with in-memory data structures and complexities in shared database integration. The emergence of NoSQL databases reflects a shift towards more flexible data storage solutions, particularly for large-scale applications and clusters.

Uploaded by

Prerana S A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views31 pages

Nosql Mod1

Relational databases are essential for storing persistent data, managing concurrency, and enabling integration among multiple applications. They face challenges such as impedance mismatch with in-memory data structures and complexities in shared database integration. The emergence of NoSQL databases reflects a shift towards more flexible data storage solutions, particularly for large-scale applications and clusters.

Uploaded by

Prerana S A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

MODULE 01

1.1 THE VALUE OF RELATIONAL DATABASES

Relational databases are an integral part of computing, offering numerous benefits that are
crucial to revisit for a better understanding of their value.

1.1.1 Getting at Persistent Data

• The primary value of databases lies in their ability to store large amounts of
persistent data, which is essential for long-term data retention.

• Computer architectures generally feature two types of memory:

1. Main Memory: Fast but volatile, meaning data is lost during power outages
or system failures.

2. Backing Store: Larger but slower, used to retain data even in adverse
conditions.

• Backing stores are typically implemented using disks, although modern


implementations may use persistent memory.

• While some applications (e.g., word processors) use file systems to store data,
enterprise applications depend on databases for their advanced capabilities.

• Databases offer greater flexibility than file systems by allowing quick and easy access
to small pieces of data from large datasets.

1.1.2 Concurrency

• Enterprise applications often involve multiple users accessing the same body of data
simultaneously.

• While users usually work on different areas of the data, conflicts can arise when they
attempt to modify the same data.

o Example: Double booking of hotel rooms due to uncoordinated data access.

• Managing concurrency is highly complex, often leading to errors even with careful
programming.

• Relational databases mitigate concurrency issues by controlling all data access


through transactions.

o Transactions simplify coordination by ensuring that only one process can


modify a specific piece of data at a time.

o If a conflict occurs, the database ensures consistency by handling errors and


enabling rollback of incomplete changes.
• Transactions also play a vital role in error handling, allowing changes to be reversed
in case of processing errors, ensuring clean data management.

1.1.3 Integration

• Enterprise ecosystems often require multiple applications, developed by different


teams, to collaborate and share data.

• This collaboration can be challenging as it involves crossing organizational and


technical boundaries.

• Shared database integration is a common solution where multiple applications store


their data in a single database.

o This allows applications to access each other's data seamlessly.

o The database’s concurrency control ensures that data integrity is maintained


even when multiple applications access it simultaneously.

• This approach simplifies inter-application data sharing and reduces complexity in


collaborative environments.

1.1.4 A (Mostly) Standard Model

• The success of relational databases is largely due to their standardized approach to


providing core benefits.

• Developers and database professionals can learn the relational model and apply it
across multiple projects, thanks to this standardization.

• Although there are differences between relational database vendors, the


fundamental mechanisms remain consistent:

o SQL Dialects: Variations exist, but they share a common foundation.

o Transaction Operations: These operate in a similar way across different


databases.

• This consistency allows developers to easily adapt their skills to different relational
database systems, enhancing productivity and reducing learning curves.

1.2 IMPEDANCE MISMATCH

Relational databases, while advantageous, have limitations that have frustrated application
developers since their inception. The concept of impedance mismatch highlights a
fundamental issue in aligning relational databases with in-memory data structures.

Relational Model vs. In-Memory Structures

• Relational Data Model:


o Organizes data into tables (relations) and rows (tuples).

o A tuple is defined as a set of name-value pairs, distinct from its definition in


mathematics and programming languages where it is a sequence of values.

o SQL operations are based on relations, adhering to relational algebra, which is


mathematically elegant.

• Limitation of Relational Values:

o Relational tuples require simple values and cannot handle complex structures
like nested records or lists.

o In contrast, in-memory data structures allow richer, more complex


representations.

• Translation Requirement:

o Storing complex in-memory data structures in a relational database requires


translation into a relational format, causing impedance mismatch.

o This mismatch necessitates ongoing conversions between two distinct


representations, adding complexity and inefficiency.

Historical Context and Evolution

• Rise of Object-Oriented Programming (1990s):

o Object-oriented programming languages gained prominence, offering rich in-


memory data structures.

o Object-oriented databases emerged as an alternative to relational databases,


aiming to replicate in-memory structures directly to disk.

• Decline of Object-Oriented Databases:

o Despite the success of object-oriented programming, object-oriented


databases failed to replace relational databases.

o Relational databases retained dominance due to:

▪ Their role as an integration mechanism.

▪ A mostly standardized data manipulation language (SQL).

▪ A professional divide between application developers and database


administrators.

Addressing Impedance Mismatch

• Object-Relational Mapping (ORM) Frameworks:


o Tools like Hibernate and iBATIS emerged to simplify the translation process
between in-memory structures and relational representations.

o These frameworks implement well-known mapping patterns, reducing


manual work.

• Challenges with ORMs:

o While ORMs reduce the burden of mapping, they introduce new problems:

▪ Ignoring database-specific optimizations can degrade query


performance.

▪ Over-reliance on ORMs may lead to inefficiencies in complex


applications.

Shift in Relational Database Dominance

• Despite the widespread use of ORMs, the mapping problem persists, highlighting the
inherent mismatch between relational databases and in-memory structures.

• Relational databases dominated enterprise computing through the 2000s, but their
supremacy began to face challenges during that decade.

1.3 APPLICATION AND INTEGRATION DATABASES

• Reasons for Relational Database Triumph:

o The primary factor was SQL as an integration mechanism between


applications.

o The database acts as an integration database, with multiple applications,


usually developed by separate teams, storing data in a common database.
o This improves communication as all applications operate on a consistent set
of persistent data.

• Downsides to Shared Database Integration:

o A structure designed to integrate many applications ends up being more


complex than any single application needs.

o If an application wants to make changes to its data storage, it needs to


coordinate with all other applications using the database.

o Different applications have different structural and performance needs, so an


index required by one application may cause a problematic hit on inserts for
another.

o Since each application is usually developed by a separate team, the database


cannot trust applications to update the data in a way that preserves database
integrity, so the database needs to take responsibility for that.

• Application Database Approach:

o Treating the database as an application database means it's only directly


accessed by a single application codebase, which is managed by a single
team.

o Only the team using the application needs to know about the database
structure, making it easier to maintain and evolve the schema.

o Since the application team controls both the database and the application
code, the responsibility for database integrity can be placed in the application
code.

o Interoperability concerns can shift to the interfaces of the application,


allowing for better interaction protocols and providing support for changing
them.

• Shift to Web Services (2000s):

o Applications began using web services over HTTP for integration, enabling a
new form of communication mechanism.

o Web services allowed the use of richer data structures with nested records
and lists, usually represented as XML or JSON.

o Reducing round trips was achieved by putting a rich structure of information


into a single request or response.

• Text vs. Binary Protocols:


o Text-based protocols (like HTTP) are the standard for most use cases, as they
are easier to work with.

o Binary protocols should only be used for highly performance-sensitive


interactions.

• Decoupling Through Application Databases:

o Application databases allow more freedom in choosing a database, as


external systems don’t need to care about how data is stored internally.

o Features like security, typically handled by relational databases, can now be


managed by the application itself.

• Adoption Trends:

o Despite the flexibility of application databases, most teams continued using


relational databases due to familiarity and reliability.

o Application databases didn't lead to a major shift away from relational


databases at that time.

o The cracks in relational database dominance came from other sources.

1.4 ATTACK OF THE CLUSTERS

• Dot-com Bubble and the 2000s Growth:

o The 2000s saw large web properties dramatically increase in scale, despite
the busting of the 1990s dot-com bubble.

o Websites began tracking activity and structure in great detail, generating large
sets of data (links, social networks, activity in logs, mapping data).

o With the growth of data came the growth in users, and the largest websites
started serving vast numbers of visitors.

• Scaling to Handle Increased Data and Traffic:

o To handle the increase in data and traffic, websites had two choices: scale up
or scale out.

o Scaling up involves using bigger machines with more processors, disk storage,
and memory, but becomes expensive and limited as size increases.

o Scaling out involves using many small machines in a cluster, which is cheaper,
more resilient, and can handle failures better by providing high reliability.

• Challenges of Using Relational Databases on Clusters:

o Relational databases are not designed to be run on clusters.


o Clustered relational databases like Oracle RAC Server use a shared disk
subsystem and a cluster-aware file system that writes to a highly available
disk subsystem, but this creates a single point of failure.

o Sharding relational databases (splitting data across multiple servers) solves


some scaling issues but introduces new problems:

▪ The application must track which server to talk to for each piece of
data.

▪ Querying, referential integrity, transactions, and consistency controls


across shards are lost.

▪ This is often referred to as “unnatural acts.”

• Licensing Costs:

o Commercial relational databases are usually priced on a single-server basis,


leading to high costs when running on a cluster.

o This mismatch led to frustrating negotiations with purchasing departments.

• Influence of Google and Amazon:

o The challenges of relational databases on clusters led organizations to


consider alternatives.

o Google and Amazon, at the forefront of running large clusters and capturing
huge amounts of data, were influential in pushing the idea of databases
designed specifically for clusters.

o Both companies published influential papers: BigTable (Google) and Dynamo


(Amazon).

• Relevance of Google and Amazon’s Solutions:

o Although the scale of Amazon and Google may seem too large for most
organizations, many are beginning to face similar challenges with growing
data and traffic.

o As more information about Google and Amazon’s solutions leaked out, other
organizations began exploring databases explicitly designed for clusters.

• The Threat from Clusters:

o While earlier challenges to relational databases were less significant, the


threat posed by clusters was real and serious.
1.5 THE EMERGENCE OF NOSQL

• Origin of the Term "NoSQL":

o The term "NoSQL" first appeared in the late 90s as the name of an open-
source relational database by Carlo Strozzi.

o This early "NoSQL" database did not use SQL and was manipulated through
UNIX shell scripts, with data stored as ASCII files.

o Despite the name, Strozzi’s NoSQL had no influence on the modern databases
referred to as NoSQL.

• The Modern Use of "NoSQL":

o The term "NoSQL" gained prominence after a 2009 meetup in San Francisco
organized by Johan Oskarsson.

o The meetup was inspired by the examples of BigTable and Dynamo, with
discussions about alternative data storage solutions.

o Johan Oskarsson chose "NoSQL" as the name for the meetup, which became
widely used to describe this technology trend.

• Characteristics of NoSQL Databases:

o NoSQL databases don’t use SQL, although some have query languages similar
to SQL (e.g., Cassandra’s CQL).

o They are generally open-source projects, although some closed-source


systems are also referred to as NoSQL.

o Most NoSQL databases are designed to run on clusters, and their data models
and consistency approaches are suited to this environment.

o Not all NoSQL databases are cluster-oriented; for example, graph databases
use a distribution model similar to relational databases but with a different
data model.

o NoSQL databases are typically developed for the needs of large-scale web
estates, often created in the early 21st century.

o They operate without a schema, allowing for flexible data storage and the
addition of fields without predefined structure, making them ideal for
nonuniform and custom data.

• Definition and Interpretation of "NoSQL":


o "NoSQL" does not have a strict definition, but it is commonly understood as
referring to open-source, distributed, non-relational databases developed
mainly in the early 21st century.

o The term "Not Only SQL" is often used, though it has issues, such as not
differentiating from relational databases that can also use non-SQL elements.

o It is better to view NoSQL as a movement rather than a technology.

• NoSQL as a Movement:

o While relational databases will continue to be widely used, NoSQL represents


a shift in how data storage is approached.

o The concept of polyglot persistence has emerged, where different data


storage technologies are used based on specific circumstances.

o Organizations will likely use a mix of data stores for different purposes,
depending on the nature of the data and how it needs to be manipulated.

• Application vs. Integration Databases:

o NoSQL databases are better suited as application databases rather than


integration databases.

o The shift from integration databases to application databases, where data is


encapsulated in services, is seen as a positive trend.

• Two Primary Reasons for Considering NoSQL:

o Handling Big Data on Clusters: NoSQL is useful for managing large-scale data
access that requires a cluster for performance and scalability.

o Improving Application Development Productivity: NoSQL can simplify


database access and improve development productivity, even without the
need for scaling beyond a single machine.

AGGREGATE DATA MODELS

• Data Model: A data model describes how we interact with data in a database,
distinct from a storage model, which details how data is stored and manipulated
internally.

• Ideal vs. Practical: In an ideal world, users would be unaware of the storage model,
but in practice, some understanding of it is needed for good performance.

• Data Model in Context: In this book, "data model" refers to the way a database
organizes data (metamodel), which is different from the specific data in an
application (like an entity-relationship diagram).
• Relational Data Model: The relational model, dominant for the past decades, is
visualized as a set of tables (like a spreadsheet) where rows represent entities and
columns contain single values. Relationships are formed when a column refers to
another row in the same or a different table.

• Shift in NoSQL: NoSQL introduces a shift from the relational model. NoSQL solutions
use different data models, categorized into four types: key-value, document, column-
family, and graph.

• Aggregate Orientation: The key-value, document, and column-family models share a


characteristic known as "aggregate orientation," which is the focus of this chapter.

2.1 AGGREGATES

• Relational Model vs. Aggregate Orientation:

o The relational model divides data into tuples (rows), where each tuple is a
limited data structure that captures a set of values. It is not possible to nest
tuples within another tuple or store lists of values or tuples within a tuple.

o Aggregate orientation differs in that it recognizes that complex data


structures, like lists or nested records, are often needed for operations. This
allows thinking of data as complex records and enables nesting of lists and
records within each other.

o Aggregate databases (e.g., key-value, document, and column-family


databases) use a more complex record structure.

o The term "aggregate" comes from Domain-Driven Design, where an


aggregate is a collection of related objects treated as a unit for data
manipulation and consistency.

o The aggregate approach is beneficial for operations in clusters because


aggregates naturally support replication and sharding.

o Aggregates simplify application development by enabling easier data


manipulation via aggregate structures.

2.1.1 Example of Relations and Aggregates

• E-commerce Scenario:

o Imagine an e-commerce website with users, product catalog, orders, shipping


addresses, billing addresses, and payment data.

o Relational Model Example:

▪ The data model involves tables for customers, orders, products,


addresses, and payments. Each table enforces proper normalization
and referential integrity, preventing data duplication and ensuring
relationships via foreign keys.

o Aggregate-Oriented Model Example:

▪ In NoSQL, data can be modeled in JSON format where aggregates like


"customer" and "order" are treated as collections of related data.

▪ The customer aggregate contains a list of billing addresses, while the


order aggregate contains order items, shipping addresses, and
payments.
// in customers

"id": 1,

"name": "Martin",

"billingAddress": [{"city": "Chicago"}]

// in Orders

"id": 99,

"customerId": 1,

"orderItems": [{"productId": 27, "price": 32.45, "productName": "NoSQL Distilled"}],

"shippingAddress": [{"city": "Chicago"}],

"orderPayment": [{"ccinfo": "1000-1000-1000-1000", "txnId": "abelif879rft",


"billingAddress": {"city": "Chicago"}}]

▪ The key idea is that the customer and order aggregates are treated as
independent units.
▪ The payment information, including billing address, is embedded
within the order.

▪ This approach differs from relational models where a new row would
be created for each instance of the relationship.

▪ The relationship between the customer and order is managed across


aggregates, with customer and order linked through the customerId in
the order aggregate.

▪ Alternative Model:

▪ All orders could be included within the customer aggregate.

▪ Example:

"customer": {

"id": 1,

"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"orders": [{

"id": 99,

"customerId": 1,

"orderItems": [{"productId": 27, "price": 32.45,


"productName": "NoSQL Distilled"}],

"shippingAddress": [{"city": "Chicago"}],

"orderPayment": [{"ccinfo": "1000-1000-1000-1000",


"txnId": "abelif879rft", "billingAddress": {"city": "Chicago"}}]

}]

▪ Aggregate boundaries depend on how the data is accessed. If


accessing customer data with all orders is common, the customer
aggregate might include orders; otherwise, they might be kept
separate.
2.1.2 Consequences of Aggregate Orientation

• Relational Model Limitations:

o Relational databases capture data elements and relationships but do not


inherently consider aggregates.

o Aggregates, in aggregate-oriented models, allow us to focus on how data is


used, making it easier to manipulate and distribute data in clusters.

o Relational models do not differentiate between aggregation relationships and


regular relationships, making it harder to optimize data storage and
distribution.

o Semantics of Aggregates:

▪ The semantics of what makes an aggregate relationship distinct from


other relationships are not well-defined in relational modeling, and
where they exist, they vary.

▪ Aggregate-oriented databases have clearer semantics since they focus


on units of interaction with the data storage, making data
management more efficient.
• Challenges in Aggregate-Oriented Databases:

o Relational databases are considered "aggregate-ignorant" because they do


not have an aggregate concept in their data model.

o Aggregate-ignorant databases (like relational and graph databases) are not


inherently bad; they are often preferred when data is accessed in many
different ways or when aggregate boundaries are difficult to define.

o Advantages of Aggregate-Oriented Databases:

▪ Aggregate models are ideal when the same data needs to be


manipulated together, especially when running on clusters.

▪ Aggregates help minimize the number of nodes queried, making them


useful in distributed systems.

• Impact on Transactions:

o Relational databases support ACID (Atomic, Consistent, Isolated, Durable)


transactions across multiple tables, which allow manipulating any
combination of rows in one transaction.

o Aggregate-oriented databases typically support atomic operations within a


single aggregate, not spanning multiple aggregates.

o ACID in NoSQL:

▪ While it’s true that NoSQL databases may not support ACID
transactions across multiple aggregates, they often support atomic
operations within a single aggregate.

▪ Application code often manages atomicity across multiple aggregates


if necessary.

▪ Some aggregate-ignorant databases (like graph databases) do support


ACID transactions similar to relational databases.

▪ The focus on consistency goes beyond whether a database is ACID-


compliant, as the implementation and management of consistency in
databases are more complex.

2.2 KEY-VALUE AND DOCUMENT DATA MODELS

• Key-Value and Document Databases as Aggregate-Oriented:

o Both key-value and document databases are primarily constructed through


aggregates.

o Each aggregate in these databases has a key or ID used to access the data.
• Key-Value Databases:

o In key-value databases, the aggregate is opaque to the database; it is just a


large blob of mostly meaningless bits.

o The advantage of opacity is that we can store any kind of data in the
aggregate, with the database only imposing general size limits.

o Key-value databases typically support lookup operations based on the key


alone.

• Document Databases:

o In contrast, document databases are able to recognize a structure within the


aggregate.

o Document databases impose limits on what can be stored, defining allowable


structures and types.

o The trade-off is that document databases offer more flexibility in data access:

▪ Queries can be made based on fields in the aggregate.

▪ It is possible to retrieve parts of the aggregate instead of the entire


structure.

▪ Document databases can create indexes based on the contents of the


aggregate.

• Blurry Line Between Key-Value and Document Models:

o The distinction between key-value and document databases is not always


clear.

o People often include an ID field in a document database to enable key-value


style lookups.

o Some key-value databases allow additional data structures beyond opaque


aggregates:

▪ Riak: Allows metadata addition for indexing and inter-aggregate links.

▪ Redis: Allows breaking down the aggregate into lists or sets.

▪ Solr Integration: Key-value databases may integrate search tools (e.g.,


Solr) to enable querying by adding search functionality on aggregates
stored in JSON or XML structures.

• General Distinction:

o The primary distinction still holds:


▪ Key-Value Databases: Expect to look up aggregates using a key.

▪ Document Databases: Expect to query based on the internal structure


of the document, which may be a key or another field.

2.3 COLUMN-FAMILY STORES

• BigTable and Column-Family Databases:

o Google's BigTable was an early influential NoSQL database, and it influenced


later databases like HBase and Cassandra.

o BigTable’s structure is often mistaken for a table, but it is actually a two-level


map.

o Bigtable-style databases are often referred to as column stores, although the


term "column store" has been used for other types of databases (like C-Store)
in the past.

• Column Stores (Pre-NoSQL):

o Pre-NoSQL column stores used the relational model and SQL but focused on
storing data physically by columns, rather than rows.

o The primary benefit of column storage is for scenarios where writes are rare,
but there is a need to read a few columns across many rows.

o Column stores store groups of columns for all rows together, unlike row-
based storage that improves write performance.

• Bigtable and Column-Family Databases:

o Bigtable and its descendants (HBase, Cassandra) store groups of columns


(called column families) together, diverging from C-Store by abandoning the
relational model and SQL.

o These databases are called column-family databases.

• Two-Level Aggregate Structure:

o Column-family databases use a two-level structure:

1. The first level is a row identifier (the key), representing the entire
aggregate (e.g., a customer).

2. The second level consists of columns, where specific values can be


accessed.

o Operations can target the entire row or specific columns within the row (e.g.,
get('1234', 'name')).
• Column Families:

o Columns are organized into column families.

o Each column must belong to a column family, and columns are accessed as
units, with data in a column family typically being accessed together.

• Data Structure Representation:

o Row-oriented: Each row is an aggregate (e.g., customer ID 1234), and column


families represent different chunks of data (e.g., profile, order history).

o Column-oriented: Each column family defines a record type (e.g., customer


profiles), and rows represent records in all column families.

• Two-Dimensional Quality of Column-Family Databases:

o Column-family databases have a two-dimensional structure due to the


combination of rows and column families, which influences storage and
access behavior.

• Cassandra’s Variation:

o Cassandra has a unique approach where a row only exists in one column
family, but the column family may contain supercolumns (nested columns),
which are similar to classic Bigtable column families.

o In Cassandra, you can freely add new columns to rows, but adding new
column families is less frequent and may require stopping the database.
• Wide vs. Skinny Rows:

o Skinny rows have fewer columns, and each column is used across many rows.

o Wide rows have many columns (potentially thousands), and each row may
have very different columns.

o Wide column families are used to model lists, with each column representing
one element of the list.

• Sort Order in Column Families:

o Column families can define a sort order for their columns.

o For example, orders could be keyed by a concatenation of date and ID (e.g.,


20111027-1001), allowing for range queries based on the order key.

o While wide column families may define a sort order, there is no technical
restriction on combining field-like and list-like columns in the same column
family, though doing so could complicate the sorting.

2.4 SUMMARIZING AGGREGATE-ORIENTED DATABASES

• Overview of Aggregate-Oriented Data Models:

o The three types of aggregate-oriented data models share the concept of an


aggregate indexed by a key for lookup.

o This aggregate is essential for running on a cluster, as the database ensures


that all data for an aggregate is stored together on a single node.

o The aggregate serves as the atomic unit for updates, providing basic
transactional control, though this control is limited.

• Differences Between Models:

1. Key-Value Data Model:

▪ The aggregate is treated as an opaque whole, meaning you can only


perform key lookups for the entire aggregate.

▪ Queries and partial retrievals are not possible.

2. Document Model:

▪ The aggregate is transparent to the database, enabling queries and


partial retrievals.

▪ However, because the document lacks a schema, the database cannot


optimize the storage and retrieval of parts of the aggregate based on
its structure.
3. Column-Family Model:

▪ The aggregate is divided into column families, treating these column


families as distinct units within the row aggregate.

▪ This imposes some structure on the aggregate, but it allows the


database to take advantage of the structure for improved
accessibility.

MORE DETAILS ON DATA MODELS

3.1 RELATIONSHIPS

• Aggregates and Their Usage:

o Aggregates are useful for grouping data that is commonly accessed together.

o In some cases, data related to an entity (e.g., a customer and their orders)
may be accessed differently by various applications.

o Some applications prefer to combine the customer and order history into a
single aggregate when accessing the customer, while others prefer to treat
orders as independent aggregates.

o In such cases, separate customer and order aggregates are needed, but a
relationship between them is essential.

• Linking Aggregates:

o The simplest way to link aggregates (like customer and order) is by


embedding the customer ID within the order’s aggregate data.

o This allows for the retrieval of customer data by reading the order, extracting
the customer ID, and then querying the customer record separately.

o While this works, the database remains unaware of the relationship between
the aggregates, which can be important in some scenarios.

o Many databases, including key-value stores, allow you to make relationships


visible. For example, Riak (a key-value store) allows link information to be
stored in metadata, enabling partial retrieval and link-walking.

• Handling Updates:

o Aggregate-oriented databases treat the aggregate as the unit of data


retrieval. This means atomicity is only supported within a single aggregate.

o When updating multiple aggregates, failure management is needed, as there


are no guarantees for consistency across aggregates during updates.
o Relational databases, on the other hand, provide ACID guarantees for
modifying multiple records within a single transaction, which helps maintain
consistency across different rows.

• Challenges with Relationships:

o Aggregate-oriented databases become awkward when operations involve


multiple aggregates. The lack of built-in support for managing relationships
across aggregates makes them less suited for such tasks.

o For data involving complex relationships, relational databases might be a


better choice than NoSQL stores, especially since relational databases are
optimized for managing complex relationships with joins.

o However, relational databases also face performance issues when dealing


with multiple joins and complex queries.

• Introducing a New Category:

o The discussion hints at introducing another category of databases, often


classified under NoSQL, which may better address complex relationships
between data.

3.2 GRAPH DATABASES

• Introduction to Graph Databases:

o Graph databases are unique within the NoSQL landscape. Most NoSQL
databases focus on large records with simple connections, driven by the need
to run on clusters. In contrast, graph databases are motivated by frustrations
with relational databases and feature a different model—small records with
complex interconnections.

o In graph databases, the term "graph" refers to a data structure made up of


nodes connected by edges, not a bar chart or histogram.

• Graph Structure:

o The data structure in a graph database consists of small nodes (often just a
name) connected by rich, complex interconnections (edges). This structure
allows for advanced queries such as “find books in the Databases category
written by someone whom a friend of mine likes.”

o Graph databases are ideal for handling data with complex relationships, such
as social networks, product preferences, or eligibility rules.
• Graph Database Data Model:

o The fundamental model is simple: nodes connected by edges (also called


arcs).

o Different graph databases have variations in how data is stored within nodes
and edges:

▪ FlockDB: Simple nodes and edges without attributes.

▪ Neo4J: Allows attaching Java objects as properties to nodes and


edges, in a schemaless fashion.

▪ Infinite Graph: Stores Java objects (subclasses of built-in types) as


nodes and edges.

• Querying Graph Databases:

o Once a graph is created, you can query it using graph-specific operations.

o Unlike relational databases that use foreign keys and joins to navigate
relationships (which can be expensive), graph databases make relationship
traversal cheap and efficient.

o A major difference is that graph databases shift much of the work of


navigating relationships to insert time, improving performance for queries
but making insert operations slower.

• Graph Database Queries:

o Queries in graph databases often involve navigating relationships. For


example, “tell me all the things that both Anna and Barbara like.”

o While ID-based lookups can be used to start queries, the emphasis is on


using the edges (relationships) to find relevant data.

• Differences from Aggregate-Oriented Databases:

o Graph databases are designed for handling relationships, making them


fundamentally different from aggregate-oriented databases.

o They are more likely to run on single servers rather than being distributed
across clusters.

o ACID transactions in graph databases cover multiple nodes and edges to


ensure consistency, unlike aggregate-oriented databases, where atomicity is
limited to individual aggregates.

o Graph databases reject the relational model and share a rise in popularity
alongside the broader NoSQL movement.
3.3 SCHEMALESS DATABASES

• Schemalessness is a common theme across all forms of NoSQL databases.

• In relational databases, you must define a schema before storing data. A schema
includes:

o Defined structure of tables.

o Columns that exist.

o Data types each column can hold.

• In NoSQL databases, storing data is more casual and flexible:

o Key-value store: You can store any data under a key.

o Document database: No restrictions on the structure of documents.

o Column-family databases: You can store any data under any column.

o Graph databases: You can freely add new edges and properties to nodes and
edges.

• Benefits of Schemalessness:

o Offers freedom and flexibility by removing the need to figure out in advance
what data is necessary.

o You can easily change data storage as you learn more about your project.
o New things can be added easily as they are discovered.

o If you no longer need certain data, you can stop storing it without worrying
about losing old data (unlike relational databases when deleting columns).

o Nonuniform data (where each record has a different set of fields) is handled
easily:

▪ Relational databases force all rows into a fixed schema, leading to


issues like sparse tables or meaningless columns (e.g., custom column
4).

▪ Schemalessness avoids these issues, allowing each record to contain


only the necessary data.

• Drawbacks of Schemalessness:

o When programs access data, they rely on an implicit schema, which is:

▪ Assumptions about data structure within the program’s code.

▪ For example, a program assumes the field “qty” means quantity, but it
cannot infer this unless programmed.

▪ The database does not validate data or ensure consistency across


different applications.

o This leads to some issues:

▪ To understand the data structure, you must dive into the application
code to deduce the schema.

▪ If the application code is poorly structured, this could be difficult.

▪ The database remains unaware of the schema, making it difficult for


the database to optimize storage and retrieval.

• Relational Databases vs. Schemaless Databases:

o Relational databases use a fixed schema to maintain structure and


consistency across applications.

o NoSQL databases move the schema to the application code accessing the
data, which can cause issues when multiple applications interact with the
same database.

• Solutions to Problems:

o Encapsulation: Encapsulate all database interactions within a single


application and integrate it with others via web services.
o Aggregate Partitioning: Clearly delineate different sections of the database
(e.g., different sections in a document database or column families in a
column-family database) for different applications.

• Changing the Schema:

o Relational databases can change schemas dynamically using standard SQL


commands.

o Nonuniform data can be handled in relational databases by adding new


columns as needed.

o Changing the schema in a relational database can be done in a controlled


way.

• Impact of Schemalessness on Database Changes:

o Schemalessness offers flexibility within an aggregate (collection of related


data), but if you need to change aggregate boundaries, it is just as complex as
altering relational schemas.

o Managing changes in a schemaless database requires careful planning to


access both old and new data easily.

o Despite its flexibility, the impact of schema changes remains significant, and
migrations must be handled properly.

3.4 MATERIALIZED VIEWS

• Aggregate-Oriented Data Models:

o They are useful for accessing data like orders as a single unit, but they have
limitations when you need to access data differently, like querying product
sales over time.

o This leads to the problem of potentially needing to read every order to


answer a specific query.

• Relational Databases:

o Lack of aggregate structure, allowing access to data in different ways.

o Provide views, which are virtual tables defined by computations over base
tables.
o Views compute data dynamically, but some can be expensive to compute.

• Materialized Views:

o Views computed in advance and cached on disk to reduce computation time


for frequently read data.

o Useful for data that is read heavily but can tolerate being somewhat stale.

• Materialized Views in NoSQL Databases:

o NoSQL databases do not have views, but they may have precomputed and
cached queries.

o The term "materialized view" is reused to describe precomputed and cached


data.

o Materialized views are crucial in aggregate-oriented databases as many


queries may not fit well with the aggregate structure.

o NoSQL databases often use map-reduce computations to create materialized


views (to be discussed in Chapter 7).

• Building Materialized Views:

o Two strategies:

1. Eager Approach:

▪ Materialized view is updated at the same time as the base


data.

▪ For example, adding an order would update the product's


purchase history aggregate.

▪ This approach is suitable when materialized views are read


frequently, and you want them to be as fresh as possible.

▪ The application database approach helps ensure updates to


base data also update materialized views.

2. Batch Update Approach:

▪ Materialized views are updated at regular intervals through


batch jobs.

▪ The freshness of materialized views depends on the business


requirements and how stale they can be.
▪ You can compute and save the materialized view manually or
use the database to build the view itself, with the computation
executed as needed based on configured parameters.

▪ This approach works well for incremental map-reduce


computations.

• Use of Materialized Views Within the Same Aggregate:

o Example: An order document might include an order summary to avoid


transferring the entire order document when querying for an order summary.

o Common in column-family databases: Materialized views are stored in


separate column families.

o Advantage: Materialized views can be updated in the same atomic operation


as the base data.

3.5 MODELING FOR DATA ACCESS

• Key-Value Store:

o All customer data can be embedded using a key-value store, where the
application can read the customer's information and related data using the
key.

o If querying orders or products sold in each order, the entire object must be
read and parsed on the client side to build the results.

o When references are needed, you can switch to document stores or split the
value object into Customer and Order objects, maintaining references
between them.

o With references, orders can be found independently of the Customer, and all
orders for a Customer can be retrieved by using the orderId reference in the
Customer object.

• Customer Object Example (Key-Value Store):

"customerId": 1,

"customer": {

"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit","ccinfo": "1000-1000-1000-1000"}],


"orders":[{"orderId":99}]

#Order Object

"customerId": 1,

"orderId": 99,

"order": {

"orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft"}],

"shippingAddress":{"city":"Chicago"}

}
• Using Aggregates for Analytics:

o Aggregates can be used for Real-Time BI or Real-Time Analytics, allowing


enterprises to access data in real time without relying on batch runs.

o For example, an aggregate update could fill in which orders contain a given
product.

o Example denormalized data:

"itemid": 27,

"orders": {99, 545, 897, 678}

"itemid": 29,

"orders": {199, 545, 704, 819}

• Document Stores:

o In document stores, references to orders can be removed from the Customer


object, and updates to the Customer object are no longer necessary when
new orders are placed.

o This is possible because document stores allow querying within documents,


enabling searches like "find all orders that include a specific product."

o Example Customer Object (without orders):

"customerId": 1,

"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit", "ccinfo": "1000-1000-1000-1000"}]

o Example Order Object (separate):

{
"orderId": 99,

"customerId": 1,

"orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft"}],

"shippingAddress":{"city":"Chicago"}

• Column-Family Stores:

o In column-family stores, columns are ordered, and frequently used columns


can be named and fetched first.

o It’s important to model data based on query requirements, not write


operations, ensuring the data is optimized for reads.

o Example: Storing Customer and Order in different column-family families,


with references to orders in the Customer column family, improves query
performance.
• Graph Databases:

o Graph databases model all objects as nodes and relationships as edges.

o Relationships between nodes have types and directional significance, making


traversal easy.

o For example, to find all customers who purchased a specific product, query
the product node and look for customers with an incoming PURCHASED
relationship.

o Graph databases are particularly useful for product recommendations and


analyzing user patterns.

You might also like