0% found this document useful (0 votes)

8 views31 pages

Nosql Mod1

Relational databases are essential for storing persistent data, managing concurrency, and enabling integration among multiple applications. They face challenges such as impedance mismatch with in-memory data structures and complexities in shared database integration. The emergence of NoSQL databases reflects a shift towards more flexible data storage solutions, particularly for large-scale applications and clusters.

Uploaded by

Prerana S A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views31 pages

Nosql Mod1

Uploaded by

Prerana S A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

MODULE 01

1.1 THE VALUE OF RELATIONAL DATABASES

Relational databases are an integral part of computing, offering numerous benefits that are
crucial to revisit for a better understanding of their value.

1.1.1 Getting at Persistent Data

• The primary value of databases lies in their ability to store large amounts of
persistent data, which is essential for long-term data retention.

• Computer architectures generally feature two types of memory:

1. Main Memory: Fast but volatile, meaning data is lost during power outages
or system failures.

2. Backing Store: Larger but slower, used to retain data even in adverse
conditions.

• Backing stores are typically implemented using disks, although modern

implementations may use persistent memory.

• While some applications (e.g., word processors) use file systems to store data,
enterprise applications depend on databases for their advanced capabilities.

• Databases offer greater flexibility than file systems by allowing quick and easy access
to small pieces of data from large datasets.

1.1.2 Concurrency

• Enterprise applications often involve multiple users accessing the same body of data
simultaneously.

• While users usually work on different areas of the data, conflicts can arise when they
attempt to modify the same data.

o Example: Double booking of hotel rooms due to uncoordinated data access.

• Managing concurrency is highly complex, often leading to errors even with careful
programming.

• Relational databases mitigate concurrency issues by controlling all data access

through transactions.

o Transactions simplify coordination by ensuring that only one process can

modify a specific piece of data at a time.

o If a conflict occurs, the database ensures consistency by handling errors and

enabling rollback of incomplete changes.
• Transactions also play a vital role in error handling, allowing changes to be reversed
in case of processing errors, ensuring clean data management.

1.1.3 Integration

• Enterprise ecosystems often require multiple applications, developed by different

teams, to collaborate and share data.

• This collaboration can be challenging as it involves crossing organizational and

technical boundaries.

• Shared database integration is a common solution where multiple applications store

their data in a single database.

o This allows applications to access each other's data seamlessly.

o The database’s concurrency control ensures that data integrity is maintained

even when multiple applications access it simultaneously.

• This approach simplifies inter-application data sharing and reduces complexity in

collaborative environments.

1.1.4 A (Mostly) Standard Model

• The success of relational databases is largely due to their standardized approach to

providing core benefits.

• Developers and database professionals can learn the relational model and apply it
across multiple projects, thanks to this standardization.

• Although there are differences between relational database vendors, the

fundamental mechanisms remain consistent:

o SQL Dialects: Variations exist, but they share a common foundation.

o Transaction Operations: These operate in a similar way across different

databases.

• This consistency allows developers to easily adapt their skills to different relational
database systems, enhancing productivity and reducing learning curves.

1.2 IMPEDANCE MISMATCH

Relational databases, while advantageous, have limitations that have frustrated application
developers since their inception. The concept of impedance mismatch highlights a
fundamental issue in aligning relational databases with in-memory data structures.

Relational Model vs. In-Memory Structures

• Relational Data Model:

o Organizes data into tables (relations) and rows (tuples).

o A tuple is defined as a set of name-value pairs, distinct from its definition in

mathematics and programming languages where it is a sequence of values.

o SQL operations are based on relations, adhering to relational algebra, which is

mathematically elegant.

• Limitation of Relational Values:

o Relational tuples require simple values and cannot handle complex structures
like nested records or lists.

o In contrast, in-memory data structures allow richer, more complex

representations.

• Translation Requirement:

o Storing complex in-memory data structures in a relational database requires

translation into a relational format, causing impedance mismatch.

o This mismatch necessitates ongoing conversions between two distinct

representations, adding complexity and inefficiency.

Historical Context and Evolution

• Rise of Object-Oriented Programming (1990s):

o Object-oriented programming languages gained prominence, offering rich in-

memory data structures.

o Object-oriented databases emerged as an alternative to relational databases,

aiming to replicate in-memory structures directly to disk.

• Decline of Object-Oriented Databases:

o Despite the success of object-oriented programming, object-oriented

databases failed to replace relational databases.

o Relational databases retained dominance due to:

▪ Their role as an integration mechanism.

▪ A mostly standardized data manipulation language (SQL).

▪ A professional divide between application developers and database

administrators.

Addressing Impedance Mismatch

• Object-Relational Mapping (ORM) Frameworks:

o Tools like Hibernate and iBATIS emerged to simplify the translation process
between in-memory structures and relational representations.

o These frameworks implement well-known mapping patterns, reducing

manual work.

• Challenges with ORMs:

o While ORMs reduce the burden of mapping, they introduce new problems:

▪ Ignoring database-specific optimizations can degrade query

performance.

▪ Over-reliance on ORMs may lead to inefficiencies in complex

applications.

Shift in Relational Database Dominance

• Despite the widespread use of ORMs, the mapping problem persists, highlighting the
inherent mismatch between relational databases and in-memory structures.

• Relational databases dominated enterprise computing through the 2000s, but their
supremacy began to face challenges during that decade.

1.3 APPLICATION AND INTEGRATION DATABASES

• Reasons for Relational Database Triumph:

o The primary factor was SQL as an integration mechanism between

applications.

o The database acts as an integration database, with multiple applications,

usually developed by separate teams, storing data in a common database.
o This improves communication as all applications operate on a consistent set
of persistent data.

• Downsides to Shared Database Integration:

o A structure designed to integrate many applications ends up being more

complex than any single application needs.

o If an application wants to make changes to its data storage, it needs to

coordinate with all other applications using the database.

o Different applications have different structural and performance needs, so an

index required by one application may cause a problematic hit on inserts for
another.

o Since each application is usually developed by a separate team, the database

cannot trust applications to update the data in a way that preserves database
integrity, so the database needs to take responsibility for that.

• Application Database Approach:

o Treating the database as an application database means it's only directly

accessed by a single application codebase, which is managed by a single
team.

o Only the team using the application needs to know about the database
structure, making it easier to maintain and evolve the schema.

o Since the application team controls both the database and the application
code, the responsibility for database integrity can be placed in the application
code.

o Interoperability concerns can shift to the interfaces of the application,

allowing for better interaction protocols and providing support for changing
them.

• Shift to Web Services (2000s):

o Applications began using web services over HTTP for integration, enabling a
new form of communication mechanism.

o Web services allowed the use of richer data structures with nested records
and lists, usually represented as XML or JSON.

o Reducing round trips was achieved by putting a rich structure of information

into a single request or response.

• Text vs. Binary Protocols:

o Text-based protocols (like HTTP) are the standard for most use cases, as they
are easier to work with.

o Binary protocols should only be used for highly performance-sensitive

interactions.

• Decoupling Through Application Databases:

o Application databases allow more freedom in choosing a database, as

external systems don’t need to care about how data is stored internally.

o Features like security, typically handled by relational databases, can now be

managed by the application itself.

• Adoption Trends:

o Despite the flexibility of application databases, most teams continued using

relational databases due to familiarity and reliability.

o Application databases didn't lead to a major shift away from relational

databases at that time.

o The cracks in relational database dominance came from other sources.

1.4 ATTACK OF THE CLUSTERS

• Dot-com Bubble and the 2000s Growth:

o The 2000s saw large web properties dramatically increase in scale, despite
the busting of the 1990s dot-com bubble.

o Websites began tracking activity and structure in great detail, generating large
sets of data (links, social networks, activity in logs, mapping data).

o With the growth of data came the growth in users, and the largest websites
started serving vast numbers of visitors.

• Scaling to Handle Increased Data and Traffic:

o To handle the increase in data and traffic, websites had two choices: scale up
or scale out.

o Scaling up involves using bigger machines with more processors, disk storage,
and memory, but becomes expensive and limited as size increases.

o Scaling out involves using many small machines in a cluster, which is cheaper,
more resilient, and can handle failures better by providing high reliability.

• Challenges of Using Relational Databases on Clusters:

o Relational databases are not designed to be run on clusters.

o Clustered relational databases like Oracle RAC Server use a shared disk
subsystem and a cluster-aware file system that writes to a highly available
disk subsystem, but this creates a single point of failure.

o Sharding relational databases (splitting data across multiple servers) solves

some scaling issues but introduces new problems:

▪ The application must track which server to talk to for each piece of
data.

▪ Querying, referential integrity, transactions, and consistency controls

across shards are lost.

▪ This is often referred to as “unnatural acts.”

• Licensing Costs:

o Commercial relational databases are usually priced on a single-server basis,

leading to high costs when running on a cluster.

o This mismatch led to frustrating negotiations with purchasing departments.

• Influence of Google and Amazon:

o The challenges of relational databases on clusters led organizations to

consider alternatives.

o Google and Amazon, at the forefront of running large clusters and capturing
huge amounts of data, were influential in pushing the idea of databases
designed specifically for clusters.

o Both companies published influential papers: BigTable (Google) and Dynamo

(Amazon).

• Relevance of Google and Amazon’s Solutions:

o Although the scale of Amazon and Google may seem too large for most
organizations, many are beginning to face similar challenges with growing
data and traffic.

o As more information about Google and Amazon’s solutions leaked out, other
organizations began exploring databases explicitly designed for clusters.

• The Threat from Clusters:

o While earlier challenges to relational databases were less significant, the

threat posed by clusters was real and serious.
1.5 THE EMERGENCE OF NOSQL

• Origin of the Term "NoSQL":

o The term "NoSQL" first appeared in the late 90s as the name of an open-
source relational database by Carlo Strozzi.

o This early "NoSQL" database did not use SQL and was manipulated through
UNIX shell scripts, with data stored as ASCII files.

o Despite the name, Strozzi’s NoSQL had no influence on the modern databases
referred to as NoSQL.

• The Modern Use of "NoSQL":

o The term "NoSQL" gained prominence after a 2009 meetup in San Francisco
organized by Johan Oskarsson.

o The meetup was inspired by the examples of BigTable and Dynamo, with
discussions about alternative data storage solutions.

o Johan Oskarsson chose "NoSQL" as the name for the meetup, which became
widely used to describe this technology trend.

• Characteristics of NoSQL Databases:

o NoSQL databases don’t use SQL, although some have query languages similar
to SQL (e.g., Cassandra’s CQL).

o They are generally open-source projects, although some closed-source

systems are also referred to as NoSQL.

o Most NoSQL databases are designed to run on clusters, and their data models
and consistency approaches are suited to this environment.

o Not all NoSQL databases are cluster-oriented; for example, graph databases
use a distribution model similar to relational databases but with a different
data model.

o NoSQL databases are typically developed for the needs of large-scale web
estates, often created in the early 21st century.

o They operate without a schema, allowing for flexible data storage and the
addition of fields without predefined structure, making them ideal for
nonuniform and custom data.

• Definition and Interpretation of "NoSQL":

o "NoSQL" does not have a strict definition, but it is commonly understood as
referring to open-source, distributed, non-relational databases developed
mainly in the early 21st century.

o The term "Not Only SQL" is often used, though it has issues, such as not
differentiating from relational databases that can also use non-SQL elements.

o It is better to view NoSQL as a movement rather than a technology.

• NoSQL as a Movement:

o While relational databases will continue to be widely used, NoSQL represents

a shift in how data storage is approached.

o The concept of polyglot persistence has emerged, where different data

storage technologies are used based on specific circumstances.

o Organizations will likely use a mix of data stores for different purposes,
depending on the nature of the data and how it needs to be manipulated.

• Application vs. Integration Databases:

o NoSQL databases are better suited as application databases rather than

integration databases.

o The shift from integration databases to application databases, where data is

encapsulated in services, is seen as a positive trend.

• Two Primary Reasons for Considering NoSQL:

o Handling Big Data on Clusters: NoSQL is useful for managing large-scale data
access that requires a cluster for performance and scalability.

o Improving Application Development Productivity: NoSQL can simplify

database access and improve development productivity, even without the
need for scaling beyond a single machine.

AGGREGATE DATA MODELS

• Data Model: A data model describes how we interact with data in a database,
distinct from a storage model, which details how data is stored and manipulated
internally.

• Ideal vs. Practical: In an ideal world, users would be unaware of the storage model,
but in practice, some understanding of it is needed for good performance.

• Data Model in Context: In this book, "data model" refers to the way a database
organizes data (metamodel), which is different from the specific data in an
application (like an entity-relationship diagram).
• Relational Data Model: The relational model, dominant for the past decades, is
visualized as a set of tables (like a spreadsheet) where rows represent entities and
columns contain single values. Relationships are formed when a column refers to
another row in the same or a different table.

• Shift in NoSQL: NoSQL introduces a shift from the relational model. NoSQL solutions
use different data models, categorized into four types: key-value, document, column-
family, and graph.

• Aggregate Orientation: The key-value, document, and column-family models share a

characteristic known as "aggregate orientation," which is the focus of this chapter.

2.1 AGGREGATES

• Relational Model vs. Aggregate Orientation:

o The relational model divides data into tuples (rows), where each tuple is a
limited data structure that captures a set of values. It is not possible to nest
tuples within another tuple or store lists of values or tuples within a tuple.

o Aggregate orientation differs in that it recognizes that complex data

structures, like lists or nested records, are often needed for operations. This
allows thinking of data as complex records and enables nesting of lists and
records within each other.

o Aggregate databases (e.g., key-value, document, and column-family

databases) use a more complex record structure.

o The term "aggregate" comes from Domain-Driven Design, where an

aggregate is a collection of related objects treated as a unit for data
manipulation and consistency.

o The aggregate approach is beneficial for operations in clusters because

aggregates naturally support replication and sharding.

o Aggregates simplify application development by enabling easier data

manipulation via aggregate structures.

2.1.1 Example of Relations and Aggregates

• E-commerce Scenario:

o Imagine an e-commerce website with users, product catalog, orders, shipping

addresses, billing addresses, and payment data.

o Relational Model Example:

▪ The data model involves tables for customers, orders, products,

addresses, and payments. Each table enforces proper normalization
and referential integrity, preventing data duplication and ensuring
relationships via foreign keys.

o Aggregate-Oriented Model Example:

▪ In NoSQL, data can be modeled in JSON format where aggregates like

"customer" and "order" are treated as collections of related data.

▪ The customer aggregate contains a list of billing addresses, while the

order aggregate contains order items, shipping addresses, and
payments.
// in customers

"id": 1,

"name": "Martin",

"billingAddress": [{"city": "Chicago"}]

// in Orders

"id": 99,

"customerId": 1,

"orderItems": [{"productId": 27, "price": 32.45, "productName": "NoSQL Distilled"}],

"shippingAddress": [{"city": "Chicago"}],

"orderPayment": [{"ccinfo": "1000-1000-1000-1000", "txnId": "abelif879rft",

"billingAddress": {"city": "Chicago"}}]

▪ The key idea is that the customer and order aggregates are treated as
independent units.
▪ The payment information, including billing address, is embedded
within the order.

▪ This approach differs from relational models where a new row would
be created for each instance of the relationship.

▪ The relationship between the customer and order is managed across

aggregates, with customer and order linked through the customerId in
the order aggregate.

▪ Alternative Model:

▪ All orders could be included within the customer aggregate.

▪ Example:

"customer": {

"id": 1,

"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"orders": [{

"id": 99,

"customerId": 1,

"orderItems": [{"productId": 27, "price": 32.45,

"productName": "NoSQL Distilled"}],

"shippingAddress": [{"city": "Chicago"}],

"orderPayment": [{"ccinfo": "1000-1000-1000-1000",

"txnId": "abelif879rft", "billingAddress": {"city": "Chicago"}}]

}]

▪ Aggregate boundaries depend on how the data is accessed. If

accessing customer data with all orders is common, the customer
aggregate might include orders; otherwise, they might be kept
separate.
2.1.2 Consequences of Aggregate Orientation

• Relational Model Limitations:

o Relational databases capture data elements and relationships but do not

inherently consider aggregates.

o Aggregates, in aggregate-oriented models, allow us to focus on how data is

used, making it easier to manipulate and distribute data in clusters.

o Relational models do not differentiate between aggregation relationships and

regular relationships, making it harder to optimize data storage and
distribution.

o Semantics of Aggregates:

▪ The semantics of what makes an aggregate relationship distinct from

other relationships are not well-defined in relational modeling, and
where they exist, they vary.

▪ Aggregate-oriented databases have clearer semantics since they focus

on units of interaction with the data storage, making data
management more efficient.
• Challenges in Aggregate-Oriented Databases:

o Relational databases are considered "aggregate-ignorant" because they do

not have an aggregate concept in their data model.

o Aggregate-ignorant databases (like relational and graph databases) are not

inherently bad; they are often preferred when data is accessed in many
different ways or when aggregate boundaries are difficult to define.

o Advantages of Aggregate-Oriented Databases:

▪ Aggregate models are ideal when the same data needs to be

manipulated together, especially when running on clusters.

▪ Aggregates help minimize the number of nodes queried, making them

useful in distributed systems.

• Impact on Transactions:

o Relational databases support ACID (Atomic, Consistent, Isolated, Durable)

transactions across multiple tables, which allow manipulating any
combination of rows in one transaction.

o Aggregate-oriented databases typically support atomic operations within a

single aggregate, not spanning multiple aggregates.

o ACID in NoSQL:

▪ While it’s true that NoSQL databases may not support ACID
transactions across multiple aggregates, they often support atomic
operations within a single aggregate.

▪ Application code often manages atomicity across multiple aggregates

if necessary.

▪ Some aggregate-ignorant databases (like graph databases) do support

ACID transactions similar to relational databases.

▪ The focus on consistency goes beyond whether a database is ACID-

compliant, as the implementation and management of consistency in
databases are more complex.

2.2 KEY-VALUE AND DOCUMENT DATA MODELS

• Key-Value and Document Databases as Aggregate-Oriented:

o Both key-value and document databases are primarily constructed through

aggregates.

o Each aggregate in these databases has a key or ID used to access the data.
• Key-Value Databases:

o In key-value databases, the aggregate is opaque to the database; it is just a

large blob of mostly meaningless bits.

o The advantage of opacity is that we can store any kind of data in the
aggregate, with the database only imposing general size limits.

o Key-value databases typically support lookup operations based on the key

alone.

• Document Databases:

o In contrast, document databases are able to recognize a structure within the

aggregate.

o Document databases impose limits on what can be stored, defining allowable

structures and types.

o The trade-off is that document databases offer more flexibility in data access:

▪ Queries can be made based on fields in the aggregate.

▪ It is possible to retrieve parts of the aggregate instead of the entire

structure.

▪ Document databases can create indexes based on the contents of the

aggregate.

• Blurry Line Between Key-Value and Document Models:

o The distinction between key-value and document databases is not always

clear.

o People often include an ID field in a document database to enable key-value

style lookups.

o Some key-value databases allow additional data structures beyond opaque

aggregates:

▪ Riak: Allows metadata addition for indexing and inter-aggregate links.

▪ Redis: Allows breaking down the aggregate into lists or sets.

▪ Solr Integration: Key-value databases may integrate search tools (e.g.,

Solr) to enable querying by adding search functionality on aggregates
stored in JSON or XML structures.

• General Distinction:

o The primary distinction still holds:

▪ Key-Value Databases: Expect to look up aggregates using a key.

▪ Document Databases: Expect to query based on the internal structure

of the document, which may be a key or another field.

2.3 COLUMN-FAMILY STORES

• BigTable and Column-Family Databases:

o Google's BigTable was an early influential NoSQL database, and it influenced

later databases like HBase and Cassandra.

o BigTable’s structure is often mistaken for a table, but it is actually a two-level

map.

o Bigtable-style databases are often referred to as column stores, although the

term "column store" has been used for other types of databases (like C-Store)
in the past.

• Column Stores (Pre-NoSQL):

o Pre-NoSQL column stores used the relational model and SQL but focused on
storing data physically by columns, rather than rows.

o The primary benefit of column storage is for scenarios where writes are rare,
but there is a need to read a few columns across many rows.

o Column stores store groups of columns for all rows together, unlike row-
based storage that improves write performance.

• Bigtable and Column-Family Databases:

o Bigtable and its descendants (HBase, Cassandra) store groups of columns

(called column families) together, diverging from C-Store by abandoning the
relational model and SQL.

o These databases are called column-family databases.

• Two-Level Aggregate Structure:

o Column-family databases use a two-level structure:

1. The first level is a row identifier (the key), representing the entire
aggregate (e.g., a customer).

2. The second level consists of columns, where specific values can be

accessed.

o Operations can target the entire row or specific columns within the row (e.g.,
get('1234', 'name')).
• Column Families:

o Columns are organized into column families.

o Each column must belong to a column family, and columns are accessed as
units, with data in a column family typically being accessed together.

• Data Structure Representation:

o Row-oriented: Each row is an aggregate (e.g., customer ID 1234), and column

families represent different chunks of data (e.g., profile, order history).

o Column-oriented: Each column family defines a record type (e.g., customer

profiles), and rows represent records in all column families.

• Two-Dimensional Quality of Column-Family Databases:

o Column-family databases have a two-dimensional structure due to the

combination of rows and column families, which influences storage and
access behavior.

• Cassandra’s Variation:

o Cassandra has a unique approach where a row only exists in one column
family, but the column family may contain supercolumns (nested columns),
which are similar to classic Bigtable column families.

o In Cassandra, you can freely add new columns to rows, but adding new
column families is less frequent and may require stopping the database.
• Wide vs. Skinny Rows:

o Skinny rows have fewer columns, and each column is used across many rows.

o Wide rows have many columns (potentially thousands), and each row may
have very different columns.

o Wide column families are used to model lists, with each column representing
one element of the list.

• Sort Order in Column Families:

o Column families can define a sort order for their columns.

o For example, orders could be keyed by a concatenation of date and ID (e.g.,

20111027-1001), allowing for range queries based on the order key.

o While wide column families may define a sort order, there is no technical
restriction on combining field-like and list-like columns in the same column
family, though doing so could complicate the sorting.

2.4 SUMMARIZING AGGREGATE-ORIENTED DATABASES

• Overview of Aggregate-Oriented Data Models:

o The three types of aggregate-oriented data models share the concept of an

aggregate indexed by a key for lookup.

o This aggregate is essential for running on a cluster, as the database ensures

that all data for an aggregate is stored together on a single node.

o The aggregate serves as the atomic unit for updates, providing basic
transactional control, though this control is limited.

• Differences Between Models:

1. Key-Value Data Model:

▪ The aggregate is treated as an opaque whole, meaning you can only

perform key lookups for the entire aggregate.

▪ Queries and partial retrievals are not possible.

2. Document Model:

▪ The aggregate is transparent to the database, enabling queries and

partial retrievals.

▪ However, because the document lacks a schema, the database cannot

optimize the storage and retrieval of parts of the aggregate based on
its structure.
3. Column-Family Model:

▪ The aggregate is divided into column families, treating these column

families as distinct units within the row aggregate.

▪ This imposes some structure on the aggregate, but it allows the

database to take advantage of the structure for improved
accessibility.

MORE DETAILS ON DATA MODELS

3.1 RELATIONSHIPS

• Aggregates and Their Usage:

o Aggregates are useful for grouping data that is commonly accessed together.

o In some cases, data related to an entity (e.g., a customer and their orders)
may be accessed differently by various applications.

o Some applications prefer to combine the customer and order history into a
single aggregate when accessing the customer, while others prefer to treat
orders as independent aggregates.

o In such cases, separate customer and order aggregates are needed, but a
relationship between them is essential.

• Linking Aggregates:

o The simplest way to link aggregates (like customer and order) is by

embedding the customer ID within the order’s aggregate data.

o This allows for the retrieval of customer data by reading the order, extracting
the customer ID, and then querying the customer record separately.

o While this works, the database remains unaware of the relationship between
the aggregates, which can be important in some scenarios.

o Many databases, including key-value stores, allow you to make relationships

visible. For example, Riak (a key-value store) allows link information to be
stored in metadata, enabling partial retrieval and link-walking.

• Handling Updates:

o Aggregate-oriented databases treat the aggregate as the unit of data

retrieval. This means atomicity is only supported within a single aggregate.

o When updating multiple aggregates, failure management is needed, as there

are no guarantees for consistency across aggregates during updates.
o Relational databases, on the other hand, provide ACID guarantees for
modifying multiple records within a single transaction, which helps maintain
consistency across different rows.

• Challenges with Relationships:

o Aggregate-oriented databases become awkward when operations involve

multiple aggregates. The lack of built-in support for managing relationships
across aggregates makes them less suited for such tasks.

o For data involving complex relationships, relational databases might be a

better choice than NoSQL stores, especially since relational databases are
optimized for managing complex relationships with joins.

o However, relational databases also face performance issues when dealing

with multiple joins and complex queries.

• Introducing a New Category:

o The discussion hints at introducing another category of databases, often

classified under NoSQL, which may better address complex relationships
between data.

3.2 GRAPH DATABASES

• Introduction to Graph Databases:

o Graph databases are unique within the NoSQL landscape. Most NoSQL
databases focus on large records with simple connections, driven by the need
to run on clusters. In contrast, graph databases are motivated by frustrations
with relational databases and feature a different model—small records with
complex interconnections.

o In graph databases, the term "graph" refers to a data structure made up of

nodes connected by edges, not a bar chart or histogram.

• Graph Structure:

o The data structure in a graph database consists of small nodes (often just a
name) connected by rich, complex interconnections (edges). This structure
allows for advanced queries such as “find books in the Databases category
written by someone whom a friend of mine likes.”

o Graph databases are ideal for handling data with complex relationships, such
as social networks, product preferences, or eligibility rules.
• Graph Database Data Model:

o The fundamental model is simple: nodes connected by edges (also called

arcs).

o Different graph databases have variations in how data is stored within nodes
and edges:

▪ FlockDB: Simple nodes and edges without attributes.

▪ Neo4J: Allows attaching Java objects as properties to nodes and

edges, in a schemaless fashion.

▪ Infinite Graph: Stores Java objects (subclasses of built-in types) as

nodes and edges.

• Querying Graph Databases:

o Once a graph is created, you can query it using graph-specific operations.

o Unlike relational databases that use foreign keys and joins to navigate
relationships (which can be expensive), graph databases make relationship
traversal cheap and efficient.

o A major difference is that graph databases shift much of the work of

navigating relationships to insert time, improving performance for queries
but making insert operations slower.

• Graph Database Queries:

o Queries in graph databases often involve navigating relationships. For

example, “tell me all the things that both Anna and Barbara like.”

o While ID-based lookups can be used to start queries, the emphasis is on

using the edges (relationships) to find relevant data.

• Differences from Aggregate-Oriented Databases:

o Graph databases are designed for handling relationships, making them

fundamentally different from aggregate-oriented databases.

o They are more likely to run on single servers rather than being distributed
across clusters.

o ACID transactions in graph databases cover multiple nodes and edges to

ensure consistency, unlike aggregate-oriented databases, where atomicity is
limited to individual aggregates.

o Graph databases reject the relational model and share a rise in popularity
alongside the broader NoSQL movement.
3.3 SCHEMALESS DATABASES

• Schemalessness is a common theme across all forms of NoSQL databases.

• In relational databases, you must define a schema before storing data. A schema
includes:

o Defined structure of tables.

o Columns that exist.

o Data types each column can hold.

• In NoSQL databases, storing data is more casual and flexible:

o Key-value store: You can store any data under a key.

o Document database: No restrictions on the structure of documents.

o Column-family databases: You can store any data under any column.

o Graph databases: You can freely add new edges and properties to nodes and
edges.

• Benefits of Schemalessness:

o Offers freedom and flexibility by removing the need to figure out in advance
what data is necessary.

o You can easily change data storage as you learn more about your project.
o New things can be added easily as they are discovered.

o If you no longer need certain data, you can stop storing it without worrying
about losing old data (unlike relational databases when deleting columns).

o Nonuniform data (where each record has a different set of fields) is handled
easily:

▪ Relational databases force all rows into a fixed schema, leading to

issues like sparse tables or meaningless columns (e.g., custom column
4).

▪ Schemalessness avoids these issues, allowing each record to contain

only the necessary data.

• Drawbacks of Schemalessness:

o When programs access data, they rely on an implicit schema, which is:

▪ Assumptions about data structure within the program’s code.

▪ For example, a program assumes the field “qty” means quantity, but it
cannot infer this unless programmed.

▪ The database does not validate data or ensure consistency across

different applications.

o This leads to some issues:

▪ To understand the data structure, you must dive into the application
code to deduce the schema.

▪ If the application code is poorly structured, this could be difficult.

▪ The database remains unaware of the schema, making it difficult for

the database to optimize storage and retrieval.

• Relational Databases vs. Schemaless Databases:

o Relational databases use a fixed schema to maintain structure and

consistency across applications.

o NoSQL databases move the schema to the application code accessing the
data, which can cause issues when multiple applications interact with the
same database.

• Solutions to Problems:

o Encapsulation: Encapsulate all database interactions within a single

application and integrate it with others via web services.
o Aggregate Partitioning: Clearly delineate different sections of the database
(e.g., different sections in a document database or column families in a
column-family database) for different applications.

• Changing the Schema:

o Relational databases can change schemas dynamically using standard SQL

commands.

o Nonuniform data can be handled in relational databases by adding new

columns as needed.

o Changing the schema in a relational database can be done in a controlled

way.

• Impact of Schemalessness on Database Changes:

o Schemalessness offers flexibility within an aggregate (collection of related

data), but if you need to change aggregate boundaries, it is just as complex as
altering relational schemas.

o Managing changes in a schemaless database requires careful planning to

access both old and new data easily.

o Despite its flexibility, the impact of schema changes remains significant, and
migrations must be handled properly.

3.4 MATERIALIZED VIEWS

• Aggregate-Oriented Data Models:

o They are useful for accessing data like orders as a single unit, but they have
limitations when you need to access data differently, like querying product
sales over time.

o This leads to the problem of potentially needing to read every order to

answer a specific query.

• Relational Databases:

o Lack of aggregate structure, allowing access to data in different ways.

o Provide views, which are virtual tables defined by computations over base
tables.
o Views compute data dynamically, but some can be expensive to compute.

• Materialized Views:

o Views computed in advance and cached on disk to reduce computation time

for frequently read data.

o Useful for data that is read heavily but can tolerate being somewhat stale.

• Materialized Views in NoSQL Databases:

o NoSQL databases do not have views, but they may have precomputed and
cached queries.

o The term "materialized view" is reused to describe precomputed and cached

data.

o Materialized views are crucial in aggregate-oriented databases as many

queries may not fit well with the aggregate structure.

o NoSQL databases often use map-reduce computations to create materialized

views (to be discussed in Chapter 7).

• Building Materialized Views:

o Two strategies:

1. Eager Approach:

▪ Materialized view is updated at the same time as the base

data.

▪ For example, adding an order would update the product's

purchase history aggregate.

▪ This approach is suitable when materialized views are read

frequently, and you want them to be as fresh as possible.

▪ The application database approach helps ensure updates to

base data also update materialized views.

2. Batch Update Approach:

▪ Materialized views are updated at regular intervals through

batch jobs.

▪ The freshness of materialized views depends on the business

requirements and how stale they can be.
▪ You can compute and save the materialized view manually or
use the database to build the view itself, with the computation
executed as needed based on configured parameters.

▪ This approach works well for incremental map-reduce

computations.

• Use of Materialized Views Within the Same Aggregate:

o Example: An order document might include an order summary to avoid

transferring the entire order document when querying for an order summary.

o Common in column-family databases: Materialized views are stored in

separate column families.

o Advantage: Materialized views can be updated in the same atomic operation

as the base data.

3.5 MODELING FOR DATA ACCESS

• Key-Value Store:

o All customer data can be embedded using a key-value store, where the
application can read the customer's information and related data using the
key.

o If querying orders or products sold in each order, the entire object must be
read and parsed on the client side to build the results.

o When references are needed, you can switch to document stores or split the
value object into Customer and Order objects, maintaining references
between them.

o With references, orders can be found independently of the Customer, and all
orders for a Customer can be retrieved by using the orderId reference in the
Customer object.

• Customer Object Example (Key-Value Store):

"customerId": 1,

"customer": {

"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit","ccinfo": "1000-1000-1000-1000"}],

"orders":[{"orderId":99}]

#Order Object

"customerId": 1,

"orderId": 99,

"order": {

"orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft"}],

"shippingAddress":{"city":"Chicago"}

}
• Using Aggregates for Analytics:

o Aggregates can be used for Real-Time BI or Real-Time Analytics, allowing

enterprises to access data in real time without relying on batch runs.

o For example, an aggregate update could fill in which orders contain a given
product.

o Example denormalized data:

"itemid": 27,

"orders": {99, 545, 897, 678}

"itemid": 29,

"orders": {199, 545, 704, 819}

• Document Stores:

o In document stores, references to orders can be removed from the Customer

object, and updates to the Customer object are no longer necessary when
new orders are placed.

o This is possible because document stores allow querying within documents,

enabling searches like "find all orders that include a specific product."

o Example Customer Object (without orders):

"customerId": 1,

"name": "Martin",

"billingAddress": [{"city": "Chicago"}],

"payment": [{"type": "debit", "ccinfo": "1000-1000-1000-1000"}]

o Example Order Object (separate):

{
"orderId": 99,

"customerId": 1,

"orderDate":"Nov-20-2011",

"orderItems":[{"productId":27, "price": 32.45}],

"orderPayment":[{"ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft"}],

"shippingAddress":{"city":"Chicago"}

• Column-Family Stores:

o In column-family stores, columns are ordered, and frequently used columns

can be named and fetched first.

o It’s important to model data based on query requirements, not write

operations, ensuring the data is optimized for reads.

o Example: Storing Customer and Order in different column-family families,

with references to orders in the Customer column family, improves query
performance.
• Graph Databases:

o Graph databases model all objects as nodes and relationships as edges.

o Relationships between nodes have types and directional significance, making

traversal easy.

o For example, to find all customers who purchased a specific product, query
the product node and look for customers with an incoming PURCHASED
relationship.

o Graph databases are particularly useful for product recommendations and

analyzing user patterns.

DBMS PPT 1
No ratings yet
DBMS PPT 1
27 pages
Wholesalers in Ethiopia
No ratings yet
Wholesalers in Ethiopia
25 pages
Lesson 1 - Database System Overview
No ratings yet
Lesson 1 - Database System Overview
11 pages
Komatsu Avance Loader WA470 3 Wheel Loader Operating Maintenance Manual
0% (1)
Komatsu Avance Loader WA470 3 Wheel Loader Operating Maintenance Manual
235 pages
Office of The Senior Citizens Affairs (Osca)
100% (1)
Office of The Senior Citizens Affairs (Osca)
13 pages
Bcs Database - Complete Reference 2022
No ratings yet
Bcs Database - Complete Reference 2022
109 pages
Nosqlmodule 1
100% (1)
Nosqlmodule 1
102 pages
NoSql Mod 1 C
No ratings yet
NoSql Mod 1 C
16 pages
In This Chapter, You Will Learn:: Database Design Is Important File Systems
No ratings yet
In This Chapter, You Will Learn:: Database Design Is Important File Systems
35 pages
1st Semester
No ratings yet
1st Semester
19 pages
Current Affairs
No ratings yet
Current Affairs
3 pages
Marketing Plan: de La Salle University - Dasmariñas
No ratings yet
Marketing Plan: de La Salle University - Dasmariñas
16 pages
GRAUER &amp WEIL (INDIA) LTD PDF
100% (2)
GRAUER &amp WEIL (INDIA) LTD PDF
8 pages
MODULE 1 - PPT - 7B
No ratings yet
MODULE 1 - PPT - 7B
70 pages
Brushless DC Motors
No ratings yet
Brushless DC Motors
21 pages
RDBMS Concepts
100% (1)
RDBMS Concepts
73 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Sara
No ratings yet
Sara
160 pages
DBMS All
No ratings yet
DBMS All
121 pages
Data Science V No SQL Databases
No ratings yet
Data Science V No SQL Databases
61 pages
Attachment 14940535 2 4 - S-GATE - Presentation
No ratings yet
Attachment 14940535 2 4 - S-GATE - Presentation
14 pages
Intro 2 DB
No ratings yet
Intro 2 DB
126 pages
ATO Tutorials
100% (1)
ATO Tutorials
36 pages
DBMS PPT 1 Eng
No ratings yet
DBMS PPT 1 Eng
74 pages
NOSQL
No ratings yet
NOSQL
64 pages
Zanussi ZAN2250 Sewing Machine Instruction Manual
No ratings yet
Zanussi ZAN2250 Sewing Machine Instruction Manual
76 pages
Database
No ratings yet
Database
72 pages
Rdbms III Sem
100% (1)
Rdbms III Sem
80 pages
RDBMS Unit-1
No ratings yet
RDBMS Unit-1
67 pages
Module 1
No ratings yet
Module 1
52 pages
Introduction To Data Models 677e35511a823
No ratings yet
Introduction To Data Models 677e35511a823
45 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Data Modeling
No ratings yet
Data Modeling
61 pages
4.2 NoSQL Databases UNIT-1
No ratings yet
4.2 NoSQL Databases UNIT-1
35 pages
Nots RDBMS
No ratings yet
Nots RDBMS
37 pages
No SQL
No ratings yet
No SQL
36 pages
Scientific Writing of Research: Dr. Aman Ullah, PH.D
No ratings yet
Scientific Writing of Research: Dr. Aman Ullah, PH.D
38 pages
Udbms Notes
No ratings yet
Udbms Notes
18 pages
Unit 1
No ratings yet
Unit 1
69 pages
Data Anal
No ratings yet
Data Anal
53 pages
Introduction To Relational Databases
No ratings yet
Introduction To Relational Databases
24 pages
Drain PCC 136 BOQ With Rates
No ratings yet
Drain PCC 136 BOQ With Rates
29 pages
Relational DB
No ratings yet
Relational DB
32 pages
1 DBMS
No ratings yet
1 DBMS
43 pages
Database Advice Guide
No ratings yet
Database Advice Guide
19 pages
ADBMS-Module 2
No ratings yet
ADBMS-Module 2
33 pages
BGD Mod 2 QB Solns
No ratings yet
BGD Mod 2 QB Solns
11 pages
Intro of Dbms
No ratings yet
Intro of Dbms
20 pages
Database Design & Administration: ISOM3260
No ratings yet
Database Design & Administration: ISOM3260
35 pages
NoSql Intro
No ratings yet
NoSql Intro
24 pages
Struts Survival Guide
No ratings yet
Struts Survival Guide
227 pages
Chapter 1 DBE Edit M1 452 Sem
No ratings yet
Chapter 1 DBE Edit M1 452 Sem
18 pages
Institutional Training Final Report
No ratings yet
Institutional Training Final Report
27 pages
Database Note - 1
No ratings yet
Database Note - 1
29 pages
Nosql Mod3
No ratings yet
Nosql Mod3
18 pages
Info
No ratings yet
Info
20 pages
CSC 111 (Introduction To Database Concepts & It Application)
No ratings yet
CSC 111 (Introduction To Database Concepts & It Application)
24 pages
Blockchain Ia2 Answers
No ratings yet
Blockchain Ia2 Answers
19 pages
SQL Unit1
No ratings yet
SQL Unit1
28 pages
Module 1 Nosql
No ratings yet
Module 1 Nosql
16 pages
Topic 1 - Introduction To Database
No ratings yet
Topic 1 - Introduction To Database
14 pages
A Comparative Analysis of The
No ratings yet
A Comparative Analysis of The
15 pages
1.0 Transformation of Computing
No ratings yet
1.0 Transformation of Computing
20 pages
PDF Document BIDA 2
No ratings yet
PDF Document BIDA 2
21 pages
DDM Assignment
No ratings yet
DDM Assignment
27 pages
QB 1
No ratings yet
QB 1
9 pages
Intro To Database
No ratings yet
Intro To Database
12 pages
102 Copies Adv Lesson 1
No ratings yet
102 Copies Adv Lesson 1
5 pages
Ebook Database Advice Guide
No ratings yet
Ebook Database Advice Guide
19 pages
504 Lecture2 PDF
No ratings yet
504 Lecture2 PDF
34 pages
Database Management
No ratings yet
Database Management
7 pages
Fag Smartcheck: High Process Security by Means of Decentralised Machinery Monitoring
No ratings yet
Fag Smartcheck: High Process Security by Means of Decentralised Machinery Monitoring
26 pages
Đề Cương Ôn Thi CK 2 k10
No ratings yet
Đề Cương Ôn Thi CK 2 k10
9 pages
INFOMAN Prelim Notes
No ratings yet
INFOMAN Prelim Notes
9 pages
Sales Management & Sales Distribution: A Project ON Mumbai Dabawalla'S
No ratings yet
Sales Management & Sales Distribution: A Project ON Mumbai Dabawalla'S
30 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
4 pages
Data Model & Database System Architecture - Unit 1 - 2
No ratings yet
Data Model & Database System Architecture - Unit 1 - 2
7 pages
DBS-C01-S02-B-03-Relational Databases
No ratings yet
DBS-C01-S02-B-03-Relational Databases
3 pages
Explore The Role of SQL in The Relational Database by Showing How It Works
No ratings yet
Explore The Role of SQL in The Relational Database by Showing How It Works
4 pages
Assignment (Data Models of DBMS)
No ratings yet
Assignment (Data Models of DBMS)
5 pages
Mini Score PDF
No ratings yet
Mini Score PDF
6 pages
Anila 8611
No ratings yet
Anila 8611
18 pages
How Much Power
No ratings yet
How Much Power
5 pages
Euglena S
No ratings yet
Euglena S
4 pages
Surfnews
No ratings yet
Surfnews
5 pages
English Notebook Face Sheet-21.04.25
No ratings yet
English Notebook Face Sheet-21.04.25
3 pages
BY:-Abhishek Goel Shubham Gupta Varun Sood
No ratings yet
BY:-Abhishek Goel Shubham Gupta Varun Sood
27 pages
Database Management System (DBMS) : Convenient Efficient
No ratings yet
Database Management System (DBMS) : Convenient Efficient
11 pages
Management Education in India
No ratings yet
Management Education in India
22 pages
Lesson Plan: Veer Surendra Sai University of Technology
No ratings yet
Lesson Plan: Veer Surendra Sai University of Technology
2 pages
97-680 Multiprime
No ratings yet
97-680 Multiprime
2 pages
Determination of Caffeine in Tea Samples
No ratings yet
Determination of Caffeine in Tea Samples
7 pages
Generator Set Iso8528!5!2005 Operating Limits
No ratings yet
Generator Set Iso8528!5!2005 Operating Limits
1 page
Heart of The Sun Warrior 1st Edition Sue Lynn Tan 2024 Scribd Download
100% (3)
Heart of The Sun Warrior 1st Edition Sue Lynn Tan 2024 Scribd Download
37 pages