Nosql Mod1
Nosql Mod1
Relational databases are an integral part of computing, offering numerous benefits that are
crucial to revisit for a better understanding of their value.
• The primary value of databases lies in their ability to store large amounts of
persistent data, which is essential for long-term data retention.
1. Main Memory: Fast but volatile, meaning data is lost during power outages
or system failures.
2. Backing Store: Larger but slower, used to retain data even in adverse
conditions.
• While some applications (e.g., word processors) use file systems to store data,
enterprise applications depend on databases for their advanced capabilities.
• Databases offer greater flexibility than file systems by allowing quick and easy access
to small pieces of data from large datasets.
1.1.2 Concurrency
• Enterprise applications often involve multiple users accessing the same body of data
simultaneously.
• While users usually work on different areas of the data, conflicts can arise when they
attempt to modify the same data.
• Managing concurrency is highly complex, often leading to errors even with careful
programming.
1.1.3 Integration
• Developers and database professionals can learn the relational model and apply it
across multiple projects, thanks to this standardization.
• This consistency allows developers to easily adapt their skills to different relational
database systems, enhancing productivity and reducing learning curves.
Relational databases, while advantageous, have limitations that have frustrated application
developers since their inception. The concept of impedance mismatch highlights a
fundamental issue in aligning relational databases with in-memory data structures.
o Relational tuples require simple values and cannot handle complex structures
like nested records or lists.
• Translation Requirement:
o While ORMs reduce the burden of mapping, they introduce new problems:
• Despite the widespread use of ORMs, the mapping problem persists, highlighting the
inherent mismatch between relational databases and in-memory structures.
• Relational databases dominated enterprise computing through the 2000s, but their
supremacy began to face challenges during that decade.
o Only the team using the application needs to know about the database
structure, making it easier to maintain and evolve the schema.
o Since the application team controls both the database and the application
code, the responsibility for database integrity can be placed in the application
code.
o Applications began using web services over HTTP for integration, enabling a
new form of communication mechanism.
o Web services allowed the use of richer data structures with nested records
and lists, usually represented as XML or JSON.
• Adoption Trends:
o The 2000s saw large web properties dramatically increase in scale, despite
the busting of the 1990s dot-com bubble.
o Websites began tracking activity and structure in great detail, generating large
sets of data (links, social networks, activity in logs, mapping data).
o With the growth of data came the growth in users, and the largest websites
started serving vast numbers of visitors.
o To handle the increase in data and traffic, websites had two choices: scale up
or scale out.
o Scaling up involves using bigger machines with more processors, disk storage,
and memory, but becomes expensive and limited as size increases.
o Scaling out involves using many small machines in a cluster, which is cheaper,
more resilient, and can handle failures better by providing high reliability.
▪ The application must track which server to talk to for each piece of
data.
• Licensing Costs:
o Google and Amazon, at the forefront of running large clusters and capturing
huge amounts of data, were influential in pushing the idea of databases
designed specifically for clusters.
o Although the scale of Amazon and Google may seem too large for most
organizations, many are beginning to face similar challenges with growing
data and traffic.
o As more information about Google and Amazon’s solutions leaked out, other
organizations began exploring databases explicitly designed for clusters.
o The term "NoSQL" first appeared in the late 90s as the name of an open-
source relational database by Carlo Strozzi.
o This early "NoSQL" database did not use SQL and was manipulated through
UNIX shell scripts, with data stored as ASCII files.
o Despite the name, Strozzi’s NoSQL had no influence on the modern databases
referred to as NoSQL.
o The term "NoSQL" gained prominence after a 2009 meetup in San Francisco
organized by Johan Oskarsson.
o The meetup was inspired by the examples of BigTable and Dynamo, with
discussions about alternative data storage solutions.
o Johan Oskarsson chose "NoSQL" as the name for the meetup, which became
widely used to describe this technology trend.
o NoSQL databases don’t use SQL, although some have query languages similar
to SQL (e.g., Cassandra’s CQL).
o Most NoSQL databases are designed to run on clusters, and their data models
and consistency approaches are suited to this environment.
o Not all NoSQL databases are cluster-oriented; for example, graph databases
use a distribution model similar to relational databases but with a different
data model.
o NoSQL databases are typically developed for the needs of large-scale web
estates, often created in the early 21st century.
o They operate without a schema, allowing for flexible data storage and the
addition of fields without predefined structure, making them ideal for
nonuniform and custom data.
o The term "Not Only SQL" is often used, though it has issues, such as not
differentiating from relational databases that can also use non-SQL elements.
• NoSQL as a Movement:
o Organizations will likely use a mix of data stores for different purposes,
depending on the nature of the data and how it needs to be manipulated.
o Handling Big Data on Clusters: NoSQL is useful for managing large-scale data
access that requires a cluster for performance and scalability.
• Data Model: A data model describes how we interact with data in a database,
distinct from a storage model, which details how data is stored and manipulated
internally.
• Ideal vs. Practical: In an ideal world, users would be unaware of the storage model,
but in practice, some understanding of it is needed for good performance.
• Data Model in Context: In this book, "data model" refers to the way a database
organizes data (metamodel), which is different from the specific data in an
application (like an entity-relationship diagram).
• Relational Data Model: The relational model, dominant for the past decades, is
visualized as a set of tables (like a spreadsheet) where rows represent entities and
columns contain single values. Relationships are formed when a column refers to
another row in the same or a different table.
• Shift in NoSQL: NoSQL introduces a shift from the relational model. NoSQL solutions
use different data models, categorized into four types: key-value, document, column-
family, and graph.
2.1 AGGREGATES
o The relational model divides data into tuples (rows), where each tuple is a
limited data structure that captures a set of values. It is not possible to nest
tuples within another tuple or store lists of values or tuples within a tuple.
• E-commerce Scenario:
"id": 1,
"name": "Martin",
// in Orders
"id": 99,
"customerId": 1,
▪ The key idea is that the customer and order aggregates are treated as
independent units.
▪ The payment information, including billing address, is embedded
within the order.
▪ This approach differs from relational models where a new row would
be created for each instance of the relationship.
▪ Alternative Model:
▪ Example:
"customer": {
"id": 1,
"name": "Martin",
"orders": [{
"id": 99,
"customerId": 1,
}]
o Semantics of Aggregates:
• Impact on Transactions:
o ACID in NoSQL:
▪ While it’s true that NoSQL databases may not support ACID
transactions across multiple aggregates, they often support atomic
operations within a single aggregate.
o Each aggregate in these databases has a key or ID used to access the data.
• Key-Value Databases:
o The advantage of opacity is that we can store any kind of data in the
aggregate, with the database only imposing general size limits.
• Document Databases:
o The trade-off is that document databases offer more flexibility in data access:
• General Distinction:
o Pre-NoSQL column stores used the relational model and SQL but focused on
storing data physically by columns, rather than rows.
o The primary benefit of column storage is for scenarios where writes are rare,
but there is a need to read a few columns across many rows.
o Column stores store groups of columns for all rows together, unlike row-
based storage that improves write performance.
1. The first level is a row identifier (the key), representing the entire
aggregate (e.g., a customer).
o Operations can target the entire row or specific columns within the row (e.g.,
get('1234', 'name')).
• Column Families:
o Each column must belong to a column family, and columns are accessed as
units, with data in a column family typically being accessed together.
• Cassandra’s Variation:
o Cassandra has a unique approach where a row only exists in one column
family, but the column family may contain supercolumns (nested columns),
which are similar to classic Bigtable column families.
o In Cassandra, you can freely add new columns to rows, but adding new
column families is less frequent and may require stopping the database.
• Wide vs. Skinny Rows:
o Skinny rows have fewer columns, and each column is used across many rows.
o Wide rows have many columns (potentially thousands), and each row may
have very different columns.
o Wide column families are used to model lists, with each column representing
one element of the list.
o While wide column families may define a sort order, there is no technical
restriction on combining field-like and list-like columns in the same column
family, though doing so could complicate the sorting.
o The aggregate serves as the atomic unit for updates, providing basic
transactional control, though this control is limited.
2. Document Model:
3.1 RELATIONSHIPS
o Aggregates are useful for grouping data that is commonly accessed together.
o In some cases, data related to an entity (e.g., a customer and their orders)
may be accessed differently by various applications.
o Some applications prefer to combine the customer and order history into a
single aggregate when accessing the customer, while others prefer to treat
orders as independent aggregates.
o In such cases, separate customer and order aggregates are needed, but a
relationship between them is essential.
• Linking Aggregates:
o This allows for the retrieval of customer data by reading the order, extracting
the customer ID, and then querying the customer record separately.
o While this works, the database remains unaware of the relationship between
the aggregates, which can be important in some scenarios.
• Handling Updates:
o Graph databases are unique within the NoSQL landscape. Most NoSQL
databases focus on large records with simple connections, driven by the need
to run on clusters. In contrast, graph databases are motivated by frustrations
with relational databases and feature a different model—small records with
complex interconnections.
• Graph Structure:
o The data structure in a graph database consists of small nodes (often just a
name) connected by rich, complex interconnections (edges). This structure
allows for advanced queries such as “find books in the Databases category
written by someone whom a friend of mine likes.”
o Graph databases are ideal for handling data with complex relationships, such
as social networks, product preferences, or eligibility rules.
• Graph Database Data Model:
o Different graph databases have variations in how data is stored within nodes
and edges:
o Unlike relational databases that use foreign keys and joins to navigate
relationships (which can be expensive), graph databases make relationship
traversal cheap and efficient.
o They are more likely to run on single servers rather than being distributed
across clusters.
o Graph databases reject the relational model and share a rise in popularity
alongside the broader NoSQL movement.
3.3 SCHEMALESS DATABASES
• In relational databases, you must define a schema before storing data. A schema
includes:
o Column-family databases: You can store any data under any column.
o Graph databases: You can freely add new edges and properties to nodes and
edges.
• Benefits of Schemalessness:
o Offers freedom and flexibility by removing the need to figure out in advance
what data is necessary.
o You can easily change data storage as you learn more about your project.
o New things can be added easily as they are discovered.
o If you no longer need certain data, you can stop storing it without worrying
about losing old data (unlike relational databases when deleting columns).
o Nonuniform data (where each record has a different set of fields) is handled
easily:
• Drawbacks of Schemalessness:
o When programs access data, they rely on an implicit schema, which is:
▪ For example, a program assumes the field “qty” means quantity, but it
cannot infer this unless programmed.
▪ To understand the data structure, you must dive into the application
code to deduce the schema.
o NoSQL databases move the schema to the application code accessing the
data, which can cause issues when multiple applications interact with the
same database.
• Solutions to Problems:
o Despite its flexibility, the impact of schema changes remains significant, and
migrations must be handled properly.
o They are useful for accessing data like orders as a single unit, but they have
limitations when you need to access data differently, like querying product
sales over time.
• Relational Databases:
o Provide views, which are virtual tables defined by computations over base
tables.
o Views compute data dynamically, but some can be expensive to compute.
• Materialized Views:
o Useful for data that is read heavily but can tolerate being somewhat stale.
o NoSQL databases do not have views, but they may have precomputed and
cached queries.
o Two strategies:
1. Eager Approach:
• Key-Value Store:
o All customer data can be embedded using a key-value store, where the
application can read the customer's information and related data using the
key.
o If querying orders or products sold in each order, the entire object must be
read and parsed on the client side to build the results.
o When references are needed, you can switch to document stores or split the
value object into Customer and Order objects, maintaining references
between them.
o With references, orders can be found independently of the Customer, and all
orders for a Customer can be retrieved by using the orderId reference in the
Customer object.
"customerId": 1,
"customer": {
"name": "Martin",
#Order Object
"customerId": 1,
"orderId": 99,
"order": {
"orderDate":"Nov-20-2011",
"orderPayment":[{"ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"}
}
• Using Aggregates for Analytics:
o For example, an aggregate update could fill in which orders contain a given
product.
"itemid": 27,
"itemid": 29,
• Document Stores:
"customerId": 1,
"name": "Martin",
{
"orderId": 99,
"customerId": 1,
"orderDate":"Nov-20-2011",
"orderPayment":[{"ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"}
• Column-Family Stores:
o For example, to find all customers who purchased a specific product, query
the product node and look for customers with an incoming PURCHASED
relationship.