0% found this document useful (0 votes)
27 views64 pages

NOSQL

Uploaded by

acharyaramya412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views64 pages

NOSQL

Uploaded by

acharyaramya412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

NOSQL

Introduction
• There is a longstanding dominance of relational databases in
the software industry, particularly for enterprise applications,
and contrasts it with the recent surge in interest surrounding
NoSQL databases.
Relational Databases Dominance:
For decades, relational databases were the default choice
for serious data storage, especially in enterprise contexts.
Most software architects only had to decide which relational
database to use.
Introduction
Challenges from Other Technologies:
Throughout history, other database technologies, like object
databases in the 1990s, tried to challenge the relational
model but failed to gain significant traction.

NoSQL's Emergence:
The recent rise of NoSQL databases has caught many by
surprise, challenging the previously unshaken dominance of
relational databases.
The Value of Relational
Databases
1. Getting at Persistent Data :
• Memory Hierarchy: In computing, there are two primary types of
memory.
 Main Memory (RAM): Fast but volatile, meaning data is lost when
power is cut or the system fails.
Backing Store (Persistent Storage): Slower but non-volatile,
traditionally a disk, although modern systems may use persistent memory
(like SSDs or flash storage).
• The Role of Backing Store: Persistent storage ensures that data
remains available even after power loss or system failures.
File System vs. Database
• For some applications (like word processors), data is simply
stored as files in the file system.
• For enterprise applications, databases are preferred because
they offer more flexibility and efficiency in managing large
amounts of data.
• They allow applications to quickly retrieve small pieces of data,
providing more sophisticated ways to organize and query
information compared to file systems.
Concurrency

• In enterprise applications, concurrency management ensures


multiple users can access and modify the same data without
causing errors like double booking.
• Transactions help handle this by locking data during operations,
ensuring changes are applied in a controlled manner. They
guarantee ACID properties (Atomicity, Consistency, Isolation,
Durability), ensuring that either all operations succeed or none do.
• Transactions also aid error handling by allowing rollback in case of
issues, restoring the system to a safe state.
• However, developers still need to handle conflicts and
transactional errors properly for smooth operation.
Integration

• In enterprise applications, inter-application collaboration is often


necessary, but challenging due to different teams and systems
needing to work together.
• A common solution is shared database integration, where multiple
applications share a single database. This allows them to easily access
and update the same data.
• The database's concurrency control manages data consistency across
all applications, just as it does for multiple users within a single
application, ensuring smooth collaboration and data visibility.
A (Mostly) Standard Model

• Relational databases have thrived due to their standardized core


benefits, allowing developers and database professionals to apply a
consistent relational model across various projects.
• Despite differences between vendors, key mechanisms like SQL
dialects and transaction handling remain largely similar, making it
easier to work across different systems.
• This uniformity has contributed to their widespread success and
adoption.
Impedance Mismatch
• Relational databases offer many advantages but come with certain
frustrations, notably the impedance mismatch between the relational
model and in-memory data structures.
• The relational model organizes data into tables (relations) and rows
(tuples), with tuples as simple name-value pairs.
• However, in-memory data structures can be more complex, supporting
rich hierarchies like nested records or lists, which the relational model
doesn't natively handle.
• This mismatch forces developers to translate complex in-memory data
into simpler relational forms for storage, leading to inefficiencies and
additional work.
• An example of impedance mismatch is when an application uses a
complex in-memory data structure, such as an object with nested
properties or a list of objects.
class Item:
def __init__(self, name, price):
self.name = name
self.price = price
class Order:
def __init__(self, order_id, customer, items):
self.order_id = order_id
self.customer = customer
self.items = items # This is a list of Item objects
# Creating Item objects
item1 = Item("Laptop", 1000)
item2 = Item("Mouse", 50)
# Creating an Order object, which contains a list of Item objects
order = Order(1, "John Doe", [item1, item2])

• An Orders table for the order details.


•• An Items table for the individualthe
items.
This requires transforming rich object structure into a
format that fits the relational table schema.
• To save an Order with its Items, you would first store the
Order in one table, then store each Item in another, and then
link them.
• This translation between object-oriented and relational models is
the impedance mismatch, causing extra complexity in application
development.
Application and Integration
Databases
• Analysis of why relational databases became dominant, focusing on
their integration advantages, particularly through SQL, and how the
rise of web services introduced alternatives.
Relational Databases as Integration Databases
• SQL played a crucial role in the success of relational databases
because it allowed multiple applications to interact with a shared
database, integrating data between different teams.
• This ensured consistency across applications, but came with
challenges:
This ensured consistency across applications, but came with challenges:

• Increased complexity in database design to accommodate different


application needs.
• Changes to the database structure required coordination between
teams, making it harder to evolve or optimize for specific
applications.
• Relational databases had to maintain data integrity, as individual
applications couldn't be trusted to do so consistently.
Shift to Application Databases:
• In contrast, an application database is dedicated to a single application
managed by a single team.
• This simplifies database maintenance, evolution, and allows integrity
checks to move to the application layer.
• Application teams have more freedom to choose the database
technology, which could open the door for non-relational (NoSQL)
databases.
Rise of Web Services:
• The transition to web services (HTTP-based communication) in the 2000s
marked a shift away from SQL as the dominant communication
mechanism.
• This approach allowed more flexibility with data structures, often using
XML or JSON to represent nested records and lists.
Service-Oriented Architecture (SOA):

• With the rise of Service-Oriented Architecture (SOA) and web services,


communication between applications no longer depended on shared
databases but on well-defined interfaces (APIs).

• This gave teams more flexibility to innovate with their internal


databases, whether relational or non-relational.
Reluctance to Adopt NoSQL:

• Even with the flexibility afforded by application databases, most teams


still stuck with relational databases.
• Relational systems were familiar, mature, and offered many benefits (like
security features).
• However, cracks in the dominance of relational databases eventually
appeared due to factors beyond the application database model.
The Emergence of NoSQL
The Origins of "NoSQL“
• The term "NoSQL" initially appeared in the late 1990s with Strozzi
NoSQL, an open-source relational database that did not use SQL as a
query language.
• However, this project had no lasting influence on the NoSQL movement
that followed.
• The modern use of the term "NoSQL" dates to a meetup in San
Francisco in June 2009, organized by Johan Oskarsson, where developers
discussed emerging distributed, non-relational databases like Cassandra,
MongoDB, and CouchDB.
• The term "NoSQL" was suggested by Eric Evans as a catchy, hashtag-
friendly name
Characteristics of NoSQL Databases
• Not Using SQL:
As the name suggests, NoSQL databases generally do not use SQL as
their primary query language.
Some databases have similar query languages, such as Cassandra’s CQL,
but these still differ significantly from standard SQL.
• Cluster-Oriented:
NoSQL databases are designed to run on clusters, making them well-
suited for distributing data across many servers.
 This shift comes with trade-offs in consistency, with many NoSQL
databases offering eventual consistency models rather than strong ACID
transactions common in relational databases.
• Schema-Free:
One of the key advantages of NoSQL databases is their schema-less nature,
allowing flexible data storage, where records can have varying fields without
the need for predefined structures.
This is particularly helpful for dealing with unstructured or semi-structured
data.
The Polyglot Persistence Movement
• Polyglot Persistence
is the idea that modern systems should use multiple types of databases
depending on the specific needs of the data.
 Rather than relying solely on relational databases for all purposes, different
types of data stores can be used in different circumstances.
For example, a relational database might be used for financial transactions,
while a NoSQL database like MongoDB could store user activity logs.
Why Consider NoSQL?
• Big Data and Clusters:
NoSQL databases are often used for handling large scale data
distributed across clusters, where relational databases may struggle
with scalability and performance.
• Productivity:
Some teams choose NoSQL to avoid the impedance mismatch
between object-oriented programming and relational databases.
 NoSQL databases can simplify development by offering data models
(like document or key-value) that align more naturally with modern
programming paradigms.
Chapter 2 :AGGREGATE DATA
MODEL
• Overview of data models and how they relate to database design and
use.
• Data Model vs. Storage Model:
• A data model describes how we interact with data in a database. It is
how users and developers perceive and manipulate the data.
• A storage model, on the other hand, describes how the data is
physically stored and manipulated internally by the database.
• Ideally, users should not need to understand the storage model, but in
practice, it can help optimize performance.
Data Model in Applications:
• In conversations, developers often refer to their specific application’s
data model, such as an entity-relationship diagram containing entities
like customers, orders, and products.
• However, in a more formal sense, a data model refers to how the
database organizes data, which can also be called a metamodel.
Relational Data Model:
• The relational data model has been dominant in recent decades. It can
be visualized as a set of tables (or relations), where each row (or tuple)
represents an entity, and columns represent attributes of that entity.
• Relationships between entities are expressed through columns that
reference other rows.
NoSQL Data Models:
• With NoSQL databases, there is a move away from the relational model,
and different models are used.
• The four common categories of NoSQL data models are key-value,
document, column-family, and graph.
• The first three share a characteristic known as aggregate orientation,
which refers to how these models organize and group data. This concept
will be explored further in the chapter.

This discussion highlights the distinction between how data is


represented for users (data model) versus how it is stored internally
(storage model), and introduces the differences between relational and
NoSQL databases.
Aggregates
Relational Model's Simplicity
• In the relational model, data is stored in tuples (rows), each containing a set
of values.
• Tuples are a simple data structure, meaning they cannot contain nested
records, lists, or other tuples within them.
• This simplicity is a key feature of the relational model, as it allows for a
uniform way of thinking about operations on the data.
Aggregate Orientation:
• Unlike the relational model, aggregate orientation recognizes the need to
work with more complex data structures.
• This approach allows for records that can have nested structures, such as lists
or other records, which are more flexible for many real-world applications
In short, while the relational model relies on simple, flat tuples,
aggregate orientation enables the use of more complex, nested data
structures. This makes aggregates particularly useful in NoSQL
databases and for operations in distributed environments.
Example of Relations and
Aggregates
• At this point, an example may help explain what we’re talking about.
• Let’s assume we have to build an e-commerce website; we are going to
be selling items directly to customers over the web, and we will have to
store information about users, our product catalog, orders, shipping
addresses, billing addresses, and payment data
• Now let’s see how this model might look when we think in more aggregate-
oriented terms.
• Again, we have some sample data, which we’ll show in JSON
format as that’s a common
• In this version, the customer and orders are separated into two
independent JSON objects.
• The customer object holds only customer-related information (ID,
name, billing address), while the order object object holds order-
specific data
Consequences of Aggregate
Orientation
• NoSQL databases, especially aggregate-oriented ones like document
databases, focus on these aggregates, treating them as a unit for data
manipulation.
• Aggregates are useful in distributed systems (clusters) because
keeping related data together on the same node minimizes the
number of nodes to query, improving performance.
• However, aggregate structures may become problematic for tasks like
analyzing sales history, where data needs to be retrieved across
multiple aggregates.
• Relational databases allow complex ACID transactions that span many
tables, ensuring data consistency.
• In contrast, aggregate-oriented NoSQL databases often support ACID
transactions only within a single aggregate, requiring developers to
manage transactions across multiple aggregates manually in the
application code.
Key-Value and Document Data Models
Key-Value Databases
• Aggregate-Oriented: Constructed primarily around
aggregates, each identified by a unique key.
• Opaque Structure: The database treats the aggregate as a
blob of data with no inherent structure visible to the
database.
• Access Method: Access is limited to key-based lookups; the
content of the aggregate is not indexed or queried directly.
• Flexibility: Users can store virtually any data without
structural constraints, aside from size limitations.
Document Databases
•Structured Aggregates: Each aggregate (document) is seen as
having a defined structure (e.g., JSON, XML).

•Query Flexibility: Users can query based on fields within the


document, retrieve partial data, and create indexes for faster
access.

•Imposed Structure: While there is flexibility, document databases


define allowable structures and types, providing some level of
constraint.
Blurring Distinctions

• Hybrid Features: Many modern databases blur the lines between


these categories:

• Key-Value Stores with Metadata: Systems like Riak allow for metadata that
can be used for indexing.

• Structured Elements in Key-Value Stores: Redis can handle lists and sets,
providing more structured access than typical key-value databases.
Column-Family Stores
• The emergence of Google's BigTable marked a pivotal point in the
development of NoSQL databases, influencing subsequent systems like
HBase and Cassandra.
• Below is a comprehensive overview of its structure, characteristics, and
implications for data storage:
Overview of BigTable
• Data Model: BigTable is often conceptualized as a two-level map rather
than a traditional table. It uses a schema-less design that supports sparse
data, where columns can be added freely without predefined constraints.
• Column Families: The data is organized into column families, which group
related columns together. Each column belongs to a single column family,
and operations typically access data at the column family level.
Column-Family Structure
• Row-Oriented Perspective:
1.Each row is seen as an aggregate of related data (e.g., customer ID
1234), with column families categorizing useful chunks (e.g.,
profile, order history).
2.This approach allows you to retrieve all data for a specific
aggregate with a single query.
• Column-Oriented Perspective:
1.Each column family defines a record type (e.g., customer profiles),
and each row represents an instance of that type.
2.This allows for thinking of a row as a composite of records across
different column families.
Storage and Access Characteristics
Dynamic Schema: Unlike traditional relational databases, column-family
databases allow adding new columns to existing rows without altering the
overall schema. This flexibility is beneficial for dealing with unstructured or
evolving data.
In Cassandra, a row belongs to only one column family, but column
families may contain supercolumns that can hold nested columns. This
concept allows for hierarchical data representation, providing greater
flexibility in modeling complex relationships.
Here’s how we might structure the UserProfile column family:
•Column Family: UserProfile
•Row Key: User ID (e.g., user_123)
•Supercolumns:
•Profile (supercolumn)
•name: "Alice"
•age: 30
•location: "New York"
•Interests (supercolumn)
•hobbies: "Photography"
•sports: "Tennis"
•music: "Jazz"
Advantages of Column-Family Databases
•Performance Optimization: The ability to store related columns
together optimizes read performance, especially for use cases
where reading multiple columns across many rows is common.

•Flexibility: The schema-less nature allows organizations to


adapt their data structures to evolving business needs without
significant overhead.
Practical Implications
• Column-family databases are suitable for various applications, such
as:
• Time-series data storage.
• Real-time analytics.
• Applications requiring high write and read throughput with varying data
types.
More Details on Data Models
Aggregates:
• An aggregate is a cluster of related data that can be treated as a single
unit for data manipulation. For instance, a customer and their orders can
be viewed as a single aggregate when you want to access their order
history together.
Separate Aggregates:
• In scenarios where processing individual orders is preferred, separating
the customer and order aggregates is beneficial. This allows each to be
managed independently, improving flexibility.
Linking Aggregates:
• To maintain relationships between aggregates, a common approach is to
store a reference (like a customer ID) within the order aggregate. While
this is straightforward, it can lead to additional queries to fetch related
data, which may not be efficient.
Database Awareness of Relationships:
• Relational databases inherently understand relationships between tables
through foreign keys, allowing for robust querying and integrity
constraints. In contrast, many NoSQL databases lack this inherent
awareness, which can complicate data management across multiple
aggregates.
Atomicity and Transactions:
• Aggregate-oriented databases often guarantee atomicity within a
single aggregate but struggle with operations involving multiple
aggregates. This can complicate error handling in scenarios where
updates to multiple records are necessary.
• Relational databases, on the other hand, support ACID (Atomicity,
Consistency, Isolation, Durability) transactions, allowing for
modifications across multiple records in a single transaction.
Complex Queries and Performance:
• While relational databases excel at handling complex relationships
through SQL joins, performance can degrade significantly as the
number of joins increases. Writing and optimizing such queries can
become complex and challenging.
When to Choose Which Database
• Aggregate-Oriented Databases (NoSQL):
• Best suited for scenarios where you primarily work with individual
aggregates or need high scalability.
• Useful for applications that require flexible schemas and can
tolerate eventual consistency.
• Relational Databases:
• Ideal for applications with complex relationships and the need for
strong consistency and integrity.
• Suitable for scenarios requiring complex queries, reporting, and
transactions across multiple entities
Graph Database
• Graph databases indeed stand out in the NoSQL landscape, primarily
due to their unique focus on complex relationships rather than large,
aggregate records.
• Let’s break down the key points and characteristics:
Data Structure:
• Nodes and Edges: The fundamental structure consists of nodes
(entities) and edges (relationships). This model is inherently designed
to represent complex interconnections.
• Small Records: Each node can hold minimal information (e.g., just a
name), while the relationships between them are emphasized.
Complex Relationships:
• Graph databases excel in scenarios where relationships are central, such
as social networks, product preferences, or interconnected data in
applications.
• They allow for complex queries, such as finding items based on
relationships (e.g., books liked by friends).
Query Performance:
• Unlike relational databases, which may require expensive joins to
navigate relationships, graph databases optimize traversal of
relationships. This allows for efficient queries over highly interconnected
data.
Deployment:
• Graph databases are more likely to run on a single server rather than
being distributed across clusters. This is due to their focus on
managing relationships rather than large aggregates of data.
Use Cases

1. Social Networks: Analyzing friendships, mutual connections, and


user interactions.
2. Recommendation Systems: Finding items based on user preferences
and social influence.
3. Fraud Detection: Identifying suspicious patterns through relationship
analysis.
Materialized Views
• Materialized views are precomputed views stored on disk, designed to
enhance query performance, especially for frequently accessed data
that can tolerate some staleness.
Eager vs. Lazy Updates:
• Eager Approach: Updates to the materialized view happen
simultaneously with changes to the base data. This ensures freshness
but can slow down write operations.
• Lazy Approach: Updates are done at scheduled intervals or through
batch processes, allowing for lower overhead during data writes. This
method is suitable when the data can be slightly outdated.
Building Materialized Views:
• Materialized views can be built directly within the database, where the
database engine computes the view based on defined parameters.
• They can also be computed externally, with results saved back into the
database, but this may complicate synchronization and data integrity.
Performance Considerations:
• Materialized views can significantly improve read performance,
especially in systems where certain queries are executed frequently.
• The choice between eager and lazy updates depends on the application’s
read-write patterns and the acceptable level of data staleness.
Data Consistency:
• Maintaining consistency between the base data and materialized views is
crucial. Eager updates reduce the risk of staleness but add complexity to
write operations.
Modelling for Data Access
• illustrates different approaches to data modeling in various types of NoSQL
databases, emphasizing how these choices affect data retrieval,
aggregation, and the overall design of applications.
• Here’s a summary and breakdown of the key concepts:
Data Modeling in NoSQL
1. Key-Value Stores:
• Embedded Data: All customer information can be stored as a single object,
enabling fast reads using a unique key.
• Drawbacks: To access related data (like orders), the entire object must be
read and parsed client-side, which can be inefficient.
Document Stores:
• When references are needed, we could switch to document
stores and then query inside the documents, or even change the
data for the key-value store to split the value object into
Customer and Order objects and then maintain these objects’
references to each other.
• Denormalization: This method eliminates the need to update the
customer object when new orders are placed. Data can be
structured to improve read performance without worrying about
write complexity.
{
"customerId": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit", "ccinfo": "1000-1000-1000-1000"}]
}
{
"orderId": 99,
"customerId": 1,
"orderDate": "Nov-20-2011",
"orderItems": [{"productId": 27, "price": 32.45}],
"orderPayment": [{"ccinfo": "1000-1000-1000-1000", "txnId":
"abelif879rft"}],
"shippingAddress": {"city": "Chicago"}
}
3. Column-Family Stores:
• Query-Optimized Structure: Columns are ordered, allowing frequent
queries to be optimized by placing those columns first.
• Denormalization During Writes: Data should be structured for easy
querying rather than for optimal writing performance, leading to
better read times.
• As you can imagine, there are multiple ways to model the data; one
way is to store the Customer and Order in different column-family
families.
• Here, it is important to note the reference to all the orders placed by
the customer are in the Customer column family.
• Similar other denormalizations are generally done so that query
(read) performance is improved.
Graph Databases:
• Nodes and Relationships: Every entity (like a customer or product) is
modeled as a node, and relationships (like PURCHASED) define how
nodes interact.
• Traversing the Graph: Queries can easily traverse relationships to find
connections, such as identifying all customers who purchased a
specific product.
• Let’s say you want to find all the Customers who PURCHASED a
product with the name Refactoring Database.
• All we need to do is query for the product node Refactoring
Databases and look for all the Customers with the incoming
PURCHASED relationship.

You might also like