0% found this document useful (0 votes)
28 views55 pages

NOSQL

Uploaded by

Meghana .N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views55 pages

NOSQL

Uploaded by

Meghana .N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

MODULE 1

1) What is NoSQL? Discuss and differentiate between the relational model and the
aggregate model.
What is NoSQL?
NoSQL stands for "Not Only SQL." It is a type of database designed to handle a wide range
of data models, offer scalability, and ensure high performance. Unlike traditional relational
databases, NoSQL databases are schema-less (no fixed structure) and designed for specific
data types like documents, key-value pairs, column families, or graphs.
Features of NoSQL Databases:
1. Flexible Schema: No fixed structure for storing data, making it suitable for dynamic
and changing data.
2. Horizontal Scalability: Can expand across multiple servers, handling more data and
users.
3. High Performance: Built to process large amounts of data quickly.
4. Varied Data Models: Includes different models like key-value, document, column-
family, and graph.
5. Eventual Consistency: Balances consistency, availability, and partition tolerance as
per the CAP theorem.
Use Cases of NoSQL:
1. Applications with changing data structures.
2. High-speed data scenarios, such as IoT or social media.
3. Distributed databases that require horizontal scaling.

Relational Model vs. Aggregate Model

Feature Relational Model Aggregate Model

Data is stored in tables (rows and Data is stored as aggregates like


Definition columns) with strict documents, key-value pairs, or
relationships. collections.

Uses a strict, fixed structure for Flexible structure that supports


Schema
storing data. dynamic and evolving data.

Data Data is divided across multiple Data is often combined into a single
Representation tables (normalized). unit, like a document.
Feature Relational Model Aggregate Model

Relationships are stored within the


Relationships are defined using
Relationships same document or managed by the
keys (primary and foreign).
application.

Uses SQL for querying structured No standard query language; each


Query Language
data. NoSQL database has its own.

Slower for handling complex


Faster for read and write tasks because
Performance joins and high-volume
of denormalized data.
operations.

Grows by adding more hardware Grows by adding more servers


Scalability
to one server (vertical scaling). (horizontal scaling).

Supports ACID transactions for Uses BASE transactions, focusing on


Transactions
strong consistency. eventual consistency.

Suitable for tasks needing strict Ideal for scalable, high-performance


Use Cases
consistency, like banking. needs like social media or IoT.

MySQL, PostgreSQL, Oracle MongoDB, Cassandra, Couchbase,


Examples
Database. DynamoDB.

2) Which data model does not support aggregate orientation? Explain the model with a
suitable diagram.
Relational Database Model does not support aggregate orientation
• Relational Database Model: In relational databases, data is stored in separate tables
(e.g., users, products, orders), and relationships between tables are maintained using
foreign keys.
• NoSQL Approach: Instead of splitting data into many tables, NoSQL databases use
aggregates, where related data is grouped together in a single unit like a document.

E-commerce Example
• For an e-commerce website, data about users, product catalogs, orders, shipping, and
payment needs to be stored.
o Relational Model: Data is normalized into different tables (e.g., users, orders,
products) with relationships defined using keys.
o NoSQL Aggregate Model: Related data, like orders with shipping and
payment details, is grouped together in one document for faster access.
• Why Aggregates are Useful:
o Aggregates improve performance in distributed systems by enabling sharding
(dividing data across servers) and replication (creating copies for backup).

JSON Representation of Aggregates


Here’s how data might look in JSON format in NoSQL:
Customer Aggregate
{
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}]
}
Order Aggregate
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]}

Explanation of Aggregates
Aggregate Boundaries
• Customer Aggregate: Contains customer details, such as name and billing address.
• Order Aggregate: Contains order details, such as items, shipping address, and
payment information.
Data Duplication
• Addresses (e.g., billing and shipping) are copied into different JSON sections instead
of being linked by foreign keys.
• This ensures data like shipping addresses doesn’t change after an order is placed,
maintaining the immutability of important data.

Aggregates and Relationships


• Relationships between aggregates are shown through fields like customerId in the
order.
• Denormalization: Some fields, like productName, are included directly in the order
to avoid accessing multiple aggregates.

Aggregate Design Considerations


Trade-offs
• Combined Aggregates: If the application frequently retrieves all orders for a
customer, it may make sense to store orders within the customer aggregate.
• Separate Aggregates: If orders are accessed independently, it’s better to keep them
separate.
Example of Combined Customer and Order Aggregate
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{"city": "Chicago"}],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": {"city": "Chicago"}
}
]
}
]
}
}

3) Define key-value stores and explain the differences between key-value and document
data models.
Key-Value Stores
Key-value stores are a type of NoSQL database where data is saved in key-value pairs. Each
key is unique and points to a specific value, which can be any kind of data (e.g., a number,
text, or even JSON).
Characteristics
1. Simple Data Model: Data is organized as key-value pairs.
2. Efficient Lookups: Keys are used to quickly find their corresponding values.
3. Flexible Storage: Values can be simple (like text) or complex (like JSON).
4. Scalability: Easily handles large amounts of data across many servers.
5. Use Cases: Commonly used for caching, session storage, shopping cart data, and app
configuration.
Examples
• Key: userID:12345
• Value: { "name": "John Doe", "age": 30, "email": "[email protected]" }
Popular Key-Value Stores: Redis, Amazon DynamoDB, and Memcached.

Key-Value vs. Document Data Models

Feature Key-Value Data Model Document Data Model

Data Hierarchical, semi-structured documents


Simple key-value pairs.
Structure (e.g., JSON, BSON).

Unique identifier, often called a


Key Unique identifier for each value.
"document ID."

Stored as a single blob (any data Stored as a document with fields and
Value
type). nested structures.

Can query based on fields within


Querying Only retrieves values by keys.
documents.

Schema-less; no structure for Semi-structured; fields can vary within


Schema
values. documents.

Simple and fast for key-based Allows more complex queries within
Complexity
lookups. the document.

Ideal for caching, session storage, Great for content management, catalogs,
Use Cases
and simple lookups. and hierarchical data.

Slightly slower due to more complex


Performance Very fast for key-based retrieval.
queries.

Scales well horizontally in Also scales well but may need indexing
Scalability
distributed systems. for complex queries.

Examples Redis, DynamoDB, Memcached. MongoDB, CouchDB.

In Simple Terms
• Key-Value Stores are like a dictionary: you search using a key and get back the
value.
• Document Data Models are like a folder: they store detailed and organized data that
you can search within.
4) Describe with an example how column family stores data in the aggregate structure.

Column-Family Databases
Column-family databases (like Google’s Bigtable, HBase, and Cassandra) organize data into
two levels: rows and columns. They are designed for handling large datasets, especially in
distributed systems.

Row-Oriented vs. Column-Oriented


• In column-family databases, each row is like a group (e.g., a customer).
• Column families group related data within a row (e.g., customer profile or order
history).
• Unlike relational databases, the columns in each row can be different, making it
flexible for handling complex and varied data.

Wide and Skinny Rows


• Wide Rows:
o Contain many columns, often used for lists.
o Example: A list of a customer’s purchases, where each column represents a
purchase.
• Skinny Rows:
o Contain fewer columns and resemble relational database rows.
o Example: Each row might represent a single user with columns for name, age,
and address.
Sorting and Ordering
• Columns within a column family can be sorted.
• This sorting enables range queries, such as finding orders based on date or ID.
• It’s useful for time-series data or applications needing ordered data access.

In Simple Terms
• Rows represent a group (like a customer).
• Column families organize related data within a row (like profile details and order
history).
• It’s flexible because each row can have a different number of columns.
• Sorting and ordering make it efficient for finding data in a sequence, like searching
orders by date or ID.
5) Explain briefly how impedance mismatch occurs in the relational model, and what
are some common solutions to address it?

Impedance Mismatch in the Relational Model


Impedance mismatch happens when there is a difference between the way data is stored in
relational databases (tables and rows) and how data is used in programming languages (rich
in-memory structures like objects, lists, or nested records).
In relational databases:
• Data is stored as tables (relations) and rows (tuples).
• Values in a row must be simple, without any complex structure like nested lists or
records.
In programming languages:
• Data can have complex structures, such as objects with nested properties and lists.
• To store these in a relational database, the rich data must be converted into tables
and rows, which requires extra effort for translation.
This mismatch between the two formats causes inefficiencies and frustration for developers,
as they need to manage this conversion process.

Example of Impedance Mismatch


An order might appear as a single unit in an application, with customer details, payment
information, and items. However, in a relational database:
• The customer details, payment details, and order items are split into different tables
and linked by IDs.
• This requires multiple queries and joins to piece together the data into the original
structure.

Solutions to Address Impedance Mismatch


1. Object-Relational Mapping (ORM) Frameworks:
Tools like Hibernate and iBATIS automatically map the objects in code to relational
database tables.
o They reduce the manual effort of writing conversion code.
o Example: A "Customer" object in Java can be linked directly to a "customers"
table in the database.
2. Mapping Patterns:
Frameworks use mapping techniques to translate rich in-memory structures into
relational tables. These patterns follow standard practices to improve efficiency.
3. Standard SQL:
Relational databases emphasize SQL as a common language for data manipulation,
making it easier to work with data across various systems.
4. Integration of Developers and DBAs:
Collaboration between developers and database administrators helps balance
application needs with database design.
While ORM frameworks help reduce the workload, they may sometimes cause performance
issues if developers try to overlook database optimization entirely.

6) What are materialized views, and how do they differ from relational views in terms of
data access? What strategies are used to build materialized views?
3.4 Materialized Views
What are Materialized Views?
Materialized views are like stored copies of the results of a query. Instead of calculating the
data every time, they save the precomputed data on disk. This makes it faster to retrieve the
data, especially for queries that are used often.
Relational views, on the other hand, are not stored. They calculate the data whenever you
query them, which can take more time. Materialized views are faster to access but may not
always show the latest updates (they can be a bit outdated).
Differences in Terms of Data Access
1. Relational Views:
o Data is computed when you access it.
o Flexible but slower for large or frequent queries.
2. Materialized Views:
o Data is precomputed and stored for quick access.
o Good for heavy reads but may not always have the most recent data.
Strategies to Build Materialized Views
1. Eager Approach:
o Updates the materialized view as soon as the base data changes.
o Keeps the view fresh but can slow down writes to the database.
o Best when you need fast reads and don’t update the data too often.
2. Batch Updates:
o Updates the materialized view at set times (e.g., every few hours).
o Works well if small delays in updates are acceptable.
3. External Computation:
o The view is calculated outside the database, then stored back in it.
o Useful when you need specific, custom calculations.
4. Database-Supported Computation:
o Many databases let you define how to compute the view.
o The database handles the updates based on the rules you set, like for
incremental updates.
Examples in Use:
• In NoSQL systems, materialized views are often created using map-reduce
techniques.
• In column-family databases (like Cassandra), materialized views can be updated in
the same operation as the base data for efficiency.
Materialized views help speed up data access and are particularly helpful when you need
quick answers to repeated queries.
MODULE 2
1) Define Master-Slave replication. With a neat diagram, explain the advantages and
disadvantages of master-slave replication.
Master-Slave Replication
Definition:
Master-slave replication is a method of replicating data across multiple nodes in a database
system. One node acts as the master (primary), which is responsible for all updates or writes.
The other nodes, called slaves (secondaries), synchronize their data with the master. All
updates happen on the master, and these updates are then propagated to the slaves.
Slaves can handle read requests, which makes this method useful for read-intensive
datasets. The replication process ensures the data remains consistent across all nodes.

Advantages of Master-Slave Replication:


1. Scalability for Reads:
o You can add more slaves to handle more read requests, which makes the
system scalable for read-heavy workloads.
2. Read Resilience:
o If the master fails, the slaves can still handle read operations.
o A slave can be quickly promoted to master, reducing downtime.
3. Hot Backup:
o Even in a single-server setup, a slave can act as a hot backup, providing better
resilience and faster recovery in case of failures.
4. Automatic Master Appointment:
o Some systems support automatic master election, allowing the system to
recover quickly by appointing a new master when the original master fails.
Disadvantages of Master-Slave Replication:
1. Write Limitations:
o The master handles all writes, so its capacity becomes a bottleneck for write-
intensive applications.
2. Data Inconsistency:
o There can be delays in propagating updates from the master to slaves.
o Clients reading from different slaves may see outdated or inconsistent data.
3. Lost Updates During Failures:
o If the master fails before updates are propagated to slaves, those updates may
be lost.
4. Complex Read and Write Paths:
o Separate paths for reads and writes are needed to ensure read resilience, which
may not always be supported by database libraries.

This shows that the master handles all writes, and the slaves synchronize with the master to
handle read operations.
Master-slave replication is useful for scaling read-heavy workloads but comes with
challenges like write limitations and potential data inconsistency.

2) In a distributed inventory system, the product “Laptop" has the following details:
Price: ₹60,000, Stock: 10, Version Stamp: v1. Example: User A updates the Price to
₹50000, and User B updates it to ₹45000 at the same time. For this example, how can
different version stamping methods be applied to track these updates, and what are the
advantages and disadvantages of each method?
8. In a Distributed Inventory System
When multiple users update the same product at the same time, conflicts arise. Version
stamping helps track and resolve these conflicts in distributed systems. Below is an
explanation of how version stamping methods can be applied to the example scenario.
Scenario Recap
• Initial State: Product: Laptop, Price: ₹60,000, Stock: 10, Version Stamp: v1.
• User A's Update: Changes price to ₹50,000.
• User B's Update: Changes price to ₹45,000.
• Simultaneous Updates: Both updates occur at the same time, leading to a conflict.

Version Stamping Methods


1. Timestamp-Based Version Stamping
• Mechanism: Each update gets a timestamp (e.g., time of update in seconds). The
system picks the latest timestamp as the valid update.
• Example:
o User A’s update: Price: ₹50,000, Timestamp: 1677201000.
o User B’s update: Price: ₹45,000, Timestamp: 1677201010.
o The system selects User B’s update since it has the latest timestamp.
Advantages:
1. Simple and easy to implement.
2. Resolves conflicts deterministically by choosing the latest update.
Disadvantages:
1. Requires synchronized clocks across systems; clock mismatches can cause errors.
2. Updates with closely timed but valid timestamps may be overwritten.

2. Logical Clock-Based Version Stamping (Lamport Timestamps)


• Mechanism: Each update is assigned a logical clock value, which is incremented
sequentially. Updates are applied in order of their logical clock values.
• Example:
o Initial version: v1.
o User A’s update: Price: ₹50,000, Logical Clock: 2.
o User B’s update: Price: ₹45,000, Logical Clock: 3.
o The system selects User B’s update since it has a higher logical clock value.
Advantages:
1. No need for physical clocks.
2. Resolves conflicts systematically in distributed systems.
Disadvantages:
1. Requires centralized or synchronized logic to increment clocks.
2. Additional logic may be needed to handle simultaneous updates.

3. Vector Clock-Based Version Stamping


• Mechanism: Maintains a vector of version numbers for each node, incrementing the
vector independently. Conflicts are detected if vectors cannot be directly compared.
• Example:
o Initial vector: [0, 0] (Node 1, Node 2).
o User A (Node 1) updates price to ₹50,000: Vector [1, 0].
o User B (Node 2) updates price to ₹45,000: Vector [0, 1].
o Conflict detected as [1, 0] and [0, 1] are incomparable.
• Resolution:
o Merge updates (e.g., pick a median price).
o Flag for manual resolution.
Advantages:
1. Tracks causal relationships between updates.
2. Useful in distributed systems with multiple nodes.
Disadvantages:
1. Vector size increases with more nodes.
2. Conflict resolution requires additional, complex logic.

4. Optimistic Concurrency Control (OCC) with Version Numbers


• Mechanism: Each record has a version number. Updates are allowed only if the
version matches the latest version in the database.
• Example:
o Initial version: v1.
o User A reads v1, updates price to ₹50,000, and changes version to v2.
o User B reads v1 but cannot update to v2 (₹45,000) because the version has
already changed.
Advantages:
1. Prevents updates from being applied to outdated data.
2. Simple and effective for avoiding stale updates.
Disadvantages:
1. Requires retries for failed updates.
2. High contention can lead to frequent failures.

Comparison of Methods

Conflict
Method Complexity Advantages Disadvantages
Resolution

Requires synchronized
Timestamp- Latest Simple, easy to
Low clocks; overwrites valid
Based timestamp wins implement
updates.

Logical Clock- Highest logical No physical clock Requires additional logic


Moderate
Based clock value wins dependency for conflicts.

Tracks causality;
Vector Clock- Detects causal Complex; metadata size
High ideal for many
Based conflicts increases.
nodes

OCC with Prevents stale Simple; avoids High contention can lead
Low
Version Nos. updates stale updates to frequent retries.

By choosing the appropriate version stamping method, conflicts in distributed systems can be
resolved effectively based on the specific requirements of the system.
3) What is the CAP theorem? Explain the trade-offs between its three properties in
detail.
The CAP Theorem, proposed by Eric Brewer, states that in a distributed system, it is
impossible to guarantee all three properties—Consistency (C), Availability (A), and
Partition Tolerance (P)—at the same time. Systems must choose two of the three properties
based on their specific needs.

CAP Properties
1. Consistency (C):
• All nodes in the system show the same data at any given time.
• Any change to the data is instantly updated across all nodes.
• Example: If a sales figure is updated on one node, all other nodes should immediately
reflect the new value.
2. Availability (A):
• The system ensures that every request gets a response, even if some nodes are down.
• Achieved by replicating data across multiple nodes.
• Example: A user query for sales data gets a response, regardless of failures in part of
the system.
3. Partition Tolerance (P):
• The system continues to work even if there are network partitions (communication
failures between nodes).
• Ensures fault tolerance and resilience during such failures.
• Example: Even if part of the system cannot communicate with another, operations on
one side still succeed.
CAP Combinations
Since no distributed system can satisfy all three properties, they must select two out of the
three:
1. Consistency + Availability (CA):
• Guarantees data is the same across all nodes (Consistency).
• Ensures all requests get a response (Availability).
• Trade-off: Does not work if there is a network partition (requires perfect
communication).
• Example: Relational databases in centralized systems.
2. Availability + Partition Tolerance (AP):
• Keeps the system responsive during network failures (Partition Tolerance).
• May allow some nodes to show outdated or inconsistent data (sacrifices
Consistency).
• Example: DynamoDB, where availability is prioritized to maintain responsiveness.
3. Consistency + Partition Tolerance (CP):
• Ensures all nodes have consistent data (Consistency).
• Works during network partitions but sacrifices availability (some requests may be
rejected).
• Example: MongoDB, used in systems that need accurate and consistent data.
Network Partition and Trade-offs
When a network partition occurs, systems behave based on their chosen CAP combination:
1. AP (Availability + Partition Tolerance):
• Prioritizes availability, ensuring the system remains responsive.
• Suitable for applications like social media or e-commerce, where the user experience
is critical.
2. CP (Consistency + Partition Tolerance):
• Prioritizes consistency by ensuring accurate data, even if responses are delayed.
• Some requests may fail until the latest data is available.
• Suitable for applications like banking or financial transactions, where data accuracy is
essential.

Understanding CAP Theorem helps in designing distributed systems that align with
application requirements and trade-offs effectively.

4) Identify the type of conflict in the given scenario. How can it be solved? Alice and
Bob both try to book the last available room at the same time. Alice starts filling in her
details, but Bob completes his booking first. When Alice submits her booking, it
overwrites Bob's reservation, and the room is booked for Alice instead.
Identify the Type of Conflict in the Given Scenario. How Can It Be Solved?
Type of Conflict:
The scenario represents a write-write conflict, where both Alice and Bob are trying to
update the same data (the last available room) at the same time. Since there is no proper
concurrency control, Alice’s update overwrites Bob’s reservation, leading to inconsistency.
Solution:
To solve this conflict, concurrency control mechanisms can be applied, using either a
pessimistic approach or an optimistic approach:
1. Pessimistic Approach (Write Locks):
o Implement a write lock for the room booking.
o When Alice or Bob tries to book, they must acquire a lock on the data.
o Only one of them can succeed (e.g., Bob acquires the lock first). The other
(Alice) will have to wait or receive a notification that the room is no longer
available.
2. Optimistic Approach (Conditional Updates):
o Use a conditional update system where Alice and Bob’s updates are verified
against the current state of the data.
o In this case, Bob’s booking is processed first, and when Alice submits her
booking, it checks whether the room is still available.
o Since Bob has already booked the room, Alice’s update fails, and she is
informed that the room is no longer available.
3. Conflict Recording and Resolution:
o Save both attempts and flag the conflict.
o Notify Alice and Bob about the conflict and let them resolve it manually (e.g.,
contacting customer support).
o Alternatively, an automated system can prioritize the first completed booking
(Bob’s) and reject the second one (Alice’s).
Key Takeaway:
Using write locks (pessimistic) ensures conflicts are avoided, but it may slow down the
system. Conditional updates (optimistic) allow faster responses but require proper handling of
failed updates. The choice depends on the system’s requirements for safety and
responsiveness.
5) What is Sharding? With a neat diagram, explain the concept of sharding with an
example.

What is Sharding?
Sharding is a technique used in distributed systems to horizontally scale a data store by
splitting the dataset into smaller, manageable parts called shards. Each shard is stored on a
separate server and handles its own reads and writes. This reduces the load on any single
server, improving performance and scalability.
Explanation of Sharding
In sharding, data is divided across multiple servers. Ideally, users accessing different parts of
the data communicate with different servers, ensuring rapid responses and balanced server
loads.
For example:
• If there are 10 servers, each handles about 10% of the data.
• Each user communicates with only one server, reducing latency.
Key Aspects of Sharding:
1. Data Clumping:
o Data that is often accessed together is stored on the same shard.
o For example, customer orders for a user in Boston can be stored in a data
center in the eastern US.
2. Load Balancing:
o Aggregates (related data items) are distributed evenly across shards to ensure
that no single server is overloaded.
3. Performance:
o Sharding improves read and write performance, especially for write-heavy
applications.
o Unlike replication, which is mainly useful for improving read performance,
sharding scales writes horizontally.
4. Challenges:
o Sharding complicates application logic, especially if implemented manually.
o Auto-sharding, provided by many NoSQL databases, automates the process of
allocating data to shards and ensures correct data access.
o Sharding alone does not improve resilience. If a shard fails, its data becomes
unavailable.

Example of Sharding
Let’s assume an e-commerce application:
• Data is divided based on customer names:
o Customers with surnames A–D are stored on Shard 1.
o Customers with surnames E–G are stored on Shard 2, and so on.
• If a customer with the surname "Anderson" queries the database, the application
directs the query to Shard 1.
Conclusion
Sharding is an effective way to scale a database horizontally by distributing data across
multiple servers. While it significantly improves performance, especially for write-heavy
workloads, it must be implemented with care to avoid operational challenges.
6) Define Quorum. Explain how to read and write a quorum with examples.
Define Quorum
Quorum is the minimum number of nodes in a distributed system that must participate in a
read or write operation to ensure strong consistency. By involving a majority of nodes, we
can avoid conflicts and ensure the data is up-to-date and consistent.

How to Read and Write a Quorum


Write Quorum
• A write quorum ensures strong consistency by requiring a majority of nodes to
acknowledge a write.
• Formula: W > N/2
o W: Number of nodes involved in the write operation.
o N: Replication factor (total number of replicas).
Example:
• Consider data replicated across 3 nodes (N = 3).
• For strong consistency, at least 2 nodes (W > 3/2) must confirm a write.
• If two nodes acknowledge, conflicting writes are avoided because only one write can
get a majority.

Read Quorum
• A read quorum ensures the latest data is read by involving enough nodes to confirm
the most recent write.
• Formula: R + W > N
o R: Number of nodes involved in the read operation.
o W: Number of nodes required to confirm a write.
Example 1:
• If W = 2 (2 nodes confirm writes), then at least R = 2 nodes must be contacted for a
read to ensure the latest data.
• This ensures R + W = 4 > N = 3, giving a strongly consistent read.
Key Points to Remember
• A replication factor of 3 (N = 3) is common and allows resilience even if one node
fails.
• Strong consistency in reads and writes depends on ensuring R + W > N.
• Trade-offs can be made based on the application's need for consistency, speed, and
availability.
MODULE 3
1) Apply the Map-reduce process to compare the sales of products for each month in
2011 to the prior year. Illustrate the process with suitable diagrams.
1.4 A Two-Stage Map-Reduce Example
As map-reduce tasks get complex, it is useful to break them into stages using a pipes-and-
filters approach. Here, the output of one stage serves as the input for the next, similar to
UNIX pipelines.
Problem Statement
We want to compare the sales of products for each month in 2011 with the same month in the
prior year (2010). To solve this, we divide the task into two stages:

First Stage: Creating Records for Monthly Sales of a Product


The first stage reads the original order records and produces key-value pairs for the sales of
each product per month.
• Key: A composite key with product and month (e.g., (Product A, January)).
• Value: Total sales for that product in the given month.
This stage groups records using multiple fields (product and month), making use of a
composite key.
Second Stage: Creating Base Records for Year-on-Year Comparison
The second-stage mapper processes the output of the first stage based on the year:
• 2011 Records: Populates current year quantities.
• 2010 Records: Populates prior year quantities.
• Older Records (e.g., 2009): No mapping output is produced.
The reducer merges records by summing values from both years, creating a single value.
Additional calculations (e.g., percentage change) can also be performed during this step.
Final Step: Reduction as a Merge of Records
The reduction step merges incomplete records from the two years into a single record by
summing the quantities. The result contains:
• Product Name.
• Month.
• Current Year Sales (2011).
• Prior Year Sales (2010).
• Any additional computed values.

Advantages of Breaking Down into Two Stages


1. Simplicity: Smaller, independent steps make the process easier to implement and
debug.
2. Reusability: Intermediate output can be reused for other calculations or outputs,
saving time and effort.
3. Efficiency: Early map-reduce stages access the most data, so storing these
intermediate outputs (materialized views) reduces workload in downstream tasks.

Tools for Writing Map-Reduce Programs


1. Apache Pig: A language designed for writing map-reduce programs easily.
2. Hive: Provides SQL-like syntax for map-reduce tasks.
Cluster-Oriented Computation with Map-Reduce
• Map-reduce is designed for handling large data volumes on distributed systems (e.g.,
Hadoop).
• It ensures computations are suitable for cluster environments.
• As data grows, more organizations will adopt the map-reduce pattern for efficient
processing.
By breaking complex tasks into smaller, reusable steps, the map-reduce pattern enables
efficient computation for large-scale data processing.

2) What are key-value stores and popular key-value databases? Discuss with an
example how data is organized within a single bucket and mention ways to handle key
conflicts.
Key-Value Stores and Popular Key-Value Databases
Key-Value Stores: Key-value stores are NoSQL databases designed to store data as a pair of
keys and values. A key is a unique identifier, and the value is a blob that can store any type of
data. These stores are accessed using primary keys, offering high performance and easy
scalability.
Popular key-value databases include:
• Riak
• Redis (often called a Data Structure server)
• Memcached DB and its variants
• Berkeley DB
• HamsterDB (for embedded use)
• Amazon DynamoDB (not open-source)
• Project Voldemort (an open-source implementation of DynamoDB)

Example: Organizing Data Within a Single Bucket In key-value stores like Riak, data is
organized into buckets. Buckets act as flat namespaces for keys. For example, to store data
like user session data, shopping cart information, and user preferences, all of these can be
placed in a single bucket as a single object.
• Key: 12345
• Value: A single object containing user session, shopping cart, and user preferences
data.
This approach is simple but increases the chance of key conflicts.
Handling Key Conflicts:
1. Change the Key Design:
Append object names to keys to make them unique.
o Example:
▪ Key: 288790b8a421_userProfile
▪ Key: 288790b8a421_shoppingCart
2. Use Separate Buckets for Different Data:
Create separate buckets for different types of data, like UserProfile and ShoppingCart.
This prevents conflicts by segmenting data.
o Bucket: UserProfile
▪ Key: 12345
▪ Value: User profile data
o Bucket: ShoppingCart
▪ Key: 67890
▪ Value: Shopping cart data
By organizing data this way, you avoid key conflicts and ensure easy access to specific
objects.

3)

Total Sales Calculation Using MapReduce


The MapReduce process is used to calculate the total sales for each product based on the given sales
data. It consists of three main steps: Map, Shuffle and Sort, and Reduce. Let's explain this step by
step with examples.

1. Map Function
The Map function processes each record in the input data and generates intermediate key-value pairs.
• Input Data (Product, Quantity, Price):
1. (Puerh, 8, 24)
2. (Dragonwell, 12, 24)
3. (Genmaicha, 20, 80)
4. (Puerh, 5, 15)
5. (Dragonwell, 16, 48)
• Key: Product name
• Value: Total sales for the record (Quantity × Price)
Output of Map Function:
For each record, the map function generates:
• (Puerh, 8 × 24 = 192)
• (Dragonwell, 12 × 24 = 288)
• (Genmaicha, 20 × 80 = 1600)
• (Puerh, 5 × 15 = 75)
• (Dragonwell, 16 × 48 = 768)
Intermediate Key-Value Pairs:
• (Puerh, 192)
• (Dragonwell, 288)
• (Genmaicha, 1600)
• (Puerh, 75)
• (Dragonwell, 768)

2. Shuffle and Sort


The Shuffle and Sort phase automatically groups the intermediate key-value pairs by their key
(product name).
Grouped Key-Value Pairs:
• (Puerh, [192, 75])
• (Dragonwell, [288, 768])
• (Genmaicha, [1600])

3. Reduce Function
The Reduce function aggregates the values for each key. Here, it sums up the total sales for each
product.
Calculations:
• For Puerh: 192 + 75 = 267
• For Dragonwell: 288 + 768 = 1056
• For Genmaicha: Total Sales = 1600
Final Output:
• (Puerh, 267)
• (Dragonwell, 1056)
• (Genmaicha, 1600)

Final Explanation:
1. The Map function transforms input data into key-value pairs where the key is the product
name and the value is the total sales (Quantity × Price).
2. The Shuffle and Sort phase groups all values for the same product together.
3. The Reduce function sums up the grouped values to calculate the total sales for each product.
Final Output:
• Puerh: 267
• Dragonwell: 1056
• Genmaicha: 1600
This approach simplifies processing large datasets by breaking it into smaller tasks.

4) Explain the features of key-value stores.


Features of Key-Value Stores
Key-value stores provide a straightforward and efficient way to manage data. Here are their
main features explained simply:

1. Consistency
• Key-value stores prioritize performance and often use an eventually consistent
model. This means changes made to data might take some time to update across all
servers.
• For example, in Riak, if two people make changes at the same time, users can either
choose "last write wins" or handle multiple conflicting values on their own.
• Settings like replication factor (n Val) and write quorum (w) allow users to balance
consistency and performance.
2. Transactions
• Key-value stores don’t usually support multi-key or multi-document transactions
like relational databases do.
• Instead, they use a quorum model to ensure reliability. This model includes:
o N: Number of copies (replicas) of data.
o W: Number of successful writes required.
o R: Number of successful reads required.
• These configurations ensure reliable data availability.

3. Query Features
• Key-value stores allow simple lookups using a key but don’t support complex queries
like SQL databases.
• Applications need to design meaningful keys for efficient retrieval.
• Advanced features like Riak Search can add more flexibility, such as querying with
Lucene (a search engine library).
• They work best in predictable query situations, such as session storage or shopping
carts.

4. Structure of Data
• Data is stored as a blob (a single block of data), which can be any format like JSON,
XML, or plain text.
• The database doesn’t understand the data’s structure; it’s the application’s job to
process it.
• In Riak, users can specify the format of the data using a Content-Type header,
which helps during retrieval but doesn’t change how the data is stored.

5. Scaling
• Key-value stores scale horizontally by sharding, dividing data across multiple servers
based on the keys.
• Adding more servers increases capacity. However, if a server fails, data on it becomes
temporarily unavailable unless replication is used.
• Tools like replication and configurations (e.g., N, R, and W values in Riak) help
balance consistency, availability, and partition tolerance (CAP theorem).
Summary
Key-value stores are simple, scalable, and perform well for specific use cases like session
data or shopping carts. However, they require careful key design and don’t support complex
transactions or queries.
MODULE 4
1) What is a document database? Explain with an example how data is stored in it and
how it differs from an RDBMS.
What is a Document Database?
A document database is a type of NoSQL database that stores, retrieves, and manages data
as documents. These documents are typically in JSON-like format, making them flexible
and easy to use. Unlike relational databases (RDBMS), where data follows a fixed structure
(schema), document databases allow each document to have its own unique structure. This
makes them suitable for applications where the schema can change over time.

Example of Data Storage in a Document Database


Here are two sample documents stored in a document database:
1. First Document:
{
"firstname": "Martin",
"likes": ["Biking", "Photography"],
"lastcity": "Boston"
}
• This document has fields like firstname, likes (a list), and lastcity.
2. Second Document:
{
"firstname": "Pramod",
"addresses": [
{ "state": "AK", "city": "DILLINGHAM", "type": "R" },
{ "state": "MH", "city": "PUNE", "type": "R" }
],
"citiesvisited": ["Chicago", "London", "Pune", "Bangalore"],
"lastcity": "Chicago"
}
• This document includes fields like firstname, addresses (a list of objects), citiesvisited
(a list), and lastcity.
These documents can have different fields, and the structure is flexible.
Comparison with Relational Database
In an RDBMS, the same data would be stored in tables with predefined columns. Below is
the equivalent representation in RDBMS:
Create Tables:
CREATE TABLE Users (
user_id INT AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(100),
lastcity VARCHAR(100)
);

CREATE TABLE Likes (


like_id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT,
like VARCHAR(100),
FOREIGN KEY (user_id) REFERENCES Users(user_id) ON DELETE CASCADE
);

CREATE TABLE CitiesVisited (


city_id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT,
city VARCHAR(100),
FOREIGN KEY (user_id) REFERENCES Users(user_id) ON DELETE CASCADE
);

CREATE TABLE Addresses (


address_id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT,
state VARCHAR(2),
city VARCHAR(100),
type VARCHAR(1),
FOREIGN KEY (user_id) REFERENCES Users(user_id) ON DELETE CASCADE
);
Insert Data:
INSERT INTO Users (firstname, lastcity)
VALUES
('Martin', 'Boston'),
('Pramod', 'Chicago');

INSERT INTO Likes (user_id, like)


VALUES
(1, 'Biking'),
(1, 'Photography');

INSERT INTO CitiesVisited (user_id, city)


VALUES
(2, 'Chicago'),
(2, 'London'),
(2, 'Pune'),
(2, 'Bangalore');

INSERT INTO Addresses (user_id, state, city, type)


VALUES
(2, 'AK', 'DILLINGHAM', 'R'),
(2, 'MH', 'PUNE', 'R');
In RDBMS, the data is stored in separate tables, and relationships are maintained using keys.
Differences Between Document Database and RDBMS

Feature Document Database RDBMS

Data
Flexible (JSON-like documents). Fixed schema (tables).
Structure

Relationships Embedded or referenced directly. Managed through foreign keys.

Simple key-based or nested field


Queries Complex SQL queries with joins.
queries.

Vertical scaling (adding resources to


Scalability Horizontal scaling using sharding.
one server).

Applications with changing schema or Applications needing structured,


Use Cases
diverse data types. predictable schema.

This shows how document databases are more flexible, while RDBMS requires strict
structure.

2) List and explain the suitable use cases for document databases.
Suitable Use Cases
1. Event Logging
o Many applications generate different types of event logs.
o Document databases are a great fit to store all these logs in one central place,
especially when the structure of the logs changes frequently.
o Events can be organized based on their type (e.g., order_processed,
customer_logged) or the application name where the event occurred.
2. Content Management Systems and Blogging Platforms
o Document databases handle content like web pages, user comments, or
profiles effectively because they have no fixed structure (schema).
o These databases work well with JSON, making them ideal for websites, user
registrations, and publishing platforms.
3. Web Analytics or Real-Time Analytics
o Document databases are suitable for storing data like page views or unique
visitors in real-time.
o New metrics can be added easily without requiring changes to the database
structure.
4. E-Commerce Applications
o E-commerce platforms need flexible data models for products, orders, and
customer details.
o Document databases allow evolving these models without the need for
expensive changes to the database structure.

When Not to Use


1. Complex Transactions Spanning Different Operations
o Document databases are not ideal if you need to perform operations involving
multiple documents at the same time with strict accuracy (atomicity).
o However, some advanced document databases, like RavenDB, do support such
features.
2. Queries Against Varying Aggregate Structure
o Document databases save data in flexible formats without enforcing a fixed
structure.
o If your queries frequently change or involve combining (joining) different
parts of data, they may not perform efficiently.
o Relational databases, which organize data in structured tables, are better in
such cases.
This simple explanation highlights when document databases may not be the best choice.

3) Explain the differences in query handling between MongoDB and RDBMS with
examples.
4) What is a replica set? How does replication work in MongoDB? What are the
alternatives to MongoDB?
What is a Replica Set?
A replica set in MongoDB is a group of servers that keep copies of the same data.
• One server is the primary node, which handles all the write operations.
• Other servers are secondary nodes, which copy the data from the primary node.
These secondary nodes can handle read requests if configured. This setup helps to
manage a large number of read requests and provides backup in case the primary node
fails.

How Does Replication Work in MongoDB?


1. The primary node stores and updates the data.
2. The secondary nodes replicate the data from the primary node.
3. If the primary node goes down, one of the secondary nodes is automatically promoted
to the new primary node.
Adding more nodes to a replica set increases the database's ability to handle read requests
without downtime. For example:
• A new node (like Mongo D) can be added using the command:
rs.add("mongod:27017")
• This node automatically syncs with the primary and starts serving read requests.
Alternatives to MongoDB
If MongoDB does not meet your needs, here are some alternatives:
1. MySQL: A popular relational database for structured data with well-defined schemas.
2. PostgreSQL: A powerful, open-source relational database with advanced features
like JSON support.
3. Cassandra: A distributed database designed for scalability and high availability.
4. Neo4j: A graph database suited for applications requiring complex relationship
handling.
5. CouchDB: A NoSQL database with a focus on synchronization and offline access.
These alternatives are chosen based on factors like the type of data, scalability needs, and use
cases.
5) Briefly explain the scaling feature in document databases with a neat diagram.

Scaling Feature in Document Databases


Scaling in document databases, like MongoDB, is the process of improving performance and
handling more data by adding nodes or changing the data storage strategy, without
upgrading the hardware of a single server. It ensures that the database can handle increasing
loads for both read and write operations.
1. Scaling for Read Loads
• For applications with heavy-read requirements, read slaves are added to the cluster.
• All read operations are distributed among the slaves, reducing the load on the primary
node.
• This is called horizontal scaling for reads.
• Example: Adding a new node (e.g., mongo D) to a 3-node replica set improves read
capacity without downtime.
Command to Add a Node:
rs.add("mongod:27017")
2. Scaling for Write Loads
• When write operations increase, sharding is used.
• Sharding splits data into parts based on a specific field (e.g., firstname) and stores
them across multiple nodes, called shards.
• Each shard can be a replica set, allowing efficient writes and reads within the shard.
Sharding Command:
db.runCommand({ shardcollection: "ecommerce.customer", key: { firstname: 1 } })
• Data is dynamically moved between shards to balance the load as new nodes are
added.
3. Shard Key
The shard key decides how data is distributed across shards. For example, if the shard key is
based on location, data for East Coast users is stored in East Coast shards, ensuring faster
access for users in that region.

1. Replica Set for Read Scaling:


A primary node with multiple secondary nodes (slaves) handling read requests.
2. Sharding for Write Scaling:
Multiple shards with data distributed across nodes. Each shard is a replica set for
better performance.
Key Points:
• Scaling ensures no downtime for the application.
• Horizontal scaling improves both read and write performance by adding more
nodes to the system.
6) Explain a few applications where document databases should not be used.
Applications Where Document Databases Should Not Be Used
1. Complex Transactions Spanning Different Operations
o If you need to perform transactions that affect multiple documents and all of
them must succeed or fail together (like transferring money between
accounts), document databases may not work well.
o Some document databases like RavenDB can handle such tasks, but many
others cannot.
2.Queries Against Varying Aggregate Structures
o Document databases allow you to store data in a flexible way, but this can be a
problem if you frequently change the way your data is structured or if your
queries need to constantly adjust to different data formats.
o If your data needs to be in a very organized and normalized format, document
databases may not be the best choice.
In short, document databases are not ideal if your application needs complicated transactions
or frequent changes in data structure. In those cases, relational databases might be a better
option.
MODULE 5
1) What is a graph database? Explain how relationships and properties are represented
in a graph, with a neat diagram.
What is a Graph Database?
A graph database is a type of NoSQL database that uses a graph to organize and store data. It
is different from traditional databases, which store data in tables. In a graph database,
everything is connected, and it’s designed to handle data that is highly related, like in social
networks, recommendation systems, or fraud detection.
Components of a Graph Database
1. Nodes: These are the main objects in the graph. Nodes represent entities such as
people, products, or places. For example, in a social network, a person can be
represented as a node.
2. Edges: These represent the relationships between nodes. An edge connects two nodes
and shows how they are related. For example, "is friends with," "purchased," or
"works at" could be different types of relationships between nodes.
3. Properties: These are extra details that can be added to both nodes and edges. A node
representing a person might have properties like "name" and "age." An edge
connecting two people might have a property like "date" to show when they became
friends.
Representation of Relationships and Properties in a Graph
• Nodes represent real-world entities like a person, product, or event.
• Edges show how nodes are connected and represent the relationships between them.
These edges have a direction, indicating who is connected to whom (e.g., Alice is
friends with Bob, but not vice versa).
• Properties add extra details to both nodes and edges. For example, a node might store
a person's "name" and "age," and an edge might store the "date" when two people
became friends.
In simple terms, a graph database helps you see how things are connected and makes it easy
to find relationships between different pieces of data.
2) Explain transaction, consistency and availability with respect to graph databases.
5.2.1. Consistency
In graph databases, consistency refers to ensuring that data is reliable and accurate, especially
when handling complex relationships between nodes. Most graph databases, like Neo4J,
ensure data consistency within a single server by being ACID-compliant. This means that any
changes (like adding or deleting nodes or relationships) are reliable and won't result in errors
or incomplete data.
However, when a graph database is spread across multiple servers (a cluster), not all nodes
can be stored on separate servers because of the connected nature of the data. Some
databases, like Infinite Graph, support distributing nodes across a cluster, but in Neo4J, data
consistency is ensured through master-slave replication. When changes are made to the
master server, they are eventually synchronized to the slave servers, but slaves can always be
read from. This ensures that data remains consistent, and dangling relationships (where nodes
are not connected correctly) are prevented.
5.2.2. Transactions
In graph databases, a transaction is a way to group operations together to ensure that they are
performed correctly. Neo4J, for example, is ACID-compliant, meaning it handles transactions
in a reliable way to maintain consistency and integrity.
Before making changes to nodes or relationships in the database, a transaction must be
started. If you don’t wrap operations in a transaction, the system will throw an error. For
example, in the code below, a transaction is initiated, a node is created, and properties are set
before marking the transaction as successful and finally finishing it:
Code:
Transaction = database.beginTx();
try {
Node = database.createNode();
node.setProperty("name", "NoSQL Distilled");
node.setProperty("published", "2012");
transaction.success();
} finally {
transaction.finish();
}
The key point is that if you don’t mark the transaction as successful and finish it, the changes
won’t be saved to the database. This is different from how transactions work in traditional
relational databases (RDBMS).
5.2.3. Availability
Availability refers to ensuring that a database is always available for reading and writing
operations, even when some parts of the system fail. Neo4J achieves high availability by
replicating data on slave servers. These slaves can also handle write operations and
synchronize them to the master server. The data is first written to the master, then to the
slave. Other slaves will eventually receive the update.
In summary:
• Consistency ensures that the data is always correct and synchronized across servers.
• Transactions group changes together to keep data reliable, ensuring no data is lost or
corrupted.
• Availability guarantees that the system can always be accessed, even if some parts
fail, by replicating data across multiple servers.
3) Describe the query features of graph databases in detail with examples.
1. Query Languages
• Cypher: Cypher is the query language used by Neo4J. It is declarative, meaning you
tell the system what information you want, and it figures out how to retrieve it. It's
similar to SQL but designed for querying graph data, where you work with nodes and
relationships rather than tables.
• Gremlin: Gremlin is another query language used for graph databases. It’s part of the
TinkerPop framework and works with various graph databases. Unlike Cypher, which
is declarative, Gremlin is procedural, meaning you specify the steps to navigate
through the graph.
2. Indexing and Searching
• Indexing: In graph databases, indexing is used to speed up searches. By creating
indexes on certain properties of nodes or relationships (like names or dates), the
database can quickly find specific data without having to search the entire graph.
• Searching: Searching in a graph database means finding specific nodes or
relationships that match certain criteria. With indexing, you can search for specific
values (e.g., find a node with a specific name or a relationship of a certain type). For
example, if you want to find a person named "Barbara" in a social network, you can
search through an index to quickly locate her node.
3. Traversals
• Traversals: Traversing a graph means exploring the relationships between nodes.
You can decide how deep to explore (how many relationships to follow) and the
direction of exploration (whether you're following outgoing, incoming, or both
directions of relationships).
• There are two common strategies for traversals:
o BREADTH_FIRST: Explore all neighbors before moving deeper.
o DEPTH_FIRST: Go deeper into one branch before exploring others.
• The direction of traversal can be adjusted, for example, following outgoing
relationships (moving away from the start node) or incoming relationships (moving
toward the start node).
4. Pathfinding
• Pathfinding: Graph databases allow you to find paths between nodes. For example,
you could find the shortest path between two people using algorithms like Dijkstra.
Pathfinding can be useful for applications like social networks or recommendation
systems where you need to identify connections or suggest new relationships.
5. Cypher Query Language Syntax
• In Cypher, you define the query in a few simple steps:
o START: Choose the starting node for your query.
o MATCH: Define the relationships between nodes that you want to explore.
o WHERE: Set conditions to filter the results.
o RETURN: Specify what information to return from the query.
o ORDER BY: Sort the results.
o SKIP: Skip a certain number of results (useful for pagination).
o LIMIT: Limit the number of results returned.
Graph Databases Use Cases
Graph databases are great for handling complex relationships. They are used in:
• Social Networks: To find connections between users, such as friends and friends-of-
friends.
• Recommendation Systems: To suggest items based on user preferences and
relationships.
• Path Analysis: To find the best path between nodes, useful for routing in
transportation or network analysis.
These features make graph databases perfect for scenarios where relationships between data
are central, such as social networks, recommendations, and more.
4) Discuss the three scaling methods in graph databases with a clear diagram.

Scaling Methods in Graph Databases


In graph databases, scaling is the process of managing and improving the performance of the
database as the amount of data and the number of users grow. Here are three main methods to
scale graph databases:
1. Adding More RAM to the Server
• Description: One way to scale graph databases is by increasing the amount of RAM
on a server. The more RAM a server has, the more data it can store and process in-
memory (i.e., without needing to read from disk). This works well if the size of the
data fits within the server’s RAM.
• How it works: When all the nodes and relationships are stored in memory, the graph
can be processed much faster. This method is effective when you have a dataset that is
not too large, so it fits in the available memory.
• Limitation: This approach works only when the data can fit entirely in memory. If
the dataset is too large for the available RAM, this method won't be effective.
2. Using Read Replicas (Slave Nodes)
• Description: To scale the read operations, you can add slave nodes. These are copies
of the data that only allow reading (not writing). The main server (master) handles all
the write operations, and the slave nodes handle the read requests.
• How it works:
o The master node accepts write operations (inserts, updates, and deletes).
o Slaves are copies of the master that handle only read operations. This setup
improves performance for read-heavy applications, as multiple read requests
can be handled in parallel by the slave nodes.
• Limitation: The slaves can't handle write operations, so the system is still dependent
on the master for any updates or changes. However, this setup is very useful when
there are more read operations than write operations.
3. Application-Level Sharding
• Description: Sharding involves splitting the data into smaller parts, and each part is
stored on a different server. This technique can be done manually at the application
level, where the application decides which node goes to which server based on
domain-specific knowledge (e.g., region, category, etc.).
• How it works:
o For example, you might have one server store data related to North America
and another server store data related to Asia.
o The application needs to know which server holds the data for a particular
region and can route the queries accordingly.
• Limitation: This method requires careful management of how data is distributed, and
it can become complex. It is useful when the dataset is too large to fit on a single
machine or when replication across many servers is not feasible.
Summary
• Adding More RAM: Useful for smaller datasets that fit in memory, providing faster
graph processing.
• Read Replicas: Helps scale read operations by allowing multiple slave nodes to
handle read queries, while the master node handles writes.
• Application-Level Sharding: Splits the data manually across servers based on
business logic (e.g., geographic regions), helping scale when replication is
impractical.
These scaling methods help graph databases manage larger datasets and handle more users
efficiently, even when data grows significantly.

5) List and explain applications where graph databases are suitable and not suitable.
Applications Where Graph Databases Are Suitable and Not Suitable
5.3.1 Connected Data
Graph databases are great for managing connected data, which involves multiple entities
that are interconnected.
• Example: In social networks, graph databases are perfect for representing users and
their relationships (such as friends, family, coworkers, etc.).
• They can also represent complex relationships in other areas like employees and
projects they have worked on, knowledge connections, and more.
• Why Suitable: Graph databases excel in these situations because they can quickly
traverse relationships and handle complex interconnected data across multiple
domains, like social, spatial, and commerce.
5.3.2 Routing, Dispatch, and Location-Based Services
Graph databases are useful for routing and location-based services:
• Example: When planning delivery routes, each location or address can be
represented as a node. The relationships between these nodes can store information
such as distance, helping to find the most efficient route.
• They are also useful for applications that provide recommendations based on
location, like suggesting nearby restaurants or entertainment options when a user
is in a specific area.
• Why Suitable: Graph databases can easily handle the connections between various
locations and help find optimal paths or provide location-based recommendations.
5.3.3 Recommendation Engines
Graph databases are ideal for recommendation engines, where recommendations are made
based on the relationships between users, products, or items:
• Example: A recommendation system might suggest, “your friends also bought this
product,” or “people who visited this place also visited that place.”
• Why Suitable: As the data grows, graph databases efficiently handle the increasing
relationships between nodes, making it easier to suggest relevant products, services,
or locations based on existing patterns.
5.4 When Not to Use
Graph databases are not suitable in some cases:
• Bulk Updates: When you need to update all or a subset of entities at once (like
changing properties across many nodes), graph databases are not ideal because
updating many nodes at once can be complex and inefficient.
• Global graph operations: For global graph operations that involve processing the
entire graph, some graph databases may struggle with large-scale data.
• Why Not Suitable: Graph databases are designed for handling relationships, but they
may not be efficient for bulk updates or large-scale computations involving the entire
dataset.
Summary
• Suitable: Graph databases are perfect for managing connected data (like social
networks), location-based services (like route planning), and recommendation engines
(like suggesting products).
• Not Suitable: They are not the best choice for situations that require bulk updates or
large-scale computations on the entire dataset, such as in analytics applications.

6) With an example graph structure, discuss how relationships are handled in a graph
database compared to an RDBMS

Graph Database vs. Relational Database: Handling Relationships


Graph Database Representation:
In a Graph Database, relationships are directly represented as edges between nodes.
• Nodes: Represent entities such as people or objects (e.g., Person A, Person B, House
1).
• Edges: Represent the relationships between nodes (e.g., Person A loves Person B,
Person A lives in House 1).
Example:
• Person A LOVES Person B
• Person A LIVES_IN House 1
• Person B LIVES_IN House 1
In a graph database, you can easily follow these relationships by traversing the edges, making
it straightforward to find connections, like discovering all the people that live in the same
house or love the same person.
Relational Database Representation:
In a Relational Database, relationships are represented using tables and foreign keys.
• Tables: Each table holds data about a specific entity, like people or houses.
• Foreign Keys: These are used to link related data across tables.
For example:
• Person Table stores people’s information.
• House Table stores information about houses.
• Love Relationship Table stores who loves whom (using foreign keys to connect
people).
• Lives In Relationship Table stores where each person lives (using foreign keys to
connect people with houses).
In relational databases, to understand the relationships, you need to join tables. For example,
to know who loves whom, you have to look at the Love Relationship Table and connect it
with the Person Table.
Key Differences:
• Graph Database: The relationships are stored as direct connections (edges) between
nodes, making it easier and faster to explore relationships, especially for complex,
interconnected data.
o You can easily find connected nodes by traversing the graph, making queries
more intuitive when working with interconnected data.
• Relational Database: Relationships are stored using foreign keys and represented
across multiple tables, requiring complex queries to join tables and fetch related data.
o As the data grows, these JOIN queries can become slower and more complex,
especially for multiple relationships.
Conclusion:
• Graph databases make it easy to model and query relationships, as everything is
directly linked through edges.
• Relational databases rely on foreign keys and JOIN operations to represent
relationships, which can become complicated and less efficient for deeply connected
data.

You might also like