NOSQL
NOSQL
1) What is NoSQL? Discuss and differentiate between the relational model and the
aggregate model.
What is NoSQL?
NoSQL stands for "Not Only SQL." It is a type of database designed to handle a wide range
of data models, offer scalability, and ensure high performance. Unlike traditional relational
databases, NoSQL databases are schema-less (no fixed structure) and designed for specific
data types like documents, key-value pairs, column families, or graphs.
Features of NoSQL Databases:
1. Flexible Schema: No fixed structure for storing data, making it suitable for dynamic
and changing data.
2. Horizontal Scalability: Can expand across multiple servers, handling more data and
users.
3. High Performance: Built to process large amounts of data quickly.
4. Varied Data Models: Includes different models like key-value, document, column-
family, and graph.
5. Eventual Consistency: Balances consistency, availability, and partition tolerance as
per the CAP theorem.
Use Cases of NoSQL:
1. Applications with changing data structures.
2. High-speed data scenarios, such as IoT or social media.
3. Distributed databases that require horizontal scaling.
Data Data is divided across multiple Data is often combined into a single
Representation tables (normalized). unit, like a document.
Feature Relational Model Aggregate Model
2) Which data model does not support aggregate orientation? Explain the model with a
suitable diagram.
Relational Database Model does not support aggregate orientation
• Relational Database Model: In relational databases, data is stored in separate tables
(e.g., users, products, orders), and relationships between tables are maintained using
foreign keys.
• NoSQL Approach: Instead of splitting data into many tables, NoSQL databases use
aggregates, where related data is grouped together in a single unit like a document.
E-commerce Example
• For an e-commerce website, data about users, product catalogs, orders, shipping, and
payment needs to be stored.
o Relational Model: Data is normalized into different tables (e.g., users, orders,
products) with relationships defined using keys.
o NoSQL Aggregate Model: Related data, like orders with shipping and
payment details, is grouped together in one document for faster access.
• Why Aggregates are Useful:
o Aggregates improve performance in distributed systems by enabling sharding
(dividing data across servers) and replication (creating copies for backup).
Explanation of Aggregates
Aggregate Boundaries
• Customer Aggregate: Contains customer details, such as name and billing address.
• Order Aggregate: Contains order details, such as items, shipping address, and
payment information.
Data Duplication
• Addresses (e.g., billing and shipping) are copied into different JSON sections instead
of being linked by foreign keys.
• This ensures data like shipping addresses doesn’t change after an order is placed,
maintaining the immutability of important data.
3) Define key-value stores and explain the differences between key-value and document
data models.
Key-Value Stores
Key-value stores are a type of NoSQL database where data is saved in key-value pairs. Each
key is unique and points to a specific value, which can be any kind of data (e.g., a number,
text, or even JSON).
Characteristics
1. Simple Data Model: Data is organized as key-value pairs.
2. Efficient Lookups: Keys are used to quickly find their corresponding values.
3. Flexible Storage: Values can be simple (like text) or complex (like JSON).
4. Scalability: Easily handles large amounts of data across many servers.
5. Use Cases: Commonly used for caching, session storage, shopping cart data, and app
configuration.
Examples
• Key: userID:12345
• Value: { "name": "John Doe", "age": 30, "email": "[email protected]" }
Popular Key-Value Stores: Redis, Amazon DynamoDB, and Memcached.
Stored as a single blob (any data Stored as a document with fields and
Value
type). nested structures.
Simple and fast for key-based Allows more complex queries within
Complexity
lookups. the document.
Ideal for caching, session storage, Great for content management, catalogs,
Use Cases
and simple lookups. and hierarchical data.
Scales well horizontally in Also scales well but may need indexing
Scalability
distributed systems. for complex queries.
In Simple Terms
• Key-Value Stores are like a dictionary: you search using a key and get back the
value.
• Document Data Models are like a folder: they store detailed and organized data that
you can search within.
4) Describe with an example how column family stores data in the aggregate structure.
Column-Family Databases
Column-family databases (like Google’s Bigtable, HBase, and Cassandra) organize data into
two levels: rows and columns. They are designed for handling large datasets, especially in
distributed systems.
In Simple Terms
• Rows represent a group (like a customer).
• Column families organize related data within a row (like profile details and order
history).
• It’s flexible because each row can have a different number of columns.
• Sorting and ordering make it efficient for finding data in a sequence, like searching
orders by date or ID.
5) Explain briefly how impedance mismatch occurs in the relational model, and what
are some common solutions to address it?
6) What are materialized views, and how do they differ from relational views in terms of
data access? What strategies are used to build materialized views?
3.4 Materialized Views
What are Materialized Views?
Materialized views are like stored copies of the results of a query. Instead of calculating the
data every time, they save the precomputed data on disk. This makes it faster to retrieve the
data, especially for queries that are used often.
Relational views, on the other hand, are not stored. They calculate the data whenever you
query them, which can take more time. Materialized views are faster to access but may not
always show the latest updates (they can be a bit outdated).
Differences in Terms of Data Access
1. Relational Views:
o Data is computed when you access it.
o Flexible but slower for large or frequent queries.
2. Materialized Views:
o Data is precomputed and stored for quick access.
o Good for heavy reads but may not always have the most recent data.
Strategies to Build Materialized Views
1. Eager Approach:
o Updates the materialized view as soon as the base data changes.
o Keeps the view fresh but can slow down writes to the database.
o Best when you need fast reads and don’t update the data too often.
2. Batch Updates:
o Updates the materialized view at set times (e.g., every few hours).
o Works well if small delays in updates are acceptable.
3. External Computation:
o The view is calculated outside the database, then stored back in it.
o Useful when you need specific, custom calculations.
4. Database-Supported Computation:
o Many databases let you define how to compute the view.
o The database handles the updates based on the rules you set, like for
incremental updates.
Examples in Use:
• In NoSQL systems, materialized views are often created using map-reduce
techniques.
• In column-family databases (like Cassandra), materialized views can be updated in
the same operation as the base data for efficiency.
Materialized views help speed up data access and are particularly helpful when you need
quick answers to repeated queries.
MODULE 2
1) Define Master-Slave replication. With a neat diagram, explain the advantages and
disadvantages of master-slave replication.
Master-Slave Replication
Definition:
Master-slave replication is a method of replicating data across multiple nodes in a database
system. One node acts as the master (primary), which is responsible for all updates or writes.
The other nodes, called slaves (secondaries), synchronize their data with the master. All
updates happen on the master, and these updates are then propagated to the slaves.
Slaves can handle read requests, which makes this method useful for read-intensive
datasets. The replication process ensures the data remains consistent across all nodes.
This shows that the master handles all writes, and the slaves synchronize with the master to
handle read operations.
Master-slave replication is useful for scaling read-heavy workloads but comes with
challenges like write limitations and potential data inconsistency.
2) In a distributed inventory system, the product “Laptop" has the following details:
Price: ₹60,000, Stock: 10, Version Stamp: v1. Example: User A updates the Price to
₹50000, and User B updates it to ₹45000 at the same time. For this example, how can
different version stamping methods be applied to track these updates, and what are the
advantages and disadvantages of each method?
8. In a Distributed Inventory System
When multiple users update the same product at the same time, conflicts arise. Version
stamping helps track and resolve these conflicts in distributed systems. Below is an
explanation of how version stamping methods can be applied to the example scenario.
Scenario Recap
• Initial State: Product: Laptop, Price: ₹60,000, Stock: 10, Version Stamp: v1.
• User A's Update: Changes price to ₹50,000.
• User B's Update: Changes price to ₹45,000.
• Simultaneous Updates: Both updates occur at the same time, leading to a conflict.
Comparison of Methods
Conflict
Method Complexity Advantages Disadvantages
Resolution
Requires synchronized
Timestamp- Latest Simple, easy to
Low clocks; overwrites valid
Based timestamp wins implement
updates.
Tracks causality;
Vector Clock- Detects causal Complex; metadata size
High ideal for many
Based conflicts increases.
nodes
OCC with Prevents stale Simple; avoids High contention can lead
Low
Version Nos. updates stale updates to frequent retries.
By choosing the appropriate version stamping method, conflicts in distributed systems can be
resolved effectively based on the specific requirements of the system.
3) What is the CAP theorem? Explain the trade-offs between its three properties in
detail.
The CAP Theorem, proposed by Eric Brewer, states that in a distributed system, it is
impossible to guarantee all three properties—Consistency (C), Availability (A), and
Partition Tolerance (P)—at the same time. Systems must choose two of the three properties
based on their specific needs.
CAP Properties
1. Consistency (C):
• All nodes in the system show the same data at any given time.
• Any change to the data is instantly updated across all nodes.
• Example: If a sales figure is updated on one node, all other nodes should immediately
reflect the new value.
2. Availability (A):
• The system ensures that every request gets a response, even if some nodes are down.
• Achieved by replicating data across multiple nodes.
• Example: A user query for sales data gets a response, regardless of failures in part of
the system.
3. Partition Tolerance (P):
• The system continues to work even if there are network partitions (communication
failures between nodes).
• Ensures fault tolerance and resilience during such failures.
• Example: Even if part of the system cannot communicate with another, operations on
one side still succeed.
CAP Combinations
Since no distributed system can satisfy all three properties, they must select two out of the
three:
1. Consistency + Availability (CA):
• Guarantees data is the same across all nodes (Consistency).
• Ensures all requests get a response (Availability).
• Trade-off: Does not work if there is a network partition (requires perfect
communication).
• Example: Relational databases in centralized systems.
2. Availability + Partition Tolerance (AP):
• Keeps the system responsive during network failures (Partition Tolerance).
• May allow some nodes to show outdated or inconsistent data (sacrifices
Consistency).
• Example: DynamoDB, where availability is prioritized to maintain responsiveness.
3. Consistency + Partition Tolerance (CP):
• Ensures all nodes have consistent data (Consistency).
• Works during network partitions but sacrifices availability (some requests may be
rejected).
• Example: MongoDB, used in systems that need accurate and consistent data.
Network Partition and Trade-offs
When a network partition occurs, systems behave based on their chosen CAP combination:
1. AP (Availability + Partition Tolerance):
• Prioritizes availability, ensuring the system remains responsive.
• Suitable for applications like social media or e-commerce, where the user experience
is critical.
2. CP (Consistency + Partition Tolerance):
• Prioritizes consistency by ensuring accurate data, even if responses are delayed.
• Some requests may fail until the latest data is available.
• Suitable for applications like banking or financial transactions, where data accuracy is
essential.
Understanding CAP Theorem helps in designing distributed systems that align with
application requirements and trade-offs effectively.
4) Identify the type of conflict in the given scenario. How can it be solved? Alice and
Bob both try to book the last available room at the same time. Alice starts filling in her
details, but Bob completes his booking first. When Alice submits her booking, it
overwrites Bob's reservation, and the room is booked for Alice instead.
Identify the Type of Conflict in the Given Scenario. How Can It Be Solved?
Type of Conflict:
The scenario represents a write-write conflict, where both Alice and Bob are trying to
update the same data (the last available room) at the same time. Since there is no proper
concurrency control, Alice’s update overwrites Bob’s reservation, leading to inconsistency.
Solution:
To solve this conflict, concurrency control mechanisms can be applied, using either a
pessimistic approach or an optimistic approach:
1. Pessimistic Approach (Write Locks):
o Implement a write lock for the room booking.
o When Alice or Bob tries to book, they must acquire a lock on the data.
o Only one of them can succeed (e.g., Bob acquires the lock first). The other
(Alice) will have to wait or receive a notification that the room is no longer
available.
2. Optimistic Approach (Conditional Updates):
o Use a conditional update system where Alice and Bob’s updates are verified
against the current state of the data.
o In this case, Bob’s booking is processed first, and when Alice submits her
booking, it checks whether the room is still available.
o Since Bob has already booked the room, Alice’s update fails, and she is
informed that the room is no longer available.
3. Conflict Recording and Resolution:
o Save both attempts and flag the conflict.
o Notify Alice and Bob about the conflict and let them resolve it manually (e.g.,
contacting customer support).
o Alternatively, an automated system can prioritize the first completed booking
(Bob’s) and reject the second one (Alice’s).
Key Takeaway:
Using write locks (pessimistic) ensures conflicts are avoided, but it may slow down the
system. Conditional updates (optimistic) allow faster responses but require proper handling of
failed updates. The choice depends on the system’s requirements for safety and
responsiveness.
5) What is Sharding? With a neat diagram, explain the concept of sharding with an
example.
What is Sharding?
Sharding is a technique used in distributed systems to horizontally scale a data store by
splitting the dataset into smaller, manageable parts called shards. Each shard is stored on a
separate server and handles its own reads and writes. This reduces the load on any single
server, improving performance and scalability.
Explanation of Sharding
In sharding, data is divided across multiple servers. Ideally, users accessing different parts of
the data communicate with different servers, ensuring rapid responses and balanced server
loads.
For example:
• If there are 10 servers, each handles about 10% of the data.
• Each user communicates with only one server, reducing latency.
Key Aspects of Sharding:
1. Data Clumping:
o Data that is often accessed together is stored on the same shard.
o For example, customer orders for a user in Boston can be stored in a data
center in the eastern US.
2. Load Balancing:
o Aggregates (related data items) are distributed evenly across shards to ensure
that no single server is overloaded.
3. Performance:
o Sharding improves read and write performance, especially for write-heavy
applications.
o Unlike replication, which is mainly useful for improving read performance,
sharding scales writes horizontally.
4. Challenges:
o Sharding complicates application logic, especially if implemented manually.
o Auto-sharding, provided by many NoSQL databases, automates the process of
allocating data to shards and ensures correct data access.
o Sharding alone does not improve resilience. If a shard fails, its data becomes
unavailable.
Example of Sharding
Let’s assume an e-commerce application:
• Data is divided based on customer names:
o Customers with surnames A–D are stored on Shard 1.
o Customers with surnames E–G are stored on Shard 2, and so on.
• If a customer with the surname "Anderson" queries the database, the application
directs the query to Shard 1.
Conclusion
Sharding is an effective way to scale a database horizontally by distributing data across
multiple servers. While it significantly improves performance, especially for write-heavy
workloads, it must be implemented with care to avoid operational challenges.
6) Define Quorum. Explain how to read and write a quorum with examples.
Define Quorum
Quorum is the minimum number of nodes in a distributed system that must participate in a
read or write operation to ensure strong consistency. By involving a majority of nodes, we
can avoid conflicts and ensure the data is up-to-date and consistent.
Read Quorum
• A read quorum ensures the latest data is read by involving enough nodes to confirm
the most recent write.
• Formula: R + W > N
o R: Number of nodes involved in the read operation.
o W: Number of nodes required to confirm a write.
Example 1:
• If W = 2 (2 nodes confirm writes), then at least R = 2 nodes must be contacted for a
read to ensure the latest data.
• This ensures R + W = 4 > N = 3, giving a strongly consistent read.
Key Points to Remember
• A replication factor of 3 (N = 3) is common and allows resilience even if one node
fails.
• Strong consistency in reads and writes depends on ensuring R + W > N.
• Trade-offs can be made based on the application's need for consistency, speed, and
availability.
MODULE 3
1) Apply the Map-reduce process to compare the sales of products for each month in
2011 to the prior year. Illustrate the process with suitable diagrams.
1.4 A Two-Stage Map-Reduce Example
As map-reduce tasks get complex, it is useful to break them into stages using a pipes-and-
filters approach. Here, the output of one stage serves as the input for the next, similar to
UNIX pipelines.
Problem Statement
We want to compare the sales of products for each month in 2011 with the same month in the
prior year (2010). To solve this, we divide the task into two stages:
2) What are key-value stores and popular key-value databases? Discuss with an
example how data is organized within a single bucket and mention ways to handle key
conflicts.
Key-Value Stores and Popular Key-Value Databases
Key-Value Stores: Key-value stores are NoSQL databases designed to store data as a pair of
keys and values. A key is a unique identifier, and the value is a blob that can store any type of
data. These stores are accessed using primary keys, offering high performance and easy
scalability.
Popular key-value databases include:
• Riak
• Redis (often called a Data Structure server)
• Memcached DB and its variants
• Berkeley DB
• HamsterDB (for embedded use)
• Amazon DynamoDB (not open-source)
• Project Voldemort (an open-source implementation of DynamoDB)
Example: Organizing Data Within a Single Bucket In key-value stores like Riak, data is
organized into buckets. Buckets act as flat namespaces for keys. For example, to store data
like user session data, shopping cart information, and user preferences, all of these can be
placed in a single bucket as a single object.
• Key: 12345
• Value: A single object containing user session, shopping cart, and user preferences
data.
This approach is simple but increases the chance of key conflicts.
Handling Key Conflicts:
1. Change the Key Design:
Append object names to keys to make them unique.
o Example:
▪ Key: 288790b8a421_userProfile
▪ Key: 288790b8a421_shoppingCart
2. Use Separate Buckets for Different Data:
Create separate buckets for different types of data, like UserProfile and ShoppingCart.
This prevents conflicts by segmenting data.
o Bucket: UserProfile
▪ Key: 12345
▪ Value: User profile data
o Bucket: ShoppingCart
▪ Key: 67890
▪ Value: Shopping cart data
By organizing data this way, you avoid key conflicts and ensure easy access to specific
objects.
3)
1. Map Function
The Map function processes each record in the input data and generates intermediate key-value pairs.
• Input Data (Product, Quantity, Price):
1. (Puerh, 8, 24)
2. (Dragonwell, 12, 24)
3. (Genmaicha, 20, 80)
4. (Puerh, 5, 15)
5. (Dragonwell, 16, 48)
• Key: Product name
• Value: Total sales for the record (Quantity × Price)
Output of Map Function:
For each record, the map function generates:
• (Puerh, 8 × 24 = 192)
• (Dragonwell, 12 × 24 = 288)
• (Genmaicha, 20 × 80 = 1600)
• (Puerh, 5 × 15 = 75)
• (Dragonwell, 16 × 48 = 768)
Intermediate Key-Value Pairs:
• (Puerh, 192)
• (Dragonwell, 288)
• (Genmaicha, 1600)
• (Puerh, 75)
• (Dragonwell, 768)
3. Reduce Function
The Reduce function aggregates the values for each key. Here, it sums up the total sales for each
product.
Calculations:
• For Puerh: 192 + 75 = 267
• For Dragonwell: 288 + 768 = 1056
• For Genmaicha: Total Sales = 1600
Final Output:
• (Puerh, 267)
• (Dragonwell, 1056)
• (Genmaicha, 1600)
Final Explanation:
1. The Map function transforms input data into key-value pairs where the key is the product
name and the value is the total sales (Quantity × Price).
2. The Shuffle and Sort phase groups all values for the same product together.
3. The Reduce function sums up the grouped values to calculate the total sales for each product.
Final Output:
• Puerh: 267
• Dragonwell: 1056
• Genmaicha: 1600
This approach simplifies processing large datasets by breaking it into smaller tasks.
1. Consistency
• Key-value stores prioritize performance and often use an eventually consistent
model. This means changes made to data might take some time to update across all
servers.
• For example, in Riak, if two people make changes at the same time, users can either
choose "last write wins" or handle multiple conflicting values on their own.
• Settings like replication factor (n Val) and write quorum (w) allow users to balance
consistency and performance.
2. Transactions
• Key-value stores don’t usually support multi-key or multi-document transactions
like relational databases do.
• Instead, they use a quorum model to ensure reliability. This model includes:
o N: Number of copies (replicas) of data.
o W: Number of successful writes required.
o R: Number of successful reads required.
• These configurations ensure reliable data availability.
3. Query Features
• Key-value stores allow simple lookups using a key but don’t support complex queries
like SQL databases.
• Applications need to design meaningful keys for efficient retrieval.
• Advanced features like Riak Search can add more flexibility, such as querying with
Lucene (a search engine library).
• They work best in predictable query situations, such as session storage or shopping
carts.
4. Structure of Data
• Data is stored as a blob (a single block of data), which can be any format like JSON,
XML, or plain text.
• The database doesn’t understand the data’s structure; it’s the application’s job to
process it.
• In Riak, users can specify the format of the data using a Content-Type header,
which helps during retrieval but doesn’t change how the data is stored.
5. Scaling
• Key-value stores scale horizontally by sharding, dividing data across multiple servers
based on the keys.
• Adding more servers increases capacity. However, if a server fails, data on it becomes
temporarily unavailable unless replication is used.
• Tools like replication and configurations (e.g., N, R, and W values in Riak) help
balance consistency, availability, and partition tolerance (CAP theorem).
Summary
Key-value stores are simple, scalable, and perform well for specific use cases like session
data or shopping carts. However, they require careful key design and don’t support complex
transactions or queries.
MODULE 4
1) What is a document database? Explain with an example how data is stored in it and
how it differs from an RDBMS.
What is a Document Database?
A document database is a type of NoSQL database that stores, retrieves, and manages data
as documents. These documents are typically in JSON-like format, making them flexible
and easy to use. Unlike relational databases (RDBMS), where data follows a fixed structure
(schema), document databases allow each document to have its own unique structure. This
makes them suitable for applications where the schema can change over time.
Data
Flexible (JSON-like documents). Fixed schema (tables).
Structure
This shows how document databases are more flexible, while RDBMS requires strict
structure.
2) List and explain the suitable use cases for document databases.
Suitable Use Cases
1. Event Logging
o Many applications generate different types of event logs.
o Document databases are a great fit to store all these logs in one central place,
especially when the structure of the logs changes frequently.
o Events can be organized based on their type (e.g., order_processed,
customer_logged) or the application name where the event occurred.
2. Content Management Systems and Blogging Platforms
o Document databases handle content like web pages, user comments, or
profiles effectively because they have no fixed structure (schema).
o These databases work well with JSON, making them ideal for websites, user
registrations, and publishing platforms.
3. Web Analytics or Real-Time Analytics
o Document databases are suitable for storing data like page views or unique
visitors in real-time.
o New metrics can be added easily without requiring changes to the database
structure.
4. E-Commerce Applications
o E-commerce platforms need flexible data models for products, orders, and
customer details.
o Document databases allow evolving these models without the need for
expensive changes to the database structure.
3) Explain the differences in query handling between MongoDB and RDBMS with
examples.
4) What is a replica set? How does replication work in MongoDB? What are the
alternatives to MongoDB?
What is a Replica Set?
A replica set in MongoDB is a group of servers that keep copies of the same data.
• One server is the primary node, which handles all the write operations.
• Other servers are secondary nodes, which copy the data from the primary node.
These secondary nodes can handle read requests if configured. This setup helps to
manage a large number of read requests and provides backup in case the primary node
fails.
5) List and explain applications where graph databases are suitable and not suitable.
Applications Where Graph Databases Are Suitable and Not Suitable
5.3.1 Connected Data
Graph databases are great for managing connected data, which involves multiple entities
that are interconnected.
• Example: In social networks, graph databases are perfect for representing users and
their relationships (such as friends, family, coworkers, etc.).
• They can also represent complex relationships in other areas like employees and
projects they have worked on, knowledge connections, and more.
• Why Suitable: Graph databases excel in these situations because they can quickly
traverse relationships and handle complex interconnected data across multiple
domains, like social, spatial, and commerce.
5.3.2 Routing, Dispatch, and Location-Based Services
Graph databases are useful for routing and location-based services:
• Example: When planning delivery routes, each location or address can be
represented as a node. The relationships between these nodes can store information
such as distance, helping to find the most efficient route.
• They are also useful for applications that provide recommendations based on
location, like suggesting nearby restaurants or entertainment options when a user
is in a specific area.
• Why Suitable: Graph databases can easily handle the connections between various
locations and help find optimal paths or provide location-based recommendations.
5.3.3 Recommendation Engines
Graph databases are ideal for recommendation engines, where recommendations are made
based on the relationships between users, products, or items:
• Example: A recommendation system might suggest, “your friends also bought this
product,” or “people who visited this place also visited that place.”
• Why Suitable: As the data grows, graph databases efficiently handle the increasing
relationships between nodes, making it easier to suggest relevant products, services,
or locations based on existing patterns.
5.4 When Not to Use
Graph databases are not suitable in some cases:
• Bulk Updates: When you need to update all or a subset of entities at once (like
changing properties across many nodes), graph databases are not ideal because
updating many nodes at once can be complex and inefficient.
• Global graph operations: For global graph operations that involve processing the
entire graph, some graph databases may struggle with large-scale data.
• Why Not Suitable: Graph databases are designed for handling relationships, but they
may not be efficient for bulk updates or large-scale computations involving the entire
dataset.
Summary
• Suitable: Graph databases are perfect for managing connected data (like social
networks), location-based services (like route planning), and recommendation engines
(like suggesting products).
• Not Suitable: They are not the best choice for situations that require bulk updates or
large-scale computations on the entire dataset, such as in analytics applications.
6) With an example graph structure, discuss how relationships are handled in a graph
database compared to an RDBMS