BDA Module-3
BDA Module-3
1. CAP THEOREM
The CAP Theorem states that it is impossible for a distributed data system to simultaneously provide
Consistency (C), Availability (A), and Partition Tolerance (P). A distributed system can only guarantee at most
two out of these three properties. This theorem is critical in understanding trade-offs in distributed database design.
Among C, A, and P, two are at least present for the application service process.
• Consistency means all copies have the same value like in traditional DB.
• Availability means at least one copy is available in case a partition becomes active or fails.
• Partition means parts that are active but may not cooperate (share) as in distributed DBs.
1. Consistency in distributed database:
a. All nodes observe the same data at the same time.
b. Operations in one partition of the database should reflect in other related partitions in a distributed
database.
c. Example: Operations that change the sales data from a specific showroom in a table should also
reflect in related tables using that sales data.
2. Availability:
a. During transactions, field values must be available in other partitions of the database.
b. Each request receives a response on success or failure.
c. Replication ensures availability.
3. Partition:
a. Division of a large database into different databases without affecting the operations on them by
adopting specified procedures.
b. Partition Tolerance: Refers to the continuation of operations as a whole even in case of message
loss, node failure, or a node not being reachable.
o Stores data as key-value pairs, but the values are called Documents.
o Documents are complex data structures (e.g., JSON, XML, text, arrays).
o Nested documents are commonly used.
• Applications: Ideal for managing semi-structured data such as JSON files.
• Advantages:
o Suitable for handling unstructured and semi-structured data.
o Easy storage, retrieval, and management of documents.
• Examples: MongoDB, CouchDB
Features/ Characteristics:
1. Schema Flexibility: NoSQL databases do not require a fixed schema, allowing for dynamic addition and
modification of fields. This makes them suitable for handling unstructured and semi-structured data.
2. Scalability: Supports horizontal scaling, meaning data can be distributed across multiple servers (nodes)
to handle large-scale operations efficiently.
3. Auto Sharding: Automatically partitions large datasets across multiple servers, ensuring efficient load
distribution and improved query performance.
4. Replication: Provides data redundancy by replicating data across multiple nodes, ensuring high
availability and fault tolerance.
5. Integrated Caching: Built-in caching mechanisms reduce latency and improve data retrieval speed.
6. High Performance: Optimized for fast read and write operations, making them ideal for real-time
analytics and applications.
7. Distributed Architecture: Operates on distributed systems, making it possible to store and process
massive amounts of data across geographically dispersed nodes.
8. Handles Big Data: Designed to store and manage large volumes of data, including unstructured and semi-
structured formats like JSON, XML, and binary data (BLOBs).
9. Semi-Structured Data Support: Can handle irregular or flexible data formats, making them versatile for
modern applications that deal with dynamic data.
10. CAP Theorem Compliance: NoSQL databases prioritize Consistency, Availability, or Partition
Tolerance, depending on the use case, as per Brewer's CAP theorem.
1. Relaxation of ACID Properties: NoSQL databases often relax one or more ACID properties (Atomicity,
Consistency, Isolation, Durability) to enhance scalability and performance.
2. CAP Theorem: NoSQL databases are characterized by two out of three CAP properties: Consistency,
Availability, and Partition Tolerance, depending on the use case.
3. BASE Model: Transactions in NoSQL follow the BASE properties: Basically Available, Soft state, and
Eventual consistency, emphasizing availability and scalability over strict consistency.
4. Atomicity in Operations: While multi-document or multi-collection transactions are limited, atomicity
is often maintained within a single document or key-value pair.
5. Consistency: Transactions ensure eventual consistency, meaning all data replicas will synchronize over
time, but immediate consistency across all nodes is not guaranteed.
6. Isolation: Transactions are isolated from one another, ensuring that incomplete operations do not interfere
with others in the system.
7. Durability: Changes made during transactions are durable and persist even in the event of a system failure,
though the mechanism might vary from SQL databases.
1. High and Easy Scalability: Horizontal scalability allows adding new nodes to expand capacity, making
it suitable for terabytes and petabytes of data.
2. Replication: Data is replicated across multiple nodes, ensuring high availability, fault tolerance, and
reliability.
3. Distributed Shards: Data is partitioned into shards and distributed across clusters, improving
performance and throughput.
4. Cost-Effectiveness: NoSQL databases use inexpensive, open-source tools and commodity hardware,
reducing implementation and operational costs.
5. Schema-Less Data Model: No predefined schema is required, allowing flexibility in storing and
managing unstructured or semi-structured data.
6. Integrated Caching: Built-in caching in memory improves performance, eliminating the need for
separate caching infrastructure as in traditional SQL systems.
7. Flexibility: Unlike rigid SQL databases, NoSQL solutions are highly flexible, supporting various data
formats and structures without stringent constraints.