0% found this document useful (0 votes)
12 views7 pages

BDA Module-3

The document discusses the CAP Theorem, which states that a distributed data system cannot simultaneously provide Consistency, Availability, and Partition Tolerance, emphasizing the trade-offs in database design. It also covers NoSQL database architecture, highlighting its flexibility, scalability, and various data models, including key-value, document, column, and graph databases. Additionally, it contrasts NoSQL with traditional SQL databases and outlines strategies for handling big data challenges, such as even distribution, replication, and efficient query execution.

Uploaded by

laxmishetti1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

BDA Module-3

The document discusses the CAP Theorem, which states that a distributed data system cannot simultaneously provide Consistency, Availability, and Partition Tolerance, emphasizing the trade-offs in database design. It also covers NoSQL database architecture, highlighting its flexibility, scalability, and various data models, including key-value, document, column, and graph databases. Additionally, it contrasts NoSQL with traditional SQL databases and outlines strategies for handling big data challenges, such as even distribution, replication, and efficient query execution.

Uploaded by

laxmishetti1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

BDA MODULE 3

1. CAP THEOREM

The CAP Theorem states that it is impossible for a distributed data system to simultaneously provide
Consistency (C), Availability (A), and Partition Tolerance (P). A distributed system can only guarantee at most
two out of these three properties. This theorem is critical in understanding trade-offs in distributed database design.
Among C, A, and P, two are at least present for the application service process.
• Consistency means all copies have the same value like in traditional DB.
• Availability means at least one copy is available in case a partition becomes active or fails.
• Partition means parts that are active but may not cooperate (share) as in distributed DBs.
1. Consistency in distributed database:
a. All nodes observe the same data at the same time.
b. Operations in one partition of the database should reflect in other related partitions in a distributed
database.
c. Example: Operations that change the sales data from a specific showroom in a table should also
reflect in related tables using that sales data.
2. Availability:
a. During transactions, field values must be available in other partitions of the database.
b. Each request receives a response on success or failure.
c. Replication ensures availability.
3. Partition:
a. Division of a large database into different databases without affecting the operations on them by
adopting specified procedures.
b. Partition Tolerance: Refers to the continuation of operations as a whole even in case of message
loss, node failure, or a node not being reachable.

Brewer's CAP Theorem


The CAP theorem demonstrates that any distributed system cannot guarantee Consistency (C), Availability (A),
and Partition Tolerance (P) together.
1. Consistency: All nodes observe the same data at the same time.
2. Availability: Each request receives a response on success or failure.
3. Partition Tolerance: The system continues to operate as a whole even in case of message loss, node
failure, or a node not being reachable.
In Case of Network Failure, the Choice Can Be:
• Database must answer, and that answer would be old or wrong data (AP).
• Database should not answer unless it receives the latest copy of the data (CP).
The CAP theorem implies that for a network partition system, the choice of consistency and availability are
mutually exclusive.
• CA: Consistency and Availability.
• AP: Availability and Partition Tolerance.
• CP: Consistency and Partition Tolerance.

2. NOSQL DATA ARCHITECTURE PATTERN, CHARACTERISTICS, TRANSACTIONS AND


SOLUTIONS.
Definition: NoSQL (Not Only SQL) databases are non-relational databases designed to handle large-scale data
with flexibility, scalability, and high performance. They support diverse data models like key-value, document,
column-family, and graph, making them ideal for semi-structured or unstructured data in modern applications.
NoSQL databases are used to manage and store big data efficiently and flexibly. They organize data into logical
patterns that help in storing, retrieving, and managing data effectively. The following are the primary data
architecture patterns in NoSQL:

i. Key-Value Store Database


This is one of the simplest models in NoSQL databases. Data is stored as Key-Value Pairs, where:
• Key: A unique identifier (strings, integers, or characters).
• Value: Linked to the key and can be of any data type (e.g., JSON, BLOB, strings).
Typically uses hash tables to store key-value pairs.

• Applications: Commonly used in shopping websites or e-commerce platforms.


• Advantages:
o Can handle large data volumes and heavy loads.
o Fast and easy retrieval of data using keys.
• Examples: DynamoDB, Berkeley DB

ii. Column Store Database


o Stores data in individual cells grouped into columns.
o Unlike relational databases, data is stored column-wise rather than row-wise.
o Columns can differ in format and titles across rows.
• Applications: Suitable for analytical operations like SUM, AVERAGE, and COUNT.
• Advantages:
o Data is readily available for column-specific queries.
o Optimized for queries on large datasets.
• Examples: HBase, Bigtable by Google, Cassandra

iii. Document Database

o Stores data as key-value pairs, but the values are called Documents.
o Documents are complex data structures (e.g., JSON, XML, text, arrays).
o Nested documents are commonly used.
• Applications: Ideal for managing semi-structured data such as JSON files.
• Advantages:
o Suitable for handling unstructured and semi-structured data.
o Easy storage, retrieval, and management of documents.
• Examples: MongoDB, CouchDB

iv. Graph Database


o Stores data in the form of graphs, where:
▪ Nodes: Represent entities or objects.
▪ Edges: Represent relationships between nodes and are uniquely identified.
• Applications: Useful for handling data with complex relationships, such as social networks.
• Advantages:
o Fast data traversal due to the connected structure.
o Suitable for managing spatial data.
• Examples: Neo4J, FlockDB (used by Twitter)

Features/ Characteristics:

1. Schema Flexibility: NoSQL databases do not require a fixed schema, allowing for dynamic addition and
modification of fields. This makes them suitable for handling unstructured and semi-structured data.
2. Scalability: Supports horizontal scaling, meaning data can be distributed across multiple servers (nodes)
to handle large-scale operations efficiently.
3. Auto Sharding: Automatically partitions large datasets across multiple servers, ensuring efficient load
distribution and improved query performance.
4. Replication: Provides data redundancy by replicating data across multiple nodes, ensuring high
availability and fault tolerance.
5. Integrated Caching: Built-in caching mechanisms reduce latency and improve data retrieval speed.
6. High Performance: Optimized for fast read and write operations, making them ideal for real-time
analytics and applications.
7. Distributed Architecture: Operates on distributed systems, making it possible to store and process
massive amounts of data across geographically dispersed nodes.
8. Handles Big Data: Designed to store and manage large volumes of data, including unstructured and semi-
structured formats like JSON, XML, and binary data (BLOBs).
9. Semi-Structured Data Support: Can handle irregular or flexible data formats, making them versatile for
modern applications that deal with dynamic data.
10. CAP Theorem Compliance: NoSQL databases prioritize Consistency, Availability, or Partition
Tolerance, depending on the use case, as per Brewer's CAP theorem.

Big Data NoSQL Transactions


NoSQL transactions differ significantly from traditional SQL-based systems, as they prioritize scalability and
flexibility over strict ACID compliance. Key features include:

1. Relaxation of ACID Properties: NoSQL databases often relax one or more ACID properties (Atomicity,
Consistency, Isolation, Durability) to enhance scalability and performance.
2. CAP Theorem: NoSQL databases are characterized by two out of three CAP properties: Consistency,
Availability, and Partition Tolerance, depending on the use case.
3. BASE Model: Transactions in NoSQL follow the BASE properties: Basically Available, Soft state, and
Eventual consistency, emphasizing availability and scalability over strict consistency.
4. Atomicity in Operations: While multi-document or multi-collection transactions are limited, atomicity
is often maintained within a single document or key-value pair.
5. Consistency: Transactions ensure eventual consistency, meaning all data replicas will synchronize over
time, but immediate consistency across all nodes is not guaranteed.
6. Isolation: Transactions are isolated from one another, ensuring that incomplete operations do not interfere
with others in the system.
7. Durability: Changes made during transactions are durable and persist even in the event of a system failure,
though the mechanism might vary from SQL databases.

Big Data NoSQL Solutions


NoSQL databases offer scalable and cost-effective solutions to handle the challenges of big data. Key
characteristics include:

1. High and Easy Scalability: Horizontal scalability allows adding new nodes to expand capacity, making
it suitable for terabytes and petabytes of data.
2. Replication: Data is replicated across multiple nodes, ensuring high availability, fault tolerance, and
reliability.
3. Distributed Shards: Data is partitioned into shards and distributed across clusters, improving
performance and throughput.
4. Cost-Effectiveness: NoSQL databases use inexpensive, open-source tools and commodity hardware,
reducing implementation and operational costs.
5. Schema-Less Data Model: No predefined schema is required, allowing flexibility in storing and
managing unstructured or semi-structured data.
6. Integrated Caching: Built-in caching in memory improves performance, eliminating the need for
separate caching infrastructure as in traditional SQL systems.
7. Flexibility: Unlike rigid SQL databases, NoSQL solutions are highly flexible, supporting various data
formats and structures without stringent constraints.

3. SHARED NOTHING ARCHITECTURE FOR BIG DATA TASK


The Shared Nothing Architecture (SN) is a scalable and distributed design model used in NoSQL databases and
big data systems. It ensures no single point of contention by decentralizing the system components. Each node
operates independently, with its own memory and disk storage, making it highly suitable for big data tasks.
Here are the key distribution models under SN architecture:

i. Single Server Model


• Simplest distribution option for NoSQL data stores.
• Entire application runs sequentially on a single server.
• Suitable for graph databases, which process relationships between nodes on one server.
• Efficient for small-scale applications but lacks scalability for larger datasets.
ii. Sharding Very Large Databases
• Sharding divides the database into smaller, more manageable pieces (shards), which are distributed
across multiple nodes.
• Provides horizontal scalability, as shards allow the addition of nodes to the cluster without
reconfiguring the application.
• Applications can process shards in parallel, improving performance.
• If a node or shard fails, the system can migrate the affected shard to another node, ensuring continuity.

iii. Master-Slave Distribution Model


• In this model, one node serves as the master (primary) while others act as slaves (secondary).
• The master node directs operations and updates slave nodes.
• Slave nodes handle read operations, while the master handles write operations and updates.
• Advantages:
o Improved processing performance due to the distribution of large datasets across slave nodes.
o Data redundancy ensures fault tolerance.

iv. Peer-to-Peer Distribution Model


• All nodes (peers) function equally, removing the concept of master-slave hierarchy.
• Characteristics of the model:
o All nodes handle read requests and provide responses.
o Replication ensures all nodes have updated data, enhancing consistency.
o Node failures do not disrupt write capabilities.
• Widely used by databases like Cassandra, where data is distributed across the cluster.
• Adding nodes increases system performance and scalability.
4. NOSQL V/S SQL(RDBMS)
Feature RDBMS NoSQL
Data Model Tabular (rows and columns) Document, Key-value, Graph, Wide-column
Schema Structured (predefined schema) Flexible (schema-less or dynamic)
ACID Properties Strong support for ACID (Atomicity, May or may not support ACID
Consistency, Isolation, Durability)
Scalability Vertical scaling (adding more powerful Horizontal scaling (adding more nodes)
hardware)
Data Types Primarily structured data (numbers, Supports various data types, including
text, dates) unstructured and semi-structured data
Use Cases Complex transactions, data Big data, high-velocity data, content
warehousing, OLTP systems management, mobile applications
Examples MySQL, PostgreSQL, Oracle Database, MongoDB, Cassandra, Redis, Neo4j
SQL Server

5. HANDLING BIG DATA PROBLEMS.


Big Data systems deal with vast volumes of structured, semi-structured, and unstructured data. Efficient
handling of these problems requires distributed and scalable solutions. Here are four key ways to address Big
Data challenges:

i. Even Distribution Using Hash Rings (Consistent Hashing)


• Consistent Hashing: A technique to distribute data evenly across a cluster using a hashing algorithm.
• How It Works: The algorithm generates a pointer (hash value) for a dataset, allowing client nodes to
locate data within the cluster using only the Collection_ID hash.
• Hash Ring: A circular map of hash values used to assign datasets consistently to specific processors or
nodes.
• Benefits: Ensures balanced data distribution and prevents overloading any single node, improving
efficiency and scalability.

ii. Replication for Horizontal Scaling and Fault Tolerance


• Replication: Creating real-time backup copies of data across multiple nodes.
• Enables horizontal scaling by distributing client read-requests to multiple nodes.
• Advantages:
o Fault-tolerant data retrieval in distributed environments.
o Improved performance as client requests are handled by replicated nodes.

iii. Moving Queries to Data Instead of Data to Queries


• In distributed environments, moving data-intensive queries to the nodes where data resides reduces
network overhead.
• This approach is efficient, especially when using cloud services or large-scale databases.
• Advantages:
o Faster query execution.
o Reduced data transfer costs.

iv. Query Distribution Across Multiple Nodes


• Query Analyzers: Analyze client queries and distribute them evenly across data nodes or replica nodes.
• Parallel Query Execution: Queries are executed simultaneously across multiple nodes, enhancing
performance.
• Benefits:
o Efficient resource utilization.
o Reduced query response time.

You might also like