0% found this document useful (0 votes)
3 views

Data Engineering Unit 3

Uploaded by

Santhu N Gowda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Engineering Unit 3

Uploaded by

Santhu N Gowda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Engineering Unit 3

HBase Distributed Storage Architecture

 Master-Worker Pattern: HBase follows a master-worker architecture. The Master node is responsible for
coordinating tasks, while Range Servers (also known as Region Servers) manage specific subsets of data
(regions).

 Regions and Row Keys: Each range (or region) in HBase stores an ordered set of rows, identified by unique
row keys. As the size of data in a region grows beyond a configured threshold, the region is split into two,
with the data divided accordingly.

 Column-Family and Store Mapping: HBase stores data in columns, grouped into column families. Each
region maintains a separate store for each column family, with these stores mapping to physical files in the
underlying distributed file system.

 Write-Ahead Log (WAL): HBase uses a write-ahead log to ensure data durability. When data is written to
HBase, it first goes to the WAL before being written to the in-memory store. If the in-memory store is full,
data is flushed to disk.

 Distributed File System: HBase typically uses the Hadoop Distributed File System (HDFS) for storage. The
HDFS follows a master-worker pattern similar to HBase, with a NameNode and DataNodes. HBase interacts
with the file system through a filesystem API, allowing compatibility with other systems like CloudStore
(formerly Kosmos FileSystem).

 ZooKeeper for Configuration and Coordination: HBase relies on ZooKeeper for configuration management
and coordination. ZooKeeper assists with client requests by managing the catalog information (-ROOT-
and .META.) necessary for locating specific rows in HBase tables.

 Data Access Flow: When accessing data, the client first consults the -ROOT- and .META. catalogs via
ZooKeeper to locate the relevant region. This process is cached, so subsequent requests to the same data
can bypass the lookup steps.
Introduction to NoSQL Databases

NoSQL databases are designed to store and manage large volumes of unstructured, semi-structured, or structured
data. Unlike traditional relational databases, NoSQL databases do not require a fixed schema, and they scale
horizontally, making them ideal for handling big data and real-time web applications.

Types of NoSQL Databases with Examples

1. Document-Oriented Databases

o Description: These databases store data in document formats like JSON, BSON, or XML, where each
document can have a different structure.

o Example: MongoDB

 Use Case: Ideal for content management systems, e-commerce platforms, and applications
that require flexible schemas.

2. Key-Value Stores

o Description: These databases store data as a collection of key-value pairs. The key is a unique
identifier, and the value can be any type of data.

o Example: Redis

 Use Case: Suitable for caching, session management, and real-time analytics.

3. Column-Family Stores

o Description: These databases store data in columns rather than rows, allowing for efficient querying
and storage of large datasets.

o Example: Apache Cassandra

 Use Case: Best for time-series data, logging, and real-time analytics applications.

4. Graph Databases

o Description: These databases store data in nodes and edges, representing entities and relationships
between them, respectively.

o Example: Neo4j

 Use Case: Ideal for social networks, recommendation engines, and fraud detection systems.

5. Wide-Column Stores

o Description: A hybrid between key-value and column-family stores, wide-column stores allow
storing a large amount of data in a columnar format.

o Example: HBase

 Use Case: Used for handling sparse data, such as in big data applications and Hadoop
ecosystems.

6. Object-Oriented Databases

o Description: These databases store data as objects, similar to how they are handled in object-
oriented programming languages.

o Example: db4o

 Use Case: Suitable for applications where data is naturally represented as objects, such as in
complex simulations.
7. Time-Series Databases

o Description: These databases are optimized for storing and querying time-stamped or time-series
data.

o Example: InfluxDB

 Use Case: Used in IoT, monitoring systems, and real-time analytics.

8. Multi-Model Databases

o Description: These databases support multiple data models, such as key-value, document, and graph
within the same database.

o Example: ArangoDB

 Use Case: Useful for applications requiring flexibility in handling different data types and
relationships.

9. Search Engines

o Description: These are specialized databases designed for searching and indexing large volumes of
text data.

o Example: Elasticsearch

 Use Case: Used in full-text search applications, logging, and analytics.

10. Geospatial Databases

o Description: These databases are optimized for storing and querying geospatial data, such as
coordinates and polygons.

o Example: PostGIS (an extension of PostgreSQL)

 Use Case: Used in geographic information systems (GIS), location-based services, and
mapping applications

CAP theorem https://www.geeksforgeeks.org/the-cap-theorem-in-dbms/


OR

The CAP theorem is a fundamental principle in distributed systems that helps explain the trade-offs that
must be made when designing databases that are spread across multiple networked nodes. The theorem, originally
proposed by Eric Brewer in 2000, outlines three essential properties that distributed databases aim to achieve:

1. Consistency: Every read operation reflects the most recent write. This means that all clients accessing the
system will have the same view of the data at the same time. For example, if a transaction updates a piece of
data, all subsequent reads should return the updated data.

2. Availability: Every request (whether read or write) receives a response, even if it might not reflect the latest
data. This means that the system remains operational, and clients can always access data, but the data might
be stale or inconsistent during certain conditions.

3. Partition Tolerance: The system continues to function even if there is a network partition that prevents
some parts of the system from communicating with others. This means the system is resilient to network
failures, ensuring that it remains operational even if some nodes are isolated due to network issues.

Key Insight of the CAP Theorem

The CAP theorem states that it is impossible for a distributed system to fully achieve all three of these properties
simultaneously. Instead, a system can at most achieve two out of the three:
 CA (Consistency and Availability): A system that ensures both consistency and availability will not be able to
handle network partitions effectively. Such a system will function smoothly as long as there are no network
partitions, but if a partition occurs, the system might fail to maintain either consistency or availability.

 CP (Consistency and Partition Tolerance): A system that ensures consistency and can handle network
partitions might have to sacrifice availability during a partition. For example, in the case of a network
partition, some parts of the system might become unavailable to ensure that the data remains consistent
across all nodes.

 AP (Availability and Partition Tolerance): A system that ensures availability and can handle partitions might
sacrifice consistency. This means that during a network partition, the system might continue to operate and
respond to requests, but different parts of the system might return different, potentially inconsistent, data.

Eventual Consistency and CAP Theorem

The concept of eventual consistency arises within the context of the CAP theorem, particularly in systems that
prioritize availability and partition tolerance (AP). Eventual consistency is a weaker form of consistency that allows
the system to provide immediate availability and partition tolerance, with the understanding that the data will
eventually become consistent once the system has had enough time to propagate all updates across all nodes.

In other words, under eventual consistency:

 In the absence of updates, the system will eventually reach a state where all nodes have the same data.

 With continuous updates, there might be temporary inconsistencies, but eventually, all replicas in the
system will converge to a consistent state, or a replica might be removed if it cannot be synchronized.

BASE vs. ACID

Eventual consistency is part of the BASE (Basically Available, Soft state, Eventual consistency) model, which is often
contrasted with the ACID (Atomicity, Consistency, Isolation, Durability) model used in traditional relational
databases:

 ACID ensures strict consistency and reliability of transactions, making it suitable for systems where data
accuracy and integrity are critical, such as financial applications.

 BASE allows for more flexibility and scalability in distributed systems, where availability and partition
tolerance are prioritized, and temporary inconsistencies are acceptable as long as the system eventually
becomes consistent.

You might also like