Unit 3 Hbase, Mongodb and Couch DB
Unit 3 Hbase, Mongodb and Couch DB
HBase
Row-oriented with column families: HBase stores data in tables, where each row has a
unique identifier called a row key. Each table is divided into column families, which
group similar columns together to improve performance for read and write operations.
Unlike relational databases that store data in rows and columns, HBase optimizes for
large-scale and fast reads/writes across rows using column families.
Data is stored in HDFS: HBase stores its data on the Hadoop Distributed File System
(HDFS), which provides fault tolerance and scalability. Data is distributed across
multiple servers and can be replicated for high availability.
Regions: Tables in HBase are split into regions (essentially ranges of rows). These
regions are distributed across HBase servers called RegionServers, allowing the database
to scale horizontally as the number of regions and RegionServers increases. When a table
grows beyond the size of a single region, HBase splits it into new regions to manage
more data.
Master Server: The HBase Master is responsible for managing the RegionServers, load
balancing, and managing metadata about regions. It coordinates the overall operation of
HBase but does not store data directly.
Consistency Model: HBase provides strong consistency at the row level. This means
that if a read operation is done after a write, the read will return the value of the write, but
only for the specific row. It does not guarantee consistency across rows or column
families. This makes HBase suitable for applications where read consistency is required
at the row level but not across large datasets.
Write and Read Operations:
o Write Operations: Data is written into MemStore (in-memory), then
periodically flushed to disk into HFiles. HBase also uses Write-Ahead Logs
(WAL) to ensure durability during crashes.
o Read Operations: Data is retrieved from HFiles or MemStore and can be cached
for faster subsequent reads. The system is optimized for random reads and writes,
particularly useful for use cases like time-series data, logs, and real-time analytics.
Performance:
HBase performs best in environments where the data can be stored across multiple
servers in a distributed fashion. Its architecture is highly optimized for random reads
and writes of large datasets.
However, it may not be as efficient for complex queries (e.g., joins) or high-latency
data retrieval, making it less ideal for use cases that need complex, relational-style
querying.
Use Cases:
1. HBase Architecture
HBase is designed to handle very large datasets and store them in a distributed manner. It is a
column-family store and follows the model introduced by Google Bigtable. HBase is tightly
integrated with the Hadoop ecosystem and stores data on HDFS (Hadoop Distributed File
System).
Key Components:
1. Master Server:
o The HBase Master is responsible for managing the entire cluster. It monitors
RegionServers, balances loads, and manages metadata.
o It does not store actual data but coordinates the RegionServer nodes to ensure
proper distribution and manage the metadata of the tables (e.g., which regions
belong to which RegionServers).
2. Region Servers:
o RegionServers are the heart of HBase. They serve client requests for data and
manage regions (a subset of the table’s data). Each RegionServer can host
multiple regions, and each region is responsible for a range of rows.
o A Region is the basic unit of distribution and contains rows of data. Once a region
grows too large, it splits into multiple regions to balance load.
3. Regions:
o A Region is a contiguous range of rows in a table. HBase tables are split into
multiple regions, and each region is managed by a RegionServer.
o When a region reaches a certain size, it is split into two regions, and HBase
redistributes the data.
4. HFile:
oData is stored in HFiles on HDFS. An HFile is an actual disk file storing the data
in sorted order, and it's immutable.
o The MemStore (in-memory store) temporarily holds writes before they are
flushed to HFiles on disk.
5. Write-Ahead Log (WAL):
o For durability, writes are first recorded in the WAL on the RegionServer before
being written to MemStore and eventually flushed to HFiles.
6. Zookeeper:
o HBase relies on Zookeeper for coordination. Zookeeper helps in tracking cluster
state, managing leader election for RegionServers, and maintaining metadata
consistency across all nodes.
Data Flow:
Write Operations:
o Data is first written to the Write-Ahead Log (WAL) for durability. Then, it is
stored temporarily in MemStore and eventually written to disk as HFiles.
Read Operations:
o Data is retrieved from MemStore or HFiles. If a region is requested that isn’t
available on the local server, a request is forwarded to the RegionServer that holds
that data.
Scaling:
HBase scales horizontally by adding more RegionServers as the data grows. Regions are
dynamically distributed and balanced across RegionServers to manage load and optimize
performance.
It uses HDFS for distributed storage, ensuring fault tolerance by replicating data across
multiple nodes.
2. MongoDB
Performance:
MongoDB performs very well for write-heavy workloads and read-heavy workloads
where complex relational queries are not necessary. The database’s flexibility allows for
the rapid evolution of application data models, particularly for fast-paced development
cycles (like web and mobile applications).
However, MongoDB’s eventual consistency model can sometimes lead to
inconsistencies across replicas, especially in highly distributed systems. This might be
acceptable for many real-time applications but is not suitable for those that require strong
consistency.
Use Cases:
MongoDB Architecture
1. Database:
o MongoDB stores data in databases, and each database contains a set of
collections. A collection is a group of documents and can be thought of as a table
in a relational database. Collections are schema-less, meaning that each document
within a collection can have different fields.
2. Documents:
o Data is stored as documents in BSON format, which supports complex data types
like arrays and embedded documents.
o Documents are analogous to rows in relational databases, but they can be much
more complex, with nested structures and flexible schemas.
3. Replica Sets:
o A replica set is a group of MongoDB servers that maintain the same data set. A
primary node handles all write operations, while secondary nodes replicate the
data for redundancy and high availability.
o If the primary node fails, one of the secondaries can automatically be promoted to
primary (using automatic failover), ensuring continued service.
4. Shards and Sharding:
o MongoDB can distribute data across multiple machines using sharding. A shard
is a subset of the data, and MongoDB distributes data across shards using a shard
key, which determines how documents are divided across different servers.
o Sharding helps MongoDB scale horizontally by adding more nodes to handle
larger datasets and higher throughput.
5. Journaling:
o MongoDB uses journaling to ensure data durability. Changes to data are first
written to a journal (log) before being committed to the database, ensuring that in
case of a crash, MongoDB can recover data.
6. Indexing:
o MongoDB supports several types of indexes to improve the performance of
queries. Indexes can be created on single fields, compound fields, and even
geospatial data. The aggregation framework allows complex queries with
groupings, transformations, and filters.
7. Querying:
o MongoDB offers a rich set of queries to filter, sort, and aggregate data. It also
supports joins via its aggregation framework, although it doesn’t use traditional
relational joins. Instead, it performs lookup operations to link documents.
Data Flow:
Write Operations:
o When a document is inserted or updated, the data is writ0ten to the primary node
in the replica set and propagated to the secondary nodes.
Read Operations:
o Reads can be served by the primary or secondary nodes, depending on the
configuration (primary preferred, or nearest node).
Scaling:
3. CouchDB
Performance:
Use Cases:
Offline-first mobile apps where data synchronization is required when the device
reconnects.
Distributed systems that require frequent synchronization of data between nodes.
Data synchronization between web clients and servers in a highly available, fault-
tolerant way.
Web applications that need easy replication and synchronization.
3. CouchDB Architecture
Type: Document-Oriented NoSQL Database
Primary Use Case: Distributed, fault-tolerant storage for web apps, with offline capabilities and
data synchronization across multiple nodes.
CouchDB is a document-oriented database that stores data in JSON format. One of its primary
strengths is its built-in replication and offline-first capabilities, which allow data to be
synchronized between distributed systems.
Key Components:
1. Document Storage:
o CouchDB stores data in documents that are identified by a unique ID. These
documents are stored in JSON format, and CouchDB supports complex, nested
data structures.
o Every document in CouchDB is versioned and immutable, meaning changes to a
document result in a new version being created, preserving the document’s
history.
2. Views:
o CouchDB uses MapReduce to create views. A view is an index that organizes
and processes data. You define a Map function to process documents and emit
key-value pairs, and a Reduce function to aggregate or process those key-value
pairs.
o Views allow CouchDB to perform complex queries, like filtering and grouping,
without the overhead of relational joins.
3. Replication:
o One of CouchDB’s strongest features is its replication mechanism. Data can be
replicated between nodes in different locations, supporting distributed systems
that are disconnected at times (i.e., offline-first).
o Replication can be continuous, and CouchDB automatically handles conflict
resolution during synchronization.
4. Clustered Servers:
o While CouchDB does not inherently support sharding, it supports master-master
replication, where each node in the cluster can read and write data. This allows
for distributed data storage across multiple nodes, ensuring high availability and
fault tolerance.
5. HTTP API:
o CouchDB exposes a RESTful API for interacting with the database. All database
operations (CRUD) are done through HTTP requests (GET, POST, PUT,
DELETE). This makes it easy to integrate CouchDB with web and mobile
applications.
6. Conflict Resolution:
o CouchDB handles conflicts through its MVCC (Multi-Version Concurrency
Control) system, which ensures that conflicting writes to the same document are
stored as separate versions. Users or applications can resolve conflicts manually,
but CouchDB’s conflict resolution system tracks changes effectively.
7. Futon:
o Futon is a web-based administrative interface for CouchDB that allows users to
interact with the database, manage documents, and configure replication.
Data Flow:
Write Operations:
o When a document is written, it is assigned a revision number and stored in the
database. If the document already exists, CouchDB creates a new version of it.
Conflicts may arise during replication, and CouchDB handles them using its
MVCC system.
Read Operations:
o Documents are retrieved through HTTP requests. When a query is made,
CouchDB retrieves the document or the result from a view. Views are indexed
using MapReduce functions and can be queried for more efficient data retrieval.
Scaling: