Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
Data is stored and retrieved one row at a time and hence could read
unnecessary data if only some of the data in a row is required.
Easy to read and write records
Well suited for OLTP systems
Not efficient in performing operations applicable to the entire dataset and
hence aggregation is an expensive operation
Typical compression mechanisms provide less effective results than those
on column-oriented data stores
Column-oriented data stores –
Data is stored and retrieved in columns and hence can read only relevant
data if only some data is required
Read and Write are typically slower operations
Well suited for OLAP systems
Can efficiently perform operations applicable to the entire dataset and
hence enables aggregation over many rows and columns
Permits high compression rates due to few distinct values in columns
Introduction Relational Databases vs. HBase
When talking of data stores, we first think of Relational Databases with
structured data storage and a sophisticated query engine. However, a Relational
Database incurs a big penalty to improve performance as the data size increases.
HBase, on the other hand, is designed from the ground up to provide scalability
and partitioning to enable efficient data structure serialization, storage and
retrieval. Broadly, the differences between a Relational Database and HBase
are:
Relational Database –
Is Schema-less
Is a Column-oriented datastore
Is designed to store Denormalized Data
Contains wide and sparsely populated tables
Supports Automatic Partitioning
HDFS vs. HBase
HDFS is a distributed file system that is well suited for storing large files. It’s
designed to support batch processing of data but doesn’t provide fast individual
record lookups. HBase is built on top of HDFS and is designed to provide
access to single rows of data in large tables. Overall, the differences between
HDFS and HBase are
HDFS –
Is suited for High Latency operations batch processing
Data is primarily accessed through MapReduce
Is designed for batch processing and hence doesn’t have a concept of
random reads/writes
HBase –
Just like in a Relational Database, data in HBase is stored in Tables and these
Tables are stored in Regions. When a Table becomes too big, the Table is
partitioned into multiple Regions. These Regions are assigned to Region
Servers across the cluster. Each Region Server hosts roughly the same number
of Regions.
The HMaster in the HBase is responsible for
Performing Administration
Managing and Monitoring the Cluster
Assigning Regions to the Region Servers
Controlling the Load Balancing and Failover
On the other hand, the HRegionServer perform the following work
Tables – The HBase Tables are more like logical collection of rows stored in
separate partitions called Regions. As shown above, every Region is then served
by exactly one Region Server. The figure above shows a representation of a
Table.
Rows – A row is one instance of data in a table and is identified by a rowkey.
Rowkeys are unique in a Table and are always treated as a byte[].
Version – The data stored in a cell is versioned and versions of data are
identified by the timestamp. The number of versions of data retained in a
column family is configurable and this value by default is 3.