HBase
HBase
• Limitations of Hadoop
• Hadoop can perform only batch processing, and data will be accessed only in
a sequential manner. That means one has to search the entire dataset even
for the simplest of jobs.
• A huge dataset when processed results in another huge data set, which
should also be processed sequentially. At this point, a new solution is needed
to access any point of data in a single unit of time (random access).
• Hadoop Random Access Databases
• Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB
are some of the databases that store huge amounts of data and access the
data in a random manner.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It
is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
HBase and HDFS
HDFS HBase
HDFS is a distributed file system suitable for HBase is a database built on top of the
storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing; It provides low latency access to single rows
no concept of batch processing. from billions of records (Random access).
It provides only sequential access of data. HBase internally uses Hash tables and
provides random access, and it stores the
data in indexed HDFS files for faster
lookups.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row.
The table schema defines only column families, which are the key value pairs.
3
Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema, which
concept of fixed columns schema; defines describes the whole structure of tables.
only column families.
It is built for wide tables. HBase is It is thin and built for small tables. Hard to
horizontally scalable. scale.
Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
In addition to availability, the nodes are also used to track server failures or
network partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
HBase Shell : HBase contains a shell using which you can communicate with HBase.
General Commands:
• status - Provides the status of HBase, for example, the number of servers.
• version - Provides the version of HBase being used.
• table_help - Provides help for table-reference commands.
• whoami - Provides information about the user.
Data Definition Language:
These are the commands that operate on the tables in HBase.
create - Creates a table.
list - Lists all the tables in HBase.
disable - Disables a table.
is_disabled - Verifies whether a table is disabled.
enable - Enables a table.
is_enabled - Verifies whether a table is enabled.
•describe - Provides the description of a table.
•drop_all - Drops the tables matching the ‘regex’ given in the command.
•Java Admin API - Prior to all the above commands, Java provides an Admin
API to achieve DDL functionalities through programming.
Under org.apache.hadoop.hbase.client package, HBaseAdmin and
HTableDescriptor are the two important classes in this package that provide DDL
functionalities.
Data Manipulation Language:
•put - Puts a cell value at a specified column in a specified row in a particular table.
•get - Fetches the contents of row or a cell.
•delete - Deletes a cell value in a table.
•deleteall - Deletes all the cells in a given row.
•scan - Scans and returns the table data.
•count - Counts and returns the number of rows in a table.
•truncate - Disables, drops, and recreates a specified table.
•Java client API - Prior to all the above commands, Java provides a client API to
achieve DML functionalities, CRUD (Create Retrieve Update Delete) operations and
more through programming, under org.apache.hadoop.hbase.client
package. HTable Put and Get are the important classes in this package.
Table Creation in HBase
Creating a Table using HBase Shell
admin.createTable(table);
import java.io.IOException;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.conf.Configuration;
public class CreateTable {
public static void main(String[] args) throws IOException {
// Instantiating configuration class
Configuration con = HBaseConfiguration.create();
// Instantiating HbaseAdmin class
HBaseAdmin admin = new HBaseAdmin(con);
// Instantiating table descriptor class
HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("emp"));
// Adding column families to table descriptor
tableDescriptor.addFamily(new HColumnDescriptor("personal"));
tableDescriptor.addFamily(new HColumnDescriptor("professional"));
// Execute the table through admin
admin.createTable(tableDescriptor);
System.out.println(" Table created ");
}}