10 HBase
10 HBase
1
Limitations of Hadoop
What is HBase
Hbase vs HDFS
Storage Mechanism in Hbase
2
Disclaimer: Content Present in this PPT has © Copyright IBM Corp.
HBase
Limitations of Hadoop
• Hadoop can perform batch processing, and data will be accessed only in a sequential manner. That means one has to search the
entire dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new
solution is needed to access any point of data in a single unit of time (random access).
What is HBase
• Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent of big data, companies
realized the benefit of processing big data and started opting for solutions like Hadoop.
• Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels in storing and processing of
huge data of various formats such as semi-, or even unstructured.
• HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is
horizontally scalable.
• HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured
data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using
HBase. HBase sits on top of the Hadoop File System and provides read and write access.
3
HBase
HBase vs HDFS
• HDFS • HBase
• HDFS is a distributed file system suitable for storing • HBase is a database built on top of the Hadoop file System.
large files. • HBase provides fast lookups for larger tables.
• HDFS does not support fast individual record lookups. • no concept of batch processing.It provides low latency access
• It provides high latency batch processing to single rows from billions of records (Random access).
• It provides only sequential access of data. • HBase internally uses Hash tables and provides random
access, and it stores the data in indexed HDFS files for faster
lookups.
Storage Mechanism in Hbase
• HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which
are the key value pairs. A table have multiple column families and each column family can have any number of columns.
Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
4
HBase
Column Oriented and Row Oriented
• Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data. In short,
they will have column families.
• Row-Oriented Database, It is suitable for Online Transaction Process (OLTP).Such databases are designed for small number of rows
and columns.
• Column-oriented databases are designed for huge tables.
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It provides data replication across clusters
5
HBase
HBase vs RDBMS
• HBase • RDBMS
• HBase is schema-less, it doesn't have the concept of • An RDBMS is governed by its schema, which describes the
fixed columns schema; defines only column families. whole structure of tables.
• It is built for wide tables. HBase is horizontally scalable. • It is thin and built for small tables. Hard to scale.
• No transactions are there in HBase. • RDBMS is transactional.
• It has de-normalized data. • It will have normalized data.
• It is good for semi-structured as well as structured data. • It is good for structured data.
6
HBase
HBase Architecture and its Important Components
HBase has three major components: the client library, a
master server, and region servers. Region servers can be
added or removed as per requirement.
MasterServer
The master server -
• Assigns regions to the region servers and takes the help
of Apache ZooKeeper for this task.
• Handles load balancing of the regions across region
servers. It unloads the busy servers and shifts the regions
to less occupied servers.
• Maintains the state of the cluster by negotiating the load
balancing.
• Is responsible for schema changes and other metadata
operations such as creation of tables and column
families.
7
HBase
HBase Architecture and its Important Components
Regions
• Regions are nothing but tables that are split up and
spread across the region servers.
• The region servers have regions that -
• Communicate with the client and handle data-related
operations.
• Handle read and write requests for all the regions under
it.
• Decide the size of the region by following the region size
thresholds.
• The store contains memory store and HFiles. Memstore
is just like a cache memory. Anything that is entered into
the HBase is stored here initially. Later, the data is
transferred and saved in Hfiles as blocks and the
memstore is flushed.
8
HBase
HBase Architecture and its Important Components
Zookeeper
• Zookeeper is an open-source project that provides
services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has ephemeral nodes representing different
region servers. Master servers use these nodes to
discover available servers.
• In addition to availability, the nodes are also used to
track server failures or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take
care of zookeeper.
9
HBase
Commands
Command to create Table in HBase
hbase(main):001:0> create 'emp', 'personal data', 'professional data‘
General Commands
• status - Provides the status of HBase, for example, the number of servers.
• version - Provides the version of HBase being used.
• table_help - Provides help for table-reference commands.
• whoami - Provides information about the user.
10
HBase
Data Definition Language Commands Data Manipulation Language Commands
These are the commands that operate on the tables in • put - Puts a cell value at a specified column in a specified
HBase. row in a particular table.
• create - Creates a table. • get - Fetches the contents of row or a cell.
• list - Lists all the tables in HBase. • delete - Deletes a cell value in a table.
• disable - Disables a table. • deleteall - Deletes all the cells in a given row.
• is_disabled - Verifies whether a table is disabled. • scan - Scans and returns the table data.
• enable - Enables a table. • count - Counts and returns the number of rows in a
• is_enabled - Verifies whether a table is enabled. table.
• describe - Provides the description of a table. • truncate - Disables, drops, and recreates a specified
• alter - Alters a table. table.
• exists - Verifies whether a table exists.
• drop - Drops a table from HBase.
• drop_all - Drops the tables matching the ‘regex’ given in
the command.
11
Reference
• Introduction to Big Data Ecosystem Skills Academy: Data Science Technologist (IBM)
• Hadoop Definitive Guide Author: Tom White Publisher: O’Reilly Media
• Hadoop Real-world Solutions Author: Brian Femiano, Jon Lentz, Jonathan Owens,
• https://developer.ibm.com/tutorials/dm-1209hadoopbigdata/
• https://www.udacity.com/
• https://httpd.apache.org/ (Guide Corner)
• https://cognitiveclass.ai/learn/hadoop/
• https://www.edureka.co/blog/
• https://www.ibm.com/training/in
• https://www.ibm.com/cloud/blog
THANK YOU
13