0% found this document useful (0 votes)
307 views25 pages

Hadoop, Hbase, and Hive

This document discusses Hive/HBase integration, which allows Hive to access data stored in HBase tables. It describes three main use cases: 1) using HBase as an ETL target from Hive queries, 2) querying HBase tables from Hive, and 3) using HBase for low-latency queries on a data warehouse. Key aspects covered include the storage handler, loading data via Hive INSERT statements, query processing, and bulk loading into HBase. The document concludes with questions about the Hive/HBase integration.

Uploaded by

Harvinder Saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
307 views25 pages

Hadoop, Hbase, and Hive

This document discusses Hive/HBase integration, which allows Hive to access data stored in HBase tables. It describes three main use cases: 1) using HBase as an ETL target from Hive queries, 2) querying HBase tables from Hive, and 3) using HBase for low-latency queries on a data warehouse. Key aspects covered include the storage handler, loading data via Hive INSERT statements, query processing, and bulk loading into HBase. The document concludes with questions about the Hive/HBase integration.

Uploaded by

Harvinder Saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Hive/HBase Integration

or, MaybeSQL?
April 2010
John Sichi
Facebook
+
Agenda
Use Cases
Architecture
Storage Handler
Load via INSERT
Query Processing
Bulk Load
Q & A

Facebook
Motivations
Data, data, and more data
200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
About 8x increase per year
Queries, queries, and more queries
More than 200 unique users querying per day
7500+ queries on production cluster per day; mixture of ad-
hoc queries and ETL/reporting queries
They want it all and they want it now
Users expect faster response time on fresher data
Sampled subsets arent always good enough

Facebook
How Can HBase Help?
Replicate dimension tables from transactional databases
with low latency and without sharding
(Fact data can stay in Hive since it is append-only)
Only move changed rows
Full scrape is too slow and doesnt scale as data keeps
growing
Hive by itself is not good at row-level operations
Integrate into Hives map/reduce query execution plans
for full parallel distributed processing
Multiversioning for snapshot consistency?



Facebook
Use Case 1: HBase As ETL Data Target

Facebook
HBase
Hive INSERT
SELECT
Source
Files/Ta
bles
Use Case 2: HBase As Data Source

Facebook
HBase
Other
Files/Ta
bles
Hive SELECT
JOIN
GROUP BY
Query
Result
Use Case 3: Low Latency Warehouse

Facebook
HBase
Other
Files/Ta
bles
Periodic Load
Continuous Update
Hive
Queries
HBase Architecture
Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Hive Architecture

Facebook
All Together Now!

Facebook
Hive CLI With HBase
Minimum configuration needed:

hive \
--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \
-hiveconf hbase.zookeeper.quorum=zk1,zk2

hive> create table
Facebook
Storage Handler

CREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
hbase.columns.mapping =
small:name,small:email,large:notes)
TBLPROPERTIES (
hbase.table.name = user_list
);
Facebook
Column Mapping
First column in table is always the row key
Other columns can be mapped to either:
An HBase column (any Hive type)
An HBase column family (must be MAP type in Hive)
Multiple Hive columns can map to the same HBase column
or family
Limitations
Currently no control over type mapping (always string in
HBase)
Currently no way to map HBase timestamp attribute
Facebook
Load Via INSERT
INSERT OVERWRITE TABLE users
SELECT * FROM ;
Hive task writes rows to HBase via
org.apache.hadoop.hbase.mapred.TableOutputFormat
HBaseSerDe serializes rows into BatchUpdate objects
(currently all values are converted to strings)
Multiple rows with same key -> only one row written
Limitations
No write atomicity yet
No way to delete rows
Write parallelism is query-dependent (map vs reduce)

Facebook
Map-Reduce Job for INSERT

Facebook
HBase
From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
Map-Only Job for INSERT

Facebook
HBase
Query Processing
SELECT name, notes FROM users WHERE userid=xyz;
Rows are read from HBase via
org.apache.hadoop.hbase.mapred.TableInputFormatBase
HBase determines the splits (one per table region)
HBaseSerDe produces lazy rows/maps for RowResults
Column selection is pushed down
Any SQL can be used (join, aggregation, union)
Limitations
Currently no filter pushdown
How do we achieve locality?
Facebook
Metastore Integration
DDL can be used to create metadata in Hive and HBase
simultaneously and consistently
CREATE EXTERNAL TABLE: register existing Hbase
table
DROP TABLE: will drop HBase table too unless it was
created as EXTERNAL
Limitations
No two-phase-commit for DDL operations
ALTER TABLE is not yet implemented
Partitioning is not yet defined
No secondary indexing
Facebook
Bulk Load
Ideally
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT ;
But for now, you have to do some work and issue multiple
Hive commands
1 Sample source data for range partitioning
2 Save sampling results to a file
3 Run CLUSTER BY query using HiveHFileOutputFormat
and TotalOrderPartitioner (sorts data, producing a large
number of region files)
4 Import HFiles into HBase
5 HBase can merge files if necessary
Facebook
Range Partitioning During Sort

Facebook
A-G
H-Q
R-Z
HBase
(H)
(R)
TotalOrderPartitioner
loadtable.rb
Sampling Query For Range Partitioning
Given 5 million users in a table bucketed into 1000 buckets of
5000 users each, pick 9 user_ids which partition the set of
all user_ids into 10 nearly-equal-sized ranges.

select user_id from
(select user_id
from hive_user_table
tablesample(bucket 1 out of 1000 on user_id) s
order by user_id) sorted_user_5k_sample
where (row_sequence() % 501)=0;
Facebook
Sorting Query For Bulk Load
set mapred.reduce.tasks=12;
set hive.mapred.partitioner=
org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
set total.order.partitioner.path=/tmp/hb_range_key_list;
set hfile.compression=gz;
create table hbsort(user_id string, user_type string, ...)
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat
outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat
tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');

insert overwrite table hbsort
select user_id, user_type, createtime,
from hive_user_table
cluster by user_id;

Facebook
Deployment
Latest Hive trunk (will be in Hive 0.6.0)
Requires Hadoop 0.20+
Tested with HBase 0.20.3 and Zookeeper 3.2.2
20-node hbtest cluster at Facebook
No performance numbers yet
Currently setting up tests with about 6TB (gz compressed)
Facebook
Questions?
[email protected]
[email protected]
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad





Special thanks to Samuel Guo for the early versions of the
integration code
Facebook
Hey, What About HBQL?
HBQL focuses on providing a convenient language layer
for managing and accessing individual HBase tables, and
is not intended for heavy-duty SQL processing such as
joins and aggregations
HBQL is implemented via client-side calls, whereas
Hive/HBase integration is implemented via map/reduce
jobs
Facebook

You might also like