Hadoop, Hbase, and Hive

This document discusses Hive/HBase integration, which allows Hive to access data stored in HBase tables. It describes three main use cases: 1) using HBase as an ETL target from Hive queries, 2) querying HBase tables from Hive, and 3) using HBase for low-latency queries on a data warehouse. Key aspects covered include the storage handler, loading data via Hive INSERT statements, query processing, and bulk loading into HBase. The document concludes with questions about the Hive/HBase integration.

Uploaded by

Harvinder Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

307 views25 pages

Hadoop, Hbase, and Hive

Uploaded by

Harvinder Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 25

Hive/HBase Integration

or, MaybeSQL?
April 2010
John Sichi
Facebook
+
Agenda
Use Cases
Architecture
Storage Handler
Load via INSERT
Query Processing
Bulk Load
Q & A

Facebook
Motivations
Data, data, and more data
200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
About 8x increase per year
Queries, queries, and more queries
More than 200 unique users querying per day
7500+ queries on production cluster per day; mixture of ad-
hoc queries and ETL/reporting queries
They want it all and they want it now
Users expect faster response time on fresher data
Sampled subsets arent always good enough

Facebook
How Can HBase Help?
Replicate dimension tables from transactional databases
with low latency and without sharding
(Fact data can stay in Hive since it is append-only)
Only move changed rows
Full scrape is too slow and doesnt scale as data keeps
growing
Hive by itself is not good at row-level operations
Integrate into Hives map/reduce query execution plans
for full parallel distributed processing
Multiversioning for snapshot consistency?

Facebook
Use Case 1: HBase As ETL Data Target

Facebook
HBase
Hive INSERT
SELECT
Source
Files/Ta
bles
Use Case 2: HBase As Data Source

Facebook
HBase
Other
Files/Ta
bles
Hive SELECT
JOIN
GROUP BY
Query
Result
Use Case 3: Low Latency Warehouse

Facebook
HBase
Other
Files/Ta
bles
Periodic Load
Continuous Update
Hive
Queries
HBase Architecture
Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Hive Architecture

Facebook
All Together Now!

Facebook
Hive CLI With HBase
Minimum configuration needed:

hive \
--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \
-hiveconf hbase.zookeeper.quorum=zk1,zk2

hive> create table
Facebook
Storage Handler

CREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
hbase.columns.mapping =
small:name,small:email,large:notes)
TBLPROPERTIES (
hbase.table.name = user_list
);
Facebook
Column Mapping
First column in table is always the row key
Other columns can be mapped to either:
An HBase column (any Hive type)
An HBase column family (must be MAP type in Hive)
Multiple Hive columns can map to the same HBase column
or family
Limitations
Currently no control over type mapping (always string in
HBase)
Currently no way to map HBase timestamp attribute
Facebook
Load Via INSERT
INSERT OVERWRITE TABLE users
SELECT * FROM ;
Hive task writes rows to HBase via
org.apache.hadoop.hbase.mapred.TableOutputFormat
HBaseSerDe serializes rows into BatchUpdate objects
(currently all values are converted to strings)
Multiple rows with same key -> only one row written
Limitations
No write atomicity yet
No way to delete rows
Write parallelism is query-dependent (map vs reduce)

Facebook
Map-Reduce Job for INSERT

Facebook
HBase
From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
Map-Only Job for INSERT

Facebook
HBase
Query Processing
SELECT name, notes FROM users WHERE userid=xyz;
Rows are read from HBase via
org.apache.hadoop.hbase.mapred.TableInputFormatBase
HBase determines the splits (one per table region)
HBaseSerDe produces lazy rows/maps for RowResults
Column selection is pushed down
Any SQL can be used (join, aggregation, union)
Limitations
Currently no filter pushdown
How do we achieve locality?
Facebook
Metastore Integration
DDL can be used to create metadata in Hive and HBase
simultaneously and consistently
CREATE EXTERNAL TABLE: register existing Hbase
table
DROP TABLE: will drop HBase table too unless it was
created as EXTERNAL
Limitations
No two-phase-commit for DDL operations
ALTER TABLE is not yet implemented
Partitioning is not yet defined
No secondary indexing
Facebook
Bulk Load
Ideally
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT ;
But for now, you have to do some work and issue multiple
Hive commands
1 Sample source data for range partitioning
2 Save sampling results to a file
3 Run CLUSTER BY query using HiveHFileOutputFormat
and TotalOrderPartitioner (sorts data, producing a large
number of region files)
4 Import HFiles into HBase
5 HBase can merge files if necessary
Facebook
Range Partitioning During Sort

Facebook
A-G
H-Q
R-Z
HBase
(H)
(R)
TotalOrderPartitioner
loadtable.rb
Sampling Query For Range Partitioning
Given 5 million users in a table bucketed into 1000 buckets of
5000 users each, pick 9 user_ids which partition the set of
all user_ids into 10 nearly-equal-sized ranges.

select user_id from
(select user_id
from hive_user_table
tablesample(bucket 1 out of 1000 on user_id) s
order by user_id) sorted_user_5k_sample
where (row_sequence() % 501)=0;
Facebook
Sorting Query For Bulk Load
set mapred.reduce.tasks=12;
set hive.mapred.partitioner=
org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
set total.order.partitioner.path=/tmp/hb_range_key_list;
set hfile.compression=gz;
create table hbsort(user_id string, user_type string, ...)
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat
outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat
tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');

insert overwrite table hbsort
select user_id, user_type, createtime,
from hive_user_table
cluster by user_id;

Facebook
Deployment
Latest Hive trunk (will be in Hive 0.6.0)
Requires Hadoop 0.20+
Tested with HBase 0.20.3 and Zookeeper 3.2.2
20-node hbtest cluster at Facebook
No performance numbers yet
Currently setting up tests with about 6TB (gz compressed)
Facebook
Questions?
[email protected]
[email protected]
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

Special thanks to Samuel Guo for the early versions of the
integration code
Facebook
Hey, What About HBQL?
HBQL focuses on providing a convenient language layer
for managing and accessing individual HBase tables, and
is not intended for heavy-duty SQL processing such as
joins and aggregations
HBQL is implemented via client-side calls, whereas
Hive/HBase integration is implemented via map/reduce
jobs
Facebook

Cassandra Certification Study Guide DataStax
13% (8)
Cassandra Certification Study Guide DataStax
20 pages
Siva
No ratings yet
Siva
4 pages
Swiss Chalet Menu On
No ratings yet
Swiss Chalet Menu On
1 page
Cassandra DBA
No ratings yet
Cassandra DBA
5 pages
Reading Data From SAP Application Server (Tx-AL11)
No ratings yet
Reading Data From SAP Application Server (Tx-AL11)
20 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Computer Architecture-2-Marks-Imp-Questions
No ratings yet
Computer Architecture-2-Marks-Imp-Questions
6 pages
8.3.1.3 Lab - Install A Printer in Windows
No ratings yet
8.3.1.3 Lab - Install A Printer in Windows
2 pages
Network Simulator Lab
No ratings yet
Network Simulator Lab
16 pages
090.040-CS - Quantum HD 2016-05
No ratings yet
090.040-CS - Quantum HD 2016-05
116 pages
CCIE Security v6.1 Blueprint
No ratings yet
CCIE Security v6.1 Blueprint
5 pages
PostgreSQL Backups The Modern Way
No ratings yet
PostgreSQL Backups The Modern Way
50 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Acer Aspire 5330 - 5730 Wistron Cathedral Peak RevSB
No ratings yet
Acer Aspire 5330 - 5730 Wistron Cathedral Peak RevSB
42 pages
TCSESM Managed Switch Fourp
No ratings yet
TCSESM Managed Switch Fourp
6 pages
AMD Opteron™Processor Product Data Sheet: 940-Pin Package Specific Features
No ratings yet
AMD Opteron™Processor Product Data Sheet: 940-Pin Package Specific Features
4 pages
J2EE 3-Tier or N-Tier Architecture
No ratings yet
J2EE 3-Tier or N-Tier Architecture
2 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
2.5 Azure Management Interfaces
No ratings yet
2.5 Azure Management Interfaces
14 pages
CH02 COA9e PDF
No ratings yet
CH02 COA9e PDF
40 pages
Remote Access
No ratings yet
Remote Access
9 pages
Lesson 1: Installing Servers: MOAC 70-410: Installing and Configuring Windows Server 2012
No ratings yet
Lesson 1: Installing Servers: MOAC 70-410: Installing and Configuring Windows Server 2012
55 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Aws Data Engineer Resume Example
No ratings yet
Aws Data Engineer Resume Example
1 page
Mastering Apache Spark PDF
No ratings yet
Mastering Apache Spark PDF
663 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Certification
No ratings yet
Certification
16 pages
VarAC Manual in English
No ratings yet
VarAC Manual in English
89 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Azure SQL Trainings: Contact: +91 90 32 82 44 67
No ratings yet
Azure SQL Trainings: Contact: +91 90 32 82 44 67
6 pages
Session - 23 Operands Instruction Formats and Addressing Modes
No ratings yet
Session - 23 Operands Instruction Formats and Addressing Modes
27 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Exercise Test
100% (2)
Exercise Test
93 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Digital Assignment - 1: I) Ping
No ratings yet
Digital Assignment - 1: I) Ping
13 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
17 pages
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
No ratings yet
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
8 pages
Big Data Hadoop Architect
No ratings yet
Big Data Hadoop Architect
19 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Bicker Iups 301 e
No ratings yet
Bicker Iups 301 e
1 page
Exam: MCD - Level 1 (Mule 4) : 2020-06-03, 3:50 PM Purchase Confirmation
No ratings yet
Exam: MCD - Level 1 (Mule 4) : 2020-06-03, 3:50 PM Purchase Confirmation
1 page
Big Data Masters Certification Learnbay
No ratings yet
Big Data Masters Certification Learnbay
12 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
JJJJJJJ
No ratings yet
JJJJJJJ
3 pages
Failed To Set File Mode For PDF
No ratings yet
Failed To Set File Mode For PDF
2 pages
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
Top - Niunaijun.blackboxa33 Logcat
No ratings yet
Top - Niunaijun.blackboxa33 Logcat
3 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
E311 Eng
No ratings yet
E311 Eng
1 page
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
IBM Hexadecimal Floating-Point: (Width in Bits)
No ratings yet
IBM Hexadecimal Floating-Point: (Width in Bits)
6 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Top 100 Hadoop Interview Questions and Answers 2016
No ratings yet
Top 100 Hadoop Interview Questions and Answers 2016
21 pages
Lecture 03
No ratings yet
Lecture 03
30 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Enterpricse Consolestartupguide
No ratings yet
Enterpricse Consolestartupguide
62 pages
Cloudera Apache Impala Guide
No ratings yet
Cloudera Apache Impala Guide
691 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
No ratings yet
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
35 pages
Cloudera Spark
No ratings yet
Cloudera Spark
70 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
HBase and Hive at StumbleUpon Presentation
No ratings yet
HBase and Hive at StumbleUpon Presentation
22 pages
Proxylab
No ratings yet
Proxylab
15 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
Hive - Data Warehousing &: Analytics On Hadoop
No ratings yet
Hive - Data Warehousing &: Analytics On Hadoop
42 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
SQ Service Guide: Distributor Use Only!
No ratings yet
SQ Service Guide: Distributor Use Only!
24 pages
Unit 5
No ratings yet
Unit 5
10 pages
ICT Assigment Final
No ratings yet
ICT Assigment Final
3 pages
Mod 2
No ratings yet
Mod 2
70 pages
IoT Exam
No ratings yet
IoT Exam
5 pages
Programming Reviewer
No ratings yet
Programming Reviewer
2 pages

Hadoop, Hbase, and Hive

Uploaded by

Hadoop, Hbase, and Hive

Uploaded by

Hive/HBase Integration

You might also like