Hive Performance - Practical Guide
Hive Performance - Practical Guide
Hive Performance - Practical Guide
Guide
November 2013
INSTRUCTIONS
This is an automated template for the technical whitepaper
series Enlighten. The following guidelines will help you work at it
effectively to create a market facing output. Once completed
please send it across to [email protected]. We shall get back to
you with the final copy.
Cover Page: Please update only the Title and the Year Month.
Once you move onto writing the other sections, the same shall
be automatically updated at the header across all the pages.
Table of Contents: Do not make any changes to this page. Any
change if need be to the heading of a page, should be done at
the page itself (in the heading format as used currently). Once
the paper is complete move to the TOC, right click and update
fields. The table shall be update automatically.
Highlight Pad : On the left of each page, you will find a Grey
color box which you will need to use for one key point from the
text on the right/ quotation/ statistic etc.
Addition/ Removal of Pages : Writing can be continued, onto
the pages to write additional content. Table of content should be
updated at the end to account for all such modifications.
Final Page: This is uneditable text carrying information about
HCL and ERS.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
TABLE OF CONTENTS
Abstract ............................................................................................. 4
Challenge for developers .................................................................. 4
Test Environment .............................................................................. 4
Prerequisite Knowledge .................................................................... 5
Performance Parameters .................................................................. 5
Query Optimization ......................................................................... 11
Data Optimization ............................................................................ 15
References ...................................................................................... 16
Author Info ....................................................................................... 17
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
Abstract
Hive is data warehouse and query language for hadoop, an
essential tool in the Hadoop ecosystem that provides a SQL dialect
for querying data stored in the Hadoop Distributed Filesystem
(HDFS). Good for batch processing.
Most data warehouse applications are implemented using relational
databases that use SQL as the query language. Hive lowers the
barrier for moving these applications to Hadoop.
Test Environment
This section describes the environment where hive performance
tuning parameter and other query optimization techniques were
tested. We have tested our solution on two different platforms. (Yes
we were lucky to do that)
Test Platform 1
Microsoft HDInsight 1.6 (beta) with 40 node cluster, where each
node has 2 core and 4GB RAM. HDInsight 1.6 uses Hortonworks
distribution.
Test Platform 2
Amazon EC2: 10 Node cluster (Ubuntu 12.04 LTS 64 Bit Server,
m1.large - 2 Core, 7.5 GB RAM)
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
Prerequisite Knowledge
1. Good understanding of map-reduce framework. (What is the
output of a map task, how map output is transferred to reduce
task and what is the significance of Partitioner)
2. Understanding of hadoop distributed cache.
3. Understanding of hadoop performance tuning.
4. Moderate understanding of compression techniques like snappy,
LZO, sequence files.
Performance Parameters
I have categorized it into two sections:
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
GB RAM can be used to determine ideal slot for that node, i.e.,
4 task slots (2 core * 2 processes = 4, 4*512 MB < 3 GB).
Never modify block size but rather input split to tweak number of
map tasks. Use mapred.max.split.size property to achieve this.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
10
Query Optimization
In this section we will see how we can optimize Hive query to get
best performance.
1. Partitioning
Partition directories.
Partitioning is not a new term, and has quite obvious benefits. I
highly recommend that you should analyse your input data to
identify any opportunity to partition your data in order to have
following benefits:
Hive put partitioned data into separate directories. When you use
partition filter in your where clause, input data from specific
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
11
directories are picked instead of scanning whole data set. This can
significantly improve query performance as less data to process
means reduced query processing time.
Use partition for filters that will present upper set of data.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
12
2. Bucketing
Bucket files.
Partitions offer a convenient way to segregate data and to optimize
queries. However, not all data sets lead to sensible partitioning,
especially given the concerns raised earlier about appropriate sizing.
Bucketing is another technique for decomposing data sets into more
manageable parts. Using bucketed query will significantly improve
query performance. (Please see Programming Hive page no. 125
and
http://archive.cloudera.com/cdh/3/hive/language_manual/working_w
ith_bucketed_tables.html for usage information)
Know about your hash function and figure out specific bucket
range required for your lookup query.
3. Indexing
The purpose of using indexing in hive is to improve the speed of
query lookup on certain column of tables which is not different from
partitioning and bucketing. Without an index queries with WHERE
clause like WHERE col1=10 load the entire table or partition and
process all the rows. But if an index exists for col1, then only a
portion of file needs to be loaded and processed.
(Please see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+I
ndexing, https://cwiki.apache.org/confluence/display/Hive/IndexDev
and Programming Hive page no 117, for implementation detail)
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
13
5. Sorting
Cluster By is a shortcut for Distribute By and Sort By.
Sorting can be done by using Sort By clause, but you wont get the
desired result by only using Sort By clause (Please see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
SortBy for detail).
I am not going to cover how sorting can be used in your HiveQL, but
I would like to highlight one performance improvement point. When
in your query the column(s) for distribution and sort column(s) are
not exactly same, then prefer using distribute by along with sort by
clause rather cluster by, as cluster by will apply distribute by and
then sort by on all column(s) defined with it and it is quite possible
that you want to distribute and sort on different set of column(s).
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
14
Data Optimization
Compression
Using snappy compression as default compression on my hive data,
always produced good result for me, I recommend that you use
sequence file as your hive table storage format.
Example:
CREATE TABLE page_view(viewTime INT, userid
BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
Drawback: One of Hives unique features is that Hive does not
force data to be converted to a specific format and applying
compression on overall data will restrict interoperability.
Tip: Use overall compression with hive internal table, and be careful
while applying compression on external tables.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
15
References
Join Optimization
https://cwiki.apache.org/Hive/joinoptimization.html
https://cwiki.apache.org/confluence/download/attachments/2736205
4/Hive+Summit+2011join.pdf?version=1&modificationDate=1309986642000
Snappy Compression
http://code.google.com/p/hadoop-snappy/
http://www.cloudera.com/content/cloudera-content/clouderadocs/CDH4/4.3.0/CDH4-Installation-Guide/cdh4ig_topic_23_5.html
Runaway Query
http://stackoverflow.com/questions/587965/what-is-runaway-query
Hive Partitioning
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
DDL
http://www.brentozar.com/archive/2013/03/introduction-to-hivepartitioning/
Hive Bucketing
http://archive.cloudera.com/cdh/3/hive/language_manual/working_w
ith_bucketed_tables.html
Hive Indexing
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+I
ndexing
https://cwiki.apache.org/confluence/display/Hive/IndexDev
Hive Sorting
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
SortBy
Programming Hive
By Edward Capriolo, Dean Wampler, and Jason Rutherglen
Chapter 10 Tuning, Over Partitioning - Page 122
Bucketing - Page 125, Indexing - Page 117
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
16
Author Info
Sabir Hussain
Sr. Technical Architect TFG-Analytics
Has 10+ years of experience in
Architectural design, development and
implementation of multi-tier web based
enterprise application using Java/J2EE.
Has been involved from past 2+ years in
Bigdata Analytics using Hadoop, Hive,
HBase, Storm, Talend, Actuate BIRT,
Tableua, Pentaho, MS-HDInsight and
others.
2013, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
17
About HCL
About HCL Technologies
HCL Technologies is a leading global IT services company, working
with clients in the areas that impact and redefine the core of their
businesses. Since its inception into the global landscape after its IPO in
1999, HCL focuses on transformational outsourcing, underlined by
innovation and value creation, and offers integrated portfolio of services
including software-led IT solutions, remote infrastructure management,
engineering and R&D services and BPO. HCL leverages its extensive
global offshore infrastructure and network of offices in 26 countries to
provide holistic, multi-service delivery in key industry verticals including
Financial Services, Manufacturing, Consumer Services, Public Services
and Healthcare. HCL takes pride in its philosophy of 'Employees First,
Customers Second' which empowers our 85,335 transformers to create
a real value for the customers. HCL Technologies, along with its
subsidiaries, has reported consolidated revenues of US$ 4.3 billion (Rs.
22417 crores), as on TTM ended Sep 30 '12.
For more information, please visit www.hcltech.com
About HCL Enterprise
HCL is a $6.2 billion leading global technology and IT enterprise
comprising two companies listed in India - HCL Technologies and HCL
Infosystems. Founded in 1976, HCL is one of India's original IT garage
start-ups. A pioneer of modern computing, HCL is a global
transformational enterprise today. Its range of offerings includes
product engineering, custom & package applications, BPO, IT
infrastructure services, IT hardware, systems integration, and
distribution of information and communications technology (ICT)
products across a wide range of focused industry verticals. The HCL
team consists of over 90,000 professionals of diverse nationalities, who
operate from 31 countries including over 500 points of presence in
India. HCL has partnerships with several leading global 1000 firms,
including leading IT and technology firms.
For more information, please visit www.hcl.com