0% found this document useful (0 votes)

124 views

Hive Performance - Practical Guide

done

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views

Hive Performance - Practical Guide

done

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Title Hive Performance Practical

Guide

November 2013

Title Hive Performance Practical Guide | November 2013

INSTRUCTIONS
This is an automated template for the technical whitepaper
series Enlighten. The following guidelines will help you work at it
effectively to create a market facing output. Once completed
please send it across to [email protected]. We shall get back to
you with the final copy.
Cover Page: Please update only the Title and the Year Month.
Once you move onto writing the other sections, the same shall
be automatically updated at the header across all the pages.
Table of Contents: Do not make any changes to this page. Any
change if need be to the heading of a page, should be done at
the page itself (in the heading format as used currently). Once
the paper is complete move to the TOC, right click and update
fields. The table shall be update automatically.
Highlight Pad : On the left of each page, you will find a Grey
color box which you will need to use for one key point from the
text on the right/ quotation/ statistic etc.
Addition/ Removal of Pages : Writing can be continued, onto
the pages to write additional content. Table of content should be
updated at the end to account for all such modifications.
Final Page: This is uneditable text carrying information about
HCL and ERS.

Title Hive Performance Practical Guide | November 2013

TABLE OF CONTENTS

Abstract ............................................................................................. 4
Challenge for developers .................................................................. 4
Test Environment .............................................................................. 4
Prerequisite Knowledge .................................................................... 5
Performance Parameters .................................................................. 5
Query Optimization ......................................................................... 11
Data Optimization ............................................................................ 15
References ...................................................................................... 16
Author Info ....................................................................................... 17

Title Hive Performance Practical Guide | November 2013

Abstract
Hive is data warehouse and query language for hadoop, an
essential tool in the Hadoop ecosystem that provides a SQL dialect
for querying data stored in the Hadoop Distributed Filesystem
(HDFS). Good for batch processing.
Most data warehouse applications are implemented using relational
databases that use SQL as the query language. Hive lowers the
barrier for moving these applications to Hadoop.

Challenge for developers

Although Hive provides SQL language for Hadoop, there are
aspects that are different from other SQL-based environments; also
documentation for Hive users and Hadoop developers has been
sparse.
While working on project for one of the client providing media
analysis, where Hive was used to process complex media log, I
read blogs, searched web and tutorials for optimizing hive query to
gain maximum performance, and found many options and
recommendations scattered all over with so many ifs and buts,
some of them are useful, some are confusing and some are
baseless.
Here I am trying to put all information and knowledge based on
practical experience gained to improve hive query performance at
single place. Also my intent is to highlight good practices and
parameter tuning that must be followed and will clearly explain
consideration points for complex tuning parameters. This will help
me and developer like me to refer this document in future while
working with Hive.

Test Environment
This section describes the environment where hive performance
tuning parameter and other query optimization techniques were
tested. We have tested our solution on two different platforms. (Yes
we were lucky to do that)
Test Platform 1
Microsoft HDInsight 1.6 (beta) with 40 node cluster, where each
node has 2 core and 4GB RAM. HDInsight 1.6 uses Hortonworks
distribution.
Test Platform 2
Amazon EC2: 10 Node cluster (Ubuntu 12.04 LTS 64 Bit Server,
m1.large - 2 Core, 7.5 GB RAM)

Title Hive Performance Practical Guide | November 2013

Cloudera cdh4 (4.1.2): (hadoop and hive), setup using cloudera

manager.

Prerequisite Knowledge
1. Good understanding of map-reduce framework. (What is the
output of a map task, how map output is transferred to reduce
task and what is the significance of Partitioner)
2. Understanding of hadoop distributed cache.
3. Understanding of hadoop performance tuning.
4. Moderate understanding of compression techniques like snappy,
LZO, sequence files.

Performance Parameters
I have categorized it into two sections:

DEFAULT PARAM: This should be used as a part of good

practice without any consideration.

TRICKY PARAM: As this should be used case to case basis

and require analysing your requirement, data size, data
categorization, data manipulation, data transfer behavior, query
complexity and cluster size.

1. Map Join (DEFAULT PARAM)

"set hive.auto.convert.join = true"
Enabling map join can significantly decrease your hive query
processing time, and you would experience noticeable performance
boost, I have seen one query taking 25 min to process brought
down to 5 min. To enable map join use this setting in your hive shell
or hive script "set hive.auto.convert.join = true". There is one catch
with the map join - if all the tables are too large to exceed the limit
set then the regular reduce join will be used; i.e., currently, the total
size of one table participating in join should be less than or equal to
25 M. 25M is a very conservative number and user can change this
number by "set hive.smalltable.filesize = 33554432" (Please see https://cwiki.apache.org/Hive/joinoptimization.html). You should
always use this setting in your query if you use join operation, and it
will not put any negative impact, as hive will use map-join if
condition imply else regular reduce side join will be used.

Title Hive Performance Practical Guide | November 2013

2. Bucketed Map Join (DEFAULT PARAM)

set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
By default this optimization is disabled and I suggest to enable it in
your hive-site.xml like other default optimization parameters. This
setting will improve your join performance by joining individual
buckets between tables in the map phase, because it does not need
to fetch the entire contents of one table to match against each
bucket in the other table. So, if tables participating in join operation
are bucketed on join fields then it will definatily improve performance.
You will understand it better once you read bucketing in following
section. I will also suggest to read Map Join topic on Page no. 284
rd
of Hadoop The Definitive Guide, 3 Edition.
The hive.optimize.bucketmapjoin.sortedmerge setting takes
advantage if your bucketed/clustered fields are also sorted. (sorting
using Cluster By and combination of Distribute By and Sort By
clause is explained in sorting section below)
(Please see also https://cwiki.apache.org/confluence/download/attachments/2736205
4/Hive+Summit+2011join.pdf?version=1&modificationDate=1309986642000)

3. Intermediate Compression (DEFAULT PARAM)

set hive.exec.compress.intermediate = true
Intermediate compression shrinks the data shuffled between the
map and reduce tasks for a job, but you must select a codec that
has lower CPU cost than greater compression. As most hive jobs
are I/O bound and even though some are CPU bound, with the right
selection of codec you will always be benefited. I highly recommend
that you enable this property, for some special cases in your dev
environment where you generally test with small set of data you can
skip this setting.
Note: In higher CPU bound job, if you are getting better
performance by off this setting, then only you should disable it in
that job script.
Tip: Use snappy compression code, some people prefer using LZO
as its splitable, but for intermediate compression we dont have to
worry about whether compression code support splitting or not.
set mapred.map.output.compression.codec=
org.apache.hadoop.io.compress.SnappyCodec
set hive.exec.compress.intermediate = true

Title Hive Performance Practical Guide | November 2013

For setting snappy - http://code.google.com/p/hadoop-snappy/

For using with hive - http://www.cloudera.com/content/clouderacontent/cloudera-docs/CDH4/4.3.0/CDH4-InstallationGuide/cdh4ig_topic_23_5.html

4. Local Mode (DEFAULT PARAM)

set hive.exec.mode.local.auto=true
This parameter should also be used with all hive deployment.
Launching MR job on all nodes for a very small set of data
significantly consumes overall job execution time. Hive can
automatically leverage the lighter weight of the local mode to
perform all the tasks for the job on a single machine and sometimes
in the same process.
To achieve this set below property in your hive-site.xml
<property>
<name>hive.exec.mode.local.auto</name>
<value>true</value>
<description>
Let hive determine whether to run in
local mode automatically
</description>
</property>
(Please refer programming hive page number 135 for detail)

5. Strict Mode (TRICKY PARAM)

"set hive.mapred.mode = strict"
This parameter I have categorised as TRICKY because of only one
consideration that this should not be used with production but only
during development and testing phase. This will help you write hive
query in optimized way that will give best performance and will
prevent unintended and undesirable result.
Using this property will restrict 3 types of queries:

Queries on partitioned tables are not permitted unless they

include a partition filter in WHERE clause. This is very useful as
if you are not using partition filter on partitioned tables then you
are neglecting a huge performance boost, and if you think that
there could be use case where you don't want to use partition
filter then I will suggest that you rethink on your partitioning plan.

Title Hive Performance Practical Guide | November 2013

Second restriction is on queries that use ORDER BY clause, but

no LIMIT clause. Generally ORDER BY itself should be avoided
as far as possible; because it will send all result to single
reducer to perform the ordering. LIMIT will prevent the reducer
from running for an extended period of time.
Note: This restriction is mainly useful in development
environment and its recommended that you off this property
and LIMIT from your query before you promote it to testing or
production.

Third restriction prevents Cartesian product. This is very useful

as people coming from RDBMS world may think that queries
that perform JOIN not with an ON clause but with WHERE
clause will have query optimized by the query planner,
effectively converting WHERE clause into an ON clause.
Unfortunately, hive does not perform this optimization, so a
runaway query will occur if tables are large.
(Please read http://stackoverflow.com/questions/587965/whatis-runaway-query to understand runaway query)

6. Parallel Execution (TRICKY PARAM)

set hive.exec.parallel =true
Hive converts a query into one or more stages. Stages could be a
Map-Reduce stage, sampling stage, merge stage, limit stage, or
other possible tasks hive needs to do. By default, Hive executes
these stages one at a time. However, a particular job may consist of
some stages that are not dependent on each other and could be
executed in parallel, possibly allowing the overall job to complete
more quickly. (See programming hive page no. 136 for hivesite.xml property settings)
The reason to categories this as TRICKY is it require observation
in a shared cluster as running more stages in parallel will increase
cluster utilization. However, if you are sure to utilize the full
bandwidth of cluster then I recommended that you enable parallel
execution.

7. JVM Reuse (TRICKY PARAM)

This is quite a tricky parameter that requires careful observation on
case to case basis of following items.

Map/Reduce slot available on tasktracker.

There's no point in considering this parameter if you cannot
have more than 2 map/reduce slots per node. Slots are

Title Hive Performance Practical Guide | November 2013

generally decided based on availability of cores or processors

and RAM on tasktracker node.
Note: YARN Hadoop 2 does not support JVM reuse.

Execution time of map/reduce task.

It is useful to set this parameter so that more than one mapper
and reducer task can utilize same JVM, if per task execution
time is less than 40 second.

Careful monitoring of total job execution time before and

after setting this parameter (keep log of each execution to
compare), monitor JVM heap.

However this is Hadoop tuning parameter but it is very relevant to

Hive performance, especially where it's hard to avoid small files and
scenarios with lots of tasks, most of which have short execution time
as defined above.
As this is Hadoop tuning parameter I am not going to cover this in
detail, but as Hive is dependent on hadoop make sure your hadoop
cluster is fine tuned first.
(Please see hadoop definitive guide 3rd edition page no. 219 for
better understanding on JVM Reuse)

8. Mapper and Reducer Number (TRICKY PARAM)

Tuning the number of map-reduce tasks launched for your job can
play significant role in performance improvement. This tricky setting
requires careful observation of following items.

Overall cluster map/reduce slots information.

This is your cluster bandwidth, make sure to utilize it completely,
and always try to avoid the case of resource underutilization.
Cluster Map Slot = No of max map tasks set on node *
tasktracker nodes in cluster.
Cluster Reduce Slot = No of max reduce tasks set on node *
tasktracker nodes in cluster.
mapred-site.xml properties
mapreduce.tasktracker.map.tasks.maximum and
mapreduce.tasktracker.reduce.tasks.maximum can be used to
set slots available per node.
Example: Suppose you have 4 cores and 6 GB RAM on a node,
and considering that one core is capable of handling 2
processes and 1 map-reduce task needs 512 MB RAM, then we
have 1 core and 2 GB RAM for datanode and tasktracker
daemon (default 1GB is set for each daemon), 1 core 1 GB
RAM for other system processes then remaining 2 core and 3

Title Hive Performance Practical Guide | November 2013

GB RAM can be used to determine ideal slot for that node, i.e.,
4 task slots (2 core * 2 processes = 4, 4*512 MB < 3 GB).

Average number of queries that can run in parallel in your

cluster.
This should be considered only for shared cluster.
If in a shared cluster you assume that average number of
queries that will be launched simultaneously is 4, then below
defined formula can be used to determine ideal number of
reduce max task allowed for one job, so to avoid cluster
underutilized and to give each job fair execution time.
Total cluster reduce slot * 1.5 / average number of queries
running
12*1.5/4 = 4.5 5

Number of map-reduce tasks launched for each mapreduce stages.

You can tweak number of map-reduce tasks launched in order
to achieve best performance by utilizing maximum cluster
resources.

Here I will explain how to manipulate the number of map-reduce

tasks for a job while considering the above items.
Map Tasks
set mapred.max.split.size=<number>
Number of map tasks that can be launched simultaneously depends
on tasktracker's map slots available within cluster. The number of
map tasks for one hive execution stage is determined based on
input data, input splits and block size identified. To better
understand suppose data identified for stage 1 query is 10 GB and
there is no input split explicitly defined then block size (suppose
256MB) will determine the total map tasks required as 40 (i.e.,
10240/256=40), but if max input split size (64MB) is also defined
then minimum of input split or block size is picked and in that case
map tasks would be 160 (i.e., 10240/64 = 160)
Tips:
-

Never modify block size but rather input split to tweak number of
map tasks. Use mapred.max.split.size property to achieve this.

Always try avoiding situation by where after most mappers and

reducers are scheduled, one or two tasks remains to run all
alone, by increasing or decreasing map-reduce tasks.

Title Hive Performance Practical Guide | November 2013

Try increasing map tasks number if slots are underutilized and

monitor result, if performance improves add parameter setting
to your hive job script.

I have seen reduction of around 10 minutes in query execution time,

when we have increased our number of map tasks to utilize full
cluster slots.
Reduce Tasks
set hive.exec.reducers.bytes.per.reducer=<number> - to change
the average load.
set hive.exec.reducers.max=<number> - to limit the maximum
number of reducer.
Number of reduce tasks that can be launched simultaneously again
depends on cluster reducer slots available.
While monitoring your job for reducer number, if you see that cluster
slots are underutilized or in the end very few reducers are running
alone, then try to use above defined properties to control the
number of reducers determined for your job. Also remember that in
a shared cluster you can set max reducer (hive.exec.reducers.max)
to have optimal cluster bandwidth utilization.
Tips:
-

Always try to have balance with number of map-reduce tasks

and total cluster slots, don't make your numbers to high or to
low.

In an urge to increase the number of map-reduce tasks don't

make your splits very small.

Prepare a sheet and mark performance improvement per job

basis when you tweak the number of map-reduce tasks, and put
those settings in job query itself.

Query Optimization
In this section we will see how we can optimize Hive query to get
best performance.
1. Partitioning
Partition directories.
Partitioning is not a new term, and has quite obvious benefits. I
highly recommend that you should analyse your input data to
identify any opportunity to partition your data in order to have
following benefits:

Process only relevant data not whole.

Hive put partitioned data into separate directories. When you use
partition filter in your where clause, input data from specific

Title Hive Performance Practical Guide | November 2013

directories are picked instead of scanning whole data set. This can
significantly improve query performance as less data to process
means reduced query processing time.

Local mode setting can shine.

If your partitioned data is so small that its completely available on

one datanode then your local mode setting will shine.

Map join can shine.

If your partitioned data is small enough to fit in size defined by

hive.smalltable.filesize then your join operation will be optimized to
use Map Join.
(Please see page no. 58 of Programming Hive for more detail. You
can also refer https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
DDL, http://www.brentozar.com/archive/2013/03/introduction-tohive-partitioning/ )
Drawback: the only drawback of partitioning is fundamental
HDFS was designed for many millions of large files, not for billions
of small files, make sure to carefully choose your partition filter.
(Please see Over Partitioning section of Programming Hive, page
122)
Tips: In our use case, analysts wanted media logs to be analysed
on state, year, month, day, and user type, etc., here user types was
very limited (only 2) so having partition on that is of no use, partition
on day will create lot of small files, so we have created partition on
stateyear month. As user type was also a very important filter
for our use case, we have used it for bucket (explained later).
-

An ideal partition scheme should not result in too many

partitions and their directories, and the files in each directory
should be large, some multiple of the filesystem block size.

A good strategy for time-range partitioning, for example, is to

determine the approximate size of your data accumulation over
different granularities of time, and start with the granularity that
results in modest growth in the number of partitions over time.

Use partition for filters that will present upper set of data.

Its a good idea to partition on parameters that divide overall

data in such which require separate analysis most of the time.

Consider these columns for partition - Region, Country, State,

IP Address geo location, department etc.

Title Hive Performance Practical Guide | November 2013

2. Bucketing
Bucket files.
Partitions offer a convenient way to segregate data and to optimize
queries. However, not all data sets lead to sensible partitioning,
especially given the concerns raised earlier about appropriate sizing.
Bucketing is another technique for decomposing data sets into more
manageable parts. Using bucketed query will significantly improve
query performance. (Please see Programming Hive page no. 125
and
http://archive.cloudera.com/cdh/3/hive/language_manual/working_w
ith_bucketed_tables.html for usage information)

Processing relevant data.

Hive puts bucketed data into separate files determined using
hash function. When you use bucked query; input data will be
picked from specific buckets rather scanning entire data set.

As specified in partition section, local mode and map join

settings will also shine with bucked queries.

Drawback: It also has the same potential problem defined in

partition section that it will multiply the number of files managed by
namenode.
Tips: In our use case defined in partitioning section, we have
created bucket on user type (individual and household). As analysis
done majorly for these two user types and partition was not suitable
for user type column (explained in partition section), we created
bucket on user type column.
-

Remember data is divided based on hash function unlike simple

value match in partition.

Know about your hash function and figure out specific bucket
range required for your lookup query.

3. Indexing
The purpose of using indexing in hive is to improve the speed of
query lookup on certain column of tables which is not different from
partitioning and bucketing. Without an index queries with WHERE
clause like WHERE col1=10 load the entire table or partition and
process all the rows. But if an index exists for col1, then only a
portion of file needs to be loaded and processed.
(Please see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+I
ndexing, https://cwiki.apache.org/confluence/display/Hive/IndexDev
and Programming Hive page no 117, for implementation detail)

Title Hive Performance Practical Guide | November 2013

Drawback: Hive has limited indexing capabilities, but you can

provide custom implementation. There are no keys in the usual
RDBMS sense. The improvement in query speed that an index can
provide comes at the cost of additional processing to create the
index and disk space to store the index.
Tip: Create index on column(s) with less distinct values.

4. Distribute By & Cluster By

Distribute By & Cluster By clause when used will distribute your map
output to reducers based on columns defined with it. Think of
partitioning (hash partitioner), carefully using this clause can
decrease your query processing time and you will have following
benefits:

This is very effective when you use clustered fields in your

Group By clause.

As clustered fields ensure that the data blocks are organized

based on the hash values, a hash join becomes far more
efficient for disk I/O and network bandwidth because it can
operate large co-located blocks of data, hence improving JOIN
operations.

Consider a case where your outer query is dependent on some

inner query and want to further filter it based on some column(s)
from inner query output, in this case if those column(s) are
clustered then it will optimize your query.

5. Sorting
Cluster By is a shortcut for Distribute By and Sort By.
Sorting can be done by using Sort By clause, but you wont get the
desired result by only using Sort By clause (Please see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
SortBy for detail).
I am not going to cover how sorting can be used in your HiveQL, but
I would like to highlight one performance improvement point. When
in your query the column(s) for distribution and sort column(s) are
not exactly same, then prefer using distribute by along with sort by
clause rather cluster by, as cluster by will apply distribute by and
then sort by on all column(s) defined with it and it is quite possible
that you want to distribute and sort on different set of column(s).

Title Hive Performance Practical Guide | November 2013

Data Optimization
Compression
Using snappy compression as default compression on my hive data,
always produced good result for me, I recommend that you use
sequence file as your hive table storage format.
Example:
CREATE TABLE page_view(viewTime INT, userid
BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
Drawback: One of Hives unique features is that Hive does not
force data to be converted to a specific format and applying
compression on overall data will restrict interoperability.
Tip: Use overall compression with hive internal table, and be careful
while applying compression on external tables.

Title Hive Performance Practical Guide | November 2013

References
Join Optimization
https://cwiki.apache.org/Hive/joinoptimization.html
https://cwiki.apache.org/confluence/download/attachments/2736205
4/Hive+Summit+2011join.pdf?version=1&modificationDate=1309986642000
Snappy Compression
http://code.google.com/p/hadoop-snappy/
http://www.cloudera.com/content/cloudera-content/clouderadocs/CDH4/4.3.0/CDH4-Installation-Guide/cdh4ig_topic_23_5.html
Runaway Query
http://stackoverflow.com/questions/587965/what-is-runaway-query
Hive Partitioning
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
DDL
http://www.brentozar.com/archive/2013/03/introduction-to-hivepartitioning/
Hive Bucketing
http://archive.cloudera.com/cdh/3/hive/language_manual/working_w
ith_bucketed_tables.html
Hive Indexing
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+I
ndexing
https://cwiki.apache.org/confluence/display/Hive/IndexDev
Hive Sorting
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+
SortBy

Programming Hive
By Edward Capriolo, Dean Wampler, and Jason Rutherglen
Chapter 10 Tuning, Over Partitioning - Page 122
Bucketing - Page 125, Indexing - Page 117

Hadoop: The Definitive Guide, Third Edition

by Tom White

Title Hive Performance Practical Guide | November 2013

Author Info
Sabir Hussain
Sr. Technical Architect TFG-Analytics
Has 10+ years of experience in
Architectural design, development and
implementation of multi-tier web based
enterprise application using Java/J2EE.
Has been involved from past 2+ years in
Bigdata Analytics using Hadoop, Hive,
HBase, Storm, Talend, Actuate BIRT,
Tableua, Pentaho, MS-HDInsight and
others.

Hello, Im from HCLs Engineering and R&D Services. We enable

technology led organizations to go to market with innovative products
and solutions. We partner with our customers in building world class
products and creating associated solution delivery ecosystems to help
bring market leadership. We develop engineering products, solutions
and platforms across Aerospace and Defense, Automotive, Consumer
Electronics, Software, Online, Industrial Manufacturing, Medical
Devices, Networking & Telecom, Office Automation, Semiconductor
and Servers & Storage for our customers.
For more details contact [email protected]
Follow us on twitter: http://twitter.com/hclers
Visit our blog: http://ers.hclblogs.com/
Visit our website: http://www.hcltech.com/engineering-services/

About HCL
About HCL Technologies
HCL Technologies is a leading global IT services company, working
with clients in the areas that impact and redefine the core of their
businesses. Since its inception into the global landscape after its IPO in
1999, HCL focuses on transformational outsourcing, underlined by
innovation and value creation, and offers integrated portfolio of services
including software-led IT solutions, remote infrastructure management,
engineering and R&D services and BPO. HCL leverages its extensive
global offshore infrastructure and network of offices in 26 countries to
provide holistic, multi-service delivery in key industry verticals including
Financial Services, Manufacturing, Consumer Services, Public Services
and Healthcare. HCL takes pride in its philosophy of 'Employees First,
Customers Second' which empowers our 85,335 transformers to create
a real value for the customers. HCL Technologies, along with its
subsidiaries, has reported consolidated revenues of US$ 4.3 billion (Rs.
22417 crores), as on TTM ended Sep 30 '12.
For more information, please visit www.hcltech.com
About HCL Enterprise
HCL is a $6.2 billion leading global technology and IT enterprise
comprising two companies listed in India - HCL Technologies and HCL
Infosystems. Founded in 1976, HCL is one of India's original IT garage
start-ups. A pioneer of modern computing, HCL is a global
transformational enterprise today. Its range of offerings includes
product engineering, custom & package applications, BPO, IT
infrastructure services, IT hardware, systems integration, and
distribution of information and communications technology (ICT)
products across a wide range of focused industry verticals. The HCL
team consists of over 90,000 professionals of diverse nationalities, who
operate from 31 countries including over 500 points of presence in
India. HCL has partnerships with several leading global 1000 firms,
including leading IT and technology firms.
For more information, please visit www.hcl.com

SAP ABAP Performance Tuning
From Everand
SAP ABAP Performance Tuning
May
4.5/5 (28)
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Minitab 16 Full Version - Download Full Version Software Key Serial Number Patch - Available-Crack
75% (4)
Minitab 16 Full Version - Download Full Version Software Key Serial Number Patch - Available-Crack
8 pages
Big Data Best Practices PDF
No ratings yet
Big Data Best Practices PDF
4 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Hands-On Lab: IBM Software Information Management
No ratings yet
Hands-On Lab: IBM Software Information Management
25 pages
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
No ratings yet
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
24 pages
Apache Hive Optimization Techniques - 1 - Towards Data Science
No ratings yet
Apache Hive Optimization Techniques - 1 - Towards Data Science
8 pages
Salesforce Developer Interview Questions: 1.0, #1
From Everand
Salesforce Developer Interview Questions: 1.0, #1
SFDC TELUGU
No ratings yet
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
Bda Exp-6
No ratings yet
Bda Exp-6
10 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Hive
No ratings yet
Hive
50 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
No ratings yet
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
24 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
CSS Grid Layout: 5 Practical Projects
From Everand
CSS Grid Layout: 5 Practical Projects
Craig Buckler
No ratings yet
7.Hive
No ratings yet
7.Hive
30 pages
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Apache Hive Cookbook - Sample Chapter
100% (1)
Apache Hive Cookbook - Sample Chapter
27 pages
Hive
No ratings yet
Hive
65 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
From Everand
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
Theo Houle
No ratings yet
Apache Hive Cookbook
No ratings yet
Apache Hive Cookbook
485 pages
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
Mod 2
No ratings yet
Mod 2
70 pages
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Apache Hive
No ratings yet
Apache Hive
17 pages
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Module 06 Hive - Distributed Data Warehouse
No ratings yet
Module 06 Hive - Distributed Data Warehouse
36 pages
Upgrade Your Computer In Easy Steps
From Everand
Upgrade Your Computer In Easy Steps
Ian Keir
3/5 (1)
Kelly Hadoop Hyd May 2018
No ratings yet
Kelly Hadoop Hyd May 2018
14 pages
LectureNotes Hive Final
No ratings yet
LectureNotes Hive Final
36 pages
Crystal Reports Introduction: Versions 2008-2016
From Everand
Crystal Reports Introduction: Versions 2008-2016
Seth Bonder
No ratings yet
JBoss AS 5 Performance Tuning
From Everand
JBoss AS 5 Performance Tuning
Francesco Marchioni
No ratings yet
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Practical Play Framework: Focus on what is really important
From Everand
Practical Play Framework: Focus on what is really important
Alberto Souza
No ratings yet
9555 BDA Exp11
No ratings yet
9555 BDA Exp11
9 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive_Main
No ratings yet
Hive_Main
33 pages
Programming Hive 1st Edition Edward Capriolo - Quickly download the ebook to explore the full content
100% (2)
Programming Hive 1st Edition Edward Capriolo - Quickly download the ebook to explore the full content
43 pages
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Hive
No ratings yet
Hive
12 pages
Hive
No ratings yet
Hive
29 pages
Tableau Training Manual 9.0 Basic Version: This Via Tableau Training Manual Was Created for Both New and Intermediate
From Everand
Tableau Training Manual 9.0 Basic Version: This Via Tableau Training Manual Was Created for Both New and Intermediate
Larry Keller
3/5 (1)
Exadata For Oracle DBAs PDF
No ratings yet
Exadata For Oracle DBAs PDF
25 pages
Itu Forum On Internet of Things: New Age of Smarter Living: Iot Meets Big Data
No ratings yet
Itu Forum On Internet of Things: New Age of Smarter Living: Iot Meets Big Data
13 pages
TanujKhuranaUpdated Implementation Hadoop
No ratings yet
TanujKhuranaUpdated Implementation Hadoop
25 pages
Hive: A Data Warehouse On Hadoop: On Facebook Team's Paper
No ratings yet
Hive: A Data Warehouse On Hadoop: On Facebook Team's Paper
14 pages
Cca 500 PDF
No ratings yet
Cca 500 PDF
4 pages
HBase Integration Hive
No ratings yet
HBase Integration Hive
7 pages
Hive Airline Data Analysis
No ratings yet
Hive Airline Data Analysis
5 pages
Termination and Interface of On Semiconductor ECL Devices With CML (Current Mode Logic) OUTPUT Structure
No ratings yet
Termination and Interface of On Semiconductor ECL Devices With CML (Current Mode Logic) OUTPUT Structure
10 pages
Whole Number 2
No ratings yet
Whole Number 2
3 pages
Mcbe1-D123n7 Mcbe1-D123u7 Mcbe1-D253n7 Mcbe1-D253u7
No ratings yet
Mcbe1-D123n7 Mcbe1-D123u7 Mcbe1-D253n7 Mcbe1-D253u7
5 pages
An R Package For Item Response Modelling
No ratings yet
An R Package For Item Response Modelling
39 pages
Experiment No: 9 TITTLE: Write A Program To Implement Game Playing Algorithms: Minimax and Alpha Beta
No ratings yet
Experiment No: 9 TITTLE: Write A Program To Implement Game Playing Algorithms: Minimax and Alpha Beta
4 pages
Mhf4u - UNIT3
100% (1)
Mhf4u - UNIT3
17 pages
Sample Thesis Proposal Computer Science
100% (2)
Sample Thesis Proposal Computer Science
6 pages
Hdi 2100 Operation User S Manual 23
No ratings yet
Hdi 2100 Operation User S Manual 23
23 pages
BSM Crew Service Centre Philippines Inc (Formerly Philippine Hammonia Shipa Manning Agency
No ratings yet
BSM Crew Service Centre Philippines Inc (Formerly Philippine Hammonia Shipa Manning Agency
3 pages
Training Report
No ratings yet
Training Report
41 pages
TS32291 NCHF ConvergedCharging
No ratings yet
TS32291 NCHF ConvergedCharging
19 pages
Account Statement From 1 Apr 2021 To 24 Mar 2022: TXN Date Value Date Description Ref No./Cheque No. Debit Credit Balance
No ratings yet
Account Statement From 1 Apr 2021 To 24 Mar 2022: TXN Date Value Date Description Ref No./Cheque No. Debit Credit Balance
8 pages
Syllabus For Computer Programming II: Module 1: Introduction To Java
No ratings yet
Syllabus For Computer Programming II: Module 1: Introduction To Java
118 pages
6
No ratings yet
6
10 pages
6.4.1.2 Packet Tracer - Configure Initial Router Settings
No ratings yet
6.4.1.2 Packet Tracer - Configure Initial Router Settings
4 pages
Calawod - Reading Progress Tool
No ratings yet
Calawod - Reading Progress Tool
41 pages
Kbiswas,+08 Charging+and+Billing
No ratings yet
Kbiswas,+08 Charging+and+Billing
10 pages
mob6ecoBTFL - Cli - Backup - M6 - ECO 2024 - 20240618 - 214506 - CRAZYBEEF4DX
No ratings yet
mob6ecoBTFL - Cli - Backup - M6 - ECO 2024 - 20240618 - 214506 - CRAZYBEEF4DX
6 pages
IBM ThinkCentre m51 Eng
No ratings yet
IBM ThinkCentre m51 Eng
4 pages
IoT-Enabled_Modern_Parenting_with_Infant_Guard
No ratings yet
IoT-Enabled_Modern_Parenting_with_Infant_Guard
6 pages
Axis ASAP - KYC Offer Terms and Conditions (Bluetooth Speakers)
No ratings yet
Axis ASAP - KYC Offer Terms and Conditions (Bluetooth Speakers)
6 pages
Enterpricse Consolestartupguide
No ratings yet
Enterpricse Consolestartupguide
62 pages
a-golf-ball-launcher-as-a-sophomore-design-project
No ratings yet
a-golf-ball-launcher-as-a-sophomore-design-project
15 pages
IT Controls Part II: Security and Access: Accounting Information Systems, 5
No ratings yet
IT Controls Part II: Security and Access: Accounting Information Systems, 5
39 pages
ლექცია 2 PDF
100% (1)
ლექცია 2 PDF
55 pages
Week 11
No ratings yet
Week 11
5 pages
Scrum Presentation - Amit
No ratings yet
Scrum Presentation - Amit
7 pages
List of Students With Email Ids - 2,4,6 SEM
No ratings yet
List of Students With Email Ids - 2,4,6 SEM
4 pages
What Are The Number Systems
No ratings yet
What Are The Number Systems
3 pages

Hive Performance - Practical Guide

Uploaded by

Hive Performance - Practical Guide

Uploaded by

Title Hive Performance Practical

Title Hive Performance Practical Guide | November 2013

Title Hive Performance Practical Guide | November 2013

Title Hive Performance Practical Guide | November 2013

Challenge for developers

Title Hive Performance Practical Guide | November 2013

Cloudera cdh4 (4.1.2): (hadoop and hive), setup using cloudera

DEFAULT PARAM: This should be used as a part of good

TRICKY PARAM: As this should be used case to case basis

1. Map Join (DEFAULT PARAM)

Title Hive Performance Practical Guide | November 2013

2. Bucketed Map Join (DEFAULT PARAM)

3. Intermediate Compression (DEFAULT PARAM)

Title Hive Performance Practical Guide | November 2013

For setting snappy - http://code.google.com/p/hadoop-snappy/

4. Local Mode (DEFAULT PARAM)

5. Strict Mode (TRICKY PARAM)

Queries on partitioned tables are not permitted unless they

Title Hive Performance Practical Guide | November 2013

Second restriction is on queries that use ORDER BY clause, but

Third restriction prevents Cartesian product. This is very useful

6. Parallel Execution (TRICKY PARAM)

7. JVM Reuse (TRICKY PARAM)

Map/Reduce slot available on tasktracker.

Title Hive Performance Practical Guide | November 2013

generally decided based on availability of cores or processors

Execution time of map/reduce task.

Careful monitoring of total job execution time before and

However this is Hadoop tuning parameter but it is very relevant to

8. Mapper and Reducer Number (TRICKY PARAM)

Overall cluster map/reduce slots information.

Title Hive Performance Practical Guide | November 2013

Average number of queries that can run in parallel in your

Number of map-reduce tasks launched for each mapreduce stages.

Here I will explain how to manipulate the number of map-reduce

Always try avoiding situation by where after most mappers and

Title Hive Performance Practical Guide | November 2013

Try increasing map tasks number if slots are underutilized and

I have seen reduction of around 10 minutes in query execution time,

Always try to have balance with number of map-reduce tasks

In an urge to increase the number of map-reduce tasks don't

Prepare a sheet and mark performance improvement per job

Process only relevant data not whole.

Title Hive Performance Practical Guide | November 2013

Local mode setting can shine.

If your partitioned data is so small that its completely available on

Map join can shine.

If your partitioned data is small enough to fit in size defined by

An ideal partition scheme should not result in too many

A good strategy for time-range partitioning, for example, is to

Its a good idea to partition on parameters that divide overall

Consider these columns for partition - Region, Country, State,

Title Hive Performance Practical Guide | November 2013

Processing relevant data.

As specified in partition section, local mode and map join

Drawback: It also has the same potential problem defined in

Remember data is divided based on hash function unlike simple

Title Hive Performance Practical Guide | November 2013

Drawback: Hive has limited indexing capabilities, but you can

4. Distribute By & Cluster By

This is very effective when you use clustered fields in your

As clustered fields ensure that the data blocks are organized

Consider a case where your outer query is dependent on some

Title Hive Performance Practical Guide | November 2013

Title Hive Performance Practical Guide | November 2013

Hadoop: The Definitive Guide, Third Edition

Title Hive Performance Practical Guide | November 2013

Hello, Im from HCLs Engineering and R&D Services. We enable

You might also like