Hive
Hive
Framework
3.2 Compiler
The driver invokes the compiler with the HiveQL string
which can be one of DDL, DML or query statements. The
compiler converts the string to a plan. The plan consists
only of metadata operations in case of DDL statements,
and HDFS operations in case of LOAD statements. For in-
sert statements and queries, the plan consists of a directed-
acyclic graph (DAG) of map-reduce jobs.
Figure 2: Query plan with 3 map-reduce jobs for • The Parser transforms a query string to a parse tree
multi-table insert query representation.
• The Semantic Analyzer transforms the parse tree to a
block-based internal query representation. It retrieves
cution time, number of output rows, etc. schema information of the input tables from the metas-
• The Compiler is invoked by the driver upon receiv- tore. Using this information it verifies column names,
ing a HiveQL statement. The compiler translates this expands select * and does type-checking including
statement into a plan which consists of a DAG of map- addition of implicit type conversions.
reduce jobs. For more details see Section 3.2 • The Logical Plan Generator converts the internal query
• The driver submits the individual map-reduce jobs from representation to a logical plan, which consists of a tree
the DAG to the Execution Engine in a topological of logical operators.
order. Hive currently uses Hadoop as its execution • The Optimizer performs multiple passes over the logi-
engine. cal plan and rewrites it in several ways:
We next describe the metastore and compiler in detail. • Combines multiple joins which share the join key
into a single multi-way join, and hence a single
3.1 Metastore map-reduce job.
The metastore is the system catalog which contains meta- • Adds repartition operators (also known as Re-
data about the tables stored in Hive. This metadata is spec- duceSinkOperator) for join, group-by and custom
ified during table creation and reused every time the table is map-reduce operators. These repartition opera-
referenced in HiveQL. The metastore distinguishes Hive as a tors mark the boundary between the map phase
traditional warehousing solution (ala Oracle or DB2) when and a reduce phase during physical plan genera-
compared with similar data processing systems built on top tion.
of map-reduce like architectures like Pig [7] and Scope [2]. • Prunes columns early and pushes predicates closer
The metastore contains the following objects: to the table scan operators in order to minimize
the amount of data transfered between operators. ecution engines has forced us to revisit techniques for query
• In case of partitioned tables, prunes partitions processing. We have discovered that we have to either mod-
that are not needed by the query ify or rewrite several query processing algorithms to perform
• In case of sampling queries, prunes buckets that efficiently in our setting.
are not needed Hive is an Apache sub-project, with an active user and de-
veloper community both within and outside Facebook. The
Users can also provide hints to the optimizer to Hive warehouse instance in Facebook contains over 700 ter-
• add partial aggregation operators to handle large abytes of usable data and supports over 5000 queries on a
cardinality grouped aggregations daily basis. This demonstration show cases the current ca-
• add repartition operators to handle skew in grouped pabilities of Hive. There are many important avenues of
aggregations future work:
• perform joins in the map phase instead of the re- • HiveQL currently accepts only a subset of SQL as valid
duce phase queries. We are working towards making HiveQL sub-
sume SQL syntax.
• The Physical Plan Generator converts the logical plan
• Hive currently has a naı̈ve rule-based optimizer with a
into a physical plan, consisting of a DAG of map-
small number of simple rules. We plan to build a cost-
reduce jobs. It creates a new map-reduce job for each
based optimizer and adaptive optimization techniques
of the marker operators – repartition and union all –
to come up with more efficient plans.
in the logical plan. It then assigns portions of the logi-
cal plan enclosed between the markers to mappers and • We are exploring columnar storage and more intelli-
reducers of the map-reduce jobs. gent data placement to improve scan performance.
In Figure 2, we illustrate the plan of the multi-table in- • We are running performance benchmarks based on [1]
sert query in Section 2.3. The nodes in the plan are phys- to measure our progress as well as compare against
ical operators and the edges represent the flow of data be- other systems [4]. In our preliminary experiments, we
tween operators. The last line in each node represents the have been able to improve the performance of Hadoop
output schema of that operator. For lack of space, we do itself by 20% compared to [1]. The improvements in-
not describe the parameters specified within each operator volved using faster Hadoop data structures to process
node. The plan has three map-reduce jobs. Within the same the data, for example, using Text instead of String.
map-reduce job, the portion of the operator tree below the The same queries expressed easily in HiveQL had 20%
repartition operator (ReduceSinkOperator) is executed by overhead compared to our optimized hadoop imple-
the mapper and the portion above by the reducer. The mentation, i.e., Hive’s performance is on par with the
repartitioning itself is performed by the execution engine. hadoop code from [1]. Based on these experiments, we
Notice that the first map-reduce job writes to two tem- have identified several areas for performance improve-
porary files to HDFS, tmp1 and tmp2, which are consumed ment and have begun working on them. More details
by the second and third map-reduce jobs respectively. Thus, are available in [4].
the second and third map-reduce jobs wait for the first map- • We are enhancing the JDBC and ODBC drivers for
reduce job to finish. Hive for integration with commercial BI tools which
only work with traditional relational warehouses.
• We are exploring methods for multi-query optimiza-
4. DEMONSTRATION DESCRIPTION tion techniques and performing generic n-way joins in
The demonstration consists of the following: a single map-reduce job.
• Functionality — We demonstrate HiveQL constructs We would like to thank our user and developer community
via the StatusMeme application described in Section 2.3. for their contributions, with special thanks to Yuntao Jia,
We expand the application to include queries which use Yongqiang He, Dhruba Borthakur and Je↵ Hammerbacher.
more HiveQL constructs and showcase the rule-based
optimizer.
6. REFERENCES
• Tuning — We also demonstrate our query plan viewer
which shows how HiveQL queries are translated into [1] A. Pavlo et. al. A Comparison of Approaches to
physical plans of map-reduce jobs. We show how hints Large-Scale Data Analysis. Proc. ACM SIGMOD, 2009.
can be used to modify the plans generated by the op- [2] C.Ronnie et al. SCOPE: Easy and Efficient Parallel
timizer. Processing of Massive Data Sets. Proc. VLDB Endow.,
1(2):1265–1276, 2008.
• User Interface — We show our graphical user interface
which allows users to explore a Hive database, author [3] Apache Hadoop. Available at
HiveQL queries, and monitor query execution. http://wiki.apache.org/hadoop.
[4] Hive Performance Benchmark. Available at
• Scalability — We illustrate the scalability of the sys-
https://issues.apache.org/jira/browse/HIVE-396.
tem by increasing the sizes of the input data and the
complexity of the queries. [5] Hive Language Manual. Available at
http://wiki.apache.org/hadoop/Hive/LanguageManual.
[6] Facebook Lexicon. Available at
5. FUTURE WORK http://www.facebook.com/lexicon.
Hive is a first step in building an open-source warehouse [7] Apache Pig. http://wiki.apache.org/pig.
over a web-scale map-reduce data processing system (Hadoop). [8] Apache Thrift. http://incubator.apache.org/thrift.
The distinct characteristics of the underlying storage and ex-