Hive
Hive
What is Hive
Hive is a data warehouse infrastructure tool to process data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
Hive in big data is a milestone innovation that has eventually led to data analysis on a large
scale. Big organizations need big data to record the information that is collected over the
time.
To produce data-driven analysis, organizations gather data and use such software applications
to analyze their data. This data, with Apache Hive, can be used for reading, writing, and
managing information. Ever since data analytics has come into being, storage of data has
Even though small scale organizations were able to manage medium-sized data and analyze it
with traditional data analytics tools, big data could not be managed with such applications
As data collection became a daily task and organizations expanded in all aspects, data
collection became exponential and vast. Furthermore, data began to be dealt in petabytes that
For this, organizations needed hefty equipment and perhaps that is the reason why the release
of a software like Apache Hive was necessary. Thus, Apache Hive was released with the
Here are 2 case studies of airbnb and the guardian that can help you to understand the use of
"Airbnb connects people with places to stay and things to do around the world with 2.9
million hosts listed, supporting 800k nightly stays. Airbnb uses Amazon EMR to run Apache
Hive on a S3 data lake. Running Hive on the EMR clusters enables Airbnb analysts to
perform ad hoc SQL queries on data stored in the S3 data lake. By migrating to a S3 data
lake, Airbnb reduced expenses, can now do cost attribution, and increased the speed of
"Guardian gives 27 million members the security they deserve through insurance and wealth
management products and services. Guardian uses Amazon EMR to run Apache Hive on a S3
data lake. Apache Hive is used for batch processing. The S3 data lake fuels Guardian Direct,
a digital platform that allows consumers to research and purchase both Guardian products and
third party products in the insurance sector."
Benefits of Hive
Hive in Big Data is extremely beneficial. While it has its own cons, the pros of Hive make it
The USP of Apache Hive can be summed up in its benefits that have been highly helpful in
big data analysis over the time. Here are a few benefits that will make you understand the
concept better.
1. Easy-to-use
Hive in Big Data is an easy-to-use software application that lets one analyze large-scale
data through the batch processing technique. An efficient program, it uses a familiar
software that uses HiveQL, a language that is very similar to SQL- structured query
it a very accessible and easy-to-use application for converting petabytes of data into
This is one of the biggest benefits of Apache Hive that has made it a popular choice for
The technique of batch processing refers to the analysis of data in bits and parts that are
later clubbed together. Moreover, the analyzed data is sent to Apache Hadoop, while
The technique of batch processing makes Apache Hive a fast software that conducts
the analysis of data in a rapid manner. In addition, Apache Hive is an advanced data
Thus, this particular software can handle big loads of data in one go as opposed to the
traditional softwares that could only filter moderate-sized data in one go.
3. Fault-tolerant Software
In most of the softwares that is used to handle Big Data today, fault tolerance is a rare
feature. However, Apache Hive and the HDFS file system together work in a fault-
This means that as soon as big data is analyzed in Hive, it is immediately replicated to
other machines. This is done in order to prevent loss of data or schemas just in case a
Fault tolerance in Hadoop (Hive) is one of the biggest benefits of Hive as it beats other
competitors like Impala and makes Hive unique in its own way.
4. Cheaper Option
option. For large organizations, profit is the key. Yet with technologically advanced
tools and softwares that are expensive to operate, profit margins can stoop low.
Therefore, it is necessary for organizations to look out for cheaper options that can help
them achieve the same goals but with cost-effective measures. When it comes to big
data and data analysis, Apache Hive is one of the best softwares to use and operate.
Fast and familiar, it is highly efficient and also relies on fault tolerance to produce
better results.
5. Productive Software
Apache Hive is a productive software. Why? Well, the answer lies in its other benefits.
Apache Hive not only analyzes data, but also enables its users to read and write the
What's more is that this software defines specific schemas related to data analysis and
stores them in Hadoop Distributed File System (HDFS) which helps in future analysis.
Henceforth, Hive in Big Data is quite productive and enables large organizations to
make the best use of the data collected and generated over a long period of time to