0% found this document useful (0 votes)
41 views7 pages

Hive

Uploaded by

mytempemail2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views7 pages

Hive

Uploaded by

mytempemail2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Hive

Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

What is Hive
Hive is a data warehouse infrastructure tool to process data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:

Unit Name Operation


User Hive is a data warehouse infrastructure software that can create interaction
Interface between user and HDFS. The user interfaces that Hive supports are Hive
Web UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of
tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL HiveQL is similar to SQL for querying on schema info on the Metastore. It is
Process one of the replacements of traditional approach for MapReduce program.
Engine Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.
Execution The conjunction part of HiveQL process Engine and MapReduce is Hive
Engine Execution Engine. Execution engine processes the query and generates
results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or Hadoop distributed file system or HBASE are the data storage techniques to
HBASE store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.

Why do we need it?

Hive in big data is a milestone innovation that has eventually led to data analysis on a large

scale. Big organizations need big data to record the information that is collected over the

time.
To produce data-driven analysis, organizations gather data and use such software applications

to analyze their data. This data, with Apache Hive, can be used for reading, writing, and

managing information. Ever since data analytics has come into being, storage of data has

been a trending topic.

Even though small scale organizations were able to manage medium-sized data and analyze it

with traditional data analytics tools, big data could not be managed with such applications

and so, there was a dire need for advanced software.

As data collection became a daily task and organizations expanded in all aspects, data

collection became exponential and vast. Furthermore, data began to be dealt in petabytes that

define storage of vast data.

For this, organizations needed hefty equipment and perhaps that is the reason why the release

of a software like Apache Hive was necessary. Thus, Apache Hive was released with the

purpose of analyzing big data and producing data-driven analogies.

Here are 2 case studies of airbnb and the guardian that can help you to understand the use of

Hive in Big Data.

"Airbnb connects people with places to stay and things to do around the world with 2.9

million hosts listed, supporting 800k nightly stays. Airbnb uses Amazon EMR to run Apache

Hive on a S3 data lake. Running Hive on the EMR clusters enables Airbnb analysts to

perform ad hoc SQL queries on data stored in the S3 data lake. By migrating to a S3 data

lake, Airbnb reduced expenses, can now do cost attribution, and increased the speed of

Apache Spark jobs by three times their original speed."

"Guardian gives 27 million members the security they deserve through insurance and wealth
management products and services. Guardian uses Amazon EMR to run Apache Hive on a S3
data lake. Apache Hive is used for batch processing. The S3 data lake fuels Guardian Direct,
a digital platform that allows consumers to research and purchase both Guardian products and
third party products in the insurance sector."

Benefits of Hive

Hive in Big Data is extremely beneficial. While it has its own cons, the pros of Hive make it

an unbeatable option available for data optimization and analysis.

The USP of Apache Hive can be summed up in its benefits that have been highly helpful in

big data analysis over the time. Here are a few benefits that will make you understand the

concept better.

1. Easy-to-use

Hive in Big Data is an easy-to-use software application that lets one analyze large-scale

data through the batch processing technique. An efficient program, it uses a familiar

software that uses HiveQL, a language that is very similar to SQL- structured query

language used for interaction with databases.

Such a software can be operated by both programmers and non-programmers, making

it a very accessible and easy-to-use application for converting petabytes of data into

useful data strands.

This is one of the biggest benefits of Apache Hive that has made it a popular choice for

data analytics among large organizations with vast data.


2. Fast Experience

The technique of batch processing refers to the analysis of data in bits and parts that are

later clubbed together. Moreover, the analyzed data is sent to Apache Hadoop, while

the schemas or derived stereotypes remain with Apache Hive.

The technique of batch processing makes Apache Hive a fast software that conducts

the analysis of data in a rapid manner. In addition, Apache Hive is an advanced data

analysis batch processing software that is unlike traditional tools.

Thus, this particular software can handle big loads of data in one go as opposed to the

traditional softwares that could only filter moderate-sized data in one go.

3. Fault-tolerant Software

In most of the softwares that is used to handle Big Data today, fault tolerance is a rare

feature. However, Apache Hive and the HDFS file system together work in a fault-

tolerant manner that operates on the basis of replica creation.

This means that as soon as big data is analyzed in Hive, it is immediately replicated to

other machines. This is done in order to prevent loss of data or schemas just in case a

particular machine fails to work or stops operating.

Fault tolerance in Hadoop (Hive) is one of the biggest benefits of Hive as it beats other
competitors like Impala and makes Hive unique in its own way.

4. Cheaper Option

Another reason why Apache Hive is beneficial is that it is a comparatively cheaper

option. For large organizations, profit is the key. Yet with technologically advanced

tools and softwares that are expensive to operate, profit margins can stoop low.
Therefore, it is necessary for organizations to look out for cheaper options that can help

them achieve the same goals but with cost-effective measures. When it comes to big

data and data analysis, Apache Hive is one of the best softwares to use and operate.

Fast and familiar, it is highly efficient and also relies on fault tolerance to produce

better results.

5. Productive Software

Apache Hive is a productive software. Why? Well, the answer lies in its other benefits.
Apache Hive not only analyzes data, but also enables its users to read and write the

data in an organized manner.

What's more is that this software defines specific schemas related to data analysis and

stores them in Hadoop Distributed File System (HDFS) which helps in future analysis.

Henceforth, Hive in Big Data is quite productive and enables large organizations to

make the best use of the data collected and generated over a long period of time to

convert it into meaningful bits and pieces.

You might also like