Hive

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

What is Hive in Hadoop?

"The Apache Hive™ data warehouse software facilitates reading, writing, and managing
large datasets residing in distributed storage using SQL. The structure can be projected
onto data already in storage."

In other words, Hive is an open-source system that processes structured data in Hadoop,
residing on top of the latter for summarizing Big Data, as well as facilitating analysis and
queries.

Now that we have investigated what is Hive in Hadoop, let’s look at the features and
characteristics.

Architecture of Hive

Hive chiefly consists of three core parts:


• Hive Clients: Hive offers a variety of drivers designed for communication with
different applications. For example, Hive provides Thrift clients for Thrift-based
applications. These clients and drivers then communicate with the Hive server,
which falls under Hive services.

• Hive Services: Hive services perform client interactions with Hive. For example,
if a client wants to perform a query, it must talk with Hive services.

• Hive Storage and Computing: Hive services such as file system, job client, and
meta store then communicates with Hive storage and stores things like
metadata table information and query results.

Hive's Features

These are Hive's chief characteristics:

• Hive is designed for querying and managing only structured data stored in
tables

• Hive is scalable, fast, and uses familiar concepts

• Schema gets stored in a database, while processed data goes into a Hadoop
Distributed File System (HDFS)

• Tables and databases get created first; then data gets loaded into the proper
tables

• Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record


Columnar File), and TEXTFILE

• Hive uses an SQL-inspired language, sparing the user from dealing with the
complexity of MapReduce programming. It makes learning more accessible by
utilizing familiar concepts found in relational databases, such as columns,
tables, rows, and schema, etc.
• The most significant difference between the Hive Query Language (HQL) and
SQL is that Hive executes queries on Hadoop's infrastructure instead of on a
traditional database

• Since Hadoop's programming works on flat files, Hive uses directory structures
to "partition" data, improving performance on specific queries

• Hive supports partition and buckets for fast and simple data retrieval

• Hive supports custom user-defined functions (UDF) for tasks like data cleansing
and filtering. Hive UDFs can be defined according to programmers'
requirements

Limitations of Hive

Of course, no resource is perfect, and Hive has some limitations. They are:

• Hive doesn’t support OLTP. Hive supports Online Analytical Processing


(OLAP), but not Online Transaction Processing (OLTP).

• It doesn’t support subqueries.

• It has a high latency.

• Hive tables don’t support delete or update operations.

How Data Flows in the Hive?

1. The data analyst executes a query with the User Interface (UI).

2. The driver interacts with the query compiler to retrieve the plan, which consists
of the query execution process and metadata information. The driver also
parses the query to check syntax and requirements.
3. The compiler creates the job plan (metadata) to be executed and communicate s
with the metastore to retrieve a metadata request.

4. The metastore sends the metadata information back to the compiler

5. The compiler relays the proposed query execution plan to the driver.

6. The driver sends the execution plans to the execution engine.

7. The execution engine (EE) processes the query by acting as a bridge between
the Hive and Hadoop. The job process executes in MapReduce. The execution
engine sends the job to the JobTracker, found in the Name node, and assigns
it to the TaskTracker, in the Data node. While this is happening, the execution
engine executes metadata operations with the metastore.

8. The results are retrieved from the data nodes.

9. The results are sent to the execution engine, which, in turn, sends the results
back to the driver and the front end (UI).

Since we have gone on at length about what Hive is, we should also touch on what Hive
is not:

• Hive isn't a language for row-level updates and real-time queries

• Hive isn't a relational database

• Hive isn't a design for Online Transaction Processing

Hive Modes

Depending on the size of Hadoop data nodes, Hive can operate in two different modes:

• Local mode

• Map-reduce mode
User Local mode when:

• Hadoop is installed under the pseudo mode, possessing only one data node

• The data size is smaller and limited to a single local machine

• Users expect faster processing because the local machine contains smaller
datasets.

Use Map Reduce mode when:

• Hadoop has multiple data nodes, and the data is distributed across these
different nodes

• Users must deal with more massive data sets

MapReduce is Hive's default mode.

Hive and Hadoop on AWS

Amazon Elastic Map Reduce (EMR) is a managed service that lets you use big data
processing frameworks such as Spark, Presto, Hbase, and, yes, Hadoop to analyze and
process large data sets. Hive, in turn, runs on top of Hadoop clusters, and can be used
to query data residing in Amazon EMR clusters, employing an SQL language.

Hive and IBM Db2 Big SQL

Data analysts can query Hive transactional (ACID) tables straight from Db2 Big SQL,
although Db2 Big SQL can only see compacted data in the transactional table. Data
modification statement results won’t be seen by any queries generated in Db2 Big SQL
until you perform a compaction operation, which places data in a base directory.
Hive vs. Relational Databases

Relational databases, or RDBMS, is a database that stores data in a structured format


with rows and columns, a structured form called “tables.” Hive, on the other hand, is a
data warehousing system that offers data analysis and queries.

Here’s a handy chart that illustrates the differences at a glance:

Relational Database Hive

Maintains a database Maintains a data warehouse

Fixed schema Varied schema

Sparse tables Dense tables

Doesn’t support partitioning Supports automation partition

Stores both normalized and


Stores normalized data
denormalized data
Uses HQL (Hive Query
Uses SQL (Structured Query Language)
Language)

In order to continue our understanding of what Hive is, let us next look at the difference
between Pig and Hive.

Pig vs. Hive

Both Hive and Pig are sub-projects, or tools used to manage data in Hadoop. While Hive
is a platform that used to create SQL-type scripts for MapReduce functions, Pig is a
procedural language platform that accomplishes the same thing. Here's how their
differences break down:

Users

• Data analysts favor Apache Hive

• Programmers and researchers prefer Apache Pig

Language Used

• Hive uses a declarative language variant of SQL called HQL

• Pig uses a unique procedural language called Pig Latin

Data Handling

• Hive works with structured data

• Pig works with both structured and semi-structured data


Cluster Operation

• Hive operates on the cluster's server-side

• Pig operates on the cluster's client-side

Partitioning

• Hive supports partitioning

• Pig doesn't support partitioning

Load Speed

• Hive doesn't load quickly, but it executes faster

• Pig loads quickly

So, if you're a data analyst accustomed to working with SQL and want to perform
analytical queries of historical data, then Hive is your best bet. But if you're a programmer
and are very familiar with scripting languages and you don't want to be bothered by
creating the schema, then use Pig.

In order to strengthen our understanding of what is Hive, let us next look at the difference
between Hive and Hbase.

Apache Hive vs. Apache Hbase

We've spotlighted the differences between Hive and Pig. Now, it's time for a brief
comparison between Hive and Hbase.

• HBase is an open-source, column-oriented database management system that


runs on top of the Hadoop Distributed File System (HDFS)
• Hive is a query engine, while Hbase is a data storage system geared towards
unstructured data. Hive is used mostly for batch processing; Hbase is used
extensively for transactional processing

• Hbase processes in real-time and features real-time querying; Hive doesn't and
is used only for analytical queries

• Hive runs on the top of Hadoop, while Hbase runs on the top of the HDFS

• Hive isn't a database, but Hbase supports NoSQL databases

• Hive has a schema model, Hbase doesn't

• And finally, Hive is ideal for high latency operations, while Hbase is made
primarily for low-level latency ones

Hive Optimization Techniques

Data analysts who want to optimize their Hive queries and make them run faster in their
clusters should consider the following hacks:

• Partition your data to reduce read time within your directory, or else all the data
will get read

• Use appropriate file formats such as the Optimized Row Columnar (ORC) to
increase query performance. ORC reduces the original data size by up to 75
percent

• Divide table sets into more manageable parts by employing bucketing

• Improve aggregations, filters, scans, and joins by vectorizing your queries.


Perform these functions in batches of 1024 rows at once, rather than one at a
time

• Create a separate index table that functions as a quick reference for the original
table.
Hive Data Models

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-
hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive structures data into well-understood database concepts such as tables, rows,
columns and partitions. It supports primitive types like Integers, Floats, Doubles, and
Strings. Hive also supports Associative Arrays, Lists, Structs, and Serialize and Deserialized
API is used to move data in and out of tables.

Let’s look at Hive Data Models in detail;

Hive Data Models:


The Hive data models contain the following components:

• Databases
• Tables
• Partitions
• Buckets or clusters

Partitions:
Partition means dividing a table into a coarse grained parts based on the value of a
partition column such as ‘data’. This makes it faster to do queries on slices of data
So, what is the function of Partition? The Partition keys determine how data is stored.
Here, each unique value of the Partition key defines a Partition of the table. The Partitions
are named after dates for convenience. It is similar to ‘Block Splitting’ in HDFS.

Buckets:
Buckets give extra structure to the data that may be used for efficient queries. A join of
two tables that are bucketed on the same columns, including the join column can be
implemented as a Map-Side Join. Bucketing by used ID means we can quickly evaluate a
user-based query by running it on a randomized sample of the total set of users.

What is a metastore in Hive?


Ans. Basically, to store the metadata information in the Hive we
use Metastore. Though, it is possible by using RDBMS and an open source
ORM (Object Relational Model) layer called Data Nucleus. That converts the
object representation into the relational schema and vice versa.

You might also like