Hive is a data warehousing tool built on top of Hadoop, mainly used for data analytics in handling large
datasets stored in a distributed environment. It provides a SQL-like interface (HiveQL) that enables users
to query, analyze, and manage large datasets in the Hadoop Distributed File System (HDFS) without deep
knowledge of complex MapReduce programming. This makes Hive a popular choice in data analytics for
processing, summarizing, and querying structured data.
Here’s a breakdown of key Hive components and its working process:
1. Hive Shell
The Hive Shell, or Hive CLI (Command-Line Interface), is the command-line interface through which users
can interact with Hive. Users can execute HiveQL (similar to SQL) commands in the Hive Shell to manage
data, create tables, and run queries. The shell serves as the starting point for most interactions with
Hive, allowing analysts and data engineers to submit commands directly.
2. Hive Services
Hive Services are components that handle different parts of the data processing lifecycle in Hive. Key
services include:
Hive Thrift Server: Allows external applications to connect to Hive using JDBC, ODBC, and other
interfaces, supporting a wide range of programming languages.
Hive Driver: Manages and processes HiveQL queries, communicating with the execution engine
and returning results.
CLI Service: Facilitates command-line interface communication, primarily through the Hive Shell.
Compiler: Translates the high-level HiveQL into low-level MapReduce tasks for Hadoop to
execute.
These services allow Hive to handle multiple types of requests and interact with various clients and
applications.
3. Hive Metastore
The Hive Metastore is a crucial component that acts as a central repository for Hive metadata. It stores
information about the structure and properties of Hive databases, tables, partitions, columns, and data
types. The Metastore enables Hive to efficiently manage schemas and query data, storing metadata in a
relational database like MySQL or Derby. When queries are run, Hive uses this metadata to locate data
files in HDFS and interpret the data structure.
4. HiveQL
HiveQL is the query language used in Hive, based on SQL. It includes most of SQL’s features and adds
specific capabilities for working with Hadoop, such as partitioning and bucketing data. Through HiveQL,
users can execute queries to manipulate and retrieve data stored in Hadoop, without needing to write
complex MapReduce code. HiveQL is powerful for running analytical queries on massive datasets and
includes commands for:
Data Definition Language (DDL): Used for creating and modifying tables, databases, etc.
Data Manipulation Language (DML): Used for querying, inserting, updating, and deleting data.
User-Defined Functions (UDFs): Custom functions that can be added to extend Hive’s
capabilities.
5. Working of Hive
The workflow in Hive involves several steps:
1. Query Submission: A user submits a HiveQL query via the Hive Shell, Thrift Server, or other
interface.
2. Compilation: The query is parsed and translated into a directed acyclic graph (DAG) of
MapReduce tasks by the compiler.
3. Optimization: The compiler optimizes the query by reducing unnecessary steps, reusing
intermediate results, and performing optimizations for faster execution.
4. Execution: The Hive execution engine submits the optimized tasks to Hadoop's MapReduce or
Tez framework, where they are executed in parallel across Hadoop clusters.
5. Result Retrieval: Once execution completes, results are gathered and sent back to the Hive
interface from which the query was initiated.
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It supports
different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from all
those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC
Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries
and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-
based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients
and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC
driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis
on the different query blocks and expressions. It converts HiveQL statements into MapReduce
jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce
tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order
of their dependencies.