Introduction To Hive
Introduction To Hive
Introduction To Hive
Hive is a data warehousing and SQL-like query language software that facilitates
querying and managing large datasets in distributed storage.
It is built on top of Hadoop and provides a high-level interface for working with data
stored in Hadoop Distributed File System (HDFS).
Characteristics of Hive
1. Capability to translate the queries into MapReduce jobs, making Hive scalable.
2. Handles data warehouse applications, therefore, suitable for the analysis of
static data of extremely large size, where fast response time is not a criterion.
3. Supports web interface as well, which means application API as well as web
browser client that can access the Hive DB serve.
4. Provides an SQL dialect, Hive Query Language, (abbreviated HiveQL or HQL)
5. Results of HiveQL Query and the data load in the tables which store at Hadoop
cluster.
Hive Limitations
• An optional service.
• Remote client submits requests to Hive.
• Retrieves results.
• Thrift Server exposes a very simple client API to execute HiveQL statements.
• A popular interface
• Interact with Hive
• Hive runs in local mode that uses local storage when running the CLI on a
Hadoop cluster instead of HDFS.
Metastore
Hive Driver
Name Description
Database • In Hive, a database is a logical container for tables.
• It helps organize and manage tables.
• Users can switch between databases to isolate their tables and queries.
Table • Tables in Hive are similar to tables in a relational database.
• They consist of rows and columns, and each column has a specified
data type.
• Hive supports both managed tables (where Hive manages the data)
and external tables (where data is stored outside Hive, and Hive
simply provides a schema).
Partition • Partitions are a way to organize data in Hive tables based on specific
columns.
• Partitioning is beneficial for improving query performance, as it
allows for the elimination of irrelevant data during query processing.
• For example, a table of log data might be partitioned by date.
Bucketing • Bucketing is a technique in Hive to distribute data within partitions
further.
• It involves dividing data into a fixed number of buckets based on the
hash of a column.
• Bucketing can be useful for optimizing certain types of queries, such
as join operations.
Hive Integration and Workflow Steps
Execute Query Hive interface (CLI or Web Interface) sends a query to Database
Driver to execute the query
Get Plan Driver sends the query to query compiler that parses the query
to check the syntax and query plan or the requirement of the
query.
Get Metadata Compiler sends metadata request to Metastore (of any database,
such as MySQL)
Send Metadata Metastore sends metadata as a response to compiler.
Send Plan: Compiler checks the requirement and resends the plan to driver.
The parsing and compiling of the query are complete at this
place.
Execute Plan Driver sends the execute plan to execution engine
Execute Job • Internally, the process of execution job is a MapReduce
job.
• The execution engine sends the job to Job Tracker, which
is in Name node and it assigns this job to Task Tracker,
which is in Data node. Then, the query executes the job
Metadata Meanwhile the execution engine can execute the metadata
Operations operations with Metastore.
Fetch Result Execution engine receives the results from Data nodes.
Send Results Execution engine sends the result to Driver.
Send Results Driver sends the results to Hive Interfaces.
Hive Built-in Functions
In the context of handling projects in domains like banking or insurance using Oracle
SQL or HiveQL, these return data types would be used to define the structure of
database tables, and they would correspond to various attributes or properties of
entities in the system.
BIGINT:
Commonly used for representing large integer values such as account numbers, policy
numbers, or unique identifiers.
DOUBLE:
Used for floating-point numbers, suitable for storing financial values that require
decimal precision, such as amounts or percentages.
STRING:
Typically used for storing textual data like names, addresses, or descriptions.
In the context of banking, it could be used for customer names or branch locations.
account_number BIGINT,
balance DOUBLE,
customer_name STRING,
age INT
);
policy_number BIGINT,
coverage_amount DOUBLE,
policyholder_name STRING,
duration INT
);