Hive
Hive
Hive
"The Apache Hive™ data warehouse software facilitates reading, writing, and managing
large datasets residing in distributed storage using SQL. The structure can be projected
onto data already in storage."
In other words, Hive is an open-source system that processes structured data in Hadoop,
residing on top of the latter for summarizing Big Data, as well as facilitating analysis and
queries.
Now that we have investigated what is Hive in Hadoop, let’s look at the features and
characteristics.
Architecture of Hive
• Hive Services: Hive services perform client interactions with Hive. For example,
if a client wants to perform a query, it must talk with Hive services.
• Hive Storage and Computing: Hive services such as file system, job client, and
meta store then communicates with Hive storage and stores things like
metadata table information and query results.
Hive's Features
• Hive is designed for querying and managing only structured data stored in
tables
• Schema gets stored in a database, while processed data goes into a Hadoop
Distributed File System (HDFS)
• Tables and databases get created first; then data gets loaded into the proper
tables
• Hive uses an SQL-inspired language, sparing the user from dealing with the
complexity of MapReduce programming. It makes learning more accessible by
utilizing familiar concepts found in relational databases, such as columns,
tables, rows, and schema, etc.
• The most significant difference between the Hive Query Language (HQL) and
SQL is that Hive executes queries on Hadoop's infrastructure instead of on a
traditional database
• Since Hadoop's programming works on flat files, Hive uses directory structures
to "partition" data, improving performance on specific queries
• Hive supports partition and buckets for fast and simple data retrieval
• Hive supports custom user-defined functions (UDF) for tasks like data cleansing
and filtering. Hive UDFs can be defined according to programmers'
requirements
Limitations of Hive
Of course, no resource is perfect, and Hive has some limitations. They are:
1. The data analyst executes a query with the User Interface (UI).
2. The driver interacts with the query compiler to retrieve the plan, which consists
of the query execution process and metadata information. The driver also
parses the query to check syntax and requirements.
3. The compiler creates the job plan (metadata) to be executed and communicate s
with the metastore to retrieve a metadata request.
5. The compiler relays the proposed query execution plan to the driver.
7. The execution engine (EE) processes the query by acting as a bridge between
the Hive and Hadoop. The job process executes in MapReduce. The execution
engine sends the job to the JobTracker, found in the Name node, and assigns
it to the TaskTracker, in the Data node. While this is happening, the execution
engine executes metadata operations with the metastore.
9. The results are sent to the execution engine, which, in turn, sends the results
back to the driver and the front end (UI).
Since we have gone on at length about what Hive is, we should also touch on what Hive
is not:
Hive Modes
Depending on the size of Hadoop data nodes, Hive can operate in two different modes:
• Local mode
• Map-reduce mode
User Local mode when:
• Hadoop is installed under the pseudo mode, possessing only one data node
• Users expect faster processing because the local machine contains smaller
datasets.
• Hadoop has multiple data nodes, and the data is distributed across these
different nodes
Amazon Elastic Map Reduce (EMR) is a managed service that lets you use big data
processing frameworks such as Spark, Presto, Hbase, and, yes, Hadoop to analyze and
process large data sets. Hive, in turn, runs on top of Hadoop clusters, and can be used
to query data residing in Amazon EMR clusters, employing an SQL language.
Data analysts can query Hive transactional (ACID) tables straight from Db2 Big SQL,
although Db2 Big SQL can only see compacted data in the transactional table. Data
modification statement results won’t be seen by any queries generated in Db2 Big SQL
until you perform a compaction operation, which places data in a base directory.
Hive vs. Relational Databases
In order to continue our understanding of what Hive is, let us next look at the difference
between Pig and Hive.
Both Hive and Pig are sub-projects, or tools used to manage data in Hadoop. While Hive
is a platform that used to create SQL-type scripts for MapReduce functions, Pig is a
procedural language platform that accomplishes the same thing. Here's how their
differences break down:
Users
Language Used
Data Handling
Partitioning
Load Speed
So, if you're a data analyst accustomed to working with SQL and want to perform
analytical queries of historical data, then Hive is your best bet. But if you're a programmer
and are very familiar with scripting languages and you don't want to be bothered by
creating the schema, then use Pig.
In order to strengthen our understanding of what is Hive, let us next look at the difference
between Hive and Hbase.
We've spotlighted the differences between Hive and Pig. Now, it's time for a brief
comparison between Hive and Hbase.
• Hbase processes in real-time and features real-time querying; Hive doesn't and
is used only for analytical queries
• Hive runs on the top of Hadoop, while Hbase runs on the top of the HDFS
• And finally, Hive is ideal for high latency operations, while Hbase is made
primarily for low-level latency ones
Data analysts who want to optimize their Hive queries and make them run faster in their
clusters should consider the following hacks:
• Partition your data to reduce read time within your directory, or else all the data
will get read
• Use appropriate file formats such as the Optimized Row Columnar (ORC) to
increase query performance. ORC reduces the original data size by up to 75
percent
• Create a separate index table that functions as a quick reference for the original
table.
Hive Data Models
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-
hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive structures data into well-understood database concepts such as tables, rows,
columns and partitions. It supports primitive types like Integers, Floats, Doubles, and
Strings. Hive also supports Associative Arrays, Lists, Structs, and Serialize and Deserialized
API is used to move data in and out of tables.
• Databases
• Tables
• Partitions
• Buckets or clusters
Partitions:
Partition means dividing a table into a coarse grained parts based on the value of a
partition column such as ‘data’. This makes it faster to do queries on slices of data
So, what is the function of Partition? The Partition keys determine how data is stored.
Here, each unique value of the Partition key defines a Partition of the table. The Partitions
are named after dates for convenience. It is similar to ‘Block Splitting’ in HDFS.
Buckets:
Buckets give extra structure to the data that may be used for efficient queries. A join of
two tables that are bucketed on the same columns, including the join column can be
implemented as a Map-Side Join. Bucketing by used ID means we can quickly evaluate a
user-based query by running it on a randomized sample of the total set of users.