Apache Hive
Apache Hive
net/publication/376086576
CITATIONS READS
0 212
1 author:
Nilu Singh
Koneru Lakshmaiah Education Foundation
120 PUBLICATIONS 356 CITATIONS
SEE PROFILE
All content following this page was uploaded by Nilu Singh on 01 December 2023.
• Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics
at a massive scale.
• Hive Metastore (HMS) provides a central repository of metadata that can easily be
analyzed to make informed, data driven decisions, and therefore it is a critical
component of many data lake architectures.
• Hive is built on top of Apache Hadoop and supports storage on S3, ADLS, GS etc.
though HDFS. Hive allows users to read, write, and manage petabytes of data using
SQL.
Apache HIVE Architecture
• Apache Hive is a very effective tool when it comes to big data (descriptive data to be
analyzed).
• Archive data software that supports the process of data analysis of big data on a regular
basis, the concept of big data nest is very popular in the technology area.
• As data is stored in the Apache Hadoop Distributed File System (HDFS) where data is
processed and processed, Apache Hive assists in processing and analyzing, and
producing data-driven patterns and trends.
Cont.
➢ HiveQL is a SQL-like
language that interacts
with the Hive website
in various organizations
and analyzes the
required data in a
structured format.
Cont.
External tables: Hive supports external tables, which allow users to access data stored in
other storage systems such as HBase, Cassandra, and Amazon S3.
Partitioning: Hive offers partitioning, which allows users to separate huge datasets based
on parameters such as date, location, or user ID. Restricting the quantity of data that must
be scanned improves query performance.
Advantages of Hive
Fast : Quickly process enormous amounts of data.
Familiar : Hive is its familiar SQL-like interface.
Scalable : Hive in Big Data can handle massive amounts of data stored in HDFS and
other compatible data stores.
Hive Optimization Techniques
•Partition your data to reduce read time within your directory, or else all the data will get
read
•Use appropriate file formats such as the Optimized Row Columnar (ORC) to increase
query performance. ORC reduces the original data size by up to 75 percent
•Divide table sets into more manageable parts by employing bucketing
•Improve aggregations, filters, scans, and joins by vectorizing your queries. Perform these
functions in batches of 1024 rows at once, rather than one at a time
•Create a separate index table that functions as a quick reference for the original table.
Components of Hive
• Shell
• Driver
• Compiler
• Metastore
• Execution Engine
Applications of Hive
•Data Mining
•Log Processing
•Document Indexing
•Customer Facing Business
Intelligence
•Predictive Modelling
•Hypothesis Testing
EXAMPLES
➢ “Airbnb connects people with accommodation and activities worldwide by 2.9 million
registered tourists, who support 800k overnight stays. Airbnb uses Amazon EMR to run
Apache Hive in the S3 data pool. Running Hive in EMR collections enables Airbnb analysts to
create temporary SQL queries in data stored in the S3 data pool. Spark at three times its
original speed”.
➢ “Guardian provides 27 million members with the protection they deserve through insurance
and asset management products and services. Guardian uses Amazon EMR to deploy Apache
Hive in the S3 data pool. Apache Hive is used to process clusters. data once influenced
Guardian Direct, a digital platform that allows consumers to research and purchase both
Guardian products and third-party products in the insurance industry”.
Important Points