0% found this document useful (0 votes)

23 views12 pages

Hive

Uploaded by

shivangiyadav09022003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views12 pages

Hive

Uploaded by

shivangiyadav09022003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

What is Hive in Hadoop?

"The Apache Hive™ data warehouse software facilitates reading, writing, and managing
large datasets residing in distributed storage using SQL. The structure can be projected
onto data already in storage."

In other words, Hive is an open-source system that processes structured data in Hadoop,
residing on top of the latter for summarizing Big Data, as well as facilitating analysis and
queries.

Now that we have investigated what is Hive in Hadoop, let’s look at the features and
characteristics.

Architecture of Hive

Hive chiefly consists of three core parts:

• Hive Clients: Hive offers a variety of drivers designed for communication with
different applications. For example, Hive provides Thrift clients for Thrift-based
applications. These clients and drivers then communicate with the Hive server,
which falls under Hive services.

• Hive Services: Hive services perform client interactions with Hive. For example,
if a client wants to perform a query, it must talk with Hive services.

• Hive Storage and Computing: Hive services such as file system, job client, and
meta store then communicates with Hive storage and stores things like
metadata table information and query results.

Hive's Features

These are Hive's chief characteristics:

• Hive is designed for querying and managing only structured data stored in
tables

• Hive is scalable, fast, and uses familiar concepts

• Schema gets stored in a database, while processed data goes into a Hadoop
Distributed File System (HDFS)

• Tables and databases get created first; then data gets loaded into the proper
tables

• Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record

Columnar File), and TEXTFILE

• Hive uses an SQL-inspired language, sparing the user from dealing with the
complexity of MapReduce programming. It makes learning more accessible by
utilizing familiar concepts found in relational databases, such as columns,
tables, rows, and schema, etc.
• The most significant difference between the Hive Query Language (HQL) and
SQL is that Hive executes queries on Hadoop's infrastructure instead of on a
traditional database

• Since Hadoop's programming works on flat files, Hive uses directory structures
to "partition" data, improving performance on specific queries

• Hive supports partition and buckets for fast and simple data retrieval

• Hive supports custom user-defined functions (UDF) for tasks like data cleansing
and filtering. Hive UDFs can be defined according to programmers'
requirements

Limitations of Hive

Of course, no resource is perfect, and Hive has some limitations. They are:

• Hive doesn’t support OLTP. Hive supports Online Analytical Processing

(OLAP), but not Online Transaction Processing (OLTP).

• It doesn’t support subqueries.

• It has a high latency.

• Hive tables don’t support delete or update operations.

How Data Flows in the Hive?

1. The data analyst executes a query with the User Interface (UI).

2. The driver interacts with the query compiler to retrieve the plan, which consists
of the query execution process and metadata information. The driver also
parses the query to check syntax and requirements.
3. The compiler creates the job plan (metadata) to be executed and communicate s
with the metastore to retrieve a metadata request.

4. The metastore sends the metadata information back to the compiler

5. The compiler relays the proposed query execution plan to the driver.

6. The driver sends the execution plans to the execution engine.

7. The execution engine (EE) processes the query by acting as a bridge between
the Hive and Hadoop. The job process executes in MapReduce. The execution
engine sends the job to the JobTracker, found in the Name node, and assigns
it to the TaskTracker, in the Data node. While this is happening, the execution
engine executes metadata operations with the metastore.

8. The results are retrieved from the data nodes.

9. The results are sent to the execution engine, which, in turn, sends the results
back to the driver and the front end (UI).

Since we have gone on at length about what Hive is, we should also touch on what Hive
is not:

• Hive isn't a language for row-level updates and real-time queries

• Hive isn't a relational database

• Hive isn't a design for Online Transaction Processing

Hive Modes

Depending on the size of Hadoop data nodes, Hive can operate in two different modes:

• Local mode

• Map-reduce mode
User Local mode when:

• Hadoop is installed under the pseudo mode, possessing only one data node

• The data size is smaller and limited to a single local machine

• Users expect faster processing because the local machine contains smaller
datasets.

Use Map Reduce mode when:

• Hadoop has multiple data nodes, and the data is distributed across these
different nodes

• Users must deal with more massive data sets

MapReduce is Hive's default mode.

Hive and Hadoop on AWS

Amazon Elastic Map Reduce (EMR) is a managed service that lets you use big data
processing frameworks such as Spark, Presto, Hbase, and, yes, Hadoop to analyze and
process large data sets. Hive, in turn, runs on top of Hadoop clusters, and can be used
to query data residing in Amazon EMR clusters, employing an SQL language.

Hive and IBM Db2 Big SQL

Data analysts can query Hive transactional (ACID) tables straight from Db2 Big SQL,
although Db2 Big SQL can only see compacted data in the transactional table. Data
modification statement results won’t be seen by any queries generated in Db2 Big SQL
until you perform a compaction operation, which places data in a base directory.
Hive vs. Relational Databases

Relational databases, or RDBMS, is a database that stores data in a structured format

with rows and columns, a structured form called “tables.” Hive, on the other hand, is a
data warehousing system that offers data analysis and queries.

Here’s a handy chart that illustrates the differences at a glance:

Relational Database Hive

Maintains a database Maintains a data warehouse

Fixed schema Varied schema

Sparse tables Dense tables

Doesn’t support partitioning Supports automation partition

Stores both normalized and

Stores normalized data
denormalized data
Uses HQL (Hive Query
Uses SQL (Structured Query Language)
Language)

In order to continue our understanding of what Hive is, let us next look at the difference
between Pig and Hive.

Pig vs. Hive

Both Hive and Pig are sub-projects, or tools used to manage data in Hadoop. While Hive
is a platform that used to create SQL-type scripts for MapReduce functions, Pig is a
procedural language platform that accomplishes the same thing. Here's how their
differences break down:

Users

• Data analysts favor Apache Hive

• Programmers and researchers prefer Apache Pig

Language Used

• Hive uses a declarative language variant of SQL called HQL

• Pig uses a unique procedural language called Pig Latin

Data Handling

• Hive works with structured data

• Pig works with both structured and semi-structured data

Cluster Operation

• Hive operates on the cluster's server-side

• Pig operates on the cluster's client-side

Partitioning

• Hive supports partitioning

• Pig doesn't support partitioning

Load Speed

• Hive doesn't load quickly, but it executes faster

• Pig loads quickly

So, if you're a data analyst accustomed to working with SQL and want to perform
analytical queries of historical data, then Hive is your best bet. But if you're a programmer
and are very familiar with scripting languages and you don't want to be bothered by
creating the schema, then use Pig.

In order to strengthen our understanding of what is Hive, let us next look at the difference
between Hive and Hbase.

Apache Hive vs. Apache Hbase

We've spotlighted the differences between Hive and Pig. Now, it's time for a brief
comparison between Hive and Hbase.

• HBase is an open-source, column-oriented database management system that

runs on top of the Hadoop Distributed File System (HDFS)
• Hive is a query engine, while Hbase is a data storage system geared towards
unstructured data. Hive is used mostly for batch processing; Hbase is used
extensively for transactional processing

• Hbase processes in real-time and features real-time querying; Hive doesn't and
is used only for analytical queries

• Hive runs on the top of Hadoop, while Hbase runs on the top of the HDFS

• Hive isn't a database, but Hbase supports NoSQL databases

• Hive has a schema model, Hbase doesn't

• And finally, Hive is ideal for high latency operations, while Hbase is made
primarily for low-level latency ones

Hive Optimization Techniques

Data analysts who want to optimize their Hive queries and make them run faster in their
clusters should consider the following hacks:

• Partition your data to reduce read time within your directory, or else all the data
will get read

• Use appropriate file formats such as the Optimized Row Columnar (ORC) to
increase query performance. ORC reduces the original data size by up to 75
percent

• Divide table sets into more manageable parts by employing bucketing

• Improve aggregations, filters, scans, and joins by vectorizing your queries.

Perform these functions in batches of 1024 rows at once, rather than one at a
time

• Create a separate index table that functions as a quick reference for the original
table.
Hive Data Models

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-
hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive structures data into well-understood database concepts such as tables, rows,
columns and partitions. It supports primitive types like Integers, Floats, Doubles, and
Strings. Hive also supports Associative Arrays, Lists, Structs, and Serialize and Deserialized
API is used to move data in and out of tables.

Let’s look at Hive Data Models in detail;

Hive Data Models:

The Hive data models contain the following components:

• Databases
• Tables
• Partitions
• Buckets or clusters

Partitions:
Partition means dividing a table into a coarse grained parts based on the value of a
partition column such as ‘data’. This makes it faster to do queries on slices of data
So, what is the function of Partition? The Partition keys determine how data is stored.
Here, each unique value of the Partition key defines a Partition of the table. The Partitions
are named after dates for convenience. It is similar to ‘Block Splitting’ in HDFS.

Buckets:
Buckets give extra structure to the data that may be used for efficient queries. A join of
two tables that are bucketed on the same columns, including the join column can be
implemented as a Map-Side Join. Bucketing by used ID means we can quickly evaluate a
user-based query by running it on a randomized sample of the total set of users.

What is a metastore in Hive?

Ans. Basically, to store the metadata information in the Hive we
use Metastore. Though, it is possible by using RDBMS and an open source
ORM (Object Relational Model) layer called Data Nucleus. That converts the
object representation into the relational schema and vice versa.

Apache HIVE
100% (1)
Apache HIVE
105 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Hadoop and Hive Architecture 1
No ratings yet
Hadoop and Hive Architecture 1
11 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
Hive
No ratings yet
Hive
30 pages
Hive - Self Learning Notes
No ratings yet
Hive - Self Learning Notes
69 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Student Management System: Project Report On
No ratings yet
Student Management System: Project Report On
59 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
01 Introduction To Hive
No ratings yet
01 Introduction To Hive
17 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Solution Manual Digital System Design Roth: Read/Download
No ratings yet
Solution Manual Digital System Design Roth: Read/Download
2 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Bda Exp-6
No ratings yet
Bda Exp-6
10 pages
Hive
No ratings yet
Hive
23 pages
HIVE
No ratings yet
HIVE
80 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
Unit-5 Sgs
No ratings yet
Unit-5 Sgs
10 pages
Web Based Data Management of Apache Hive
No ratings yet
Web Based Data Management of Apache Hive
22 pages
HIVE
No ratings yet
HIVE
7 pages
7 Hive
No ratings yet
7 Hive
30 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
1 - Introduction
No ratings yet
1 - Introduction
5 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Unit 3
No ratings yet
Unit 3
8 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hive
No ratings yet
Hive
5 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
HIVE
No ratings yet
HIVE
18 pages
Hive
No ratings yet
Hive
28 pages
Hive
No ratings yet
Hive
49 pages
Hive
No ratings yet
Hive
12 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Bda Report
No ratings yet
Bda Report
16 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Implementation Manual Festo IO - Link
No ratings yet
Implementation Manual Festo IO - Link
71 pages
Hive
No ratings yet
Hive
52 pages
Automatic Railway Track Fault Detecting Using Wireless Network Systems
No ratings yet
Automatic Railway Track Fault Detecting Using Wireless Network Systems
24 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Distributed Computing Lesson Plan CS3551 JEC
No ratings yet
Distributed Computing Lesson Plan CS3551 JEC
4 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
31 pages
R-Format Instructions: Op Rs RT RD Shamt Funct
No ratings yet
R-Format Instructions: Op Rs RT RD Shamt Funct
4 pages
PCF Users Guide
No ratings yet
PCF Users Guide
104 pages
Mantra MFS100 RD Service Manual Windows 1.1.0
No ratings yet
Mantra MFS100 RD Service Manual Windows 1.1.0
16 pages
Error Detecting and Correcting Codes: Appendix A
No ratings yet
Error Detecting and Correcting Codes: Appendix A
143 pages
Teste
No ratings yet
Teste
42 pages
Cables and Connectors
No ratings yet
Cables and Connectors
4 pages
Oacon LMI3d Lazer Profil Ve Snapshot Sensor
No ratings yet
Oacon LMI3d Lazer Profil Ve Snapshot Sensor
28 pages
Operating Systems & Gui: in This Lesson Students Will: Get Familiar With The Following Terms
No ratings yet
Operating Systems & Gui: in This Lesson Students Will: Get Familiar With The Following Terms
9 pages
Quiz Questions For Chapter 1
No ratings yet
Quiz Questions For Chapter 1
19 pages
Electronic Records Article
No ratings yet
Electronic Records Article
34 pages
Curriculum-Vitae: Objective: - Akhilesh Kalaru
No ratings yet
Curriculum-Vitae: Objective: - Akhilesh Kalaru
3 pages
Tennecomp Minidek Part 2
No ratings yet
Tennecomp Minidek Part 2
24 pages
0210 (Bengali) Paper-II
No ratings yet
0210 (Bengali) Paper-II
10 pages
COURSEWORK 2022/2023: IMAT5122 - Computer Systems and Networks
No ratings yet
COURSEWORK 2022/2023: IMAT5122 - Computer Systems and Networks
10 pages
PMAC
No ratings yet
PMAC
2 pages
Actual vs. Budget Ytd Project
No ratings yet
Actual vs. Budget Ytd Project
4 pages
MultiVue Quick Guide (EPIQ Evolution 5.0 & Affiniti Continuum 3.0) Newpdf
No ratings yet
MultiVue Quick Guide (EPIQ Evolution 5.0 & Affiniti Continuum 3.0) Newpdf
5 pages
SQL Interview Questions & Answers
100% (1)
SQL Interview Questions & Answers
4 pages
Computer Example
No ratings yet
Computer Example
3 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
MEAM - Design - MEAM520-12C-P01-IK
No ratings yet
MEAM - Design - MEAM520-12C-P01-IK
1 page
1LOC005 - How Do I Allocate Legs To A Drivers Run Sheet
No ratings yet
1LOC005 - How Do I Allocate Legs To A Drivers Run Sheet
2 pages
The Method and Precautions of Updated Sotware For Lerdge-Z
No ratings yet
The Method and Precautions of Updated Sotware For Lerdge-Z
3 pages
DR B R Ambedkar National Institute of Technology, Jalandhar: Grade Sheet
No ratings yet
DR B R Ambedkar National Institute of Technology, Jalandhar: Grade Sheet
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Hive

Uploaded by

Hive

Uploaded by

What is Hive in Hadoop?

Hive chiefly consists of three core parts:

These are Hive's chief characteristics:

• Hive is scalable, fast, and uses familiar concepts

• Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record

• Hive doesn’t support OLTP. Hive supports Online Analytical Processing

• It doesn’t support subqueries.

• It has a high latency.

• Hive tables don’t support delete or update operations.

How Data Flows in the Hive?

4. The metastore sends the metadata information back to the compiler

6. The driver sends the execution plans to the execution engine.

8. The results are retrieved from the data nodes.

• Hive isn't a language for row-level updates and real-time queries

• Hive isn't a relational database

• Hive isn't a design for Online Transaction Processing

• The data size is smaller and limited to a single local machine

Use Map Reduce mode when:

• Users must deal with more massive data sets

MapReduce is Hive's default mode.

Hive and Hadoop on AWS

Hive and IBM Db2 Big SQL

Relational databases, or RDBMS, is a database that stores data in a structured format

Here’s a handy chart that illustrates the differences at a glance:

Relational Database Hive

Maintains a database Maintains a data warehouse

Fixed schema Varied schema

Sparse tables Dense tables

Doesn’t support partitioning Supports automation partition

Stores both normalized and

Pig vs. Hive

• Data analysts favor Apache Hive

• Programmers and researchers prefer Apache Pig

• Hive uses a declarative language variant of SQL called HQL

• Pig uses a unique procedural language called Pig Latin

• Hive works with structured data

• Pig works with both structured and semi-structured data

• Hive operates on the cluster's server-side

• Pig operates on the cluster's client-side

• Hive supports partitioning

• Pig doesn't support partitioning

• Hive doesn't load quickly, but it executes faster

• Pig loads quickly

Apache Hive vs. Apache Hbase

• HBase is an open-source, column-oriented database management system that

• Hive isn't a database, but Hbase supports NoSQL databases

• Hive has a schema model, Hbase doesn't

Hive Optimization Techniques

• Divide table sets into more manageable parts by employing bucketing

• Improve aggregations, filters, scans, and joins by vectorizing your queries.

Let’s look at Hive Data Models in detail;

Hive Data Models:

What is a metastore in Hive?

You might also like