Using Hive For Data Warehousing: Introduction To Hive

Hive is a data warehouse system built on top of Hadoop that allows users to query large datasets stored in Hadoop using SQL-like queries. It provides a mechanism to project structure onto data in Hadoop and uses a metastore to map file structure to tabular form. Hive is not a full database but brings familiar database concepts to Hadoop, making it easier for users to work with large datasets. It is best suited for batch jobs, analytics, and aggregations on huge datasets rather than low-latency queries or updates.

Uploaded by

Saurav Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views4 pages

Using Hive For Data Warehousing: Introduction To Hive

Uploaded by

Saurav Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Using Hive for Data Warehousing: Introduction to Hive

1
Hi. Welcome to Accessing Hadoop Data Using Hive. This lesson will provide you with
an Introduction to Hive.

2
After completing this lesson, you should be able to:
Describe what Hive is, what it’s used for and how it compares to other similar
technologies.
Describe the Hive architecture.
Describe the main components of Hive.
And list interesting ways others are using Hive.

3
Hive was initially developed by Facebook in 2007 to help the company handle massive
amounts of new data. At the time Hive was created, Facebook had a 15TB dataset they
needed to work with. A few short years later, that data had grown to 700TB.
Their RDBMS data warehouse was taking too long to process daily jobs so the company
decided to move their data into the scalable open-source Hadoop.
The company found that creating MapReduce programs was not easy and was time
consuming for many users.
When they created Hive, their vision was to bring familiar database concepts to Hadoop,
making it easier for all users to work with.
In 2008 Hive was open sourced. Facebook has since used Hive for reporting dashboards
and ad-hoc analysis.

4
So what exactly is Hive? Hive is a data warehouse system built on top of Hadoop. Hive
facilitates easy data summarization, ad-hoc queries, and the analysis of very large
datasets that are stored in Hadoop.
Hive provides a SQL interface, better known as HiveQL or HQL for short, which allows
for easy querying of data in Hadoop. HQL has its own Data Definition and Data
Manipulation languages which are very similar to the DML and DDL many of us already
have experience with.
In Hive, the HQL queries are implicitly translated into one or more MapReduce jobs,
shielding the user from much more advanced and time consuming programming.
Hive provides a mechanism to project structure (like tables and partitions) onto the data
in Hadoop and uses a metastore to map file structure to tabular form.

5
Hive is not a full database. However Hive can fit right alongside your RDBMS. There are
a variety of things that Hive lacks when compared to an RDBMS.
Hive is not a real-time processing system and is best suited for batch jobs and huge
datasets. Think heavy analytics and large aggregations. Latencies are often much higher
than in a traditional database system. Hive is schema on read which provides for fast
loads and flexibility, at the sacrifice of query time.
Hive lacks full SQL support and does not provide row level inserts, updates or deletes.
Hive does not support transactions and has limited subquery support. Query optimization
is still a work in progress too.
If you are interested in a distributed and scalable data store that supports row-level
updates, rapid queries and row-level transaction, then HBase is also worth investigating.

6
Let’s compare Hive to a couple of common alternatives. An example often used is that of
the Word Count program. The Word Count program is meant to read in documents on
Hadoop and return a listing of all the words read in along with the number of occurrences
of those words. Writing a custom MapReduce program to do this takes 63 lines of code.
Having Hive perform the same task only takes 7 easy lines of code!
Another Hive alternative is Apache Pig. Pig is a high level programming language, best
described as a “data flow language” and not a query language. Being a custom language
means there is a higher learning curve for SQL programmers to become comfortable with
the Pig language. Pig has powerful data transformation capabilities and is great for ETL.
It is not so good for ad-hoc querying. Pig is a nice complement for Hive and the two are
often used in tandem in a Hadoop environment.

7
Now let’s take a look at Hive’s architecture. There are a variety of different ways that
you can interface with Hive.

You can use a web browser to access Hive via the Hive Web Interface.

You could also access Hive using an application over JDBC, ODBC, or the Thrift API,
each made possible by Hive’s Thrift Server referred to as HiveServer. HiveServer2 was
released in Hive 0.11 and serves as a replacement for HiveServer1, though you still have
the choice of which HiveServer to run, or can even run them concurrently. HiveServer2
brings many enhancements including the ability to handle concurrent clients and more.

Hive also comes with some powerful Command Line interfaces (often referred to as the
“CLI”). The introduction of HiveServer2 brings with it a new Hive CLI called Beeline,
which can be run in embedded mode or thin client mode. In thin client mode, the Beeline
CLI connects to Hive via JDBC and HiveServer2. The original CLI is also included with
Hive and can run in embedded mode or as a client to the HiveServer1.

Hive comes with a catalog known as the Metastore. The Metastore stores the system
catalog and metadata about tables, columns, partitions and so on. The metastore makes
mapping file structure to a tabular form possible in Hive.
A newer component of Hive is called HCatalog. HCatalog is built on top of the Hive
metastore and incorporates Hive's DDL. HCatalog provides read and write interfaces for
Pig and MapReduce and uses Hive's command line interface for issuing data definition
and metadata exploration commands. Essentially, HCatalog makes it easier for users of
Pig, MapReduce, and Hive, to read and write data on the grid.

The Hive Driver, Compiler, Optimizer, and Executor work together to turn a query into a
set of Hadoop jobs.
The Driver piece manages the lifecycle of a HiveQL statement as it moves through Hive.
It maintains a session handle and any session statistics.
The Query Compiler compiles HiveQL queries into a DAG of MapReduce tasks.
The Execution Engine executes the tasks produced by the compiler in proper dependency
order. The Execution Engine interacts with the underlying Hadoop instance, working
with the Name Node, Job Tracker and so on.

8
A typical Hive installation has the following directory structure. First there is a “lib”
folder in the Hive installation. The lib folder contains a variety of JAR files. These JAR
files contain the Java code that collectively make up the functionality of Hive.
Then there is the “bin” directory. This is the location of a variety of Hive scripts that
launch various Hive services.
Finally there is the “conf” directory. This directory contains Hive’s configuration files.

9
Now let’s take a slightly deeper look at the Hive CLI. The CLI or “Command Line
Interface” is the most common way to interact with the Hive system. From the CLI shell
you can perform queries, DML, and DDL. You can view and manipulate table metadata,
retrieve query explain plans, and more. Hive currently comes with two command line
interfaces – the original CLI and the newer Beeline CLI. These two CLI’s are located in
Hive’s bin directory. There are differences in the original CLI and Beeline architecture,
however running commands within the two is a very similar procedure.

10
The metastore stores the Hive metadata. It consists of two pieces – the service and the
datastore. There’s three configurations you can choose for your metastore. The first is
embedded, which runs the metastore code in the same process with your Hive program
and the database that backs the metastore is in the same process as well. The embedded
metstore is likely to be used only in a test environment.
The second configuration option is to run the metastore as local, which keeps the
metastore code running in process, but moves the database into a separate process that the
metastore code communicates with.
The last option is to setup a remote metastore. This option moves the metastore itself out
of the process as well. The remote metastore can be useful if you wish to share the
metastore with other users. The remote metastore is the configuration you are most likely
to use in a production environment, as it provides some additional security benefits on
top of what’s possible with a local metastore.
A minimum Hive configuration identifies where the metastore is located. If there are no
configuration details provided by the user then an embedded Derby database is used. A
Derby metastore only allows one user at a time, so it may be advantageous to setup Hive
to use a more robust database option, such as DB2, MySQL or another JDBC-compliant
database.

11
A good introduction to Hive has to include some cool real world examples of how
companies are using Hive. Here’s a very small sampling of some Hive real world use
cases.
As you can see, Hive is used for data mining, data analysis, analytics, customer facing
business intelligence and a multitude of other uses.

12
You have now completed this topic. Thank you for watching.

Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
DS520
No ratings yet
DS520
10 pages
Vittorio Lora Python For Civil and Structural Engineers Amazoncom 2019 PDF Free
No ratings yet
Vittorio Lora Python For Civil and Structural Engineers Amazoncom 2019 PDF Free
183 pages
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
30 pages
Hive
No ratings yet
Hive
5 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Big Data & Analytics (CSE6005) L6 (2)
No ratings yet
Big Data & Analytics (CSE6005) L6 (2)
56 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
HIVE (1)
No ratings yet
HIVE (1)
18 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
Hive and Hiveql
No ratings yet
Hive and Hiveql
10 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
bda unit 4 - mam
No ratings yet
bda unit 4 - mam
57 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
BDA-Unit-V
No ratings yet
BDA-Unit-V
23 pages
Unit IV (1)
No ratings yet
Unit IV (1)
22 pages
Introduction To HIVE
No ratings yet
Introduction To HIVE
8 pages
Hive
No ratings yet
Hive
30 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
Chapter 7
No ratings yet
Chapter 7
84 pages
Unit-4_Hive_
No ratings yet
Unit-4_Hive_
10 pages
hive updated
No ratings yet
hive updated
18 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
DSS U4 HIVE Rev1.1
No ratings yet
DSS U4 HIVE Rev1.1
23 pages
Unit 3
No ratings yet
Unit 3
8 pages
7.Hive
No ratings yet
7.Hive
30 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Hive PPT
No ratings yet
Hive PPT
61 pages
Bda Exp-6
No ratings yet
Bda Exp-6
10 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Hive
No ratings yet
Hive
23 pages
SQL and Nosql Programming With Spark
No ratings yet
SQL and Nosql Programming With Spark
63 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Emailing Hive PDF
No ratings yet
Emailing Hive PDF
25 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Unit IV Notes
No ratings yet
Unit IV Notes
47 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Unit 5 Hive and Pig
No ratings yet
Unit 5 Hive and Pig
16 pages
HIVE
No ratings yet
HIVE
16 pages
big-data-unit 5
No ratings yet
big-data-unit 5
54 pages
Hive
No ratings yet
Hive
12 pages
Unit V BD LM Cse
No ratings yet
Unit V BD LM Cse
34 pages
bda report
No ratings yet
bda report
16 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
(r17a0528) Big Data Analytics-57-100
No ratings yet
(r17a0528) Big Data Analytics-57-100
44 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
AI Rekognition Lab
No ratings yet
AI Rekognition Lab
42 pages
Lid Superior Case
No ratings yet
Lid Superior Case
8 pages
Bi Tools Comparison
No ratings yet
Bi Tools Comparison
16 pages
Tableau - Architecture
No ratings yet
Tableau - Architecture
8 pages
Accessing Hadoop Data Using Hive: Hive DDL - VIDEO 1
No ratings yet
Accessing Hadoop Data Using Hive: Hive DDL - VIDEO 1
3 pages
How To Migrate WordPress Site To A New Domain or URL - BlogVault
No ratings yet
How To Migrate WordPress Site To A New Domain or URL - BlogVault
17 pages
Cocept of Su24: Check Ind. Proposal Meaning Explanation
100% (1)
Cocept of Su24: Check Ind. Proposal Meaning Explanation
6 pages
NEMS-NGC-LNL-EGT-ICS-003 - Instrumentation and Control System Philosophy - R01
No ratings yet
NEMS-NGC-LNL-EGT-ICS-003 - Instrumentation and Control System Philosophy - R01
28 pages
A Review Paper On Artificial Neural Network: Intelligent Traffic Management System
100% (1)
A Review Paper On Artificial Neural Network: Intelligent Traffic Management System
7 pages
DP 300 Demo
No ratings yet
DP 300 Demo
13 pages
3hac065040 Rapid Overview RW 7-En
No ratings yet
3hac065040 Rapid Overview RW 7-En
166 pages
Information Technology-Part B_Unit 1
No ratings yet
Information Technology-Part B_Unit 1
13 pages
Snippy
100% (1)
Snippy
2 pages
Receipt: Transactiondetails
No ratings yet
Receipt: Transactiondetails
1 page
Enodeb Wraparound Testing
No ratings yet
Enodeb Wraparound Testing
22 pages
112 Manual Testing Interview Questions & Answers
No ratings yet
112 Manual Testing Interview Questions & Answers
26 pages
Small Basic Assessment 2017
No ratings yet
Small Basic Assessment 2017
4 pages
307stenfile Vol - II MSI ScopeofWork Full (1 369)
No ratings yet
307stenfile Vol - II MSI ScopeofWork Full (1 369)
369 pages
WaterLife SmartCity Presentation
No ratings yet
WaterLife SmartCity Presentation
9 pages
Him Escapes: Admin Panel: Admin Panel Is The Overall Controller Panel of This Project. Admin Can
No ratings yet
Him Escapes: Admin Panel: Admin Panel Is The Overall Controller Panel of This Project. Admin Can
25 pages
Computer Fundamentals: Dr. Safdar Nawaz Khan Marwat DCSE, UET Peshawar
No ratings yet
Computer Fundamentals: Dr. Safdar Nawaz Khan Marwat DCSE, UET Peshawar
18 pages
DH-IPC-HDBW1531E: 5MP WDR IR Mini-Dome Network Camera
No ratings yet
DH-IPC-HDBW1531E: 5MP WDR IR Mini-Dome Network Camera
3 pages
Druck - PDCR 3500 Flyer
No ratings yet
Druck - PDCR 3500 Flyer
2 pages
Enterprise Provisioner User Guide
No ratings yet
Enterprise Provisioner User Guide
22 pages
Microsoft Publisher Activities
No ratings yet
Microsoft Publisher Activities
1 page
Templateless Django Cheat Sheet 2016-04-14
No ratings yet
Templateless Django Cheat Sheet 2016-04-14
1 page
McAfee Labs Threat Advisory Ransomware SAMAS
No ratings yet
McAfee Labs Threat Advisory Ransomware SAMAS
6 pages
Stream Bitrate Calculator
No ratings yet
Stream Bitrate Calculator
9 pages
Linux Keyboard Shortcuts
No ratings yet
Linux Keyboard Shortcuts
15 pages
Bacon Shakespeare and The Rosicrucians
No ratings yet
Bacon Shakespeare and The Rosicrucians
321 pages
Portfolio See Saw
No ratings yet
Portfolio See Saw
2 pages
Body and Chassis HIL Brochure
No ratings yet
Body and Chassis HIL Brochure
8 pages
KELLER Software Disclaimer
No ratings yet
KELLER Software Disclaimer
1 page
Malware analysis https___app-cdn.minepi.com Malicious activity _ ANY.RUN - Malware Sandbox Online
No ratings yet
Malware analysis https___app-cdn.minepi.com Malicious activity _ ANY.RUN - Malware Sandbox Online
19 pages

Using Hive For Data Warehousing: Introduction To Hive

Uploaded by

Using Hive For Data Warehousing: Introduction To Hive

Uploaded by

Using Hive for Data Warehousing: Introduction to Hive

You might also like