Hive Main

Apache Hive is a data warehousing infrastructure built on Hadoop that allows non-Java programmers to query large datasets using SQL-like queries through HiveQL. It is designed for batch processing and is not suitable for real-time operations, with a hierarchical data organization including databases, tables, partitions, and buckets. Hive supports various data types and operations but has limitations in SQL functionality and requires a metastore for metadata management.

Uploaded by

himanshugmarekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views33 pages

Hive Main

Uploaded by

himanshugmarekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Apache Hive

Based on Slides by Adam Shook

What Is Hive?
• Developed by Facebook and a top-level Apache
project
• A data warehousing infrastructure based on Hadoop
• Immediately makes data on a cluster available to non-
Java programmers via SQL like queries
• Built on HiveQL (HQL), a SQL-like query language
• Interprets HiveQL and generates MapReduce jobs
that run on the cluster
• Enables easy data summarization, ad-hoc reporting
and querying, and analysis of large volumes of data
What Hive Is Not
• Hive, like Hadoop, is designed for batch
processing of large datasets
• Not an OLTP or real-time system
• Latency and throughput are both high
compared to a traditional RDBMS
– Even when dealing with relatively small data
( <100 MB )
Data Hierarchy
• Hive is organised hierarchically into:
– Databases: namespaces that separate tables and other
objects
– Tables: homogeneous units of data with the same
schema
• Analogous to tables in an RDBMS
– Partitions: determine how the data is stored
• Allow efficient access to subsets of the data
– Buckets/clusters
• For sub-sampling within a partition
• Join optimization
HiveQL
• HiveQL / HQL provides the basic SQL-like operations:
– Select columns using SELECT
– Filter rows using WHERE
– JOIN between tables
– Evaluate aggregates using GROUP BY
– Store query results into another table
– Download results to a local directory (i.e., export from
HDFS)
– Manage tables and queries with CREATE, DROP, and
ALTER
Primitive Data Types
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
Bucketing is a very similar concept,
with some important differences. Here,
we split the data into a fixed number of
"buckets", according to a hash function
over some set of columns. (When using
both partitioning and bucketing, each
partition will be split into an equal
number of buckets.) Hive will guarantee
that all rows which have the same hash
will end up in the same bucket, but a
single bucket may contain multiple such
groups.
HiveQL Limitations
• HQL only supports equi-joins, outer joins, left semi-
joins
• Because it is only a shell for Map-Reduce, complex
queries can be hard to optimise
• Missing large parts of full SQL specification:
– HAVING clause in SELECT
– Correlated sub-queries
– Sub-queries outside FROM clauses
– Updatable or materialized views
– Stored procedures
Hive Metastore
• Stores Hive metadata
• Default metastore database uses Apache Derby
• Various configurations:
– Embedded (in-process metastore, in-process database)
• Mainly for unit tests
– Local (in-process metastore, out-of-process database)
• Each Hive client connects to the metastore directly
– Remote (out-of-process metastore, out-of-process
database)
• Each Hive client connects to a metastore server, which connects
to the metadata database itself
Hive Warehouse
• Hive tables are stored in the Hive
“warehouse”
– Default HDFS location: /user/hive/warehouse
• Tables are stored as sub-directories in the
warehouse directory
• Partitions are subdirectories of tables
• External tables are supported in Hive
• The actual data is stored in flat files
Hive Schemas
• Hive is schema-on-read
– Schema is only enforced when the data is read (at
query time)
– Allows greater flexibility: same data can be read
using multiple schemas
• Contrast with an RDBMS, which is schema-on-
write
– Schema is enforced when the data is loaded
– Speeds up queries at the expense of load times
Create Table Syntax
CREATE TABLE table_name
(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Simple Table
CREATE TABLE page_view
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
More Complex Table
CREATE TABLE employees (
(name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables
• CREATE TABLE
– LOAD: file moved into Hive’s data warehouse
directory
– DROP: both metadata and data deleted
• CREATE EXTERNAL TABLE
– LOAD: no files moved
– DROP: only metadata deleted
– Use this when sharing with other Hadoop
applications, or when you want to use multiple
schemas on the same data
Partitioning
• Can make some queries faster
• Divide data based on partition column
• Use PARTITION BY clause when creating table
• Use PARTITION clause when loading data
• SHOW PARTITIONS will show a table’s
partitions
Bucketing
• Can speed up queries that involve sampling
the data
– Sampling works without bucketing, but Hive has to
scan the entire dataset
• Use CLUSTERED BY when creating table
– For sorted buckets, add SORTED BY
• To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Partitions
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specification ( uses regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful
only for debugging )
DESCRIBE page_view List information about a partition
PARTITION (ds='2008-10-31');
Loading Data
• Use LOAD DATA to load data from a file or
directory
– Will read from HDFS unless LOCAL keyword is specified
– Will append data unless OVERWRITE specified
– PARTITION required if destination table is partitioned

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'

OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inserting Data
• Use INSERT to load data from a Hive query
– Will append data unless OVERWRITE specified
– PARTITION required if destination table is
partitioned

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view
PARTITION (dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url
WHERE pvs.country = 'US';
Loading And Inserting Data: Summary

Use this For this purpose

LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
Sample Select Clauses
• Select from a single table
SELECT *
FROM sales
WHERE amount > 10 AND
region = "US";
• Select from a partitioned table
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND
page_views.date <= '2008-03-31'
Relational Operators
• ALL and DISTINCT
– Specify whether duplicate rows should be returned
– ALL is the default (all matching rows are returned)
– DISTINCT removes duplicate rows from the result set
• WHERE
– Filters by expression
– Does not support IN, EXISTS or sub-queries in the WHERE
clause
• LIMIT
– Indicates the number of rows to be returned
Relational Operators
• GROUP BY
– Group data by column values
– Select statement can only include columns included
in the
GROUP BY clause
• ORDER BY / SORT BY
– ORDER BY performs total ordering
• Slow, poor performance
– SORT BY performs partial ordering
• Sorts output from each reducer
Advanced Hive Operations
• JOIN
– If only one column in each table is used in the join, then
only one MapReduce job will run
• This results in 1 MapReduce job:
SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key = c.key

• This results in 2 MapReduce jobs:

SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key2 = c.key

– If multiple tables are joined, put the biggest table last and
the reducer will stream the last table, buffer the others
– Use left semi-joins to take the place of IN/EXISTS
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;
Advanced Hive Operations
• JOIN
– Do not specify join conditions in the WHERE clause
• Hive does not know how to optimise such queries
• Will compute a full Cartesian product before filtering it
• Join Example

SELECT
a.ymd, a.price_close, b.price_close
FROM stocks a
JOIN stocks b ON a.ymd = b.ymd
WHERE a.symbol = 'AAPL' AND
b.symbol = 'IBM' AND
a.ymd > '2010-01-01';
Hive Stinger
• MPP-style execution of Hive queries
• Available since Hive 0.13
• No MapReduce
• We will talk about this more when we get to
SQL on Hadoop
References
• http://hive.apache.org

Eco Assignment
No ratings yet
Eco Assignment
9 pages
IC ENGINES PPT Revised1
No ratings yet
IC ENGINES PPT Revised1
80 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Hive
No ratings yet
Hive
29 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Hive Final
No ratings yet
Hive Final
75 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
HIVE
No ratings yet
HIVE
80 pages
Hive L1
No ratings yet
Hive L1
134 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
HIVE
No ratings yet
HIVE
28 pages
Hive
No ratings yet
Hive
42 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
Hive
No ratings yet
Hive
65 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
Hive
No ratings yet
Hive
45 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Hive
No ratings yet
Hive
49 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Module 4
No ratings yet
Module 4
34 pages
Apache Hive
No ratings yet
Apache Hive
30 pages
7 Hive
No ratings yet
7 Hive
30 pages
BDA011GU04
No ratings yet
BDA011GU04
49 pages
Hive
No ratings yet
Hive
9 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Hive
No ratings yet
Hive
4 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
SQL Server 2014 Development Essentials
From Everand
SQL Server 2014 Development Essentials
Basit A. Masood-Al-Farooq
4.5/5 (2)
Straight Road to Excel 2013/2016 Pivot Tables: Get Your Hands Dirty
From Everand
Straight Road to Excel 2013/2016 Pivot Tables: Get Your Hands Dirty
Sam Akrasi
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
No ratings yet
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
8 pages
Bigmart PDF
No ratings yet
Bigmart PDF
20 pages
1 - Accounting - Crossword 3
No ratings yet
1 - Accounting - Crossword 3
2 pages
02 Chittick Vs CA
No ratings yet
02 Chittick Vs CA
5 pages
QAR-QD418-50 Rod As-Rear Suspension
No ratings yet
QAR-QD418-50 Rod As-Rear Suspension
5 pages
Wiring Diagram: Security Control System
No ratings yet
Wiring Diagram: Security Control System
1 page
Cataloge E&H Weld-In Adapter and Flanges
No ratings yet
Cataloge E&H Weld-In Adapter and Flanges
40 pages
Agilent ERP Failure
No ratings yet
Agilent ERP Failure
2 pages
Specifications Alphasorb Barrier Fabric Wrapped Acoustic Panels
No ratings yet
Specifications Alphasorb Barrier Fabric Wrapped Acoustic Panels
3 pages
Canadian Manual On Foundation Engineering
No ratings yet
Canadian Manual On Foundation Engineering
297 pages
BS 6 - M&HCV Models
No ratings yet
BS 6 - M&HCV Models
6 pages
Cabbage: Schedule of Cabbage Production Practices
No ratings yet
Cabbage: Schedule of Cabbage Production Practices
19 pages
Boiler and Boiler Calculations
No ratings yet
Boiler and Boiler Calculations
7 pages
ChatGPT Premium Guide
67% (3)
ChatGPT Premium Guide
152 pages
Summay Chapter 6 and 8 (Paul Goodwin and George Wright)
No ratings yet
Summay Chapter 6 and 8 (Paul Goodwin and George Wright)
10 pages
Actility Enova Presentation REV01
No ratings yet
Actility Enova Presentation REV01
19 pages
Linearization OpenFAST
No ratings yet
Linearization OpenFAST
13 pages
Tap Magic Eco Oil Sds en Us 2023pdf
No ratings yet
Tap Magic Eco Oil Sds en Us 2023pdf
8 pages
Chapter I-Iii For Printing
No ratings yet
Chapter I-Iii For Printing
26 pages
18 Home Savings vs. Dailo
No ratings yet
18 Home Savings vs. Dailo
11 pages
Form 1
No ratings yet
Form 1
1 page
Transcript
No ratings yet
Transcript
12 pages
Cohesive Nouns
100% (1)
Cohesive Nouns
3 pages
Boarding Pass (IXR MAA)
No ratings yet
Boarding Pass (IXR MAA)
2 pages
Labor Law BarVenture 2024
No ratings yet
Labor Law BarVenture 2024
4 pages
Requirements Doc - BRD
No ratings yet
Requirements Doc - BRD
11 pages
Catalogo Bomba de Lodos Gardner Denver Pah-08 Ultimo
100% (3)
Catalogo Bomba de Lodos Gardner Denver Pah-08 Ultimo
35 pages
Gucci Strategic MGT
0% (1)
Gucci Strategic MGT
18 pages

Hive Main

Uploaded by

Hive Main

Uploaded by

Apache Hive

Based on Slides by Adam Shook

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'

FROM page_view_stg pvs

Use this For this purpose

• This results in 2 MapReduce jobs:

You might also like