Hive Intoduction and Tables

Apache Hive is a data warehousing infrastructure built on Hadoop that allows non-Java programmers to query large datasets using SQL-like queries through HiveQL. It is designed for batch processing and is not suitable for real-time operations, with a hierarchical data organization including databases, tables, partitions, and buckets. Hive supports various data types and operations, but has limitations in query complexity and SQL specifications, and utilizes a metastore for metadata management.

Uploaded by

YASWANTH P 717822I163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views31 pages

Hive Intoduction and Tables

Uploaded by

YASWANTH P 717822I163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Apache Hive

CMSC 491
Hadoop-Based Distributed Compu<ng
Spring 2016
Adam Shook
What Is Hive?
• Developed by Facebook and a top-level Apache project
• A data warehousing infrastructure based on Hadoop
• Immediately makes data on a cluster available to non-
Java programmers via SQL like queries
• Built on HiveQL (HQL), a SQL-like query language
• Interprets HiveQL and generates MapReduce jobs that
run on the cluster
• Enables easy data summariza<on, ad-hoc repor<ng
and querying, and analysis of large volumes of data
What Hive Is Not
• Hive, like Hadoop, is designed for batch
processing of large datasets
• Not an OLTP or real-<me system
• Latency and throughput are both high
compared to a tradi<onal RDBMS
– Even when dealing with rela<vely small data
( <100 MB )
Data Hierarchy
• Hive is organised hierarchically into:
– Databases: namespaces that separate tables and
other objects
– Tables: homogeneous units of data with the same
schema
• Analogous to tables in an RDBMS
– Par<<ons: determine how the data is stored
• Allow efficient access to subsets of the data
– Buckets/clusters
• For subsampling within a par<<on
• Join op<miza<on
HiveQL
• HiveQL / HQL provides the basic SQL-like
opera<ons:
– Select columns using SELECT
– Filter rows using WHERE
– JOIN between tables
– Evaluate aggregates using GROUP BY
– Store query results into another table
– Download results to a local directory (i.e., export
from HDFS)
– Manage tables and queries with CREATE, DROP, and
ALTER
Primi<ve Data Types
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or date<me string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collec<on of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
HiveQL Limita<ons
• HQL only supports equi-joins, outer joins, lel
semi-joins
• Because it is only a shell for mapreduce, complex
queries can be hard to op<mise
• Missing large parts of full SQL specifica<on:
– HAVING clause in SELECT
– Correlated sub-queries
– Sub-queries outside FROM clauses
– Updatable or materialized views
– Stored procedures
Hive Metastore
• Stores Hive metadata
• Default metastore database uses Apache Derby
• Various configura<ons:
– Embedded (in-process metastore, in-process
database)
• Mainly for unit tests
– Local (in-process metastore, out-of-process database)
• Each Hive client connects to the metastore directly
– Remote (out-of-process metastore, out-of-process
database)
• Each Hive client connects to a metastore server, which
connects to the metadata database itself
Hive Warehouse
• Hive tables are stored in the Hive
“warehouse”
– Default HDFS loca<on: /user/hive/warehouse
• Tables are stored as sub-directories in the
warehouse directory
• Par<<ons are subdirectories of tables
• External tables are supported in Hive
• The actual data is stored in flat files
Hive Schemas
• Hive is schema-on-read
– Schema is only enforced when the data is read (at
query <me)
– Allows greater flexibility: same data can be read
using mul<ple schemas
• Contrast with an RDBMS, which is schema-on-
write
– Schema is enforced when the data is loaded
– Speeds up queries at the expense of load <mes
Create Table Syntax
CREATE TABLE table_name
(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Simple Table
CREATE TABLE page_view
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
More Complex Table
CREATE TABLE employees (
(name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables
• CREATE TABLE
– LOAD: file moved into Hive’s data warehouse
directory
– DROP: both metadata and data deleted
• CREATE EXTERNAL TABLE
– LOAD: no files moved
– DROP: only metadata deleted
– Use this when sharing with other Hadoop
applica<ons, or when you want to use mul<ple
schemas on the same data
Par<<oning
• Can make some queries faster
• Divide data based on par<<on column
• Use PARTITION BY clause when crea<ng table
• Use PARTITION clause when loading data
• SHOW PARTITIONS will show a table’s
par<<ons
Bucke<ng
• Can speed up queries that involve sampling
the data
– Sampling works without bucke<ng, but Hive has
to scan the en<re dataset
• Use CLUSTERED BY when crea<ng table
– For sorted buckets, add SORTED BY
• To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Par<<ons
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specifica<on ( uses regex syntax )
SHOW PARTITIONS page_view; Show the par<<ons of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More informa<on on columns (useful
only for debugging )
DESCRIBE page_view List informa<on about a par<<on
PARTITION (ds='2008-10-31');
Loading Data
• Use LOAD DATA to load data from a file or
directory
– Will read from HDFS unless LOCAL keyword is
specified
– Will append data unless OVERWRITE specified
– PARTITION required if des<na<on table is par<<oned

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'

OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inser<ng Data
• Use INSERT to load data from a Hive query
– Will append data unless OVERWRITE speciﬁed
– PARTITION required if des<na<on table is
par<<oned
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view
PARTITION (dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url
WHERE pvs.country = 'US';
Inser<ng Data
• Normally only one par<<on can be inserted into
with a single INSERT
• A mul<-insert lets you insert into mul<ple
par<<ons
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view
PARTITION ( dt='2008-06-08', country='US‘ )
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view
PARTITION ( dt='2008-06-08', country='CA' )
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view
PARTITION ( dt='2008-06-08', country='UK' )
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'UK';
Inser<ng Data During Table Crea<on
• Use AS SELECT in the CREATE TABLE
statement to populate a table as it is created
CREATE TABLE page_view AS
SELECT pvs.viewTime, pvs.userid, pvs.page_url,
pvs.referrer_url
FROM page_view_stg pvs
WHERE pvs.country = 'US';
Loading And Inser<ng Data: Summary

Use this For this purpose

LOAD Load data from a ﬁle or directory
INSERT Load data from a query
• One par<<on at a <me
• Use mul<ple INSERTs to insert into
mul<ple par<<ons in the one query
CREATE TABLE AS (CTAS) Insert data while crea<ng a table
Add/modify external ﬁle Load new data into external table
Sample Select Clauses
• Select from a single table
SELECT *
FROM sales
WHERE amount > 10 AND
region = "US";
• Select from a par<<oned table
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND
page_views.date <= '2008-03-31'
Rela<onal Operators
• ALL and DISTINCT
– Specify whether duplicate rows should be returned
– ALL is the default (all matching rows are returned)
– DISTINCT removes duplicate rows from the result set
• WHERE
– Filters by expression
– Does not support IN, EXISTS or sub-queries in the
WHERE clause
• LIMIT
– Indicates the number of rows to be returned
Rela<onal Operators
• GROUP BY
– Group data by column values
– Select statement can only include columns
included in the
GROUP BY clause
• ORDER BY / SORT BY
– ORDER BY performs total ordering
• Slow, poor performance
– SORT BY performs par<al ordering
• Sorts output from each reducer
Advanced Hive Opera<ons
• JOIN
– If only one column in each table is used in the join, then
only one MapReduce job will run
• This results in 1 MapReduce job:
SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key = c.key

• This results in 2 MapReduce jobs:

SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key2 = c.key

– If mul<ple tables are joined, put the biggest table last and
the reducer will stream the last table, buﬀer the others
– Use lel semi-joins to take the place of IN/EXISTS
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;
Advanced Hive Opera<ons
• JOIN
– Do not specify join condi<ons in the WHERE clause
• Hive does not know how to op<mise such queries
• Will compute a full Cartesian product before ﬁltering it
• Join Example
SELECT
a.ymd, a.price_close, b.price_close
FROM stocks a
JOIN stocks b ON a.ymd = b.ymd
WHERE a.symbol = 'AAPL' AND
b.symbol = 'IBM' AND
a.ymd > '2010-01-01';
Hive S<nger
• MPP-style execu<on of Hive queries
• Available since Hive 0.13
• No MapReduce
• We will talk about this more when we get to
SQL on Hadoop
References
• hvp://hive.apache.org

Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Hive L1
No ratings yet
Hive L1
134 pages
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
No ratings yet
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
6 pages
Zoho 2nd and 3rd Round Coding Questions
70% (10)
Zoho 2nd and 3rd Round Coding Questions
49 pages
Machine Translation and Encoder
No ratings yet
Machine Translation and Encoder
13 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Storage With VMware Vsphere
No ratings yet
Storage With VMware Vsphere
238 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Working With The Divvy Data Set
100% (1)
Working With The Divvy Data Set
43 pages
Hbase Tutorial
No ratings yet
Hbase Tutorial
21 pages
Hdag Using HBase To Store and Access Data
No ratings yet
Hdag Using HBase To Store and Access Data
46 pages
Hive Part 2
No ratings yet
Hive Part 2
47 pages
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
No ratings yet
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
91 pages
H Base Tutorial
No ratings yet
H Base Tutorial
38 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Piyush C++ 20
No ratings yet
Piyush C++ 20
24 pages
Module 4
No ratings yet
Module 4
34 pages
7 Hive
No ratings yet
7 Hive
30 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Minestis v2018 New Features
No ratings yet
Minestis v2018 New Features
2 pages
Apache Hive Lessons For Beginner
No ratings yet
Apache Hive Lessons For Beginner
93 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
How To Hack Websites, Passwords, Everything Step by ST
75% (12)
How To Hack Websites, Passwords, Everything Step by ST
3 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
PLE - Geospatial Cloud Concept PDF
No ratings yet
PLE - Geospatial Cloud Concept PDF
93 pages
Hive Documet
No ratings yet
Hive Documet
33 pages
Prime Implicant.: Department of Electronics and Communication Engineering
No ratings yet
Prime Implicant.: Department of Electronics and Communication Engineering
2 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Tutorial - Apache Hive - Apache Software Foundation
No ratings yet
Tutorial - Apache Hive - Apache Software Foundation
15 pages
Hive
No ratings yet
Hive
42 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Mail
No ratings yet
Mail
3 pages
A New Homotopy For Seeking All Real Roots of A Nonlinear Equation. Computers and Chemical Engineering 35 (2011) 403-411
No ratings yet
A New Homotopy For Seeking All Real Roots of A Nonlinear Equation. Computers and Chemical Engineering 35 (2011) 403-411
9 pages
HIVE
No ratings yet
HIVE
28 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Rev03D7 F4MSC L IOM
No ratings yet
Rev03D7 F4MSC L IOM
440 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Hive Final
No ratings yet
Hive Final
75 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Lab ADT 1
No ratings yet
Lab ADT 1
31 pages
Intro To Machine Learning With Python
100% (1)
Intro To Machine Learning With Python
55 pages
Samples
No ratings yet
Samples
2 pages
Check Balanced Parentheses in C++
No ratings yet
Check Balanced Parentheses in C++
4 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
BDA011GU04
No ratings yet
BDA011GU04
49 pages
Hive Main
No ratings yet
Hive Main
33 pages
Amit Login
100% (1)
Amit Login
17 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
CareerHub Resume Casual
No ratings yet
CareerHub Resume Casual
2 pages
HIVE
No ratings yet
HIVE
80 pages
HDFSandhivecommands
No ratings yet
HDFSandhivecommands
15 pages
Hive Introduction
No ratings yet
Hive Introduction
13 pages
Functions: Definition, Domain and Range
No ratings yet
Functions: Definition, Domain and Range
25 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Umar Syed CV
No ratings yet
Umar Syed CV
3 pages
Apache Hive
No ratings yet
Apache Hive
30 pages
Hive
No ratings yet
Hive
29 pages
Nnfas3040a Checking - 00 I 30 Sati
No ratings yet
Nnfas3040a Checking - 00 I 30 Sati
133 pages
Hive
No ratings yet
Hive
65 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Fenics Getting Started
No ratings yet
Fenics Getting Started
3 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Stability and Convergence Theorems For Newmark's Method
No ratings yet
Stability and Convergence Theorems For Newmark's Method
6 pages
Guidelines For Writing Clean and Fast Code in MATLAB: Nico Schlömer November 6, 2015
No ratings yet
Guidelines For Writing Clean and Fast Code in MATLAB: Nico Schlömer November 6, 2015
33 pages
Hive Tutorial 310518 0511 31592
No ratings yet
Hive Tutorial 310518 0511 31592
20 pages
Hive
No ratings yet
Hive
45 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Eagle Case Study - Final
No ratings yet
Eagle Case Study - Final
3 pages
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
3 SQL Hadoop Analyzing Big Data Hive m3 Hiveql Slides
No ratings yet
3 SQL Hadoop Analyzing Big Data Hive m3 Hiveql Slides
33 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Isilon Draft Edm
No ratings yet
Isilon Draft Edm
2 pages
CSE316
No ratings yet
CSE316
2 pages
Syllabus Adsp
No ratings yet
Syllabus Adsp
3 pages
ECDL Advanced
No ratings yet
ECDL Advanced
2 pages
1 Quick Tour v1.0
No ratings yet
1 Quick Tour v1.0
33 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Binding With Binder
No ratings yet
Binding With Binder
7 pages
SQL Server 2014 Development Essentials
From Everand
SQL Server 2014 Development Essentials
Basit A. Masood-Al-Farooq
4.5/5 (2)
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
Oracle Essbase 9 Implementation Guide
From Everand
Oracle Essbase 9 Implementation Guide
Joseph Sydney Gomez
No ratings yet

Hive Intoduction and Tables

Uploaded by

Hive Intoduction and Tables

Uploaded by

Apache Hive

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'

Use this For this purpose

• This results in 2 MapReduce jobs:

You might also like