0% found this document useful (0 votes)

78 views24 pages

Introduction To Hive: Liyin Tang Liyintan@usc - Edu

The document provides an introduction to Hive, including its motivation, overview, data model, architecture, performance considerations, pros and cons, applications, and related systems. Hive is a data warehousing system that allows users to query large datasets using SQL, and automatically handles the execution of queries as MapReduce jobs. It aims to make Hadoop data analysis easier for analysts through an SQL interface while leveraging Hadoop's distributed processing capabilities.

Uploaded by

poonam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views24 pages

Introduction To Hive: Liyin Tang Liyintan@usc - Edu

Uploaded by

poonam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Introduction to Hive

Liyin Tang
[email protected]
Outline

Motivation
Overview
Data Model / Metadata
Architecture
Performance
Cons and Pros
Application
Related Work

03/02/17 Introduction to Hive 2

Motivation

Realtim
e
Hadoop
Cluster
Web Scribe
Servers MidTier Scribe Writers

Oracle RAC Hadoop Hive MySQL

Warehouse
http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

03/02/17 Introduction to Hive 3

Motivation

Limitation of MR
Have to use M/R model
Not Reusable
Error prone
For complex jobs:
Multiple stage of Map/Reduce functions
Just like ask dev to write specify physical
execution plan in the database

03/02/17 Introduction to Hive 4

Overview

Intuitive
Make the unstructured data looks like tables regardless
how it really lay out
SQL based query can be directly against these tables
Generate specify execution plan for this query
Whats Hive
A data warehousing system to store structured data on
Hadoop file system
Provide an easy query these data by execution Hadoop
MapReduce plans

03/02/17 Introduction to Hive 5

Data Model
Tables
Basic type columns (int, float, boolean)
Complex type: List / Map ( associate array)
Partitions
Buckets
CREATE TABLE sales( id INT, items
ARRAY<STRUCT<id:INT,name:STRING>
) PARITIONED BY (ds STRING)
CLUSTERED BY (id) INTO 32 BUCKETS;

SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)

03/02/17 Introduction to Hive 6

Metadata

Database namespace
Table definitions
schema info, physical location In HDFS

Partition data

ORM Framework
All the metadata can be stored in Derby by default
Any database with JDBC can be configed

03/02/17 Introduction to Hive 7

Architecture
Map Reduce
Web UI + Hive CLI + User-defined HDFS
JDBC/ODBC Map-reduce Scripts
Browse, Query, DDL

Hive QL UDF/UDAF
Parser substr
sum
Planner average
Execution
Optimizer FileFormats
SerDe
TextFile
CSV SequenceFile
Thrift RCFile
Regex

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-faceb
8
ook-hive-and-hdfs
Performance

GROUP BY operation
Efficient execution plans based on:
Data skew:
how evenly distributed data across a number of
physical nodes
bottleneck VS load balance
Partial aggregation:
Group the data with the same group by value as soon
as possible
In memory hash-table for mapper
Earlier than combiner

03/02/17 Introduction to Hive 9

Performance

JOIN operation
Traditional Map-Reduce Join
Early Map-side Join
very efficient for joining a small table with a large
table
Keep smaller table data in memory first
Join with a chunk of larger table data each time
Space complexity for time complexity

7/20/2010 Introduction to Hive 10

Performance

Ser/De
Describe how to load the data from the file into a
representation that make it looks like a table;
Lazy load
Create the field object when necessary
Reduce the overhead to create unnecessary objects in
Hive
Java is expensive to create objects
Increase performance

7/20/2010 Introduction to Hive 11

Hive Performance
Date SVN Revision Major Changes Query A Query B Query C
2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec
2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec
3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec
4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec
6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec
QueryA: SELECT count(1) FROM t;
QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;
QueryC: SELECT * FROM t;
map-side time only (incl. GzipCodec for comp/decompression)
* These two features need to be tested with other queries.

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs
Pros

Pros
A easy way to process large scale data
Support SQL-based queries
Provide more user defined interfaces to
extend
Programmability
Efficient execution plans for performance
Interoperability with other database tools

03/02/17 Introduction to Hive 13

Cons

Cons
No easy way to append data
Files in HDFS are immutable
Future work
Views / Variables
More operator
In/Exists semantic
More future work in the mail list

03/02/17 Introduction to Hive 14

Application

Log processing
Daily Report
User Activity Measurement
Data/Text mining
Machine learning (Training Data)
Business intelligence
Advertising Delivery
Spam Detection

7/20/2010 Introduction to Hive 15

Related Work

Parallel databases: Gamma, Bubba, Volcano

Google: Sawzall
Yahoo: Pig
IBM: JAQL
Microsoft: DradLINQ , SCOPE

7/20/2010 Introduction to Hive 16

Reference

[1] A.Thusoo et al. Hive: a warehousing solution over a

map-reduce framework. Proceedings of VLDB09', 2009.
[2] Hadoop 2009:
http://www.slideshare.net/cloudera/hw09-hadoop-
development-at-facebook-hive-and-hdfs
[4] Facebook Data Team:
http://www.slideshare.net/zshao/hive-data-
warehousing-analytics-on-hadoop-presentation
[3] Cloudera:
http://www.cloudera.com/videos/introduction_to_hiv
e

7/20/2010 Introduction to Hive 17

Q&A
Thank you
Back up
Hive Components

Shell Interface: Like the MySQL shell

Driver:
Session handles, fetch, exeucition
Complier:
Prarse,plan,optimzie
Execution Engine:
DAG stage
Run map or reduce

7/20/2010 Introduction to Hive 20

Motivation

MapReduce Motivation
Data processing: > 1 TB
Massively parallel
Locality
Fault Tolerant

7/20/2010 Introduction to Hive 21

Hive Usage

hive> show tables;

hive> create table SHAKESPEARE (freq INT,word STRING)

row format delimited fields terminated by \t stored as
textfile
hive> load data inpath shakespeare_freq into table
shakespeare;

Introduction to Hive 22
Hive Usage

hive> load data inpath shakespeare_freq into table

shakespeare;

hive> select * from shakespeare where freq>100 sort by

freq asc limit 10;

Introduction to Hive 23
Hive Usage @ Facebook
Statistics per day:
4 TB of compressed new data added per day
135TB of compressed data scanned per day
7500+ Hive jobs on per day
Hive simplifies Hadoop:
~200 people/month run jobs on Hadoop/Hive
Analysts (non-engineers) use Hadoop through Hive
95% of jobs are Hive Jobs
http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-
and-hdfs

7/20/2010 Introduction to Hive 24

Green at Heart by Catherine Tan
No ratings yet
Green at Heart by Catherine Tan
5 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
SAP Project Lifecycle
No ratings yet
SAP Project Lifecycle
2 pages
Content Area Lesson Plan
No ratings yet
Content Area Lesson Plan
10 pages
ERP Configuration Using GBI Phase II Handbook (A4) en v3.3
No ratings yet
ERP Configuration Using GBI Phase II Handbook (A4) en v3.3
102 pages
6.ika Owonrin
90% (10)
6.ika Owonrin
11 pages
Stages of Team Development
No ratings yet
Stages of Team Development
3 pages
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
No ratings yet
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
24 pages
Hive
No ratings yet
Hive
50 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Hive
No ratings yet
Hive
4 pages
BDA_Hive
No ratings yet
BDA_Hive
22 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Hive Pig PDF
No ratings yet
Hive Pig PDF
20 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Hive Main
No ratings yet
Hive Main
33 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hive
No ratings yet
Hive
49 pages
7 Hive
No ratings yet
7 Hive
30 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop
No ratings yet
Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop
40 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive
No ratings yet
Hive
23 pages
1 - Introduction
No ratings yet
1 - Introduction
5 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
LectureNotes Hive Final
No ratings yet
LectureNotes Hive Final
36 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Hands-On Lab: IBM Software Information Management
No ratings yet
Hands-On Lab: IBM Software Information Management
25 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Hive
No ratings yet
Hive
29 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Bda Report
No ratings yet
Bda Report
16 pages
HIVE
No ratings yet
HIVE
7 pages
Hive
No ratings yet
Hive
28 pages
Data Warehousing & Analytics On Hadoop: Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
No ratings yet
Data Warehousing & Analytics On Hadoop: Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
19 pages
Hive L1
No ratings yet
Hive L1
134 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Hive
No ratings yet
Hive
45 pages
Apache Hive: General Information About Hive
No ratings yet
Apache Hive: General Information About Hive
3 pages
Hive
No ratings yet
Hive
12 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
Hive
No ratings yet
Hive
65 pages
Module 4
No ratings yet
Module 4
34 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
07 Hive 01
No ratings yet
07 Hive 01
21 pages
Big Data: Week - 11
No ratings yet
Big Data: Week - 11
28 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
HIVE NB
No ratings yet
HIVE NB
19 pages
Software Quality Management
No ratings yet
Software Quality Management
64 pages
Second Lecture Softquality System
No ratings yet
Second Lecture Softquality System
55 pages
What Are Differences Between Arrays and Collections
No ratings yet
What Are Differences Between Arrays and Collections
2 pages
Jar File
100% (1)
Jar File
8 pages
Regression in Data Mining
No ratings yet
Regression in Data Mining
15 pages
What Is A Data Modelvery Important
No ratings yet
What Is A Data Modelvery Important
7 pages
Very Important Waht Is Data Warehouse and Why Required
No ratings yet
Very Important Waht Is Data Warehouse and Why Required
26 pages
Very Improtant Normalisation
No ratings yet
Very Improtant Normalisation
5 pages
Recursionexaplanation of Recursion Very Much Important
No ratings yet
Recursionexaplanation of Recursion Very Much Important
6 pages
Medicinal Use of Harsingar Fever Malria Sugar High BP Intastine Problem
No ratings yet
Medicinal Use of Harsingar Fever Malria Sugar High BP Intastine Problem
4 pages
Satya Nash I
No ratings yet
Satya Nash I
4 pages
Dhudhi Badi Important
No ratings yet
Dhudhi Badi Important
4 pages
Iway in Ecommerce
80% (5)
Iway in Ecommerce
6 pages
Mis PPT (Ashu and Anjali)
No ratings yet
Mis PPT (Ashu and Anjali)
14 pages
Datamining Fifth Lecture
No ratings yet
Datamining Fifth Lecture
65 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
Cough Remedy
No ratings yet
Cough Remedy
4 pages
Directorate of Health Services, West Bengal Pay Slip Government of West Bengal
No ratings yet
Directorate of Health Services, West Bengal Pay Slip Government of West Bengal
1 page
CDSHBX
No ratings yet
CDSHBX
2 pages
3.5 Light and Shadows
No ratings yet
3.5 Light and Shadows
4 pages
Asm2o Syllabus
No ratings yet
Asm2o Syllabus
3 pages
Effects of Freezing On Nutritional Properties and Microbiological Quality of Meat
No ratings yet
Effects of Freezing On Nutritional Properties and Microbiological Quality of Meat
13 pages
Elon Musk 2018 Comp Plan Delaware Court Decision
No ratings yet
Elon Musk 2018 Comp Plan Delaware Court Decision
201 pages
Laro NG Lahi
100% (1)
Laro NG Lahi
4 pages
Twinkle Twinkle Little Star
No ratings yet
Twinkle Twinkle Little Star
2 pages
Basketball England Technical Curriculum
No ratings yet
Basketball England Technical Curriculum
32 pages
Osculating Parabola and Numerical Experiments
No ratings yet
Osculating Parabola and Numerical Experiments
16 pages
Immediate Download Chemistry 2nd Edition Paul Flowers Ebooks 2024
No ratings yet
Immediate Download Chemistry 2nd Edition Paul Flowers Ebooks 2024
25 pages
Ojt Forms 8
No ratings yet
Ojt Forms 8
8 pages
IPPTA 203 37 42 Development in High Yield
No ratings yet
IPPTA 203 37 42 Development in High Yield
6 pages
SAIC-P-3004 Rev 7 Final
No ratings yet
SAIC-P-3004 Rev 7 Final
3 pages
JointProspectus2021 22
No ratings yet
JointProspectus2021 22
128 pages
Balauro Worksheet Protein Synthesis
100% (1)
Balauro Worksheet Protein Synthesis
4 pages
Information Technology (Set 4)
No ratings yet
Information Technology (Set 4)
8 pages
Thread Level Parallelism
No ratings yet
Thread Level Parallelism
21 pages
English - (Lang and Lit) - X-Mock Test - (2023-24) (Ques)
No ratings yet
English - (Lang and Lit) - X-Mock Test - (2023-24) (Ques)
11 pages
BETNESOL™ Tablets: Data Sheet
100% (1)
BETNESOL™ Tablets: Data Sheet
4 pages
dc2025 0218
No ratings yet
dc2025 0218
10 pages
Instructions for Sports Medicine Patients 2nd Edition Marc Safran newest edition 2025
No ratings yet
Instructions for Sports Medicine Patients 2nd Edition Marc Safran newest edition 2025
76 pages
Art or Bunk
No ratings yet
Art or Bunk
81 pages
Seahorse Crochet Pattern
No ratings yet
Seahorse Crochet Pattern
6 pages
Pathologic Basis of Veterinary Disease, 4th Edition: Chapter 5 Diseases of Immunity
No ratings yet
Pathologic Basis of Veterinary Disease, 4th Edition: Chapter 5 Diseases of Immunity
88 pages

Introduction To Hive: Liyin Tang Liyintan@usc - Edu

Uploaded by

Introduction To Hive: Liyin Tang Liyintan@usc - Edu

Uploaded by

Introduction to Hive

03/02/17 Introduction to Hive 2

Oracle RAC Hadoop Hive MySQL

03/02/17 Introduction to Hive 3

03/02/17 Introduction to Hive 4

03/02/17 Introduction to Hive 5

SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)

03/02/17 Introduction to Hive 6

03/02/17 Introduction to Hive 7

03/02/17 Introduction to Hive 9

7/20/2010 Introduction to Hive 10

7/20/2010 Introduction to Hive 11

03/02/17 Introduction to Hive 13

03/02/17 Introduction to Hive 14

7/20/2010 Introduction to Hive 15

Parallel databases: Gamma, Bubba, Volcano

7/20/2010 Introduction to Hive 16

[1] A.Thusoo et al. Hive: a warehousing solution over a

7/20/2010 Introduction to Hive 17

Shell Interface: Like the MySQL shell

7/20/2010 Introduction to Hive 20

7/20/2010 Introduction to Hive 21

hive> show tables;

hive> create table SHAKESPEARE (freq INT,word STRING)

hive> load data inpath shakespeare_freq into table

hive> select * from shakespeare where freq>100 sort by

7/20/2010 Introduction to Hive 24

You might also like