0% found this document useful (0 votes)
43 views25 pages

Emailing Hive PDF

Uploaded by

babel 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views25 pages

Emailing Hive PDF

Uploaded by

babel 8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CH~~TER 9

Introduction to Hive

BRIEF CONTENTS

"Information is-the oil ofthe 21st century, and analytics is the combustion engi.ne."
- Peter Sondergaard, Gartner Research
WHAT'S IN STORE?
We assume that you are al ready fam iliar with commercial database system~. In this ~hapter, We WiU
use that knowledge as our base to build a structure on Hadoop for effective analysts. We will d' try t()
importance of Hive with the help of use cases. We will also enrich your knowledge by working ;.cuhSs the
It t,.
.
We suggest you refer to some of the learning resources suggested at the end of this chapter and als
0 co"'
pIete t he .. ,...1est Me" exercises.
· ·••-

About the Company


TENTOTEN is a Retail Store which has a chain of hypermarkets in India. They have 250+ stor .
·
95 cities and towns. About 45,000+ people are working m. TENTOTEN es.aero
• TENTOTEN deal . Ss
wide range of products including fashion apparels, food products, books, furniture, etc. Around
customers visit and/or purchase products every day.from each of these stores.
a
1 OO+
\'n
Prohk,n Scenario
The approximate size of TENTOTEN log datasets is 12 TB. Information about the vario
is stored in the form of semi-structured data. Traditional Business Intelligence (BI) tools :/tores
when data is present in pre-defined schema and datasets are just several hundreds of gigab good.
the TENTOTEN dataset is mostly log d~taset, which does)not conform ·to any particular~e~ Bue
Querying such large dataset is difficult and immensely time consuming. c erna.
The challenges are:

1. Moving the log dataset to HDFS (Hadoop Distributed File System).


2. Performing analysis on HDFS data.

Hadoop Map Reduce can be used to resolve these issues. However we will still have t d al .th
below constraints: 0 e Wl the

1. W:ri~ing complex MapReduce jobs in Java can be tedious and error prone.
2. Jommg across large datasets is quite tricky.

Enter Hive to counter the above challenges.

9.1 WHAT IS HIVE?

Hive is a Data Warehousing tool n c p·


. 1'..erer igure 9 1 Hiv . d
Hadoop. Facebook created Hive com · · e is use to query structured data built on top of
. ponent to manage th · . - .
use of th e followmg: eir ever-growing volumes oflog data. Hive makes
. ef' a rt
1. HDFS for Storage. ~\]'} <h :ff . I nlh0x C; (),.,
2. MapReduce for execution. \ e,o). oJI" · U i\.,~
3. Stores metadata in an RDBMS. t,O~\
r
r~tl" . - = -
. \
'
'
\batawarehousing .. • t ., I a• I
applications
'. .
J

.
-- figure 9.1 Hive - a data warehousing tool.

___..,,.__..,,,~- - - .,~ · .ip- , ., . . . . . ----=~-----·'-~·~ . . "'"r-":... -..,.~ ..,...- ,'~""'""'",..-4;,~


.,. . ,.. . - _.,.-~<:<-"'<it
Hive was born at Facebook to analyze _
their incoming log data. •
2007 I

Figure 9.2 History of Hive.

Figure 9.3 Recent releases of Hive.

Hive provides HQL (Hive Query Language) which is similar to SQL. Hive compiles SQL queries into,
MapReduce jobs and then runs the job in the Hadoop Cluster. Hive provides extensive data type functions ·
and formats for data summarization and analysis. :..

9.1.1 History of Hive and Recent Releases of Hive


The history of Hive and recent releases of Hive are illustrated pictorially in Figures 9.2 and 9.3, respectively.

9.1.2 Hive Features


1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as struc~s, lists, and maps.
4. Hive supports SQL filters,jgroup-by and order-by clauses.
5. Custom Types, Custom Functions can be defined.
.
232• s· ...............
- - - - - - - - - - - - - - - - - - - - - - -------.::.:..1!?...:gDataand¾,1

9.1.3 Hive Integration and Work Flow


Figure 9.4 depicts the Aow of log fiJe analysis. . .
Hourly Log Dara can be stored directly into HDFS and then data cleansing Is performed on th I
Finally Hive table(s) can be created to query the log fiJe. e og 6.lt.

9.1.4 Hive Data Units


1. Databases: The namespace for tables:,,
2. Tables: Set of records that have similar schema.,,
3. Partitions: Logical separations of data based o~ classification of given information as pe .
· ·•-=::..:::..:...~,..-,---:---;- d 'fi d k · r speqfi_
mbutes. Once hive has partitioned the data base on a spect e ey, It starts to assemble th c at.
into specific folders as and when the records are inserted. e records
4. Buckets (or Clwters): Similar to partitions but uses hash.function to segregate data-and rl P•~
~ r bucket into which the record should be placed. ~ rrnines
Figure 9.5 shows how these data units are arranged in a Hive Cluster.
Figure 9.6 describes the semblance of Hive structure with database.
A database contai~s several tables. Each table is -constituted of rows and columns. In Hive
stored as a folder and partition tables are stored as a sub-directory. Bucketed tables are stored ' tfiables are
. as a le.

Figure 9.4 Flow of log analysis file.

Database

Tables

-
Partitions


Columns

[SJ
. .. I. •

Figure 9.5
Data units as arranged in a Hive.
• 233

OB
1·"@
111
\ ·i
::·i1i.. . .
Directory

-~-
Partitions
Rows

· Fifes

Figure 9.~6emblance
( of Hive structure with database .

•' . ' • ••
' "'"'"'
t.Hiye•server (Thrift) . -1

Figure 9. 7 Hive architecture.

9.2 Hive Architecture


Hive Architecture is depicted in Figure 9. 7. The various parts are as follows:
J \
1. Hive Command-Line Interface (~ve CLI): The most commorrly used interface to interact with Hive.
2. Hive Web Interface: It is a simple Graphic User lnterface' to interact with Hive and to ex~te query.
3. Hive Server: This is an optional server. This can be used to submit Hive Jobs from a remoteclient.
4. JDBC I ODBC: Jobs can be submitted from a JDBC Cltent. One can write a Java coJe to connect
to Hive and submit jobs on it.
5. Driver: Hive queries are sent to the driver for.compilation, optimization and execution.
6. Metastore: Hive table definitions and mappings to the data are stored in a Metastore. A Metastore
consists of the follo~ mg: · · --
• Metastore service: Offers interface to .the Hive.
• Database: Stores data definitions, mappings to the data and others.
The metadata which is stored in the metastore includes IDs of Database, IDs of Tables, IDs of Indexes,
etc., the time of creation of a Table, the Input Format used for a Table, the Output Format used for
234 • BigD
ata
aric1
a Table etc. The rnerasrore is updated whenever a cable is created or deleted frorn ~ •I:, .
' ·~1
TJ ·
\
kinds of merasrorc. · here
. . dc . a.re
L J. Embedded Mewtore: This merastore JS mamly use ror unit tests. Here, onJ tht
t,- Th· · th d f: ul Y one p tt
ro connect ro ~ s wre at a time. e e a t metastore. for I--{'tve.1t· races
JS is dh .
1 ~_ s,s
Darabase. In this merasrore, both the database an t e metastore service runs ~ '
Hive Server process. Figure 9.8 shows an Embedded Metastore. ' ernbedded in ~ \l,trj
l,. 2. Local Mewtore: Metadara can be scored in any RD BMS component like M S t~e ,/.
QL. Loc:a1 'il
allows multiple connections at a time. In this mode, the Hive metastore servi ceyruns.
Server
. process, but the metastore database runs in a separate process , and can be o tn the ·•,a.i
"' eta.stllt
i,-. F1gure 9.9 shows a Local Metastore. n a sep I\½· t
{ ' 3• Remote Metastore: In this, the Hive driver and the metastore interface rurt d'
arat e h1i'"t
.ffc h' ., -.- ~ . on tffer St
can run on dJ erent mac mes as well) as in Figure 9.10. This wayy tem e u at b
th H' a ase can b entfi JV1,
• NJ.s ( ·
e Jve user and also database credentials are completely isolated from th e users ofeu•re-w~11 %icL11
<1.1 ed r
l l.lVe. {t(lll)

.Driver

Figure 9.8 Embedded Metastore.

Figure 9.9 local Metastore.

..
Hive service JVM

Figure 9,10
Remote M
etastore.
L
• 235
t-li"'

~TYPES
9.1 ~ - a-ta_ r_yp_e_s _ _ _ _ _ _ _ _ _ _ _ _ _ __

1/pri111itNe u
9.3·1 •·N~·~~·~i~ ·o~i~ ·ry~·~ ·. ·······. ·. ·····. ·. ·.... ·····. ···· T .. · · · · · · · .. · · .. .. · · · · · · · .. · · · · ·

TINYINT 1 - byte signed integer


sMALLINT 2 - byte signed integer
INT 4 - byte signed integer
-~ 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DO~~~~.. ............... .... ' 8 - byte double-precision floating-point number
····· ·············· ·········· ·· ··· ·· ···· ··················· ···· ···········
··st~i~~ ·ryp~~· ··. .······················:························································
STRING
• VARCHAR Only available starting with Hive 0.12.0
CHAR Only avail~ble starting with Hive 0.13.0
Strings can be expressed in either single quotes (') or double quotes (")
·························· ··········· ··· ························································ ··· ···
.. ryp~~:·..:..... ······························································· ___,

BOOLEAN
BINARY Only available starting with Hive
····· ······· ············· ········ ······ ········ ····· ··· ······ ·· ··· ··· ···· ······ ·· ···· ····· ··· ······ ·· ·

9.3.2 Collection Data Types


····················································· ··· ·· ····· ·· ···················· ·· ··· ·······················•.•·· ··· ············:······
Collection Data Types
STRUCT Similar to 'C' struct. Fields are accessed using dot notation. E.g.: struct('John', 'Doe')
MAP Acollection of key - value pairs. Fields are accessed using [] notation. E.g.: map('first', 'John',
'last', 'Doe')
ARRAY Ordered sequence of same types. Fields are accessed using array index. E.g.: array('John', 'Doe')
···· ········ ··· ······ ·········· ···· ···· ··· ··· ······· ···· ··· ······· ······ ···· ······

9.4 HIVE FILE FORMAT


·The file formats in Hive specify how records are encoded in a file.

9.4.1 Text File


e format is text file. In this format, ,each record is a line in the file. In text file, different con-
trol characters are use as delimiters. The delimiters are "A (octal 001, separates all fields), "B (octal 002,
236 • ~ d Ah .

. lhe array or st ruct )' "( (octal 003, separates


separates tl1e demenrs 111 d key-value
.c, CSV pair),
):lir\ and\
__ _]
n. .1.,.,he
I
l'
.L d C. l d 1 · · The supporte text rues are and Tsv· JS ONterr...·
is used when overriding me erau t e im1ter.
documents too can be specified as text file. Ot
11
x~
9.4.2 Sequential File
Sequential files are Rat files mat store binary key-value pairs. Ir indudes compression support Which
the CPU, I/0 requirement. terjl!c
1

9.4.3 RCFile (Record Columnar File)


RCFile scores me data in CoJumn Oriented Manner which ensures that Aggregation operat' .
expensive operation. For e~ample, consider ara6Te'w.Jiich=conrams four "columns as shown in IS llot a1

..Instea~ ofonly parciti?ning the table horizo~tally like t~e .row-oriented DBMS (row-store),
aaons tlus table first honzonrally and then vemcally to senal1ze the data. Based on the user-spe . de Par
R~i'I.
firsr me cable is partitioned into multiple row groups horizontally. Depicted in Table 9.2, the tab~16ed 11alue
Table 9. I is partitioned into two row groups by considering three rows as the size of each row e shown in
Next, in every row group RCFile partitions the data vertically like column-store. So the
serialized as shown in Table 9.3. _ e WI]J be
:?· .
Table 9.1 Atable with four columns
··,i .... ·:.~.'.ii ......... :·:c:i:·.:... ·:.. ·,·,i ... ;..... .
11 12 13 14
21 22 23 24
31. 32 33 34
41 42 43 44
51 52 53 54
·····························································
Table 9.2 Table with two row groups
.'i ~~· G~~~·p· 1·................... :..................... .·R~~· ·2 ········································
C1 C2 C3 C4 C1 C2 C3 C4
11 12 13 14
41 42 43
21 22 23 24 44
54
.?: .......... 32 33 34
51 52 53
··················································:······•· ..
······················· ························

... ... Table 9,3 Table in RCFile Format


Ro~. ·1·····································•••....... .
ll, 21 . Row Group 2
, 31, . 41
12 , 22, 32; , 51;

13, 23, 33;


42, 52;
14, 24, 34; 43, 53;
······ ·· ·········· ······ 44, 54·
············· ······
,
·· ·············
• 237

~e
olJE~ LANGUAGE (HQL)
HIV
q,S uage provides basic SQL like operations. Here are few of the tasks .
,.,, \ang which HQL can d
·1
11e•, l d •· o eas, y
q d manage tab es an partmons. ·
create an jous Relational, Arithmetic, and Logical Operators.
l• rc var
z S11PP 0 functions.
1 ' EJualte d the contents of a table to a local directory or result of queries to HDFS d'
: pown oa _ irectory.
4
oDL (Data Definition Language) Statements
1
9, 5• cacernents are used co build and modify the tables and other objects in the database. The DDL com-
fhCSC s as follows:
ds are
111
aJI e1Drop/Alter Database
Creat
1, C ace/Drop/Truncate Table
z, ~:er Table/Partition/_Column
:· Create/Drop/Alter View
,6.. Show
· Create/Drop/Alter Index ,

7. Describe
. .2 DML (Data Manipulation Language) Statements
95
These statements are used to retrieve, store, modify, delete, and update _data in database. The DML
commands are as follows:

1. Loading files into table.


2. Inserting data into Hive Tables from queries.
Note: Hive 0.14 supports update, delete, and transaction operations.

9.5.3 Starting Hive Shell


To start Hive, go to the installation path of Hive and type as below:
:1root@volgalnx005 ~] t hive

'Logging initialized using configuration in jar:file:/root/Desktop/VMDATA/Hive/hi


ve/lib/hive - common-0 .14. 0. jar! /hive - log4j. propertie!l
:sLF4J: Class path contains multiple SLF4J bindings.
\SLF4J: Found bi.nding in {jar: file: /root/De.sktop/v;MDATA/Hadoop/h.adoop/ .share/hadoo
:p/common/lib/.slf4j-log4j 12-1. 7. 5. jar! /org/slf4j / impl/StaticLoggerBinder. class]
SLF4J: Found binding in [jar:file:/root/De.sktop/VMDATA/Hive/hive/lib/hive-jdbc-0
.14 . 0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
\SLF4J: See http://www.slf4j.org/code.s.htmlimultiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory)
hive> I
The sections have been designed as follows:
Objective: What is it that we are trying to achieve here?
Input (optional): What is the input that has been given to us to act upon?
Act: The actual statement/ command to accomplish the task at hand.
Outcome: The result/output as a consequence of executing the statement.
240 •

L r•bles
=
..

d" rwo kinds


. of table, Manage d and External Table.
~ ,
BigD~

9.5.5.1 Managed Table .


. scores ch e Managed ta bl es un der the. warehouse
1. Hive d bfolder
H' under Hive.
2. The complete life cycle of table and data IS manage Y Ive.
·
3. Wh en ch e mtern aJ ta ble 1s· dropped , it drops the data as well as the metadata.

• Objective: To create managed table named 'STUDENT'.


Act:

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) I)


FORMAT DELIMITED FIELDS TERMINATED BY '\t';
. .
Outcome:
hi ve> CREATE TABLE IF NOT EXISTS STUDENT(ro llno INT name STRING gpa FLOAT) ROW FORM
JE LDS TEl™INATED BY '\t'; ' ' AT DELIMillo .
OKme taken : 0. 355 seconds
Ti • , )'
hive> I [1

I
Objective: To describe the "STUDENT" table.
Act:

DESCRIBE STUDENT;

Outcome:
hi ve> DESCRIBE STUDENT·
OK I

roll no int
name .
gpa str1ng
lime talc . 0 16 float
hive I en. . 3 seconds• Fetched: 3 row(s)

Note: Hive creates man ed tabl . .


Coot,ntsofdJrtttoeyllllwhtv-•••• ag e In the Warehouse directory of H· h

-~
tve.as s own below·.
-~ ·· '
::-.:::::-----. -~,tudrnts.db .
--------
·.

I
._ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _•_24_1
. II co
~11ctJO

IP ernal or Self-Managed Table


5,5,Z ~e cable is dropped, it retains the data in the underlying location.
~. l• ~en al keyword is used ~o create an external table.
,, E,st~ needs to be specified to store the dataset in that particular location.
"' cauon .
,. l,D

·#: To create external table named 'EXT_STUDENT'.


!

= A B L E IF NOT EXISTS EXT_STUDENT(rollno INT,name


1
sfIUNG,gilr ROW FORMAT DELIMITED FIELDS TERMINATED BY · \t'
~/STUDENT_INF0;

Qutcollle:
. CREATE EXTERNAL TABLE IF NO: E~ISTS EXT_5TUDENT(ro1lno INT,name STRING,gpa FLOAT) ROW FORMA~
i,,ve" "!TED FIELDS TERMINATED BY \t LOCATION •/STUDENT.JNFO'.
~LI~ I I
oK aken: o.123 seconds
Tilllf ti ,J
I
hive> ,l' ~-- "" ~~""'~"°"'¥ ~rj~

Note: Hive creates. the external table in the specified location.



, .5.3 Loading Data into Table from File
95
.-- .
Objective: To load data mto the table from file named student.tsv.

Act:
. . /
LOAD·DAT~JPCAL~TH '/root/hivedemos/studen~.tsy;.- OVERWRITE INTO~-
~_STUDENT; . . · _
Note: Local keyword is used to load the data from the local file system. To load the data from HDFS,
remove local key word from the statement.
Outcome:
hive> LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv' OVERWRITE INTO TABLE EXT_5TUDENT;
Loading data to table students.ext_student .
Table students.ext_student stats: [numFiles=O, numRows=O, totalsize=O, rawoatasize=O]
OK
Time taken: 5.034 seconds
hive> I

Hive 10<1-ds the file in the specified location as shown bdow:


CoaltDb of dlrtttory ,'.STl;l)E1'1_11\TO

Goto ·fu\loENT INFO

Lotatl0&5

=-----------------·-..---
i&JdirfflO<y

8-iii. 201 1
,.,~ ,,__...
~::r::J•d ---- -----
.w.....u----
o..
wua ,-w-w-,

.....
.••1
·~
·
... ,...,. . ,
lllal
,.....
} Ml
j,i;,Jt ""
• 1
•••

llllf'II Jra.1,I , S
..... •J r-• , .,
t""1 .,.., ,_, • I
... ~ 1 1 •

9.5.5.4 Collection Data Types

• Objective: To work with collection data types.


Input:
1001,John,Smich:Jones,Markl !45:Mark2!46:Mark3!43
1002,Jack,Smich:Jones,Mark 1!46:Mark2!47:Mark3!42
Act:
I I ";" ' t .

CREATE TABLE STUDENT_INFO (rollno INT,name String, sub ARRAY<STRING>,marks


MAP<STRING,INT>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY','/ ' .·

~:::==:.;~.B~~~,i". . . /
WAD DATA WCAL INPATH '/root/hivedemos/studentinfo.csv' INTO TABLE
STUDENT_INFO; '>:, -;. r

Outcome:
:hive> CREATE TABL£ STUDfNT_INFO (ro11no INT,name Stdng, sub ARRAY<STRING>.,marks MAP<STRING,FLOAT>)
l
> ROW FORMAT DELIMITED FI£LDS TERMINATED BY ', '
> COLL£CTION IT£MS T£RMINAT£D BY ';'
> MAP K£YS TERMINATED BY '! '; .
foK
'Time taken: 0.112 seconds
ihive>
...., I
_ _,,....,.,._,,..,.,......,_,_,,,_,,,_,,_"""1iil,~~~~'ffl'if?'&~~~'?tf"tfl®'~~~f.b.,;;w.~TC,]
1hive> LOAD DATA LOCAL INPATH '/root/hivedemos/studentinfo.csv' INTO TABLE STUDENTJNFO·
Loading data to table students.student_info '
1Table students.student_info stats: [numFiles=l, totalsize=109J
OK
lTime taken: 0. 397 seconds
1,hjye~ ,
. _ _ - . . ,_ _ _ _ _ _ _ _
T ~~--~-----~-%______ :i

9.5.5.5 Querying Table

• Objective: To retrieve the student details from "EXT_STUDENT" table.


Act:

SELECT* from EXT- STUDENT.·' .


Collection Data Types.

~c1:
sELECf * front STUDENT_INFO;
"' Cf NMffi,SUB FROM STUDENT INFO·
s~ 11 - ,
//'fo retrieve value of Markl
sELECf NAME, ~['Markl'] from STUDENT_INFO;
// 'fo retrieve subordinate (array) value
sELECf N~,SUB[0) FROM STUDEft_ITJNFO;

0utco01e:
hive> SELECT • from STUDENT_INFO;
O~l John r::sm~th::. ::Jones::1 {'.:Markl'.'.:45, "Mark2":46, "Mark3":43}
1 Jack [ sm1th , Jones ] { Markl :46 , "Mark2":47 ,"Mark3":42}
};,ne
002 taken: O. 04~ seconds, Fetched: 2 row(s)
hive> SELECT NAME' SUB FROM STUDENT_INFO;
OK ["smith", "Jones"]
John ["5 ·th" "Jones") ·
Jack taken: m10.061' seconds, Fetched: 2, row(s)
rime
hive> I
1 from STUO~_INFO;
hive> SELECT NAME, MARKS[ Markl']
OK
John 45
Jack 46 2 row(s)
Time taken: 0. 06 seconds, Fetched :
hive> I
!hive> SELECT NAME,SUB(0] FROM STUDENTJNFO;
OK
John Smith
Jack smith
Time taken: 0.071 seconds, Fetched: 2 row(s)
hive> I
.,.;,;.;.,..,;:,_,,--~~,Nf#,~~"">"<,"@.~~~•,;l';.;>Mi/"-s!!"."Si'<i!i&-"li»'<·"'<"'"~''-'""'"'""""

9.5.6 Partitions
In Hive, the query reads the entire dataset even though a where clause filter is specified on a particular
column. This becomes a bottleneck in most of the MapReduce jobs as it involves huge degree ofl/0. So it
is necessary to reduce 1/0 required by the MapReduce job to improve the performance of the query. A very
common method to reduce 1/0 is data partitioning.
Partitions split the larger dataset into more meaningful chunks.
Hive provides two kinds of partitions: Static Partition and Dynamic Partition.
244 •

9.5.6. 1 Static Partition kn n at co~pil time.


. columns whose values are ow
Static partitions compnse

. . based on "gpa » co1umn .


Objective: To create static parrmon
Act:

CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, nanie Sl'~


PAkl1llONFQ_ BY Jg_pa FWAn ROW FORMAT DELIMITED FIELDS TERMINA_~G)
M ~ b
Outcome:
h;vo CREATE TABL£ ri= NOT fXISTS STATIC~ART_STUOENT(rollno iNT,name STRING) PARTITIONED BY (gpa FLOAT) RO
::«FORMAT DELIMITED FIELDS TERMINATED BY \t ; r~,!
Ti llM! taken : 0. 105 seconds t]:9
M ve> I ---=. "' r - -~J

• Objective: Load data into partition table from table.


A/ .
c/~ OVERWRITE TABLE S'Ji\Tlc)ART!$WQENT PAR1'InoN (gpa =4.0)
SELECT rpllno, name from EXT_STUDENT whete gpa=~.O;
. ' ', '

Outcome:
lhi ve> INSERT OVERWRITE TABLE STATICPART_STUDENT PARTITION (gpa =4.0) SELECT rollno,name from EXT_STlJDENT
~ ere gpa=4.0; . -1,
iQuery
/Tota IO = =root_201502Z4230404_4500d58a-cb21-4912-ba40-788e5cf8f9da
1 fobs 3

Hive creates the folder for the value specified in the partition.
Cotltau of clrttfo,y fliKt'llwll'lrdl!!!w:1,tudtnr..db

Local logs
l.,;adir<ao,y

--- ---------
Conttnts ot directory 1i,,...11. ,••• 'wg,,,hh....
U12Wll!Yri~~sratlc_J11rt student
·g;;· - - - - - -- --- - - - - - - - - - - -
- .
Goto : IJSerlhivelw~,t

---=- -
Local logs ----------- - - - - - - - __ _ _ _ __
~~ttory

-------------- -- ---- - ·-- --


----
247

• , l

~
ObJ"'...n
· e more static partition based on "gpa" column using the
· "alter" statement.

ice: .
,u:r£ll TABLE STATIC_PART~STUDENT ADD PAIITI 3

::..-• from EXT-S;l,'(JD~~;:,:r~~ENT


J;l(I' oVEJlWRITE TABLE STATIC
,, , /,,
TION ~ .5),
PAlUITION (gpa =4.0) SELECT

ouico01e:
.
h,ve> ALTER TABLE STATICJ>ART_STUDENT ADD PARTITION ( gpa=3. 5) ; \)1
$~Ill! tlalcen: o.166 seconds \]
hive> - - - - - - - - - - - - - - - - - - - - - ~
coa1t11ts of dlrtd0 l'Y ~ s f u d e n t , . d b/stattc_part_studcnt

~ l_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __


9.5.6.2 Dynamic Partition
Dynamic partition have columns whose values are known only at E,~ Time.

I ,
.. - - - - - - -

Objective: To create dynamic partition on column date.


Act:
,,_•:i'J"(-;''st:·· .. ', \. ·..... ,,,. · : '< . · "" ;,, ,. · ·.."': ··" -· 1·- -,·,· •·, . _,,J< ' il:J'{'W:\ r:r;;:t, iu-.,:~;;•.. , . · .',. ·· ., \
· TAll~ ,If r1gn:x1s~ AMIC' · sro ·.· . ~'lmii-J:!'II'E ~s;i:arnGri
. PAIITT1'IOflIBD BY (gpll FLOAT) ROW FORMAT oELIMITIID Fl'Et.D'$,,TERMfNA~;
· BY '\'t';~. . · ', "'.,;, . ,' '\1:'
,,.. c,

Outcome: ·
hi,o CREATE TABLE lF NOT EXlSlS DYNNULPART_5TUl)OIT( ,.11 no INT,.... ,;rnrnG) PARTITl011£D BY (gpa FLOAT) R . '
:= FORMAT DELIMIT<D HELDS TEI04INATED BY '\t'; \1
ioe taken, 0.166 seconds •.
hWe> I

246.
7 -
________________Bi~g::.:
Da:.::"''"d-'ln~
al}'ti1:1

. . . n table from table.


~------- d nam1c parut.10
I d daca into a y
Objective: To oa
Act: . attidon • tr11e; .
...naauc.P
ET hh'e est'- d, -- ·ct·
. . mode = nonsUJ '
S · . d""aJ11jc.putJUOD, . cleast one static partition column. To turn this ff
SET hi~esec- ,.- aruoon
. . strict mode requtres a
.
o ,
Th dynarruc P . mode-nonscnct
No~ e d namic.parciuon. - AMIC PART_STUDENT PARTITION (gpa) SELEer
set hJVt,exec, y . TE TABLE DYN -
-rs.ER! oVERWRI .,.,rr STUDENT;
"" {rolD ~.a.-
rollno,oallle,gpa

~ --·----
Note: Create partition for all values.

""'r;I'·;/s·7 Bucketing
. .. . ·· db .
h is a subtle difference between parnnon an ucketmg. In
. . iI arcmon However, t ere Th' l d .
Bucketing 1s s1m ar to P · .. c h unique value of the column. 1s may ea to situations
.. d to create pamuon ror eac . . B k . . h'
a paranon, you nee . d f titians This can be avoided by usmg uc etmg m w 1ch you
d up with thousan s O par · . . • d'
where you may en k A bucket is a file whereas a partmon ts a uectory.
can limit the number of buc ets to create.

I
Objective: To learn about bucket in hive.
Act:

CREATE TABLE IF NOT EXISTS STUDENT (rollno INT,name STRING,grade FLOAT)


ROW FORMAT DELIMITEI;> FIELDS TERMINATED BY '\t';
WAD DATA LOCAL INPATH '/root/bivedemos/student.tsv' INTO TABLE STUDENT;
Set below property to enable bucketing.
set hive.enforce.bucketing=true;
II To create a bucketed table having 3 buckets
CREATE TABLE IF NOT EXISTS STUD
FLOAT) ENT_BUCKET (rollno INT,name STRING,grade

CLUSTE~ to~
• 247

// Load data wbucketed table


fROM STUDENT
JNSER'f OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
// To display content of first bucket
s£LECf DISTINCT GRADE FROM STUDENT_BUCKET
'fj\BLESAMPLE(BUCKET I OUT OF 3 ON GRADE);

Outcome: V. V • • wvvvu-w

f ~;> -~~;E TABLE IF NOT EXISTS STUDENT (rollno INT,name STRING,grade FLOAT)
'{1 1 > ROW FORMAT DELIMITED F.IELDS TERMINATED BY • \ t, ;

!0~ taken: 0.101 seconds


JTiJll€
!hive>
I
~ > LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv' INTO TABLE STUDENT;
bJ. ding data to table book.student
: i e book.student stats: [numFiles=l, totalsize=l45J
OK I
Ti.Jne taken: 0.536 seconds ~-
-~h_1~
·v;:;_
e_>
:;..!!I~- - - - - - - - - - - - - · ~ - - - - - · = - · - - - ~ - - - - - - - · - -
hive> set hive.eriforce.bucketing=true;
ive> I
hive> CREATE TABLE IF NOT E~ISTS STUDENT_BUCKET (rollno INT,narne STRING,gra.d e FLOAT) ~11
OK > CLUSTERED BY (grade) into 3 buckets; _ ,
1
~!~-~;,;=;.
>..:
ti:.kl!!e!!li:
n"" : w..0•·,.,
1..,
01.,_s,..
e..,
co _ _,,_,,..,..,,.""'"'"".,..,_ _...,.,..;l...,
""'n""d""s,._""""'""'"'"""'""."""'·_ __,,,,_,,,,. ~!1111

r
ive> FROM STUDENT
> INSERT OVERWRITE TABLE STUDENT_BUCKET
> SELECT rollno,name,grade;

3 buckets have been created as shown below:


couttnrs of directory L.!IKCfllill/wardlouse/l!lw,.ill!lstudent_bucket

Go b.acl; ro DFS il-O!ne

Local logs
l&idirtctory

&m.2015.

hive>
> SELECT DISTINCT GRADE FROM STUDENT BUCKET
> TABLESAMPLE (BUCKET 1 OUT OF 3 ON GRADE) i
OK
4.0
4.2
Time taken: 21.117 seconds, Fetched: 2 row(sl
hive> I

fill
-----
248 •
4
Big Dara and Anal .
----------------------------=-----=~
9.5.8 Views
In H' . . 6
Views are purely logical object.
ive, view support is available only in version startmg from 0 · ·

• . •
ObJecttve: To create a view table name
d "STUDENT VIEW''.
-
.,I
Act: e FROM _EXT_STUDENT;
CREATE VIEW STUDENT_VIEW AS SELECT rollno, nam

EXT-STUDENT;
Outcome: llno,name FRDM
T VIEW AS SELECT ro

~~~~me
Ti vie~>taken:
JCtR:E:A~
TE.0.606
:V:I :EW
~ seconds
STU
:: DE: N
::..
__ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ .,~=-=-=-=-=-=-=-=-=-=-=-=-:..;j
_!,i ive> I
~ ---

EW"
I Objective: Querying the view "STUDENT_VI .

Act:
r
SELECT* FROM STUDENT_VIEW LIMIT 4;

Outcome:
hive> SELECT* FRoM STUDENT_VIEW LIMIT 4;
OK
1001 .John
1002 .Jack
1 003 smith
1004 Scott d Fetched: 4 row(s)
ime taken: 0.279 secon s,
hive> I

Objective: To drop the view "STUDENT_VIE

Act:
DROP VIEW STUDENT_VIEW;

Outcome:
ive> DROP VIEW STUDENT_VIEW;
K
ime taken: 0.452 seconds
ive> I

9.5. 9 Sub-Query
In Hive, sub-queries are supported only in the FROM clause (Hive 0.12). You need to specify name for sub-
query because every table in a FROM clause has a name. The columns in the sub-query select list should have
unique names. The columns in the subquery select list are available to the outer query just like columns of a table.
. jO~ ,o t-11\'C
Ji
t
\frite a sub-query to count occurrence of similar words . h fil
•~ , int e e.

°"~ r.UJLE docs (line STRING); _


~~'fJ\J,OCALINPATH '/root/hivedemos/lines.txt' OVERWRITE .
,.o,'°. Dr.UJLE word_count AS . INTO TABLE docs;
cttf}'}'E "°rd, count(l) AS count FROM
aplode (split (line, ' ')) AS word FROM docs) w
(S~{JP BY -word
Gio t:R BY -word;
ollP er* fllOM word_count;
sf.LE
ootcOJJl::AiE TABLE docs (1 i ne STRING);
hive:> C . .118 seconds
taken, 0
.
~i
,ine I
·ve:>
h1 ATA LOCAL INPATH '/root/hived /1' ,
·ve:> LOAD D to table students. docs emos ines · txt OVERWRITE INTO TABLE docs;
h1 · g data d stats· [numF · 1 1
L:~tstudents. ocs . l es= ' numRows,;,,o, totalsize=91, rawDatasize=O]

rK
O. taken -• 2. 697 seconds .,
r,me
hive;, •- - - - - - - - - - - - - - -.................____________..:.)
TE TABLE word_count AS .
hive;, CR~~CT word, count (1~ AS <;:ount _FROM
,. ~~ELECT explode (split (lrne , ')) AS word FROM docs) w
,. GROUP BY word
: ORDER BY word;
.
. I
. ELECT * FROM word_count;
hive> s .
OK 2
H~dOOP 2
HWe • 1
1ntroduc1~g 1
1ntroduct1on
pig 1
session 3
welcome 1
to 2
Time taken: 0.062 seconds, Fetched: 8 row(s)
hive> I
Note: The explode() function takes an array as input and outputs the elements of the array as
separate rows.
In Hive 0.13, sub-queries are supported in the where clause as well.

9.5.10 Joins
joins in Hive is similar to the SQL Join.

Objective: To create JOIN between Student and Department tables where we use RollNo from both
the tables as the join key.
250 •

Act:
CREATE 't\llLE If NCYf J!l(l~ SfllDJll'IT(,ollno INT.name STRIN
FORMAT pEL)MITED fIJ!LPS ~NATED BY 'It'; G,gpa Fr.o
WAD DATA I,Oc,\L filll',u-fl '/,ootJhrt'<"...,os/stUdent.uv' OVER
STUDEN'll WRITE IN'r O\>,
CREA1£ TABLE Jf NCYf J!l(]S'I'S DEPAJlTMI!Nf(rollno
INT,de . O 't"ll
ROW poJIM,U' oJ!LIMl11!D fIELDS 'fllJlMINATED BY 'It'; ptno tnt,na,ne s I.I!
LOAD D.4.~ J.()C,\L fi\11',u-fl '/,ootfjm,edentos!departm•nt.tsv' 0 't~~
TABLE D . E P ~ ' VERWJu'f£ I
sEI.fiCf ....,u,, , a.•""'•• a.gpa, b.deptno fllOM STUDENT a JOIN DEi! . ho~
0
.,..,u,,o • b,,ollnOl 'ARTMENi:
STRING ' g pa FLOAT) ROW F
0urconie:
h; v,, cR£ATE TABLE JF NOT EXJSTS s-ruJJENT(rollno JNT,nan>e """'- D~
EL IMT1'£D FIELDS TERMINATED BY '\t';

K
·ine taken: 0.115 seconds INTO TABLE -._,..., .
h ive> I STllOE);-r
~i ve> LOAD 0A TA LoCAL JNPATU '/ root/h; vedemo5 /student. ts v' OVERWRITE
rawoatas·1 Ze:Q]
K • - • a s,ze,145
Loading data to table students.student
able students.student stats: [numFiles=l numRows-0 tot 1 ·
_..,..,____ lJ~.
Ti oe taken, o.723 seconds '
INT ,deptno int,name STRING) ROW-Fo
--. •
hi ve> I RM •

8
hi ve> CREATE TABLE IF NOT EXISTS DEPARTMENT(rollno
T OELD4ITED FIELDS TERMINATED BY '\t' ·
OK '
1me taken: 0.099 seconds
hwe> I DEPA l
d' •
RlMENT' root/bwedemos/department . t sv . OVE_,TE
iF
hive> LOAD DATA LOCAL INPATH '/ .
· 1~ ~t~;!~,td!:):~~~d~!~; ~•f;,';,;;'\'~!s-1 • numR ows, O totals·
rawDatas·ize==O]
INTO. TABLE
l

'Th~me t aken, 0.442 seconds ' ,ze=120,

hive>
1ve> 1
SELECT a.rollno an · - - - -FROM
- - STUDENT
----- a - --
JOIN -RTMENT
DEPA - - --
b -ON
- -a.- ··-· . .. • ; .•
rolloo , b.colloo; ' · """• a.gpa, b.deptno iJ
!
U"
1001 · John 3 0
1002
1003
1004
J k
/c

·0
m, th 4 · 5
101
102
103
1005 Scot~ 4. 2 104
1006 Josh, 3 .5 105
Alex 4 5
1 007 David 4 - 2 101
1008 Jame 4· 104
Time taken·
hive> I
. 0i 102 'Fetched: 8 row(s)
· lS.282 seconds

<
9.s.1 · A9gregation
Hive supports aggregation fu nct1ons
. like avg, count, etc.
. }-{Jve r-;3
1110
~~, ,_10
1
111
·ce the average and count aggregation function.
, , 'fo wri
obj~·
f.c': <gpa) FROM STUDENT;
S~Cf i~nt(*) FROM STUDENT;
~cr,o-
tcO'°e: (gpa) FROM STUDENT;
0" sELECi av9 .
•ve.,.
roK 9996~8~~~~i seconds, Fetched; 1 row(s) 8\
1
38 ,3 99taken • _ _ _ _ _ _ _ _ _ _ Tl
fii,e • - --..:.J
~;ve'l' avg(gpa) FROM STUDENT;

~1
5ELECi
l~;ve'l'

oK
10
. .
taken-
ri !111? • -----------=--
26
_218 seconds, Fetched: 1 row(s)

T ,


Group By and Having
9,5
\
/ a column or columns can be grouped on the basis of values contained therein by using "Group By"·
10
Data. ,, clause is used to filter out groups NOT meeting the specified condition
"I-iaVIOg '
\I
~ g r o u p by and having function.

Act: .
SELECT rollno, name,gp·a FROM STUDENT GROIJP BY rollno,name,gpa HAVING gpa >
4.0; .

Outcome:
1003 smith 4. 5
1004 Scott 4 . 2
1006 Alex 4. 5
1007 David 4. 2
rime taken: 78. 972 seconds , Fetched: 4 row(s)
hi

1

,9/2CFILE IMPLEMENTATION
\(
RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables
on computer clusters .

• Objective: To work with RCFILE Format.


'Dt nante string,gpa float) STORED AS
Ae1: ....-.,v¢_ftC( rolfoTo JR~ SELECT * FROM STUDENT; R.c:pll.t,
CREATE 'fAOL.6 3J V -LI srvv£N - I

;VRJl'E t-av e ..rr RC·


INSERT ovfR~ . M SfUDJ:,.l"'" - ' \
I
sELEcr suM(gp., fRO ·ng gpa flo at) STOR£D AS RCF JL£ ·
I
in t , name strl ' '
n.. fflC: c( ro 11 no \
v ul"CO 5 rtJoEllf..J! I
(AlC r,'BL(
hl lft'' CR ds
C>I'
f1 0Jt! t;,ken : 0,
09J secon
* from STUDENT;
J
h1lft'" I o£1ff-RC snECT
r£ table 5ftJ
Rf oV£RWfll I
hl VI'' rwSE £WT.JlC;
) f rom sTUD
hive> SELf(T 5i,,1(gpa
ot. 3027 t t,ed· 1 row(s) ~,
JS.3999996!8~5.41 seconds , Fe c . _ _ _ __...
r i l!M! taken . -
hi ve> n . d ner
data in column orienre man .
N ote·. Srores rhe
. ,,,11,1l1Jll].rlll.PI
I •• , ...
oooooo- o -----· -··-···-
- · - - --·------·-·-·-··-·- ···---·-·
-···-····-
ro,: .¢'!Ill! ~ ~ ..- - - -

.i-;~~t--~~-:: : ~~
::::,~:~ff@·~f::: -.:. - -··- -· ·.
_............
--··············-"'-------. _...........- --·-·········

~ ;r~> . . . .

... ... .......


ac•M...c ,1; rc fi !t -C0!~ -"1.1•:; ~. . )chn}IC ~-Si,lthSc::>tt,05hiJJ.tX
((

I
I,

,,
SerDe stands for Serializer/Deserializer.
. th l . onvert unstructured data into records.
1. Contams e og1c to c
1I 2. Implemented using Java. ..
3, Serializers are used at the time of wntmg.
4. Deserializers are used at query time (SELECT Statement).
Deserializer interface rakes a binary representation or string of a record, converts it into a java object that
Hive can then manipulate. Serializer rakes a java object that Hive has been working with and translates it
into something that Hive can write to HDFS.

Objective: To manipulate the XML data.


Input:
<employee> <empid>l001</empid> <name>John< /name> <designation> Team Lead<ldesignation>
</employee>

<employee> <empid> 1002</empid>


</emp1oyee> <name>Smith</n ame> <des1gnat1on>Analyst<I
· • designatior.,'
• 253

1,cr: '{J\BU: XMLSAMPLE(xmldata string);


~XJ)\ LOCAL JNP,ITI{ '/root/hivedcmos/input.xml' INTO TABLE XMLSAMPLE;
rABLE xpath_table AS
C ~ s:path_int(xmldata,'employee/empid'),
stJ.$ striJlg(xmldata,' employee/ name'),
$1' th- triJJg(xmldata,' employee/designation')
11

ath-5
$1' ._ 1 s:inJsample;
fJtOlYJ.
s£LECf * FROM xpath_table;
teo01e: .
0U £ATE TABLE XMLSAMPLE(xmldata string);
nive" CR l
0~ kn: o.244 seconds [3
,,rirne
ve> tl
La e_ _ _ __..........................,,,~~~a,.w~maw~.w,w~
1
LOAD DATA LOCAL INPAdTH '/root/hivedemos/input xml' INTO TABLE XMLSAMPLE·, \
11;ve~ data to table stu ents.xmlsample ·
~~:~t~tudents.xmlsample stats: (numFiles=l, tota1Size=l94]
o. 889 seconds
~~me taken: l':l
hive,.J~ - - -
~ - --- _, ~f.ffi!@l'!fa~N,Zffi?Mt.@Mm:WM$&tE~£.S'"~~ ·
. CRfATE TABLE ~path...table ~s
h1ve,. SELECT xpath...int(xmldata, employee/empid •)

l ,.
,.
;

I0hive,.
~ John
x ath_str1ng(xm1data, :employee/name'),
x~ath_str1ng(xml data, employee/designation')
FROM xmlsample;

. SELECT * FROM xpath_table;


Team Lead
'

01
iooz
.
smith Analyst
taken·· 0.064 seconds, Fetched: 2 row(s)
r,me
hive> I

9.8 USER-DEFINED FUNCTION (UDF)


In Hive, you can use custom functions by defining the User-Defined Function (UDF).
I
Objective: Write a Hive function to convert the values of a field to uppercase.
Act:

package com.example.hive. udf;


import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec. UD F;
@Description(
name="SimpleUD FExample")
2S4.

public fioaJ daM MyLowerCase extend~ UDP (


p11blic String e'\raJuate(6oal Sui ng word) (
retWD wo rd.ro LowerC.1sc();

Note: Co nvert this Java Prog.ram into Jar.


.ADD JAR /rootlhlvecfemos/UpperC.ase·jar;
CREATE TEMJ>ORARY FUNCTION wuppeccase AS 'com.cxample.&n,e.udf.MyUPPerc
SELECT TOuPPERCASE(name) FROM snJVENT; a,,•,

Outcome:
h ive> i'DO JAR /root / hivedemos/Uppercase.jar;· .
Added (/ root/hivedemos/Uppercase.jar] to class path
Added ,..,,,
hM> resources: (/root/hivedemos/Uppercase.jarJ
""'"'""'" ,uNCTION ,oupP<'<~• AS ' C - • · · -1 e. hiv•. udf. MyUppe,C~• • ;
I I
Ti me
OK taken: 0.014 seconds
h~ I - - - - - - - - - - - - - - - - - - - - - - - - -·- - - - - - - -

hive > select touppercase (name) from SllJDENT;


OK
JOHN
JACK
5Jo4 ITH
scon
JOSHI
AL EX
DAVID
JAMES
JOHN
Time
JOSHI taken: 0.061 seconds, Fetched: IO row(s)
hive> I

REMIND ME •
• H~ve ~s a Data Warehousing tool. .. I
• Hive 1s used to query structured data built on top qf tfadoop.
• Hiv~ provides HQL (Hive Query Language} which is similar to SQL.
• A
t blHive databa~
d contains d tables. Each table is c6rtstitutecf of
fi Id several . rows a n d co1umns. In Hive
·
a es are store as a o er an partition tables are stored as a sub-directory '
• Bucketed tables are stored as a file. ·

POINT ME (BOOKS)

•~- Programming ve, .Jason Rutherglen O'R ·u p bl. .


Hi-
----- - ' ei Y u 1canon.
d\•C:
0 pl
• 255
·oft I

Ii'.
~ Veil
er ME (INTERNET ResouRces>
~org/wiki/RCFde
, b"1':Is•I/e/JJ~lri.apacbe,org/
cw-- ch rg/conftuence/
n8 .display /Hive/Dyn....... : p ..
b"1'. ,~.apa e.o co uence/displa /Hi .........c artitJons
' ~111"':,~.apach•·org/conlluence/displ y/H"ve/LanguageManaal+DDL
' b"1's: r ..rtlfied BigData Developer. ay ive/LanguageManual+DML
, ~"'j.
,
I.JO' YO &g Q § 4 0¢£4 3! ::;;:;::•

~ - - - - - - - -- - - - -
~atch Me····· ·· .. .... .............. .... .,...... ... .... .... ,: ·: ...... ...........
, .... •·· ···~·A
. ...... ... .... .. .. ..
· ColurnnB
HOL Web Logs
oatabase struct, map
crnpleX Data Types Set of records
0
Hive APPlication Hive Query Language
Table............................................
····· ··········· ···· ···Namespace
··· ················· ··········

··Answers:
c~i~.~~·A -· 0 o O IO o O O I O O O O O O O o 0

· · · .. ._.. ·:........ -_
0 0 I O O O O I O

·,~tJ~~·~...:. ,. ......... .
o O O 0

_;_------_.;,---_.__-'-"'-::-___::..:___::..:_ ____;L........:~ ~ ~ ~ - - - -
Hive Query L~nguage
HQL .
Database Namespace
complex Data Types struct, map
Hive Application Web Logs
Table Set of records
······························· ··········· ··························································

8, Fill Me
1. The metastore consists of _ _ _ _ and a _ _ __
2. The most commonly used interface to interact with Hive is _ _ __
3, The default metastore for Hive is - - - -
4. Metastore contains _ _ _ _ of Hive tables.
5. ____ is responsible for compilation, optimization, and execution of Hive queries.

Answers:
4. System Catalog
I. Metaservices, database S. Driver
2. Command Line Interface
3. Derby

You might also like