0% found this document useful (0 votes)
6 views37 pages

05b Hive

HIVE is a data warehousing system designed to manage and query unstructured data as if it were structured, utilizing Hadoop's file system for storage and Map-Reduce for execution. It was developed at Facebook to handle the exponential growth of data, providing a familiar SQL interface and extensibility through user-defined functions and types. Key components include a shell for interactive queries, a driver for session management, a compiler for query optimization, and a metastore for schema management.

Uploaded by

Frederic Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views37 pages

05b Hive

HIVE is a data warehousing system designed to manage and query unstructured data as if it were structured, utilizing Hadoop's file system for storage and Map-Reduce for execution. It was developed at Facebook to handle the exponential growth of data, providing a familiar SQL interface and extensibility through user-defined functions and types. Key components include a shell for interactive queries, a driver for session management, a compiler for query optimization, and a metastore for schema management.

Uploaded by

Frederic Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

HIVE

 
Why  Another  Data    
Warehousing  System?  
— Problem  :  Data,  data  and  more  data  
— Several  TBs  of  data  everyday  
 
— The  Hadoop  Experiment:  
— Uses  Hadoop  File  System  (HDFS)  
— Scalable/Available  
 
— Problem  
— Lacked  Expressiveness  
— Map-­‐Reduce  hard  to  program  
 
— SoluOon  :  HIVE  
Copyright Ellis Horowitz, 2011 - 2012 2
What  is  HIVE?  
— A  system  for  managing  and  querying  unstructured  data  as  if  it  
were  structured  
— Uses  Map-­‐Reduce  for  execuOon  
— HDFS  for  Storage  
— Key  Building  Principles  
— SQL  as  a  familiar  data  warehousing  tool  
— Extensibility  (Pluggable  map/reduce  scripts  in  the  language  of  your  
choice,  Rich  and  User  Defined  Data  Types,  User  Defined  FuncOons)  
— Interoperability  (Extensible  Framework  to  support  different  file  and  data  
formats)  
— Performance  

Copyright Ellis Horowitz, 2011 - 2012 3


Hive: Background  
— Started at Facebook
— Data was collected by nightly cron jobs into Oracle
DB
— “ETL” via hand-coded python
— Grew from 10s of GBs (2006) to 1 TB/day new data
(2007), now 10x that

Copyright Ellis Horowitz, 2011 - 2012 4


Source: cc-licensed slide by Cloudera
Hive Components  
— Shell: allows interactive queries
— Driver: session handles, fetch, execute
— Compiler: parse, plan, optimize
— Execution engine: DAG of stages (MR, HDFS,
metadata)
— Metastore: schema, location in HDFS, etc

Copyright Ellis Horowitz, 2011 - 2012 5


Source: cc-licensed slide by Cloudera
Data Model  
— Tables
— Typed columns (int, float, string, boolean)
— Also, list: map (for JSON-like data)
— Partitions
— For example, range-partition tables by date
— Buckets
— Hash partitions within ranges (useful for sampling, join
optimization)

Copyright Ellis Horowitz, 2011 - 2012 6


Source: cc-licensed slide by Cloudera
Type  System  
— PrimiOve  types  
– Integers:TINYINT,  SMALLINT,  INT,  BIGINT.  
– Boolean:  BOOLEAN.  
– FloaOng  point  numbers:  FLOAT,  DOUBLE  .  
– String:  STRING.  
— Complex  types  
– Structs:  {a  INT;  b  INT}.  
– Maps:    M['group'].  
– Arrays:    ['a',  'b',  'c'],  A[1]  returns  'b'.  

Copyright Ellis Horowitz, 2011 - 2012 7


Data  Model-­‐  Tables  
— Tables  
— Analogous  to  tables  in  relaOonal  DBs.  
— Each  table  has  corresponding  directory  in  HDFS.  
— Example  
— Page  view  table  name  –  pvs  
— HDFS  directory  
— /wh/pvs  

— Example:  
 CREATE  TABLE  t1(ds  string,  ctry  float,  li  list<map<string,  
 struct<p1:int,  p2:int<<);  

Copyright Ellis Horowitz, 2011 - 2012 8


Data  Model  -­‐  ParOOons  
— ParOOons  
— Analogous  to  dense  indexes  on  parOOon  columns  
— Nested  sub-­‐directories  in  HDFS  for  each  combinaOon  of  
parOOon  column  values.  
— Allows  users  to  efficiently  retrieve  rows  
— Example  
— ParOOon  columns:  ds,  ctry  
— HDFS  for  ds=20120410,  ctry=US  
— /wh/pvs/ds=20120410/ctry=US  

— HDFS  for  ds=20120410,  ctry=IN  


— /wh/pvs/ds=20120410/ctry=IN  

Copyright Ellis Horowitz, 2011 - 2012 9


Hive  Query  Language  –Contd.  
— ParOOoning  –  CreaOng  parOOons  
 
CREATE  TABLE  test_part(ds  string,  hr  int)  
PARTITIONED  BY  (ds  string,  hr  int);  
 
— INSERT  OVERWRITE  TABLE  
test_part  PARTITION(ds='2009-­‐01-­‐01',  hr=12)  
SELECT  *  FROM  t;  
 
— ALTER  TABLE  test_part  
ADD  PARTITION(ds='2009-­‐02-­‐02',  hr=11);  

Copyright Ellis Horowitz, 2011 - 2012 10


ParOOoning  -­‐  Contd..  
SELECT  *  FROM  test_part  WHERE  ds='2009-­‐01-­‐01';  
 
— will  only  scan  all  the  files  within  the  
/user/hive/warehouse/test_part/ds=2009-­‐01-­‐01  directory  
 
SELECT  *  FROM  test_part  
WHERE  ds='2009-­‐02-­‐02'  AND  hr=11;  
 
— will  only  scan  all  the  files  within  the  /user/hive/warehouse/test_part/
ds=2009-­‐02-­‐02/hr=11  directory.  

Copyright Ellis Horowitz, 2011 - 2012 11


Data  Model  
— Buckets  
— Split  data  based  on  hash  of  a  column  –  mainly  for  
parallelism  
— Data  in  each  parOOon  may  in  turn  be  divided  into  Buckets  
based  on  the  value  of  a  hash  funcOon  of  some  column  of  a  
table.  
— Example  
— Bucket  column:  user  into  32  buckets  
— HDFS  file  for  user  hash  0  
— /wh/pvs/ds=20120410/cntr=US/part-­‐00000  

— HDFS  file  for  user  hash  bucket  20  


— /wh/pvs/ds=20120410/cntr=US/part-­‐00020  

Copyright Ellis Horowitz, 2011 - 2012 12


Data  Model  
— External  Tables  
— Point  to  exisOng  data  directories  in  HDFS  
— Can  create  table  and  parOOons  
— Data  is  assumed  to  be  in  Hive-­‐compaOble  format  
— Dropping  external  table  drops  only  the  metadata  
— Example:  create  external  table  
 CREATE  EXTERNAL  TABLE  test_extern(c1  string,  c2  int)  
 LOCATION  '/user/mytables/mydata';  
 
 
 

Copyright Ellis Horowitz, 2011 - 2012 13


SerializaOon/DeserializaOon  
— Generic  (De)SerialzaOon  Interface  SerDe  
— Uses  LazySerDe  
— Flexibile  Interface  to  translate  unstructured  data  into  
structured  data  
— Designed  to  read  data  separated  by  different  delimiter  
characters  
— The  SerDes  are  located  in  'hive_contrib.jar';  

Copyright Ellis Horowitz, 2011 - 2012 14


Hive  Tables  
— Two  types  of  tables  
— External  Table  
— Table  created  on  top  of  the  exisOng  data  
— delete  the  table  è  data  sOll  persistent  
— Normal  Table  
— Tables  locaOon  is  in  hives  default  locaOon  

— delete  the  table  è  data  gone  

Copyright Ellis Horowitz, 2011 - 2012 15


Create  Table  
— Employee1  |  Name  1  |Address1|Phone  1  
— create  external  table  (Key1  String,  Name  Strng,Address  
String,  Phone  String)  row  format  delimited  fields  
terminated  by  ‘|’  locaOon  ‘/….’;  

Copyright Ellis Horowitz, 2011 - 2012 16


Hive  File  Formats    
— Hive  lets  users  store  different  file  formats  
— Helps  in  performance  improvements  
— SQL  Example:  
CREATE  TABLE  dest1(key  INT,  value  STRING)  
STORED  AS  
INPUTFORMAT  
'org.apache.hadoop.mapred.SequenceFileInputFormat'  
OUTPUTFORMAT  
'org.apache.hadoop.mapred.SequenceFileOutputFormat'  

Copyright Ellis Horowitz, 2011 - 2012 17


17
System  Architecture  and  Components  

Copyright Ellis Horowitz, 2011 - 2012 18


System  Architecture    
and  Components  
JDBC   ODBC  
   
Web    
Command  Line  Interface   Interface   ThriP  Server  
   
Metastore  
Driver    
(Compiler,  OpOmizer,  Executor)  
 

 Metastore  
• The  component  that  store  the  system  catalog  and  meta  data  about  tables,  
columns,  parOOons  etc.  
• Stored  on  a  tradiOonal  RDBMS  
System  Architecture    
and  Components  
JDBC   ODBC  
   
Web    
Command  Line  Interface   Interface   ThriP  Server  
   
Metastore  
Driver    
(Compiler,  OpOmizer,  Executor)    

•  Driver  
The  component  that  manages  the  lifecycle  of  a  HiveQL  statement  as  it  
moves  through  Hive.  The  driver  also  maintains  a  session  handle  and  any  
session  staOsOcs.  
System  Architecture    
and  Components  
JDBC   ODBC  
   
Web    
Command  Line  Interface   Interface   ThriP  Server  
   
Metastore  
Driver    
(Compiler,  OpOmizer,  Executor)    

•  Query  Compiler  
The  component  that  compiles  HiveQL  into  a  directed  acyclic  graph  of  map/
reduce  tasks.  
System  Architecture    
and  Components  
JDBC   ODBC  
   
Web    
Command  Line  Interface   Interface   ThriP  Server  
   
Metastore  
Driver    
(Compiler,  OpOmizer,  Executor)    

• OpOmizer  
consists  of  a  chain  of  transformaOons  such  that  the  operator  DAG  resulOng  from  one  transformaOon  is  passed  as  
input  to  the  next  transformaOon  
Performs  tasks  like  Column  Pruning  ,  ParOOon  Pruning,  ReparOOoning  of  Data  
 
 
System  Architecture    
and  Components  
JDBC   ODBC  
   
Web    
Command  Line  Interface   Interface   Thris  Server  
   
Metastore  
Driver    
(Compiler,  OpOmizer,  Executor)    

•  ExecuOon  Engine  
The  component  that  executes  the  tasks  produced  by  the  compiler  in  proper  
dependency  order.  The  execuOon  engine  interacts  with  the  underlying  
Hadoop  instance.  
System  Architecture    
and  Components  
JDBC   ODBC  
   
Web    
Command  Line  Interface   Interface   ThriP  Server  
   
Metastore  
Driver    
(Compiler,  OpOmizer,  Executor)    

•  HiveServer  
The  component  that  provides    a  tris  interface  and  a  JDBC/ODBC  server  and  
provides  a  way  of  integraOng  Hive  with  other  applicaOons.  
System  Architecture    
and  Components  
JDBC   ODBC  
   
Web    
Command  Line  Interface   Interface   ThriP  Server  
   
Metastore  
Driver    
(Compiler,  OpOmizer,  Executor)    

•  Client  Components  
Client  component  like  Command  Line  Interface(CLI),  the  web  UI  and  JDBC/
ODBC  driver.  
Hive  Query  Language  
— Basic  SQL  
— From  clause  sub-­‐query  
— ANSI  JOIN  (equi-­‐join  only)  
— MulO-­‐Table  insert  
— MulO  group-­‐by  
— Sampling  
— Objects  Traversal  
— Extensibility  
— Pluggable  Map-­‐reduce  scripts  using  TRANSFORM  

Copyright Ellis Horowitz, 2011 - 2012 26


Hive  Query  Language  
— JOIN  
   
SELECT  t1.a1  as  c1,  t2.b1  as  c2  
FROM  t1  JOIN  t2  ON  (t1.a2  =  t2.b2);  

— INSERTION  
   INSERT  OVERWRITE  TABLE  t1  
 SELECT  *    FROM  t2;  

Copyright Ellis Horowitz, 2011 - 2012 27


Hive  Query  Language  –Contd.  
— InserOon    
 
INSERT  OVERWRITE  TABLE  sample1  '/tmp/hdfs_out'  SELECT  *  FROM  sample  
WHERE  ds='2012-­‐02-­‐24';  
 
INSERT  OVERWRITE  DIRECTORY  '/tmp/hdfs_out'  SELECT  *  FROM  sample  
WHERE  ds='2012-­‐02-­‐24';  
 
INSERT  OVERWRITE  LOCAL  DIRECTORY  '/tmp/hive-­‐sample-­‐out'  SELECT  *  
FROM  sample;  

Copyright Ellis Horowitz, 2011 - 2012 28


Hive  Query  Language  –Contd.  
— Map  Reduce  
   
FROM  (MAP  doctext  USING  'python  wc_mapper.py'  AS  (word,  cnt)  
FROM  docs  
CLUSTER  BY  word  
)    
REDUCE  word,  cnt  USING  'python  wc_reduce.py';  
 
— FROM  (FROM  session_table  
SELECT  sessionid,  tstamp,  data  
DISTRIBUTE  BY  sessionid  SORT  BY  tstamp  
)    
REDUCE  sessionid,  tstamp,  data  USING  'session_reducer.sh';  

Copyright Ellis Horowitz, 2011 - 2012 29


Hive  Query  Language  
— Example  of  mulO-­‐table  insert  query  and  its  opOmizaOon  
   FROM  (SELECT  a.status,  b.school,  b.gender  
     FROM  status_updates  a  JOIN  profiles  b  
       ON  (a.userid  =  b.userid  AND  a.ds='2009-­‐03-­‐20'  ))  subq1  
 
   INSERT  OVERWRITE  TABLE  gender_summary  
         PARTITION(ds='2009-­‐03-­‐20')  
   SELECT  subq1.gender,  COUNT(1)  
   GROUP  BY  subq1.gender  
 
   INSERT  OVERWRITE  TABLE  school_summary  
         PARTITION(ds='2009-­‐03-­‐20')  
   SELECT  subq1.school,  COUNT(1)  
   GROUP  BY  subq1.school  

Copyright Ellis Horowitz, 2011 - 2012 30


Hive  Query  Language  
Hive:  Example  
— Hive looks similar to an SQL database
— Relational join on two tables:
— Table of word counts from Shakespeare collection
— Table of word counts from Homer
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN homer k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

the 25848 62394


I 23031 8854
and 19671 38985
to 18038 13526
of 16700 34654
a 14170 8057
you 12702 2720
my 11297 4135
in 10797 12445
is 8882 6884
Source: Material drawn from Cloudera training VM
Hive:  Behind  the  Scenes  
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN homer k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

(Abstract Syntax Tree)


(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF homer k) (= (. (TOK_TABLE_OR_COL
s) word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k)
freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))

(one or more of MapReduce jobs)


Metastore  
— Database: namespace containing a set of tables
— Holds table definitions (column types, physical
layout)
— Holds partitioning information
— Can be stored in Derby, MySQL, and many other
relational databases

Source: cc-licensed slide by Cloudera


Physical  Layout  
— Warehouse directory in HDFS
— E.g., /user/hive/warehouse
— Tables stored in subdirectories of warehouse
— Partitions form subdirectories of tables
— Actual data stored in flat files
— Control char-delimited text, or SequenceFiles
— With custom SerDe, can use arbitrary format

Source: cc-licensed slide by Cloudera


Hive Usage @ Facebook
¬ Statistics per day:
¬ 4 TB of compressed new data added per day
¬ 135TB of compressed data scanned per day
¬ 7500+ Hive jobs on per day
¬ Hive simplifies Hadoop:
¬ ~200 people/month run jobs on Hadoop/Hive
¬ Analysts (non-engineers) use Hadoop through
Hive
¬ 95% of jobs are Hive Jobs

http://www.slideshare.net/cloudera/hw09-hadoop-
7/20/2010
development-at-facebook-hive-and-hdfs
Introduction to Hive 36
Conclusion  
— Pros  
— Good  explanaOon  of  Hive  and  HiveQL  with  proper  examples  
— Architecture  is  well  explained  
— Usage  of  Hive  is  properly  given  
— Cons  
— Accepts  only  a  subset  of  SQL  queries  
— Performance  comparisons  with  other  systems  would  have  
been  more  appreciable  

Copyright Ellis Horowitz, 2011 - 2012 37

You might also like