0% found this document useful (0 votes)
17 views47 pages

06 ImpalaHiveDataModeling

Chapter 6 focuses on modeling and managing data using Impala and Hive, detailing how they utilize the Metastore for data management. It covers creating and managing tables with Impala SQL and HiveQL, as well as loading data into these tables. The chapter also emphasizes the importance of metadata in describing data structure and location.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views47 pages

06 ImpalaHiveDataModeling

Chapter 6 focuses on modeling and managing data using Impala and Hive, detailing how they utilize the Metastore for data management. It covers creating and managing tables with Impala SQL and HiveQL, as well as loading data into these tables. The chapter also emphasizes the importance of metadata in describing data structure and location.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Modeling

 and  Managing  Data  with  


Impala  and  Hive  
Chapter  6  

201509  
Course  Chapters  

1   IntroducHon   Course  IntroducHon  


2   IntroducHon  to  Hadoop  and  the  Hadoop  Ecosystem  
IntroducHon  to  Hadoop  
3   Hadoop  Architecture  and  HDFS  
4   ImporHng  RelaHonal  Data  with  Apache  Sqoop  
5   IntroducHon  to  Impala  and  Hive  
Impor*ng  and  Modeling  
6   Modeling  and  Managing  Data  with  Impala  and  Hive  
Structured  Data  
7   Data  Formats  
8   Data  File  ParHHoning  
9   Capturing  Data  with  Apache  Flume     IngesHng  Streaming  Data  

10   Spark  Basics  
11   Working  with  RDDs  in  Spark  
12   AggregaHng  Data  with  Pair  RDDs  
13   WriHng  and  Deploying  Spark  ApplicaHons   Distributed  Data  Processing  with  
14   Parallel  Processing  in  Spark   Spark  
15   Spark  RDD  Persistence    
16   Common  PaFerns  in  Spark  Data  Processing  
17   Spark  SQL  and  DataFrames  

18   Conclusion   Course  Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐2  
Modeling  and  Managing  Data  in  Impala  and  Hive  

In  this  chapter  you  will  learn  


§ How  Impala  and  Hive  use  the  Metastore  
§ How  to  use  Impala  SQL  and  HiveQL  DDL  to  create  tables  
§ How  to  create  and  manage  tables  using  Hue  or  HCatalog  
§ How  to  load  data  into  tables  using  Impala,  Hive,  or  Sqoop  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐3  
Chapter  Topics  

Modeling  and  Managing  Data  With   Impor*ng  and  Modeling  Structured  


Impala  and  Hive   Data  

§ Data  Storage  Overview  


§ CreaHng  Databases  and  Tables  
§ Loading  Data  into  Tables  
§ HCatalog  
§ Impala  Metadata  Caching  
§ Conclusion  
§ Homework:  Create  and  Populate  Tables  in  Impala  or  Hive    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐4  
How  Hive  and  Impala  Load  and  Store  Data  (1)  

§ Queries  operate  on  tables,  just  like  in  an  RDBMS  


– A  table  is  simply  an  HDFS  directory  containing  one  or  more  files  
– Default  path:  /user/hive/warehouse/<table_name>      
– Supports  many  formats  for  data  storage  and  retrieval  
§ What  is  the  structure  and  loca*on  of  tables?  
– These  are  specified  when  tables  are  created  
– This  metadata  is  stored  in  the  Metastore  
– Contained  in  an  RDBMS  such  as  MySQL  
§ Hive  and  Impala  work  with  the  same  data  
– Tables  in  HDFS,  metadata  in  the  Metastore  

  ©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐5  
HIDDEN  SLIDE  Hive  Metastore  instructor  notes  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐6  
How  Hive  and  Impala  Load  and  Store  Data  (2)  

§ Hive  and  Impala  use  the  Metastore  to  determine  data  format  and  loca*on  
– The  query  itself  operates  on  data  stored  in  HDFS  

Metastore  
 

Query  
(metadata  in  RDBMS)  
Impala  or  
Hive  Server  
Tables    

(data  in  HDFS  files)  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐7  
Data  and  Metadata  

§ Data  refers  to  the  informa*on  you  store  and  process  


– Billing  records,  sensor  readings,  and  server  logs  are  examples  of  data  
§ Metadata  describes  important  aspects  of  that  data  
– Field  name  and  order  are  examples  of  metadata  

Metadata   cust_id   name   country  


001 Alice us
002 Bob ca
003 Carlos mx
… … …
Data  
392 Maria it
393 Nigel uk
394 Ophelia dk
… … …

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐8  
The  Data  Warehouse  Directory  

§ By  default,  data  is  stored  in  the  HDFS  directory    


/user/hive/warehouse
§ Each  table  is  a  subdirectory  containing  any  number  of  files  
customers  table  
cust_id   name   country   /user/hive/warehouse/customers
001 Alice us
002 Bob ca a Alice us
file1   b Bob ca
003 Carlos mx
c Carlos mx
… … … d Dieter de

392 Maria it
393 Nigel uk m Maria it
file2   n Nigel uk
394 Ophelia dk o Ophelia dk
… … … p Peter us

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐9  
Chapter  Topics  

Modeling  and  Managing  Data  With   Impor*ng  and  Modeling  Structured  


Impala  and  Hive   Data  

§ Data  Storage  Overview  


§ Crea*ng  Databases  and  Tables  
§ Loading  Data  into  Tables  
§ HCatalog  
§ Impala  Metadata  Caching  
§ Conclusion  
§ Homework:  Create  and  Populate  Tables  in  Impala  or  Hive    
 

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐10  
Defining  Databases  and  Tables  

§ Databases  and  tables  are  created  and  managed  using  the  DDL  (Data  
Defini*on  Language)  of  HiveQL  or  Impala  SQL  
– Very  similar  to  standard  SQL  DDL  
– Some  minor  differences  between  Hive  and  Impala  DDL  will  be  noted  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐11  
CreaHng  a  Database  

§ Hive  and  Impala  databases  are  simply  namespaces  


– Helps  to  organize  your  tables  
§ To  create  a  new  database  

CREATE DATABASE loudacre;

1. Adds  the  database  definiHon  to  the  metastore  


2. Creates  a  storage  directory  in  HDFS  
e.g./user/hive/warehouse/loudacre.db
§ To  condi*onally  create  a  new  database  
– Avoids  error  in  case  database  already  exists  (useful  for  scripHng)  

CREATE DATABASE IF NOT EXISTS loudacre;

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐12  
Removing  a  Database  

§ Removing  a  database  is  similar  to  crea*ng  it  


– Just  replace  CREATE  with  DROP

DROP DATABASE loudacre;

DROP DATABASE IF EXISTS loudacre;

§ These  commands  will  fail  if  the  database  contains  tables  


– In  Hive:  Add  the  CASCADE  keyword  to  force  removal  
– CauHon:  this  command  might  remove  data  in  HDFS!  

DROP DATABASE loudacre CASCADE;

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐13  
Data  Types  

§ Each  column  is  assigned  a  specific  data  type  


– These  are  specified  when  the  table  is  created  
– NULL  values  are  returned  for  non-­‐conforming  data  in  HDFS  
§ Here  are  some  common  data  types    
Name   Descrip*on   Example  Value  
STRING Character  data  (of  any  length)   Alice
BOOLEAN TRUE  or  FALSE TRUE
TIMESTAMP Instant  in  Hme   2014-03-14 17:01:29
INT Range:  same  as  Java  int 84127213
BIGINT Range:  same  as  Java  long 7613292936514215317
FLOAT Range:  same  as  Java  float 3.14159
DOUBLE Range:  same  as  Java  double 3.1415926535897932385

Hive  (not  Impala)  also  supports  a  few  complex  types  such  


as  maps  and  arrays  
©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐14  
CreaHng  a  Table  (1)  

§ Basic  syntax  for  crea*ng  a  table:  

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|…}

§ Creates  a  subdirectory  in  the  database’s  warehouse  directory  in  HDFS  


– Default  database:    
 /user/hive/warehouse/tablename    
– Named  database:      
/user/hive/warehouse/dbname.db/tablename  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐15  
CreaHng  a  Table  (2)  

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STOREDSpecify a name for the table, and list the column
AS {TEXTFILE|SEQUENCEFILE|RCFILE|PARQUET}
names and datatypes (see later)

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐16  
CreaHng  a  Table  (3)  

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE|PARQUET}
This line states that fields in each file in the table’s
directory are delimited by some character. The default
delimiter is Control-A, but you may
specify an alternate delimiter...

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐17  
CreaHng  a  Table  (4)  

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|…}

…for example, tab-delimited data would require that


you specify FIELDS TERMINATED BY '\t'

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐18  
CreaHng  a  Table  (5)  

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|…}

Finally, you may declare the file format. STORED AS


TEXTFILE is the default and does not need to be
specified.
Other formats will be discussed later in the course.

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐19  
Example  Table  DefiniHon  

§ The  following  example  creates  a  new  table  named  jobs


– Data  stored  as  text  with  four  comma-­‐separated  fields  per  line  

CREATE TABLE jobs (


id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

– Example  of  corresponding  record  for  the  table  above  

1,Data Analyst,100000,2013-06-21 15:52:03

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐20  
CreaHng  Tables  Based  On  ExisHng  Schema  

§ Use  LIKE  to  create  a  new  table  based  on  an  exis*ng  table  defini*on  
 
CREATE TABLE jobs_archived LIKE jobs;

§ Column  defini*ons  and  names  are  derived  from  the  exis*ng  table  
– New  table  will  contain  no  data  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐21  
CreaHng  Tables  Based  On  ExisHng  Data  

§ Create  a  table  based  on  a  SELECT  statement  


– Oeen  know  as  ‘Create  Table  As  Select’  (CTAS)  
CREATE TABLE ny_customers AS
SELECT cust_id, fname, lname
  FROM customers
WHERE state = 'NY';

§ Column  defini*ons  are  derived  from  the  exis*ng  table  


§ Column  names  are  inherited  from  the  exis*ng  names  
– Use  aliases  in  the  SELECT  statement  to  specify  new  names  
§ New  table  will  contain  the  selected  data  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐22  
Controlling  Table  Data  LocaHon  

§ By  default,  table  data  is  stored  in  the  warehouse  directory  


§ This  is  not  always  ideal  
– Data  might  be  shared  by  several  users    
§ Use  LOCATION  to  specify  the  directory  where  table  data  resides  

CREATE TABLE jobs (


id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/loudacre/jobs';

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐23  
Externally  Managed  Tables  

§ CAUTION:  Dropping  a  table  removes  its  data  in  HDFS  


– Tables  are  “managed”  or  “internal”  by  default  
§ Using  EXTERNAL  when  crea*ng  the  table  avoids  this  behavior  
– Dropping  an  external  table  removes  only  its  metadata  

CREATE EXTERNAL TABLE adclicks


( campaign_id STRING,
click_time TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)
LOCATION '/loudacre/ad_data';

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐24  
Exploring  Tables  (1)  

§ The  SHOW TABLES  command  lists  all  tables  in  the  current  database  
SHOW TABLES;
+---------------+
| tab_name |
+---------------+
| accounts |
| employees |
| job |
| vendors |
+---------------+

§ The  DESCRIBE  command  lists  the  fields  in  the  specified  table  
  DESCRIBE jobs;
+--------+-----------+---------+
| name | type | comment |
+--------+-----------+---------+
| id | int | |
| title | string | |
| salary | int | |
| posted | timestamp | |
+--------+-----------+---------+

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐25  
Exploring  Tables  (2)  

§ DESCRIBE FORMATTED  also  shows  table  proper*es  


DESCRIBE FORMATTED jobs;
+------------------+-------------------------------------+--------+
| name | type | comment|
+------------------+-------------------------------------+--------+
| # col_name | data_type | comment|
| id | int | NULL |
| title | string | NULL |
| salary | int | NULL |
| posted | timestamp | NULL |
| | NULL | NULL |
| # Detailed Table | NULL | NULL |
| Information
| Database: | default | NULL |
| Owner: | training | NULL |
| CreateTime: | Wed Jun 17 09:41:23 PDT 2015 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://localhost:8020/loudacre/jobs | NULL |
| Table Type: | MANAGED_TABLE | NULL |

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐26  
Exploring  Tables  (3)  

§ SHOW CREATE TABLE displays  the  SQL  command  to  create  the  table  

SHOW CREATE TABLE jobs;


+-----------------------------------------------------+
| CREATE TABLE default.jobs |
| id INT, |
| title STRING, |
| salary INT, |
| posted TIMESTAMP |
| ) |
| ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' |

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐27  
Using  the  Hue  Metastore  Manager  

§ The  Hue  Metastore  Manager    


– An  alternaHve  to  using  SQL  commands  to  manage  metadata  
– Allows  you  to  create,  load,  preview,  and  delete  databases  and  tables  
– Not  all  features  are  supported  yet  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐28  
Chapter  Topics  

Modeling  and  Managing  Data  With   Impor*ng  and  Modeling  Structured  


Impala  and  Hive   Data  

§ Data  Storage  Overview  


§ CreaHng  Databases  and  Tables  
§ Loading  Data  into  Tables  
§ HCatalog  
§ Impala  Metadata  Caching  
§ Conclusion  
§ Homework:  Create  and  Populate  Tables  in  Impala  or  Hive    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐29  
Data  ValidaHon  

§ Impala  and  Hive  are  ‘schema  on  read’  


– Unlike  an  RDBMS,  they  do  not  validate  data  on  insert  
– Files  are  simply  moved  into  place  
– Loading  data  into  tables  is  therefore  very  fast  
– Errors  in  file  format  will  be  discovered  when  queries  are  performed  
§ Missing  or  invalid  data  will  be  represented  as  NULL

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐30  
Loading  Data  From  HDFS  Files  

§ To  load  data,  simply  add  files  to  the  table’s  directory  in  HDFS  
– Can  be  done  directly  using  the  hdfs dfs  commands  
– This  example  loads  data  from  HDFS  into  the  sales  table  

$ hdfs dfs -mv \


/tmp/sales.txt /user/hive/warehouse/sales/
 
§ Alterna*vely,  use  the  LOAD DATA INPATH  command  
– Done  from  within  Hive  or  Impala  
– This  moves  data  within  HDFS,  just  like  the  command  above  
– Source  can  be  either  a  file  or  directory  

LOAD DATA INPATH '/tmp/sales.txt'


INTO TABLE sales;

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐31  
OverwriHng  Data  From  Files  

§ Add  the  OVERWRITE  keyword  to  delete  all  records  before  import  
– Removes  all  files  within  the  table’s  directory    
– Then  moves  the  new  files  into  that  directory  

LOAD DATA INPATH '/tmp/sales.txt'


OVERWRITE INTO TABLE sales;

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐32  
Appending  Selected  Records  to  a  Table  

§ Another  way  to  populate  a  table  is  through  a  query  


– Use  INSERT  INTO  to  add  results  to  an  exis0ng  Hive  table  

INSERT INTO TABLE accounts_copy


SELECT * FROM accounts;

– Specify  a  WHERE  clause  to  control  which  records  are  appended  

INSERT INTO TABLE loyal_customers


SELECT * FROM accounts
WHERE YEAR(acct_create_dt) = 2008
AND acct_close_dt IS NULL;

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐33  
Loading  Data  Using  the  Metastore  Manager  

§ The  Metastore  Manager  provides  two  ways  to  load  data  into  a  table  

Table  creaHon  wizard   Import  data  wizard  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐34  
Loading  Data  From  a  RelaHonal  Database  

§ Sqoop  has  built-­‐in  support  for  impor*ng  data  into  Hive  and  Impala  
§ Add  the  --hive-import  op*on  to  your  Sqoop  command  
– Creates  the  table  in  the  Hive  metastore  
– Imports  data  from  the  RDBMS  to  the  table’s  directory  in  HDFS  

$ sqoop import \
--connect jdbc:mysql://localhost/loudacre \
--username training \
--password training \
--fields-terminated-by '\t' \
--table employees \
--hive-import

– Note  that  --hive-import  creates  a  table  accessible  in  both  Hive  and  
Impala  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐35  
Chapter  Topics  

Modeling  and  Managing  Data  With   Impor*ng  and  Modeling  Structured  


Impala  and  Hive   Data  

§ Data  Storage  Overview  


§ CreaHng  Databases  and  Tables  
§ Loading  Data  into  Tables  
§ HCatalog  
§ Impala  Metadata  Caching  
§ Conclusion  
§ Homework:  Create  and  Populate  Tables  in  Impala  or  Hive    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐36  
Impala  in  the  Cluster  

§ Each  slave  node  in  the  cluster  runs  an   Catalog  Server  
Impala  daemon   Master  
State  Store   Node  
– Co-­‐located  with  the  HDFS  slave  
NameNode  
daemon  (DataNode)  
§ Two  other  daemons  running  on  master   Impala     DataNode  
nodes  support  query  execu*on   Daemon   (HDFS)  

– The  State  Store  daemon    


– Provides  lookup  service  for   Impala     DataNode  
Daemon   (HDFS)  
Impala  daemons   Slave    
– Periodically  checks  status  of   Nodes  
Impala     DataNode  
Impala  daemons   Daemon   (HDFS)  
– The  Catalog  daemon    
– Relays  metadata  changes  to  all   Impala     DataNode  
the  Impala  daemons  in  a  cluster   Daemon   (HDFS)  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐37  
How  Impala  Executes  a  Query  

§ Impala  daemon  plans  the  query   Catalog  Server  


Master  
– Client  (impala-­‐shell  or  Hue)  connects   State  Store   Node  
to  a  local  impala  daemon  
NameNode  
–  This  is  the  coordinator
– Coordinator  requests  a  list  of  other   Impala    
Impala  daemons  in  the  cluster  from   Daemon  
the  State  Store  
– Coordinator  distributes  the  query   Impala    
across  other  Impala  daemons   Daemon  

HDFS  
Slave    
– Streams  results  to  client   Nodes  
Impala    
Daemon  

Impala    
Daemon  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐38  
Metadata  Caching  (1)  

§ Impala  daemons  cache  metadata     Catalog  Server  


– The  tables’  schema  definiHons   State  Store  
– The  locaHons  of  tables’  HDFS   NameNode  
blocks  
§ Metadata  is  cached  from  the   Impala     Metadata  
Daemon   cache  
Metastore  at  startup  
Metastore  
Impala     Metadata  
Daemon   cache  

metadata    
Impala     Metadata   in  RDBMS  
Daemon   cache  

Impala     Metadata  
Daemon   cache  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐39  
Metadata  Caching  (2)  

§ When  one  Impala  daemon  changes   Catalog  Server  


the  metastore,  it  no*fies  the  catalog   State  Store  
service  
NameNode  
§ The  catalog  service  no*fies  all  Impala  
daemons  to  update  their  cache   Impala     Metadata  
Daemon   cache  

Metastore  
CREATE TABLE Impala     Metadata  
suppliers (…) Daemon   cache  

metadata    
Impala     Metadata   in  RDBMS  
Daemon   cache  

Impala     Metadata  
Daemon   cache  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐40  
External  Changes  and  Metadata  Caching  

§ Metadata  updates  made  from   Catalog  Server   Metastore  


outside  of  Impala  are  not  known   State  Store  
to  Impala,  e.g.  
NameNode  
– Changes  via  Hive,  HCatalog  or   metadata    
Hue  Metadata  Manager   in  RDBMS  
Impala     Metadata  
– Data  added  directly  to   Daemon   cache  
directory  in  HDFS  
External  
§ Therefore  the  Impala  metadata   Impala     Metadata  
changes  
Daemon   cache  
caches  will  be  invalid  
§ You  must  manually  refresh  or   Impala     Metadata  
invalidate  Impala’s  metadata   Daemon   cache  
cache   HDFS  
Impala     Metadata  
Daemon   cache  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐41  
UpdaHng  the  Impala  Metadata  Cache  

External   Required  Ac*on   Effect  on  Local  Caches  


Metadata  Change  
New  table  added   INVALIDATE METADATA Marks  the  enHre  cache  as  
(with  no  table  name)   stale;  metadata  cache  is  
reloaded  as  needed.  
Table  schema   REFRESH <table> Reloads  the  metadata  for  one  
modified   table  immediately.  Reloads  
or   HDFS  block  locaHons  for  new  
New  data  added  to  a   data  files  only.  
table  

Data  in  a  table   INVALIDATE METADATA Marks  the  metadata  for  a  


extensively  altered,   <table> single  table  as  stale.  When  
such  as  by  HDFS   the  metadata  is  needed,  all  
balancing HDFS  block  locaHons  are  
retrieved.  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐42  
Chapter  Topics  

Modeling  and  Managing  Data  With   Impor*ng  and  Modeling  Structured  


Impala  and  Hive   Data  

§ Data  Storage  Overview  


§ CreaHng  Databases  and  Tables  
§ Loading  Data  into  Tables  
§ HCatalog  
§ Impala  Metadata  Caching  
§ Conclusion  
§ Homework:  Create  and  Populate  Tables  in  Impala  or  Hive    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐43  
EssenHal  Points  

§ Each  table  maps  to  a  directory  in  HDFS


– Table  data  is  stored  as  one  or  more  files
– Default  format:  plain  text  with  delimited  fields
§ The  Metastore  stores  data  about  the  data  in  an  RDBMS
– E.g.  LocaHon,  column  names  and  types
§ Tables  are  created  and  managed  using  the  Impala  SQL  or  HiveQL  Data
Defini*on  Language
§ Impala  caches  metadata  from  the  Metastore
– Invalidate  or  refresh  the  cache  if  tables  are  modified  outside  Impala

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-44  
Bibliography  

The  following  offer  more  informa*on  on  topics  discussed  in  this  chapter  
§ Impala  Concepts  and  Architecture
– http://tiny.cloudera.com/adcc12a
§ Impala  SQL  Language  Reference  
– http://tiny.cloudera.com/impalasql
§ Impala-­‐related  Ar*cles  on  Cloudera’s  Blog  
– http://tiny.cloudera.com/adcc12e
§ Apache  Hive  Web  Site
– http://hive.apache.org/
§ HiveQL  Language  Manual  
– http://tiny.cloudera.com/adcc10b

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐45  
Chapter  Topics  

Modeling  and  Managing  Data  With   Impor*ng  and  Modeling  Structured  


Impala  and  Hive   Data  

§ Data  Storage  Overview  


§ CreaHng  Databases  and  Tables  
§ Loading  Data  into  Tables  
§ HCatalog  
§ Impala  Metadata  Caching  
§ Conclusion  
§ Homework:  Create  and  Populate  Tables  in  Impala  or  Hive    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐46  
Homework:  Create  and  Populate  Tables  in  Impala    

§ In  this  homework  assignment  you  will    


– Create  a  table  in  Impala  to  model  and  view  exisHng  data  
– Use  Sqoop  to  create  a  new  table  automaHcally  from  data  imported  from  
MySQL  
§ Please  refer  to  the  Homework  descrip*on  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriFen  consent  from  Cloudera.   6-­‐47  

You might also like