0% found this document useful (0 votes)

17 views47 pages

06 ImpalaHiveDataModeling

Chapter 6 focuses on modeling and managing data using Impala and Hive, detailing how they utilize the Metastore for data management. It covers creating and managing tables with Impala SQL and HiveQL, as well as loading data into these tables. The chapter also emphasizes the importance of metadata in describing data structure and location.

Uploaded by

priyanka chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views47 pages

06 ImpalaHiveDataModeling

Uploaded by

priyanka chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Modeling

and Managing Data with

Impala and Hive
Chapter 6

201509
Course Chapters

1 IntroducHon Course IntroducHon

2 IntroducHon to Hadoop and the Hadoop Ecosystem
IntroducHon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporHng RelaHonal Data with Apache Sqoop
5 IntroducHon to Impala and Hive
Impor*ng and Modeling
6 Modeling and Managing Data with Impala and Hive
Structured Data
7 Data Formats
8 Data File ParHHoning
9 Capturing Data with Apache Flume IngesHng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 AggregaHng Data with Pair RDDs
13 WriHng and Deploying Spark ApplicaHons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaFerns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐2
Modeling and Managing Data in Impala and Hive

In this chapter you will learn

§ How Impala and Hive use the Metastore
§ How to use Impala SQL and HiveQL DDL to create tables
§ How to create and manage tables using Hue or HCatalog
§ How to load data into tables using Impala, Hive, or Sqoop

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐3
Chapter Topics

Modeling and Managing Data With Impor*ng and Modeling Structured

Impala and Hive Data

§ Data Storage Overview

§ CreaHng Databases and Tables
§ Loading Data into Tables
§ HCatalog
§ Impala Metadata Caching
§ Conclusion
§ Homework: Create and Populate Tables in Impala or Hive

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐4
How Hive and Impala Load and Store Data (1)

§ Queries operate on tables, just like in an RDBMS

– A table is simply an HDFS directory containing one or more ﬁles
– Default path: /user/hive/warehouse/<table_name>
– Supports many formats for data storage and retrieval
§ What is the structure and loca*on of tables?
– These are speciﬁed when tables are created
– This metadata is stored in the Metastore
– Contained in an RDBMS such as MySQL
§ Hive and Impala work with the same data
– Tables in HDFS, metadata in the Metastore

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐5
HIDDEN SLIDE Hive Metastore instructor notes

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐6
How Hive and Impala Load and Store Data (2)

§ Hive and Impala use the Metastore to determine data format and loca*on
– The query itself operates on data stored in HDFS

Metastore

Query
(metadata in RDBMS)
Impala or
Hive Server
Tables

(data in HDFS ﬁles)

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐7
Data and Metadata

§ Data refers to the informa*on you store and process

– Billing records, sensor readings, and server logs are examples of data
§ Metadata describes important aspects of that data
– Field name and order are examples of metadata

Metadata cust_id name country

001 Alice us
002 Bob ca
003 Carlos mx
… … …
Data
392 Maria it
393 Nigel uk
394 Ophelia dk
… … …

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐8
The Data Warehouse Directory

§ By default, data is stored in the HDFS directory

/user/hive/warehouse
§ Each table is a subdirectory containing any number of files
customers table
cust_id name country /user/hive/warehouse/customers
001 Alice us
002 Bob ca a Alice us
file1 b Bob ca
003 Carlos mx
c Carlos mx
… … … d Dieter de
…
392 Maria it
393 Nigel uk m Maria it
file2 n Nigel uk
394 Ophelia dk o Ophelia dk
… … … p Peter us
…

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐9
Chapter Topics

Modeling and Managing Data With Impor*ng and Modeling Structured

Impala and Hive Data

§ Data Storage Overview

§ Crea*ng Databases and Tables
§ Loading Data into Tables
§ HCatalog
§ Impala Metadata Caching
§ Conclusion
§ Homework: Create and Populate Tables in Impala or Hive

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐10
Deﬁning Databases and Tables

§ Databases and tables are created and managed using the DDL (Data
Deﬁni*on Language) of HiveQL or Impala SQL
– Very similar to standard SQL DDL
– Some minor diﬀerences between Hive and Impala DDL will be noted

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐11
CreaHng a Database

§ Hive and Impala databases are simply namespaces

– Helps to organize your tables
§ To create a new database

CREATE DATABASE loudacre;

1. Adds the database deﬁniHon to the metastore

2. Creates a storage directory in HDFS
e.g./user/hive/warehouse/loudacre.db
§ To condi*onally create a new database
– Avoids error in case database already exists (useful for scripHng)

CREATE DATABASE IF NOT EXISTS loudacre;

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐12
Removing a Database

§ Removing a database is similar to crea*ng it

– Just replace CREATE with DROP

DROP DATABASE loudacre;

DROP DATABASE IF EXISTS loudacre;

§ These commands will fail if the database contains tables

– In Hive: Add the CASCADE keyword to force removal
– CauHon: this command might remove data in HDFS!

DROP DATABASE loudacre CASCADE;

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐13
Data Types

§ Each column is assigned a speciﬁc data type

– These are speciﬁed when the table is created
– NULL values are returned for non-‐conforming data in HDFS
§ Here are some common data types
Name Descrip*on Example Value
STRING Character data (of any length) Alice
BOOLEAN TRUE or FALSE TRUE
TIMESTAMP Instant in Hme 2014-03-14 17:01:29
INT Range: same as Java int 84127213
BIGINT Range: same as Java long 7613292936514215317
FLOAT Range: same as Java float 3.14159
DOUBLE Range: same as Java double 3.1415926535897932385

Hive (not Impala) also supports a few complex types such

as maps and arrays
© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐14
CreaHng a Table (1)

§ Basic syntax for crea*ng a table:

CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|…}

§ Creates a subdirectory in the database’s warehouse directory in HDFS

– Default database:
/user/hive/warehouse/tablename
– Named database:
/user/hive/warehouse/dbname.db/tablename

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐15
CreaHng a Table (2)

CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STOREDSpecify a name for the table, and list the column
AS {TEXTFILE|SEQUENCEFILE|RCFILE|PARQUET}
names and datatypes (see later)

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐16
CreaHng a Table (3)

CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE|PARQUET}
This line states that fields in each file in the table’s
directory are delimited by some character. The default
delimiter is Control-A, but you may
specify an alternate delimiter...

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐17
CreaHng a Table (4)

CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|…}

…for example, tab-delimited data would require that

you specify FIELDS TERMINATED BY '\t'

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐18
CreaHng a Table (5)

CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|…}

Finally, you may declare the file format. STORED AS

TEXTFILE is the default and does not need to be
specified.
Other formats will be discussed later in the course.

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐19
Example Table DeﬁniHon

§ The following example creates a new table named jobs

– Data stored as text with four comma-‐separated ﬁelds per line

CREATE TABLE jobs (

id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

– Example of corresponding record for the table above

1,Data Analyst,100000,2013-06-21 15:52:03

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐20
CreaHng Tables Based On ExisHng Schema

§ Use LIKE to create a new table based on an exis*ng table deﬁni*on

CREATE TABLE jobs_archived LIKE jobs;

§ Column deﬁni*ons and names are derived from the exis*ng table
– New table will contain no data

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐21
CreaHng Tables Based On ExisHng Data

§ Create a table based on a SELECT statement

– Oeen know as ‘Create Table As Select’ (CTAS)
CREATE TABLE ny_customers AS
SELECT cust_id, fname, lname
FROM customers
WHERE state = 'NY';

§ Column deﬁnions are derived from the exisng table

§ Column names are inherited from the exis*ng names
– Use aliases in the SELECT statement to specify new names
§ New table will contain the selected data

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐22
Controlling Table Data LocaHon

§ By default, table data is stored in the warehouse directory

§ This is not always ideal
– Data might be shared by several users
§ Use LOCATION to specify the directory where table data resides

CREATE TABLE jobs (

id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/loudacre/jobs';

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐23
Externally Managed Tables

§ CAUTION: Dropping a table removes its data in HDFS

– Tables are “managed” or “internal” by default
§ Using EXTERNAL when crea*ng the table avoids this behavior
– Dropping an external table removes only its metadata

CREATE EXTERNAL TABLE adclicks

( campaign_id STRING,
click_time TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)
LOCATION '/loudacre/ad_data';

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐24
Exploring Tables (1)

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐25
Exploring Tables (2)

§ DESCRIBE FORMATTED also shows table proper*es

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐26
Exploring Tables (3)

§ SHOW CREATE TABLE displays the SQL command to create the table

SHOW CREATE TABLE jobs;

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐27
Using the Hue Metastore Manager

§ The Hue Metastore Manager

– An alternaHve to using SQL commands to manage metadata
– Allows you to create, load, preview, and delete databases and tables
– Not all features are supported yet

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐28
Chapter Topics

Modeling and Managing Data With Impor*ng and Modeling Structured

Impala and Hive Data

§ Data Storage Overview

§ CreaHng Databases and Tables
§ Loading Data into Tables
§ HCatalog
§ Impala Metadata Caching
§ Conclusion
§ Homework: Create and Populate Tables in Impala or Hive

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐29
Data ValidaHon

§ Impala and Hive are ‘schema on read’

– Unlike an RDBMS, they do not validate data on insert
– Files are simply moved into place
– Loading data into tables is therefore very fast
– Errors in ﬁle format will be discovered when queries are performed
§ Missing or invalid data will be represented as NULL

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐30
Loading Data From HDFS Files

§ To load data, simply add ﬁles to the table’s directory in HDFS
– Can be done directly using the hdfs dfs commands
– This example loads data from HDFS into the sales table

$ hdfs dfs -mv \

/tmp/sales.txt /user/hive/warehouse/sales/

§ Alterna*vely, use the LOAD DATA INPATH command
– Done from within Hive or Impala
– This moves data within HDFS, just like the command above
– Source can be either a ﬁle or directory

LOAD DATA INPATH '/tmp/sales.txt'

INTO TABLE sales;

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 6-‐31
OverwriHng Data From Files

§ Add the OVERWRITE keyword to delete all records before import
– Removes all ﬁles within the table’s directory
– Then moves the new ﬁles into that directory

LOAD DATA INPATH '/tmp/sales.txt'

OVERWRITE INTO TABLE sales;

§ Another way to populate a table is through a query

– Use INSERT INTO to add results to an exis0ng Hive table

INSERT INTO TABLE accounts_copy

SELECT * FROM accounts;

– Specify a WHERE clause to control which records are appended

INSERT INTO TABLE loyal_customers

SELECT * FROM accounts
WHERE YEAR(acct_create_dt) = 2008
AND acct_close_dt IS NULL;

§ The Metastore Manager provides two ways to load data into a table

Table creaHon wizard Import data wizard

§ Sqoop has built-‐in support for impor*ng data into Hive and Impala
§ Add the --hive-import op*on to your Sqoop command
– Creates the table in the Hive metastore
– Imports data from the RDBMS to the table’s directory in HDFS

$ sqoop import \
--connect jdbc:mysql://localhost/loudacre \
--username training \
--password training \
--fields-terminated-by '\t' \
--table employees \
--hive-import

– Note that --hive-import creates a table accessible in both Hive and
Impala

Modeling and Managing Data With Impor*ng and Modeling Structured

Impala and Hive Data

§ Data Storage Overview

§ CreaHng Databases and Tables
§ Loading Data into Tables
§ HCatalog
§ Impala Metadata Caching
§ Conclusion
§ Homework: Create and Populate Tables in Impala or Hive

§ Each slave node in the cluster runs an Catalog Server
Impala daemon Master
State Store Node
– Co-‐located with the HDFS slave
NameNode
daemon (DataNode)
§ Two other daemons running on master Impala DataNode
nodes support query execu*on Daemon (HDFS)

– The State Store daemon

– Provides lookup service for Impala DataNode
Daemon (HDFS)
Impala daemons Slave
– Periodically checks status of Nodes
Impala DataNode
Impala daemons Daemon (HDFS)
– The Catalog daemon
– Relays metadata changes to all Impala DataNode
the Impala daemons in a cluster Daemon (HDFS)

§ Impala daemon plans the query Catalog Server

Master
– Client (impala-‐shell or Hue) connects State Store Node
to a local impala daemon
NameNode
– This is the coordinator
– Coordinator requests a list of other Impala
Impala daemons in the cluster from Daemon
the State Store
– Coordinator distributes the query Impala
across other Impala daemons Daemon

HDFS
Slave
– Streams results to client Nodes
Impala
Daemon

Impala
Daemon

§ Impala daemons cache metadata Catalog Server

– The tables’ schema deﬁniHons State Store
– The locaHons of tables’ HDFS NameNode
blocks
§ Metadata is cached from the Impala Metadata
Daemon cache
Metastore at startup
Metastore
Impala Metadata
Daemon cache

metadata
Impala Metadata in RDBMS
Daemon cache

Impala Metadata
Daemon cache

§ When one Impala daemon changes Catalog Server

the metastore, it no*ﬁes the catalog State Store
service
NameNode
§ The catalog service no*ﬁes all Impala
daemons to update their cache Impala Metadata
Daemon cache

Metastore
CREATE TABLE Impala Metadata
suppliers (…) Daemon cache

metadata
Impala Metadata in RDBMS
Daemon cache

Impala Metadata
Daemon cache

§ Metadata updates made from Catalog Server Metastore

outside of Impala are not known State Store
to Impala, e.g.
NameNode
– Changes via Hive, HCatalog or metadata
Hue Metadata Manager in RDBMS
Impala Metadata
– Data added directly to Daemon cache
directory in HDFS
External
§ Therefore the Impala metadata Impala Metadata
changes
Daemon cache
caches will be invalid
§ You must manually refresh or Impala Metadata
invalidate Impala’s metadata Daemon cache
cache HDFS
Impala Metadata
Daemon cache

External Required Ac*on Eﬀect on Local Caches

Metadata Change
New table added INVALIDATE METADATA Marks the enHre cache as
(with no table name) stale; metadata cache is
reloaded as needed.
Table schema REFRESH <table> Reloads the metadata for one
modiﬁed table immediately. Reloads
or HDFS block locaHons for new
New data added to a data ﬁles only.
table

Data in a table INVALIDATE METADATA Marks the metadata for a

extensively altered, <table> single table as stale. When
such as by HDFS the metadata is needed, all
balancing HDFS block locaHons are
retrieved.

Modeling and Managing Data With Impor*ng and Modeling Structured

Impala and Hive Data

§ Data Storage Overview

§ CreaHng Databases and Tables
§ Loading Data into Tables
§ HCatalog
§ Impala Metadata Caching
§ Conclusion
§ Homework: Create and Populate Tables in Impala or Hive

§ Each table maps to a directory in HDFS

– Table data is stored as one or more files
– Default format: plain text with delimited fields
§ The Metastore stores data about the data in an RDBMS
– E.g. LocaHon, column names and types
§ Tables are created and managed using the Impala SQL or HiveQL Data
Defini*on Language
§ Impala caches metadata from the Metastore
– Invalidate or refresh the cache if tables are modified outside Impala

The following oﬀer more informa*on on topics discussed in this chapter
§ Impala Concepts and Architecture
– http://tiny.cloudera.com/adcc12a
§ Impala SQL Language Reference
– http://tiny.cloudera.com/impalasql
§ Impala-‐related Ar*cles on Cloudera’s Blog
– http://tiny.cloudera.com/adcc12e
§ Apache Hive Web Site
– http://hive.apache.org/
§ HiveQL Language Manual
– http://tiny.cloudera.com/adcc10b

Modeling and Managing Data With Impor*ng and Modeling Structured

Impala and Hive Data

§ Data Storage Overview

§ CreaHng Databases and Tables
§ Loading Data into Tables
§ HCatalog
§ Impala Metadata Caching
§ Conclusion
§ Homework: Create and Populate Tables in Impala or Hive

§ In this homework assignment you will

– Create a table in Impala to model and view exisHng data
– Use Sqoop to create a new table automaHcally from data imported from
MySQL
§ Please refer to the Homework descrip*on

CDP 4001 Demo
No ratings yet
CDP 4001 Demo
13 pages
Big Data Practice
No ratings yet
Big Data Practice
93 pages
Cloudera Msazure Hadoop Deployment Guide
No ratings yet
Cloudera Msazure Hadoop Deployment Guide
39 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
DSCI 5350 - Lecture 4 PDF
No ratings yet
DSCI 5350 - Lecture 4 PDF
33 pages
05 ImpalaHiveIntro
No ratings yet
05 ImpalaHiveIntro
24 pages
Hive and Impala
No ratings yet
Hive and Impala
46 pages
DS Lab - Manual - Assignment 11
No ratings yet
DS Lab - Manual - Assignment 11
3 pages
Apache Hive
No ratings yet
Apache Hive
30 pages
Impala
No ratings yet
Impala
11 pages
Big Data and Data Analytics Cloudera.
No ratings yet
Big Data and Data Analytics Cloudera.
3 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Cloudera - DANA-262: Analyzing With Cloudera Data Warehouse
No ratings yet
Cloudera - DANA-262: Analyzing With Cloudera Data Warehouse
3 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Learning Cloudera Impala Sample Chapter
No ratings yet
Learning Cloudera Impala Sample Chapter
25 pages
Hive L1
No ratings yet
Hive L1
134 pages
Big Data - Impala
No ratings yet
Big Data - Impala
5 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
DWstudent Slides
No ratings yet
DWstudent Slides
679 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive
No ratings yet
Hive
12 pages
Apache Hive Lessons For Beginner
No ratings yet
Apache Hive Lessons For Beginner
93 pages
Impala - Overview
No ratings yet
Impala - Overview
1 page
Cloudera Datamgmt
No ratings yet
Cloudera Datamgmt
63 pages
Hive and Hiveql
No ratings yet
Hive and Hiveql
10 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive Final
No ratings yet
Hive Final
75 pages
Cloudera JDBC Driver For Apache Hive Install Guide 2 5 4
No ratings yet
Cloudera JDBC Driver For Apache Hive Install Guide 2 5 4
21 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
ABP W11-W12 Big Data Analytics Lab-HIVE
No ratings yet
ABP W11-W12 Big Data Analytics Lab-HIVE
8 pages
HIVE
No ratings yet
HIVE
28 pages
Scripting
No ratings yet
Scripting
88 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Impala Presentation - Orlando PDF
No ratings yet
Impala Presentation - Orlando PDF
60 pages
Hive Using Hiveql
No ratings yet
Hive Using Hiveql
56 pages
Week-11 - 12-Hivepdf - 2023 - 11 - 10 - 12 - 47 - 43
No ratings yet
Week-11 - 12-Hivepdf - 2023 - 11 - 10 - 12 - 47 - 43
8 pages
SQL and Nosql Programming With Spark
No ratings yet
SQL and Nosql Programming With Spark
63 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Hive Part 2
No ratings yet
Hive Part 2
47 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Hive
No ratings yet
Hive
29 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
30 pages
Hive Pig PDF
No ratings yet
Hive Pig PDF
20 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Hive
No ratings yet
Hive
49 pages
HIVE
No ratings yet
HIVE
33 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
No ratings yet
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
91 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Programming For Data Science - Assignment 1
No ratings yet
Programming For Data Science - Assignment 1
2 pages
Mapreduce
No ratings yet
Mapreduce
5 pages
UTD Resume Final
No ratings yet
UTD Resume Final
1 page
Group - 3
No ratings yet
Group - 3
24 pages
Lecture 2
No ratings yet
Lecture 2
63 pages
Group - 1
No ratings yet
Group - 1
27 pages
Group - 36
No ratings yet
Group - 36
9 pages
Modern Big Data Analysis
100% (1)
Modern Big Data Analysis
35 pages
iitG-Big Data (Old Syllabus)
No ratings yet
iitG-Big Data (Old Syllabus)
2 pages
Unit 4 3 Lumify, Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify, Data Rapper and Sqooop
27 pages
Book
100% (1)
Book
388 pages
Cloudera Academic Partnership 8
No ratings yet
Cloudera Academic Partnership 8
69 pages
KNIME Will They Blend 20200817
No ratings yet
KNIME Will They Blend 20200817
165 pages
Jimmy Lamba Resume PDF
No ratings yet
Jimmy Lamba Resume PDF
8 pages
Ibraheem - SAP LE (SD) - Hyderabad - 5.5yrs
No ratings yet
Ibraheem - SAP LE (SD) - Hyderabad - 5.5yrs
5 pages
R23 IDS Unit 3 Lecture Notes
No ratings yet
R23 IDS Unit 3 Lecture Notes
57 pages
Cloudera User Manual
No ratings yet
Cloudera User Manual
44 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
Mattin Zargarpur - Lead Android Developer - Royal Cyber
No ratings yet
Mattin Zargarpur - Lead Android Developer - Royal Cyber
2 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Pig Vs Hive VS Native Map Reduc E: Pangool
No ratings yet
Pig Vs Hive VS Native Map Reduc E: Pangool
6 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Data Engineering Brochure FXSr63lN9T
No ratings yet
Data Engineering Brochure FXSr63lN9T
14 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Akshay Chekuri
No ratings yet
Akshay Chekuri
4 pages
Hive Using Hiveql
No ratings yet
Hive Using Hiveql
38 pages
Databricks Unity Catalog - TechSession-Spain Oct. 2022
No ratings yet
Databricks Unity Catalog - TechSession-Spain Oct. 2022
51 pages
Quiz 3 Big Data
No ratings yet
Quiz 3 Big Data
2 pages
Sai Kokadwar Latest Resume
No ratings yet
Sai Kokadwar Latest Resume
3 pages
Accelerate Machine Learning With A Unified Analytics Architecture
No ratings yet
Accelerate Machine Learning With A Unified Analytics Architecture
56 pages
HBase and Hive at StumbleUpon Presentation
No ratings yet
HBase and Hive at StumbleUpon Presentation
22 pages
Wa0006.
No ratings yet
Wa0006.
14 pages
Hive Architecture and Working
No ratings yet
Hive Architecture and Working
2 pages
Chapter 1: Introduction 1.1.2 Internet of Things (Iot) : Basic Concept
No ratings yet
Chapter 1: Introduction 1.1.2 Internet of Things (Iot) : Basic Concept
29 pages
Big Data-2 Sourcing Data
No ratings yet
Big Data-2 Sourcing Data
38 pages

06 ImpalaHiveDataModeling

Uploaded by

06 ImpalaHiveDataModeling

Uploaded by

Modeling

and Managing Data with

1 IntroducHon Course IntroducHon

18 Conclusion Course Conclusion

In this chapter you will learn

Modeling and Managing Data With Impor*ng and Modeling Structured

§ Data Storage Overview

§ Queries operate on tables, just like in an RDBMS

(data in HDFS ﬁles)

§ Data refers to the informa*on you store and process

Metadata cust_id name country

§ By default, data is stored in the HDFS directory

Modeling and Managing Data With Impor*ng and Modeling Structured

§ Data Storage Overview

§ Hive and Impala databases are simply namespaces

CREATE DATABASE loudacre;

1. Adds the database deﬁniHon to the metastore

CREATE DATABASE IF NOT EXISTS loudacre;

§ Removing a database is similar to crea*ng it

DROP DATABASE loudacre;

DROP DATABASE IF EXISTS loudacre;

§ These commands will fail if the database contains tables

DROP DATABASE loudacre CASCADE;

§ Each column is assigned a speciﬁc data type

Hive (not Impala) also supports a few complex types such

§ Basic syntax for crea*ng a table:

CREATE TABLE tablename (colname DATATYPE, ...)

§ Creates a subdirectory in the database’s warehouse directory in HDFS

CREATE TABLE tablename (colname DATATYPE, ...)

CREATE TABLE tablename (colname DATATYPE, ...)

CREATE TABLE tablename (colname DATATYPE, ...)

…for example, tab-delimited data would require that

CREATE TABLE tablename (colname DATATYPE, ...)

Finally, you may declare the file format. STORED AS

§ The following example creates a new table named jobs

CREATE TABLE jobs (

– Example of corresponding record for the table above

1,Data Analyst,100000,2013-06-21 15:52:03

§ Create a table based on a SELECT statement

§ Column deﬁni*ons are derived from the exis*ng table

§ By default, table data is stored in the warehouse directory

CREATE TABLE jobs (

§ CAUTION: Dropping a table removes its data in HDFS

CREATE EXTERNAL TABLE adclicks

§ DESCRIBE FORMATTED also shows table proper*es

SHOW CREATE TABLE jobs;

§ The Hue Metastore Manager

Modeling and Managing Data With Impor*ng and Modeling Structured

§ Data Storage Overview

§ Impala and Hive are ‘schema on read’

$ hdfs dfs -mv \

LOAD DATA INPATH '/tmp/sales.txt'

LOAD DATA INPATH '/tmp/sales.txt'

§ Another way to populate a table is through a query

INSERT INTO TABLE accounts_copy

– Specify a WHERE clause to control which records are appended

INSERT INTO TABLE loyal_customers

Table creaHon wizard Import data wizard

Modeling and Managing Data With Impor*ng and Modeling Structured

§ Data Storage Overview

– The State Store daemon

§ Impala daemon plans the query Catalog Server

§ Impala daemons cache metadata Catalog Server

§ When one Impala daemon changes Catalog Server

§ Metadata updates made from Catalog Server Metastore

External Required Ac*on Eﬀect on Local Caches

Data in a table INVALIDATE METADATA Marks the metadata for a

Modeling and Managing Data With Impor*ng and Modeling Structured

§ Data Storage Overview

§ Each table maps to a directory in HDFS

Modeling and Managing Data With Impor*ng and Modeling Structured

§ Data Storage Overview

§ In this homework assignment you will

You might also like

§ Column deﬁnions are derived from the exisng table