0% found this document useful (0 votes)
8 views

2 Partitioning+QC+Done

partition
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

2 Partitioning+QC+Done

partition
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

CREATING TABLES WITH PARTITIONS

CREATING TABLES WITH PARTITIONS

HIVE ORGANIZES TABLES


INTO PARTITIONS

PARTITIONS ARE JUST SPLITS OF ALL


THE DATA IN A TABLE
CREATING TABLES WITH PARTITIONS

HIVE ORGANIZES TABLES


INTO PARTITIONS

EACH PARTITION IS IN A SEPARATE


DIRECTORY
CREATING TABLES WITH PARTITIONS

HIVE ORGANIZES TABLES


INTO PARTITIONS
EACH PARTITION IS IN A SEPARATE
DIRECTORY

A DIRECTORY CAN HAVE MULTIPLE FILES


WITH THE DATA IN THAT PARTITION
CREATING TABLES WITH PARTITIONS

HIVE ORGANIZES TABLES


INTO PARTITIONS

ALL PARTITIONS TOGETHER MAKE UP


THE ENTIRE TABLE
CREATING TABLES WITH PARTITIONS

HIVE ORGANIZES TABLES


INTO PARTITIONS

THE PARTITION CAN BE BASED ON THE


VALUES OF ONE OR MORE COLUMNS
(CALLED PARTITION COLUMNS)
HIVE ORGANIZES TABLES INTO PARTITIONS
THE PARTITION CAN BE BASED ON THE VALUES OF
ONE OR MORE COLUMNS
(CALLED PARTITION COLUMNS)
YOU CAN USE ANY COLUMN TO CREATE PARTITIONS
StoreLocation Product Date Revenue
Bellandur
Bellandur Bananas
Nutella January
January 18,2016
18,2016 8,236.33
7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
Bellandur Milk January 18,2016 2,433.76
Koramangala Bananas January 18,2016 9,456.01
Koramangala Nutella January 18,2016 3,644.33
Koramangala Peanut Butter January 18,2016 8,988.64
Koramangala Milk January 18,2016 1,621.58

CONSIDER THE SALES TABLE


YOU CAN USE ANY COLUMN TO CREATE PARTITIONS
StoreLocation Product Date Revenue
Bellandur
Bellandur Bananas
Nutella January
January 18,2016
18,2016 8,236.33
7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
Bellandur Milk January 18,2016 2,433.76
Koramangala Bananas January 18,2016 9,456.01
Koramangala Nutella January 18,2016 3,644.33
Koramangala Peanut Butter January 18,2016 8,988.64
Koramangala Milk January 18,2016 1,621.58

CONSIDER THE SALES TABLE


LET’S PARTITION IT ON THE PRODUCT
COLUMN
YOU CAN USE ANY COLUMN TO CREATE PARTITIONS
StoreLocation Product Date Revenue
Bellandur
Bellandur Bananas
Nutella January
January 18,2016
18,2016 8,236.33
7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
Bellandur Milk January 18,2016 2,433.76
Koramangala Bananas January 18,2016 9,456.01
Koramangala Nutella January 18,2016 3,644.33
Koramangala Peanut Butter January 18,2016 8,988.64
Koramangala Milk January 18,2016 1,621.58

LET’S PARTITION IT ON THE PRODUCT COLUMN


THERE ARE 4 DISTINCT Product

Bananas
VALUES IN PRODUCT COLUMN Nutella

Peanut Butter

4 PARTITIONS OF THIS TABLE Milk


LET’S PARTITION IT ON THE PRODUCT COLUMN
StoreLocation Product Date Revenue
Bellandur Bananas January 18,2016 8,236.33
Bellandur Nutella January 18,2016 7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
Bellandur Milk January 18,2016 2,433.76
Koramangala Bananas January 18,2016 9,456.01
Koramangala Nutella January 18,2016 3,644.33
Koramangala Peanut Butter January 18,2016 8,988.64
Koramangala Milk January 18,2016 1,621.58

4 PARTITIONS OF THIS TABLE


PRODUCT = BANANAS PRODUCT = PEANUT BUTTER
StoreLocation Date Revenue StoreLocation Date Revenue
Bellandur January 18,2016 8,236.33 Bellandur January 18,2016 5,316.89
Koramangala January 18,2016 9,456.01 Koramangala January 18,2016 8,988.64

PRODUCT = NUTELLA PRODUCT = MILK


StoreLocation Date Revenue StoreLocation Date Revenue
Bellandur January 18,2016 7455.67 Bellandur January 18,2016 2,433.76
Koramangala January 18,2016 3644.33 Koramangala January 18,2016 1,621.58
4 PARTITIONS OF THIS TABLE
PRODUCT = BANANAS PRODUCT = PEANUT BUTTER
StoreLocation Date Revenue StoreLocation Date Revenue

Bellandur January 18,2016 8,236.33 Bellandur January 18,2016 5,316.89

Koramangala January 18,2016 9,456.01 Koramangala January 18,2016 8,988.64

PRODUCT = NUTELLA PRODUCT = MILK


StoreLocation Date Revenue StoreLocation Date Revenue

Bellandur January 18,2016 7455.67 Bellandur January 18,2016 2,433.76

Koramangala January 18,2016 3644.33 Koramangala January 18,2016 1,621.58

THESE BECOME 4 SUB-DIRECTORIES


IN THE TABLE’S DIRECTORY
/USER/HIVE/WAREHOUSE DATA IS STORED IN
/SALES-TABLE THESE FILES
/PRODUCT=BANANAS
/FILE-01

/PRODUCT=PEANUT BUTTER
/FILE-01
/PRODUCT=NUTELLA
/FILE-01

/PRODUCT=MILK
/FILE-01
/USER/HIVE/WAREHOUSE THERE IS JUST ONE FILE
/SALES-TABLE INSIDE EACH DIRECTORY

BUT THAT /PRODUCT=BANANAS


/FILE-01
NEED NOT BE
THE CASE - /PRODUCT=PEANUT BUTTER
YOU CAN HAVE /FILE-01
MULTIPLE
FILES IN A /PRODUCT=NUTELLA
/FILE-01
DIRECTORY AS
WELL /PRODUCT=MILK
/FILE-01
WHY SHOULD WE PARTITION TABLES?
IMPROVED PERFORMANCE
LOGICAL ORGANIZATION OF DATA

PARTITIONING DETERMINES DATA STORAGE


STRUCTURES AND SUBDIRECTORIES
WHY SHOULD WE PARTITION TABLES?

PARTITIONING DETERMINES DATA STORAGE


STRUCTURES AND SUBDIRECTORIES

PARTITIONING IMPROVES QUERY


PERFORMANCE
WHY SHOULD WE PARTITION TABLES?
PARTITIONING IMPROVES QUERY PERFORMANCE
StoreLocation Product Date Revenue
Bellandur Bananas January 18,2016 8,236.33
Bellandur Nutella January 18,2016 7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
Bellandur Milk January 18,2016 2,433.76
Koramangala Bananas January 18,2016 9,456.01
Koramangala Nutella January 18,2016 3,644.33
Koramangala Peanut Butter January 18,2016 8,988.64
Koramangala Milk January 17,2016 1,621.58

CALCULATE TOTAL REVENUE FROM SELLING


MILK ON JANUARY 17
IN AN UNPARTITIONED TABLE, WE
HAVE TO SCAN THE ENTIRE TABLE
PARTITIONING IMPROVES QUERY PERFORMANCE
CALCULATE TOTAL REVENUE FROM SELLING
MILK ON JANUARY 17

THIS IS HOW OUR PARTITIONED TABLE LOOKS


PRODUCT = BANANAS PRODUCT = PEANUT BUTTER
StoreLocation Date Revenue StoreLocation Date Revenue
Bellandur January 18,2016 5,316.89
Bellandur January 18,2016 8,236.33
Koramangala January 18,2016 8,988.64
Koramangala January 18,2016 9,456.01

PRODUCT = NUTELLA PRODUCT = MILK


StoreLocation Date Revenue StoreLocation Date Revenue
Bellandur January 18,2016 7455.67 Bellandur January 18,2016 2,433.76
Koramangala January 18,2016 3644.33 Koramangala January 17,2016 1,621.58
PARTITIONING IMPROVES QUERY PERFORMANCE
CALCULATE TOTAL REVENUE FROM SELLING
MILK ON JANUARY 17

THIS IS HOW OUR PARTITIONED TABLE LOOKS


PRODUCT = MILK
StoreLocation Date Revenue
Bellandur January 18,2016 2,433.76
Koramangala January 17,2016 1,621.58

IN A PARTITIONED TABLE, WE JUST SCAN THE


PARTITION CORRESPONDING TO PRODUCT = MILK
PARTITIONING IMPROVES QUERY PERFORMANCE
CALCULATE TOTAL REVENUE FROM SELLING
MILK ON JANUARY 17

THIS IS HOW OUR PARTITIONED TABLE LOOKS


PRODUCT = MILK
StoreLocation Date Revenue
Bellandur January 18,2016 2,433.76
Koramangala January 17,2016 1,621.58

THE DATA TO PROCESS IS ROUGHLY 1/4TH OF


THE TOTAL DATA PRESENT IN THIS TABLE
A POTENTIALLY HUGE SAVING!
PARTITIONING IMPROVES QUERY
PERFORMANCE
THE PERFORMANCE IMPROVEMENT
CAN BE DRAMATIC

ONLY IF THE PARTITIONING SCHEME REFLECTS


COMMON FILTERING AND COMMON QUERIES
PARTITIONING IMPROVES QUERY
PERFORMANCE

IN THE PREVIOUS EXAMPLE, IF WE HAD


PARTITIONS BASED ON STORAGE
LOCATION, THERE WOULD HAVE BEEN NO
ADVANTAGE OF USING PARTITIONS
PARTITIONING IMPROVES QUERY
PERFORMANCE
IN THE PREVIOUS EXAMPLE, IF WE HAD
PARTITIONS BASED ON STORAGE
LOCATION, THERE WOULD HAVE BEEN NO
ADVANTAGE OF USING PARTITIONS

ALL PARTITIONS WOULD HAVE TO BE


SCANNED TO SEE WHERE MILK WAS
SOLD ON JANUARY 17
PARTITIONING IMPROVES QUERY
PERFORMANCE

WHAT ARE THE MOST COMMON


QUERIES YOU PLAN TO RUN? PLAN
YOUR PARTITION BASED ON MAKING
THEM FASTER
PARTITIONING IMPROVES QUERY
PERFORMANCE
PARTITIONS ARE A TRADE-OFF

TOO MANY PARTITIONS MAY OPTIMIZE


SOME QUERIES, BUT BE DETRIMENTAL
FOR OTHER IMPORTANT QUERIES
PARTITIONING IMPROVES QUERY
PERFORMANCE
CONSIDER AN ORDER TABLE FOR AN
E-COMMERCE SITE
DO NOT CREATE PARTITIONS BASED ON
CUSTOMER ID
THERE WILL BE MILLIONS OF PARTITIONS
PARTITIONING IMPROVES QUERY
PERFORMANCE
LARGE NUMBER OF PARTITIONS = LARGE
NUMBER OF HADOOP DIRECTORIES

THIS IS A HUGE OVERHEAD FOR THE


NAME NODE WHICH MAINTAINS FILE
METADATA FOR HADOOP
HOW TO PARTITION TABLES?
BY ADDING
partitioned by(Partition_Column_Name column_data_type)

TO THE CREATE TABLE COMMAND


CREATE TABLE Sales_Data N
T I T I O
( PAR IS NOT
StoreLocation VARCHAR(30), L U M N
C O I O N E D
OrderDate DATE, ME N T
Revenue DECIMAL(10,2) HE R E
)
HOW TO PARTITION TABLES?
BY ADDING
partitioned by(Partition_Column_Name column_data_type)

TO THE CREATE TABLE COMMAND


CREATE TABLE Sales_Data
( N T T O
W A
StoreLocation VARCHAR(30), WE ON IT BY
R T I T I
OrderDate DATE, PA D U C T
Revenue DECIMAL(10,2)
P RO
)
HOW TO PARTITION TABLES?
BY ADDING
partitioned by(Partition_Column_Name column_data_type)

CREATE TABLE Sales_Data_Product_Partition


(
StoreLocation VARCHAR(30),
OrderDate DATE,
Revenue DECIMAL(10,2)
)
partitioned by(product varchar(30));
HOW TO PARTITION TABLES?
BY ADDING
NOTE THAT THE COLUMN IS NOT SPECIFIED AS A
PART OF THE CREATE TABLE
partitioned by(column name column_data_type)

CREATE TABLE Sales_Data_Product_Partition


(
StoreLocation VARCHAR(30),
OrderDate DATE,
Revenue DECIMAL(10,2)
)
partitioned by(product varchar(30));
HOW TO PARTITION TABLES?
BY ADDING
INSTEAD IT’S ONLY PRESENT IN THE
SEPARATE PARTITION BY COMMAND
partitioned by(column name column_data_type)

CREATE TABLE Sales_Data_Product_Partition


(
StoreLocation VARCHAR(30),
OrderDate DATE,
Revenue DECIMAL(10,2)
)
partitioned by(product varchar(30));
IF WE WANT TO PARTITION TABLES BY DATE COLUMN

BY ADDING
partitioned by(column name column_data_type)

CREATE TABLE Sales_Data_Date_Partition


(
StoreLocation VARCHAR(30),
product VarChar(30),
Revenue DECIMAL(10,2)
)
partitioned by(OrderDate DATE);
/USER/HIVE/WAREHOUSE
DATA IS STORED IN
/SALES-DATA-DATE-PARTITION THESE FILES

/DATE=2015-01-21
/FILE-01

/DATE=2015-01-22
/FILE-01
/DATE=2015-01-23
/FILE-01

/DATE=2015-01-24
/FILE-01
WE CAN PARTITION TABLES BY TWO COLUMNS
WE CAN PARTITION TABLES BY TWO COLUMNS

partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)

THE PARTITIONED BY KEYWORD IS


THE SAME
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)

EACH COLUMN NAME AND DATA


TYPE IS SPECIFIED
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)

SEPARATED BY A COMMA
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)

THIS COMMAND IS ADDED TO THE CREATE


TABLE COMMAND AS WE DID IN LAST SECTION
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)
CREATE TABLE Sales_Data_Date_Product_Partition
(
StoreLocation VARCHAR(30),
Revenue DECIMAL(10,2)
)
partitioned by
(
OrderDate DATE,
product VarChar(30)
);
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)
CREATE TABLE Sales_Data_Date_Product_Partition
(
StoreLocation VARCHAR(30),
Revenue DECIMAL(10,2)
)
partitioned by
(
OrderDate DATE,
product VarChar(30)
);
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)
CREATE TABLE Sales_Data_Date_Product_Partition
(
StoreLocation VARCHAR(30),
Revenue DECIMAL(10,2)
)
partitioned by
(
OrderDate DATE,
product VarChar(30)
);
WE CAN PARTITION TABLES BY TWO COLUMNS
partitioned by
(
column_name_1 data_type_1,
column_name_2 data_type_2
)
CREATE TABLE Sales_Data_Date_Product_Partition
(
StoreLocation VARCHAR(30),
Revenue DECIMAL(10,2)
)
partitioned by
THIS PART IS SAME
( AS THE CREATE
OrderDate DATE, COMMAND
product VarChar(30)
);
WE CAN PARTITION TABLES BY TWO COLUMNS
LET’S LOOK AT THE DIRECTORY STRUCTURE FOR SUCH A TABLE

/USER/HIVE/WAREHOUSE
SALES_DATA_DATE_PRODUCT_PARTITION
DATA IS STORED
IN THESE FILES
/DATE=‘2015-01-17’

/PRODUCT=BANANAS
/FILE-01
/PRODUCT = PEANUT BUTTER
/DATE=‘2015-01-18’
/PRODUCT = BANANAS
/PRODUCT = PEANUT BUTTER
HOW DO WE GET DATA FROM
TABLES WITH PARTITIONS?
HOW DO WE GET DATA FROM
TABLES WITH PARTITIONS?
CREATE TABLE Sales_Data_Date_Product_Partition
(
StoreLocation VARCHAR(30),
Revenue DECIMAL(10,2)
WHEN YOU QUERY
) THIS TABLE JUST
partitioned by TREAT IT AS IF IT
( HAS 4 COLUMNS
OrderDate DATE,
product VarChar(30)
);
HOW DO WE GET DATA FROM
TABLES WITH PARTITIONS?
partitioned by YOU CAN WRITE
( SELECT STATEMENTS
OrderDate DATE, WHICH TREAT THESE
product VarChar(30)
); AS YOU WOULD ANY
REGULAR COLUMN
HOW DO WE GET DATA FROM
TABLES WITH PARTITIONS?
partitioned by QUERIES WITH
( CONDITIONS ON THE
OrderDate DATE, PARTITION
product VarChar(30)
); COLUMNS WILL
RUN FASTER
HOW DO WE PUT STUFF INTO
TABLES WITH PARTITIONS?
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS?
WE CREATED A PARTITIONED TABLE USING THE FOLLOWING COMMAND

CREATE TABLE Sales_Data_Date_Partition


(
StoreLocation VARCHAR(30),
product VarChar(30),
Revenue DECIMAL(10,2)
)
partitioned by(OrderDate DATE);
CREATE TABLE Sales_Data_Date_Partition
(
StoreLocation VARCHAR(30),
product VarChar(30),
Revenue DECIMAL(10,2)
) partitioned by(OrderDate DATE);

TO INSERT DATA WE WILL USE THE FOLLOWING COMMAND


Insert into Sales_Data_Date_Partition
partition (OrderDate ='2016-01-16')
Values
('Bellandur','Nutella',7455.67),
('Bellandur','Peanut Butter',5316.89),
('Bellandur','Milk',2433.76),
(‘Koramangala','Bananas',9456.01);
CREATE TABLE Sales_Data_Date_Partition
(
StoreLocation VARCHAR(30),
product VarChar(30),
Revenue DECIMAL(10,2)
) partitioned by(OrderDate DATE);

SIMILAR TO NORMAL INSERT COMMANDS


Insert into Sales_Data_Date_Partition
partition (OrderDate ='2016-01-16')
Values
('Bellandur','Nutella',7455.67),
('Bellandur','Peanut Butter',5316.89),
('Bellandur','Milk',2433.76),
(‘Koramangala','Bananas',9456.01);
CREATE TABLE Sales_Data_Date_Partition
(
StoreLocation VARCHAR(30),
product VarChar(30),
Revenue DECIMAL(10,2)
) partitioned by(OrderDate DATE);

SPECIFY THE ACTUAL VALUE OF THE PARTITION WHILE


LOADING INTO THE TABLES
Insert into Sales_Data_Date_Partition
partition (OrderDate ='2016-01-16')
Values
('Bellandur','Nutella',7455.67),
('Bellandur','Peanut Butter',5316.89),
('Bellandur','Milk',2433.76),
(‘Koramangala','Bananas',9456.01);
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS?

SPECIFY THE ACTUAL VALUE OF THE PARTITION WHILE


LOADING INTO THE TABLES

IF WE HAVE LOTS OF PARTITIONS, THIS


CAN BECOME A MAJOR HEADACHE

SOLUTION: DYNAMIC PARTITIONING


HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS?
SOLUTION: DYNAMIC PARTITIONING
THIS CREATES PARTITIONS IN A
PARTITIONED TABLE AUTOMATICALLY

WE CAN LOAD THE ENTIRE DATA IN ONE SWOOP


AND LET HIVE CREATE PARTITIONS

WE NEED NOT CREATE PARTITIONS ONE AT A TIME


HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS?
SOLUTION: DYNAMIC PARTITIONING
LET HIVE CREATE PARTITIONS

TO CONFIGURE HIVE TO SUPPORT DYNAMIC PARTITION


CREATION, ENTER THE FOLLOWING SET COMMANDS

SET hive.exec.dynamic.partition = true;


SET hive.exec.dynamic.partition.mode = nonstrict;

THIS COMMAND CHANGES THE SETTING ONLY FOR A SINGLE SESSION


HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS?
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

THIS COMMAND CHANGES THE SETTING ONLY FOR A SINGLE SESSION

TO CHANGE THE SETTINGS PERMANENTLY, EDIT THESE


PROPERTIES IN THE HIVE-SITE.XML FILE
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS?
SOLUTION: DYNAMIC PARTITIONING
WHILE SETTING THE DYNAMIC PARTITION MODE, WE
LET HIVE
SHOULD ALSO CREATE PARTITIONS
SET PARTITION MODE TO NONSTRICT
THERE ARE TWO MODES
STRICT MODE
NONSTRICT MODE
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

THIS COMMAND CHANGES SETTING ONLY FOR A SINGLE SESSION


WHILE SETTING THE DYNAMIC PARTITION MODE, WE SHOULD ALSO SET
PARTITION MODE TO NONSTRICT
THERE ARE TWO MODES

STRICT MODE NONSTRICT MODE


THERE SHOULD
BE AT LEAST ONE ALL PARTITIONS
NON-DYNAMIC ARE ALLOWED
PARTITION TO BE DYNAMIC
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
THERE ARE TWO MODES

STRICT MODE NONSTRICT MODE


THERE SHOULD BE AT ALL PARTITIONS ARE
LEAST ONE NON- ALLOWED TO BE
DYNAMIC PARTITION DYNAMIC

partition (ProductID,OrderDate ='2016-01-16')

DYNAMIC PARTITION
NON-DYNAMIC PARTITION
DYNAMIC PARTITIONING

SET hive.exec.dynamic.partition = true;


SET hive.exec.dynamic.partition.mode = nonstrict;

ONCE BOTH OF THESE SETTINGS ARE IN


PLACE, IT’S EASY TO CHANGE OUR QUERY
TO DYNAMICALLY LOAD PARTITIONS.

HOW DO WE PUT STUFF INTO TABLES WITH


PARTITIONS USING DYNAMIC PARTITIONING?
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS USING DYNAMIC PARTITIONING?

LET US IMPORT DATA INTO A PARTITIONED TABLE FROM


ANOTHER TABLE
StoreLocation Product Date Revenue
Bellandur Bananas January 18,2016 8,236.33
Bellandur Nutella January 18,2016 7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
Bellandur Milk January 18,2016 2,433.76
Koramangala Bananas January 18,2016 9,456.01
Koramangala Nutella January 18,2016 3,644.33
Koramangala Peanut Butter January 18,2016 8,988.64
Koramangala Milk January 18,2016 1,621.58
Bellandur Bananas January 17,2016 2342.33
Bellandur Nutella January 17,2016 6345.10
Bellandur Peanut Butter January 17,2016 5673.01
Bellandur Milk January 17,2016 4543.98
Koramangala Bananas January 17,2016 8902.65
Koramangala Nutella January 17,2016 9114.67
Koramangala Peanut Butter January 17,2016 5102.05
Koramangala Milk January 17,2016 1299.45

SOURCE TABLE =
SALES_DATA_WITHOUT_PARTITION
LET US IMPORT DATA INTO A PARTITIONED TABLE FROM ANOTHER TABLE

THE TABLE WAS CREATED WITH THE FOLLOWING COMMAND


CREATE TABLE Sales_Data_Without_Partition
(
StoreLocation VARCHAR(30),
Product VARCHAR(30),
OrderDate DATE,
Revenue DECIMAL(10,2)
);
SOURCE TABLE =
SALES_DATA_WITHOUT_PARTITION
StoreLocation Product Date Revenue

Bellandur
Bellandur Bananas
Nutella January
January 18,2016
18,2016 8,236.33
7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS USING DYNAMIC PARTITIONING?

LET US IMPORT DATA INTO A PARTITIONED TABLE FROM


ANOTHER TABLE
SOURCE TABLE =
StoreLocation Product Date Revenue
Bellandur Bananas January 18,2016 8,236.33
Bellandur
Bellandur
Nutella
Peanut Butter
January 18,2016
January 18,2016
7,455.67
5,316.89
SALES_DATA_WITHOUT_PARTITION
Bellandur Milk January 18,2016 2,433.76
Koramangala Bananas January 18,2016 9,456.01
Koramangala Nutella January 18,2016 3,644.33
Koramangala
Koramangala
Peanut Butter
Milk
January 18,2016
January 18,2016
8,988.64
1,621.58
TABLE IS PARTITIONED
Bellandur
Bellandur
Bananas
Nutella
January 17,2016
January 17,2016
2342.33
6345.10
ON DATE AND PRODUCT
Bellandur Peanut Butter January 17,2016 5673.01
Bellandur Milk January 17,2016 4543.98
Koramangala Bananas January 17,2016 8902.65
Koramangala
Koramangala
Nutella
Peanut Butter
January 17,2016
January 17,2016
9114.67
5102.05
DESTINATION TABLE =
Koramangala Milk January 17,2016 1299.45
SALES_DATA_DATE_PRODUCT
_PARTITION
SALES_DATA_DATE_PRODUCT_PARTITION
IS CREATED LIKE THIS
CREATE TABLE Sales_Data_Date_Product_Partition
(
StoreLocation VARCHAR(30),
Revenue DECIMAL(10,2)
) H I S
P T
partitioned by E E !
( E K I N D
A S M
OrderDate DATE, L E I N
product VarChar(30) P
);
HOW DO WE PUT STUFF INTO TABLES WITH
PARTITIONS USING DYNAMIC PARTITIONING?
Insert into Sales_Data_Date_Product_Partition
partition(Product,OrderDate)
select StoreLocation,Revenue,Product,OrderDate
from Sales_Data_Without_Partition;

SOURCE TABLE =
SALES_DATA_WITHOUT_PARTITION TABLE IS PARTITIONED ON DATE AND PRODUCT
StoreLocation Product Date Revenue
Bellandur
Bellandur Bananas
Nutella January
January 18,2016
18,2016 8,236.33
7,455.67
Bellandur Peanut Butter January 18,2016 5,316.89
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS USING DYNAMIC PARTITIONING?

SIMILAR TO NORMAL INSERT COMMANDS


Insert into Sales_Data_Date_Product_Partition
partition(Product,OrderDate)
select StoreLocation,Revenue,Product,OrderDate
from Sales_Data_Without_Partition;

SOURCE TABLE =
SALES_DATA_WITHOUT_PARTITION TABLE IS PARTITIONED ON DATE AND PRODUCT
StoreLocation Product Date Revenue
Bellandur
Bellandur Bananas
Nutella January
January 18,2016
18,2016 8,236.33
7,455.67
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS USING DYNAMIC PARTITIONING?
WHEN WE LOAD INTO PARTITION TABLES, WE
SPECIFY ONLY NAMES OF PARTITION COLUMNS
Insert into Sales_Data_Date_Product_Partition
partition(Product,OrderDate)
select StoreLocation,Revenue,Product,OrderDate
from Sales_Data_Without_Partition;

IT’S NECESSARY TO INCLUDE THE PARTITION


COLUMNS AS THE LAST COLUMNS IN THE QUERY
SOURCE TABLE =
SALES_DATA_WITHOUT_PARTITION TABLE IS PARTITIONED ON DATE AND PRODUCT
StoreLocation Product Date Revenue
Bellandur Bananas January 18,2016 8,236.33
HOW DO WE PUT STUFF INTO TABLES WITH PARTITIONS USING DYNAMIC PARTITIONING?

Insert into Sales_Data_Date_Product_Partition


partition(Product,OrderDate)
select StoreLocation,Revenue,Product,OrderDate
from Sales_Data_Without_Partition;

THIS IS THE QUERY FOR DATA


FROM THE SOURCE TABLE
SOURCE TABLE =
SALES_DATA_WITHOUT_PARTITION TABLE IS PARTITIONED ON DATE AND PRODUCT
StoreLocation Product Date Revenue
Bellandur Bananas January 18,2016 8,236.33
/USER/HIVE/WAREHOUSE
SALES_DATA_DATE_PRODUCT_PARTITION
DATA IS STORED
IN THESE FILES
/DATE=‘2015-01-17’

/PRODUCT=BANANAS
/FILE-01
/PRODUCT = PEANUT BUTTER
/DATE=‘2015-01-18’
/PRODUCT = NUTELLA
/PRODUCT = MILK
/DATE=‘2015-01-19’
/PRODUCT = MILK
/DATE=‘2015-01-20’ /PRODUCT=BANANAS
/PRODUCT = PEANUT BUTTER
HOW TO CHECK WHAT PARTITIONS EXIST IN A TABLE?

BY USING THE COMMAND


show partitions table_name;

TO SEE PARTITIONS OF SALES_DATA_PRODUCT_PARTITION


HOW TO CHECK WHAT PARTITIONS EXIST IN A TABLE?

YOU CAN CHECK THESE PARTITIONS IN THE HIVE WAREHOUSE DIRECTORY

hadoop fs -ls /user/hive/warehouse/sales_data_product_partition

TO SEE PARTITIONS OF SALES_DATA_PRODUCT_PARTITION


HOW TO CHECK WHAT PARTITIONS EXIST IN A TABLE?
TO SEE PARTITIONS OF SALES_DATA_DATE_PRODUCT_PARTITION
hadoop fs -ls /user/hive/warehouse/sales_data_dateproduct_partition

FIRST YOU GET THE DIRECTORIES FOR DATE PARTITIONS


HOW TO CHECK WHAT PARTITIONS EXIST IN A TABLE?
TO SEE PARTITIONS OF SALES_DATA_DATE_PRODUCT_PARTITION
hadoop fs -ls /user/hive/warehouse/sales_data_dateproduct_partition/
orderdate=2016-01-16

WITHIN EACH DATE PARTITION, THERE IS A PARTITION FOR EACH PRODUCT

You might also like