Hive Partitions
Hive Partitions
Tables, Partitions, and Buckets are the parts of Hive data modeling.
What is Partitions?
Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based
on partition keys.
Partition is helpful when the table has one or more Partition keys. Partition keys are basic elements
for determining how the data is stored in the table.
For Example: -
"Client having Some E –commerce data which belongs to India operations in which each state (38
states) operations mentioned in as a whole. If we take state column as partition key and perform
partitions on that India data as a whole, we can able to get Number of partitions (38 partitions) which
is equal to number of states (38) present in India. Such that each state data can be viewed separately
in partitions tables.
set hive.exec.dynamic.partition.mode=nonstrict
6. Actual processing and formation of partition tables based on state as partition key
7. There are going to be 38 partition outputs in HDFS storage with the file name as state name.
We will check this in this step
The following screen shots will show u the execution of above mentioned code
From the above code, we do following things
1. Creation of table all states with 3 column names such as state, district, and enrollment
2. Loading data into table all states
3. Creation of partition table with state as partition key
4. In this step Setting partition mode as non-strict( This mode will activate dynamic partition
mode)
5. Loading data into partition tablestate_part
6. Actual processing and formation of partition tables based on state as partition key
7. There is going to 38 partition outputs in HDFS storage with the file name as state name. We
will check this in this step. In This step, we seeing the 38 partition outputs in HDFS
What is Buckets?
Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used for
efficient querying.
The data i.e. present in that partitions can be divided further into Buckets
The division is performed based on Hash of particular columns that we selected in the table.
Buckets use some form of Hashing algorithm at back end to read each record and place it
into buckets
In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;
We are creating sample_bucket with column names such as first_name, job_id, department,
salary and country
We are creating 4 buckets overhere.
Once the data get loaded it automatically, place the data into 4 buckets
Assuming that"Employees table" already created in Hive system. In this step, we will see the loading
of Data from employees table into table sample bucket.
Before we start moving employees data into buckets, make sure that it consist of column names such
as first_name, job_id, department, salary and country.
Here we are loading data into sample bucket from employees table.