Module-4
Module-4
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs, lists and maps.
4. Hive supports SQL filters, group-by and order-by clauses.
5. Custom Types, Custom Functions can be defined.
Hourly Log Data can be stored directly into HDFS and then data cleansing is
performed on the log file. Finally, Hive table(s) can be created to query the log
file.
4.1.4 Hive Data Units
1. Databases: The namespace for tables.
2. Tables: Set of records that have similar schema.
3. Partitions: Logical separations of data based on classification of given
information as per specific attributes. Once hive has partitioned the data based
on a specified key, it starts to assemble the records into specific folders as and
when the records are inserted.
4. Buckets (or Clusters): Similar to partitions but uses hash function to
segregate data and determines the cluster or bucket into which the record
should be placed.
• Partitioning tables changes how Hive structures the data storage.
• Hive will create subdirectories reflecting the partitioning structure.
• Although partitioning helps in enhancing performance and is
recommended, having too many partitions may prove detrimental for few
queries.
• Bucketing is another technique of managing large datasets. If we
partition the dataset based on customer_ID, we would end up with far
too many partitions. Instead, if we bucket the customer table and use
customer_id as the bucketing column, the value of this column will be
hashed by a user-defined number into buckets.
When to Use Partitioning/Bucketing?
Bucketing works well when the field has high cardinality (cardinality is the
number of values a column or field can have) and data is evenly distributed
among buckets. Partitioning works best when the cardinality of the partitioning
field is not too high. Partitioning can be done on multiple fields with an order
(Year/Month/ Day) whereas bucketing can be done on only one field.
Figure shows how these data units are arranged in a Hive Cluster.
the Hive metastore service runs in the main Hive Server process, but the
metastore database runs in a separate process, and can be on a separate host.
Figure below shows a Local Metastore.
3. Remote Metastore: In this, the Hive driver and the metastore interface run
on different JVMs (which can run on different machines as well) as in Figure
below. This way the database can be fire-walled from the Hive user and also
database credentials are completely isolated from the users of Hive.
Instead of only partitioning the table horizontally like the row-oriented DBMS
(row-store), RCFile partitions this table first horizontally and then vertically to
serialize the data. Based on the user-specified value, first the table is
partitioned into multiple row groups horizontally. Depicted in Table-2, Table-1
is partitioned into two row groups by considering three rows as the size of each
row group.
Next, in every row group RCFile partitions the data vertically like column-store.
So the table will be serialized as shown in Table-3.
Hive query language provides basic SQL like operations. Below are few of the
tasks which HQL can do easily.
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
SHOW DATABASES LIKE 'Stud???'; Lists all databases whose names start
with "Stud" followed by exactly 3 characters.
Objective:
To view details about an existing database in Hive.
Command (Act):
DESCRIBE DATABASE STUDENTS;
This command displays the database name, comment (description), and the
directory location in HDFS where the database is stored.
Outcome (Sample Output):
students STUDENT Details
hdfs://volgalnx010.ad.infosys.com:9000/user/hive/warehouse/students.
db root USER
Time taken: 0.03 seconds, Fetched: 1 row(s)
students: The name of the database.
STUDENT Details: The comment provided during creation.
hdfs://.../students.db: The HDFS directory path where the database is
located.
root USER: Owner information.
This command is useful for confirming database metadata and verifying
storage paths.
Objective:
To describe the extended properties of a database.
Command (Act):
DESCRIBE DATABASE EXTENDED STUDENTS;
This command displays additional information such as DB properties (defined
using DBPROPERTIES during creation), along with standard database
metadata.
Outcome (Sample Output):
Verification:
You can verify the updated properties using the command:
DESCRIBE DATABASE EXTENDED STUDENTS;
This will now include:
{creator=JOHN, edited-by=JAMES}
Objective:
To make a database the current working database.
Command (Act):
USE STUDENTS;
Outcome (Example):
hive> USE STUDENTS;
OK
Time taken: 0.02 seconds
There is no direct command to display the current active database in Hive.
However, to make the command prompt display the current database name as
a suffix, use the following setting:
set hive.cli.print.current.db=true;
This will help you keep track of which database you're working in during a Hive
session.
Objective:
To drop a database.
Command (Act):
DROP DATABASE STUDENTS;
Hive stores databases in the warehouse directory (e.g., /user/hive/warehouse).
Deleting a database only works if no tables exist inside it — this is the default
RESTRICT mode.
To drop a database along with all its tables:
DROP DATABASE STUDENTS CASCADE;
The CASCADE option deletes the database along with all contained tables.
RESTRICT (default) prevents deletion if any tables are present.
Prof. Nisha S K, Assistant Professor, SVIT Page 16
Big Data Analytics-BAD601-Module 4
Complete Syntax:
DROP DATABASE [IF EXISTS] database_name [RESTRICT | CASCADE];
4.5.5 Tables
Hive provides two kinds of table:
1. Internal or Managed Table
2. External Table
Input:
1001,John,Smith:Jones,Mark1!45:Mark2!46:Mark3!43
1002,Jack,Smith:Jones,Mark1!46:Mark2!47:Mark3!42
Act:
CREATE TABLE STUDENT_INFO(rollno INT,name String, sub
ARRAY<STRING>,marks MAP<STRING,INT>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ':'
MAP KEYS TERMINATED BY '!';
LOAD DATA LOCAL INPATH '/root/hivedemos/studentinfo.csv' INTO
STUDENT_INFO;
Outcome:
Act:
SELECT * from STUDENT_INFO;
SELECT NAME,SUB FROM STUDENT_INFO;
// To retrieve value of Mark1
SELECT NAME, MARKS['Mark1'] from STUDENT_INFO;-
// To retrieve subordinate (array) value
SELECT NAME,SUB[0] FROM STUDENT_INFO;
Outcome:
4.5.6 Partitions
In Hive, without partitioning, queries scan the entire dataset, leading to high
I/O and slow performance. Partitioning divides data into subdirectories (like by
year or region), so Hive reads only relevant parts. This reduces I/O, speeds up
queries, and improves MapReduce job efficiency.
Partition is of two types:
1. STATIC PARTITION: It is upon the user to mention the partition (the
segregation unit) where the data from the file is to be loaded.
Hive creates the folder for the value specified in the partition.
Objective: To add one more static partition based on "gpa" column using
the "alter" statement.
Act:
ALTER TABLE STATIC_PART_STUDENT ADD PARTITION (gpa=3.5);
INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa
=4.0) SELECT rollno,name from EXT_STUDENT where gpa=4.0;
Outcome:
Act:
CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT(rollno
INT,name STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';
Outcome:
4.5.7 Bucketing
Bucketing is similar to partition. However, there is a subtle difference between
partition and bucketing. In a partition, you need to create partition for each
unique value of the column. This may lead to situations where you may end up
4.5.8 Views
In Hive, view support is available only in version starting from 0.6. Views are
purely logical object.
Objective: To create a view table named “STUDENT_VIEW”.
Act:
CREATE VIEW STUDENT_VIEW AS SELECT rollno, name FROM
EXT_STUDENT;
Prof. Nisha S K, Assistant Professor, SVIT Page 25
Big Data Analytics-BAD601-Module 4
Outcome:
4.5.9 Sub-Query
In Hive, sub-queries are supported only in the FROM clause (Hive 0.12). You
need to specify name for sub- query because every table in a FROM clause has
a name. The columns in the sub-query select list should have unique names.
The columns in the subquery select list are available to the outer query just
like columns of a table.
4.5.10 Joins
Joins in Hive is similar to the SQL Join.
Objective: To create JOIN between Student and Department tables where
we use RollNo from both the tables as the join key.
Act:
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv' OVERWRITE
INTO TABLE STUDENT;
CREATE TABLE IF NOT EXISTS DEPARTMENT(rollno INT,deptno int,name
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/root/hivedemos/department.tsy'
OVERWRITE INTO TABLE DEPARTMENT;
SELECT a.rollno, a.name, a.gpa, b.deptno FROM STUDENT a JOIN
DEPARTMENT ь ON a.rollno = b.rollno;
4.5.11 Aggregation
Prof. Nisha S K, Assistant Professor, SVIT Page 27
Big Data Analytics-BAD601-Module 4
Act:
CREATE TABLE STUDENT_RC( rollno int, name string,gpa float ) STORED
AS RCFILE; INSERT OVERWRITE table STUDENT_RC SELECT * FROM
STUDENT; SELECT SUM(gpa) FROM STUDENT_RC;
Outcome:
return word.toLowerCase();
}
}
Note: Convert this Java Program into Jaṛ.
ADD JAR /root/hivedemos/UpperCase.jar;
CREATE TEMPORARY FUNCTION touppercase AS
'com.example.hive.udf.MyUpperCase';
SELECT TOUPPERCASE(name) FROM STUDENT;
INTRODUCTION TO PIG
4.9 What is PIG?
Pig runs on Hadoop. Pig uses both Hadoop Distributed File System and
MapReduce Programming. By default, Pig reads input files from HDFS. Pig
stores the intermediate data (data produced by MapReduce jobs) and the
output in HDFS. However, Pig can also read input from and place output to
other sources. Pig supports the following:
1. HDFS commands.
2. UNIX shell commands.
3. Relational operators.
4. Positional parameters.
5. Common mathematical functions.
6. Custom functions.
7. Complex data structures.
Pig is widely used for "ETL" (Extract, Transform, and Load). Pig can extract
data from different sources such as ERP, Accounting, Flat Files, etc. Pig then
makes use of various operators to perform transformation on the data and
subsequently loads it into the data warehouse. Refer Figure below.
2. It should begin with a letter and should be followed only by letters, numbers,
and underscores.
The below table shows the list of valid and invalid identifiers.
Table below describes simple data types supported in Pig. In Pig, fields of
unspecified types are considered as an array of bytes which is known as
bytearray.
Null: In Pig Latin, NULL denotes a value that is unknown or is non-existent.
1. Interactive Mode
2. Batch Mode
1. Interactive Mode:
Command to start Pig shell:
pig
• Pig runs in Grunt shell, an interactive environment for executing Pig
Latin scripts.
• Warnings show deprecated configuration parameters (e.g.,
mapred.job.tracker, fs.default.name) and suggest alternatives like
mapreduce.jobtracker.address, fs.defaultFS.
We need to create "Pig Script" to run pig in batch mode. Write Pig Latin
statements in a file and save it with .pig extension.
1. Local Mode.
2. MapReduce Mode.
Local Mode
To run pig in local mode, we need to have your files in the local file system.
Syntax:
MapReduce Mode
Syntax:
pig filename
Objective: Learn how to perform HDFS operations directly from the Pig Grunt
shell.
Command Example:
Explanation:
Purpose:
The FILTER operator is used to select tuples (records) from a relation based on
specified conditions.
Objective:
Select students whose GPA is greater than 4.0.
Input:
Relation Student with schema:
(rollno:int, name:chararray, gpa:float)
Commands (Act):
A = LOAD '/pigdemo/student.tsv' AS (rollno:int, name:chararray,
gpa:float);
B = FILTER A BY gpa > 4.0;
DUMP B;
Output:
(1003, Smith, 4.5) (1004, Scott, 4.2)
FILTER is used to apply conditions and reduce the dataset.
It helps in extracting meaningful subsets of data for further processing.
2. FOREACH Operator in Pig
Purpose:
The FOREACH operator is used for data transformation based on columns of a
relation.
Objective:
Display the names of all students in uppercase.
Input:
Relation Student with schema:
(rollno:int, name:chararray, gpa:float)
Commands (Act):
A = LOAD '/pigdemo/student.tsv' AS (rollno:int, name:chararray,
gpa:float);
B = FOREACH A GENERATE UPPER(name);
Prof. Nisha S K, Assistant Professor, SVIT Page 38
Big Data Analytics-BAD601-Module 4
DUMP B;
Output:
(JOHN)
(JACK)
(SMITH)
(SCOTT)
(JOSHI)
FOREACH ... GENERATE is used to apply transformations to each row.
Functions like UPPER() can be used to modify specific fields.
3. GROUP Operator in Pig
Purpose:
The GROUP operator is used to group data based on a column.
Objective:
Group tuples of students based on their GPA.
Input:
Relation Student with schema:
(rollno:int, name:chararray, gpa:float)
Commands (Act):
A = LOAD '/pigdemo/student.tsv' AS (rollno:int, name:chararray,gpa:float);
B = GROUP A BY gpa;
DUMP B;
Output:
(3.0,{(1001,John,3.0),(1001,John,3.0)})
(3.5,{(1005,Joshi,3.5),(1005,Joshi,3.5)})
(4.0,{(1008,James,4.0),(1002,Jack,4.0)})
(4.2,{(1007,David,4.2),(1004,Scott,4.2)})
(4.5,{(1006,Alex,4.5),(1003,Smith,4.5)})
GROUP A BY <column> creates groups where all tuples with the same value in
the specified column are grouped together.
Each group is represented as (group_key, bag_of_tuples).
Prof. Nisha S K, Assistant Professor, SVIT Page 39
Big Data Analytics-BAD601-Module 4
9. SPLIT
It is used to partition a relation into two or more relations.
Objective: To partition a relation based on the GPAs acquired by the
students.
GPA = 4.0, place it into relation X.
GPA is < 4.0, place it into relation Y.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
SPLIT A INTO X IF gpa==4.0, Y IF gpa<=4.0;
DUMP X;
10. SAMPLE
It is used to select random sample of data based on the specified sample size.
Objective: To depict the use of SAMPLE.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B = SAMPLE A 0.01;
DUMP B;
4.20.2 MAX
MAX is used to compute the maximum of numeric values in a single column
bag.
Objective: To calculate the maximum marks for each student.
Input: Student (studname:chararray,marks:int)
Act:
A= load '/pigdemo/student.csv' USING PigStorage (",") as
(studname:chararray, marks:int);
B = GROUP A BY studname;
C = FOREACH B GENERATE A.studname, MAX(A.marks);
DUMP C;
4.20.3 COUNT
COUNT is used to count the number of elements in a bag.
Objective: To count the number of tuples in a bag.
Input:
Student (studname:chararray,marks:int)
Act:
A= load '/pigdemo/student.csv' USING PigStorage (",") as
(studname:chararray, marks:int);
B = GROUP A BY studname;
C = FOREACH B GENERATE A. studname,COUNT(A);
DUMP C;
DUMP B
Note: Convert above java class into jar to include this function into
your code.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
register /root/pigdemos/myudfs.jar;
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
Prof. Nisha S K, Assistant Professor, SVIT Page 50
Big Data Analytics-BAD601-Module 4
*****END*****