Unit 4 Hadoop Ecosystem - HIVE and PIG
Unit 4 Hadoop Ecosystem - HIVE and PIG
2 HIVE query language, loading data into table, HIVE built-in functions, joins in
HIVE, Partitioning.
3 HiveQL: querying data, sorting and aggregation,
6 PIG built-in functions, filtering, grouping, sorting data Installation of PIG and
PIG Latin commands
7 Self-Learning Topics: Cloudera IMPALA
Introduction to Hive XP
for querying and analyzing large data sets that are principally
HQL (Hive query language) which gets internally converted to map reduce
jobs.
Hive Pig
Hive is commonly used by Pig is commonly used by
Data Analysts. programmers.
It follows SQL-like queries It follows the data-flow
(HQL) language.
It can handle structured data. It can handle semi-structured
data.
It works on server-side of It works on client-side of
HDFS cluster. HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster
than Hive.
Hive Architecture
Q. What is HIVE? Explain HIVE architecture in detail. XP
XP
XP
Hive Architecture : Hive Client XP
• Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
• Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
– Apache Thrift is basically protocols which define how your connections are made
between clients and servers. Apache Hive uses Thrift to allow remote users to make a
connection with HiveServer2(The thrift server) to connect to it and submit queries. Also
thrift protocols are written in many different language like C++, Java, Python so that
users can query the same source in different languages.
• JDBC Driver - It is used to establish a connection between hive and Java applications.
form of DAG of map-reduce tasks and HDFS tasks. In the end, the
dependencies.
XP
Hive Services
2. The driver is interacting with Compiler for getting the plan. (Here plan refers
5. Compiler communicating with Driver with the proposed plan to execute the
query
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS
operations.
– EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
– EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data
node only. While from Name Node it only fetches the metadata information for the query.
– It collects actual data from data nodes related to mentioned query
– Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to
perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP
and ALTERING tables and databases are done. Meta store will store information about
database name, table names and column names only. It will fetch data related to query
mentioned.
– Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data
nodes, and job tracker to execute the query on top of Hadoop file system
8. Fetching results from driver
9.Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will
send results back to driver and to UI ( front end)
XP
daemons via Execution engine. The dotted arrow in the Job flow
daemons.
Different modes of Hive XP
• Integer Types
Type Size Range
TINYINT 1-byte signed -128 to 127
integer
SMALLINT 2-byte signed 32,768 to 32,767
integer
INT 4-byte signed 2,147,483,648 to 2,147,483,647
integer
BIGINT 8-byte signed -9,223,372,036,854,775,808 to
integer 9,223,372,036,854,775,807
HIVE Data Types XP
• Decimal Type
Date/Time Types:
• TIMESTAMP
– It supports traditional UNIX timestamp with optional nanosecond precision.
– As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
– As Floating point numeric type, it is interpreted as UNIX timestamp in
seconds with decimal precision.
– As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
• Date: The Date value is used to specify a particular year, month and day, in
the form YYYY--MM--DD
HIVE Data Types XP
String Types
• STRING :The string is a sequence of characters. It values
can be enclosed within single quotes (') or double quotes
(").
• Varchar: The varchar is a variable length type whose
range lies between 1 and 65535, which specifies that the
maximum number of characters allowed in the character
string.
• CHAR:The char is a fixed-length type whose maximum
length is fixed at 255.
HIVE Data Types XP
• Each database must contain a unique name. If we create two databases with the
same name, then HIVE gives error message as
• If we want to suppress the warning generated by Hive on creating the database with
the same name, follow the below command: -
• Hive also allows assigning properties with the database in the form of key-value pair.
hive>create the database demo
>WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-06-03');
– The database demo is not present in the list. Hence, the database is dropped
successfully.
– If we try to drop the database that doesn't exist, the following error
generates:
XP
– it is not allowed to drop the database that contains the tables directly. In such
a case, we can drop the database either by dropping tables first or use
Cascade keyword with the command.
– This command automatically drops the tables present in the database first.
XP
• It is also called managed tables as the lifecycle of their data is controlled by the Hive.
• By default, these tables are stored in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (i.e. /user/hive/warehouse).
• If we try to drop the internal table, Hive deletes both table schema and data.
• command:- hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
• when we try to create the existing table again then the exception occurs. If we
want to ignore this type of exception, we can use if not exists command while
creating the table.
XP
• While creating a table, we can add the comments to the columns and can also
define the table properties.
• hive> create table demo.new_employee (Id int comment 'Employee Id', Name
string comment 'Employee Name', Salary float comment 'Employee Salary')
• comment 'Table Description'
• TBLProperties ('creator'='Gaurav Chawla', 'created_at' = '2019-06-
06 11:00:00');
XP
the database
• Once the internal table has been created, the next step is to load the data into
it.
• To load the data of the file into the database by using the following : -
• If we want to add more data into the current database, execute the same
query again by just updating the new file name.
load data local inpath '/home/codegyani/hive/
emp_details1' into table demo.employee;
XP
• If we try to load unmatched data (i.e., one or more column data doesn't match
the data type of specified table columns), it will not throw any exception.
However, it stores the Null value at the position of unmatched tuple.
Hive - Load Data load the data into the table XP
• select the database from which we want to delete the table by using the
following command: -
hive> use demo;
XP
the table new_employee is not present in the list. Hence, the table is dropped
successfully.
Hive - Alter Table XP
• To perform modifications in the existing table like changing the table name,
column name, comments, and table properties.
• Rename a Table : to change the name of an existing table
XP
• Alter table employee_data add columns (age int);
we didn't add any data to the new column, hive consider NULL as the value.
Change Column XP
• To delete one or more columns by replacing them with the new columns.
• To drop a column from the table
alter table employee_data replace columns( id string, first_name string, age int);
Partitioning in Hive XP
• Dividing the table into some parts based on the values of a particular column
like date, course, city or country.
• Advantage - data is stored in slices, the query response time becomes faster.
• partitioning in Hive and divide the data among the different datasets based on
particular columns.
• Types:
– Static partitioning
– Dynamic partitioning
Static Partitioning
XP
• Load the data into the table and pass the values of partition columns with it by using
the following command: -
hive> load data local inpath '/home/codegyani/hive/
student_details1' into table student
partition(course= "java");
• The values of partitioned columns exist within the table. So, it is not required to
pass the values of partitioned columns manually.
• First, select the database in which we want to create a table.
– hive> use show;
• Enable the dynamic partition by using the following commands: -
– hive> set hive.exec.dynamic.partition=true;
– hive> set hive.exec.dynamic.partition.mode=nonstrict;
• Create a dummy table to store the data.
– hive> create table stud_demo(id int, name string, age int, institute string, course string)
• Let's retrieve the entire data of the table by using the following command:
– hive> select * from student_part;
XP
1 when loading files (big files) When you have large data stored in a table
into Hive tables static partitions are then the Dynamic partition is suitable.
preferred. If you want to partition a number of columns
but you don’t know how many columns then
also dynamic partition is suitable.
2 “statically” add a partition in the The values of partitioned columns exist within
table and move the file into the the table. So, it is not required to pass the
partition of the table. values of partitioned columns manually.
3 We can alter the partition in the can’t perform alter on the Dynamic partition.
static partition.
4 Static partition is in Strict Mode. Dynamic partition is in non-Strict Mode.
5 You should use where clause to use Dynamic partition there is no required where
limit in the static partition. clause to use limit.
Bucketing in Hive XP
• Let's retrieve
• According to hash function :
7%3=1
4%3=1
1%3=1
So, these columns stored in bucket 1. the data of bucket 1.
XP
iii. Here also bucketed tables offer faster query responses than non-
iv. This concept offers the flexibility to keep the records in each
• The HQL Group By clause is used to group the data from the multiple
records based on one or more column.
• It is generally used in conjunction with the aggregate functions (like SUM,
COUNT, MIN, MAX and AVG) to perform an aggregation over each group.
• sum of employee salaries department wise by using the following command:
• The HQL HAVING clause is used with GROUP BY clause. Its purpose is to
apply constraints on the group of data produced by GROUP BY clause.
Thus, it always returns the data where the condition is TRUE.
• The sum of employee's salary based on department having sum >= 35000
by using the following command:
– hive> select department, sum(salary) from emp group by department ha
ving sum(salary)>=35000;
XP
HiveQL - ORDER BY
• an example to arrange the data in the sorted order by using ORDER BY clause.
• fetch the data in the descending order by using the following command:
– hive> select * from emp order by salary desc;
HiveQL - SORT BY Clause XP
• https://www.edureka.co/blog/hive-tutorial/
• https://www.edureka.co/blog/hive-commands-with-examples
• https://www.javatpoint.com/hive-drop-database
• https://www.dezyre.com/hadoop-tutorial/hive-commands
PIG
XP
PIG
• The language used to analyze data in Hadoop using Pig is known as Pig Latin.
• It is a high level data processing language which provides a rich set of data types
a Pig script using the Pig Latin language, and execute them using any of the
• Internally, Apache Pig converts these scripts into a series of MapReduce jobs
The architecture of Apache Pig is shown below. XP
XP
Parser:
– Initially the Pig Scripts are handled by the Parser.
– It checks the syntax of the script, does type checking, and other miscellaneous
checks.
– The output of the parser will be a DAG (directed acyclic graph), which represents
– In the DAG, the logical operators of the script are represented as the nodes and
Optimizer
– The logical plan (DAG) is passed to the logical optimizer, which carries out the
Compiler:
– The compiler compiles the optimized logical plan into a series of
MapReduce jobs.
Execution engine:
– Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
• The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple.
• Given below is the diagrammatical representation of Pig Latin’s data model.
XP
Atom:
• Any single value in Pig Latin, irrespective of their data, type is known as
an Atom.
• We can use it as string and number and store it as the string.
• Atomic values of Pig are int, long, float, double, char array, and byte array.
• A field is a piece of data or a simple atomic value in Pig.
• For Example − ‘Shubham’’ or ‘30’
Tuple:
• Tuple is a record that is formed by an ordered set of fields.
• However, the fields can be of any type.
• In addition, a tuple is similar to a row in a table of RDBMS.
For Example − (Shubham, 30)
XP
Bag:
Map
• It is represented by ‘[]’
Relation
• A bag-collection of tuples.
• A tuple - an ordered set of fields.
• A field - a piece of data
Statements in Pig Latin XP
• Chararray : represents a character array (string) in Unicode UTF-8 format. For eg . ‘Class MCA’
• Tuple : An ordered set of fields is what we call a tuple. For Example : (Ankit, 32)
Null Values :
– It is possible that values for all the above data types can be NULL
* Multiplication − This operation multiplies the values on either side of the a * b gives you 1200
operator.
/ Division − This operator divides the left hand operand by right hand if a= 40, b= 20,
operand. b / a results to 2
% Modulus − It divides the left hand operand by right hand operand with if a= 40, b= 30,
remainder as result. b%a results to 10
?: Bincond − It evaluates the Boolean operators. Moreover, it has three b = (a == 1)? 40: 20;
operands below. if a = 1 the value is
variable x = (expression) ? value1 if true : value2 if false. 40.
if a!=1 the value is
20.
CASE Case − This operator is equal to the nested bincond. CASE f2 % 4
WHEN WHEN 0 THEN
THEN ‘even’
ELSE WHEN 1 THEN ‘odd’
END END
Pig Latin Comparison Operators XP
!= Not Equal − Checks the values of two operands are equal or not. If the If a=10, b=20, then
values are equal, then condition becomes false else true. (a != b) is true
Greater than − It checks whether the right operand value is greater If a=10, b=20,
> than that of the right operand. If yes, then the condition becomes true. then(a > b) is not
true.
Less than − This operator checks the value of the left operand is less (a < b) is true, if
< than the right operand. If condition fulfills, then it returns true. a=10, b=20.
Greater than or equal to − It checks the value of the left operand with If a=20, b=50,
>= right hand. It checks whether it is greater or equal to the right operand. true(a >= b) is not
If yes, then it returns true. true.
<= Less than or equal to − The value of the left operand is less than or If a=20, b=20, (a <=
equal to that of the right operand. Then the condition still returns true. b) is true.
matches Pattern matching − This checks the string in the left-hand matches with f1 matches ‘.*df.*’
the constant in the RHS.
Type Construction Operators XP
Operator Description
Loading and LOAD It loads the data from a file system into a relation.
Storing
STORE It stores a relation to the file system (local/HDFS).
Filtering FILTER There is a removal of unwanted rows from a relation.
DISTINCT We can remove duplicate rows from a relation by this operator.
FOREACH, GENERATE It transforms the data based on the columns of data.
STREAM To transform a relation using an external program.
Grouping and JOIN We can join two or more relations.
Joining
COGROUP There is a grouping of the data into two or more relations.
GROUP It groups the data in a single relation.
18
XP
Speaking Pig Latin
🙶 FILTER
🙶 Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression ;
🙶 Expression uses simple comparison operators (==, !=,
<, >, …)
some_apples
and Logical = (AND, NOT, OR)
connectors
FILTER apples BY colour != ‘red’;
🙶 Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
Pig Latin Relational Operations XP
Operator Description
Sorting ORDER It arranges a relation in an order based on one or more fields.
LIMIT We can get a particular number of tuples from a relation.
Combining and UNION We can combine two or more relations into one relation.
Splitting
SPLIT To split a single relation into more relations.
Diagnostic DUMP It prints the content of a relationship through the console.
Operators
DESCRIBE It describes the schema of a relation.
We can view the logical, physical execution plans to evaluate a
EXPLAIN
relation.
• The modes depends on where the Pig script is going to run and where the
data is residing. The data can be stored in a single machine, i.e. local file
system or it can be stored in a distributed environment like typical Hadoop
Cluster environment
• First one is non interactive shell also known as script mode. In this we have to
create a file, load the code in the file and execute the script.
• Second one is grunt shell, it is an interactive shell for running Apache Pig
commands.
• Third one is embedded mode, in this we use JDBC to run SQL programs from
XP
Execution modes
• Local mode: In this mode, Pig runs in a single JVM and makes use of local file
system. This mode is suitable only for analysis of small datasets using Pig
• Map Reduce mode: In this mode, we could have proper Hadoop cluster setup
and Hadoop installations on it. By default, the pig runs on MR mode. Pig
translates the submitted queries into Map reduce jobs and runs them on top
of Hadoop cluster. We can say this mode as a Map reduce mode on a fully
distributed cluster.
• Pig Latin statements like LOAD, STORE are used to read data from the HDFS file
system and to generate output. These Statements are used to process data.
XP
• Storing Results:
• By Using DUMP, we can get the final results displayed to the output
Syntax :AVG(expression)
ii. BagToString() :It is used to concatenate the elements of a bag into a string. We can place a
Syntax: COUNT(expression)
syntax :COUNT_STAR(expression)
Syntax :IsEmpty(expression)
viii. MAX() : to calculate the highest value for a column (numeric values or chararrays) in a
single-column bag.
Syntax: MAX(expression)
ix. MIN(): to get the minimum (lowest) value (numeric or chararray) for a certain column in a
single-column bag.
Syntax : MIN(expression)
x. PluckTuple(): We can define a string prefix and filter the columns in a relation that begin with
the given prefix,
xi. SIZE(): to compute the number of elements based on any Pig data type.
Syntax : SIZE(expression)
Eval Functions: XP
xii. SUBTRACT(): to subtract two bags. As a process, it takes two bags as inputs. Then
returns a bag which contains the tuples of the first bag that are not in the second
bag.
Syntax : SUBTRACT(bag1,bag2)
xiii. SUM(): To get the total of the numeric values of a column in a single-column
xiv. TOKENIZE() : For splitting a string (which contains a group of words) in a single
tuple. Then return a bag which contains the output of the split operation.
Syntax : TOKENIZE(expression)
Load and Store Functions XP
iii. BinStorage() : By using machine-readable format, for loading and storing data into
Pig.
Syntax :BinStorage()
i. ENDSWITH(string, testAgainst)
This Pig Function verifies whether the first string starts with the second, after
lowercase.
uppercase.
xi. STRSPLIT(string, regex, limit) : to split a string around matches of a given regular
ex
It splits the string by given delimiter and returns the result in a bag.pression.
xiii. TRIM(expression): It is used to return a copy of a string with leading and trailing
whitespaces removed.
xiv. LTRIM(expression)
xv. RTRIM(expression)
There are more alternative for this functions. Such as ToDate(iosstring), ToDate(userstring,
iii. GetDay(datetime): To get the day of a month as a return from the date-time object, we
use it.
iv. GetHour(datetime): GetHour returns the hour of a day from the date-time object.
object.
vi. GetMinute(datetime): To get the minute of an hour in return from the date-time object,
we use it.
Date and Time Functions
vii. GetMonth(datetime) XP
GetMonth returns the month of a year from the date-time object.
viii. GetSecond(datetime)
ix. GetWeek(datetime)
To get the week of a year as a return from the date-time object, we use it.
x. GetWeekYear(datetime)
xi. GetYear(datetime)
To get the result of a date-time object as a result along with the duration object, we use it.
SubtractDuration subtracts the duration object from the Date-Time object and returns the result.
Date and Time Functions
XP
DaysBetween returns the number of days between the two date-time objects.
To get the number of milliseconds as result between two date-time objects, we use
it.
To get the number of months as a return between two date-time objects, we use it.
To get the number of years as a return between two date-time objects, we use it.
Syntax: ABS(expression)
Syntax:ACOS(expression)
Syntax: ASIN(expression)
Syntax: ATAN(expression)
Math Functions XP
Syntax : CBRT(expression)
integer
Syntax: CEIL(expression)
Syntax : COS(expression)
Syntax : COSH(expression)
Math Functions XP
xiv. ROUND: gives the value of an expression rounded to an integer (if the
result type is float) or rounded to a long (if the result type is double).
Syntax : ROUND(expression)
Syntax: SIN(expression)
Syntax : SINH(expression)
Math Functions XP
xvii. SQRT :
Syntax: SQRT(expression)
Syntax:TAN(expression)
Syntax: TANH(expression)
Filtering data XP
• The FILTER operator is used to select the required tuples from a relation based
on a condition.
• Syntax:
grunt> Relation2_name = FILTER relation1_name BY (condition);
• The GROUP operator is used to group the data in one or more relations. It
eg. grunt> sh ls
• fs Command: By using fs command, we can invoke the ls command of
HDFS from the Grunt shell. Here, it lists the files in the HDFS root
directory.
Eg.grunt> fs –ls
XP
Debug : Also, by passing on/off to this key, we can turn off or turn on the debugging
feature in Pig.
job.name : Moreover, by passing a string value to this key we can set the Job name to
the required job.
job.priority : By passing one of the following values to this key, we can set the job
priority to a job − very_low , low , normal,high,very_high
stream.skippath : By passing the desired path in the form of a string to this key, we can
set the path from where the data is not to be transferred, for streaming.