BDP U4

Introduction to Apache Pig
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to process the
large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides
a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes.
First, to process the data which is stored in the HDFS, the programmers will write the scripts using
the Pig Latin Language. Internally Pig Engine(a component of Apache Pig) converted all these scripts
into a specific map and reduce task. But these are not visible to the programmers in order to provide
a high-level of abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig
tool. The result of Pig always stored in the HDFS.
Note: Pig Engine has two type of execution environment i.e. a local execution environment in a
single JVM (used when dataset is small in size)and distributed execution environment in a Hadoop
Cluster.
Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing the
reducer and mapper, compiling packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development using the multi-query approach.
Also, Pig is beneficial for programmers who are not from Java background. 200 lines of Java code can
be written in only 10 lines using the Pig Latin language. Programmers who have SQL knowledge
needed less effort to learn Pig Latin.
 It uses query approach which results in reducing the length of the code.
 Pig Latin is SQL like language.
 It provides many builtIn operators.
 It provides nested data types (tuples, bags, map).
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that time, the
main idea to develop Pig was to execute the MapReduce jobs on extremely large datasets. In the
year 2007, it moved to Apache Software Foundation(ASF) which makes it an open source project.
The first version(0.1) of Pig came in the year 2008. The latest version of Apache Pig is 0.18 which
came in the year 2017.
Features of Apache Pig:
 For performing several operations Apache Pig provides rich sets of operators like the filters,
join, sort, etc.
 Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
 Apache Pig is extensible so that you can make your own user-defined functions and process.
 Join operation is easy in Apache Pig.
 Fewer lines of code.
 Apache Pig allows splits in the pipeline.
 The data structure is multivalued, nested, and richer.
 Pig can handle the analysis of both structured and unstructured data.

Difference between Pig and MapReduce
Apache Pig MapReduce
It is a scripting language. It is a compiled programming language.
Abstraction is at higher level. Abstraction is at lower level.
It have less line of code as compared to

MapReduce. Lines of code is more.
More development efforts are required for

Less effort is needed for Apache Pig. MapReduce.
Code efficiency is less as compared to As compared to Pig efficiency of code is

MapReduce. higher.
Pig provides built in functions for ordering,

sorting and union. Hard to perform data operations.
It allows nested data types like map, tuple and

bag It does not allow nested data types
Applications of Apache Pig:
 For exploring large datasets Pig Scripting is used.
 Provides the supports across large data-sets for Ad-hoc queries.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 For collecting large amounts of datasets in form of search logs and web crawls.
 Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
 Atom: It is a atomic data value which is used to store as a string. The main use of this model
is that it can be used as a number and as well as a string.
 Tuple: It is an ordered set of the fields.
 Bag: It is a collection of the tuples.

 Map: It is a set of key/value pairs.
Apache Pig - Execution
Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no
need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS)
using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a
MapReduce job is invoked in the back-end to perform a particular operation on the data that exists
in the HDFS.
Apache Pig Execution Mechanisms
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
 Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump
operator).
 Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in
a single file with .pig extension.
 Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our
script.
Invoking the Grunt Shell
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown
below.
Local mode MapReduce mode
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Executing Apache Pig in Batch Mode
You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose
we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump student;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig
Hive vs Pig vs SQL
With the range of technologies in the big data world, there is often confusion to choose from them.
It is required to handle huge databases efficiently with big data, and the options for managing and
querying data are also needed. When it comes to managing database, SQL (Structured Query
Language) is the old friend, well tried and tested by everyone for data analysis. But the complicated
world of Hadoop needs high-level data analysis tools.
Though old SQL still is the favorite of many and is popularly used in numerous organizations, Apache
Hive and Pig have become the buzz terms in the big data world today. These tools provide easy
alternatives to carry out the complex programming of MapReduce helping data developers and
analysts.
The organizations looking for open source querying and programming to tame Big data have
adopted Hive and Pig widely. At the same time, it is vital to pick and choose the right platform and
tool for managing your data well. Hence it is essential to understand the differences between Hive vs
Pig vs SQL and choose the best suitable option for the project.
Technical Differences Between Hive vs Pig vs SQL

Apache Hive
Apache Hive is the excellent big data software that helps in writing, reading, and managing huge
datasets present in the distributive storage. It is an open source project built on Hadoop to analyse,
summarise, and query datasets. There is a special language similar to SQL known as HiveQL that
converts the queries into MapReduce programmes that can be executed on datasets in HDFS
(Hadoop Distributed File System).
Hive is seen as a Data Warehouse Infrastructure and is used as an ETL (Extraction-Transformation-

Loading) tool. It improves the flexibility in schema design with data serialization and deserialization.
It is an excellent tool for querying historical data.
Apache Pig
Apache Pig is another platform with high-level language to express the analysis programmes to
analyse huge datasets. It is an open source project that provides a simple language Pig Latin that
manipulates and queries the data.
It is quite easy to learn and use Pig if you are aware of SQL. It provides the use of nested data types-
Tuples, Maps, Bags, etc. and supports data operations like Joins, Filters, and Ordering. Tech giants
like Google, Yahoo, and Microsoft use Pig for the analysis of enormous datasets arising out of search
logs, web crawls, and click streams.
SQL
Structured Query Language is the traditional database management tool used by programmers for
decades. It is a declarative language that manages the data stored in relational database systems.
SQL is a much better option than excel as it is a fast tool for data processing and analysis.
Hive vs Pig vs SQL – When to Use What?
All three technologies Hive, Pig, and SQL are quite popular in the industry for data analysis and
management, but the bigger question is to know the appropriate use of these tools. There is a need
to understand that which platform suits your needs better and when to use what. Let us understand
the scenarios when we can use these three tools appropriately in the context of Hive vs Pig vs SQL.
When to Use Hive
Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually promote
Hive language due to its extensive feature list and similarities with SQL. Here are some of the
scenarios when Apache Hive is ideal to use:
 To query large datasets: Apache Hive is specially used for analytics purposes on huge

datasets. It is an easy way to approach and quickly carry out complex querying on datasets
and inspect the datasets stored in the Hadoop ecosystem.
 For extensibility: Apache Hive contains a range of user APIs that help in building the custom
behavior for the query engine.
 For someone familiar with SQL concepts: If you are familiar with SQL, Hive will be very easy
to use as you will see many similarities between the two. Hive uses the clauses like select,
where, order by, group by, etc. similar to SQL.
 To work on Structured Data: In case of structured data, Hive is widely adopted everywhere.
 To analyze historical data: Apache Hive is a great tool for analysis and querying of the data
which is historical and collected over a period.
Apache Hive is the Big Data technology widely used for Big Data analytics. Let’s understand why
is Big data analytics so important?
When to Use Pig
Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and
optimization scope. This language uses a multi-query approach that reduces the time in data
scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you
are familiar with the SQL ecosystem. You can use Apache Pig for the following special scenarios:
 To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big

data. It is a data flow system that uses Pig Latin, a simple language for data queries and
manipulation.
 As a programmer with the scripting knowledge: The programmers with the scripting

knowledge can learn how to use Apache Pig very easily and efficiently.
 For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach.
Apache Pig is famous worldwide for its speed.
 When you don’t want to work with Schema: In case of Apache Pig, there is no need for
creating a schema for the data loading related work.
 For SQL like functions: It has many functions related to SQL along with the cogroup function.
Want to know more about Apache Pig? Please refer our blog on Apache Pig Progression with
Hadoop’s Changing Versions.
When to Use SQL
SQL is a general purpose database management language used around the globe. It has been
updating itself as per the user expectations for decades. It is declarative and hence focuses explicitly
on ‘what’ is needed. It is popularly used for the transactional as well as analytical queries. When the
requirements are not too demanding, SQL works as an excellent tool. Here are few scenarios –
 For better performance: SQL is famous for its ability to pull data quickly and frequently. It
supports OLAP (Online Analytical Processing) applications and performs better for these
applications. Hive is slow in case of online transactional needs.
 When the datasets are small: SQL works well with small datasets and performs much better
for smaller amounts of data. It also has many ways for the optimisation of data.
 For frequent data manipulation: If your requirement needs frequent modification in records
or you need to update a large number of records frequently, SQL can perform these
activities well. SQL also provides an entirely interactive experience to the user.
To know more about how SQL fits in Hadoop architecture, you can refer our blog on NoSQL vs SQL –
How NoSQL is Better for Big Data Applications?
Does the comparison for Hive vs Pig vs SQL direct the winning of the game?
We have seen that there are significant differences in the three- Hive vs Pig vs SQL. All of these
perform specific functions and meet unique requirements of the business. Also, all three requires
proper infrastructure and skills for their efficient use while working on data sets.
Hive vs Pig vs SQL
Nature of Uses Procedural language Uses Declarative language SQL itself is Declarative
Language called Pig Latin called HiveQL language
Definition An open source and high- An open source built with General purpose database
level data flow language with an analytical focus used for language for analytical and
a Multi-query approach Analytical queries transactional queries
Suitable for Suitable for Complex as well Ideal for Batch Processing – Ideal for more straightforward
as Nested data structure OLAP (Online Analytical business demands for fast data
Processing) analysis
Operational Semi-structured and Used only for Structured A domain-specific language for
for structured data Data a relational database
management system
Compatibility Pig works on top of Hive works on top of Not compatible with
MapReduce MapReduce MapReduce programming
Use of Schema No concept of a schema to Supports Schema for data Strict use of schemas in case of
store data insertion in tables storing data
On one side, Apache Pig relies on scripts and it requires special knowledge while Apache Hive is the
answer for innate developers working on databases. Furthermore, Apache Hive has better access
choices and features than that in Apache Pig. However, Apache Pig works faster than Apache Hive.
On the other hand, SQL being an old tool with powerful abilities is still an answer to our many needs.
Looking at the differences, we can see that they can meet specific needs of our projects differently.
Both Apache Hive and Apache Pig are used popularly in the management and analysis of big data,
but SQL serves as the traditional database management for smaller datasets. Though SQL is old, the
advance tools can still not replace it. There is a slight tendency of adopting Apache Hive and Apache
Pig over SQL by the big businesses looking for object-oriented programming. However, the smaller
projects will still need SQL.
Bottom Line
Despite of the extensively advanced features, Pig and Hive are still growing and developing
themselves to meet the challenging requirements. Hence when we are comparing Hive vs Pig vs SQL,
there is a distinct direction that Hive and Pig are winning the Big data game, but SQL is still here to
stay.
Apache Pig - Grunt Shell

After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that, there are
certain useful shell and utility commands provided by the Grunt shell. This chapter explains the shell
and utility commands provided by the Grunt shell.
Note − In some portions of this chapter, the commands like Load and Store are used. Refer the
respective chapters to get in-detail information on them.
Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can invoke
any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt shell. Using sh command
from the Grunt shell, we cannot execute the commands that are a part of the shell environment
(ex − cd).
Syntax
Given below is the syntax of sh command.
grunt> sh shell command parameters
Example
We can invoke the ls command of Linux shell from the Grunt shell using the sh option as shown
below. In this example, it lists out the files in the /pig/bin/ directory.
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the Grunt shell.
Syntax
Given below is the syntax of fs command.
grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command. In the following
example, it lists the files in the HDFS root directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
In the same way, we can invoke all the other file system shell commands from the Grunt shell using
the fs command.
Utility Commands
The Grunt shell provides a set of utility commands. These include utility commands such as clear,
help, history, quit, and set; and commands such as exec, kill, and run to control Pig from the Grunt
shell. Given below is the description of the utility commands provided by the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Syntax
You can clear the screen of the grunt shell using the clear command as shown below.
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties.
Usage
You can get a list of Pig commands using the help command as shown below.
grunt> help
Commands: <pig latin statement>; - See the PigLatin manual for details:
http://hadoop.apache.org/pig
File system commands:fs <fs arguments> - Equivalent to Hadoop dfs command:
http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic Commands:describe <alias>[::<alias] - Show the schema for the alias.
Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml]
[-param <param_name>=<pCram_value>]
[-param_file <file_name>] [<alias>] -

Show the execution plan to compute the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
-dot - Generate the output in .dot format. Default is text format.
-xml - Generate the output in .xml format. Default is text format.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
alias - Alias to explain.
dump <alias> - Compute the alias and writes the results to stdout.
Utility Commands: exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -
Execute the script with access to grunt environment including aliases.
script - Script to be executed.
run [-param <param_name>=param_value] [-param_file <file_name>] <script> -
Execute the script with access to grunt environment.
script - Script to be executed.
sh <shell command> - Invoke a shell command.
kill <job_id> - Kill the hadoop job specified by the hadoop job id.
set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.
The following keys are supported:
default_parallel - Script-level reduce parallelism. Basic input size heuristics used
by default.
debug - Set debug on or off. Default is off.
job.name - Single-quoted name for jobs. Default is PigLatin:<script name>
job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high.
Default is normal stream.skippath - String that contains the path.

This is used by streaming any hadoop property.
help - Display this message.
history [-n] - Display the list statements in cache.
-n Hide line numbers.
quit - Quit the grunt shell.
history Command
This command displays a list of statements executed / used so far since the Grunt sell is invoked.
Usage
Assume we have executed three statements since opening the Grunt shell.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',');
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',');
Then, using the history command will produce the following output.
grunt> history
customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',');
orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',');
set Command
The set command is used to show/assign values to keys used in Pig.
Usage
Using this command, you can set values to the following keys.
Key Description and values
default_parallel You can set the number of reducers for a map job by passing any whole number as a value
to this key.
debug You can turn off or turn on the debugging freature in Pig by passing on/off to this key.
job.name You can set the Job name to the required job by passing a string value to this key.
job.priority You can set the job priority to a job by passing one of the following values to this key −
 very_low
 low
 normal
 high
 very_high
stream.skippat For streaming, you can set the path from where the data is not to be transferred, by
h passing the desired path in the form of a string to this key.
quit Command
You can quit from the Grunt shell using this command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit
Let us now take a look at the commands using which you can control Apache Pig from the Grunt
shell.
exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
Given below is the syntax of the utility command exec.
grunt> exec [–param param_name = param_value] [–param_file file_name] [script]
Example
Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the

following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the /pig_data/ directory of HDFS with

the following content.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',')
as (id:int,name:chararray,city:chararray);
Dump student;
Now, let us execute the above script from the Grunt shell using the exec command as shown below.
grunt> exec /sample_script.pig
Output
The exec command executes the script in the sample_script.pig. As directed in the script, it loads

the student.txt file into Pig and gives you the result of the Dump operator displaying the following
content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
Example
Suppose there is a running Pig job having id Id_0055, you can kill it from the Grunt shell using
the kill command, as shown below.
grunt> kill Id_0055
run Command
You can run a Pig script from the Grunt shell using the run command
Syntax
Given below is the syntax of the run command.
grunt> run [–param param_name = param_value] [–param_file file_name] script
Example
Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the
following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the local filesystem with the following
content.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Now, let us run the above script from the Grunt shell using the run command as shown below.
grunt> run /sample_script.pig
You can see the output of the script using the Dump operator as shown below.
grunt> Dump;
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a textual
language that abstracts the programming from the Java MapReduce idiom into a notation.
Pig Latin Statements
The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an
input and generates another relation as an output.
o It can span multiple lines.
o Each statement must end with a semi-colon.
o It may include expression and schemas.
o By default, these statements are processed using multi-query execution
Pig Latin Conventions
Convention Description
() The parenthesis can enclose one or more items. It

can also be used to indicate the tuple data type.
Example - (10, xyz, (3,6,9))
[] The straight brackets can enclose one or more

items. It can also be used to indicate the map data
type.
Example - [INNER | OUTER]
{} The curly brackets enclose two or more items. It

can also be used to indicate the bag data type
Example - { block | nested_block }
... The horizontal ellipsis points indicate that you can

repeat a portion of the code.
Example - cat path [path ...]
Latin Data Types
Simple Data Types
Type Description
int It defines the signed 32-bit integer.

Example - 2
long It defines the signed 64-bit integer.

Example - 2L or 2l
float It defines 32-bit floating point number.

Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F
double It defines 64-bit floating point number.

Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F
chararray It defines character array in Unicode UTF-8 format.

Example - javatpoint
bytearray It defines the byte array.
boolean It defines the boolean type values.

Example - true/false
datetime It defines the values in datetime order.
Example - 1970-01- 01T00:00:00.000+00:00
biginteger It defines Java BigInteger values.

Example - 5000000000000
bigdecimal It defines Java BigDecimal values.

Example - 52.232344535345
Complex Types
Type Description
tuple It defines an ordered set of fields.

Example - (15,12)
bag It defines a collection of tuples.

Example - {(15,12), (12,15)}
map It defines a set of key-value pairs.

Example - [open#apache]
Apache Pig - User Defined Functions
In addition to the built-in functions, Apache Pig provides extensive support

for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use
them. The UDF support is provided in six programming languages, namely, Java, Jython, Python,
JavaScript, Ruby and Groovy.
For writing UDF’s, complete support is provided in Java and limited support is provided in all the
remaining languages. Using Java, you can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation. Since Apache Pig has been written in Java, the
UDF’s written using Java language work efficiently compared to other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can
access Java UDF’s written by other users, and contribute our own UDF’s.
Types of UDF’s in Java
While writing UDF’s using Java, we can create and use the following three types of functions −
 Filter Functions − The filter functions are used as conditions in filter statements. These
functions accept a Pig value as input and return a Boolean value.
 Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These
functions accept a Pig value as input and return a Pig result.
 Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE
statement. These functions are used to perform full MapReduce operations on an inner bag.
Writing UDF’s using Java
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss
how to write a sample UDF using Eclipse. Before proceeding further, make sure you have installed
Eclipse and Maven in your system.
Follow the steps given below to write a UDF function −
 Open Eclipse and create a new project (say myproject).
 Convert the newly created project into a Maven project.
 Copy the following content in the pom.xml. This file contains the Maven dependencies for
Apache Pig and Hadoop-core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-

4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Pig_Udf</groupId>
<artifactId>Pig_Udf</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.15.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>
</dependencies>
</project>
 Save the file and refresh it. In the Maven Dependencies section, you can find the
downloaded jar files.
 Create a new class file with name Sample_Eval and copy the following content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
While writing UDF’s, it is mandatory to inherit the EvalFunc class and provide implementation
to exec() function. Within this function, the code required for the UDF is written. In the above
example, we have return the code to convert the contents of the given column to uppercase.
 After compiling the class without errors, right-click on the Sample_Eval.java file. It gives you
a menu. Select export as shown in the following screenshot.
 On clicking export, you will get the following window. Click on JAR file.
 Proceed further by clicking Next> button. You will get another window where you need to
enter the path in the local file system, where you need to store the jar file.
 Finally click the Finish button. In the specified folder, a Jar file sample_udf.jar is created. This
jar file contains the UDF written in Java.
Using the UDF
After writing the UDF and generating the Jar file, follow the steps given below −
Step 1: Registering the Jar file
After writing UDF (in Java) we have to register the Jar file that contain the UDF using the Register
operator. By registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax
Given below is the syntax of the Register operator.
REGISTER path;
Example
As an example let us register the sample_udf.jar created earlier in this chapter.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$./pig –x local
REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
After registering the UDF we can define an alias to it using the Define operator.
Syntax
Given below is the syntax of the Define operator.
DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
Example
Define the alias for sample_eval as shown below.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the alias you can use the UDF same as the built-in functions. Suppose there is a file
named emp_data in the HDFS /Pig_Data/ directory with the following content.
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
007,Robert,22,newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London
011,Stacy,25,Bhuwaneshwar
012,Kelly,22,Chennai
And assume we have loaded this file into Pig as shown below.
grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.
grunt> Dump Upper_case;
(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)
Data Processing Operators :
The Apache Pig Operators is a high-level procedural language for querying large data sets using
Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces another relation as
output.
These operators are the main tools for Pig Latin provides to operate on the data.
They allow you to transform it by sorting, grouping, joining, projecting, and filtering.
The Apache Pig operators can be classified as :
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS storage into a Pig
relation.
FOREACH: This operator generates data transformations based on columns of data. It is used to add
or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more relations based on
common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in either ascending or
descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group key (key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability, programmers usually use
GROUP when only one relation is involved and COGROUP when multiple relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the results on the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular relation. The DESCRIBE
operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed through a sequence of
Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and MapReduce execution
plans of a relation.
What is HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally
converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce
programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).
Features of Hive
These are the following features of Hive:
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark
jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Hive Pig
Hive is commonly used by Data Analysts. Pig is commonly used by programmers.
It follows SQL-like queries. It follows the data-flow language.
It can handle structured data. It can handle semi-structured data.
It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.
Differences between Hive and Pig
HIVE Shell
$HIVE_HOME/bin/hive is a shell utility which can be used to run Hive queries in either interactive or
batch mode. HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline, which is a JDBC
client based on SQLLine.
Hive Command Line Options
To get help, run “hive -H” or “hive –help”. Usage (as it is in Hive 0.9.0)
usage: hive
Option Explanation
-d,–define <key=value> Variable subsitution to apply to hive commands. e.g. -d A=B or –define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,–help Print help information
-h <hostname> Connecting to Hive Server on remote host
–hiveconf <property=value> Use value for given property
–hivevar <key=value> Variable subsitution to apply to hive commands. e.g. –hivevar A=B
-i <filename> Initialization SQL file
-p <port> Connecting to Hive Server on port number
-S,–silent Silent mode in interactive shell
-v,–verbose Verbose mode (echo executed SQL to the console)
Examples
 Example of running a query from the command line
$HIVE_HOME/bin/hive -e ‘select a.foo from pokes a’
 Example of setting Hive configuration variables
$HIVE_HOME/bin/hive -e ‘select a.foo from pokes a’ –hiveconf

hive.exec.scratchdir=/opt/my/hive_scratch –hiveconf mapred.reduce.tasks=1
 Example of dumping data out from a query into a file using silent mode
$HIVE_HOME/bin/hive -S -e ‘select a.foo from pokes a’ > a.txt
 Example of running a script non-interactively from local disk
$HIVE_HOME/bin/hive -f /home/my/hive-script.sql
 Example of running a script non-interactively from a Hadoop supported filesystem (starting

in Hive 0.14)
$HIVE_HOME/bin/hive -f hdfs://<namenode>:<port>/hive-script.sql
Hive CLI is a legacy tool which had two main use cases. The first is that it served as a thick client for
SQL on Hadoop and the second is that it served as a command line tool for Hive Server (the original
Hive server, now often referred to as “HiveServer1”). Hive Server has been deprecated and removed
from the Hive code base as of Hive 1.0.0 and replaced with HiveServer2, so the second use case no
longer applies. For the first use case, Beeline provides or is supposed to provide equal functionality,
yet is implemented differently from Hive CLI.
Command Description
quit Use quit or exit to leave the interactive shell.

exit
reset Resets the configuration to the default values (as of Hive 0.10).
set <key>=<value> Sets the value of a particular configuration variable (key).

If you misspell the variable name, the CLI will not show an error.
set Prints a list of configuration variables that are overridden by the user or Hive.
set -v Prints all Hadoop and Hive configuration variables.
add FILE[S] <filepath> Adds one or more files, jars, or archives to the list of resources in the distributed cache.
<filepath>*
add JAR[S] <filepath>
<filepath>*
add ARCHIVE[S]
<filepath> <filepath>*
add FILE[S] <ivyurl> As of Hive 1.2.0, adds one or more files, jars or archives to the list of resources in the
<ivyurl>* distributed cache using an Ivy URL of the form ivy://group:module:version?query_string.
add JAR[S] <ivyurl>
<ivyurl>*
add ARCHIVE[S] <ivyurl>
<ivyurl>*
list FILE[S]list JAR[S]list Lists the resources already added to the distributed cache.
ARCHIVE[S]
list FILE[S] <filepath>* Checks whether the given resources are already added to the distributed cache or not.
list JAR[S] <filepath>*
list ARCHIVE[S]
<filepath>*
delete FILE[S] Removes the resource(s) from the distributed cache.

<filepath>*
delete JAR[S]
<filepath>*
delete ARCHIVE[S]
<filepath>*
delete FILE[S] <ivyurl> As of Hive 1.2.0, removes the resource(s) which were added using the <ivyurl> from the
<ivyurl>* distributed cache.
delete JAR[S] <ivyurl>
<ivyurl>*
delete ARCHIVE[S]
<ivyurl> <ivyurl>*
! <command> Executes a shell command from the Hive shell.
dfs <dfs command> Executes a dfs command from the Hive shell.
<query string> Executes a Hive query and prints results to standard output.
source <filepath> Executes a script file inside the CLI.
Ideally, Hive CLI should be deprecated as the Hive community has long recommended using the
Beeline plus HiveServer2 configuration; however, because of the wide use of Hive CLI, we instead
are replacing Hive CLI’s implementation with a new Hive CLI on top of Beeline plus embedded
HiveServer2 so that the Hive community only needs to maintain a single code path. In this way, the
new Hive CLI is just an alias to Beeline at both the shell script level and the high code level. The goal
is that no or minimal changes are required from existing user scripts using Hive CLI.
The hiverc File
The CLI when invoked without the -i option will attempt to load $HIVE_HOME/bin/.hiverc and
$HOME/.hiverc as initialization files.
Hive Batch Mode Commands
When $HIVE_HOME/bin/hive is run with the -e or -f option, it executes SQL commands in batch
mode.
hive -e ‘<query-string>’ executes the query string.
hive -f <filepath> executes one or more SQL queries from a file.
Hive Interactive Shell Commands
When $HIVE_HOME/bin/hive is run without either the -e or -f option, it enters interactive shell
mode. Use “;” (semicolon) to terminate commands. Comments in scripts can be specified using the
“–” prefix.
Example
hive> set mapred.reduce.tasks=32;
hive> set;
hive> select a.* from tab1;
hive> !ls;
hive> dfs -ls;
Beeline – New Command Line Shell
HiveServer2 supports a new command shell Beeline that works with HiveServer2. It’s a JDBC client
that is based on the SQLLine CLI. The Beeline shell works in both embedded mode as well as remote
mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) whereas remote mode
is for connecting to a separate HiveServer2 process over Thrift. Starting in Hive 0.14, when Beeline is
used with HiveServer2, it also prints the log messages from HiveServer2 for queries it executes to
STDERR.
Beeline Command Options
Option Description
-u <database URL> The JDBC URL to connect to. Usage: beeline -u db_URL
-n <username> The username to connect as. Usage: beeline -n valid_user
-p <password> The password to connect as. Usage: beeline -p valid_password
-d <driver class> The driver class to use. Usage: beeline -d driver_class
-e <query> Query that should be executed. Double or single quotes enclose the query string. This option
can be specified multiple times. Usage: beeline -e “query_string“
-f <file> Script file that should be executed. Usage: beeline -f filepath
–hiveconf propert Use value for the given configuration property. Properties that are listed in
y=value hive.conf.restricted.list cannot be reset with hiveconf. Usage: beeline –hiveconf prop1=value1
–hivevar name=value Hive variable name and value. This is a Hive-specific setting in which variables can be set at the
session level and referenced in Hive commands or queries. Usage: beeline –
hivevar var1=value1
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks
in the order of their dependencies.
Introduction to Hive metastore

Hive metastore (HMS) is a service that stores metadata related to Apache Hive and
other services, in a backend RDBMS, such as MySQL or PostgreSQL. Impala, Spark,
Hive, and other services share the metastore. The connections to and from HMS
include HiveServer, Ranger, and the NameNode that represents HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to
HiveServer. The HiveServer instance reads/writes data to HMS. By default,
redundant HMS operate in active/active mode. The physical data resides in a
backend RDBMS, one for HMS. You must configure all HMS instances to use the
same backend database. A separate RDBMS supports the security service, Ranger
for example. All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.
HMS connects directly to Ranger and the NameNode (HDFS), and so does
HiveServer, but this is not shown in the diagram for simplicity. One or more HMS
instances on the backend can talk to other services, such as Ranger.
Comparison with Hive and Traditional database

Hive Traditional database
Schema on WRITE – table schema is enforced at data

Schema on READ – it’s does not verify the schema
load time i.e if the data being loaded does’t
while it’s loaded the data
conformed on schema in that case it will rejected
It’s very easily scalable at low cost Not much Scalable, costly scale up.
It’s based on hadoop notation that is Write once and In traditional database we can read and write many
read many times time
Record level updates is not possible in Hive Record level updates, insertions and
deletes, transactions and indexes are possible
OLTP (On-line Transaction Processing) is not yet Both OLTP (On-line Transaction Processing)

supported in Hive but it’s supported OLAP (On-line and OLAP (On-line Analytical Processing) are
Analytical Processing) supported in RDBMS
What is Hive Query Language (HiveQL)?
Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analyzing
structured data. It separates users from the complexity of Map Reduce programming. It reuses
common concepts from relational databases, such as tables, rows, columns, and schema, to ease
learning. Hive provides a CLI for Hive query writing using Hive Query Language (HiveQL).
Most interactions tend to take place over a command line interface (CLI). Generally, HiveQL syntax is
similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats
which are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
Hive uses derby database for single user metadata storage, and for multiple user Metadata or
shared Metadata case, Hive uses MYSQL.
HiveQL Built-in Operators
Hive provides Built-in operators for Data operations to be implemented on the tables present inside
Hive warehouse.
These operators are used for mathematical operations on operands, and it will return specific value
as per the logic applied.
Below are the main types of Built-in Operators in HiveQL:
 Relational Operators
 Arithmetic Operators
 Logical Operators
 Operators on Complex types
 Complex type Constructors

Relational Operators in Hive SQL
We use Relational operators for relationship comparisons between two operands.
 Operators such as equals, Not equals, less than, greater than …etc
 The operand types are all number types in these Operators.
The following Table will give us details about Relational operators and its usage in HiveQL:
Built-in
Description Operand
Operator
TRUE
It takes all primitive
X=Y if expression X is equivalent to expression Y
types
Otherwise FALSE.
TRUE
X != Y if expression X is not equivalent to expression Y
types
Otherwise FALSE.
TRUE
X<Y if expression X is less than expression Y
types
Otherwise FALSE.
TRUE
X <= Y if expression X is less than or equal to expression Y
types
Otherwise FALSE.
TRUE
X>Y if expression X is greater than expression Y
types
Otherwise FALSE.
TRUE
X>= Y if expression X is greater than or equal to expression Y
types
Otherwise FALSE.
X IS NULL TRUE if expression X evaluates to NULL otherwise FALSE. It takes all types
FALSE
X IS NOT NULL It takes all types
If expression X evaluates to NULL otherwise TRUE.
TRUE
X LIKE Y Takes only Strings
If string pattern X matches to Y otherwise FALSE.
NULL if X or Y is NULL, TRUE if any substring of X matches the Java regular
X RLIKE Y Takes only Strings
expression Y, otherwise FALSE.
X REGEXP Y Same as RLIKE. Takes only Strings
HiveQL Arithmetic Operators
We use Arithmetic operators for performing arithmetic operations on operands
 Arithmetic operations such as addition, subtraction, multiplication and division between

operands we use these Operators.
 The operand types all are number types in these Operators
Sample Example:
2 + 3 gives result 5.
In this example, ‘+’ is theoperator and 2 and 3 are operands. The return value is 5
The following Table will give us details about Arithmetic operators in Hive Query Language:
Built-in Operator Description Operand
X+Y It will return the output of adding X and Y value. It takes all number types
X–Y It will return the output of subtracting Y from X value. It takes all number types
X*Y It will return the output of multiplying X and Y values. It takes all number types
X/Y It will return the output of dividing Y from X. It takes all number types
X%Y It will return the remainder resulting from dividing X by Y. It takes all number types
X&Y It will return the output of bitwise AND of X and Y. It takes all number types
X|Y It will return the output of bitwise OR of X and Y. It takes all number types
X^Y It will return the output of bitwise XOR of X and Y. It takes all number types
~X It will return the output of bitwise NOT of X. It takes all number types
Hive QL Logical Operators
We use Logical operators for performing Logical operations on operands
 Logical operations such as AND, OR, NOT between operands we use these Operators.
 The operand types all are BOOLEAN type in these Operators
The following Table will give us details about Logical operators in HiveSQL:
Operators Description Operands
X AND Y TRUE if both X and Y are TRUE, otherwise FALSE. Boolean types only
X && Y Same as X AND Y but here we using && symbol Boolean types only
X OR Y TRUE if either X or Y or both are TRUE, otherwise FALSE. Boolean types only
X || Y Same as X OR Y but here we using || symbol Boolean types only
NOT X TRUE if X is FALSE, otherwise FALSE. Boolean types only
!X Same as NOT X but here we using! symbol Boolean types only
Operators on Complex Types
The following Table will give us details about Complex Type Operators . These are operators which
will provide a different mechanism to access elements in complex types.
Operator
Operands Description
s
A is an Array and n is an integer It will return nth element in the array A. The first element has
A[n]
type index of 0
M is a Map<K, V> and key has

M[key] It will return the values belongs to the key in the map
type K
Complex Type Constructors
The following Table will give us details about Complex type Constructors. It will construct instances
on complex data types. These are of complex data types such as Array, Map and Struct types in Hive.
In this section, we are going to see the operations performed on Complex type Constructors.
Operators Operands Description
It will create an array with the given elements as mentioned like val1,
array (val1, val2, …)
val2
It will create a union type with the values that is being mentioned to
Create_ union (tag, val1, val2, …)
by the tag parameter
(key1, value1, key2, It will create a map with the given key/value pairs mentioned in
map
value2, …) operands
Named_struc (name1, val1, name2, It will create a Struct with the given field names and values mentioned
t val2, …) in operands
Creates a Struct with the given field values. Struct field names will be
STRUCT (val1, val2, val3, …)
col1, col2, .
Summary
 Hive Query Language (HiveQL) is a query language in Apache Hive for processing and
analyzing structured data.
 Hive provides Built-in operators for Data operations to be implemented on the tables
present inside Hive warehouse.
 Types of Built-in Operators in HiveQL are:

 Relational Operators
 Arithmetic Operators
 Logical Operators
 Operators on Complex types
 Complex type Constructors
Types of Tables in Apache Hive
Here are the types of tables in Apache Hive:
Managed Tables
In a managed table, both the table data and the table schema are managed by Hive. The data will be
located in a folder named after the table within the Hive data warehouse, which is essentially just a
file location in HDFS.
The location is user-configurable when Hive is installed. By managed or controlled we mean that if
you drop (delete) a managed table, then Hive will delete both the Schema (the description of the
table) and the data files associated with the table. Default location is /user/hive/warehouse).
Syntax to Create Managed Table
CREATE TABLE IF NOT EXISTS stocks (exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
As for managed tables, you can also copy the schema (but not the data) of an existing table:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';
External Tables
An external table is one where only the table schema is controlled by Hive. In most cases, the user
will set up the folder location within HDFS and copy the data file(s) there. This location is included as
part of the table definition statement. When an external table is deleted, Hive will only delete the
schema associated with the table. The data files are not affected.

Syntax to Create External Table
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
Managed vs. External Table – What’s the Difference?
Managed Table External Table
Hive assumes that it owns the data for For external tables, Hive assumes that it
managed tables. does not manage the data.
If a managed table or partition is dropped, the Dropping the table does not delete the data,
data and metadata associated with that table although the metadata for the table will be
or partition are deleted. deleted.
For Managed tables, Hive stores data into its For External Tables, Hive stores the data in the
warehouse directory LOCATION specified during creation of
the table(generally not in warehouse directory)
Managed table provides ACID/transnational External Table does not provide

action support. ACID/transactional action support.
Statements: ARCHIVE, UNARCHIVE, TRUNCATE, Not supported.

MERGE, CONCATENATE supported
Query Result Caching supported(saves the Not Supported

results of an executed Hive query for reuse )
Identify the Type of Apache Hive Table
You can tell whether or not a table is managed or external using the output of DESCRIBE EXTENDED
table name.
Near the end of the Detailed Table Information output, you will see the following
for managed tables:
... tableType: MANAGED_TABLE)

For external tables, you will see the following:
... tableType: EXTERNAL_TABLE)
Note: If you omit the EXTERNAL keyword and the original table is external, the new table will also be
external. If you omit EXTERNAL and the original table is managed, the new table will also be
managed. However, if you include the EXTERNAL keyword and the original table is managed, the
new table will be external. Even in this scenario, the LOCATION clause will still be optional.
Hive Queries: Order By, Group By, Distribute By, Cluster By Examples
Hive provides SQL type querying language for the ETL purpose on top of Hadoop file system.
Hive Query language (HiveQL) provides SQL type environment in Hive to work with tables,
databases, queries.
We can have a different type of Clauses associated with Hive to perform different type data
manipulations and querying. For better connectivity with different nodes outside the environment.
HIVE provide JDBC connectivity as well.
Hive queries provides the following features:
 Data modeling such as Creation of databases, tables, etc.
 ETL functionalities such as Extraction, Transformation, and Loading data into tables
 Joins to merge different data tables
 User specific custom scripts for ease of code
 Faster querying tool on top of Hadoop
Creating Table in Hive
Before initiating with our main topic for this tutorial, first we will create a table to use it as
references for the following tutorial.
Here in this tutorial, we are going to create table “employees_guru” with 6 columns.
From the above screen shot,
1. We are creating table “employees_guru” with 6 column values such as Id, Name, Age,
Address, Salary, Department, which belongs to the employees present in organization
“guru.”
2. Here in this step we are loading data into employees_guru table. The data that we are going
to load will be placed under Employees.txt file
Order by query:
The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY in SQL language.
Order by is the clause we use with “SELECT” statement in Hive queries, which helps sort data. Order
by clause use columns on Hive tables for sorting particular column values mentioned with Order by.
For whatever the column name we are defining the order by clause the query will selects and display
results by ascending or descending order the particular column values.
If the mentioned order by field is a string, then it will display the result in lexicographical order. At
the back end, it has to be passed on to a single reducer.
From the Above screen shot, we can observe the following
1. It is the query that performing on the “employees_guru” table with the ORDER BY clause
with Department as defined ORDER BY column name.
“Department” is String so it will display results based on lexicographical order.
2. This is actual output for the query. If we observe it properly, we can see that it get results
displayed based on Department column such as ADMIN, Finance and so on in orderQuery to
be perform.
Query :
SELECT * FROM employees_guru ORDER BY Department;
Group by query:
Group by clause use columns on Hive tables for grouping particular column values mentioned with
the group by. For whatever the column name we are defining a “groupby” clause the query will
selects and display results by grouping the particular column values.
For example, in the below screen shot it’s going to display the total count of employees present in
each department. Here we have “Department” as Group by value.
From the above screenshot, we will observe the following
1. It is the query that is performed on the “employees_guru” table with the GROUP BY clause
with Department as defined GROUP BY column name.
2. The output showing here is the department name, and the employees count in different
departments. Here all the employees belong to the specific department is grouped by and
displayed in the results. So the result is department name with the total number of
employees present in each department.
Query:
SELECT Department, count(*) FROM employees_guru GROUP BY Department;
Sort by:
Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC
for sorting the order in descending order and mention ASC for Ascending order of the sort.
In this sort by it will sort the rows before feeding to the reducer. Always sort by depends on column
types.
For instance, if column types are numeric it will sort in numeric order if the columns types are string
it will sort in lexicographical order.
From the above screen shot we can observe the following:
1. It is the query that performing on the table “employees_guru” with the SORT BY clause with
“id” as define SORT BY column name. We used keyword DESC.
2. So the output displayed will be in descending order of “id”.
Query:
SELECT * from employees_guru SORT BY Id DESC;
Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to distribute the
rows among reducers. Cluster BY columns will go to the multiple reducers.
 It ensures sorting orders of values present in multiple reducers
For example, Cluster By clause mentioned on the Id column name of the table employees_guru
table. The output when executing this query will give results to multiple reducers at the back end.
But as front end it is an alternative clause for both Sort By and Distribute By.
This is actually back end process when we perform a query with sort by, group by, and cluster by in
terms of Map reduce framework. So if we want to store results into multiple reducers, we go with
Cluster By.
From the above screen shot we are getting the following observations:
1. It is the query that performs CLUSTER BY clause on Id field value. Here it’s going to get a sort
on Id values.
2. It displays the Id and Names present in the guru_employees sort ordered by

Query:
SELECT Id, Name from employees_guru CLUSTER BY Id;
Distribute By:
Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to
distribute the rows among reducers. All Distribute BY columns will go to the same reducer.
 It ensures each of N reducers gets non-overlapping ranges of column
 It doesn’t sort the output of each reducer
From the above screenshot, we can observe the following
1. DISTRIBUTE BY Clause performing on Id of “empoloyees_guru” table

2. Output showing Id, Name. At back end, it will go to the same reducer
Query:
SELECT Id, Name from employees_guru DISTRIBUTE BY Id;
UDFs (User Defined Functions):
In Hive, the users can define own functions to meet certain client requirements. These are known as
UDFs in Hive. User Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in application frameworks. The
developer will develop these functions in Java and integrate those UDFs with the Hive.
During the Query execution, the developer can directly use the code, and UDFs will return outputs
according to the user-defined tasks. It will provide high performance in terms of coding and
execution.
For example, for string stemming we don’t have any predefined function in Hive. For this, we can
write stem UDF in Java. Wherever we require Stem functionality, we can directly call this Stem UDF
in Hive.
Here stem functionality means deriving words from its root words. It is like stemming algorithm
reduces the words “wishing”, “wished”, and “wishes” to the root word “wish.” For performing this
type of functionality, we can write UDF in Java and integrate it with Hive.
Depending on the use cases, the UDFs can be written. It will accept and produce different numbers
of input and output values.
The general type of UDF will accept a single input value and produce a single output value. If the UDF
is used in the query, then UDF will be called once for each row in the result data set.
In the other way, it can accept a group of values as input and return a single output value as well.
HBase - Overview
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the
advent of big data, companies realized the benefit of processing big data and started opting for
solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop
excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even
unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential manner.
That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of
time (random access).
Hadoop Random Access Databases

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases
that store huge amounts of data and access the data in a random manner.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the
data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read
and write access.
HBase and HDFS
HDFS HBase
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch It provides low latency access to single rows from billions of
processing; no concept of batch records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for faster
lookups.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column families
and each column family can have any number of columns. Subsequent column values are stored
contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
 Table is a collection of rows.
 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.
Given below is an example schema of table in HBase.
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing
(OLAP).
Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.
The following image shows column families in a column-oriented database:
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema,
of fixed columns schema; defines only column which describes the whole structure of
families. tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured It is good for structured data.

data.
Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.
Where to Use HBase
 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up
on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase History
Year Event
Nov 2006 Google released the paper on BigTable.
Feb 2007 Initial HBase prototype was created as a Hadoop contribution.
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Jan 2008 HBase became the sub project of Hadoop.
Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.
Sept 2009 HBase 0.20.0 was released.
May 2010 HBase became Apache top-level project.
HBase - Client API
This chapter describes the java client API for HBase that is used to perform CRUD operations on
HBase tables. HBase is written in Java and has a Java Native API. Therefore it provides programmatic
access to Data Manipulation Language (DML).
Class HBase Configuration
Adds HBase configuration files to a Configuration. This class belongs to

the org.apache.hadoop.hbase package.
Methods and description
S.No. Methods and Description
1 static org.apache.hadoop.conf.Configuration create()
This method creates a Configuration with HBase resources.
Class HTable
HTable is an HBase internal class that represents an HBase table. It is an implementation of table
that is used to communicate with a single HBase table. This class belongs to
the org.apache.hadoop.hbase.client class.
Constructors
S.No. Constructors and Description
1 HTable()
2 HTable(TableName tableName, ClusterConnection connection, ExecutorService pool)
Using this constructor, you can create an object to access an HBase table.
Methods and description
1 void close()
Releases all the resources of the HTable.
2 void delete(Delete delete)
Deletes the specified cells/row.

3 boolean exists(Get get)
Using this method, you can test the existence of columns in the table, as specified by Get.
4 Result get(Get get)
Retrieves certain cells from a given row.
5 org.apache.hadoop.conf.Configuration getConfiguration()
Returns the Configuration object used by this instance.
6 TableName getName()
Returns the table name instance of this table.
7 HTableDescriptor getTableDescriptor()
Returns the table descriptor for this table.
8 byte[] getTableName()
Returns the name of this table.
9 void put(Put put)
Using this method, you can insert data into the table.
Class Put
This class is used to perform Put operations for a single row. It belongs to
the org.apache.hadoop.hbase.client package.
Constructors
S.No. Constructors and Description
1 Put(byte[] row)
Using this constructor, you can create a Put operation for the specified row.
2 Put(byte[] rowArray, int rowOffset, int rowLength)
Using this constructor, you can make a copy of the passed-in row key to keep local.
3 Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
Using this constructor, you can make a copy of the passed-in row key to keep local.
4 Put(byte[] row, long ts)
Using this constructor, we can create a Put operation for the specified row, using a given
timestamp.
Methods
S.No Methods and Description

.
1 Put add(byte[] family, byte[] qualifier, byte[] value)
Adds the specified column and value to this Put operation.
2 Put add(byte[] family, byte[] qualifier, long ts, byte[] value)
Adds the specified column and value, with the specified timestamp as its version to this Put
operation.
3 Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
operation.
4 Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
operation.
Class Get
This class is used to perform Get operations on a single row. This class belongs to
Constructor
S.No. Constructor and Description
1 Get(byte[] row)
Using this constructor, you can create a Get operation for the specified row.
2 Get(Get get)
Methods
1 Get addColumn(byte[] family, byte[] qualifier)
Retrieves the column from the specific family with the specified qualifier.
2 Get addFamily(byte[] family)
Retrieves all columns from the specified family.
Class Delete
This class is used to perform Delete operations on a single row. To delete an entire row, instantiate a
Delete object with the row to delete. This class belongs to
Constructor
S.No. Constructor and Description
1 Delete(byte[] row)
Creates a Delete operation for the specified row.
2 Delete(byte[] rowArray, int rowOffset, int rowLength)
Creates a Delete operation for the specified row and timestamp.
3 Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)
4 Delete(byte[] row, long timestamp)
Methods
1 Delete addColumn(byte[] family, byte[] qualifier)
Deletes the latest version of the specified column.
2 Delete addColumns(byte[] family, byte[] qualifier, long timestamp)
Deletes all versions of the specified column with a timestamp less than or equal to the specified
timestamp.
3 Delete addFamily(byte[] family)
Deletes all versions of all columns of the specified family.
4 Delete addFamily(byte[] family, long timestamp)
Deletes all columns of the specified family with a timestamp less than or equal to the specified
timestamp.
Class Result
This class is used to get a single row result of a Get or a Scan query.
Constructors
S.No Constructors
.
1 Result()
Using this constructor, you can create an empty Result with no KeyValue payload; returns null
if you call raw Cells().
Methods
1 byte[] getValue(byte[] family, byte[] qualifier)
This method is used to get the latest version of the specified column.
2 byte[] getRow()
This method is used to retrieve the row key that corresponds to the row from which this
Result was created.
HBase Example
Let's see a HBase example to import data of a file in HBase table.
Use Case
We have to import data present in the file into an HBase table by creating it through Java API.
Data_file.txt contains the below data
1. 1,India,Bihar,Champaran,2009,April,P1,1,5
2. 2,India, Bihar,Patna,2009,May,P1,2,10
3. 3,India, Bihar,Bhagalpur,2010,June,P2,3,15
4. 4,United States,California,Fresno,2009,April,P2,2,5
5. 5,United States,California,Long Beach,2010,July,P2,4,10
6. 6,United States,California,San Francisco,2011,August,P1,6,20
The Java code is shown below
This data has to be inputted into a new HBase table to be created through JAVA API. Following
column families have to be created
1. "sample,region,time.product,sale,profit".
Column family region has three column qualifiers: country, state, city
Column family Time has two column qualifiers: year, month
Jar Files
Make sure that the following jars are present while writing the code as they are required by the
HBase.
a. commons-loging-1.0.4
b. commons-loging-api-1.0.4
c. hadoop-core-0.20.2-cdh3u2
d. hbase-0.90.4-cdh3u2
e. log4j-1.2.15
f. zookeper-3.3.3-cdh3u0
Program Code
1. import java.io.BufferedReader;
2. import java.io.File;
3. import java.io.FileReader;
4. import java.io.IOException;
5. import java.util.StringTokenizer;
6.
7. import org.apache.hadoop.conf.Configuration;
8. import org.apache.hadoop.hbase.HBaseConfiguration;
9. import org.apache.hadoop.hbase.HColumnDescriptor;
10. import org.apache.hadoop.hbase.HTableDescriptor;
11. import org.apache.hadoop.hbase.client.HBaseAdmin;
12. import org.apache.hadoop.hbase.client.HTable;
13. import org.apache.hadoop.hbase.client.Put;
14. import org.apache.hadoop.hbase.util.Bytes;
15.
16.
17. public class readFromFile {
18. public static void main(String[] args) throws IOException{
19. if(args.length==1)
20. {
21. Configuration conf = HBaseConfiguration.create(new Configuration());
22. HBaseAdmin hba = new HBaseAdmin(conf);
23. if(!hba.tableExists(args[0])){
24. HTableDescriptor ht = new HTableDescriptor(args[0]);
25. ht.addFamily(new HColumnDescriptor("sample"));
26. ht.addFamily(new HColumnDescriptor("region"));
27. ht.addFamily(new HColumnDescriptor("time"));
28. ht.addFamily(new HColumnDescriptor("product"));
29. ht.addFamily(new HColumnDescriptor("sale"));
30. ht.addFamily(new HColumnDescriptor("profit"));
31. hba.createTable(ht);
32. System.out.println("New Table Created");
33.
34. HTable table = new HTable(conf,args[0]);
35.
36. File f = new File("/home/training/Desktop/data");
37. BufferedReader br = new BufferedReader(new FileReader(f));
38. String line = br.readLine();
39. int i =1;
40. String rowname="row";
41. while(line!=null && line.length()!=0){
42. System.out.println("Ok till here");
43. StringTokenizer tokens = new StringTokenizer(line,",");
44. rowname = "row"+i;
45. Put p = new Put(Bytes.toBytes(rowname));
46. p.add(Bytes.toBytes("sample"),Bytes.toBytes("sampleNo."),
47. Bytes.toBytes(Integer.parseInt(tokens.nextToken())));
48. p.add(Bytes.toBytes("region"),Bytes.toBytes("country"),Bytes.toBytes(tokens.nex
tToken()));
49. p.add(Bytes.toBytes("region"),Bytes.toBytes("state"),Bytes.toBytes(tokens.nextT
oken()));
50. p.add(Bytes.toBytes("region"),Bytes.toBytes("city"),Bytes.toBytes(tokens.nextTok
en()));
51. p.add(Bytes.toBytes("time"),Bytes.toBytes("year"),Bytes.toBytes(Integer.parseInt
(tokens.nextToken())));
52. p.add(Bytes.toBytes("time"),Bytes.toBytes("month"),Bytes.toBytes(tokens.nextT
oken()));
53. p.add(Bytes.toBytes("product"),Bytes.toBytes("productNo."),Bytes.toBytes(token
s.nextToken()));
54. p.add(Bytes.toBytes("sale"),Bytes.toBytes("quantity"),Bytes.toBytes(Integer.pars
eInt(tokens.nextToken())));
55. p.add(Bytes.toBytes("profit"),Bytes.toBytes("earnings"),Bytes.toBytes(tokens.nex
tToken()));
56. i++;
57. table.put(p);
58. line = br.readLine();
59. }
60. br.close();
61. table.close();
62. }
63. else
64. System.out.println("Table Already exists.Please enter another table name");
65. }
66. else
67. System.out.println("Please Enter the table name through command line");
68. }
69. }
Difference between RDBMS and HBase
Relational Database Management System (RDBMS): RDBMS is for SQL, and for all modern database
systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access. A Relational database
management system (RDBMS) is a database management system (DBMS) that is based on the
relational model as introduced by E. F. Codd. An RDBMS is a type of DBMS with a row-based table
structure that connects related data elements and includes functions that maintain the security,
accuracy, integrity, and consistency of the data. The most basic RDBMS functions are create, read,
update and delete operations. Hbase follows the ACID Properties.
Applications:
 Tracking and managing day-to-day activities and transactions such as production, stocking,
income and expenses, and purchases.
 Management of normal activities in hospitals, banks, railways, schools, and institutions.
HBase: HBase is a column-oriented database management system that runs on top of the Hadoop
Distributed File System (HDFS). It is well suited for sparse data sets, which are common in many big
data use cases. It is an open-source, distributed database developed by Apache software
foundations. Initially, it was named Google Big Table, afterwards, it was re-named HBase and is
primarily written in Java. It can store massive amounts of data from terabytes to petabytes. It is built
for low-latency operations and is used extensively for reading and writing operations. It stores a
large amount of data in the form of tables.
Application:
 For creating large applications.
 Random and fast accessing of data is provided using HBase.
 HBase is used internally by companies including Facebook, Twitter, Yahoo, and Adobe.
Difference between RDBMS and HBase:

S. Parameters RDBMS HBase
No.
1. SQL It requires SQL (Structured Query SQL is not required in HBase.

Language).
2. Schema It has a fixed schema. It does not have a fixed schema

and allows for the addition of
columns on the fly.
3. Database It is a row-oriented database It is a column-oriented database.

Type
4. Scalability RDBMS allows for scaling up. That Scale-out is possible using HBase.
implies, that rather than adding new It means that, while we require
servers, we should upgrade the current extra memory and disc space, we
server to a more capable server must add new servers to the
whenever there is a requirement for cluster rather than upgrade the
more memory, processing power, and existing ones.
disc space.
5. Nature It is static in nature Dynamic in nature
6. Data In RDBMS, slower retrieval of data. In HBase, faster retrieval of data.

retrieval
7. Rule It follows the ACID (Atomicity, It follows CAP (Consistency,

Consistency, Isolation, and Durability) Availability, Partition-tolerance)
property. theorem.
8. Type of data It can handle structured data. It can handle structured,

unstructured as well as semi-
structured data.
9. Sparse data It cannot handle sparse data. It can handle sparse data.
10. Volume of The amount of data in RDBMS is In HBase, the .amount of data
data determined by the server’s depends on the number of
configuration. machines deployed rather than
on a single machine.
11. Transaction In RDBMS, mostly there is a guarantee In HBase, there is no such

Integrity associated with transaction integrity. guarantee associated with the
transaction integrity.
12. Referential Referential integrity is supported by When it comes to referential

Integrity RDBMS. integrity, no built-in support is
available.
13. Normalize In RDBMS, you can normalize the data. The data in HBase is not
normalized, which means there is
no logical relationship or
connection between distinct
tables of data.
14. Table size It is designed to accommodate small It is designed to accommodate

tables.. Scaling is difficult. large tables. HBase may scale
horizontally.

BDP U4

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

BDP U4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDP U4

Uploaded by

Copyright:

Available Formats

Introduction to Apache Pig

 Pig Latin is SQL like language.

 It provides many builtIn operators.

 It provides nested data types (tuples, bags, map).

Features of Apache Pig:

 Join operation is easy in Apache Pig.

 Fewer lines of code.

 Apache Pig allows splits in the pipeline.

 The data structure is multivalued, nested, and richer.

Apache Pig MapReduce

It is a scripting language. It is a compiled programming language.

Abstraction is at higher level. Abstraction is at lower level.

It have less line of code as compared to

More development efforts are required for

Code efficiency is less as compared to As compared to Pig efficiency of code is

Pig provides built in functions for ordering,

It allows nested data types like map, tuple and

Applications of Apache Pig:

 For exploring large datasets Pig Scripting is used.

 Provides the supports across large data-sets for Ad-hoc queries.

 In the prototyping of large data-sets processing algorithms.

 Required to process the time sensitive data loads.

 Tuple: It is an ordered set of the fields.

 Bag: It is a collection of the tuples.

Apache Pig - Execution

Apache Pig Execution Modes

Apache Pig Execution Mechanisms

Invoking the Grunt Shell

Local mode MapReduce mode

$ ./pig –x local $ ./pig -x mapreduce

You can exit the Grunt shell using ‘ctrl + d’.

grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING

Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Hive vs Pig vs SQL

Technical Differences Between Hive vs Pig vs SQL

Hive is seen as a Data Warehouse Infrastructure and is used as an ETL (Extraction-Transformation-

Hive vs Pig vs SQL – When to Use What?

When to Use Hive

 To query large datasets: Apache Hive is specially used for analytics purposes on huge

When to Use Pig

 To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big

 As a programmer with the scripting knowledge: The programmers with the scripting

When to Use SQL

Hive vs Pig vs SQL

Apache Pig - Grunt Shell

Given below is the syntax of sh command.

grunt> sh shell command parameters

Given below is the syntax of fs command.

grunt> sh File System command parameters

drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase

drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data

drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data

The clear command is used to clear the screen of the Grunt shell.

The help command gives you a list of Pig commands or Pig properties.

File system commands:fs <fs arguments> - Equivalent to Hadoop dfs command:

Diagnostic Commands:describe <alias>[::<alias] - Show the schema for the alias.