0% found this document useful (0 votes)
33 views11 pages

Apache Pig Handy Notes Lab

Uploaded by

juhi46125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views11 pages

Apache Pig Handy Notes Lab

Uploaded by

juhi46125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to Apache Pig

 Apache Pig is a platform, used to analyze large data sets representing them as data
flows.
 It is designed to provide an abstraction over MapReduce, reducing the complexities of
writing a MapReduce program.
 We can perform data manipulation operations very easily in Hadoop using Apache
Pig.

 Pig Latin isa data flow language. This means it allows users to describe how data from
one or more inputs should be read, processed, and then stored to one or more outputs in
parallel. These data flows can be simple linear flows, or complex workflows that include
points where multiple inputs are joined and where data is split into multiple streams to be
processed by different operators.

The features of Apache pig are:


 Pig enables programmers to write complex data transformations without knowing
Java.
 Apache Pig has two main components – the Pig Latin language and the Pig Run-time
Environment, in which Pig Latin programs are executed.
 For Big Data Analytics, Pig gives a simple data flow language known as Pig Latin which
has functionalities similar to SQL like join, filter, limit etc.
 Developers who are working with scripting languages and SQL, leverages Pig Latin.
This gives developers ease of programming with Apache Pig. Pig Latin provides various
built-in operators like join, sort, filter, etc to read, write, and process large data sets. Thus
it is evident, Pig has a rich set of operators.
 Programmers write scripts using Pig Latin to analyze data and these scripts are internally
converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writing
MapReduce tasks was the only way to process the data stored in HDFS.
 If a programmer wants to write custom functions which are unavailable in Pig, Pig allows
them to write User Defined Functions (UDF) in any language of their choice like
Java, Python, Ruby, Jython, JRubyetc. and embed them in Pig script. This
provides extensibility to Apache Pig.
 Pig can process any kind of data, i.e. structured, semi-structured or unstructured data,
coming from various sources. Apache Pig handles all kinds of data.
 Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code.
 It can handle inconsistent schema (in case of unstructured data).
 Apache Pig extracts the data, performs operations on that data and dumps the data in the
required format in HDFS i.e. ETL (Extract Transform Load).
 Apache Pig automatically optimizes the tasks before execution, i.e. automatic
optimization.
 It allows programmers and developers to concentrate upon the whole operation
irrespective of creating mapper and reducer functions separately.
Use Cases of Apache Pig:
Web Logs, Data Processing for Searching, AdHoc queries, Quick Prototyping of
algorithms for processing large datasets.

The architecture of Apache Pig is shown in the below image.

Parser
From the above image you can see, after passing through Grunt or Pig Server, Pig Scripts are
passed to the Parser. The Parser does type checking and checks the syntax of the script. The
parser outputs a DAG (directed acyclic graph). DAG represents the Pig Latin statements and
logical operators. The logical operators are represented as the nodes and the data flows are
represented as edges.

Optimizer
Then the DAG is submitted to the optimizer. The Optimizer performs the optimization activities
like split, merge, transform, and reorder operators etc. This optimizer provides the automatic
optimization feature to Apache Pig. The optimizer basically aims to reduce the amount of data in
the pipeline at any instance of time while processing the extracted data, and for that it performs
functions like:
 PushUpFilter: If there are multiple conditions in the filter and the filter can be split, Pig
splits the conditions and pushes up each condition separately. Selecting these conditions
earlier, helps in reducing the number of records remaining in the pipeline.
 PushDownForEachFlatten: Applying flatten, which produces a cross product between a
complex type such as a tuple or a bag and the other fields in the record, as late as possible
in the plan. This keeps the number of records low in the pipeline.
 ColumnPruner: Omitting columns that are never used or no longer needed, reducing the
size of the record. This can be applied after each operator, so that fields can be pruned as
aggressively as possible.
 MapKeyPruner: Omitting map keys that are never used, reducing the size of the record.
 LimitOptimizer: If the limit operator is immediately applied after a load or sort operator,
Pig converts the load or sort operator into a limit-sensitive implementation, which does
not require processing the whole data set. Applying the limit earlier, reduces the number
of records.
This is just a flavor of the optimization process. Over that it also performs Join, Order
By and Group Byfunctions.

Compiler
After the optimization process, the compiler compiles the optimized code into a series of
MapReduce jobs. The compiler is the one who is responsible for converting Pig jobs
automatically into MapReduce jobs.

Execution engine
Finally, as shown in the figure, these MapReduce jobs are submitted for execution to the
execution engine. Then the MapReduce jobs are executed and gives the required result. The
result can be displayed on the screen using “DUMP” statement and can be stored in the HDFS
using “STORE” statement.

Pig Latin Scripts


Initially as illustrated in the above image, we submit Pig scripts to the Apache Pig execution
environment which can be written in Pig Latin using built-in operators.
There are three ways to execute the Pig script:
 Grunt Shell: This is Pig’s interactive shell provided to execute all Pig Scripts.
 Script File: Write all the Pig commands in a script file and execute the Pig script file.
This is executed by the Pig Server.
 Embedded Script: If some functions are unavailable in built-in operators, we can
programmatically create User Defined Functions to bring that functionalities using other
languages like Java, Python, Ruby, etc. and embed it in Pig Latin Script file. Then,
execute that script file.
Pig’s data types
Pig’s data types make up the data model for how Pig thinks of the structure of the data it is
processing. With Pig, the data model gets defined when the data is loaded.
Any data you load into Pig from disk is going to have a particular schema and structure.
Pig needs to understand that structure, so when you do the loading, the data automatically goes
through a mapping.
Luckily for you, the Pig data model is rich enough to handle most anything thrown its way,
including table- like structures and nested hierarchical data structures.
In general terms, though, Pig data types can be broken into two categories:
Scalar types and complex types.
 Scalar types contain a single value,
 Complex types contain other types, such as the Tuple, Bag and Map types listed below.
Pig Latin has these four types in its data model:
 Atom: An atom is any single value, such as a string or a number — ‘Diego’, for example.
Pig’s atomic values are scalar types that appear in most programming languages — int,
long, float, double, chararray and bytearray.
 Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any
type — ‘Diego’, ‘Gomez’, or 6, for example). Think of a tuple as a row in a table.
 Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible —
each tuple in the collection can contain an arbitrary number of fields, and each field can
be of any type.
 Map: A map is a collection of key value pairs. Any type can be stored in the value, and
the key needs to be unique. The key of a map must be a chararray and the value can be of
any type.
The figure offers some fine examples of Tuple, Bag, and Map data types, as well.

Simple and Complex


Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data: 10L or 10l
Display: 10L
float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or
10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0
chararray Character array (string) inhello world
Unicode UTF-8 format
bytearray Byte array (blob)
boolean boolean true/false (case insensitive)
datetime datetime 1970-01-01T00:00:00.000+00:00
biginteger Java BigInteger 200000000000
bigdecimal Java BigDecimal 33.456783321323441233442
Complex Types
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [open#apache]

The value of all these types can also be null. The semantics for null are similar to those used in
SQL. The concept of null in Pig means that the value is unknown. Nulls can show up in the data
in cases where values are unreadable or unrecognizable — for example, if you were to use a
wrong data type in the LOAD statement.

Null could be used as a placeholder until data is added or as a value for a field that is optional.
Pig Latin has a simple syntax with powerful semantics you’ll use to carry out two primary
operations: access and transform data.

In a Hadoop context, accessing data means allowing developers to load, store, and stream
data, whereas transforming data means taking advantage of Pig’s ability to group, join,
combine, split, filter, and sort data. The table gives an overview of the operators associated
with each operation.

Pig Latin Operators

Operation Operator Explanation

Data Access LOAD Read and Write data to file system. Load
operator specifies schema.

DUMP Write output to standard output (stdout)

STREAM Send all records through external binary


Transformations FOREACH Apply expression to each record and output
one or more
records

FILTER Apply predicate and remove records


that don’t meet
condition

GROUP/COGROUP Aggregate records with the same key


from one or more inputs

JOIN Join two or more records based on a


condition

CROSS Cartesian product of two or more inputs

ORDER Sort records based on key

DISTINCT Remove duplicate records

UNION Merge two data sets

SPLIT Divide data into two or more bags based


on predicate

LIMIT subset the number of records

Operators for Debugging and Troubleshooting

Operation Operator Description

Debug DESCRIBE Return the schema of a relation.

DUMP Dump the contents of a relation to the


screen.

EXPLAIN Display the MapReduce execution


plans.
Apache Pig script Execution Modes
Local Mode: In ‘local mode’, you can execute the pig script in local file system. In this case,
you don’t need to store the data in Hadoop HDFS file system, instead you can work with the data
stored in local file system itself.
Command: pig –x local
MapReduce Mode: In ‘MapReduce mode’, the data needs to be stored in HDFS file system
and you can process the data with the help of pig script.
Command: pig

To check where pig is installed


echo $PIG_HOME
Apache Pig Script in MapReduce Mode
Example 1: To read data from a data file and to display the required contents on the
terminal as output.
The sample data file contains following data:

1. Open gedit and create the file


2. Save the text file with the name on Desktop
3. Now, copy the file into hdfs so that it can be used by PIG.
$ hdfs dfs –put Desktop/filename /newfile
4. Now login into grunt shell by typing pig
5. Follow the following steps to load the file without Schema
A = LOAD ‘/edureka/information.txt’ using PigStorage (‘’) as (FName: chararray,
LName: chararray, MobileNo: chararray, City: chararray, Profession: chararray);
B = FOREACH A generate FName, MobileNo, Profession;
DUMP B;
6. Follow the following steps to load the file with Schema
A = LOAD ‘/edureka/information.txt’ using PigStorage (‘’) as (FName: chararray,
LName: chararray, MobileNo: chararray, City: chararray, Profession: chararray);

7. Save and close the file.


 The first command loads the file ‘information.txt’ into variable A with indirect schema
(FName, LName, MobileNo, City, Profession).
 The second command loads the required data from variable A to variable B.The
FOREACH operator is used to generate specified data transformations based on the
column data.
 The third line displays the content of variable B on the terminal/console.

WORKING DIRECTLY ON GRUNT SHELL


Copy file to HDFS
Check file copied successfully

Start PIG and create schema


To view schema

grunt>DESCRIBE A;

A: {name: chararray,age: int,gpa: float}

quit Command
You can quit from the Grunt shell using this command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit

Steps to create a Case Sensitivity


The names (aliases) of relations and fields are case sensitive. The names of Pig Latin functions
are case sensitive. The names of parameters (see Parameter Substitution) and all other Pig Latin
keywords are case insensitive.
In the example below, note the following:
1. The names (aliases) of relations A, B, and C are case sensitive.
2. The names (aliases) of fields f1, f2, and f3 are case sensitive.
3. Function names PigStorage and COUNT are case sensitive.
4. Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, and DUMP
are case insensitive. They can also be written as load, using, as, group, by, etc.
5. In the FOREACH statement, the field in relation B is referred to by positional
notation ($0).

grunt> A = LOAD 'datafile' USING PigStorage() AS (f1:int, f2:int, f3:int);


grunt> B = GROUP A BY f1;
grunt> C = FOREACH B GENERATE COUNT ($0);
grunt> DUMP C;
Bag
A bag is a collection of tuples.
Syntax: Inner bag
{ tuple [, tuple …] }
Terms
{ } An inner bag is enclosed in curly brackets { }.
tuple A tuple.
Usage
Note the following about bags:
 A bag can have duplicate tuples.
 A bag can have tuples with differing numbers of fields. However, if Pig tries to access a field
that does not exist, a null value is substituted.
 A bag can have tuples with fields that have different data types. However, for Pig to effectively
process bags, the schemas of the tuples within those bags should be the same. For example, if
half of the tuples include chararray fields and while the other half include float fields, only half
of the tuples will participate in any kind of computation because the chararray fields will be
converted to null.
Bags have two forms: outer bag (or relation) and inner bag.
Example: Outer Bag
In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.
A = LOAD 'data' as (f1:int, f2:int, f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
Example: Inner Bag
Now, suppose we group relation A by the first field to form relation X.
In this example X is a relation or bag of tuples. The tuples in relation X have two fields. The first
field is type int. The second field is type bag; you can think of this bag as an inner bag.
X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})

What are the relational operators in Pig?


The relational operators in Pig are as follows:

COGROUP
It joins two or more tables and then performs GROUP operation on the joined table result.

CROSS
This is used to compute the cross product (cartesian product) of two or more relations.

FOREACH
This will iterate through the tuples of a relation, generating a data transformation.

JOIN
This is used to join two or more tables in a relation.

LIMIT
This will limit the number of output tuples.

SPLIT
This will split the relation into two or more relations.

UNION
It will merge the contents of two relations.

ORDER
This is used to sort a relation based on one or more fields.
Storing data in PIG
You can store the loaded data in the file system using the store operator.
Syntax
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING
PigStorage(',')as(id:int,firstname:chararray,lastname:chararray,phone:chararray,city:chararray);
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage(',');

You might also like