100% found this document useful (2 votes)

230 views80 pages

Apache Pig

The document provides information about Apache Pig, including: 1) Apache Pig is a tool for analyzing large datasets that represents data as data flows and allows data manipulation operations using the Pig Latin language which is compiled into MapReduce jobs. 2) Pig Latin scripts are converted by the Pig Engine into MapReduce jobs which perform the data analysis on datasets stored in HDFS. 3) Pig provides features like rich operators, ease of programming through a SQL-like language, and ability to create UDFs in other languages like Java.

Uploaded by

Mukul Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

230 views80 pages

Apache Pig

Uploaded by

Mukul Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

WWW.PAVANONLINETRAININGS.COM | WWW.PAVANTESTINGTOOLS.

COM
What is Pig?
• Apache Pig is an abstraction over MapReduce.
• It is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language
known as Pig Latin.
• This language provides various operators using which programmers
can develop their own functions for reading, writing, and
processing data.
Pig Architecture & Components
• To analyze data using Apache Pig, programmers need to write
scripts using Pig Latin language.
• All these scripts are internally converted to Map and Reduce
tasks.
• Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
Features of Pig

• Rich set of operators: It provides many operators to perform operations like join,
sort, filer, etc.

• Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.

• UDF’s: Pig provides the facility to create User-defined Functions in other

programming languages such as Java and invoke or embed them in Pig Scripts.

• Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured
as well as unstructured. It stores the results in HDFS.
Apache Pig Vs Hive
• Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases,
Hive operates on HDFS in a similar way Apache Pig does.
Pig Latin – Data Model
Pig Execution Modes
• You can run Apache Pig in two modes.
• Local Mode
– In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.
• MapReduce Mode
– MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we
execute the Pig Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation on the data that
exists in the HDFS.
Invoking the Grunt Shell
• Local Mode
• $ pig –x local
• MapReduce mode
• $ pig -x mapreduce (or) pig
Execution Mechanisms
• Interactive Mode (Grunt shell) – You can run Apache Pig in interactive
mode using the Grunt shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump operator).

• Batch Mode (Script) – You can run Apache Pig in Batch mode by writing the
Pig Latin script in a single file with .pig extension.

• Embedded Mode (UDF) – Apache Pig provides the provision of defining our
own functions (User Defined Functions) in programming languages such as
Java, and using them in our script.
• Interactive Mode:
grunt> customers= LOAD '/home/cloudera/customers.txt' USING
PigStorage(',');
grunt> dump customers;

• Batch Mode (Local):

[cloudera@quickstart ~]$ cat pig_samplescript_local.pig

customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;

[cloudera@quickstart ~]$ pig -x local pig_samplescript_local.pig

• Batch Mode (HDFS):
[cloudera@quickstart ~]$ cat pig_samplescript_global.pig
customers= LOAD '/training/customers.txt' USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;

[cloudera@quickstart ~]$ pig -x mapreduce pig_samplescript_global.pig

Pig Latin Basics
Diagnostic Operators
• The load statement will simply load the data into the specified
relation in Apache Pig. To verify the execution of
the Load statement, you have to use the Diagnostic Operators.
• Pig Latin provides four different types of diagnostic operators:
– Dump operator
– Describe operator
– Explanation operator
– Illustration operator
• Dump operator
• The Dump operator is used to run the Pig Latin statements and
display the results on the screen. It is generally used for
debugging Purpose.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> dump customers;
• Describe operator
• The describe operator is used to view the schema of a
relation/bag.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> describe customers;
customers: {id: int,name: chararray,age: int,address:
chararray,salary: int}
• Explain operator
• The explain operator is used to display the logical, physical,
and MapReduce execution plans of a relation/bag.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> explain customers;
• Illustrate operator
• The illustrate operator is used to display the logical, physical,
and MapReduce execution plans of a relation/bag.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> illustrate customers;
Grouping & Joining
Group Operator
• The GROUP operator is used to group the data in one or more
relations. It collects the data having the same key.

• grunt> student_details = LOAD '/home/cloudera/students.txt' USING

PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);
• grunt> student_groupdata = GROUP student_details by age;
• grunt> dump student_groupdata;
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,trivendram)})

• grunt> describe student_groupdata;

student_groupdata: {group: int,student_details: {(id: int,firstname:
chararray,lastname: chararray,age: int,phone: chararray,city:
chararray)}}

• grunt> Illustrate student_groupdata;

Grouping by Multiple Columns
• grunt> student_details = LOAD '/home/cloudera/students.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);

• grunt> student_multiplegroup = GROUP student_details by

(age, city);
• grunt> dump student_multiplegroup;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
((24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Join Operator
• The JOIN operator is used to combine records from two or
more relations.
• Types of Joins:
– Self-join
– Inner-join
– Outer join : left join, right join, full join
Self Join
• customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, address:chararray, salary:int);
• orders = LOAD '/home/local/orders.txt' USING PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);

• grunt> customers1 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,

name:chararray, age:int, address:chararray, salary:int);
• grunt> customers2 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, address:chararray, salary:int);

• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

• grunt> Dump customers3;

Inner Join (equijoin)
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
• grunt> customer_orders = JOIN customers BY id, orders BY customer_id;
• grunt> dump customer_orders;
Left Outer Join
• The left outer Join operation returns all rows from the left table, even if
there are no matches in the right relation.

• grunt> customers = LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);
• grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY
customer_id;
• grunt> Dump outer_left;
Right Outer Join
• The right outer join operation returns all rows from the right table, even if
there are no matches in the left table.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);

• grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

• grunt> Dump outer_right;

Full Outer Join
• The full outer join operation returns rows when there is a match in one of the
relations.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);

• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

• grunt> Dump outer_full;

Cross Operator
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);

• grunt> cross_data = CROSS customers, orders;

• grunt> Dump cross_data;

Combining & Splitting
Union Operator
• The UNION operator of Pig Latin is used to merge the content
of two relations. To perform UNION operation on two
relations, their columns and domains must be identical.
• grunt> student1 = LOAD '/home/cloudera/student_data1.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student2 = LOAD '/home/cloudera/student_data2.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student = UNION student1, student2;
• grunt> dump student;
Split Operator
• he SPLIT operator is used to split a relation into two or more
relations.
• grunt> student_details = LOAD '/home/cloudera/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);

• Let us now split the relation into two, one listing the students age less than
23, and the other listing the students having the age between 23 and 25.

• SPLIT student_details into student_details1 if age<23, student_details2 if

(age>23 and age<25);

• grunt> Dump student_details1;

• grunt> Dump student_details2;

Filtering
Filter Operator
• The FILTER operator is used to select the required tuples from
a relation based on a condition.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> filter_data = FILTER student_details BY city == 'Chennai';
• grunt> dump filter_data;
Distinct Operator
• The DISTINCT operator is used to remove redundant
(duplicate) tuples from a relation.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> distinct_data = DISTINCT student_details;
• grunt> dump distinct_data;
Foreach Operator
• The FOREACH operator is used to generate specified data
transformations based on the column data.
grunt> student_details = LOAD ‘/home/cloudera/student_details.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray, city:chararray);
• get the id, age, and city values of each student from the
relation student_details and store it into another relation
named foreach_data using the foreach operator.
• grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
• grunt> Dump foreach_data;
Sorting
Order By
• The ORDER BY operator is used to display the contents of a
relation in a sorted order based on one or more fields.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> order_by_data = ORDER student_details BY age DESC;
Limit Operator
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> limit_data = LIMIT student_details 4;
Pig Latin Built-In Functions
• Eval Functions
• String Functions
• Date-time Functions
• Math Functions
Eval Functions
• Avg()
• CONCAT()
• COUNT()
• COUNT_STAR()
• DIFF()
• MAX()
• MIN()
• SIZE()
• SUBTRACT()
• SUM()
AVG()
• Computes the average of the numeric values in a single-column bag.
• grunt> A = LOAD '/home/cloudera/student.txt' USING PigStorage(',') as (name:chararray,
term:chararray, gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
• grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa);
• grunt> DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
CONCAT()
• Concatenates two expressions of identical type.
• grunt>A = LOAD ‘/home/Cloudera/data.txt' as (f1:chararray,
f2:chararray, f3:chararray);
• grunt>DUMP A;
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
• grunt>X = FOREACH A GENERATE CONCAT(f2,f3);
• grunt>DUMP X;
(opensource)
(mapreduce)
(piglatin)
COUNT
• Computes the number of elements in a bag.
• Note: You cannot use the tuple designator (*) with COUNT;
that is, COUNT(*) will not work.
• grunt>A = LOAD '/home/cloudera/c.txt' USING PigStorage(',') as (f1:int, f2:int, f3:int);
• grunt>DUMP A;
1,2,3
4,2,null
8,3,4
4,3,null
7,5,null
8,4,3
• grunt>B = GROUP A BY f1;
• grunt>DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
• grunt>X = FOREACH B GENERATE COUNT(A);
• grunt>DUMP X;
(1L)
(2L)
(1L)
(2L)
COUNT_STAR
• Computes the number of elements in a bag.
• COUNT_STAR includes NULL values in the count computation
(unlike COUNT, which ignores NULL values).
• Example
• In this example COUNT_STAR is used the count the tuples in a
bag.
• grunt>X = FOREACH B GENERATE COUNT_STAR(A);
DIFF
• Compares two fields in a tuple.
• grunt> A = LOAD ‘/home/Cloudera/data.txt' AS
(B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});
• grunt> DUMP A;
({(8,9),(0,1)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7)},{(2,2),(3,7)})
• grunt> DESCRIBE A;
a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
• grunt> X = FOREACH A DIFF(B1,B2);
• grunt> dump X;
({(0,1),(1,1)})
({})
({(6,7),(2,2)})
MAX
• Computes the maximum of the numeric values or chararrays in
a single-column bag. MAX requires a preceding GROUP ALL
statement for global maximums and a GROUP BY statement for
group maximums.
• Example
– In this example the maximum GPA for all terms is computed for each
student (see the GROUP operator for information about the field
names in relation B).
• grunt> A = LOAD ‘home/Cloudera/student.txt' AS (name:chararray, session:chararray, gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
• grunt> X = FOREACH B GENERATE group, MAX(A.gpa);
• grunt> DUMP X;
(John,4.0F)
(Mary,4.0F)
MIN
• Computes the minimum of the numeric values or chararrays in
a single-column bag. MIN requires a preceding GROUP… ALL
statement for global minimums and a GROUP … BY statement
for group minimums.
• Example
– In this example the minimum GPA for all terms is computed for each
student (see the GROUP operator for information about the field
names in relation B).
• grunt> A = LOAD ‘/home/Cloudera/student.txt' AS (name:chararray, session:chararray, gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
• grunt> X = FOREACH B GENERATE group, MIN(A.gpa);
• grunt> DUMP X;
(John,3.7F)
(Mary,3.8F)
SIZE
• Computes the number of elements based on any Pig data type.

• Example
• In this example the number of characters in the first field is computed.

• grunt> A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);

(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
• grunt> X = FOREACH A GENERATE SIZE(f1);
• grunt> DUMP X;
(6L)
(6L)
(3L)
SUM
• Computes the sum of the numeric values in a single-column
bag. SUM requires a preceding GROUP ALL statement for
global sums and a GROUP BY statement for group sums.
• Example
• In this example the number of pets is computed.
• grunt> A = LOAD ‘/home/Cloudera/data' AS (owner:chararray, pet_type:chararray,
pet_num:int);
• grunt> DUMP A;
(Alice,turtle,1)
(Alice,goldfish,5)
(Alice,cat,2)
(Bob,dog,2)
(Bob,cat,2)
• grunt> B = GROUP A BY owner;
• grunt> DUMP B;
(Alice,{(Alice,turtle,1),(Alice,goldfish,5),(Alice,cat,2)})
(Bob,{(Bob,dog,2),(Bob,cat,2)})
• grunt> X = FOREACH B GENERATE group, SUM(A.pet_num);
• DUMP X;
(Alice,8L)
(Bob,4L)
String Functions
• ENDSWITH
• STARTSWITH
• SUBSTRING
• EqualsIgnoreCase
• UPPER
• LOWER
• REPLACE
• TRIM, RTRIM, LTRIM
ENDSWITH , STARTSWITH
• ENDSWITH - This function accepts two String parameters, it is
used to verify whether the first string ends with the second.
string.
• STARTSWITH - This function accepts two string parameters. It
verifies whether the first string starts with the second.
• emp.txt
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
• grunt> emp_endswith = FOREACH emp_data GENERATE
(id,name),ENDSWITH ( name, 'n' );
• grunt> Dump emp_endswith;
• grunt> startswith_data = FOREACH emp_data GENERATE (id,name),
STARTSWITH (name,’Ro’);
• grunt> Dump startswith_data;
SUBSTRING()
• This function returns a substring from the given string.

• EMP.TXT
001,Robin,22,newyork
002,Stacy,25,Bhuwaneshwar
003,Kelly,22,Chennai
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING
PigStorage(',')as (id:int, name:chararray, age:int, city:chararray);

• grunt> substring_data = FOREACH emp_data GENERATE (id,name),

SUBSTRING (name, 0, 2);

• grunt> Dump substring_data;

((1,Robin),Rob)
((2,Stacy),Sta)
((3,Kelly),Kel)
EqualsIgnoreCase()
• The EqualsIgnoreCase() function is used to compare two
strings and verify whether they are equal. If both are equal this
function returns the Boolean value true else it returns the
value false.
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, city:chararray);
• grunt> equals_data = FOREACH emp_data GENERATE (id,name), EqualsIgnoreCase(name,
'Robin');
• grunt> Dump equals_data;
• ((1,Robin),true)
((2,BOB),false)
((3,Maya),false)
((4,Sara),false)
((5,David),false)
((6,Maggy),false)
((7,Robert),false)
((8,Syam),false)
((9,Mary),false)
((10,Saran),false)
((11,Stacy),false)
((12,Kelly),false)
UPPER(), LOWER()
• UPPER- This function is used to convert all the characters in a
string to uppercase.
• LOWER- This function is used to convert all the characters in a
string to lowercase.
• grunt> emp_data = LOAD '/home/cloudera/emp.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);

• grunt> upper_data = FOREACH emp_data GENERATE (id,name),

UPPER(name);
• grunt> Dump upper_data;

• grunt> lower_data = FOREACH emp_data GENERATE (id,name),

LOWER(name);
• grunt> Dump lower_data;
REPLACE()
• This function is used to replace all the characters in a given
string with the new characters.
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, city:chararray);

• grunt> replace_data = FOREACH emp_data GENERATE

(id,city),REPLACE(city,'Bhuwaneshwar','Bhuw');

• grunt> Dump replace_data;

((1,newyork),newyork)
((2,Kolkata),Kolkata)
((3,Tokyo),Tokyo)
((4,London),London)
((5,Bhuwaneshwar),Bhuw)
((6,Chennai),Chennai)
((7,newyork),newyork)
((8,Kolkata),Kolkata)
((9,Tokyo),Tokyo)
((10,London),London)
((11,Bhuwaneshwar),Bhuw)
((12,Chennai),Chennai)
TRIM(), RTRIM(), LTRIM()
• The TRIM() function accepts a string and returns its copy after removing
the unwanted spaces before and after it.
• The function LTRIM() is same as the function TRIM(). It removes the
unwanted spaces from the left side of the given string (heading spaces).
• The function RTRIM() is same as the function TRIM(). It removes the
unwanted spaces from the right side of a given string (tailing spaces).
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',') as
(id:int, name:chararray, age:int, city:chararray);

• grunt> trim_data = FOREACH emp_data GENERATE (id,name), TRIM(name);

• grunt> ltrim_data = FOREACH emp_data GENERATE (id,name), LTRIM(name);

• grunt> rtrim_data = FOREACH emp_data GENERATE (id,name), RTRIM(name);

• grunt> Dump trim_data;

• grunt> Dump ltrim_data;

• grunt> Dump rtrim_data;

Date-time Functions
• ToDate()
• GetDay()
• GetMonth()
• GetYear()
ToDate()
• This function is used to generate a DateTime object according
to the given parameters.

• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
• grunt> date_data = LOAD ‘/home/cloudera/date.txt' USING
PigStorage(',') as (id:int,date:chararray);

• grunt> todate_data = foreach date_data generate

ToDate(date,'yyyy/MM/dd HH:mm:ss') as (date_time:DateTime);

• grunt> Dump todate_data;

(1989-09-26T09:00:00.000+05:30)
(1980-06-20T10:22:00.000+05:30)
(1990-12-19T03:11:44.000+05:30)
GetDay()
• This function accepts a date-time object as a parameter and
returns the current day of the given date-time object.

• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
UDF’S
User Defined Functions
• Apache Pig provides extensive support
for User Defined Functions (UDF’s).
• Using these UDF’s, we can define our own functions and use
them.
• The UDF support is provided in six programming languages.
Java, Jython, Python, JavaScript, Ruby and Groovy.
Creating UDF’S
• Open Eclipse and create a new project.
• Convert the newly created project into a Maven project.
• Copy the pom.xml. This file contains the Maven dependencies
for Apache Pig and Hadoop-core jar files.
Java code
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
}
}
Registering the Jar file
• grunt> REGISTER '/home/cloudera/sample_udf.jar';
• grunt> DEFINE Sample_Eval sample_eval();
• grunt> emp_data = LOAD '/home/cloudera/pigdata.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int,
city:chararray);
• grunt> Upper_case = FOREACH emp_data GENERATE
sample_eval(name);

Az MCQ 1 PDF
No ratings yet
Az MCQ 1 PDF
78 pages
Database
No ratings yet
Database
145 pages
Reinforcement Detailing Handbook-2010
100% (6)
Reinforcement Detailing Handbook-2010
188 pages
Compiler Design Slide Chapter 1-6
No ratings yet
Compiler Design Slide Chapter 1-6
250 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Felcom 12-15-16 Ssas Tie
100% (2)
Felcom 12-15-16 Ssas Tie
80 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
100 Interview Questions On Hadoop - Hadoop Online Tutorials
100% (1)
100 Interview Questions On Hadoop - Hadoop Online Tutorials
22 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Snowflake Architecture
No ratings yet
Snowflake Architecture
18 pages
Cloudera Spark
No ratings yet
Cloudera Spark
66 pages
3CS4-07 Software Engineering Unit - 1 Notes
No ratings yet
3CS4-07 Software Engineering Unit - 1 Notes
30 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Bachelor of Science in Computer Science
No ratings yet
Bachelor of Science in Computer Science
2 pages
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
No ratings yet
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
166 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Pipe Flow Expert
No ratings yet
Pipe Flow Expert
28 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Presentation Hotel Management System SQL
100% (1)
Presentation Hotel Management System SQL
20 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
Cloudera Apache Impala Guide
No ratings yet
Cloudera Apache Impala Guide
691 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Exercise 3 Two-Way Traffic Light
No ratings yet
Exercise 3 Two-Way Traffic Light
3 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Introduction EJB3
No ratings yet
Introduction EJB3
52 pages
Hadoop Commands Cheat Sheet
No ratings yet
Hadoop Commands Cheat Sheet
1 page
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Indutrial Training Report Format Tybtech Mech
No ratings yet
Indutrial Training Report Format Tybtech Mech
4 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
The Snowflake Elastic Data Warehouse SIGMOD 2016 and Beyond Ashish Motivala, Jiaqi Yan
No ratings yet
The Snowflake Elastic Data Warehouse SIGMOD 2016 and Beyond Ashish Motivala, Jiaqi Yan
40 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Lab - Qlik Replicate With Google BigQuery
No ratings yet
Lab - Qlik Replicate With Google BigQuery
23 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
Hive and Impala
No ratings yet
Hive and Impala
46 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Service Manual Acer Aspire 9420 9410 7110 TravelMate 5620 5610 5110
No ratings yet
Service Manual Acer Aspire 9420 9410 7110 TravelMate 5620 5610 5110
116 pages
Jnu Dbms Lab File
No ratings yet
Jnu Dbms Lab File
55 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Lab Exercise-Basic Java
No ratings yet
Lab Exercise-Basic Java
12 pages
SnowFlake Course Brochure FINAL
No ratings yet
SnowFlake Course Brochure FINAL
7 pages
Rakesh Kumar CV
No ratings yet
Rakesh Kumar CV
3 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Hive in Class Assignment Winter 2021
No ratings yet
Hive in Class Assignment Winter 2021
2 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
17 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Administering Active Directory 1st Edition Mark Wilkins PDF Download
No ratings yet
Administering Active Directory 1st Edition Mark Wilkins PDF Download
49 pages
Getting Started With Target For Arcgis
No ratings yet
Getting Started With Target For Arcgis
10 pages
Big Data Hadoop Architect - V4
No ratings yet
Big Data Hadoop Architect - V4
20 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
BGP Secure Routing 1708284503
No ratings yet
BGP Secure Routing 1708284503
82 pages
Tuning SQL Queries - Oracle
100% (1)
Tuning SQL Queries - Oracle
27 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Understanding Computer Hardware and Peripherals
No ratings yet
Understanding Computer Hardware and Peripherals
58 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Excel Core 2016 Lesson 09
No ratings yet
Excel Core 2016 Lesson 09
115 pages
Alisha-Vocational Training Report
No ratings yet
Alisha-Vocational Training Report
19 pages
Student Result: Aktu-One-View (Oneview - Aspx)
No ratings yet
Student Result: Aktu-One-View (Oneview - Aspx)
8 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
ICT Computer Support Technician 12-11 JD Ps SC 3 Feb 11
No ratings yet
ICT Computer Support Technician 12-11 JD Ps SC 3 Feb 11
3 pages
I Questions
No ratings yet
I Questions
7 pages
9XR Motherboard Connector Pinout J1 Right Switches Atmega
No ratings yet
9XR Motherboard Connector Pinout J1 Right Switches Atmega
2 pages
SQL Server Theory
No ratings yet
SQL Server Theory
2 pages
Lab 14.6.6.2 Configure A Site-To-Site Ipsec VPN Tunnel Using Cli
No ratings yet
Lab 14.6.6.2 Configure A Site-To-Site Ipsec VPN Tunnel Using Cli
9 pages
H.265+ Encoding Technology: Hikvision
No ratings yet
H.265+ Encoding Technology: Hikvision
12 pages
Quiz 6 - 3
No ratings yet
Quiz 6 - 3
9 pages
The Exact lwm2m Request That Is Being Sent Like 3/4/1 Format N What Each Value Represents
No ratings yet
The Exact lwm2m Request That Is Being Sent Like 3/4/1 Format N What Each Value Represents
5 pages
Skeyetech: Autonomous Aerial Surveillance
No ratings yet
Skeyetech: Autonomous Aerial Surveillance
2 pages
Hadoop For Dummies
From Everand
Hadoop For Dummies
Dirk deRoos
3/5 (2)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet

Apache Pig

Uploaded by

Apache Pig

Uploaded by

WWW.PAVANONLINETRAININGS.COM | WWW.PAVANTESTINGTOOLS.

• UDF’s: Pig provides the facility to create User-defined Functions in other

• Batch Mode (Local):

[cloudera@quickstart ~]$ cat pig_samplescript_local.pig

[cloudera@quickstart ~]$ pig -x local pig_samplescript_local.pig

[cloudera@quickstart ~]$ pig -x mapreduce pig_samplescript_global.pig

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

• grunt> student_details = LOAD '/home/cloudera/students.txt' USING

• grunt> describe student_groupdata;

• grunt> Illustrate student_groupdata;

• grunt> student_multiplegroup = GROUP student_details by

• grunt> customers1 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,

• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

• grunt> Dump customers3;

• grunt> customers = LOAD '/home/cloudera/customers.txt' USING

• grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

• grunt> Dump outer_right;

• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

• grunt> Dump outer_full;

• grunt> cross_data = CROSS customers, orders;

• grunt> Dump cross_data;

• SPLIT student_details into student_details1 if age<23, student_details2 if

• grunt> Dump student_details1;

• grunt> Dump student_details2;

• grunt> A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);

• grunt> substring_data = FOREACH emp_data GENERATE (id,name),

• grunt> Dump substring_data;

• grunt> upper_data = FOREACH emp_data GENERATE (id,name),

• grunt> lower_data = FOREACH emp_data GENERATE (id,name),

• grunt> replace_data = FOREACH emp_data GENERATE

• grunt> Dump replace_data;

• grunt> trim_data = FOREACH emp_data GENERATE (id,name), TRIM(name);

• grunt> ltrim_data = FOREACH emp_data GENERATE (id,name), LTRIM(name);

• grunt> rtrim_data = FOREACH emp_data GENERATE (id,name), RTRIM(name);

• grunt> Dump trim_data;

• grunt> Dump ltrim_data;

• grunt> Dump rtrim_data;

• grunt> todate_data = foreach date_data generate

• grunt> Dump todate_data;

public class Sample_Eval extends EvalFunc<String>{

You might also like