0% found this document useful (0 votes)

11 views41 pages

Apache PIG

Apache Pig is a high-level data flow language that simplifies the process of analyzing large datasets in Hadoop using a SQL-like language called Pig Latin. It offers built-in operators for data operations, supports both structured and unstructured data, and converts scripts into MapReduce jobs for execution. Pig is advantageous for ETL operations due to its ease of programming, code reusability, and optimization capabilities, although it is not suited for real-time processing or pinpointing individual records in large datasets.

Uploaded by

talh75350

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views41 pages

Apache PIG

Uploaded by

talh75350

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Apache Pig

•An abstraction over MapReduce.

•A platform used to analyze larger sets of data.
•Pig is used with Hadoop.
•The language for Pig is pig Latin.
•The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS.
•Every task which can be achieved using PIG can also be
achieved using java used in Map reduce.
Why Do We Need Apache Pig?

•USING PIG LATIN, PROGRAMMERS CAN PERFORM MAPREDUCE

TASKS EASILY WITHOUT HAVING TO TYPE COMPLEX CODES IN JAVA.
•PIG LATIN - SQL-LIKE LANGUAGE.
•APACHE PIG PROVIDES MANY BUILT-IN OPERATORS TO SUPPORT DATA
OPERATIONS LIKE JOINS, FILTERS, ORDERING, ETC.
• IT ALSO PROVIDES NESTED DATA TYPES LIKE TUPLES, BAGS, AND MAPS THAT ARE
MISSING FROM MAPREDUCE.
Features of Pig
•Rich set of operators − join, sort, filter, etc.
•Ease of programming − Pig Latin is similar to SQL.
•Optimization opportunities − The tasks in Apache Pig optimize
their execution automatically.
•Extensibility − Using the existing operators, users can develop
their own functions to read, process, and write data.
•Handles all kinds of data − both structured as well as
unstructured.
•It stores the results in HDFS.
•UDF’s − Pig provides the facility to create User-defined
Functions in other programming languages as well.
Apache Pig Vs MapReduce

•Apache Pig is a data flow language.

•MapReduce is a data processing paradigm.

•Pig is a high level language.

•MapReduce is low level and rigid.

•Performing a Join operation in Apache Pig is pretty simple.

•It is quite difficult in MapReduce to perform a Join operation

between datasets.
Apache Pig Vs MapReduce

•Apache Pig uses multi-query approach, thereby reducing the

length of the codes to a great extent.
•MapReduce will require almost 20 times more the number
of lines to perform the same task.

•There is no need for compilation. On execution, every

Apache Pig operator is converted internally into a
MapReduce job.
•MapReduce jobs have a long compilation process.
Apache Pig Vs Hive

•Pig Latin is a data flow language.

•HiveQL is a query processing language.

•Pig Latin is a procedural language and it fits in pipeline

paradigm.
•HiveQL is a declarative language.

•Apache Pig can handle structured, unstructured, and

semi-structured data.
•Hive is mostly for structured data.
Advantages of Pig

•Code reusability.
•Faster development
•Less number of lines of code
•Ideal for ETL operations.
• It allows a detailed step by step procedure by which the
data has to be transformed.
• Schema and type checking. It can handle inconsistent
schema data.
Pig Latin, Pig Engine, Pig script
Pig Latin:
•provides various operators using which programmers can
develop their own functions for reading, writing, and
processing data.

Pig Engine:
•Pig Engine component of Pig accepts the Pig Latin scripts as
input and converts those scripts into MapReduce jobs.

Pig scripts:
•To analyze data using Apache Pig, programmers need to
write scripts using Pig Latin language.
Pig has two execution modes

Local Mode:
Pig runs in a single JVM and makes use of local file system.
This mode is suitable only for analysis of small data sets
using Pig
This mode is generally used for testing purpose.

HDFS Mode:
-In this mode, queries written in Pig Latin are translated into
MapReduce jobs and are run on a Hadoop cluster.
-MapReduce mode with fully distributed cluster is useful of
running Pig on large data sets.
Apache Pig Components
•Parser
-checks the syntax of the script, does type checking, and other
miscellaneous checks. The output of the parser will be a DAG
•Optimizer
-carries out the logical optimizations
•Compiler
-compiles the optimized logical plan into a series of
MapReduce jobs.
•Execution engine
- MapReduce jobs are executed on Hadoop producing the
desired results
Apache Pig Execution Modes

• Interactive Mode (Grunt shell)

$ ./pig –x local
$ ./pig -x mapreduce

• Batch Mode (Script)

$ pig -x local Sample_script.pig
$ pig -x mapreduce Sample_script.pig

• Embedded Mode (UDF)

Why UDF?

•Do operations on more than one field

•Do more than grouping and filtering
•Programmer is comfortable
•Want to reuse existing logic

Traditionally UDF can be written only in Java. Now other

languages like Python are also supported.
Apache Pig - Architecture

•Pig uses the Pig Latin language, and execute them using any
of the execution mechanisms.

•After execution, these scripts will go through a series of

transformations applied by the Pig Framework, to produce
the desired output.

•Internally, Apache Pig converts these scripts into a series of

MapReduce jobs, and thus, it makes the programmer’s job
easy.
Pig Architecture
Shell Command in Pig

Syntax
grunt> sh shell command parameters
grunt> sh ls
PigStorage

•A built-in function of Pig

• PigStorage is used to load and store data in pig scripts.
• PigStorage can be used to parse text data with an arbitrary
delimiter or output data in a delimited format.
Viewing Data

DUMP input;

Very useful for debugging, but not so much useful for huge
datasets.
Load and Store example

data = LOAD 'data/data-bag.txt'

USING PigStorage(',');

STORE data INTO 'data/output/load-store'

USING PigStorage('|');
Loading Data into Pig

file = LOAD ‘/data/dropbox-policy.txt' AS

(line);

data = LOAD ‘/data/tweets.csv' USING

PigStorage(',');

data = LOAD ‘/data/tweets.csv'

USING PigStorage(',')
AS ('list', 'of', 'fields');
Storing Data from Pig

STORE data INTO 'output_location';

STORE data INTO 'output_location'

USING PigStorage();

STORE data INTO 'output_location'

USING PigStorage(',’);

•Similar to `LOAD`, lot of options are available

•Can store locally or in HDFS
Data Types used in Pig Latin

•Scalar Types
•Complex Types
Scalar Types

•int, long – (32, 64 bit) integer

•float, double – (32, 64 bit) floating point
•boolean (true/false)
•chararray (String in UTF-8)
•bytearray (blob) (DataByteArray in Java)
Complex Types

•tuple – ordered set of fields

•(data) bag – collection of tuples (NESTED)
•map – set of key value pairs
Schemas in Load statement

We can specify a schema to `LOAD` statements

data = LOAD ‘/data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
Pig Latin – Relational Operations
Loading and Storing
•LOAD - To Load the data from the file system (local/HDFS)
into a relation.
•STORE - To save a relation to the file system (local/HDFS).

Filtering
•FILTER - To remove unwanted rows from a relation.
•DISTINCT - To remove duplicate rows from a relation.
•FOREACH, GENERATE - To generate data transformations
based on columns of data.
Grouping and Joining
•JOIN To join two or more relations.
•COGROUP To group the data in two or more relations.
•GROUP To group the data in a single relation.
•CROSS To create the cross product of two or more
relations.

Sorting
ORDER To arrange a relation in a sorted order based on one or
more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single
relation.
SPLIT To split a single relation into two or more
relations.

Diagnostic Operators
•DUMP To print the contents of a relation on the console.
•DESCRIBE To describe the schema of a relation.
•EXPLAIN To view the logical, physical, or MapReduce
execution plans to compute a relation.
•ILLUSTRATE To view the step-by-step execution of a series
of statements.
FOREACH

Generates data transformations based on columns of data

x = FOREACH data GENERATE *;

x = FOREACH data GENERATE $0, $1;
x = FOREACH data GENERATE $0 AS first, $1
AS second;
GROUP
• Groups data in one or more relations
• Groups tuples that have the same group key
• Similar to SQL group by operator

outerbag = LOAD ‘/data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP outerbag;

innerbag = GROUP outerbag BY f1;

DUMP innerbag;
FILTER
Selects tuples from a relation based on some condition

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

filtered = FILTER data BY f1 == 1;

DUMP filtered;
COUNT
Counts the number of tuples in a relationship

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group,

COUNT (data);
DUMP counted;
ORDER By
Sort a relation based on one or more fields. Similar to SQL order by

data = LOAD 'data/nested-sample.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

ordera = ORDER data BY f1 ASC;

DUMP ordera;

orderd = ORDER data BY f1 DESC;

DUMP orderd;
DISTINCT

Removes duplicates from a relation

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

unique = DISTINCT data;

DUMP unique;
LIMIT

Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt'

USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

limited = LIMIT data 3;

DUMP limited;
JOIN

Joins relation based on a field. Both outer and inner joins are
supported.
a = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP a;

b = LOAD 'data/simple-tuples.txt'
USING PigStorage(',') AS (t1:int, t2:int);
DUMP b;

joined = JOIN a by f1, b by t1;

DUMP joined;
Pig Commands
(Using Pig's Grunt Shell Interface.)
• grunt> movies = LOAD 'Movies.txt' USING PigStorage(',') as (id:int, name:chararray, year:int,
rating:float, duration:int);
• grunt> dump movies;
• B = group movies all;
• C = FOREACH B GENERATE group, COUNT(movies);
• DUMP C;
• STORE C INTO '/OUTPUT_PIG' USING PigStorage(','); ( OUTPUT directory should not exist
already in HDFS)
• $ hadoop fs -ls /OUTPUT_PIG
• Found 2 items
• -rw-rw-rw- 1 bedrock supergroup 0 2015-07-31 10:30 /OUTPUT_PIG/_SUCCESS
• -rw-rw-rw- 1 bedrock supergroup 7 2015-07-31 10:30 /OUTPUT_PIG/part-r-00000
• [bedrock@cdh-5-2 ~]$ hadoop fs -cat /OUTPUT_PIG/part-r-00000
• all,10

Note: The text file should already exist on HDFS

Pig used to get the difference between two
text files
• file1_set = LOAD '/home/bedrock/TEST_DATA/file1.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• file2_set = LOAD '/home/bedrock/TEST_DATA/file2.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• cogroup_set = COGROUP file1_set by id, file2_set by id ;
• Dump cogroup_set;
• diff_data = FOREACH cogroup_set GENERATE DIFF(file1_set,file2_set);
• Dump diff_data;
Optimizing Pig Scripts

•Project early and often

•Filter early and often
•Drop nulls before a join
•Prefer DISTINCT over GROUP BY
•Use the right data structure
What are the limitations of the Pig?

•As the Pig platform is designed for ETL-type use cases, it’s
not a better choice for real-time scenarios.
•Apache Pig is not a good choice for pinpointing a single
record in huge data sets.
•Apache Pig is built on top of MapReduce, which is batch
processing oriented.
Is Pig script case sensitive?

•Pig script is both case sensitive and case insensitive.

•User defined functions, the field name, and relations are
case sensitive. M=LOAD ‘data’ is not same as M=LOAD
‘Data’.
•Whereas Pig script keywords are case insensitive. i.e. LOAD is
same as load.
• https://www.edureka.co/blog/interview-questions/hadoop-intervie
w-questions-pig/
• https://letsfindcourse.com/hadoop-questions/pig-hadoop-mcq-ques
tions

DSPIC30f2010 C Code Sanjay
71% (14)
DSPIC30f2010 C Code Sanjay
16 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Unit 5
No ratings yet
Unit 5
76 pages
Unit 5
No ratings yet
Unit 5
39 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Pig
No ratings yet
Pig
61 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
BDA - HIVE & PIG-Other Notes in Detail
No ratings yet
BDA - HIVE & PIG-Other Notes in Detail
162 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Unit 5
No ratings yet
Unit 5
24 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
BDP U4
No ratings yet
BDP U4
58 pages
Notes
No ratings yet
Notes
19 pages
Unit 4
No ratings yet
Unit 4
29 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Pig
No ratings yet
Pig
6 pages
Pig 2
No ratings yet
Pig 2
63 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
3 Pig
No ratings yet
3 Pig
77 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
Ruby Tutorial
0% (2)
Ruby Tutorial
26 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Apache Pig
100% (2)
Apache Pig
80 pages
JAVA Sample Questions For Practice (II CSE - A' & II IT - B')
No ratings yet
JAVA Sample Questions For Practice (II CSE - A' & II IT - B')
5 pages
Lab Program C++ First Year SASTRA University
No ratings yet
Lab Program C++ First Year SASTRA University
6 pages
JulyAugust 2022
No ratings yet
JulyAugust 2022
1 page
Function Pointers in VBScript
50% (2)
Function Pointers in VBScript
14 pages
From Java To Kotlin PDF
No ratings yet
From Java To Kotlin PDF
9 pages
Java Crux
No ratings yet
Java Crux
31 pages
Backbone Js
No ratings yet
Backbone Js
7 pages
Lab 2 - B Tree
No ratings yet
Lab 2 - B Tree
4 pages
Assignment 4 Event Handling
No ratings yet
Assignment 4 Event Handling
11 pages
300 Core Java Interview Questions (2023) - Javatpoint
No ratings yet
300 Core Java Interview Questions (2023) - Javatpoint
50 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
31 pages
Visibility Modes and Virtual Function
No ratings yet
Visibility Modes and Virtual Function
7 pages
Unit - 5 Oodp Final
No ratings yet
Unit - 5 Oodp Final
60 pages
Software Construction Codesmell
No ratings yet
Software Construction Codesmell
30 pages
CSC213 Object Oriented Programming-Lab Manual-Sol
No ratings yet
CSC213 Object Oriented Programming-Lab Manual-Sol
83 pages
Fahad Hussain MCS, MSCS, Dae (Cit) : Computer Science Instructor
No ratings yet
Fahad Hussain MCS, MSCS, Dae (Cit) : Computer Science Instructor
36 pages
CPCS204-04-Linked List - Introduction-Traversal-Searching
No ratings yet
CPCS204-04-Linked List - Introduction-Traversal-Searching
36 pages
Relationships
No ratings yet
Relationships
15 pages
Sma Cep Project
No ratings yet
Sma Cep Project
20 pages
Practical No.12
No ratings yet
Practical No.12
8 pages
Remote Object Invocation
No ratings yet
Remote Object Invocation
17 pages
Nitte Meenakshi Institute of Technology: Department of Computer Science and Engineering
No ratings yet
Nitte Meenakshi Institute of Technology: Department of Computer Science and Engineering
9 pages
Interface: Cse215: Programming Language Ii Silvia Ahmed (Sva)
No ratings yet
Interface: Cse215: Programming Language Ii Silvia Ahmed (Sva)
21 pages
Oop 1
No ratings yet
Oop 1
17 pages
Bài Thực Hành Số 8
No ratings yet
Bài Thực Hành Số 8
7 pages
Java 1 Marker Questions
No ratings yet
Java 1 Marker Questions
3 pages
1.4 Object-Oriented Programming Paradigm (OOP)
No ratings yet
1.4 Object-Oriented Programming Paradigm (OOP)
5 pages
C# Concepts
No ratings yet
C# Concepts
2 pages

Apache PIG

Uploaded by

Apache PIG

Uploaded by

Apache Pig

•An abstraction over MapReduce.

•USING PIG LATIN, PROGRAMMERS CAN PERFORM MAPREDUCE

•Apache Pig is a data flow language.

•Pig is a high level language.

•Performing a Join operation in Apache Pig is pretty simple.

•It is quite difficult in MapReduce to perform a Join operation

•Apache Pig uses multi-query approach, thereby reducing the

•There is no need for compilation. On execution, every

•Pig Latin is a data flow language.

•Pig Latin is a procedural language and it fits in pipeline

•Apache Pig can handle structured, unstructured, and

• Interactive Mode (Grunt shell)

• Batch Mode (Script)

• Embedded Mode (UDF)

•Do operations on more than one field

Traditionally UDF can be written only in Java. Now other

•After execution, these scripts will go through a series of

•Internally, Apache Pig converts these scripts into a series of

•A built-in function of Pig

data = LOAD 'data/data-bag.txt'

STORE data INTO 'data/output/load-store'

file = LOAD ‘/data/dropbox-policy.txt' AS

data = LOAD ‘/data/tweets.csv' USING

data = LOAD ‘/data/tweets.csv'

STORE data INTO 'output_location';

STORE data INTO 'output_location'

STORE data INTO 'output_location'

•Similar to `LOAD`, lot of options are available

•int, long – (32, 64 bit) integer

•tuple – ordered set of fields

We can specify a schema to `LOAD` statements

data = LOAD ‘/data/data-bag.txt'

Generates data transformations based on columns of data

x = FOREACH data GENERATE *;

outerbag = LOAD ‘/data/data-bag.txt'

innerbag = GROUP outerbag BY f1;

data = LOAD 'data/data-bag.txt'

filtered = FILTER data BY f1 == 1;

data = LOAD 'data/data-bag.txt'

grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group,

data = LOAD 'data/nested-sample.txt'

ordera = ORDER data BY f1 ASC;

orderd = ORDER data BY f1 DESC;

Removes duplicates from a relation

data = LOAD 'data/data-bag.txt'

unique = DISTINCT data;

Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt'

limited = LIMIT data 3;

joined = JOIN a by f1, b by t1;

Note: The text file should already exist on HDFS

•Project early and often

•Pig script is both case sensitive and case insensitive.

You might also like