0% found this document useful (0 votes)
10 views41 pages

Apache PIG

Apache Pig is a high-level data flow language that simplifies the process of analyzing large datasets in Hadoop using a SQL-like language called Pig Latin. It offers built-in operators for data operations, supports both structured and unstructured data, and converts scripts into MapReduce jobs for execution. Pig is advantageous for ETL operations due to its ease of programming, code reusability, and optimization capabilities, although it is not suited for real-time processing or pinpointing individual records in large datasets.

Uploaded by

talh75350
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views41 pages

Apache PIG

Apache Pig is a high-level data flow language that simplifies the process of analyzing large datasets in Hadoop using a SQL-like language called Pig Latin. It offers built-in operators for data operations, supports both structured and unstructured data, and converts scripts into MapReduce jobs for execution. Pig is advantageous for ETL operations due to its ease of programming, code reusability, and optimization capabilities, although it is not suited for real-time processing or pinpointing individual records in large datasets.

Uploaded by

talh75350
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Apache Pig

•An abstraction over MapReduce.


•A platform used to analyze larger sets of data.
•Pig is used with Hadoop.
•The language for Pig is pig Latin.
•The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS.
•Every task which can be achieved using PIG can also be
achieved using java used in Map reduce.
Why Do We Need Apache Pig?

•USING PIG LATIN, PROGRAMMERS CAN PERFORM MAPREDUCE


TASKS EASILY WITHOUT HAVING TO TYPE COMPLEX CODES IN JAVA.
•PIG LATIN - SQL-LIKE LANGUAGE.
•APACHE PIG PROVIDES MANY BUILT-IN OPERATORS TO SUPPORT DATA
OPERATIONS LIKE JOINS, FILTERS, ORDERING, ETC.
• IT ALSO PROVIDES NESTED DATA TYPES LIKE TUPLES, BAGS, AND MAPS THAT ARE
MISSING FROM MAPREDUCE.
Features of Pig
•Rich set of operators − join, sort, filter, etc.
•Ease of programming − Pig Latin is similar to SQL.
•Optimization opportunities − The tasks in Apache Pig optimize
their execution automatically.
•Extensibility − Using the existing operators, users can develop
their own functions to read, process, and write data.
•Handles all kinds of data − both structured as well as
unstructured.
•It stores the results in HDFS.
•UDF’s − Pig provides the facility to create User-defined
Functions in other programming languages as well.
Apache Pig Vs MapReduce

•Apache Pig is a data flow language.


•MapReduce is a data processing paradigm.

•Pig is a high level language.


•MapReduce is low level and rigid.

•Performing a Join operation in Apache Pig is pretty simple.

•It is quite difficult in MapReduce to perform a Join operation


between datasets.
Apache Pig Vs MapReduce

•Apache Pig uses multi-query approach, thereby reducing the


length of the codes to a great extent.
•MapReduce will require almost 20 times more the number
of lines to perform the same task.

•There is no need for compilation. On execution, every


Apache Pig operator is converted internally into a
MapReduce job.
•MapReduce jobs have a long compilation process.
Apache Pig Vs Hive

•Pig Latin is a data flow language.


•HiveQL is a query processing language.

•Pig Latin is a procedural language and it fits in pipeline


paradigm.
•HiveQL is a declarative language.

•Apache Pig can handle structured, unstructured, and


semi-structured data.
•Hive is mostly for structured data.
Advantages of Pig

•Code reusability.
•Faster development
•Less number of lines of code
•Ideal for ETL operations.
• It allows a detailed step by step procedure by which the
data has to be transformed.
• Schema and type checking. It can handle inconsistent
schema data.
Pig Latin, Pig Engine, Pig script
Pig Latin:
•provides various operators using which programmers can
develop their own functions for reading, writing, and
processing data.

Pig Engine:
•Pig Engine component of Pig accepts the Pig Latin scripts as
input and converts those scripts into MapReduce jobs.

Pig scripts:
•To analyze data using Apache Pig, programmers need to
write scripts using Pig Latin language.
Pig has two execution modes

Local Mode:
Pig runs in a single JVM and makes use of local file system.
This mode is suitable only for analysis of small data sets
using Pig
This mode is generally used for testing purpose.

HDFS Mode:
-In this mode, queries written in Pig Latin are translated into
MapReduce jobs and are run on a Hadoop cluster.
-MapReduce mode with fully distributed cluster is useful of
running Pig on large data sets.
Apache Pig Components
•Parser
-checks the syntax of the script, does type checking, and other
miscellaneous checks. The output of the parser will be a DAG
•Optimizer
-carries out the logical optimizations
•Compiler
-compiles the optimized logical plan into a series of
MapReduce jobs.
•Execution engine
- MapReduce jobs are executed on Hadoop producing the
desired results
Apache Pig Execution Modes

• Interactive Mode (Grunt shell)


$ ./pig –x local
$ ./pig -x mapreduce

• Batch Mode (Script)


$ pig -x local Sample_script.pig
$ pig -x mapreduce Sample_script.pig

• Embedded Mode (UDF)


Why UDF?

•Do operations on more than one field


•Do more than grouping and filtering
•Programmer is comfortable
•Want to reuse existing logic

Traditionally UDF can be written only in Java. Now other


languages like Python are also supported.
Apache Pig - Architecture

•Pig uses the Pig Latin language, and execute them using any
of the execution mechanisms.

•After execution, these scripts will go through a series of


transformations applied by the Pig Framework, to produce
the desired output.

•Internally, Apache Pig converts these scripts into a series of


MapReduce jobs, and thus, it makes the programmer’s job
easy.
Pig Architecture
Shell Command in Pig

Syntax
grunt> sh shell command parameters
grunt> sh ls
PigStorage

•A built-in function of Pig


• PigStorage is used to load and store data in pig scripts.
• PigStorage can be used to parse text data with an arbitrary
delimiter or output data in a delimited format.
Viewing Data

DUMP input;

Very useful for debugging, but not so much useful for huge
datasets.
Load and Store example

data = LOAD 'data/data-bag.txt'


USING PigStorage(',');

STORE data INTO 'data/output/load-store'


USING PigStorage('|');
Loading Data into Pig

file = LOAD ‘/data/dropbox-policy.txt' AS


(line);

data = LOAD ‘/data/tweets.csv' USING


PigStorage(',');

data = LOAD ‘/data/tweets.csv'


USING PigStorage(',')
AS ('list', 'of', 'fields');
Storing Data from Pig

STORE data INTO 'output_location';

STORE data INTO 'output_location'


USING PigStorage();

STORE data INTO 'output_location'


USING PigStorage(',’);

•Similar to `LOAD`, lot of options are available


•Can store locally or in HDFS
Data Types used in Pig Latin

•Scalar Types
•Complex Types
Scalar Types

•int, long – (32, 64 bit) integer


•float, double – (32, 64 bit) floating point
•boolean (true/false)
•chararray (String in UTF-8)
•bytearray (blob) (DataByteArray in Java)
Complex Types

•tuple – ordered set of fields


•(data) bag – collection of tuples (NESTED)
•map – set of key value pairs
Schemas in Load statement

We can specify a schema to `LOAD` statements

data = LOAD ‘/data/data-bag.txt'


USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
Pig Latin – Relational Operations
Loading and Storing
•LOAD - To Load the data from the file system (local/HDFS)
into a relation.
•STORE - To save a relation to the file system (local/HDFS).

Filtering
•FILTER - To remove unwanted rows from a relation.
•DISTINCT - To remove duplicate rows from a relation.
•FOREACH, GENERATE - To generate data transformations
based on columns of data.
Grouping and Joining
•JOIN To join two or more relations.
•COGROUP To group the data in two or more relations.
•GROUP To group the data in a single relation.
•CROSS To create the cross product of two or more
relations.

Sorting
ORDER To arrange a relation in a sorted order based on one or
more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single
relation.
SPLIT To split a single relation into two or more
relations.

Diagnostic Operators
•DUMP To print the contents of a relation on the console.
•DESCRIBE To describe the schema of a relation.
•EXPLAIN To view the logical, physical, or MapReduce
execution plans to compute a relation.
•ILLUSTRATE To view the step-by-step execution of a series
of statements.
FOREACH

Generates data transformations based on columns of data

x = FOREACH data GENERATE *;


x = FOREACH data GENERATE $0, $1;
x = FOREACH data GENERATE $0 AS first, $1
AS second;
GROUP
• Groups data in one or more relations
• Groups tuples that have the same group key
• Similar to SQL group by operator

outerbag = LOAD ‘/data/data-bag.txt'


USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP outerbag;

innerbag = GROUP outerbag BY f1;

DUMP innerbag;
FILTER
Selects tuples from a relation based on some condition

data = LOAD 'data/data-bag.txt'


USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

filtered = FILTER data BY f1 == 1;


DUMP filtered;
COUNT
Counts the number of tuples in a relationship

data = LOAD 'data/data-bag.txt'


USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group,


COUNT (data);
DUMP counted;
ORDER By
Sort a relation based on one or more fields. Similar to SQL order by

data = LOAD 'data/nested-sample.txt'


USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

ordera = ORDER data BY f1 ASC;


DUMP ordera;

orderd = ORDER data BY f1 DESC;


DUMP orderd;
DISTINCT

Removes duplicates from a relation

data = LOAD 'data/data-bag.txt'


USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

unique = DISTINCT data;


DUMP unique;
LIMIT

Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt'


USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP data;

limited = LIMIT data 3;


DUMP limited;
JOIN

Joins relation based on a field. Both outer and inner joins are
supported.
a = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);

DUMP a;

b = LOAD 'data/simple-tuples.txt'
USING PigStorage(',') AS (t1:int, t2:int);
DUMP b;

joined = JOIN a by f1, b by t1;


DUMP joined;
Pig Commands
(Using Pig's Grunt Shell Interface.)
• grunt> movies = LOAD 'Movies.txt' USING PigStorage(',') as (id:int, name:chararray, year:int,
rating:float, duration:int);
• grunt> dump movies;
• B = group movies all;
• C = FOREACH B GENERATE group, COUNT(movies);
• DUMP C;
• STORE C INTO '/OUTPUT_PIG' USING PigStorage(','); ( OUTPUT directory should not exist
already in HDFS)
• $ hadoop fs -ls /OUTPUT_PIG
• Found 2 items
• -rw-rw-rw- 1 bedrock supergroup 0 2015-07-31 10:30 /OUTPUT_PIG/_SUCCESS
• -rw-rw-rw- 1 bedrock supergroup 7 2015-07-31 10:30 /OUTPUT_PIG/part-r-00000
• [bedrock@cdh-5-2 ~]$ hadoop fs -cat /OUTPUT_PIG/part-r-00000
• all,10

Note: The text file should already exist on HDFS


Pig used to get the difference between two
text files
• file1_set = LOAD '/home/bedrock/TEST_DATA/file1.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• file2_set = LOAD '/home/bedrock/TEST_DATA/file2.txt' USING
PigStorage(',') as
(id:int,source_address:chararray,source_city:chararray,source_name:chara
rray,dest_address:chararray,dest_city:chararray,dest_name:chararray,labe
l:float);
• cogroup_set = COGROUP file1_set by id, file2_set by id ;
• Dump cogroup_set;
• diff_data = FOREACH cogroup_set GENERATE DIFF(file1_set,file2_set);
• Dump diff_data;
Optimizing Pig Scripts

•Project early and often


•Filter early and often
•Drop nulls before a join
•Prefer DISTINCT over GROUP BY
•Use the right data structure
What are the limitations of the Pig?

•As the Pig platform is designed for ETL-type use cases, it’s
not a better choice for real-time scenarios.
•Apache Pig is not a good choice for pinpointing a single
record in huge data sets.
•Apache Pig is built on top of MapReduce, which is batch
processing oriented.
Is Pig script case sensitive?

•Pig script is both case sensitive and case insensitive.


•User defined functions, the field name, and relations are
case sensitive. M=LOAD ‘data’ is not same as M=LOAD
‘Data’.
•Whereas Pig script keywords are case insensitive. i.e. LOAD is
same as load.
• https://www.edureka.co/blog/interview-questions/hadoop-intervie
w-questions-pig/
• https://letsfindcourse.com/hadoop-questions/pig-hadoop-mcq-ques
tions

You might also like