Apache PIG
Apache PIG
•Code reusability.
•Faster development
•Less number of lines of code
•Ideal for ETL operations.
• It allows a detailed step by step procedure by which the
data has to be transformed.
• Schema and type checking. It can handle inconsistent
schema data.
Pig Latin, Pig Engine, Pig script
Pig Latin:
•provides various operators using which programmers can
develop their own functions for reading, writing, and
processing data.
Pig Engine:
•Pig Engine component of Pig accepts the Pig Latin scripts as
input and converts those scripts into MapReduce jobs.
Pig scripts:
•To analyze data using Apache Pig, programmers need to
write scripts using Pig Latin language.
Pig has two execution modes
Local Mode:
Pig runs in a single JVM and makes use of local file system.
This mode is suitable only for analysis of small data sets
using Pig
This mode is generally used for testing purpose.
HDFS Mode:
-In this mode, queries written in Pig Latin are translated into
MapReduce jobs and are run on a Hadoop cluster.
-MapReduce mode with fully distributed cluster is useful of
running Pig on large data sets.
Apache Pig Components
•Parser
-checks the syntax of the script, does type checking, and other
miscellaneous checks. The output of the parser will be a DAG
•Optimizer
-carries out the logical optimizations
•Compiler
-compiles the optimized logical plan into a series of
MapReduce jobs.
•Execution engine
- MapReduce jobs are executed on Hadoop producing the
desired results
Apache Pig Execution Modes
•Pig uses the Pig Latin language, and execute them using any
of the execution mechanisms.
Syntax
grunt> sh shell command parameters
grunt> sh ls
PigStorage
DUMP input;
Very useful for debugging, but not so much useful for huge
datasets.
Load and Store example
•Scalar Types
•Complex Types
Scalar Types
Filtering
•FILTER - To remove unwanted rows from a relation.
•DISTINCT - To remove duplicate rows from a relation.
•FOREACH, GENERATE - To generate data transformations
based on columns of data.
Grouping and Joining
•JOIN To join two or more relations.
•COGROUP To group the data in two or more relations.
•GROUP To group the data in a single relation.
•CROSS To create the cross product of two or more
relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or
more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single
relation.
SPLIT To split a single relation into two or more
relations.
Diagnostic Operators
•DUMP To print the contents of a relation on the console.
•DESCRIBE To describe the schema of a relation.
•EXPLAIN To view the logical, physical, or MapReduce
execution plans to compute a relation.
•ILLUSTRATE To view the step-by-step execution of a series
of statements.
FOREACH
DUMP outerbag;
DUMP innerbag;
FILTER
Selects tuples from a relation based on some condition
DUMP data;
DUMP data;
DUMP data;
DUMP data;
Joins relation based on a field. Both outer and inner joins are
supported.
a = LOAD 'data/data-bag.txt'
USING PigStorage(',')
AS (f1:int, f2:int, f3:int);
DUMP a;
b = LOAD 'data/simple-tuples.txt'
USING PigStorage(',') AS (t1:int, t2:int);
DUMP b;
•As the Pig platform is designed for ETL-type use cases, it’s
not a better choice for real-time scenarios.
•Apache Pig is not a good choice for pinpointing a single
record in huge data sets.
•Apache Pig is built on top of MapReduce, which is batch
processing oriented.
Is Pig script case sensitive?