Pig Basics
Pig Basics
Table of contents
1
Conventions....................................................................................................................... 2
Reserved Keywords........................................................................................................... 3
Case Sensitivity..................................................................................................................4
Relational Operators........................................................................................................ 48
UDF Statements............................................................................................................... 85
1. Conventions
Conventions for the syntax and code examples in the Pig Latin Reference Manual are
described here.
Convention
Description
Example
()
Multiple items:
(1, abc, (2,4,6) )
Optional items:
[INNER | OUTER]
UPPERCASE
lowercase
Page 2
2. Reserved Keywords
Pig reserved keywords are listed here.
-- A
-- B
-- C
-- D
-- E
-- F
-- G
generate, group
-- H
help
-- I
-- J
join
-- K
kill
-- L
-- M
-- N
not, null
-- O
-- P
-- Q
quit
Page 3
-- R
-- S
-- T
-- U
union, using
-- V, W, X, Y, Z
3. Case Sensitivity
The names (aliases) of relations and fields are case sensitive. The names of Pig Latin
functions are case sensitive. The names of parameters (see Parameter Substitution) and all
other Pig Latin keywords (see Reserved Keywords) are case insensitive.
In the example below, note the following:
The names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, and DUMP are
case insensitive. They can also be written as load, using, as, group, by, etc.
In the FOREACH statement, the field in relation B is referred to by positional notation
($0).
grunt>
grunt>
grunt>
grunt>
Page 4
underscores.
Valid identifiers:
A
A123
abc_123_BeX_
Invalid identifiers:
_A123
abc_$
A!B
Page 5
You an assign an alias to another alias. The new alias can be used in the place of the original
alias to refer the original relation.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,
gpa:float);
B = A;
DUMP B;
Second Field
Third Field
Data type
chararray
int
float
Positional notation
(generated by system)
$0
$1
$2
name
age
gpa
John
18
4.0
As shown in this example when you assign names to fields (using the AS schema clause) you
can still refer to the fields using positional notation. However, for debugging purposes and
ease of comprehension, it is better to use field names.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,
gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
(John,4.0F)
(Mary,3.8F)
Page 6
(Bill,3.9F)
(Joe,3.8F)
In this example an error is generated because the requested column ($3) is outside of the
declared schema (positional notation begins with $0). Note that the error is caught before the
statements are executed.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
B = FOREACH A GENERATE $3;
DUMP B;
2009-01-21 23:03:46,715 [main] ERROR org.apache.pig.tools.grunt.GruntParser
- java.io.IOException:
Out of bound access. Trying to access non-existent : 3. Schema {f1:
bytearray,f2: bytearray,f3: bytearray} has 3 column(s).
etc ...
Page 7
Description
Example
int
10
long
Data:
10L or 10l
Display: 10L
float
double
Data:
bytearray
boolean
boolean
tuple
(19,2)
bag
An collection of tuples.
{(19,2), (18,1)}
map
[open#apache]
Complex Types
Page 8
which that data is used. For example, in relation B, f1 is converted to integer because 5 is
integer. In relation C, f1 and f2 are converted to double because we don't know the type
of either f1 or f2.
A = LOAD 'data' AS (f1,f2,f3);
B = FOREACH A GENERATE f1 + 5;
C = FOREACH A generate f1 + f2;
If a schema is defined as part of a load statement, the load function will attempt to
enforce the schema. If the data does not conform to the schema, the loader will generate a
null value or an error.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
If an explicit cast is not supported, an error will occur. For example, you cannot cast a
chararray to int.
If Pig cannot resolve incompatible types through implicit casts, an error will occur. For
example, you cannot add chararray and float (see the Types Table for addition and
subtraction).
4.3.2.2. Terms
( )
Page 9
field
4.3.2.3. Usage
You can think of a tuple as a row with one or more fields, where each field can be any data
type and any field may or may not have data. If a field has no data, then the following
happens:
In a load statement, the loader will inject null into the tuple. The actual value that is
substituted for null is loader specific; for example, PigStorage substitutes an empty field
for null.
In a non-load statement, if a requested field is missing from a tuple, Pig will inject null.
Also see tuple schemas.
4.3.2.4. Example
4.3.3. Bag
A bag is a collection of tuples.
4.3.3.1. Syntax: Inner bag
{ tuple [, tuple ] }
4.3.3.2. Terms
{ }
tuple
A tuple.
4.3.3.3. Usage
Page 10
A bag can have tuples with differing numbers of fields. However, if Pig tries to access a
field that does not exist, a null value is substituted.
A bag can have tuples with fields that have different data types. However, for Pig to
effectively process bags, the schemas of the tuples within those bags should be the same.
For example, if half of the tuples include chararray fields and while the other half include
float fields, only half of the tuples will participate in any kind of computation because the
chararray fields will be converted to null.
Bags have two forms: outer bag (or relation) and inner bag.
In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.
A = LOAD 'data' as (f1:int, f2:int, f3;int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
4.3.4. Map
A map is a set of key/value pairs.
4.3.4.1. Syntax (<> denotes optional)
[ key#value <, key#value > ]
Page 11
4.3.4.2. Terms
[]
key
value
4.3.4.3. Usage
Interaction
Comparison operators:
==, !=
>, <
>=, <=
Comparison operator:
matches
Page 12
Arithmetic operators:
+ , -, *, /
% modulo
? : bincond
Null operator:
is null
Null operator:
is not null
Dereference operators:
COUNT_STAR
Cast operator
Functions:
SIZE
For Boolean subexpressions, note the results when nulls are used with these operators:
Page 13
FILTER operator If a filter expression results in null value, the filter does not pass them
through (if X is null, !X is also null, and the filter will reject both).
Bincond operator If a Boolean subexpression results in null value, the resulting
expression is null (see the interactions above for Arithmetic operators)
In this example of an outer join, if the join key is missing from a table it is replaced by null.
A = LOAD 'student' AS (name: chararray, age: int, gpa: float);
B = LOAD 'votertab10k' AS (name: chararray, age: int, registration:
chararray, donation: float);
C = COGROUP A BY name, B BY name;
D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)),
FLATTEN((IsEmpty(B) ? null : B));
Like any other expression, null constants can be implicitly or explicitly cast.
In this example both a and null will be implicitly cast to double.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a + null;
In this example both a and null will be cast to int, a implicitly, and null explicitly.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a + (int)null;
Page 14
Page 15
When using the GROUP (COGROUP) operator with multiple relations, records with a null
group key are considered different and are grouped separately. In the example below note
that there are two tuples in the output corresponding to the null group key: one that contains
tuples from relation A (but not relation B) and one that contains tuples from relation B (but
not relation A).
A = load 'student' as (name:chararray, age:int, gpa:float);
B = load 'student' as (name:chararray, age:int, gpa:float);
dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)
X = cogroup A by age, B by age;
dump X;
(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})
4.5. Constants
Pig provides constant representations for all data types except bytearrays.
Page 16
Constant Example
Notes
19
long
19L
float
19.2F or 1.92e2f
double
19.2 or 1.92e2
chararray
'hello world'
bytearray
Not applicable.
boolean
true/false
Case insensitive.
tuple
(19, 2, 1)
bag
map
Page 17
The data type definitions for tuples, bags, and maps apply to constants:
A tuple can contain fields of any data type
A bag is a collection of tuples
A map key must be a scalar; a map value can be any data type
Complex constants (either with or without values) can be used in the same places scalar
constants can be used; that is, in FILTER and GENERATE statements.
A = LOAD 'data' USING MyStorage() AS (T: tuple(name:chararray, age: int));
B = FILTER A BY T == ('john', 25);
D = FOREACH B GENERATE T.name, [25#5.6], {(1, 5, 18)};
4.6. Expressions
In Pig Latin, expressions are language constructs used with the FILTER, FOREACH,
GROUP, and SPLIT operators as well as the eval functions.
Expressions are written in conventional mathematical infix notation and are adapted to the
UTF-8 character set. Depending on the context, expressions can include:
Any Pig data type (simple data types, complex data types)
Any Pig operator (arithmetic, comparison, null, boolean, dereference, sign, and cast)
Any Pig built in function.
Any user defined function (UDF) written in Java.
In Pig Latin,
An arithmetic expression could look like this:
X = GROUP A BY f2*f3;
A string expression could look like this, where a and b are both chararrays:
X = FOREACH A GENERATE CONCAT(a,b);
Page 18
A common error when using the star expression is shown below. In this example, the
programmer really wants to count the number of elements in the bag in the second field:
COUNT($1).
G = GROUP A BY $0;
C = FOREACH G GENERATE COUNT(*)
There are some restrictions on use of the star expression when the input schema is unknown
(null):
For GROUP/COGROUP, you can't include a star expression in a GROUP BY column.
For ORDER BY, if you have project-star as ORDER BY column, you cant have any
other ORDER BY column in that statement.
4.6.3. Project-Range Expressions
Project-range ( .. ) expressions can be used to project a range of columns from input. For
example:
.. $x : projects columns $0 through $x, inclusive
$x .. : projects columns through end, inclusive
$x .. $y : projects columns through $y, inclusive
If the input relation has a schema, you can refer to columns by alias rather than by column
Page 19
position. You can also combine aliases and column positions in an expression; for example,
"col1 .. $5" is valid.
Project-range can be used in all cases where the star expression ( * ) is allowed.
Project-range can be used in the following statements: FOREACH, JOIN, GROUP,
COGROUP, and ORDER BY (also when ORDER BY is used within a nested FOREACH
block).
A few examples are shown here:
.....
grunt> F = foreach IN generate (int)col0, col1 .. col3;
grunt> describe F;
F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
.....
.....
grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
.....
.....
J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
.....
.....
g = group l1 by b .. c;
.....
There are some restrictions on the use of project-to-end form of project-range (eg "x .. ")
when the input schema is unknown (null):
For GROUP/COGROUP, the project-to-end form of project-range is not allowed.
For ORDER BY, the project-to-end form of project-range is supported only as the last
sort column.
.....
grunt> describe IN;
Schema for IN unknown.
/* This statement is supported */
SORT = order IN by $2 .. $3, $6 ..;
/* This statement is NOT supported */
SORT = order IN by $2 .. $3, $6 ..;
.....
Page 20
4.7. Schemas
Schemas enable you to assign names to fields and declare types for fields. Schemas are
optional but we encourage you to use them whenever possible; type declarations result in
better parse-time error checking and more efficient code execution.
Schemas for simple types and complex types can be used anywhere a schema definition is
appropriate.
Schemas are defined with the LOAD, STREAM, and FOREACH operators using the AS
clause. If you define a schema using the LOAD operator, then it is the load function that
enforces the schema (see LOAD and User Defined Functions for more information).
Known Schema Handling
Note the following:
You can define a schema that includes both the field name and field type.
You can define a schema that includes the field name only; in this case, the field type
defaults to bytearray.
You can choose not to define a schema; in this case, the field is un-named and the field
type defaults to bytearray.
If you assign a name to a field, you can refer to that field using the name or by positional
notation. If you don't assign a name to a field (the field is un-named) you can only refer to
the field using positional notation.
If you assign a type to a field, you can subsequently change the type using the cast operators.
If you don't assign a type to a field, the field defaults to bytearray; you can change the default
type using the cast operators.
Page 21
If you do DESCRIBE on B, you will see a single column of type double. This is because Pig
makes the safest choice and uses the largest numeric type when the schema is not know. In
Page 22
practice, the input data could contain integer values; however, Pig will cast the data to double
and make sure that a double result is returned.
If the schema of a relationship cant be inferred, Pig will just use the runtime data as is and
propagate it through the pipeline.
4.7.1. Schemas with LOAD and STREAM
With LOAD and STREAM operators, the schema following the AS keyword must be
enclosed in parentheses.
In this example the LOAD statement includes a schema definition for simple data types.
A = LOAD 'data' AS (f1:int, f2:int);
In this example the FOREACH statement includes a schema for simple expression.
X = FOREACH A GENERATE f1+f2 AS x1:int;
In this example the FOREACH statement includes a schemas for multiple fields.
X = FOREACH A GENERATE f1 as user, f2 as age, f3 as gpa;
4.7.3.2. Terms
alias
Page 23
type
(,)
4.7.3.3. Examples
3.8
3.9
3.8
In this example field "gpa" will default to bytearray because no type is declared.
cat student;
John 18 4.0
Mary 19 3.8
Bill 20 3.9
Joe 18 3.8
A = LOAD 'data' AS (name:chararray, age:int, gpa);
DESCRIBE A;
A: {name: chararray,age: int,gpa: bytearray}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
Page 24
4.7.5.2. Terms
alias
:tuple
()
alias[:type]
4.7.5.3. Examples
In this example the schema defines one tuple. The load statements are equivalent.
cat data;
(3,8,9)
(1,4,7)
(2,5,8)
A = LOAD 'data' AS (T: tuple (f1:int, f2:int, f3:int));
A = LOAD 'data' AS (T: (f1:int, f2:int, f3:int));
DESCRIBE A;
A: {T: (f1: int,f2: int,f3: int)}
Page 25
DUMP A;
((3,8,9))
((1,4,7))
((2,5,8))
4.7.6.2. Terms
alias
:bag
{}
tuple
4.7.6.3. Examples
In this example the schema defines a bag. The two load statements are equivalent.
cat data;
Page 26
{(3,8,9)}
{(1,4,7)}
{(2,5,8)}
A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
A = LOAD 'data' AS (B: {T: (t1:int, t2:int, t3:int)});
DESCRIBE A:
A: {B: {T: (t1: int,t2: int,t3: int)}}
DUMP A;
({(3,8,9)})
({(1,4,7)})
({(2,5,8)})
4.7.7.2. Terms
alias
:map
[]
type
4.7.7.3. Examples
In this example the schema defines an untyped map (the map values default to bytearray).
The load statements are equivalent.
Page 27
cat data;
[open#apache]
[apache#hadoop]
A = LOAD 'data' AS (M:map []);
A = LOAD 'data' AS (M:[]);
DESCRIBE A;
a: {M: map[ ]}
DUMP A;
([open#apache])
([apache#hadoop])
Page 28
5.1.1. Description
Operator
Symbol
addition
subtraction
multiplication
division
modulo
Notes
bincond
?:
(condition ? value_if_true :
value_if_false)
The bincond should be enclosed in
parenthesis.
The schemas for the two
conditional outputs of the bincond
should match.
Use expressions only (relational
operators are not allowed).
5.1.1.1. Examples
In this example the modulo operator is used with fields f1 and f2.
X = FOREACH A GENERATE f1, f2, f1%f2;
Page 29
DUMP X;
(10,1,0)
(10,3,1)
(10,6,4)
In this example the bincond operator is used with fields f2 and B. The condition is "f2 equals
1"; if the condition is true, return 1; if the condition is false, return the count of the number of
tuples in B.
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
DUMP X;
(1,1L)
(3,2L)
(6,3L)
bag
tuple
map
int
long
float
double
chararray
bag
tuple
map
int
long
float
double
chararray bytearray
error
error
error
error
error
error
error
error
error
not yet
error
error
error
error
error
error
error
error
error
error
error
error
error
error
int
long
float
double
error
cast as
int
long
float
double
error
cast as
long
float
double
error
cast as
float
double
error
cast as
double
error
error
bytearray
cast as
double
Page 30
bag
tuple
bag
tuple
map
int
long
float
double
chararray bytearray
error
error
error
not yet
not yet
not yet
not yet
error
error
error
error
not yet
not yet
not yet
not yet
error
error
error
error
error
error
error
error
error
int
long
float
double
error
cast as
int
long
float
double
error
cast as
long
float
double
error
cast as
float
double
error
cast as
double
error
error
map
int
long
float
double
chararray
bytearray
cast as
double
int
long
bytearray
int
long
bytearray
int
long
cast as int
long
cast as long
error
Page 31
Symbol
AND
and
OR
or
NOT
not
Notes
The result of a boolean expression (an expression that includes boolean and comparison
operators) is always of type boolean (true or false).
5.2.1.1. Example
X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));
bag
bag
tuple
map
int
long
float
double
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
yes
yes
yes
yes
error
error
yes
yes
yes
error
error
yes
yes
error
error
tuple
error
map
error
error
int
error
error
error
long
error
error
error
yes
float
error
error
error
yes
yes
Page 32
double
error
error
error
yes
yes
yes
yes
chararray error
error
error
yes
yes
yes
yes
bytearray yes
yes
yes
yes
yes
yes
yes
yes
boolean error
error
error
error
error
error
error
yes
error
error
error
yes
yes
error
5.3.1.1. Syntax
{(data_type) | (tuple(data_type)) | (bag{tuple(data_type)}) | (map[]) } field
5.3.1.2. Terms
(data_type)
field
5.3.1.3. Usage
Cast operators enable you to cast or convert data from one type to another, as long as
conversion is supported (see the table above). For example, suppose you have an integer
field, myint, which you want to convert to a string. You can cast this field from int to
chararray using (chararray)myint.
Please note the following:
A field can be explicitly cast. Once cast, the field remains that type (it is not
automatically cast back). In this example $0 is explicitly cast to int.
B = FOREACH A GENERATE (int)$0 + 1;
Where possible, Pig performs implicit casts. In this example $0 is cast to int (regardless
of underlying data) and $1 is cast to double.
Page 33
When two bytearrays are used in arithmetic expressions or with built in aggregate
functions (such as SUM) they are implicitly cast to double. If the underlying data is really
int or long, youll get better performance by declaring the type or explicitly casting the
data.
Downcasts may cause loss of data. For example casting from long to int may drop bits.
5.3.2. Examples
In this example an int is cast to type chararray (see relation X).
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
DESCRIBE B;
B: {group: int,A: {f1: int,f2: int,f3: int}}
X = FOREACH B GENERATE group, (chararray)COUNT(A) AS total;
(1,1)
(4,2)
(7,1)
(8,2)
DESCRIBE X;
X: {group: int,total: chararray}
Page 34
Page 35
In this example a multi-field tuple is used. For the FILTER statement, Pig performs an
implicit cast. For the FOREACH statement, an explicit cast is used.
Page 36
Symbol
equal
==
not equal
!=
less than
<
greater than
>
<=
>=
pattern matching
matches
Notes
String Example
Page 37
Matches Example
X = FILTER A BY (f1 matches '.*apache.*');
bag
tuple
bag
tuple
map
int
long
float
double
error
error
error
error
error
error
error
error
error
error
boolean error
error
error
error
error
error
error
error
boolean error
error
error
error
error
error
error
(see
Note 1)
map
(see
Note 2)
int
cast as error
boolean
long
cast as error
boolean
float
cast as error
boolean
boolean error
cast as error
boolean
double
chararray
bytearray
boolean error
boolean
boolean
Page 38
Note 1: boolean (Tuple A is equal to tuple B if they have the same size s, and for all 0 <= i <
s A[i] = = B[i])
Note 2: boolean (Map A is equal to map B if A and B have the same number of entries, and
for every key k1 in A with a value of v1, there is a key k2 in B with a value of v2, such that
k1 = = k2 and v1 = = v2)
5.4.4. Types Table: not equal (!=) operator
bag
tuple
map
int
bag
tuple
map
int
long
float
double
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
boolean error
(bytearray
cast as
int)
long
boolean error
(bytearray
cast as
long)
float
boolean error
(bytearray
cast as
float)
boolean error
boolean error
(bytearray
cast as
double)
double
chararray
Page 39
bytearray
boolean error
boolean
error
bytearray*
chararray
boolean
boolean
bytearray
boolean
boolean
Symbol
Notes
tuple constructor
()
bag constructor
{}
map constructor
[]
Page 40
Given this {($1), $2} Pig creates this {($1), ($2)} a bag with two tuples
... since ($1) is treated as $1 (one cannot create a single element tuple using this
syntax), {($1), $2} becomes {$1, $2} and Pig creates a tuple around each item
Given this {($1, $2)} Pig creates this {($1, $2)} a bag with a single tuple
... Pig creates a tuple ($1, $2) and then puts this tuple into the bag
5.5.2. Examples
Tuple Construction
A = load 'students' as (name:chararray, age:int, gpa:float);
B = foreach A generate (name, age);
store B into results;
Input (students):
joe smith 20 3.5
amy chen
22 3.2
leo allen 18 2.1
Output (results):
(joe smith,20)
(amy chen,22)
(leo allen,18)
Bag Construction
A = load 'students' as (name:chararray, age:int, gpa:float);
B = foreach A generate {(name, age)}, {name, age};
store B into results;
Input (students):
joe smith 20 3.5
amy chen
22 3.2
leo allen 18 2.1
Output (results):
{(joe smith,20)}
{(amy chen,22)}
{(joe smith),(20)}
{(amy chen),(22)}
Page 41
{(leo allen,18)}
{(leo allen),(18)}
Map Construction
A = load 'students' as (name:chararray, age:int, gpa:float);
B = foreach A generate [name, gpa];
store B into results;
Input (students):
joe smith 20 3.5
amy chen
22 3.2
leo allen 18 2.1
Output (results):
[joe smith#3.5]
[amy chen#3.2]
[leo allen#2.1]
Symbol
Notes
tuple dereference
tuple.id or tuple.(id,)
bag dereference
bag.id or bag.(id,)
map dereference
map#'key'
Page 42
5.6.2. Examples
Tuple Example
Suppose we have relation A.
LOAD 'data' as (f1:int, f2:tuple(t1:int,t2:int,t3:int));
DUMP A;
(1,(1,2,3))
(2,(4,5,6))
(3,(7,8,9))
(4,(1,4,7))
(5,(2,5,8))
In this example dereferencing is used to retrieve two fields from tuple f2.
X = FOREACH A GENERATE f2.t1,f2.t3;
DUMP X;
(1,3)
(4,6)
(7,9)
(1,7)
(2,8)
Bag Example
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for
information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int,f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
Page 43
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
ILLUSTRATE B;
etc
---------------------------------------------------------| b
| group: int | a: bag({f1: int,f2: int,f3: int}) |
----------------------------------------------------------
In this example dereferencing is used with relation X to project the first field (f1) of each
tuple in the bag (a).
X = FOREACH B GENERATE a.f1;
DUMP X;
({(1)})
({(4),(4)})
({(7)})
({(8),(8)})
Tuple/Bag Example
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for
information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int, f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY (f1,f2);
DUMP B;
((1,2),{(1,2,3)})
((4,2),{(4,2,1)})
((4,3),{(4,3,3)})
((7,2),{(7,2,5)})
((8,3),{(8,3,4)})
((8,4),{(8,4,3)})
ILLUSTRATE B;
etc
------------------------------------------------------------------------------| b
| group: tuple({f1: int,f2: int}) | a: bag({f1: int,f2: int,f3:
int}) |
------------------------------------------------------------------------------|
| (8, 3)
| {(8, 3, 4), (8, 3, 4)} |
-------------------------------------------------------------------------------
Page 44
In this example dereferencing is used to project a field (f1) from a tuple (group) and a field
(f1) from a bag (a).
X = FOREACH B GENERATE group.f1, a.f1;
DUMP X;
(1,{(1)})
(4,{(4)})
(4,{(4)})
(7,{(7)})
(8,{(8)})
(8,{(8)})
Map Example
Suppose we have relation A.
A = LOAD 'data' AS (f1:int, f2:map[]);
DUMP A;
(1,[open#apache])
(2,[apache#hadoop])
(3,[hadoop#pig])
(4,[pig#grunt])
=
=
=
=
Page 45
The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that
changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples
as well as bags. The idea is the same, but the operation and result is different for each type of
structure.
For tuples, flatten substitutes the fields of a tuple in place of the tuple. For example, consider
a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1),
will cause that tuple to become (a, b, c).
For bags, the situation becomes more complicated. When we un-nest a bag, we create new
tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply
GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level
of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a
relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP
operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create
new tuples: (a, b, c) and (a, d, e).
Also note that the flatten of empty bag will result in that row being discarded; no output is
generated. (See also Drop Nulls Before a Join.)
grunt>
{}
grunt>
grunt>
grunt>
grunt>
cat empty.bag
1
A = LOAD 'empty.bag' AS (b : bag{}, i : int);
B = FOREACH A GENERATE flatten(b), i;
DUMP B;
Symbol
is null
is null
is not null
is not null
Notes
Page 46
Symbol
Notes
positive
Has no effect.
negative (negation)
5.10.2. Examples
In this example, the negation operator is applied to the "x" values.
A = LOAD 'data' as (x, y, z);
B = FOREACH A GENERATE -x, y;
error
tuple
error
map
error
int
int
long
long
float
float
Page 47
double
double
chararray
error
bytearray
6. Relational Operators
6.1. COGROUP
See the GROUP operator.
6.2. CROSS
Computes the cross product of two or more relations.
6.2.1. Syntax
alias = CROSS alias, alias [, alias ] [PARTITION BY partitioner] [PARALLEL n];
6.2.2. Terms
alias
PARTITION BY partitioner
PARALLEL n
6.2.3. Usage
Use the CROSS operator to compute the cross product (Cartesian product) of two or more
Page 48
relations.
CROSS is an expensive operation and should be used sparingly.
6.2.4. Example
Suppose we have relations A and B.
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
6.3. DEFINE
See:
DEFINE (UDFs, streaming)
DEFINE (macros)
6.4. DISTINCT
Removes duplicate tuples in a relation.
6.4.1. Syntax
alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];
Page 49
6.4.2. Terms
alias
PARTITION BY partitioner
PARALLEL n
6.4.3. Usage
Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not
preserve the original order of the contents (to eliminate duplicates, Pig must first sort the
data). You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a
nested block to first select the fields and then apply DISTINCT (see Example: Nested
Block).
6.4.4. Example
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
Page 50
(8,3,4)
6.5. FILTER
Selects tuples from a relation based on some condition.
6.5.1. Syntax
alias = FILTER alias BY expression;
6.5.2. Terms
alias
BY
Required keyword.
expression
A boolean expression.
6.5.3. Usage
Use the FILTER operator to work with tuples or rows of data (if you want to work with
columns of data, use the FOREACH...GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you dont want.
6.5.4. Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example the condition states that if the third field equals 3, then include the tuple with
relation X.
X = FILTER A BY f3 == 3;
Page 51
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
In this example the condition states that if the first field equals 8 or if the sum of fields f2 and
f3 is not greater than first field, then include the tuple relation X.
X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
DUMP X;
(4,2,1)
(8,3,4)
(7,2,5)
(8,4,3)
6.6. FOREACH
Generates data transformations based on columns of data.
6.6.1. Syntax
alias = FOREACH { block | nested_block };
6.6.2. Terms
alias
block
nested_block
};
Where:
The nested block is enclosed in opening and closing
brackets { }.
The GENERATE keyword must be the last statement
within the nested block.
See Schemas
Macros are NOT alllowed inside a nested block.
expression
An expression.
nested_alias
nested_op
nested_exp
AS
Keyword
schema
6.6.3. Usage
Use the FOREACHGENERATE operation to work with columns of data (if you want to
Page 53
In this example two fields from relation A are projected to form relation X.
X = FOREACH A GENERATE a1, a2;
DUMP X;
(1,2)
(4,2)
(8,3)
(4,3)
(7,2)
(8,4)
Page 54
(1,{(3)})
(4,{(6),(9)})
(8,{(9)})
Page 55
Another FLATTEN example. Note that for the group '4' in C, there are two tuples in each
bag. Thus, when both bags are flattened, the cross product of these tuples is returned; that is,
tuples (4, 2, 6), (4, 3, 6), (4, 2, 9), and (4, 3, 9).
X = FOREACH C GENERATE FLATTEN(A.(a1, a2)), FLATTEN(B.$1);
DUMP X;
(1,2,3)
(4,2,6)
(4,2,9)
(4,3,6)
(4,3,9)
(8,3,9)
(8,4,9)
Another FLATTEN example. Here, relations A and B both have a column x. When forming
relation E, you need to use the :: operator to identify which column x to use - either relation
A column x (A::x) or relation B column x (B::x). This example uses relation A column x
(A::x).
A =
B =
C =
D =
E =
Page 56
This example shows a CROSS and FOREACH nested to the second level.
a
b
c
d
=
=
=
=
}
dump d;
Suppose we have relations A and B. Note that relation B contains an inner bag.
A = LOAD 'data' AS (url:chararray,outlink:chararray);
DUMP A;
(www.ccc.com,www.hjk.com)
(www.ddd.com,www.xyz.org)
(www.aaa.com,www.cvn.org)
(www.www.com,www.kpt.net)
(www.www.com,www.xyz.org)
(www.ddd.com,www.xyz.org)
B = GROUP A BY url;
DUMP B;
(www.aaa.com,{(www.aaa.com,www.cvn.org)})
(www.ccc.com,{(www.ccc.com,www.hjk.com)})
(www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
(www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
In this example we perform two of the operations allowed in a nested block, FILTER and
DISTINCT. Note that the last statement in the nested block must be GENERATE. Also, note
the use of projection (PA = FA.outlink;) to retrieve a field. DISTINCT can be applied to a
Page 57
6.7. GROUP
Groups the data in one or more relations.
Note: The GROUP and COGROUP operators are identical. Both operators work with one or
more relations. For readability GROUP is used in statements involving one relation and
COGROUP is used in statements involving two or more relations. You can COGROUP up to
but no more than 127 relations at a time.
6.7.1. Syntax
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression ] [USING 'collected' | 'merge']
[PARTITION BY partitioner] [PARALLEL n];
6.7.2. Terms
alias
ALL
BY
Page 58
expression
USING
Keyword
'collected'
'merge'
Page 59
PARALLEL n
6.7.3. Usage
The GROUP operator groups together tuples that have the same group key (key field). The
key field will be a tuple if the group key has more than one field, otherwise it will be the
same type as that of the group key. The result of a GROUP operation is a relation that
includes one tuple per group. This tuple contains two fields:
The first field is named "group" (do not confuse this with the GROUP operator) and is
the same type as the group key.
The second field takes the name of the original relation and is type bag.
The names of both fields are generated by the system as shown in the example below.
Note the following about the GROUP/COGROUP and JOIN operators:
The GROUP and JOIN operators perform similar functions. GROUP creates a nested set
of output tuples while JOIN creates a flat set of output tuples
The GROUP/COGROUP and JOIN operators handle null values differently (see Nulls
and GROUP/COGROUP Operataors).
6.7.4. Example
Suppose we have relation A.
Page 60
Now, suppose we group relation A on field "age" for form relation B. We can use the
DESCRIBE and ILLUSTRATE operators to examine the structure of relation B. Relation B
has two fields. The first field is named "group" and is type int, the same as field "age" in
relation A. The second field is name "A" after relation A and is type bag.
B = GROUP A BY age;
DESCRIBE B;
B: {group: int, A: {name: chararray,age: int,gpa: float}}
ILLUSTRATE B;
etc ...
---------------------------------------------------------------------| B
| group: int | A: bag({name: chararray,age: int,gpa: float}) |
---------------------------------------------------------------------|
| 18
| {(John, 18, 4.0), (Joe, 18, 3.8)}
|
|
| 20
| {(Bill, 20, 3.9)}
|
---------------------------------------------------------------------DUMP B;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Continuing on, as shown in these FOREACH statements, we can refer to the fields in relation
B by names "group" and "A" or by positional notation.
C = FOREACH B GENERATE group, COUNT(A);
DUMP C;
(18,2L)
(19,1L)
(20,1L)
C = FOREACH B GENERATE $0, $1.name;
DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
Page 61
6.7.5. Example
Suppose we have relation A.
A = LOAD 'data' as (f1:chararray, f2:int, f3:int);
DUMP A;
(r1,1,2)
(r2,2,1)
(r3,2,8)
(r4,4,4)
6.7.6. Example
Suppose we have two relations, A and B.
A = LOAD 'data1' AS (owner:chararray,pet:chararray);
DUMP A;
(Alice,turtle)
(Alice,goldfish)
(Alice,cat)
(Bob,dog)
(Bob,cat)
B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
DUMP B;
(Cindy,Alice)
(Mark,Alice)
(Paul,Bob)
(Paul,Jane)
In this example tuples are co-grouped using field owner from relation A and field friend2
from relation B as the key fields. The DESCRIBE operator shows the schema for relation X,
which has two fields, "group" and "A" (see the GROUP operator for information about the
field names).
X = COGROUP A BY owner, B BY friend2;
DESCRIBE X;
X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1:
Page 62
chararray,friend2: chararray}}
Relation X looks like this. A tuple is created for each unique key field. The tuple includes the
key field and two bags. The first bag is the tuples from the first relation with the matching
key field. The second bag is the tuples from the second relation with the matching key field.
If no tuples match the key field, the bag is empty.
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
In this example tuples are co-grouped and the INNER keyword is used asymmetrically on
only one of the relations.
X = COGROUP A BY owner, B BY friend2 INNER;
DUMP X;
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
6.7.7. Example
This example shows how to group using multiple keys.
A = LOAD 'allresults' USING PigStorage() AS (tcid:int, tpid:int,
date:chararray, result:chararray, tsid:int, tag:chararray);
B = GROUP A BY (tcid, tpid);
Page 63
6.8. IMPORT
See IMPORT (macros)
6.9.2. Terms
alias
BY
Keyword
expression
A field expression.
Example: X = JOIN A BY fieldA, B BY fieldB, C
BY fieldC;
USING
Keyword
'replicated'
'skewed'
'merge'
'merge-sparse'
Page 64
PARTITION BY partitioner
PARALLEL n
6.9.3. Usage
Use the JOIN operator to perform an inner, equijoin join of two or more relations based on
common field values. The JOIN operator always performs an inner join. Inner joins ignore
null keys, so it makes sense to filter them out before the join.
Note the following about the GROUP/COGROUP and JOIN operators:
The GROUP and JOIN operators perform similar functions. GROUP creates a nested set
of output tuples while JOIN creates a flat set of output tuples.
The GROUP/COGROUP and JOIN operators handle null values differently (see Nulls
and JOIN Operator).
Self Joins
To perform self joins in Pig load the same data multiple times, under different aliases, to
avoid naming conflicts.
In this example the same data is loaded twice using aliases A and B.
grunt>
grunt>
grunt>
grunt>
A = load 'mydata';
B = load 'mydata';
C = join A by $0, B by $0;
explain C;
6.9.4. Example
Suppose we have relations A and B.
Page 65
6.10.2. Terms
alias
alias-column
Keyword
LEFT
RIGHT
FULL
OUTER
(Optional) Keyword
USING
Keyword
'replicated'
'skewed'
'merge'
PARTITION BY partitioner
PARALLEL n
6.10.3. Usage
Page 67
Use the JOIN operator with the corresponding keywords to perform left, right, or full outer
joins. The keyword OUTER is optional for outer joins; the keywords LEFT, RIGHT and
FULL will imply left outer, right outer and full outer joins respectively when OUTER is
omitted. The Pig Latin syntax closely adheres to the SQL standard.
Please note the following:
Outer joins will only work provided the relations which need to produce nulls (in the case
of non-matching keys) have schemas.
Outer joins will only work for two-way joins; to perform a multi-way outer join, you will
need to perform multiple two-way outer join statements.
6.10.4. Examples
This example shows a left outer join.
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;
6.11. LIMIT
Limits the number of output tuples.
6.11.1. Syntax
alias = LIMIT alias n;
Page 68
6.11.2. Terms
alias
6.11.3. Usage
Use the LIMIT operator to limit the number of output tuples.
If the specified number of output tuples is equal to or exceeds the number of tuples in the
relation, all tuples in the relation are returned.
If the specified number of output tuples is less than the number of tuples in the relation, then
n tuples are returned. There is no guarantee which n tuples will be returned, and the tuples
that are returned can change from one run to the next. A particular set of tuples can be
requested using the ORDER operator followed by LIMIT.
Note: The LIMIT operator allows Pig to avoid processing all tuples in a relation. In most
cases a query that uses LIMIT will run more efficiently than an identical query that does not
use LIMIT. It is always a good idea to use limit if you can.
6.11.4. Examples
In this example the lmit is express as a scalar.
a
b
c
d
e
=
=
=
=
=
load 'a.txt';
group a all;
foreach b generate COUNT(a) as sum;
order a by $0;
limit d c.sum/100;
Page 69
In this example output is limited to 3 tuples. Note that there is no guarantee which three
tuples will be output.
X = LIMIT A 3;
DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
In this example the ORDER operator is used to order the tuples and the LIMIT operator is
used to output the first three tuples.
B = ORDER A BY f1 DESC, f2 ASC;
DUMP B;
(8,3,4)
(8,4,3)
(7,2,5)
(4,2,1)
(4,3,3)
(1,2,3)
X = LIMIT B 3;
DUMP X;
(8,3,4)
(8,4,3)
(7,2,5)
6.12. LOAD
Loads data from the file system.
6.12.1. Syntax
LOAD 'data' [USING function] [AS schema];
Page 70
6.12.2. Terms
'data'
Keyword.
If the USING clause is omitted, the default load
function PigStorage is used.
function
AS
Keyword.
schema
Page 71
6.12.3. Usage
Use the LOAD operator to load data from the file system.
6.12.4. Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are
newline-separated.
1 2 3
4 2 1
8 3 4
In this example the default load function, PigStorage, loads data from myfile.txt to form
relation A. The two LOAD statements are equivalent. Note that, because no schema is
specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
In this example a schema is specified using the AS keyword. The two LOAD statements are
equivalent. You can use the DESCRIBE and ILLUSTRATE operators to view the schema.
A = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int);
A = LOAD 'myfile.txt' USING PigStorage(\t) AS (f1:int, f2:int, f3:int);
DESCRIBE A;
a: {f1: int,f2: int,f3: int}
ILLUSTRATE A;
---------------------------------------------------------
Page 72
| a
| f1: bytearray | f2: bytearray | f3: bytearray |
--------------------------------------------------------|
| 4
| 2
| 1
|
----------------------------------------------------------------------------------------------| a
| f1: int | f2: int | f3: int |
--------------------------------------|
| 4
| 2
| 1
|
---------------------------------------
For examples of how to specify more complex schemas for use with the LOAD operator, see
Schemas for Complex Data Types and Schemas for Multiple Types.
6.13. MAPREDUCE
Executes native MapReduce jobs inside a Pig script.
6.13.1. Syntax
alias1 = MAPREDUCE 'mr.jar' STORE alias2 INTO 'inputLocation' USING storeFunc LOAD
'outputLocation' USING loadFunc AS schema [`params, ... `];
6.13.2. Terms
alias1, alias2
mr.jar
See STORE
Store alias2 into the inputLocation using storeFunc,
which is then used by the MapReduce job to read its
data.
See LOAD
After running mr1.jar's MapReduce job, load back
the data from outputLocation into alias1 using
loadFunc as schema.
Page 73
`params, ...`
6.13.3. Usage
Use the MAPREDUCE operator to run native MapReduce jobs from inside a Pig script.
The input and output locations for the MapReduce program are conveyed to Pig using the
STORE/LOAD clauses. Pig, however, does not pass this information (nor require that this
information be passed) to the MapReduce program. If you want to pass the input and output
locations to the MapReduce program you can use the params clause or you can hardcode the
locations in the MapReduce program.
6.13.4. Example
This example demonstrates how to run the wordcount MapReduce progam from Pig. Note
that the files specified as input and output locations in the MAPREDUCE statement will
NOT be deleted by Pig automatically. You will need to delete them manually.
A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'
AS (word:chararray, count: int) `org.myorg.WordCount inputDir
outputDir`;
6.14. ORDER BY
Sorts a relation based on one or more fields.
6.14.1. Syntax
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] ] }
[PARALLEL n];
6.14.2. Terms
alias
field_alias
Page 74
ASC
DESC
PARALLEL n
6.14.3. Usage
Note: ORDER BY is NOT stable; if multiple records have the same ORDER BY key, the
order in which these records are returned is not defined and is not guarantted to be the same
from one run to the next.
In Pig, relations are unordered (see Relations, Bags, Tuples, Fields):
If you order relation A to produce relation X (X = ORDER A BY * DESC;) relations A
and X still contain the same data.
If you retrieve relation X (DUMP X;) the data is guaranteed to be in the order you
specified (descending).
However, if you further process relation X (Y = FILTER X BY $0 > 1;) there is no
guarantee that the data will be processed in the order you originally specified
(descending).
Pig currently supports ordering on fields with simple types or by tuple designator (*). You
cannot order on fields with complex types or by expressions.
A = LOAD 'mydata' AS (x: int, y: map[]);
B = ORDER A BY x; -- this is allowed because x is a simple type
B = ORDER A BY y; -- this is not allowed because y is a complex type
B = ORDER A BY y#'id'; -- this is not allowed because y#'id' is an
expression
6.14.4. Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
Page 75
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example relation A is sorted by the third field, f3 in descending order. Note that the
order of the three tuples ending in 3 can vary.
X = ORDER A BY a3 DESC;
DUMP X;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)
6.15. SAMPLE
Partitions a relation into two or more relations.
6.15.1. Syntax
SAMPLE alias size;
6.15.2. Terms
alias
size
6.15.3. Usage
Use the SAMPLE operator to select a random data sample with the stated sample size.
SAMPLE is a probabalistic operator; there is no guarantee that the exact same number of
tuples will be returned for a particular sample size each time the operator is used.
Page 76
6.15.4. Example
In this example relation X will contain 1% of the data in relation A.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
X = SAMPLE A 0.01;
In this example, a scalar expression is used (it will sample approximately 1000 records from
the input).
a
b
c
e
=
=
=
=
load 'a.txt';
group a all;
foreach b generate COUNT(a) as num_rows;
sample d 1000/num_rows;
6.16. SPLIT
Partitions a relation into two or more relations.
6.16.1. Syntax
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ] [, alias OTHERWISE];
6.16.2. Terms
alias
INTO
Required keyword.
IF
Required keyword.
expression
An expression.
OTHERWISE
6.16.3. Usage
Use the SPLIT operator to partition the contents of a relation into two or more relations based
on some expression. Depending on the conditions stated in the expression:
A tuple may be assigned to more than one relation.
Page 77
6.16.4. Example
In this example relation A is split into three relations, X, Y, and Z.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3)
(7,8,9)
6.16.5. Example
In this example, the SPLIT and FILTER statements are essentially equivalent. However,
because SPLIT is implemented as "split the data stream and then apply filters" the SPLIT
statement is more expensive than the FILTER statement because Pig needs to filter and store
two data streams.
SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF
(field1 is null);
-- where ignored_var is not used elsewhere
output_var = FILTER input_var BY (field1 is not null);
6.17. STORE
Stores or saves results to the file system.
6.17.1. Syntax
Page 78
6.17.2. Terms
alias
INTO
Required keyword.
'directory'
USING
function
6.17.3. Usage
Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to
the file system. Use STORE for production scripts and batch mode processing.
Note: To debug scripts during development, you can use DUMP to check intermediate
results.
6.17.4. Examples
In this example data is stored using PigStorage and the asterisk character (*) as the field
Page 79
delimiter.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
STORE A INTO 'myoutput' USING PigStorage ('*');
CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
In this example, the CONCAT function is used to format the data before it is stored.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = FOREACH A GENERATE CONCAT('a:',(chararray)f1),
CONCAT('b:',(chararray)f2), CONCAT('c:',(chararray)f3);
DUMP B;
(a:1,b:2,c:3)
(a:4,b:2,c:1)
(a:8,b:3,c:4)
(a:4,b:3,c:3)
(a:7,b:2,c:5)
(a:8,b:4,c:3)
STORE B INTO 'myoutput' using PigStorage(',');
CAT myoutput;
a:1,b:2,c:3
a:4,b:2,c:1
a:8,b:3,c:4
a:4,b:3,c:3
a:7,b:2,c:5
Page 80
a:8,b:4,c:3
6.18. STREAM
Sends data to an external script or program.
6.18.1. Syntax
alias = STREAM alias [, alias ] THROUGH {`command` | cmd_alias } [AS schema] ;
6.18.2. Terms
alias
THROUGH
Keyword.
`command`
cmd_alias
AS
Keyword.
schema
6.18.3. Usage
Use the STREAM operator to send data through an external script or program. Multiple
stream operators can appear in the same Pig script. The stream operators can be adjacent to
each other or have other operations in between.
When used with a command, a stream statement could look like this:
A = LOAD 'data';
B = STREAM A THROUGH `stream.pl -n 5`;
When used with a cmd_alias, a stream statement could look like this, where mycmd is the
defined alias.
Page 81
A = LOAD 'data';
DEFINE mycmd `stream.pl n 5`;
B = STREAM A THROUGH mycmd;
Page 82
C = FOREACH B {
D = ORDER A BY ($3, $4);
GENERATE D;
}
E = STREAM C THROUGH `stream.pl`;
6.19. UNION
Computes the union of two or more relations.
6.19.1. Syntax
alias = UNION [ONSCHEMA] alias, alias [, alias ];
6.19.2. Terms
alias
ONSCHEMA
6.19.3. Usage
Use the UNION operator to merge the contents of two or more relations. The UNION
operator:
Does not preserve the order of tuples. Both the input and output relations are interpreted
as unordered bags of tuples.
Does not ensure (as databases do) that all tuples adhere to the same schema or that they
have the same number of fields. In a typical scenario, however, this should be the case;
therefore, it is the user's responsibility to either (1) ensure that the tuples in the input
relations have the same schema or (2) be able to process varying tuples in the output
relation.
Page 83
Schema Behavior
The behavior of schemas for UNION (positional notation / data types) and UNION
ONSCHEMA (named fields / data types) is the same, except where noted.
Union on relations with two different sizes result in a null schema (union only):
A: (a1:long, a2:long)
B: (b1:long, b2:long, b3:long)
A union B: null
Union columns of compatible type will produce an "escalate" type. The priority is:
double > float > long > int > bytearray
tuple|bag|map|chararray > bytearray
A: (a1:int, a2:bytearray, a3:int)
B: (b1:float, b2:chararray, b3:bytearray)
A union B: (a1:float, a2:chararray, a3:int)
The alias of the first relation is always taken as the alias of the unioned relation field.
6.19.4. Example
In this example the union of relation A and B is computed.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
B = LOAD 'data' AS (b1:int,b2:int);
DUMP A;
Page 84
(2,4)
(8,9)
(1,3)
X = UNION A, B;
DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)
6.19.5. Example
This example shows the use of ONSCHEMA.
L1 = LOAD 'f1' USING (a : int, b : float);
DUMP L1;
(11,12.0)
(21,22.0)
L2 = LOAD
DUMP L2;
(11,a)
(12,b)
(13,c)
7. UDF Statements
7.1. DEFINE (UDFs, streaming)
Assigns an alias to a UDF or streaming command.
7.1.1. Syntax: UDF and streaming
DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
Page 85
7.1.2. Terms
alias
function
`command`
input
output
Page 86
ship
USING Keyword.
deserializer PigStreaming is the default
deserializer.
cache
stderr
7.1.3. Usage
Use the DEFINE statement to assign a name (alias) to a UDF function or to a streaming
command.
Use DEFINE to specify a UDF function when:
The function has a long package name that you don't want to include in a script,
especially if you call the function several times in that script.
Page 87
The constructor for the function takes string parameters. If you need to use different
constructor parameters for different calls to the function you will need to create multiple
defines one for each parameter set.
Serialization is needed to convert data from tuples to a format that can be processed by the
streaming application. Deserialization is needed to convert the output from the streaming
application back into tuples. PigStreaming is the default serialization/deserialization function.
Streaming uses the same default format as PigStorage to serialize/deserialize the data. If you
want to explicitly specify a format, you can do it as show below (see more examples in the
Examples: Input/Output section).
DEFINE CMD `perl PigStreaming.pl - nameMap` input(stdin using
PigStreaming(',')) output(stdout using PigStreaming(','));
A = LOAD 'file';
B = STREAM B THROUGH CMD;
If you need an alternative format, you will need to create a custom serializer/deserializer by
implementing the following interfaces.
interface PigToStream {
/**
* Given a tuple, produce an array of bytes to be passed to the
streaming
* executable.
*/
public byte[] serialize(Tuple t) throws IOException;
}
interface StreamToPig {
/**
* Given a byte array from a streaming executable, produce a
tuple.
*/
public Tuple deserialize(byte[]) throws IOException;
/**
Page 88
* This will be called on the front end during planning and not on
the back
* end during execution.
*
* @return the {@link LoadCaster} associated with this object.
* @throws IOException if there is an exception during LoadCaster
*/
public LoadCaster getLoadCaster() throws IOException;
}
Use the ship option to send streaming binary and supporting files, if any, from the client node
to the compute nodes. Pig does not automatically ship dependencies; it is your responsibility
to explicitly specify all the dependencies and to make sure that the software the processing
relies on (for instance, perl or python) is installed on the cluster. Supporting files are shipped
to the task's current working directory and only relative paths should be specified. Any
pre-installed binaries should be specified in the PATH.
Only files, not directories, can be specified with the ship option. One way to work around
this limitation is to tar all the dependencies into a tar file that accurately reflects the structure
needed on the compute nodes, then have a wrapper for your script that un-tars the
dependencies prior to execution.
Note that the ship option has two components: the source specification, provided in the ship(
) clause, is the view of your machine; the command specification is the view of the actual
cluster. The only guarantee is that the shipped files are available in the current working
directory of the launched job and that your current working directory is also on the PATH
environment variable.
Shipping files to relative paths or absolute paths is not supported since you might not have
permission to read/write/execute from arbitrary paths on the clusters.
Note the following:
It is safe only to ship files to be executed from the current working directory on the task
on the cluster.
OP = stream IP through 'script';
or
DEFINE CMD 'script' ship('/a/b/script');
OP = stream IP through 'CMD';
Shipping files to relative paths or absolute paths is undefined and mostly will fail since
you may not have permissions to read/write/execute from arbitraty paths on the actual
clusters.
Page 89
The ship option works with binaries, jars, and small datasets. However, loading larger
datasets at run time for every execution can severely impact performance. Instead, use the
cache option to access large files already moved to and available on the compute nodes. Only
files, not directories, can be specified with the cache option.
7.1.3.4. About Auto-Ship
If the ship and cache options are not specified, Pig will attempt to auto-ship the binary in the
following way:
If the first word on the streaming command is perl or python, Pig assumes that the binary
is the first non-quoted string it encounters that does not start with dash.
Otherwise, Pig will attempt to ship the first string from the command line as long as it
does not come from /bin, /usr/bin, /usr/local/bin. Pig will determine this
by scanning the path if an absolute path is provided or by executing which. The paths
can be made configurable using the set stream.skippath option (you can use multiple set
commands to specify more than one path to skip).
If you don't supply a DEFINE for a given streaming command, then auto-shipping is turned
off.
Note the following:
If Pig determines that it needs to auto-ship an absolute path it will not ship it at all since
there is no way to ship files to the necessary location (lack of permissions and so on).
OP = stream IP through `/a/b/c/script`;
or
OP = stream IP through `perl /a/b/c/script.pl`;
Pig will not auto-ship files in the following system directories (this is determined by
executing 'which <file>' command).
/bin /usr/bin /usr/local/bin /sbin /usr/sbin /usr/local/sbin
To auto-ship, the file in question should be present in the PATH. So if the file is in the
current working directory then the current working directory should be in the PATH.
Page 90
In this example user defined serialization/deserialization functions are used with the script.
DEFINE Y 'stream.pl' INPUT(stdin USING MySerializer) OUTPUT (stdout USING
MyDeserializer);
X = STREAM A THROUGH Y;
In this example cache is used to specify a file located on the cluster compute nodes.
DEFINE Y 'stream.pl data.gz' SHIP('/work/stream.pl')
CACHE('/input/data.gz#data.gz');
X = STREAM A THROUGH Y;
Page 91
In this example a function is defined for use with the FOREACH GENERATE operator.
REGISTER /src/myfunc.jar
DEFINE myFunc myfunc.MyEvalfunc('foo');
A = LOAD 'students';
B = FOREACH A GENERATE myFunc($0);
7.2. REGISTER
Registers a JAR file so that the UDFs in the file can be used.
7.2.1. Syntax
REGISTER path;
7.2.2. Terms
path
7.2.3. Usage
Pig Scripts
Use the REGISTER statement inside a Pig script to specify a JAR file or a Python/JavaScript
module. Pig supports JAR files and modules stored in local file systems as well as remote,
distributed file systems such as HDFS and Amazon S3 (see Pig Scripts).
Additionally, JAR files stored in local file systems can be specified as a glob pattern using
*. Pig will search for matching jars in the local file system, either the relative path (relative
to your working directory) or the absolute path. Pig will pick up all JARs that match the glob.
Command Line
You can register additional files (to use with your Pig script) via the command line using the
-Dpig.additional.jars option. For more information see User Defined Functions.
Page 92
7.2.4. Examples
In this example REGISTER states that the JavaScript module, myfunc.js, is located in the
/src directory.
/src $ java -jar pig.jar
REGISTER /src/myfunc.js;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
In this example additional JAR files are registered via the command line.
pig -Dpig.additional.jars=my.jar:your.jar script.pig
This example shows how to specify a glob pattern using either a relative path or an absolute
path.
register /homes/user/pig/myfunc*.jar
register count*.jar
register jars/*.jar
Page 93