Pig
Pig
Performing a Join operation in Apache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation
perform a Join operation between datasets.
Apache Pig uses multi-query approach, thereby reducing MapReduce will require almost 20 times more the number
thereby reducing the length of the codes to a great extent. the number of lines to perform the same task.
Pig SQL
The data model in Apache Pig is nested relational. The data model used in SQL is flat relational.
• Tuple:A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to
a row in a table of RDBMS. Example − (Raja, 30)
• Bag:A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can
have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table
in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column)
have the same type. Example − {(Raja, 30), (Mohammad, 45)}
• A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338,
[email protected],}}
• Map:A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value
might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30]
• Relation:A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in
any particular order).
• COMPARISON WITH DATABASES
• There are several differences between Pig and relational database management systems
(RDBMSs).
● Pig Latin is a data flow programming language, whereas SQL is a declarative programming
language.
● A Pig Latin program is a step-by-step set of operations on an input relation, in which each step is a
single transformation. By contrast, SQL statements are a set of constraints that, taken together,
define the output.
● RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed about the data
that it processes: you can define a schema at runtime, but it’s optional. It will operate on any
source of tuples, text file with tab separated fields.
● There is no data import process to load the data into the RDBMS. In Pig, the data is loaded from
the filesystem (usually HDFS) as the first step in the processing.
● Pig’s support, for complex and nested data structures, further
differentiates it from SQL, which operates on flatter data
structures. Also, Pig’s ability to use UDFs and streaming
operators that are tightly integrated with the language and Pig’s
nested data structures makes Pig Latin more customizable than
most SQL dialects.
• RDBMSs have several features to support online, low-latency
queries, such as transactions and indexes, that are absent in
Pig. Pig does not support random reads or writes
Pig Latin
A Pig Latin program consists of a collection of statements. A statement can be thought of
as an operation or a command.
For example, a GROUP operation is a type of statement:
grouped_records = GROUP records BY year;
The command to list the files in a
Hadoop filesystem is another example of a statement:
ls /
❖ Statements that have to be terminated with a semicolon can be split across multiple
lines for readability:
DUMP A; -- What's in A?
C-style comments are more flexible since they delimit the beginning and end of
The comment block with /* and */ markers. They can span lines or be embedded
in a single line: /*
* multiple lines.
*/
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B’;
DUMP C;
Pig Latin has a list of keywords that have a special meaning in the language and
cannot be used as identifiers. These include the operators (LOAD, ILLUSTRATE),
commands (cat, ls), expressions (matches, FLATTEN), and functions (DIFF, MAX).
Pig Latin has mixed rules on case sensitivity. Operators and commands are not case
sensitive (to make interactive use more forgiving); however, aliases and function
names are case sensitive
STATEMENTS
As a Pig Latin program is executed, each statement is parsed in turn. If there are
syntax errors or other (semantic) problems, such as undefined aliases, the interpreter
will halt and display an error message.
The interpreter builds a logical plan for every relational operation, which forms the
core of a Pig Latin program. The logical plan for the statement is added to the logical
plan for the program so far, and then the interpreter moves on to the next statement.
• In Pig Latin, macros are used to define reusable code snippets
that can be called within Pig scripts. They provide a way to
abstract and organize complex logic, making it easier to
manage and maintain Pig scripts.
Eg:
DEFINE double(x) RETURNS y
{ y = x * 2;
}
Using the macro
data = LOAD 'input.txt' USING PigStorage(',') AS (num:int);
processed_data = FOREACH data GENERATE double(num) AS
doubled_num;
STORE processed_data INTO 'output.txt' USING PigStorage(',');
Expressions
An expression is something that is evaluated to yield a value. Expressions can be used in Pig as a
part of a statement containing a relational operator. Pig has a rich variety of expressions.
Types
Schemas
1. A relation in Pig may have an associated schema, which gives the fields in the relation names and
types.
For example:
4. The schema is entirely optional and can be omitted by not specifying an AS clause:
• A SQL database will enforce the constraints in a table’s schema at load time; for example, trying to load a string into a column that is declared to be a
numeric type will fail. In Pig, if the value cannot be cast to the type declared in the schema, it will substitute a null value.
• For example:
• the following input for the weather data, which has an “e” character in place of an integer:
• 1950 0 1
• 1950 22 1
• 1950 e 1
• 1949 111 1
• 1949 78 1
• Pig handles the corrupt line by producing a null for the offending value, which is displayed as the absence of a value when dumped to
screen (and also when saved using STORE):
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
• (1950,0,1)
• (1950,22,1)
• (1950,,1)
• (1949,111,1)
• (1949,78,1)
Pig produces a warning for the invalid field (not shown here) but
does not halt its processing.
Functions
Functions in Pig come in four types:
Eval function
• A function that takes one or more expressions and returns another expression. Anexample of a built-in eval function is
MAX, which returns the maximum value of the entries in a bag. Some eval functions are aggregate functions, which
means they operate on a bag of data to produce a scalar value; MAX is an example of an aggregate function.
Filter function
• A special type of eval function that returns a logical Boolean result. As the name suggests, filter functions are used in
the FILTER operator to remove unwanted rows.
• An example of a built-in filter function is IsEmpty, which tests whether a bag or a map
contains any items.
Load function
A function that specifies how to load data into a relation from external storage.
Store function
• A function that specifies how to save the contents of a relation to external storage.
Other libraries
• If the function you need is not available, you can write your own user-defined function (or UDF). Piggy Bank, a library of
Pig functions shared by the Pig community and distributed as a part of Pig. Apache DataFu is another rich library of Pig
UDFs. In addition to general utility functions, it includes functions for computing basic statistics, performing sampling
and estimation, hashing, and working with web data.
User-Defined Functions
Pig makes it easy to define and use user-defined functions. UDFs can be written in Java, Python, JavaScript, Ruby, or
Groovy, all of which run using the Java Scripting API.
A Filter UDF
Let’s demonstrate by writing a filter function for filtering out weather records that do not have a
temperature quality reading of satisfactory (or better). The idea is to change this line:
to
to