0% found this document useful (0 votes)
24 views27 pages

Pig

Uploaded by

sans323597
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views27 pages

Pig

Uploaded by

sans323597
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Apache Pig Vs MapReduce

Apache Pig MapReduce

Apache Pig is a data flow language. MapReduce is a data processing paradigm.

It is a high level language. MapReduce is low level and rigid.

Performing a Join operation in Apache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation
perform a Join operation between datasets.

Any novice programmer with a basic knowledge of SQL can


Exposure to Java is must to work with MapReduce.
of SQL can work conveniently with Apache Pig.

Apache Pig uses multi-query approach, thereby reducing MapReduce will require almost 20 times more the number
thereby reducing the length of the codes to a great extent. the number of lines to perform the same task.

There is no need for compilation. On execution, every


MapReduce jobs have a long compilation process.
execution, every Apache Pig operator is converted internally
converted internally into a MapReduce job.
Apache Pig Vs SQL

Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store data


without designing a schema (values are stored as $01, Schema is mandatory in SQL.
$02 etc.)

The data model in Apache Pig is nested relational. The data model used in SQL is flat relational.

Apache Pig provides limited opportunity for Query


There is more opportunity for query optimization in SQL.
optimization.
Pig Latin Data Model
• Atom:Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used
as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field. Example − ‘raja’ or ‘30’

• Tuple:A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to
a row in a table of RDBMS. Example − (Raja, 30)

• Bag:A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can
have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table
in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column)
have the same type. Example − {(Raja, 30), (Mohammad, 45)}

• A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338,
[email protected],}}

• Map:A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value
might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30]

• Relation:A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in
any particular order).
• COMPARISON WITH DATABASES
• There are several differences between Pig and relational database management systems
(RDBMSs).

● Pig Latin is a data flow programming language, whereas SQL is a declarative programming
language.

● A Pig Latin program is a step-by-step set of operations on an input relation, in which each step is a
single transformation. By contrast, SQL statements are a set of constraints that, taken together,
define the output.

● RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed about the data
that it processes: you can define a schema at runtime, but it’s optional. It will operate on any
source of tuples, text file with tab separated fields.

● There is no data import process to load the data into the RDBMS. In Pig, the data is loaded from
the filesystem (usually HDFS) as the first step in the processing.
● Pig’s support, for complex and nested data structures, further
differentiates it from SQL, which operates on flatter data
structures. Also, Pig’s ability to use UDFs and streaming
operators that are tightly integrated with the language and Pig’s
nested data structures makes Pig Latin more customizable than
most SQL dialects.
• RDBMSs have several features to support online, low-latency
queries, such as transactions and indexes, that are absent in
Pig. Pig does not support random reads or writes
Pig Latin
A Pig Latin program consists of a collection of statements. A statement can be thought of
as an operation or a command.
For example, a GROUP operation is a type of statement:
grouped_records = GROUP records BY year;
The command to list the files in a
Hadoop filesystem is another example of a statement:

ls /

❖ Statements are usually terminated with a semicolon. statements or commands for


interactive use in Grunt do not need the terminating semicolon. It’s never an error to
add a terminating semicolon.

❖ Statements that have to be terminated with a semicolon can be split across multiple
lines for readability:

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);


❖ Pig Latin has two forms of comments. Double hyphens are used for single-line
comments. Everything from the first hyphen to the end of the line is ignored by the
Pig Latin interpreter: -- My program

DUMP A; -- What's in A?
 C-style comments are more flexible since they delimit the beginning and end of

The comment block with /* and */ markers. They can span lines or be embedded

in a single line: /*

* Description of my program spanning

* multiple lines.

*/

A = LOAD 'input/pig/join/A';

B = LOAD 'input/pig/join/B’;

C = JOIN A BY $0, /* ignored */ B BY $1;

DUMP C;
 Pig Latin has a list of keywords that have a special meaning in the language and
cannot be used as identifiers. These include the operators (LOAD, ILLUSTRATE),
commands (cat, ls), expressions (matches, FLATTEN), and functions (DIFF, MAX).
 Pig Latin has mixed rules on case sensitivity. Operators and commands are not case
sensitive (to make interactive use more forgiving); however, aliases and function
names are case sensitive
STATEMENTS
 As a Pig Latin program is executed, each statement is parsed in turn. If there are
syntax errors or other (semantic) problems, such as undefined aliases, the interpreter
will halt and display an error message.
 The interpreter builds a logical plan for every relational operation, which forms the
core of a Pig Latin program. The logical plan for the statement is added to the logical
plan for the program so far, and then the interpreter moves on to the next statement.
• In Pig Latin, macros are used to define reusable code snippets
that can be called within Pig scripts. They provide a way to
abstract and organize complex logic, making it easier to
manage and maintain Pig scripts.
Eg:
DEFINE double(x) RETURNS y
{ y = x * 2;
}
Using the macro
data = LOAD 'input.txt' USING PigStorage(',') AS (num:int);
processed_data = FOREACH data GENERATE double(num) AS
doubled_num;
STORE processed_data INTO 'output.txt' USING PigStorage(',');
Expressions
An expression is something that is evaluated to yield a value. Expressions can be used in Pig as a
part of a statement containing a relational operator. Pig has a rich variety of expressions.
Types
Schemas

1. A relation in Pig may have an associated schema, which gives the fields in the relation names and
types.

For example:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt’

>> AS (year:int, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: int,temperature: int,quality: int}

2. It’s possible to omit type declarations completely, too:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature, quality);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: bytearray,quality: bytearray}


3. You don’t need to specify types for every field; you can leave some to default to bytearray, as we
have done for year in this declaration:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: int,quality: int}

4. The schema is entirely optional and can be omitted by not specifying an AS clause:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';

grunt> DESCRIBE records;

Schema for records unknown.


Validation and nulls

• A SQL database will enforce the constraints in a table’s schema at load time; for example, trying to load a string into a column that is declared to be a
numeric type will fail. In Pig, if the value cannot be cast to the type declared in the schema, it will substitute a null value.

• For example:

• the following input for the weather data, which has an “e” character in place of an integer:

• 1950 0 1

• 1950 22 1

• 1950 e 1

• 1949 111 1

• 1949 78 1

• Pig handles the corrupt line by producing a null for the offending value, which is displayed as the absence of a value when dumped to
screen (and also when saved using STORE):
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
• (1950,0,1)
• (1950,22,1)
• (1950,,1)
• (1949,111,1)
• (1949,78,1)
Pig produces a warning for the invalid field (not shown here) but
does not halt its processing.
Functions
Functions in Pig come in four types:

Eval function

• A function that takes one or more expressions and returns another expression. Anexample of a built-in eval function is
MAX, which returns the maximum value of the entries in a bag. Some eval functions are aggregate functions, which
means they operate on a bag of data to produce a scalar value; MAX is an example of an aggregate function.

Filter function

• A special type of eval function that returns a logical Boolean result. As the name suggests, filter functions are used in
the FILTER operator to remove unwanted rows.

• An example of a built-in filter function is IsEmpty, which tests whether a bag or a map
contains any items.

Load function

A function that specifies how to load data into a relation from external storage.

Store function

• A function that specifies how to save the contents of a relation to external storage.
Other libraries

• If the function you need is not available, you can write your own user-defined function (or UDF). Piggy Bank, a library of
Pig functions shared by the Pig community and distributed as a part of Pig. Apache DataFu is another rich library of Pig
UDFs. In addition to general utility functions, it includes functions for computing basic statistics, performing sampling
and estimation, hashing, and working with web data.

User-Defined Functions
Pig makes it easy to define and use user-defined functions. UDFs can be written in Java, Python, JavaScript, Ruby, or
Groovy, all of which run using the Java Scripting API.

A Filter UDF

Let’s demonstrate by writing a filter function for filtering out weather records that do not have a
temperature quality reading of satisfactory (or better). The idea is to change this line:

filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

to

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);


Let’s demonstrate by writing a filter function for filtering out weather records that do not have a
temperature quality reading of satisfactory (or better). The idea is to change this line:

filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

to

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);

You might also like