0% found this document useful (0 votes)

24 views27 pages

Pig

Uploaded by

sans323597

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views27 pages

Pig

Uploaded by

sans323597

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Apache Pig Vs MapReduce

Apache Pig MapReduce

Apache Pig is a data flow language. MapReduce is a data processing paradigm.

It is a high level language. MapReduce is low level and rigid.

Performing a Join operation in Apache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation
perform a Join operation between datasets.

Any novice programmer with a basic knowledge of SQL can

Exposure to Java is must to work with MapReduce.
of SQL can work conveniently with Apache Pig.

Apache Pig uses multi-query approach, thereby reducing MapReduce will require almost 20 times more the number
thereby reducing the length of the codes to a great extent. the number of lines to perform the same task.

There is no need for compilation. On execution, every

MapReduce jobs have a long compilation process.
execution, every Apache Pig operator is converted internally
converted internally into a MapReduce job.
Apache Pig Vs SQL

Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store data

without designing a schema (values are stored as $01, Schema is mandatory in SQL.
$02 etc.)

The data model in Apache Pig is nested relational. The data model used in SQL is flat relational.

Apache Pig provides limited opportunity for Query

There is more opportunity for query optimization in SQL.
optimization.
Pig Latin Data Model
• Atom:Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used
as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field. Example − ‘raja’ or ‘30’

• Tuple:A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to
a row in a table of RDBMS. Example − (Raja, 30)

• Bag:A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can
have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table
in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column)
have the same type. Example − {(Raja, 30), (Mohammad, 45)}

• A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338,
[email protected],}}

• Map:A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value
might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30]

• Relation:A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in
any particular order).
• COMPARISON WITH DATABASES
• There are several differences between Pig and relational database management systems
(RDBMSs).

● Pig Latin is a data flow programming language, whereas SQL is a declarative programming
language.

● A Pig Latin program is a step-by-step set of operations on an input relation, in which each step is a
single transformation. By contrast, SQL statements are a set of constraints that, taken together,
define the output.

● RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed about the data
that it processes: you can define a schema at runtime, but it’s optional. It will operate on any
source of tuples, text file with tab separated fields.

● There is no data import process to load the data into the RDBMS. In Pig, the data is loaded from
the filesystem (usually HDFS) as the first step in the processing.
● Pig’s support, for complex and nested data structures, further
differentiates it from SQL, which operates on flatter data
structures. Also, Pig’s ability to use UDFs and streaming
operators that are tightly integrated with the language and Pig’s
nested data structures makes Pig Latin more customizable than
most SQL dialects.
• RDBMSs have several features to support online, low-latency
queries, such as transactions and indexes, that are absent in
Pig. Pig does not support random reads or writes
Pig Latin
A Pig Latin program consists of a collection of statements. A statement can be thought of
as an operation or a command.
For example, a GROUP operation is a type of statement:
grouped_records = GROUP records BY year;
The command to list the files in a
Hadoop filesystem is another example of a statement:

ls /

❖ Statements are usually terminated with a semicolon. statements or commands for

interactive use in Grunt do not need the terminating semicolon. It’s never an error to
add a terminating semicolon.

❖ Statements that have to be terminated with a semicolon can be split across multiple
lines for readability:

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

❖ Pig Latin has two forms of comments. Double hyphens are used for single-line
comments. Everything from the first hyphen to the end of the line is ignored by the
Pig Latin interpreter: -- My program

DUMP A; -- What's in A?
 C-style comments are more flexible since they delimit the beginning and end of

The comment block with /* and */ markers. They can span lines or be embedded

in a single line: /*

* Description of my program spanning

* multiple lines.

A = LOAD 'input/pig/join/A';

B = LOAD 'input/pig/join/B’;

C = JOIN A BY $0, /* ignored */ B BY $1;

DUMP C;
 Pig Latin has a list of keywords that have a special meaning in the language and
cannot be used as identifiers. These include the operators (LOAD, ILLUSTRATE),
commands (cat, ls), expressions (matches, FLATTEN), and functions (DIFF, MAX).
 Pig Latin has mixed rules on case sensitivity. Operators and commands are not case
sensitive (to make interactive use more forgiving); however, aliases and function
names are case sensitive
STATEMENTS
 As a Pig Latin program is executed, each statement is parsed in turn. If there are
syntax errors or other (semantic) problems, such as undefined aliases, the interpreter
will halt and display an error message.
 The interpreter builds a logical plan for every relational operation, which forms the
core of a Pig Latin program. The logical plan for the statement is added to the logical
plan for the program so far, and then the interpreter moves on to the next statement.
• In Pig Latin, macros are used to define reusable code snippets
that can be called within Pig scripts. They provide a way to
abstract and organize complex logic, making it easier to
manage and maintain Pig scripts.
Eg:
DEFINE double(x) RETURNS y
{ y = x * 2;
}
Using the macro
data = LOAD 'input.txt' USING PigStorage(',') AS (num:int);
processed_data = FOREACH data GENERATE double(num) AS
doubled_num;
STORE processed_data INTO 'output.txt' USING PigStorage(',');
Expressions
An expression is something that is evaluated to yield a value. Expressions can be used in Pig as a
part of a statement containing a relational operator. Pig has a rich variety of expressions.
Types
Schemas

1. A relation in Pig may have an associated schema, which gives the fields in the relation names and
types.

For example:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt’

>> AS (year:int, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: int,temperature: int,quality: int}

2. It’s possible to omit type declarations completely, too:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature, quality);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: bytearray,quality: bytearray}

3. You don’t need to specify types for every field; you can leave some to default to bytearray, as we
have done for year in this declaration:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: int,quality: int}

4. The schema is entirely optional and can be omitted by not specifying an AS clause:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';

grunt> DESCRIBE records;

Schema for records unknown.

Validation and nulls

• A SQL database will enforce the constraints in a table’s schema at load time; for example, trying to load a string into a column that is declared to be a
numeric type will fail. In Pig, if the value cannot be cast to the type declared in the schema, it will substitute a null value.

• For example:

• the following input for the weather data, which has an “e” character in place of an integer:

• 1950 0 1

• 1950 22 1

• 1950 e 1

• 1949 111 1

• 1949 78 1

• Pig handles the corrupt line by producing a null for the offending value, which is displayed as the absence of a value when dumped to
screen (and also when saved using STORE):
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
• (1950,0,1)
• (1950,22,1)
• (1950,,1)
• (1949,111,1)
• (1949,78,1)
Pig produces a warning for the invalid field (not shown here) but
does not halt its processing.
Functions
Functions in Pig come in four types:

Eval function

• A function that takes one or more expressions and returns another expression. Anexample of a built-in eval function is
MAX, which returns the maximum value of the entries in a bag. Some eval functions are aggregate functions, which
means they operate on a bag of data to produce a scalar value; MAX is an example of an aggregate function.

Filter function

• A special type of eval function that returns a logical Boolean result. As the name suggests, filter functions are used in
the FILTER operator to remove unwanted rows.

• An example of a built-in filter function is IsEmpty, which tests whether a bag or a map
contains any items.

Load function

A function that specifies how to load data into a relation from external storage.

Store function

• A function that specifies how to save the contents of a relation to external storage.
Other libraries

• If the function you need is not available, you can write your own user-defined function (or UDF). Piggy Bank, a library of
Pig functions shared by the Pig community and distributed as a part of Pig. Apache DataFu is another rich library of Pig
UDFs. In addition to general utility functions, it includes functions for computing basic statistics, performing sampling
and estimation, hashing, and working with web data.

User-Defined Functions
Pig makes it easy to define and use user-defined functions. UDFs can be written in Java, Python, JavaScript, Ruby, or
Groovy, all of which run using the Java Scripting API.

A Filter UDF

Let’s demonstrate by writing a filter function for filtering out weather records that do not have a
temperature quality reading of satisfactory (or better). The idea is to change this line:

filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);

Let’s demonstrate by writing a filter function for filtering out weather records that do not have a
temperature quality reading of satisfactory (or better). The idea is to change this line:

filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);

Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Pig
No ratings yet
Pig
6 pages
Pig Latin Command
No ratings yet
Pig Latin Command
12 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Unit 5
No ratings yet
Unit 5
76 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Pig
No ratings yet
Pig
61 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
6 Part2
No ratings yet
6 Part2
45 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Unit 5
No ratings yet
Unit 5
24 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Basic
No ratings yet
Basic
98 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
BDA - HIVE & PIG-Other Notes in Detail
No ratings yet
BDA - HIVE & PIG-Other Notes in Detail
162 pages
Unit IV
No ratings yet
Unit IV
36 pages
Pig Latin Reference Manual 2
No ratings yet
Pig Latin Reference Manual 2
149 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Lecture+Notes+ +PIG
No ratings yet
Lecture+Notes+ +PIG
21 pages
Pig Basics
No ratings yet
Pig Basics
93 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Piglatin Ref2 PDF
No ratings yet
Piglatin Ref2 PDF
120 pages
Pig Viva Ques
No ratings yet
Pig Viva Ques
6 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
What Is Apache Pig?
No ratings yet
What Is Apache Pig?
5 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Pig 2
No ratings yet
Pig 2
63 pages
Lecture 12
No ratings yet
Lecture 12
21 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
BDA Unit5
No ratings yet
BDA Unit5
36 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
PIG Commands
No ratings yet
PIG Commands
53 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
BDA Unit-5 v1
No ratings yet
BDA Unit-5 v1
20 pages
6 Part1
No ratings yet
6 Part1
5 pages
James Moxham
No ratings yet
James Moxham
38 pages
Bird Species Identification Using Deep Learning
No ratings yet
Bird Species Identification Using Deep Learning
74 pages
WINSEM2023-24 BECE320E ETH VL2023240504751 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BECE320E ETH VL2023240504751 2024-03-11 Reference-Material-I
79 pages
Chapter 6 - 10
No ratings yet
Chapter 6 - 10
214 pages
VIVA Software For Flying Probe Systems MA-VI-VIVASWEN-02
No ratings yet
VIVA Software For Flying Probe Systems MA-VI-VIVASWEN-02
198 pages
Subprocess 1
No ratings yet
Subprocess 1
14 pages
Chapter 4 - Anatomy of A Learning Algorithms
No ratings yet
Chapter 4 - Anatomy of A Learning Algorithms
2 pages
OOAD - Unit 2 Part 1 Introductio To UML
No ratings yet
OOAD - Unit 2 Part 1 Introductio To UML
11 pages
Responsi Algoritma Dan Pemrogramman 2024
No ratings yet
Responsi Algoritma Dan Pemrogramman 2024
7 pages
Advanced PHP and MySQL
No ratings yet
Advanced PHP and MySQL
5 pages
ESIOT Manual
No ratings yet
ESIOT Manual
23 pages
Unit-5 ch-2 Function in Python
No ratings yet
Unit-5 ch-2 Function in Python
39 pages
Assignment 8
No ratings yet
Assignment 8
7 pages
Read Recipes With Angular PDF
No ratings yet
Read Recipes With Angular PDF
82 pages
Subprocess Subprocess Management
No ratings yet
Subprocess Subprocess Management
25 pages
PrefixSpan Final
No ratings yet
PrefixSpan Final
22 pages
Accolite QPP
No ratings yet
Accolite QPP
8 pages
Lectruing Capability
No ratings yet
Lectruing Capability
3 pages
Chapter 2 BCP
No ratings yet
Chapter 2 BCP
6 pages
Arrays in C': Shashidhar G Koolagudi CSE, NITK, Surathkal
No ratings yet
Arrays in C': Shashidhar G Koolagudi CSE, NITK, Surathkal
26 pages
C#.NET Inheritance
No ratings yet
C#.NET Inheritance
17 pages
Questionbank
No ratings yet
Questionbank
4 pages
Program Menghitung Gaji Karyawan
No ratings yet
Program Menghitung Gaji Karyawan
2 pages
Top 10 PowerShell Commands
No ratings yet
Top 10 PowerShell Commands
25 pages
Bugreport 2022 01 04 15 15 29 Dumpstate - Log 29625
No ratings yet
Bugreport 2022 01 04 15 15 29 Dumpstate - Log 29625
3 pages
UNIT II (Formal Modelling and Verification)
No ratings yet
UNIT II (Formal Modelling and Verification)
23 pages
Lab Manual 11
No ratings yet
Lab Manual 11
4 pages
20CSS01 PPS Question Bank
100% (1)
20CSS01 PPS Question Bank
24 pages
A Simple and Compact Python Code For Complex 3D Topology Optimization, 2015
No ratings yet
A Simple and Compact Python Code For Complex 3D Topology Optimization, 2015
11 pages
Modeling and Predicting Cyber Hacking BR
No ratings yet
Modeling and Predicting Cyber Hacking BR
78 pages

Pig

Uploaded by

Pig

Uploaded by

Apache Pig Vs MapReduce

Apache Pig MapReduce

Apache Pig is a data flow language. MapReduce is a data processing paradigm.

It is a high level language. MapReduce is low level and rigid.

Any novice programmer with a basic knowledge of SQL can

There is no need for compilation. On execution, every

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store data

Apache Pig provides limited opportunity for Query

❖ Statements are usually terminated with a semicolon. statements or commands for

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

* Description of my program spanning

C = JOIN A BY $0, /* ignored */ B BY $1;

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt’

>> AS (year:int, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: int,temperature: int,quality: int}

2. It’s possible to omit type declarations completely, too:

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature, quality);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: bytearray,quality: bytearray}

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: int,quality: int}

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';

grunt> DESCRIBE records;

Schema for records unknown.

filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);

filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);

You might also like