0% found this document useful (0 votes)

175 views46 pages

Pig Slides

Apache Pig is a scripting language for exploring large datasets that allows users to express data flows in Pig Latin scripts. Pig Latin scripts describe multi-step transformations that Pig executes by translating into MapReduce jobs. This allows users to focus on the logic of their data analysis without needing to write MapReduce programs directly.

Uploaded by

Sreedhar Arikatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

175 views46 pages

Pig Slides

Uploaded by

Sreedhar Arikatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

 Apache pig raises the level of abstraction for

processing large datasets.

 Pig is a scripting language for exploring large datasets.
Why Pig ?
 To overcome long development cycle of
Map-Reduce Jobs

1. Writing the mappers and reducers,

2. Compiling and packaging the code,
3. submitting the job(s), and retrieving the results
Pig Components
 Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs.
1. local execution in a single JVM
2. distributed execution on a Hadoop cluster.
 Pig Latin program is made up of a series of operations,
or transformations, that are applied to the input data
to produce output.

 The operations describe a data flow, which the Pig

execution environment translates into an executable
representation and then runs.

 Pig turns the transformations into a series of

MapReduce jobs unaware to the programmer.
Installing and Running Pig
 Pig runs as a client-side application either locally or
on a Hadoop cluster.
 Pig launches jobs and interacts with HDFS
(or other Hadoop filesystems) from your workstation.

http://pig.apache.org/releases.html, and unpack the

tarb in a suitable directory on your workstation:
% tar xzf pig-x.y.z.tar.gz
Pig Execution Types
 Local mode
% pig -x local
grunt>
 Map Reduce mode
Hadoop MR
Apache Tez
Apache Spark
 To use MapReduce mode, you first need to check that
the version of Pig you downloaded is compatible with
the version of Hadoop you are using.
Connect to Hadoop cluster
 Pig conf directory
pig.properties
fs.defaultFS=hdfs://localhost/
mapreduce.framework.name=yarn
yarn.resourcemanager.address=localhost:8032
Running Pig Programs
Three ways of executing Pig programs, all of which work in
both local and MapReduce mode.
Script -- A script file containing pig commands.
-e option specify command as string
Grunt --- An interactive shell for running pig
commands.
Embedded --- You can run Pig programs from Java using
the PigServer class, much like you can use
JDBC to run SQL programs from Java.
Pig example script
-- max_temp.pig: Finds the maximum temperature by
year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature !=
9999 AND quality IN (0, 1, 4, 5, 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE
group,
MAX(filtered_records.temperature);
DUMP max_temp;
 grunt> records = LOAD 'input/ncdc/micro-
tab/sample.txt'
 >> AS (year:chararray, temperature:int,
quality:int);
 For simplicity, the program assumes that the input is
tab-delimited text, with each line
 having just year, temperature, and quality fields.
 grunt> filtered_records = FILTER records BY
temperature != 9999 AND
 >> quality IN (0, 1, 4, 5, 9);
 grunt> DUMP filtered_records;
 (1950,0,1)
 (1950,22,1)
 (1950,-11,1)
 (1949,111,1)
 (1949,78,1)
 grunt> grouped_records = GROUP
filtered_records BY year;
 grunt> DUMP grouped_records;
 (1949,{(1949,78,1),(1949,111,1)})
 (1950,{(1950,-11,1),(1950,22,1),(1950,0,1)})
 grunt> DESCRIBE grouped_records;
 grouped_records: {group: chararray,filtered_records:
{year: chararray,
 temperature: int,quality: int}}
 grunt> max_temp = FOREACH grouped_records
GENERATE group,
 >> MAX(filtered_records.temperature);
 FOREACH processes every row to generate a derived
set of rows, using a GENERATE clause to define the
fields in each derived row.
 grunt> DUMP max_temp;
 (1949,111)
 (1950,22)
Pig vs Databases
 Pig Latin is a data flow programming language,
whereas SQL is a declarative programming language.

 a Pig Latin program is a step-by-step set of operations

on an input relation, in which each step is a single
transformation.

 SQL statements are a set of constraints that, taken

together, define the output.
 Pig Latin is like working at the level of an RDBMS
query planner, which figures out how to turn a
declarative statement into a system of steps.

 RDBMSs store data in tables, with tightly predefined

schemas. Pig is more relaxed about the data that it
processes: you can define a schema at runtime, but it’s
optional.

 Pig’s nested data structures makes Pig Latin more

customizable than most SQL dialects.
 Pig does not support random reads Nor does it support
random writes to update small portions of data;

 In Pig, all writes are bulk streaming writes, just like

with MapReduce.

 Pig is able to work with Hive tables using HCatalog

Pig Latin
 A Pig Latin program consists of a collection of
statements.
 A statement can be thought of as an operation or a
command.
 For example, a GROUP operation is a type of
statement:
grouped_records = GROUP records BY year;
 Statements that have to be terminated with a
semicolon can be split across multiple lines for
readability:
 records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
 Pig Latin has two forms of comments. Double hyphens
are used for single-line comments.
 Everything from the first hyphen to the end of the line
is ignored by the Pig Latin interpreter:
 -- My program
DUMP A; -- What's in A?
 C-style comments are more flexible since they delimit the
beginning and end of the
 comment block with /* and */ markers. They can span lines or be
embedded in a single
 line:
 /*
 * Description of my program spanning
 * multiple lines.
 */
 A = LOAD 'input/pig/join/A';
 B = LOAD 'input/pig/join/B';
 C = JOIN A BY $0, /* ignored */ B BY $1;
 DUMP C;
 Pig Latin has a list of keywords that have a special
meaning in the language and cannot be used as
identifiers.
 These include the operators (LOAD, ILLUSTRATE),
commands (cat, ls),
expressions (matches, FLATTEN),
and functions (DIFF, MAX)
 Pig Latin has mixed rules on case sensitivity.
 Operators and commands are not case sensitive (to
make interactive use more forgiving);
 however, aliases and function names are case sensitive.
 When Pig Latin program is executed, each statement is
parsed in turn.
If there are syntax errors or other (semantic) problems,
such as undefined aliases, the interpreter will halt and
display an error message.
 The interpreter builds a logical plan for every relational
operation, which forms the core of a Pig Latin
program.
 The logical plan for the statement is added to the
logical plan for the program so far, and then the
interpreter moves on to the next statement.
 No data processing takes place while the logical plan of
the program is being constructed.
 The trigger for Pig to start execution is the DUMP
statement.
 At that point, the logical plan is compiled into a physical
plan and executed.
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
AND
quality IN (0, 1, 4, 5, 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE
group,
MAX(filtered_records.temperature);
DUMP max_temp;
 Pig validates the GROUP and FOREACH...GENERATE
statements, and adds them to the logical plan without
executing them.

 The trigger for Pig to start execution is the DUMP

statement. At that point, the logical plan is compiled
into a physical plan and executed.
 The physical plan that Pig prepares is a series of
MapReduce jobs, which in local mode Pig runs in the
local JVM and in MapReduce mode Pig runs on a
Hadoop cluster.
 The STORE statement should be used when the size of
the output is more than a few lines, as it writes to a file
rather than to the console.
 diagnostic operators—DESCRIBE, EXPLAIN, and
ILLUSTRATE—are provided to allow the user to
interact with the logical plan for debugging purposes
Operator (Shortcut) Description
 DESCRIBE (\de) Prints a relation’s schema
 EXPLAIN (\e) Prints the logical and physical
plans
 ILLUSTRATE (\i) Shows a sample execution of the
logical plan, using a generated
subset of the input
Pig Latin also provides three statements—REGISTER,
DEFINE, and IMPORT—that make it possible to incorporate
macros and user-defined functions into Pig scripts
Statement Description

 REGISTER Registers a JAR file with the Pig runtime

 DEFINE Creates an alias for a macro, UDF,

streaming script, or command specification

 IMPORT Imports macros defined in a separate file

into a script
 Pig provides commands to interact with Hadoop
filesystems and MapReduce, as well as a few utility
commands.
 Table 16-4. Pig Latin commands
 Category Command Description
 Hadoop filesystem
cat Prints the contents of one or more files
 cd Changes the current directory
 copyFromLocal Copies a local file or directory to a Hadoop
filesystem
 copyToLocal Copies a file or directory on a Hadoop
filesystem to the local filesystem
 cp Copies a file or directory to another directory
 fs Accesses Hadoop’s filesystem shell
 ls Lists files
 mkdir Creates a new directory
 mv Moves a file or directory to another directory
 pwd Prints the path of the current working
directory
 rm Deletes a file or directory
 rmf Forcibly deletes a file or directory (does not fail if
the file or directory does not exist)
Hadoop MapReduce
 kill Kills a MapReduce job
Utilities
 clear Clears the screen in Grunt
 exec Runs a script in a new Grunt shell in batch mode
 help Shows the available commands and options
 history Prints the query statements run in the current
Grunt session
 quit (\q) Exits the interpreter
 run Runs a script within the existing Grunt shell
 set Sets Pig options and MapReduce job properties
 sh Runs a shell command from within Grunt
 A relation in Pig may have an associated schema,
which gives the fields in the relation names and types.

grunt> records = LOAD 'input/ncdc/micro-

tab/sample.txt'
>> AS (year:int, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: int,temperature: int,quality: int}
 Functions in Pig come in four types:
Eval function
 A function that takes one or more expressions and
returns another expression.
 An example of a built-in eval function is MAX, which
returns the maximum value of the entries.
 Filter function
 A special type of eval function that returns a logical
Boolean result. As the name
 suggests, filter functions are used in the FILTER
operator to remove unwanted rows.
 Load function
 A function that specifies how to load data into a
relation from external storage.
 Store function
 A function that specifies how to save the contents of a
relation to external storage.
 Often, load and store functions are implemented by
the same type.
Macros
 Macros provide a way to package reusable pieces of Pig
Latin code from within Pig Latin
 itself. For example, we can extract the part of our Pig
Latin program that performs
 grouping on a relation and then finds the maximum
value in each group by defining a macro as follows:
 DEFINE max_by_group(X, group_key, max_field)
RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group,
MAX($X.$max_field);
}
 records = LOAD 'input/ncdc/micro-tab/sample.txt'
A S (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature
!= 9999 AND quality IN (0, 1, 4, 5, 9);
max_temp = max_by_group(filtered_records, year,
temperature);
DUMP max_temp
 macros can be defined in separate files to Pig scripts, in
which case they need to be imported into any script that uses
them.
An import statement looks like this:
IMPORT './ch16-pig/src/main/pig/max_temp.macro';
User-Defined Functions
 Pig’s designers realized that the ability to plug in
custom code is crucial for all but the most trivial data
processing jobs.

 For this reason, they made it easy to define and use

user-defined functions.

 you can write UDFs in Java, Python, JavaScript, Ruby,

or Groovy, all of which are run using the Java Scripting
API.
 A FilterFunc UDF to remove records with unsatisfactory
temperature
 quality readings
 package com.hadoopbook.pig;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.List;
 import org.apache.pig.FilterFunc;
 import
org.apache.pig.backend.executionengine.ExecException;
 import org.apache.pig.data.DataType;
 import org.apache.pig.data.Tuple;
 import
org.apache.pig.impl.logicalLayer.FrontendException;
 public class IsGoodQuality extends FilterFunc {

Apache Pig Tutorial
100% (1)
Apache Pig Tutorial
207 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Apache Pig
100% (2)
Apache Pig
80 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
Scenario Based Splunk Admin Interview Questions
No ratings yet
Scenario Based Splunk Admin Interview Questions
34 pages
Microsoft Azure Fundamentals AZ-900 Exam
100% (3)
Microsoft Azure Fundamentals AZ-900 Exam
7 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Ait307 QP
No ratings yet
Ait307 QP
3 pages
BDA Lab Manual R22
0% (1)
BDA Lab Manual R22
70 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Key Win 7
100% (1)
Key Win 7
5 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
MCQ Type Questions
No ratings yet
MCQ Type Questions
24 pages
Cyber Security IMP Points Short Notes
No ratings yet
Cyber Security IMP Points Short Notes
20 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
TALEND ESB 6.0 Cours 1444874212 - 00 - Course - LessonTOC - 13 Files Merged
No ratings yet
TALEND ESB 6.0 Cours 1444874212 - 00 - Course - LessonTOC - 13 Files Merged
203 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
DataScience With Python Course Content Syllabus Meritude
No ratings yet
DataScience With Python Course Content Syllabus Meritude
10 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Python Syllbus by Lokesh
No ratings yet
Python Syllbus by Lokesh
5 pages
Banking Data Analysis On Hadoop
No ratings yet
Banking Data Analysis On Hadoop
21 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Chapter 10
No ratings yet
Chapter 10
50 pages
Unit 4 (MongoDB)
No ratings yet
Unit 4 (MongoDB)
46 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Aws Interview
No ratings yet
Aws Interview
4 pages
Kunal Kushwaha PDF
No ratings yet
Kunal Kushwaha PDF
1 page
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Hive PPT
No ratings yet
Hive PPT
25 pages
Industry Profile:: 1) History of Hem Electronics
No ratings yet
Industry Profile:: 1) History of Hem Electronics
15 pages
SQL Detailed Notes For Professionals 1672765219
No ratings yet
SQL Detailed Notes For Professionals 1672765219
166 pages
Apache Pig
No ratings yet
Apache Pig
21 pages
SAP ASE HADR Users Guide en PDF
No ratings yet
SAP ASE HADR Users Guide en PDF
564 pages
Fundamentals of Information Systems - Final
100% (2)
Fundamentals of Information Systems - Final
12 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
DWNotes PDF
No ratings yet
DWNotes PDF
209 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
43 PPT On Apache Pig
No ratings yet
43 PPT On Apache Pig
16 pages
Jnu Dbms Lab File
No ratings yet
Jnu Dbms Lab File
55 pages
010 Intro Natural Language Processing
No ratings yet
010 Intro Natural Language Processing
43 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
52 pages
Hive Main Installation
No ratings yet
Hive Main Installation
2 pages
06 Linux Shell Programming
No ratings yet
06 Linux Shell Programming
59 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Mutual Fund Performance Analyser
No ratings yet
Mutual Fund Performance Analyser
24 pages
Hive Commands
No ratings yet
Hive Commands
3 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
18 4 Installation Grid Preq2
No ratings yet
18 4 Installation Grid Preq2
3 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
Retrieve Data From Tables
No ratings yet
Retrieve Data From Tables
7 pages
Ict 1 Handout: Notre Dame of Salaman College, Inc
No ratings yet
Ict 1 Handout: Notre Dame of Salaman College, Inc
13 pages
Introduction To: What Is SQL?
No ratings yet
Introduction To: What Is SQL?
25 pages
Hive in Class Assignment Winter 2021
No ratings yet
Hive in Class Assignment Winter 2021
2 pages
A Quick Start Guide For The MARIE Machine Simulator Environment
No ratings yet
A Quick Start Guide For The MARIE Machine Simulator Environment
3 pages
Simple C++ Programs: ELEC 206 Computer Applications For Electrical Engineers Dr. Ron Hayne
No ratings yet
Simple C++ Programs: ELEC 206 Computer Applications For Electrical Engineers Dr. Ron Hayne
51 pages
Javascript
No ratings yet
Javascript
81 pages
DSU PTT ds1 Paper
No ratings yet
DSU PTT ds1 Paper
5 pages
Boot - Device AFF - A150 ASA - A150 AFF - C190 AFF - A220 ASA - AFF A220 - FAS2720 50 Ev20 024a
No ratings yet
Boot - Device AFF - A150 ASA - A150 AFF - C190 AFF - A220 ASA - AFF A220 - FAS2720 50 Ev20 024a
19 pages
Rs232 Interface 917027manual
No ratings yet
Rs232 Interface 917027manual
42 pages
TDX22 DevOps Center Practical Use Cases
No ratings yet
TDX22 DevOps Center Practical Use Cases
29 pages
Etl Tools and Comparison of Different Tools
100% (1)
Etl Tools and Comparison of Different Tools
3 pages
Kroeplin - Macki Pomiarowe o Rozdzielczości 0,001 MM 2017 EN
No ratings yet
Kroeplin - Macki Pomiarowe o Rozdzielczości 0,001 MM 2017 EN
2 pages
HPE Trueview-4aa4-3937enw
No ratings yet
HPE Trueview-4aa4-3937enw
6 pages
CheckPoint R80.10 ReleaseNotes
No ratings yet
CheckPoint R80.10 ReleaseNotes
27 pages
DBMS Anjali
No ratings yet
DBMS Anjali
14 pages
Tips and Tricks - Bold 9700
No ratings yet
Tips and Tricks - Bold 9700
6 pages
RDP Connection To Remote Desktop Server Running Windows Server 2008 R2 May Fail With Message 'The Local Security Authority Cannot Be Contacted'
No ratings yet
RDP Connection To Remote Desktop Server Running Windows Server 2008 R2 May Fail With Message 'The Local Security Authority Cannot Be Contacted'
2 pages
Harshithkothamasu SE 1
No ratings yet
Harshithkothamasu SE 1
1 page
Dbms Assignment 2
No ratings yet
Dbms Assignment 2
5 pages
Nokia History
No ratings yet
Nokia History
11 pages
2garmin City Navigator Europe NT 2014.30 (PC Installation) Unlock (Download Torrent) - TPB
No ratings yet
2garmin City Navigator Europe NT 2014.30 (PC Installation) Unlock (Download Torrent) - TPB
2 pages
Mount Windows Share
No ratings yet
Mount Windows Share
3 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
What's Inside The Computer?: The System Unit - Case
No ratings yet
What's Inside The Computer?: The System Unit - Case
2 pages
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
From Everand
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
Ravi Saraswathi
5/5 (1)
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
JSP-Servlet Interview Questions You'll Most Likely Be Asked
From Everand
JSP-Servlet Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet

Pig Slides

Uploaded by

Pig Slides

Uploaded by

 Apache pig raises the level of abstraction for

processing large datasets.

1. Writing the mappers and reducers,

 The operations describe a data flow, which the Pig

 Pig turns the transformations into a series of

http://pig.apache.org/releases.html, and unpack the

 a Pig Latin program is a step-by-step set of operations

 SQL statements are a set of constraints that, taken

 RDBMSs store data in tables, with tightly predefined

 Pig’s nested data structures makes Pig Latin more

 In Pig, all writes are bulk streaming writes, just like

 Pig is able to work with Hive tables using HCatalog

 The trigger for Pig to start execution is the DUMP

 REGISTER Registers a JAR file with the Pig runtime

 DEFINE Creates an alias for a macro, UDF,

 IMPORT Imports macros defined in a separate file

grunt> records = LOAD 'input/ncdc/micro-

 For this reason, they made it easy to define and use

 you can write UDFs in Java, Python, JavaScript, Ruby,

You might also like