0% found this document useful (0 votes)

217 views61 pages

Apache Pig

Apache Pig is a platform for analyzing large data sets. It uses a scripting language called Pig Latin to express data analysis programs. Pig Latin programs are compiled into sequences of MapReduce jobs which can handle very large datasets in parallel. Popular companies like Yahoo, Twitter and LinkedIn use Pig to process large volumes of data more quickly than traditional approaches.

Uploaded by

AMAL NEJJARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

217 views61 pages

Apache Pig

Uploaded by

AMAL NEJJARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

http://pig.apache.

org/

Prof. EZZAHOUT Abderrahmane

Master IPS 2021

Apache Pig is a platform for analyzing
large data sets
Pig consists of a high-level language for
expressing data analysis programs
Pig is coupled with infrastructure for
evaluating these programs.
The salient property of Pig programs is that
their structure is amenable to substantial
parallelization, which in turns enables them
to handle very large data sets.
Pig offers a scripting language called "Pig
Latin"
Its instructions describe trai
each command modifies the data flow which
crosses it on a flow
Pig Latin also makes it possible to build
much more varied and non-linear data
processing
Pig translates Pig Latin programs into
MapReduce jobs and integrates the results
into the flow
Comments are placed between /*...*/ or from - and
the end of the line.
A Pig Latin program is a series of instructions.
All must be terminated with a;
As in SQL, there is no notion of variables, nor
functions / procedures.
The result of each Pig statement is a collection of
tuples. We call it relationship.
We can see it as a database table.
Each Pig instruction takes an input relation and
produces a new output relation.
Faster development
• Fewer lines of code (Writing map reduce like writing SQL
queries)
• Re-use the code (Pig library, Piggy bank)
One test: Find the top 5 words with most high frequency
• 10 lines of Pig Latin V.S 200 lines in Java
• 15 minutes in Pig Latin V.S 4 hours in Java
Pig Latin Pig Latin Jav
Java 300 a
300
250 250
minut

200 200

150 150
100 100
50 50
0 0
70% of production jobs at Yahoo (10ks
per day)
Twitter, LinkedIn, Ebay, AOL,…
Used to
• Process web logs
• Build user behavior models
• Process images
• Build maps of the web
• Do research on large data sets
Requirements
Mandatory
Unix and Windows users need the following:
Hadoop 2.X -
http://hadoop.apache.org/common/releases.html (
You can run Pig with different versions of Hadoop by
setting HADOOP_HOME to point to the directory
where you have installed Hadoop. If you do not set
HADOOP_HOME, by default Pig will run with the
embedded version, currently Hadoop 2.7.3.)
Java 1.7 -
http://java.sun.com/javase/downloads/index.jsp (s et
JAVA_HOME to the root of your Java installation)
Optional
Python 2.7 - https://www.python.org (when using
Streaming Python UDFs)
High-level language (Pig Latin)
• Set of commands
Two Main
Components

• Two execution modes

• Local: reads/write to local file system
• Mapreduce: connects to Hadoop
cluster and reads/writes to HDFS
Lines=LOAD ‗input/hadoop.log‘ AS (line: chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each
row Words = FOREACH Lines GENERATE
-- filter out any words that are just white spaces
FLATTEN(TOKENIZE(line)) AS word;
-- create a group for each word
Groups = GROUP Words BY word;
-- count the entries in each group
Counts = FOREACH Groups GENERATE group,
COUNT(Words);
-- order the records by count
Results = ORDER Words BY Counts DESC;
Top5 = LIMIT Results 5;
STORE Top5 INTO /output/top5words;
Support for
• Grouping
• Joins
• Filtering
• Aggregation
Extensibility
• Support for User Defined Functions (UDF‘s)
Leverages the same massive parallelism
as native MapReduce
Pig Latin statements are the basic constructs you
use to process data using Pig. A Pig Latin
statement is an operator that takes a relation as
input and produces another relation as output.
(This definition applies to all Pig Latin operators
except LOAD and STORE which read data from
and write data to the file system.) Pig Latin
statements may
include expressions and schemas. Pig Latin
statements can span multiple lines and must end
with a semi-colon ( ; ). By default, Pig Latin
statements are processed using multi-query
execution.
Pig Latin statements are generally
organized as follows:
A LOAD statement to read data from the
file system.
A series of "transformation" statements to
process the data.
A DUMP statement to view results or a
STORE statement to save the results.
Note that a DUMP or STORE statement
is required to generate output.
Scalar Types:
• Int, long, float, double, boolean, null, chararray, bytearry;
Complex Types: fields, tuples, bags, relations;
• A Field is a piece of data
• A Tuple is an ordered set of fields
• A Bag is a collection of tuples
• A Relation is a bag

Samples:
• Tuple Row in Database
( 0002576169, Tome, 20, 4.0)
• Bag Table or View in Database
{(0002576169 , Tome, 20, 4.0),
(0002576170, Mike, 20, 3.6),
(0002576171 Lucy, 19, 4.0), ….
}
● Common design patterns as key words
(joins, distinct, counts)
● Data flow analysis
● A script can map to multiple map-reduce jobs
● Avoids Java-level errors (not everyone
can write java code)
● Can be interactive mode
● Issue commands and get results
Loads data from an HDFS file
var = LOA 'employees.txt';
D
salary);
var = LOA 'employees.txt'AS (id, name,
var =DLOAD 'employees.txt' using PigStorage()
AS (id,name, salary);
Each LOAD statement defines a new bag
• Each bag can have multiple elements (atoms)
• Each element can be referenced by name or position
($n)
A bag is immutable
A bag can be aliased and referenced later
STOR
E • Writes output to an HDFS file in a specified
directory
grunt> STORE processed INTO 'processed_txt';
Fails if directory exists
Writes output files, part-[m|r]-xxxxx, to the directory
• PigStorage can be used to specify a field
delimiter
DUMP
• Write output to screen
grunt> DUMP processed;
FOREACH
• Applies expressions to every record in a bag
FILTER
• Filters by expression
GROUP
• Collect records with the same key
ORDER BY
• Sorting
DISTINCT
• Removes duplicates
Use the FOREACH …GENERATE operator
to work with rows of data, call functions, etc.
Basic syntax:
alias2 = FOREACH alias1 GENERATE expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)
Use the FILTER operator to restrict tuples or
rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
Use the GROUP…ALL operator to group data
• Use GROUP when only one relation is involved
• Use COGROUP with multiple relations are involved
Basic syntax:
alias2 = GROUP alias1 ALL;
Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Use the ORDER…BY operator to sort
a relation based on one or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
Use the DISTINCT operator to remove
duplicate tuples in a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
FLATTEN
• Used to un-nest tuples as well as bags
INNER JOIN
• Used to perform an inner join of two or more relations
based on common field values
OUTER JOIN
• Used to perform left, right or full outer joins
SPLIT
• Used to partition the contents of a relation into two or
more relations
SAMPLE
• Used to select a random data sample with the stated
sample size
Use the JOIN operator to perform an inner,
equi-join join of two or more relations based
on common field values
The JOIN operator always performs an inner
join
Inner joins ignore null keys
• Filter null keys before the join
JOIN and COGROUP operators perform
similar functions
• JOIN creates a flat set of output records
• COGROUP creates a nested set of output records
DUMP Alias1; Join Alias1 by Col1 to
(1,2,3) Alias2 by Col1
(4,2,1) Alias3 = JOIN Alias1 BY
Col1, Alias2 BY Col1;
(8,3,4)
(4,3,3) Dump Alias3;
(7,2,5) (1,2,3,1,3)
(8,4,3) (4,2,1,4,6)
DUMP Alias2; (4,3,3,4,6)
(2,4) (4,2,1,4,9)
(8,9) (4,3,3,4,9)
(1,3) (8,3,4,8,9)
(2,7) (8,4,3,8,9)
(2,9)
(4,6)
(4,9)
Use the OUTER JOIN operator to perform left,
right, or full outer joins
• Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
• keywords LEFT, RIGHT and FULL will imply left outer,
right outer and full outer joins respectively
Outer joins will only work provided the relations
which need to produce nulls (in the case of non-
matching keys) have schemas
Outer joins will only work for two-way joins
• To perform a multi-way outer join perform multiple two-
way outer join statements
Left Outer Join
• A = LOAD 'a.txt' AS (n:chararray, a:int);
• B = LOAD 'b.txt' AS (n:chararray, m:chararray);
• C = JOIN A by $0 LEFT OUTER, B BY $0;
Full Outer Join
• A = LOAD 'a.txt' AS (n:chararray, a:int);
• B = LOAD 'b.txt' AS (n:chararray, m:chararray);
• C = JOIN A BY $0 FULL OUTER, B BY $0;
Natively written in Java, packaged as a jar
file
• Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER
statement
Optionally, alias it with the DEFINE
statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE can be used to work with UDFs and
also streaming commands
• Useful when dealing with complex input/output
formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;

/* Define UDFs to a more readable format */

DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float, gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;
1. Accessing Pig
2. Basic Pig knowledge: (Word Count)
1. Pig Data Types
2. Pig Operations
3. How to run Pig Scripts
3. Advanced Pig features: (Kmeans
Clustering)
1. Embedding Pig within Python
2. User Defined Function
Accessing approaches:
• Batch mode: submit a script directly
• Interactive mode: Grunt, the pig shell
• PigServer Java class, a JDBC like
interface
Execution mode:
• Local mode: pig –x local
• Mapreduce mode: pig –x mapreduce
Loading data
• LOAD loads input data
• Lines=LOAD ‗input/access.log‘ AS (line: chararray);
Projection
• FOREACH … GENERTE … (similar to SELECT)
• takes a set of expressions and applies them to every
record.
Grouping
• GROUP collects together records with the same key
Dump/Store
• DUMP displays results to screen, STORE save results
to file system
Aggregation
• AVG, COUNT, MAX, MIN, SUM
Pig Data Loader
• PigStorage: loads/stores relations using field-
delimited text format
(John,18,4.0F)
students = load 'student.txt' using PigStorage('\t')
(Mary,19,3.8F)
as (studentid: int, name:chararray, age:int,
(Bill,20,3.9F)
gpa:double);
• TextLoader: loads relations from a plain-text
format
• BinStorage:loads/stores relations from or to
binary files
• PigDump: stores relations by writing the
toString() representation of tuples, one per line
Foreach ... Generate
• The Foreach … Generate statement iterates over
the members of a bag
studentid = FOREACH students GENERATE
st ud e n ti d, n a m e;
• Th e r e s u l t o f a Foreach is
another bag
• Elements are named as in the input bag
Fields are referred to by positional
notation or by name (alias).
students = LOAD 'student.txt' USING PigStorage() AS (name:chararray, age:int,
gpa:float);
DUMP A;
(John,18,4.0F
)
(Mary,19,3.8F)
studentname = Foreach students Generate $1 as studentname;
(Bill,20,3.9F)
First Field Second Field Third Field
Data Type chararray int float
Position $0 $1 $2
notation
Name name age Gpa
(variable)
Field value Tom 19 3.9
Groups the data in one or more relations
• The GROUP and COGROUP operators
are identical.
• Both operators work with one or more relations.
• For readability GROUP is used in statements
involving one relation
• COGROUP is used in statements involving two or
more relations. Jointly Group the tuples from A
and B = GROUP A BY age;
B. C = COGROUP A BY name, B BY
name;
DUMP Operator:
• display output results, will always trigger
execution
STORE Operator:
• Pig will parse entire script prior to writing for
efficiency purposes
A = LOAD ‗input/pig/multiquery/A‘;
B = FILTER A by $1 ==
―appleǁ; C = FILTER A by $1
== ―appleǁ; SOTRE B INTO
―output/bǁ STORE C INTO
―output/cǁ
Relations B&C both derived from A
Prior this would create two MapReduce jobs
Compute the number of elements in a
bag
Use the COUNT function to compute the
number of elements in a bag.
COUNT requires a preceding GROUP
ALL statement for global counts and
GROUP BY statement for group counts.
X = FOREACH B GENERATE
COUNT(A);
Sorts a relation based on one or more
fields
In Pig, relations are unordered. If you
order relation A to produce relation X
relations A and X still contain the same
elements.

student = ORDER students BY gpa DESC;

Local mode
• Local host and local file system is used
• Neither Hadoop nor HDFS is required
• Useful for prototyping and debugging
MapReduce mode
• Run on a Hadoop cluster and HDFS
Batch mode - run a script directly
• Pig –x local my_pig_script.pig
• Pig –x mapreduce my_pig_script.pig
Interactive mode use the Pig shell to run script
• Grunt> Lines = LOAD ‗/input/input.txt‘ AS (line:chararray);
• Grunt> Unique = DISTINCT Lines;
• Grunt> DUMP Unique;
1. Get and Setup Hand-on VM from:
http://salsahpc.indiana.edu/ScienceCloud/virtualbox_appliance_gu
ide.html
2. cd pigtutorial/pig-hands-on/
3. tar –xf pig-wordcount.tar
4. cd pig-wordcount

5. Batch mode
6. pig –x local wordcount.pig

7. Iterative mode
8. grunt> Lines=LOAD ‗input.txt‘ AS (line: chararray);
9. grunt>Words = FOREACH Lines GENERATE
FLATTEN(TOKENIZE(line)) AS word;
10. grunt>Groups = GROUP Words BY word;
11. grunt>counts = FOREACH Groups GENERATE group,
COUNT(Words);
12. grunt>DUMP counts;
TOKENIZE returns a new bag for each
input;―FLATTENǁ eliminates bag nesting
A:{line1, line2, line3…}
After
Tokenize:{{lineword1,line1word2,…}},{lin
e2word1,line2word2…}}
After
Flatten{line1word1,line1word2,line2word
1…}
A method of cluster analysis which aims to
partition n observations into k clusters in which
each observation belongs to the cluster with the
nearest mean.
Assignment step: Assign each observation to the
cluster with the closest mean

Update step: Calculate the new means to be the

centroid of the observations in the cluster.

Reference: http://en.wikipedia.org/wiki/K-means_clustering
PC = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
students = load 'student.txt' as (name:chararray, age:int,
gpa:double);
centroided = foreach students generate gpa, find_centroid(gpa) as
centroid;
grouped = group centroided by centroid;
result = Foreach grouped Generate group,
AVG(centroided.gpa);
store result into 'output';
""")
while iter_num<MAX_ITERATION:
PCB = PC.bind({'centroids':initial_centroids})
results = PCB.runSingle()
iter = results.result("result").iterator()
centroids = [None] * v
distance_move = 0.0
# get new centroid of this iteration, calculate the moving distance with
last iteration
for i in range(v):
tuple = iter.next()
centroids[i] = float(str(tuple.get(1)))
distance_move = distance_move + fabs(last_centroids[i]-
centroids[i])
distance_move = distance_move / v;
if distance_move<tolerance:
converged = True
break
……
What is
UDF
• Way to do an operation on a field or fields
• Called from within a pig script
• Currently all done in Java
Why use UDF
• You need to do more than grouping or filtering
• Actually filtering is a UDF
• Maybe more comfortable in Java land than in
SQL/Pig Latin
P = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
Pig does not support flow control statement:
if/else, while loop, for loop, etc.
Pig embedding API can leverage all
language features provided by Python
including control flow:
• Loop and exit criteria
• Similar to the database embedding API
• Easier parameter passing
JavaScript is available as well
The framework is extensible. Any JVM
implementation of a language could be
integrated
1. Get and Setup Hand-on VM from:
http://salsahpc.indiana.edu/ScienceCloud/virtualbox_appliance_guid
e.html
2. cd pigtutorial/pig-hands-on/
3. tar –xf pig-kmeans.tar
4. cd pig-kmeans
5. export PIG_CLASSPATH= /opt/pig/lib/jython-
2.5.0.jar
6. Hadoop dfs –copyFromLocal input.txt ./input.txt
7. pig –x mapreduce kmeans.py
8. pig—x local kmeans.py
2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to
run:
register udf.jar
DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0');
students = load 'student.txt' as (name:chararray, age:int, gpa:double);
centroided = foreach students generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = foreach grouped generate group, AVG(centroided.gpa);
store result into 'output';

Input(s): Successfully read 10000 records (219190 bytes) from:

"hdfs://iw-ubuntu/user/developer/student.txt"

Output(s): Successfully stored 4 records (134 bytes) in:

"hdfs://iw-ubuntu/user/developer/output―

last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]

Peta
10^15

Tera
10^12
Giga
10^9

Mega
10^6
1. Search Engine System for Summer
School
2. To give an example of how to use
MapReduce technologies to solve big
data challenge.
3. Using Hadoop/HDFS/HBase/Pig
4. Indexed 656K web pages (540MB in
size) selected from Clueweb09 data set.
5. Calculate ranking values for 2 million
web sites.
Apache Lucene

Inverted Indexing
System

PHP script
Web UI HBase Tables
1. inverted index table
Hive/Pig script HBas 2. page rank table
e

Apache Server on Thrift client Thrift server

Salsa Portal

Pig script

Hadoop Cluster
Ranking
on FutureGrid
System
Read file from HDFS The input format (text, tab delimited)
Define run-time schema

raw = LOAD 'excite.log' USING PigStorage ('\t') AS (user, id, time,

query);
clean1 = FILTER raw BY id > 20 AND id < 100;
clean2 = FOREACH clean1 GENERATE
user, time,
org.apache.pig.tutorial.sanitze(query) as
query;
user_groups = GROUP clean2 BY (user, query);
user_query_counts = FOREACH user_groups
GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);

STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage (',');

Store the output in a file Text, Comma delimited
Keywords
• Load, Filter, Foreach Generate, Group By,
Store, Join, Distinct, Order By, …
Aggregations
• Count, Avg, Sum, Max, Min
Schema
• Defines at query-time not when files are loaded
UDFs
Packages for common input/output
formats
Script can take arguments Data are ―ctrl-Aǁ delimited Define types of the columns

A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int);

B = group A by name parallel 10; Specify the need of 10 reduce tasks

C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as

c2;

D = filter C by c0 > 100 and c1 > 100 and c2 > 100;

store D into '$out';

Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Data Wrangling With R
No ratings yet
Data Wrangling With R
174 pages
Detonado Dragon Quest VIII PDF: Mirror Link #1
0% (1)
Detonado Dragon Quest VIII PDF: Mirror Link #1
4 pages
Model Test 133
No ratings yet
Model Test 133
16 pages
Data Wrangling
No ratings yet
Data Wrangling
50 pages
ICT Concept For FX
No ratings yet
ICT Concept For FX
24 pages
(22-23) Anh 8. Ôn Tập (Chuyên Đề 8 Stress)
No ratings yet
(22-23) Anh 8. Ôn Tập (Chuyên Đề 8 Stress)
5 pages
UNIT 8 GRADE 10 MOCK TEST - Key
No ratings yet
UNIT 8 GRADE 10 MOCK TEST - Key
6 pages
All You To Know: To Become A Successful Data Professional
No ratings yet
All You To Know: To Become A Successful Data Professional
68 pages
Divertidos Ensayos Persuasivos
100% (1)
Divertidos Ensayos Persuasivos
6 pages
Supercharge Your Data Lake With Snowflake
No ratings yet
Supercharge Your Data Lake With Snowflake
13 pages
Sail Application
No ratings yet
Sail Application
3 pages
Planning For Big Data - CIO's Handbook For The Changing Data Landscape, O'Reilly 2012
No ratings yet
Planning For Big Data - CIO's Handbook For The Changing Data Landscape, O'Reilly 2012
84 pages
Sight Screen Catalog
No ratings yet
Sight Screen Catalog
3 pages
Sindhu Internship Report
No ratings yet
Sindhu Internship Report
38 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
DWNotes PDF
No ratings yet
DWNotes PDF
209 pages
Self-Improving LLM Architectures With Open Source
No ratings yet
Self-Improving LLM Architectures With Open Source
14 pages
Velocity v8 Data Warehousing Methodology
No ratings yet
Velocity v8 Data Warehousing Methodology
1,106 pages
CFCV DE1 2023 02 Administraive Positions
No ratings yet
CFCV DE1 2023 02 Administraive Positions
5 pages
Introduction To RAG (Retrieval Augmented Generation) and Vector Database - by Sachinsoni - Medium
No ratings yet
Introduction To RAG (Retrieval Augmented Generation) and Vector Database - by Sachinsoni - Medium
18 pages
Aquaponics Presentation
No ratings yet
Aquaponics Presentation
19 pages
Architecture Basics Guide Dataiku
No ratings yet
Architecture Basics Guide Dataiku
31 pages
San Diego YouGotPosted Lawsuit: Motion To DIsmiss: Plaintiff's Supplemental Evidence
No ratings yet
San Diego YouGotPosted Lawsuit: Motion To DIsmiss: Plaintiff's Supplemental Evidence
42 pages
HADOOP PPT
No ratings yet
HADOOP PPT
21 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Jee Result
No ratings yet
Jee Result
1 page
O'Connor e Kirtley (2018) - The Integrated Motivational-Volitional Modeol of Suicidal Behaviour PDF
No ratings yet
O'Connor e Kirtley (2018) - The Integrated Motivational-Volitional Modeol of Suicidal Behaviour PDF
10 pages
Untitled
No ratings yet
Untitled
48 pages
Micro Optical Tech Letters - 2007 - Luo - Multilayer Frequency Selective Surface With Grating Lobe Suppression
No ratings yet
Micro Optical Tech Letters - 2007 - Luo - Multilayer Frequency Selective Surface With Grating Lobe Suppression
3 pages
Wrangling Webinar
No ratings yet
Wrangling Webinar
151 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
People v. de Leon
No ratings yet
People v. de Leon
9 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Cloudera Introduction
No ratings yet
Cloudera Introduction
93 pages
TR - 1 - Science Curriculum Analysis - Learning Journal - Siti Hanifah Nasution
No ratings yet
TR - 1 - Science Curriculum Analysis - Learning Journal - Siti Hanifah Nasution
3 pages
Axon Data Governance - The Playbook For Information Segmentation (2021)
No ratings yet
Axon Data Governance - The Playbook For Information Segmentation (2021)
27 pages
Xquery and Xpath 2
No ratings yet
Xquery and Xpath 2
25 pages
Data Analytics Master Course Brochure
No ratings yet
Data Analytics Master Course Brochure
27 pages
Data Engineering For Everyone 3
No ratings yet
Data Engineering For Everyone 3
81 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Composing A Negative Message: International University - Vietnam National University
No ratings yet
Composing A Negative Message: International University - Vietnam National University
3 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
Hadoop Commands
No ratings yet
Hadoop Commands
2 pages
Mortar Pig Cheat Sheet
50% (2)
Mortar Pig Cheat Sheet
13 pages
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
No ratings yet
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
17 pages
Unstructured Data Transformation Overview
100% (2)
Unstructured Data Transformation Overview
13 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Lab Workbook-Informatica EDC Migrating To MSFT AZURE DWM
No ratings yet
Lab Workbook-Informatica EDC Migrating To MSFT AZURE DWM
17 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Apache Pig
No ratings yet
Apache Pig
21 pages
Informatica Powermart / Powercenter 6.X Upgrade Features: Ted Williams
No ratings yet
Informatica Powermart / Powercenter 6.X Upgrade Features: Ted Williams
53 pages
Oxfam Shop Volunteer Application Form A4
No ratings yet
Oxfam Shop Volunteer Application Form A4
2 pages
UBS OCF - IDQ Capabilities Review
No ratings yet
UBS OCF - IDQ Capabilities Review
15 pages
Informatica
No ratings yet
Informatica
7 pages
L&T Long Term Infrastructure Bond Tranche 1 Application Form
No ratings yet
L&T Long Term Infrastructure Bond Tranche 1 Application Form
8 pages
IDQ Learning
No ratings yet
IDQ Learning
33 pages
Broker's Title, Inc. v. Ralph E. Main, JR., Robert H. Blodinger, Orbin F. Carter, 806 F.2d 257, 4th Cir. (1986)
No ratings yet
Broker's Title, Inc. v. Ralph E. Main, JR., Robert H. Blodinger, Orbin F. Carter, 806 F.2d 257, 4th Cir. (1986)
2 pages
Hadoop
No ratings yet
Hadoop
34 pages
Complete Reference To Informatica PDF
100% (3)
Complete Reference To Informatica PDF
52 pages
Data Science Internship
No ratings yet
Data Science Internship
2 pages
Jss College of Arts, Commerce & Science (Autonomous) : GOVERNING BODY (2017-19)
No ratings yet
Jss College of Arts, Commerce & Science (Autonomous) : GOVERNING BODY (2017-19)
2 pages
ADF Course Content
No ratings yet
ADF Course Content
11 pages
XBRL US Pacific Rim Workshop Database and Business Intelligence Workshop Karen Hsu Director Product Marketing, Informatica
No ratings yet
XBRL US Pacific Rim Workshop Database and Business Intelligence Workshop Karen Hsu Director Product Marketing, Informatica
18 pages
09 Davao Freeworkers V Cir
No ratings yet
09 Davao Freeworkers V Cir
5 pages
PIG Interview Qusetions
No ratings yet
PIG Interview Qusetions
15 pages
Etl
No ratings yet
Etl
13 pages
OKRs and KPIs: What They Are and How They Work Together - Reflektive
No ratings yet
OKRs and KPIs: What They Are and How They Work Together - Reflektive
1 page
Education Under The Japanese Regime
100% (3)
Education Under The Japanese Regime
13 pages
Darwin's Finches Lab
No ratings yet
Darwin's Finches Lab
3 pages
Chap 17 Reading Worksheet
No ratings yet
Chap 17 Reading Worksheet
5 pages
Action Plan in Reading
No ratings yet
Action Plan in Reading
2 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Idq 1
No ratings yet
Idq 1
13 pages
Datawarehouse DVP
No ratings yet
Datawarehouse DVP
12 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Cloud Data Warehouse
No ratings yet
Cloud Data Warehouse
7 pages
IDQ Functionality Imp
No ratings yet
IDQ Functionality Imp
7 pages
Formulario de Mantenimiento 1
No ratings yet
Formulario de Mantenimiento 1
2 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Create Int Varchar Date Varchar State Varchar: Emp - Piyush Employeeid Empname 30 Dob City 20 20
100% (1)
Create Int Varchar Date Varchar State Varchar: Emp - Piyush Employeeid Empname 30 Dob City 20 20
10 pages
FSLDM Data Modeller
No ratings yet
FSLDM Data Modeller
1 page
英文法全貌的快速掃描 (升高一，解答版)
No ratings yet
英文法全貌的快速掃描 (升高一，解答版)
6 pages
PBL2 SME Governance Problem Statement-V2
No ratings yet
PBL2 SME Governance Problem Statement-V2
3 pages
Data Architect or ETL Architect or BI Architect or Data Warehous
No ratings yet
Data Architect or ETL Architect or BI Architect or Data Warehous
4 pages
Informatica Big Data Management Course Agenda
100% (2)
Informatica Big Data Management Course Agenda
4 pages
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet

Apache Pig

Uploaded by

Apache Pig

Uploaded by

http://pig.apache.

Prof. EZZAHOUT Abderrahmane

Master IPS 2021

• Two execution modes

/* Define UDFs to a more readable format */

student = ORDER students BY gpa DESC;

Update step: Calculate the new means to be the

Input(s): Successfully read 10000 records (219190 bytes) from:

Output(s): Successfully stored 4 records (134 bytes) in:

last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]

Apache Server on Thrift client Thrift server

raw = LOAD 'excite.log' USING PigStorage ('\t') AS (user, id, time,

STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage (',');

B = group A by name parallel 10; Specify the need of 10 reduce tasks

C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as

D = filter C by c0 > 100 and c1 > 100 and c2 > 100;

store D into '$out';

You might also like