PIG: A Big Data Processor: Tushar B. Kute
PIG: A Big Data Processor: Tushar B. Kute
Tushar B. Kute,
http://tusharkute.com
What is Pig?
• Atom
– Any single value in Pig Latin, irrespective of their
data, type is known as an Atom.
– It is stored as string and can be used as string
and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig.
– A piece of data or a simple atomic value is known
as a field.
– Example: ‘raja’ or ‘30’
Apache Pig – Elements
• Tuple
– A record that is formed by an ordered set of
fields is known as a tuple, the fields can be of any
type. A tuple is similar to a row in a table of
RDBMS.
– Example: (Raja, 30)
Apache Pig – Elements
• Bag
– A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A
bag is represented by ‘{}’. It is similar to a table in RDBMS,
but unlike a table in RDBMS, it is not necessary that every
tuple contain the same number of fields or that the fields
in th same position (column) have the same type.
– Example: {(Raja, 30), (Mohammad, 45)}
– A bag can be a field in a relation; in that context, it is
known as inner bag.
– Example: {Raja, 30, {9848022338, [email protected],}}
Apache Pig – Elements
• Relation
– A relation is a bag of tuples. The relations in Pig
Latin are unordered (there is no guarantee that
tuples are processed in any particular order).
• Map
– A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should
be unique. The value might be of any type. It is
represented by ‘[]’
– Example: [name#Raja, age#30]
Installation of PIG
Download
export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
source ~/.bashrc
Start the Pig
pig -x local
pig -x mapreduce
Grunt shell
Data Processing with PIG
Example: movies_data.csv
1,Dhadakebaz,1986,3.2,7560
2,Dhumdhadaka,1985,3.8,6300
3,Ashi hi banva banvi,1988,4.1,7802
4,Zapatlela,1993,3.7,6022
5,Ayatya Gharat Gharoba,1991,3.4,5420
6,Navra Maza Navsacha,2004,3.9,4904
7,De danadan,1987,3.4,5623
8,Gammat Jammat,1987,3.4,7563
9,Eka peksha ek,1990,3.2,6244
10,Pachhadlela,2004,3.1,6956
Load data
• $ pig -x local
• grunt> movies = LOAD
'movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration)
• grunt> movies_greater_than_35 =
FILTER movies BY (float)rating > 3.5;
cat my_movies/part-m-00000
Load command
• grunt> movies = LOAD
'movies_data.csv' USING
PigStorage(',') as (id:int,
name:chararray, year:int,
rating:double, duration:int);
Check the filters
Mo
Th vies
an gr
2 h eat
o u er
rs
Describe
grunt> DESCRIBE movies;
movies: {id: int, name: chararray,
year: int, rating: double, duration:
int}
Foreach
grunt> grouped_by_year = group movies
by year;
grunt> count_by_year = FOREACH
grouped_by_year GENERATE group,
COUNT(movies);
Output
Order by
From
1985
To
2004
Limit
grunt> top_5_movies = LIMIT movies 5;
grunt> DUMP top_10_movies;
Pig: Modes of Execution
$ pig x local scriptfile.pig
Grunt mode
• You can also run pig scripts from grunt using run and exec
commands.
grunt> run scriptfile.pig
grunt> exec scriptfile.pig
Embedded mode
lines = LOAD 'shivneri.txt' AS
(line:chararray);
words = FOREACH lines GENERATE
FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
w_count = FOREACH grouped GENERATE group,
COUNT(words);
DUMP w_count;
forts.pig
Output snapshot
Blogs
Web Resources
http://digitallocha.blogspot.in
http://tusharkute.com
http://kyamputar.blogspot.in