0% found this document useful (0 votes)
189 views20 pages

Hive Pig PDF

Hive and Pig are high-level data processing languages for Hadoop. Hive uses a SQL-like language called HQL and stores tables as flat files on HDFS. Pig uses a dataflow language called Pig Latin and was developed by Yahoo. Both provide higher-level languages that compile down to Hadoop jobs, making large data processing easier for those without Java expertise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views20 pages

Hive Pig PDF

Hive and Pig are high-level data processing languages for Hadoop. Hive uses a SQL-like language called HQL and stores tables as flat files on HDFS. Pig uses a dataflow language called Pig Latin and was developed by Yahoo. Both provide higher-level languages that compile down to Hadoop jobs, making large data processing easier for those without Java expertise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Hive and Pig!

Juliana Freire
New York University


Some slides from J. Lin
Need for High-Level Languages!
•  Hadoop is great for large-data processing!
•  But writing Java programs for everything is verbose and slow
•  Not everyone wants to (or can) write Java code
•  Solution: develop higher-level data processing languages
•  Hive: HQL is like SQL
•  Pig: Pig Latin is a bit like Perl
Hive and Pig!
•  Hive: data warehousing application in Hadoop
•  Query language is HQL, variant of SQL
•  Tables stored on HDFS as flat files
•  Developed by Facebook, now open source
•  Pig: large-scale data processing system
•  Scripts are written in Pig Latin, a dataflow language
•  Developed by Yahoo!, now open source
•  Roughly 1/3 of all Yahoo! internal jobs
•  Common idea:
•  Provide higher-level language to facilitate large-data processing
•  Higher-level language “compiles down” to Hadoop jobs
Hive: Background!
•  Started at Facebook
•  Data was collected by nightly cron jobs into Oracle DB
•  “ETL” via hand-coded python
•  Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x
that

ETL



(Extract, Transform, and Load)
OLTP
Hadoop
OLAP

Source: cc-licensed slide by Cloudera



Hive Components!
standard input and write out rows to standard output. This
•  Shell: allows interactive queries
flexibility does come at a cost of converting rows from and
to strings.
•  Driver: session handles, fetch,
We omit more details due to lack of space. For a complete
description of HiveQL see the language manual [5].
execute
2.3 Running Example: StatusMeme
•  Compiler:
Meme, inspired parse,
by Facebookplan,
Lexicon optimize
We now present a highly simplified application, Status-
[6]. When Facebook
users update their status, the updates are logged into flat
•  Execution engine: DAG of stages
files in an NFS directory /logs/status updates which are ro-
tated every day. We load this data into hive on a daily basis
(MR, HDFS, metadata)
into a table
status updates(userid int,status string,ds string)

•  Metastore: schema, location in


using a load statement like below.

HDFS,
INTO TABLE SerDe
LOAD DATA LOCAL INPATH ‘/logs/status_updates’
status_updates PARTITION (ds=’2009-03-20’)

Each status update record contains the user identifier (userid),


the actual status string (status), and the date (ds) when the
status update occurred. This table is partitioned on the ds
column. Detailed user profile information, like the gender of
the user and the school the user is attending, is available in
the profiles(userid int,school string,gender int) table.
We first want to compute daily statistics on the frequency
of status updates based on gender and school which the user
attends. The following multi-table insert statement gener-
ates the daily counts of status updates by school (into
school summary(school string,cnt int,ds string)) and gen- Figure 1: Hive Architecture
der (into gender summary(gender int,cnt int,ds string)) us-
ing a single scan of the join of the status updates and
profiles tables. slide
Source: cc-licensed Notebythat the output tables are also parti-
Cloudera
FROM status_updates a JOIN profiles b
tioned on the ds column, and HiveQL allows users to insert
Framework
Data Model!
husoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,
Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy
•  Tables:
Facebookanalogous to tables
Data Infrastructure in RDBMS
Team
•  Typed columns (int, float, string, boolean)
•  Structs: {a  
lected and analyzed in the
I NT;   b   I NT}.
databases. Each table has a corresponding HDFS di-
rectory. The data in a table is serialized and stored in
is •  Also,
growing list,
rapidly, arrays
mak-
tions prohibitively expen-
: map (for
theJSON-like data) of the underlying data.
files within that directory. Users can associate tables
with serialization format
Hive provides builtin serialization formats which ex-
•  Partitions
en-source map-reduce im-
as an alternative to store ploit compression and lazy de-serialization. Users can
also add support for new data formats by defining cus-
•  For example, range-partition
sets on commodity hard-
rogramming model is very tom serializetables by date
and de-serialize methods (called SerDe’s)
written in Java. The serialization format of each table
• 
reuse. Buckets
to write custom programs
is stored in the system catalog and is automatically
an open-source data ware- used by Hive during query compilation and execution.
•  Hash
f Hadoop. Hivepartitions
supports withinHive
ranges (usefulexternal
also supports for sampling,
tables on data join optimization)
stored in
clarative language - HiveQL, HDFS, NFS or local directories.
ce jobs executed on Hadoop. • Partitions - Each table can have one or more parti-
tom map-reduce scripts to tions which determine the distribution of data within
guage includes a type sys- sub-directories of the table directory. Suppose data
aining primitive types, col- for table T is in the directory /wh/T. If T is partitioned
nd nested compositions of on columns ds and ctry, then data with a particular
braries can be extended to ds value 20090101 and ctry value US, will be stored in
Hive also includes a system files within the directory /wh/T/ds=20090101/ctry=US.
ng schemas and statistics, • Buckets - Data in each partition may in turn be divided
n and query optimization. [Thusoo et al., VLDB 2009]!
into buckets based on the hash of a column in the table.
contains several thousand Each bucket is stored as a file in the partition directory.
Source: cc-licensed slide by Cloudera

data and is being used ex- Hive supports primitive column types (integers, floating
ad-hoc analyses by more
Metastore!
•  Database: namespace containing a set of tables
•  Holds table definitions (column types, physical layout)
•  Holds partitioning information
•  Can be stored in Derby, MySQL, and many other relational databases

Source: cc-licensed slide by Cloudera



Physical Layout!
•  Warehouse directory in HDFS
•  E.g., /user/hive/warehouse
•  Tables stored in subdirectories of warehouse
•  Partitions form subdirectories of tables
•  Each table has a corresponding HDFS directory
•  Actual data stored in flat files
•  Users can associate a table with a serialization format
•  Control char-delimited text, or SequenceFiles
•  With custom SerDe, can use arbitrary format

Source: cc-licensed slide by Cloudera



Hive: Example!
•  Hive looks similar to an SQL database
•  Relational join on two tables:
•  Table of word counts from Shakespeare collection
•  Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s

JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 
ORDER BY s.freq DESC LIMIT 10;



the
25848
62394

I
23031
8854

and
19671
38985

to
18038
13526

of
16700
34654

a
14170
8057

you
12702
2720

my
11297
4135

in
10797
12445

is
8882
6884

Source: Material drawn from Cloudera training VM

files in an NFS directory /logs/status updates which are ro-
of status
tated updates
every day. based
We loadon gender andinto
this data school
hivewhich the user
on a daily basis
attends. The following multi-table insert statement gener-
status
Hive: Another Example!
into a table
ates theupdates(userid
daily counts of status updates
int,status by school
string,ds (into
string)
using asummary(school
school load statement string,cnt
like below.int,ds string)) and gen-
der (into gender summary(gender int,cnt int,ds string)) us-
ing
LOADa DATA
singleLOCAL
scan INPATH
of the ‘/logs/status_updates’
join of the status updates and
profiles
INTO TABLE tables. Note that the
status_updates output tables
PARTITION are also parti-
(ds=’2009-03-20’)
tioned on the ds column, and HiveQL allows users to insert
query
Each results
statusinto a specific
update record partition
containsofthe
theuser
output table.(userid),
identifier
the actual status string (status), and the date (ds) when the G
FROM (SELECT a.status, b.school, b.gender D
status update occurred. This table is partitioned on the ds
FROM status_updates a JOIN profiles b S
column. Detailed user profile information,
ON (a.userid = b.userid and like the gender of ) subq2
the user and the school the user is
a.ds=’2009-03-20’ ) attending, is available in
the profiles(userid
) subq1 int,school string,gender int) table.
INSERT OVERWRITE
We first want toTABLE daily statistics on the frequency 3. H
gender_summary
compute
PARTITION(ds=’2009-03-20’)
of status updates based on gender Figur
SELECT subq1.gender, COUNT(1) GROUPand
BY school which the user
subq1.gender
attends.
INSERT The following
OVERWRITE multi-table insert statement gener- teractio
TABLE school_summary
ates the daily counts ofPARTITION(ds=’2009-03-20’)
status updates by school (into • E
SELECT
school subq1.school,
summary(school COUNT(1)
string,cntGROUPint,ds
BY subq1.school
string)) and gen- fa
der (into gender summary(gender int,cnt int,ds string)) us- ca
Next, we want to display the ten most popular memes O
ing a single scan of the join of the status updates and
per school as determined by status updates by users who
profiles tables. Note that the output tables are also parti- • Th
attend that school. We now show how this computation
tioned on the ds column, and HiveQL allows users to insert A
can be done using HiveQLs map-reduce constructs. We
querythe results into fra
parse result of athe
specific partitionstatus
join between of the updates
output table.
and
Hive: Another Example!
•  HiveQL provides MapReduce constructs
REDUCE subq2.school, subq2.meme, subq2.cnt
USING ‘top10.py’ AS (school,meme,cnt)
FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt
FROM (MAP b.school, a.status
USING ‘meme-extractor.py’ AS (school,meme)
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid)
) subq1
GROUP BY subq1.school, subq1.meme
DISTRIBUTE BY school, meme
SORT BY school, meme, cnt desc
) subq2;

Source: Material drawn from Cloudera training VM



Example Data Analysis Task!

Find users who tend to visit “good” pages.


Visits
Pages

user url time url pagerank
Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00

. . .

. . .

Pig Slides adapted from Olston et al.



Conceptual Dataflow!
Load
Load

Visits(user, url, time)
Pages(url, pagerank)

Canonicalize URLs

Join

url = url

Group by user

Compute Average Pagerank


Filter

avgPR > 0.5

Pig Slides adapted from Olston et al.



System-Level Dataflow!
Visits
Pages

load
. . .
. . .
load

canonicalize

join by url

. . .

group by user

. . .
compute average pagerank

filter

the answer

Pig Slides adapted from Olston et al.

MapReduce Code!
import java.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class);
import java.util.ArrayList; } lp.setOutputValueClass(Text.class);
import java.util.Iterator; lp.setMapperClass(LoadPages.class);
import java.util.List; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new
for (String s1 : first) { Path("/user/gates/pages"));
import org.apache.hadoop.fs.Path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp,
import org.apache.hadoop.io.LongWritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages"));
import org.apache.hadoop.io.Text; oc.collect(null, new Text(outval)); lp.setNumReduceTasks(0);
import org.apache.hadoop.io.Writable; reporter.setStatus("OK"); Job loadPages = new Job(lp);
import org.apache.hadoop.io.WritableComparable; }
import org.apache.hadoop.mapred.FileInputFormat; } JobConf lfu = new JobConf(MRExample.class);
import org.apache.hadoop.mapred.FileOutputFormat; } lfu.setJobName("Load and Filter Users");
import org.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class);
import org.apache.hadoop.mapred.KeyValueTextInputFormat; public static class LoadJoined extends MapReduceBase lfu.setOutputKeyClass(Text.class);
import org.apache.hadoop.mapred.Mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setOutputValueClass(Text.class);
import org.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class);
import org.apache.hadoop.mapred.OutputCollector; public void map( FileInputFormat.add InputPath(lfu, new
import org.apache.hadoop.mapred.RecordReader; Text k, Path("/user/gates/users"));
import org.apache.hadoop.mapred.Reducer; Text val, FileOutputFormat.setOutputPath(lfu,
import org.apache.hadoop.mapred.Reporter; OutputColle ctor<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users"));
import org.apache.hadoop.mapred.SequenceFileInputFormat; Reporter reporter) throws IOException { lfu.setNumReduceTasks(0);
import org.apache.hadoop.mapred.SequenceFileOutputFormat; // Find the url Job loadUsers = new Job(lfu);
import org.apache.hadoop.mapred.TextInputFormat; String line = val.toString();
import org.apache.hadoop.mapred.jobcontrol.Job; int firstComma = line.indexOf(','); JobConf join = new JobConf( MRExample.class);
import org.apache.hadoop.mapred.jobcontrol.JobC ontrol; int secondComma = line.indexOf(',', first Comma); join.setJobName("Join Users and Pages");
import org.apache.hadoop.mapred.lib.IdentityMapper; String key = line.substring(firstComma, secondComma); join.setInputFormat(KeyValueTextInputFormat.class);
// drop the rest of the record, I don't need it anymore, join.setOutputKeyClass(Text.class);
public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setOutputValueClass(Text.class);
public static class LoadPages extends MapReduceBase Text outKey = new Text(key); join.setMapperClass(IdentityMap per.class);
implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outKey, new LongWritable(1L)); join.setReducerClass(Join.class);
} FileInputFormat.addInputPath(join, new
public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new
Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users"));
// Pull the key out Writable> { FileOutputFormat.se tOutputPath(join, new
String line = val.toString(); Path("/user/gates/tmp/joined"));
int firstComma = line.indexOf(','); public void reduce( join.setNumReduceTasks(50);
String key = line.sub string(0, firstComma); Text ke y, Job joinJob = new Job(join);
String value = line.substring(firstComma + 1); Iterator<LongWritable> iter, joinJob.addDependingJob(loadPages);
Text outKey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinJob.addDependingJob(loadUsers);
// Prepend an index to the value so we know which file Reporter reporter) throws IOException {
// it came from. // Add up all the values we see JobConf group = new JobConf(MRE xample.class);
Text outVal = new Text("1 " + value); group.setJobName("Group URLs");
oc.collect(outKey, outVal); long sum = 0; group.setInputFormat(KeyValueTextInputFormat.class);
} wh ile (iter.hasNext()) { group.setOutputKeyClass(Text.class);
} sum += iter.next().get(); group.setOutputValueClass(LongWritable.class);
public static class LoadAndFilterUsers extends MapReduceBase reporter.setStatus("OK"); group.setOutputFormat(SequenceFi leOutputFormat.class);
implements Mapper<LongWritable, Text, Text, Text> { } group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
public void map(LongWritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setReducerClass(ReduceUrls.class);
OutputCollector<Text, Text> oc, } FileInputFormat.addInputPath(group, new
Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined"));
// Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new
String line = val.toString(); implements Mapper<WritableComparable, Writable, LongWritable, Path("/user/gates/tmp/grouped"));
int firstComma = line.indexOf(','); Text> { group.setNumReduceTasks(50);
String value = line.substring( firstComma + 1); Job groupJob = new Job(group);
int age = Integer.parseInt(value); public void map( groupJob.addDependingJob(joinJob);
if (age < 18 || age > 25) return; WritableComparable key,
String key = line.substring(0, firstComma); Writable val, JobConf top100 = new JobConf(MRExample.class);
Text outKey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setJobName("Top 100 sites");
// Prepend an index to the value so w e know which file Reporter reporter) throws IOException { top100.setInputFormat(SequenceFileInputFormat.class);
// it came from. oc.collect((LongWritable)val, (Text)key); top100.setOutputKeyClass(LongWritable.class);
Text outVal = new Text("2" + value); } top100.setOutputValueClass(Text.class);
oc.collect(outKey, outVal); } top100.setOutputFormat(SequenceFileOutputF ormat.class);
} public static class LimitClicks extends MapReduceBase top100.setMapperClass(LoadClicks.class);
} implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setCombinerClass(LimitClicks.class);
public static class Join extends MapReduceBase top100.setReducerClass(LimitClicks.class);
implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new
public void reduce( Path("/user/gates/tmp/grouped"));
public void reduce(Text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new
Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setNumReduceTasks(1);
Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100);
// For each value, figure out which file it's from and limit.addDependingJob(groupJob);
store it // Only output the first 100 records
// accordingly. while (count < 100 && iter.hasNext()) { JobControl jc = new JobControl("Find top 100 sites for users
List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25");
List<String> second = new ArrayList<String>(); count++; jc.addJob(loadPages);
} jc.addJob(loadUsers);
while (iter.hasNext()) { } jc.addJob(joinJob);
Text t = iter.next(); } jc.addJob(groupJob);
String value = t.to String(); public static void main(String[] args) throws IOException { jc.addJob(limit);
if (value.charAt(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run();
first.add(value.substring(1)); lp.setJobName("Load Pages"); }
else second.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); }

Pig Slides adapted from Olston et al.



Pig Latin Script!

Visits = load ‘/data/visits’ as (user, url, time);!


Visits = foreach Visits generate user, Canonicalize(url), time;!
!
Pages = load ‘/data/pages’ as (url, pagerank);!
!
VP = join Visits by url, Pages by url;!
UserVisits = group VP by user;!
UserPageranks = foreach UserVisits generate user,
AVG(VP.pagerank) as avgpr;!
GoodUsers = filter UserPageranks by avgpr > ‘0.5’;!
!
store GoodUsers into '/data/good_users';!

Pig Slides adapted from Olston et al.



Java vs. Pig Latin!

1/20 the lines of code 1/16 the development time


180 300
160
250
140

Minutes
120 200
100
150
80
60 100
40
50
20
0 0

Hadoop Pig Hadoop Pig

Performance on par with raw Hadoop!

Pig Slides adapted from Olston et al.



Pig takes care of…!
•  Schema and type checking
•  Translating into efficient physical dataflow
•  (i.e., sequence of one or more MapReduce jobs)
•  Exploiting data reduction opportunities
•  (e.g., early partial aggregation via a combiner)
•  Executing the system-level dataflow
•  (i.e., running the MapReduce jobs)
•  Tracking progress, errors, etc.
References!
•  Getting started with Pig: http://pig.apache.org/docs/r0.11.1/start.html
•  Pig Tutorial: http://pig.apache.org/docs/r0.7.0/tutorial.html
•  Hive Tutorial: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Questions!

You might also like