0% found this document useful (0 votes)

189 views20 pages

Hive Pig PDF

Hive and Pig are high-level data processing languages for Hadoop. Hive uses a SQL-like language called HQL and stores tables as flat files on HDFS. Pig uses a dataflow language called Pig Latin and was developed by Yahoo. Both provide higher-level languages that compile down to Hadoop jobs, making large data processing easier for those without Java expertise.

Uploaded by

Keerti CoolpadNote3Lite

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views20 pages

Hive Pig PDF

Uploaded by

Keerti CoolpadNote3Lite

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Hive and Pig!

Juliana Freire
New York University

Some slides from J. Lin
Need for High-Level Languages!
•  Hadoop is great for large-data processing!
•  But writing Java programs for everything is verbose and slow
•  Not everyone wants to (or can) write Java code
•  Solution: develop higher-level data processing languages
•  Hive: HQL is like SQL
•  Pig: Pig Latin is a bit like Perl
Hive and Pig!
•  Hive: data warehousing application in Hadoop
•  Query language is HQL, variant of SQL
•  Tables stored on HDFS as flat files
•  Developed by Facebook, now open source
•  Pig: large-scale data processing system
•  Scripts are written in Pig Latin, a dataflow language
•  Developed by Yahoo!, now open source
•  Roughly 1/3 of all Yahoo! internal jobs
•  Common idea:
•  Provide higher-level language to facilitate large-data processing
•  Higher-level language “compiles down” to Hadoop jobs
Hive: Background!
•  Started at Facebook
•  Data was collected by nightly cron jobs into Oracle DB
•  “ETL” via hand-coded python
•  Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x
that

ETL

(Extract, Transform, and Load)
OLTP
Hadoop
OLAP

Source: cc-licensed slide by Cloudera

Hive Components!
standard input and write out rows to standard output. This
•  Shell: allows interactive queries
flexibility does come at a cost of converting rows from and
to strings.
•  Driver: session handles, fetch,
We omit more details due to lack of space. For a complete
description of HiveQL see the language manual [5].
execute
2.3 Running Example: StatusMeme
•  Compiler:
Meme, inspired parse,
by Facebookplan,
Lexicon optimize
We now present a highly simplified application, Status-
[6]. When Facebook
users update their status, the updates are logged into flat
•  Execution engine: DAG of stages
files in an NFS directory /logs/status updates which are ro-
tated every day. We load this data into hive on a daily basis
(MR, HDFS, metadata)
into a table
status updates(userid int,status string,ds string)

•  Metastore: schema, location in

using a load statement like below.

HDFS,
INTO TABLE SerDe
LOAD DATA LOCAL INPATH ‘/logs/status_updates’
status_updates PARTITION (ds=’2009-03-20’)

Each status update record contains the user identifier (userid),

the actual status string (status), and the date (ds) when the
status update occurred. This table is partitioned on the ds
column. Detailed user profile information, like the gender of
the user and the school the user is attending, is available in
the profiles(userid int,school string,gender int) table.
We first want to compute daily statistics on the frequency
of status updates based on gender and school which the user
attends. The following multi-table insert statement gener-
ates the daily counts of status updates by school (into
school summary(school string,cnt int,ds string)) and gen- Figure 1: Hive Architecture
der (into gender summary(gender int,cnt int,ds string)) us-
ing a single scan of the join of the status updates and
profiles tables. slide
Source: cc-licensed Notebythat the output tables are also parti-
Cloudera
FROM status_updates a JOIN profiles b
tioned on the ds column, and HiveQL allows users to insert
Framework
Data Model!
husoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,
Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy
•  Tables:
Facebookanalogous to tables
Data Infrastructure in RDBMS
Team
•  Typed columns (int, float, string, boolean)
•  Structs: {a
lected and analyzed in the
I NT; b I NT}.
databases. Each table has a corresponding HDFS di-
rectory. The data in a table is serialized and stored in
is •  Also,
growing list,
rapidly, arrays
mak-
tions prohibitively expen-
: map (for
theJSON-like data) of the underlying data.
files within that directory. Users can associate tables
with serialization format
Hive provides builtin serialization formats which ex-
•  Partitions
en-source map-reduce im-
as an alternative to store ploit compression and lazy de-serialization. Users can
also add support for new data formats by defining cus-
•  For example, range-partition
sets on commodity hard-
rogramming model is very tom serializetables by date
and de-serialize methods (called SerDe’s)
written in Java. The serialization format of each table
• 
reuse. Buckets
to write custom programs
is stored in the system catalog and is automatically
an open-source data ware- used by Hive during query compilation and execution.
•  Hash
f Hadoop. Hivepartitions
supports withinHive
ranges (usefulexternal
also supports for sampling,
tables on data join optimization)
stored in
clarative language - HiveQL, HDFS, NFS or local directories.
ce jobs executed on Hadoop. • Partitions - Each table can have one or more parti-
tom map-reduce scripts to tions which determine the distribution of data within
guage includes a type sys- sub-directories of the table directory. Suppose data
aining primitive types, col- for table T is in the directory /wh/T. If T is partitioned
nd nested compositions of on columns ds and ctry, then data with a particular
braries can be extended to ds value 20090101 and ctry value US, will be stored in
Hive also includes a system files within the directory /wh/T/ds=20090101/ctry=US.
ng schemas and statistics, • Buckets - Data in each partition may in turn be divided
n and query optimization. [Thusoo et al., VLDB 2009]!
into buckets based on the hash of a column in the table.
contains several thousand Each bucket is stored as a file in the partition directory.
Source: cc-licensed slide by Cloudera

data and is being used ex- Hive supports primitive column types (integers, floating
ad-hoc analyses by more
Metastore!
•  Database: namespace containing a set of tables
•  Holds table definitions (column types, physical layout)
•  Holds partitioning information
•  Can be stored in Derby, MySQL, and many other relational databases

Source: cc-licensed slide by Cloudera

Physical Layout!
•  Warehouse directory in HDFS
•  E.g., /user/hive/warehouse
•  Tables stored in subdirectories of warehouse
•  Partitions form subdirectories of tables
•  Each table has a corresponding HDFS directory
•  Actual data stored in flat files
•  Users can associate a table with a serialization format
•  Control char-delimited text, or SequenceFiles
•  With custom SerDe, can use arbitrary format

Source: cc-licensed slide by Cloudera

Hive: Example!
•  Hive looks similar to an SQL database
•  Relational join on two tables:
•  Table of word counts from Shakespeare collection
•  Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s

JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

the
25848
62394

I
23031
8854

and
19671
38985

to
18038
13526

of
16700
34654

a
14170
8057

you
12702
2720

my
11297
4135

in
10797
12445

is
8882
6884

Source: Material drawn from Cloudera training VM

files in an NFS directory /logs/status updates which are ro-
of status
tated updates
every day. based
We loadon gender andinto
this data school
hivewhich the user
on a daily basis
attends. The following multi-table insert statement gener-
status
Hive: Another Example!
into a table
ates theupdates(userid
daily counts of status updates
int,status by school
string,ds (into
string)
using asummary(school
school load statement string,cnt
like below.int,ds string)) and gen-
der (into gender summary(gender int,cnt int,ds string)) us-
ing
LOADa DATA
singleLOCAL
scan INPATH
of the ‘/logs/status_updates’
join of the status updates and
profiles
INTO TABLE tables. Note that the
status_updates output tables
PARTITION are also parti-
(ds=’2009-03-20’)
tioned on the ds column, and HiveQL allows users to insert
query
Each results
statusinto a specific
update record partition
containsofthe
theuser
output table.(userid),
identifier
the actual status string (status), and the date (ds) when the G
FROM (SELECT a.status, b.school, b.gender D
status update occurred. This table is partitioned on the ds
FROM status_updates a JOIN profiles b S
column. Detailed user profile information,
ON (a.userid = b.userid and like the gender of ) subq2
the user and the school the user is
a.ds=’2009-03-20’ ) attending, is available in
the profiles(userid
) subq1 int,school string,gender int) table.
INSERT OVERWRITE
We first want toTABLE daily statistics on the frequency 3. H
gender_summary
compute
PARTITION(ds=’2009-03-20’)
of status updates based on gender Figur
SELECT subq1.gender, COUNT(1) GROUPand
BY school which the user
subq1.gender
attends.
INSERT The following
OVERWRITE multi-table insert statement gener- teractio
TABLE school_summary
ates the daily counts ofPARTITION(ds=’2009-03-20’)
status updates by school (into • E
SELECT
school subq1.school,
summary(school COUNT(1)
string,cntGROUPint,ds
BY subq1.school
string)) and gen- fa
der (into gender summary(gender int,cnt int,ds string)) us- ca
Next, we want to display the ten most popular memes O
ing a single scan of the join of the status updates and
per school as determined by status updates by users who
profiles tables. Note that the output tables are also parti- • Th
attend that school. We now show how this computation
tioned on the ds column, and HiveQL allows users to insert A
can be done using HiveQLs map-reduce constructs. We
querythe results into fra
parse result of athe
specific partitionstatus
join between of the updates
output table.
and
Hive: Another Example!
•  HiveQL provides MapReduce constructs
REDUCE subq2.school, subq2.meme, subq2.cnt
USING ‘top10.py’ AS (school,meme,cnt)
FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt
FROM (MAP b.school, a.status
USING ‘meme-extractor.py’ AS (school,meme)
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid)
) subq1
GROUP BY subq1.school, subq1.meme
DISTRIBUTE BY school, meme
SORT BY school, meme, cnt desc
) subq2;

Source: Material drawn from Cloudera training VM

Example Data Analysis Task!

Find users who tend to visit “good” pages.

Visits
Pages

user url time url pagerank
Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00

. . .

. . .

Pig Slides adapted from Olston et al.

Conceptual Dataflow!
Load
Load

Visits(user, url, time)
Pages(url, pagerank)

Canonicalize URLs

Join

url = url

Group by user

Compute Average Pagerank

Filter

avgPR > 0.5

Pig Slides adapted from Olston et al.

System-Level Dataflow!
Visits
Pages

load
. . .
. . .
load

canonicalize

join by url

. . .

group by user

. . .
compute average pagerank

filter

the answer

Pig Slides adapted from Olston et al.

MapReduce Code!
import java.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class);
import java.util.ArrayList; } lp.setOutputValueClass(Text.class);
import java.util.Iterator; lp.setMapperClass(LoadPages.class);
import java.util.List; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new
for (String s1 : first) { Path("/user/gates/pages"));
import org.apache.hadoop.fs.Path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp,
import org.apache.hadoop.io.LongWritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages"));
import org.apache.hadoop.io.Text; oc.collect(null, new Text(outval)); lp.setNumReduceTasks(0);
import org.apache.hadoop.io.Writable; reporter.setStatus("OK"); Job loadPages = new Job(lp);
import org.apache.hadoop.io.WritableComparable; }
import org.apache.hadoop.mapred.FileInputFormat; } JobConf lfu = new JobConf(MRExample.class);
import org.apache.hadoop.mapred.FileOutputFormat; } lfu.setJobName("Load and Filter Users");
import org.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class);
import org.apache.hadoop.mapred.KeyValueTextInputFormat; public static class LoadJoined extends MapReduceBase lfu.setOutputKeyClass(Text.class);
import org.apache.hadoop.mapred.Mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setOutputValueClass(Text.class);
import org.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class);
import org.apache.hadoop.mapred.OutputCollector; public void map( FileInputFormat.add InputPath(lfu, new
import org.apache.hadoop.mapred.RecordReader; Text k, Path("/user/gates/users"));
import org.apache.hadoop.mapred.Reducer; Text val, FileOutputFormat.setOutputPath(lfu,
import org.apache.hadoop.mapred.Reporter; OutputColle ctor<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users"));
import org.apache.hadoop.mapred.SequenceFileInputFormat; Reporter reporter) throws IOException { lfu.setNumReduceTasks(0);
import org.apache.hadoop.mapred.SequenceFileOutputFormat; // Find the url Job loadUsers = new Job(lfu);
import org.apache.hadoop.mapred.TextInputFormat; String line = val.toString();
import org.apache.hadoop.mapred.jobcontrol.Job; int firstComma = line.indexOf(','); JobConf join = new JobConf( MRExample.class);
import org.apache.hadoop.mapred.jobcontrol.JobC ontrol; int secondComma = line.indexOf(',', first Comma); join.setJobName("Join Users and Pages");
import org.apache.hadoop.mapred.lib.IdentityMapper; String key = line.substring(firstComma, secondComma); join.setInputFormat(KeyValueTextInputFormat.class);
// drop the rest of the record, I don't need it anymore, join.setOutputKeyClass(Text.class);
public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setOutputValueClass(Text.class);
public static class LoadPages extends MapReduceBase Text outKey = new Text(key); join.setMapperClass(IdentityMap per.class);
implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outKey, new LongWritable(1L)); join.setReducerClass(Join.class);
} FileInputFormat.addInputPath(join, new
public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new
Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users"));
// Pull the key out Writable> { FileOutputFormat.se tOutputPath(join, new
String line = val.toString(); Path("/user/gates/tmp/joined"));
int firstComma = line.indexOf(','); public void reduce( join.setNumReduceTasks(50);
String key = line.sub string(0, firstComma); Text ke y, Job joinJob = new Job(join);
String value = line.substring(firstComma + 1); Iterator<LongWritable> iter, joinJob.addDependingJob(loadPages);
Text outKey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinJob.addDependingJob(loadUsers);
// Prepend an index to the value so we know which file Reporter reporter) throws IOException {
// it came from. // Add up all the values we see JobConf group = new JobConf(MRE xample.class);
Text outVal = new Text("1 " + value); group.setJobName("Group URLs");
oc.collect(outKey, outVal); long sum = 0; group.setInputFormat(KeyValueTextInputFormat.class);
} wh ile (iter.hasNext()) { group.setOutputKeyClass(Text.class);
} sum += iter.next().get(); group.setOutputValueClass(LongWritable.class);
public static class LoadAndFilterUsers extends MapReduceBase reporter.setStatus("OK"); group.setOutputFormat(SequenceFi leOutputFormat.class);
implements Mapper<LongWritable, Text, Text, Text> { } group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
public void map(LongWritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setReducerClass(ReduceUrls.class);
OutputCollector<Text, Text> oc, } FileInputFormat.addInputPath(group, new
Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined"));
// Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new
String line = val.toString(); implements Mapper<WritableComparable, Writable, LongWritable, Path("/user/gates/tmp/grouped"));
int firstComma = line.indexOf(','); Text> { group.setNumReduceTasks(50);
String value = line.substring( firstComma + 1); Job groupJob = new Job(group);
int age = Integer.parseInt(value); public void map( groupJob.addDependingJob(joinJob);
if (age < 18 || age > 25) return; WritableComparable key,
String key = line.substring(0, firstComma); Writable val, JobConf top100 = new JobConf(MRExample.class);
Text outKey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setJobName("Top 100 sites");
// Prepend an index to the value so w e know which file Reporter reporter) throws IOException { top100.setInputFormat(SequenceFileInputFormat.class);
// it came from. oc.collect((LongWritable)val, (Text)key); top100.setOutputKeyClass(LongWritable.class);
Text outVal = new Text("2" + value); } top100.setOutputValueClass(Text.class);
oc.collect(outKey, outVal); } top100.setOutputFormat(SequenceFileOutputF ormat.class);
} public static class LimitClicks extends MapReduceBase top100.setMapperClass(LoadClicks.class);
} implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setCombinerClass(LimitClicks.class);
public static class Join extends MapReduceBase top100.setReducerClass(LimitClicks.class);
implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new
public void reduce( Path("/user/gates/tmp/grouped"));
public void reduce(Text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new
Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setNumReduceTasks(1);
Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100);
// For each value, figure out which file it's from and limit.addDependingJob(groupJob);
store it // Only output the first 100 records
// accordingly. while (count < 100 && iter.hasNext()) { JobControl jc = new JobControl("Find top 100 sites for users
List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25");
List<String> second = new ArrayList<String>(); count++; jc.addJob(loadPages);
} jc.addJob(loadUsers);
while (iter.hasNext()) { } jc.addJob(joinJob);
Text t = iter.next(); } jc.addJob(groupJob);
String value = t.to String(); public static void main(String[] args) throws IOException { jc.addJob(limit);
if (value.charAt(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run();
first.add(value.substring(1)); lp.setJobName("Load Pages"); }
else second.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); }

Pig Slides adapted from Olston et al.

Pig Latin Script!

Visits = load ‘/data/visits’ as (user, url, time);!

Visits = foreach Visits generate user, Canonicalize(url), time;!
!
Pages = load ‘/data/pages’ as (url, pagerank);!
!
VP = join Visits by url, Pages by url;!
UserVisits = group VP by user;!
UserPageranks = foreach UserVisits generate user,
AVG(VP.pagerank) as avgpr;!
GoodUsers = filter UserPageranks by avgpr > ‘0.5’;!
!
store GoodUsers into '/data/good_users';!

Pig Slides adapted from Olston et al.

Java vs. Pig Latin!

1/20 the lines of code 1/16 the development time

180 300
160
250
140

Minutes
120 200
100
150
80
60 100
40
50
20
0 0

Hadoop Pig Hadoop Pig

Performance on par with raw Hadoop!

Pig Slides adapted from Olston et al.

Pig takes care of…!
•  Schema and type checking
•  Translating into efficient physical dataflow
•  (i.e., sequence of one or more MapReduce jobs)
•  Exploiting data reduction opportunities
•  (e.g., early partial aggregation via a combiner)
•  Executing the system-level dataflow
•  (i.e., running the MapReduce jobs)
•  Tracking progress, errors, etc.
References!
•  Getting started with Pig: http://pig.apache.org/docs/r0.11.1/start.html
•  Pig Tutorial: http://pig.apache.org/docs/r0.7.0/tutorial.html
•  Hive Tutorial: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Questions!

JUSPAY Analytics Data Set - PM Internship
No ratings yet
JUSPAY Analytics Data Set - PM Internship
3 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
PostgreSQL Command Line Cheatsheet
No ratings yet
PostgreSQL Command Line Cheatsheet
12 pages
Hive
No ratings yet
Hive
4 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Module 4
No ratings yet
Module 4
34 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
BDA_Hive
No ratings yet
BDA_Hive
22 pages
HIVE
No ratings yet
HIVE
80 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive
No ratings yet
Hive
49 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Hive Main
No ratings yet
Hive Main
33 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Hive Final
No ratings yet
Hive Final
75 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Hive
No ratings yet
Hive
23 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive
No ratings yet
Hive
29 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Unit-5 Sgs
No ratings yet
Unit-5 Sgs
10 pages
Big Data
No ratings yet
Big Data
120 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
Hive
No ratings yet
Hive
45 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Unit 5 Hive and Pig
No ratings yet
Unit 5 Hive and Pig
16 pages
Hive
No ratings yet
Hive
28 pages
HIVE
No ratings yet
HIVE
28 pages
Hive
No ratings yet
Hive
47 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Module-IV HIVE Ppt
No ratings yet
Module-IV HIVE Ppt
69 pages
Unit 3
No ratings yet
Unit 3
8 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
LectureNotes Hive Final
No ratings yet
LectureNotes Hive Final
36 pages
Hive
No ratings yet
Hive
30 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Hive L1
No ratings yet
Hive L1
134 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Increase Storage On The M-Series Appliance
No ratings yet
Increase Storage On The M-Series Appliance
3 pages
Oracle Database Memory Tuning SGA PGA Performance Tips 1745608256
No ratings yet
Oracle Database Memory Tuning SGA PGA Performance Tips 1745608256
4 pages
BI Analytics Cloud
No ratings yet
BI Analytics Cloud
56 pages
Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse
No ratings yet
Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse
25 pages
Blog - Understanding Evolution of CDS and AMDP in Most Simple Way (2019.05)
No ratings yet
Blog - Understanding Evolution of CDS and AMDP in Most Simple Way (2019.05)
4 pages
SQL Performance Explained
No ratings yet
SQL Performance Explained
122 pages
Java Unit 5
No ratings yet
Java Unit 5
11 pages
How To Connect SQL Server Compact Edition Database To Crystal Report in C# - Stack Overflow
No ratings yet
How To Connect SQL Server Compact Edition Database To Crystal Report in C# - Stack Overflow
3 pages
Informatica Assessment - 4D Case Study
No ratings yet
Informatica Assessment - 4D Case Study
6 pages
bcsl-034 Lab PDF
No ratings yet
bcsl-034 Lab PDF
7 pages
PHP My Admin Intro
No ratings yet
PHP My Admin Intro
11 pages
Power BI: Certification Training
No ratings yet
Power BI: Certification Training
16 pages
Structured Query Language: Introduction To
No ratings yet
Structured Query Language: Introduction To
18 pages
CRUD in Servlet
No ratings yet
CRUD in Servlet
16 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
Data Analytics
No ratings yet
Data Analytics
42 pages
Database Management Systems-A21
No ratings yet
Database Management Systems-A21
3 pages
Csi Zg518 Ec-3r First Sem 2023-2024
No ratings yet
Csi Zg518 Ec-3r First Sem 2023-2024
8 pages
BW4HANA - DataFlow - LSA++ PDF
No ratings yet
BW4HANA - DataFlow - LSA++ PDF
1 page
Dapper
No ratings yet
Dapper
12 pages
Jawad Ali
No ratings yet
Jawad Ali
11 pages
DSE 310 - Practice 1
No ratings yet
DSE 310 - Practice 1
31 pages
Analisis Swot Dalam Menentukan Strategi Pelatihan Keamanan Siber Dan Sandi
No ratings yet
Analisis Swot Dalam Menentukan Strategi Pelatihan Keamanan Siber Dan Sandi
14 pages
Bahande Database - Application
No ratings yet
Bahande Database - Application
28 pages
How Is RAG Used in The Industry Launchpad - Rag - Seminar - q2 - 8 - May - 2025
No ratings yet
How Is RAG Used in The Industry Launchpad - Rag - Seminar - q2 - 8 - May - 2025
49 pages
VFP Questions - Test 1
No ratings yet
VFP Questions - Test 1
1 page
Fachini 15/07/2011 09:58:21 Resolvida Solbase Fachini 0 Interna Não Definido Bde Error - Lista de Erros Sistema-Utilitarios 99-Naodefinido
No ratings yet
Fachini 15/07/2011 09:58:21 Resolvida Solbase Fachini 0 Interna Não Definido Bde Error - Lista de Erros Sistema-Utilitarios 99-Naodefinido
14 pages
Online Teaching System
100% (1)
Online Teaching System
43 pages

Hive Pig PDF

Uploaded by

Hive Pig PDF

Uploaded by

Hive and Pig!

Source: cc-licensed slide by Cloudera

• Metastore: schema, location in

Each status update record contains the user identifier (userid),

Source: cc-licensed slide by Cloudera

Source: cc-licensed slide by Cloudera

Source: Material drawn from Cloudera training VM

Find users who tend to visit “good” pages.

Pig Slides adapted from Olston et al.

Compute Average Pagerank

Pig Slides adapted from Olston et al.

Pig Slides adapted from Olston et al.

Visits = load ‘/data/visits’ as (user, url, time);!

Pig Slides adapted from Olston et al.

1/20 the lines of code 1/16 the development time

Hadoop Pig Hadoop Pig

Performance on par with raw Hadoop!

Pig Slides adapted from Olston et al.

You might also like

•  Metastore: schema, location in