0% found this document useful (0 votes)

162 views4 pages

EMR Workshop - Lab 2

This document provides instructions for interactively processing data using Hive and submitting Hive and Pig jobs as steps on an Amazon EMR cluster. It outlines downloading sample NY taxi data to S3, creating an external Hive table, and querying the data. It then describes creating Hive and Pig scripts to process the data into different formats and output locations, and adding these scripts as steps to the EMR cluster through the console.

Uploaded by

praveenwebarts

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

162 views4 pages

EMR Workshop - Lab 2

Uploaded by

praveenwebarts

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

EMR Workshop Lab 2 – Hive, Pig & EMR Steps

(Updated 14-Nov-18)

This lab demonstrates submitting Hive/Pig work to an Amazon EMR cluster.

You can submit Hive work to your cluster interactively, or you can submit work as a cluster step
using the console, CLI, or API. You can submit steps when the cluster is launched, or you can
submit steps to a running cluster.

Exercise 1: Process data interactively

• Create an S3 bucket with folders:

o files
o logs
o input
o output
• Get sample data from here (1.8MB file):
https://s3.amazonaws.com/aws-data-analytics-blog/emrimmersionday/tripdata.csv
• Upload file to your "input" folder in your S3 bucket
• SSH to master node of your previously created cluster.
• Run “hive” and create external table following these steps:

[hadoop@ip-10-0-0-135 ~]$ hive;

• Copy and paste the following script, make sure that you don’t have invisible characters.
Use vi on mac/Linux or Notepad on Windows. Alternatively, you can download this
script from here and edit it:

hive>
CREATE EXTERNAL TABLE ny_taxi_test (
vendor_id int,
lpep_pickup_datetime string,
lpep_dropoff_datetime string,
store_and_fwd_flag string,
rate_code_id smallint,
pu_location_id int,
do_location_id int,
passenger_count int,
trip_distance double,
fare_amount double,
mta_tax double,
tip_amount double,
tolls_amount double,
ehail_fee double,
improvement_surcharge double,
total_amount double,
payment_type smallint,
trip_type smallint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "s3://<YOUR-BUCKET>/input/";

• Run test query. This script will query the NY taxi data and show 5 different rate code ids.

hive> select distinct rate_code_id from ny_taxi_test;

Exercise 2: Processing data with EMR steps

After you’ve created the Hive table and queried your data, you can practice scheduling the job
on the cluster using EMR steps.

Hive Step

• You will have to create a ny-taxi.hql text file and upload it to your "files" folder.
• Copy and paste the following script into ny-taxi.hql, make sure that you don’t have
invisible characters. Use vi on mac/Linux or Notepad on windows. Alternatively, you can
download this script from here and edit it:

CREATE EXTERNAL TABLE ny_taxi (

vendor_id int,
lpep_pickup_datetime string,
lpep_dropoff_datetime string,
store_and_fwd_flag string,
rate_code_id smallint,
pu_location_id int,
do_location_id int,
passenger_count int,
trip_distance double,
fare_amount double,
mta_tax double,
tip_amount double,
tolls_amount double,
ehail_fee double,
improvement_surcharge double,
total_amount double,
payment_type smallint,
trip_type smallint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "${INPUT}";

INSERT OVERWRITE DIRECTORY "${OUTPUT}"

SELECT * FROM ny_taxi WHERE rate_code_id = 1;

This script will query the ny_taxi table and extract trips where standard rate is used.

• Go to the EMR console and scroll down to the “Step”.

• Add step, choose Hive program in "Step type"
• You need to add 3 locations to this step.
1. Script S3 location: The first is the location of the script you just uploaded to S3. The
format is: s3://<YOUR-BUCKET>/files/ny-taxi.hql
2. Input S3 location: Where is your data source
(Note that you don’t want to specific the file. Hive reads in folders, not files). The
input location is: s3://<YOUR-BUCKET>/input/
3. Output S3 location: Where to store your processed data. The output location is:
<s3://<YOUR-BUCKET>/output/hive/
• After you’ve added the information necessary, click “Add”.
• Check "output/hive" in 3 minutes.

Pig Step

• Run PIG script to parse data in CSV format and transform into TSV format
• Create a ny-taxi.pig text file and upload it to the "files" folder.
• Copy and paste the following script into ny-taxi.pig, make sure that you don’t have
invisible characters. Use vi on Mac/Linux or Notepad on windows. Alternatively, you can
download this script from here and edit it:

DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();

NY_TAXI = LOAD '$INPUT' USING CSVLoader(',') AS

(vendor_id:int,
lpep_pickup_datetime:chararray,
lpep_dropoff_datetime:chararray,
store_and_fwd_flag:chararray,
rate_code_id:int,
pu_location_id:int,
do_location_id:int,
passenger_count:int,
trip_distance:double,
fare_amount:double,
mta_tax:double,
tip_amount:double,
tolls_amount:double,
ehail_fee:double,
improvement_surcharge:double,
total_amount:double,
payment_type:int,
trip_type:int);

STORE NY_TAXI into '$OUTPUT' USING PigStorage('\t');

This script will parse data stored as CSV file on S3 and output data in tab delimited table format.

• Go to the EMR console and scroll down to the “Step”.

• Add step, choose Pig program in "Step type"
• You need to add 3 locations to this step.
1. Script S3 location: The first is the location of the script you just uploaded. The format
is: s3://<YOUR-BUCKET>/files/ny-taxi.pig
2. InputS3 location: Your data source (unlike Hive, Pig needs file entry location). The
input location is: s3://<YOUR-BUCKET>/input/tripdata.csv
3. Output S3 location: Where to store your processed data. The output location is:
s3://<YOUR-BUCKET>/output/pig/
• After you’ve added the information necessary, click “Add”.
• Check "output/pig" in 2 minutes.

Hunting IOCs Using RedLine
100% (1)
Hunting IOCs Using RedLine
9 pages
Taxi Trip Analysis Using Hive
No ratings yet
Taxi Trip Analysis Using Hive
3 pages
Downloaded Oct24 Lab5 Latestmanual
No ratings yet
Downloaded Oct24 Lab5 Latestmanual
24 pages
Create RDS Instance in AWS
No ratings yet
Create RDS Instance in AWS
4 pages
LabManual5 ProcessingLogs Using EMR
No ratings yet
LabManual5 ProcessingLogs Using EMR
29 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
BIG DATA Module 2 FINAL SMI
No ratings yet
BIG DATA Module 2 FINAL SMI
44 pages
T15 Kickoff Statement
No ratings yet
T15 Kickoff Statement
17 pages
Lecture38 PDF
No ratings yet
Lecture38 PDF
23 pages
Hadoop (Hive) - NYC Yellow Taxi Case Study
No ratings yet
Hadoop (Hive) - NYC Yellow Taxi Case Study
2 pages
BigData Module 2
No ratings yet
BigData Module 2
41 pages
Bda - Module Ii
No ratings yet
Bda - Module Ii
239 pages
Pig Practical: Mcjjcbek/View?Usp Sharing
No ratings yet
Pig Practical: Mcjjcbek/View?Usp Sharing
10 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
15CS82 Module 2
No ratings yet
15CS82 Module 2
12 pages
Load CSV Data From s3 Using Snowpipe
No ratings yet
Load CSV Data From s3 Using Snowpipe
2 pages
Big Data Practicals
No ratings yet
Big Data Practicals
10 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
Shippo Integration in Angular: A Step-by-Step Guide to Creating Shipping Functionality
From Everand
Shippo Integration in Angular: A Step-by-Step Guide to Creating Shipping Functionality
Abdelfattah Ragab
No ratings yet
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
From Everand
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet
BC Ca1,2
No ratings yet
BC Ca1,2
31 pages
Lab1-01 - Amazon Sagemaker Data Wrangling and Features Storel
No ratings yet
Lab1-01 - Amazon Sagemaker Data Wrangling and Features Storel
47 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Lab 5
No ratings yet
Lab 5
9 pages
Big Data Unit 5 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 5 (Easy Notes) Edushine Classes
42 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Example BQ Ingestion
No ratings yet
Example BQ Ingestion
5 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
AutoIT Scripting For Beginners
From Everand
AutoIT Scripting For Beginners
Rajan
5/5 (2)
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
From Everand
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
Abdelfattah Ragab
No ratings yet
Project Report Edit
No ratings yet
Project Report Edit
20 pages
Cloudera Data Analyst Training
No ratings yet
Cloudera Data Analyst Training
2 pages
Cloudera Data Analyst Training PDF
No ratings yet
Cloudera Data Analyst Training PDF
2 pages
Pig - Lab Demonstrations Explore!: Woha! Pig Is Supercool!
No ratings yet
Pig - Lab Demonstrations Explore!: Woha! Pig Is Supercool!
4 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Exercise 3 - Processing Data in A Data Lake
No ratings yet
Exercise 3 - Processing Data in A Data Lake
6 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
7 Ibiz Pig Workouts
No ratings yet
7 Ibiz Pig Workouts
7 pages
Learn Well Technocraft: Hadoop/Big Data Syllabus
100% (1)
Learn Well Technocraft: Hadoop/Big Data Syllabus
12 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Pig
No ratings yet
Pig
6 pages
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Big Data Training1
No ratings yet
Big Data Training1
4 pages
Kiran Reddy Resume
No ratings yet
Kiran Reddy Resume
7 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
Run Word Count - Hive Job On EMR - V1 - Reviewed - Sks - Lab Guides
No ratings yet
Run Word Count - Hive Job On EMR - V1 - Reviewed - Sks - Lab Guides
8 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Steps
No ratings yet
Steps
9 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Angular Shopping Store: From Scratch to Successful Payment
From Everand
Angular Shopping Store: From Scratch to Successful Payment
Abdelfattah Ragab
No ratings yet
Module 2
No ratings yet
Module 2
27 pages
Creating Tables in Hive
No ratings yet
Creating Tables in Hive
3 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
Unit 5 Short
No ratings yet
Unit 5 Short
14 pages
Banking Problem Database
No ratings yet
Banking Problem Database
5 pages
BDA Module-4
No ratings yet
BDA Module-4
4 pages
02 - Networking-in-AWS-Immersion-Day
No ratings yet
02 - Networking-in-AWS-Immersion-Day
32 pages
Module 06 - WorkSpaces Bundle and Application Management
No ratings yet
Module 06 - WorkSpaces Bundle and Application Management
26 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
No ratings yet
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
16 pages
h17746 Isilon Cloudpools and Microsoft Azure WP PDF
No ratings yet
h17746 Isilon Cloudpools and Microsoft Azure WP PDF
45 pages
Azure Migrate
No ratings yet
Azure Migrate
1 page
IBM BM Quote PDF
No ratings yet
IBM BM Quote PDF
2 pages
02 - Networking-in-AWS-Immersion-Day
No ratings yet
02 - Networking-in-AWS-Immersion-Day
9 pages
02 - Networking-in-AWS-Immersion-Day
No ratings yet
02 - Networking-in-AWS-Immersion-Day
9 pages
Cloud Extender Install
No ratings yet
Cloud Extender Install
129 pages
Raid 101
No ratings yet
Raid 101
2 pages
Powerstore - Configuring Volumes
No ratings yet
Powerstore - Configuring Volumes
27 pages
Dynapac Automates Medius AP Automation: Invoice Processing With
No ratings yet
Dynapac Automates Medius AP Automation: Invoice Processing With
10 pages
HTTP Injector
No ratings yet
HTTP Injector
4 pages
Sap Router
No ratings yet
Sap Router
4 pages
Group 1 Enumeration: Multiple Choice
No ratings yet
Group 1 Enumeration: Multiple Choice
18 pages
LDAP Cognos Configuration
No ratings yet
LDAP Cognos Configuration
12 pages
Mbeddr User Guide
No ratings yet
Mbeddr User Guide
65 pages
JAVA Design Pattern
100% (1)
JAVA Design Pattern
11 pages
Splunk Queiries
No ratings yet
Splunk Queiries
1 page
S3 Connector Operation
No ratings yet
S3 Connector Operation
262 pages
DM Events
No ratings yet
DM Events
138 pages
Free Website Hosting Services
No ratings yet
Free Website Hosting Services
15 pages
Knowledge Base Article: 000187008: Onefs: Error When Copying Files Using The Isi - Vol - Copy - VNX Utility (000187008)
No ratings yet
Knowledge Base Article: 000187008: Onefs: Error When Copying Files Using The Isi - Vol - Copy - VNX Utility (000187008)
2 pages
Crypto Format
No ratings yet
Crypto Format
4 pages
Introduction To Oracle: SQL Plus
100% (1)
Introduction To Oracle: SQL Plus
6 pages
Barcode Pharmaceutical System Main Project Work Isiyaku Auwalu
No ratings yet
Barcode Pharmaceutical System Main Project Work Isiyaku Auwalu
30 pages
01 - Upload and Display An Image - UploadAndDisplayImage
No ratings yet
01 - Upload and Display An Image - UploadAndDisplayImage
28 pages
BacLink 7.AST Vitek
No ratings yet
BacLink 7.AST Vitek
9 pages
J2ee Architecture
No ratings yet
J2ee Architecture
31 pages
Servicenow Application Developer Exam New-Practice Test Set 5
No ratings yet
Servicenow Application Developer Exam New-Practice Test Set 5
29 pages
Q-DBM-Database ENG CXXX
No ratings yet
Q-DBM-Database ENG CXXX
46 pages
Security On The Internet and Firewalls
No ratings yet
Security On The Internet and Firewalls
14 pages
SAP MM Consultant Resume
No ratings yet
SAP MM Consultant Resume
5 pages
Jboss Classroom Usage Instructions: Jbserver
No ratings yet
Jboss Classroom Usage Instructions: Jbserver
5 pages
Installation and 5ProVision Administration Manual 6.11.
No ratings yet
Installation and 5ProVision Administration Manual 6.11.
378 pages
Best Asm
No ratings yet
Best Asm
27 pages
Ravi Chandra Reddy Muli Mobile: +919182920300: Professional Summary
No ratings yet
Ravi Chandra Reddy Muli Mobile: +919182920300: Professional Summary
5 pages

EMR Workshop - Lab 2

Uploaded by

EMR Workshop - Lab 2

Uploaded by

EMR Workshop Lab 2 – Hive, Pig & EMR Steps

This lab demonstrates submitting Hive/Pig work to an Amazon EMR cluster.

Exercise 1: Process data interactively

• Create an S3 bucket with folders:

[hadoop@ip-10-0-0-135 ~]$ hive;

hive> select distinct rate_code_id from ny_taxi_test;

Exercise 2: Processing data with EMR steps

CREATE EXTERNAL TABLE ny_taxi (

INSERT OVERWRITE DIRECTORY "${OUTPUT}"

• Go to the EMR console and scroll down to the “Step”.

DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();

NY_TAXI = LOAD '$INPUT' USING CSVLoader(',') AS

STORE NY_TAXI into '$OUTPUT' USING PigStorage('\t');

• Go to the EMR console and scroll down to the “Step”.

You might also like