EMR Workshop - Lab 2
EMR Workshop - Lab 2
(Updated 14-Nov-18)
You can submit Hive work to your cluster interactively, or you can submit work as a cluster step
using the console, CLI, or API. You can submit steps when the cluster is launched, or you can
submit steps to a running cluster.
• Copy and paste the following script, make sure that you don’t have invisible characters.
Use vi on mac/Linux or Notepad on Windows. Alternatively, you can download this
script from here and edit it:
hive>
CREATE EXTERNAL TABLE ny_taxi_test (
vendor_id int,
lpep_pickup_datetime string,
lpep_dropoff_datetime string,
store_and_fwd_flag string,
rate_code_id smallint,
pu_location_id int,
do_location_id int,
passenger_count int,
trip_distance double,
fare_amount double,
mta_tax double,
tip_amount double,
tolls_amount double,
ehail_fee double,
improvement_surcharge double,
total_amount double,
payment_type smallint,
trip_type smallint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "s3://<YOUR-BUCKET>/input/";
• Run test query. This script will query the NY taxi data and show 5 different rate code ids.
After you’ve created the Hive table and queried your data, you can practice scheduling the job
on the cluster using EMR steps.
Hive Step
• You will have to create a ny-taxi.hql text file and upload it to your "files" folder.
• Copy and paste the following script into ny-taxi.hql, make sure that you don’t have
invisible characters. Use vi on mac/Linux or Notepad on windows. Alternatively, you can
download this script from here and edit it:
This script will query the ny_taxi table and extract trips where standard rate is used.
Pig Step
• Run PIG script to parse data in CSV format and transform into TSV format
• Create a ny-taxi.pig text file and upload it to the "files" folder.
• Copy and paste the following script into ny-taxi.pig, make sure that you don’t have
invisible characters. Use vi on Mac/Linux or Notepad on windows. Alternatively, you can
download this script from here and edit it:
This script will parse data stored as CSV file on S3 and output data in tab delimited table format.