LabManual5_ProcessingLogs_Using_EMR(1) (1)
LabManual5_ProcessingLogs_Using_EMR(1) (1)
Amazon EMR
Scenario
Another university department has asked your group to look for trends in
the online advertising data that is contained in a set of weblog files. The
logs exist in multiple files because of the log rotation process that is used to
collect and store them. Two types of logs have been provided:
Impression logs: Each entry indicates an advertisement being
displayed to a user.
Click logs: Each entry indicates someone clicking on an
advertisement.
Numbered
Detail
Step
1 After you create the EMR cluster, you will connect to an AWS Cloud9
Numbered
Detail
Step
integrated development environment (IDE). You will use a bash
terminal to establish an SSH connection to the main node and launch
the Hive command line interface (CLI). You will then run Hive
commands, and many of those commands will invoke jobs to run on the
cluster.
Some of the cluster jobs will read data from the SAMPLE (or source)
2
data.
3 The requested data will be read into the cluster.
4 The jobs will process the data.
Hive table metadata will be stored in HDFS on the cluster. Some of the
5 commands that you will run will also store temporary table data in
HDFS.
Toward the end of the lab, you will join two tables and store the
resulting joined data in a hive-output S3 bucket. Then, when you run
6
SQL-like SELECT queries in Hive, the queries will read data from
this hive-output bucket and return the results to the Hive client.
Task 1: Launching an Amazon EMR
cluster
In this task, you will launch an EMR cluster with Hive installed.
3. To access the EMR console, in the search box to the right of Services,
search for and choose EMR.
Analysis: The Amazon EMR release that you choose determines the version
of Hadoop that will be installed on the cluster. You can also install many
other Hadoop-related projects, such as Hadoop User Experience (Hue), Pig,
and Spark. However, in this lab, you will only need Hadoop and Hive.
Hadoop will install and configure the cluster's internal HDFS as well as the
YARN resource scheduler and coordinator to process jobs on the cluster. Hive
is an open-source framework that you can use to store and query large
datasets. Hive is an abstraction layer that translates SQL queries into YARN
jobs that run and extract results from the data stored in HDFS. Hue and Pig
are additional components of the Hadoop framework that you won't need to
use in this lab.
Note: In the console, you might see the main node referred to as
the master node. These instructions will use the term main node.
For the main /primary node type, choose the pencil icon
next to the instance type. Choose m4.large from the list,
and then choose Save.
Repeat the same process for the Core node type.
Remove the Task 1 configuration
Verify that the instance counts are as follows for each node
type: 1 for main, 2 for core, and 0 for task.
8. Configure the options for Step 4: Security configuration and EC2 key
pair
o For EC2 key pair, choose the vockey key pair, which already
exists in this account.
o If this is not there, Click on Create Key pair and give the name as
EMR-key and keep the other default settings and Click on Create
key pair
Note: You can see the EC2 Security groups in summary section
once you create the cluster
Your cluster will now be provisioned. Don't wait for the provisioning
process to complete before continuing to the next step.
Tip: You can ignore the warning that says Auto-termination is not
available for this account when using this release of EMR.
10. Configure the security group for the main node to allow SSH
connections from the AWS Cloud9 instance.
A new tab opens to the Security Groups page in the Amazon EC2
console.
o In the bottom pane, choose the Inbound rules tab, and then
choose Edit inbound rules.
o At the bottom of the page, choose Add rule, and then configure
SSH access for the AWS Cloud9 instance that has been created
for you:
Type: Choose SSH.
Source: Choose Anywhere-IPv4.
Note: In the next task, you will use a bash terminal that is
available in an AWS Cloud9 instance to connect to the EMR
cluster through SSH. That is why you set the allowed source for
inbound SSH connections to EC2 instances that use the AWS
Cloud9 security group.
11. Confirm that the cluster is now available.
o Return to the Amazon EMR console, which should still be open in
another browser tab.
o Refresh the page.
Important: Don't proceed to the next task until the status of the
cluster shows Waiting.
In this task, you provisioned an EMR cluster with Hive running on it. You also
configured the security group for the main node in the cluster so that it will
accept SSH connections.
12. On the cluster Summary tab, in the Summary section, find the
public DNS
address for the main node. Copy the value to your clipboard.
If cloud9 is not there, follow the steps below and create a cloud9
environment.
Navigate to AWS Cloud9 and click on create Environment.
Give the Name as Cloud9 and keep the other default settings.
14. Copy the SSH key to the AWS Cloud9 instance and configure it.
o Above these lab instructions, choose AWS Details.
o In the panel that opens, choose Download PEM and save the
file to the directory of your choice.
o
o Return to the AWS Cloud9 IDE.
o Choose File > Upload Local Files....
o Upload the labsuser.pem file into the AWS Cloud9 IDE.
Note: After you upload the file, it appears under the Cloud9
Instance folder in the file list of the Environment window, on
the left side of the IDE.
Note the use of the hadoop user to log in. By logging in as this
user, you will be able to access the Hive client.
Note: The Hive client expects this directory to exist when it is run. You
will use the Hive client in a later step.
aws s3 ls /
Run the following command. Replace with the full name for the
hive-output bucket.
hive -d SAMPLE=s3://aws-tc-largeobjects/CUR-TF-200-ACDENG-1/emr-lab -d
DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -
d OUTPUT=s3://<HIVE-BUCKET-NAME>/output/
The command takes a few seconds to run and then displays the
following output. This indicates that you are now connected to
the interactive Hive client:
As you can see in the LOCATION part of the command, Hive is told that
the data is located in the S3 bucket specified in the SAMPLE key-value
pair that you used when you launched Hive in the previous task. Within
that bucket is a tables folder with a subfolder named impressions. The
impressions folder contains partitioned data.
If you are familiar with relational databases, where you first define a
table structure and then insert data into the table, what you did here
might seem backward and confusing. What you have done is point
Hive to existing data. This existing data has structure, and in the
create table command, you informed Hive what that structure is.
The following is a single line of the data that you pointed to in Amazon
S3. The data contains thousands of lines similar to this. Here, the line
is displayed with line breaks to make it easier for you to read, but in
the source data all of this information would be on a single line. This
source data is in JSON format and has 16 first-level name-value pairs.
However, your table definition referenced only seven of these.
Therefore, your table is only referencing some of the source data.
{
"number": "348309",
"referrer": "example.com",
"processId": "1505",
"adId": "2h8VvmXWGFrm4tAGoKPQ3UfWj3g4Jt",
"browserCookie": "bxsbcgatmx",
"userCookie": "CbHCJK7atU0RrnX4GAlh0w1kcvbuIJ",
"requestEndTime": "1239581899000",
"impressionId": "OtGijGGr0OjjUXnj181fXXsql7G3f6",
"userAgent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR
1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPa",
"timers": {
"modelLookup": "0.2444",
"requestTime": "0.7636"
},
"threadId": "98",
"ip": "22.68.182.56",
"modelId": "bxxiuxduab",
"hostname": "ec2-64-28-73-15.amazon.com",
"sessionId": "APrPRwrrXhwPUwsIKuOCCHpHSDTfxW",
"requestBeginTime": "1239581898000"
}
If you review the statement that you used to create the table, notice
how, when defining the row format, you referenced a JSON
serializer/de-serializer (called SerDe) to parse the source data.
19. Update the Hive metadata for the impressions table to include all
partitions.
o To see how many partitions the impressions table currently has,
run the following command:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE impressions;
The output should now indicate that 241 partitions exist in the
table.
20. Create another external table and again discover its partitions.
This table will be named clicks, and it references click-stream logs.
{
"number": "8673",
"processId": "1010",
"adId": "S5KtvIerGRLpNxjnkn4MdRH2GqOq5A",
"requestEndTime": "1239612672000",
"impressionId": "PEWl8ecT2D77hdVce8oXdgtPe1m7kr",
"timers": {
"requestTime": "0.9578"
},
"threadId": "100",
"hostname": "ec2-0-51-75-39.amazon.com",
"requestBeginTime": "1239612671000"
}
o To return all partitions from the click-stream log data, run the
following command:
show tables;
The output returns the names of both external tables that are
stored in Amazon S3: clicks and impressions.
Perfect! In this task, you successfully created two external Hive tables that
can now be used to query data.
Task 5: Joining tables by using Hive
In this section of the lab, you will join the impressions table with the clicks
table. By combining information on the ads that were presented to a user
(the impressions data) with information on which ads resulted in a user
choosing the ad (the clicks data), you can gain insight into user behavior and
target and monetize advertising services.
The following diagram illustrates what you will accomplish in this task.
Both the clicks and impressions tables are partitioned tables. When the two
are joined, the CREATE TABLE command will tell the new table to be
partitioned.
So far, the table does not have any data. When data is added, it will be
stored in the hive-output S3 bucket.
Analysis: This table will store temporary data in the local HDFS
partition. You know that it is stored in HDFS (not in S3) because the
command did not specify EXTERNAL as the table type. The table has
the same column names as the impressions table.
24. To insert impression log data for the time duration referenced,
run the following command:
INSERT OVERWRITE TABLE tmp_impressions
SELECT
from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int))
requestBeginTime,
i.adId,
i.impressionId,
i.referrer,
i.userAgent,
i.userCookie,
i.ip
FROM impressions i
WHERE i.dt >= '${DAY}-${HOUR}-00'
AND i.dt < '${NEXT_DAY}-${NEXT_HOUR}-00';
The SELECT command selects data from the impressions table but only
selects the records that correspond to a specific period of time.
Because the impressions table is partitioned by requestBeginTime,
only the partitions that are relevant to the specified time duration are
read, which improves performance.
The start of the time period is DAY-HOUR, and the end of the period is
NEXT_DAY-NEXT_HOUR. NEXT_DAY is the day of the next time period.
It differs from ${DAY} only when you are processing the last hour of a
day. In this case, the time period ends on the next day.
25. Create a temporary table named tmp_clicks and insert data into
the table.
o To create the table, run the following command:
CREATE TABLE tmp_clicks ( impressionId string, adId string )
STORED AS SEQUENCEFILE;
Analysis: The left outer join returns all records from tmp_impressions,
even if the join condition does not find and return matching records
from the tmp_clicks table. If matching records are not found in
tmp_clicks, null values are returned for each column.
The left outer join excludes any data from tmp_clicks that did not
originate from an impression in the selected time period.
Congratulations! In this task, you successfully combined the data from the
two source tables into a single queryable table. The following diagram
summarizes your accomplishment and also shows where the data of each
table resides.
Because you maintain the nontemporary data in Amazon S3, you can access
it at anytime. For example, you might not want to keep this cluster running
for long because running servers is relatively expensive compared to only
storing data. By storing the data in Amazon S3, you could delete this cluster
at any time and then later, when you want to run more queries, create a new
cluster to connect to the same data.
CHALLENGE TASK:
Write an SQL Query for the following requirements:
Create two temporary tables to analyze ad performance for the
date 2009-04-12.
Note: Temporary tables exist only for that session, making them
useful for short-term data storage during complex queries.
1. The first temporary table - temp_impressions should select
the top 10 records from the impressions table where the
date part of the dt column is 2009-04-12.
2. The second temporary table - temp_clicks should select the
top 10 records from the clicks table where the date part of
the dt column is 2009-04-12.
After creating the temporary tables, generate an ad performance
report by joining them. Your report SQL query should include the
following metrics:
adId
Total impressions
Total clicks
Click-through rate (CTR) = (Total Clicks / Total Impressions) *
100
Note: It’s fine if the query doesn't return valid results due
to the data. Just write the query and attach the results.
exit;
This query returns the adId values for the 20 most clicked ads
during the hour of weblogs that you have been analyzing. It
seems that the ad with adId
70n4fCvgAV0wfojgw5xw3QAEiaULMg was the most clicked
during that time. That's interesting. But it might be more
interesting to figure out which referrers to the website provided
the users who clicked the highest number of ads. Referrers are
the websites that sent visitors to the website that you are
analyzing.
You will do this next, but you will also learn how to run a Hive
query from outside the Hive shell.
The query will return a list of the top ten most effective website
referrers, meaning that the referrals resulted in the most ad
clicks. The result is written to a file.
cat /home/hadoop/result.txt
As you can see, now that you have loaded the table data into
Amazon EMR and learned how to use Hive to make SQL-like
queries, you have the tools at your disposal to conduct trend
analysis.
exit;
29. Analyze one of the log files that makes up the output from
Amazon S3.
o To rediscover the name of the bucket that contains the
joined_impressions table data, run the following command:
aws s3 ls /
Next, to download the file that contains the data that you have
been querying, run the following command (be sure to replace
<bucket-name> with the actual hive-output bucket name):
aws s3 cp s3://<bucket-name>/output/tables/joined_impressions/day=2009-
04-13/hour=08/000000_0 example-log.txt
Congratulations! In this lab, you created an EMR cluster and then used Hive
to create a data warehouse based on two sets of log files that you merged
into a joined table. You were then able to run queries to uncover some trends
in the data, such as which referrers to the website resulted in the most
frequent occurrences of users clicking ads.
Your POC demonstrated how to process this large dataset was successful.
With more analysis, you could reveal more actionable information from this
dataset.
Lab complete