Lab Aws 14-10
Lab Aws 14-10
Overview
In recent years, they have been collecting data through a number of disparate systems
and want to consolidate this data in a modern data architecture.
A workshop was held with the key stakeholders in UnicornNation and they identified
three key data sources they would like to consolidate and have provided the funding
and resources to build a Data Lake on AWS.
During the course of this bootcamp, you will be building a Data Lake on AWS to meet
their requirements and gain experience with a number of core AWS services, including
S3, Glue, Athena, Redshift, Redshift Spectrum and QuickSight.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 2 of 25
With the following labs you will get hands-on experience with some of the key AWS
services that underpin a Data Lake implementation. The labs are provided with step-
by-step instructions that will help you use each service to build the basic build blocks
of a data lake.
For the labs, you will be using your own computer and logging in to an AWS console
through your web browser.
The IT team at UnicornNation has exported Merchandise Sales data from their finance
system and transferred this data to a folder in a single S3 bucket. There are
multiple .CSV files in the bucket, each representing a month’s worth of data.
In this lab, we are going to create a Glue Data Crawler to crawl across this bucket.
Once we have created the crawler, we are going to run it to determine the tables that
are located in the bucket and add their definitions to the Glue Data Catalog.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 3 of 25
1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.
1. Before start working with AWS Glue, we need to configure a default result
bucket, please navigate to the S3 service
2. Copy and paste the full bucket name in a notepad, you will need it in the
following step
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 4 of 25
1. Enter the S3 bucket name you recorded in step 4. Follow this pattern: s3://[my-
query-result-bucket]/ (include the slash ‘/’ at the end).
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 5 of 25
1. Click Save
2. From the console, navigate to the AWS Glue service. Services > AWS Glue
3. From the AWS Glue Data Catalog, navigate to Databases and click Add
Database
1. Now that your database has been created, click on the Tables menu in the left-
hand menu
2. Select the drop down to Add Tables > Add tables using a crawler
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 6 of 25
2. For Crawler source type select Data stores, then click Next
3. For your data store, select S3 and below Include path, click on the little folder
icon
1. This will open a new window, look for the Merchandise Sales bucket, it’s name
will follow this pattern: merchandise-sales-xxxxxxx , click Select and then click
Next.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 7 of 25
2. For Add another data store, take the default of No and click Next
3. Next, when setting up an IAM role, click on Choose an existing IAM Role, then
under IAM role select role name GlueServiceRole. Then click Next.
1. For the Frequency, leave the default of Run on demand and click Next
3. Finally, review the settings for your crawler and click Finish. This should take you
to a list of crawlers that have been created
4. Select the crawler you just created and click on Run crawler
Watch the Crawler console for your job to finish successfully (it should take ~2
minutes). Status must change from Starting to Ready
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 8 of 25
2. You should see a new table has been created, copy and paste the table name on a
notepad.
3. Now click to select the table, then select Action > View Data. A dialogue window
will be open, select Preview data
1. You will see the Athena console, you will run the below query, but first you need
to update it with the table name you recorded in step 26- Click Run query
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 9 of 25
The IT team at UnicornNation has extracted historical data from their ticketing system,
named “Tickit” which processes the majority of transactions for the company. This data
source is known as the Tickit History.
They have stored this data in an S3 bucket and have created folder/prefixes for each
table of data they have exported.
In this lab, we are going to create a Glue Data Crawler to crawl across this bucket.
Once we have created the crawler, we are going to run it to determine the tables that
are located in the bucket and add their definitions to the Glue Data Catalog.
In the first lab, the files we crawled were comma-delimited and there was a header row
in each that provided the field names for the different columns.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 10 of 25
In this example, we are going to create a Glue crawler on a more complex dataset,
where the files are pipe-delimited and don’t have a header row with the column names.
This dataset is stored in an S3 Bucket with a folder/key for each table name. These
tables make up the “Tickit” sample data set, which consists of seven tables: two fact
tables and five dimension tables as shown below:
This data set is the HISTORICAL data for the Tickit database—this data will not be used
that often, so S3 is a great place to store and access this data.
1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 11 of 25
3. From the AWS Glue Data Catalog, navigate to Databases and click Add database
1. Now that your database has been created, click Tables in the left-hand menu
2. Select the drop down to Add Tables > Add tables using a crawler
5. For your data store, select S3 and below Include path click on the little folder
icon
6. This will open a new window, look for the Tickit History bucket, it’s name will
follow this pattern: tickit-history-xxxxxxx , click Select and then click Next.
7. For Add another data store, take the default of No and click Next
8. Next, when setting up an IAM role, click on Choose an existing IAM Role, then
under IAM role select role name GlueServiceRoleTickitHistory. Then click Next.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 12 of 25
1. For the Frequency, select the default of Run on Demand and click Next
3. Finally, review the settings for your crawler and click Finish. This should take you
to a list of crawlers that have been created
Watch the Crawler console for your job to finish successfully (it should take ~2
minutes). Status must change from Starting to Ready
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 13 of 25
1. From within the AWS Glue console, select Databases from the left panel and
navigate to your Tickit history database (i.e. teamawesome-tickit-history)
1. Click on the teamawesome-tickit-history link. This will display a list of all of the
tables that have been generated by the Glue Crawler.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 14 of 25
You will notice that the column names are listed as Col0, Col1, Col2–that is because
the data files do not have a header row. In the next steps you will update the comunm
names
1. Use the information below as a guide to edit ONLY the Column names for the
category schema, (do NOT change the types, just the column names):
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 15 of 25
CATEGORY table
|—|---|
| col0 | CATID |
| col1 | CATGROUP |
| col2 | CATNAME |
| col3 | CATDESC |
1. When you are finished editing the Category schema your schema must look like
this:
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 16 of 25
1. Using the same steps, repeat this process to update the following tables in your
teamawesome-tickit-history database. Remember, you are just updating the
COLUMN NAMES, not changing the data types.
• DATE
• EVENT
• LISTING
• SALES
• USERS
• VENUE
DATE Table
|—|---|
| col0 | DATEID |
| col1 | CALDATE |
| col2 | DAY |
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 17 of 25
| col3 | WEEK |
| col4 | MONTH |
| col5 | QTR |
| col6 | YEAR |
| col7 | HOLIDAY |
EVENT Table
|—|---|
| col0 | EVENTID |
| col1 | VENUEID |
| col2 | CATID |
| col3 | DATEID |
| col4 | EVENTNAME |
| col5 | STARTTIME |
LISTING Table
|—|---|
| col0 | LISTID |
| col1 | SELLERID |
| col2 | EVENTID |
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 18 of 25
| col3 | DATEID |
| col4 | NUMTICKETS |
| col5 | PRICEPERTICKET |
| col6 | TOTALPRICE |
| col7 | LISTTIME |
SALES Table
|—|---|
| col0 | SALESID |
| col1 | LISTID |
| col2 | SELLERID |
| col3 | BUYERID |
| col4 | EVENTID |
| col5 | DATEID |
| col6 | QTYSOLD |
| col7 | PRICEPAID |
| col8 | COMMISSION |
| col9 | SALETIME |
USERS Table
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 19 of 25
|—|---|
| col0 | USERID |
| col1 | USERNAME |
| col2 | FIRSTNAME |
| col3 | LASTNAME |
| col4 | CITY |
| col5 | STATE |
| col6 | EMAIL |
| col7 | PHONE |
| col8 | LIKESPORTS |
| col9 | LIKETHEATRE |
| col10 | LIKECONCERTS |
| col11 | LIKEJAZZ |
| col12 | LIKECLASSICAL |
| col13 | LIKEOPERA |
| col14 | LIKEROCK |
| col15 | LIKEVEGAS |
| col16 | LIKEBROADWAY |
| col17 | LIKEMUSICALS |
VENUE Table
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 20 of 25
|—|---|
| col0 | VENUEID |
| col1 | VENUENAME |
| col2 | VENUECITY |
| col3 | VENUESTATE |
| col4 | VENUESEATS |
The ticketing team at UnicornNation has heard about the work you did in setting up
the data catalog. They have some data that they need urgently and need your help in
setting up and running these queries.
In this lab, you are going to use Amazon Athena to create some queries, which you will
then save to make it easy for users to run and consume.
As part of the lab, you will also be running your own queries to help users answer some
basic questions around ticket sales, customers and more.
1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 21 of 25
1. Using the Database drop-down list, change the database to point to your Tickit
database teamawesome-tickit-history
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 22 of 25
1. To run a query, paste your below sql script into the new text box and click the
Run Query button
2. Use the query text below, run each query to answer these questions:
Question 1: Using the following query, what were the Top 5 ticket sellers for events in
San Diego in 2008?
select sellerid, username, city, firstname ||' '|| lastname as fullname, sum(qtysold) a
order by 5 desc
limit 5;
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 23 of 25
Question 2: Using the following query, who were the buyers AND sellers for ticket
transactions that cost $10,000 or more?
where sales.sellerid=users.userid
union
'B' as buyorsell
where sales.buyerid=users.userid
order by 1, 2, 3, 4, 5;
The IT team at UnicornNation is looking for a way to reduce their AWS spend for this
project. After reviewing the file formats they are using, they have decided that for the
Merchandise Sales data, they are going to change the file format from CSV to Parquet.
Parquet is a columnar, compressed format and will help them reduce the amount of
data that is scanned when using Amazon Athena.
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 24 of 25
In this lab, you are going to create a Glue Job to convert the existing CSV data files to
Parquet.
1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.
3. Under the ETL menu, select Jobs and then click the button for Add Job
5. Under IAM role, using the drop-down list, select GlueServiceRole IAM role
1. Click Next
2. For your data source, select the database you created for the Merchandise Sales
data, look for a data source named similar to merchandise_sales_xxxxxxx and
then click the Next button
3. In the Choose a transformation type page, select Change schema, click the Next
button
4. In the Choose a data target page, select Create tables in your data target
7. For the Target path, click on the small folder icon and select the S3 bucket
teamawesome-merch-sales-parquet-xxxxxxx , click Select, then click Next
8. In the Map the source columns to target columns page, leave defaults and click
Save job and edit script
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 25 of 25
10. From the toolbar, click the Save button, and then click the X icon (far right) to
return to the list of jobs
11. Locate the job you created and click to select the job
1. Click the Action button and select Run Job, this will start your Glue ETL job
2. Click again on the checkbox next to your ETL job to see the status panel.
3. You will notice the job status as Running, wait a couple of minutes until it changes
to Succeeded
4. Navigate to the S3 console (Services > S3) and search for the bucket
teamawesome-merch-sales-parquet-xxxxxxxx . Verify that the parquet files
were successfully created in the bucket
With the Parquet files created, you could then crawl those files and start using the data
with Athena. Because Parquet is a columnar, compressed format, there will be less data
scanned, reducing the cost of your Athena queries.
LAB END
https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020