100% found this document useful (1 vote)
200 views25 pages

Lab Aws 14-10

This document provides instructions for building a data lake on AWS for a company called UnicornNation. It describes creating Glue crawlers to discover data from source S3 buckets and catalog it into Glue Data Catalog tables. The first lab has students create a crawler for CSV sales data with headers. The second lab modifies schemas for pipe-delimited historical ticketing data across multiple tables without headers.

Uploaded by

Ana Marroquín
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
200 views25 pages

Lab Aws 14-10

This document provides instructions for building a data lake on AWS for a company called UnicornNation. It describes creating Glue crawlers to discover data from source S3 buckets and catalog it into Glue Data Catalog tables. The first lab has students create a crawler for CSV sales data with headers. The second lab modifies schemas for pipe-delimited historical ticketing data across multiple tables without headers.

Uploaded by

Ana Marroquín
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

PSO-Big-Data-Day-1 Page 1 of 25

AWS Professional Services: Big Data and


Analytics

Data Lake on AWS - Lab guide

Overview

UnicornNation is a global entertainment company that provides ticketing,


merchandising and promotion of large concerts and events.

In recent years, they have been collecting data through a number of disparate systems
and want to consolidate this data in a modern data architecture.

A workshop was held with the key stakeholders in UnicornNation and they identified
three key data sources they would like to consolidate and have provided the funding
and resources to build a Data Lake on AWS.

During the course of this bootcamp, you will be building a Data Lake on AWS to meet
their requirements and gain experience with a number of core AWS services, including
S3, Glue, Athena, Redshift, Redshift Spectrum and QuickSight.

UnicornNation Target Architecture

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 2 of 25

###About the Labs

With the following labs you will get hands-on experience with some of the key AWS
services that underpin a Data Lake implementation. The labs are provided with step-
by-step instructions that will help you use each service to build the basic build blocks
of a data lake.

For the labs, you will be using your own computer and logging in to an AWS console
through your web browser.

##Lab 1: Creating a Glue Data Crawler

The IT team at UnicornNation has exported Merchandise Sales data from their finance
system and transferred this data to a folder in a single S3 bucket. There are
multiple .CSV files in the bucket, each representing a month’s worth of data.

In this lab, we are going to create a Glue Data Crawler to crawl across this bucket.
Once we have created the crawler, we are going to run it to determine the tables that
are located in the bucket and add their definitions to the Glue Data Catalog.

To create your Glue data crawler, follow these steps:

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 3 of 25

1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.

1. Before start working with AWS Glue, we need to configure a default result
bucket, please navigate to the S3 service

1. Locate the S3 bucket with this name pattern query-results-bucket-xxxxxxxxx

2. Copy and paste the full bucket name in a notepad, you will need it in the
following step

3. Now navigate to the Athena service. Services > Athena

4. Click on Get Started

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 4 of 25

1. Click on the link to set up a query result location

1. Enter the S3 bucket name you recorded in step 4. Follow this pattern: s3://[my-
query-result-bucket]/ (include the slash ‘/’ at the end).

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 5 of 25

1. Click Save

2. From the console, navigate to the AWS Glue service. Services > AWS Glue

3. From the AWS Glue Data Catalog, navigate to Databases and click Add
Database

1. Enter a database name, type teamawesome-merch-sales and click Create

1. Now that your database has been created, click on the Tables menu in the left-
hand menu

2. Select the drop down to Add Tables > Add tables using a crawler

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 6 of 25

1. Enter a name for your crawler, type teamawesome-merch-sales-crawler and


click Next

2. For Crawler source type select Data stores, then click Next

3. For your data store, select S3 and below Include path, click on the little folder
icon

1. This will open a new window, look for the Merchandise Sales bucket, it’s name
will follow this pattern: merchandise-sales-xxxxxxx , click Select and then click
Next.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 7 of 25

2. For Add another data store, take the default of No and click Next

3. Next, when setting up an IAM role, click on Choose an existing IAM Role, then
under IAM role select role name GlueServiceRole. Then click Next.

1. For the Frequency, leave the default of Run on demand and click Next

2. In Configure the crawler’s output, select the Database teamawesome-merch-


sales and click Next

3. Finally, review the settings for your crawler and click Finish. This should take you
to a list of crawlers that have been created

4. Select the crawler you just created and click on Run crawler

Watch the Crawler console for your job to finish successfully (it should take ~2
minutes). Status must change from Starting to Ready

1. Using the navigation menu on the left, navigate to Tables

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 8 of 25

2. You should see a new table has been created, copy and paste the table name on a
notepad.

3. Now click to select the table, then select Action > View Data. A dialogue window
will be open, select Preview data

1. This will open the Athena console, click Get started

1. You will see the Athena console, you will run the below query, but first you need
to update it with the table name you recorded in step 26- Click Run query

SELECT * FROM "teamawesome-merch-sales"."[GLUE_TABLE_NAME]" limit 10;

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 9 of 25

1. You must see a similar query output

##Lab 2: Modifying Table Schemas

The IT team at UnicornNation has extracted historical data from their ticketing system,
named “Tickit” which processes the majority of transactions for the company. This data
source is known as the Tickit History.

They have stored this data in an S3 bucket and have created folder/prefixes for each
table of data they have exported.

In this lab, we are going to create a Glue Data Crawler to crawl across this bucket.
Once we have created the crawler, we are going to run it to determine the tables that
are located in the bucket and add their definitions to the Glue Data Catalog.

In the first lab, the files we crawled were comma-delimited and there was a header row
in each that provided the field names for the different columns.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 10 of 25

In this example, we are going to create a Glue crawler on a more complex dataset,
where the files are pipe-delimited and don’t have a header row with the column names.

This dataset is stored in an S3 Bucket with a folder/key for each table name. These
tables make up the “Tickit” sample data set, which consists of seven tables: two fact
tables and five dimension tables as shown below:

This data set is the HISTORICAL data for the Tickit database—this data will not be used
that often, so S3 is a great place to store and access this data.

To create these tables in the Glue catalog, follow these steps:

Task 1: Create your Glue crawler

1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.

2. From the console, navigate to the AWS Glue service

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 11 of 25

3. From the AWS Glue Data Catalog, navigate to Databases and click Add database

4. Enter a database name, type teamawesome-tickit-history and click Create

1. Now that your database has been created, click Tables in the left-hand menu

2. Select the drop down to Add Tables > Add tables using a crawler

3. Enter a name for your crawler, type teamawesome-tickit-crawler and click


Next

4. For Crawler source type select Data stores

5. For your data store, select S3 and below Include path click on the little folder
icon

6. This will open a new window, look for the Tickit History bucket, it’s name will
follow this pattern: tickit-history-xxxxxxx , click Select and then click Next.

7. For Add another data store, take the default of No and click Next

8. Next, when setting up an IAM role, click on Choose an existing IAM Role, then
under IAM role select role name GlueServiceRoleTickitHistory. Then click Next.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 12 of 25

1. For the Frequency, select the default of Run on Demand and click Next

2. In Configure the crawler’s output, select the Database teamawesome-tickit-


history and click Next

3. Finally, review the settings for your crawler and click Finish. This should take you
to a list of crawlers that have been created

4. Select the crawler you just created, teamawesome-tickit-crawler, and click on


Run crawler

Watch the Crawler console for your job to finish successfully (it should take ~2
minutes). Status must change from Starting to Ready

1. Using the navigation menu on the left, navigate to Tables

2. You should see several tables has been created

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 13 of 25

Task 2: Modifying the Table Schemas

1. From within the AWS Glue console, select Databases from the left panel and
navigate to your Tickit history database (i.e. teamawesome-tickit-history)

1. Click on the teamawesome-tickit-history link. This will display a list of all of the
tables that have been generated by the Glue Crawler.

1. Click on the category table.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 14 of 25

You will notice that the column names are listed as Col0, Col1, Col2–that is because
the data files do not have a header row. In the next steps you will update the comunm
names

1. Click on the Edit schema button (far upper right corner)

1. Use the information below as a guide to edit ONLY the Column names for the
category schema, (do NOT change the types, just the column names):

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 15 of 25

CATEGORY table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | CATID |

| col1 | CATGROUP |

| col2 | CATNAME |

| col3 | CATDESC |

1. When you are finished editing the Category schema your schema must look like
this:

1. Click the Save button.

2. To return to your tables list click on the Tables link

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 16 of 25

1. Using the same steps, repeat this process to update the following tables in your
teamawesome-tickit-history database. Remember, you are just updating the
COLUMN NAMES, not changing the data types.

• DATE

• EVENT

• LISTING

• SALES

• USERS

• VENUE

DATE Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | DATEID |

| col1 | CALDATE |

| col2 | DAY |

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 17 of 25

| col3 | WEEK |

| col4 | MONTH |

| col5 | QTR |

| col6 | YEAR |

| col7 | HOLIDAY |

EVENT Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | EVENTID |

| col1 | VENUEID |

| col2 | CATID |

| col3 | DATEID |

| col4 | EVENTNAME |

| col5 | STARTTIME |

LISTING Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | LISTID |

| col1 | SELLERID |

| col2 | EVENTID |

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 18 of 25

| col3 | DATEID |

| col4 | NUMTICKETS |

| col5 | PRICEPERTICKET |

| col6 | TOTALPRICE |

| col7 | LISTTIME |

SALES Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | SALESID |

| col1 | LISTID |

| col2 | SELLERID |

| col3 | BUYERID |

| col4 | EVENTID |

| col5 | DATEID |

| col6 | QTYSOLD |

| col7 | PRICEPAID |

| col8 | COMMISSION |

| col9 | SALETIME |

USERS Table

| Column Name (old) | Column Name (new) |

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 19 of 25

|—|---|

| col0 | USERID |

| col1 | USERNAME |

| col2 | FIRSTNAME |

| col3 | LASTNAME |

| col4 | CITY |

| col5 | STATE |

| col6 | EMAIL |

| col7 | PHONE |

| col8 | LIKESPORTS |

| col9 | LIKETHEATRE |

| col10 | LIKECONCERTS |

| col11 | LIKEJAZZ |

| col12 | LIKECLASSICAL |

| col13 | LIKEOPERA |

| col14 | LIKEROCK |

| col15 | LIKEVEGAS |

| col16 | LIKEBROADWAY |

| col17 | LIKEMUSICALS |

VENUE Table

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 20 of 25

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | VENUEID |

| col1 | VENUENAME |

| col2 | VENUECITY |

| col3 | VENUESTATE |

| col4 | VENUESEATS |

##Lab 3: Querying your Data Lake with Amazon Athena

The ticketing team at UnicornNation has heard about the work you did in setting up
the data catalog. They have some data that they need urgently and need your help in
setting up and running these queries.

In this lab, you are going to use Amazon Athena to create some queries, which you will
then save to make it easy for users to run and consume.

As part of the lab, you will also be running your own queries to help users answer some
basic questions around ticket sales, customers and more.

To run a query using Amazon Athena, follow these steps:

1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.

2. From the console, navigate to the Athena service

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 21 of 25

1. Using the Database drop-down list, change the database to point to your Tickit
database teamawesome-tickit-history

1. You could open a new query text box

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 22 of 25

1. To run a query, paste your below sql script into the new text box and click the
Run Query button

2. Use the query text below, run each query to answer these questions:

Question 1: Using the following query, what were the Top 5 ticket sellers for events in
San Diego in 2008?

select sellerid, username, city, firstname ||' '|| lastname as fullname, sum(qtysold) a

from sales, date, users

where sales.sellerid = users.userid

and sales.dateid = date.dateid

and year = 2008

and city = 'San Diego'

group by sellerid, username, city, firstname ||' '|| lastname

order by 5 desc

limit 5;

 

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 23 of 25

Question 2: Using the following query, who were the buyers AND sellers for ticket
transactions that cost $10,000 or more?

select listid, lastname, firstname, username,

pricepaid as price, 'S' as buyorsell

from sales, users

where sales.sellerid=users.userid

and pricepaid >=10000

union

select listid, lastname, firstname, username, pricepaid,

'B' as buyorsell

from sales, users

where sales.buyerid=users.userid

and pricepaid >=10000

order by 1, 2, 3, 4, 5;

##Lab 4: Transforming Data with AWS Glue

The IT team at UnicornNation is looking for a way to reduce their AWS spend for this
project. After reviewing the file formats they are using, they have decided that for the
Merchandise Sales data, they are going to change the file format from CSV to Parquet.

Parquet is a columnar, compressed format and will help them reduce the amount of
data that is scanned when using Amazon Athena.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 24 of 25

In this lab, you are going to create a Glue Job to convert the existing CSV data files to
Parquet.

1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.

2. From the console, navigate to the AWS Glue service

3. Under the ETL menu, select Jobs and then click the button for Add Job

4. For the Name of the Job, use type teamawesome-convert-to-parquet

5. Under IAM role, using the drop-down list, select GlueServiceRole IAM role

Leave the rest of the parameters as Default

1. Click Next

2. For your data source, select the database you created for the Merchandise Sales
data, look for a data source named similar to merchandise_sales_xxxxxxx and
then click the Next button

3. In the Choose a transformation type page, select Change schema, click the Next
button

4. In the Choose a data target page, select Create tables in your data target

5. For the Data store, select Amazon S3

6. For the Format, select Parquet

7. For the Target path, click on the small folder icon and select the S3 bucket
teamawesome-merch-sales-parquet-xxxxxxx , click Select, then click Next

8. In the Map the source columns to target columns page, leave defaults and click
Save job and edit script

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 25 of 25

9. Close Script editor tips window

10. From the toolbar, click the Save button, and then click the X icon (far right) to
return to the list of jobs

11. Locate the job you created and click to select the job

1. Click the Action button and select Run Job, this will start your Glue ETL job

2. Click again on the checkbox next to your ETL job to see the status panel.

3. You will notice the job status as Running, wait a couple of minutes until it changes
to Succeeded

4. Navigate to the S3 console (Services > S3) and search for the bucket
teamawesome-merch-sales-parquet-xxxxxxxx . Verify that the parquet files
were successfully created in the bucket

With the Parquet files created, you could then crawl those files and start using the data
with Athena. Because Parquet is a columnar, compressed format, there will be less data
scanned, reducing the cost of your Athena queries.

LAB END

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020

You might also like