100% found this document useful (1 vote)

200 views25 pages

Lab Aws 14-10

This document provides instructions for building a data lake on AWS for a company called UnicornNation. It describes creating Glue crawlers to discover data from source S3 buckets and catalog it into Glue Data Catalog tables. The first lab has students create a crawler for CSV sales data with headers. The second lab modifies schemas for pipe-delimited historical ticketing data across multiple tables without headers.

Uploaded by

Ana Marroquín

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

200 views25 pages

Lab Aws 14-10

Uploaded by

Ana Marroquín

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

PSO-Big-Data-Day-1 Page 1 of 25

AWS Professional Services: Big Data and

Analytics

Data Lake on AWS - Lab guide

Overview

UnicornNation is a global entertainment company that provides ticketing,

merchandising and promotion of large concerts and events.

In recent years, they have been collecting data through a number of disparate systems
and want to consolidate this data in a modern data architecture.

A workshop was held with the key stakeholders in UnicornNation and they identified
three key data sources they would like to consolidate and have provided the funding
and resources to build a Data Lake on AWS.

During the course of this bootcamp, you will be building a Data Lake on AWS to meet
their requirements and gain experience with a number of core AWS services, including
S3, Glue, Athena, Redshift, Redshift Spectrum and QuickSight.

UnicornNation Target Architecture

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 2 of 25

###About the Labs

With the following labs you will get hands-on experience with some of the key AWS
services that underpin a Data Lake implementation. The labs are provided with step-
by-step instructions that will help you use each service to build the basic build blocks
of a data lake.

For the labs, you will be using your own computer and logging in to an AWS console
through your web browser.

##Lab 1: Creating a Glue Data Crawler

The IT team at UnicornNation has exported Merchandise Sales data from their finance
system and transferred this data to a folder in a single S3 bucket. There are
multiple .CSV files in the bucket, each representing a month’s worth of data.

In this lab, we are going to create a Glue Data Crawler to crawl across this bucket.
Once we have created the crawler, we are going to run it to determine the tables that
are located in the bucket and add their definitions to the Glue Data Catalog.

To create your Glue data crawler, follow these steps:

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 3 of 25

1. Login to the AWS Console using the login URL and credentials provided. Once
you are logged in, check in the upper right-hand corner that you are using
Oregon region. If not, use the drop-down list to change to the Oregon region.

1. Before start working with AWS Glue, we need to configure a default result
bucket, please navigate to the S3 service

1. Locate the S3 bucket with this name pattern query-results-bucket-xxxxxxxxx

2. Copy and paste the full bucket name in a notepad, you will need it in the
following step

3. Now navigate to the Athena service. Services > Athena

4. Click on Get Started

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 4 of 25

1. Click on the link to set up a query result location

1. Enter the S3 bucket name you recorded in step 4. Follow this pattern: s3://[my-
query-result-bucket]/ (include the slash ‘/’ at the end).

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 5 of 25

1. Click Save

2. From the console, navigate to the AWS Glue service. Services > AWS Glue

3. From the AWS Glue Data Catalog, navigate to Databases and click Add
Database

1. Enter a database name, type teamawesome-merch-sales and click Create

1. Now that your database has been created, click on the Tables menu in the left-
hand menu

2. Select the drop down to Add Tables > Add tables using a crawler

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 6 of 25

1. Enter a name for your crawler, type teamawesome-merch-sales-crawler and

click Next

2. For Crawler source type select Data stores, then click Next

3. For your data store, select S3 and below Include path, click on the little folder
icon

1. This will open a new window, look for the Merchandise Sales bucket, it’s name
will follow this pattern: merchandise-sales-xxxxxxx , click Select and then click
Next.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 7 of 25

2. For Add another data store, take the default of No and click Next

3. Next, when setting up an IAM role, click on Choose an existing IAM Role, then
under IAM role select role name GlueServiceRole. Then click Next.

1. For the Frequency, leave the default of Run on demand and click Next

2. In Configure the crawler’s output, select the Database teamawesome-merch-

sales and click Next

3. Finally, review the settings for your crawler and click Finish. This should take you
to a list of crawlers that have been created

4. Select the crawler you just created and click on Run crawler

Watch the Crawler console for your job to finish successfully (it should take ~2
minutes). Status must change from Starting to Ready

1. Using the navigation menu on the left, navigate to Tables

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 8 of 25

2. You should see a new table has been created, copy and paste the table name on a
notepad.

3. Now click to select the table, then select Action > View Data. A dialogue window
will be open, select Preview data

1. This will open the Athena console, click Get started

1. You will see the Athena console, you will run the below query, but first you need
to update it with the table name you recorded in step 26- Click Run query

SELECT * FROM "teamawesome-merch-sales"."[GLUE_TABLE_NAME]" limit 10;

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 9 of 25

1. You must see a similar query output

##Lab 2: Modifying Table Schemas

The IT team at UnicornNation has extracted historical data from their ticketing system,
named “Tickit” which processes the majority of transactions for the company. This data
source is known as the Tickit History.

They have stored this data in an S3 bucket and have created folder/prefixes for each
table of data they have exported.

In the first lab, the files we crawled were comma-delimited and there was a header row
in each that provided the field names for the different columns.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 10 of 25

In this example, we are going to create a Glue crawler on a more complex dataset,
where the files are pipe-delimited and don’t have a header row with the column names.

This dataset is stored in an S3 Bucket with a folder/key for each table name. These
tables make up the “Tickit” sample data set, which consists of seven tables: two fact
tables and five dimension tables as shown below:

This data set is the HISTORICAL data for the Tickit database—this data will not be used
that often, so S3 is a great place to store and access this data.

To create these tables in the Glue catalog, follow these steps:

Task 1: Create your Glue crawler

2. From the console, navigate to the AWS Glue service

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 11 of 25

3. From the AWS Glue Data Catalog, navigate to Databases and click Add database

4. Enter a database name, type teamawesome-tickit-history and click Create

1. Now that your database has been created, click Tables in the left-hand menu

2. Select the drop down to Add Tables > Add tables using a crawler

3. Enter a name for your crawler, type teamawesome-tickit-crawler and click

4. For Crawler source type select Data stores

5. For your data store, select S3 and below Include path click on the little folder
icon

6. This will open a new window, look for the Tickit History bucket, it’s name will
follow this pattern: tickit-history-xxxxxxx , click Select and then click Next.

7. For Add another data store, take the default of No and click Next

8. Next, when setting up an IAM role, click on Choose an existing IAM Role, then
under IAM role select role name GlueServiceRoleTickitHistory. Then click Next.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 12 of 25

1. For the Frequency, select the default of Run on Demand and click Next

2. In Configure the crawler’s output, select the Database teamawesome-tickit-

history and click Next

3. Finally, review the settings for your crawler and click Finish. This should take you
to a list of crawlers that have been created

4. Select the crawler you just created, teamawesome-tickit-crawler, and click on

Run crawler

Watch the Crawler console for your job to finish successfully (it should take ~2
minutes). Status must change from Starting to Ready

1. Using the navigation menu on the left, navigate to Tables

2. You should see several tables has been created

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 13 of 25

Task 2: Modifying the Table Schemas

1. From within the AWS Glue console, select Databases from the left panel and
navigate to your Tickit history database (i.e. teamawesome-tickit-history)

1. Click on the teamawesome-tickit-history link. This will display a list of all of the
tables that have been generated by the Glue Crawler.

1. Click on the category table.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 14 of 25

You will notice that the column names are listed as Col0, Col1, Col2–that is because
the data files do not have a header row. In the next steps you will update the comunm
names

1. Click on the Edit schema button (far upper right corner)

1. Use the information below as a guide to edit ONLY the Column names for the
category schema, (do NOT change the types, just the column names):

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 15 of 25

CATEGORY table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | CATID |

| col1 | CATGROUP |

| col2 | CATNAME |

| col3 | CATDESC |

1. When you are finished editing the Category schema your schema must look like
this:

1. Click the Save button.

2. To return to your tables list click on the Tables link

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 16 of 25

1. Using the same steps, repeat this process to update the following tables in your
teamawesome-tickit-history database. Remember, you are just updating the
COLUMN NAMES, not changing the data types.

• DATE

• EVENT

• LISTING

• SALES

• USERS

• VENUE

DATE Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | DATEID |

| col1 | CALDATE |

| col2 | DAY |

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 17 of 25

| col3 | WEEK |

| col4 | MONTH |

| col5 | QTR |

| col6 | YEAR |

| col7 | HOLIDAY |

EVENT Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | EVENTID |

| col1 | VENUEID |

| col2 | CATID |

| col3 | DATEID |

| col4 | EVENTNAME |

| col5 | STARTTIME |

LISTING Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | LISTID |

| col1 | SELLERID |

| col2 | EVENTID |

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 18 of 25

| col3 | DATEID |

| col4 | NUMTICKETS |

| col5 | PRICEPERTICKET |

| col6 | TOTALPRICE |

| col7 | LISTTIME |

SALES Table

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | SALESID |

| col1 | LISTID |

| col2 | SELLERID |

| col3 | BUYERID |

| col4 | EVENTID |

| col5 | DATEID |

| col6 | QTYSOLD |

| col7 | PRICEPAID |

| col8 | COMMISSION |

| col9 | SALETIME |

USERS Table

| Column Name (old) | Column Name (new) |

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 19 of 25

|—|---|

| col0 | USERID |

| col1 | USERNAME |

| col2 | FIRSTNAME |

| col3 | LASTNAME |

| col4 | CITY |

| col5 | STATE |

| col6 | EMAIL |

| col7 | PHONE |

| col8 | LIKESPORTS |

| col9 | LIKETHEATRE |

| col10 | LIKECONCERTS |

| col11 | LIKEJAZZ |

| col12 | LIKECLASSICAL |

| col13 | LIKEOPERA |

| col14 | LIKEROCK |

| col15 | LIKEVEGAS |

| col16 | LIKEBROADWAY |

| col17 | LIKEMUSICALS |

VENUE Table

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 20 of 25

| Column Name (old) | Column Name (new) |

|—|---|

| col0 | VENUEID |

| col1 | VENUENAME |

| col2 | VENUECITY |

| col3 | VENUESTATE |

| col4 | VENUESEATS |

##Lab 3: Querying your Data Lake with Amazon Athena

The ticketing team at UnicornNation has heard about the work you did in setting up
the data catalog. They have some data that they need urgently and need your help in
setting up and running these queries.

In this lab, you are going to use Amazon Athena to create some queries, which you will
then save to make it easy for users to run and consume.

As part of the lab, you will also be running your own queries to help users answer some
basic questions around ticket sales, customers and more.

To run a query using Amazon Athena, follow these steps:

2. From the console, navigate to the Athena service

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 21 of 25

1. Using the Database drop-down list, change the database to point to your Tickit
database teamawesome-tickit-history

1. You could open a new query text box

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 22 of 25

1. To run a query, paste your below sql script into the new text box and click the
Run Query button

2. Use the query text below, run each query to answer these questions:

Question 1: Using the following query, what were the Top 5 ticket sellers for events in
San Diego in 2008?

select sellerid, username, city, firstname ||' '|| lastname as fullname, sum(qtysold) a

from sales, date, users

where sales.sellerid = users.userid

and sales.dateid = date.dateid

and year = 2008

and city = 'San Diego'

group by sellerid, username, city, firstname ||' '|| lastname

order by 5 desc

limit 5;

 

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 23 of 25

Question 2: Using the following query, who were the buyers AND sellers for ticket
transactions that cost $10,000 or more?

select listid, lastname, firstname, username,

pricepaid as price, 'S' as buyorsell

from sales, users

where sales.sellerid=users.userid

and pricepaid >=10000

union

select listid, lastname, firstname, username, pricepaid,

'B' as buyorsell

from sales, users

where sales.buyerid=users.userid

and pricepaid >=10000

order by 1, 2, 3, 4, 5;

##Lab 4: Transforming Data with AWS Glue

The IT team at UnicornNation is looking for a way to reduce their AWS spend for this
project. After reviewing the file formats they are using, they have decided that for the
Merchandise Sales data, they are going to change the file format from CSV to Parquet.

Parquet is a columnar, compressed format and will help them reduce the amount of
data that is scanned when using Amazon Athena.

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 24 of 25

In this lab, you are going to create a Glue Job to convert the existing CSV data files to
Parquet.

2. From the console, navigate to the AWS Glue service

3. Under the ETL menu, select Jobs and then click the button for Add Job

4. For the Name of the Job, use type teamawesome-convert-to-parquet

5. Under IAM role, using the drop-down list, select GlueServiceRole IAM role

Leave the rest of the parameters as Default

1. Click Next

2. For your data source, select the database you created for the Merchandise Sales
data, look for a data source named similar to merchandise_sales_xxxxxxx and
then click the Next button

3. In the Choose a transformation type page, select Change schema, click the Next
button

4. In the Choose a data target page, select Create tables in your data target

5. For the Data store, select Amazon S3

6. For the Format, select Parquet

7. For the Target path, click on the small folder icon and select the S3 bucket
teamawesome-merch-sales-parquet-xxxxxxx , click Select, then click Next

8. In the Map the source columns to target columns page, leave defaults and click
Save job and edit script

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020
PSO-Big-Data-Day-1 Page 25 of 25

9. Close Script editor tips window

10. From the toolbar, click the Save button, and then click the X icon (far right) to
return to the list of jobs

11. Locate the job you created and click to select the job

1. Click the Action button and select Run Job, this will start your Glue ETL job

2. Click again on the checkbox next to your ETL job to see the status panel.

3. You will notice the job status as Running, wait a couple of minutes until it changes
to Succeeded

4. Navigate to the S3 console (Services > S3) and search for the bucket
teamawesome-merch-sales-parquet-xxxxxxxx . Verify that the parquet files
were successfully created in the bucket

With the Parquet files created, you could then crawl those files and start using the data
with Athena. Because Parquet is a columnar, compressed format, there will be less data
scanned, reducing the cost of your Athena queries.

LAB END

https://apn-training.s3.amazonaws.com/PSO-Big-Data/PSO-Big-Data-Day-1.html 10/14/2020

Aws Certified Data Engineer Slides
No ratings yet
Aws Certified Data Engineer Slides
696 pages
Aws Data Engineer
No ratings yet
Aws Data Engineer
66 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Lab - Performing ETL On A Dataset by Using AWS Glue
100% (1)
Lab - Performing ETL On A Dataset by Using AWS Glue
26 pages
AWS Glue Studio
100% (1)
AWS Glue Studio
126 pages
Azure Storage
No ratings yet
Azure Storage
649 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
REPEAT 2 Architecture Patterns For Multi-Region Active-Active ARC213-R2
No ratings yet
REPEAT 2 Architecture Patterns For Multi-Region Active-Active ARC213-R2
91 pages
PSO Data Analytics Day 1
100% (1)
PSO Data Analytics Day 1
106 pages
AWS Re Invent 2020 Hands-On Labs 12162020
No ratings yet
AWS Re Invent 2020 Hands-On Labs 12162020
1,131 pages
VitalSource - Advanced Architecting On AWS 2.0 (EN) - Lab Guide
No ratings yet
VitalSource - Advanced Architecting On AWS 2.0 (EN) - Lab Guide
80 pages
Migrate Oracle DB To AWS RDS Using Oracle Dump and DMS
No ratings yet
Migrate Oracle DB To AWS RDS Using Oracle Dump and DMS
41 pages
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
Orchestrate Redshift ETL Using AWS Glue and Step Functions Report
No ratings yet
Orchestrate Redshift ETL Using AWS Glue and Step Functions Report
31 pages
CodeCommit Setup Step1 PDF
100% (1)
CodeCommit Setup Step1 PDF
310 pages
Part 7 AWS Solution Architect Real World Scenarios 1734966847
No ratings yet
Part 7 AWS Solution Architect Real World Scenarios 1734966847
7 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
DevOps With AWS
No ratings yet
DevOps With AWS
8 pages
ZFNET Architecture
No ratings yet
ZFNET Architecture
14 pages
Edureka Training - AWS Solutions Architect
No ratings yet
Edureka Training - AWS Solutions Architect
11 pages
Rahul Sharma
100% (1)
Rahul Sharma
2 pages
AWS Migration Primer
No ratings yet
AWS Migration Primer
12 pages
AWS Boto - 1
No ratings yet
AWS Boto - 1
55 pages
Programming Amazon SQS and SNS Using The AWS Nodejs SDK V1.04
No ratings yet
Programming Amazon SQS and SNS Using The AWS Nodejs SDK V1.04
26 pages
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
No ratings yet
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
32 pages
AWS Slides
100% (1)
AWS Slides
16 pages
AWS Sysops Course Labs Guide: Sameh Tawfik
No ratings yet
AWS Sysops Course Labs Guide: Sameh Tawfik
3 pages
AWS Certification Preparation Notes
No ratings yet
AWS Certification Preparation Notes
25 pages
5 AWS Database Services 19-08-2024
No ratings yet
5 AWS Database Services 19-08-2024
21 pages
Amazon Elastic Container Service (ECS) Is A Highly Scalable, High Performance Container
No ratings yet
Amazon Elastic Container Service (ECS) Is A Highly Scalable, High Performance Container
8 pages
Ec2 Ug PDF
No ratings yet
Ec2 Ug PDF
722 pages
Lab - Exploring DataLake With Athena and Quicksight PDF
No ratings yet
Lab - Exploring DataLake With Athena and Quicksight PDF
22 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
AWS CP - Sruya Kiran Sir Notes
No ratings yet
AWS CP - Sruya Kiran Sir Notes
8 pages
TVL CSS11 - Q3 - M7
No ratings yet
TVL CSS11 - Q3 - M7
16 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
No ratings yet
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
6 pages
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
No ratings yet
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
34 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
MongoDB Indexes Guide
No ratings yet
MongoDB Indexes Guide
68 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Class 11 Python Fundamentals CS 083
No ratings yet
Class 11 Python Fundamentals CS 083
19 pages
Data Science Links
No ratings yet
Data Science Links
1 page
Discussion Lab3C PLC
33% (3)
Discussion Lab3C PLC
3 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
CM-MA-119 Procedure For Document Control
No ratings yet
CM-MA-119 Procedure For Document Control
8 pages
Ansible 2
No ratings yet
Ansible 2
15 pages
100 Cyber Threats
No ratings yet
100 Cyber Threats
103 pages
10gen-MongoDB Operations Best Practices
No ratings yet
10gen-MongoDB Operations Best Practices
26 pages
Chapter 9 PHP Cookies & Sessions
No ratings yet
Chapter 9 PHP Cookies & Sessions
35 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Module 2 - Install Setup and Configure - T100 T500 2
No ratings yet
Module 2 - Install Setup and Configure - T100 T500 2
63 pages
Terraform Cheat Sheet
No ratings yet
Terraform Cheat Sheet
2 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
AWS Certified SysOps Administrator
No ratings yet
AWS Certified SysOps Administrator
3 pages
Talend Installation Guide (Data Service Platform)
No ratings yet
Talend Installation Guide (Data Service Platform)
14 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Day1 3 Lab
No ratings yet
Day1 3 Lab
6 pages
Electrical Switchboards Form Separation
100% (1)
Electrical Switchboards Form Separation
16 pages
1 AWS Analytics and Data Lakes
No ratings yet
1 AWS Analytics and Data Lakes
15 pages
Data Lake Implementation Improved Processing Time by 4X
No ratings yet
Data Lake Implementation Improved Processing Time by 4X
5 pages
Side Dump Book 2014
No ratings yet
Side Dump Book 2014
36 pages
CA08100006E Vol5 Ibook
No ratings yet
CA08100006E Vol5 Ibook
755 pages
Manual OP 108
No ratings yet
Manual OP 108
118 pages
Steam Turbine Control Solutions: Features
No ratings yet
Steam Turbine Control Solutions: Features
4 pages
Rac Node Delete
No ratings yet
Rac Node Delete
19 pages
Springer Style - Advancements in Blockchain-Enabled Spectrum Access Security in 6G Cognitive Radio IoT Networks
No ratings yet
Springer Style - Advancements in Blockchain-Enabled Spectrum Access Security in 6G Cognitive Radio IoT Networks
29 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Advanced Architecting On AWS Description
No ratings yet
Advanced Architecting On AWS Description
2 pages
Bab 2
No ratings yet
Bab 2
4 pages
Technical ABAP Standards v1.1 (1446)
No ratings yet
Technical ABAP Standards v1.1 (1446)
63 pages
Introduction To Social Media
No ratings yet
Introduction To Social Media
8 pages
File System
No ratings yet
File System
15 pages
TestSuite ApplicationTest DOC v10 en
No ratings yet
TestSuite ApplicationTest DOC v10 en
14 pages
Assignment 4: Self-Excited Compound-Wound DC Generator
No ratings yet
Assignment 4: Self-Excited Compound-Wound DC Generator
11 pages
HDB 6.5.1 JDBC Eng
No ratings yet
HDB 6.5.1 JDBC Eng
92 pages
Non - Authoritative Applications - 1
No ratings yet
Non - Authoritative Applications - 1
33 pages
SOP - Shiprocket v.4
No ratings yet
SOP - Shiprocket v.4
8 pages
Business and It Strategic Alignment Applying Soea Framework: Nassir Dino, Awel Dico, PHD, Dida Midekso, PHD
No ratings yet
Business and It Strategic Alignment Applying Soea Framework: Nassir Dino, Awel Dico, PHD, Dida Midekso, PHD
8 pages
Solar Cell Efficiency Tables (Version 50)
No ratings yet
Solar Cell Efficiency Tables (Version 50)
9 pages
4.2blockchain and Cloud
No ratings yet
4.2blockchain and Cloud
7 pages
Madison Greco: 219 Ramblewood Drive, Utica, NY 13502 - (315) 507-1987 - Professional Summary
No ratings yet
Madison Greco: 219 Ramblewood Drive, Utica, NY 13502 - (315) 507-1987 - Professional Summary
1 page
Supra Saeindia 2017 Overall Rankings
No ratings yet
Supra Saeindia 2017 Overall Rankings
5 pages
GH 2023 Service Check Document
No ratings yet
GH 2023 Service Check Document
3 pages
EPA Test Procedure For EVs-PHEVs-11-14-2017
No ratings yet
EPA Test Procedure For EVs-PHEVs-11-14-2017
2 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-4: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-4: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet

Lab Aws 14-10

Uploaded by

Lab Aws 14-10

Uploaded by

PSO-Big-Data-Day-1 Page 1 of 25

AWS Professional Services: Big Data and

Data Lake on AWS - Lab guide

UnicornNation is a global entertainment company that provides ticketing,

UnicornNation Target Architecture

###About the Labs

##Lab 1: Creating a Glue Data Crawler

To create your Glue data crawler, follow these steps:

1. Locate the S3 bucket with this name pattern query-results-bucket-xxxxxxxxx

3. Now navigate to the Athena service. Services > Athena

4. Click on Get Started

1. Click on the link to set up a query result location

1. Enter a database name, type teamawesome-merch-sales and click Create

1. Enter a name for your crawler, type teamawesome-merch-sales-crawler and

2. In Configure the crawler’s output, select the Database teamawesome-merch-

1. Using the navigation menu on the left, navigate to Tables

1. This will open the Athena console, click Get started

SELECT * FROM "teamawesome-merch-sales"."[GLUE_TABLE_NAME]" limit 10;

1. You must see a similar query output

##Lab 2: Modifying Table Schemas

To create these tables in the Glue catalog, follow these steps:

Task 1: Create your Glue crawler

2. From the console, navigate to the AWS Glue service

4. Enter a database name, type teamawesome-tickit-history and click Create

3. Enter a name for your crawler, type teamawesome-tickit-crawler and click

4. For Crawler source type select Data stores

2. In Configure the crawler’s output, select the Database teamawesome-tickit-

4. Select the crawler you just created, teamawesome-tickit-crawler, and click on

1. Using the navigation menu on the left, navigate to Tables

2. You should see several tables has been created

Task 2: Modifying the Table Schemas

1. Click on the category table.

1. Click on the Edit schema button (far upper right corner)

| Column Name (old) | Column Name (new) |

1. Click the Save button.

2. To return to your tables list click on the Tables link

| Column Name (old) | Column Name (new) |

| Column Name (old) | Column Name (new) |

| Column Name (old) | Column Name (new) |

| Column Name (old) | Column Name (new) |

| Column Name (old) | Column Name (new) |

| Column Name (old) | Column Name (new) |

##Lab 3: Querying your Data Lake with Amazon Athena

To run a query using Amazon Athena, follow these steps:

2. From the console, navigate to the Athena service

1. You could open a new query text box

from sales, date, users

where sales.sellerid = users.userid

and sales.dateid = date.dateid

and year = 2008

and city = 'San Diego'

group by sellerid, username, city, firstname ||' '|| lastname

select listid, lastname, firstname, username,

pricepaid as price, 'S' as buyorsell

from sales, users

and pricepaid >=10000

select listid, lastname, firstname, username, pricepaid,

from sales, users

and pricepaid >=10000

##Lab 4: Transforming Data with AWS Glue

2. From the console, navigate to the AWS Glue service

4. For the Name of the Job, use type teamawesome-convert-to-parquet

Leave the rest of the parameters as Default

5. For the Data store, select Amazon S3

6. For the Format, select Parquet

9. Close Script editor tips window

You might also like