Lab - Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Lab - Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Duration
This lab will require approximately 120 minutes to complete.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 1/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Scenario
Previously, you created a proof of concept (POC) to demonstrate how to use AWS Glue to infer a
data schema and manually adjust column names. Then, you used Athena to query the data.
Although Mary likes this approach, each time that she starts a new project she must complete
many manual steps. She has asked you to create a reusable data pipeline that will help her to
quickly start building new data processing projects.
One of Mary's projects is to study New York City taxi data. She knows the column names for the
table data and has already created views and ingestion SQL commands for you. She wants to
study taxi usage patterns in New York City in the early part of 2020.
Mary has requested that you store the table data partition by month in Parquet format with
Snappy compression. This will promote efficiency and cost. Because it is a POC, Mary is OK with
you using hard-coded values for column names, partitions, views, and S3 bucket information.
Mary has provided the following:
Links to access the taxi data
The partitions that she would like to create (pickup_year and pickup_month)
SQL ingestion scripts
A script that will create a view in SQL that she wants to use for this particular project
When you start the lab, the environment will contain the resources that are shown in the following
diagram.
By the end of the lab, you will have created the architecture that is shown in the following diagram.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 2/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
After doing some research, you decided to take advantage of the flexibility of Step Functions to
create the ETL pipeline logic. With Step Functions, you can handle initial runs where the table data
and SQL view don't exist, in addition to subsequent runs where the tables and view do exist.
OK, let's get started!
2. To connect to the AWS Management Console, choose the AWS link in the upper-left corner.
A new browser tab opens and connects you to the console.
Tip: If a new browser tab does not open, a banner or icon is usually at the top of your
browser with the message that your browser is preventing the site from opening pop-up
windows. Choose the banner or icon, and then choose Allow pop-ups.
3. Open all the AWS service consoles that you will use during this lab.
Tip: Since you will use the consoles for many AWS services throughout this lab, it will be
easier to have each console open in a separate browser tab.
In the search box to the right of Services, search for Step Functions
Open the context menu (right-click) on the Step Functions entry which appears in the
search results and choose the option to open the link in a new tab.
Repeat this same process to open the AWS service consoles for each of these additional
services:
S3
AWS Glue
Athena
Cloud9
IAM
Confirm that you now have each of the six AWS services consoles open in different
browser tabs.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 4/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
mybucket="<FMI_1>"
echo $mybucket
Tip: You might be prompted about safely pasting multiline text. To disable this prompt for
the future, clear Ask before pasting multiline code. Choose Paste.
Analysis: With these commands, you assigned your bucket name to a shell variable. You
then echoed the value of that variable to the terminal. Saving the bucket name as a
variable will be useful when you run the next few commands.
Copy the yellow taxi data for January into a prefix (folder) in your bucket called
nyctaxidata/data.
Note: The command takes about 20 seconds to complete. The file that you are copying
is approximately 500 MB in size. Wait for the terminal prompt to display again before
continuing.
Copy the yellow taxi data for February into a prefix in your bucket called nyctaxidata/data.
Tip: Much more taxi data is available, and in a production solution, you would likely want
to include many years of data. However, for POC purposes, using 2 months of data will
suffice.
Copy the location information (lookup table) into a prefix in your bucket called
nyctaxidata/lookup.
Important: The space in the taxi _zone_lookup.csv file name is intentional.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 5/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
"LocationID","Borough","Zone","service_zone"
1,"EWR","Newark Airport","EWR"
2,"Queens","Jamaica Bay","Boro Zone"
3,"Bronx","Allerton/Pelham Gardens","Boro Zone"
4,"Manhattan","Alphabet City","Yellow Zone"
5,"Staten Island","Arden Heights","Boro Zone"
6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone"
7,"Queens","Astoria","Boro Zone"
8,"Queens","Astoria Park","Boro Zone"
...truncated
Analysis: The structure is defined by listing the column names on the first line. Mary is familiar
with these column names; therefore, the SQL commands that she provided will work without
modification later in the lab.
The yellow taxi data file structure for January and February is similar to the following:
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,s
tore_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amoun
t,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
1,2020-01-01 00:28:15,2020-01-01 00:33:03,1,1.20,1,N,238,239,1,6,3,0.5,1.47,0,0.3,11.27,2.5
1,2020-01-01 00:35:39,2020-01-01 00:43:04,1,1.20,1,N,239,238,1,7,3,0.5,1.5,0,0.3,12.3,2.5
1,2020-01-01 00:47:41,2020-01-01 00:53:52,1,.60,1,N,238,238,1,6,3,0.5,1,0,0.3,10.8,2.5
...truncated
As with the lookup table file, the first line in each file defines the column names.
Congratulations! In this task, you successfully loaded the source data. Now, you can start
building.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 6/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Analysis: Athena uses the AWS Glue Data Catalog to store and retrieve table metadata
for data that is stored in Amazon S3. The table metadata indicates to the Athena query
engine how to find, read, and process the data that you want to query.
In the Inspector panel on the right:
Change State name to Create Glue DB
Keep the Integration type as Optimized.
For API Parameters, replace the default JSON code with the following. Replace
<FMI_1> with your actual bucket name (the one with gluelab in the name).
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 7/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
{
"QueryString": "CREATE DATABASE if not exists nyctaxidb",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_1>/athena/"
}
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 8/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
When the Create Glue DB step turns green, as shown in the following image, the step
succeeded.
In this task, you successfully created an AWS Glue database by using a Step Functions workflow.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 9/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
With the new StartQueryExecution task selected, in the Inspector panel, change State
name to Run Table Lookup
After you rename the state, the workflow displays as shown in the following image.
For API Parameters, replace the default JSON code with the following. Replace <FMI_1>
with your actual bucket name (the one with gluelab in the name).
{
"QueryString": "show tables in nyctaxidb",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_1>/athena/"
}
}
{
"Comment": "A description of my state machine",
"StartAt": "Create Glue DB",
"States": {
"Create Glue DB": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution.sync",
"Parameters": {
"QueryString": "CREATE DATABASE if not exists nyctaxidb",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<your-gluelab-bucket-name>/athena/"
}
},
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 10/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
"Next": "Run Table Lookup"
},
"Run Table Lookup": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution.sync",
"Parameters": {
"QueryString": "show tables in nyctaxidb",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<your-gluelab-bucket-name>/athena/"
}
},
"End": true
}
}
}
When prompted about how the IAM role might need new permissions, choose Save
anyway.
Note: Recall that you previously reviewed the permissions that are granted to this IAM
role. The permissions are sufficient to complete all the tasks in this lab.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 11/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
In the Events history section, notice that the status of each task is provided in addition to
the time that each took to run.
The workflow takes about 1 minute to run, and it will not find any tables.
After the workflow completes, in the Graph view section, choose the Run Table Lookup
task.
In the Details panel to the right, choose the Output tab.
Notice that the task generated a QueryExecutionId. You will use this in the next task.
In the Amazon S3 console, choose the link for the gluelab bucket, and then choose the
athena link.
Notice that the folder (prefix) has more files now.
Tip: You may need to refresh the browser tab to see them.
The .txt files are blank, but a metadata file now exists and contains some data. AWS Glue
will use the metadata file internally.
Congratulations! In this task, you updated the workflow by adding a task that checks whether
tables exist in the AWS Glue database.
{
"QueryExecutionId.$": "$.QueryExecution.QueryExecutionId"
}
Analysis: This task will use the query execution ID that the prior task made available as
an output value. By passing the value along, the next task (which you haven't added yet)
can use the value to evaluate whether tables were found.
Note: You don't need to internally poll for this task to complete, so you don't need to
select Wait for task to complete.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 12/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Choose {} Code.
Confirm the definition. It should look similar to the following JSON code where the
<FMI_1> placeholders contain your actual gluelab bucket name.
{
"Comment": "A description of my state machine",
"StartAt": "Create Glue DB",
"States": {
"Create Glue DB": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution.sync",
"Parameters": {
"QueryString": "CREATE DATABASE if not exists nyctaxidb",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_1>/athena/"
}
},
"Next": "Run Table Lookup"
},
"Run Table Lookup": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution.sync",
"Parameters": {
"QueryString": "show tables in nyctaxidb",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_1>/athena/"
}
},
"Next": "Get lookup query results"
},
"Get lookup query results": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:getQueryResults",
"Parameters": {
"QueryExecutionId.$": "$.QueryExecution.QueryExecutionId"
},
"End": true
}
}
}
Choose Save.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 13/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
$.ResultSet.Rows[0].Data[0].VarCharValue
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 14/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Choose Save.
Analysis: When you run the workflow and the Get lookup query results task is complete,
the choice state will evaluate the results of the last query.
If tables aren't found (the $.ResultSet.Rows[0].Data[0].VarCharValue logic evaluates this),
the workflow will go the REPLACE ME TRUE STATE route. In the next task, you will
replace this state with a process to create tables.
Otherwise, if tables are found, the workflow will go the Default route (the REPLACE ME
FALSE STATE route). Later in this lab, you will replace this state with a process to check
for any new data (for example, February taxi data) and then insert it into an existing table.
Congratulations! In this task, you successfully added a choice state to support evaluating the
results of the Get lookup query results task.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 15/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
21. Add another Athena StartQueryExecution task to the workflow and configure it to create a
table.
Choose the Actions tab, and search for athena
Drag a StartQueryExecution task to the canvas between the ChoiceStateFirstRun state
and the REPLACE ME TRUE STATE state.
With the StartQueryExecution task selected, change State name to Run Create data Table
Query
For Integration type, keep Optimized selected.
For API Parameters, replace the default JSON code with the following. Replace <FMI_1>
and <FMI_2> with your actual bucket name (the one with gluelab in the name).
{
"QueryString": "CREATE EXTERNAL TABLE nyctaxidb.yellowtaxi_data_csv( vendorid
bigint, tpep_pickup_datetime string, tpep_dropoff_datetime string, passenger_count bigint,
trip_distance double, ratecodeid bigint, store_and_fwd_flag string, pulocationid bigint,
dolocationid bigint, payment_type bigint, fare_amount double, extra double, mta_tax
double, tip_amount double, tolls_amount double, improvement_surcharge double,
total_amount double, congestion_surcharge double) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION
's3://<FMI_1>/nyctaxidata/data/' TBLPROPERTIES ( 'skip.header.line.count'='1')",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_2>/athena/"
}
}
Analysis: Recall that you reviewed the structure of the source data files that you copied
into your gluelab bucket. The yellow_tripdata_2020-01.csv and yellow_tripdata_2020-
02.csv source files are in comma-separated value (CSV) format.
The first line in each file defines the columns of data that are contained in the file. The
columns include vendorid, tpep_pickup_datetime, and the other columns that are
defined in the CREATE EXTERNAL TABLE SQL statement that you just entered for the
task.
The CSV file doesn't define data types for each column of data, but your AWS Glue table
does define them (for example, as bigint and string). Note that, by defining the table as
EXTERNAL, you indicate that the table data will remain in Amazon S3, in the location
defined by the LOCATION part of the command (s3://<gluelab-bucket>/nyctaxidata/data/).
The QueryString that you are sending to Athena in this task uses the Create Table as
Select (CTAS) feature of Athena. CTAS statements use standard SELECT queries to
create new tables. By using this feature, you can extract, transform, and load data into
Amazon S3 for processing. For more information, see "Using CTAS and INSERT INTO for
ETL and Data Analysis" at https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into
-etl.html.
Select Wait for task to complete - optional.
Note: It is important for the table to be fully created before the workflow continues.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 16/38
1/2/25, 17:21 p y Pipelines by Using Athena and Step Functions
Lab: Building and Orchestrating ETL
Choose Save.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 17/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
23. Verify that the updated workflow created a table in the AWS Glue database the first time that
you ran it.
In the Amazon S3 console, navigate to the contents of the athena folder in your gluelab
bucket.
Notice that the folder contains another metadata file and additional empty text files.
Note: The empty text files are basic output files from Step Functions tasks. You can
ignore them.
In the AWS Glue console, in the navigation pane, choose Tables.
Notice that a yellowtaxi_data_csv table now exists. This is the AWS Glue table that Athena
created when your Step Function workflow invoked the Run Create data Table Query task.
To view the schema details, choose the link for the yellowtaxi_data_csv table.
The schema looks like the following image.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 18/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
24. Run the workflow again to test the other choice route.
In the Step Functions console, choose the link for the WorkflowPOC state machine.
Choose Start execution.
For Name, enter NewTest and then choose Start execution again.
Wait for the workflow to complete successfully.
Analysis: You want to ensure that, if the workflow finds the new table (as it should this
time), the workflow will take the other choice route and invoke the REPLACE ME FALSE
STATE state.
The following image shows the completed workflow.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 19/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
This run didn't re-create the database or try to overwrite the table that was created during
the previous run. Step Functions did generate some output files in Amazon S3 with
updated AWS Glue metadata.
In this task, you successfully created an AWS Glue table that points to the yellow taxi data.
"LocationID","Borough","Zone","service_zone"
1,"EWR","Newark Airport","EWR"
The query will again use CTAS to have Athena create an external table.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 20/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Drag a StartQueryExecution task between the Run Create data Table Query task and the
End task.
With the StartQueryExecution task selected, change State name to Run Create lookup
Table Query
For API Parameters, replace the default JSON code with the following. Replace <FMI_1>
and <FMI_2> with your actual bucket name.
{
"QueryString": "CREATE EXTERNAL TABLE nyctaxidb.nyctaxi_lookup_csv( locationid bigint,
borough string, zone string, service_zone string, latitude double, longitude double)ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION
's3://<FMI_1>/nyctaxidata/lookup/' TBLPROPERTIES ( 'skip.header.line.count'='1')",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_2>/athena/"
}
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 21/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Congratulations! In this task, you successfully used a Step Functions workflow to create both
tables in the AWS Glue database.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 22/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
29. Update the WorkflowPOC state machine to create a Parquet table with Snappy compression.
In the Step Functions console, use the method that you used in previous steps to open
the WorkflowPOC state machine in Workflow Studio.
In the Actions panel, search for athena
Drag a StartQueryExecution task to the canvas between the Run Create lookup Table
Query task and the End task.
With the StartQueryExecution task selected, change State name to Run Create Parquet
lookup Table Query
For API Parameters, replace the default JSON code with the following. Replace <FMI_1>
and <FMI_2> with your actual bucket name.
{
"QueryString": "CREATE table if not exists nyctaxidb.nyctaxi_lookup_parquet WITH
(format='PARQUET',parquet_compression='SNAPPY', external_location =
's3://<FMI_1>/nyctaxidata/optimized-data-lookup/') AS SELECT locationid, borough, zone ,
service_zone , latitude ,longitude FROM nyctaxidb.nyctaxi_lookup_csv",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_2>/athena/"
}
}
Choose Save.
30. Test by removing the existing tables from AWS Glue, running the workflow, and verifying the
results.
In the AWS Glue console, in the navigation pane, choose Tables.
Delete both of the tables in the AWS Glue database.
This will ensure that the correct path is taken in the workflow when you run it next.
In the Step Functions console, use the method that you used in previous steps to run the
WorkflowPOC state machine. Name the test TaskSevenTest
Wait for the workflow to complete successfully.
The following image shows the completed workflow.
Check the AWS Glue console to see that a new nyctaxi_lookup_parquet table was
created.
The following image shows the schema for the table.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 24/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Wonderful! In this task, you successfully created the lookup table in Parquet format with Snappy
compression. You will create another Parquet table, and then you will be able to combine the
information from both tables.
31. Update the workflow with a new step that creates the yellow taxi_data_parquet table.
In the Step Functions console, use the method that you used in previous steps to open
the WorkflowPOC state machine in Workflow Studio.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 25/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
{
"QueryString": "CREATE table if not exists nyctaxidb.yellowtaxi_data_parquet WITH
(format='PARQUET',parquet_compression='SNAPPY',partitioned_by=array['pickup_year','picku
p_month'],external_location = 's3://<FMI_1>/nyctaxidata/optimized-data/') AS SELECT
vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodei
d,store_and_fwd_flag,pulocationid,dolocationid,fare_amount,extra,mta_tax,tip_amount,tolls_am
ount,improvement_surcharge,total_amount,congestion_surcharge,payment_type,substr(\"tpep_
pickup_datetime\",1,4) pickup_year, substr(\"tpep_pickup_datetime\",6,2) AS pickup_month
FROM nyctaxidb.yellowtaxi_data_csv where substr(\"tpep_pickup_datetime\",1,4) = '2020' and
substr(\"tpep_pickup_datetime\",6,2) = '01'",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_2>/athena/"
}
}
32. Test by removing the existing tables in AWS Glue, running the workflow, and then verifying the
results.
In the AWS Glue console, delete all three of the existing tables.
This will ensure that the workflow takes the correct path when you run it next.
In the S3 console, navigate to the contents of the nyctaxidata folder in your gluelab
bucket.
Select the optimized-data-lookup prefix, and choose Delete.
On the Delete objects page, enter permanently delete in the field at the bottom of the
page, and then choose Delete objects.
Choose Close.
Explanation: The permissions that were granted to Athena don't allow Athena to delete
table information that is stored in Amazon S3. Therefore, you need to manually remove
the optimized-data-lookup prefix in your S3 bucket before running the workflow. If you
don't, the workflow will fail during the Create Parquet lookup Table Query task. This
wasn't a problem with the other tables because they were defined as external tables;
however, the Parquet tables are defined as internal tables.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 26/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
In the Step Functions console, use the method that you used in previous steps to run the
WorkflowPOC state machine. Name the test TaskEightTest
The following image shows the completed workflow.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 27/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
34. Create a new view of the data by using the Athena query editor.
In the Data panel, in the Tables and views section, choose Create > CREATE VIEW.
The following SQL commands populate in a query tab to help you get started.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 28/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
-- View Example
CREATE OR REPLACE VIEW view_name AS
SELECT column1, column2
FROM table_name
WHERE condition;
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 29/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
This view looks like it could be quite useful for the team. You are getting close to
completing the POC!
Great progress! In this task, you successfully created a view of the data that was obtained by
querying both of the Parquet tables.
36. Add a step to the Step Functions workflow to create the view.
In the Step Function console, use the method that you used in previous steps to open
the WorkflowPOC state machine in Workflow Studio.
In the Actions panel, search for athena
Drag a StartQueryExecution task between the Run Create Parquet data Table Query task
and the End task.
With the StartQueryExecution task selected, change State name to Run Create View
For API Parameters, replace the default JSON code with the following. Replace <FMI_1>
with your actual bucket name.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 30/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
{
"QueryString": "create or replace view nyctaxidb.yellowtaxi_data_vw as select a.*,lkup.*
from (select datatab.pulocationid pickup_location ,pickup_month, pickup_year,
sum(cast(datatab.total_amount AS decimal(10, 2))) AS sum_fare , sum(cast(datatab.trip_distance
AS decimal(10, 2))) AS sum_trip_distance , count(*) AS countrec FROM
nyctaxidb.yellowtaxi_data_parquet datatab WHERE datatab.pulocationid is NOT null GROUP
BY datatab.pulocationid, pickup_month, pickup_year) a , nyctaxidb.nyctaxi_lookup_parquet
lkup where lkup.locationid = a.pickup_location",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_1>/athena/"
}
}
Notice that the QueryString contains the same query that you ran to create the view by
using the Athena query editor in the previous task.
Select Wait for task to complete - optional.
Choose Save.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 31/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Excellent! You now have an entire ETL pipeline as a POC. The POC is constructed to be easy to
use for a new project. To use the workflow, someone would only need to swap in the new bucket
locations and update queries for the format of the data and how they want to partition it and
condition it.
The only remaining issue that this implementation doesn't yet handle is when someone wants to
process new time-sequence data (for example, February data). That is what you will work to
handle next.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 32/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
You could run this command for each file; however, that would result in two CSV files and two
Parquet files. You want to add data to only the Parquet table for the yellow taxi data.
You decide that an effective approach is to use a Step Functions map flow step, because this type
of step can iterate over all five of the AWS Glue tables and pass over (skip) the tables that you
don't want to update. In this way, you can run the statement on only the yellow taxi data Parquet
file.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 33/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
40. Add a pass state to the workflow and configure the default choice logic.
From the Flow panel, drag a Pass state to the canvas after the CheckTable state, on the
right side under the arrow that is labeled Default.
With the Pass state selected, change State name to Ignore File
On the canvas, choose the Default label on the arrow under the CheckTable state.
In the Inspector panel, in the Choice Rules section, open the Default rule details.
For Default state, verify that Ignore File is selected.
Important: Don't apply the changes yet. You will add more to the workflow in the next
step.
Analysis: The default rule will be invoked for any AWS Glue table that isn't the one that
you want to modify (which is all of the tables except the yellowtaxi_data_parquet table).
41. Add a StartQueryExecution task to the workflow to update the taxi Parquet table.
In the Flow panel, choose the Actions tab, and search for athena
Drag a StartQueryExecution task to the canvas after the CheckTable state, on the left
side.
With the StartQueryExecution task selected, change State name to Insert New Parquet
Data
For API Parameters, replace the default JSON code with the following. Replace <FMI_1>
with your actual bucket name.
Note: Mary provided this SQL statement to insert the February data. Notice the 02 at the
end of the QueryString.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 34/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
{
"QueryString": "INSERT INTO nyctaxidb.yellowtaxi_data_parquet select
vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodei
d,store_and_fwd_flag,pulocationid,dolocationid,fare_amount,extra,mta_tax,tip_amount,tolls_am
ount,improvement_surcharge,total_amount,congestion_surcharge,payment_type,substr(\"tpep_
pickup_datetime\",1,4) pickup_year, substr(\"tpep_pickup_datetime\",6,2) AS pickup_month
FROM nyctaxidb.yellowtaxi_data_csv where substr(\"tpep_pickup_datetime\",1,4) = '2020' and
substr(\"tpep_pickup_datetime\",6,2) = '02'",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://<FMI_1>/athena/"
}
}
Choose Save.
Excellent! You successfully added the final bit of logic to the workflow. This logic will process
the additional month of taxi data.
42. Choose Execute. For Name, enter TaskTwelveTest and then choose Start execution.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 35/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
While you stay on the Step input tab, change Index until VarCharValue in the JSON code
is equal to yellowtaxi_data_parquet.
In the graph, choose the Insert New Parquet Data step.
Choose the Step output tab.
Here you see the query that created the view in addition to other step run details.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 36/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
Excellent! The ETL pipeline can now also handle the additional data and integrate it into
the view so that the data can be queried from Athena.
47. To find detailed feedback about your work, choose Submission Report.
Lab complete
Congratulations! You have completed the lab.
48. At the top of this page, choose End Lab, and then choose Yes to confirm that you want to
end the lab.
A message panel indicates that the lab is terminating.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 37/38
1/2/25, 17:21 Lab: Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
© 2022, Amazon Web Services, Inc. and its affiliates. All rights reserved. This work may not be
reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited.
https://awsacademy.instructure.com/courses/96839/modules/items/8946982 38/38