0% found this document useful (0 votes)
31 views15 pages

Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance

This document outlines a pattern for building an ETL service pipeline to incrementally load data from Amazon S3 to Amazon Redshift using AWS Glue. It details the prerequisites, limitations, architecture, and tools involved in the process, including the use of various AWS services like AWS Lambda and Amazon Athena. The document also provides a step-by-step guide for creating the necessary infrastructure and configuring the ETL pipeline.

Uploaded by

mmyybabybaby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views15 pages

Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance

This document outlines a pattern for building an ETL service pipeline to incrementally load data from Amazon S3 to Amazon Redshift using AWS Glue. It details the prerequisites, limitations, architecture, and tools involved in the process, including the use of various AWS services like AWS Lambda and Amazon Athena. The document also provides a step-by-step guide for creating the necessary infrastructure and configuring the ETL pipeline.

Uploaded by

mmyybabybaby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

AWS Documentation AWS Prescriptive Guidance Patterns

Build an ETL service pipeline to


load data incrementally from
Amazon S3 to Amazon Redshift
using AWS Glue
PDF (/pdfs/prescriptive-guidance/latest/patterns/prescriptive-guidance.pdf#build-an-etl-
service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-
glue)

Created by Rohan Jamadagni (AWS) and Arunabha Datta (AWS)

Environm Technologies: AWS services: Amazon


ent: Analytics; Data lakes; Redshift; Amazon S3; AWS
Production Storage & backup Glue; AWS Lambda

Summary of building an ETL pipeline from


Amazon S3 to Amazon Redshift using AWS
Glue
This pattern provides guidance on how to configure Amazon Simple Storage
Service (Amazon S3) for optimal data lake performance, and then load
incremental data changes from Amazon S3 into Amazon Redshift by using
AWS Glue, performing extract, transform, and load (ETL) operations.

The source files in Amazon S3 can have different formats, including comma-
separated values (CSV), XML, and JSON files. This pattern describes how you
can use AWS Glue to convert the source files into a cost-optimized and
performance-optimized format like Apache Parquet. You can query Parquet
files directly from Amazon Athena and Amazon Redshift Spectrum. You can
also load Parquet files into Amazon Redshift, aggregate them, and share the
aggregated data with consumers, or visualize the data by using Amazon
QuickSight.

Prerequisites and limitations for building an


ETL pipeline from Amazon S3 to Amazon
Redshift using AWS Glue
Prerequisites

An active AWS account.


An S3 source bucket that has the right privileges and contains CSV, XML,
or JSON files.

Assumptions

The CSV, XML, or JSON source files are already loaded into Amazon S3
and are accessible from the account where AWS Glue and Amazon
Redshift are configured.
Best practices for loading the files, splitting the files, compression, and
using a manifest are followed, as discussed in the Amazon Redshift
documentation (https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-
data-from-S3.html) .

The source file structure is unaltered.


The source system is able to ingest data into Amazon S3 by following the
folder structure defined in Amazon S3.
The Amazon Redshift cluster spans a single Availability Zone. (This
architecture is appropriate because AWS Lambda, AWS Glue, and
Amazon Athena are serverless.) For high availability, cluster snapshots
are taken at a regular frequency.

Limitations

The file formats are limited to those that are currently supported by AWS
Glue (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-
format.html) .

Real-time downstream reporting isn't supported.


Architecture for building an ETL pipeline
from Amazon S3 to Amazon Redshift using
AWS Glue
Source technology stack

S3 bucket with CSV, XML, or JSON files

Target technology stack

S3 data lake (with partitioned Parquet file storage)


Amazon Redshift

Target architecture

Data flow
Tools for building an ETL pipeline from
Amazon S3 to Amazon Redshift using AWS
Glue
Amazon S3 (http://aws.amazon.com/s3/) – Amazon Simple Storage
Service (Amazon S3) is a highly scalable object storage service. Amazon
S3 can be used for a wide range of storage solutions, including websites,
mobile applications, backups, and data lakes.
AWS Lambda (http://aws.amazon.com/lambda/) – AWS Lambda lets you
run code without provisioning or managing servers. AWS Lambda is an
event-driven service; you can set up your code to automatically initiate
from other AWS services.
Amazon Redshift (http://aws.amazon.com/redshift/) – Amazon Redshift is
a fully managed, petabyte-scale data warehouse service. With Amazon
Redshift, you can query petabytes of structured and semi-structured
data across your data warehouse and your data lake using standard SQL.
AWS Glue (http://aws.amazon.com/glue/) – AWS Glue is a fully managed
ETL service that makes it easier to prepare and load data for analytics.
AWS Glue discovers your data and stores the associated metadata (for
example, table definitions and schema) in the AWS Glue Data Catalog.
Your cataloged data is immediately searchable, can be queried, and is
available for ETL.
AWS Secrets Manager (http://aws.amazon.com/secrets-manager/) – AWS
Secrets Manager facilitates protection and central management of
secrets needed for application or service access. The service stores
database credentials, API keys, and other secrets, and eliminates the
need to hardcode sensitive information in plaintext format. Secrets
Manager also offers key rotation to meet security and compliance needs.
It has built-in integration for Amazon Redshift, Amazon Relational
Database Service (Amazon RDS), and Amazon DocumentDB. You can
store and centrally manage secrets by using the Secrets Manager
console, the command-line interface (CLI), or Secrets Manager API and
SDKs.
Amazon Athena (http://aws.amazon.com/athena/) – Amazon Athena is an
interactive query service that makes it easy to analyze data that's stored
in Amazon S3. Athena is serverless and integrated with AWS Glue, so it
can directly query the data that's cataloged using AWS Glue. Athena is
elastically scaled to deliver interactive query performance.

Epics for building an ETL pipeline from


Amazon S3 to Amazon Redshift using AWS
Glue

Create the S3 buckets and folder structure

Task Description Sk
ill
s
re
q
ui
re
d

Analyze Perform this task for each data source D


source that contributes to the Amazon S3 data at
systems for lake. a
data en
structure gi
Task Description Sk
ill
s
re
q
ui
re
d

and ne
attributes. er

Define the This strategy should be based on the D


partition frequency of data captures, delta at
and access processing, and consumption needs. a
strategy. Make sure that S3 buckets are not open en
to the public and that access is controlled gi
by specific service role-based policies ne
only. For more information, see the er
Amazon S3 documentation
(https://docs.aws.amazon.com/AmazonS3/lat
est/user-guide/using-folders.html) .

Create Create a separate bucket for each source, D


separate S3 and then create a folder structure that's at
buckets for based on the source system's data a
each data ingestion frequency; for example, en
source type s3://source-system-name/date/hour . gi
and a For the processed (converted to Parquet ne
separate S3 format) files, create a similar structure; er
bucket per for example, s3://source-processed-
source for bucket/date/hour . For more
the information about creating S3 buckets,
processed see the Amazon S3 documentation
(Parquet) (https://docs.aws.amazon.com/AmazonS3/lat
data. est/user-guide/create-bucket.html) .

Create a data warehouse in Amazon Redshift


Task Description Sk
ill
s
re
qu
ir
ed

Launch the Use the Secrets Manager database secret Da


Amazon for admin user credentials while creating ta
Redshift the Amazon Redshift cluster. For en
cluster with information about creating and sizing an gi
the Amazon Redshift cluster, see the Amazon ne
appropriate Redshift documentation er
parameter (https://docs.aws.amazon.com/ses/latest/De
groups and veloperGuide/event-publishing-redshift-
maintenanc cluster.html) and the Sizing Cloud Data
e and Warehouses
backup (https://d1.awsstatic.com/whitepapers/Size-
strategy. Cloud-Data-Warehouse-on-AWS.pdf)
whitepaper.

Create and The AWS Identity and Access Da


attach the Management (IAM) service role ensures ta
IAM service access to Secrets Manager and the en
role to the source S3 buckets. For more information, gi
Amazon see the AWS documentation on ne
Redshift authorization er
cluster. (https://docs.aws.amazon.com/redshift/lates
t/mgmt/authorizing-redshift-service.html)
and adding a role
(https://docs.aws.amazon.com/redshift/lates
t/dg/c-getting-started-using-spectrum-add-
role.html) .

Create the Follow Amazon Redshift best practices Da


database for table design. Based on the use case, ta
schema. choose the appropriate sort and en
distribution keys, and the best possible gi
Task Description Sk
ill
s
re
qu
ir
ed

compression encoding. For best ne


practices, see the AWS documentation er
(https://docs.aws.amazon.com/redshift/lates
t/dg/c_designing-tables-best-practices.html)
.

Configure Configure workload management (WLM) Da


workload queues, short query acceleration (SQA), ta
manageme or concurrency scaling, depending on en
nt. your requirements. For more gi
information, see Implementing workload ne
management er
(https://docs.aws.amazon.com/redshift/lates
t/dg/cm-c-implementing-workload-
management.html) in the Amazon Redshift
documentation.

Create a secret in Secrets Manager

Task Description Sk
ill
s
re
qu
ir
ed

Create a This secret stores the credentials for the Da


new admin user as well as individual database ta
secret to service users. For instructions, see the en
Task Description Sk
ill
s
re
qu
ir
ed

store the Secrets Manager documentation gi


Amazon (https://docs.aws.amazon.com/secretsmanager/ ne
Redshift latest/userguide/manage_create-basic- er
sign-in secret.html) . Choose Amazon Redshift
credentia Cluster as the secret type. Additionally, on
ls in the Secret rotation page, turn on the
Secrets rotation. This will create the appropriate
Manager. user in the Amazon Redshift cluster and
will rotate the key secrets at defined
intervals.

Create Restrict Secrets Manager access to only Da


an IAM Amazon Redshift administrators and AWS ta
policy to Glue. en
restrict gi
Secrets ne
Manager er
access.

Configure AWS Glue


Task Description S
ki
lls
re
q
ui
re
d

In the For instructions, see the AWS Glue D


AWS documentation at
Glue (https://docs.aws.amazon.com/glue/latest/dg/c a
Data onsole-connections.html) . en
Catalog, gi
add a ne
connecti er
on for
Amazon
Redshift.

Create For more information, see the AWS Glue D


and documentation at
attach an (https://docs.aws.amazon.com/glue/latest/dg/c a
IAM reate-service-policy.html) . en
service gi
role for ne
AWS er
Glue to
access
Secrets
Manager,
Amazon
Redshift,
and S3
buckets.

Define This step involves creating a database and D


the AWS required tables in the AWS Glue Data at
Glue Catalog. You can either use a crawler to a
Data catalog the tables in the AWS Glue en
Task Description S
ki
lls
re
q
ui
re
d

Catalog database, or define them as Amazon gi


for the Athena external tables. You can also access ne
source. the external tables defined in Athena er
through the AWS Glue Data Catalog. See
the AWS documentation for more
information about defining the Data
Catalog
(https://docs.aws.amazon.com/glue/latest/dg/p
opulate-data-catalog.html) and creating an
external table
(https://docs.aws.amazon.com/athena/latest/ug
/creating-tables.html) in Athena.

Create an The AWS Glue job can be a Python shell or D


AWS PySpark to standardize, deduplicate, and at
Glue job cleanse the source data files. To optimize a
to performance and avoid having to query the en
process entire S3 source bucket, partition the S3 gi
source bucket by date, broken down by year, ne
data. month, day, and hour as a pushdown er
predicate for the AWS Glue job. For more
information, see the AWS Glue
documentation
(https://docs.aws.amazon.com/glue/latest/dg/a
ws-glue-programming-etl-partitions.html) .
Load the processed and transformed data
to the processed S3 bucket partitions in
Parquet format. You can query the Parquet
files from Athena.

Create an The AWS Glue job can be a Python shell or D


AWS PySpark to load the data by upserting the at
Task Description S
ki
lls
re
q
ui
re
d

Glue job data, followed by a complete refresh. For a


to load details, see the AWS Glue documentation en
data into (https://docs.aws.amazon.com/glue/latest/dg/a gi
Amazon uthor-job.html) and the Additional ne
Redshift. information section. er

(Optional The incremental data load is primarily D


) driven by an Amazon S3 event that causes at
Schedule an AWS Lambda function to call the AWS a
AWS Glue job. Use AWS Glue trigger-based en
Glue jobs scheduling for any data loads that demand gi
by using time-based instead of event-based ne
triggers scheduling. er
as
necessary
.

Create a Lambda function


Task Description Sk
ill
s
re
q
ui
re
d

Create Create an IAM service-linked role for AWS D


and Lambda with a policy to read Amazon S3 at
attach an objects and buckets, and a policy to access a
IAM the AWS Glue API to start an AWS Glue en
service- job. For more information, see the gi
linked Knowledge Center ne
role for (http://aws.amazon.com/premiumsupport/kno er
AWS wledge-center/lambda-execution-role-s3-
Lambda bucket/) .
to access
S3
buckets
and the
AWS Glue
job.

Create a The Lambda function should be initiated D


Lambda by the creation of the Amazon S3 manifest at
function file. The Lambda function should pass the a
to run the Amazon S3 folder location (for example, en
AWS Glue source_bucket/year/month/date/hour) to gi
job based the AWS Glue job as a parameter. The AWS ne
on the Glue job will use this parameter as a er
defined pushdown predicate to optimize file access
Amazon and job processing performance. For more
S3 event. information, see the AWS Glue
documentation
(https://docs.aws.amazon.com/glue/latest/dg/
aws-glue-programming-python-calling.html) .
Task Description Sk
ill
s
re
q
ui
re
d

Create an The Amazon S3 PUT object event should D


Amazon be initiated only by the creation of the at
S3 PUT manifest file. The manifest file controls a
object the Lambda function and the AWS Glue en
event to job concurrency, and processes the load as gi
detect a batch instead of processing individual ne
object files that arrive in a specific partition of er
creation, the S3 source bucket. For more
and call information, see the Lambda
the documentation
respective (https://docs.aws.amazon.com/lambda/latest/
Lambda dg/with-s3-example.html) .
function.

Related resources for building an ETL


pipeline from Amazon S3 to Amazon Redshift
using AWS Glue
Amazon S3 documentation
(https://docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html)

AWS Glue documentation


(https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html)

Amazon Redshift documentation


(https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html)

AWS Lambda (http://aws.amazon.com/lambda/)

Amazon Athena (http://aws.amazon.com/athena/)

AWS Secrets Manager (http://aws.amazon.com/secrets-manager/)


Additional information for building an ETL
pipeline from Amazon S3 to Amazon Redshift
using AWS Glue
Detailed approach for upsert and complete refresh

Upsert: This is for datasets that require historical aggregation, depending on


the business use case. Follow one of the approaches described in Updating
and inserting new data
(https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-
staging-tables-.html) (Amazon Redshift documentation) based on your
business needs.

Complete refresh: This is for small datasets that don't need historical
aggregations. Follow one of these approaches:

1. Truncate the Amazon Redshift table.


2. Load the current partition from the staging area

or:

1. Create a temporary table with current partition data.


2. Drop the target Amazon Redshift table.
3. Rename the temporary table to the target table.

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like