0% found this document useful (0 votes)

31 views15 pages

Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance

This document outlines a pattern for building an ETL service pipeline to incrementally load data from Amazon S3 to Amazon Redshift using AWS Glue. It details the prerequisites, limitations, architecture, and tools involved in the process, including the use of various AWS services like AWS Lambda and Amazon Athena. The document also provides a step-by-step guide for creating the necessary infrastructure and configuring the ETL pipeline.

Uploaded by

mmyybabybaby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views15 pages

Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance

Uploaded by

mmyybabybaby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

AWS Documentation AWS Prescriptive Guidance Patterns

Build an ETL service pipeline to

load data incrementally from
Amazon S3 to Amazon Redshift
using AWS Glue
PDF (/pdfs/prescriptive-guidance/latest/patterns/prescriptive-guidance.pdf#build-an-etl-
service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-
glue)

Created by Rohan Jamadagni (AWS) and Arunabha Datta (AWS)

Environm Technologies: AWS services: Amazon

ent: Analytics; Data lakes; Redshift; Amazon S3; AWS
Production Storage & backup Glue; AWS Lambda

Summary of building an ETL pipeline from

Amazon S3 to Amazon Redshift using AWS
Glue
This pattern provides guidance on how to configure Amazon Simple Storage
Service (Amazon S3) for optimal data lake performance, and then load
incremental data changes from Amazon S3 into Amazon Redshift by using
AWS Glue, performing extract, transform, and load (ETL) operations.

The source files in Amazon S3 can have different formats, including comma-
separated values (CSV), XML, and JSON files. This pattern describes how you
can use AWS Glue to convert the source files into a cost-optimized and
performance-optimized format like Apache Parquet. You can query Parquet
files directly from Amazon Athena and Amazon Redshift Spectrum. You can
also load Parquet files into Amazon Redshift, aggregate them, and share the
aggregated data with consumers, or visualize the data by using Amazon
QuickSight.

Prerequisites and limitations for building an

ETL pipeline from Amazon S3 to Amazon
Redshift using AWS Glue
Prerequisites

An active AWS account.

An S3 source bucket that has the right privileges and contains CSV, XML,
or JSON files.

Assumptions

The CSV, XML, or JSON source files are already loaded into Amazon S3
and are accessible from the account where AWS Glue and Amazon
Redshift are configured.
Best practices for loading the files, splitting the files, compression, and
using a manifest are followed, as discussed in the Amazon Redshift
documentation (https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-
data-from-S3.html) .

The source file structure is unaltered.

The source system is able to ingest data into Amazon S3 by following the
folder structure defined in Amazon S3.
The Amazon Redshift cluster spans a single Availability Zone. (This
architecture is appropriate because AWS Lambda, AWS Glue, and
Amazon Athena are serverless.) For high availability, cluster snapshots
are taken at a regular frequency.

Limitations

The file formats are limited to those that are currently supported by AWS
Glue (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-
format.html) .

Real-time downstream reporting isn't supported.

Architecture for building an ETL pipeline
from Amazon S3 to Amazon Redshift using
AWS Glue
Source technology stack

S3 bucket with CSV, XML, or JSON files

Target technology stack

S3 data lake (with partitioned Parquet file storage)

Amazon Redshift

Target architecture

Data flow
Tools for building an ETL pipeline from
Amazon S3 to Amazon Redshift using AWS
Glue
Amazon S3 (http://aws.amazon.com/s3/) – Amazon Simple Storage
Service (Amazon S3) is a highly scalable object storage service. Amazon
S3 can be used for a wide range of storage solutions, including websites,
mobile applications, backups, and data lakes.
AWS Lambda (http://aws.amazon.com/lambda/) – AWS Lambda lets you
run code without provisioning or managing servers. AWS Lambda is an
event-driven service; you can set up your code to automatically initiate
from other AWS services.
Amazon Redshift (http://aws.amazon.com/redshift/) – Amazon Redshift is
a fully managed, petabyte-scale data warehouse service. With Amazon
Redshift, you can query petabytes of structured and semi-structured
data across your data warehouse and your data lake using standard SQL.
AWS Glue (http://aws.amazon.com/glue/) – AWS Glue is a fully managed
ETL service that makes it easier to prepare and load data for analytics.
AWS Glue discovers your data and stores the associated metadata (for
example, table definitions and schema) in the AWS Glue Data Catalog.
Your cataloged data is immediately searchable, can be queried, and is
available for ETL.
AWS Secrets Manager (http://aws.amazon.com/secrets-manager/) – AWS
Secrets Manager facilitates protection and central management of
secrets needed for application or service access. The service stores
database credentials, API keys, and other secrets, and eliminates the
need to hardcode sensitive information in plaintext format. Secrets
Manager also offers key rotation to meet security and compliance needs.
It has built-in integration for Amazon Redshift, Amazon Relational
Database Service (Amazon RDS), and Amazon DocumentDB. You can
store and centrally manage secrets by using the Secrets Manager
console, the command-line interface (CLI), or Secrets Manager API and
SDKs.
Amazon Athena (http://aws.amazon.com/athena/) – Amazon Athena is an
interactive query service that makes it easy to analyze data that's stored
in Amazon S3. Athena is serverless and integrated with AWS Glue, so it
can directly query the data that's cataloged using AWS Glue. Athena is
elastically scaled to deliver interactive query performance.

Epics for building an ETL pipeline from

Amazon S3 to Amazon Redshift using AWS
Glue

Create the S3 buckets and folder structure

Task Description Sk
ill
s
re
q
ui
re
d

Analyze Perform this task for each data source D

source that contributes to the Amazon S3 data at
systems for lake. a
data en
structure gi
Task Description Sk
ill
s
re
q
ui
re
d

and ne
attributes. er

Define the This strategy should be based on the D

partition frequency of data captures, delta at
and access processing, and consumption needs. a
strategy. Make sure that S3 buckets are not open en
to the public and that access is controlled gi
by specific service role-based policies ne
only. For more information, see the er
Amazon S3 documentation
(https://docs.aws.amazon.com/AmazonS3/lat
est/user-guide/using-folders.html) .

Create Create a separate bucket for each source, D

separate S3 and then create a folder structure that's at
buckets for based on the source system's data a
each data ingestion frequency; for example, en
source type s3://source-system-name/date/hour . gi
and a For the processed (converted to Parquet ne
separate S3 format) files, create a similar structure; er
bucket per for example, s3://source-processed-
source for bucket/date/hour . For more
the information about creating S3 buckets,
processed see the Amazon S3 documentation
(Parquet) (https://docs.aws.amazon.com/AmazonS3/lat
data. est/user-guide/create-bucket.html) .

Create a data warehouse in Amazon Redshift

Task Description Sk
ill
s
re
qu
ir
ed

Launch the Use the Secrets Manager database secret Da

Amazon for admin user credentials while creating ta
Redshift the Amazon Redshift cluster. For en
cluster with information about creating and sizing an gi
the Amazon Redshift cluster, see the Amazon ne
appropriate Redshift documentation er
parameter (https://docs.aws.amazon.com/ses/latest/De
groups and veloperGuide/event-publishing-redshift-
maintenanc cluster.html) and the Sizing Cloud Data
e and Warehouses
backup (https://d1.awsstatic.com/whitepapers/Size-
strategy. Cloud-Data-Warehouse-on-AWS.pdf)
whitepaper.

Create and The AWS Identity and Access Da

attach the Management (IAM) service role ensures ta
IAM service access to Secrets Manager and the en
role to the source S3 buckets. For more information, gi
Amazon see the AWS documentation on ne
Redshift authorization er
cluster. (https://docs.aws.amazon.com/redshift/lates
t/mgmt/authorizing-redshift-service.html)
and adding a role
(https://docs.aws.amazon.com/redshift/lates
t/dg/c-getting-started-using-spectrum-add-
role.html) .

Create the Follow Amazon Redshift best practices Da

database for table design. Based on the use case, ta
schema. choose the appropriate sort and en
distribution keys, and the best possible gi
Task Description Sk
ill
s
re
qu
ir
ed

compression encoding. For best ne

practices, see the AWS documentation er
(https://docs.aws.amazon.com/redshift/lates
t/dg/c_designing-tables-best-practices.html)
.

Configure Conﬁgure workload management (WLM) Da

workload queues, short query acceleration (SQA), ta
manageme or concurrency scaling, depending on en
nt. your requirements. For more gi
information, see Implementing workload ne
management er
(https://docs.aws.amazon.com/redshift/lates
t/dg/cm-c-implementing-workload-
management.html) in the Amazon Redshift
documentation.

Create a secret in Secrets Manager

Task Description Sk
ill
s
re
qu
ir
ed

Create a This secret stores the credentials for the Da

new admin user as well as individual database ta
secret to service users. For instructions, see the en
Task Description Sk
ill
s
re
qu
ir
ed

store the Secrets Manager documentation gi

Amazon (https://docs.aws.amazon.com/secretsmanager/ ne
Redshift latest/userguide/manage_create-basic- er
sign-in secret.html) . Choose Amazon Redshift
credentia Cluster as the secret type. Additionally, on
ls in the Secret rotation page, turn on the
Secrets rotation. This will create the appropriate
Manager. user in the Amazon Redshift cluster and
will rotate the key secrets at defined
intervals.

Create Restrict Secrets Manager access to only Da

an IAM Amazon Redshift administrators and AWS ta
policy to Glue. en
restrict gi
Secrets ne
Manager er
access.

Configure AWS Glue

Task Description S
ki
lls
re
q
ui
re
d

In the For instructions, see the AWS Glue D

AWS documentation at
Glue (https://docs.aws.amazon.com/glue/latest/dg/c a
Data onsole-connections.html) . en
Catalog, gi
add a ne
connecti er
on for
Amazon
Redshift.

Create For more information, see the AWS Glue D

and documentation at
attach an (https://docs.aws.amazon.com/glue/latest/dg/c a
IAM reate-service-policy.html) . en
service gi
role for ne
AWS er
Glue to
access
Secrets
Manager,
Amazon
Redshift,
and S3
buckets.

Define This step involves creating a database and D

the AWS required tables in the AWS Glue Data at
Glue Catalog. You can either use a crawler to a
Data catalog the tables in the AWS Glue en
Task Description S
ki
lls
re
q
ui
re
d

Catalog database, or deﬁne them as Amazon gi

for the Athena external tables. You can also access ne
source. the external tables deﬁned in Athena er
through the AWS Glue Data Catalog. See
the AWS documentation for more
information about deﬁning the Data
Catalog
(https://docs.aws.amazon.com/glue/latest/dg/p
opulate-data-catalog.html) and creating an
external table
(https://docs.aws.amazon.com/athena/latest/ug
/creating-tables.html) in Athena.

Create an The AWS Glue job can be a Python shell or D

AWS PySpark to standardize, deduplicate, and at
Glue job cleanse the source data ﬁles. To optimize a
to performance and avoid having to query the en
process entire S3 source bucket, partition the S3 gi
source bucket by date, broken down by year, ne
data. month, day, and hour as a pushdown er
predicate for the AWS Glue job. For more
information, see the AWS Glue
documentation
(https://docs.aws.amazon.com/glue/latest/dg/a
ws-glue-programming-etl-partitions.html) .
Load the processed and transformed data
to the processed S3 bucket partitions in
Parquet format. You can query the Parquet
ﬁles from Athena.

Create an The AWS Glue job can be a Python shell or D

AWS PySpark to load the data by upserting the at
Task Description S
ki
lls
re
q
ui
re
d

Glue job data, followed by a complete refresh. For a

to load details, see the AWS Glue documentation en
data into (https://docs.aws.amazon.com/glue/latest/dg/a gi
Amazon uthor-job.html) and the Additional ne
Redshift. information section. er

(Optional The incremental data load is primarily D

) driven by an Amazon S3 event that causes at
Schedule an AWS Lambda function to call the AWS a
AWS Glue job. Use AWS Glue trigger-based en
Glue jobs scheduling for any data loads that demand gi
by using time-based instead of event-based ne
triggers scheduling. er
as
necessary
.

Create a Lambda function

Task Description Sk
ill
s
re
q
ui
re
d

Create Create an IAM service-linked role for AWS D

and Lambda with a policy to read Amazon S3 at
attach an objects and buckets, and a policy to access a
IAM the AWS Glue API to start an AWS Glue en
service- job. For more information, see the gi
linked Knowledge Center ne
role for (http://aws.amazon.com/premiumsupport/kno er
AWS wledge-center/lambda-execution-role-s3-
Lambda bucket/) .
to access
S3
buckets
and the
AWS Glue
job.

Create a The Lambda function should be initiated D

Lambda by the creation of the Amazon S3 manifest at
function ﬁle. The Lambda function should pass the a
to run the Amazon S3 folder location (for example, en
AWS Glue source_bucket/year/month/date/hour) to gi
job based the AWS Glue job as a parameter. The AWS ne
on the Glue job will use this parameter as a er
defined pushdown predicate to optimize ﬁle access
Amazon and job processing performance. For more
S3 event. information, see the AWS Glue
documentation
(https://docs.aws.amazon.com/glue/latest/dg/
aws-glue-programming-python-calling.html) .
Task Description Sk
ill
s
re
q
ui
re
d

Create an The Amazon S3 PUT object event should D

Amazon be initiated only by the creation of the at
S3 PUT manifest file. The manifest file controls a
object the Lambda function and the AWS Glue en
event to job concurrency, and processes the load as gi
detect a batch instead of processing individual ne
object files that arrive in a specific partition of er
creation, the S3 source bucket. For more
and call information, see the Lambda
the documentation
respective (https://docs.aws.amazon.com/lambda/latest/
Lambda dg/with-s3-example.html) .
function.

Related resources for building an ETL

pipeline from Amazon S3 to Amazon Redshift
using AWS Glue
Amazon S3 documentation
(https://docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html)

AWS Glue documentation

(https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html)

Amazon Redshift documentation

(https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html)

AWS Lambda (http://aws.amazon.com/lambda/)

Amazon Athena (http://aws.amazon.com/athena/)

AWS Secrets Manager (http://aws.amazon.com/secrets-manager/)

Additional information for building an ETL
pipeline from Amazon S3 to Amazon Redshift
using AWS Glue
Detailed approach for upsert and complete refresh

Upsert: This is for datasets that require historical aggregation, depending on

the business use case. Follow one of the approaches described in Updating
and inserting new data
(https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-
staging-tables-.html) (Amazon Redshift documentation) based on your
business needs.

Complete refresh: This is for small datasets that don't need historical
aggregations. Follow one of these approaches:

1. Truncate the Amazon Redshift table.

2. Load the current partition from the staging area

or:

1. Create a temporary table with current partition data.

2. Drop the target Amazon Redshift table.
3. Rename the temporary table to the target table.

Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
No ratings yet
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
14 pages
AWS Glue
No ratings yet
AWS Glue
36 pages
CSC Investigatory Project
No ratings yet
CSC Investigatory Project
42 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Lecture - Developing Storage Solutions With Amazon S3
No ratings yet
Lecture - Developing Storage Solutions With Amazon S3
37 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
No ratings yet
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
4 pages
Glue by Pushpjeet
No ratings yet
Glue by Pushpjeet
7 pages
Athena
No ratings yet
Athena
13 pages
Affinity
No ratings yet
Affinity
7 pages
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
No ratings yet
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
9 pages
Data Storage and AWS
No ratings yet
Data Storage and AWS
24 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
AWS Glue
No ratings yet
AWS Glue
10 pages
AWS Glue
100% (1)
AWS Glue
225 pages
AWS Glue For Handling Metadata - Analytics Vidhya
No ratings yet
AWS Glue For Handling Metadata - Analytics Vidhya
5 pages
Automated Archival For Amazon Redshift Ra
No ratings yet
Automated Archival For Amazon Redshift Ra
1 page
Notes
No ratings yet
Notes
28 pages
AWS Project1
No ratings yet
AWS Project1
13 pages
Aws Glu
No ratings yet
Aws Glu
17 pages
6 +Athena,+QuickSight,+EMR
No ratings yet
6 +Athena,+QuickSight,+EMR
63 pages
AWS Data Lake
No ratings yet
AWS Data Lake
13 pages
Incremental Data Loading AWS Detailed
No ratings yet
Incremental Data Loading AWS Detailed
17 pages
Lesson 02 Exploring The World of AWS Glue
No ratings yet
Lesson 02 Exploring The World of AWS Glue
33 pages
Comprehensive Guide To AWS S3
No ratings yet
Comprehensive Guide To AWS S3
18 pages
Data Lake On Aws
No ratings yet
Data Lake On Aws
29 pages
Cheat Sheet AWS Solutions Architect Professional
No ratings yet
Cheat Sheet AWS Solutions Architect Professional
177 pages
Dea C01
No ratings yet
Dea C01
7 pages
Amazon S3 Notes
No ratings yet
Amazon S3 Notes
8 pages
Cloud Computing Activity - Unit 3
No ratings yet
Cloud Computing Activity - Unit 3
18 pages
Introduction To Analytics On AWS
No ratings yet
Introduction To Analytics On AWS
34 pages
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
No ratings yet
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
43 pages
Aws S3
No ratings yet
Aws S3
23 pages
Aws Project
No ratings yet
Aws Project
16 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
Aws Lab Manual
No ratings yet
Aws Lab Manual
44 pages
Exercise 3 - Processing Data in A Data Lake
No ratings yet
Exercise 3 - Processing Data in A Data Lake
6 pages
Amazon S3 Notes
No ratings yet
Amazon S3 Notes
10 pages
Digitalcloud - Training-Amazon S3 and Glacier
No ratings yet
Digitalcloud - Training-Amazon S3 and Glacier
18 pages
Aws 03 S3
No ratings yet
Aws 03 S3
136 pages
Aws Glue Interview
No ratings yet
Aws Glue Interview
259 pages
What Is An Amazon S3 Bucket
No ratings yet
What Is An Amazon S3 Bucket
7 pages
01 135202 004 11734480879 04062024 112721pm
No ratings yet
01 135202 004 11734480879 04062024 112721pm
10 pages
ACAv3 EN M04 AddingAStorageLayerWithAmazonS3 Instructor Deck
No ratings yet
ACAv3 EN M04 AddingAStorageLayerWithAmazonS3 Instructor Deck
76 pages
AWS S3 - Integration - of - SFG
No ratings yet
AWS S3 - Integration - of - SFG
8 pages
21DIT087 AC Practical Submission
No ratings yet
21DIT087 AC Practical Submission
98 pages
Amazon Simple Storage Service
No ratings yet
Amazon Simple Storage Service
7 pages
Serverless Etl Aws Glue
No ratings yet
Serverless Etl Aws Glue
17 pages
What Is Amazon S3 - Amazon Simple Storage Service
No ratings yet
What Is Amazon S3 - Amazon Simple Storage Service
18 pages
AWS Services
No ratings yet
AWS Services
34 pages
04 AWS - Overview - S3 - Storage PDF
No ratings yet
04 AWS - Overview - S3 - Storage PDF
14 pages
Amazon Simple Storage Service (Amazon S3)
No ratings yet
Amazon Simple Storage Service (Amazon S3)
4 pages
AWS Capstone Project
No ratings yet
AWS Capstone Project
4 pages
Amazon Web Services
100% (2)
Amazon Web Services
71 pages
Implementing Travel & Hospitality Data Mesh: AWS Reference Architecture
No ratings yet
Implementing Travel & Hospitality Data Mesh: AWS Reference Architecture
2 pages
AWS S3 (Simple Storage Services) .
No ratings yet
AWS S3 (Simple Storage Services) .
14 pages
AWS Glue for Data Engineers: Serverless ETL Made Easy
From Everand
AWS Glue for Data Engineers: Serverless ETL Made Easy
Robert Johnson
No ratings yet
AWS Certified Solutions Architect - Associate Exam Prep kit
From Everand
AWS Certified Solutions Architect - Associate Exam Prep kit
SUJAN
No ratings yet
AWS IAM Roles - Matillion Docs
No ratings yet
AWS IAM Roles - Matillion Docs
3 pages
Create An IAM Role For S3 Access
No ratings yet
Create An IAM Role For S3 Access
2 pages
OLTP and OLAP
No ratings yet
OLTP and OLAP
46 pages
COPY JOB (Preview) - Amazon Redshift
No ratings yet
COPY JOB (Preview) - Amazon Redshift
1 page
AWS Glue Is A Fully Managed ETL
No ratings yet
AWS Glue Is A Fully Managed ETL
2 pages
Validating Direct Mapping With SQL
No ratings yet
Validating Direct Mapping With SQL
5 pages
Transformation
No ratings yet
Transformation
2 pages
Mapping Document Source Side - Metadata Validat...
No ratings yet
Mapping Document Source Side - Metadata Validat...
3 pages
Validationns Mapping
No ratings yet
Validationns Mapping
2 pages
Employee Details
No ratings yet
Employee Details
2 pages
City and Sate
No ratings yet
City and Sate
28 pages
Designed Test Cases For Data Loading
No ratings yet
Designed Test Cases For Data Loading
1 page
Using - SASeg - Effectively - Proc SQL Good One Read at Home
No ratings yet
Using - SASeg - Effectively - Proc SQL Good One Read at Home
30 pages
Database Management Systems Week 5
No ratings yet
Database Management Systems Week 5
22 pages
Unable To See Organization List in ASCP Plan - Oracle Applications Architects
No ratings yet
Unable To See Organization List in ASCP Plan - Oracle Applications Architects
2 pages
Unit4 Rdbms
No ratings yet
Unit4 Rdbms
24 pages
Lab 10 DB
No ratings yet
Lab 10 DB
3 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
Oracle 1z0-873 - 139 Questions
No ratings yet
Oracle 1z0-873 - 139 Questions
43 pages
CH-4 Concurrency Control
No ratings yet
CH-4 Concurrency Control
80 pages
L7b - Queue ADT
No ratings yet
L7b - Queue ADT
29 pages
General Trees
No ratings yet
General Trees
12 pages
Sumita Arora Class 12 Computer Science
No ratings yet
Sumita Arora Class 12 Computer Science
6 pages
ODI Incremental Load For Custom Table in OBI Apps
No ratings yet
ODI Incremental Load For Custom Table in OBI Apps
8 pages
Log
No ratings yet
Log
14 pages
HA201 EN Col18 CO A4
No ratings yet
HA201 EN Col18 CO A4
17 pages
Netbackup 8.0 Blueprint NDMP
No ratings yet
Netbackup 8.0 Blueprint NDMP
42 pages
Nutanix Files
No ratings yet
Nutanix Files
64 pages
Brochure 837-2-2 2
No ratings yet
Brochure 837-2-2 2
8 pages
SAS94 9CBXWM 12601755 Win X64 WRKSTN
No ratings yet
SAS94 9CBXWM 12601755 Win X64 WRKSTN
3 pages
10.7 Learn Oracle DROP TRIGGER by Practical Examples
No ratings yet
10.7 Learn Oracle DROP TRIGGER by Practical Examples
3 pages
NALec07AD&Group 2024
No ratings yet
NALec07AD&Group 2024
42 pages
Info125 Notes2
No ratings yet
Info125 Notes2
52 pages
Snapmirror Synchronous For Ontap 9.6: Technical Report
No ratings yet
Snapmirror Synchronous For Ontap 9.6: Technical Report
45 pages
DMDW-Solution For Unit 1-5
50% (2)
DMDW-Solution For Unit 1-5
20 pages
Trace - 2020-01-01 13 - 02 - 54 365
No ratings yet
Trace - 2020-01-01 13 - 02 - 54 365
8 pages
Data Science Solutions IA 2
No ratings yet
Data Science Solutions IA 2
16 pages
GIT Advanced: Anthony Baire
No ratings yet
GIT Advanced: Anthony Baire
173 pages
Startup and Shutdown of Oracle Database
No ratings yet
Startup and Shutdown of Oracle Database
8 pages
Enhancing DBSCAN Algorithm For Data Mining
No ratings yet
Enhancing DBSCAN Algorithm For Data Mining
5 pages
Things To Know About EPM 11.1.2.4
No ratings yet
Things To Know About EPM 11.1.2.4
58 pages

Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance

Uploaded by

Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance

Uploaded by

AWS Documentation AWS Prescriptive Guidance Patterns

Build an ETL service pipeline to

Created by Rohan Jamadagni (AWS) and Arunabha Datta (AWS)

Environm Technologies: AWS services: Amazon

Summary of building an ETL pipeline from

Prerequisites and limitations for building an

An active AWS account.

The source file structure is unaltered.

Real-time downstream reporting isn't supported.

S3 bucket with CSV, XML, or JSON files

Target technology stack

S3 data lake (with partitioned Parquet file storage)

Epics for building an ETL pipeline from

Create the S3 buckets and folder structure

Analyze Perform this task for each data source D

Define the This strategy should be based on the D

Create Create a separate bucket for each source, D

Create a data warehouse in Amazon Redshift

Launch the Use the Secrets Manager database secret Da

Create and The AWS Identity and Access Da

Create the Follow Amazon Redshift best practices Da

compression encoding. For best ne

Configure Conﬁgure workload management (WLM) Da

Create a secret in Secrets Manager

Create a This secret stores the credentials for the Da

store the Secrets Manager documentation gi

Create Restrict Secrets Manager access to only Da

Configure AWS Glue

In the For instructions, see the AWS Glue D

Create For more information, see the AWS Glue D

Define This step involves creating a database and D

Catalog database, or deﬁne them as Amazon gi

Create an The AWS Glue job can be a Python shell or D

Create an The AWS Glue job can be a Python shell or D

Glue job data, followed by a complete refresh. For a

(Optional The incremental data load is primarily D

Create a Lambda function

Create Create an IAM service-linked role for AWS D

Create a The Lambda function should be initiated D

Create an The Amazon S3 PUT object event should D

Related resources for building an ETL

AWS Glue documentation

Amazon Redshift documentation

AWS Lambda (http://aws.amazon.com/lambda/)

Amazon Athena (http://aws.amazon.com/athena/)

AWS Secrets Manager (http://aws.amazon.com/secrets-manager/)

Upsert: This is for datasets that require historical aggregation, depending on

1. Truncate the Amazon Redshift table.

1. Create a temporary table with current partition data.

You might also like