Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance
Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance
The source files in Amazon S3 can have different formats, including comma-
separated values (CSV), XML, and JSON files. This pattern describes how you
can use AWS Glue to convert the source files into a cost-optimized and
performance-optimized format like Apache Parquet. You can query Parquet
files directly from Amazon Athena and Amazon Redshift Spectrum. You can
also load Parquet files into Amazon Redshift, aggregate them, and share the
aggregated data with consumers, or visualize the data by using Amazon
QuickSight.
Assumptions
The CSV, XML, or JSON source files are already loaded into Amazon S3
and are accessible from the account where AWS Glue and Amazon
Redshift are configured.
Best practices for loading the files, splitting the files, compression, and
using a manifest are followed, as discussed in the Amazon Redshift
documentation (https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-
data-from-S3.html) .
Limitations
The file formats are limited to those that are currently supported by AWS
Glue (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-
format.html) .
Target architecture
Data flow
Tools for building an ETL pipeline from
Amazon S3 to Amazon Redshift using AWS
Glue
Amazon S3 (http://aws.amazon.com/s3/) – Amazon Simple Storage
Service (Amazon S3) is a highly scalable object storage service. Amazon
S3 can be used for a wide range of storage solutions, including websites,
mobile applications, backups, and data lakes.
AWS Lambda (http://aws.amazon.com/lambda/) – AWS Lambda lets you
run code without provisioning or managing servers. AWS Lambda is an
event-driven service; you can set up your code to automatically initiate
from other AWS services.
Amazon Redshift (http://aws.amazon.com/redshift/) – Amazon Redshift is
a fully managed, petabyte-scale data warehouse service. With Amazon
Redshift, you can query petabytes of structured and semi-structured
data across your data warehouse and your data lake using standard SQL.
AWS Glue (http://aws.amazon.com/glue/) – AWS Glue is a fully managed
ETL service that makes it easier to prepare and load data for analytics.
AWS Glue discovers your data and stores the associated metadata (for
example, table definitions and schema) in the AWS Glue Data Catalog.
Your cataloged data is immediately searchable, can be queried, and is
available for ETL.
AWS Secrets Manager (http://aws.amazon.com/secrets-manager/) – AWS
Secrets Manager facilitates protection and central management of
secrets needed for application or service access. The service stores
database credentials, API keys, and other secrets, and eliminates the
need to hardcode sensitive information in plaintext format. Secrets
Manager also offers key rotation to meet security and compliance needs.
It has built-in integration for Amazon Redshift, Amazon Relational
Database Service (Amazon RDS), and Amazon DocumentDB. You can
store and centrally manage secrets by using the Secrets Manager
console, the command-line interface (CLI), or Secrets Manager API and
SDKs.
Amazon Athena (http://aws.amazon.com/athena/) – Amazon Athena is an
interactive query service that makes it easy to analyze data that's stored
in Amazon S3. Athena is serverless and integrated with AWS Glue, so it
can directly query the data that's cataloged using AWS Glue. Athena is
elastically scaled to deliver interactive query performance.
Task Description Sk
ill
s
re
q
ui
re
d
and ne
attributes. er
Task Description Sk
ill
s
re
qu
ir
ed
Complete refresh: This is for small datasets that don't need historical
aggregations. Follow one of these approaches:
or:
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.