AWS-Cloud-Data-Ingestion-Patterns-Practices
AWS-Cloud-Data-Ingestion-Patterns-Practices
© 2021 Amazon Web Services, Inc. or its affiliates. All rights reserved.
2
Contents
Introduction ..........................................................................................................................1
Data ingestion patterns .......................................................................................................2
Homogeneous data ingestion patterns ...............................................................................4
Homogeneous relational data ingestion ..........................................................................4
Homogeneous data files ingestion ................................................................................11
Heterogeneous data ingestion patterns ............................................................................16
Heterogeneous data files ingestion ...............................................................................16
Streaming data ingestion ...............................................................................................18
Relational data ingestion................................................................................................27
Conclusion .........................................................................................................................34
Contributors .......................................................................................................................35
Further reading ..................................................................................................................35
Document history ...............................................................................................................36
3
Abstract
Today, many organizations want to gain further insight using the vast amount of data
they generate or have access to. They may want to perform reporting, analytics and/or
machine learning on that data and further integrate the results with other applications
across the organization. More and more organizations have found it challenging to meet
their needs with traditional on-premises data analytics solutions, and are looking at
modernizing their data and analytics infrastructure by moving to cloud. However, before
they can start analyzing the data, they need to ingest this data into cloud and use the
right tool for the right job. This paper outlines the patterns, practices, and tools used by
AWS customers when ingesting data into AWS Cloud using AWS services.
4
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Introduction
As companies are dealing with massive surge in data being generated, collected and
stored to support their business needs, users are expecting faster access to data to
make better decisions quickly as changes occur. Such agility requires that they
integrate terabytes to petabytes and sometimes exabytes of data, along with the data
that was previously siloed, in order to get a complete view of their customers and
business operations.
To analyze these vast amounts of data and to cater to their end user’s needs, many
companies create solutions like data lakes and data warehouses. They also have
purpose-built data stores to cater to specific business applications and use cases—for
example, relational database systems to cater to transactional systems for structured
data or technologies like Elasticsearch to perform log analytics and search operations.
As customers use these data lakes and purpose-built stores, they often need to also
move data between these systems. For example, moving data from the lake to purpose-
built stores, from those stores to the lake, and between purpose-built stores.
At re:Invent 2020, we walked through a new modern approach to called the Lake House
architecture. This architecture is shown in the following diagram.
1
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
As data in these data lakes and purpose-built stores continues to grow, it becomes
harder to move all this data around. We call this data gravity.
Data movement is a critical part of any system. To design a data ingestion pipeline, it is
important to understand the requirements of data ingestion and choose the appropriate
approach which meets performance, latency, scale, security, and governance needs.
This whitepaper provides the patterns, practices and tools to consider in order to arrive
at the most appropriate approach for data ingestion needs, with a focus on ingesting
data from outside AWS to the AWS Cloud. This whitepaper is not a programming guide
to handle data ingestion but is rather intended to be a guide for architects to understand
the options available and provide guidance on what to consider when designing a data
ingestion platform. It also guides you through various tools that are available to perform
your data ingestion needs.
The whitepaper is organized into several high-level sections which highlight the
common patterns in the industry. For example, homogeneous data ingestion is a pattern
where data is moved between similar data storage systems, like Microsoft SQL Server
to Microsoft SQL Server or similar formats like Parquet to Parquet. This paper further
breaks down different use cases for each pattern—for example, migration from on-
premises system to AWS Cloud, scaling in the cloud for read-only workloads and
reporting, or performing change data capture for continuous data ingestion into the
analytics workflow.
Depending upon the current architecture and target Lake House architecture, there are
certain common ingestion patterns that can be observed.
Homogeneous data ingestion patterns: These are patterns where the primary
objective is to move the data into the destination in the same format or same storage
engine as it is in the source. In these patterns, your primary objectives may be speed of
data transfer, data protection (encryption in transit and at rest), preserving the data
integrity and automating where continuous ingestion is required. These patterns usually
2
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
fall under EL piece of extract, transform, load (ETL) and can be an intermediary step
before transformations are done after the ingestion.
This paper covers the following use cases for this pattern:
• Relational data ingestion between same data engines (for example, Microsoft
SQL Server to Amazon RDS for SQL Server or SQL Server on Amazon EC2, or
Oracle to Amazon RDS for Oracle.) This use case can apply to migrating your
peripheral workload into the AWS Cloud or for scaling your workload to expand
on new requirements like reporting.
• Data files ingestion from on-premises storage to an AWS Cloud data lake (for
example, ingesting parquet files from Apache Hadoop to Amazon Simple Storage
Service (Amazon S3) or ingesting CSV files from a file share to Amazon S3).
This use case may be one time to migrate your big data solutions, or may apply
to building a new data lake capability in the cloud.
• Large objects (BLOB, photos, videos) ingestion into Amazon S3 object storage.
Heterogeneous data ingestion patterns: These are patterns where data must be
transformed as it is ingested into the destination data storage system. These
transformations can be simple like changing the data type/format of the data to meet the
destination requirement or can be as complex as running machine learning to derive
new attributes in the data. This pattern is usually where data engineers and ETL
developers spend most of their time to cleanse, standardize, format, and shape the data
based on the business and technology requirements. As such, it follows a traditional
ETL model. In this pattern, you may be integrating data from multiple sources and may
have a complex step of applying transformation. The primary objectives here are same
as in homogeneous data ingestion, with the added objective of meeting the business
and technology requirements to transform the data.
This paper covers the following use cases for this pattern:
• Relational data ingestion between different data engines (for example, Microsoft
SQL Server to Amazon Aurora relational database or Oracle to Amazon RDS for
MySQL).
• Streaming data ingestion from data sources like Internet of Things (IoT) devices
or log files to a central data lake or peripheral data storage.
• Relational data source to non-relational data destination and vice versa (for
example, Amazon DocumentDB solution to Amazon Redshift or MySQL to
Amazon DynamoDB).
3
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• File format transformations while ingesting data files (for example, changing CSV
format files on file share to Parquet on Amazon S3).
The tools that can be used in each of the preceding patterns depend upon your use
case. In many cases, the same tool can be used to meet multiple use cases. Ultimately,
the decision on using the right tool for the right job will depend upon your overall
requirements for data ingestion in the Lake House architecture. An important aspect of
your tooling will also be workflow scheduling and automation.
• Continuous ingestion of relational data to keep a copy of the data in the cloud for
reporting, analytics, or integration purposes.
Some of the most common challenges when migrating relational data from on-premises
environments is the size of the data, available bandwidth of the network between on-
premises and cloud, and downtime requirements. With a broad breadth of tools and
options available to solve these challenges, it can be difficult for a migration team to sift
through these options. This section addresses ways to overcome those challenges.
4
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
house and use the same tools that were used while moving data from on-premises to
on-premises. The exception to this scenario is thinking about network connectivity
between on-premises and the AWS Cloud. Because you cannot access the operating
system (OS) layer of the server hosting the Amazon RDS instance, not all data
ingestion tools will work as is. Therefore, it requires careful selection of the ingestion
toolset. The following section addresses these ingestion tools.
• Oracle Export and Import utilities help you migrate databases that are smaller
than 10 GB and don’t include binary float and double data types. The import
process creates the schema objects, so you don’t have to run a script to create
them beforehand. This makes the process well-suited for databases that have a
large number of small tables. You can use this tool for both Amazon RDS for
Oracle and Oracle databases on Amazon EC2.
5
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• Oracle GoldenGate is a tool for replicating data between a source database and
one or more destination databases with minimal downtime. You can use it to
build high availability architectures, and to perform real-time data integration,
transactional change data capture (CDC), replication in heterogeneous
environments, and continuous data replication. There is no limit to the source
database size.
6
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• Database Mirroring - You can use database mirroring to set up a hybrid cloud
environment for your SQL Server databases. This option requires SQL Server
Enterprise edition. In this scenario, your principal SQL Server database runs on-
premises, and you create a warm standby solution in the cloud. You replicate
your data asynchronously, and perform a manual failover when you’re ready for
cutover.
7
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• AWS Snowball Edge - You can use AWS Snowball Edge to migrate very large
databases (up to 80 TB in size). Any type of source relational database can use
this method when the database size is very large. Snowball has a 10 Gb
Ethernet port that you plug into your on-premises server and place all database
backups or data on the Snowball device. After the data is copied to Snowball,
you send the appliance to AWS for placement in your designated S3 bucket.
Data copied to Snowball Edge is automatically encrypted. You can then
download the backups from Amazon S3 and restore them on SQL Server on an
Amazon EC2 instance, or run the rds_restore_database stored procedure to
restore the database to Amazon RDS. You can also use AWS Snowcone for
databases up to 8 TB in size.
• AWS DMS – Apart from the above tools, AWS Database Migration Service (AWS
DMS) can be used to migrate initial bulk and/or change data capture as well.
For more prescriptive guidance, see Homogeneous database migration for SQL Server.
o The native tools allow you to migrate your system with minimal downtime.
8
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
The pg_dump utility uses the COPY command to create a schema and data
dump of a PostgreSQL database. The dump script generated by pg_dump loads
data into a database with the same name and recreates the tables, indexes, and
foreign keys. You can use the pg_restore command and the -d parameter to
restore the data to a database with a different name.
For more prescriptive guidance, see Homogeneous database migration for PostgreSQL
Server.
• AWS DMS – You can use AWS DMS as well for the initial ingestion.
• For Oracle, AWS DMS uses either the Oracle LogMiner API or binary reader API
(bfile API) to read ongoing changes. AWS DMS reads ongoing changes from
the online or archive redo logs based on the system change number (SCN).
• For Microsoft SQL Server, AWS DMS uses MS-Replication or MS-CDC to write
information to the SQL Server transaction log. It then uses
the fn_dblog() or fn_dump_dblog() function in SQL Server to read the
changes in the transaction log based on the log sequence number (LSN).
9
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• For MySQL, AWS DMS reads changes from the row-based binary logs (binlogs)
and migrates those changes to the target.
• For PostgreSQL, AWS DMS sets up logical replication slots and uses the
test_decoding plugin to read changes from the source and migrate them to the
target.
Amazon RDS Read Replicas provide enhanced performance and durability for Amazon
RDS database (DB) instances. They make it easy to elastically scale out beyond the
capacity constraints of a single DB instance for read-heavy database workloads. You
can create one or more replicas of a given source DB Instance and serve high-volume
application read traffic from multiple copies of your data, thereby increasing aggregate
read throughput. Read replicas can also be promoted when needed to become
standalone DB instances. Read replicas are available in Amazon RDS for MySQL,
MariaDB, PostgreSQL, Oracle, and MS SQL Server as well as for Amazon Aurora.
For the MySQL, MariaDB, PostgreSQL, Oracle, and MS SQL Server database engines,
Amazon RDS creates a second DB instance using a snapshot of the source DB
instance. It then uses the engines' native asynchronous replication to update the read
replica whenever there is a change to the source DB instance. The read replica
operates as a DB instance that allows only read-only connections; applications can
connect to a read replica just as they would to any DB instance. Amazon RDS
replicates all databases in the source DB instance. Check vendor specific licensing
before using the read replicas feature in Amazon RDS.
10
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
11
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
This section covers the pattern where the format of the files is not changed from its
source, nor is any transformation applied when the files are being ingested. This
scenario is a fairly common outside-in data movement pattern. This pattern is used for
populating a so-called landing area in the data lake where all of the original copies of
the ingested data are kept.
If the ingestion must be done only one time, which could be the use case for migration
of an on-prem big-data or analytics systems to AWS or a one-off bulk ingestion job, then
depending upon the size of the data, the data can be ingested over the wire or using the
Snow family of devices.
These options are used when no transformations are required while ingesting the data.
If you determine that the bandwidth you have over the wire is sufficient and data size
manageable to meet the SLAs, then following set of tools can be considered for one
time or continuous ingestion of files. Note that you need to have the right connectivity
between your data center and AWS setup using options like AWS Direct Connect. For
all available connectivity options, see the Amazon Virtual Private Cloud Connectivity
Options whitepaper.
The following tools can cater to both one-time or continuous file ingestion needs, with
some tools catering to other uses cases around backup or disaster recovery (DR) as
well.
12
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
File Gateway can be deployed either as a virtual or hardware appliance. It enables for
you to present a NFS such as Server Message Block (SMB) in your on-premises
environment, which interfaces with Amazon S3. In Amazon S3, File Gateway preserves
file metadata, such as permissions and timestamps for objects it stores in Amazon S3.
For details on how AWS DataSync can be used to transfer files from on-premises to
AWS securely, see AWS re:Invent recap: Quick and secure data migrations using AWS
DataSync. This blog post also references the re:invent session which details the
aspects around end-to-end validation, in-flight encryption, scheduling, filtering, and
more.
DataSync provides built-in security capabilities such as encryption of data in-transit, and
data integrity verification in-transit and at-rest. It optimizes use of network bandwidth,
and automatically recovers from network connectivity failures. In addition, DataSync
provides control and monitoring capabilities such as data transfer scheduling and
granular visibility into the transfer process through Amazon CloudWatch metrics, logs,
and events.
DataSync can copy data between Network File System (NFS) shares, Server Message
Block (SMB) shares, self-managed object storage and Amazon Simple Storage Service
(Amazon S3) buckets.
13
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
It’s simple to use. You deploy the DataSync agent as a virtual appliance in a network
that has access to AWS, where you define your source, target, and transfer options per
transfer task. It allows for simplified data transfers from your SMB and NFS file shares,
and self-managed object storage, directly to any of the Amazon S3 storage classes. It
also supports Amazon Elastic File System (Amazon EFS) and Amazon FSx for
Windows File Server for data movement of file data, where it can preserve the file and
folder attributes.
The AWS Transfer Family supports common user authentication systems, including
Microsoft Active Directory and Lightweight Directory Access Protocol (LDAP).
Alternatively, you can also choose to store and manage users’ credentials directly within
the service. By connecting your existing identity provider to the AWS Transfer Family
service, you assure that your external users continue to have the correct, secure level of
access to your data resources without disruption.
The AWS Transfer Family enables seamless migration by allowing you to import host
keys, use static IP addresses, and use existing hostnames for your servers. With these
features, user scripts and applications that use your existing file transfer systems
continue working without changes.
AWS Transfer Family is simple to use. You can deploy an AWS Transfer for SFTP
endpoint in minutes with a few clicks. Using the service, you simply consume a
14
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
managed file transfer endpoint from the AWS Transfer Family, configure your users and
then configure an Amazon S3 bucket to use as your storage location.
For continued file ingestion, you can use a scheduler, such as cron jobs.
The AWS Snow Family, comprising AWS Snowcone, AWS Snowball, and AWS
Snowmobile, offers a number of physical devices and capacity points. For a feature
comparison of each device, see AWS Snow Family. When choosing a particular
member of the AWS Snow Family, it is important to consider not only the data size but
also the supported network interfaces and speeds, device size, portability, job lifetime,
and supported APIs. Snow Family devices are owned and managed by AWS and
15
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
You can build event-driven pipelines for ETL with AWS Glue ETL. See the following
example.
16
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
You can use AWS Glue as a managed ETL tool to connect to your data centers for
ingesting data from files while transforming data and then load the data into your data
storage of choice in AWS (or example, Amazon S3 data lake storage or Amazon
Redshift). For details on how to set up AWS Glue in a hybrid environment when you are
ingesting data from on-premises data centers, see How to access and analyze on-
premises data stores using AWS Glue.
AWS Glue supports various format options for files both as input and as output. These
formats include avro, csv, ion, orc, and more. For a complete list of supported formats,
see AWS Glue programming ETL format.
AWS Glue provide various connectors to connect to the different source and destination
targets. For a reference of all tconnectors and their usage as source or sink, see
Connection Types and Options for ETL in AWS Glue.
AWS Glue supports Python and Scala for programming your ETL. As part of the
transformation, AWS Glue provides various transform classes for programming with
both PySpark and Scala.
You can use AWS Glue to meet the most complex data ingestion needs for your Lake
House architecture. Most of these ingestion workloads must be automated in
enterprises and can follow a complex workflow. You can use Workflows in AWS Glue to
achieve orchestration of AWS Glue workloads. For more complex workflow
orchestration and automation, use AWS Data Pipeline.
17
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
18
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Information derived from such analysis gives companies visibility into many aspects of
their business and customer activity—such as service usage (for metering/billing),
server activity, website clicks, and geo-location of devices, people, and physical
goods—and enables them to respond promptly to emerging situations. For example,
businesses can track changes in public sentiment on their brands and products by
continuously analyzing social media streams and respond in a timely fashion as the
necessity arises.
AWS provides several options to work with streaming data. You can take advantage of
the managed streaming data services offered by Amazon Kinesis or deploy and
manage your own streaming data solution in the cloud on Amazon EC2.
AWS offers streaming and analytics managed services such as Amazon Kinesis Data
Firehose, Amazon Kinesis Data Streams, Amazon Kinesis Data Analytics, and Amazon
Managed Streaming for Apache Kafka (Amazon MSK).
In addition, you can run other streaming data platforms, such as Apache Flume, Apache
Spark Streaming, and Apache Storm, on Amazon EC2 and Amazon EMR.
Amazon Kinesis Data Firehose is a fully managed service for delivering real-time
streaming data directly to Amazon S3. Kinesis Data Firehose automatically scales to
match the volume and throughput of streaming data, and requires no ongoing
administration. Kinesis Data Firehose can also be configured to transform streaming
data before it’s stored in Amazon S3. Its transformation capabilities include
compression, encryption, data batching, and AWS Lambda functions.
19
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Kinesis Data Firehose can compress data before it’s stored in Amazon S3. It currently
supports GZIP, ZIP, and SNAPPY compression formats. GZIP is the preferred format
because it can be used by Amazon Athena, Amazon EMR, and Amazon Redshift.
Kinesis Data Firehose encryption supports Amazon S3 server-side encryption with AWS
Key Management Service (AWS KMS) for encrypting delivered data in Amazon S3. You
can choose not to encrypt the data or to encrypt with a key from the list of AWS KMS
keys that you own (see Encryption with AWS KMS). Kinesis Data Firehose can
concatenate multiple incoming records, and then deliver them to Amazon S3 as a single
S3 object. This is an important capability because it reduces Amazon S3 transaction
costs and transactions per second load.
Finally, Kinesis Data Firehose can invoke AWS Lambda functions to transform incoming
source data and deliver it to Amazon S3. Common transformation functions include
transforming Apache Log and Syslog formats to standardized JSON and/or CSV
formats. The JSON and CSV formats can then be directly queried using Amazon
Athena. If using a Lambda data transformation, you can optionally back up raw source
data to another S3 bucket, as shown in the following figure.
Figure 7: Delivering real-time streaming data with Amazon Kinesis Data Firehose to Amazon S3
with optional backup
20
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
In each method, you must specify the name of the delivery stream and the data record,
or array of data records, when using the method. Each data record consists of a data
BLOB that can be up to 1,000 KB in size and any kind of data.
For detailed information and sample code for the Kinesis Data Firehose API operations,
see Writing to a Firehose Delivery Stream Using the AWS SDK.
You can install the agent on Linux-based server environments such as web servers, log
servers, and database servers. After installing the agent, configure it by specifying the
files to monitor and the destination stream for the data. After the agent is configured, it
durably collects data from the files and reliably sends it to the delivery stream.
The agent can monitor multiple file directories and write to multiple streams. It can also
be configured to pre-process data records before they’re sent to your stream or delivery
stream.
If you’re considering a migration from a traditional batch file system to streaming data,
it’s possible that your applications are already logging events to files on the file systems
of your application servers. Or, if your application uses a popular logging library (such
as Log4j), it is typically a straight-forward task to configure it to write to local files.
Regardless of how the data is written to a log file, you should consider using the agent
in this scenario. It provides a simple solution that requires little or no change to your
existing system. In many cases, it can be used concurrently with your existing batch
solution. In this scenario, it provides a stream of data to Kinesis Data Streams, using the
log files as a source of data for the stream.
21
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
In our example scenario, we chose to use the agent to send streaming data to the
delivery stream. The source is on-premises log files, so forwarding the log entries to
Kinesis Data Firehose was a simple installation and configuration of the agent. No
additional code was needed to start streaming the data.
Data transformation
In some scenarios, you may want to transform or enhance your streaming data before it
is delivered to its destination. For example, data producers might send unstructured text
in each data record, and you may need to transform it to JSON before delivering it to
Amazon Elasticsearch Service.
22
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Figure 9: Kinesis Agent to monitor multiple file directors and write to Kinesis Data Firehose
Kinesis Data Streams provide many more controls in terms of how you want to scale the
service to meet high demand use cases, such as real-time analytics, gaming data
feeds, mobile data captures, log and event data collection, and so on. You can then
build applications that consume the data from Amazon Kinesis Data Streams to power
real-time dashboards, generate alerts, implement dynamic pricing and advertising, and
more. Amazon Kinesis Data Streams supports your choice of stream processing
framework including Kinesis Client Library (KCL), Apache Storm, and Apache Spark
Streaming.
23
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• Writes to one or more Kinesis streams with an automatic and configurable retry
mechanism
24
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• Integrates seamlessly with the Amazon Kinesis Client Library (KCL) to de-
aggregate batched records on the consumer
The KPL can be used in either synchronous or asynchronous use cases. We suggest
using the higher performance of the asynchronous interface unless there is a specific
reason to use synchronous behavior. For more information about these two use cases
and example code, see Writing to your Kinesis Data Stream Using the KPL.
The KCL helps you consume and process data from a Kinesis stream. This type of
application is also referred to as a consumer. The KCL takes care of many of the
complex tasks associated with distributed computing, such as load balancing across
multiple instances, responding to instance failures, checkpointing processed records,
and reacting to resharding. The KCL enables you to focus on writing record-processing
logic.
The KCL is a Java library; support for languages other than Java is provided using a
multi-language interface. At run time, a KCL application instantiates a worker with
configuration information, and then uses a record processor to process the data
received from a Kinesis stream. You can run a KCL application on any number of
instances. Multiple instances of the same application coordinate on failures and load-
balance dynamically. You can also have multiple KCL applications working on the same
stream, subject to throughput limits. The KCL acts as an intermediary between your
record processing logic and Kinesis Streams.
For detailed information on how to build your own KCL application, see Developing
Amazon Kinesis Streams Consumers Using the Amazon Kinesis Client Library.
25
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Figure 10: Managed Kafka for storing streaming data in an Amazon S3 data lake
26
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Customers migrating into Amazon RDS and Amazon Aurora managed database
services gain benefits of operating and scaling a database engine without extensive
administration and licensing requirements. Also, customers gain access to features
such as backtracking where relational databases can be backtracked to a specific time,
without restoring data from a backup, restoring database cluster to a specified time, and
avoiding database licensing costs.
27
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Customers with data in on-premises warehouse databases gain benefits by moving the
data to Amazon Redshift – a cloud data warehouse database that simplifies
administration and scalability requirements.
28
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Network connectivity in a form of AWS VPN or AWS Direct Connect between on-
premises data centers and the AWS Cloud must be established and sized accordingly
to ensure secure and realizable data transfer for both initial and ongoing replications.
When loading large databases, especially in cases when there is a low bandwidth
connectivity between the on-premises data center and AWS Cloud, it’s recommended to
use AWS Snowball or similar data storage devices for shipping data to AWS. Such
physical devices are used for securely copying and shipping the data to AWS Cloud.
Once devices are received by AWS, the data is securely loaded into Amazon Simple
Storage Service (Amazon S3), and then ingested into an Amazon Aurora database
engine. Network connectivity must be sized accordingly to that data can be initially
loaded in a timely manner, and ongoing CDC does not incur latency lag.
When moving data from on-premises databases or storing data in the cloud, security
and access control of the data is an important aspect that must be accounted for in any
architecture. AWS services use Transport Level Security (TLS) for securing data in
transit. For securing data at rest, AWS offers a large number of encryption options for
encrypting data automatically using AWS provided keys, customer provided keys and
even using Hardware Security Module (HSM). Once data is loaded in AWS and
securely stored, the pattern must account for providing controlled and auditable access
to the data at the right level of granularity. In AWS, a combination of AWS Identity &
Access Management and AWS Lake Formation services can be used to achieve this
requirement.
29
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Figure 14: Migrating data from Relational data store to NoSQL data store
You can use AWS Database Migration Service (AWS DMS) to migrate your data to and
from the most widely used commercial and open-source databases. It supports
homogeneous and heterogeneous migrations between different database platforms.
AWS DMS supports migration to a DynamoDB table as a target. You use object
mapping to migrate your data from a source database to a target DynamoDB table.
30
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
Object mapping enables you to determine where the source data is located in the
target. You can also create a DMS task that captures the ongoing changes from the
source database and apply these to DynamoDB as target. This task can be full load
plus change data capture (CDC) or CDC only.
One of the key challenges when refactoring to Amazon DynamoDB is identifying the
access patterns and building the data model. There are many best practices for
designing and architecting with Amazon DynamoDB. AWS provides NoSQL Workbench
for Amazon DynamoDB. NoSQL Workbench is a cross-platform client-side GUI
application for modern database development and operations and is available for
Windows, macOS, and Linux. NoSQL Workbench is a unified visual IDE tool that
provides data modeling, data visualization, and query development features to help you
design, create, query, and manage DynamoDB tables.
In this scenario, converting the relational structures to documents can be complex and
may require building complex data pipelines for transformations. Amazon Database
Migration Services (AWS DMS) can simplify the process of the migration and replicate
ongoing changes.
AWS DMS maps database objects to Amazon DocumentDB in the following ways:
31
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
AWS DMS reads records from the source endpoint, and constructs JSON documents
based on the data it reads. For each JSON document, AWS DMS determines
an _id field to act as a unique identifier. It then writes the JSON document to an
Amazon DocumentDB collection, using the _id field as a primary key.
In document mode, the JSON documents from DocumentDB are migrated as is. So,
when you use a relational database as a target, the data is a single column
named _doc in a target table. You can optionally set the extra connection
attribute extractDocID to true to create a second column named "_id" that acts as the
primary key. If you use change data capture (CDC), set this parameter to true except
when using Amazon DocumentDB as the target.
In table mode, AWS DMS transforms each top-level field in a DocumentDB document
into a column in the target table. If a field is nested, AWS DMS flattens the nested
values into a single column. AWS DMS then adds a key field and data types to the
target table's column set.
32
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
collections. You can read events from a change stream using AWS DMS to implement
many different use cases, including the following:
• Change notification
After change streams are enabled, you can create a migration task in AWS DMS that
migrates existing data and at the same time replicates ongoing changes. AWS DMS
continues to capture and apply changes even after the bulk data is loaded. Eventually,
the source and target databases synchronize, minimizing downtime for a migration.
During a database migration when Amazon Redshift is the target for data warehousing
use cases, AWS DMS first moves data to an Amazon Simple Storage Service (Amazon
S3) bucket. When the files reside in an Amazon S3 bucket, AWS DMS then transfers
them to the proper tables in the Amazon Redshift data warehouse.
AWS Database Migration Service supports both full load and change processing
operations. AWS DMS reads the data from the source database and creates a series of
comma-separated value (.csv) files. For full-load operations, AWS DMS creates files for
each table. AWS DMS then copies the table files for each table to a separate folder in
Amazon S3. When the files are uploaded to Amazon S3, AWS DMS sends a copy
command and the data in the files are copied into Amazon Redshift. For change-
processing operations, AWS DMS copies the net changes to the .csv files. AWS DMS
then uploads the net change files to Amazon S3 and copies the data to Amazon
Redshift.
33
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
In Amazon ES, you work with indexes and documents. An index is a collection of
documents, and a document is a JSON object containing scalar values, arrays, and
other objects. Elasticsearch Service provides a JSON-based query language, so that
you can query data in an index and retrieve the corresponding documents. When AWS
DMS creates indexes for a target endpoint for Amazon ES, it creates one index for each
table from the source endpoint.
AWS DMS supports multithreaded full load to increase the speed of the transfer, and
multithreaded CDC load to improve the performance of CDC. For the task settings and
prerequisites that are required to be configured for these modes, see Using an Amazon
Elasticsearch Service cluster as a target for AWS Database Migration Service.
Figure 16 Migrating data from Amazon DocumentDB store to Amazon Elasticsearch Service
Conclusion
With the massive amount of data growth, organizations are focusing on driving greater
efficiency in their operations by making data driven decisions. As data is coming from
various sources in different forms, organizations are tasked with how to integrate
terabytes to petabytes and sometimes exabytes of data that were previously siloed in
order to get a complete view of their customers and business operations. Traditional on-
34
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
premises data analytics solutions can’t handle this approach because they don’t scale
well enough and are too expensive. As a result, customers are looking to modernize
their data and analytics infrastructure by moving to the cloud.
To analyze these vast amounts of data, many companies are moving all their data from
various silos into a Lake House architecture. A cloud-based Lake House architecture on
AWS allows customers to take advantages around scale, innovation, elasticity and
agility to meet their data analytics and machine learning needs. As such, ingesting data
into the cloud becomes an important step of the overall architecture. This whitepaper
addressed the various patterns, scenarios, use cases and the right tools for the right job
that an organization should consider while ingesting data into AWS.
Contributors
Contributors to this document include:
Further reading
• Database Migrations Case Studies
• Streaming CDC into Amazon S3 Data Lake in Parquet Format with AWS DMS
35
Amazon Web Services AWS Cloud Data Ingestion Patterns and Practices
• Loading Data Lake Changes with AWS DMS and AWS Glue
Document history
Date Description
July 23, 2021 First publication
36