The Future of Data Integration
The Future of Data Integration
data integration
Easily connect and act on your data
from every source
Table of contents
Introduction........................................................................................................... 3
2. Perform easier value-add transformations and data pipelines with AWS Glue............... 18
Conclusion: Unlock the value of your data with data integration on AWS.......................... 24
2
I N T R O D U C T IO N
Let’s say you’re running the marketing function for a chain of hotels.
You’re looking to create targeted offers that help improve the
experience of your high-value customers. You have customer purchase
history in a relational database, clickstream data from the hotel website
in an analytics system, and customer chat transcripts in a support
system. You want to take these datasets and use them to build a
machine learning (ML) model that predicts when a customer has a high
probability of booking rooms with a rival hotel company—so you can
offer them the right incentive at the right time.
You can see from this example that you need to integrate all three
datasets so your teams access a complete customer profile and make
timely predictions. Data integration is key to providing a holistic view
and helping you turn your disparate data into real business value.
However, combining data from different sources in different types of
tools is hard, and it’s even harder when your organization is dealing
with silos that impede data access, create distance across systems, and
prevent users at all levels from accessing data.
3
Data integration has long been a heavy lift and one that’s prone to
productivity losses, rising costs, and continuous errors. For many data
teams, integrating data across different data silos requires them to
build complex extract, transform, and load (ETL) pipelines that take
hours, if not days, to complete. And that’s just the beginning. Once they
build the ETL pipeline, they have to spend even more time and effort to
maintain it. They must continually manage the pipeline to ensure data is
current and relevant. They also have to operationalize the pipeline and
make a concentrated effort to avoid downtime. This often materializes
as a never-ending loop of scheduling, monitoring, and troubleshooting.
ETL may be status quo, but it simply isn’t fast enough to keep up with
the speed of decision making. ETL needs to be simpler and in many
cases, eliminated.
At Amazon Web Services (AWS), our goal is to make it easier for you
to connect to and act on all of your data, no matter where it lives, and
to do it with the speed and confidence you need to make data-driven
decisions. We’re focused on four areas of effective data integration.
First, we’re doing direct integrations between AWS services to reduce
and eliminate ETL for common use cases, so your teams can move
faster. This includes our investment in a zero-ETL future where you can
perform analytics, ML, and business intelligence (BI) without building or
managing data pipelines that move, load, or preprocess the data.
4
Second, when ETL is necessary for use cases where you’re combining
multiple types of datasets or adding value through transformations
or similar scenarios, we’re making ETL easy with AWS Glue. 4 ways AWS is making data
integration faster and easier:
Third, to ensure you can act on all, and not just some of your data, we’re 1. Providing direct integrations
providing AWS services that connect and federate to an expanding list between AWS services to
of hundreds of data sources, including third-party software as a service reduce and eliminate ETL
(SaaS), on premises, and other clouds, as well as seamless integration for common use cases
with third-party data.
5
CHAPTER 1
The challenge of
manual data integration
The traditional ETL process can best be described like an obstacle course.
Take, for example, a global manufacturing company with dozens of
factories in multiple countries. They use a cluster of databases to store
order and inventory data in each of those countries. To get a real-time
view of their orders and inventory, they have to build individual data
pipelines between each of these database clusters to a central data
warehouse to query across the combined dataset. To meet this need,
the data integration team has to write code to connect to 12 different
clusters and manage and test 12 production pipelines. Once deployed,
the team has to constantly monitor and scale the pipelines to optimize
performance. When anything changes, they have to make updates
across 12 different places.
6
To accomplish the above, you need a team of engineers
with specialized skills to build and maintain ETL pipelines.
You need data engineers to create custom code to build the
pipeline and DevOps engineers to deploy and manage the
infrastructure so the pipeline scales. It takes this team hours,
if not days, to complete the build. And they must repeat the
entire process whenever the data source changes.
7
CHAPTER 2
Breaking data
integration barriers
Your data sources are like puzzle pieces. Data integration takes these fragmented pieces and seamlessly How AWS data
puts them together to present a single, unified view of your data. This view gives your organization a
integration technology
deeper understanding of your customer and business. However, the traditional ETL process makes it
difficult to uncover this picture with any degree of speed or confidence. increases the pace
of innovation:
At AWS, we’re working to automate the undifferentiated parts of building and managing ETL pipelines,
so you can integrate and act on all of your data at a faster pace. Our data integration technology reduces
the time and resources you spend to build data pipelines and empowers your teams to access data more
• Reduces time and
quickly. Our services work to simplify your data architecture and reduce data engineering efforts. Instead
of bogging your teams down with persistent costs and repetitive effort, you enable greater productivity resources spent on
and free them to focus on high-value, creative work. building data
AWS zero-ETL integrations, for instance, are cloud-native and scalable, allowing your organization
pipelines
to optimize costs based on actual usage and data processing needs. You reduce infrastructure costs,
development efforts, and maintenance overheads. Zero-ETL also eliminates the need for recurring work
by allowing for the inclusion of new data sources without the need to reprocess large amounts of data. • Empowers teams
to access data
Zero-ETL also automates moving data from source to target with zero effort. Your teams gain near
real-time data access, ensuring they have the latest data for analytics, artificial intelligence (AI), ML,
more quickly
and reporting. They discover business insights faster and make decisions in the moment they matter.
This immediacy has implications for use cases like near real-time dashboards, data quality monitoring,
and customer behavior analysis.
It’s important to note that data integration is not just about technical and operational gains, although
those are vital to innovation. Data integration also has cultural implications. For most data leaders,
establishing a data-driven culture is a paramount goal. When teams across your organization trust
data and use it in real time to transform user experiences, you naturally begin to build or reinforce
such a culture.
8
C U ST O M E R S U CCESS W I T H ZE R O -E T L I N T EG R AT IO N S
Hitoshi Kageyama
Executive Vice President
KINTO Technologies Corporation
9
CHAPTER 3
Data integration
made easier
with AWS
AWS is investing in a future where you can quickly
and easily integrate and act on all your data, no matter
where it lives. As outlined in the beginning of this eBook,
our data integration approach encompasses four pillars
that make it easier for your organization to:
3 Connect to hundreds
of data sources
10
Eliminating the need for manual pipelines 1. Integrate services and enable
a zero-ETL future
Federated Apply ML models Real-time Zero-ETL A zero-ETL future means you can perform analytics, ML,
query directly in streaming integrations and BI without the need to manually build or manage data
data stores ingestion pipelines that move, load, or preprocess the data. AWS is
bringing this future to light with numerous use cases that
eliminate the need for manual data pipelines.
Federated query
With federated querying on Amazon Redshift and
Amazon Athena, you can run predictive analytics across
data stored in operational databases, data warehouses,
and data lakes—without any data movement.
Devices People
11
Real-time streaming ingestion Benefits of AWS
With direct integrations for AWS streaming services, you analyze data as soon as it’s produced zero-ETL integrations:
and gather timely insights to capitalize on opportunities. For example, with Amazon Redshift
Streaming Ingestion, you configure Amazon Redshift to directly ingest streaming data into your
• Provides faster
data warehouse in real time right from the Amazon Redshift console. With this integration, you
ingest hundreds of megabytes of data per second to query data in near real time. You can also access to insights
connect to multiple Amazon Kinesis data streams or Amazon Managed Streaming for Apache
Kafka (Amazon MSK) data streams and pull data directly to Amazon Redshift without staging
• Eliminates months
data in Amazon S3.
of work for data
engineering teams
Zero-ETL integrations
We have zero-ETL integrations for common ETL jobs across our most popular data stores,
• Easy to use
including four integrations with Amazon Redshift and two with Amazon OpenSearch Service.
With these zero-ETL integrations, your data is automatically connected from the source to
• Integrates data from
the destination, so you can quickly analyze your transactional data. And because no pipeline
development is needed, you don’t have to wait on one to be built to get the insight you need.
multiple sources
You eliminate months of work for data engineering teams, allowing them to focus on higher
value-add activities. At the same time, you can make quicker and more accurate data-driven
predictions for the purposes of content targeting, fraud detection, customer behavior analysis,
and more.
These integrations are easy to use and simple to set up. You simply select the source and select
the target. They also enable you to consolidate data from multiple sources seamlessly, so you
can run unified analytics or search across multiple applications and data sources.
Here’s a look at each of the zero-ETL integrations with Amazon Redshift and Amazon OpenSearch
Service and how you can use them.
12
ZE R O -E T L I N T EG R AT IO N S W I T H A M A ZO N R E D S HIF T
Zero-ETL
Description Highlights Use cases
integration
• Optimized
• Seamlessly replicates RDS for MySQL data into
gaming
Amazon Redshift, automatically handling initial
The Amazon Relational Database experience
data loads, ongoing change synchronization,
Service (Amazon RDS) for MySQL
and schema replication • Data
Amazon RDS integration with Amazon Redshift
quality
for MySQL empowers you to easily perform • Enables workload isolation for optimal performance
monitoring
analytics on your RDS for
• Consolidates data from multiple sources into
MySQL data. • Fraud
Amazon Redshift, such as Aurora MySQL-Compatible
Edition and Aurora PostgreSQL-Compatible Edition detection
• Customer
• Replicates DynamoDB data into Amazon Redshift for behavior
analytics without consuming the DynamoDB Read analysis
The Amazon DynamoDB zero-ETL Capacity Units (RCU)
integration with Amazon Redshift
• Enables holistic insights across applications without
Amazon provides a fully managed solution
impacting production workloads
DynamoDB for making data from DynamoDB
available for analytics in Amazon • Unlocks powerful Amazon Redshift capabilities on
Redshift. DynamoDB data such as high-speed SQL queries, ML
integrations, materialized views for fast aggregations,
and secure data sharing
13
ZE R O -E T L I N T EG R AT IO N S W I T H A M A ZO N O P E N SE A R C H SE R V ICE
Zero-ETL
Description Highlights Use cases
integration
14
C U ST O M E R S U CCESS W I T H ZE R O -E T L I N T EG R AT IO N S
Katsutoshi Murakami
Director and CPO
Money Forward i
15
C U ST O M E R S U CCESS W I T H ZE R O -E T L I N T EG R AT IO N S
16
C U ST O M E R S U CCESS W I T H ZE R O -E T L I N T EG R AT IO N S
17
2. Perform easier value-add transformations Discover, prepare, and integrate
and data pipelines with AWS Glue all your data at scale
Building ETL pipelines will still be necessary for certain use cases. Data engineers
likely need to perform data transformations such as data cleansing and deduplication
and combining multiple datasets across custom applications for performing data analysis
and creating ML models. AWS is making transformations easy for these use cases with
AWS Glue—a serverless, scalable data movement and transformation service.
AWS Glue is a fully managed data integration service that connects, transforms, and Tailored tools
manages data and data pipelines. Each month, hundreds of thousands of customers
All-in-one
to support all
use AWS Glue while hundreds of millions of data integration jobs are run on the data integration
data users
service. By simplifying the data integration process, AWS Glue ensures that data is service
readily available and formatted correctly for various analytical applications.
Integrate data
faster with
generative AI
features
18
C U ST O M E R S U CCESS W I T H A W S G L U E
19
C U ST O M E R S U CCESS W I T H A W S G L U E
20
3. Connect to hundreds of data sources
To ensure your organization can act on all, and not just some of your data,
AWS services connect to an expanding list of hundreds of data sources
including third-party SaaS, on premises, and other clouds, as well as seamless
integration with third-party data. With AWS, you can connect to data sources
that run the gamut in your enterprise, going from ERP applications such as
SAP, to CRM applications such as Salesforce, to analytics offerings such as
Adobe Analytics, and more.
Here are a few examples of the AWS services that enable these connections:
• Amazon Kinesis Data Firehouse: Stream data in real time from over
30 AWS and third-party sources
21
For third-party data, we offer AWS Data Exchange, which enables you Quickly and easily use third-party
to access third-party data through files, tables, and APIs from over
data in your applications, analytics,
300 data providers and over 3,500 data products all from one place.
You can easily discover and subscribe to ready-to-use data in the cloud and machine learning models
that can be quickly integrated with AWS data, analytics, and ML services.
22
4. Share data securely and easily
You need a secure and effective way to share your
data with partners. AWS Clean Rooms help you and
your partners easily and securely collaborate, analyze,
and build ML models using your collective datasets—
without sharing or copying one another’s underlying
data or revealing sensitive information to each other.
You can create a secure data room in minutes, and
collaborate with any other company on the AWS
Cloud to generate unique insights about advertising
campaigns, investment decisions, and research
and development.
23
C O N C L U SIO N
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 24