0% found this document useful (0 votes)
96 views9 pages

ELTP Extending ELT For Modern AI and Analytics Airbyte

The document discusses the ELTP (Extract, Load, Transform, Publish) model for building robust data pipelines. ELTP addresses common mistakes in ETL designs by separating extraction/loading of raw data from downstream transformations and publishing. It provides a simple yet scalable architecture for AI, analytics and data engineering tasks. The Publish step delivers transformed data to various external systems like SaaS apps, file stores and analytics tools in an efficient manner. Overall, ELTP presents a holistic model that improves on ELT and enables robust, scalable and extensible data pipelines.

Uploaded by

Cezar Ovidiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views9 pages

ELTP Extending ELT For Modern AI and Analytics Airbyte

The document discusses the ELTP (Extract, Load, Transform, Publish) model for building robust data pipelines. ELTP addresses common mistakes in ETL designs by separating extraction/loading of raw data from downstream transformations and publishing. It provides a simple yet scalable architecture for AI, analytics and data engineering tasks. The Publish step delivers transformed data to various external systems like SaaS apps, file stores and analytics tools in an efficient manner. Overall, ELTP presents a holistic model that improves on ELT and enables robust, scalable and extensible data pipelines.

Uploaded by

Cezar Ovidiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ELTP: Extending ELT for Modern AI and

Analytics | Airbyte
Excerpt
Whether you are building the next great AI app, an analytics suite, or a new data
warehouse, the ELTP model is a simple and extensible architecture for building robust data
pipelines.

Whether building data pipelines for AI, analytics, or data engineering, there are two very
common design mistakes that almost everyone makes at least once.

The first common mistake comes from designing everything as a single “ETL” operation, running
Extract, Transform, and Load as a single combined step. The second common mistake is to
assume that the best place for processing data is also the best place to host it, omitting the
important ‘publish’ step.

To address these challenges, I introduce the “ELTP” architecture: Extract, Load, Transform, and
Publish. This simple and scalable model is crafted to prevent these common mistakes. Below,
we'll delve into the nuances of ELTP and explore its broad applicability for AI, analytics, and big
data.

From ETL to ELT to ELTP - A brief Introduction


In order to understand the benefits of the ELTP architecture, it’s important to first understand
some basic context around ETL and ELT.

The original “ETL” approach to data movement combined business logic and data replication
into a single operation. This was inherently fragile, because ETL processes would break whenever
business logic changed, and any breakage would prevent source data from loading at all. Since
many of the transforms required aggregating and joining across datasets, memory errors were
frequent and new failure types could arise at any time, as data volumes grew.

To address these issues and others, modern data engineering best practices now advocate for an
“ELT” or “EL+T” approach, where the replication (Extract and Load) is performed first, as a
standalone process. Then, business logic and transformations are applied only after the raw data
is safely landed in a database or data lake which we control. In this model, the Extract-Load Pair,
or “EL Pair”, are executed as a single step in the process, with composable Extract and Load
connectors working together to replicate data efficiently from source to destination.

While ELT architecture is more stable and more scalable than the ETL model, ELT is still lacking, in
that it doesn’t specify a method for delivering data to downstream users and systems - what is
sometimes called “Reverse ETL”. To address this gap, we introduce the 'Publish' step, leading to
the “EL+T+P” or “ELTP” model. In full, the ELTP model encompasses the following steps:

Extract data from raw data sources.


Load data into durable storage, such as a data warehouse or data lake.
Transform raw data into new datasets.
Publish data to downstream users and business applications.

This design preserves the benefits of ELT while also giving a framework for ensuring that data is
efficiently published to downstream consumers and applications.

Meet the “Publish” Step


The Publish framing is a simple way to describe how transformed data can be delivered to a
variety of downstream destinations, including: external SaaS applications, file stores, CRM
systems, AI vector stores, and even other databases. Publish steps implement the same
protocols and paradigms of other Extract-Load pairs used for raw data collection, and therefore
can benefit from the same advantages discussed above for EL pairs.

The biggest difference between the initial Extract-Load and the final Publish is that in the latter
case, the EL process is extracting from our data platform and loading to an external system or
external publish location.

Note that in this two-stage EL architecture, we can add any number of custom transformations
between the initial EL step that loads the raw data and the second EL step that publishes
transformed data to downstream applications.

Why Publish?
There are several reasons for adding a Publish step to one’s data architecture, namely: sending
data files to external systems, decentralizing analytic queries, and publishing to downstream
applications and indexes.

Publish Data Files to External Systems

The oldest example of a “publish” operation is to send data files to partner companies, vendors,
or regulatory agencies. For example, a Marketing Agency might request a regular upload of your
customer data file, which they in turn would process and provide back a new file with added
market segmentation data. These publish operations often have externally-defined file formats
and upload location requirements which are controlled by the third party.

Here is a simple example, where we publish one file to a partner via S3 and another file to a
vendor using SFTP:

Publish to Empower Analytics

While I was at Amazon Video, I built and maintained systems that processed terabytes of data
daily. To achieve that kind of scale, my team developed ELT pipelines on AWS Redshift, carefully
right-sizing our clusters to efficiently execute all our transformations.

But when it came time to consume the data, we ran into problems. Our system couldn’t handle
heavy query traffic, and one runaway query could cause significant performance issues for every
other user. Our clusters optimized for heavy-duty processing tasks weren’t necessarily equipped
for ad-hoc analysis or concurrency.

The solution was to add a final “publish” operation, where we could deliver data to our users
exactly how and where they needed it. We wouldn’t bar users from querying the main cluster,
but downstream consumers were incentivized to use Tableau and S3 datasets because those
provide better performance and the ability for consumers to right-size their own analytics
infrastructure according to their specific requirements. Meanwhile, the central database was
better insulated from runaway queries that might otherwise lead to costly outages.

Publish to SaaS Applications

A popular use case for Publish operations are the so-called “Reverse ETL” use cases, where
transformed data is Published to one or more business applications or SaaS providers. Popular
examples include Publishing data to Salesforce for business operations, Hubspot to keep
customer CRM data up-to-date, and Active Directory to keep employee and organizational data
up-to-date.
Publish to Vector Stores as AI Long-Term Memory (RAG)

A common method of building AI chatbots is to Publish text data to a specialized indexing


system called a “vector store”, which can in turn be used by the chatbot in a process called
“RAG” (or “retrieval-augmented generation”). With RAG, the chatbot dynamically retrieves
relevant information from a vector store as if retrieving from its long-term “memory”. It uses this
information to generate responses that are informed by the data it has retrieved, combining
pre-existing knowledge with the context provided by the user's query.

To run this process using ELTP, we follow the same basic steps as before:

1. Extract text from our raw document sources,

2. Load the raw text data to long-term storage,

3. Transform the data by splitting large text into smaller chunks and calculating a numeric
representation of each chunk (often called a “vector” or “embedding”)

4. Publish the text data to our vector store indexes, where the chatbot can then make use of this
data when answering questions it wouldn’t otherwise be able to answer.
ELTP in Practice
The ELTP framing offers a practical and holistic model for what otherwise might be expressed as
“ELT + Reverse ETL” - while addressing additional use cases that traditionally would not be in
scope for Reverse ETL. Here we’ll discuss a few advanced use cases and real world design and
deployment considerations.

Publish Destinations are more than Reverse ETL


All Reverse ETL destinations are Publish-type destinations, but not all Publish destinations are
Reverse ETL. The term “Reverse ETL” can still be useful when communicating the unique
challenges of publishing to a SaaS system like Salesforce and Hubspot, but the term falls down
when describing a publish flow to a partner’s SFTP site or to an external database. Whether
publishing to Salesforce, SFTP, S3, or SharePoint, the Publish framing offers a more robust and
holistic architectural model that is not tied to the “pickiness” of API-type destinations.

Plan for ELTP, but Don’t Fear ETL


Sometimes all that’s needed is a good proof of concept. For example, in AI and LLM
applications, tutorials will often walk users through this now-classic data processing flow for
RAG applications: Extract text from documents like PDFs or web sites, Transform into embedded
document chunks, and directly Load to the vector store. Tutorials covering this approach can be
found here and here, for instance.

As a tutorial or a proof of concept (POC) data flow, there’s nothing wrong with this ETL-like flow:
As shown in the diagram above, the decision to skip the database Load is very convenient and
improves accessibility. The author has less surface area to maintain and doesn’t need to worry
about introducing durable long-term storage into the architecture. This reduces the likelihood
that someone following along will be blocked by not having access to the same or similar
options for the long-term storage layer.

However, if and when our POC is successful, we will very soon want to migrate to the ELTP
architecture, loading our raw data into long-term storage and then publishing to the vector
store as a separate step. Here is a diagram of the same solution using ELTP:

Adding a database for long-term storage decouples the Extract/Load step from the Publish
operation, and gives us the ability to add any number of transformations in between the initial
EL operation and the Publish operation - for instance dbt models or custom Python scripts to
clean the data and to apply business logic.
Our speed of iteration is also greatly improved with this design. When we are ready to
experiment with different embedding models and tune the chunking parameters, we already
have the raw text extracted and ready to go. We can even run side-by-side experiments
comparing different transformation and publication options, without re-extracting the raw data
and without making any changes to the “production” data flow.

The ELTP model allows us to scale more effectively as well: from tens of documents to thousands
or even millions of documents. Each step in the process can run on hardware specialized for that
step, and all of the replication steps (EL and Publish) can operate incrementally: extracting only
new or updated documents when sending data downstream. As business logic evolves and as
bugs are identified, we can rerun or restate our transformations without needing to re-extract
the text from the upstream data source.

The upshot of all these ELTP benefits is that ELTP allows more opportunities for iteration, while
also providing scaling and performance benefits. While perhaps counter-intuitive, adding the
durable storage layer actually reduces our total cost of ownership (TCO) versus running the ETL
or “POC” model which performs Extract-Transform-Load all in one step.

ELTP for AI Use Cases


With the rapid growth in AI applications, many people now find themselves entering the world
of data engineering for the first time, specifically for the purpose of building an AI application.
The ELTP model provides these new data practitioners an opportunity to learn from data
engineering mistakes of the past, while not having to reinvent the wheel.

Vector Store Publish Destinations

Airbyte is an open source data movement tool which can be used for both “EL” and “Publish”
steps in the ELTP architecture. Airbyte currently supports several vector store destinations,
including Pinecone, Milvus, Chroma, and Weaviate. These can be used to directly load data using
an ETL or ELTP architecture - according to your own preferences and requirements. All of these
vector store destinations provide access to inline text splitting and vector embeddings using a
declarative, configuration-based approach.

Text Document Source Connectors (New!)

Whereas traditional data pipelines prioritized numeric and tabular data, AI and LLM data flows
greatly benefit from text data in “unstructured” document formats like PDF, Word, Markdown,
and Google Docs.

This week, Airbyte is also launching source connectors to extract “text-as-data” from those
unstructured documents and others. Text documents are a powerful new class of data source for
data pipelines, and the new text document connectors enable the same ELTP, ELT, and ETL
architectures across all of your data, whether for AI or for analytics.

For more information, please see the companion blog post announcement.

Simplifying ELTP Best Practices with Open Source


Modern data movement tools such as Airbyte allow you to implement robust ELTP data
pipelines easily and quickly. Airbyte offers a low-code UI, an API and a Terraform provider for
those who prefer to build pipelines programmatically. Airbyte being an open source tool carries
the additional benefit of letting users build and modify connectors if the source or destination
they want to reach is not yet officially supported.

With the ability to create your own open source connectors when needed, the same tooling and
design techniques can be applied to almost any imaginable data pipeline

Airbyte aims to provide data teams with an open-source standard in data movement. We’re
currently investing heavily in Publish connectors for AI and LLM use cases. In 2024, we will
expand to cover additional Reverse ETL and ELTP workflows and use cases, to simplify and
commoditize data movement in any direction.

In Closing
While the ELT model is a great improvement over ETL, it doesn’t solve for the last mile of data
delivery. ELTP is a model that handles the heavy lifting while also providing a real solution for
product delivery.

Whether you are building the next great AI application or a more traditional data analytics stack,
you can set yourself up for success by understanding the ELTP model and designing your
architecture with these best practices in mind.

Interested in more data content? Join us for the upcoming ELTP webinar for AI or check out our
blog, where we publish articles in-house and from the community on data engineering,
databases, AI infrastructure and more.

You might also like