ELTP Extending ELT For Modern AI and Analytics Airbyte
ELTP Extending ELT For Modern AI and Analytics Airbyte
Analytics | Airbyte
Excerpt
Whether you are building the next great AI app, an analytics suite, or a new data
warehouse, the ELTP model is a simple and extensible architecture for building robust data
pipelines.
Whether building data pipelines for AI, analytics, or data engineering, there are two very
common design mistakes that almost everyone makes at least once.
The first common mistake comes from designing everything as a single “ETL” operation, running
Extract, Transform, and Load as a single combined step. The second common mistake is to
assume that the best place for processing data is also the best place to host it, omitting the
important ‘publish’ step.
To address these challenges, I introduce the “ELTP” architecture: Extract, Load, Transform, and
Publish. This simple and scalable model is crafted to prevent these common mistakes. Below,
we'll delve into the nuances of ELTP and explore its broad applicability for AI, analytics, and big
data.
The original “ETL” approach to data movement combined business logic and data replication
into a single operation. This was inherently fragile, because ETL processes would break whenever
business logic changed, and any breakage would prevent source data from loading at all. Since
many of the transforms required aggregating and joining across datasets, memory errors were
frequent and new failure types could arise at any time, as data volumes grew.
To address these issues and others, modern data engineering best practices now advocate for an
“ELT” or “EL+T” approach, where the replication (Extract and Load) is performed first, as a
standalone process. Then, business logic and transformations are applied only after the raw data
is safely landed in a database or data lake which we control. In this model, the Extract-Load Pair,
or “EL Pair”, are executed as a single step in the process, with composable Extract and Load
connectors working together to replicate data efficiently from source to destination.
While ELT architecture is more stable and more scalable than the ETL model, ELT is still lacking, in
that it doesn’t specify a method for delivering data to downstream users and systems - what is
sometimes called “Reverse ETL”. To address this gap, we introduce the 'Publish' step, leading to
the “EL+T+P” or “ELTP” model. In full, the ELTP model encompasses the following steps:
This design preserves the benefits of ELT while also giving a framework for ensuring that data is
efficiently published to downstream consumers and applications.
The biggest difference between the initial Extract-Load and the final Publish is that in the latter
case, the EL process is extracting from our data platform and loading to an external system or
external publish location.
Note that in this two-stage EL architecture, we can add any number of custom transformations
between the initial EL step that loads the raw data and the second EL step that publishes
transformed data to downstream applications.
Why Publish?
There are several reasons for adding a Publish step to one’s data architecture, namely: sending
data files to external systems, decentralizing analytic queries, and publishing to downstream
applications and indexes.
The oldest example of a “publish” operation is to send data files to partner companies, vendors,
or regulatory agencies. For example, a Marketing Agency might request a regular upload of your
customer data file, which they in turn would process and provide back a new file with added
market segmentation data. These publish operations often have externally-defined file formats
and upload location requirements which are controlled by the third party.
Here is a simple example, where we publish one file to a partner via S3 and another file to a
vendor using SFTP:
While I was at Amazon Video, I built and maintained systems that processed terabytes of data
daily. To achieve that kind of scale, my team developed ELT pipelines on AWS Redshift, carefully
right-sizing our clusters to efficiently execute all our transformations.
But when it came time to consume the data, we ran into problems. Our system couldn’t handle
heavy query traffic, and one runaway query could cause significant performance issues for every
other user. Our clusters optimized for heavy-duty processing tasks weren’t necessarily equipped
for ad-hoc analysis or concurrency.
The solution was to add a final “publish” operation, where we could deliver data to our users
exactly how and where they needed it. We wouldn’t bar users from querying the main cluster,
but downstream consumers were incentivized to use Tableau and S3 datasets because those
provide better performance and the ability for consumers to right-size their own analytics
infrastructure according to their specific requirements. Meanwhile, the central database was
better insulated from runaway queries that might otherwise lead to costly outages.
A popular use case for Publish operations are the so-called “Reverse ETL” use cases, where
transformed data is Published to one or more business applications or SaaS providers. Popular
examples include Publishing data to Salesforce for business operations, Hubspot to keep
customer CRM data up-to-date, and Active Directory to keep employee and organizational data
up-to-date.
Publish to Vector Stores as AI Long-Term Memory (RAG)
To run this process using ELTP, we follow the same basic steps as before:
3. Transform the data by splitting large text into smaller chunks and calculating a numeric
representation of each chunk (often called a “vector” or “embedding”)
4. Publish the text data to our vector store indexes, where the chatbot can then make use of this
data when answering questions it wouldn’t otherwise be able to answer.
ELTP in Practice
The ELTP framing offers a practical and holistic model for what otherwise might be expressed as
“ELT + Reverse ETL” - while addressing additional use cases that traditionally would not be in
scope for Reverse ETL. Here we’ll discuss a few advanced use cases and real world design and
deployment considerations.
As a tutorial or a proof of concept (POC) data flow, there’s nothing wrong with this ETL-like flow:
As shown in the diagram above, the decision to skip the database Load is very convenient and
improves accessibility. The author has less surface area to maintain and doesn’t need to worry
about introducing durable long-term storage into the architecture. This reduces the likelihood
that someone following along will be blocked by not having access to the same or similar
options for the long-term storage layer.
However, if and when our POC is successful, we will very soon want to migrate to the ELTP
architecture, loading our raw data into long-term storage and then publishing to the vector
store as a separate step. Here is a diagram of the same solution using ELTP:
Adding a database for long-term storage decouples the Extract/Load step from the Publish
operation, and gives us the ability to add any number of transformations in between the initial
EL operation and the Publish operation - for instance dbt models or custom Python scripts to
clean the data and to apply business logic.
Our speed of iteration is also greatly improved with this design. When we are ready to
experiment with different embedding models and tune the chunking parameters, we already
have the raw text extracted and ready to go. We can even run side-by-side experiments
comparing different transformation and publication options, without re-extracting the raw data
and without making any changes to the “production” data flow.
The ELTP model allows us to scale more effectively as well: from tens of documents to thousands
or even millions of documents. Each step in the process can run on hardware specialized for that
step, and all of the replication steps (EL and Publish) can operate incrementally: extracting only
new or updated documents when sending data downstream. As business logic evolves and as
bugs are identified, we can rerun or restate our transformations without needing to re-extract
the text from the upstream data source.
The upshot of all these ELTP benefits is that ELTP allows more opportunities for iteration, while
also providing scaling and performance benefits. While perhaps counter-intuitive, adding the
durable storage layer actually reduces our total cost of ownership (TCO) versus running the ETL
or “POC” model which performs Extract-Transform-Load all in one step.
Airbyte is an open source data movement tool which can be used for both “EL” and “Publish”
steps in the ELTP architecture. Airbyte currently supports several vector store destinations,
including Pinecone, Milvus, Chroma, and Weaviate. These can be used to directly load data using
an ETL or ELTP architecture - according to your own preferences and requirements. All of these
vector store destinations provide access to inline text splitting and vector embeddings using a
declarative, configuration-based approach.
Whereas traditional data pipelines prioritized numeric and tabular data, AI and LLM data flows
greatly benefit from text data in “unstructured” document formats like PDF, Word, Markdown,
and Google Docs.
This week, Airbyte is also launching source connectors to extract “text-as-data” from those
unstructured documents and others. Text documents are a powerful new class of data source for
data pipelines, and the new text document connectors enable the same ELTP, ELT, and ETL
architectures across all of your data, whether for AI or for analytics.
For more information, please see the companion blog post announcement.
With the ability to create your own open source connectors when needed, the same tooling and
design techniques can be applied to almost any imaginable data pipeline
Airbyte aims to provide data teams with an open-source standard in data movement. We’re
currently investing heavily in Publish connectors for AI and LLM use cases. In 2024, we will
expand to cover additional Reverse ETL and ELTP workflows and use cases, to simplify and
commoditize data movement in any direction.
In Closing
While the ELT model is a great improvement over ETL, it doesn’t solve for the last mile of data
delivery. ELTP is a model that handles the heavy lifting while also providing a real solution for
product delivery.
Whether you are building the next great AI application or a more traditional data analytics stack,
you can set yourself up for success by understanding the ELTP model and designing your
architecture with these best practices in mind.
Interested in more data content? Join us for the upcoming ELTP webinar for AI or check out our
blog, where we publish articles in-house and from the community on data engineering,
databases, AI infrastructure and more.