What Is Lambda Architecture
What Is Lambda Architecture
Architecture?
The Lambda Architecture is a deployment model for data processing that organizations use
to combine a traditional batch pipeline with a fast real-time stream pipeline for data access. It
is a common architecture model in IT and development organizations’ toolkits as businesses
strive to become more data-driven and event-driven in the face of massive volumes of
rapidly generated data, often referred to as “big data.”
The Lambda Architecture contains both a traditional batch data pipeline and a fast streaming
pipeline for real-time data, as well as a serving layer for responding to queries.
In the diagram above, you can see the main components of the Lambda Architecture:
Data Sources. Data can be obtained from a variety of sources, which can then be included in
the Lambda Architecture for analysis. This component is oftentimes a streaming source like
Apache Kafka, which is not the original data source per se, but is an intermediary store that
can hold data in order to serve both the batch layer and the speed layer of the Lambda
Architecture. The data is delivered simultaneously to both the batch layer and the speed layer
to enable a parallel indexing effort.
Batch Layer. This component saves all data coming into the system as batch views in
preparation for indexing. The input data is saved in a model that looks like a series of
changes/updates that were made to a system of record, similar to the output of a change data
capture (CDC) system. Oftentimes this is simply a file in the comma-separated values
(CSV) format. The data is treated as immutable and append-only to ensure a trusted historical
record of all incoming data. A technology like Apache Hadoop is often used as a system for
ingesting the data as well as storing the data in a cost-effective way.
Serving Layer. This layer incrementally indexes the latest batch views to make it queryable
by end users. This layer can also reindex all data to fix a coding bug or to create different
indexes for different use cases. The key requirement in the serving layer is that the processing
is done in an extremely parallelized way to minimize the time to index the data set. While an
indexing job is run, newly arriving data will be queued up for indexing in the next indexing
job.
Speed Layer. This layer complements the serving layer by indexing the most recently added
data not yet fully indexed by the serving layer. This includes the data that the serving layer is
currently indexing as well as new data that arrived after the current indexing job started.
Since there is an expected lag between the time the latest data was added to the system and
the time the latest data is available for querying (due to the time it takes to perform the batch
indexing work), it is up to the speed layer to index the latest data to narrow this gap.
This layer typically leverages stream processing software to index the incoming data in near
real-time to minimize the latency of getting the data available for querying. When the
Lambda Architecture was first introduced, Apache Storm was a leading stream processing
engine used in deployments, but other technologies have since gained more popularity as
candidates for this component (like Hazelcast Jet, Apache Flink, and Apache Spark
Streaming).
Query. This component is responsible for submitting end user queries to both the serving
layer and the speed layer and consolidating the results. This gives end users a complete query
on all data, including the most recently added data, to provide a near real-time analytics
system.
Latency. Raw data is indexed in the serving layer so that end users can query and analyze all
historical data. Since batch indexing takes a bit of time, there tends to be a relatively large
time window of data that is temporarily not available to end users for analysis. The speed
layer uses stream processing technologies to immediately index recent data that is currently
not queryable in the batch/serving layers, thus narrowing the time window of unanalyzable
data. This helps to reduce the latency (i.e., the wait time for making data available for
analysis) that is inherent in the batch/serving layers.
Data consistency. One key idea behind the Lambda Architecture is that it eliminates the risk
of data inconsistency that is often seen in distributed systems. In a distributed database where
data might not be delivered to all replicas due to node or network failures, there is a chance
for inconsistent data. In other words, one copy of the data might reflect the up-to-date value,
but another copy might still have the previous value. In the Lambda Architecture, since the
data is processed sequentially (and not in parallel with overlap, which may be the case for
operations on a distributed database), the indexing process can ensure the data reflects the
latest state in both the batch and speed layers.
Scalability. The Lambda Architecture does not specify the exact technologies to use, but is
based on distributed, scale-out technologies that can be expanded by simply adding more
nodes. This can be done at the data source, in the batch layer, in the serving layer, and in the
speed layer. This lets you use the Lambda Architecture no matter how much data you need to
process.
Fault tolerance. As above, the Lambda Architecture is based on distributed systems that
support fault tolerance, so should a hardware failure occur, other nodes are available to
continue the workload. In addition, since all data is stored in the batch layer, any failures
during indexing either in the serving layer or the speed layer can be overcome by simply
rerunning the indexing job at the batch/serving layers, and letting the speed layer continue
indexing the most recent data.
Human fault tolerance. Since raw data is saved for indexing, it acts as a system of record
for your analyzable data, and all indexes can be recreated from this data set. This means that
if there are any bugs in the indexing code or any omissions, the code can be updated and then
rerun to reindex all data.
The Lambda Architecture has sometimes been criticized as being overly complex. Each
pipeline requires its own code base, and the code bases must be kept in sync to ensure
consistent, accurate results when queries touch both pipelines.
This is a simplified approach in that it only requires one code base, but in organizations with
historical data in traditional batch systems, they must decide whether the transition to a
streaming-only environment is worth the overhead of the initial change of platforms. Also,
message buses are not as efficient for extremely large time windows of data versus data
platforms that are cost-effective for larger data sets. This means you cannot always store your
entire data history in a Kappa Architecture. However, technological innovations are breaking
down this limitation so that much larger data sets can be stored in message buses as on-
demand streams, to enable the Kappa Architecture to be more universally adopted. Also,
some stream processing technologies (like Hazelcast Jet) support batch processing
paradigms as well, so you can use large-scale data repositories as a source alongside a
streaming repository. This lets you process extremely large data sets in a cost-effective way
while also gaining the simplicity of using only one processing engine.