Lambda Architecture
Lambda Architecture
net/publication/338375917
CITATIONS READS
3 5,025
1 author:
Yuvraj Kumar
AI ML Machine Advocacy Council
15 PUBLICATIONS 10 CITATIONS
SEE PROFILE
All content following this page was uploaded by Yuvraj Kumar on 03 January 2020.
Abstract - Data has evolved immensely in recent field of data analytics is growing immensely to
years, in type, volume and velocity. There are draw valuable insights from big chunks of raw
several frameworks to handle the big data data. In order to compute information in a data
applications. The project focuses on the Lambda system, processing frameworks and processing
Architecture proposed by Marz and its engines are essential. The traditional relational
application to obtain real-time data processing. database seems to show limitations when exposed
The architecture is a solution that unites the to the colossal chunks of unstructured data. There
benefits of the batch and stream processing is a need to decouple compute from storage. The
techniques. Data can be historically processed processing frameworks can be categorized into
with high precision and involved algorithms three frameworks –batch processing, stream data
without loss of short-term information, alerts processing and hybrid data processing
and insights. Lambda Architecture has an ability frameworks. Traditional batch processing of data
to serve a wide range of use cases and workloads gives good results but with a high latency.
that withstands hardware and human mistakes. Hadoop is a scalable and fault-tolerant
The layered architecture enhances loose framework that includes Map Reduce for
coupling and flexibility in the system. This a computational processing. [1][2] Map Reduce jobs
huge benefit that allows understanding the trade- are run in batches to give results which are
offs and application of various tools and accurate and highly available. The downside of
technologies across the layers. There has been an Map Reduce is its high latency, which does not
advancement in the approach of building the LA make it a good choice for real-time data
due to improvements in the underlying tools. processing. In order to achieve results in real-time
The project demonstrates a simplified with low-latency, a good solution is to use
architecture for the LA that is maintainable. Apache Kafka coupled with Apache Spark. This
streaming model does wonders in high
Index terms – Lambda Architecture (LA), availability and low latency but might suffer in
Batch Processing, Stream Processing, Real- terms of accuracy. In most of the scenarios, use
time Data Processing cases demand both fast results and deep
processing of data.[3] This project is focused
I. INTRODUCTION towards Lambda Architecture that unifies batch
and stream processing. Many tech companies
With tremendous rate of growth in the amounts such as Twitter, LinkedIn, Netflix and Amazon
of data, there have been innovations both in use this architecture to solve multiple business
storage and processing of big data. According to requirements. The LA architecture aims to meet
Dough Laney, Big data can be thought of as an the needs of a robust system that is scalable and
increase in the 3 V’s, i.e. Volume, Variety and fault-tolerant against hardware failures and
Velocity. Due to sources such as IoT sensors, human mistakes. On the other hand, the LA
twitter feeds, application logs and database state creates an overhead to maintain a lot of moving
changes, there has been an inflow of streams of parts in the architecture and duplicate code
data to store and process. These streams are a frameworks.[4] The project is structured into
flow of continuous and unbounded data that different sections. The Lambda Architecture
demand near real-time processing of data. The emerges as a solution that consists of three
different layers. The LA is an amalgamation of sometimes be approximate. As data arrives in the
numerous tools and technologies. The LA fits in form of streams, the processing engine only
many use cases and has applications across knows about the current data and is unaware of
various domain. the dataset. Batch processing frameworks
operate over the entire dataset in a parallel and
II. BACKGROUND exhaustive approach to ensure the correctness of
the result. Stream processing fails to achieve the
Traditional batch processing saw a shift in 2004, same accuracy as that of batch processing
when Google came up with Hadoop MapReduce systems. Batch and stream processing were
for big data processing [1]. MapReduce is a considered diametrical paradigms of big data
scalable and efficient model that processes large architecture until 2013, when Nathan Marz
amounts of data in batches. The idea behind the founded the Lambda Architecture (LA) [3]
MapReduce framework is that the collected data describes how LA addressed the need of unifying
is stored over a certain period before it is the benefits of batch and stream processing
processed. The execution time of a batched models. According to Marz, the following are the
MapReduce job is dependent on the essential requirements of the LA:
computational power of the system and the • Fault-tolerant and robust enough to withstand
overall size of the data being processed. That is code failures and human errors
why, large scale processing is performed on an • The layers must be horizontally scalable.
hourly, daily or weekly basis in the industry as • Low latency in order to achieve real-time results
per the use case. MapReduce is widely used for • Easy to maintain and debug.
data analytics with its batch data processing
approach but tends to fall short when immediate III. LAMBDA ARCHITECTURE
results prerequired recent times, there has been a
need to process and analyze data at speed. There The proposed architecture is demonstrated in Fig.
is a necessity to gain insights quickly after the 1. The incoming data is duplicated and fed to the
event has happened as the value diminishes with batch and speed layer for computation.
time. Online retail, social media, stock markets
and intrusion detection systems rely heavily on
instantaneous analysis within milliseconds or
seconds. According to [5], Real-time data
processing combines data capturing, data
processing and data exploration in a very fast and
prompt manner. However, MapReduce was never
built for this purpose, thereby, leading to the
innovation of stream processing systems. Unlike
batch processing, the data fed to a stream Fig. 1. Lambda Architecture [2]
processing system is unbounded and in motion.
This can be time series data or generated from [3] discusses in detail about the three layers in the
user’s web activity, applications logs, or IoT LA. A subset of properties necessary for large-
sensors, and must be pipelined into a stream scale distributed big data architectures is satisfied
processing engine such as Apache Spark [9] or by each layer in the LA. A highly reliable, fault-
Apache Storm [11]. These engines have the tolerant and low latency architecture is developed
capability to compute analysis that can be further using multiple big data frameworks and
displayed as real-time results on a dashboard. technologies that scale out in conjunction across
Stream processing is an excellent solution for the layers.
real-time data processing, but the results can
A) BATCH LAYER overcome this limitation, the speed layer is very
significant.
The crux of the LA is the master dataset. The
master dataset constantly receives new data in an B) SPEED LAYER
append-only fashion. This approach is highly
desirable to maintain the immutability of the data. [3] and [5] state that real-time data processing is
In the book [3], Marz stresses on the importance of realized because of the presence of the speed
immutable datasets. The overall purpose is to layer. The data streams are processed in real-time
prepare for human or system errors and allow without the consideration of completeness or fix-
reprocessing. As values are overridden in a ups. The speed layer achieves up-to-date query
mutable data model, the immutability principle results and compensates for the high-latency of
prevents loss of data. Secondly, the immutable the batch layer. The purpose of this layer is to fill
data model supports simplification due to the in the gap caused by the time-consuming batch
absence of indexing of data. The master dataset layer. In order to create real-time views of the
in the batch layer is ever growing and is the most recent data, this layer sacrifices throughput
detailed source of data in the architecture. The and decreases latency substantially. The real-
master dataset permits random read feature on the time views are generated immediately after the
historical data. The batch layer prefers re- data is received but are not as complete or precise
computation algorithms over incremental as the batch layer. The idea behind this design is
algorithms. The problem with incremental that the accurate results of the batch layer
algorithms is the failure to address the challenges override the real-time views, once they arrive.
faced by human mistakes. The re-computational The separation of roles in the different layers
nature of the batch layer creates simple batch account for the beauty of the LA. As mentioned
views as the complexity is addressed during earlier, the batch layer participates in a resource
precomputation. Additionally, the responsibility intensive operation by running over the entire
of the batch layer is to historically process the master dataset. Therefore, the speed layer must
data with high accuracy. Machine learning incorporate a different approach to meet the low-
algorithms take time to train the model and give latency requirements. In contrast to the re-
better results over time. Such naturally computation approach of batch layer, the speed
exhaustive and time-consuming tasks are layer adopts incremental computation. The
processed inside the batch layer. In the Hadoop incremental computation is more complex, but
framework, the master dataset is persisted in the the data handled in the speed layer is vastly
Hadoop File System (HDFS) [6]. HDFS is smaller and the views are transient. A random-
distributed and fault-tolerant and follows an read/random-write methodology is used to re-use
append only approach to fulfill the needs of the and update the previous views. There is a
batch layer of the LA. Batch processing is demonstration of the incremental computational
performed with the use of MapReduce jobs than strategy in the Fig. 2.
run at constant intervals and calculate batch
views over the entire data spread out in HDFS.
The problem with the batch layer is high-latency.
The batch jobs must be run over the entire master
dataset and are time consuming. For example,
there might be some MapReduce. jobs that are
run after every two hours. These jobs can process
data that can be relatively old as they cannot keep
up with the inflow of stream data. This is a serious
limitation for real-time data processing. To
Fig. 2. Incremental Computation Strategy [3]
B) APACHE HADOOP
C) SERVICE LAYER
[2] defines Apache Hadoop as a distributed
The serving layer is responsible to store the software platform for managing big data across
outputs from the batch and the speed layer. [3] An clusters. The idea behind Hadoop was instead of
arrangement of flat records with pre-computed bringing data towards compute, bring compute to
views are obtained as results from the batch layer. the data. The Hadoop framework can be
These pre-computed batch views are indexed in categorized into storage and compute models
this layer for faster retrieval. This layer provides known as Hadoop Distributed File System and
random reads and batch upgrades due to static MapReduce respectively.
batch perspectives. According to [3], whenever
the LA is queried, the serving layer merges the C) HDFS
batch and real-time views and outputs a result.
The merged views can be displayed on a HDFS is a scalable, reliable and fault-tolerant file
dashboard or used to create reports. Therefore, system that stores huge quantities of data. HDFS
the LA combines the results from a data- is the most used technology for storage in the
intensive, yet accurate batch layer, and a prompt batch layer. [5] The immutable, append-only
speed layer as per the required use case. master dataset is dumped inside a resilient HDFS.
The LA is a generic purpose architecture that According to [1] and [2], MapReduce is a
allows a choice between multiple technologies. programming paradigm that manipulates key-
Each layer has a unique responsibility in the LA value pairs to write computations in map and
and requires a specific technology. Here is a list reduce functions. In the map phase, the individual
of a few technologies used across this chunks of data generated through splits of input-
architecture: data are processed in parallel. The output from the
map phase is then sorted and forwarded to
A) APACHE KAFKA the reduce stage of the framework. The
MapReduce framework runs over the master
According to [7], Apache Kafka is a distributed dataset and performs the precomputation required
pub-sub/messaging queue used to build real-time in the batch layer. Hadoop also includes Hive and
streaming data pipelines. A topic is used to store Pig, which are high level abstractions that later
the stream of records inside Kafka. A publisher translate to MapReduce jobs.
pushes messages into the topics and a consumer
subscribes to the topic. Due to the fact that Kafka E) APACHE SPARK
is a multi-consumer queue, the messages can be
rewound and replayed in case of a point of failure. As discussed in [9] and [10], the process of
There is a configurable retention period to persist reading and writing to a disk in MapReduce is
all published records irrespective of their slow. Apache Spark is an alternative to
consumption. The data generated from user MapReduce and runs up to 100 times faster due
website activity, applications logs, IoT sensors to in-memory processing. Spark works on the
can be ingested into Apache Kafka. [10] shows entire data as opposed to MapReduce, which
how it is the responsibility of Apache Kafka to runs in stages. Resilient Distributed Datasets
duplicate the data and send each copy to the batch (RDD) is the primary data structure of Spark
and the speed layer respectively. that is a sharable object across jobs
representing the state of memory. Spark is a H) APACHE HBASE
polyglot and can run stand-alone or on Apache [13] Apache HBase is a non-relational distributed
Mesos, Kubernetes, or in the cloud. Spark database that is a great option for the serving
supports multiple features such as batch layer. It is a core component of the Hadoop
processing, stream processing, graph processing, Framework that handles large scale data. Due
interactive analysis and machine learning. to extremely good read and write performance,
the batch and real-time views can be stored and
F) SPARK STREAMING read in real-time. It fulfills the responsibility of
exposing the views created by both the layers in
Spark streaming is an extension of core Spark order to service incoming queries.
API to enable processing of live data streams.
The input data streams are divided into micro I) APACHE CASSANDRA
batches by Spark streaming and then final
streams of results in batches are processed by the Apache Cassandra is heavily used in the LA to
Spark Engine. Fig. 3. illustrates the same. store the real-time views of the speed layer. It is
the preferred technology proposed by [3] to
store the output from stream processing
technologies. Along with that, it supports high
read performance to output reports. Cassandra is
Fig. 3. Spark Streaming [10] a distributed database and has the ability to
perform, scale and provide high availability
The live stream data can be ingested from a with no single point of failure. [12] The
source such as Apache Kafka and can be architecture includes a masterless ring that is
analyzed using various functions provided by simple, easy to setup and maintain. Instead of
Spark. The processed data can be dumped out maintaining a separate database in the serving
to databases, file systems, live dashboards or be layer like HBase or ElephantDB, Cassandra can
used to apply machine learning. A stream of be used to fulfill the same purpose. This reduces
continuous data is represented with DStreams a lot of overhead and complexity.
(discretized streams) and is basically a sequence
of RDDs. The speed layer can be implemented V. APPLICATIONS OF LAMBDA
using Spark Streaming for low latency ARCHITECTURE
requirements. Lambda architecture can be considered as almost
real-time data processing architecture. It can
G) APACHE STORM withstand the faults as well as allow scalability. It
uses the functions of batch layer and stream layer
Apache Storm is the basis of the speed layer in and keeps adding new data to the main
the LA suggested by Nathan Marz [3].Instead of storage while ensuring that the existing data
micro-batching the streams, Storm relies on a will remain intact. In order to meet the quality
one-at-a-time stream processing approach. Due to of service standards, companies like Twitter,
this, the results have lower latency compared to Netflix, and LinkedIn are using this
Spark Streaming. [11] Storm is dependent on a architecture. Online advertisers use data
concept called topologies which is equivalent to enrichment to combine historical customer data
MapReduce jobs in Hadoop. A topology is a with live customer behavior data and deliver
network of spouts and bolts. A spout is more personalized and targeted ads in real-time
considered as the source of a stream, whereas the and in context with what customers are doing. LA
bolt performs some action on the stream. is also applied to detect and realize unusual
behaviors that could indicate a potentially serious
problem within the system. Financial institutions
rely on trigger events to detect fraudulent
transactions and stop fraud in its tracks. Hospitals
also use triggers to detect potentially dangerous
health changes while monitoring patient vital
signs, sending automatic alerts to the right
caregivers who can then take immediate and
appropriate action. Twitter APIs can be called
to process large feeds through the lambda
architecture pipeline and sentiment analysis Fig. 5. Lambda Architecture at Talentica [15]
can be performed. Additionally, complex
networks can be protected from intrusion The LA turned out to be a cost effective and
detection by tracking the failure of a node scalable solution to meet the demanding
and avoiding future issues. requirements of the mobile ad campaign.
As discussed in [14], MapR uses the Lambda [16] provides detailed information about how a
architecture for online alerting in order to LA was built on AWS to analyze customer
minimize the idle transports. The architecture behavior and recommend content by a software
used by MapR is demonstrated in Fig. 4. engineer who works at SmartNews.
Fig. 9. DataSet
Kafka is a distributed streaming platform with the Fig. 11. Anatomy of a Topic [7]
following capabilities:-Publish and subscribe to
streams of records, similar to a message queue or The Kafka cluster stores streams of records in
enterprise messaging system.-Store streams of categories called topics. Each record consists of
records in a fault-tolerant durable way.-Process a key, a value, and a timestamp[12]. A topic
streams of records as they occur. Kafka is used to is a category or feed name to which records
build ingestion pipelines for scalable data are published. A topic is broken down into
processing systems and very well fits our real- partitions that can be scaled horizontally. Topics
time data processing use case. A brief in Kafka are always multi-subscriber; that is,
introduction to the architecture and terminologies a topic can have zero, one, or many consumers
used in Kafka: that subscribe to the data written to it. In our
architecture, the Kafka Producer produces
continuous streams of data in the form of
clickstream events and pushes them on to the
Kafka broker. The streams are then forwarded to
HDFS and Spark’s stream processing engine.
MAPREDUCE VS SPARK
[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. 6th
Symp. Oper. Syst. Des. Implement., pp. 137–149,2004.
[2] T. Chen, D. Dai, Y. Huang, and X. Zhou, “MapReduce On Stream Processing,” pp. 337–341.
[3] N. Marz and J. Warren. Big Data: Principles and best practices of scalable real time data systems.
Manning Publications Co.,2015.
[5] I. Ganelin, E. Orhian, K. Sasaki, and B. York, Spark: Big Data Cluster Computing in Production.2016.
[6] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop
Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and
Technologies (MSST)(MSST '10). IEEE Computer Society, Washington, DC, USA, 1-10.
DOI=http://dx.doi.org/10.1109/MSST.2010.5496972
[7] Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: a distributed messaging system for log processing.
SIGMOD Workshop on Networking Meets Databases, 2011.
[8] W. Yang et al. “Big Data Real-Time Processing Based on Storm.” 2013 12th IEEE International
Conference on Trust, Security and Privacy in Computing and Communications (2013): 1784-1787.
[9] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave,
Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez,
Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun.
ACM 59, 11 (October 2016), 56-65. DOI: https://doi.org/10.1145/2934664
[12] Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system.
SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35-40. DOI=http://dx.doi.org/10.1145/1773912.1773922
[15] https://www.talentica.com/assets/white-papers/big-data-using-lambda-architecture
[16] https://aws.amazon.com/blogs/big-data/how-smartnews-built-a-lambda-architecture-on-aws-to-
analyze-customer-behavior-and-recommend-content/
[17] S. Kasireddy, “How we built a data pipeline with Lambda Architecture using Spark/Spark
Streaming”,https://medium.com/walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-
using-spark-spark-streaming-9d3b4b4555d3
AUTHORS:
*First Author – Dr. Yuvraj Kumar, Ph.D. in Artificial Intelligence, Faculty of Quantum Computing and
Artificial Intelligence, Zukovsky State University, Russia