0% found this document useful (0 votes)
20 views

Lambda Architecture

Uploaded by

Huy Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lambda Architecture

Uploaded by

Huy Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338375917

Lambda Architecture - Realtime Data Processing

Thesis · January 2020


DOI: 10.13140/RG.2.2.19091.84004

CITATIONS READS
3 5,025

1 author:

Yuvraj Kumar
AI ML Machine Advocacy Council
15 PUBLICATIONS 10 CITATIONS

SEE PROFILE

All content following this page was uploaded by Yuvraj Kumar on 03 January 2020.

The user has requested enhancement of the downloaded file.


Lambda Architecture – Realtime Data Processing
*Dr. Yuvraj Kumar – Ph.D. in Artificial Intelligence
*The faculty of Quantum Computing and Artificial Intelligence
*Zukovsky State University, Russia

Abstract - Data has evolved immensely in recent field of data analytics is growing immensely to
years, in type, volume and velocity. There are draw valuable insights from big chunks of raw
several frameworks to handle the big data data. In order to compute information in a data
applications. The project focuses on the Lambda system, processing frameworks and processing
Architecture proposed by Marz and its engines are essential. The traditional relational
application to obtain real-time data processing. database seems to show limitations when exposed
The architecture is a solution that unites the to the colossal chunks of unstructured data. There
benefits of the batch and stream processing is a need to decouple compute from storage. The
techniques. Data can be historically processed processing frameworks can be categorized into
with high precision and involved algorithms three frameworks –batch processing, stream data
without loss of short-term information, alerts processing and hybrid data processing
and insights. Lambda Architecture has an ability frameworks. Traditional batch processing of data
to serve a wide range of use cases and workloads gives good results but with a high latency.
that withstands hardware and human mistakes. Hadoop is a scalable and fault-tolerant
The layered architecture enhances loose framework that includes Map Reduce for
coupling and flexibility in the system. This a computational processing. [1][2] Map Reduce jobs
huge benefit that allows understanding the trade- are run in batches to give results which are
offs and application of various tools and accurate and highly available. The downside of
technologies across the layers. There has been an Map Reduce is its high latency, which does not
advancement in the approach of building the LA make it a good choice for real-time data
due to improvements in the underlying tools. processing. In order to achieve results in real-time
The project demonstrates a simplified with low-latency, a good solution is to use
architecture for the LA that is maintainable. Apache Kafka coupled with Apache Spark. This
streaming model does wonders in high
Index terms – Lambda Architecture (LA), availability and low latency but might suffer in
Batch Processing, Stream Processing, Real- terms of accuracy. In most of the scenarios, use
time Data Processing cases demand both fast results and deep
processing of data.[3] This project is focused
I. INTRODUCTION towards Lambda Architecture that unifies batch
and stream processing. Many tech companies
With tremendous rate of growth in the amounts such as Twitter, LinkedIn, Netflix and Amazon
of data, there have been innovations both in use this architecture to solve multiple business
storage and processing of big data. According to requirements. The LA architecture aims to meet
Dough Laney, Big data can be thought of as an the needs of a robust system that is scalable and
increase in the 3 V’s, i.e. Volume, Variety and fault-tolerant against hardware failures and
Velocity. Due to sources such as IoT sensors, human mistakes. On the other hand, the LA
twitter feeds, application logs and database state creates an overhead to maintain a lot of moving
changes, there has been an inflow of streams of parts in the architecture and duplicate code
data to store and process. These streams are a frameworks.[4] The project is structured into
flow of continuous and unbounded data that different sections. The Lambda Architecture
demand near real-time processing of data. The emerges as a solution that consists of three
different layers. The LA is an amalgamation of sometimes be approximate. As data arrives in the
numerous tools and technologies. The LA fits in form of streams, the processing engine only
many use cases and has applications across knows about the current data and is unaware of
various domain. the dataset. Batch processing frameworks
operate over the entire dataset in a parallel and
II. BACKGROUND exhaustive approach to ensure the correctness of
the result. Stream processing fails to achieve the
Traditional batch processing saw a shift in 2004, same accuracy as that of batch processing
when Google came up with Hadoop MapReduce systems. Batch and stream processing were
for big data processing [1]. MapReduce is a considered diametrical paradigms of big data
scalable and efficient model that processes large architecture until 2013, when Nathan Marz
amounts of data in batches. The idea behind the founded the Lambda Architecture (LA) [3]
MapReduce framework is that the collected data describes how LA addressed the need of unifying
is stored over a certain period before it is the benefits of batch and stream processing
processed. The execution time of a batched models. According to Marz, the following are the
MapReduce job is dependent on the essential requirements of the LA:
computational power of the system and the • Fault-tolerant and robust enough to withstand
overall size of the data being processed. That is code failures and human errors
why, large scale processing is performed on an • The layers must be horizontally scalable.
hourly, daily or weekly basis in the industry as • Low latency in order to achieve real-time results
per the use case. MapReduce is widely used for • Easy to maintain and debug.
data analytics with its batch data processing
approach but tends to fall short when immediate III. LAMBDA ARCHITECTURE
results prerequired recent times, there has been a
need to process and analyze data at speed. There The proposed architecture is demonstrated in Fig.
is a necessity to gain insights quickly after the 1. The incoming data is duplicated and fed to the
event has happened as the value diminishes with batch and speed layer for computation.
time. Online retail, social media, stock markets
and intrusion detection systems rely heavily on
instantaneous analysis within milliseconds or
seconds. According to [5], Real-time data
processing combines data capturing, data
processing and data exploration in a very fast and
prompt manner. However, MapReduce was never
built for this purpose, thereby, leading to the
innovation of stream processing systems. Unlike
batch processing, the data fed to a stream Fig. 1. Lambda Architecture [2]
processing system is unbounded and in motion.
This can be time series data or generated from [3] discusses in detail about the three layers in the
user’s web activity, applications logs, or IoT LA. A subset of properties necessary for large-
sensors, and must be pipelined into a stream scale distributed big data architectures is satisfied
processing engine such as Apache Spark [9] or by each layer in the LA. A highly reliable, fault-
Apache Storm [11]. These engines have the tolerant and low latency architecture is developed
capability to compute analysis that can be further using multiple big data frameworks and
displayed as real-time results on a dashboard. technologies that scale out in conjunction across
Stream processing is an excellent solution for the layers.
real-time data processing, but the results can
A) BATCH LAYER overcome this limitation, the speed layer is very
significant.
The crux of the LA is the master dataset. The
master dataset constantly receives new data in an B) SPEED LAYER
append-only fashion. This approach is highly
desirable to maintain the immutability of the data. [3] and [5] state that real-time data processing is
In the book [3], Marz stresses on the importance of realized because of the presence of the speed
immutable datasets. The overall purpose is to layer. The data streams are processed in real-time
prepare for human or system errors and allow without the consideration of completeness or fix-
reprocessing. As values are overridden in a ups. The speed layer achieves up-to-date query
mutable data model, the immutability principle results and compensates for the high-latency of
prevents loss of data. Secondly, the immutable the batch layer. The purpose of this layer is to fill
data model supports simplification due to the in the gap caused by the time-consuming batch
absence of indexing of data. The master dataset layer. In order to create real-time views of the
in the batch layer is ever growing and is the most recent data, this layer sacrifices throughput
detailed source of data in the architecture. The and decreases latency substantially. The real-
master dataset permits random read feature on the time views are generated immediately after the
historical data. The batch layer prefers re- data is received but are not as complete or precise
computation algorithms over incremental as the batch layer. The idea behind this design is
algorithms. The problem with incremental that the accurate results of the batch layer
algorithms is the failure to address the challenges override the real-time views, once they arrive.
faced by human mistakes. The re-computational The separation of roles in the different layers
nature of the batch layer creates simple batch account for the beauty of the LA. As mentioned
views as the complexity is addressed during earlier, the batch layer participates in a resource
precomputation. Additionally, the responsibility intensive operation by running over the entire
of the batch layer is to historically process the master dataset. Therefore, the speed layer must
data with high accuracy. Machine learning incorporate a different approach to meet the low-
algorithms take time to train the model and give latency requirements. In contrast to the re-
better results over time. Such naturally computation approach of batch layer, the speed
exhaustive and time-consuming tasks are layer adopts incremental computation. The
processed inside the batch layer. In the Hadoop incremental computation is more complex, but
framework, the master dataset is persisted in the the data handled in the speed layer is vastly
Hadoop File System (HDFS) [6]. HDFS is smaller and the views are transient. A random-
distributed and fault-tolerant and follows an read/random-write methodology is used to re-use
append only approach to fulfill the needs of the and update the previous views. There is a
batch layer of the LA. Batch processing is demonstration of the incremental computational
performed with the use of MapReduce jobs than strategy in the Fig. 2.
run at constant intervals and calculate batch
views over the entire data spread out in HDFS.
The problem with the batch layer is high-latency.
The batch jobs must be run over the entire master
dataset and are time consuming. For example,
there might be some MapReduce. jobs that are
run after every two hours. These jobs can process
data that can be relatively old as they cannot keep
up with the inflow of stream data. This is a serious
limitation for real-time data processing. To
Fig. 2. Incremental Computation Strategy [3]
B) APACHE HADOOP
C) SERVICE LAYER
[2] defines Apache Hadoop as a distributed
The serving layer is responsible to store the software platform for managing big data across
outputs from the batch and the speed layer. [3] An clusters. The idea behind Hadoop was instead of
arrangement of flat records with pre-computed bringing data towards compute, bring compute to
views are obtained as results from the batch layer. the data. The Hadoop framework can be
These pre-computed batch views are indexed in categorized into storage and compute models
this layer for faster retrieval. This layer provides known as Hadoop Distributed File System and
random reads and batch upgrades due to static MapReduce respectively.
batch perspectives. According to [3], whenever
the LA is queried, the serving layer merges the C) HDFS
batch and real-time views and outputs a result.
The merged views can be displayed on a HDFS is a scalable, reliable and fault-tolerant file
dashboard or used to create reports. Therefore, system that stores huge quantities of data. HDFS
the LA combines the results from a data- is the most used technology for storage in the
intensive, yet accurate batch layer, and a prompt batch layer. [5] The immutable, append-only
speed layer as per the required use case. master dataset is dumped inside a resilient HDFS.

IV. TOOLS AND TECHNOLOGIES D) MAPREDUCE

The LA is a generic purpose architecture that According to [1] and [2], MapReduce is a
allows a choice between multiple technologies. programming paradigm that manipulates key-
Each layer has a unique responsibility in the LA value pairs to write computations in map and
and requires a specific technology. Here is a list reduce functions. In the map phase, the individual
of a few technologies used across this chunks of data generated through splits of input-
architecture: data are processed in parallel. The output from the
map phase is then sorted and forwarded to
A) APACHE KAFKA the reduce stage of the framework. The
MapReduce framework runs over the master
According to [7], Apache Kafka is a distributed dataset and performs the precomputation required
pub-sub/messaging queue used to build real-time in the batch layer. Hadoop also includes Hive and
streaming data pipelines. A topic is used to store Pig, which are high level abstractions that later
the stream of records inside Kafka. A publisher translate to MapReduce jobs.
pushes messages into the topics and a consumer
subscribes to the topic. Due to the fact that Kafka E) APACHE SPARK
is a multi-consumer queue, the messages can be
rewound and replayed in case of a point of failure. As discussed in [9] and [10], the process of
There is a configurable retention period to persist reading and writing to a disk in MapReduce is
all published records irrespective of their slow. Apache Spark is an alternative to
consumption. The data generated from user MapReduce and runs up to 100 times faster due
website activity, applications logs, IoT sensors to in-memory processing. Spark works on the
can be ingested into Apache Kafka. [10] shows entire data as opposed to MapReduce, which
how it is the responsibility of Apache Kafka to runs in stages. Resilient Distributed Datasets
duplicate the data and send each copy to the batch (RDD) is the primary data structure of Spark
and the speed layer respectively. that is a sharable object across jobs
representing the state of memory. Spark is a H) APACHE HBASE
polyglot and can run stand-alone or on Apache [13] Apache HBase is a non-relational distributed
Mesos, Kubernetes, or in the cloud. Spark database that is a great option for the serving
supports multiple features such as batch layer. It is a core component of the Hadoop
processing, stream processing, graph processing, Framework that handles large scale data. Due
interactive analysis and machine learning. to extremely good read and write performance,
the batch and real-time views can be stored and
F) SPARK STREAMING read in real-time. It fulfills the responsibility of
exposing the views created by both the layers in
Spark streaming is an extension of core Spark order to service incoming queries.
API to enable processing of live data streams.
The input data streams are divided into micro I) APACHE CASSANDRA
batches by Spark streaming and then final
streams of results in batches are processed by the Apache Cassandra is heavily used in the LA to
Spark Engine. Fig. 3. illustrates the same. store the real-time views of the speed layer. It is
the preferred technology proposed by [3] to
store the output from stream processing
technologies. Along with that, it supports high
read performance to output reports. Cassandra is
Fig. 3. Spark Streaming [10] a distributed database and has the ability to
perform, scale and provide high availability
The live stream data can be ingested from a with no single point of failure. [12] The
source such as Apache Kafka and can be architecture includes a masterless ring that is
analyzed using various functions provided by simple, easy to setup and maintain. Instead of
Spark. The processed data can be dumped out maintaining a separate database in the serving
to databases, file systems, live dashboards or be layer like HBase or ElephantDB, Cassandra can
used to apply machine learning. A stream of be used to fulfill the same purpose. This reduces
continuous data is represented with DStreams a lot of overhead and complexity.
(discretized streams) and is basically a sequence
of RDDs. The speed layer can be implemented V. APPLICATIONS OF LAMBDA
using Spark Streaming for low latency ARCHITECTURE
requirements. Lambda architecture can be considered as almost
real-time data processing architecture. It can
G) APACHE STORM withstand the faults as well as allow scalability. It
uses the functions of batch layer and stream layer
Apache Storm is the basis of the speed layer in and keeps adding new data to the main
the LA suggested by Nathan Marz [3].Instead of storage while ensuring that the existing data
micro-batching the streams, Storm relies on a will remain intact. In order to meet the quality
one-at-a-time stream processing approach. Due to of service standards, companies like Twitter,
this, the results have lower latency compared to Netflix, and LinkedIn are using this
Spark Streaming. [11] Storm is dependent on a architecture. Online advertisers use data
concept called topologies which is equivalent to enrichment to combine historical customer data
MapReduce jobs in Hadoop. A topology is a with live customer behavior data and deliver
network of spouts and bolts. A spout is more personalized and targeted ads in real-time
considered as the source of a stream, whereas the and in context with what customers are doing. LA
bolt performs some action on the stream. is also applied to detect and realize unusual
behaviors that could indicate a potentially serious
problem within the system. Financial institutions
rely on trigger events to detect fraudulent
transactions and stop fraud in its tracks. Hospitals
also use triggers to detect potentially dangerous
health changes while monitoring patient vital
signs, sending automatic alerts to the right
caregivers who can then take immediate and
appropriate action. Twitter APIs can be called
to process large feeds through the lambda
architecture pipeline and sentiment analysis Fig. 5. Lambda Architecture at Talentica [15]
can be performed. Additionally, complex
networks can be protected from intrusion The LA turned out to be a cost effective and
detection by tracking the failure of a node scalable solution to meet the demanding
and avoiding future issues. requirements of the mobile ad campaign.

As discussed in [14], MapR uses the Lambda [16] provides detailed information about how a
architecture for online alerting in order to LA was built on AWS to analyze customer
minimize the idle transports. The architecture behavior and recommend content by a software
used by MapR is demonstrated in Fig. 4. engineer who works at SmartNews.

SmartNews aggregates outputs from machine


learning algorithms with data streams to gather
user feedback in near real-time. The LA helps
them achieve the best stories on the web with
low-latency and high performance. The
architecture followed at SmartNews can be seen
in Fig.6.

Fig. 4. Lambda Architecture at MapR [14]

In [15], a software development company


named Talentica uses the LA for mobile ad
campaign management. In a mobile ad
campaign management, a processing challenge
is faced due to high volumes and velocities of
data. If real-time views are not generated, the
campaign might fail to deliver and incur heavy Fig. 6. Lambda Architecture at
losses in terms of revenue and missed SmartNews[16]
opportunities. The company faced a situation
where over 500 GB of data and 200k Another use case of LA can be seen incorporated
messages/sec were generated. The company by the tech giant, Walmart Labs [17]. Walmart is
needed a unified approach including both batch a data driven company and produces business
and stream and hence, the architecture and product decisions based on the analysis of
demonstrated in Fig. 5. was established. data. The use case focuses on click stream
analysis in order to find the unique visitors,
orders, units, revenues, site error rates. During
peak hours, the pipeline holds as much as streams of data and passing it over to two
250k events per second. The software engineers separate consumers can be troublesome and
at Walmart devised a LA solution as illustrated in creates operational overhead. The LA does not
Fig. 7. for better productivity and results. always live up to the expectation and many
industries use full batch processing system or a
stream processing system to meet their use case.

VII. A PROPOSED SOLUTION


As understood, the LA can lead to code
complexity, debugging and maintenance
problems if not implemented in the right manner.
A proper analysis and understanding of existing
tools helped me realize that the LA can be
Fig. 7. Lambda Architecture at Walmart Labs implemented using a technology stack
[17] comprising of Apache Kafka, Apache Spark and
Apache Cassandra. There are multiple ways of
This framework is a simplified version of LA building a LA. In this approach, I will try one of
because of the Spark implementation in the batch them. I will use Apache Zeppelin to display the
and speed layer. As discussed earlier, Spark results observed in the serving layer.
provides APIs for both batch and Spark
streaming. Due to this, there is a considerable
decrease in the complexity of the code base.
There is no need to maintain separate code
frameworks as Spark enables code reuse.

VI. LIMITATIONS OF THE


TRADITIONAL
LAMBDAARCHITECTURE
Fig. 8. Proposed Lambda Architecture
Initially, when the LA was proposed, the batch
layer was meant to be a combination of A) DATASET
Hadoop File System and MapReduce and the
stream layer was realized using Apache Storm. As discussed earlier, LA handles fast incoming
Also, the serving layer was a combination of data from sources such as IoT sensors, web
two independent databases i.e. ElephantDB application clickstreams, application logs etc.
and Cassandra. Inherently, this model had a lot In our case, a log producing continuous flow
of implementation and maintenance issues. [4] of clickstream is chosen as a dataset. The dataset
Developers were required to possess a good resembles an online retail web application and the
understanding of two different systems and there clickstream events. Basically, this is a simulation
was a steep learning curve. Additionally, creating of streams in the form of logs. Kafka producer
a unified solution was possible but resulted a lot will publish this data on the broker which will be
of merge issues, debugging problems and used by the downstream systems, i.e. the batch
operational complexity. The incoming data needs and speed layers. The log will be demonstrated in
to be fed to both the batch and the speed layer of the following format:
the LA. It is very important to preserve the
ordering of events of the input data to obtain
thorough results. The process of duplicating
broker and receive the messages in order. The
asynchronous nature of Kafka really helps out in
building scalable architectures.

Fig. 9. DataSet

B) APACHE KAFKA – DATA INGESTION

Kafka is a distributed streaming platform with the Fig. 11. Anatomy of a Topic [7]
following capabilities:-Publish and subscribe to
streams of records, similar to a message queue or The Kafka cluster stores streams of records in
enterprise messaging system.-Store streams of categories called topics. Each record consists of
records in a fault-tolerant durable way.-Process a key, a value, and a timestamp[12]. A topic
streams of records as they occur. Kafka is used to is a category or feed name to which records
build ingestion pipelines for scalable data are published. A topic is broken down into
processing systems and very well fits our real- partitions that can be scaled horizontally. Topics
time data processing use case. A brief in Kafka are always multi-subscriber; that is,
introduction to the architecture and terminologies a topic can have zero, one, or many consumers
used in Kafka: that subscribe to the data written to it. In our
architecture, the Kafka Producer produces
continuous streams of data in the form of
clickstream events and pushes them on to the
Kafka broker. The streams are then forwarded to
HDFS and Spark’s stream processing engine.

C) APACHE KAFKA – BATCH + SPEED

In the previous architecture, there was a mixture


of two processing systems for the batch and speed
layer respectively. The batch layer consisted of
the Hadoop stack –HDFS for storage and
MapReduce for compute. The speed layer
consisted of the fast MapReduce version, i.e.
Fig. 10.Apache Kafka [7] Apache Storm. In our architecture, we will rely
only on Apache Spark. Spark revolves around the
Kafka acts as a publisher-subscriber model and concept of a resilient distributed dataset (RDD),
hence benefits from the loose coupling between which is a fault-tolerant collection of elements
the producers and the consumers. A Kafka cluster that can be operated on in parallel[8].An RDD is
contains brokers to store the published messages immutable data structure. There are two types of
from a producer. The consumers subscribe to the operations for an RDD, i.e. transformations and
actions. A transformation basically creates a new Any operation applied on a DStream translates to
dataset from an existing dataset. An action operations on the underlying RDDs. The
basically after running a computation on the DStream operations hide most of these details
dataset, returns a value to the driver program. and provide the developer with a higher-level
Spark uses a DAG approach to optimize the API for convenience.
operations. In Spark, the transformations are
lazy and are not computed until an action D) APACHE KAFKA – SERVING LAYER
operation occurs. Whenever an action takes
place, all the transformations are executed. A Apache Cassandra is a NoSQL distributed
directed acyclic graph is formed for the RDDs management system designed for large amounts
and are executed in order. If a dataset is created of data and provides high availability with no
through a map transformation operation then it single point of failure. It serves our use case as it
will only return the result of the reduce to the can take heavy write loads and be gracefully
driver instead of a larger mapped dataset. By used for analytics. We use a Spark Cassandra
default, each transformed RDD may be Connector Library that provides excellent
recomputed each time you run an action on it. integration between the two systems. As per the
However, you may also persist an RDD in CAP theorem, it is only possible to have either of
memory using the persist (or cache) method, in consistency or availability in a distributed
which case Spark will keep the elements system. The consistency parameter in Cassandra
around on the cluster for much faster access is tunable and configurable with a trade-off with
the next time you query it. We make use of the availability. In our use case, there is a need for
Kafka Consumer APIs to Integrate Kafka with high availability and hence, consistency can be
Spark. There are two types of approaches to compromised. Furthermore, the data modeling in
achieve the same, i.e. Receiver based Approach Cassandra revolves around the queries. It is
and Direct Approach. We will use the Direct possible to build tailored tables and materialized
Approach because of the following advantages it views with a read-write separation for better
provides:1)Efficiency2)SimplifiedparallelismEx performance. A Cassandra cluster is configurable
actly-onesemanticsAlso, Direct Kafka streams and for our use case I went with a simple strategy
will make sure that there is a 1-1 mapping with a default replication factor of 3.
between the Kafka partition and the Spark Additionally, Cassandra’s Query Language
partition. We can make use of Spark’s Data (CQL) integrates well with Apache Zeppelin for
Frame API to read data from HDFS in a querying and analytics.
structured format using Spark SQL. Building the
Speed Layer with Spark: Basically, we fork the E) APACHE ZEPPELIN
data coming from Kafka into a HDFS and into
stream processing engine of Kafka. We make use Apache Zeppelin is a web-based notebook which
of Spark’s DStream API to deal with the brings data exploration, visualization, sharing
streaming data. Discretized Stream or DStream is and collaboration features to Spark. The reason
the basic abstraction provided by Spark to choose Zeppelin is because of its support for
Streaming. It represents a continuous stream of Spark APIs and CQL .
data, either the input data stream received from
source, or the processed data stream generated by VIII. COMPARATIVE ANALYSIS
transforming the input stream. Internally, a
DStream is represented by a continuous series of
RDDs, which is Spark’s abstraction of an
immutable, distributed dataset. Each RDD in a
DStream contains data from a certain interval.
When the Lambda Architecture was initially
proposed, it consisted of Apache Storm as the
engine for the speed layer. Apache Storm follows
a native streaming approach for the creation of
real-time views. Spark has grown exponentially
during these years and has been a top-level
Apache Project. Spark has been a preferred
solution due to rich APIs for SQL (Spark SQL)
and Machine Learning (MLlib). Spark streaming
is more of a micro-batch processing engine.
The below diagram demonstrates the meaning of
micro-batching as it is an overlap of batch and
stream and can be thought of as a fast-batch
processing.

Fig. 12. COMPARISON BETWEEN


ARCHITECTURES

MAPREDUCE VS SPARK

As demonstrated in the table below, Spark gives


better performance compared to MapReduce

Fig. 14. Batch vs Streaming vs Micro-Batch

The choice between Storm and Spark


Streaming is dependent on a lot of factors.
Both the technologies have built-in support for
fault tolerance and high availability. The term
real-time data processing is a relative term and
dependent on the use case. For our use case,
initially, Storm performs slightly better in
comparison to Spark on the basis of low-latency.
As time progresses, Spark Streaming matches the
performance of Storm. There is a trade-off
between latency and ease-of-development that
impacts the design decision in the proposed
architecture. Unlike Spark, Storm does not
support using the same application code for both
Fig. 13. MapReduce vs Apache Spark batch and stream processing. The traditional
architecture has separate processing engines
SPARK VS STORM across the parallel layers. There is a
considerable overhead in software installation
and maintenance of MapReduce and Storm. KAFKA VS OTHER QUEUEING SYSTEMS
Even though the application logic remains
almost same, writing code for two different
systems was difficult for my use case. The group
of APIs supported by Storm and MapReduce are
very different to each other. This issue can easily
scale up on an industry-level. This is where
Spark’s unified API for batch and stream
processing helped my case. Spark has a
unified API that can be reused in both the
layers. Spark Streaming is inherently a micro-
batch, therefore, for each RDD operation Fig. 16. Performance Comparison Between
streamlined the application logic. The proposed Queueing Systems [21]
architecture performs computation and
manipulation in SQL with the help of The above figure is a demonstration for the
SparkSQL. This helps the vast community of comparison between different pub-
developers and analysts that prefer SQL for data sub/queueing systems. The blue value represents
processing. Additionally, the interoperability the throughput of the sending messages i.e.
between Spark and Cassandra is smooth due to Sender. The red value represents the throughput
the Spark-Cassandra connector. This helped us of receiving the messages i.e. Receiver. The LA
push the processed records into Cassandra is meant to handle data at a huge scale. Especially
tables with ease. We carried out an analysis for for big data with very high velocity, a resilient
different data sizes and worked it out with queueing system such as Kafka is necessary.
Apache Storm and Spark Streaming. The Even though other pub/sub systems such as
demonstration can be seen below in the table ActiveMQ and RabbitMQ are used widely, they
followed by a figure. do not match the results of Kafka. The major
reason to go with Kafka for our use case is due to
its ability to rewind records with the help of an
offset handled by a consumer. This offloads a lot
of burden from the queueing system and results
in loose coupling. Additionally, Kafka
provides a very good persistence mechanism
and a configurable retention period that fits in our
use case very well. I have set the retention period
to seven days after which I push the data to
HDFS. If re-computation is required, we can
always look up the data from HDFS.

SERVING LAYER SIMPLIFIED

In the traditional LA, the serving layer comprised


of both the ElephantDB and Apache Cassandra.
The views computed in the batch layer are sent to
ElephantDB. The views calculated in the speed
layer are sent to Apache Cassandra. The proposed
Fig. 15. Storm vs Spark Streaming architecture comprises of a single database to
hold the views from both the batch and the speed
layer. Due to growth and acceptance of Apache in Kafka. This reduces the duplication
Cassandra, building software systems using complexity and simplifies the architecture.
Cassandra as a data store has become very easy. 2) Spark’s rich APIs for batch and streaming
Unlike the traditional architecture, the proposed make it a tailor-made solution for the LA. The
architecture has a Spark Processing engine on top underlying element for Spark streaming API (D
of the serving layer datastore. Hence, a Streams) is a collection of RDDs. Hence, there is
distributed, heavy-write database like Cassandra huge scope for code reuse. Also, the business
smoothly integrates with our use case. This logic is simpler to maintain and debug.
alleviates the burden of maintaining separate Additionally, for the batch layer, Spark is a
databases and therefore reduces storage cost. better option due to the speed it achieves
because of in-memory processing.
SUMMARIZED COMPARISON BETWEEN 3) Kafka is a beast that can store messages for as
TRADITIONAL & PROPOSED long as the use case demands (7 days by default,
ARCHITECTURE but that is configurable). We have used HDFS for
reliability in case of human or machine faults. In
fact, we can totally eliminate HDFS from the
batch layer and store the records as they are in the
Kafka topics. Whenever re-computation is
required, the commit log inside Kafka can be
replayed.
4) Kafka’s commit log is very useful for event
sourcing as well. As the commit log is an
immutable, append-only data store, a history of
user events can be tracked for analytics. It finds it
application in online retail and
recommendationengines.
5) Cassandra is a write heavy database that can
build tables as per new use cases. Kafka commit
log can be replayed and new views can be
materialized using Cassandra.

X. CONCLUSION& FUTURE WORK


Due to a tremendous evolution of big data, there
is a big challenge to process and analyze the large
amounts of data. Traditional systems such as
relational databases and batch processing systems
are not able to keep up with the big data trends.
IX. KEY SIGHTS Even though Hadoop frameworks instill a great
promise and reduce complexities in distributed
1) The trouble and overhead to duplicate systems, they fail to satisfy the real-time
incoming data can be eliminated with the help of processing capabilities. The primary goal of the
Kafka. Kafka can store messages over a period project is to demonstrate and synthesize the
of time that can be consumed by multiple advancements in the field of real-time data
consumers. Each consumer maintains an offset to processing with Lambda architecture. The
denote a position in the commit log. Thus, the architecture is a robust combination of batch
batch and speed layers of the LA can act as and stream processing techniques. Each layer
consumers to the published records on the topics in the architecture has a defined role and is
decoupled from the other layer. The LA is overall involvement of comprehensive processing in the
a highly distributed, scalable, fault-tolerant, low- disparate batch and speed layers. In certain
latency platform. All the data entering the system scenarios, it is not beneficial to perform re-
is sent to both the batch and the speed layer for processing for every batch cycle. Additionally, it
further processing. The purpose of the batch layer can be difficult to merge the views in the serving
is management of the master dataset and a pre- layer. Therefore, we have introduced a LA using
compute of the batch views. The high latency Apache Spark as a common framework for both
disadvantage of the batch layer is compensated batch and speed layer. This simplifies the
by a speed layer. The serving layer accumulates architecture to a great extent and is
the batch and real-time views. The beauty of the maintainable as well. Furthermore, a detailed
architecture is to apply different technologies comparison has been made between the two
across the three layers. As the big data tools architectures depending on the design decision
are improving day-by-day, the LA provides and trade-offs. The future work will be to
new opportunities and possibilities. It is the understand the different technologies that can
responsibility of the LA to translate the incoming be used as an alternative to the proposed
raw data into something meaningful. There is a architecture. Apache Samza is a stream
massive demand for real-time data processing in processing engine that integrates very well with
the industry, which is addressed by the LA. The Kafka. It is built by the same researchers that
system is a good balance of speed and reliability. were responsible for the development of Kafka. It
Although the LA has many pros, there are a few would be interesting to try different use cases
observed cons as well. There area lot of moving with the architecture and analyze how they
parts resulting into coding overhead due to the perform.
ANNEXURE: FIGURES

Fig. 1. Lambda Architecture [2]

Fig. 2. Incremental Computation Strategy [3]


Fig. 3. Spark Streaming [10]

Fig. 4. Lambda Architecture at MapR [14]

Fig. 5. Lambda Architecture at Talentica[15]


Fig. 5. Lambda Architecture at Talentica [16]

Fig. 6. Lambda Architecture at Walmart Labs [17]


Fig. 7. Proposed LAMBDA Architecture [18]
REFERENCES

[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. 6th
Symp. Oper. Syst. Des. Implement., pp. 137–149,2004.

[2] T. Chen, D. Dai, Y. Huang, and X. Zhou, “MapReduce On Stream Processing,” pp. 337–341.

[3] N. Marz and J. Warren. Big Data: Principles and best practices of scalable real time data systems.
Manning Publications Co.,2015.

[4] J. Kreps, “Questioning the Lambda Architecture”, https://www.oreilly.com/ideas/questioning-the-


lambda-architecture

[5] I. Ganelin, E. Orhian, K. Sasaki, and B. York, Spark: Big Data Cluster Computing in Production.2016.

[6] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop
Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and
Technologies (MSST)(MSST '10). IEEE Computer Society, Washington, DC, USA, 1-10.
DOI=http://dx.doi.org/10.1109/MSST.2010.5496972

[7] Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: a distributed messaging system for log processing.
SIGMOD Workshop on Networking Meets Databases, 2011.

[8] W. Yang et al. “Big Data Real-Time Processing Based on Storm.” 2013 12th IEEE International
Conference on Trust, Security and Privacy in Computing and Communications (2013): 1784-1787.

[9] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave,
Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez,
Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun.
ACM 59, 11 (October 2016), 56-65. DOI: https://doi.org/10.1145/2934664

[10] Apache Spark, spark.apache.org

[11] Apache Storm, www.storm.apache.org

[12] Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system.
SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35-40. DOI=http://dx.doi.org/10.1145/1773912.1773922

[13] Apache HBase, https://hbase.apache.org

[14] “Stream processing with MapR”, https://mapr.com/resources/stream-processing-mapr/

[15] https://www.talentica.com/assets/white-papers/big-data-using-lambda-architecture

[16] https://aws.amazon.com/blogs/big-data/how-smartnews-built-a-lambda-architecture-on-aws-to-
analyze-customer-behavior-and-recommend-content/
[17] S. Kasireddy, “How we built a data pipeline with Lambda Architecture using Spark/Spark
Streaming”,https://medium.com/walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-
using-spark-spark-streaming-9d3b4b4555d3

[18] “Lambda Architecture for Batch and Stream


Processing”,https://d0.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-aws.pdf

[19] “A prototype of Lambda architecture for financial risk”,https://www.linkedin.com/pulse/prototype-


lambda-architecture-build-financial-risk-mich/

[20] “Applying the Lambda Architecture with Spark, Kafka,


andCassandra”,https://github.com/aalkilani/spark-kafka-cassandra-applying-lambda-architecture

[21] Adam Warski, “Evaluating persistent, replicated message


queues”http://www.warski.org/blog/2014/07/evaluating-persistent-replicated-message-queues/

AUTHORS:

*First Author – Dr. Yuvraj Kumar, Ph.D. in Artificial Intelligence, Faculty of Quantum Computing and
Artificial Intelligence, Zukovsky State University, Russia

View publication stats

You might also like