Lec 19
Lec 19
Preface
Content of this lecture: In this lecture we will discuss real time big data processing with Spark Streaming
and sliding window analytics. We will also discuss a case study based on Twitter sentiment analysis using
Streaming.
Now, the question is how to process the big data big Streaming data? So, now here we are going to deal
in this part of the lecture the data which is called a, ‘Streaming Data’. So, if the the stream of the data is
very fast and also a continuous stream then it is categorized under the Big Data scenario. Where in this
particular characteristic which is called a, ‘Volume’, sorry, ‘Velocity’, requires the infrastructure of a big
data to handle it so that means we require hundreds to the thousands of these nodes to scale that means to
scale out to deal with this first data which is also called as the, ‘Streaming Data’. So, for the scaling and
scalability and for we require thousands of hundreds of nodes to process this big Streaming data another
aspect is about achieving the low latency. Obviously this requires the insight into the technology we will
see how we can achieve the low latency in this Streaming data processing framework. Now, another
requirement is about how to deal with the failures? And how to deal not only with the failures? But, how
to efficiently recover from the failures? So, that it can be tolerated by the applications which are
monitoring in the real time the events and for different applications. So, we have to deal with the failure
recovery which is to be done. So, that it can be it can cater to the application in a real-time applications
which are based on streaming data. Now, another thing is that now there are various interactive
processing is happening with the Streaming data. And this particular Streaming data is also called as a
‘Fast Data’. Now, besides this fast data sometimes you we also require to be to integrate with another
stream of batch data. So, if there are two different modes of data that means one is through the batch data
the other is called ‘Streaming data’ together there are some applications which requires the integration of
both these different type of data and they are different and they are having a different stack to be
processed. So, how are we going to integrate them in the traditional in the previous the technologies they
are having a different stacks and therefore the integration is basically time taking and may not be useful
for real-time applications. So, here we are going to see the new technology which is called the ‘Spark
Streaming’. Which will combine or integrate this requirement of processing simultaneously the batch and
the interactive data.
Now, let us see how what what people have been doing? So, far that means what are the other systems
before this Spark is Streaming. And how it was being done. So, actually as we see that for batch
processing and for stream processing there we requirement of different stacks and often different stacks
will are optimized for different type of data so integration is not very common in the previous generation
of systems. And, for example for batch processing we have the Spark system and we have the Map
Reduce framework for doing the batch processing and for the Streaming in the earlier systems were such
as storm. So, now integration requires two different stacks to be combined together. So, hence it requires
it has lot of latency involved within it and may not be require may not be sufficient for different real-time
applications. That, is why this particular framework which we are going to discuss in today's lecture that
is called, ‘Spark Streaming’ which will integrate both batch and Streaming applications using the same
stack. Therefore, it will be the most efficient way of dealing with multiple type of data that is the batch
data and the stream data when they are required to be processed at the same time. So, the existing
framework such as storm and Map Reduce cannot do both of the the processing that is for the batch and
the Streaming data at the same point of time. So, either the swimming stream processing of hundreds of
megabytes with a low latency or the batch processing of terabytes of data with high latency are to be dealt
separately and the combined or integrated viewpoint is not available as of date before this Spark is
Streaming. So, Spark Streaming is the new technology which we are going to discuss and we will be also
seeing different use cases where this kind of batch processing and stream processing together are required
in many applications.
So, therefore there are it is required to maintain two different stacks if the the integration is not supported
and also they may require different programming models and also different efforts are required and also
requires an operational cost. So, that is not feasible.
Let us see, how? What are the different features which are handled which are able to cater all these parts?
So, spark is tripping features first is scaling. So, Spark Streaming can easily scale to hundreds and
thousands so speed also is a low latency and fault tolerance is achieved here to recover from the failure
ending it is also integrated with the real-time and Business Analytics is also supported.
So, let us see the modern data applications approaches for for more insights. So, traditional analytics is
basically require these kind of analysis.
Refer slide time: (27:32)
Now, the existing system for the Streaming data analysis is called a, ‘Storm’. And, it replays the record if
not processed by the node and therefore sometimes it provides the at least ones semantics. So, that means
some of the updates if the mutable states and the nodes are filled to achieve the updates in the mutable
state. So, that means it will update twice so that becomes a problem in at least one semantics and the
mutable states can be lost due to the failures. So, this at least once semantics and will create problems in
some of the times where the updates are to be done twice and in the existing systems like storm and does
achieve only up to that state of the art which is called at least once. So, exactly once is the semantics
which is required and it is supported in the spark Streaming system. There, are other streaming system
such as Trident which use the transactions to update the state and there also has exactly ones semantics
and for estate transaction to external database is slow.