Lec 19
Lec 19
Lecture-19
Preface
Content of this lecture: In this lecture we will discuss real time big data processing with Spark Streaming
and sliding window analytics. We will also discuss a case study based on Twitter sentiment analysis using
Streaming.
Therefore, the Spark is Streaming we are going to discuss today how that is all integrated and is being
useful for various applications. Now, let us see the before going in more detail about the Spark Streaming.
So, how we will see let us see that how the fault-tolerance is achieved in the stream processing systems.
So, traditional processing models use the pipeline of different nodes and each node maintains the mutable
States and each input record updates the state and new records are sent out. Now, the problem with the
previous systems were that the mutable States were lost if the node are filled. And, this is a normal failure
of the nodes is a norm rather than exception in the commodity hardware. So, therefore when the node
fields the mutable states which are maintaining this fault tolerance of for the mutable states will be lost.
So, some things are basically lost which are in the traditional the previous systems. Therefore, the stream
therefore making the stateful stream processing fault tolerant is also very much needed and in in Spark
Streaming system. Will, we will see that how this stateful in stream processing is done and in a fault
tolerant manner.
Now, let us see what is a Streaming? So, Data Streaming is a technique for transferring data. So, that it
can be processed as a steady and continuous stream which is incoming to the system in a stream of data
which is incoming to the system. So, you can visualize as if the data is flowing continuously on the pipes
and as it process as it passes through these pipes it has required to be processing in a real-time. And, if
that is cause if that is there then it is called the, ‘Data Streaming’ or ‘Streaming Data’. Now, the sources
which generates the streaming data are many for example the internet traffic which is flowing also if it is
seen then it will also be a network Streaming data similarly the Twitter Streaming data also can be taken
up in some of the applications. Similarly, the Netflix which is in real-time online movie watching that also
generates the streaming data similarly YouTube data also a kind of the Streaming data and there are many
other many many other ways this Streaming data can be generated Streaming data can also be generated
from the database so data is read and is being transmitted in the form of the Streaming heater, that is ETL.
So, companies normally does this for the analysis so Streaming data technologies are becoming
increasingly important I with the growth of the Internet and the Internet enabled different services which
are available in the form of Netflix, Facebook, Twitter, YouTube, Pandora, iTunes and so on. There are
tons of such different nowadays services available through the internet.
So, therefore there is a need of a framework for big data stream processing that scales to the hundreds
thousands of the nodes achieves second scale latency sufficient to recover from the failure integrate, with
a batch and interactive processing.
Refer slide time: (26:06)
Let us see, how? What are the different features which are handled which are able to cater all these parts?
So, spark is tripping features first is scaling. So, Spark Streaming can easily scale to hundreds and
thousands so speed also is a low latency and fault tolerance is achieved here to recover from the failure
ending it is also integrated with the real-time and Business Analytics is also supported.
So, let us see another part besides the integration with the batch and the batch processing and the real
time Streaming data processing other than that we will see another requirement which is about stateful
stream processing. In, the traditional model as we have seen that it provides a pipeline and if the mutable
state is lost then it has to handle.
So, let us see the modern data applications approaches for for more insights. So, traditional analytics is
basically require these kind of analysis.
Refer slide time: (27:32)
Now, the existing system for the Streaming data analysis is called a, ‘Storm’. And, it replays the record if
not processed by the node and therefore sometimes it provides the at least ones semantics. So, that means
some of the updates if the mutable states and the nodes are filled to achieve the updates in the mutable
state. So, that means it will update twice so that becomes a problem in at least one semantics and the
mutable states can be lost due to the failures. So, this at least once semantics and will create problems in
some of the times where the updates are to be done twice and in the existing systems like storm and does
achieve only up to that state of the art which is called at least once. So, exactly once is the semantics
which is required and it is supported in the spark Streaming system. There, are other streaming system
such as Trident which use the transactions to update the state and there also has exactly ones semantics
and for estate transaction to external database is slow.
So, let us review the entire scenario as the spark is streaming architecture. So, it is based on micro batch
architecture operates in the interval of time whenever a new batch are created at a regular time interval it
divides the received time batch into the blocks for parallelism. And, each batch is draft that translates into
the multiple jobs. And, has the ability to create the larger sized batch window as it processes over the
time.