0% found this document useful (0 votes)
7 views12 pages

UNIT V Window Operations

The document discusses real-time processing using Spark Streaming, focusing on structured streaming concepts and window operations. It explains how windowed computations allow transformations over sliding windows of data, detailing parameters like window length and sliding interval. Additionally, it outlines common Spark window operations such as Window, CountByWindow, and ReduceByWindow, emphasizing their functionalities and requirements.

Uploaded by

953622243097
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

UNIT V Window Operations

The document discusses real-time processing using Spark Streaming, focusing on structured streaming concepts and window operations. It explains how windowed computations allow transformations over sliding windows of data, detailing parameters like window length and sliding interval. Additionally, it outlines common Spark window operations such as Window, CountByWindow, and ReduceByWindow, emphasizing their functionalities and requirements.

Uploaded by

953622243097
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

REAL-TIME

PROCESSING USING
SPARK STREAMING
Structured Streaming, Basic Concepts, Handling Event-
time and Late Data, Fault-tolerant Semantics, Exactly-
once Semantics, Creating Streaming Datasets,
Schema Inference, Partitioning of Streaming datasets,
Operations on Streaming Data, Selection, Aggregation,
Projection, Watermarking, Window operations, Types of
Time windows, Join Operations, Deduplication
Window Operations
Spark Streaming Window Operations

Spark streaming leverages advantage of windowed


computations in Apache Spark.
It offers to apply transformations over a sliding
window of data.
Spark Streaming Window Operations
As window slides over a source DStream, the source
RDDs that fall within the window are combined.
It also operated upon which produces spark RDDs of
the windowed DStream.
Hence, In this specific case, the operation is applied
over the last 3 time units of data, also slides by 2-time
units.
Basically, any Spark window operation requires
specifying two parameters.
Spark Streaming Window Operations

Window length – It defines the duration of the window


(3 in the figure).
Sliding interval – It defines the interval at which the
window operation is performed (2 in the figure).
However, these 2 parameters must be multiples of the
batch interval of the source DStream.
Common Spark Window Operations
1. Window (windowLength, slideInterval)
Window operation returns a new DStream. On the basis of windowed
batches of the source DStream, it gets computed.
2. CountByWindow (windowLength, slideInterval)
In the stream, countByWindow operation returns a sliding window
count of elements.
3. ReduceByWindow (func, windowLength, slideInterval)
ReduceByWindow returns a new single-element stream, that is
created by aggregating elements in the stream over a sliding
interval using func. However, a function must be commutative and
associative, so that it can be computed correctly in parallel.
Common Spark Window Operations
4. ReduceByKeyAndWindow(func, windowLength, slideInterval,
[numTasks])
Whenever we call reduceByKeyAndWindow window on a DStream of
(K, V) pairs, it returns a new DStream of (K, V) pairs. Here, we
aggregate values of each key, by given reduce function func over
batches in a sliding window.
In addition, it uses spark’s default number of parallel tasks, for
grouping purpose.
Like for local mode, it is 2. While in cluster mode it determines
number using spark.default.parallelism config property. To set a
different number of tasks, it passes an optional numTasks argument.
Common Spark Window Operations
5. ReduceByKeyAndWindow (func, invFunc, windowLength,
slideInterval, [numTasks])
It is the more efficient version of the above
reduceByKeyAndWindow(). As in above one, we calculate the
reduced value of each window by using the reduce values of the
previous window.
However, here calculations take place by reducing the new data. For
calculating, it reduces data which enters the sliding window. Also
performs “inverse reducing” of the old data which leaves the
window.
Common Spark Window Operations
6. CountByValueAndWindow(windowLength, slideInterval,
[numTasks])
While, we call countByValueAndWindow on a DStream of
(K, V) pairs, it returns a new DStream of (K, Long) pairs.
Here, the value of each key is its frequency within a
sliding window.
In one case it is very similar to reduceByKeyAndWindow
operation. Here also, we can configure the number of
reduce tasks by an optional argument.

You might also like