BDA Lec10
BDA Lec10
Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. What is Streaming?
Broadly:
Data can be continuously analyzed and transformed in memory before it is stored on a disk.
Stream Data?
● Sequence of items
● Structured (e.g., tuples)
● Ordered (implicitly or timestamped)(process data in exact order)
● Arriving continuously at high volumes
● Not possible to store entirely
● Sometimes not possible to even examine all items
● Streaming engine receives data and creates micro batches based on processing time
● Spark Streaming Programming Model
○ > Discretized Stream (DStream)
■ - high-level abstraction representing a continuous stream of data.
■ - Represents a stream of data
■ - Implemented as a sequence of RDDs one for each Micro-Batch.
○ > DStreams API very similar to RDD API
■ - Functional APIs in Scala, Java for tasks such as map, reduce, filter, etc.
■ - Create input DStream from different sources(Kafka, Flume, HDFS, etc.)
■ - Apply parallel operations
● The size of RDD is dependent on the length of batch interval (or the size of time window) and the
number of messages delivered during the interval
How does Spark Streaming work?
● Streaming Spark Context (SSC) is the entry point to Spark Streaming
application
○ – 1 SSC runs in 1 JVM (we can run several JVMs on cluster of
distributed processing)
○ – Configure batch interval (e.g., 5 seconds, 10 seconds)
● › After fetching SSC
○ – Define source of data - Input Dstream(Kafka, file system, socket)
○ – Define pipeline (computation)
○ – Begin/Start accepting and processing data from input stream
○ – Wait for processing termination (manually or on error)
● › After/Once processing is started the pipeline becomes immutable
○ – It is not possible to pause or modify the processing pipeline.
○ To make changes, you must stop application and restart it with the
updated logic.
1. Spark Streaming Example App.
2. Spark Structured Streaming ?
● › Stream processing engine built on the Spark SQL engine
● › “Streaming computation expressed the same way as batch computation“
● › Data Frame API available in Scala, Java, Python, R
● Supports Advanced Operations
○ – Aggregations, event-time windows, stream joins, …
● › 2 processing models
○ – Micro-batch: Data is divided into small micro-batches and processed
at regular intervals.
■ – Latency ~ 100 ms
■ – Possibility of exactly-once semantics(ensuring each event is
processed exactly once.)
○ – Continuous Processing (since Spark 2.3)
■ – Each message is processed immediately after delivery
■ – Latency ~ 1 ms
■ at-least-once semantics(meaning events might be processed
more than once in certain failure scenarios.)
Streaming DataFrame
● › Basic data structure
● › Represents unbounded table