Big Data Concepts - Spark & Streaming
Big Data Concepts - Spark & Streaming
Source: https://towardsdatascience.com/apache-spark-101-3f961c89b8c
7
u Spark has 2 types of RDD operation: Transformations and Actions
u Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output
u Actions are operations to
access the actual data
available in an RDD and
provide non-RDD values
u Transformation’s output is an
input of Actions
Source: https://commandstech.com/spark-lazy-evaluation-with-example/
Big Data on the Cloud: Spark & Streaming
u Requires a set of data that is collected over u Requires data to be fed into an analytics tool,
time often in micro-batches, and in real-time
u Handles a large batch of data u Handles individual records or micro-batches of
few records
u Processes over all or most of the data
u Processes over data on a rolling window or most
u Latency of the batch processing model will
recent record
be in minutes to hours
u Latency will be in seconds or milliseconds
u Lengthy process and is meant for large
quantities of information that aren’t time- u Processing is fast and is meant for information
sensitive that is needed immediately
Source: https://k21academy.com/microsoft-azure/data-engineer/batch-processing-vs-stream-processing/
Streaming - Architecture 13
Source: https://www.spec-india.com/blog/kinesis-vs-kafka
Kafka 16
u Apache Kafka (originally development at LinkedIn) is a distributed event store
and stream-processing platform
u Its core architectural concept is an immutable log of messages that can be
organized into topics for consumption by multiple users or applications
Source: https://soshace.com/the-ultimate-introduction-to-kafka-with-javascript/
Kafka 18
u Kafka stores messages in topics that are partitioned and replicated across multiple
brokers in a cluster
u Producers send messages to topics from which consumers read
u Within a topic, partitions can be added to improve scalability
u Consumers read messages from partitions within a topic
Kafka – Consumer and Consumer Groups 19
u Kafka Consumers are part of a Consumer group
u Question: What is a “Consumer” an example of?
Consumer 1 will get all messages Adding another Consumer to the Two Consumer groups can
from all four T1 partitions group allows the 2 Consumers to independently process
consume the messages from the messages from the T1
What happens if Consumer 1 partitions
partitions more efficiently
cannot keep up with all the
partitions?
Source: https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Kafka Microservices Architecture - Example 20
u A system centers on an Orders Service which exposes a REST interface to POST and GET Orders
u Posting an Order creates an event in Kafka that is recorded in the topic orders
u This is picked up by different validation engines (Fraud Service, Inventory Service and Order Details
Service), which validate the order in parallel, emitting a PASS or FAIL based on whether each
validation succeeds
Source: https://docs.confluent.io/platform/current/tutorials/examples/microservices-orders/docs/index.html
Kafka – Benefits 21
u Scalable
u Kafka’s partitioned log model allows data to be distributed across multiple servers, making it
scalable beyond what would normally fit on a single server
u Fast
u Kafka decouples data streams so there is very low latency, making it extremely fast
u Durable
u It helps protect against server failure, making the data very fault-tolerant and durable
u Performance
u It is stable, provides reliable durability, has a flexible publish-
subscribe/queue that scales well, has robust replication, provides
producers with tunable consistency guarantees, and provides a
preserved order at the topic-partition level
u Reacting in real-time
u Kafka is a big data technology that enables you to process data in
motion and quickly determine what is working and what is not
Source: https://scalac.io/blog/what-is-apache-kafka-and-what-are-kafka-use-cases/
Kinesis 22
u Kinesis makes it easy to collect, process, and analyze real-time, streaming
data so you can get timely insights and react quickly to new information
u It offers key capabilities to cost-effectively process streaming data at any
scale, along with the flexibility to choose the tools that best suit the
requirements of your application
u You can ingest real-time data such as video, audio, application logs,
website clickstreams, and IoT telemetry data for machine learning,
analytics, and other applications
u It enables you to process and analyze data as it
arrives and respond instantly instead of having to
wait until all your data is collected before the
processing can begin
Kinesis Capabilities 23
u Real-time
u Enables ingesting data in real-time, so you can run analytics queries instantly without
waiting for your data to accumulate to analyze in future
u Fully managed
u You do not have to maintain servers or run any software upgrades
u Scalable
u Automatically scales up and down and can handle any amount of incoming data
Kinesis Data Streams 24
u Data Producers enter the records into Kinesis Data Streams (KDS)
u AWS offers the Kinesis Producer Library for simplifying producer application
development
Kinesis Data Streams - Producers 25
u Producers enter the records into Kinesis
Streams, which consists of shards
u Shards provide 5 transactions per second
for reads, up to a maximum total data
read rate of 2MB per second and up to
1,000 records per second for writes up to
a maximum total data write rate of 1MB
per second
u The data capacity of your stream is a
function of the number of shards that you
specify for the data stream
u The total capacity of the Kinesis stream is
the sum of the capacities of all shards
Kinesis Data Streams - Consumers 26
u Consumers are applications that
process all data from a Kinesis data
stream
u It gets its own 2 MB/sec allotment of
read throughput, allowing multiple
consumers to read data from the same
stream in parallel, without contending
for read throughput with other
consumers
Kinesis or Kafka? 27
u Apache Kafka
u Gives full flexibility and all
advantages of latest Kafka
versions, but requires more effort in
its management
u Amazon Kinesis
u Simplifies the process of
introducing streaming technology
for those who don’t have their own
resources to manage Kafka,
however, Kinesis has much more
limitations than Apache Kafka
Streaming
Real-time
analytics
Spark Streaming & Data Sources 29
u Spark Streaming is an extension of the core Spark API that allows data engineers and
data scientists to process real-time data from various sources including (but not limited
to) Kafka, Flume, and Amazon Kinesis
u This processed data can be pushed out to file systems, databases, and live dashboards
Source: https://www.databricks.com/glossary/what-is-spark-streaming
Spark Streaming – How does it work? 30
u Spark Streaming receives live input data streams and divides the data into
batches, which are then processed by the Spark engine to generate the final
stream of results in batches
u It provides a high-level abstraction called DStream (discretized stream), which
represents a continuous stream of data
u Internally, a DStream is represented as a sequence of RDDs
Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Spark Streaming – How does it work? 31
u The data stream is produces a batch interval based on a time, which results in
an RDD
u Each RDD contains the records received during that batch interval
Source: https://www.javacodegeeks.com/2016/04/fast-scalable-streaming-applications-mapr-streams-spark-streaming-mapr-db.html
Spark Streaming – Simple example 32
u Wordcount.py - Streaming style
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
u In a separate window, run Netcat (small utility found on most Linux/Unix systems)
nc -lk 9999
Writing to port 9999
Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Spark Streaming – Simple example 33
This can be
replaced by a
streaming
technology like
Kafka or Kinesis
Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Kinesis Data Streams 34
u With Spark on EMR, you can create a Dstream that connects to a Kinesis Data
Stream, which can process the data as RDDs
Reports,
Data producers visualizations,
etc.
Kafka + Spark Streaming 36
u Combining this all together
Ingestion
u There are 3 deliverables plus an opportunity for a bonus point in this activity