0% found this document useful (0 votes)
8 views

Big Data Concepts - Spark & Streaming

Uploaded by

shwetasha03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Big Data Concepts - Spark & Streaming

Uploaded by

shwetasha03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

1

u Due: Assignments > Activity #17 - Wordcount with Spark


u Grad Project & Term Project Check-in #2 this week
Project - Presentation (Due 11/17) 2

u A 10-15 minute presentation that will be presented to the class. The


presentation should cover:
u Overview of the project, AWS technologies used
u A polished, working demo running on AWS
u Estimated costs or other noted advantages
u Concluding sales pitch summarizing your recommendation/findings
u Presentation (minimally 8 slides) should be in PPT or PDF format
u Everyone presents
u Presentations will be on 11/17 and 11/22
u 11/17 – Teams 1, 2 and 4 (need 1 more)
Project - Whitepaper (Due 12/2) 3
u A white paper that covers the details that includes:
u Overview and goals of the of the project
u Architecture diagram and detailed description of AWS technologies used
u Detailed breakdown of estimated costs or other noted benefits of selected
technologies
u Lessons learned along the way. What would you do differently if you had
more time or had to do this over (e.g. different technology choices)?
u White paper should be minimally 5 pages, double-spaced, 12pt font. One
diagram of a half page can count towards the 5 pages minimum
Project - Submission Deliverables 4

u For each team, presentation and white papers must be uploaded


Assignments > Term Project > Project Submission
u All code developed for this project, sample data (if any) and other
relevant files must be uploaded to Github Classroom
u You can (re)use some open source as part of your application, but you
need to give proper credit to the original authors
u For the presentations (proposal and final) and whitepaper, we will be
using Google Docs so team progress and participation can be tracked
u Grading Rubric:
u Contains details for each deliverable of the project
u See Content > Project > Term Project Grade Rubric
Setup Spark Cluster 5

u First - Do Assignments > Big Data > Setup EMR


u Stop after slide #3 (which you click “Create Cluster”), you are done for
now (no screen shots needed)
u This will take several minutes to run
u More later
6
u Spark is an open-source, distributed processing system used for big data workloads
u It is a Hadoop enhancement to MapReduce
u The primary difference between Spark and MapReduce is that Spark processes and
retains data in memory for subsequent steps, whereas MapReduce processes data
on disk

Source: https://towardsdatascience.com/apache-spark-101-3f961c89b8c
7
u Spark has 2 types of RDD operation: Transformations and Actions
u Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output
u Actions are operations to
access the actual data
available in an RDD and
provide non-RDD values
u Transformation’s output is an
input of Actions

Source: https://commandstech.com/spark-lazy-evaluation-with-example/
Big Data on the Cloud: Spark & Streaming

SWEN 514/614: Engineering Cloud Software Systems

Department of Software Engineering


Rochester Institute of Technology
Big Data Streaming 11

u Big Data Streaming is a process in which large streams of real-time


data are processed with the sole aim of extracting insights and
useful trends out of it
u A continuous stream of unstructured data is sent for analysis into
memory before storing it onto disk
u This happens across a cluster of servers and speed matters the most
in big data streaming
u The value of data, if not processed
quickly, decreases with time
Streaming vs. Batch 12

u Requires a set of data that is collected over u Requires data to be fed into an analytics tool,
time often in micro-batches, and in real-time
u Handles a large batch of data u Handles individual records or micro-batches of
few records
u Processes over all or most of the data
u Processes over data on a rolling window or most
u Latency of the batch processing model will
recent record
be in minutes to hours
u Latency will be in seconds or milliseconds
u Lengthy process and is meant for large
quantities of information that aren’t time- u Processing is fast and is meant for information
sensitive that is needed immediately
Source: https://k21academy.com/microsoft-azure/data-engineer/batch-processing-vs-stream-processing/
Streaming - Architecture 13

u Some examples of streaming data include IoT sensors, server and


security logs, real-time advertising and click-stream data from apps and
websites

u Data streams from one or more message brokers need to be


aggregated, transformed and structured before data can be analyzed
Source: https://www.upsolver.com/blog/streaming-data-architecture-key-components
Examples of Streaming Data 14
u A financial institution tracks changes in the stock market in real time, computes value-at-risk, and
automatically rebalances portfolios based on stock price movements
u Sensors in transportation vehicles, industrial equipment, and farm machinery send data to a
streaming application, which monitors performance, detects any potential defects in advance,
and places a spare part order automatically preventing equipment down time
u A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-
time property recommendations of properties to visit based on their geo-location
u A solar power company has to maintain power throughput for
its customers, or pay penalties. It implemented a streaming
data application that monitors of all of panels in the field, and
schedules service in real time, thereby minimizing the periods of
low throughput from each panel and the associated penalty
payouts.
u An online gaming company collects streaming data about
player-game interactions, and feeds the data into its gaming
platform. It then analyzes the data in real-time, offers incentives
and dynamic experiences to engage its players
Source: https://aws.amazon.com/streaming-data/
Streaming Technologies 15
u Apache Kafka and Amazon Kinesis are data ingest frameworks/platforms that
are meant to help with ingesting data durably, reliably, and with scalability in
mind
u Both offerings share common core concepts, including replication,
sharding/partitioning, and application components (consumer and producers)
u Both handle handle real-time data feeds and are capable of ingesting
thousands of data feeds simultaneously to support high-speed data processing

Source: https://www.spec-india.com/blog/kinesis-vs-kafka
Kafka 16
u Apache Kafka (originally development at LinkedIn) is a distributed event store
and stream-processing platform
u Its core architectural concept is an immutable log of messages that can be
organized into topics for consumption by multiple users or applications

u It can be deployed on bare-metal hardware, virtual machines and containers


(on-premise and cloud)
Kafka 17
u Kafka is used by thousands of companies including over 80% of the
Fortune 100

Source: https://soshace.com/the-ultimate-introduction-to-kafka-with-javascript/
Kafka 18
u Kafka stores messages in topics that are partitioned and replicated across multiple
brokers in a cluster
u Producers send messages to topics from which consumers read
u Within a topic, partitions can be added to improve scalability
u Consumers read messages from partitions within a topic
Kafka – Consumer and Consumer Groups 19
u Kafka Consumers are part of a Consumer group
u Question: What is a “Consumer” an example of?

Consumer 1 will get all messages Adding another Consumer to the Two Consumer groups can
from all four T1 partitions group allows the 2 Consumers to independently process
consume the messages from the messages from the T1
What happens if Consumer 1 partitions
partitions more efficiently
cannot keep up with all the
partitions?
Source: https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Kafka Microservices Architecture - Example 20
u A system centers on an Orders Service which exposes a REST interface to POST and GET Orders
u Posting an Order creates an event in Kafka that is recorded in the topic orders
u This is picked up by different validation engines (Fraud Service, Inventory Service and Order Details
Service), which validate the order in parallel, emitting a PASS or FAIL based on whether each
validation succeeds

Source: https://docs.confluent.io/platform/current/tutorials/examples/microservices-orders/docs/index.html
Kafka – Benefits 21
u Scalable
u Kafka’s partitioned log model allows data to be distributed across multiple servers, making it
scalable beyond what would normally fit on a single server
u Fast
u Kafka decouples data streams so there is very low latency, making it extremely fast
u Durable
u It helps protect against server failure, making the data very fault-tolerant and durable
u Performance
u It is stable, provides reliable durability, has a flexible publish-
subscribe/queue that scales well, has robust replication, provides
producers with tunable consistency guarantees, and provides a
preserved order at the topic-partition level
u Reacting in real-time
u Kafka is a big data technology that enables you to process data in
motion and quickly determine what is working and what is not
Source: https://scalac.io/blog/what-is-apache-kafka-and-what-are-kafka-use-cases/
Kinesis 22
u Kinesis makes it easy to collect, process, and analyze real-time, streaming
data so you can get timely insights and react quickly to new information
u It offers key capabilities to cost-effectively process streaming data at any
scale, along with the flexibility to choose the tools that best suit the
requirements of your application
u You can ingest real-time data such as video, audio, application logs,
website clickstreams, and IoT telemetry data for machine learning,
analytics, and other applications
u It enables you to process and analyze data as it
arrives and respond instantly instead of having to
wait until all your data is collected before the
processing can begin
Kinesis Capabilities 23

u Real-time
u Enables ingesting data in real-time, so you can run analytics queries instantly without
waiting for your data to accumulate to analyze in future
u Fully managed
u You do not have to maintain servers or run any software upgrades
u Scalable
u Automatically scales up and down and can handle any amount of incoming data
Kinesis Data Streams 24
u Data Producers enter the records into Kinesis Data Streams (KDS)
u AWS offers the Kinesis Producer Library for simplifying producer application
development
Kinesis Data Streams - Producers 25
u Producers enter the records into Kinesis
Streams, which consists of shards
u Shards provide 5 transactions per second
for reads, up to a maximum total data
read rate of 2MB per second and up to
1,000 records per second for writes up to
a maximum total data write rate of 1MB
per second
u The data capacity of your stream is a
function of the number of shards that you
specify for the data stream
u The total capacity of the Kinesis stream is
the sum of the capacities of all shards
Kinesis Data Streams - Consumers 26
u Consumers are applications that
process all data from a Kinesis data
stream
u It gets its own 2 MB/sec allotment of
read throughput, allowing multiple
consumers to read data from the same
stream in parallel, without contending
for read throughput with other
consumers
Kinesis or Kafka? 27
u Apache Kafka
u Gives full flexibility and all
advantages of latest Kafka
versions, but requires more effort in
its management
u Amazon Kinesis
u Simplifies the process of
introducing streaming technology
for those who don’t have their own
resources to manage Kafka,
however, Kinesis has much more
limitations than Apache Kafka

u Amazon MSK (Amazon Managed Streaming for Apache Kafka)


u An intermediate solution that allows using Kafka as AWS service hence simplifies the setup
process and offloads DevOps management, but still doesn’t have full compatibility with
the latest Apache Kafka versions
Source: https://rockset.com/blog/kafka-vs-kinesis-choosing-the-best-data-streaming-solution/
Spark Streaming 28

u Spark Streaming is an extension of the core Spark API that enables


scalable, high-throughput, fault-tolerant stream processing of live
data streams

Streaming

Real-time
analytics
Spark Streaming & Data Sources 29
u Spark Streaming is an extension of the core Spark API that allows data engineers and
data scientists to process real-time data from various sources including (but not limited
to) Kafka, Flume, and Amazon Kinesis
u This processed data can be pushed out to file systems, databases, and live dashboards

Source: https://www.databricks.com/glossary/what-is-spark-streaming
Spark Streaming – How does it work? 30
u Spark Streaming receives live input data streams and divides the data into
batches, which are then processed by the Spark engine to generate the final
stream of results in batches
u It provides a high-level abstraction called DStream (discretized stream), which
represents a continuous stream of data
u Internally, a DStream is represented as a sequence of RDDs

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Spark Streaming – How does it work? 31
u The data stream is produces a batch interval based on a time, which results in
an RDD
u Each RDD contains the records received during that batch interval

Source: https://www.javacodegeeks.com/2016/04/fast-scalable-streaming-applications-mapr-streams-spark-streaming-mapr-db.html
Spark Streaming – Simple example 32
u Wordcount.py - Streaming style
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "NetworkWordCount") Create a local StreamingContext and batch interval of 1


ssc = StreamingContext(sc, 1) second

lines = ssc.socketTextStream("localhost", 9999) Create a DStream that will connect to hostname:port,


like localhost:9999
wordCounts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \ This is the same code from our previous wordcount
.reduceByKey(lambda x,y: x + y) example using an RDD

wordCounts.pprint() Print the first 10 elements of the RDD

ssc.start() # Start the computation


ssc.awaitTermination() # Wait for the computation to terminate Loops waiting for events

u In a separate window, run Netcat (small utility found on most Linux/Unix systems)
nc -lk 9999
Writing to port 9999

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Spark Streaming – Simple example 33

Window #1 running Netcat Window #2 running WordCount.py


# TERMINAL 1: $ spark-submit Wordcount.py localhost 9999
# Running Netcat ...
-------------------------------------------
$ nc -lk 9999 Time: 2014-10-14 15:25:21
-------------------------------------------
hello world (hello,1)
(world,1)

This can be
replaced by a
streaming
technology like
Kafka or Kinesis
Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Kinesis Data Streams 34
u With Spark on EMR, you can create a Dstream that connects to a Kinesis Data
Stream, which can process the data as RDDs

from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream


Creates a DStream
from a Kinesis Data kinesisStream = KinesisUtils.createStream( streamingContext, [Kinesis app name],
Stream [Kinesis stream name], [endpoint URL], [region name], [initial position],
[checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)
Kinesis + Spark Streaming 35
u Combining this all together

Reports,
Data producers visualizations,
etc.
Kafka + Spark Streaming 36
u Combining this all together

Data producers Reports,


Visualizations,
etc.
In-memory Durable
first-pass aggregation
aggregation

Ingestion

ZooKeeper is primarily used to


track the status of nodes in the
Kafka cluster and maintain a list
of Kafka topics and messages
Weather modelling - Exercise 37
u This exercise goes beyond Wordcount and provides a more realistic use case using
Spark
u We will use data from NOAA (National Oceanic and Atmospheric Administration)
u This data is from Western-European weather stations from 1980 to 2014
u Follow the instructions in Assignments > Activity #18 - Evaluating Weather Data with

u There are 3 deliverables plus an opportunity for a bonus point in this activity

You might also like