0% found this document useful (0 votes)

8 views

Big Data Concepts - Spark & Streaming

Uploaded by

shwetasha03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Big Data Concepts - Spark & Streaming

Uploaded by

shwetasha03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

1

u Due: Assignments > Activity #17 - Wordcount with Spark

u Grad Project & Term Project Check-in #2 this week
Project - Presentation (Due 11/17) 2

u A 10-15 minute presentation that will be presented to the class. The

presentation should cover:
u Overview of the project, AWS technologies used
u A polished, working demo running on AWS
u Estimated costs or other noted advantages
u Concluding sales pitch summarizing your recommendation/findings
u Presentation (minimally 8 slides) should be in PPT or PDF format
u Everyone presents
u Presentations will be on 11/17 and 11/22
u 11/17 – Teams 1, 2 and 4 (need 1 more)
Project - Whitepaper (Due 12/2) 3
u A white paper that covers the details that includes:
u Overview and goals of the of the project
u Architecture diagram and detailed description of AWS technologies used
u Detailed breakdown of estimated costs or other noted benefits of selected
technologies
u Lessons learned along the way. What would you do differently if you had
more time or had to do this over (e.g. different technology choices)?
u White paper should be minimally 5 pages, double-spaced, 12pt font. One
diagram of a half page can count towards the 5 pages minimum
Project - Submission Deliverables 4

u For each team, presentation and white papers must be uploaded

Assignments > Term Project > Project Submission
u All code developed for this project, sample data (if any) and other
relevant files must be uploaded to Github Classroom
u You can (re)use some open source as part of your application, but you
need to give proper credit to the original authors
u For the presentations (proposal and final) and whitepaper, we will be
using Google Docs so team progress and participation can be tracked
u Grading Rubric:
u Contains details for each deliverable of the project
u See Content > Project > Term Project Grade Rubric
Setup Spark Cluster 5

u First - Do Assignments > Big Data > Setup EMR

u Stop after slide #3 (which you click “Create Cluster”), you are done for
now (no screen shots needed)
u This will take several minutes to run
u More later
6
u Spark is an open-source, distributed processing system used for big data workloads
u It is a Hadoop enhancement to MapReduce
u The primary difference between Spark and MapReduce is that Spark processes and
retains data in memory for subsequent steps, whereas MapReduce processes data
on disk

Source: https://towardsdatascience.com/apache-spark-101-3f961c89b8c
7
u Spark has 2 types of RDD operation: Transformations and Actions
u Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output
u Actions are operations to
access the actual data
available in an RDD and
provide non-RDD values
u Transformation’s output is an
input of Actions

Source: https://commandstech.com/spark-lazy-evaluation-with-example/
Big Data on the Cloud: Spark & Streaming

SWEN 514/614: Engineering Cloud Software Systems

Department of Software Engineering

Rochester Institute of Technology
Big Data Streaming 11

u Big Data Streaming is a process in which large streams of real-time

data are processed with the sole aim of extracting insights and
useful trends out of it
u A continuous stream of unstructured data is sent for analysis into
memory before storing it onto disk
u This happens across a cluster of servers and speed matters the most
in big data streaming
u The value of data, if not processed
quickly, decreases with time
Streaming vs. Batch 12

u Requires a set of data that is collected over u Requires data to be fed into an analytics tool,
time often in micro-batches, and in real-time
u Handles a large batch of data u Handles individual records or micro-batches of
few records
u Processes over all or most of the data
u Processes over data on a rolling window or most
u Latency of the batch processing model will
recent record
be in minutes to hours
u Latency will be in seconds or milliseconds
u Lengthy process and is meant for large
quantities of information that aren’t time- u Processing is fast and is meant for information
sensitive that is needed immediately
Source: https://k21academy.com/microsoft-azure/data-engineer/batch-processing-vs-stream-processing/
Streaming - Architecture 13

u Some examples of streaming data include IoT sensors, server and

security logs, real-time advertising and click-stream data from apps and
websites

u Data streams from one or more message brokers need to be

aggregated, transformed and structured before data can be analyzed
Source: https://www.upsolver.com/blog/streaming-data-architecture-key-components
Examples of Streaming Data 14
u A financial institution tracks changes in the stock market in real time, computes value-at-risk, and
automatically rebalances portfolios based on stock price movements
u Sensors in transportation vehicles, industrial equipment, and farm machinery send data to a
streaming application, which monitors performance, detects any potential defects in advance,
and places a spare part order automatically preventing equipment down time
u A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-
time property recommendations of properties to visit based on their geo-location
u A solar power company has to maintain power throughput for
its customers, or pay penalties. It implemented a streaming
data application that monitors of all of panels in the field, and
schedules service in real time, thereby minimizing the periods of
low throughput from each panel and the associated penalty
payouts.
u An online gaming company collects streaming data about
player-game interactions, and feeds the data into its gaming
platform. It then analyzes the data in real-time, offers incentives
and dynamic experiences to engage its players
Source: https://aws.amazon.com/streaming-data/
Streaming Technologies 15
u Apache Kafka and Amazon Kinesis are data ingest frameworks/platforms that
are meant to help with ingesting data durably, reliably, and with scalability in
mind
u Both offerings share common core concepts, including replication,
sharding/partitioning, and application components (consumer and producers)
u Both handle handle real-time data feeds and are capable of ingesting
thousands of data feeds simultaneously to support high-speed data processing

Source: https://www.spec-india.com/blog/kinesis-vs-kafka
Kafka 16
u Apache Kafka (originally development at LinkedIn) is a distributed event store
and stream-processing platform
u Its core architectural concept is an immutable log of messages that can be
organized into topics for consumption by multiple users or applications

u It can be deployed on bare-metal hardware, virtual machines and containers

(on-premise and cloud)
Kafka 17
u Kafka is used by thousands of companies including over 80% of the
Fortune 100

Source: https://soshace.com/the-ultimate-introduction-to-kafka-with-javascript/
Kafka 18
u Kafka stores messages in topics that are partitioned and replicated across multiple
brokers in a cluster
u Producers send messages to topics from which consumers read
u Within a topic, partitions can be added to improve scalability
u Consumers read messages from partitions within a topic
Kafka – Consumer and Consumer Groups 19
u Kafka Consumers are part of a Consumer group
u Question: What is a “Consumer” an example of?

Consumer 1 will get all messages Adding another Consumer to the Two Consumer groups can
from all four T1 partitions group allows the 2 Consumers to independently process
consume the messages from the messages from the T1
What happens if Consumer 1 partitions
partitions more efficiently
cannot keep up with all the
partitions?
Source: https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Kafka Microservices Architecture - Example 20
u A system centers on an Orders Service which exposes a REST interface to POST and GET Orders
u Posting an Order creates an event in Kafka that is recorded in the topic orders
u This is picked up by different validation engines (Fraud Service, Inventory Service and Order Details
Service), which validate the order in parallel, emitting a PASS or FAIL based on whether each
validation succeeds

Source: https://docs.confluent.io/platform/current/tutorials/examples/microservices-orders/docs/index.html
Kafka – Benefits 21
u Scalable
u Kafka’s partitioned log model allows data to be distributed across multiple servers, making it
scalable beyond what would normally fit on a single server
u Fast
u Kafka decouples data streams so there is very low latency, making it extremely fast
u Durable
u It helps protect against server failure, making the data very fault-tolerant and durable
u Performance
u It is stable, provides reliable durability, has a flexible publish-
subscribe/queue that scales well, has robust replication, provides
producers with tunable consistency guarantees, and provides a
preserved order at the topic-partition level
u Reacting in real-time
u Kafka is a big data technology that enables you to process data in
motion and quickly determine what is working and what is not
Source: https://scalac.io/blog/what-is-apache-kafka-and-what-are-kafka-use-cases/
Kinesis 22
u Kinesis makes it easy to collect, process, and analyze real-time, streaming
data so you can get timely insights and react quickly to new information
u It offers key capabilities to cost-effectively process streaming data at any
scale, along with the flexibility to choose the tools that best suit the
requirements of your application
u You can ingest real-time data such as video, audio, application logs,
website clickstreams, and IoT telemetry data for machine learning,
analytics, and other applications
u It enables you to process and analyze data as it
arrives and respond instantly instead of having to
wait until all your data is collected before the
processing can begin
Kinesis Capabilities 23

u Real-time
u Enables ingesting data in real-time, so you can run analytics queries instantly without
waiting for your data to accumulate to analyze in future
u Fully managed
u You do not have to maintain servers or run any software upgrades
u Scalable
u Automatically scales up and down and can handle any amount of incoming data
Kinesis Data Streams 24
u Data Producers enter the records into Kinesis Data Streams (KDS)
u AWS offers the Kinesis Producer Library for simplifying producer application
development
Kinesis Data Streams - Producers 25
u Producers enter the records into Kinesis
Streams, which consists of shards
u Shards provide 5 transactions per second
for reads, up to a maximum total data
read rate of 2MB per second and up to
1,000 records per second for writes up to
a maximum total data write rate of 1MB
per second
u The data capacity of your stream is a
function of the number of shards that you
specify for the data stream
u The total capacity of the Kinesis stream is
the sum of the capacities of all shards
Kinesis Data Streams - Consumers 26
u Consumers are applications that
process all data from a Kinesis data
stream
u It gets its own 2 MB/sec allotment of
read throughput, allowing multiple
consumers to read data from the same
stream in parallel, without contending
for read throughput with other
consumers
Kinesis or Kafka? 27
u Apache Kafka
u Gives full flexibility and all
advantages of latest Kafka
versions, but requires more effort in
its management
u Amazon Kinesis
u Simplifies the process of
introducing streaming technology
for those who don’t have their own
resources to manage Kafka,
however, Kinesis has much more
limitations than Apache Kafka

u Amazon MSK (Amazon Managed Streaming for Apache Kafka)

u An intermediate solution that allows using Kafka as AWS service hence simplifies the setup
process and offloads DevOps management, but still doesn’t have full compatibility with
the latest Apache Kafka versions
Source: https://rockset.com/blog/kafka-vs-kinesis-choosing-the-best-data-streaming-solution/
Spark Streaming 28

u Spark Streaming is an extension of the core Spark API that enables

scalable, high-throughput, fault-tolerant stream processing of live
data streams

Streaming

Real-time
analytics
Spark Streaming & Data Sources 29
u Spark Streaming is an extension of the core Spark API that allows data engineers and
data scientists to process real-time data from various sources including (but not limited
to) Kafka, Flume, and Amazon Kinesis
u This processed data can be pushed out to file systems, databases, and live dashboards

Source: https://www.databricks.com/glossary/what-is-spark-streaming
Spark Streaming – How does it work? 30
u Spark Streaming receives live input data streams and divides the data into
batches, which are then processed by the Spark engine to generate the final
stream of results in batches
u It provides a high-level abstraction called DStream (discretized stream), which
represents a continuous stream of data
u Internally, a DStream is represented as a sequence of RDDs

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Spark Streaming – How does it work? 31
u The data stream is produces a batch interval based on a time, which results in
an RDD
u Each RDD contains the records received during that batch interval

Source: https://www.javacodegeeks.com/2016/04/fast-scalable-streaming-applications-mapr-streams-spark-streaming-mapr-db.html
Spark Streaming – Simple example 32
u Wordcount.py - Streaming style
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "NetworkWordCount") Create a local StreamingContext and batch interval of 1

ssc = StreamingContext(sc, 1) second

lines = ssc.socketTextStream("localhost", 9999) Create a DStream that will connect to hostname:port,

like localhost:9999
wordCounts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \ This is the same code from our previous wordcount
.reduceByKey(lambda x,y: x + y) example using an RDD

wordCounts.pprint() Print the first 10 elements of the RDD

ssc.start() # Start the computation

ssc.awaitTermination() # Wait for the computation to terminate Loops waiting for events

u In a separate window, run Netcat (small utility found on most Linux/Unix systems)
nc -lk 9999
Writing to port 9999

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Spark Streaming – Simple example 33

Window #1 running Netcat Window #2 running WordCount.py

# TERMINAL 1: $ spark-submit Wordcount.py localhost 9999
# Running Netcat ...
-------------------------------------------
$ nc -lk 9999 Time: 2014-10-14 15:25:21
-------------------------------------------
hello world (hello,1)
(world,1)

This can be
replaced by a
streaming
technology like
Kafka or Kinesis
Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html#overview
Kinesis Data Streams 34
u With Spark on EMR, you can create a Dstream that connects to a Kinesis Data
Stream, which can process the data as RDDs

from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream

Creates a DStream
from a Kinesis Data kinesisStream = KinesisUtils.createStream( streamingContext, [Kinesis app name],
Stream [Kinesis stream name], [endpoint URL], [region name], [initial position],
[checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)
Kinesis + Spark Streaming 35
u Combining this all together

Reports,
Data producers visualizations,
etc.
Kafka + Spark Streaming 36
u Combining this all together

Data producers Reports,

Visualizations,
etc.
In-memory Durable
first-pass aggregation
aggregation

Ingestion

ZooKeeper is primarily used to

track the status of nodes in the
Kafka cluster and maintain a list
of Kafka topics and messages
Weather modelling - Exercise 37
u This exercise goes beyond Wordcount and provides a more realistic use case using
Spark
u We will use data from NOAA (National Oceanic and Atmospheric Administration)
u This data is from Western-European weather stations from 1980 to 2014
u Follow the instructions in Assignments > Activity #18 - Evaluating Weather Data with

u There are 3 deliverables plus an opportunity for a bonus point in this activity

Grindr US Pitch Document
100% (1)
Grindr US Pitch Document
5 pages
108S Manual 10
100% (1)
108S Manual 10
10 pages
TCI Scorpio Client
No ratings yet
TCI Scorpio Client
149 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
BDA
No ratings yet
BDA
16 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Real time data streaming new techniques
No ratings yet
Real time data streaming new techniques
5 pages
Syllabus E63 2018 Fall PDF
No ratings yet
Syllabus E63 2018 Fall PDF
3 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Decomposing SMACK Stack
No ratings yet
Decomposing SMACK Stack
62 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
bda_assign2
No ratings yet
bda_assign2
4 pages
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
No ratings yet
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
23 pages
HD Mod011 Kafka
No ratings yet
HD Mod011 Kafka
29 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Download Complete (Ebook) Effective Kafka: A Hands-on Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java by Emil Koutanov ISBN 9788628558516, 9798628558515, 8628558519 PDF for All Chapters
No ratings yet
Download Complete (Ebook) Effective Kafka: A Hands-on Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java by Emil Koutanov ISBN 9788628558516, 9798628558515, 8628558519 PDF for All Chapters
67 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Data Pipelines From Zero to Solid
No ratings yet
Data Pipelines From Zero to Solid
16 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Effective Kafka A Hands on Guide to Building Robust and Scalable Event Driven Applications with Code Examples in Java 1st Edition Emil Koutanov instant download
100% (1)
Effective Kafka A Hands on Guide to Building Robust and Scalable Event Driven Applications with Code Examples in Java 1st Edition Emil Koutanov instant download
54 pages
biggdata
No ratings yet
biggdata
24 pages
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
No ratings yet
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
34 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
SPARK
No ratings yet
SPARK
66 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
No ratings yet
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
44 pages
Real Time Analytics With Apache Kafka and Spark: Rahul Jain
No ratings yet
Real Time Analytics With Apache Kafka and Spark: Rahul Jain
54 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
Learning Apache Kafka - Second Edition - Sample Chapter
No ratings yet
Learning Apache Kafka - Second Edition - Sample Chapter
12 pages
Real-Time Big Data Analytics - Sample Chapter
100% (2)
Real-Time Big Data Analytics - Sample Chapter
30 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Large Scale Data Pipelines
No ratings yet
Large Scale Data Pipelines
91 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Instaclustr Understanding Apache Kafka White Paper
No ratings yet
Instaclustr Understanding Apache Kafka White Paper
8 pages
SPA_L1_To_L7
No ratings yet
SPA_L1_To_L7
52 pages
Kafka TOC
No ratings yet
Kafka TOC
5 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Big data assignment notes
No ratings yet
Big data assignment notes
13 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
Kafka architecture
No ratings yet
Kafka architecture
5 pages
Streaming Ecosystem
No ratings yet
Streaming Ecosystem
31 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
BD Notes
No ratings yet
BD Notes
11 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Apache Kafka Introduction
No ratings yet
Apache Kafka Introduction
21 pages
Kafka
No ratings yet
Kafka
50 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Low Power Uart Device
No ratings yet
Low Power Uart Device
19 pages
Pop3 Imap
No ratings yet
Pop3 Imap
27 pages
Using CIDR Notation To Determine The Subnet Mask
No ratings yet
Using CIDR Notation To Determine The Subnet Mask
2 pages
Ilped Akrab Edu - 4 April (Sfile
No ratings yet
Ilped Akrab Edu - 4 April (Sfile
2 pages
Template - 1
No ratings yet
Template - 1
3 pages
I SENSYS - MF6680dn p8424 c3947 en - GB 1255426241
No ratings yet
I SENSYS - MF6680dn p8424 c3947 en - GB 1255426241
2 pages
Baseband vs. Passband Communication Systems
No ratings yet
Baseband vs. Passband Communication Systems
6 pages
Frontline RS232 RS422 RS485 Sniffer Protocol Analyzer SerialTestAsync PDF
No ratings yet
Frontline RS232 RS422 RS485 Sniffer Protocol Analyzer SerialTestAsync PDF
2 pages
Wireless Technology: Satellite Television
No ratings yet
Wireless Technology: Satellite Television
1 page
Manually Process An EWA Report
No ratings yet
Manually Process An EWA Report
4 pages
Chad Face - Google Search
No ratings yet
Chad Face - Google Search
1 page
Datasheet OSA5420 - Series
No ratings yet
Datasheet OSA5420 - Series
6 pages
Unit 5
No ratings yet
Unit 5
47 pages
Internet Everywhere
No ratings yet
Internet Everywhere
24 pages
Debug 1214
No ratings yet
Debug 1214
3 pages
3.8m C Band Circ. Pol RXTX
No ratings yet
3.8m C Band Circ. Pol RXTX
6 pages
Install Oracle Database 10g R2 On Linux
No ratings yet
Install Oracle Database 10g R2 On Linux
53 pages
Shine 340 380 IOT Smart Reader - Datasheet
No ratings yet
Shine 340 380 IOT Smart Reader - Datasheet
4 pages
Signed 2
No ratings yet
Signed 2
6 pages
L1 Lesson Plan - Networks - KS4
No ratings yet
L1 Lesson Plan - Networks - KS4
3 pages
Long Quiz
No ratings yet
Long Quiz
2 pages
Media Content Allocation Based On Insisting Indices
No ratings yet
Media Content Allocation Based On Insisting Indices
5 pages
connectionsII Wireless Access Points Sow
No ratings yet
connectionsII Wireless Access Points Sow
34 pages
Supportinfo d4684d269280
No ratings yet
Supportinfo d4684d269280
133 pages
Basic Small Branch Network System
No ratings yet
Basic Small Branch Network System
368 pages
PHD Thesis Network Management
100% (2)
PHD Thesis Network Management
6 pages
Quiz
No ratings yet
Quiz
10 pages

Big Data Concepts - Spark & Streaming

Uploaded by

Big Data Concepts - Spark & Streaming

Uploaded by

1

u Due: Assignments > Activity #17 - Wordcount with Spark

u A 10-15 minute presentation that will be presented to the class. The

u For each team, presentation and white papers must be uploaded

u First - Do Assignments > Big Data > Setup EMR

SWEN 514/614: Engineering Cloud Software Systems

Department of Software Engineering

u Big Data Streaming is a process in which large streams of real-time

u Some examples of streaming data include IoT sensors, server and

u Data streams from one or more message brokers need to be

u It can be deployed on bare-metal hardware, virtual machines and containers

u Amazon MSK (Amazon Managed Streaming for Apache Kafka)

u Spark Streaming is an extension of the core Spark API that enables

sc = SparkContext("local[2]", "NetworkWordCount") Create a local StreamingContext and batch interval of 1

lines = ssc.socketTextStream("localhost", 9999) Create a DStream that will connect to hostname:port,

wordCounts.pprint() Print the first 10 elements of the RDD

ssc.start() # Start the computation

Window #1 running Netcat Window #2 running WordCount.py

from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream

Data producers Reports,

ZooKeeper is primarily used to

You might also like