Cs498 Week 12 Slide
Cs498 Week 12 Slide
• Apache Storm
• Twitter Heron
• Apache Flink
▪ As Hadoop ramped up to offer batch data availability, a growing need arose to provide
data in real-time for analytic and instant feedback use cases
▪ Topologies
● graph of spouts and bolts that are connected with stream groupings
● runs indefinitely (no time/batch boundaries)
▪ Streams
● unbounded sequence of tuples that is processed and created in parallel in a
distributed fashion
▪ Spouts
● input source of streams in topology Space reserved for video
▪ Bolts Do not put anything here
● processing container, which can perform transformation, filter, aggregation,
join, etc.
● sinks: special type of bolts that have an output interface
How Did We Get Here?
▪ Finally we had hardware costs that were in line with doing in-memory streaming for
billions of events/day
Hadoop Storm
Transforms Spout
HDFS
Joins
Bolt Bolt
Validation
Sink
Aggs
Space reserved for video
Do not put anything here
Druid
Storm
Spout
Transforms
Joins
Validation
Sink
Cluster
Master Node Worker Processes
Coordination
Storm Concepts
• Streams
• Unbounded sequences of tuples
• Spout
• Source of Streams
• E.g., Read from Twitter streaming API
• Bolts
• Processes input streams and produces new streams
• E.g., Functions, Filters, Aggregation, Joins
Space reserved for video
Do not put anything here
• Topologies
• Network of spouts and bolts
Storm Tasks
Writing
Fall 2015 the Storm Word Count Example
1
Open Source Big Data
@Yahoo
Bobby (Robert) Evans
[email protected]
@bobbydata
Architect @ Yahoo
2
Provide a Hosted Platform for Yahoo
3
What We Do
• Yahoo Scale
• Make it Secure
• Make it Easy
4
Yahoo Scale
Largest Cluster Size Total Nodes
Nodes
Nodes
5
Yahoo Scale (Solving Hard Problems)
Network Topology Aware Scheduling
https://en.wikipedia.org/wiki/Network_topology
https://en.wikipedia.org/wiki/Knapsack_problem 6
Understanding Software and Hardware
State Storage (ZooKeeper):
Limited to disk write speed (80MB/sec typically)
Scheduling
• O(num_execs * resched_rate)
Supervisor
• O(num_supervisors * hb_rate)
Topology Metrics (worst case)
• O(num_execs * num_comps * num_streams * hb_rate)
Theoretical Limit:
80 MB/sec / 16 MB/sec * 240 nodes = 1,200 nodes
7
Apply it to Work Around Bottlenecks
Fix: Secure In-Memory Store for Worker Heartbeats (PaceMaker)
Removes Disk Limitation
Writes Scale Linearly
(but nimbus still needs to read it all, ideally in 10 sec or less)
240 node cluster’s complete HB state is 48MB, Gigabit is about 125 MB/s
10 s / (48 MB / 125 MB/s) * 240 nodes = 6,250 nodes
Theoretical Maximum Cluster Size
Zookeeper PaceMaker Gigabit
6250
1200
8
Make it Secure
9
Make it Easy
• Simple API
• Easy to Debug
• Easy to Setup
• Easy to Upgrade (no downtime ideally)
10
Cloud Computing
Prof Roy Campbell
Fall 2015
infrastructure does
Example
SPOUT SPLIT COUNT
[“the”] [“the”, 2]
[“cow”] [“cow”, 1]
[“over”] [“over”, 1]
[“the”]
Space reserved for video
Do not put anything here
[“moon”] [“moon”, 1]
Acker
Example
SPOUT SPLIT COUNT
[“the”] 4]
[“the”, 2]
[“cow”] [“cow”, 2]
1]
[“over”] [“over”, 2]
1]
[“the”]
Space reserved for video
Do not put anything here
[“moon”] [“moon”, 2]
1]
Acker
Cloud Computing
Prof Roy Campbell
Fall 2015
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(),
new Fields("word"))
.groupBy(new Fields("word")) Space reserved for video
Do not put anything here
.persistentAggregate(new
MemoryMapState.Factory(), new Count(), new
Fields("count"))
.parallelismHint(6);
Cloud Computing
Prof Roy Campbell
Fall 2015
Fall 2015
Fall 2015
Fall 2015
Spark Streaming
Stateful Stream Processing
mutable state
• Traditional streaming systems have a
record-at-a-time processing model input
records
• Each node has mutable state
• For each record, update state and send node 1
new records
• State is lost if node dies!
node 3
• Lambda Architecture input
• Making stateful stream processing be records Space reserved for video
fault-tolerant is challenging Do not put anything here
node 2
2
Existing Streaming Systems
• Storm
• Replays record if not processed by a
node
• Processes each record at least once
• May update mutable state twice!
• Mutable state can be lost due to failure!
Space reserved for video
• Trident – Use transactions to update state Do not put anything here
• Processes each record exactly once
• Per state transaction to external database
is slow
3
Spark
• Spark was a project out of Berkeley from 2010
• Has become very popular
• Most contributed open source project in big-data domain
• RDD: Resilient Distributed Data Set
6
Discretized Stream Processing
7
Spark Streaming example
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 9999)
# Split each line into words
words = l
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)Space reserved for video
# Print the first ten elements of each RDD generatedDoinnotthis
put anything
DStream here
to the console
wordCounts.pprint()ines.flatMap(lambda line: line.split(" "))
DStream Input Sources
• Out of the box
• Kafka
• HDFS
• Flume
• Akka Actors
• Raw TCP sockets
• updateStateByKey
• maintain arbitrary state while continuously
updating it with new information
• How to use
• Define the state - The state can be an arbitrary
data type
• Define the state update function - Specify with a
function how to update the state using the
previous state and the new values from an input Space reserved for video
stream Do not put anything here
• state update function applied in every batch for all
existing keys
Arbitrary Stateful Computations
[email protected]
Components of a streaming ecosystem
• Gather the data
• Funnel
• Distributed Queue
• Real-Time Processing
• Semi-Real-Time Processing
• Real-time OLAP
Step 1: Gather the Data
• Apache NiFi is a good distributed funnel
• Was made in NSA
• Over 8 years of development
• Open sourced in 2014 and picked up by HortonWorks
• Great visual UI to design a data flow
• Has many many processor types in the box
• But not very good for heavy weight distributed processing
• Same graph is executed on all the nodes
NiFi Components
• FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
• Processor
• Performs the work, can access FlowFiles
• Connection
• Links between processors
• Queues that can be dynamically prioritized
• Process Group
• Set of processors and their connections
• Receive data via input ports, send data via output ports
NiFi GUI
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
NiFi Site-to-Site
• Site-to-site allows very easy pushing of data from one data center to
another
• Makes it a great choice for
distributed funnel
Step 2: Distributed Queue
• Pub-sub model
Producer publish(topic, msg) Consumer
subscribe
• Kafka a very poular
example Topic
1
Topic msg
2
Topic
3
Publish subscribe
system Consumer
Producer
msg
Kafka Architecture
• Distributed, high-throughput,
pub-sub messaging system
• Fast, Scalable, Durable Producer Producer