HD Mod012 Storm
HD Mod012 Storm
Module 12 – Storm
Storm Page 1
Page 2 Storm
Table Of Contents
Lab01: Start Storm .......................................................................................................................... 4
What is Apache Storm?................................................................................................................... 6
Why Apache Storm ......................................................................................................................... 8
Storm cluster architecture ............................................................................................................. 10
Storm vs Spark .............................................................................................................................. 12
Storm Terminology ....................................................................................................................... 14
Storm visualized (1 of 2) ............................................................................................................... 16
Storm visualized (2 of 2) ............................................................................................................... 18
Java code ....................................................................................................................................... 20
Java code (con’t) ........................................................................................................................... 22
Java code (con’t) ........................................................................................................................... 24
Stream Grouping ........................................................................................................................... 26
Stream Grouping types .................................................................................................................. 28
Tuple processing workflow ........................................................................................................... 30
Topology Design ........................................................................................................................... 32
1. Define the Problem ................................................................................................................... 34
2. Map the Solution ....................................................................................................................... 36
3. Implement the Solution ............................................................................................................. 38
3. Implement the Solution – Geocode bolt.................................................................................... 40
3. Implement the Solution – Heatmap bolt ................................................................................... 42
3. Implement the Solution – Tick tuples ....................................................................................... 44
3. Implement the Solution – Persistor bolt .................................................................................... 46
3. Implement the Solution – Wire together and start .................................................................... 48
4. Scaling the Topology – Executors and Tasks ........................................................................... 50
4. Scaling the Topology ................................................................................................................ 52
4. Scaling the Topology (con’t) ................................................................................................... 54
4. Scaling the Topology (con’t) ................................................................................................... 56
5. Tune it again.............................................................................................................................. 58
Before we begin: Code description ............................................................................................... 60
Lab03: Create Topology ............................................................................................................... 62
Lab04: Add Kafka Spout .............................................................................................................. 64
Lab03: Storm UI: <IP>:8744 ........................................................................................................ 66
Lab03: Storm: UI: <IP>:8744 (con’t) ........................................................................................... 68
Lab05: Confirm Spout sending tuples to Bolt .............................................................................. 70
Lab05: Storm: UI: <IP>:8744 (con’t) ........................................................................................... 72
Lab07: WordCount lab.................................................................................................................. 74
Lab08: Cleanup ............................................................................................................................. 76
In Review - Storm ......................................................................................................................... 78
Storm Page 3
Lab01: Start Storm
From Ambari, start Storm.
Page 4 Storm
Lab01: Start Storm
Storm Page 5
What is Apache Storm?
Storm is a distributed real-time computation system for processing large volumes of
high-velocity data. Storm is extremely fast, with the ability to process over a million
records per second per node on a cluster of modest size. Enterprises harness this
speed and combine it with other data access applications in Hadoop to prevent
undesirable events or to optimize positive outcomes.
Storm does not run on Hadoop clusters but uses Zookeeper and its own minion worker
to manage its processes. Storm can read and write files to HDFS.
Page 6 Storm
What is Apache Storm?
• Apache Storm is an open source engine which can process data in real-
time using its distributed architecture. It is a distributed real-time
computation system. Apache Storm is a task parallel continuous
computational engine. It defines its workflows in Directed Acyclic Graphs
(DAG’s) called 'Topologies'. These topologies run until shutdown by the
user or encountering an unrecoverable failure
• Storm does not natively run on top of typical Hadoop clusters, it uses
Apache ZooKeeper and its own master/ minion worker processes to
coordinate topologies, master and worker state, and the message
guarantee semantics. That said, both Yahoo! and Hortonworks are
working on providing libraries for running Storm topologies on top of
Hadoop 2.x YARN clusters
http://www.zdatainc.com/2014/09/apache-storm-apache-spark/
Storm Page 7
Why Apache Storm
Page 8 Storm
Why Apache Storm?
Storm Page 9
Storm cluster architecture
Page 10 Storm
Storm cluster architecture
Worker node
Supervisor Nimbus (Master Node) -
Management server
Worker
Coordinates comm process • Similar to job tracker
between Nimbus • Distributes code around cluster
and Supervisors Worker
process • Assigns tasks
Zookeeper • Handles failures
Master node Cluster Worker node
Supervisor (Worker nodes):
node Supervisor
Zookeeper • Similar to task tracker
Nimbus
Zookeeper
Worker • A task is an instance of a Bolt or
process
Spout
Worker
process Zookeeper:
• Cluster co-ordination
Worker node • Nimbus HA
Supervisor • Stores cluster metrics
Worker
• Consumption related metadata for
process Trident topologies
Worker
process
Storm Page 11
Storm vs Spark
If your requirements are primarily focused on stream processing and CEP-style
processing and you are starting a greenfield project with a purpose-built cluster for the
project, I would probably favor Storm -- especially when existing Storm spouts that
match your integration requirements are available. This is by no means a hard and fast
rule, but such factors would at least suggest beginning with Storm.
On the other hand, if you're leveraging an existing Hadoop or Mesos cluster and/or if
your processing needs involve substantial requirements for graph processing, SQL
access, or batch processing, you might want to look at Spark first.
Another factor to consider is the multi-language support of the two systems. For
example, if you need to leverage code written in R or any other language not natively
supported by Spark, then Storm has the advantage of broader language support. By the
same token, if you must have an interactive shell for data exploration using API calls,
then Spark offers you a feature that Storm doesn’t.
In the end, you’ll probably want to perform a detailed analysis of both platforms before
making a final decision. I recommend using both platforms to build a small proof of
concept -- then run your own benchmarks with a workload that mirrors your anticipated
workloads as closely as possible before fully committing to either.
Of course, you don't need to make an either/or decision. Depending on your workloads,
infrastructure, and requirements, you may find that the ideal solution is a mixture of
Storm and Spark -- along with other tools like Kafka, Hadoop, Flume, and so on.
Therein lies the beauty of open source.
Page 12 Storm
Storm vs Spark
Storm Page 13
Storm Terminology
Here are a few terminologies and concepts you should get familiar with before we go
hands-on:
Page 14 Storm
Storm Terminology
Storm Page 15
Storm visualized (1 of 2)
Page 16 Storm
Storm visualized (1 of 2)
Storm Page 17
Storm visualized (2 of 2)
Page 18 Storm
Storm visualized (2 of 2)
Read commits
Bolt: Are Node(s) that accepts tuple from
Spout from feed input stream, performs computation
(filtering, aggregation, join) and then
optionally emits a new tuple(s) to its
[commit ='23bc [email protected]"] Stream 1 output Stream. Notice Bolt 2 does not
emit a new tuple but rather updates an in-
memory map
Extract email
from feed
Bolt 1
"[email protected]"
[email="[email protected]"] Stream 2 Nodes can either be a Spout or a Bolt. 1 Node is a Spout
and 2 Nodes are Bolts. So our Topology is network of
spouts and bolts wired together into a workflow
Update email
counter Bolt 2
Storm Page 19
Java code
Page 20 Storm
Java code
CommitFeedListener.java
Storm Page 21
Java code (con’t)
Page 22 Storm
Java code (con't)
EmailExtractor.java
Storm Page 23
Java code (con’t)
Page 24 Storm
Java code (con't)
EmailCounter.java
Storm Page 25
Stream Grouping
A Stream Grouping tells a topology how to send tuples between two components.
Remember, spouts and bolts execute in parallel as many tasks across the cluster. If you
look at how a topology is executing at the task level, it looks something like this:
When a task for Bolt A emits a tuple to Bolt B, which task should it send the tuple to? A
"Stream Grouping" answers this question by telling Storm how to send tuples between
sets of tasks.
A Stream Grouping defines how that stream should be partitioned among the bolt's
tasks.
Page 26 Storm
Stream Grouping
Stream Grouping: Defines how Tuples are sent between Spout and Bolt or
between Bolts (Spouts and Bolts run in parallel so there are multiple instances)
Read commits
Spout from feed
Extract email
Bolt 1 from commits
Storm Page 27
Stream Grouping types
There are seven built-in stream groupings in Storm, and you can implement a custom
stream grouping by implementing the CustomStreamGrouping interface:
1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a
way such that each bolt is guaranteed to get an equal number of tuples.
2. Fields grouping: The stream is partitioned by the fields specified in the
grouping. For example, if the stream is grouped by the "user-id" field, tuples with
the same "user-id" will always go to the same task, but tuples with different "user-
id"'s may go to different tasks.
3. Partial Key grouping: The stream is partitioned by the fields specified in the
grouping, like the Fields grouping, but are load balanced between two
downstream bolts, which provides better utilization of resources when the
incoming data is skewed. This paper provides a good explanation of how it works
and the advantages it provides.
4. All grouping: The stream is replicated across all the bolt's tasks. Use this
grouping with care.
5. Global grouping: The entire stream goes to a single one of the bolt's tasks.
Specifically, it goes to the task with the lowest id.
6. None grouping: This grouping specifies that you don't care how the stream is
grouped. Currently, none groupings are equivalent to shuffle groupings.
Eventually though, Storm will push down bolts with none groupings to execute in
the same thread as the bolt or spout they subscribe from (when possible).
7. Direct grouping: This is a special kind of grouping. A stream grouped this way
means that the producer of the tuple decides which task of the consumer will
receive this tuple. Direct groupings can only be declared on streams that have
been declared as direct streams. Tuples emitted to a direct stream must be
emitted using one of the
[emitDirect](/javadoc/apidocs/backtype/storm/task/OutputCollector.html#emitDire
ct(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by
either using the provided TopologyContext or by keeping track of the output of
the emit method in OutputCollector (which returns the task ids that the tuple was
sent to).
8. Local or shuffle grouping: If the target bolt has one or more tasks in the same
worker process, tuples will be shuffled to just those in-process tasks. Otherwise,
this acts like a normal shuffle grouping.
Page 28 Storm
Stream Grouping types
Shuffle Grouping Sends tuple to a bolt in random - Doing atomic operations ie:.math
fashion (each bolt get roughly same) operations
Fields Grouping Sends tuples to specific bolt based - Segmentation of the incoming stream
on field's value in the tuple - Aggregating tuples of a certain type
All grouping Sends a single copy of each bolt to - Send some signal to all bolts like clear
all instances of a receiving bolt cache or refresh state etc.
- Send ticker tuple to signal bolts to save
state etc.
Custom grouping Implement your own field grouping - Used to get max flexibility to change
so tuples are routed based on processing sequence, logic etc. based
custom logic on different factors like data types, load,
seasonality etc.
Direct grouping Source decides which bolt will - Depends
receive tuple
Global grouping Global Grouping sends tuples - Global counts..
generated by all instances of the
source to a single target instance
(specifically, the task with lowest ID)
Storm Page 29
Tuple processing workflow
Page 30 Storm
Tuple processing workflow
Get data from • Storm engine calls ‘nextTuple()’ method on the Spout
message source task
Inject data into • Spout task emits the tuple to one of its output Stream
topology with a unique 'messageID'
Figure out tuple • The right Bolt gets the data based on the field
to bolt routing grouping used by the receiving bolt
Processing
• Storm engine tracks the Tuple tree for anchored
status tracked tuples
by Storm
Storm Page 31
Topology Design
Page 32 Storm
Topology Design
Storm Page 33
1. Define the Problem
Page 34 Storm
1. Define the Problem
Want to develop a Heat map that will display activity in the bars
every 15 seconds. This is provide me information about which bar
I want to visit. I wish to save the data to a NoSQL database
Storm Page 35
2. Map the Solution
Page 36 Storm
2. Map the Solution
Feed
Storm Page 37
3. Implement the Solution
Page 38 Storm
3. Implement the Solution – Checkin Spout
Storm Page 39
3. Implement the Solution – Geocode bolt
Page 40 Storm
3. Implement the Solution – Geocode bolt
Storm Page 41
3. Implement the Solution – Heatmap bolt
Page 42 Storm
3. Implement the Solution – Heatmap bolt
Storm Page 43
3. Implement the Solution – Tick tuples
Page 44 Storm
3. Implement the Solution – Tick Tuples
Storm Page 45
3. Implement the Solution – Persistor bolt
Page 46 Storm
3. Implement the Solution – Persistor bolt
Storm Page 47
3. Implement the Solution – Wire together and start
Page 48 Storm
3. Implement the Solution –
Wire together and Start Topology
Wire everything together
Storm Page 49
4. Scaling the Topology – Executors and Tasks
Page 50 Storm
4. Scaling the Topology –
Executors and Tasks
To scale we will define Executors (threads) and Tasks (instances of
spouts/bolt running within a thread)
This won't scale Set Executors 4 and 8 respectively
builder.setBolt("checkins", new Checkins(), 4)
builder.setBolt("geocode-lookup", new GeocodeLookup(), 8)
Storm Page 51
4. Scaling the Topology
Page 52 Storm
4. Scaling the Topology
15 sec intervals
Storm Page 53
4. Scaling the Topology (con’t)
Page 54 Storm
4. Scaling the Topology (con't)
Storm Page 55
4. Scaling the Topology (con’t)
Page 56 Storm
4. Scaling the Topology (con't)
Storm Page 57
5. Tune it again
Page 58 Storm
5. Tune it again
Storm Page 59
Before we begin: Code description
Here is the custom code we will be using. It is data on trucking.
Page 60 Storm
Before we begin: Code description
Storm Page 61
Lab03: Create Topology
Running a topology is straightforward. First, you package all your code and
dependencies into a single jar. Then, you run a command like the following: The
command below will start a new Storm Topology for TruckEvents.
cd /opt/TruckEvents/Tutorials-master/
storm jar target/Tutorial-1.0-
SNAPSHOT.jar.com.hortonworks.tutorials.tutorial2.TruckEventProcessingTopology
The main function of the class defines the topology and submits it to Nimbus. The storm
jar part takes care of connecting to Nimbus and uploading the jar.
Page 62 Storm
Lab03: Create Topology
Storm Page 63
Lab04: Add Kafka Spout
Page 64 Storm
Lab04: Add Kafka spout
Storm Page 65
Lab03: Storm UI: <IP>:8744
Let’s spend a few minutes going over the Storm UI.
Page 66 Storm
Lab03: Storm UI: <IP>:8744
Topologies deployed
Storm Page 67
Lab03: Storm: UI: <IP>:8744 (con’t)
Page 68 Storm
Lab03: Storm UI: <IP>:8744 (con't)
Deactivate the Spout Deactivate, then redistribute Workers Deactivate Topology,
evenly, then go back to state then shut down Workers
and cleans up state
Storm Page 69
Lab05: Confirm Spout sending tuples to Bolt
Page 70 Storm
Lab05: Confirm Spout sending tuples to Bolt
Here we see that the Spout is sending tuples (to the Bolt)
Storm Page 71
Lab05: Storm: UI: <IP>:8744 (con’t)
Page 72 Storm
Lab05: Storm UI: <IP>:8744 (con't)
Note your Bolt is endpoint for a Topology so your Bolt does not emit any
tuples. Below screen shot is of a Bolt that does emits to another Bolt
Bolt id # of threads # of tasks # of tuples # of tuples sent
emitted to other tasks
Storm Page 73
Lab07: WordCount lab
Page 74 Storm
Lab07: WordCount lab
Doesn't work
Storm Page 75
Lab08: Cleanup
Page 76 Storm
Lab08: Cleanup
Since these are Streaming jobs, the log files will continue to grow until you kill
the process. So go to Web browser URL: <IP>:8744 and under Topology
Actions, and Kill both
• Truck-event-processer
• Wordcount
Storm Page 77
In Review - Storm
Page 78 Storm
In Review – Storm
Storm Page 79