0% found this document useful (0 votes)
5 views79 pages

HD Mod012 Storm

Uploaded by

hlidio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views79 pages

HD Mod012 Storm

Uploaded by

hlidio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Module 12

Module 12 – Storm

After completing this module, the student should be able to describe:


• Streaming Vs. Batch
• Storm Terminology
• Storm Architecture
• Topologies
• Metrics and Monitoring

Trident is an alternative interface to Storm. It allows you to express a


Topology in terms of 'what' as opposed to 'how'. To do this it provides
operations like joins, aggregations, groupings, functions, etc. It's
similar to high-level batching tools like Pig

Storm Page 1
Page 2 Storm
Table Of Contents
Lab01: Start Storm .......................................................................................................................... 4
What is Apache Storm?................................................................................................................... 6
Why Apache Storm ......................................................................................................................... 8
Storm cluster architecture ............................................................................................................. 10
Storm vs Spark .............................................................................................................................. 12
Storm Terminology ....................................................................................................................... 14
Storm visualized (1 of 2) ............................................................................................................... 16
Storm visualized (2 of 2) ............................................................................................................... 18
Java code ....................................................................................................................................... 20
Java code (con’t) ........................................................................................................................... 22
Java code (con’t) ........................................................................................................................... 24
Stream Grouping ........................................................................................................................... 26
Stream Grouping types .................................................................................................................. 28
Tuple processing workflow ........................................................................................................... 30
Topology Design ........................................................................................................................... 32
1. Define the Problem ................................................................................................................... 34
2. Map the Solution ....................................................................................................................... 36
3. Implement the Solution ............................................................................................................. 38
3. Implement the Solution – Geocode bolt.................................................................................... 40
3. Implement the Solution – Heatmap bolt ................................................................................... 42
3. Implement the Solution – Tick tuples ....................................................................................... 44
3. Implement the Solution – Persistor bolt .................................................................................... 46
3. Implement the Solution – Wire together and start .................................................................... 48
4. Scaling the Topology – Executors and Tasks ........................................................................... 50
4. Scaling the Topology ................................................................................................................ 52
4. Scaling the Topology (con’t) ................................................................................................... 54
4. Scaling the Topology (con’t) ................................................................................................... 56
5. Tune it again.............................................................................................................................. 58
Before we begin: Code description ............................................................................................... 60
Lab03: Create Topology ............................................................................................................... 62
Lab04: Add Kafka Spout .............................................................................................................. 64
Lab03: Storm UI: <IP>:8744 ........................................................................................................ 66
Lab03: Storm: UI: <IP>:8744 (con’t) ........................................................................................... 68
Lab05: Confirm Spout sending tuples to Bolt .............................................................................. 70
Lab05: Storm: UI: <IP>:8744 (con’t) ........................................................................................... 72
Lab07: WordCount lab.................................................................................................................. 74
Lab08: Cleanup ............................................................................................................................. 76
In Review - Storm ......................................................................................................................... 78

Storm Page 3
Lab01: Start Storm
From Ambari, start Storm.

In addition, confirm Zookeeper is started.

Page 4 Storm
Lab01: Start Storm

Log into Ambari (http://192.168.100.140:8080) using admin / admin and


ensure Storm and ZooKeeper are both started. If not, do so

3 Start ZooKeeper as well if not stared

Storm Page 5
What is Apache Storm?
Storm is a distributed real-time computation system for processing large volumes of
high-velocity data. Storm is extremely fast, with the ability to process over a million
records per second per node on a cluster of modest size. Enterprises harness this
speed and combine it with other data access applications in Hadoop to prevent
undesirable events or to optimize positive outcomes.

Storm does not run on Hadoop clusters but uses Zookeeper and its own minion worker
to manage its processes. Storm can read and write files to HDFS.

Page 6 Storm
What is Apache Storm?

• Apache Storm is an open source engine which can process data in real-
time using its distributed architecture. It is a distributed real-time
computation system. Apache Storm is a task parallel continuous
computational engine. It defines its workflows in Directed Acyclic Graphs
(DAG’s) called 'Topologies'. These topologies run until shutdown by the
user or encountering an unrecoverable failure

• Storm does not natively run on top of typical Hadoop clusters, it uses
Apache ZooKeeper and its own master/ minion worker processes to
coordinate topologies, master and worker state, and the message
guarantee semantics. That said, both Yahoo! and Hortonworks are
working on providing libraries for running Storm topologies on top of
Hadoop 2.x YARN clusters

http://www.zdatainc.com/2014/09/apache-storm-apache-spark/

Storm Page 7
Why Apache Storm

Page 8 Storm
Why Apache Storm?

Open source real-time event stream processing platform that provides


fixed, continuous and low latency processing for very high frequency
streaming data

Storm Page 9
Storm cluster architecture

Let’s look at the various components of a Storm Cluster:

1. Nimbus node. The master node (Similar to JobTracker)


2. Supervisor nodes. Starts/stops workers & communicates with Nimbus through
Zookeeper
3. ZooKeeper nodes. Coordinates the Storm cluster

Page 10 Storm
Storm cluster architecture
Worker node
Supervisor Nimbus (Master Node) -
Management server
Worker
Coordinates comm process • Similar to job tracker
between Nimbus • Distributes code around cluster
and Supervisors Worker
process • Assigns tasks
Zookeeper • Handles failures
Master node Cluster Worker node
Supervisor (Worker nodes):
node Supervisor
Zookeeper • Similar to task tracker
Nimbus
Zookeeper
Worker • A task is an instance of a Bolt or
process
Spout
Worker
process Zookeeper:
• Cluster co-ordination
Worker node • Nimbus HA
Supervisor • Stores cluster metrics
Worker
• Consumption related metadata for
process Trident topologies
Worker
process

Storm Page 11
Storm vs Spark
If your requirements are primarily focused on stream processing and CEP-style
processing and you are starting a greenfield project with a purpose-built cluster for the
project, I would probably favor Storm -- especially when existing Storm spouts that
match your integration requirements are available. This is by no means a hard and fast
rule, but such factors would at least suggest beginning with Storm.

On the other hand, if you're leveraging an existing Hadoop or Mesos cluster and/or if
your processing needs involve substantial requirements for graph processing, SQL
access, or batch processing, you might want to look at Spark first.

Another factor to consider is the multi-language support of the two systems. For
example, if you need to leverage code written in R or any other language not natively
supported by Spark, then Storm has the advantage of broader language support. By the
same token, if you must have an interactive shell for data exploration using API calls,
then Spark offers you a feature that Storm doesn’t.

In the end, you’ll probably want to perform a detailed analysis of both platforms before
making a final decision. I recommend using both platforms to build a small proof of
concept -- then run your own benchmarks with a workload that mirrors your anticipated
workloads as closely as possible before fully committing to either.

Of course, you don't need to make an either/or decision. Depending on your workloads,
infrastructure, and requirements, you may find that the ideal solution is a mixture of
Storm and Spark -- along with other tools like Kafka, Hadoop, Flume, and so on.
Therein lies the beauty of open source.

Page 12 Storm
Storm vs Spark

Storm does Stream process; Spark does Micro-batching

Storm Page 13
Storm Terminology

Here are a few terminologies and concepts you should get familiar with before we go
hands-on:

• Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)


• Streams. An unbounded sequence of tuples.
• Spouts. Sources of streams in a computation (e.g. a Twitter API)
• Bolts. Process input streams and produce output streams. They can:
o Run functions;
o Filter, aggregate, or join data;
o Talk to databases.
• Topologies. The overall calculation, represented visually as a network of spouts
and bolts

Page 14 Storm
Storm Terminology

• Topology: A graph with nodes and edges. Nodes do a computation, edges


represent data being passed between nodes. It is essentially a group of
spouts and bolts wired together into a workflow
• Tuple: Most fundamental data structure and is a named list of values
that can be of any datatype
• Streams: Groups of tuples
• Spouts: Generate streams
• Bolts: Contain data processing, persistence and alerting logic. Can also
emit tuples for downstream bolts

A Storm application is designed as a Topology in the shape of a directed


acyclic graph (DAG) with Spouts and Bolts acting as the graph vertices
(nodes) . Edges on the graph are named Streams and direct data from one
node to another. Together, the topology acts as a data transformation
pipeline. At a superficial level the general topology structure is similar to a
MapReduce job, with the main difference being that data is processed in
real-time as opposed to in individual batches. Additionally, Storm topologies
run indefinitely until killed, while a MapReduce job DAG must eventually end.

Storm Page 15
Storm visualized (1 of 2)

Page 16 Storm
Storm visualized (1 of 2)

Topology: A graph with Nodes and Edges. Nodes do a computation from


Tuples. Edges represent Tuples being passed between Nodes. It is essentially
a group of Spouts and Bolts wired together into a workflow

Data feed is live feed of commits "[email protected]"


"2345
"[email protected]"
[email protected]"

Tuples: Ordered list of values


Read commits
from feed

[commit ='23bc [email protected]"]

Edges pass Tuples between Nodes


Extract email Nodes perform
computations
[email="[email protected]"]
In this Topology, we have 3
Nodes and 2 Edges
Update email
counter

Storm Page 17
Storm visualized (2 of 2)

Page 18 Storm
Storm visualized (2 of 2)

Stream: An unbounded sequence of Tuples between 2 nodes in a Topology.


Below we have 2 Streams
Spout: Is a Node that servers as a source
Data feed "[email protected]"
"2345 [email protected]" of a Stream in a Topology. It listens from
"23bc [email protected]" source of a data feed. Data feed can be a
message queue (ie: Kafka) or a database

Read commits
Bolt: Are Node(s) that accepts tuple from
Spout from feed input stream, performs computation
(filtering, aggregation, join) and then
optionally emits a new tuple(s) to its
[commit ='23bc [email protected]"] Stream 1 output Stream. Notice Bolt 2 does not
emit a new tuple but rather updates an in-
memory map
Extract email
from feed
Bolt 1

"[email protected]"
[email="[email protected]"] Stream 2 Nodes can either be a Spout or a Bolt. 1 Node is a Spout
and 2 Nodes are Bolts. So our Topology is network of
spouts and bolts wired together into a workflow
Update email
counter Bolt 2

Storm Page 19
Java code

Page 20 Storm
Java code
CommitFeedListener.java

Storm Page 21
Java code (con’t)

Page 22 Storm
Java code (con't)
EmailExtractor.java

Storm Page 23
Java code (con’t)

Page 24 Storm
Java code (con't)
EmailCounter.java

Storm Page 25
Stream Grouping

A Stream Grouping tells a topology how to send tuples between two components.
Remember, spouts and bolts execute in parallel as many tasks across the cluster. If you
look at how a topology is executing at the task level, it looks something like this:

When a task for Bolt A emits a tuple to Bolt B, which task should it send the tuple to? A
"Stream Grouping" answers this question by telling Storm how to send tuples between
sets of tasks.

A Stream Grouping defines how that stream should be partitioned among the bolt's
tasks.

Page 26 Storm
Stream Grouping

Stream Grouping: Defines how Tuples are sent between Spout and Bolt or
between Bolts (Spouts and Bolts run in parallel so there are multiple instances)

Data feed "[email protected]"


"2345
"[email protected]"
[email protected]"

Read commits
Spout from feed

[commit = "1234 [email protected]"] Stream 1 (Use a SHUFFLE GROUPING to distribute


tuples randomly to the Bolts)

Extract email
Bolt 1 from commits

[email = [email protected]] Stream 2 (Use a FIELDS GROUPING to distribute


same value (in our case, '[email protected]' to
the same bolt so a count can occur)
Update email
Bolt 2 count

Storm Page 27
Stream Grouping types

There are seven built-in stream groupings in Storm, and you can implement a custom
stream grouping by implementing the CustomStreamGrouping interface:

1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a
way such that each bolt is guaranteed to get an equal number of tuples.
2. Fields grouping: The stream is partitioned by the fields specified in the
grouping. For example, if the stream is grouped by the "user-id" field, tuples with
the same "user-id" will always go to the same task, but tuples with different "user-
id"'s may go to different tasks.
3. Partial Key grouping: The stream is partitioned by the fields specified in the
grouping, like the Fields grouping, but are load balanced between two
downstream bolts, which provides better utilization of resources when the
incoming data is skewed. This paper provides a good explanation of how it works
and the advantages it provides.
4. All grouping: The stream is replicated across all the bolt's tasks. Use this
grouping with care.
5. Global grouping: The entire stream goes to a single one of the bolt's tasks.
Specifically, it goes to the task with the lowest id.
6. None grouping: This grouping specifies that you don't care how the stream is
grouped. Currently, none groupings are equivalent to shuffle groupings.
Eventually though, Storm will push down bolts with none groupings to execute in
the same thread as the bolt or spout they subscribe from (when possible).
7. Direct grouping: This is a special kind of grouping. A stream grouped this way
means that the producer of the tuple decides which task of the consumer will
receive this tuple. Direct groupings can only be declared on streams that have
been declared as direct streams. Tuples emitted to a direct stream must be
emitted using one of the
[emitDirect](/javadoc/apidocs/backtype/storm/task/OutputCollector.html#emitDire
ct(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by
either using the provided TopologyContext or by keeping track of the output of
the emit method in OutputCollector (which returns the task ids that the tuple was
sent to).
8. Local or shuffle grouping: If the target bolt has one or more tasks in the same
worker process, tuples will be shuffled to just those in-process tasks. Otherwise,
this acts like a normal shuffle grouping.

Page 28 Storm
Stream Grouping types

Provides various ways to control tuple routing to bolts. Many field


grouping exist include shuffle, fields, global
Grouping type What it does When to use

Shuffle Grouping Sends tuple to a bolt in random - Doing atomic operations ie:.math
fashion (each bolt get roughly same) operations
Fields Grouping Sends tuples to specific bolt based - Segmentation of the incoming stream
on field's value in the tuple - Aggregating tuples of a certain type

All grouping Sends a single copy of each bolt to - Send some signal to all bolts like clear
all instances of a receiving bolt cache or refresh state etc.
- Send ticker tuple to signal bolts to save
state etc.
Custom grouping Implement your own field grouping - Used to get max flexibility to change
so tuples are routed based on processing sequence, logic etc. based
custom logic on different factors like data types, load,
seasonality etc.
Direct grouping Source decides which bolt will - Depends
receive tuple
Global grouping Global Grouping sends tuples - Global counts..
generated by all instances of the
source to a single target instance
(specifically, the task with lowest ID)

Storm Page 29
Tuple processing workflow

Page 30 Storm
Tuple processing workflow

Get data from • Storm engine calls ‘nextTuple()’ method on the Spout
message source task

Inject data into • Spout task emits the tuple to one of its output Stream
topology with a unique 'messageID'

Figure out tuple • The right Bolt gets the data based on the field
to bolt routing grouping used by the receiving bolt

• Bolts process and emit unanchored or anchored


Bolts process & tuples
acknowledge • Bolts mandatorily ACK or fail the tuple after
processing done

Processing
• Storm engine tracks the Tuple tree for anchored
status tracked tuples
by Storm

Storm Page 31
Topology Design

Page 32 Storm
Topology Design

1. Define the problem – Document requirements to be placed on any


potential solution. Goal is to model a solution
2. Map the solution to Storm – Map out Topology (via Spout and Bolts)
3. Implement the Solution – Here's where the heavy lifting starts; writing
the JAVA code for the everything
4. Scaling the Topology – Tune it to run as scale
5. Tune it again – Based on observations, fine tune again if needed

Storm Page 33
1. Define the Problem

Page 34 Storm
1. Define the Problem

Want to develop a Heat map that will display activity in the bars
every 15 seconds. This is provide me information about which bar
I want to visit. I wish to save the data to a NoSQL database

Storm Page 35
2. Map the Solution

Page 36 Storm
2. Map the Solution

Here we decide on nodes (Spout


and Bolts) along with tuples and
which Grouping to use

Feed

Storm Page 37
3. Implement the Solution

Page 38 Storm
3. Implement the Solution – Checkin Spout

Storm Page 39
3. Implement the Solution – Geocode bolt

Page 40 Storm
3. Implement the Solution – Geocode bolt

Storm Page 41
3. Implement the Solution – Heatmap bolt

Page 42 Storm
3. Implement the Solution – Heatmap bolt

Storm Page 43
3. Implement the Solution – Tick tuples

Page 44 Storm
3. Implement the Solution – Tick Tuples

On HeatMap bolt, trigger


an action periodically

Storm Page 45
3. Implement the Solution – Persistor bolt

Page 46 Storm
3. Implement the Solution – Persistor bolt

Writing to the NoSQL


database

Storm Page 47
3. Implement the Solution – Wire together and start

Page 48 Storm
3. Implement the Solution –
Wire together and Start Topology
Wire everything together

Start the Topology

Storm Page 49
4. Scaling the Topology – Executors and Tasks

Page 50 Storm
4. Scaling the Topology –
Executors and Tasks
To scale we will define Executors (threads) and Tasks (instances of
spouts/bolt running within a thread)
This won't scale Set Executors 4 and 8 respectively
builder.setBolt("checkins", new Checkins(), 4)
builder.setBolt("geocode-lookup", new GeocodeLookup(), 8)

Set Executors = 8 and Tasks = 64


builder.setBolt("geocode-lookup", new GeocodeLookup(), 8 setNumTasks(64)

Storm Page 51
4. Scaling the Topology

Page 52 Storm
4. Scaling the Topology

Right now we can't parallelize HeatMapBuilder bolt because all tuples go to


the same Instance. That's so tuples can be grouped into same time interval
But what if we break up the two actions HeatMapBuilder is doing:
• Determine time interval tuple falls into
• Group tuples by time interval

15 sec intervals

… and create a separate bolt? So we create a new bolt TimeIntervalExtractor


and its job will be to determine time interval that a tuple falls into

Storm Page 53
4. Scaling the Topology (con’t)

Page 54 Storm
4. Scaling the Topology (con't)

Adding a new Bolt TimeIntervalExtractor so we can implement multiple


instances of HeatMapBuilder bolt

Change HeatMapBuilder code to accept Time Interval

Storm Page 55
4. Scaling the Topology (con’t)

Page 56 Storm
4. Scaling the Topology (con't)

Since I know have multiple


Instances of HeatMapBuilder I
can use a Fields Grouping
instead of a Global Grouping

Storm Page 57
5. Tune it again

Page 58 Storm
5. Tune it again

For a given 15-second interval, all


tuples must flow thru 1 Instance of
HeatMapBuilder bolt.

If this were to become the


bottleneck, can parallelize this by
adding another grouping (City) to
the time interval. Now we can
have multiple data flows for a
given time interval/city and they
may flow through different
instances of HeatMapBuilder

Storm Page 59
Before we begin: Code description
Here is the custom code we will be using. It is data on trucking.

Page 60 Storm
Before we begin: Code description

Here's the code we will be executing:

1. BaseTruckEventTopology.java - Topology configuration initialized here


2. TruckEventProcesssingTopology.java – Spout and Bolts initialized
3. LogTruckEventsBolt – Prints messages from Kafka spout
4. TruckScheme.java – Deserialize Kafka byte message stream to value objects

Storm Page 61
Lab03: Create Topology

Running a topology is straightforward. First, you package all your code and
dependencies into a single jar. Then, you run a command like the following: The
command below will start a new Storm Topology for TruckEvents.

cd /opt/TruckEvents/Tutorials-master/
storm jar target/Tutorial-1.0-
SNAPSHOT.jar.com.hortonworks.tutorials.tutorial2.TruckEventProcessingTopology

The main function of the class defines the topology and submits it to Nimbus. The storm
jar part takes care of connecting to Nimbus and uploading the jar.

Page 62 Storm
Lab03: Create Topology

Open a new Hadoop PuTTY command prompt and login to Hadoop.


Then follow commands below. The command below will start a new Storm
Topology for TruckEvents
1. Navigate to: cd /opt/TruckEvents/Tutorials-master and paste :

storm jar target/Tutorial-1.0-SNAPSHOT.jar


com.hortonworks.tutorials.tutorial2.TruckEventProcessingTopology
Eventually you will get the following:

What are Topology


looks like

Storm Page 63
Lab04: Add Kafka Spout

Page 64 Storm
Lab04: Add Kafka spout

We should NOT have to do below lab since this lab


should still be running from the previous Kafka module
Confirm you see messages scrolling in one of your command prompt windows

1. Open a new Hadoop PuTTY prompt


2. Navigate to: cd /opt/TruckEvents/Tutorials-master and type the below to
starts the Kafka Producer sending messages to the Broker
java -cp target/Tutorial-1.0-SNAPSHOT.jar
com.hortonworks.tutorials.tutorial1.TruckEventsProducer
sandbox:6667 sandbox:2181 &

3. You should see messages populating the Stream. If so go to next page

Storm Page 65
Lab03: Storm UI: <IP>:8744
Let’s spend a few minutes going over the Storm UI.

Page 66 Storm
Lab03: Storm UI: <IP>:8744

Storm cluster overview

Topologies deployed

Click here to see Spout and Bolt. Confirm no


Zookeeper errors under 'Topology Configuration'
All Supervisors in cluster

Configuration values for cluster

Storm Page 67
Lab03: Storm: UI: <IP>:8744 (con’t)

Page 68 Storm
Lab03: Storm UI: <IP>:8744 (con't)
Deactivate the Spout Deactivate, then redistribute Workers Deactivate Topology,
evenly, then go back to state then shut down Workers
and cleans up state

Click on 'kafkaSpout' hotlink,


then go to next page

Storm Page 69
Lab05: Confirm Spout sending tuples to Bolt

Page 70 Storm
Lab05: Confirm Spout sending tuples to Bolt

Here we see that the Spout is sending tuples (to the Bolt)

Click the F5 button every 10 seconds to


confirm 'Emitted' and 'Transferred'
numbers are increasing. This tells you
the Spout is sending Tuples to the Bolt

Storm Page 71
Lab05: Storm: UI: <IP>:8744 (con’t)

Page 72 Storm
Lab05: Storm UI: <IP>:8744 (con't)

From Firefox, click the Back button then click on logTruckEventBolt


hotlink under Bolts (All time)

Note your Bolt is endpoint for a Topology so your Bolt does not emit any
tuples. Below screen shot is of a Bolt that does emits to another Bolt
Bolt id # of threads # of tasks # of tuples # of tuples sent
emitted to other tasks

Storm Page 73
Lab07: WordCount lab

Here’s a bonus lab doing a WordCount.

Page 74 Storm
Lab07: WordCount lab

From cd /usr/hpd/2.2.0.0-2041/storm/bin folder type:


storm jar storm-starter-0.0.1-storm-0.9.0.1.jar
storm.starter.WordCountTopology WordCount -c
storm.starter.WordCountTopology WordCount -c
nimbus.host=sandbox.hortonworks.com

Doesn't work

Storm Page 75
Lab08: Cleanup

Page 76 Storm
Lab08: Cleanup

Since these are Streaming jobs, the log files will continue to grow until you kill
the process. So go to Web browser URL: <IP>:8744 and under Topology
Actions, and Kill both
• Truck-event-processer
• Wordcount

Storm Page 77
In Review - Storm

Page 78 Storm
In Review – Storm

After completing this module, the student should be able to describe:


• Streaming Vs. Batch
• Storm Terminology
• Storm Architecture
• Topologies
• Metrics and Monitoring

Storm Page 79

You might also like