Apache Storm Tutorial
Apache Storm Tutorial
Apache Storm
Audience
This tutorial has been prepared for professionals aspiring to make a career in Big Data Analytics
using Apache Storm framework. This tutorial will give you enough understanding on creating
and deploying a Storm cluster in a distributed environment.
Prerequisites
Before proceeding with this tutorial, you must have a good understanding of Core Java and any
of the Linux flavors.
Apache Storm
Table of Contents
About the Tutorial ..................................................................................................................................... i
Audience.................................................................................................................................................... i
Prerequisites ............................................................................................................................................. i
Copyright & Disclaimer ............................................................................................................................ i
Table of Contents .................................................................................................................................... ii
1.
2.
3.
4.
5.
6.
ii
Apache Storm
7.
8.
9.
iii
Apache Storm
iv
Apache Storm
Apache Storm
Storm
Hadoop
Batch processing
Stateless
Stateful
Apache Storm
Storm is open source, robust, and user friendly. It could be utilized in small companies
as well as large corporations.
Storm is fault tolerant, flexible, reliable, and supports any programming language.
Storm is unbelievably fast because it has enormous power of processing the data.
Storm can keep up the performance even under increasing load by adding resources
linearly. It is highly scalable.
Storm performs data refresh and end-to-end delivery response in seconds or minutes
depends upon the problem. It has very low latency.
Storm provides guaranteed data processing even if any of the connected nodes in the
cluster die or messages are lost.
Apache Storm
Apache Storm reads raw stream of real-time data from one end and passes it through a sequence
of small processing units and output the processed / useful information at the other end.
The following diagram depicts the core concept of Apache Storm.
Description
Tuple
Stream
Spouts
Source of stream. Generally, Storm accepts input data from raw data
sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue,
etc. Otherwise you can write spouts to read data from datasources.
ISpout" is the core interface for implementing spouts. Some of the
specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc.
Bolts
Bolts are logical processing units. Spouts pass data to bolts and bolts
process and produce a new output stream. Bolts can perform the
operations of filtering, aggregation, joining, interacting with data sources
and databases. Bolt receives data and emits to one or more bolts. IBolt
is the core interface for implementing bolts. Some of the common
interfaces are IRichBolt, IBasicBolt, etc.
Apache Storm
Lets take a real-time example of Twitter Analysis and see how it can be modelled in Apache
Storm. The following diagram depicts the structure.
The input for the Twitter Analysis comes from Twitter Streaming API. Spout will read the tweets
of the users using Twitter Streaming API and output as a stream of tuples. A single tuple from
the spout will have a twitter username and a single tweet as comma separated values. Then,
this steam of tuples will be forwarded to the Bolt and the Bolt will split the tweet into individual
word, calculate the word count, and persist the information to a configured datasource. Now, we
can easily get the result by querying the datasource.
Topology
Spouts and bolts are connected together and they form a topology. Real-time application logic
is specified inside Storm topology. In simple words, a topology is a directed graph where vertices
are computation and edges are stream of data.
A simple topology starts with spouts. Spout emits the data to one or more bolts. Bolt represents
a node in the topology having the smallest processing logic and the output of a bolt can be
emitted into another bolt as input.
Storm keeps the topology always running, until you kill the topology. Apache Storms main job
is to run the topology and will run any number of topology at a given time.
Tasks
Now you have a basic idea on spouts and bolts. They are the smallest logical unit of the topology
and a topology is built using a single spout and an array of bolts. They should be executed
properly in a particular order for the topology to run successfully. The execution of each and
5
Apache Storm
every spout and bolt by Storm is called as Tasks. In simple words, a task is either the execution
of a spout or a bolt. At a given time, each spout and bolt can have multiple instances running in
multiple separate threads.
Workers
A topology runs in a distributed manner, on multiple worker nodes. Storm spreads the tasks
evenly on all the worker nodes. The worker nodes role is to listen for jobs and start or stop the
processes whenever a new job arrives.
Stream Grouping
Stream of data flows from spouts to bolts or from one bolt to another bolt. Stream grouping
controls how the tuples are routed in the topology and helps us to understand the tuples flow in
the topology. There are four in-built groupings as explained below.
Shuffle Grouping
In shuffle grouping, an equal number of tuples is distributed randomly across all of the workers
executing the bolts. The following diagram depicts the structure.
Apache Storm
Field Grouping
The fields with same values in tuples are grouped together and the remaining tuples kept
outside. Then, the tuples with the same field values are sent forward to the same worker
executing the bolts. For example, if the stream is grouped by the field word, then the tuples
with the same string, Hello will move to the same worker. The following diagram shows how
Field Grouping works.
Global Grouping
All the streams can be grouped and forward to one bolt. This grouping sends tuples generated
by all instances of the source to a single target instance (specifically, pick the worker with lowest
ID).
Apache Storm
All Grouping
All Grouping sends a single copy of each tuple to all instances of the receiving bolt. This kind of
grouping is used to send signals to bolts. All grouping is useful for join operations.
Apache Storm
One of the main highlight of the Apache Storm is that it is a fault-tolerant, fast with no Single
Point of Failure (SPOF) distributed application. We can install Apache Storm in as many systems
as needed to increase the capacity of the application.
Lets have a look at how the Apache Storm cluster is designed and its internal architecture. The
following diagram depicts the cluster design.
Apache Storm has two type of nodes, Nimbus (master node) and Supervisor (worker node).
Nimbus is the central component of Apache Storm. The main job of Nimbus is to run the Storm
topology. Nimbus analyzes the topology and gathers the task to be executed. Then, it will
distributes the task to an available supervisor.
A supervisor will have one or more worker process. Supervisor will delegate the tasks to worker
processes. Worker process will spawn as many executors as needed and run the task. Apache
Storm uses an internal distributed messaging system for the communication between nimbus
and supervisors.
Components
Description
Nimbus
Apache Storm
Supervisor
The nodes that follow instructions given by the nimbus are called as
Supervisors. A supervisor has multiple worker processes and it
governs worker processes to complete the tasks assigned by the
nimbus.
Worker process
Executor
Task
ZooKeeper framework
Storm is stateless in nature. Even though stateless nature has its own disadvantages, it actually
helps Storm to process real-time data in the best possible and quickest way.
Storm is not entirely stateless though. It stores its state in Apache ZooKeeper. Since the state
is available in Apache ZooKeeper, a failed nimbus can be restarted and made to work from where
it left. Usually, service monitoring tools like monit will monitor Nimbus and restart it if there is
any failure.
Apache Storm also have an advanced topology called Trident Topology with state maintenance
and it also provides a high-level API like Pig. We will discuss all these features in the coming
chapters.
10
Apache Storm
A working Storm cluster should have one nimbus and one or more supervisors. Another
important node is Apache ZooKeeper, which will be used for the coordination between the nimbus
and the supervisors.
Let us now take a close look at the workflow of Apache Storm:
Initially, the nimbus will wait for the Storm Topology to be submitted to it. The
Once a topology is submitted, it will process the topology and gather all the tasks that
are to be carried out and the order in which the task is to be executed.
Then, the nimbus will evenly distribute the tasks to all the available supervisors.
At a particular time interval, all supervisors will send heartbeats to the nimbus to inform
that they are still alive.
When a supervisor dies and doesnt send a heartbeat to the nimbus, then the nimbus
assigns the tasks to another supervisor.
When the nimbus itself dies, supervisors will work on the already assigned task without
any issue.
Once all the tasks are completed, the supervisor will wait for a new task to come in.
In the meantime, the dead nimbus will be restarted automatically by service monitoring
tools.
The restarted nimbus will continue from where it stopped. Similarly, the dead supervisor
can also be restarted automatically. Since both the nimbus and the supervisor can be
restarted automatically and both will continue as before, Storm is guaranteed to process
all the task at least once.
Once all the topologies are processed, the nimbus waits for a new topology to arrive and
similarly the supervisor waits for new tasks.
Local mode: This mode is used for development, testing, and debugging because it is
the easiest way to see all the topology components working together. In this mode, we
can adjust parameters that enable us to see how our topology runs in different Storm
configuration environments. In Local mode, storm topologies run on the local machine in
a single JVM.
Production mode: In this mode, we submit our topology to the working storm cluster,
which is composed of many processes, usually running on different machines. As
discussed in the workflow of storm, a working cluster will run indefinitely until it is
shutdown.
11
Apache Storm
Apache Storm processes real-time data and the input normally comes from a message queuing
system. An external distributed messaging system will provide the input necessary for the realtime computation. Spout will read the data from the messaging system and convert it into tuples
and input into the Apache Storm. The interesting fact is that Apache Storm uses its own
distributed messaging system internally for the communication between its nimbus and
supervisor.
12
Apache Storm
The following table describes some of the popular high throughput messaging systems:
Distributed messaging
system
Description
Apache Kafka
RabbitMQ
JMS(Java Message
Service)
ActiveMQ
ZeroMQ
Kestrel
Thrift Protocol
Thrift was built at Facebook for cross-language services development and remote procedure call
(RPC). Later, it became an open source Apache project. Apache Thrift is an Interface Definition
Language and allows to define new data types and services implementation on top of the
defined data types in an easy manner.
Apache Thrift is also a communication framework that supports embedded systems, mobile
applications, web applications, and many other programming languages. Some of the key
features associated with Apache Thrift are its modularity, flexibility, and high performance. In
addition, it can perform streaming, messaging, and RPC in distributed applications.
Storm extensively uses Thrift Protocol for its internal communication and data definition. Storm
topology is simply Thrift Structs. Storm Nimbus that runs the topology in Apache Storm is a
Thrift service.
13
Apache Storm
Let us now see how to install Apache Storm framework on your machine. There are three major
steps here:
-zxf
jdk-8u60-linux-x64.gz
14
Apache Storm
Step 1.6
Now verify the Java installation using the verification command (java -version) explained in
Step 1.
-zxf
zookeeper-3.4.6.tar.gz
$ cd zookeeper-3.4.6
$ mkdir data
15
Apache Storm
$ vi conf/zoo.cfg
tickTime=2000
dataDir=/path/to/zookeeper/data
clientPort=2181
initLimit=5
syncLimit=2
Once the configuration file has been saved successfully, you can start the ZooKeeper server.
16
Apache Storm
-zxf
apache-storm-0.9.5.tar.gz
$ cd apache-storm-0.9.5
$ mkdir data
storm.zookeeper.servers:
- "localhost"
nimbus.host: "localhost"
supervisor.slots.ports:
- 6700
- 6701
17
Apache Storm
- 6702
- 6703
After applying all the changes, save and return to terminal.
18
Apache Storm
We have gone through the core technical details of the Apache Storm and now it is time to code
some simple scenarios.
Spout Creation
Spout is a component which is used for data generation. Basically, a spout will implement an
IRichSpout interface. IRichSpout interface has the following important methods:
open Provides the spout with an environment to execute. The executors will run this
method to initialize the spout.
fail Specifies that a specific tuple is not processed and not to be reprocessed.
Open
The signature of the open method is as follows:
open(Map conf, TopologyContext context, SpoutOutputCollector collector)
context Provides complete information about the spout place within the topology, its
task id, input and output information.
collector Enables us to emit the tuple that will be processed by the bolts.
nextTuple
The signature of the nextTuple method is as follows:
nextTuple()
nextTuple() is called periodically from the same loop as the ack() and fail() methods. It must
release control of the thread when there is no work to do, so that the other methods have a
19
Apache Storm
chance to be called. So the first line of nextTuple checks to see if processing has finished. If so,
it should sleep for at least one millisecond to reduce load on the processor before returning.
close
The signature of the close method is as follows:
close()
declareOutputFields
The signature of the declareOutputFields method is as follows:
declareOutputFields(OutputFieldsDeclarer declarer)
declarer It is used to declare output stream ids, output fields, etc.
This method is used to specify the output schema of the tuple.
ack
The signature of the ack method is as follows:
ack(Object msgId)
This method acknowledges that a specific tuple has been processed.
fail
The signature of the nextTuple method is as follows:
ack(Object msgId)
This method informs that a specific tuple has not been fully processed. Storm will reprocess the
specific tuple.
FakeCallLogReaderSpout
In our scenario, we need to collect the call log details. The information of the call log contains
caller number
receiver number
duration
Since, we dont have real-time information of call logs, we will generate fake call logs. The fake
information will be created using Random class. The complete program code is given below.
20
Apache Storm
Coding: FakeCallLogReaderSpout.java
import java.util.*;
@Override
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector)
{
this.context = context;
this.collector = collector;
}
@Override
public void nextTuple() {
if(this.idx <= 1000) {
List<String> mobileNumbers = new ArrayList<String>();
21
Apache Storm
mobileNumbers.add("1234123401");
mobileNumbers.add("1234123402");
mobileNumbers.add("1234123403");
mobileNumbers.add("1234123404");
Integer localIdx = 0;
while(localIdx++ < 100 && this.idx++ < 1000) {
String fromMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
String toMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
while(fromMobileNumber == toMobileNumber) {
toMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
}
Integer duration = randomGenerator.nextInt(60);
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("from", "to", "duration"));
}
@Override
public void activate() {
}
@Override
22
Apache Storm
@Override
public void ack(Object msgId) {
}
@Override
public void fail(Object msgId) {
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Bolt Creation
Bolt is a component that takes tuples as input, processes the tuple, and produces new tuples as
output. Bolts will implement IRichBolt interface. In this program, two bolt classes
CallLogCreatorBolt and CallLogCounterBolt are used to perform the operations.
IRichBolt interface has the following methods:
prepare Provides the bolt with an environment to execute. The executors will run this
method to initialize the spout.
Prepare
The signature of the prepare method is as follows:
prepare(Map conf, TopologyContext context, OutputCollector collector)
context Provides complete information about the bolt place within the topology, its
task id, input and output information, etc.
23
Apache Storm
execute
The signature of the execute method is as follows:
execute(Tuple tuple)
Here tuple is the input tuple to be processed.
The execute method processes a single tuple at a time. The tuple data can be accessed by
getValue method of Tuple class. It is not necessary to process the input tuple immediately.
Multiple tuple can be processed and output as a single output tuple. The processed tuple can be
emitted by using the OutputCollector class.
cleanup
The signature of the cleanup method is as follows:
cleanup()
declareOutputFields
The signature of the declareOutputFields method is as follows:
declareOutputFields(OutputFieldsDeclarer declarer)
Here the parameter declarer is used to declare output stream ids, output fields, etc.
This method is used to specify the output schema of the tuple.
Coding: CallLogCreatorBolt.java
//import util packages
import java.util.HashMap;
import java.util.Map;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
Apache Storm
import backtype.storm.topology.IRichBolt;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.tuple.Tuple;
//Create instance for OutputCollector which collects and emits tuples to produce output
private OutputCollector collector;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
@Override
public void execute(Tuple tuple) {
String from = tuple.getString(0);
String to = tuple.getString(1);
Integer duration = tuple.getInteger(2);
@Override
public void cleanup() {
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("call", "duration"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
25
Apache Storm
}
}
Coding:CallLogCounterBolt.java
import java.util.HashMap;
import java.util.Map;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.IRichBolt;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.tuple.Tuple;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.counterMap = new HashMap<String, Integer>();
this.collector = collector;
}
@Override
public void execute(Tuple tuple) {
String call = tuple.getString(0);
Integer duration = tuple.getInteger(1);
26
Apache Storm
if(!counterMap.containsKey(call)){
counterMap.put(call, 1);
}else{
Integer c = counterMap.get(call) + 1;
counterMap.put(call, c);
}
collector.ack(tuple);
}
@Override
public void cleanup() {
for(Map.Entry<String, Integer> entry:counterMap.entrySet()){
System.out.println(entry.getKey()+" : " + entry.getValue());
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("call"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Creating Topology
The Storm topology is basically a Thrift structure. TopologyBuilder class provides simple and
easy methods to create complex topologies. The TopologyBuilder class has methods to set spout
(setSpout) and to set bolt (setBolt). Finally, TopologyBuilder has createTopology to create
topology. Use the following code snippet to create a topology:
TopologyBuilder builder = new TopologyBuilder();
Apache Storm
.shuffleGrouping("call-log-reader-spout");
Local Cluster
For development purpose, we can create a local cluster using "LocalCluster" object and then
submit the topology using "submitTopology" method of "LocalCluster" class. One of the
arguments for "submitTopology" is an instance of "Config" class. The "Config" class is used to
set configuration options before submitting the topology. This configuration option will be
merged with the cluster configuration at run time and sent to all task (spout and bolt) with the
prepare method. Once topology is submitted to the cluster, we will wait 10 seconds for the
cluster to compute the submitted topology and then shutdown the cluster using shutdown
method of "LocalCluster". The complete program code is as follows:
Coding: LogAnalyserStorm.java
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
//
TopologyBuilder builder = new TopologyBuilder();
28
Apache Storm
FakeCallLogReaderSpout.java
CallLogCreaterBolt.java
CallLogCounterBolt.java
LogAnalyerStorm.java
Output
Once the application is started, it will output the complete details about the cluster startup
process, spout and bolt processing, and finally, the cluster shutdown process. In
"CallLogCounterBolt", we have printed the call and its count details. This information will be
displayed on the console as follows:
29
Apache Storm
1234123402 - 1234123401 : 78
1234123402 - 1234123404 : 88
1234123402 - 1234123403 : 105
1234123401 - 1234123404 : 74
1234123401 - 1234123403 : 81
1234123401 - 1234123402 : 81
1234123403 - 1234123404 : 86
1234123404 - 1234123401 : 63
1234123404 - 1234123402 : 82
1234123403 - 1234123402 : 83
1234123404 - 1234123403 : 86
1234123403 - 1234123401 : 93
Non-JVM languages
Storm topologies are implemented by Thrift interfaces which makes it easy to submit topologies
in any language. Storm supports Ruby, Python and many other languages. Lets take a look at
python binding.
Python Binding
Python is a general-purpose interpreted, interactive, object-oriented, and high-level
programming language. Storm supports Python to implement its topology. Python supports
emitting, anchoring, acking, and logging operations.
As you know, bolts can be defined in any language. Bolts written in another language are
executed as sub-processes, and Storm communicates with those sub-processes with JSON
messages over stdin/stdout. First take a sample bolt WordCount that supports python binding.
public static class WordCount implements IRichBolt {
public WordSplit() {
super("python", "splitword.py");
}
30
Apache Storm
import storm
class WordCountBolt(storm.BasicBolt):
def process(self, tup):
words = tup.values[0].split(" ")
for word in words:
storm.emit([word])
WordCountBolt().run()
This is the sample implementation for Python that counts the words in a given sentence. Similarly
you can bind with other supporting languages as well.
31
Apache Storm
Trident is an extension of Storm. Like Storm, Trident was also developed by Twitter. The main
reason behind developing Trident is to provide a high-level abstraction on top of Storm along
with stateful stream processing and low latency distributed querying.
Trident uses spout and bolt, but these low-level components are auto-generated by Trident
before execution. Trident has functions, filters, joins, grouping, and aggregation.
Trident processes streams as a series of batches which are referred as transactions. Generally
the size of those small batches will be on the order of thousands or millions of tuples, depending
on the input stream. This way, Trident is different from Storm, which performs tuple-by-tuple
processing.
Batch processing concept is very similar to database transactions. Every transaction is assigned
a transaction ID. The transaction is considered successful, once all its processing complete.
However, a failure in processing one of the transaction's tuples will cause the entire transaction
to be retransmitted. For each batch, Trident will call beginCommit at the beginning of the
transaction, and commit at the end of it.
Trident Topology
Trident API exposes an easy option to create Trident topology using TridentTopology class.
Basically, Trident topology receives input stream from spout and do ordered sequence of
operation (filter, aggregation, grouping, etc.,) on the stream. Storm Tuple is replaced by Trident
Tuple and Bolts are replaced by operations. A simple Trident topology can be created as follow
TridentTopology topology = new TridentTopology();
Trident Tuples
Trident tuple is a named list of values. The TridentTuple interface is the data model of a Trident
topology. The TridentTuple interface is the basic unit of data that can be processed by a Trident
topology.
Trident Spout
Trident spout is similar to Storm spout, with additional options to use the features of Trident.
Actually, we can still use the IRichSpout, which we have used in Storm topology, but it will be
non-transactional in nature and we wont be able to use the advantages provided by Trident.
The basic spout having all the functionality to use the features of Trident is "ITridentSpout". It
supports both transactional and opaque transactional semantics. The other spouts are
IBatchSpout, IPartitionedTridentSpout, and IOpaquePartitionedTridentSpout.
In addition to these generic spouts, Trident has many sample implementation of trident spout.
One of them is FeederBatchSpout spout, which we can use to send named list of trident tuples
easily without worrying about batch processing, parallelism, etc.
32
Apache Storm
topology.newStream("fixed-batch-spout", testSpout)
Trident Operations
Trident relies on the Trident Operation to process the input stream of trident tuples. Trident
API has a number of in-built operations to handle simple-to-complex stream processing. These
operations range from simple validation to complex grouping and aggregation of trident tuples.
Let us go through the most important and frequently used operations.
Filter
Filter is an object used to perform the task of input validation. A Trident filter gets a subset of
trident tuple fields as input and returns either true or false depending on whether certain
conditions are satisfied or not. If true is returned, then the tuple is kept in the output stream;
otherwise, the tuple is removed from the stream. Filter will basically inherit from the BaseFilter
class and implement the isKeep method. Here is a sample implementation of filter operation:
public class MyFilter extends BaseFilter {
public boolean isKeep(TridentTuple tuple) {
return tuple.getInteger(1) % 2 == 0;
}
}
input
[1, 2]
[1, 3]
[1, 4]
output
[1, 2]
[1, 4]
33
Apache Storm
Filter function can be called in the topology using each method. Fields class can be used to
specify the input (subset of trident tuple). The sample code is as follows:
TridentTopology topology = new TridentTopology();
topology.newStream("spout", spout)
.each(new Fields("a", "b"), new MyFilter())
Function
Function is an object used to perform a simple operation on a single trident tuple. It takes a
subset of trident tuple fields and emits zero or more new trident tuple fields.
Function basically inherits from the BaseFunction class and implements the execute method.
A sample implementation is given below:
public class MyFunction extends BaseFunction {
public void execute(TridentTuple tuple, TridentCollector collector) {
int a = tuple.getInteger(0);
int b = tuple.getInteger(1);
input
[1, 2]
[1, 3]
[1, 4]
output
[1, 2, 3]
[1, 3, 4]
[1, 4, 5]
34
Apache Storm
Just like Filter operation, Function operation can be called in a topology using the each method.
The sample code is as follows:
TridentTopology topology = new TridentTopology();
topology.newStream("spout", spout)
.each(new Fields(a, b"), new MyFunction(), new Fields(d")));
Aggregation
Aggregation is an object used to perform aggregation operations on an input batch or partition
or stream. Trident has three types of aggregation. They are as follows:
aggregate: Aggregates each batch of trident tuple in isolation. During the aggregate
process, the tuples are initially repartitioned using the global grouping to combine all
partitions of the same batch into a single partition.
persistentaggregate: Aggregates on all trident tuple across all batch and stores the
result in either memory or database
// aggregate operation
topology.newStream("spout", spout)
.each(new Fields(a, b"), new MyFunction(), new Fields(d))
.aggregate(new Count(), new Fields(count))
// partitionAggregate operation
topology.newStream("spout", spout)
.each(new Fields(a, b"), new MyFunction(), new Fields(d))
.partitionAggregate(new Count(), new Fields(count"))
Apache Storm
@Override
public Long init(TridentTuple tuple) {
return 1L;
}
@Override
public Long combine(Long val1, Long val2) {
return val1 + val2;
}
@Override
public Long zero() {
return 0L;
}
}
Grouping
Grouping operation is an inbuilt operation and can be called by the groupBy method. The
groupBy method repartitions the stream by doing a partitionBy on the specified fields, and then
within each partition, it groups tuples together whose group fields are equal. Normally, we use
groupBy along with persistentAggregate to get the grouped aggregation. The sample code is
as follows:
TridentTopology topology = new TridentTopology();
36
Apache Storm
State Maintenance
Trident provides a mechanism for state maintenance. State information can be stored in the
topology itself, otherwise you can store it in a separate database as well. The reason is to
maintain a state that if any tuple fails during processing, then the failed tuple is retried. This
creates a problem while updating the state because you are not sure whether the state of this
tuple has been updated previously or not. If the tuple has failed before updating the state, then
retrying the tuple will make the state stable. However, if the tuple has failed after updating the
state, then retrying the same tuple will again increase the count in the database and make the
state unstable. One needs to perform the following steps to ensure a message is processed only
once:
Assign a unique ID to each batch. If the batch is retried, it is given the same unique ID.
The state updates are ordered among batches. For example, the state update of the
second batch will not be possible until the state update for the first batch has completed.
Distributed RPC
Distributed RPC is used to query and retrieve the result from the Trident topology. Storm has an
inbuilt distributed RPC server. The distributed RPC server receives the RPC request from the
client and passes it to the topology. The topology processes the request and sends the result to
the distributed RPC server, which is redirected by the distributed RPC server to the client.
Trident's distributed RPC query executes like a normal RPC query, except for the fact that these
queries are run in parallel.
Apache Storm
Coding: FormatCall.java
import backtype.storm.tuple.Values;
import storm.trident.operation.BaseFunction;
import storm.trident.operation.TridentCollector;
import storm.trident.tuple.TridentTuple;
@Override
public void execute(TridentTuple tuple, TridentCollector collector) {
String fromMobileNumber = tuple.getString(0);
String toMobileNumber = tuple.getString(1);
CSVSplit
The purpose of the CSVSplit class is to split the input string based on comma (,) and emit
every word in the string. This function is used to parse the input argument of distributed
querying. The complete code is as follows:
Coding: CSVSplit.java
import backtype.storm.tuple.Values;
import storm.trident.operation.BaseFunction;
import storm.trident.operation.TridentCollector;
import storm.trident.tuple.TridentTuple;
@Override
public void execute(TridentTuple tuple, TridentCollector collector) {
for(String word: tuple.getString(0).split(",")) {
if(word.length() > 0) {
38
Apache Storm
collector.emit(new Values(word));
}
}
}
}
Log Analyzer
This is the main application. Initially, the application will initialize the TridentTopology and feed
caller information using FeederBatchSpout. Trident topology stream can be created using the
newStream method of TridentTopology class. Similarly, Trident topology DRPC stream can be
created using the newDRCPStream method of TridentTopology class. A simple DRCP server
can be created using LocalDRPC class. LocalDRPC has execute method to search some keyword.
The complete code is given below.
Coding: LogAnalyserTrident.java
import java.util.*;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.LocalDRPC;
import backtype.storm.utils.DRPCClient;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import storm.trident.TridentState;
import storm.trident.TridentTopology;
import storm.trident.tuple.TridentTuple;
import storm.trident.operation.builtin.FilterNull;
import storm.trident.operation.builtin.Count;
import storm.trident.operation.builtin.Sum;
import storm.trident.operation.builtin.MapGet;
import storm.trident.operation.builtin.Debug;
import storm.trident.operation.BaseFilter;
import storm.trident.testing.FixedBatchSpout;
import storm.trident.testing.FeederBatchSpout;
import storm.trident.testing.Split;
import storm.trident.testing.MemoryMapState;
39
Apache Storm
import com.google.common.collect.ImmutableList;
TridentState callCounts =
topology
.newStream("fixed-batch-spout", testSpout)
.each(new Fields("fromMobileNumber", "toMobileNumber"), new FormatCall(), new
Fields("call"))
.groupBy(new Fields("call"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
topology.newDRPCStream("call_count", drpc)
.stateQuery(callCounts, new Fields("args"), new MapGet(), new Fields("count"));
topology.newDRPCStream("multiple_call_count", drpc)
.each(new Fields("args"), new CSVSplit(), new Fields("call"))
.groupBy(new Fields("call"))
.stateQuery(callCounts, new Fields("call"), new MapGet(), new Fields("count"))
.each(new Fields("call", "count"), new Debug())
.each(new Fields("count"), new FilterNull())
.aggregate(new Fields("count"), new Sum(), new Fields("sum"));
40
Apache Storm
testSpout.feed(
ImmutableList.of(new Values("1234123401", "1234123403",
randomGenerator.nextInt(60))));
testSpout.feed(
ImmutableList.of(new Values("1234123401", "1234123404",
randomGenerator.nextInt(60))));
testSpout.feed(
ImmutableList.of(new Values("1234123402", "1234123403",
randomGenerator.nextInt(60))));
idx = idx + 1;
}
cluster.shutdown();
drpc.shutdown();
Apache Storm
FormatCall.java
CSVSplit.java
LogAnalyerTrident.java
Output
Once the application is started, the application will output the complete details about the cluster
startup process, operations processing, DRPC Server and client information, and finally, the
cluster shutdown process. This output will be displayed on the console as shown below.
DRPC : Query starts
[["1234123401 - 1234123402",10]]
DEBUG: [1234123401 - 1234123402, 10]
DEBUG: [1234123401 - 1234123403, 10]
[[20]]
DRPC : Query ends
42
Apache Storm
Here in this chapter, we will discuss a real-time application of Apache Storm. We will see how
Storm is used in Twitter.
Twitter
Twitter is an online social networking service that provides a platform to send and receive user
tweets. Registered users can read and post tweets, but unregistered users can only read tweets.
Hashtag is used to categorize tweets by keyword by appending # before the relevant keyword.
Now let us take a real-time scenario of finding the most used hashtag per topic.
Spout Creation
The purpose of spout is to get the tweets submitted by people as soon as possible. Twitter
provides Twitter Streaming API, a web service based tool to retrieve the tweets submitted by
people in real time. Twitter Streaming API can be accessed in any programming language.
twitter4j is an open source, unofficial Java library, which provides a Java based module to easily
access the Twitter Streaming API. twitter4j provides a listener-based framework to access the
tweets. To access the Twitter Streaming API, we need to sign in for Twitter developer account
and should get the following OAuth authentication details.
Customerkey
CustomerSecret
AccessToken
AccessTookenSecret
Storm provides a twitter spout, TwitterSampleSpout, in its starter kit. We will be using it to
retrieve the tweets. The spout needs OAuth authentication details and at least a keyword. The
spout will emit real-time tweets based on keywords. The complete program code is given below.
Coding: TwitterSampleSpout.java
import java.util.Map;
import java.util.concurrent.LinkedBlockingQueue;
import twitter4j.FilterQuery;
import twitter4j.StallWarning;
import twitter4j.Status;
import twitter4j.StatusDeletionNotice;
import twitter4j.StatusListener;
import twitter4j.TwitterStream;
import twitter4j.TwitterStreamFactory;
import twitter4j.auth.AccessToken;
43
Apache Storm
import twitter4j.conf.ConfigurationBuilder;
import backtype.storm.Config;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.utils.Utils;
@SuppressWarnings("serial")
public class TwitterSampleSpout extends BaseRichSpout {
SpoutOutputCollector _collector;
LinkedBlockingQueue<Status> queue = null;
TwitterStream _twitterStream;
String consumerKey;
String consumerSecret;
String accessToken;
String accessTokenSecret;
String[] keyWords;
public TwitterSampleSpout() {
// TODO Auto-generated constructor stub
}
@Override
public void open(Map conf, TopologyContext context,
44
Apache Storm
SpoutOutputCollector collector) {
queue = new LinkedBlockingQueue<Status>(1000);
_collector = collector;
@Override
public void onStatus(Status status) {
queue.offer(status);
}
@Override
public void onDeletionNotice(StatusDeletionNotice sdn) {
}
@Override
public void onTrackLimitationNotice(int i) {
}
@Override
public void onScrubGeo(long l, long l1) {
}
@Override
public void onException(Exception ex) {
}
@Override
public void onStallWarning(StallWarning arg0) {
// TODO Auto-generated method stub
};
Apache Storm
.setOAuthConsumerKey(consumerKey)
.setOAuthConsumerSecret(consumerSecret)
.setOAuthAccessToken(accessToken)
.setOAuthAccessTokenSecret(accessTokenSecret);
_twitterStream.addListener(listener);
if (keyWords.length == 0) {
_twitterStream.sample();
}
else {
@Override
public void nextTuple() {
Status ret = queue.poll();
if (ret == null) {
Utils.sleep(50);
} else {
_collector.emit(new Values(ret));
}
}
@Override
public void close() {
_twitterStream.shutdown();
}
46
Apache Storm
@Override
public Map<String, Object> getComponentConfiguration() {
Config ret = new Config();
ret.setMaxTaskParallelism(1);
return ret;
}
@Override
public void ack(Object id) {
}
@Override
public void fail(Object id) {
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("tweet"));
}
Coding: HashtagReaderBolt.java
import java.util.HashMap;
import java.util.Map;
import twitter4j.*;
import twitter4j.conf.*;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.task.OutputCollector;
47
Apache Storm
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.IRichBolt;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.tuple.Tuple;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
@Override
public void execute(Tuple tuple) {
Status tweet = (Status) tuple.getValueByField("tweet");
@Override
public void cleanup() {
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("hashtag"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
48
Apache Storm
Coding: HashtagCounterBolt.java
import java.util.HashMap;
import java.util.Map;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.IRichBolt;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.tuple.Tuple;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.counterMap = new HashMap<String, Integer>();
this.collector = collector;
}
@Override
public void execute(Tuple tuple) {
String key = tuple.getString(0);
if(!counterMap.containsKey(key)){
counterMap.put(key, 1);
}else{
Integer c = counterMap.get(key) + 1;
counterMap.put(key, c);
}
49
Apache Storm
collector.ack(tuple);
}
@Override
public void cleanup() {
for(Map.Entry<String, Integer> entry:counterMap.entrySet()){
System.out.println("Result: " + entry.getKey()+" : " + entry.getValue());
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("hashtag"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Submitting a Topology
Submitting a topology is the main application. Twitter topology consists of
TwitterSampleSpout, HashtagReaderBolt, and HashtagCounterBolt. The following
program code shows how to submit a topology.
Coding: TwitterHashtagStorm.java
import java.util.*;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.topology.TopologyBuilder;
Apache Storm
cluster.shutdown();
}
}
Apache Storm
TwitterSampleSpout.java
HashtagReaderBolt.java
HashtagCounterBolt.java
TwitterHashtagStorm.java
Output
The application will print the current available hashtag and its count. The output should be similar
to the following:
Result: jazztastic : 1
Result: foodie : 1
Result: Redskins : 1
Result: Recipe : 1
Result: cook : 1
Result: android : 1
Result: food : 2
Result: NoToxicHorseMeat : 1
Result: Purrs4Peace : 1
Result: livemusic : 1
Result: VIPremium : 1
Result: Frome : 1
Result: SundayRoast : 1
Result: Millennials : 1
Result: HealthWithKier : 1
Result: LPs30DaysofGratitude : 1
Result: cooking : 1
Result: gameinsight : 1
Result: Countryfile : 1
Result: androidgames : 1
52
Apache Storm
Yahoo! Finance is the Internet's leading business news and financial data website. It is a part of
Yahoo! and gives information about financial news, market statistics, international market data
and other information about financial resources that anyone can access.
If you are a registered Yahoo! user, then you can customize Yahoo! Finance to take advantage
of its certain offerings. Yahoo! Finance API is used to query financial data from Yahoo!
This API displays data that is delayed by 15-minutes from real time, and updates its database
every 1 minute, to access current stock-related information. Now let us take a real-time scenario
of a company and see how to raise an alert when its stock value goes below 100.
Spout Creation
The purpose of spout is to get the details of the company and emit the prices to bolts. You can
use the following program code to create a spout.
Coding: YahooFinanceSpout.java
import java.util.*;
import java.io.*;
import java.math.BigDecimal;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.topology.IRichSpout;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
Apache Storm
@Override
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector)
{
this.context = context;
this.collector = collector;
}
@Override
public void nextTuple() {
try {
Stock stock = YahooFinance.get("INTC");
BigDecimal price = stock.getQuote().getPrice();
stock = YahooFinance.get("GOOGL");
price = stock.getQuote().getPrice();
stock = YahooFinance.get("AAPL");
price = stock.getQuote().getPrice();
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("company", "price"));
}
@Override
public void close() {
}
54
Apache Storm
@Override
public void activate() {
}
@Override
public void deactivate() {
}
@Override
public void ack(Object msgId) {
}
@Override
public void fail(Object msgId) {
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Bolt Creation
Here the purpose of bolt is to process the given companys prices when the prices fall below 100.
It uses Java Map object to set the cutoff price limit alert as true when the stock prices fall below
100; otherwise false. The complete program code is as follows:
Coding: PriceCutOffBolt.java
import java.util.HashMap;
import java.util.Map;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
55
Apache Storm
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.IRichBolt;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.tuple.Tuple;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.cutOffMap = new HashMap<String, Integer>();
this.cutOffMap.put("INTC", 100);
this.cutOffMap.put("AAPL", 100);
this.cutOffMap.put("GOOGL", 100);
this.collector = collector;
}
@Override
public void execute(Tuple tuple) {
String company = tuple.getString(0);
Double price = tuple.getDouble(1);
if(this.cutOffMap.containsKey(company)){
Integer cutOffPrice = this.cutOffMap.get(company);
56
Apache Storm
collector.ack(tuple);
}
@Override
public void cleanup() {
for(Map.Entry<String, Boolean> entry:resultMap.entrySet()){
System.out.println(entry.getKey()+" : " + entry.getValue());
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("cut_off_price"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Submitting a Topology
This is the main application where YahooFinanceSpout.java and PriceCutOffBolt.java are
connected together and produce a topology. The following program code shows how you can
submit a topology.
Coding: YahooFinanceStorm.java
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.topology.TopologyBuilder;
Apache Storm
config.setDebug(true);
cluster.shutdown();
}
}
YahooFinanceSpout.java
PriceCutOffBolt.java
YahooFinanceStorm.java
Output
The output will be similar to the following:
GOOGL : false
AAPL :
false
INTC :
true
58
Apache Storm
Apache Storm framework supports many of the today's best industrial applications. We will
provide a very brief overview of some of the most notable applications of Storm in this chapter.
Klout
Klout is an application that uses social media analytics to rank its users based on online social
influence through Klout Score, which is a numerical value between 1 and 100. Klout uses
Apache Storms inbuilt Trident abstraction to create complex topologies that stream data.
Telecom Industry
Telecommunication providers process millions of phone calls per second. They perform forensics
on dropped calls and poor sound quality. Call detail records flow in at a rate of millions per
second and Apache Storm processes those in real-time and identifies any troubling patterns.
Storm analysis can be used to continuously improve call quality.
59