Building Python Real-Time Applications With Storm - Sample Chapter
Building Python Real-Time Applications With Storm - Sample Chapter
P U B L I S H I N G
Barry Hart
$ 29.99 US
19.99 UK
E x p e r i e n c e
D i s t i l l e d
Sa
m
pl
C o m m u n i t y
Kartik Bhatnagar
ee
Kartik Bhatnagar
Barry Hart
technical architect in the big data analytics unit of Infosys. He is passionate about
new technologies. He is leading the development work of Apache Storm and
MarkLogic NoSQL for a leading bank. Kartik has a total 10 years of experience in
software development for Fortune 500 companies in many countries. His expertise
also includes the full Amazon Web Services (AWS) stack and modern open source
libraries. He is active on the StackOverflow platform and is always eager to help
young developers with new technologies. Kartik has also worked as a reviewer of a
book called Elasticsearch Blueprints, Packt Publishing. In the future, he wants to work
on predictive analytics.
Barry Hart began using Storm in 2012 at AirSage. He quickly saw the potential
of Storm while suffering from the limitations of the basic storm.py that it provides.
In response, he developed Petrel, the first open source library for developing
Storm applications in pure Python. He also contributed some bug fixes to the core
Storm project.
When it comes to development, Barry has worked on a little of everything: Windows
printer drivers, logistics planning frameworks, OLAP engines for the retail industry,
database engines, and big data workflows.
Barry is currently an architect and senior Python/C++ developer at Pindrop Security,
helping fight phone fraud in banking, insurance, investment, and other industries.
Preface
Apache Storm is a powerful framework for creating complex workflows that ingest
and process huge amounts of data. With its generic concepts of spouts and bolts,
along with simple deployment and monitoring tools, it allows developers to focus
on the specifics of their workflow without reinventing the wheel.
However, Storm is written in Java. While it supports other programming languages
besides Java, the tools are incomplete and there is little documentation and
few examples.
One of the authors of this book created Petrel, the first framework that supports the
creation of Storm topologies in 100 percent Python. He has firsthand experience with
the struggles of building a Python Storm topology on the Java tool set. This book
closes this gap, providing a resource to help Python developers of all experience
levels in building their own applications using Storm.
Preface
Getting Acquainted
with Storm
In this chapter, you will get acquainted with the following topics:
An overview of Storm
Storm installation
Over the complete course of the chapter, you will learn why Storm is creating a buzz
in the industry and why it is relevant in present-day scenarios. What is this real-time
computation? We will also explain the different types of Storm's cluster modes, the
installation, and the approach to configuration.
Overview of Storm
Storm is a distributed, fault-tolerant, and highly scalable platform for processing
streaming data in a real-time manner. It became an Apache top-level project
in September 2014, and was previously an Apache Incubator project since
September 2013.
[1]
Trending topic analysis: Twitter uses such use cases to know the trending
topics within a given time frame or at present. There are numerous use cases,
and finding the top trends in a real-time manner is required. Storm can fit
well in such use cases. You can also perform running aggregation of values
with the help of any database.
Storm can ideally fit into any use case where there is a need to process data in a fast
and reliable manner, at a rate of more than 10,000 messages processing per second,
as soon as data arrives. Actually, 10,000+ is a small number. Twitter is able to process
millions of tweets per second on a large cluster. It depends on how well the Storm
topology is written, how well it is tuned, and the cluster size.
[2]
Chapter 1
Storm program (a.k.a topologies) are designed to run 24x7 and will not stop until
someone stops them explicitly.
Storm is written using both Clojure as well as Java. Clojure is a Lisp, functional
programming language that runs on JVM and is best for concurrency and parallel
programming. Storm leverages the mature Java library, which was built over the
last 10 years. All of these can be found inside the storm/lib folder.
Simple to program: It's easy to learn the Storm framework. You can write
code in the programming language of your choice and can also use the
existing libraries of that programming language. There is no compromise.
Free, open source, and lots of open source community support: Being an
Apache project, Storm has free distribution and modifying rights without
any worry about the legal aspect. Storm gets a lot of attention from the open
source community and is attracting a large number of good developers
to contribute to the code.
Developer mode
A developer can download storm from the distribution site, unzip it somewhere in
$HOME, and simply submit the Storm topology as local mode. Once the topology is
successfully tested locally, it can be submitted to run over the cluster.
[4]
Chapter 1
So, each physical machine (3, 4, and 5) runs one supervisor daemon, and each
machine's storm.yaml points to the IP address of the machine where Nimbus is
running (this can be 1 or 2). All Supervisor machines must add the Zookeeper
IP addresses (1 and 2) to storm.yaml. The Storm UI daemon should run on the
Nimbus machine (this can be 1 or 2).
A Linux machine (Storm version 0.9 and later can also run on
Windows machines)
We will be making lots of changes in the storm configuration file (that is, storm.
yaml), which is actually present under $STORM_HOME/config. First, we start the
Zookeeper process, which carries out coordination between Nimbus and the
Supervisors. Then, we start the Nimbus master daemon, which distributes code
in the Storm cluster. Next, the Supervisor daemon listens for work assigned
(by Nimbus) to the node it runs on and starts and stops the worker processes
as necessary.
ZeroMQ/JZMQ and Netty are inter-JVM communication libraries that permit two
machines or two JVMs to send and receive process data (tuples) between each other.
JZMQ is a Java binding of ZeroMQ. The latest versions of Storm (0.9+) have now
been moved to Netty. If you download an old version of Storm, installing ZeroMQ
and JZMQ is required. In this book, we will be considering only the latest versions
of Storm, so you don't really require ZeroMQ/JZMQ.
Zookeeper installation
Zookeeper is a coordinator for the Storm cluster. The interaction between Nimbus
and worker nodes is done through Zookeeper. The installation of Zookeeper is well
explained on the official website at http://zookeeper.apache.org/doc/trunk/
zookeeperStarted.html#sc_InstallingSingleMode.
[6]
Chapter 1
Alternatively, use jps to find <pid> and then use kill -9 <pid> to kill
the processes.
Storm installation
Storm can be installed in either of these two ways:
1. Fetch a Storm release from this location using Git:
https://github.com/nathanmarz/storm.git
with starting Nimbus and Supervisor. In the case of running a topology on the local
IDE on a Windows machine, C:\Users\<User-Name>\AppData\Local\Temp should
be cleaned.
Netty configuration
You don't really need to install anything extra for Netty. This is because it's a pure
Java-based communication library. All new versions of Storm support Netty.
[8]
Chapter 1
Add the following lines to your storm.yaml file. Configure and adjust the values to
best suit your use case:
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 100
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100
Starting daemons
Storm daemons are the processes that are needed to pre-run before you submit
your program to the cluster. When you run a topology program on a local IDE,
these daemons auto-start on predefined ports, but over the cluster, they must
run at all times:
1. Start the master daemon, nimbus. Go to the bin directory of the Storm
installation and execute the following command (assuming that zookeeper
is running):
./storm nimbus
Alternatively, to run in the background, use the same command
with nohup, like this:
Run in background
nohup ./storm nimbus &
2. Now we have to start the supervisor daemon. Go to the bin directory of the
Storm installation and execute this command:
./storm supervisor
supervisor &
[9]
3. Let's start the storm UI. The Storm UI is an optional process. It helps us to see
the Storm statistics of a running topology. You can see how many executors
and workers are assigned to a particular topology. The command needed
to run the storm UI is as follows:
./storm ui
[ 10 ]
Chapter 1
Nimbus
nimbus.*
UI
ui.*
Log viewer
logviewer.*
DRPC
drpc.*
Supervisor
supervisor.*
Topology
topology.*
[ 11 ]
Topology builder
Custom yaml
Changing storm.yaml
Supplying topology.yaml
as a command-line option
Create topology.yaml
with the entry made into it
similar to storm.yaml, and
supply it when running the
topology
Python:
petrel submit
--config topology.
yaml
Any configuration change in storm.yaml will affect all running topologies, but
when using the conf.setXXX option in code, different topologies can overwrite that
option, what is best suited for each of them.
Summary
Here comes the conclusion of the first chapter. This chapter gave an overview of how
applications were developed before Storm came into existence. A brief knowledge
of what real-time computations are and how Storm, as a programming framework,
is becoming so popular was also acquired as we went through the chapter and
approached the conclusion. This chapter taught you to perform Storm configurations.
It also gave you details about the daemons of Storm, Storm clusters, and their step up.
In the next chapter, we will be exploring the details of Storm's anatomy.
[ 12 ]
www.PacktPub.com
Stay Connected: