Real-Time Data Processing & Analytics - Distributed Computing & Event Processing Using Spark, Flink, Storm, Kafka
Real-Time Data Processing & Analytics - Distributed Computing & Event Processing Using Spark, Flink, Storm, Kafka
Distributed Computing and Event Processing using Apache Spark, Flink, Storm, and
Kafka
Shilpi Saxena
Saurabh Gupta
BIRMINGHAM - MUMBAI
Practical Real-Time Data Processing
and Analytics
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of
the information presented. However, the information contained in this book is sold
without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78728-120-2
www.packtpub.com
Credits
Author
Copy Editor
Shilpi Saxena
Saurabh Gupta Safis Editing
Reviewers
Project Coordinator
Shilpi has more than 12 years (3 years in the big data space) of experience in the
development and execution of various facets of enterprise solutions both in the
products and services dimensions of the software industry. An engineer by degree
and profession, she has worn varied hats, such as developer, technical leader, product
owner, tech manager, and so on, and she has seen all the flavors that the industry has
to offer. She has architected and worked through some of the pioneers' production
implementations in Big Data on Storm and Impala with auto-scaling in AWS.
Shilpi also authored Real-time Analytics with Storm and Cassandra with Packt
Publishing.
Saurabh has total 10 years (3+ years in big data) rich experience in IT industry.
Saurabh has exposure in various IOT use-cases including Telecom, HealthCare,
Smart city, Smart cars and so on.
About the Reviewers
Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon
Institute, with a master's degree in computer and electronic systems engineering,
teleinformatics, and networking specialization from the University of Salle Bajio in
Leon, Guanajuato, Mexico. He has more than 5 years of experience in developing
web applications to control and monitor devices connected with Arduino and
Raspberry Pi using web frameworks and cloud services to build the Internet of
Things applications.
He has authored the book Internet of Things Programming with JavaScript by Packt
Publishing. He is also involved in monitoring, controlling, and the acquisition of data
with Arduino and Visual Basic .NET for Alfaomega.
I would like to thank my savior and lord, Jesus Christ, for giving me the strength and
courage to pursue this project; my dearest wife, Mayte; our two lovely sons, Ruben
and Dario; my dear father, Ruben; my dearest mom, Rosalia; my brother, Juan
Tomas; and my sister, Rosalia, whom I love, for all their support while reviewing
this book, for allowing me to pursue my dream, and tolerating not being with them
after my busy day job.
Juan is a Alfaomega reviewer and has worked on the book Wearable designs for
Smart watches, Smart TVs and Android mobile devices.
I want to thank God for giving me the wisdom and humility to review this book. I
thank Packt for giving me the opportunity to review this amazing book and to
collaborate with a group of committed people. I want to thank my beautiful wife,
Brenda; our two magic princesses, Regina and Renata; and our next member, Angel
Tadeo; all of you give me the strength, happiness, and joy to start a new day. Thanks
for being my family
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version
at www.PacktPub.com and as a print book customer, you are entitled to a discount on
the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all
Packt books and video courses, as well as industry-leading tools to help you plan
your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our
editorial process. To help us improve, please leave us an honest review on this
book's Amazon page at https://www.amazon.com/dp/1787281205.
If you'd like to join our team of regular reviewers, you can e-mail us at
[email protected]. We award our regular reviewers with free eBooks and
videos in exchange for their valuable feedback. Help us be relentless in improving
our products!
Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introducing Real-Time Analytics
What is big data?
Big data infrastructure
Real–time analytics – the myth and the reality
Near real–time solution – an architecture that works
NRT – The Storm solution
NRT – The Spark solution
Lambda architecture – analytics possibilities
IOT – thoughts and possibilities
Edge analytics
Cloud – considerations for NRT and IOT
Summary
2. Real Time Applications – The Basic Ingredients
The NRT system and its building blocks
Data collection
Stream processing
Analytical layer – serve it to the end user
NRT – high-level system view
NRT – technology view
Event producer
Collection
Broker
Transformation and processing
Storage
Summary
3. Understanding and Tailing Data Streams
Understanding data streams
Setting up infrastructure for data ingestion
Apache Kafka
Apache NiFi
Logstash
Fluentd
Flume
Taping data from source to the processor - expectations and caveats
Comparing and choosing what works best for your use case
Do it yourself
Setting up Elasticsearch
Summary
4. Setting up the Infrastructure for Storm
Overview of Storm
Storm architecture and its components
Characteristics
Components
Stream grouping
Setting up and configuring Storm
Setting up Zookeeper
Installing
Configuring
Standalone
Cluster
Running
Setting up Apache Storm
Installing
Configuring
Running
Real-time processing job on Storm
Running job
Local
Cluster
Summary
5. Configuring Apache Spark and Flink
Setting up and a quick execution of Spark
Building from source
Downloading Spark
Running an example
Setting up and a quick execution of Flink
Build Flink source
Download Flink
Running example
Setting up and a quick execution of Apache Beam
Beam model
Running example
MinimalWordCount example walk through
Balancing in Apache Beam
Summary
6. Integrating Storm with a Data Source
RabbitMQ – messaging that works
RabbitMQ exchanges
Direct exchanges
Fanout exchanges
Topic exchanges
Headers exchanges
RabbitMQ setup
RabbitMQ — publish and subscribe
RabbitMQ – integration with Storm
AMQPSpout
PubNub data stream publisher
String together Storm-RMQ-PubNub sensor data topology
Summary
7. From Storm to Sink
Setting up and configuring Cassandra
Setting up Cassandra
Configuring Cassandra
Storm and Cassandra topology
Storm and IMDB integration for dimensional data
Integrating the presentation layer with Storm
Setting up Grafana with the Elasticsearch plugin
Downloading Grafana
Configuring Grafana
Installing the Elasticsearch plugin in Grafana
Running Grafana
Adding the Elasticsearch datasource in Grafana
Writing code
Executing code
Visualizing the output on Grafana
Do It Yourself
Summary
8. Storm Trident
State retention and the need for Trident
Transactional spout
Opaque transactional Spout
Basic Storm Trident topology
Trident internals
Trident operations
Functions
map and flatMap
peek
Filters
Windowing
Tumbling window
Sliding window
Aggregation
Aggregate
Partition aggregate
Persistence aggregate
Combiner aggregator
Reducer aggregator
Aggregator
Grouping
Merge and joins
DRPC
Do It Yourself
Summary
9. Working with Spark
Spark overview
Spark framework and schedulers
Distinct advantages of Spark
When to avoid using Spark
Spark – use cases
Spark architecture - working inside the engine
Spark pragmatic concepts
RDD – the name says it all
Spark 2.x – advent of data frames and datasets
Summary
10. Working with Spark Operations
Spark – packaging and API
RDD pragmatic exploration
Transformations
Actions
Shared variables – broadcast variables and accumulators
Broadcast variables
Accumulators
Summary
11. Spark Streaming
Spark Streaming concepts
Spark Streaming - introduction and architecture
Packaging structure of Spark Streaming
Spark Streaming APIs
Spark Streaming operations
Connecting Kafka to Spark Streaming
Summary
12. Working with Apache Flink
Flink architecture and execution engine
Flink basic components and processes
Integration of source stream to Flink
Integration with Apache Kafka
Example
Integration with RabbitMQ
Running example
Flink processing and computation
DataStream API
DataSet API
Flink persistence
Integration with Cassandra
Running example
FlinkCEP
Pattern API
Detecting pattern
Selecting from patterns
Example
Gelly
Gelly API
Graph representation
Graph creation
Graph transformations
DIY
Summary
13. Case Study
Introduction
Data modeling
Tools and frameworks
Setting up the infrastructure
Implementing the case study
Building the data simulator
Hazelcast loader
Building Storm topology
Parser bolt
Check distance and alert bolt
Generate alert Bolt
Elasticsearch Bolt
Complete Topology
Running the case study
Load Hazelcast
Generate Vehicle static value
Deploy topology
Start simulator
Visualization using Kibana
Summary
Preface
This book will have basic to advanced recipes on real-time computing. We will
cover technologies such as Flink, Spark and Storm. The book includes practical
recipes to help you to process unbounded streams of data, thus doing for real-time
processing what Hadoop did for batch processing. You will begin with setting up the
development environment and proceed to implement stream processing. This will be
followed by recipes on real-time problems using Rabbit-MQ, Kafka, and Nifi along
with Storm, Spark, Flink, Beam, and more. By the end of this book, you will have
gained a thorough understanding of the fundamentals of NRT and its applications,
and be able to identify and apply those fundamentals to any suitable problem.
This book is written in a cookbook style, with plenty of practical recipes, well-
explained code examples, and relevant screenshots and diagrams.
What this book covers
Section – A: Introduction – Getting Familiar
This section gives the readers basic familiarity with the real-time analytics spectra
and domains. We talk about the basic components and their building blocks. This
sections consist of the following chapters:
This section predominantly focuses on exploring Storm, its compute capabilities, and
its various features. This sections consist of the following chapters:
This section predominantly focuses on exploring Spark, its compute capabilities, and
its various features. This sections consist of the following chapters:
New terms and important words are shown in bold. Words that you see on the
screen, for example, in menus or dialog boxes, appear in the text like this: "In order
to download new modules, we will go to Files | Settings | Project Name | Project
Interpreter."
1. Log in or register to our website using your email address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder
using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublish
ing/Practical-Real-time-Processing-and-Analytics. We also have other code bundles from
our rich catalog of books and videos available at https://github.com/PacktPublishing/.
Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books-maybe a mistake in the text or
the code-we would be grateful if you could report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting http://www.packtpub.com/sub
mit-errata, selecting your book, clicking on the Errata Submission Form link, and
entering the details of your errata. Once your errata are verified, your submission
will be accepted and the errata will be uploaded to our website or added to any list of
existing errata under the Errata section of that title. To view the previously submitted
errata, go to https://www.packtpub.com/books/content/support and enter the name of the
book in the search field. The required information will appear under the Errata
section.
Piracy
Piracy of copyrighted material on the internet is an ongoing problem across all
media. At Packt, we take the protection of our copyright and licenses very seriously.
If you come across any illegal copies of our works in any form on the internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy. Please contact us at [email protected] with a link to the
suspected pirated material. We appreciate your help in protecting our authors and our
ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at
[email protected], and we will do our best to address the problem.
Introducing Real-Time Analytics
This chapter sets the context for the reader by providing an overview of the big data
technology landscape in general and real–time analytics in particular. This provides
an outline for the book conceptually, with an attempt to ignite the spark for
inquisitiveness that will encourage readers to undertake the rest of the journey
through the book.
Volume: This dimension refers to the amount of data; look around you, huge
amounts of data are being generated every second – it may be the email you
send, Twitter, Facebook, or other social media, or it can just be all the videos,
pictures, SMS messages, call records, and data from varied devices and sensors.
We have scaled up the data–measuring metrics to terabytes, zettabytes and
Yottabytes – they are all humongous figures. Look at Facebook alone; it's like
~10 billion messages on a day, consolidated across all users. We have ~5 billion
likes a day and around ~400 million photographs are uploaded each day. Data
statistics in terms of volume are startling; all of the data generated from the
beginning of time to 2008 is kind of equivalent to what we generate in a day
today, and I am sure soon it will be an hour. This volume aspect alone is
making the traditional database dwarf to store and process this amount of data
in reasonable and useful time frames, though a big data stack can be employed
to store process and compute on amazingly large data sets in a cost–effective,
distributed, and reliably efficient manner.
Velocity: This refers to the data generation speed, or the rate at which data is
being generated. In today's world, where we mentioned that the volume of data
has undergone a tremendous surge, this aspect is not lagging behind. We have
loads of data because we are able to generate it so fast. Look at social media;
things are circulated in seconds and they become viral, and the insight from
social media is analysed in milliseconds by stock traders, and that can trigger
lots of activity in terms of buying or selling. At a target point of sale counter it
takes a few seconds for a credit card swipe, and within that fraudulent
transaction processing, payment, bookkeeping, and acknowledgement is all
done. Big data gives us the power to analyse the data at tremendous speed.
Variety: This dimension tackles the fact that the data can be unstructured. In
the traditional database world, and even before that, we were used to having a
very structured form of data that fitted neatly into tables. Today, more than 80%
of data is unstructured – quotable examples are photos, video clips, social media
updates, data from variety of sensors, voice recordings, and chat conversations.
Big data lets you store and process this unstructured data in a very structured
manner; in fact, it effaces the variety.
Veracity: It's all about validity and correctness of data. How accurate and
usable is the data? Not everything out of millions and zillions of data records is
corrected, accurate, and referable. That's what actual veracity is: how
trustworthy the data is and what the quality of the data is. Examples of data with
veracity include Facebook and Twitter posts with nonstandard acronyms or
typos. Big data has brought the ability to run analytics on this kind of data to the
table. One of the strong reasons for the volume of data is veracity.
Value: This is what the name suggests: the value that the data actually holds. It
is unarguably the most important V or dimension of big data. The only
motivation for going towards big data for processing super large data sets is to
derive some valuable insight from it. In the end, it's all about cost and benefits.
Big data is a much talked about technology across businesses and the technical world
today. There are myriad domains and industries that are convinced of its usefulness,
but the implementation focus is primarily application-oriented, rather than
infrastructure-oriented. The next section predominantly walks you through the same.
Big data infrastructure
Before delving further into big data infrastructure, let's have a look at the big data
high–level landscape.
The following figure captures high–level segments that demarcate the big data space:
It clearly depicts the various segments and verticals within the big data technology
canvas (bottom up).
The key is the bottom layer that holds the data in scalable and distributed mode:
Now that you are acquainted with the basics of big data and the key segments of the
big data technology landscape, let's take a deeper look at the big data concept with
the Hadoop framework as an example. Then, we will move on to take a look at the
architecture and methods of implementing a Hadoop cluster; this will be a close
analogy to high–level infrastructure and the typical storage requirements for a big
data cluster. One of the key and critical aspect that we will delve into is information
security in the context of big data.
A couple of key aspects that drive and dictate the move to big data infraspace are
highlighted in the following figure:
Cluster design: This is the most significant and deciding aspect for
infrastructural planning. The cluster design strategy of the infrastructure is
basically the backbone of the solution; the key deciding elements for the same
are the application use cases and requirements, workload, resource computation
(depending upon memory intensive and compute intensive computations), and
security considerations.
As if the previous aspects of the expectations from the real–time solutions were not
sufficient, to have them rolling out to production, one of the basic expectations in
today's data generating and zero downtime era, is that the system should be self–
managed/managed with minimalistic efforts and it should be inherently built in a
fault tolerant and auto–recovery manner for handling most if not all scenarios. It
should also be able to provide my known basic SQL kind of interface in similar/close
format.
However outrageously ridiculous the previous expectations sound, they are perfectly
normal and minimalistic expectation from any big data solution of today.
Nevertheless, coming back to our topic of real–time analytics, now that we have
touched briefly upon the system level expectations in terms of data, processing and
output, the systems are being devised and designed to process zillions of transactions
and apply complex data science and machine learning algorithms on the fly, to
compute the results as close to real time as possible. The new term being used is
close to real–time/near real–time or human real–time. Let's dedicate a moment to
having a look at the following figure that captures the concept of computation time
and the context and significance of the final insight:
Ad–hoc queries over zeta bytes of data take up computation time in the order of
hour(s) and are thus typically described as batch. The noteworthy aspect being
depicted in the previous figure with respect to the size of the circle is that it is
an analogy to capture the size of the data being processed in diagrammatic
form.
Ad impressions/Hashtag trends/deterministic workflows/tweets: These use
cases are predominantly termed as online and the compute time is generally in
the order of 500ms/1 second. Though the compute time is considerably reduced
as compared to previous use cases, the data volume being processed is also
significantly reduced. It would be very rapidly arriving data stream of a few
GBs in magnitude.
Financial tracking/mission critical applications: Here, the data volume is
low, the data arrival rate is extremely high, the processing is extremely high,
and low latency compute results are yielded in time windows of a few
milliseconds.
Apart from the computation time, there are other significant differences between
batch and real–time processing and solution designing:
Towards the end of this section, all I would like to emphasis is that a near real–time
(NRT) solution is as close to true real–time as it is practically possible attain. So, as
said, RT is actually a myth (or hypothetical) while NRT is a reality. We deal with
and see NRT applications on a daily basis in terms of connected vehicles, prediction
and recommendation engines, health care, and wearable appliances.
There are some critical aspects that actually introduce latency to total turnaround
time, or TAT as we call it. It's actually the time lapse between occurrences of an
event to the time actionable insight is generated out of it.
The data/events generally travel from diverse geographical locations over the
wire (internet/telecom channels) to the processing hub. There is some time
lapsed in this activity.
Processing:
Data landing: Due to security aspects, data generally lands on an edge
node and is then ingested into the cluster
Data cleansing: The data veracity aspect needs to be catered for, to
eliminate bad/incorrect data before processing
Data massaging and enriching: Binding and enriching transnational data
with dimensional data
Actual processing
Storing the results
All previous aspects of processing incur:
CPU cycles
Disk I/O
Network I/O
Active marshaling and un–marshalling of data serialization
aspects.
So, now that we understand the reality of real–time analytics, let's look a little deeper
into the architectural segments of such solutions.
Near real–time solution – an
architecture that works
In this section, we will learn about what all architectural patterns are possible to
build a scalable, sustainable, and robust real–time solution.
A high–level NRT solution recipe looks very straight and simple, with a data
collection funnel, a distributed processing engine, and a few other ingredients like
in–memory cache, stable storage, and dashboard plugins.
At a high level, the basic analytics process can be segmented into three shards, which
are depicted well in previous figure:
If we delve a level deeper, there are two contending proven streaming computation
technologies on the market, which are Storm and Spark. In the coming section we
will take a deeper look at a high–level NRT solution that's derived from these stacks.
NRT – The Storm solution
This solution captures the high–level streaming data in real–time and routes it
through some Queue/broker: Kafka or RabbitMQ. Then, the distributed processing
part is handled through Storm topology, and once the insights are computed, they
can be written to a fast write data store like Cassandra or some other queue like
Kafka for further real–time downstream processing:
As per the figure, we collect real–time streaming data from diverse data sources,
through push/pull collection agents like Flume, Logstash, FluentD, or Kafka
adapters. Then, the data is written to Kafka partitions, Storm topologies pull/read the
streaming data from Kafka and processes this flowing data in its topology, and writes
the insights/results to Cassandra or some other real–time dashboards.
NRT – The Spark solution
At a very high–level, the data flow pipeline with Spark is very similar to the Storm
architecture depicted in the previous figure, but one the most critical aspects of this
flow is that Spark leverages HDFS as a distributed storage layer. Here, have a look
before we get into further dissection of the overall flow and its nitty–gritty:
As with a typical real–time analytic pipeline, we ingest the data using one of the
streaming data grabbing agents like Flume or Logstash. We introduce Kafka to
ensure decoupling into the system between the sources agents. Then, we have the
Spark streaming component that provides a distributed computing platform for
processing the data, before we dump the results to some stable storage unit,
dashboard, or Kafka queue.
One essential difference between previous two architectural paradigms is that, while
Storm is essentially a real–time transactional processing engine that is, by default,
good at processing the incoming data event by event, Spark works on the concept of
micro–batching. It's essentially a pseudo real–time compute engine, where close to
real–time compute expectations can be met by reducing the micro batch size. Storm
is essentially designed for lightning fast processing, thus all transformations are in
memory because any disk operation generates latency; this feature is a boon and
bane for Storm (because memory is volatile if things break, everything has to be
reworked and intermediate results are lost). On the other hand, Spark is essentially
backed up by HDFS and is robust and more fault tolerant, as intermediaries are
backed up in HDFS.
Over the last couple of years, big data applications have seen a brilliant shift as per
the following sequence:
Now, the question is: why did the above evolution take place? Well, the answer is
that, when folks were first acquainted with the power of Hadoop, they really liked
building the applications which could process virtually any amount of data and could
scale up to any level in a seamless, fault tolerant, non–disruptive way. Then we
moved to an era where people realized the power of now and ambitious processing
became the need, with the advent of scalable, distributed processing engines like
Storm. Storm was scalable and came with lighting–fast processing power and
guaranteed processing. But then, something changed; we realized the limitations and
strengths of both Hadoop batch systems and Storm real–time systems: the former
were catering to my appetite for volume and the latter was excellent at velocity. My
real–time applications were perfect, but they were performing over a small window
of the entire data set and did not have any mechanism for correction of data/results at
some later time. Conversely, while my Hadoop implementations were accurate and
robust, they took a long time to arrive at any conclusive insight. We reached a point
where we replicated complete/part solutions to arrive at a solution involving the
combination of both batch and real–time implementations. One of the very recent
NRT architectural patterns is Lambda architecture, which is a most sought after
solution that combines the best of both batch and real–time implementations, without
having any need to replicate and maintain two solutions. It gives me volume and
velocity, which is an edge over earlier architecture, and it can cater to a wider set of
use cases.
Lambda architecture – analytics
possibilities
Now that we have introduced this wonderful architectural pattern, let's take a closer
look at it before delving into the possible analytic use cases that can be implemented
with this new pattern.
We all know that, at base level, Hadoop gives me vast storage, and has HDFS and a
very robust processing engine in the form of MapReduce, which can handle a
humongous amount of data and can perform myriad computations. However, it has a
long turnaround time (TAT) and it's a batch system that helps us cope with the
volume aspect of big data. If we need speed and velocity for processing and are
looking for a low–latency solution, we have to resort to a real–time processing
engine that could quickly process the latest or the recent data and derive quick
insights that are actionable in the current time frame. But along with velocity and
quick TAT, we also need newer data to be progressively integrated into the batch
system for deep batch analytics that are required to execute on entire data sets. So,
essentially we land in a situation where I need both batch and real–time systems, the
optimal architectural combination of this pattern is called Lambda architecture (λ).
The input data is fed to both the batch and speed layers, where the batch layer works
at creating the precomputed views of the entire immutable master data. This layer is
predominately an immutable data store with write once and many bulk reads.
The speed layer handles the recent data and maintains only incremental views over
the recent set of the data. This layer has both random reads and writes in terms of
data accessibility.
The crux of the puzzle lies in the intelligence of the serving layer, where the data
from both the batch and speed layers is merged and the queries are catered for, so we
get the best of both the worlds seamlessly. The close to real–time requests are
handled from the data from the incremental views (they have low retention policy)
from the speed layer while the queries referencing the older data are catered to by the
master data views generated in the batch layer. This layer caters only to random
reads and no random writes, though it does handle batch computations in the form of
queries and joins and bulk writes.
However, Lambda architecture is not a one-stop solution for all hybrid use cases;
there are some key aspects that need to be taken care of:
Now that we have acquainted ourselves well with the prevalent architectural patterns
in real–time analytics, let us talk about the use cases that are possible in this
segment:
The preceding figure highlights the high–level domains and various key use cases
that may be executed.
IOT – thoughts and possibilities
The Internet of Things: the term that was coined in 1999 by Kevin Ashton, has
become one of the most promising door openers of the decade. Although we had an
IoT precursor in the form of M2M and instrumentation control for industrial
automation, the way IoT and the era of connected smart devices has arrived, is
something that has never happened before. The following figure will give you a
birds–eye view of the vastness and variety of the reach of IoT applications:
We are all surrounded by devices that are smart and connected; they have the ability
to sense, process, transmit, and even act, based on their processing. The age of
machines that was a science fiction a few years ago has become reality. I have
connected vehicles that can sense and get un–locked/locked if I walk to them or
away from them with keys. I have proximity sensing beacons in my supermarkets
which sense my proximity to shelf and flash the offers to my cell phone. I have smart
ACs that regulate the temperature based on the number of people in the room. My
smart offices save electricity by switching the lights and ACs off in empty
conference rooms. The list seems to be endless and growing every second.
At the heart of it IoT, is nothing but an ecosystem of connected devices which have
the ability to communicate over the internet. Here, devices/things could be anything,
like a sensor device, a person with a wearable, a place, a plant, an animal, a machine
– well, virtually any physical item you could think of on this planet can be connected
today. There are predominantly seven layers to any IoT platform; these are depicted
and described in the following figure:
Now that we understand the high–level architecture and layers for standard IoT
application, the next step is to understand the key aspects where an IoT solution is
constrained and what the implications are on overall solution design:
Security: This is a key concern area for the entire data-driven solution segment,
but the concept of big data and devices connected to the internet makes the
entire system more susceptible to hacking and vulnerable in terms of security,
thus making it a strategic concern area to be addressed while designing the
solution at all layers for data at rest and in motion.
Power consumption/battery life: We are devising solutions for devices and
not human beings; thus, the solutions we design for them should be of very low
power consumption overall without taxing or draining battery life.
Connectivity and communication: The devices, unlike humans, are always
connected and can be very chatty. Here again, we need a lightweight protocol
for overall communication aspects for low latency data transfer.
Recovery from failures: These solutions are designed to run for billions of data
process and in a self–sustaining 24/7 mode. The solution should be built with
the capability to diagnose the failures, apply back pressure and then self–
recover from the situation with minimal data loss. Today, IoT solutions are
being designed to handle sudden spikes of data, by detecting a latency/bottle
neck and having the ability to auto–scale–up and down elastically.
Scalability: The solutions need to be designed in a mode that its linearly
scalable without the need to re–architect the base framework or design, the
reason being that this domain is exploding with an unprecedented and un–
predictable number of devices being connected with a whole plethora of future
use cases which are just waiting to happen.
Next are the implications of the previous constraints of the IoT application
framework, which surface in the form of communication channels, communication
protocols, and processing adapters.
Direct Ethernet/WiFi/3G
LoRA
Bluetooth Low Energy (BLE)
RFID/Near Field communication (NFC)
Medium range radio mesh networks like Zigbee
For communication protocols, the de–facto standard that is on the board as of now is
MQTT, and the reasons for its wide usage are evident:
These considerations led to the development of a new kind of solution and in turn a
new arena of IOT computations — the term is edge analytics and, as the name
suggests, it's all about pushing the processing to the edge, so that the data is
processed at its source.
Edge analytics
Core analytics
As depicted in the previous figure, the IOT computation is now divided into
segments, as follows:
Sensor–level edge analytics: Wherein data is processed and some insights are
derived at the device level itself
Edge analytics: These are the analytics wherein the data is processed and
insights are derived at the gateway level
Core analytics: This flavour of analytics requires all data to arrive at a common
compute engine (distributed storage and distributed computation) and then the
high–complexity processing is done to derive actionable insights for people and
processes
Now that we have been introduced to and acquainted with cloud, the next obvious
point to understand is what this buzz is all about and why is it that the advent of the
cloud is closing curtains on the era of traditional data centers. Let's understand some
of the key benefits of cloud computing that have actually made this platform a hot
selling cake for NRT and IOT applications
In the next chapter, we will have the readers moving a little deeper into the real–time
analytical application, architecture, and concepts. We will touch upon the basic
building blocks of an NRT application, the technology stack required and the
challenges encountered while developing it.
Real Time Applications – The Basic
Ingredients
This chapter gets you acquainted with the basic building blocks of near real-time
(NRT) systems. We introduce you to a high-level logical, physical, and technical
view of such applications, and will touch upon various technology choices for each
building block of the system:
It's very important to understand the key aspects where the traditional monolithic
application systems are falling short to serve the need of the hour:
The answer to the previous issues is an architecture that supports streaming and thus
provides its end users access to actionable insights in real-time over forever flowing
in streams of real-time fact data. Couple of challenges to think through by design of
a stream processing system and captured in in the following points:
Local state and consistency of the system for large scale high velocity systems
Data doesn't arrive at intervals, it keeps flowing in, and it's streaming in all the
time
No single state of truth in the form of backend database, instead the applications
subscribe or tap into the stream of fact data
Now that we understand the time dimensions and SLAs with respect to NRT, real-
time, and batch systems, let's walk to the next step that talks about understanding the
building blocks of NRT systems.
The key aspects that come under the criteria of data collection tools, in the general
context of big data and real-time specifically, are as follows:
Apart from this, any data collection tool should be able to cater for data from a
variety of sources such as:
The Broker: that collects and holds the events or data streams from the data
collection agents
The Processing Engine: that actually transforms, correlates, aggregates the
data, and performs other necessary operations
The Distributed Cache: that actually serves as a mechanism for maintaining
common datasets across all distributed components of the Processing Engine
The same aspects of the stream processing component are zoomed out and depicted
in the diagram that follows:
There are a few key attributes that should be catered for by the stream processing
component:
These aspects are basically considered while identifying and selecting the stream
processing application/framework for a real-time use case implementation.
Analytical layer – serve it to the end
user
The analytical layer is the most creative and interesting of all the components of an
NRT application. So far, all we have talked about is backend processing, but this is
the layer where we actually present the output/insights to the end user graphically,
visually in the form of an actionable item.
The most crucial aspect of the data visualization technique that needs to be chosen as
part of the solution is actually presenting the information to the intended audience in
a format that they can comprehend and act upon. The crux here is that just smart
processing of the data and arriving to an actionable insight is not sufficient; it has to
reach the actors, be they humans or processes.
Before delving further into the nitty-gritties of business intelligence and visualization
components, first let's understand the challenges big data and high velocity NRT/RT
applications have brought to the mix of the problem statement.
Need for speed: The world is evolving and rationalizing the power of now—
more and more companies are leaning towards real-time insights to gain an
edge over their competition. So, the visualization tools should complement the
speed of the application they are part of, so that they are quick to depict and
present the key actors with accurate facts in meaningful formats so that
informed decisions can be taken.
Understanding the data and presenting it in the right context: At times, the
same result needs to be modulated and presented differently, depending upon
the audience being catered for. So, the tool should provide for this flexibility
and for the capability to design a visual solution around the actionable insight in
the most meaningful format. For instance, if you are charting vehicle location in
a city, then you may want to use a heat map, where variance in color shows the
concentration/number of vehicles, rather than plotting every vehicle on the
graph. While you are presenting an itinerary of a particular flight/ship, you may
not want to merge any data point and would plot each of them on the graph.
Dealing with outliers: Graphical data representations easily denote the trends
and outliers, which further enable the end users to spot the issues or pick up the
points that need attention. Generally, outliers are 1-5% of the data, but when
dealing with big data and handling a massive volume and velocity of data, even
5% of the total data is huge and may cause plotting issues.
The following figure depicts the overall application flow and some popular
visualizations, including the Twitter heat map:
The figure depicts the flow of information from event producers to collection agents,
followed by the brokers and processing engine (transformation, aggregation, and so
on) and then long term storage. From the storage unit, the visualization tools reap the
insights and present them in form of graphs, alerts, charts, Excel sheets, dashboards,
or maps, to the business owners, who can assimilate the information and take some
action based upon it.
NRT – high-level system view
The previous section of this chapter is dedicated to providing you with an
understanding of the basic building blocks of an NRT application and its logical
overview. The next step is to understand the functional and systems view of the NRT
architectural framework. The following figure clearly outlines the various
architectural blocks and cross cutting concerns:
So, if I get to describe the system as a horizontal scale from left to right, the process
starts with data ingestion and transformation in near real-time using low-latency
components. The transformed data is passed on to the next logical unit that actually
performs highly optimized and parallel operations on the data; this unit is actually
the near real-time processing engine. Once the data has been aggregated and
correlated and actionable insights have been derived, it is passed on to the presenting
layer, which along with real-time dash boarding and visualization, may have a
persistence component that retains the data for long term deep analytics.
The cross cutting concerns that exist across all the components of the NRT
framework as depicted in the previous figure are:
Security
System management
Data integrity and management
Next, we are going to get you acquainted with four basic streaming patterns, so you
are acquainted with the common flavors that streaming use cases pose and their
optimal solutions (in later sections):
Stream ingestion: Here, all we are expected to do is to persist the events to the
stable storage, such as HDFS, HBase, Solr, and so on. So all we need are low-
latency stream collection, transformation, and persistence components.
Near real-time (NRT) processing: This application design allows for an
external context and addresses complex use cases such as anomaly or fraud
detection. It requires filtering, alerting, de-duplication, and transformation of
events based on specific sophisticated business logic. All these operations are
required to be performed at extremely low latency.
NRT event partitioned processing: This is very close to NRT processing, but
with a variation that helps it deriving benefits from partitioning the data, to
quote a few instances, it is like storing more relevant external information in
memory. This pattern also operates at extremely low latencies.
NRT and complex models/machine learning: This one mostly requires us to
execute very complex models/operations over a sliding window of time over the
set of events in the stream. They are highly complex operations, requiring micro
batching of data and operate over very low latencies.
NRT – technology view
In this section, we introduce you to various technological choices for NRT
components and their pros and cons in certain situations. As the book progresses, we
will revisit this section in more detail to help you understand why certain tools and
stacks are better suited to solving certain use cases.
Before moving on, it's very important to understand the key aspects against which all
the tools and technologies are generally evaluated. The aspects mentioned here are
generic to software, we move on to the specifics of NRT tools later:
Now that we have rules of thumb set, let's look at the technology view of the NRT
application framework to understand what all technological choices it presents to us
—please refer to the following diagram:
This figure captures various key technology players which are contenders as parts of
the solution design for an NRT framework. Let's have a closer look at each segment
and its contenders.
Event producer
This is the source where the events are happening. These individual events or tuples
stringed in continuous never-ending flow actually form the streams of data which are
the input source for any NRT application system. These events could be any of the
following or more, which are tapped in real-time:
FluentD: FluentD is an open source data collector which lets you unify data
collection and consumption for a better use and understanding of data. (Source:
http://www.fluentd.org/architecture). The salient features of FluentD are:
Reliability: This component comes with both memory and file-based
channel configurations which can be configured based on reliability needs
for the use case in consideration
Low infrastructure foot print: The component is written in Ruby and C
and has a very low memory and CPU foot print
Pluggable architecture: This component leads to an ever-growing
contribution to the community for its growth
Uses JSON: It unifies the data into JSON as much as possible thus making
unification, transformation, and filtering easier
Logstash: Logstash is an open source, server-side data processing pipeline that
ingests data from a multitude of sources simultaneously, transforms it, and then
sends it to your favorite stash (ours is Elasticsearch, naturally). (Source: https://w
ww.elastic.co/products/logstash). The salient features of Logstash are:
Variety: It supports a wide variety of input sources, varying from metrics
to application logs, real-time sensor data, social media, and so on, in
streams.
Filtering the incoming data: Logstash provides the ability to parse, filter,
and transform data using very low latency operations, on the fly. There
could be situations where we want the data arriving from a variety of
sources to be filtered and parsed as per a predefined, a common format
before landing into the broker or stash. This makes the overall
development approach decoupled and easy to work with due to
convergence to the common format. It has the ability to format and parse
very highly complex data, and the overall processing time is independent
of source, format, complexity, or schema.
It can club the transformed output to a variety of storage, processing, or
downstream application systems such as Spark, Storm, HDFS, ES, and so
on.
It's robust, scalable and extensible: where the developers have the choice
to use a wide variety of available plugins or write their own custom
plugins. The plugins can be developed using the Logstash tool called
plugin generator.
Monitoring API: It enables the developers to tap into the Logstash
clusters and monitor the overall health of the data pipeline.
Security: It provides the ability to encrypt data in motion to ensure that the
data is secure.
Footprint 3,000 lines Ruby 50,000 lines Java 5,000 lines JRuby
Cloud API for data collection: This is yet another method of data collection
where most cloud platforms offer a variety of data collection API's such as:
AWS Amazon Kinesis Firehose
Google Stackdriver Monitoring API
Data Collector API
IBM Bluemix Data Connect API
Broker
One of the fundamental architectural principles is the decoupling of various
components. Broker is precisely the component in NRT architecture that not only
decouples the data collection component and processing unit but also provides
elasticity to hold data in a queue when there are sudden surges in traffic.
Amongst the vast variety of tools and technologies available under this segment, the
key ones we would like to touch on are:
Apache Kafka: Kafka is used for building real-time data pipelines and
streaming apps. It is horizontally scalable, fault-tolerant, wickedly fast, and runs
in production in thousands of companies. (Source: https://kafka.apache.org/). The
salient features of this broker component are:
It's highly scalable
It's fail safe; it provides for fault tolerance and high availability
It is open source and extensible
It's disk-based and can store vast amounts of data (this is a USP of Kafka
that enables it to virtually cater for any amount of data)
It allows replication and partitioning of data
High
Yes Yes
Availability
Federated
Yes No
Queues
Complex
Yes No
Routing
The key technology options that are available here are Apache Storm, Apache Flink,
and Apache Spark. We will be delving into each in detail in the coming chapters, but
the following is a quick comparison of each of them.
The following figure depicts key aspects of all of these scalable, distributed, and
highly available frameworks for easy comparison and assimilation:
Storage
This is the stable storage to which intermittent or end results and alerts are written
into. It's a very crucial component in the NRT context because we need to store the
end results in a persistent store. Secondly, it serves as an integration point for further
downstream applications which draw data from these low latency stores and evolve
further insights or deep learning around them.
The following table clearly captures the various data stores and their alignment to the
time SLA of NRT applications:
Source, http://image.slidesharecdn.com/cassandrameetuppresentation-160519132626/95/5-ways-to-
use-spark-to-enrich-your-cassandra-environment-4-638.jpg?cb=1463664473
I would like to add a note here that we are skipping the plethora of options available
for storage and visualization for now, but will touch upon these specifically in later
sections of the book.
Summary
In this chapter we got you acquainted and introduced to various components of the
NRT architectural framework and technology choices for it. You gained
understanding of the challenges of real-time and the key aspects to be considered and
the USPs of each technology available in stack. The intent here was to get you
familiarized with the available choices in terms of tools and tech stack conceptually,
so that you can pick and choose what works best for your use-case solutions,
depending upon your functional and non-functional requirements.
Understanding and Tailing Data
Streams
This chapter does a deep dive on the most critical aspect of the real-time application,
which is about getting the streaming data from the source to the compute component.
We will discuss the expectations and choices which are available. We will also walk
the reader through which ones are more appropriate for certain use cases and
scenarios. We will give high-level setup and some basic use cases for each of them.
In this chapter, we will also introduce technologies related to data ingestion for the
use cases.
There are two different kinds of streaming data: bounded and unbounded streams, as
shown in the following images. Bounded streams have a defined start and a defined
end of the data stream. Data processing stops once the end of the stream is reached.
Generally, this is called batch processing. An unbounded stream does not have an
end and data processing starts from the beginning. This is called real-time
processing, which keeps the states of events in memory for processing. It is very
difficult to manage and implement use cases for unbounded data streams, but tools
are available which give you the chance to play with them, including Apache
Storm, Apache Flink, Amazon Kinesis, Samaza, and so on.
We will discuss data processing in the following chapters. Here, you will read about
data ingestion tools which feed in to data processing engines. Data ingestion can be
from live running systems generating log files or come directly from terminals or
ports.
Setting up infrastructure for data
ingestion
There are multiple tools and frameworks available on the market for data ingestion.
We will discuss the following in the scope of this book:
Apache Kafka
Apache NiFi
Logstash
Fluentd
Apache Flume
Apache Kafka
Kafka is message broker which can be connected to any real-time framework
available on the market. In this book, we will use Kafka often for all types of
examples. We will use Kafka as a data source which keeps data from files in queues
for further processing. Download Kafka from https://www.apache.org/dyn/closer.cgi?path=
/kafka/0.10.1.1/kafka_2.11-0.10.1.1.tgz to your local machine. Once the kafka_2.11-
0.10.1.1.tgz file is downloaded, extract the files using the following command:
cp kafka_2.11-0.10.1.1.tgz /home/ubuntu/demo/kafka
cd /home/ubuntu/demo/kafka
tar -xvf kafka_2.11-0.10.1.1.tgz
The following files and folders are extracted as seen in the following screenshot:
Kafka will start on your local machine. Topics will be created later on as per the
need. Let's move on to the NiFi setup and example.
Apache NiFi
Apache NiFi is the tool to read from the source and distribute data across different
types of sinks. There are multiple types of source and sink connectors available.
Download NiFi Version 1.1.1 from https://archive.apache.org/dist/nifi/1.1.1/nifi-1.1.1-bin.tar.
gz to your local machine. Once the nifi-1.1.1-bin.tar.gz file is downloaded, extract
the files:
cp nifi-1.1.1-bin.tar.gz /home/ubuntu/demo
cd /home/ubuntu/demo
tar -xvf nifi-1.1.1-bin.tar.gz
The following files and folder are extracted, as shown in the following screenshot:
When NiFi is started, you can access the NiFi UI by accessing the following URL:
http://localhost:8080/nifi. The following screenshot shows the UI interface for NiFi:
Now, let's a create flow file in NiFi which will read the file and push each line as an
event in the Kafka topic named as nifi-example.
You should have the right entry of the IP address of your system in
/etc/hosts; otherwise, you will face problems while creating topics in
Kafka.
Now, go to the NiFi UI. Select Processor and drag it into the window. It will show all
available processors in NiFi. Search for GetFile and select it. It will display in your
workspace area as in the following screenshot:
To configure processor, right click on the GetFile processor and select Configure as
shown in the following screenshot:
It will give you the flexibility to change all possible configurations related to the
processor type. As per the scope of this chapter, let's go to properties directly.
To read the file we used the GetFile processor, now we want to push each line on the
Kafka topic, then use the PutKafka processor. Again, click on processor and drag it
into the workspace area.
After the mouse drop, it will ask for the type of processor. Search processor as
PutKafka and select it. It will be shown in the following screenshot:
Now, right click on PutKafka and select configure for configuration. Set the
configurations as shown in the following screenshot:
Some of the important configurations are Known Brokers, Topic Name, Partition,
and Client Name.
You can specify the broker host name along with port number in Known Brokers.
Multiple brokers are separated by a comma. Specify the topic name which is created
on Kafka broker. Partition is used when a topic is partitioned. Client Name should be
any relevant name for the client to make a connection with Kafka.
Now, make a connection between GetFile processor and PutKafka processor. Drag
the arrow from GetFile processor and drop to PutKafka processor. It will create a
connection between them.
Create a test file in /home/ubuntu/demo/ files and some words or statements, as follows:
hello
this
is
nifi
kafka
integration
example
Before running NiFi Pipeline, start a process from the console to read from the
Kafka topic nifi-example:
/bin/kafka-console-consumer.sh --topic nifi-example --bootstrap-server localhost:9092 --f
Let's start the NiFi Pipeline which reads from the test file and puts it into Kafka. Go
to the NiFi workspace, press select all (Shift + A) and press the Play button from the
operate window.
The following folders and files will be extracted as seen in the following screenshot:
We will repeat the same example explained for NiFi, that is, reading from a file and
pushing an event on a Kafka topic. Let's create topic on Kafka for Logstash:
./kafka-topics.sh --create --topic logstash-example --zookeeper localhost:2181 --partitio
To run the example in Logstash, we have to create a configuration file, which will
define the input, filter, and output. Here, we will not apply any filters, so the
configuration file will contain two components: input as reading from file and output
as writing into Kafka topic. The following is the configuration file required to
execute the example:
input {
file {
path => "/home/ubuntu/demo/files/test"
start_position => "beginning"
}
}
output {
stdout { codec => plain }
kafka {
codec => plain {format => "%{message}"}
topic_id => "logstash-example"
client_id => "logstash-client"
bootstrap_servers => "localhost:9092"
}
}
Create a file with the name logstash-kafka-example.conf and paste the previous
configuration into that file. Create an input file of test in /home/ubuntu/demo/files and
add the following content:
hello
this
is
logstash
kafka
integration
Before running Logstash Pipeline, start a process from the console to read from the
Kafka topic logstash-example, using the following command:
/bin/kafka-console-consumer.sh --topic logstash-example --bootstrap-server localhost:9092
The following output will be generated and certify that everything is good:
2017-02-25 16:19:49 +0530 [info]: reading config file path="/etc/td-agent/td-agent.conf"
2017-02-25 16:19:49 +0530 [info]: starting fluentd-0.12.31 as dry run mode
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-mixin-config-placeholders' version '0.4.0'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-mixin-plaintextformatter' version '0.2.6'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-kafka' version '0.5.3'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-kafka' version '0.4.1'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-mongo' version '0.7.16'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '1.5.5'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-s3' version '0.8.0'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-scribe' version '0.10.14'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-td' version '0.10.29'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.2'
2017-02-25 16:19:49 +0530 [info]: gem 'fluent-plugin-webhdfs' version '0.4.2'
2017-02-25 16:19:49 +0530 [info]: gem 'fluentd' version '0.12.31'
2017-02-25 16:19:49 +0530 [info]: adding match pattern="td.*.*" type="tdlog"
2017-02-25 16:19:49 +0530 [info]: adding match pattern="debug.**" type="stdout"
2017-02-25 16:19:49 +0530 [info]: adding source type="forward"
2017-02-25 16:19:49 +0530 [info]: adding source type="http"
2017-02-25 16:19:49 +0530 [info]: adding source type="debug_agent"
2017-02-25 16:19:49 +0530 [info]: using configuration file: <ROOT>
<match td.*.*>
@type tdlog
apikey xxxxxx
auto_create_table
buffer_type file
buffer_path /var/log/td-agent/buffer/td
buffer_chunk_limit 33554432
<secondary>
@type file
path /var/log/td-agent/failed_records
buffer_path /var/log/td-agent/failed_records.*
</secondary>
</match>
<match debug.**>
@type stdout
</match>
<source>
@type forward
</source>
<source>
@type http
port 8888
</source>
<source>
@type debug_agent
bind 127.0.0.1
port 24230
</source>
</ROOT>
In the configuration file, there are two directives defined previously, out of six
available directives. The Source directive is where all data comes from. @type tells us
which type of input plugin is being used. Here, we are using tail, which will tail the
log file. This is good for the use case where the input log file is running a log file in
which events/logs are getting appended at the end of the file. It is same as the tail -f
operation in Linux. There are multiple parameters of the tail input plugin. Path is the
absolute path of the log file. Pos_file is the file which will keep track of last read
position of the input file. Tag is the tag of the event. If you want to define the input
format like CSV, or apply regex, then use format. As this parameter is mandatory,
we used none which will use the input text as is.
The Match directive tells Fluentd what to do with the input, *.The ** pattern is telling
us that whatever is coming in through the log files, just push it into the Kafka topic.
If you want to use a different topic for error and information logs, then define the
pattern as error or info and tag the input as the same. Brokers is a host and port
where Kafka broker is running on a system. Default-topic is the topic name where
you want to push the events. If you want to retry after message failure, then set
max_send_reties to one or more.
Once the process is started without any exceptions, then start adding statements in
/home/ubuntu/demo/files/test as shown in the following screenshot:
cp apache-flume-1.7.0-bin.tar.gz ~/demo/
tar -xvf ~/demo/apache-flume-1.7.0-bin.tar.gz
The extracted folders and files will be as per the following screenshot:
We will demonstrate the same example that we executed for the previous tools,
involving reading from a file and pushing to a Kafka topic. First, let's configure the
Flume file:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/ubuntu/demo/flume/tail_dir.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /home/ubuntu/demo/files/test
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = flume-example
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 6
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Flume has three components that define flow. The first are sources, from which the
logs or events come. There are multiple sources available in Flume to define the
flow. A few are kafka, TAILDIR, and HTTP, and you can also define your own custom
source. The second component is sink, which is the destination of events where it
will be consumed. The third is channels, which defines the medium between source
and sink. The most commonly used channels are Memory, File, and Kafka, but there are
also many more. Here, we will use TAILDIR as source, Kafka as sink, and Memory as
channel. As of previously configuration a1 is the agent name, r1 is the source, k1 is
the sink, and c1 is the channel.
Let's start with source configuration. First of all, you have to define the type of
source using <agent-name>.<sources/sinks/channels>.<alias name>.type. The next
parameter is positionFile which is required to keep track of the tailing file. filegroups
indicates a set of files to be tailed. filegroups.<filegroup-name> is the absolute path of
the file directory. Sink configuration is simple and straightforward. Kafka requires
bootstrap servers and topic names. Channels configuration is long, but here we used
only the most important ones. Capacity is the maximum number of events stored in
the channel and transaction Capacity is the maximum number of events the channel
will take from a source or give to a sink per transaction.
Performance consists of I/O, CPU, and RAM usage and impact. By definition,
scalability is the capability of a system, network, or process to handle a growing
amount of work, or its potential to be enlarged in order to accommodate that growth.
So, we will identify whether tools are scalable to handle increased loads or not.
Scalability can be achieved horizontally and vertically. Horizontally means adding
more computing machines and distributing the work, while vertically means
increasing the capacity of a single machine in terms of CPU, RAM, or IOPS.
Let's start with NiFi. It is a guaranteed delivery processing engine (exactly once) by
default, which maintains write-ahead logs and a content repository to achieve this.
Performance depends on the reliability that we choose. In the case of NiFi
guaranteed message delivery, all messages are written to the disk and then read from
there. It will be slow, but you have to pay in terms of performance if you don't want
to lose even a single message. We can create a cluster of NiFi controlled by the NiFi
cluster manager. Internally, it is managed by zookeeper to sync all of the nodes. The
model is master and slave, but if the master dies then all nodes continue to operate. A
restriction will be that no new nodes can join the cluster and you can't change the
NiFi flow. So, NiFi is scalable enough to handle the cluster.
Fluentd provides At most once and At least once delivery semantics. Reliability and
performance is achieved by using the Buffer plugin. Memory Buffer structure
contains a queue of chunks. When the top chunk exceeds the specified size or time
limit, a new empty chunk is pushed to the top of the queue. The bottom chunk is
written out immediately when a new chunk is pushed. File Buffer provides a
persistent buffer implementation. It uses files to store buffer chunks on a disk. As per
its documentation, Fluentd is a well-scalable product where M*N, is resolved by
M+N where M is the number of input plugins and N is the number of output plugins.
By configuring multiple log forwarders and log aggregators, we can achieve
scalability.
Flumes get its reliability using channels and syncs between sink and channel. The
sink removes an event from the channel only after the event is stored into the
channel of the next agent or stored in the terminal repository. This is single hop
message delivery semantics. Flume uses a transactional approach to guarantee the
reliable delivery of the events. To read data over a network, Flume integrates with
Avro and Thrift.
Comparing and choosing what works
best for your use case
The following table shows comparisons between Logstash, Fluentd, Apache Flume,
Apache NiFi, and Apache Kafka:
Apache Apache
Logstash Fluentd Apache NiFi
Flume Kafka
An ample
amount of
No plugin,
50+ plugins processors
125+ you can write
Plugins 90+ plugins plugins and custom are available. your own
components Also you can code.
write your
own easily.
Supports at
least once, at
At most At least most once
At least once Exactly once
Reliability once or at once using and exactly
using Filebeat by default
least once transactions once based
on
configuration.
Do it yourself
In this section, we will provide the problem for the reader so that they can create
their own application after reading the previous content.
Here, we will extend the example given previous regarding the setup and
configuration of NiFi. The problem statement is read from a real-time log file and
put into Cassandra. The pseudo code is as follows:
You have to install Cassandra and configure it so that NiFi will be able to connect it.
Logstash is made to process the logs and throw them to other tools for storage or
visualization. The best fit here is Elastic Search, Logstash and Kibana (ELK). As
per the scope of this chapter, we will build integration between Elastic Search and
Logstash and, in the next chapters, we will integrate Elastic Search with Kibana for
complete workflow. So all you need to do to build ELK is:
Create a program to read from PubNub for real-time sensor data. The same
program will publish events to the Kafka topic
Install Elasticsearch on the local machine and start
Now, write a Logstash configuration which reads from a Kafka topic, parse and
format them and push them into the Elasticsearch engine
Setting up Elasticsearch
Execute the following steps to set up Elasticsearch:
cd elasticsearch-5.2.2/
As we are now aware of different types of data streaming tools, in the next chapter
we will focus on setting up Storm. Storm is an open source distributed, resilient,
real-time processing engine. Setting up includes download, installation,
configuration, and running an example to test whether setup is working or not.
Setting up the Infrastructure for Storm
This chapter will guide users through setting up and configuring Storm in single and
distributed mode. It will also help them to write and execute their first real-time
processing job on Storm.
Overview of Storm
Storm architecture and its components
Setting up and configuring Storm
Real-time processing job on Storm
Overview of Storm
Storm is an open source, distributed, resilient, real-time processing engine. It was
started by Nathan Marz in late 2010. He was working at BackType. On his blog, he
mentioned the challenges he faced while building Storm. It is a must read: http://natha
nmarz.com/blog/history-of-apache-storm-and-lessons-learned.html.
Here is the crux of the whole blog: initially, real-time processing was implemented
like pushing messages into a queue and then reading the messages from it using
Python or any other language and processing them one by one. The challenges with
this approach are:
In case of failure of the processing of any message, it has to be put back into the
queue for reprocessing
Keeping queues and the worker (processing unit) up and running all the time
What follows are two sparking ideas by Nathan that make Storm capable of being a
highly reliable and real-time engine:
In May 2011 BackType was acquired by Twitter. After becoming popular in public
forums, Storm started to be called "real-time Hadoop". In September, 2011, Nathan
officially released Storm. In September, 2013, he officially proposed Storm in
Apache Incubator. In September, 2014, Storm became a top-level project in Apache.
Storm architecture and its components
Let's discuss Storm architecture and how it works. The following figure depicts the
Storm cluster:
The Nimbus node acts as the master node in a Storm cluster. It is responsible
for analyzing topology and distributing tasks to different supervisors as per their
availability. Also, it monitors failure; if one of the supervisors dies, it
redistributes the tasks among available supervisors. The Nimbus node uses
Zookeeper to keep track of tasks to maintain it's state. If the Nimbus node fails,
it can be restarted so that it reads the state of Zookeeper and starts from same
point where it failed earlier.
Supervisors act as slave nodes in the Storm cluster. One or more workers, that
is, JVM processes, can run in each supervisor node. A supervisor coordinates
with workers to complete the tasks assigned by Nimbus node. In the case of
worker process failure, the supervisor finds available workers to complete the
tasks.
A worker process is a JVM running in a supervisor node. It has executors.
There can be one or more executors in a worker process. A worker coordinates
with an executor to finish up the task.
An executor is a single thread process spawned by a worker. Each executor is
responsible for running one or more tasks.
A task is a single unit of work. It performs actual processing on data. It can be
either a spout or a bolt.
Apart from previous processes, there are two important parts of a Storm cluster;
they are logging and Storm UI. The logviewer service is used to debug logs for
workers and supervisors on Storm UI.
Characteristics
The following are important characteristics of Storm:
Tuple: This is the basic data structure of Storm. It can hold multiple values and
the data types of each value can be different. Storm serializes the primitive
types of values by default but if you have any custom class then you must
provide serializer and register it in Storm. A tuple provides very useful methods
such as getInteger, getString and getLong so that the user does not need to cast the
value in a tuple.
Topology: As mentioned earlier, topology is the highest level of abstraction. It
contains the flow of processing including spouts and bolts. It is a kind of graph
computation. Each flow is represented in the form of a graph. So, nodes are
spouts or bouts and edges are a stream grouping which connects them. The
following figure showsa simple example of topology:
Ack: This method is called when tuple is successfully processed in topology. The
user should mark the tuple as processed or completed.
Fail: This method is called when tuple is not processed successfully. The user
must implement this method in such a way that the tuple should be sent for
processing again in nextTuple.
nextTuple: This method is called to get the tuple from the input source. The logic
to read from the input source should be written in this method and emitted to
the tuple for further processing.
Open: This method is called only once when spout is initialized. Here, making a
connection with input source or the output sink or configuring the memory
cache,e configured, will ensure that it will not be repeated in the nextTuple
method.
IRichSpoutis the interface available in Storm to implement custom spout. All of the
previous methods need to be implemented.
IRichBolt and IBasicBolt are available in Storm to implement the processing unit of
Storm. The differences between the two are that IBasicBolt auto acks each tuple and
provides basic filter and simple functions.
Stream grouping
The following are different types of grouping available with Storm:
Shuffle grouping: Shuffle grouping distributes tuples equally across the tasks.
An equal number of tuples are received by all tasks.
Field grouping: In this grouping, tuples are sent to the same bolt based on one
or more fields, for example, in Twitter if we want to send all tweets from the
same tweet to the same bolt then we can use this grouping.
All grouping: All tuples are sent to all bolts. Filtering is one operation where
we need all grouping.
Global grouping: All tuples send a single bolt. Reduce is one operation where
we need global grouping.
Direct grouping: The producer of the tuple decides which of the consumer's
task will receive the tuple. This is possible for only streams that are declared as
direct streams.
Local or shuffle grouping: If the source and target bolt are running in the same
worker process then it is local grouping, as no network hops are required to
send the data across the network. If this is not the case, then it is the same as
shuffle grouping.
Custom grouping: You can define your own custom grouping.
Setting up and configuring Storm
Before setting up Storm, we need to set up Zookeeper, which is required by Storm:
Setting up Zookeeper
What follows are instructions on how to install, configure and run Zookeeper in
standalone and cluster mode.
Installing
Download Zookeeper from http://www-eu.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeep
er-3.4.6.tar.gz. After the download, extract zookeeper-3.4.6.tar.gz as follows:
The console will come out after the following message and the process will run in
the background:
Starting zookeeper ... STARTED
The following command can be used to check the status of the Zookeeper process:
/bin/zkServer.sh status
The following are the files and folders that will be extracted:
Configuring
As shown in previous screenshot, go to the conf folder and add/edit
the following properties in storm.yaml:
Set the Nimbus node host name so that the Storm supervisor can communicate
with it:
nimbus.host: "nimbus"
Set the Storm local data directory to keep small information such as conf, JARs,
and so on:
storm.local.dir: "/usr/local/storm/tmp"
Set the number of workers that will run on the current supervisor node. It is best
practice to use same number of workers as the number of cores in the machine.
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
- 6704
- 6705
logviewer:The logviewer service helps to see the worker logs in the Storm UI.
Execute the following command to start it:
/bin/storm logviewer
Real-time processing job on Storm
After discussing setup and configuration, let's look at an example of a real-time
processing job. Here, we will discuss a very basic example of Storm, that is, word
count. To implement word count in Storm, we need one spout that should emit
sentences at regular intervals, one bolt to split the sentence into words based on
space, one bolt that collects all the words and finds the count, and finally, we need
one bolt to display the output on the console.
private String[] sentences = { "This is example of chapter 4", "This is word count
private int index = 0;
Splitter bolt: First, extend with BaseRichBolt class and implement the required
methods. In the execute method, we read a tuple with the ID sentence which we
defined in the spout. Then, split each sentence based on space and emit each
word as tuple to the next bolt. declareOutputFields is a method used to define the
ID of the stream for each bolt:
public class SplitSentenceBolt extends BaseRichBolt {
private static final long serialVersionUID = 1L;
private OutputCollector collector;
Word count bolt: In this bolt, we keep a map with the key as word and value as
count. We implement the value with one execute method for each tuple. In the
declareOutputFields method, we make a tuple of two values and send it to next
bolt. In Storm, you can send more than one value to the next bolt/spout, as
shown in the following example:
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = this.counts.get(word);
if (count == null) {
count = 0L;
}
count++;
this.counts.put(word, count);
this.collector.emit(new Values(word, count));
}
Display bolt: This bolt is the last bolt in topology so, there is nothing to define
in the declareOutputFields method. Also, nothing is emitted in the execute
method. Here, we are collecting all the tuples and putting them into a map. In
the cleanup method, which is called when the topology kills, we display values
present in the map:
public class DisplayBolt extends BaseRichBolt {
private static final long serialVersionUID = 1L;
private HashMap<String, Long> counts = null;
Creating topology and submitting: After defining all spouts and bolts, let's
bind them into one program, that is, topology. Here, two things are very
important, the sequence of bolts with ID and the grouping of streams.
if (mode.equals("cluster")) {
StormSubmitter.submitTopology("word-count-topology", config, builder.createTopology())
} else {
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count-topology", config,
builder.createTopology()); # line 7
Thread.sleep(20000);
cluster.killTopology("word-count-topology");
cluster.shutdown();
}
Running job
To run the previous example, there are two ways, one is local mode and the second is
cluster mode.
Local
Local mode means running your topology on the local cluster. You can run it in
Eclipse without the need to set up and configure Storm. To run it in the local cluster,
right-click on BasicStormWordCountExample and select Run As | Java Application.
Logs will start printing on the console. Before shutting down in 20 seconds, the final
output will be displayed on the console, as as shown in following figure:
Cluster
To run the example in cluster mode, execute the following steps:
1. Go to the project directory where pom.xml is placed and make build using the
following command:
mvn clean install
3. You will check the status of topology on UI, as shown in the following
screenshot:
4. When you click on the topology, you will find details of the spouts and bolts as
shown in the following screenshot:
5. Kill the topology from the UI, as shown in the previous screenshot.
6. Check the worker logs to get the final output. Go to STORM_HOME/logs/workers-
artifacts/word-count-topology-1-<Topology ID>/<worker port>/worker.log.
Summary
In this chapter, we acquainted the reader with the basics of Storm. We started with
the history of Storm, where we discussed how Nathan Marz got the idea for Storm
and what types of challenges he faced while releasing Storm as open source software
and then in Apache. We discussed the architecture of Storm and its components.
Nimbus, supervisor workers, executors, and tasks are all part of Storm's architecture.
Its components are tuples, stream, topology, spouts, and bolts. We discussed how to
set up Storm and configure it to run in the cluster. Zookeeper is required to be set up
first, as Storm requires it.
At the end of the chapter, we discussed a word count example implemented in Storm
using a spout and multiple bolts. We showed how to run an example locally, as well
as on the cluster.
Configuring Apache Spark and Flink
This chapter helps the readers do the basic setup of various computation components
that will be required throughout the book. We will do the setup and some basic set of
examples validating these setups. Apache Spark, Apache Flink, and Apache Beam
are computation engines we will discuss in this chapter. There are more
computational engines available in market.
You will require Maven 3.3.6 and Java 7+ to compile Spark 2.1.0. Also, you need to
update the MAVEN_OPT as the default setting will not be able to compile the code:
exportMAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
Use the following command to trigger the build. It will compile Spark 2.1.0 with
Hadoop Version 2.4.0:
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Downloading Spark
Download the latest version (2.1.0) using the same link (http://spark.apache.org/download
s.html) as given for Building from source section. Select the Spark release version if
the user wants to install anything other than the latest version and pre-built to select
version of Hadoop. The user can also download the source from same page and build
it. Spark 2.1.0 version requires:
Java 7+
Python 3.4+
Scala 2.11
mkdir demo
mv /home/ubuntu/downloads/spark-2.1.0-bin-hadoop2.7.tar ~/demo
cd demo
tar -xvf spark-2.1.0-bin-hadoop2.7.tar
The following list of files and folders will be extracted, as seen in the following
screenshot:
Other ways to download or install Spark are to use a virtual image of Spark provided
by Cloudera Distribution Hadoop (CDH), Hortonworks Data Platform (HDP),
or MapR. If the user wants customer support on Spark then use Databricks, which is
on the cloud.
Running an example
The Spark package comes with examples. So, to test whether all required
dependencies are working fine, execute the following commands:
cd spark-2.1.0-bin-hadoop2.7
./bin/run-example SparkPi 10
Copy the configuration in the .bashrc file at the home location of user:
export HADOOP_PREFIX=/home/impadmin/tools/hadoop-2.7.4
export HADOOP_HOME=/home/impadmin/tools/hadoop-2.7.4
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Let's create a word count example using Scala in Spark. Spark is giving shell where
we can write the code and execute it as a command line:
cdspark-2.1.0-bin-hadoop2.7
./bin/spark-shell
The following are the steps to create the word count example:
1. Define the file name. Take the file name in a variable named fileName:
valfileName = ""/tmp/input.txt"
2. Take file name in a variable fileName. Get file pointer. Construct the file object
using Spark context sc, which holds the file content:
valfile = sc.textFile(fileName)
3. Construct the file object using spark context sc which holds the file content.
Perform word count. Read each line and split each line delimiter as space " ":
valwordCount = file.flatMap(line =>line.split(" "))
4. Map transformation creates a key-value pair of word and its count as 1. Before
going to the next step, the key-value pair will contain each word with count as
1:
5. Similar keys are added up and give the final count of word of word which is 1.
The key-value pair will contain each word with count as 1:
reduceByKey(_ + _)
6. Collect and print the output. The collect action will collect the values of key-
value contains word and its counts from all workers:
wordCount.collect()
7. Print each key-value on the console. The collect action will collect the values of
key-value contains word and its counts from all workers.
foreach(println)
Maven 3.0.3 and Java 8 are required to build Flink. Use the following command to
build Flink using Maven:
mvn clean install -DskipTests
If you want to build Flink with different version of hadoop then use:
mvn clean install -DskipTests -Dhadoop.version=2.6.1
Download Flink
Download the latest version of Flink (1.1.4) from
http://Flink.apache.org/downloads.html, as shown in the following screenshot:
A list of files and folders will be extracted as shown in the following screenshot:
Apache Flink provides a dashboard which shows all the jobs that are running or
completed and submits a new job. The dashboard is accessible though
http://localhost:8081. It also shows that everything is up and running.
Running example
Flink provides a streaming API called Flink DataStream API to process continuous
unbounded streams of data in realtime.
To start using Datastream API, you should add the following dependency to the
project. Here, we are using sbt for build management.
org.apache.Flink" %% "Flink-scala" % "1.0.0
In the next few steps, we will create a word count program which reads from a
socket and displays the word count in realtime.
1. Get the Streaming environment: First of all we have to create the streaming
environment on which the program runs. We will discuss deployment modes
later in this chapter:
val environment = StreamExecutionEnvironment.getExecutionEnvironment
The keyBy function is the same as thegroupBy function and the sum function
is the same as the reduce function. 0 and 1 in keyBy and sum respectively
indicates the index of columns in tuple.
You can build the above program using sbt and create a .jar file. The word count
program is pre-built and comes with a Flink installed package. You can find the JAR
file at:
~/demo/Flink-1.1.4/examples/streaming/SocketWindowWordCount.jar
The following command will submit the job on Flink (replace your JAR file name):
/demo/ Flink-1.1.4/bin/Flink run examples/streaming/SocketWindowWordCount.jar --port 9000
The Flink dashboard starts showing the running job and all its relevant details as
shown in the following screenshot:
The following screenshot shows the job details with each subtask:
Now, enter the words or statement on the console as shown in the following
screenshot:
The output is as shown in the following screenshot:
When multiple options are available for streaming on the market, then which
technology to pick up depends on factors like performance, features provided,
reliability, and how it fits in your use case. The given table shows a small
comparison between Storm, Spark, and Flink:
Yes -
Storm 1.0 introduces Stateful Partitioned
Update
Stateful processing - Redis backed. State state store
State By
processing is maintained at a bolt instance (backed by
Key
level. Not a distributed store. Filesystem or
RocksDB).
Sliding
window,
Sliding and tumbling windows Sliding Tumbling
Window based on time duration and/or Window window,
event count. Custom
window, Event
Count based
window.
Standalone,
Resource Standalone,
Nimbus. YARN,
manager YARN
Mesos
Yes - Table
API and SQL.
SQL
No Yes SQL is not
compatibility
matured
enough.
Java, Scala,
Language Java, Javascript, Python, Ruby Java, Scala
Python
Setting up and a quick execution of
Apache Beam
What is Apache Beam? According to the definition from beam.apache.org, Apache
Beam is a unified programming model, allowing us to implement batch and
streaming data processing jobs that can run on any execution engine.
UNIFIED: Use a single programming model for both batch and streaming use
cases.
PORTABLE: The runtime environment is decoupled from code. Execute
pipelines on multiple execution environments, including Apache Apex, Apache
Flink, Apache Spark, and Google Cloud Dataflow.
EXTENSIBLE: Write and share new SDKs, IO connectors, and transformation
libraries. You can create your own Runner in case to support new runtime.
Beam model
Any transformation or aggregation performed in Beam is called Ptransform and the
connection between these transforms is called PCollection.
The preceding Maven command generates a Maven project that contains Apache
Beam's WordCount example:
$ mvnarchetype:generate \
-DarchetypeRepository=https://repository.apache.org/content
/groups/snapshots \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=LATEST \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
This will create a folder named word-count-beam which contains the code:
$ cd word-count-beam/
$ ls
pom.xml src
$ lssrc/main/java/org/apache/beam/examples/
DebuggingWordCount.java WindowedWordCount.java common
MinimalWordCount.java WordCount.java
To run the WordCount example, execute the command as per the runner:
Direct Runner: There is no need to specify the runner as it is the default. Direct
Runner runs on local machine and no specific setup will be required
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args=
Flink Runner:
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args=
Spark Runner:
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args=
DataFlow Runner:
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args=
After running the previous command, the file names starting with count are created
in the same folder. When we execute the command to check the entries in files, it
will be as shown in the given screenshot:
Creating Pipeline:
PipelineOptions options = PipelineOptionsFactory.create();
gs:// is Google Cloud Storage. You can specify the local file with a
complete path, but keep in mind that if you are running code on a Spark
cluster or Flink cluster then it might be possible that the file is not present
in the directory mentioned in Readfrom function.
Apply the ParDo transformation which calls DoFn. DoFn splits each element
in PCollection from TextIO and generates a new PCollection with each
individual word as an element:
.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^a-zA-Z']+")) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
To save the output in the file apply write transformation which takes
PCollection and produces PDone.
.apply(TextIO.Write.to("wordcounts"));
Running the Pipeline: Run the Pipeline using the following statement in code:
p.run().waitUntilFinish();
importorg.apache.beam.runners.Flink.FlinkPipelineOptions;
importorg.apache.beam.runners.Flink.FlinkRunner;
importorg.apache.beam.sdk.Pipeline;
importorg.apache.beam.sdk.io.TextIO;
importorg.apache.beam.sdk.options.PipelineOptionsFactory;
importorg.apache.beam.sdk.transforms.Count;
importorg.apache.beam.sdk.transforms.DoFn;
importorg.apache.beam.sdk.transforms.MapElements;
importorg.apache.beam.sdk.transforms.ParDo;
importorg.apache.beam.sdk.transforms.SimpleFunction;
importorg.apache.beam.sdk.values.KV;
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/*"))
.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^a-zA-Z']+")) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(Count.<String>perElement())
.apply("FormatResults", MapElements.via(new SimpleFunction<KV<String, Long>, String>() {
@Override
public String apply(KV<String, Long> input) {
returninput.getKey() + ": " + input.getValue();
}
}))
.apply(TextIO.Write.to("wordcounts"));
p.run().waitUntilFinish();
}
}
Balancing in Apache Beam
Apache Beam provides a way to keep balance between completeness, latency, and
cost. Completeness refers here to how all events should process, latency is the time
taken to execute an event and cost is the computing power required to finish the job.
The following are the right questions that should be asked to build a Pipeline in
Apache Beam which maintains balance between the above three parameters:
We saw the example of transformation above which computes the word count for a
batch of files. Now, we will continue on window, watermark, triggers, and
accumulators. We will discuss the same example given in the Apache Beam
examples coming with package.
This example illustrates the different scenarios of trigger where results are generated
partially, including late arriving data by recalculating the results. Data is real time
traffic from San Diego. It contains readings from sensor stations set up along each
freeway. Each sensor reading includes a calculation of the total_flow across all lanes
in that freeway direction. The input would be a text file and the output would be
written in Big Query.
5 30 10:01:00 10:01:03
5 30 10:02:00 11:07:00
5 20 10:04:10 10:05:15
5 60 10:05:00 11:03:00
5 20 10:05:01 11.07:30
5 60 10:15:00 10:27:15
5 40 10:26:40 10:26:43
5 60 10:27:20 10:27:25
5 60 10:29:00 11:11:00
The trigger emits the output when the system's watermark passes the end of the
window:
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
Data which arrives after the watermark has passed the event timestamp of the
arriving element is considered as late data and in this example we will drop the event
if it is late arriving:
.withAllowedLateness(Duration.ZERO)
Discard the elements when the window is finished. It will not be carried forward to
the next window:
.discardingFiredPanes())
Each pane produced by the default trigger with no allowed lateness will be the first
and last pane in the window and will be ON_TIME. At 11:03:00 (processing time), the
system watermark may have advanced to 10:54:00. As a result, when the data record
with event time 10:05:00 arrives at 11:03:00, it is considered late and dropped.
We can change the duration for allowed late arriving data in the above example as
follows:
.withAllowedLateness(Duration.standardDays(1)))
This leads to each window staying open for ONE_DAY after the watermark has passed
the end of the window. If we want to accumulate the value across the panes and also
want to emit the results irrespective of the watermark, we can implement the code in
the following example:
The trigger emits the element after the processing time whenever the element is
received in the pane:
triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()
This accumulates the elements, so that each approximation includes all of the
previous data in addition to the newly arrived data:
accumulatingFiredPanes()
Since we don't have any triggers that depend on the watermark, we don't get an
ON_TIME firing. Instead, all panes are either EARLY or LATE.
The complete program is available in the Apache beam GitHub location: https://github.
com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/cookbook/T
riggerExample.java.
Summary
In this chapter, we get the readers acquainted with and introduced to the setup and
quick execution of Spark, Flink, and Beam. They can run the examples by running
the code bundled in the jar easily on standalone as well as on cluster mode.
Storm is also computation engine. We will discuss Storm in next few chapters. In the
next chapter, we will discuss about integration of Storm with different data sources.
Integrating Storm with a Data Source
This is the chapter where we integrate the source of data with Storm distributed
compute engine. It involves stringing together the source of streaming data to a
broker service like RabbitMQ and then wiring the streaming pipeline to Storm. We
have a very interesting sensor data recipe here, which streams live data from a free
real-time sensor data channel and pushes that into RabbitMQ and further to a Storm
topology for business analysis .
A few terms that would be used very often in context to RabbitMQ in particular, or
any other queuing system are described as follows:
In the case of RabbitMQ, the producer/publisher never publishes any messages to the
queue, but it actually writes the messages to an exchange which in turn further
pushes the messages into the queue, based on the exchange type and routing key.
RabbitMQ exchanges
RabbitMQ is versatile and provides for a variety of exchanges which are at the
disposal of its developers to cater to a myriad of problems that come across for
implementation.
Direct exchanges
In this type of exchange, we have a routing key bound to the queue, which serves as
a pass key to direct the messages to the queue. So every message that is published to
the exchange has a routing key associated with it, which decides the destination
queue the exchange writes it to. For example, in the preceding figure, the message is
written to the green queue because the message routing queue "green" binds to the
green queue:
Fanout exchanges
They can also be called broadcast exchange, because when a message is published to
a Fanout Exchange it's written/sent to all the queues bound to the exchange. The
preceding figure demonstrates its working. Here the message published by the
producer is sent to all the three queues; green, red, and orange. So in a nutshell each
queue bound to the exchange receives a copy of the message. It is analogue to the
pub-sub broker pattern:
Topic exchanges
When a message is published to a topic exchange, it is sent to all the queues whose
routing key matches all, or a portion of the routing key of the message published—
for example, if we publish a message to a topic exchange with a key as green.red,
then the message would be published to the queue green queue and red queue. To
understand it better, here is the figurative explanation for the same:
The message routing key is "first.green.fast" and with topic exchange it's published
onto the "green" queue because the word green occurs in there, the "red.fast"
queue because the word fast occurs in there, and the "*.fast" queue because the
word "fast" occurs in there.
Headers exchanges
This exchange actually publishes the messages to specific queues by matching the
message headers with the binding queue headers. These exchanges hold a very
strong similarity to the topic based exchange, but they differ in the concept that they
have more than one matching criteria with complex ordered conditions:
The message published in the previous exchange has a key ("key1") associated with
it which maps to the value ("value1"); the exchange matches it against all the
bindings in queue headers and the criteria actually matches with the first header
value in the first queue which maps to "key1", "value1" and thus the message gets
published only to the first queue—also note the match criteria as "any". In the last
queue, the values are the same as the first queue but the match criteria is "all" which
means both mappings should match, thus the message isn't published on the bottom
queue.
RabbitMQ setup
Now let's actually start the setup and see some action with RabbitMQ. The latest
version of RabbitMQ can be downloaded as a DEB file from
rabbitMQ.com/github.com and can be installed using the Ubuntu software center:
Another mechanism is using the command line interface using the following steps:
2. Next, we need to add the public key to your trusted key configurations:
wget -O- https://www.rabbitmq.com/rabbitmq-release-signing-key.asc |
sudo apt-key add -
3. Once the previous steps are successfully executed we need to update the
package list:
sudo apt-get update
4. We are all set to install the RabbitMQ server:
sudo apt-get install rabbitmq-server
6. The following command can be used to check the status of the server:
sudo service rabbitmq-server status
On successful completion of the previous steps, you should have the following
screenshot:
RabbitMQ — publish and subscribe
Once the RabbitMQ service is installed and it's up and running as can be verified
from the RMQ, the management console's next obvious step is to write a quick
publisher and consumer application. Following is the code snippet of the producer
that publishes a message under an exchange called MYExchange on a queue called
MyQueue.
package com.book.rmq;
...
public class RMQProducer {
private static String myRecord;
private static final String EXCHANGE_NAME = "MYExchange";
private final static String QUEUE_NAME = "MYQueue";
private final static String ROUTING_KEY = "MYQueue";
public static void main(String[] argv) throws Exception {
ConnectionFactory factory = new ConnectionFactory();
Address[] addressArr = { new Address("localhost", 5672) };
Connection connection = factory.newConnection(addressArr);
Channel channel = connection.createChannel();
channel.exchangeDeclare(EXCHANGE_NAME, "direct");
channel.queueDeclare(QUEUE_NAME, true, false, false, null);
channel.queueBind(QUEUE_NAME, EXCHANGE_NAME, ROUTING_KEY);
int i = 0;
while (i < 1) {
try {
myRecord = "My Sample record";
channel.basicPublish(EXCHANGE_NAME, ROUTING_KEY,
MessageProperties.PERSISTENT_TEXT_PLAIN,
myRecord.getBytes());
System.out.println(" [x] Sent '" + myRecord + "'
sent at " + new Date());
i++;
Thread.sleep(2);
} catch (Exception e) {
e.printStackTrace();
}
}
channel.close();
connection.close();
}
}
The preceding screenshot shows the command line and UI output for the same for
the execution of the program:
On the RabbitMQ console you can see the event published under MyQueue under the
Queues tab:
Next, let's put together a quick consumer Java application to read this message from
the queue:
package com.book.rmq;
..
public class RMQConsumer {
private static final String EXCHANGE_NAME = "MYExchange";
private final static String QUEUE_NAME = "MYQueue";
private final static String ROUTING_KEY = "MYQueue";
Here is the console output for the same that reads the sample:
The next progressive item would be to read the messages from RabbitMQ using a
Storm topology.
RabbitMQ – integration with Storm
Now that we have accomplished basic setup and publish, and subscribe next let's
move on to integration of RabbitMQ with Storm. We'll execute this as an end-to-end
example.
AMQPSpout
Storm integrates with RabbitMQ using an AMQPSpout, which reads the messages from
RabbitMQ and pushes them to Storm topology for further processing. The following
code snippet captures the key aspects of encoding the AMQPSpout:
..
public class AMQPSpout implements IRichSpout {
private static final long serialVersionUID = 1L;
/**
* Logger instance
*/
private static final Logger log =
LoggerFactory.getLogger(AMQPSpout.class);
private static final long CONFIG_PREFETCH_COUNT = 0;
private static final long DEFAULT_PREFETCH_COUNT = 0;
private static final long WAIT_AFTER_SHUTDOWN_SIGNAL = 0;
private static final long WAIT_FOR_NEXT_MESSAGE = 1L;
/*
* Open method of the spout , here we initialize the prefetch count,
this
* parameter specified how many messages would be prefetched from the
queue
* by the spout - to increase the efficiency of the solution
*/
public void open(@SuppressWarnings("rawtypes") Map conf,
TopologyContext context, SpoutOutputCollector collector) {
Long prefetchCount = (Long) conf.get(CONFIG_PREFETCH_COUNT);
if (prefetchCount == null) {
log.info("Using default prefetch-count");
prefetchCount = DEFAULT_PREFETCH_COUNT;
} else if (prefetchCount < 1) {
throw new IllegalArgumentException(CONFIG_PREFETCH_COUNT
+ " must be at least 1");
}
this.prefetchCount = prefetchCount.intValue();
try {
this.collector = collector;
setupAMQP();
} catch (IOException e) {
log.error("AMQP setup failed", e);
log.warn("AMQP setup failed, will attempt to
reconnect...");
Utils.sleep(WAIT_AFTER_SHUTDOWN_SIGNAL);
try {
reconnect();
} catch (TimeoutException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
} catch (TimeoutException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/**
* Reconnect to an AMQP broker.in case the connection breaks at some
point
*
* @throws TimeoutException
*/
private void reconnect() throws TimeoutException {
log.info("Reconnecting to AMQP broker...");
try {
setupAMQP();
} catch (IOException e) {
log.warn("Failed to reconnect to AMQP broker", e);
}
}
/**
* Setup a connection with an AMQP broker.
*
* @throws IOException
* This is the method where we actually connect to the queue
* using AMQP client api's
* @throws TimeoutException
*/
private void setupAMQP() throws IOException, TimeoutException {
final int prefetchCount = this.prefetchCount;
final ConnectionFactory connectionFactory = new ConnectionFactory() {
public void configureSocket(Socket socket) throws IOException {
socket.setTcpNoDelay(false);
socket.setReceiveBufferSize(20 * 1024);
socket.setSendBufferSize(20 * 1024);
}
};
connectionFactory.setHost(amqpHost);
connectionFactory.setPort(amqpPort);
connectionFactory.setUsername(amqpUsername);
connectionFactory.setPassword(amqpPasswd);
connectionFactory.setVirtualHost(amqpVhost);
this.amqpConnection = connectionFactory.newConnection();
this.amqpChannel = amqpConnection.createChannel();
log.info("Setting basic.qos prefetch-count to " +
prefetchCount);
amqpChannel.basicQos(prefetchCount);
amqpChannel.exchangeDeclare(EXCHANGE_NAME, "direct");
amqpChannel.queueDeclare(QUEUE_NAME, true, false, false, null);
amqpChannel.queueBind(QUEUE_NAME, EXCHANGE_NAME, "");
this.amqpConsumer = new QueueingConsumer(amqpChannel);
assert this.amqpConsumer != null;
this.amqpConsumerTag = amqpChannel.basicConsume(QUEUE_NAME,
this.autoAck, amqpConsumer);
System.out.println("***************");
}
/*
* Cancels the queue subscription, and disconnects from the AMQP
broker. */
public void close() {
try {
if (amqpChannel != null) {
if (amqpConsumerTag != null) {
amqpChannel.basicCancel(amqpConsumerTag);
}
amqpChannel.close();
}
} catch (IOException e) {
log.warn("Error closing AMQP channel", e);
} catch (TimeoutException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
if (amqpConnection != null) {
amqpConnection.close();
}
} catch (IOException e) {
log.warn("Error closing AMQP connection", e);
}
}
/*
* Emit message received from queue into collector
*/
public void nextTuple() {
// if (spoutActive && amqpConsumer != null) {
try {
final QueueingConsumer.Delivery delivery = amqpConsumer
.nextDelivery(WAIT_FOR_NEXT_MESSAGE);
if (delivery == null)
return;
final long deliveryTag =
delivery.getEnvelope().getDeliveryTag();
String message = new String(delivery.getBody());
if (message != null && message.length() > 0) {
collector.emit(new Values(message), deliveryTag);
} else {
log.debug("Malformed deserialized message, null or
zero-length. "
+ deliveryTag);
if (!this.autoAck) {
ack(deliveryTag);
}
}
} catch (ShutdownSignalException e) {
log.warn("AMQP connection dropped, will attempt to
reconnect...");
Utils.sleep(WAIT_AFTER_SHUTDOWN_SIGNAL);
try {
reconnect();
} catch (TimeoutException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
} catch (ConsumerCancelledException e) {
log.warn("AMQP consumer cancelled, will attempt to
reconnect...");
Utils.sleep(WAIT_AFTER_SHUTDOWN_SIGNAL);
try {
reconnect();
} catch (TimeoutException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
} catch (InterruptedException e) {
log.error("Interrupted while reading a message, with
Exception : " + e);
}
// }
}
/*
* ack method to acknowledge the message that is successfully processed
*/
public void ack(Object msgId) {
if (msgId instanceof Long) {
final long deliveryTag = (Long) msgId;
if (amqpChannel != null) {
try {
amqpChannel.basicAck(deliveryTag, false);
} catch (IOException e) {
log.warn("Failed to ack delivery-tag " +
deliveryTag, e);
} catch (ShutdownSignalException e) {
log.warn(
"AMQP connection failed. Failed to
ack delivery-tag "
+ deliveryTag, e);
}
}
} else {
log.warn(String.format("don't know how to ack(%s: %s)",
msgId.getClass().getName(), msgId));
}
}
public void fail(Object msgId) {
if (msgId instanceof Long) {
final long deliveryTag = (Long) msgId;
if (amqpChannel != null) {
try {
if (amqpChannel.isOpen()) {
if (!this.autoAck) {
requeueOnFail);
}
} else {
reconnect();
}
} catch (IOException e) {
log.warn("Failed to reject delivery-tag " +
deliveryTag, e);
} catch (TimeoutException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
} else {
log.warn(String.format("don't know how to reject(%s:
%s)", msgId.getClass().getName(), msgId));
}
}
We'll quickly browse through the key methods of the previous code snippet and their
internal workings:
public AMQPSpout(..): This is the constructor, where key variables are initialized
with details such as host IP, port, username, and password for RabbitMQ. We
also set up the Requeue flag in case the message fails to be processed by the
topology for some reason.
public void open(..): This is the basic method of the IRichSpout; the prefetch
count here tells us how many records should be read and kept ready in the spout
buffer for the topology to consume.
private void setupAMQP() ..: This is the key method that does its namesake and
sets up the spout and RabbitMQ connection by declaration of a connection
factory, exchange, and queue and binds them together to the channel.
public void nextTuple(): This is the method that receives the message from the
RabbitMQ channel and emits the same into the collector for the topology to
consume.
The following code snippet retrieves the message and its body and emits the
same into the topology:
..
final long deliveryTag = delivery.getEnvelope().getDeliveryTag();
String message = new String(delivery.getBody());
..
collector.emit(new Values(message), deliveryTag);
Next let's capture the topology builder for holding an AMQPSpout component
together with other bolts:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new AMQPSpout("localhost", 5672, "guest", "guest", "/"
Next, we are going to plug in our queue declared under RabbitMQ with a
continuous stream of sensor data. I suggest to connect you any free streaming
data source such as Facebook or Twitter. For this book, I have resorted to
PubNub: (https://www.pubnub.com/developers/realtime-data-streams/)
The following are some pom.xml entries that are required for the dependencies for this
entire program to execute correctly out of your Eclipse setup.
Following are the Maven dependencies for Storm, RabbitMQ, Jackson, and PubNub:
...
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<storm.version>0.9.3</storm.version>
</properties>
<dependencies>
<dependency>
<groupId>com.rabbitmq</groupId>
<artifactId>amqp-client</artifactId>
<version>3.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>0.9.3</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.pubnub</groupId>
<artifactId>pubnub-gson</artifactId>
<version>4.4.4</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.3</version>
</dependency>
..
PubNub data stream publisher
Let's put together a quick publisher that reads the live stream of sensor data
messages from the PubNub and pushes them to RabbitMQ:
..
public class TestStream {
}
}
channel.close();
connection.close();
}
public static void main ( String args[])
{
PNConfiguration pnConfiguration = new PNConfiguration();
pnConfiguration.setSubscribeKey("sub-c-5f1b7c8e-fbee-11e3-aa40-02ee2ddab7fe");
PubNub pubnub = new PubNub(pnConfiguration);
pubnub.addListener(new SubscribeCallback() {
@Override
public void status(PubNub pubnub, PNStatus status) {
if (status.getCategory() == PNStatusCategory.PNUnexpectedDisconnectCategory) {
// This event happens when radio / connectivity is lost
}
if (status.getCategory() == PNStatusCategory.PNConnectedCategory){
System.out.println("status.getCategory()="+status.getCategory());
}
}
else if (status.getCategory() == PNStatusCategory.PNReconnectedCategory) {
// Happens as part of our regular operation. This event happens when
// radio / connectivity is lost, then regained.
}
else if (status.getCategory() == PNStatusCategory.PNDecryptionErrorCategory) {
// Handle messsage decryption error. Probably client configured to
// encrypt messages and on live data feed it received plain text.
}
}
@Override
public void message(PubNub pubnub, PNMessageResult message) {
// Handle new message stored in message.message
String strMessage = message.getMessage().toString();
System.out.println("******"+strMessage);
try {
RMQPublisher(strMessage);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TimeoutException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
@Override
public void presence(PubNub pubnub, PNPresenceEventResult presence) {
}
});
pubnub.subscribe().channels(Arrays.asList("pubnub-sensor-network")).execute();
}
}
STREAM DETAILS
Channel: pubnub-sensor-network
Subscribe key: sub-c-5f1b7c8e-fbee-11e3-aa40-02ee2ddab7fe
The PubNub listener previously binds to the subscription channel identified by the
subscription key and emits the messages into the RabbitMQ — MyExchange, MYQueue.
The following is screenshot of the program in execution and a screenshot of
RabbitMQ with messages in the Queue and a sample message for reference:
The following screenshot shows the messages showing up in the RabbitMQ, where
you can see the messages from the PubNub sensor stream lined up in the MyQueue:
}
..
class JsonConverter {
..
mysensorObj = mapper.readValue(jsonInString,
MySensorData.class);
// Pretty print
String prettyStaff1 =
mapper.writerWithDefaultPrettyPrinter()
.writeValueAsString(mysensorObj);
System.out.println(prettyStaff1);
} catch (JsonGenerationException e) {
..
return mysensorObj;
}
}
My JSONBolt basically accepts the message from the AMQPSpout of the topology and
converts the JSON string into a JSON object that can be further processed based on
business logic, for instance in our case we further stringed the SensorProcessorBolt:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new AMQPSpout("localhost", 5672, "guest", "guest", "/", true, false
builder.setBolt("split", new JsonBolt(), 8).shuffleGrouping("spout");
builder.setBolt("count", new SensorProcessorBolt(), 2).fieldsGrouping("split", new Fields("wo
The SensorProcessorBolt checks for the radiationLevel in the sensor emitted data and
filters and emits only the events that have a radiation level of more than 197:
..
collector.emit(new Values(mysensordata.toString()));
}
..
Eventual consistency is when you insert a row in a database and at same time
another user reads the same table, making it possible that the newly added row
would be visible to the user or not. Keyspace in Cassandra is the same as a database
in RDMS. The rest of the terminology is the same as RDMS. Cassandra is an open
source tool but if you need to manage clusters with designed a UI, then go for
DataStax. DataStax provides paid premium services and includes full-time support.
Now, let's see how to set up Cassandra.
Setting up Cassandra
Download the latest version 3.10 from http://cassandra.apache.org/download/and the older
version can be downloaded from http://archive.apache.org/dist/cassandra/. apache-
cassandra-3.10-bin.tar.gz will be downloaded. The following commands are executed
to extract it:
mw apache-cassandra-3.10-bin.tar.gz ~/demo
tar -xvf ~/demo/apache-cassandra-3.10-bin.tar.gz
Cassandra is started as the backend process. Press Enter to exit from the logs. To
verify whether Cassandra is working or not, execute the following command:
/bin/nodetool status
There is the concept of Virtual Nodes (vnodes) in Cassandra. vnodes are created
automatically by Cassandra on each node. By default, 256 vnodes are created when
you start Cassandra. To check the vnodes on the current node,use the following
command:
/bin/nodetool ring
It shows a long list of token IDs along with the load, as seen in the following
screenshot:
Configuring Cassandra
The Conf file in the installation directory contains all the configuration files that you
configure. Let's divide the configurations in to sections to understand them better:
Two or three nodes are sufficient to perform this role. Also, use the
same seeds node in your whole cluster.
2. Create a class:
public class CassandraBolt extends BaseBasicBolt
Override the two methods, prepare and cleanup. Line #1 is creating a cluster
of Cassandra where you have to provide the IPs of all Cassandra's nodes.
If you have a port number configured other than 9042 then provide IPs
with the port number separated by :. Line #2 is creating a session from the
cluster in which it will create a connection with one of the nodes provided
in the Cassandra node cluster list. Also you need to provide a keyspace
name while creating the session. Line #3 is closing the cluster after the job
is completed. In Storm, the prepare method calls up only once when the
topology is deployed on the cluster and the cleanup method calls up only
once when killing the topology.
4. Definition of the execute method. The execute method executes for each tuple for
processing to Storm:
session.execute("INSERT INTO users (lastname, age, city, email, firstname) VALUES ('Jo
The preceding code statement inserts a row into the Cassandra table users
in a demo keyspace. Fields are available in the tuple parameter.
Therefore, you can read the fields and change the previous statement like
following:
String userDetail = (String) input.getValueByField("event");
String[] userDetailFields = userDetail.split(":");
session.execute("INSERT INTO users (lastname, age, city, email, firstname) VALUES ('us
You can execute any valid SQL in the session.execute method. The session also
provides PreparedStatement which can be used for bulk insert/update/delete.
Storm and IMDB integration for
dimensional data
IMDB stands for In-Memory Database. An IMDB is required to keep intermediate
results while processing the streaming of events or to keep static information related
to events, which is not provided in events, for example. Employee details can be
stored in IMDB on the basis of employee IDs and events that are coming in and out
of an office. In this case, an event does not contain complete information about
employees to save the network costs and for better performance, Therefore, when
Storm processes the event, it will take static information regarding the employee
from the IMDB and persist it along with the event details in Cassandra or any other
database for further analytics. There are numerous open source IMDB tools available
on the market, but some famous ones are Hazelcast, Memcached, and Redis.
Let's see how to integrate Storm and Hazelcast. No special setup is required for
Hazelcast. Perform the following steps:
Now we know how to persist data using Storm in Cassandra and how to use
Hazelcast to get the static information about the events. Let's move on to integrate
storm with the presentation layer.
Integrating the presentation layer with
Storm
Visualization over data is adding power to know your data in the best way and also
you can take key decisions based on those. There are numerous tools available on the
market for visualization. Every visualization tool needs a database to store and
process the data. Some combinations are Grafana over Elasticseach, Kibana over
Elasticsearch,and Grafana over Influxdb. In this chapter, we will discuss the fusion
of Grafana, Elasticsearch, and Storm.
In this example, we will use the data stream from PubNub, which provides real-time
sensor data. PubNub provides all types of APIs to read data from the channel. Here, a
program is required to get the values from the PubNub subscribed channel and push it
into a Kafka topic. You will the find program in the code bundle.
Setting up Grafana with the
Elasticsearch plugin
Grafana is analytics platform which understands your data and visualizes it on a
dashboard.
Downloading Grafana
Download Grafana from https://grafana.com/grafana/download and it will give you all the
possible options to download the setup with support platforms/OS. Here we are
installing standalone binaries:
wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.2.0.linux-x64.
cp grafana-4.2.0.linux-x64.tar.gz ~/demo/.
tar -zxvf grafana-4.2.0.linux-x64.tar.gz
data: The Path to where Grafana stores the sqlite3 database (if used), file based
sessions (if used), and other data.
logs: The Path to where Grafana will store logs.
Installing the Elasticsearch plugin in
Grafana
Use the following command to install the latest Elasticsearch version plugin in
Grafana:
/bin/grafana-cli plugins install stagemonitor-elasticsearch-app
Running Grafana
Use the following command to run Grafana:
/bin/grafana-server
At first it will take some time to start as it needs to set up its own database. Once it
has started successfully, the UI can be accessed using http://localhost:3000. Enter
admin as the username and admin as the password. The dashboard will be displayed as
shown in the following screenshot:
Adding the Elasticsearch datasource in
Grafana
Now, click on the Add data source icon on the dashboard. Add/update the value as
shown in the following screenshot:
We learned how to write topology code in previous sections. Now, we will discuss
the Elasticsearch bolt that reads the data from other bolts/spouts and writes into
Elasticsearch. Perform the following steps:
Line #1 creates the object of ObjectMapper for JSON parsing. Any other
library can be used for parsing. Line #2is creating the object of the
settings, which is required by the client for Elasticsearch. The settings
require the cluster name as mandatory. Line #3is creating the Elasticsearch
client object which is taking the settings as a parameter. In line #4, we
need to provide the Elastcisearch node details which include the
hostname and port number. If there is a cluster of Elasticsearch, then use
InetAddresses.
Line #1isgetting values from the tuple at the zero position which contains
the event. Line #2 is creating an index, if one does not exist, as well as
adding a document into the index. client.prepareIndex is creating an index
which requires the index name as the first parameter and the type as the
second parameter. setSource is a method which adds a document in the
index. The get method will return the object of the class IndexResponse,
which contains information about whether a request for creating the index
and adding the document is successfully completed or not.
convertStringtoMap is converting a string into a map that changes the
datatype of the field. It is required to make it presentable on the Grafana
dashboard. If an event is already in the desired format, then we do not
need to convert the type.
Now, we have to integrate the Elasticsearch bolt with the Kafka spout to
read the data. Perform the following steps:
6. Write the topology that binds the Kafka spout and the Elasticsearch bolt:
TopologyBuilder topologyBuilder = new TopologyBuilder();//Line #1
BrokerHosts hosts = new ZkHosts("localhost:2181");//Line #2
SpoutConfig spoutConfig = new SpoutConfig(hosts, "sensor-data" , "/" + topicName, UUID
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
//Line #4
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig); //Line #5
topologyBuilder.setSpout("spout", kafkaSpout, 1);//Line #6
topologyBuilder.setBolt("es-bolt", new ElasticSearchBolt(), 1).shuffleGrouping("spout"
Line #1 is creating the object of TopologyBuilder which contains
information for spouts and all bolts. Here, in this example, we are using
predefined the Storm-Kafka integrated spout provided by Storm. Create
the Zookeeper host details as BrokerHosts in line #2. In line #3, we are
creating the spoutConfig required for the Kafka spout which contains
information about Zookeeper hosts, topic name, Zookeeper root
directory, and client ID to communicate with the Kafka broker. Line #4
sets the scheme as string; otherwise, by default, it is bytes. Create the
KafkaSpout object with spoutConfig as the parameter. Now, first set the
spout into topologyBuilder with parameters as Id, kafkaSpout and
parallelism hint in line #6. In the same way, set the ElasticSearch bolt
into topologyBuilder as described in line #7.
1. Run PubNubDataStream.java.
2. Run SenorTopology.java.
Visualizing the output on Grafana
You can configure the dashboard, as shown in the following screenshot:
Code:
package com.book.chapter7.visualization.example;
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import com.pubnub.api.PNConfiguration;
import com.pubnub.api.PubNub;
import com.pubnub.api.callbacks.SubscribeCallback;
import com.pubnub.api.enums.PNStatusCategory;
import com.pubnub.api.models.consumer.PNStatus;
import com.pubnub.api.models.consumer.pubsub.PNMessageResult;
import com.pubnub.api.models.consumer.pubsub.PNPresenceEventResult;
public PubNubDataStream() {
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
properties.put("acks", "1");
producer = new KafkaProducer<Integer, String>(properties);
}
pubnub.addListener(new SubscribeCallback() {
@Override
public void status(PubNub pubnub, PNStatus status) {
System.out.println(pubnub.toString() + "::" + status.toString());
if (status.getCategory() ==
PNStatusCategory.PNUnexpectedDisconnectCategory) {
// This event happens when radio / connectivity is lost
}
else if (status.getCategory() ==
PNStatusCategory.PNConnectedCategory) {
// Connect event. You can do stuff like publish, and know
// you'll get it.
// Or just use the connected event to confirm you are
// subscribed for
// UI / internal notifications, etc
if (status.getCategory() ==
PNStatusCategory.PNConnectedCategory) {
System.out.println("status.getCategory()="+
status.getCategory());
}
} else if (status.getCategory() ==
PNStatusCategory.PNReconnectedCategory) {
// Happens as part of our regular operation. This event
// happens when
// radio / connectivity is lost, then regained.
} else if (status.getCategory() ==
PNStatusCategory.PNDecryptionErrorCategory) {
// Handle messsage decryption error. Probably client
// configured to
// encrypt messages and on live data feed it received plain
// text.
}
}
@Override
public void message(PubNub pubnub, PNMessageResult message) {
// Handle new message stored in message.message
String strMessage = message.getMessage().toString();
System.out.println("******" + strMessage);
pubishMessageToKafka(strMessage);
/*
* log the following items with your favorite logger -
* message.getMessage() - message.getSubscription() -
* message.getTimetoken()
*/
}
@Override
public void presence(PubNub pubnub, PNPresenceEventResult presence) {
}
});
pubnub.subscribe().channels(Arrays.asList("pubnub-sensor-
network")).execute();
}
public static void main(String[] args) {
new PubNubDataStream().getMessageFromPubNub();
}
}
import java.io.IOException;
import java.net.InetSocketAddress;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.transport.client.PreBuiltTransportClient;
import com.fasterxml.jackson.core.JsonParseException;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonMappingException;
import com.fasterxml.jackson.databind.ObjectMapper;
@Override
public void prepare(@SuppressWarnings("rawtypes") Map stormConf,
TopologyContext context) {
// instance a json mapper
mapper = new ObjectMapper(); // create once, reuse
Settings settings = Settings.builder()
.put("cluster.name", "my-application").build();
preBuiltTransportClient = new
PreBuiltTransportClient(settings);
client = preBuiltTransportClient.addTransportAddress(new InetSocketTransportAddre
}
@Override
public void cleanup() {
preBuiltTransportClient.close();
client.close();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
String valueByField = input.getString(0);
System.out.println(valueByField);
try {
IndexResponse response = client.prepareIndex("pub-nub", "sensor-
data").setSource(convertStringtoMap(valueByField)).get();
System.out.println(response.status());
} catch (IOException e) {
e.printStackTrace();
}
}
convertedValue.put("ambient_temperature", Double.parseDouble(String.valueOf(readValue.get
convertedValue.put("photosensor", Double.parseDouble(String.valueOf(readValue.get("photosenso
convertedValue.put("humidity", Double.parseDouble(String.valueOf(readValue.get("humidity"))))
convertedValue.put("radiation_level", Integer.parseInt(String.valueOf(readValue.get("radiatio
convertedValue.put("sensor_uuid", readValue.get("sensor_uuid"));
convertedValue.put("timestamp", new Date());
import java.util.UUID;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.kafka.BrokerHosts;
import org.apache.storm.kafka.KafkaSpout;
import org.apache.storm.kafka.SpoutConfig;
import org.apache.storm.kafka.StringScheme;
import org.apache.storm.kafka.ZkHosts;
import org.apache.storm.spout.SchemeAsMultiScheme;
import org.apache.storm.topology.TopologyBuilder;
Pseudo code:
import java.util.Date;
import java.util.Properties;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
properties.put("bootstrap.servers", "localhost:9092");
properties.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
properties.put("acks", "1");
stream.append(phoneNumber);
stream.append(",");
stream.append(bin);
stream.append(",");
stream.append(bout);
stream.append(",");
stream.append(new Date(ThreadLocalRandom.current().nextLong()));
System.out.println(stream.toString());
ProducerRecord<Integer, String> data = new ProducerRecord<Integer,
String>(
"storm-diy", stream.toString());
producer.send(data);
counter++;
}
producer.close();
}
}
import com.hazelcast.core.Hazelcast;
package com.book.chapter7.diy;
import java.io.Serializable;
import java.util.Map;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import com.hazelcast.client.HazelcastClient;
import com.hazelcast.client.config.ClientConfig;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.IMap;
@Override
public void prepare(Map stormConf, TopologyContext context) {
ClientConfig clientConfig = new ClientConfig();
clientConfig.getNetworkConfig().addAddress("127.0.0.1:5701");
client = HazelcastClient.newHazelcastClient(clientConfig);
usageMap = client.getMap("usage");
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
PacketDetailDTO packetDetailDTO = new PacketDetailDTO();
String valueByField = input.getString(0);
String[] split = valueByField.split(",");
long phoneNumber = Long.parseLong(split[0]);
PacketDetailDTO packetDetailDTOFromMap = usageMap.get(phoneNumber);
if (null == packetDetailDTOFromMap) {
packetDetailDTOFromMap = new PacketDetailDTO();
}
packetDetailDTO.setPhoneNumber(phoneNumber);
int bin = Integer.parseInt(split[1]);
packetDetailDTO.setBin((packetDetailDTOFromMap.getBin() + bin));
int bout = Integer.parseInt(split[2]);
packetDetailDTO.setBout(packetDetailDTOFromMap.getBout() + bout);
packetDetailDTO.setTotalBytes(packetDetailDTOFromMap.getTotalBytes()
+ bin + bout);
usageMap.put(split[0], packetDetailDTO);
@Override
public void cleanup() {
client.shutdown();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declareStream("usagestream", new Fields("usagestream"));
declarer.declareStream("tdrstream", new Fields("tdrstream"));
}
}
Cassandra persistance bolt:
package com.book.chapter7.diy;
import java.util.Map;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
@Override
public void prepare(Map stormConf, TopologyContext context) {
cluster = Cluster.builder().addContactPoint(hostname).build();
session = cluster.connect(keyspace);
}
@Override
public void cleanup() {
session.close();
cluster.close();
}
}
import java.util.Map;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
@Override
public void prepare(Map stormConf, TopologyContext context) {
cluster = Cluster.builder().addContactPoint(hostname).build();
session = cluster.connect(keyspace);
}
@Override
public void cleanup() {
session.close();
cluster.close();
}
}
-----------------------------------------------------
import java.util.UUID;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.kafka.BrokerHosts;
import org.apache.storm.kafka.KafkaSpout;
import org.apache.storm.kafka.SpoutConfig;
import org.apache.storm.kafka.StringScheme;
import org.apache.storm.kafka.ZkHosts;
import org.apache.storm.spout.SchemeAsMultiScheme;
import org.apache.storm.topology.TopologyBuilder;
Let's take an example that explains how to achieve exactly-once semantics. Suppose
that you're doing a count of how many people visited your blog and also storing the
running count in a database. Now suppose you store a single value representing the
count in the database, and every time you process a new tuple you increment the
count.
Now, if failures happen, tuples will be replayed by Storm topology. Here the
problem is whether or not the tuple has been processed and the count has already
been updated in the database—if so, then you should not update it again or if the
tuple did not process successfully then you have to update the count in the database
or if the tuple processed but failed while updating the count in the database then you
should update the database.
To achieve the exactly-once semantics which ensures that the tuple has been
processed only once in the system, spout should provide the information to
bolts/spouts. There are three types of spouts available with respect to fault-tolerance:
transactional, non-transactional, and opaque transactional. Now, let's have a look at
each type of spout.
Transactional spout
Let's have a look at how trident spout processes tuples and what the characteristics
are:
Using the following statement in code, we can define a transactional spout with
Kafka:
TransactionalTridentKafkaSpout tr = new TransactionalTridentKafkaSpout(new TridentKafkaConfig
Using the following statement in code we can define an opaque transactional spout
with Kafka:
OpaqueTridentKafkaSpout otks = new OpaqueTridentKafkaSpout(new TridentKafkaConfig(new ZkHosts
Basic Storm Trident topology
Here, in basic Storm Trident topology we will go through a word count example.
More examples will be explained later in the chapter. This is the code for the
example:
FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3,
new Values("this is simple example of trident topology"),
new Values("this example count same words"));
spout.setCycle(true); // Line 1
TridentTopology topology = new TridentTopology(); // Line 2
MemoryMapState.Factory stateFactory = new MemoryMapState.Factory(); // Line 3
topology.newStream("spout1", spout) // Line 4
.each(new Fields("sentence"), new Split(), new Fields("word")) // Line 5
.groupBy(new Fields("word")) // Line 6
.persistentAggregate(stateFactory, new Count(), new Fields("count")).newValuesStream() // Lin
.filter(new DisplayOutputFilter()) // Line 8
.parallelismHint(6); // Line 9
Config config = new Config(); // Line 10
config.setNumWorkers(3); // Line 11
LocalCluster cluster = new LocalCluster(); // Line 12
cluster.submitTopology("storm-trident-example", config, topology.build()); // Line 13
Start the program with spout in Line 1. FixedBatchSpout is available for testing
purposes in code bundle. You can give a set of values that will be repeated in case
setCycle is set to True. You can define TransactionalTridentKafkaSpout, which requires
Zookeeper details to connect and a topic name. Another constructor is having the
same parameters along with the client ID:
TransactionalTridentKafkaSpout spout = new TransactionalTridentKafkaSpout(new TridentKafkaCon
Now, perform an operation on each tuple in Line 5. Here, the split function needs to
be executed for each tuple. Each method of topology has three arguments that is the
input field's name from spout, the function to be executed for each tuple, and the
output field's name. The implementation of the split method is as follows. Sentences
are split on the basis of space:
class Split extends BaseFunction {
public void execute(TridentTuple tuple, TridentCollector collector) {
String sentence = tuple.getString(0);
for (String word : sentence.split(" ")) {
collector.emit(new Values(word));
}
}
}
The Line 6 is performing group by operation on tuples with name as word. It will
group by all the tuples with the same word and create batches. It produces tuples
with word and count.
To display output, implement a custom filter to print the tuple values on the console.
Line 8 is applying the filter. Implementation of the custom filter DisplayOutputFilter
is:
public class DisplayOutputFilter implements Filter {
@Override
public void prepare(Map conf, TridentOperationContext context) {
}
@Override
public void cleanup() {
}
@Override
public boolean isKeep(TridentTuple tuple) {
System.out.println(tuple.get(0)+":"+tuple.get(1));
return true;
}
}
Line 9is setting parallelism and from line 10 to line 13 is creating a configure object
and submitting the previously created topology on a local cluster.
Trident turns your topology into a dataflow (acyclic directed) graph that it uses to
assign operations to bolts and then to assign those bolts to workers. It's smart enough
to optimize that assignment: it combines operations into bolts so that, as much as
possible, tuples are handed off with simple method cause and it arranges bolts among
workers so that, as much as possible, tuples are handed off to local executors.
The actual spout of a Trident topology is called the Master Batch Coordinator
(MBC). All it does is emit a tuple describing itself as batch 1 and then a tuple
describing itself as batch 2, and so forth. Also deciding when to emit those batches,
retry them, and so on, is quite exciting, but Storm doesn't know anything about all
that. Those batch tuples go to the topology's spout coordinator. The spout
coordinator understands the location and arrangement of records in the external
source and ensures that each source record belongs uniquely to a successful Trident
batch.
Trident operations
As we discussed earlier, Trident operations are Storm bolt implementation. We have
a vast range of operations available in Trident. They can perform complex operations
and aggregate with cache in memory. The following are operations available with
Trident.
Functions
The following are characteristics of functions:
Input:
[1,2]
[3,4]
[7,3]
Output:
[1,2,1]
[3,4,1]
map and flatMap
The following are characteristics of the map function:
Input:
[this is a simple example of trident topology]
Output:
[THIS IS A SIMPLE EXAMPLE OF TRIDENT TOPOLOGY]
Input:
[this is s simple example of trident topology]
Output:
[this]
[is]
[simple]
[example]
[of]
[trident]
[topology]
[this]
[example]
[count]
[same]
[words]
peek
This is used to debug the tuples flowing between the operations. The following is the
example using the previous functions:
topology.newStream("spout1", spout).flatMap(new SplitMapFunction())
.map(new UpperCase()).peek(
new Consumer() {
@Override
public void accept(TridentTuple tuple) {
System.out.print("[");
for (int index = 0; index < tuple.size(); index++) {
System.out.print(tuple.get(index));
if (index < (tuple.size() - 1))
System.out.print(",");
}
System.out.println("]");
}
});
Filters
The following are characteristics of filter:
Filters take in a tuple as input and decide whether or not to keep that tuple or
not
Input:
[1,2]
[3,4]
[7,3]
Output:
[1,2]
Windowing
In windowing, Trident tuples are processed within the same window and emit to the
next operation. There are two types of window operations:
Tumbling window
Window with a fixed interval or count as processed at one time. One tuple is
processed in only a single window. It is explained in the following screenshot:
The e1, e2, e3 are events. 0, 5, 10, and 15 are windows of five seconds. w1, w2, and w3 are
windows. Here, every event is part of only one window.
Sliding window
Windows with interval and after processing slide the window on the basis of time
interval. One tuple is processed in more than one window. It is explained in the
following screenshot:
Here, windows are overlapping and one event can be part of more than one window.
In the example of the window function, we will integrate the feed from Kafka and
check the output on the console.
The following is the code snippet for understanding linking Storm Trident with
Kafka:
TridentKafkaConfig config = new TridentKafkaConfig(new ZkHosts("localhost:2181"), "test"); //
config.scheme = new SchemeAsMultiScheme(new StringScheme()); // line 2
config.startOffsetTime = kafka.api.OffsetRequest.LatestTime();// line 3
TransactionalTridentKafkaSpout spout = new TransactionalTridentKafkaSpout(config); // line 4
First create the object of TridentKafaConfig, which takes the ZooKeeper hostname and
port with the topic name in line 1. Set the scheme as String for input in line 2. Also
set the startOffsetTime to consider only the latest events in the topic instead of all
events from starting in line 3. Create Trident Kafka spout using the configuration
defined previously in line 4.
The following is the code snippet for understanding useing the window API
available in Storm Trident:
topology.newStream("spout1", spout) // line 1
.each(new Fields("str"),new Split(), new Fields("word")) // line 2
.window(windowConfig, windowStore, new Fields("word"), new CountAsAggregator(), new Fields("c
First, we will define the spout in line 1. We are using the Kafka spout that we
created in the previous step. In line 2, split the input string separated by space and
define the output as word for each event in spout. The following common window
API is used for any supported windowing function:
public Stream window(WindowConfig windowConfig, WindowsStoreFactory windowStoreFactory, Field
The windowStore can be any of the following. It is required to process the tuples and
the aggregate of values:
HBaseWindowStoreFactory
InMemoryWindowsStoreFactory
In line 3, apart from windowConfig and windowStore, the input field is used as word,
output as count, and aggregation function as CountAsAggregator which calculates the
count of tuples received in the window.
Sliding count window: It performs the operation after the sliding count of 10 in
window of 100:
SlidingCountWindow.of(100, 10)
Tumbling count window: It performs the operation after window count of 100:
TumblingCountWindow.of(100)
Sliding duration window: Window duration is six seconds and sliding duration
is three seconds:
SlidingDurationWindow.of(new BaseWindowedBolt.Duration(6, TimeUnit.SECONDS), new BaseW
Partition 1:
["a", 3]
["c", 8]
Partition 2:
["e", 1]
["d", 9]
["d", 10]
Partition 1:
[11]
Partition 2:
[20]
Persistence aggregate
Persistent aggregate aggregates all tuples across all batches in the stream and stores
the result in either memory or database. An example is shown in Basic Trident
topology where we used in-memory storage to perform the count.
Example:
public class Count implements CombinerAggregator<Long> {
public Long init(TridentTuple tuple) {
return 1L;
}
public Long combine(Long val1, Long val2) {
return val1 + val2;
}
public Long zero() {
return 0L;
}
}
Reducer aggregator
The following are the characteristics of the reducer aggregator:
Example:
public class Count implements ReducerAggregator<Long> {
public Long init() {
return 0L;
}
public Long reduce(Long curr, TridentTuple tuple) {
return curr + 1;
}
}
Aggregator
The following are the characteristics of aggregator:
Example:
public class CountAgg extends BaseAggregator<CountState> {
static class CountState {
long count = 0;
}
public CountState init(Object batchId, TridentCollector collector) {
return new CountState();
}
public void aggregate(CountState state, TridentTuple tuple, TridentCollector collector) {
state.count+=1;
}
public void complete(CountState state, TridentCollector collector) {
collector.emit(new Values(state.count));
}
}
Grouping
Grouping operation is an built-in operation of Storm Trident. It is performed by the
groupBy function. It repartitions tuples using partitionBy and then within the
partition it groups all the tuples that have the same group fields. Code example:
topology.newStream("spout", spout)
.each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
As per Storm Trident documentation, the group by function is explained using the
following diagram:
Merge and joins
The merge operation is used to merge more than one stream together. Code example:
topology.merge(stream1, stream2, stream3);
Another way to combine streams is using join. Let's take an example: stream1 has the
fields (key, val1, val2) and stream2 has (x, val1). Now perform a join with stream1 and
stream2 as follows:
topology.join(stream1, new Fields("key"), stream2, new Fields("x"), new Fields("key", "a", "b
The stream1 and stream2 are joined on the basis of key and x respectively. The output
of the field is defined as key from stream1, val1, and val2 as a and b from stream1
again, val1 as c from stream2.
Input:
Stream 1:
[1, 2, 3]
Stream 2:
[1, 4]
Output:
[1, 2, 3, 4]
DRPC
Distributed Remote Produce Call (DRPC). It is computing very intense
functionality on the fly using storm. It gives input as function name and
corresponding arguments and the output is results for each of those function calls.
Write a data generator that will publish an event with fields such as phone
number, bytes in and bytes out
The data generator will publish events in Kafka
Write a topology program:
To get the events from Kafka
Apply filter to exclude phone number to take part in top 10
Split event on the basis of comma
Perform group by operation to bring same phone numbers together
Perform aggregate and sum out bytes in and bytes out together
Now, apply assembly with the FirstN function which requires the field
name and number elements to be calculated
And finally display it on the console
You will find the code in the code bundle for reference.
Program:
package com.book.chapter8.diy;
@Override
public void execute(TridentTuple tuple, TridentCollector collector) {
String event = tuple.getString(0);
System.out.println(event);
String[] splittedEvent = event.split(",");
if(splittedEvent.length>1){
long phoneNumber = Long.parseLong(splittedEvent[0]);
int bin = Integer.parseInt(splittedEvent[1]);
int bout = Integer.parseInt(splittedEvent[2]);
int totalBytesTransferred = bin + bout;
System.out.println(phoneNumber+":"+bin+":"+bout);
collector.emit(new Values(phoneNumber, totalBytesTransferred));
}
}
}
package com.book.chapter8.diy;
Import files:
import java.util.Date;
import java.util.Properties;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
System.out.println(stream.toString());
ProducerRecord<Integer, String> data = new ProducerRecord<Integer, String>(
"storm-trident-diy", stream.toString());
producer.send(data);
counter++;
}
producer.close();
}
}
Summary
In this chapter, we explained what Trident state is and how it is maintained. After
that we built basic trident topology explaining and acquainting the reader with
writing basic trident topology. We have multiple operations available with trident, so
we explained all possible operations with examples. We also explained how trident
works. We explained about the DRPC way of calling in trident and its processing. In
the end, we gave the user a problem to solve with operations such as filter, group by,
and top n.
Working with Spark
This is the chapter where we introduce our readers to Spark engine. We will
introduce the fundamentals of spark architecture and make them understand the need
for and utility of using spark as an option for practical use cases.
Spark overview
Distinct advantages of Spark
Spark – use cases
Spark architecture - working inside the engine
Spark pragmatic concepts
Spark 2.x - advent of data frames and datasets
Spark overview
Apache Spark is a highly distributed compute engine, which comes with promises of
speed and reliability for the computations. As a framework it's based on Hadoop, but
it's further enhanced to perform in memory computations to cater to interactive
queries and near real-time stream processing. The parallel processing clustering and
in-memory processing offer Spark an edge in terms of performance and reliability.
Today Apache Spark is known for its proven salient features:
Speed and efficiency: While it runs off traditional disk-based HDFS, it has
100x higher speed, because of in-memory computations and savings on disk
I/O. It saves the intermediate results in memory, thus saving the overall
execution time.
Extensibility and compatibility: It has a variety of interaction APIs for
developers to choose from. It comes out of the box with Java, Scala, and Python
APIs.
Analytics and ML: It provides robust support for all machine learning and
graph algorithms. In fact, now it's becoming the top choice among developers
for big data implementation for complex data science and artificial intelligence
models.
Spark framework and schedulers
The following diagram captures the various components of the Spark framework and
the variety of scheduling modes in which it could be deployed:
The preceding diagram has all the basic components of the Spark ecosystem, though
over a period of time some have evolved/deprecated, which we will get the users
acquainted with in due course. These are the basic components of the Spark
framework:
Spark core: As the name suggests, this is the core control unit of Spark
framework. It predominantly handles the scheduling and management of tasks.
These using a spark abstraction called resilient distributed dataset (RDDs).
RDD is the basic data abstraction unit in Spark and, shall be discussed in detail
in the following sections.
Spark SQL: This module of Spark is typically designed for data engineers who
are familiar with using SQL on structured datasets of Oracle, SQL Server,
MySQL, and so on. It supports hive and one can easily query data using
HiveQL—Hive Query Language—which is very similar to SQL. The Spark
SQL also interfaces with popular connectors like JDBC and ODBC that let the
developers integrate it with popular databases and other data marts and BI and
visualization tools.
Spark streaming: This is the Spark module that supports processing on real-
time/near real-time streaming data. It integrates seamlessly to ingestion
pipelines like Kafka, RabbitMQ, Flume, and so on. This module is built for
scalable and fault tolerant, high speed processing.
Spark MLLib: This module for Spark is predominantly an implementation of
commonly used data science statistical and machine learning algorithms. It's a
highly scalable, distributed, and fault tolerant module that provides implicit
implementations on classification, component analysis, clustering and
regression, and so on.
GraphX: GraphX was an independent project at Berkley research center, but
that was later donated to Spark and thus became part of the Apache Spark
framework. It is basically the module that supports graph computation and
analysis of a large volume of data. It supports Pregel API and a wide variety of
graphing algorithms.
Spark R: This is one of the late added modules of spark and it was
predominantly designed as a tool for data scientists. The data analysts and
scientists have been widely using R Studio as a model designing tool, but that
provides them with a limited capability single node tool that can only cater to a
subset of the data sample. These models later require a lot of rework and
redesign in terms of logic and optimization to execute on a wider set of data.
Spark R is an attempt to bridge this gap; it's a lightweight framework that
leverages Spark's capabilities and lets the data scientists execute the R model in
distributed mode over a wider set of data on the Spark engine.
The following diagram quickly captures the synopsis of the preliminary functions of
various spark components:
Now that we understand the basic modules and components of spark framework, let's
have a closer look at the orchestration mechanism and options for Spark. There are
three ways in which spark can be deployed and orchestrated. This is captured in the
following diagram and is described in the following section:
It is described as follows:
High performance: This is the key feature responsible for the success of Spark,
the high performance in data processing over HDFS. As we have seen in the
previous section, Spark leverages its framework over HDFS and the Yarn eco-
system, but offers up to 10x faster performance; this makes it a better choice
over map-reduce. Spark achieves this performance enhancement by limiting the
use of latency intensive disk I/O and leveraging over it in memory compute
capability.
Robust and dynamic: Apache Spark is robust in its out-of-the-box
implementation and it comes with over 80+ operations. It's built in Scala and
has interfacing APIs in Java, Python, and so on. The entire combination of
choice of base and peripheral technologies makes it highly extensible for any
kind of custom implementation.
In-memory computation: The in-memory compute capability is the crux to the
speed and efficiency of the Spark engine. It saves overall processing time by
saving time in writing to disk, it uses a pragmatic abstraction RDD, and it stores
most of the computes in memory. It uses the Directed Acyclic Graph (DAG)
engine for in-memory computation and execution flow orchestration.
Reusable: The RDD abstraction helps the programmers develop code in spark
in a manner that is reusable for batch, hybrid, and real-time stream processing
with a few tweaks.
Fault tolerant: The Spark RDD's are the basic programming abstraction of
Spark framework and they are resilient not only in name, but in nature as well.
They have been designed to handle the failures of nodes within the cluster
during the computation without any data loss. While RDD's are designed to
handle the failures, another notable aspect is that Spark leverages HDFS/Yarn
for its basic framework, thus the Hadoop resilience in terms of stable storage is
inherent to it.
Near Real-time stream processing: Spark streaming module is designed to
handle super-fast computation and analysis on top of streaming data to deliver
insights and actionable insights in realtime. This is definitely an edge over
Hadoop's MapReduce, wherein we had to wait for long batch cycles to get the
results.
Lazy evaluation: The execution model in Spark by its nature is lazy and all
transformations applied to an RDD don't yield immediate results—instead
another RDD is formed. The actual execution happens when we issue an action.
This model helps a lot in terms of making the total execution efficient in terms
of time.
Active and extending community: The project Spark was initiated in 2009 as
part of Berkley's Data Analytics System (BDAS). Developers from more than
50 companies joined hands in its making. The community is ever-expanding
and plays an essential part in spark's adoption by the industry.
Complex analytics: It's a system made and designed to handle complex
analytics jobs on top of both historic batch and streaming real-time data.
Integration with Hadoop: This is another cost-efficient and distinct advantage
of Spark. Its integration with Hadoop makes its adoption so easy by the
industry, as it can leverage and sit upon the existing Hadoop cluster and provide
for lightning fast computations in a fail-safe, scale mode.
Now that we clearly understand all the salient features of Spark, including the ones
that are not so sparkling, let's move to the next section that talks about some of the
use cases where Spark is the best choice in terms of analytical framework
Spark – use cases
This section is dedicated to walking the users through distinct real-life uses cases
where spark is the best and obvious choice for analytical processing in the solution:
Financial domain:
Fraud detection: A very important use case to all of us as credit card
users, here the real-time streaming data is mapped to your persona and
historical usage records through a series of complex data science
prediction algorithms to choose a fraudulent from a seemingly fraudulent
card transaction. In accordance, further action like allowing the payment,
calling for mobile verification, blocking the transaction, and so on are
taken into account.
Customer 360 churn and recommendation (cross-sell/up-sell): All
financial institutes have hordes of data, but they struggle with maintenance
aspects. Today the need of the hour is unified customer personification and
correlation of all a customer's actions in realtime to further enrich data.
This unified personification is being done very effectively in large
institutions using Spark and further it helps in behavioral data science
modeling to predict churn and provide recommendations using cross sell
and up sell based on customer profiling and personification.
Real-time monitoring (better client service): Real-time monitoring of
customer activities across all channels helps in personification and
recommendation. But it also helps in monitoring client activities and
identifies issues and breaches if any. For instance, the same client can't use
an ATM physically 100 miles apart within a span of 30 minutes.
E-commerce:
Partnership and prediction: Lot of companies are using spark-based
analytics to predict the market and trend analysis and build their partner
base accordingly
Alibaba: It runs some of the largest analytical spark jobs over hundreds of
petabytes of data to perform extraction of text from images, merchant data
ETL, and machine learning models on top of the same to analyze, predict,
plot, and recommend.
Graphing, analytics, ETL, integration: It does it all for eBay.
Health care:
Wearable devices: The industry is better equipped to provide clinical
diagnosis based on Spark recommendations using real-time streaming data
and past medical history of the patients. It also takes other dimensions such
as nationality, regional eating habits, any epidemic outbreak in the region,
weather, temperature, and so on into consideration.
Travel domain:
Personalized recommendations (TripAdvisor)
NLP and Spark for recommendations (OpenTable)
Spark architecture - working inside the
engine
We have looked at the components of spark framework, its
advantages/disadvantages, and the scenarios where it best fits in solution design. In
the following section, we will delve deeper into the internals of Spark, its
architectural abstractions, and workings. Spark works in a master salve model and
the following diagram shows the layered architecture for it:
The physical machines or the nodes are abstracted by a data storage layer (that
could a HDFS/distributed file system/AWS S3). This data storage layer
provides the APIs for storage and retrieval of final/intermediate data sets
generated during the execution.
The resource manager layer on top of the data storage obfuscates the underlying
storage and resource orchestration from spark set up and execution model, thus
providing the users a spark setup that could leverage any of the available
resource managers, such as Yarn, Mesos, or Spark standalone/local.
On top of this layer we have the Spark core and Spark extensions, each of
which we already discussed and touched upon in the previous section.
Now that we have understood the layer abstraction of spark framework over the raw
physical hardware, the next step is to have a look at spark with a different
perspective, that is, its execution model. The following diagram captures the
execution components under various nodes in a spark cluster:
As evident from the preceding diagram, the key physical components of the Spark
cluster are:
All the preceding three are vital ingredients essential to the execution of any Spark
application job. All components work as their name suggests: the driver node hosts
the spark context where the main driver program runs in the spark cluster, while the
cluster manager is basically the resource managing component of the spark cluster
and it could be either Spark—standalone resource manager, Yarn-based resource
manager, or Mesos-based resource manager. It predominantly handles the
orchestration and management of underlying resources of the cluster for application
execution in a manner that's agnostic to the overall implementation. Spark worker
nodes are actually the nodes where the executors and tasks are spawned and the
spark job actually executes.
This is the main process that executes under the master/driver node
It is the entry point for the Spark Shell
This is where the Spark context is created
The RDD is translated to execution DAG on the driver Spark context
All tasks are scheduled and controlled by the driver during their execution
All the metadata for RDD and their lineage is managed by the driver
It brings the Spark webUI
This is the process within the slave/worker nodes where the spark job tasks are
created and executed
It reads/writes data from external sources
All data processing and logic execution is performed here
It's a single instance of Spark context that has data and computation logic. It can
schedule a series of parallel/sequential jobs for execution.
We have talked enough about the latency issue the big data world was struggling
with before Spark came and took the performance to the next level. Let's have a
closer look to understand this latency problem a little better. The following diagram
captures the execution of typical Hadoop processes and its intermediate steps:
Job #1: This reads the data for processing from HDFS and writes its results to
HDFS
Job #2: This reads the interim processing results of job 1 from HDFS,
processes, and writes the outcome to HDFS
While HDFS is a fault tolerant and persistent store, any disk-based read/write
operation is very expensive in terms of overall latency. Do remember the
serialization and de-serialization, and the distributed nature of HDFS adds a network
latency aspect as well to the disk read-write latency.
So while the solution is robust, all these latent delays add to the total turnaround time
of the job and delay the final processing outcome.
So, what's the magic that Spark does? Well, it quotes to provide faster results even
while it utilizes HDFS. The magic is in-memory computations and the abstraction is
a Resilient Distributed Dataset (RDD).
RDD is the fundamental distinct feature and core of all spark computations. As a
concept, it was independent research work at Berkley University, which was first
implemented and adopted in Spark. If I have to define RDD in a nutshell I would say
it is an abstraction that exhibits all features of Hadoop, but exists in memory—not on
disk.
You may want to refer to the original research paper by Matei Zaharia about the
RDD concept at https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf.
RDD – the name says it all
Resilient—that means RDDs are fault tolerant—now the question is, if they are in-
memory, how can they recover from the failures unless they are persisted, which
brings back the disk latency? Well let me help you all here—the answer is RDD
lineage that helps in the re-computation of missing/damaged partition data in the
event of any failures.
Distributed—well an RDD data resides in-memory across the different worker nodes
in the cluster.
Data frames: These are distributed, resilient, fault tolerant in-memory data
structures that are capable of handling only structured data, which means they
are designed to manage data that can be segregated in fixed typed columns.
Though it may sound like a limitation with respect to RDD, which can handle
any type of unstructured data, in practical terms this structured abstraction over
the data makes it very easy to manipulate and work over a large volume of
structured data, the way we used to with RDBMS.
Datasets: It's an extension of the Spark data frame. It's a type safe object-
oriented interface. For the sake of simplicity, one could say that data frames are
actually an un-typed dataset. This newest API in spark pragmatic abstraction
actually leverages features of tungsten in-memory encoding and catalysts
optimizer.
Summary
In this chapter, we introduced the readers to Apache Spark compute engine, and we
talked about the various components of Spark framework and its schedulers. We
touched upon the advantages that provide edge to spark as a scalable, high
performing compute engine. The users were also acquainted with the not-so
sparkling features of Spark—when it's not the best fit to be used. We walked the
users through some practical realtime industry wide use cases. Next we got into the
layered architecture of Spark and its internal working across the cluster. In the end
we touched upon the pragmatic concepts of Spark: RDD, DataFrame, and datasets.
The next chapter will touch upon the spark APIs and execution of all components
through working code blocks.
Working with Spark Operations
This is the chapter where we introduce our readers to Spark actions, transformations,
and shared variables. We will introduce the fundamentals of Spark Operations and
explain the need for and utility of using Spark as an option for practical use cases.
We'll look at:
Spark core
Spark extensions
As depicted in the preceding diagram, at high level, the Spark codebase is divided
into two packages:
Spark extensions: All API's for the particular extension are packaged in their
own package structure. For example, all API's for Spark Streaming are
packaged in the org.apache.spark.streaming.* package and the same packaging
structure goes for other extensions: Spark MLlib—org.apache.spark.mllib.*,
Spark SQL—org.apcahe.spark.sql.*, and Spark GraphX
—org.apache.spark.graphx.*.
As a developer you will get to work a lot with Spark core, so let's take a look at the
details of Scala API's exposed under the Spark core package of the Spark
Framework.
org.apache.spark: This is the basic package of the Spark core API, and it
generally encapsulates the classes, providing functionality for:
Creation of Spark jobs on clusters
Distribution and orchestration of Spark jobs on clusters
Submitting the Spark jobs on clusters
org.apache.spark.SparkContext: This is the first item you will see in any Spark
job/program. This is the entry point of the Spark application and the context that
provides a pathway to developers to access and use all the other features
provided by the Spark Framework to develop and encode the business logic
driven application. This provides the execution handle for the Spark job and
even the references to the Spark extensions. A note worthy aspect is that it's
immutable and only one Spark context can be instantiated per JVM.
org.apache.spark.rdd.RDD.scala: This package holds the API's pertaining to the
operation that can be executed in a parallel and distributed manner using the
Spark pragmatic data unit RDD over the distributed Spark compute engine.
SparkContext provides access to various methods that can be used to load the data
from the HDFS/filesystem/Scala collection to create an RDD. This package
holds various operational contexts for RDD such as map, join, filter, and even
persist. It also has some specialized classes that come in very handy in certain
specific scenarios, such as:
PairRDDFunctions: It's useful when working with key value data
SequenceFileRDDFunctions: A great aide for handling Hadoop sequence
files
DoubelRDDFunctions: Functions for working with RDD of double data
org.apache.spark.broadcast: Once you start programming in Spark, this will be
one of the most frequently used packages of the framework, second only to
RDD and SparkContext. It encapsulates the APIs for sharing the variables across
the Spark jobs in a cluster. By nature, Spark is used to process humongous
amount of data, thus these variables we are talking about broadcasting would be
of huge size, and the mechanism of exchange and broadcast needs to be smart
and efficient so that information is passed on without jeopardizing the
performance and entire job execution. There are two broadcast implementations
in Spark:
HttpBroadcast: As the name suggests, this implementation relies on an
HTTPServer mechanism to fetch and retrieve the data where the server itself
runs at the Spark driver location. The data is being stored in the Executor
and BlockManager.
TorrentBroadcast: This is the default broadcast implementation of Spark.
Here the broadcast mechanism fetches the data in chunks from the
Executor/driver and is maintained in its own BlockManager. In principle, it
uses the same mechanism as BitTorrent to ensure the driver isn't
bottlenecking the entire broadcast pipeline.
org.apache.spark.io: This provides implementation of various compression
libraries, used at block storage level. This whole package is marked as
DeveloperAPI, so it can be extended and custom implementation can be provided
by the developers. By default, it provides three implementations: LZ4, LZF, and
Snappy.
org.apache.spark.scheduler: This provides various scheduler libraries that help in
job scheduling, tracking, and monitoring. It defines the directed acyclic graph
(DAG) scheduler http://en.wikipedia.org/wiki/Directed_acyclic_graph. Spark DAG
scheduler defines the stage oriented scheduling where it keeps track of the
completion of each RDD and the output of each stage and then computes the
DAG, which is further submitted to the underlying
org.apache.spark.scheduler.TaskScheduler that executes them on the cluster.
org.apache.spark.storage: Provides APIs for structuring, managing, and finally
persisting the data stored in RDD within blocks. It also keeps track of data and
ensures it is either stored in memory or, if the memory is full, it is flushed to an
underlying persistent storage area.
org.apache.spark.util: Utility classes for performing common functions across
the Spark APIs. For example, it defines MutablePair, which can be used as an
alternative to Scala's Tuple2 with the difference that MutablePair is updatable
while Scala's Tuple2 is not. It helps in optimizing memory and minimizing
object allocations.
RDD pragmatic exploration
We have read and understood well that RDDs are an immutable, distributed
collection of object values used as a unit of abstraction in Spark Framework. There
are two ways RDDs can be created:
Now let's create some simple programs to create and use RDDs:
The preceding screenshot captures quick steps to create an RDD on Spark Shell.
Here are the specific commands and further transformational outputs for this:
Scala> val inputfile = sc.textFile("input.txt")
The preceding command reads the file called input.txt from the specified absolute
location and a new RDD is created under the name inputfile. In the preceding
snippet we have not specified the entire path, thus the framework would assume that
the file exists under the current location.
Once the RDD is created and the data from the said input file is loaded into it, let's
put it to use to count the number of words in the file. To achieve this, we can execute
the following steps:
1. Let's split the file into words, in the form of a flatMap. We split using the space
" " character.
2. Create a key value pair by reading each word. The value will be "1" for all
words.
3. Run a reducer cycle and add the similar keys.
Well the beauty of Scala is that it compresses all the preceding steps into a single
line of execution as follows:
Scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceB
Deed done, well yes and no. We have said RDDs are lazy in execution in the
previous section, now you'll actually experience this. We have achieved the word
count transformation previously, but the output is yet to be generated, and you will
see none until there is a Spark action.
So now let's apply an action and persist the output to disk. Here, the following
command persists the new RDD counts to disk under folder output:
Scala> counts.saveAsTextFile("output")
You need to look into the output folder under the current path to see the
persisted/saved data of the counts RDD—in the following screenshot we have the
current figure:
Before getting unto Scala programming, let's have a quick look at Spark
transformation and actions ref at https://spark.apache.org/docs/latest/rdd-programming-guide.
html#transformations.
Transformations
Spark transformations are basically the operations that take an RDD as an input and
produce one or more RDD as output. All transformations are lazy in nature, while
the logical execution plans in the form of direction acyclic graph /DAGs are built
actual execution happens only when an action is called.
The following section captures the code snippets highlighting the usage of various
transformation operations:
Map: The following code reads a file called map_spark_test.txt and maps each line
of the file to its length and prints it:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object mapOperationTest{
def main(args: Array[String]) = {
val sparkSession = SparkSession.builder.appName("myMapExample").master("loca
val fileTobeMapped = sparkSession.read.textFile("my _spark_test.txt").rdd
val fileMap = fileTobeMapped.map(line => (line,line.length))
fileMap.foreach(println)
}
}
flatMap operation():The Map and flatMap operations are similar because they
both apply to every element from the input RDD, but they differ in a manner
because where map returns a single element, flatMap can return multiple
elements.
Filter operation: It returns only the elements from the parent RDD that satisfy
the criteria mentioned in the filter. The following snippet counts the number of
lines that have the word spark in them:
val fileTobeMapped = spark.read.textFile("my_spark_test.txt ").rdd
val flatmapFile = fileTobeMapped.flatMap(lines => lines.split(" ")).filter(value =
println(flatmapFile.count())
union:This transformation accepts two or more RDDs of the same type and
outputs an RDD containing elements of both the RDDs.
In the following snippet, the three RDDs are merged and the resulting
output RDD is printed:
val rddA = spark.sparkContext.parallelize(Seq((2,"JAN",2017),(7,"NOV",2015),(16,"F
val rddB = spark.sparkContext.parallelize(Seq((6,"DEC",2015),(18,"SEP",2016)))
val rddC = spark.sparkContext.parallelize(Seq((7,"DEC",2012),(17,"MAY",2016)))
val rddD = rddA.union(rddB).union(rddC)
rddD.foreach(Println)
intersection:
An operation that runs between the same types of two or more
RDDs to return the elements that are common among all the participating
RDDs:
val rddA = spark.sparkContext.parallelize(Seq((2,"JAN",2017),(4,"NOV",2015, (17,"F
val rddB= spark.sparkContext.parallelize(Seq((5,"DEC",2015),(2,"JAN",2017)))
val comman = rddA.intersection(rddB)
comman.foreach(Println)
groupByKey(): This transformation, when applied over a key value dataset, leads
to the shuffling of data according to the key. This is a very network intensive
operation and caching and disk persistence should be taken care of to make it
effective (read about persistence at https://spark.apache.org/docs/latest/rdd-programmin
g-guide.html#rdd-persistence).
reduceByKey(): This operation works on a key value based dataset, where the
values to the same key are mapped and grouped together, on the same machine
not across the network, before the data is shuffled:
val myWordList = Array("one","two","two","four","five","six","six","eight","nine",
val myWordList = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey
myWordList.foreach(println)
coalesce(): This is a very useful operation and it is generally used to reduce the
shuffling of data across the nodes in Spark clusters by controlling the number of
partitions defined for the dataset. The data would be distributed on a number of
nodes based on the number of partitions defined under the coalesce operation.
In the following code snippet, the data from rddA would be distributed
only in two nodes in two partitions, even if the Spark cluster has six
nodes in all:
val rddA = spark.sparkContext.parallelize(Array("jan","feb","mar","april","may","j
val myResult = rddA.coalesce(2)
myResult.foreach(println)
Actions
In a nutshell, it can be said that actions actually execute the transformations on real
data to generate the output. Actions send the data from the executor to the driver.
Following is capture of actions that return values (from: https://spark.apache.org/docs/late
st/rdd-programming-guide.html#actions):
Let us look at some code snippets to see actions executing actually in action:
count():As the name suggests, this operation counts and gives you the number
of elements in an RDD. We used the count action in the preceding filter
example where we counted the number of words in each line of the file being
read.
collect(): The collect operation does what its name precisely denotes, it returns
all the data from the RDD to the driver program. Due to the nature of this
operation, one is suggested to use it carefully because it copies the data to the
driver, thus all the data should fit into the node where the driver is executing.
In the following snippet, the two RDDs are joined based on the alphabet
keys and then the resulting RDD is returned to the driver program where
the joined values are held as an array against the keys:
val rddA = spark.sparkContext.parallelize(Array(('A',1),('b',2),('c',3)))
val rddB =spark.sparkContext.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c
val resultRDD = rddA.join(rddB)
println(result.collect().mkString(","))
top():This operation returns the top element from the RDD wherein the
ordering is maintained and preserved. It's good for sampling—the following
snippet returns the top three records from the file based on its default ordering:
val fileTobeMapped = spark.read.textFile("my_spark_test.txt").rdd
val mapFile = fileTobeMapped.map(line => (line,line.length))
val result = mapFile.top(3)
result.foreach(println)
countByValue():
This operation returns the count of the number of times an
element occurs in the input RDD. Its output is in the form of a key-value pair
where the key is the element and the value represents its count:
val fileTobeMapped = spark.read.textFile("my_spark_test.txt ").rdd
val result= fileTobeMapped.map(line => (line,line.length)).countByValue()
result.foreach(println)
reduce():This operation takes up two elements from the input RDD; the type of
the output remains the same as the input RDD—the result of the operation
could be count or addition, depending upon the arguments that are passed. In
terms of execution this action can be associative and commutative.
In the following snippet, the rddA is grouped based on the key (alphabet),
the transformation is executed using the collect operation, and the open
function is applied over each element of myGroupedRdd to print the
println
elements on the console:
val rddA = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('
val myGroupRdd = data.groupByKey().collect()
myGroupRdd.foreach(println)
Shared variables – broadcast variables
and accumulators
While working in distributed compute programs and modules, where the code
executes on different nodes and/or different workers, a lot of time a need arises to
share data across the execution units in the distributed execution setup. Thus Spark
has the concept of shared variables. The shared variables are used to share
information between the parallel executing tasks across various workers or the tasks
and the drivers. Spark supports two types of shared variable:
Broadcast variables
Accumulators
In the following sections, we will look at these two types of Spark variables, both
conceptually and pragmatically.
Broadcast variables
These are the variables that the programmer intends to share to all execution units
throughout the cluster. Though they sound very simple to work with, there are a few
aspects the programmers need to be cognizant of while working with broadcast
variables: they need to be able to fit in the memory of each node in the cluster—they
are like local read-only dictionary/index for each node, thus they can't be huge in
size, and all nodes share same values thus they are read-only by design. Say, for
instance, we have a dictionary for spell check, we would want each node to have the
same copy.
So to summarize, here are the major caveats/features of the design and usage of
broadcast variables:
They are a great use for static or lookup tables or metadata where each node has its
shared copy and data doesn't need to be shipped to each node—thus saving a lot on
network I/O.
The following screenshot captures the on-prompt declaration and usage of broadcast
variables: containing values 1, 2, and 3:
The next section captures a small code example to demonstrate the usage of the same
in Scala code. The following is a CSV capturing the names of some states from
India:
Uttar Pradesh, Delhi, Haryana, Punjab, Rajasthan, Himachal Pradesh, Maharashtra, Tamilnadu, T
Next, let's have this loaded into the map from disk and then convert it to a broadcast
variable:
def loadCSVFile( filename: String): Option[Map[String, String]]
val states= Map[String, String]()
Try {
val bufferedSource = Source.fromFile(filename)
As in the next step, we will convert this map into a broadcast variable.
In the preceding snippet, we load our state name file and convert it to a map that is
broadcasted as statesCache. Then we create a stateRDD from the keys of the map states
and we have a method called searchStateDetails that searches for states from a
particular alphabet specified by the user and returns its details such as capital and so
on.
In this mechanism, we don't need to send over the state CSV to each node and
executor every time the search operation is performed.
In the following snippet, one can see the entire source code for the example quoted
previously:
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.rdd.RDD
import org.apache.spark.broadcast.Broadcast
import scala.io.Source
import scala.util.{ Try, Success, Failure }
import scala.collection.mutable.Map
object TestBroadcastVariables {
def main(args: Array[String]): Unit = {
loadCSVFile("/myData/states.csv") match {
case Some(states) => {
val sc = new SparkContext(new SparkConf()
.setAppName("MyBroadcastVariablesJob"))
// happy case...
val happyCaseRDD = searchStateDetails(statesRDD, statesCache, "P")
println(">>>> Search results of states starting with 'P': " + happyCaseRDD.co
happyCaseRDD.foreach(entry => println("State:" + entry._1 + ", Capital:" + en
// non-happy case...
val nonHappyCaseRDD = searchStateDetails(statesRDD, statesCache, "Yz")
println(">>>> Search results of states starting with 'Yz': " + nonHappyCaseRD
nonHappyCaseRDD.foreach(entry => println("State:" + entry._1 + ", Capital:" +
}
case None => println("Error loading file...")
}
}
}.toOption
}
}
Accumulators
This is the second method of sharing values/data across different nodes and/or
drivers in a Spark job. As is evident from the name of this variable, accumulators are
used for counting or accumulating the values. They are the answer to MapReduces
counters, and they are different from broadcast variables because they are mutable—
the value of an accumulator can change, while the jobs can change the value of an
accumulator, but only the driver program can read its value. They work as a great
aide for data aggregation and counting across distributed workers of Spark.
Let's assume there is a purchase log for a Walmart store and we need to write a
Spark job to detect the count of each type of bad records out of the log. The
following snippet will help in attaining this.
def main(args: Array[String]): Unit = {
ctx.textFile("file:/mydata/sales.log", 4)
.foreach { line =>
if (line.length() == 0) blankLines += 1
else if (line.contains("Bad Transaction")) badtnxts+= 1
else {
val fields = line.split("\t")
if (fields.length != 4) missingFieldstnxt+= 1
else if (fields(3).toFloat == 0) zeroValuetnxt += 1
}
}
println("Sales Log Analysis Counters:")
println(s"\tBad Transactions=${ badtnxts.value}")
println(s"\tZero Value Sales=${ zeroValuetnxt.value}")
println(s"\tMissing Fields Transactions=${ missingFieldstnxt.value}")
println(s"\tBlank Lines Transactions=${ blankLines.value}")
Summary
In this chapter, we introduced the readers to Apache Spark APIs and its organization.
We discussed the concept of transformation and actions in theory and with examples.
We took the users through the arena of shared variables: broadcast variables and
accumulators. The next chapter is dedicated to Spark Streaming.
Spark Streaming
This is the chapter where we introduce our readers to Spark Streaming, the
architecture and the concept of microbatching. We will look at the various
components of a streaming application and the internals of a streaming application
integrated with myriad input sources. We will also do some practical hands-on
exercises to illustrate the execution of streaming applications in action. We will
explore and learn about Spark Streaming under the following topics:
Spark provides this streaming API as an extension to its core API which is a
scalable, low latency, high throughput, and fault tolerant framework to process live
incoming streaming data using the microbatching principle.
Some use cases where solutions based on the Spark framework for real-time
processing will come in handy:
The following screenshot captures some of the statistics in terms of the rate at which
live data is generated at every moment around the world. All the scenarios depicted
in the following screenshot are apt to be captured as a Spark Stream processing use
case:
Spark Streaming - introduction and
architecture
Spark Streaming is a very useful extension to the Spark core API that's being widely
used to process incoming streaming data in real-time or close to real-time as in near
real-time (NRT). This API extension has all the core Spark features in terms of
highly distributed, scalable, fault tolerant, and high throughput, low latency
processing.
The following diagram captures how Spark Streaming works in close conjunction
with the Spark execution engine to process real-time data streams:
Under the Spark framework, the size of each microbatch is determined using the
batch duration defined by the user. In order to understand it better, let us take an
example of an application receiving live/streaming data of 20 events per second and
the batch duration provided by the user is 2 seconds. Now our Spark Streaming will
continuously consume the data as it arrives but it will create the microbatches of data
received at the end of every 2 seconds (each batch will consists of 40 events) and
submit this to the user-defined jobs for further processing.
One important aspect that developers need to be cognizant of is that the batch
size/duration and process latency for the overall execution cycle from the time of
occurrence of the event to the arrival of the results is inversely proportional to each
other. The size of the batches is usually defined based on the following two criteria:
Here we will get our readers acquainted with the key aspects where the Spark
Streaming architecture differs from the traditional stream processing architecture and
the benefits of this microbatching based processing abstraction over traditional
event-based processing systems. This walk through of differences will in turn get the
readers acquainted with the internal nitty-gritty of Spark Streaming architecture and
design.
Traditional streaming systems are generally event-based and the data is processed by
various operators in the system as it arrives. Each execution unit in the distributed
setup operates by processing one record at a time, as shown in the following
diagram:
The preceding diagram captures the typical model where one record at a time is
picked up from the source and propagated to the worker nodes, where it's operated
on by processing workers record by record and the sink dumps it to the stable storage
or downstream for further processing.
Before we move on, it's very important to touch base upon the key differences
between Storm as a stream processing engine versus Spark as a streaming platform.
The following table clearly captures and articulates the various features of both of
these top streaming platforms:
The fundamental difference between these two frameworks is that Storm performs
task parallel computation, while the Spark platform performs data parallel
computing, as follows:
Both Storm Trident and Spark offer streaming and microbatch functionality
constrained by time-based windows. Though similar in functional characteristics,
they differ in the implementation semantics and which of the frameworks should be
chosen for an application/problem statement depends upon the requirements. The
following table articulates some ground rules:
Based on previous points, here are some final considerations:
Latency: While Storm can offer sub-second level latency with ease, that's not
an easy point of contention for Spark Streaming, which essentially is not a
streaming platform but a microbatching one.
Total cost of ownership (TCO): This is a very important consideration for any
application. If an application needs a similar solution for batch and real-time,
then Spark has an advantage, as the same application code base can be utilized,
thus saving the development cost. The same is not the case in Storm though as
its dramatically different in implementation from MapReduce/batch
applications.
Message delivery: The exactly once semantic is the default expectation in
Spark, while Storm offers at least once and exactly once. Achieving the latter is
actually a little tricky in Storm.
The following diagram depicts various architecture components, such as Input Data
Streams, Output Data Streams/Stores, and others. These components have a
pivotal role and their own life cycle during the execution of a Spark Streaming
program:
Let's take a look at the following components depicted in the preceding diagram:
Input Data Streams: This defines the input data sources, essentially the
sources that are emitting the live/streaming data at a very high frequency
(seconds, milliseconds). These sources can be a raw socket, filesystem, or even
highly scalable queuing products, such as Kafka. A Spark Streaming job
connects to the input data sources using various available connectors. These
connectors may be available with the Spark distribution itself or we may have
to download it separately and configure it in our Spark Streaming job. These
input streams are also referred to as Input DStreams. Based on the availability
of the connectors, input data sources are divided into the following categories:
Basic Data Sources: The connectors and all of their dependencies for the
basic data sources are packaged and shipped with the standard distribution
of Spark. We do not have to download any other package to make them
work.
Advanced Data Sources: The connectors and the dependencies required
by these connectors are not available with the standard distribution of
Spark. This is simply to avoid the complexity and the version conflicts. We
need to download and configure the dependencies for these connectors
separately or provide them as dependencies in Maven scripts, as instructed
in the integration guidelines for each of the following data sources:
Kafka: http://tinyurl.com/oew96sg
Flume: http://tinyurl.com/o4ntmdz
Kinesis: http://tinyurl.com/pdhtgu3
Refer to http://tinyurl.com/psmtpco for the list of available advanced data
sources.
Spark Streaming Job: This is the custom job developed by the user for
consumption and processing of the data feeds in NRT. It has the following
components:
Data Receiver: This is a receiver that is dedicated to receiving/consuming
the data produced by the data source. Every data source has its own
receiver and cannot be generalized or be common across the varied kinds
of data sources.
Batches: Batches are the collection of messages that are received over the
period of time by the receiver. Every batch has a specific number of
messages or data collected in a specific time interval (batch window)
provided by the user. These microbatches are just a series of RDDs known
as DStreams - Discretized Streams.
Discretized Streams: Also called DStreams, this is a new stream
processing model in which computations are structured as a series of
stateless, deterministic batch computations at small-time intervals. This
new stream processing model enables powerful recovery mechanisms
(similar to those in batch systems) and out-performs replication and
upstream backup. It extends and leverages the concepts of resilient
distributed datasets and creates a series of RDDs (of the same type) in one
single DStream, which is processed and computed at a user-defined time
interval (batch duration). DStreams can be created from input data streams
that are connected to varied data sources, such as sockets, filesystems, and
many more, or it can be created by applying high-level operations on other
DStreams (similar to the RDDs). There can be multiple DStreams per
Spark Streaming context and each DStream contains a series of RDDs.
Each RDD is the snapshot of the data received at a particular point in time
from the receiver. Refer to http://tinyurl.com/nnw4xvk for more information
on DStreams.
Streaming context: Spark Streaming extends the Spark Context and
provides a new context, Streaming context, for accessing all functionality
and features of Spark Streaming. It is the main entry of point and provides
methods for initializing DStreams from various input data sources. Refer
to http://tinyurl.com/p5z68gn for more information on Streaming Context.
Spark Core Engine: The core engine receives the input in the form of
RDDs and further processes as per the user-defined business logic and
finally sends it to the associated Output Data Streams/Stores.
Output Data Streams/Stores: The final output of each processed batch is
sent to the output streams for further processing. These data output streams
can be of varied types, ranging from a raw filesystem, WebSockets,
NoSQL, and many more.
Packaging structure of Spark
Streaming
In this section, we will discuss the various APIs and operations exposed by Spark
Streaming.
Spark Streaming APIs
All Spark Streaming classes are packaged in the org.apache.spark.streaming.*
package. Spark Streaming defines two core classes which also provide access to all
Spark Streaming functionality, such as StreamingContext.scala and DStream.scala.
Let's examine the following functions and roles performed by these classes:
Apart from the preceding defined classes, Spark Streaming also defines various sub-
packages for exposing functionality for various types of input receivers, as follows:
In this section, we have discussed the high level architecture, components, and
packaging structure of Spark Streaming. We also talked about the various
transformation and output operations as provided by the DStreams API. Let us move
forward and code our first Spark Streaming job.
Connecting Kafka to Spark Streaming
The following section walks you through a program that reads the streaming data off
the Kafka topic and counts the words. The aspects that will be captured in the
following code are as follows:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.codehaus.jackson.map.DeserializationConfig.Feature;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;
import scala.Tuple2;
private JavaKafkaWordCount() {
}
@SuppressWarnings("serial")
public static void main(String[] args) throws InterruptedException {
// if (args.length < 4) {
// System.err.println("Usage: JavaKafkaWordCount <zkQuorum><group><topics><numThreads>")
// System.exit(1);
// }
Defining arrays:
args = new String[4];
args[0]="localhost:2181";
args[1]= "1";
args[2]= "test";
args[3]= "1";
lines.print();
new Thread(){
public void run() {
while(true){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("#########################################################################
}
};
}.start();
jssc.start();
jssc.awaitTermination();
}
}
Summary
In this chapter, we introduced the readers to Apache Spark Streaming, and to the
concept of streams and their realization under Spark. We looked at the semantics and
differences between Spark and other streaming platforms. We also got the users
acquainted with situations where Spark is a better choice than Storm. We helped our
users to understand the language semantics by getting to know the API and
operations followed by a word count example using an integration with Kafka.
Working with Apache Flink
In this chapter, we get the readers acquainted with Apache Flink as a candidate for
real-time processing. While most of the appendages in terms of data source tailing
and sink remain the same, the compute methodology is changed to Flink and the
integrations and topology wiring are very different for this technology. Here the
reader will understand and implement end-to-end Flink processes to parse,
transform, and converge compute on real-time streaming data.
Taking a snapshot in distributed stream data flow is based on the Chandy Lamport
algorithm. Check pointing in Flink is based on two concepts: Barriers and State.
Barriers are a core element of Flink and are injected in distributed streaming data
flow along with events. Barriers separate the records into a set of records. Each
barrier has its own unique ID.
The point where the barriers for snapshot n are injected (let's call it Sn) is the
position in the source stream up to which the snapshot covers the data. This position
Sn is reported to the checkpoint coordinator (Flink's JobManager).
The barriers flow with downstream. When an intermediate operator has received a
barrier for snapshot n from all of its input streams, it emits a barrier for snapshot n
into all of its outgoing streams. Once a sink operator has received the barrier n from
all of its input streams, it acknowledges that snapshot n to the checkpoint
coordinator. After all sinks have acknowledged a snapshot, it is considered
completed.
Once snapshot n has been completed, the job will never again ask the source for
records from before Sn, since at that point these records will have passed through the
entire data flow topology.
When operators contain any form of state, this state must be part of the snapshots as
well. Operator state comes in different forms:
User-defined state: This is the state that is created and modified directly by the
transformation functions (such as map() or filter()).
System state: This state refers to data buffers that are part of the operator's
computation. A typical example for this state are the window buffers, inside
which the system collects (and aggregates) records for windows until the
window is evaluated and evicted.
Operators snapshot their state at the point in time when they have received all
snapshot barriers from their input streams, and before emitting the barriers to their
output streams. At that point, all updates to the state from records before the barriers
will have been made, and no updates that depend on records from after the barriers
have been applied. Because the state of a snapshot may be large, it is stored in a
configurable state backend. By default, this is the JobManager memory, but for
production use, a distributed reliable storage should be configured (such as HDFS).
After the state, has been stored, the operator acknowledges the checkpoint, emits the
snapshot barrier into the output streams, and proceeds.
Flink basic components and processes
As per the official documentation of Flink https://ci.apache.org/projects/flink/flink-docs-rele
ase-1.3/concepts/runtime.html and https://ci.apache.org/projects/flink/flink-docs-release-1.3/interna
ls/components.html, the following are some of the processes and basic components of
Flink.
Processes: There are two types of processes in Flink, as shown in the following
figure:
The client is not part of the program execution. It is just to submit the job and it may
stay connected to get the status of the job regularly.
Components: Flink is a layered system with different layers of stack, as shown in
the following figure:
Apache Kafka
Amazon Kinesis Streams
RabbitMQ
Apache NiFi
Twitter Streaming API
We will now see a demonstration of integration with Apache Kafka and RabbitMQ.
Integration with Apache Kafka
We have discussed Apache Kafka setup in previous chapters, so we will focus on
Java code to integrate Flink and Kafka.
The previous dependency is required for all type of the integration. The
following dependencies are specific to Flink and Kafka integration:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.8_2.11</artifactId>
<version>1.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.8.2.2</version>
</dependency>
When the Kafka source is added in the Flink environment then it returns
the DataStream object with the return data type.
You have to execute the environment, otherwise the program will not execute on
Flink. It would be like creating DAG, but not submitting it on JobManager.
Example
The following input is required for Kafka and the output will be displayed on the
console.
Input:
Open the console and add the messages on the Kafka topic using the command line
producer available. Refer to the following screenshot:
Output:
Create the streaming environment and enable the checkpoint with time and
policy as EXACTLY_ONCE. This is mandatory if you want to exactly once semantic
with RMQ and Flink. Otherwise it would be at least once.
Input:
Run the RMQ publisher program in your eclipse. It will ask for the message to be
keyed in RMQ on console.
RMQ UI shows eight messages in the queue, as shown in the following screenshot:
Output:
The following are the different transformations available with DataStream API:
FlatMap: FlatMap
returns the value in collection. It might return one or more or no
value. For example:
dataStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out)
throws Exception {
for(String word: value.split(" ")){
out.collect(word);
}
}
});
Filter: Filter transformation is used to filter events in the data stream. For
example:
dataStream.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) throws Exception {
return value != 0;
}
});
KeyBy:It partitions the stream and each partition has events of the same key. It
returns keyed Stream. The key can't be an array and class object without the
hashCode method overridden. For example:
dataStream.keyBy(0)
reduce:It performs the specified action on the current value and the last record
value and emits a new value. The difference between KeyBy and reduce is that
KeyBy works on a partition, so shuffling is less, but reduce is performing a
function on each and every event in the stream that requires shuffling. For
example:
keyedStream.reduce(new ReduceFunction<Integer>() {
@Override
public Integer reduce(Integer value1, Integer value2)
throws Exception {
return value1 * value2;
}
});
Fold: Foldis the same as Reduce, with the only difference being that Fold specifies
a seeding value before executing the specified function. For example:
DataStream<String> result =
keyedStream.fold(1, new FoldFunction<Integer, Integer>() {
@Override
public Integer fold(Integer current, Integer value) {
return current * value;
}
});
Window: Window
can be defined on already partitioned KeyedStreams. Windows
groups the data in each key according to a time specified by the user. For
example:
dataStream.keyBy(0).timeWindow(Time.seconds(5))
DataSet API
DataSet API is used for batch processing. It has almost the same type of
transformations as DataStream API provides. The following code snippet is a small
example of word count using DataSet API:
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?");
DataSet<Tuple2<String, Integer>> wordCounts = text
.flatMap(new LineSplitter())
.groupBy(0)
.sum(1);
The following are the transformations available with DataSet API and they are
different from DataStream API:
join(): join() transformation joins two datasets based on the key. For example:
result = input1.join(input2)
.where(0) // key of the first input (tuple field 0)
.equalTo(1); // key of the second input (tuple field 1)
First-n: Returns the first n elements from a data set. For example:
DataSet<Tuple2<String,Integer>> result1 = in.first(3);
We have discussed sources, processing and computation, now let's discuss the sinks
that are supported by Flink.
Flink persistence
Flink provides a connector with the sinks or persistences, such as:
Apache Kafka
Elasticsearch
Hadoop Filesystem
RabbitMQ
Amazon Kinesis Streams
Apache NiFi
Apache Casssandra
In this book, we will discuss the Flink and Cassandra connection as it is the most
popular.
Integration with Cassandra
We have discussed and explained the setup of Cassandra in previous chapters so we
will directly go to the program required to make a connection between Flink and
Cassandra:
Input:
Run the program in Eclipse and it will show the following screenshot on the console.
The data set will be different as it is random.
Output:
After running FlinkCassandraConnector, nothing will be printed on the console but you
have to check whether data is present in Cassandra or not. So run the following
query shown in the screenshot:
FlinkCEP
CEP stands for Complex Event Processing. Flink provides API's for implementing
CEP on the data stream with high throughput and low latency. CEP is kind of a
processing data stream, that applies rules or conditions and whatever event satisfies
the condition will be saved in the database as well as send notifications to the user as
shown in the following figure. Flink matches a complex pattern against each event in
the stream. This process filters out the events that are useful and discards the
irrelevant ones. This gives us the opportunity to quickly get hold of what's really
important in the data. Let's take an example. Let's say we have smart gensets which
send the status of electricity produced and temperature of the system. Suppose if the
temperature of the genset goes above 40 degrees then the user should get a
notification to shut it down for a period of time or take immediate action to avoid an
accident.
Pattern API
Apache Flink provides a pattern API to apply complex event processing on the data
stream. Some important methods are:
begin: This defines the pattern starting state and can be written as follows:
Pattern<Event, ?> start = Pattern.<Event>begin("start");
followedBy:
This appends a new pattern state but here other events can occur
between two matching events as follows:
Pattern<Event, ?> followedBy = start.followedBy("next");
where:This defines a filter condition for the current pattern state and, if the event
passes the filter, it can match the state as follows:
patternState.where(new FilterFunction <Event>() {
@Override
public boolean filter(Event value) throws Exception {
return ... // some condition
}
});
within: This defines the maximum time interval for an event sequence to match
the pattern post that is discarded. It's written as follows:
patternState.within(Time.seconds(10));
subtype(subClass):
This defines a subtype condition for the current pattern. An
event can only match the pattern if it is of this subtype:
pattern.subtype(SubEvent.class);
Detecting pattern
Once you have created a pattern then you have to apply the pattern on the data
stream to match the pattern. The following is the way to do this by code:
DataStream<Event> input = ...
Pattern<Event, ?> pattern = ...
PatternStream<Event> patternStream = CEP.pattern(input, pattern);
Here you have to create the PatternStream using the input data stream and pattern that
you have created. Now the PatternStream will have the events that will match the
pattern defined.
Selecting from patterns
We have created the pattern, and applied it on the data stream - now how do we get
the matched events? This can be done by using the following code snippet:
class MyPatternSelectFunction<IN, OUT> implements PatternSelectFunction<IN, OUT> {
@Override
public OUT select(Map<String, List<IN>> pattern) {
IN startEvent = pattern.get("start").get(0);
IN endEvent = pattern.get("end").get(0);
return new OUT(startEvent, endEvent);
}
}
In is the class on which the event pattern is applied. The Out class is the form of
output as an action on matched events.
Example
Let's take one more example with the following code, which explains the available
pattern API and their use. Here in this example we are generating an alert in case a
mobile/IOT device generates data of more than 15 MB within 10 seconds:
Pattern<DeviceEvent, ?> alertPattern = Pattern.<DeviceEvent>begin("first").subtype(DeviceEven
In the preceding snippet, we defined the pattern to be matched with each event of the
data stream:
PatternStream<DeviceEvent> tempPatternStream = CEP.pattern(messageStream.rebalance().keyBy("p
In the preceding snippet, we applied the pattern on the data stream and got the
pattern stream which contains matched events only.
DataStream<DeviceAlert> alert = tempPatternStream.select(new PatternSelectFunction<DeviceEven
private static final long serialVersionUID = 1L;
@Override
public DeviceAlert select(Map<String, DeviceEvent> pattern) {
DeviceEvent first = (DeviceEvent) pattern.get("first");
DeviceEvent second = (DeviceEvent) pattern.get("second");
allTxn.clear();
allTxn.add(first.getPhoneNumber() + " used " + ((first.getBin()
+ first.getBout())/1024/1024) +" MB at " + new
Date(first.getTimestamp()));
allTxn.add(second.getPhoneNumber() + " used " +
((second.getBin() + second.getBout())/1024/1024) +" MB at " +
new Date(second.getTimestamp()));
return new DeviceAlert(first.getPhoneNumber(), allTxn);
}
});
Previously we applied the select function to take action on the matched event. The
full code is available in the code bundle. To run the example, first start DataGenerator
which pushes data on Kafka, and then start DeviceUsageMonitoring.
Gelly
Gelly is a graph API for Flink. In Gelly, graphs can be created, transformed, and
modified. Gelly API provides all the basic and advanced functions of graph
analytics. You can also select the different graph algorithms.
Gelly API
Gelly provides the API with the ability to take actions on graphs. We will discuss the
API's in the follwing section.
Graph representation
A graph is represented by a DataSet of Vertices and Edges. Graph nodes are
represented by Vertex type. A vertex is defined by unique ID and value. A NullValue
can be defined for a Vertex with no value. The following are the methods used for
creating vertex in a graph:
Vertex<String, Long> v = new Vertex<String, Long>("vertex 1", 8L);
Vertex<String, NullValue> v = new Vertex<String, NullValue>("vertex 1", NullValue.getInstance
Graph edges are represented by edge type. An edge is defined by source ID (ID of
source vertex), target ID (ID of target vertex), and optional value. The source and
target IDs should be of the same type as the Vertex IDs. The following is the way to
create an Edge in a graph:
Edge<String, Double> e = new Edge<String, Double>("vertex 1", "vertex 2", 0.5);
Graph creation
A graph can be created as per the following statements in ExecutionEnvironment:
map: maptransformation can be applied on the vertex or edge. The ID's of the
vertex and edge remain unchanged and you can change the value as per the
user-defined function:
Graph<Long, Long, Long> updatedGraph = graph.mapVertices(new MapFunction<Vertex<Long,
public Long map(Vertex<Long, Long> value) {
return value.getValue() + 1;
}
});
filter: By using filter, you can filter the vertices or edges from the graph. If
filter is applied on the edges, then the filtered edges will remain in the graph but
the vertices will not be removed. If filter is applied on the vertices, then the
filtered vertices will remain in the graph:
graph.subgraph(new FilterFunction<Vertex<Long, Long>>() {
public boolean filter(Vertex<Long, Long> vertex) {
return (vertex.getValue() > 0);
}
},
new FilterFunction<Edge<Long, Long>>() {
public boolean filter(Edge<Long, Long> edge) {
return (edge.getValue() < 0);
}
})
reverse:The reverse() method returns a new Graph where the direction of all the
edges has been reversed.
Union: In union operation, duplicate vertices are removed, but duplicate edges
are preserved.
This is it for the Gelly graph library as per the scope of this book. If you want to
explore more on this then go to https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/l
ibs/gelly/index.html.
DIY
Lot of examples are shown in the previous sections. Now, let's get ready to do some
hands-on. Code will be available in the code bundle for reference. Read README.MD in
the code bundle for the execution of the program.
@Override
public void process(DeviceEvent element, RuntimeContext ctx, RequestIndexer indexer) {
Map<String, Object> json = new HashMap<>();
json.put("phoneNumber", element.getPhoneNumber());
json.put("bin", element.getBin());
json.put("bout", element.getBout());
json.put("timestamp", element.getTimestamp());
System.out.println(json);
indexer.add(source);
}
}
In the next code snippet, we have public methods that are called:
package com.book.flinkcep.example;
if (tokens.length != 4) {
throw new RuntimeException("Invalid record: " + line);
}
deviceEvent.phoneNumber = Long.parseLong(tokens[0]);
deviceEvent.bin = Integer.parseInt(tokens[1]);
deviceEvent.bout = Integer.parseInt(tokens[2]);
deviceEvent.timestamp = Long.parseLong(tokens[3]);
return deviceEvent;
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + bin;
result = prime * result + bout;
result = prime * result + (int) (phoneNumber ^ (phoneNumber >>> 32));
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
DeviceEvent other = (DeviceEvent) obj;
if (bin != other.bin)
return false;
if (bout != other.bout)
return false;
if (phoneNumber != other.phoneNumber)
return false;
if (timestamp != other.timestamp)
return false;
return true;
}
@Override
public String toString() {
return "DeviceEvent [phoneNumber=" + phoneNumber + ", bin=" + bin + ", bout=" + b
+ timestamp + "]";
}
}
import java.io.IOException;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.streaming.util.serialization.DeserializationSchema;
import org.apache.flink.streaming.util.serialization.SerializationSchema;
@Override
public TypeInformation<DeviceEvent> getProducedType() {
return TypeExtractor.getForClass(DeviceEvent.class);
}
@Override
public byte[] serialize(DeviceEvent element) {
return element.toString().getBytes();
}
@Override
public DeviceEvent deserialize(byte[] message) throws IOException {
return DeviceEvent.fromString(new String(message));
}
@Override
public boolean isEndOfStream(DeviceEvent nextElement) {
return false;
}
Finally, we have the rest of the code that allows us to translate the variables of the
data to real numbers:
package com.book.flinkcep.example;
import java.util.Properties;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
properties.put("bootstrap.servers", "localhost:9092");
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSer
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringS
properties.put("acks", "1");
stream.append(phoneNumber);
stream.append(",");
stream.append(bin);
stream.append(",");
stream.append(bout);
stream.append(",");
stream.append(System.currentTimeMillis());
System.out.println(stream.toString());
ProducerRecord<Integer, String> data = new ProducerRecord<Integer, String>(
"device-data", stream.toString());
producer.send(data);
counter++;
}
producer.close();
}
}
Executing Storm topology on Flink:
Storm topology can run in a Flink environment without modification. Just keep the
following points in mind while making changes:
In the following lines we can show the code for that function.
import org.apache.flink.storm.api.FlinkTopology;
import backtype.storm.topology.TopologyBuilder;
FlinkTopology.createTopology(topologyBuilder).execute();
}
}
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Map;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
@Override
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
//fileName = (String) conf.get("file");
this.collector = collector;
try {
reader = new BufferedReader(new FileReader(fileName));
} catch (Exception e) {
throw new RuntimeException(e);
}
}
@Override
public void nextTuple() {
try {
String line = reader.readLine();
if (line != null) {
collector.emit(new Values(line));
}
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void close() {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
Fields schema = new Fields("line");
declarer.declare(schema);
}
}
import java.util.Map;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Tuple;
public class TDRCassandraBolt extends BaseBasicBolt {
private static final long serialVersionUID = 1L;
private Cluster cluster;
private Session session;
private String hostname;
private String keyspace;
@Override
public void prepare(Map stormConf, TopologyContext context) {
cluster = Cluster.builder().addContactPoint(hostname).build();
session = cluster.connect(keyspace);
}
@Override
public void cleanup() {
session.close();
cluster.close();
}
}
import java.util.Map;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
String valueByField = input.getString(0);
System.out.println("field value "+ valueByField);
String[] split = valueByField.split(",");
PacketDetailDTO tdrPacketDetailDTO = new PacketDetailDTO();
tdrPacketDetailDTO.setPhoneNumber(Long.parseLong(split[0]));
tdrPacketDetailDTO.setBin(Integer.parseInt(split[1]));
tdrPacketDetailDTO.setBout(Integer.parseInt(split[2]));
tdrPacketDetailDTO.setTimestamp(Long.parseLong(split[3]));
@Override
public void cleanup() {
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declareStream("tdrstream", new Fields("tdrstream"));
}
import java.io.Serializable;
In the next chapter, we will see how to develop a real example using the applications
of all this book's scenarios.
Case Study
After reading all previous chapters, you have been acquainted with different
frameworks available for real-time as well batch streaming. In this chapter, we will
discuss a case study which uses frameworks we discussed in previous chapters.
Suppose your children took your car to buy goods from the market and you
don't want them to go beyond 5 KM/Miles from your home or you want to
know your kids/other family member's location on a map to track them
When you have put your car in for servicing and you want it not to go beyond 1
KM/Miles from the car service center
When you are approaching home, the garage gate opens automatically without
having to get out of the car and manually open it by pushing/pulling it
When you are leaving home, wondering whether you have locked the door or
not
Suppose you have a cattle farm and you want your cattle not to stray beyond a
certain distance
A beacons-based use case can be implemented in stores or malls to find out
about the customer and recommend what he/she probably likes
As well as the previous mentioned use cases, there are many more. We have picked
the use case mentioned first: Vehicle Geofencing. The user will get an alert if the
vehicle goes beyond the virtual boundary defined by the user. An example is
illustrated in the following image:
In this use case, we push real-time vehicle location data into Kafka. We will keep the
static data of the vehicle which contains the starting point of the vehicle and the
threshold distance in Kafka and push it into Hazelcast. Real-time vehicle data will be
read by Storm, which starts processing every vehicle event. It checks the distance of
the current location and the location defined by the user. If the vehicle goes beyond
the threshold distance, then an alert is generated.
Data modeling
The real-time vehicle sensor data model is:
The Vehicle Id is the unique identifier of any vehicle. Generally, it is the chassis
number shown in different locations for different types of vehicle. Latitude and
Longitude are detected by GPS which tells us the current location. Speed is the speed of
the vehicle. Timestamp is the time when this event was generated.
This static data is provided by the owner of the vehicle while setting up the alert for
his/her vehicle. The Vehicle Id is the unique identifier of any vehicle. Latitude and
Longitude are the starting location of the vehicle or the location from where distance
is calculated to check the alert. Distance is the threshold distance in meters. The Phone
number is used to send a notification in case of an alert ID being generated.
Output is pushed into Elasticsearch. There will be two types of data models used.
One for saving real-time vehicle sensor data and the other for alert information
generated by the system.
Coords contains latitude and longitude in JSON format which is converted into
geo_point type by Elasticsearch. Speed is the speed of the vehicle. Timestamp is the time
the event occurred. Vehicle_id is the unique identifier of the vehicle.
Actual_coords contains the real-time current latitude and longitude of the vehicle as
location. Expected_distance is the threshold value of distance configured by the user
for their vehicle. Actual_distance is the current distance between the actual_coords and
the expected_coords. Expected_coords is the starting point or location configured by the
user while setting up the vehicle alert. Timestamp is the time of the alert generated by
the system. Vehicle_id is the unique identifier of the vehicle.
Name Version
Java 1.8
Zookeeper 3.4.6
Kafka 2.11-0.8.2.2
Hazelcast 3.8
Storm 1.1.1
Elasticsearch 5.2.2
Kibana 5.2.2
Setting up the infrastructure
To implement the use case, the following tools must be setup:
This will download the 3.8 version of Hazelcast. Extract it and you will
get the following folders and files shown in the following screenshot:
This will start Hazelcast on the localhost and bind to port 5701. If we
want to create a cluster for Hazelcast then copy the Hazelcast setup
directory to a different location and execute the start.sh script again. It
will start Hazelcast in cluster mode on the localhost and bond the second
instance with port 5702.
It will start the mancenter UI on the localhost on port 8080. The URL is
http://localhost:8080/mancenter/. It automatically creates a work directory
at the mancenter directory. If you want to start the UI at a different port and
another work directory location, then execute the following command:
/mancenter/startMancenter.sh <PORT> <PATH>
After executing the previous steps, you will get a 6 digit code on
your number and the following response on the console:
Use the following command to register your number with a
verification code:
yowsup-cli registration -d -R 705-933 -p 917988141683 -C 91 -E android
5. After registration, you can test by sending a message using the following
command:
yowsup-cli demos --config whatsapp_config.txt --send <Receiver Phone number wit
Check your mobile and you should already have received the
message.
ui.port: 8081
Nimbus: First of all, we need to start the Nimbus service in Storm. Execute the
following command to start:
/bin/storm nimbus
Supervisor: Next we need to start supervisor nodes to connect with the nimbus
node. Execute the following command to start:
/bin/storm supervisor
Logviewer: The log viewer service helps to see the worker logs on Storm UI.
Execute the following command to start:
/bin/storm logviewer
Implementing the case study
To implement the geofencing use case, we use the following components to build and
develop it.
Building the data simulator
Before generating the real-time data of the vehicle, we need to have the starting point
of the vehicle. Following is the code snippet to the generate data:
package com.book.simulator;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.Random;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import com.book.domain.Location;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
/**
* This class is used to generate vehicle start point for number of vehicle
* specified by user.
*
* @author SGupta
*
*/
public class VehicleStartPointGenerator {
static private ObjectMapper objectMapper = new ObjectMapper();
static private Random r = new Random();
static private String BROKER_1_CONNECTION_STRING = "localhost:9092";
static private String KAFKA_TOPIC_STATIC_DATA = "vehicle-static-data";
public static void main(String[] args) {
if (args.length < 1) {
System.out.println("Provide number of vehicle");
System.exit(1);
}
// Number of vehicles for which data needs to be generated.
int numberOfvehicle = Integer.parseInt(args[0]);
// Get producer to push data into Kafka
KafkaProducer<Integer, String> producer = configureKafka();
// Get vehicle start point.
Map<String, Location> vehicleStartPoint = getVehicleStartPoints(numberOfvehicle);
// Push data into Kafka
pushVehicleStartPointToKafka(vehicleStartPoint, producer);
producer.close();
}
We need to build a real-time data simulator which generates vehicle data within a
radius of the user specified value:
package com.book.simulator;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.Random;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import com.book.domain.Location;
import com.book.domain.VehicleSensor;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
/**
* This class is used to generate real-time vehicle data with updated location
* within distance in radius of user specified value. Messages are pushed into Kafka topic.
*
* @author SGupta
*
*/
public class VehicleDataGeneration {
if (args.length < 2) {
System.out.println("Provide total number of records and range of distance from
System.exit(1);
}
producer.close();
}
properties.put("bootstrap.servers", BROKER_1_CONNECTION_STRING);
properties.put("key.serializer", StringSerializer.class.getName());
properties.put("value.serializer", StringSerializer.class.getName());
properties.put("acks", "1");
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
producer.send(data);
}
}
while (it.hasNext()) {
String message = new String(it.next().message());
try {
vehicleStartPoint = objectMapper.readValue(message, new TypeReference<Ma
});
} catch (IOException e) {
e.printStackTrace();
}
break;
}
consumer.shutdown();
return vehicleStartPoint;
}
public static double getDistanceFromLatLonInKm(double lat1, double lon1, double lat2, doub
int R = 6371; // Radius of the earth in km
double dLat = deg2rad(lat2 - lat1); // deg2rad below
double dLon = deg2rad(lon2 - lon1);
double a = Math.sin(dLat / 2) * Math.sin(dLat / 2)
+ Math.cos(deg2rad(lat1)) * Math.cos(deg2rad(lat2)) * Math.sin(dLon / 2)
double c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a));
double d = R * c; // Distance in km
return d;
}
double foundLatitude;
double foundLongitude;
foundLatitude = y0 + y;
foundLongitude = x0 + new_x;
The output will be as the following screenshot. Wait for the next step to be executed
as the program reads only the latest entries from the Kafka topic:
Once the topology is deployed, Storm UI starts showing the Topology summary, as
displayed in the following screenshot:
When you click on the topology name then details of the spout and bolts are shown
along with other details, as shown in the following screenshot:
You can visualize the complete topology in DAG form, by clicking on the
visualization button, as shown in the following screenshot:
Start simulator
Finally, we are done with all the set ups and starting them. Now, by executing the
following command, messages will start pushing in the Kafka topic vehicle-data:
java -cp chapter12-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.book.simulator.VehicleDataGen
When you select any type of visualization then Kibana asks you to select the index
name first, as shown in the following screenshot:
Now, add visualizations as per your need. A few of the examples:
Tile Map: By selecting the following options for vehicle-alert index, Geo
points start showing on the map as shown in the following screenshot (As data
points are random so they can be anywhere on the map):
Vertical Bar Chart: This chart shows vertical bars of any selected aggregation
versus selected x-axis value, as shown in the following screenshot:
Data Table: Get the actual values and perform an aggregation operation.
Configure your visualization as seen in the preceeding figures to use it. Figure
18 shows the total number of events for each vehicle and another one is the top
10 vehicles for alerts.
Now, you must build your dashboard by adding all the visualizations previously
configured by selecting the add option. After adding all visualizations, the dashboard
looks like the following screenshots:
The following is a screenshot of messages received on mobile WhatsApp:
Summary
In this chapter, we discussed the case study; we selected a GeoFencing use case as a
case study and explained the different use cases around GeoFencing. We explained
the data model used in use case implementation: input data set, output data set, and
elastic search indexes. We explained different types of tools used for implementing
the use case. We had a detailed discussion on how to set up a complete environment
with the different tools available and we completed implementation of the case study
along with code.
In the end, we looked at how to run the use case and visualize it.