0% found this document useful (0 votes)
42 views29 pages

Apache Storm

Apache Storm is an open-source distributed real-time computation system. It provides features like real-time processing, scalability, fault tolerance and supports multiple programming languages. Storm has a master-slave architecture with Nimbus as master and supervisor nodes running workers. Topologies define the flow of data from spouts to bolts. Common use cases include fraud detection, social media analytics, IoT and recommendation engines.

Uploaded by

Nipuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views29 pages

Apache Storm

Apache Storm is an open-source distributed real-time computation system. It provides features like real-time processing, scalability, fault tolerance and supports multiple programming languages. Storm has a master-slave architecture with Nimbus as master and supervisor nodes running workers. Topologies define the flow of data from spouts to bolts. Common use cases include fraud detection, social media analytics, IoT and recommendation engines.

Uploaded by

Nipuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

APACHE STORM

SENG 41303- Big Data Infrastructure

Assignment 02 - Group 02
Team details

SE/2018/012 Nethmini Devyanjalee

SE/2018/019 Isuru Malkishara

SE/2018/024 Nirmal Kapilarathna

SE/2018/025 Imasha Weerakoon

SE/2018/031 Nipuni Perera

SE/2018/038 Isuruni Rathnayaka

SE/2018/041 Sanjikan Pathmanathan

SE/2018/042 Sachin Tharaka

SE/2018/045 Tharushi Chamalsha

2
Table of Contents

Team details 2
Table of Contents 3
1. Description of Apache Storm, addressing its features and use cases. 3
1.1 Features of Apache Storm 4
1.2 Apache Storm Architecture 6
1.3 Use Cases of Apache Storm 8

2. Advantages of using Apache Storm for stream data processing. 10


3. Disadvantages of using Apache Storm for stream data processing. 12
4. Comparison between Apache Storm and Apache Spark, highlighting advantages
and disadvantages. 14
4.1 Apache Storm 14
4.1.1 Advantages 14
4.1.2 Disadvantages 14
4.2 Apache Spark 15
4.2.1 Advantages 15
4.1.2 Disadvantages 15

5. Comparison between Apache Storm and Apache Kafka, emphasizing advantages


and disadvantages. 17
6. Q & A 19
6.1 The comparison between Apache Storm and Apache Spark 19
6.2 Apache Storm's features and use cases 22
6.3 The advantages and disadvantages of Apache Storm 25
6.3.1 Advantages of Apache Storm 25
6.3.2 Disadvantages of Apache Storm 26

3
1. Description of Apache Storm, addressing its features and
use cases.

Apache Storm – released by Twitter, is a distributed open-source framework that helps


in the real-time processing of data. Apache Storm works for real-time data just as
Hadoop works for batch processing of data. Storm runs on YARN and integrates
perfectly with the Hadoop ecosystem. It is a true real-time data processing framework
having zero batch support. It takes a complete stream of data as an entire ‘event’
instead of breaking it into a series of small batches. Hence, it is best suited for data that
is to be ingested as a single entity. Storm is a robust and scalable framework that has
gained popularity for its ability to handle complex event processing and real-time
analytics.

1.1 Features of Apache Storm

1. Real-time Data Processing

Apache Storm is specifically designed for real-time data processing. It can ingest,
process, and analyze data as it arrives, making it an excellent choice for
applications that require low-latency responses.

2. Scalability

Storm's architecture is highly scalable, allowing it to handle both small and


large-scale data processing tasks. It can dynamically allocate resources to
accommodate changing workloads, making it suitable for applications that need
to process data at different scales.

3. Fault Tolerance

Storm provides built-in fault tolerance mechanisms. It ensures that data


processing continues even in the presence of hardware failures or software
errors. This reliability is crucial for mission-critical applications.

4
4. Extensibility

Storm's extensible architecture allows users to integrate it with various data


sources, processing libraries, and output sinks. This flexibility enables
customizations to fit specific application requirements.

5. Stream Processing Topologies

Storm applications are structured as directed acyclic graphs (DAGs) known as


topologies. Topologies define the flow of data from spouts (data sources) through
a series of bolts (data processing components). This topology-based approach
makes it easy to model complex data processing workflows.

6. Multiple Programming Languages

Storm supports multiple programming languages, including Java, Python, and


Clojure. This language support allows developers to use their preferred
programming language for building Storm applications.

7. Exactly-once Processing

Storm guarantees "exactly-once" processing semantics, ensuring that each piece


of data is processed exactly once, even in the presence of failures. This is critical
for maintaining data integrity.

8. Integration with Other Technologies

Storm can seamlessly integrate with various data storage and messaging
systems, such as Apache Kafka, Apache Hadoop, and Apache Cassandra. This
integration makes it a versatile tool in the big data ecosystem.

9. Monitoring and Management

Storm provides tools and utilities for monitoring and managing running
topologies, making it easier to debug and optimize real-time data processing
applications.

5
1.2 Apache Storm Architecture

To understand how Apache Storm achieves its real-time processing capabilities, let's
delve into its architecture.

1. Nimbus

Nimbus is the master node in the Storm cluster. It is responsible for distributing
code, assigning tasks to worker nodes, and monitoring the overall health of the
cluster. Nimbus ensures that the topologies are executed correctly and efficiently.

2. ZooKeeper

Storm uses Apache ZooKeeper for distributed coordination and configuration


management. ZooKeeper helps in maintaining cluster state, leader election, and
keeping track of worker nodes. It plays a crucial role in ensuring fault tolerance
and reliability in Storm clusters.

6
3. Supervisor Nodes

Supervisor nodes run on worker machines in the Storm cluster. They are
responsible for launching and managing worker processes. Each supervisor
node can manage multiple worker processes, allowing Storm to distribute tasks
efficiently.

4. Workers

Workers are individual processes responsible for executing spouts and bolts
within a topology. They run on supervisor nodes and perform the actual data
processing tasks. Storm dynamically assigns tasks to workers based on the
topology's configuration and the available resources in the cluster.

5. Topologies

Topologies are the core units of computation in Storm. They are directed acyclic
graphs (DAGs) consisting of spouts and bolts. Spouts are responsible for
ingesting data, while bolts perform data processing and transformation.
Topologies define how data flows through the system and can be customized to
suit various processing requirements.

6. Stream Groupings

Stream groupings define how tuples (data elements) emitted by spouts are
distributed to bolts. Storm supports various stream groupings, including shuffle,
fields, all, custom, and more. These groupings allow developers to control the
data distribution and processing logic within a topology.

7. Message Broker Integration

Storm can be integrated with message brokers like Apache Kafka or message
queues to ingest real-time data streams. This integration ensures that Storm can
consume data from various sources seamlessly.

7
1.3 Use Cases of Apache Storm

Apache Storm is a versatile stream processing framework that finds applications in a


wide range of industries and domains. Here are some common use cases where Storm
shows its capabilities:

1. Fraud Detection

Real-time fraud detection systems need to analyze financial transactions as they


occur. Storm can process transaction data in real time, flagging suspicious
activities and preventing fraudulent transactions from going through.

2. Social Media Analytics

Companies and organizations use Storm to analyze social media data streams.
This allows them to monitor brand mentions, sentiment analysis, and trending
topics in real time, enabling quick responses to online trends and events.

3. Internet of Things (IoT) Data Processing

In IoT applications, devices generate continuous streams of data. Storm can


process this data in real time, making it suitable for applications like smart cities,
predictive maintenance, and asset tracking.

4. Recommendation Engines

Online services that provide real-time recommendations, such as e-commerce


platforms and video streaming services, use Storm to analyze user behavior and
deliver personalized recommendations.

5. Ad Campaign Optimization

Advertising platforms use Storm to analyze user engagement and click-through


rates in real time. This information is used to optimize ad campaigns on the fly,
ensuring maximum ROI for advertisers.

8
6. Network Traffic Analysis

Telecom and network service providers use Storm to analyze network traffic
patterns in real time. This helps in optimizing network performance, identifying
anomalies, and ensuring Quality of Service (QoS).

7. Real-time Dashboarding and Monitoring

Storm can power real-time dashboards that display key performance indicators
(KPIs) and metrics from various data sources. This is crucial for decision-makers
to monitor business operations in real time.

8. Weather Forecasting

Meteorological agencies use Storm to process large volumes of real-time


weather data from sensors and satellites. This helps in generating accurate
weather forecasts and warnings.

Here are some specific use cases of Storm:

● Spotify uses Storm for various real-time features, such as monitoring, analytics,
recommendation systems, and targeting. With other technologies, such as Kafka
and Cassandra, Storm enables a fault-tolerant, low-latency distributed system.
● Twitter uses Storm for both production and in-development applications. Some
applications include real-time analytics, revenue optimization, discovery, and
personalization.
● WebMD applies Storm in a mobile environment for NLP (natural language
processing) tasks and real-time updates. Internal applications include ETL and
marketing pipelines.

9
2. Advantages of using Apache Storm for stream data
processing.

Apache Storm is a real-time stream data processing framework that offers several
advantages for handling and analyzing data streams in real time. Here are some of the
key advantages of using Apache Storm:

- Real-time data processing: Apache Storm is designed for real-time stream


processing, making it ideal for applications that require low-latency data
processing and near-instantaneous decision-making based on incoming
data.
- Fault tolerance: Storm provides built-in fault tolerance mechanisms,
ensuring that data processing continues even in the presence of failures,
such as node crashes or network issues. It uses the concept of "spouts"
(data sources) and "bolts" (data processors) that can be parallelized and
distributed across multiple nodes for redundancy.
- Scalability: Apache Storm is highly scalable and can handle large volumes
of data by distributing the processing across a cluster of machines. This
makes it suitable for applications with varying workloads, allowing you to
add or remove resources as needed.
- Extensibility: Storm's modular and extensible architecture allows you to
easily integrate it with other tools and technologies. You can create
custom spouts and bolts to process data from various sources and
perform specific operations.
- Support for multiple programming languages: While Storm is primarily
written in Java, it supports multiple programming languages through its
"multi-language" feature, allowing developers to build components in
languages like Python and Clojure.
- Integration with various data sources: Storm can ingest data from a wide
range of sources, including Apache Kafka, Apache Flume, Twitter, and
more. This versatility makes it suitable for diverse use cases.

10
- Wide ecosystem and community: Storm benefits from a vibrant
open-source community, and it integrates well with other big data and
real-time processing technologies like Apache Hadoop, Apache
Cassandra, and Apache HBase.
- Exactly-once processing semantics: Storm offers support for exactly-once
processing semantics, ensuring that each message is processed exactly
once, even in the presence of failures. This is crucial for maintaining data
integrity.
- Low-latency processing: Storm is designed to minimize end-to-end
latency, making it suitable for applications where timely processing of data
is critical, such as fraud detection, real-time analytics, and
recommendation systems.
- Monitoring and management tools: There are several tools available for
monitoring and managing Storm clusters, making it easier to ensure the
health and performance of your real-time data processing infrastructure.
- Comprehensive documentation and community support: Apache Storm
has extensive documentation and a community that can assist, making it
easier to get started and troubleshoot issues.

Overall, Apache Storm is a robust choice for organizations looking to process


and analyze streaming data in real-time, and it is well-suited for use cases that
require low latency, fault tolerance, and scalability.

11
3. Disadvantages of using Apache Storm for stream data
processing.

Although Apache Storm has several advantages for processing stream data, it's crucial
to be aware of any potential drawbacks and restrictions:

● Complexity of Setup and Configuration: Setting up and configuring a Storm


cluster can be non-trivial, especially for users who are new to distributed
computing concepts. It requires knowledge of system administration and may
involve configuring various components like Zookeeper for coordination.

● Steep Learning Curve: Developing applications for Storm may have a steeper
learning curve compared to simpler stream processing frameworks. Developers
need to understand concepts like spouts, bolts, and topologies.

● Resource Intensive: Storm can be resource-intensive, especially for complex


processing tasks or when dealing with high data volumes. This can lead to higher
hardware and operational costs.

● Lack of Built-in State Management: Unlike some other stream processing


frameworks, Storm does not have built-in support for distributed state
management. Developers need to implement their mechanisms for managing the
state, which can be challenging for certain use cases.

● Limited in-memory Processing: Storm primarily operates in-memory, which


means that it may not be well-suited for use cases that require extensive
disk-based processing. This can be a limitation for certain types of workloads.

● Debugging and Testing Complexity: Debugging and testing distributed systems,


including Storm applications, can be more complex than traditional single-node

12
applications. Ensuring correctness and reliability in a distributed environment can
be challenging.

● Lack of High-Level Abstractions: Compared to some other stream processing


frameworks, Storm may require more low-level coding. This can result in more
development effort, particularly for complex applications.

● Lack of Rich Built-in Libraries: While Storm has a mature ecosystem, it may not
have as many pre-built libraries and connectors for specific use cases as some
other stream processing frameworks.

● Limited Windowing and Event Time Handling: Storm's windowing capabilities are
more basic compared to some other stream processing systems like Apache
Flink, which offers more advanced event time handling and windowing
semantics.

● Less Integrated Batch Processing Support: While Storm is primarily designed for
real-time processing, it may not be as well-integrated with batch processing
workflows as some other frameworks that combine both batch and stream
processing, like Apache Beam.

● Community and Maintenance Concerns: The community and support for Apache
Storm, while active, may not be as large or as well-funded as some other
projects. This could potentially lead to slower updates or less extensive
documentation.

In the end, the particular requirements and limitations of the application should be taken
into consideration while selecting a stream processing framework. Despite Apache
Storm's advantages, it's crucial to take into account any potential drawbacks and
determine whether they are compatible with your use case.

13
4. Comparison between Apache Storm and Apache Spark,
highlighting advantages and disadvantages.

Both are used for distributed data processing but Apache Storm is ideal for low-latency,
real-time stream processing where data integrity and low latency are critical. On the
other hand, Apache Spark is better suited for batch processing, iterative machine
learning, and scenarios where a broader ecosystem is needed. The choice between the
two should depend on the specific requirements of the use case. Also, organizations
use both Storm and Spark in conjunction for a hybrid approach to handle both real-time
and batch-processing needs. Now we can discuss some key advantages and
disadvantages to continue the comparison.

4.1 Apache Storm

4.1.1 Advantages

1. Low Latency: Storm can process events with very low latency, making it ideal for
applications that require near real-time responses.

2. Guaranteed Message Processing: Storm provides at least once and exactly once
processing semantics, ensuring data integrity.

3. Scalability: Storm is designed to be highly scalable, and you can add or remove
nodes as needed to handle increased workloads.

4.1.2 Disadvantages

14
1. Complexity: Developing and managing Storm topologies can be more complex
and requires expertise in distributed systems.

2. State Management: Managing state in Storm can be challenging, especially for


applications that require stateful processing.

3. Limited Batch Processing: While Storm can handle real-time data well, it's not as
efficient for batch processing tasks compared to Spark.

4.2 Apache Spark

4.2.1 Advantages

1. Ease of Use: Spark's high-level APIs (like DataFrame and Dataset APIs) make it
easier for developers to work with and require less low-level code than Storm.

2. In-Memory Processing: Spark keeps data in memory between stages, which can
significantly speed up processing for iterative algorithms.

3. Broad Ecosystem: Spark has a rich ecosystem of libraries and connectors for
various data sources and analytics, making it a versatile choice for big data
processing.

4.1.2 Disadvantages

1. Latency: Spark's real-time capabilities are not as low-latency as Storm, which


makes it less suitable for applications that require immediate responses to data.

15
2. Overhead: Spark has some overhead associated with in-memory processing,
which might not be necessary for all use cases and could result in increased
resource requirements.

3. Resource Intensive: Spark can be resource-intensive, and the cluster setup may
be more challenging and expensive for smaller workloads.

16
5. Comparison between Apache Storm and Apache Kafka,
emphasising advantages and disadvantages.

Aspect Apache Storm Apache Kafka

Real-Time Stream Suitable for real-time Not designed for real-time


Processing processing. processing; primarily a data
transportation and storage
system.

Complex Event Processing Supports CEP for custom Does not provide native CEP
(CEP) data analysis. capabilities. Requires
external processing
frameworks for this.

Scalability Horizontally scalable for high Horizontally scalable to


throughput. handle large data volumes.

Fault Tolerance Offers built-in fault tolerance Reliable and fault-tolerant


features. with built-in data replication.

Broad Language Support Supports multiple Primarily Java-based, with


programming languages. some community-contributed
clients for other languages.

Complex Setup Setting up and configuring Setting up a Kafka cluster


can be complex. can be complex for
beginners.

Operational Overhead Requires ongoing Involves ongoing


maintenance and monitoring. management tasks.

State Management It supports stateful Not designed for state


processing but can be management; often requires

17
complex. external components for the
state.

Data Durability It focuses on processing; Ensures data durability by


data durability depends on persisting data to disk.
storage.

Integration with Processing Integrates with various Integrates well with


Frameworks processing frameworks. processing frameworks like
Apache Storm.

Learning Curve Steep learning curve for Requires an understanding of


beginners. Kafka's concepts and
terminology.

18
6. Q & A

6.1 The comparison between Apache Storm and Apache Spark

1. Which of the following is better suited for real-time stream processing?

A) Apache Storm
B) Apache Spark
C) Flink
D) Both

Answer: A) Apache Storm

2. Which processing model does Apache Storm primarily use?

A) Micro-batch
B) Continuous
C) Batch
D) Hybrid

Answer: B) Continuous

3. Which of the following provides better fault tolerance out of the box?

A) Apache Storm
B) Apache Spark
C) Both
D) None

Answer: A) Apache Storm

4. Which platform is known for its in-memory processing capabilities?

19
A) Apache Storm
B) Apache Spark
C) Flink
D) Both

Answer: B) Apache Spark

5. Which one is more suitable for processing large volumes of data?

A) Apache Storm
B) Apache Spark
C) Flink
D) Both

Answer: B) Apache Spark

6. Which framework is commonly used for complex event processing?

A) Apache Storm
B) Apache Spark
C) Flink
D) Both

Answer: A) Apache Storm

7. Which one has better support for machine learning and graph processing?

A) Apache Storm
B) Apache Spark
C) Flink
D) Both

20
Answer: B) Apache Spark

8. Which of the following provides a more user-friendly programming model?

A) Apache Storm
B) Apache Spark
C) Flink
D) Both

Answer: B) Apache Spark

9. Which one integrates better with Hadoop and other big data ecosystems?

A) Apache Storm
B) Apache Spark
C) Flink
D) Both

Answer: B) Apache Spark

10. Which framework has a more robust and mature ecosystem in terms of
third-party integrations and libraries?

A) Apache Storm
B) Apache Spark
C) Flink
D) Both

Answer: B) Apache Spark

21
6.2 Apache Storm's features and use cases

1. What is Apache Storm primarily used for?

A) Web development
B) Real-time stream processing
C) Batch processing
D) Data warehousing

Answer: B) Real-time stream processing

2. Which of the following is a core component of Apache Storm for defining data
processing topologies?

A) Spout
B) Bolt
C) Supervisor
D) Nimbus

Answer: B) Bolt

3. What is a Spout in Apache Storm?

A) A component that processes data in real-time


B) A component that generates data streams
C) A component for storing data
D) A component for managing cluster resources

Answer: B) A component that generates data streams

4. Which of the following is NOT a guarantee provided by Apache Storm's


message processing semantics?

22
A) At least once processing
B) Exactly-once processing
C) At-most-once processing
D) None of the above

Answer: B) Exactly-once processing

5. In Apache Storm, what does a Nimbus node do?

A) Processes data in parallel


B) Manages the cluster's resources
C) Acts as the master node for the cluster
D) Distributes data to bolts

Answer: C) Acts as the master node for the cluster

6. Which of the following is a use case for Apache Storm?

A) Static data analysis


B) Real-time fraud detection
C) Monthly report generation
D) Data archiving

Answer: B) Real-time fraud detection

7. Which programming languages can be used to develop Apache Storm


topologies?

A) Java and Python


B) Ruby and PHP
C) C++ and JavaScript

23
D) Scala and Swift

Answer: A) Java and Python

8. What is a tuple in Apache Storm's context?

A) A data structure used for storing static data


B) A unit of data in a Storm topology
C) A data source in Storm
D) A Storm-specific database

Answer: B) A unit of data in a Storm topology

9. Which of the following is a benefit of using Apache Storm for real-time stream
processing?

A) Low-latency processing
B) Support for batch processing only
C) Limited scalability
D) Lack of fault tolerance

Answer: A) Low-latency processing

10. What does the term "acknowledgment" refer to in Apache Storm's processing
model?

A) A message to confirm the receipt and successful processing of a tuple


B) A type of data source
C) The master node in the Storm cluster
D) A component that generates data streams

Answer: A) A message to confirm the receipt and successful processing of a tuple

24
6.3 The advantages and disadvantages of Apache Storm

6.3.1 Advantages of Apache Storm

1. What is one of the key advantages of Apache Storm for real-time data
processing?

A. Low-latency processing
B. Batch processing only
C. Complex setup
D. Limited scalability

Answer: A) Low-latency processing

2. Which of the following best describes Apache Storm's fault tolerance


mechanism?

A. It lacks fault tolerance.


B. It relies on external systems for fault tolerance.
C. It provides built-in fault tolerance.
D. It requires manual intervention for fault tolerance.

Answer: C) It provides built-in fault tolerance.

3. What does Apache Storm provide for stream processing that is advantageous
for real-time analytics?

A. Support for only static data sources

25
B. Scalability for batch processing
C. Support for dynamic data streams
D. Low fault tolerance

Answer: C) Support for dynamic data streams

4. What role does Apache Storm play in handling large volumes of data?

A. Data storage.
B. Data transformation.
C. Data retrieval.
D. Data processing.

Answer: D) Data processing

5. How does Apache Storm handle data processing tasks in a distributed


manner?

A. It relies on a single node for all processing.


B. It divides tasks across a cluster of nodes.
C. It uses a centralised database for processing.
D. It only supports single-node processing.

Answer: B) It divides tasks across a cluster of nodes.

6.3.2 Disadvantages of Apache Storm

6. What is one of the limitations of Apache Storm concerning state management?

A. It excels in state management.

26
B. State management can be complex and requires external databases.
C. It offers no state management capabilities.
D. State management is fully automatic.

Answer: B) State management can be complex and requires external databases.

7. Which of the following is a potential issue when working with Apache Storm in
terms of ease of use?

A. It has a highly intuitive user interface.


B. It requires a steep learning curve for new users.
C. It doesn't require any configuration.
D. It only supports simple use cases.

Answer: B) It requires a steep learning curve for new users.

8. What kind of fault tolerance does Apache Storm offer in terms of data
processing?

A. It guarantees zero data loss.


B. It provides limited fault tolerance.
C. It doesn't offer any fault tolerance features.
D. It relies on external tools for fault tolerance.

Answer: D) It relies on external tools for fault tolerance.

9. What aspect of Apache Storm might make it less cost-effective compared to


batch processing systems for certain workloads?

A. Low-latency processing
B. High scalability
C. Complex setup and maintenance

27
D. Ease of use

Answer: C) Complex setup and maintenance

10. Which of the following is a potential drawback of Apache Storm when dealing
with irregular data arrival rates?

A. It can adapt seamlessly to any data arrival rate.


B. It may lead to inefficiencies and overhead.
C. It only works well with regular data arrival rates.
D. It doesn't support dynamic data rates.

Answer: B) It may lead to inefficiencies and overhead.

28
-The End-

29

You might also like