The document discusses big data streaming platforms and real-time processing, emphasizing the importance of event-driven architectures for handling continuous data streams. It outlines various technologies such as Apache Spark, Storm, Samza, and Kafka, highlighting their capabilities in processing and analyzing data in real-time. Additionally, it covers the architecture of big data pipelines, which includes message ingestion, stream processing, and analytical data storage for generating insights and reports.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
7 views42 pages
BDA Unit 3
The document discusses big data streaming platforms and real-time processing, emphasizing the importance of event-driven architectures for handling continuous data streams. It outlines various technologies such as Apache Spark, Storm, Samza, and Kafka, highlighting their capabilities in processing and analyzing data in real-time. Additionally, it covers the architecture of big data pipelines, which includes message ingestion, stream processing, and analytical data storage for generating insights and reports.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42
Unit 3
Big Data Streaming
Platforms Real- Time Processing for Big Data • Big data concepts and techniques are used for architecting and designing real-time systems and building analytics applications over them. • Many recent modeling and implementation approaches for developing real-time applications are based on the concept of an event. • The term event refers to each data point in the system, and stream refers to the ongoing delivery of those events. • A series of events can also be referred to as streaming data or data streams. Real- Time Processing for Big Data • Actions that are taken on those events include: aggregations (e.g., calculations such as sum, mean, standard deviation), analytics (e.g., predicting a future event based on patterns in the data), transformations (e.g., changing a number into a date format), enrichment (e.g., combining the data point with other data sources to create more context and meaning), ingestion (e.g., inserting the data into a database). Data Stream Processing
• Many real-time applications involve a continuous flow of
information that is transferred from one location to another over an specific period of time. • This type of interaction is called a stream and it implies the transfer of a sequence of related information. • For example, video and audio data. • It is also used frequently to describe a sequence of data associated with real-world events, e.g. as emitted by sensors, devices, or other applications Data Stream Processing • Processing events is called as Event stream processing or Data stream processing. • Stream processing is a technology that allows users to query a continuous data stream and quickly detect conditions within a small time period from the time of receiving the data. • Event stream processing works by handling a data set by one data point at a time, instead of viewing data as a whole set. • Data stream applications focuses on filtering and aggregation of a stream data using SQL queries. • For Example, user can query a data stream coming from a temperature sensor and receive an alert when the temperature reaches maximum threshold value. Need of Data Stream Processing • Big data establishes the core values needed by the each application, by processing the data. • These values are never equal. • Some core values have much higher range shortly after it has happened and that value diminishes very fast with time. • Stream processing targets such scenarios. • The key strength of stream processing is that it can provide updated core values of data items faster, within milliseconds to seconds. Need of Data Stream Processing • Data stream processing can handles never ending data streams gracefully and naturally. • User can detect patterns, inspect results, and can focus multiple streams simultaneously. • Stream processing naturally fits with time series data and detecting patterns over time. • Batch lets the data build up and try to process them at once while stream processing processes data as they come in, hence spread the processing over time. • Stream processing requires less hardware than batch processing. • When the amount of data is huge and it is not even possible to store it, stream processing can handle this easily. Data Stream Processing Platforms • Many of the data streaming platforms are open source solutions. • These platforms facilitate the construction of real-time applications, mainly message-oriented or event-driven applications. • These applications supports, access of messages or events at a very high rate, transfer to subsequent processing, and generation of alerts. • These platforms focuses on supporting event-driven data flow through nodes in a distributed system or within a cloud infrastructure platform. • They provide a basis for building an analytics layer at the top of big data stack. Data Stream Processing Platforms
• The Hadoop ecosystem supports distributed computing and
large data processing infrastructure. • It is mainly developed to support processing large sets of structured, unstructured, and semi-structured data, but it was designed as a batch processing system. • The main drawback of Hadoop ecosystem is, it doesn't support fast data analytics performance requirements. • In order to support real time processing, it can be linked with some other components. Data Stream Processing Platforms
• In an event stream processing environment, there are two
main classes of technologies: • 1) The system that stores the events:- consist of component helps to data storage, and stores data based on a timestamp. • 2) The technology that helps developers write applications that take action on the events. • Also known as stream processors or stream processing engines. Data Stream Processing Platforms • The list of components needed for data streaming application in Hadoop are as follows:- • MapReduce, a distributed data processing model and execution environment that runs on large clusters of commodity machines • Hadoop Distributed File System (HDFS), a distributed file system that runs on large clusters of commodity machines. • ZooKeeper, a distributed, highly available coordination service, providing primitives that can be used for building distributed applications. • Pig, a dataflow language and execution environment for exploring very large datasets. • Pigs runson HDFS and MapReduce clusters. • Hive, a distributed data warehouse. Spark • Apache Spark is more recent framework that combines an engine for distributing programs across clusters of machines. • It provides Read-Evaluate-Print Loop (REPL) approach for working with data interactively. • Spark supports integration with the variety of tools in the Hadoop ecosystem. • It can read and write data in all of the data formats supported by MapReduce. • It can read from and write to NoSQL databases like HBase and Cassandra. Spark • It uses a stream processing library called as Spark Streaming, which is an extension of the Spark core framework. • It is well suited for real-time processing and analysis, supporting scalable, high throughput, and fault tolerant processing of live data streams. • Spark Streaming generates a Discretized stream (DStream) as a continuous stream of data. • Internally, a DStream is represented as a sequence of resilient distributed datasets (RDD). • RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over data (sliding window computations) Spark • DStreams can be emitted either straight from input data streams sources, such as Apache Kafka or Flume HDFS, databases or by passing the RDDs from other DStreams output. • Spark Streaming receives live input data streams through a receiver and divides data into micro batches. • These batches are then processed by the Spark engine to generate the final stream of results in batches. • The processing components used are referred to as window transformation operators. Spark • Spark Streaming uses a small-interval deterministic batch (in seconds) to divide stream into processable units. • The size of the interval determines the throughput and latency, if the interval is larger, the throughput and the latency is higher. • Data can be taken from sources like Kafka, Kinesis, or TCP sockets, and processed using complex algorithms expressed through high- level functions like map, reduce, join, and window. • Processed data can also be pushed out to filesystems, databases, and live dashboards. • Spark Streaming supports both batch and streaming applications Spark Storm • Storm is a distributed real-time computation system used for processing large volumes of data with high-velocity. • It makes easy to process unbounded streams of data and has a relatively simple processing model. • It is designed to process large amount of data in a fault-tolerant and horizontal scalable method. • Storm is easy to setup, operate and it guarantees that every message will be processed through the topology at least once. • Storm is stateless and manages distributed environment and cluster state via ZooKeeper. • Storm reads raw stream of real-time data from one end and passes it through a sequence of small processing units Storm • The internal components of Apache Storm are:- • 1. Spout:- It is a source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. • User can also write their own spouts to read data from data sources. • The core interface for implementing spouts is ISpout. • Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc. Storm • 2. Bolts:- these are logical processing units. • Spouts pass data to bolts and bolts process and produce a new output stream. • It can processes any number of input streams and produces any number of new output streams. • Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. • The core interface for implementing bolts is IBolt . • Some of the common interfaces are IRichBolt, IBasicBolt, etc. Storm • 3. Topology:-Spouts and bolts are connected together and they form a topology. • It is a DAG, where vertices are computation and edges are stream of data. • A simple topology starts with spouts. • Spout emits the data to one or more bolts. • Bolt represents a node in the topology and the output of a bolt can be emitted into another bolt as input. • Storm’s main job is to run the topology and will run any number of topology at a given time. Storm Samza • Apache Samza is a distributed stream processing framework. • Its current stable version: 0.10.0 • It is very similar to Storm in that it is a stream processor with a one-at-a time processing model and at-least-once processing semantics. • It uses Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. • It was initially created at LinkedIn, submitted to the Apache Incubator in July 2013 and was granted toplevel status in 2015. Samza • Samza was co-developed with the queueing system Kafka and has th same messaging semantics i.e. Streams are partitioned and messages (i.e. data items) inside the same partition are ordered. • By default, Samza employs a key-value format for storage of data. • Samza supports Java languages, particularly JVM. • Scalability is achieved through running a Samza job in several parallel tasks each of which consumes a separate partition in file. Samza Samza • Samza processes messages in order and stores processing results durable after each step, it is able to prevent data loss by periodically check pointing current progress and reprocessing all data from that point onwards in case of failure. • Samza does not support a weaker guarantee than at-least- once processing FLUME • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. • It has a simple and flexible architecture based on streaming data flows • It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. • It uses a simple extensible data model that allows for online analytic application. • While Flume and Kafka both can act as the event backbone for real-time event processing, they have different characteristics. FLUME • Kafka is well suited for high throughput publish-subscribe messaging applications that require scalability and availability. • Flume is better suited in cases when one needs to support data ingestion and simple event processing, but is not suitable for CEP applications. • One of the benefits of Flume is that it supports many sources and sinks out of the box. Amazon Kinesis • Amazon Kinesis is a cloud-based service for real-time data processing over large, distributed data streams. • Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events etc. Amazon Kinesis • Kinesis allows integration with Storm, as it provides a Kinesis Storm Spout that fetches data from a Kinesis stream and emits it as tuples. • The inclusion of this Kinesis component into a Storm topology provides a reliable and scalable stream capture, storage, and replay service. Big Data Pipelines for Real-Time Processing • Real time processing deals with streams of data that are captured in real-time and processed with minimal latency to generate real-time reports or automated responses. • It is defined as the processing of unbounded stream of input data. • This incoming data can be in unstructured or semi-structured format. • It has the same processing requirements, but with shorter turnaround times to support real-time consumption. Big Data Pipelines for Real-Time Processing • Processed data is often written to an analytical data store, which is optimized for analytics and visualization. • The processed data can also be ingested directly into the analytics and reporting layer for analysis, business intelligence, and real-time dashboard visualization. Big Data Pipelines for Real-Time Processing Big Data Pipelines for Real-Time Processing • A real-time processing architecture has the following logical components. • Real-time message ingestion:- The architecture must provide and store real-time messages to be consumed by a stream processing consumer. • This service could be implemented as a simple data store in which new messages can be stored in a folder. • It also requires a message broker, such as Azure Event Hubs, that acts as a buffer for the messages. • The message broker should support scale-out processing and reliable delivery. Big Data Pipelines for Real-Time Processing • Stream processing:- After capturing real-time messages, they are processed by filtering, aggregating, and preparing the data for analysis. • Analytical data store:- Many big data solutions are designed to prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. • Analysis and reporting:- The goal of most big data solutions is to provide insights into the data through analysis and reporting. Streaming Ecosystem • The different tools used in the real time steaming have their strengths and weaknesses, user need to select the correct tool and use it properly in the real time data processing pipeline. • Streaming framework is used to provide an optimized approach for real time streaming processing. • In a typical streaming application, the basic job is to gather the data from a large geographical set of users. • To gather the data properly, a good distributed channel sort of mechanism must be required. • The gathered data, must be put into the channel and provided for processing to set of servers on the commodity hardware platform. Kafka • Apache Kafka is developed by the Apache Software Foundation written in Scala and Java. • It is an open source, stream processing platform which aim to provide a unified, high throughput, low-latency platform for handling real-time data feeds. • Kafka connects with external systems to import or export data through Kafka Connect and it also makes available Kafka Streams, called Java stream processing library. • It is one of the preferred solution for integrating real-time data from multiple stream-producing sources and making that data available to multiple stream-consuming systems concurrently – such as, HDFS or HBase. • A Kafka Hadoop data pipeline supports real-time big data analytics. Kafka • Kafka is actually a store which stores messages which come from processes (one or many) called producers. • The data or messages are then partitioned in different partitions within various Topics. • In the Topic’s partition, the messages are indexed and stored together with a timestamp. • On the other end, other processes called Consumers can inquire messages from these partitions. • Kafka is working between these producers and consumers, and runs on a cluster of one or more servers. • The partitions can be distributed across cluster nodes. • Apache Kafka efficiently processes the real-time, streaming data when implemented along with Apache Storm, Apache HBase and Apache Spark. Kafka Kafka • Kafka is deployed as a cluster on multiple servers, so Kafka handles its entire publish and subscribe messaging system with the help of four APIs • i.e., Producer API, Consumer API, Streams API and Connector API. • Due to its ability to deliver massive streams of message in a fault- tolerant manner, it is used as an alternative to some of the conventional messaging systems like JMS, AMQP, etc. • The major terms of Kafka's architecture are topics, records, and brokers. • Topics consist of stream of records holding different information. • On the other hand, Brokers are responsible for replicating the messages. Kafka • Producer API: • It permits the applications to publish streams of records. • Consumer API: • It permits the application to subscribe to the topics and processes the stream of records. • Streams API: • This API converts the input streams to output and produces the result. • Connector API: • It executes the reusable producer and consumer APIs that can link the topics to the existing applications. Kafka • Kafka Real Time Applications: • Messaging:- It is used as a substitute message broker which is used for a variety of applications • Website Activity Tracking:-able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. • Metrics:-used for operational monitoring of data. • Log Aggregation:-Combines physical log files for servers and places them in a central processing location. • Stream Processing Real Time Analytics Stack
(New Horizons in Contemporary Writing) Erik Ketzan - Thomas Pynchon and The Digital Humanities - Computational Approaches To Style-Bloomsbury Academic (2021)