0% found this document useful (0 votes)

24 views15 pages

Big Data IV Nit

Uploaded by

nvnaitik7999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views15 pages

Big Data IV Nit

Uploaded by

nvnaitik7999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Introduction to Streams Concepts

Explanation: Streams are sequences of data elements made available over time.
They can be unbounded, meaning data keeps arriving continuously, and are
processed in real-time or near real-time. This contrasts with batch processing,
where data is collected, stored, and processed in chunks. Stream processing is
crucial in applications that require real-time analytics and instant responses, such
as monitoring financial transactions, social media feeds, or sensor data.

Key Characteristics of Streams:

 Continuous Data Flow: Data continuously arrives and needs to be

processed immediately.
 Low Latency: The time between receiving data and processing it is minimal.
 Scalability: Systems can handle increasing amounts of data by scaling
horizontally.

Real-World Examples of Stream Processing:

 Financial Services: Fraud detection by monitoring transactions in real-time.

 Social Media: Analyzing tweets or posts as they are published.
 IoT Devices: Processing data from sensors to detect anomalies or trigger
alerts.

Basic Stream Processing Operations:

1. Transformation: Modifying the data (e.g., mapping, filtering).

2. Aggregation: Summarizing data over a window (e.g., counting, averaging).
3. Joining: Combining data from different streams.

Stream Processing Systems: Several frameworks and tools facilitate stream

processing, including Apache Kafka, Apache Flink, and Apache Spark Streaming.

Figure: Stream Processing Pipeline

Source (Data Generation) ---> Stream Processor ---> Output (Data Sink)

Figure Explanation:

1. Source: This is where the data originates. It could be sensors, logs, user
interactions, etc.
2. Stream Processor: This component processes the data in real-time. It
applies various operations like filtering, transformation, and aggregation.
3. Output: The processed data is sent to a destination such as databases,
dashboards, or other applications.

Example: Consider a real-time monitoring system for a fleet of vehicles. Each

vehicle continuously sends its location and status data. A stream processing
system can analyze this data to provide real-time updates on vehicle positions,
detect anomalies (like a vehicle going off-route), and trigger alerts if any issues
are detected.

By understanding the basics of stream concepts, you can appreciate how

continuous, real-time data processing is essential in many modern applications,
enabling immediate insights and actions.

Stream Data Model and Architecture

Explanation: The stream data model involves the continuous flow of data from
sources to processing units, and finally to destinations (sinks). This model is
fundamentally different from the batch processing model where data is collected
and processed in fixed-size chunks.

Components of Stream Data Model:

1. Data Sources: These are the origins of the data streams. They could be
sensors, logs, social media feeds, etc.
2. Stream Processing Engine: This component processes the data as it arrives.
It can perform operations such as filtering, transformation, and aggregation.
3. Data Sinks: These are the destinations where the processed data is sent.
They could be databases, dashboards, alert systems, etc.

Stream Processing Architecture: The architecture of a stream processing system

typically includes several key components and concepts:

1. Data Ingestion:
o Data is ingested continuously from various sources such as sensors,
application logs, and social media feeds.
o Tools like Apache Kafka, Apache Pulsar, and Amazon Kinesis are
commonly used for data ingestion.
2. Data Stream:
o The ingested data is represented as a stream, which is an unbounded
sequence of events ordered by time.
o Each event or data point in the stream is processed individually or in
small batches.
3. Stream Processor:
o The stream processor is the core component that performs operations
on the data stream.
o Operations can include filtering, mapping, windowing, joining, and
aggregating the data.
o Frameworks like Apache Flink, Apache Spark Streaming, and Apache
Storm are widely used for stream processing.
4. Data Sink:
o The processed data is sent to data sinks for storage, analysis, or further
processing.
o Common data sinks include databases (e.g., Apache Cassandra,
MongoDB), data warehouses (e.g., Amazon Redshift, Google
BigQuery), and real-time dashboards.

Figure: Stream Processing Architecture

Data Sources ---> Data Ingestion ---> Stream Processor ---> Data Sinks

Detailed Components:

1. Data Sources:
o Example: IoT sensors, web logs, financial transaction systems.
o Function: Continuously generate data that needs to be processed in
real-time.
2. Data Ingestion:
o Example: Apache Kafka, Amazon Kinesis.
o Function: Captures and transports the data from sources to the
processing engine.
3. Stream Processor:
o Example: Apache Flink, Apache Spark Streaming.
o Function: Performs real-time processing on the incoming data stream.
This includes operations like filtering, mapping, and aggregating.
o Key Concepts:
 Stateless Processing: Each event is processed independently.
 Stateful Processing: The processor maintains state information
across events, enabling operations like windowing and
aggregations over time.
4. Data Sink:
o Example: Databases (MongoDB, Cassandra), real-time dashboards.
o Function: Stores or further processes the output data from the stream
processor.

Example Workflow:

1. Data Source: An IoT sensor generates temperature data every second.

2. Data Ingestion: The data is sent to an Apache Kafka topic.
3. Stream Processor: Apache Flink reads the data from Kafka, filters out invalid
readings, and computes the average temperature over a 1-minute window.
4. Data Sink: The processed data (average temperature) is sent to a real-time
monitoring dashboard.

Stream Computing
Explanation: Stream computing refers to the real-time processing of data as it
flows through a system. Unlike traditional batch processing, which handles data
in discrete chunks, stream computing processes data continuously, providing
immediate insights and actions. This is particularly useful for applications
requiring low latency, such as real-time analytics, monitoring, and event
detection.

Key Operations in Stream Computing:

1. Transformation: Modifying each data element in the stream.

2. Aggregation: Summarizing data over a window of time.
3. Filtering: Selecting data elements that meet specific criteria.
4. Joining: Combining data from multiple streams.
5. Windowing: Grouping data into windows based on time or event count.

Figure: Stream Computing Pipeline

Data Source ---> Stream Processor (Transform, Aggregate, Filter) ---> Data Sink

Example Workflow:

 A sensor network sends temperature data continuously.

 The stream processor calculates the average temperature every minute.
 If the average temperature exceeds a threshold, an alert is generated.

Sampling Data in a Stream

Explanation: Sampling in data streams refers to the process of selecting a subset
of data points from a continuous stream of data. This technique is crucial in
scenarios where it is infeasible to process the entire data stream due to
constraints like memory, processing power, or real-time requirements. Sampling
helps in approximating the properties of the whole data stream by analyzing a
manageable subset.

Why Sampling in Data Streams?

1. Efficiency: Reduces the volume of data that needs to be processed, stored,

and analyzed.
2. Real-time Processing: Enables real-time analytics by focusing on a
representative subset rather than the entire stream.
3. Resource Management: Conserves computational and memory resources,
making it possible to handle high-velocity data streams.

Types of Sampling Techniques:

1. Random Sampling:
o Simple Random Sampling: Each data point in the stream has an equal
probability of being included in the sample.
o Reservoir Sampling: A fixed-size sample is maintained as the data
stream progresses. When a new data point arrives, it may replace an
existing point in the reservoir based on a specific probability.
2. Systematic Sampling:
o Selects data points at regular intervals from the data stream. For
example, every kkk-th data point is included in the sample.
3. Stratified Sampling:
o Divides the data stream into distinct strata or groups based on specific
characteristics and then applies random sampling within each group.
This ensures that each stratum is represented in the sample.
4. Adaptive Sampling:
o Adjusts the sampling rate based on the observed properties of the
data stream. For instance, if the data stream exhibits sudden changes
or spikes, the sampling rate may be increased to capture these
anomalies.
Filtering Streams
Explanation: Filtering streams involves removing unwanted data points from a
stream based on specific criteria, allowing only relevant data to pass through. This
is a fundamental operation in stream processing, essential for focusing on
significant events and reducing noise.

Filtering Criteria Examples:

1. Value-Based Filtering: Only data points with values above or below a

threshold.
2. Pattern-Based Filtering: Data points that match a specific pattern (e.g.,
regex).
3. Attribute-Based Filtering: Data points with specific attributes (e.g.,
location, type).

Figure: Filtering Streams

Incoming Stream ---> Filter (Condition) ---> Filtered Stream

Example Workflow:

 A social media platform receives a continuous stream of posts.

 The filter removes posts that do not contain certain keywords.
 The filtered stream, containing only relevant posts, is used for sentiment
analysis.

Counting Distinct Elements in a Stream

Explanation: Counting distinct elements in a data stream is a common task in
stream processing, used in various applications such as network monitoring,
fraud detection, and real-time analytics. The challenge is to perform this task
efficiently, given the potentially unbounded nature of the data stream and the
need for low-latency processing.

Traditional methods of counting distinct elements involve maintaining a set of all

observed elements, which can be memory-intensive and infeasible for large data
streams. To address this, specialized algorithms and data structures are used.

Key Techniques for Counting Distinct Elements:

1. Hash Set:
o Method: Maintain a hash set of all unique elements.
o Pros: Simple and accurate.
o Cons: Requires significant memory for large data streams.
2. Approximate Counting Algorithms:
o HyperLogLog (HLL):
 Method: Uses probabilistic counting to estimate the number of
unique elements.
 Pros: Highly memory-efficient, suitable for very large data
streams.
 Cons: Provides an approximate count, with some error margin.
o Flajolet-Martin Algorithm:
 Method: Uses a bit vector to estimate the number of unique
elements based on hash values.
 Pros: Memory-efficient and relatively simple to implement.
 Cons: Less accurate than HyperLogLog.

Figure: Counting Distinct Elements in a Stream

Incoming Stream ---> Distinct Counting Algorithm (e.g., HyperLogLog) --->

Estimated Unique Count

Detailed Example: HyperLogLog

HyperLogLog (HLL) Algorithm:

1. Initialization: Create an array of registers initialized to zero. The size of this

array is determined by a precision parameter.
2. Hashing: For each element in the stream, compute its hash value.
3. Bit Pattern: Use the hash value to determine which register to update based
on the position of the leftmost 1-bit in the binary representation of the
hash.
4. Updating Registers: Update the register to the maximum observed value of
the leftmost 1-bit position.
5. Estimation: Combine the values in all registers to compute an estimate of
the number of unique elements.

Mathematical Basis:

 The algorithm relies on the fact that the position of the leftmost 1-bit in the
binary hash value follows a geometric distribution.
 By aggregating information from multiple hash values, HyperLogLog can
provide a good estimate of the cardinality (number of unique elements)
with high accuracy and low memory usage.

Example Workflow:

1. Data Stream: Assume a continuous stream of user IDs visiting a website.

2. Hashing: Each user ID is hashed.
3. Register Update: The hash determines which register to update based on
the leftmost 1-bit.
4. Estimate: After processing the stream, the registers are used to estimate
the total number of unique visitors.

Use Case: Real-Time Analytics

 Scenario: A website needs to track the number of unique visitors in real-

time.
 Implementation: Use HyperLogLog to process the stream of user IDs,
providing an estimate of unique visitors at any given moment.
 Benefits: This method provides a quick, memory-efficient way to get
approximate counts without storing all user IDs.

Introduction to Spark Concept

Explanation: Apache Spark is an open-source, distributed computing system
designed for large-scale data processing. It provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance.
Spark extends the MapReduce model to efficiently support more types of
computations, including interactive queries and stream processing.

Key Features of Spark:

1. Speed: Spark processes data in-memory, making it up to 100 times faster

than Hadoop MapReduce for certain applications.
2. Ease of Use: Spark provides high-level APIs in Java, Scala, Python, and R, and
an optimized engine that supports general execution graphs.
3. Advanced Analytics: Spark supports various advanced analytics functions,
including SQL queries, streaming data, machine learning, and graph
processing.
4. Unified Engine: Spark integrates all Big Data processing capabilities under
one unified framework.
Core Components:

1. Spark Core: The foundation of the Spark platform, providing essential

functionalities like task scheduling, memory management, fault recovery,
and storage system interaction.
2. Spark SQL: A module for structured data processing, allowing users to run
SQL queries alongside complex analytics algorithms.
3. Spark Streaming: Enables scalable and fault-tolerant stream processing of
live data streams.
4. MLlib: Spark's scalable machine learning library.
5. GraphX: For graph processing and computation.

Figure: Apache Spark Ecosystem

Spark Architecture and Components

Explanation: Spark's architecture is designed for scalability and efficiency,
enabling it to handle large-scale data processing tasks. It consists of a master
node and multiple worker nodes, managed by a cluster manager.

 Spark's architecture includes a driver program, cluster manager, worker

nodes, executors, and tasks.
 The system is designed for efficient, scalable, and fault-tolerant data
processing, supporting both batch and stream processing workloads.

By understanding Spark's concept and architecture, you can leverage its powerful
capabilities for large-scale data processing and real-time analytics.

Detailed Components:
1. Driver Program:
o Function: Controls the application, creating the SparkContext and
defining the operations on the data.
o Interaction: Sends tasks to the cluster manager and collects results
from executors.
o Example: A Python script that processes large datasets using Spark's
API.
2. Cluster Manager:
o Function: Allocates resources to Spark applications across the cluster.
o Types: YARN, Mesos, and Standalone.
o Example: YARN manages resource allocation in a Hadoop cluster,
scheduling tasks and balancing load.
3. Worker Nodes:
o Function: Execute tasks assigned by the driver and store data.
o Components: Run Spark executors and perform data processing.
o Example: Multiple servers running Spark executors to process
different parts of the dataset.
4. Executors:
o Function: Execute tasks and cache data for in-memory processing.
o Responsibility: Each executor runs multiple tasks and returns results
to the driver.
o Example: An executor might read a partition of data from HDFS,
process it, and keep the results in memory for fast access.
5. Tasks:
o Function: The smallest unit of work in Spark.
o Execution: Each task processes a partition of the data, performing
operations like map, filter, and reduce.
o Example: A map task might transform data elements, while a reduce
task aggregates results.

Workflow Example:

1. Submit Spark Application: The user submits a Spark application to the

cluster manager.
2. SparkContext Creation: The driver program initializes the SparkContext.
3. Resource Allocation: The cluster manager allocates resources across
worker nodes.
4. Task Execution: The driver sends tasks to executors on worker nodes.
5. Processing and Storage: Executors process the data and store intermediate
results in memory.
6. Result Collection: Executors return the final results to the driver program.

Spark Installation
Explanation: Installing Apache Spark involves several steps, including setting up
the necessary environment, downloading Spark, and configuring it for your
specific use case. Below is a detailed guide on how to install Spark on a local
machine and in a cluster environment.

Prerequisites:

1. Java Development Kit (JDK): Spark requires Java 8 or higher.

2. Python (Optional): For using PySpark (Python API for Spark).
3. Scala (Optional): For using the Scala API.

Local Installation (Standalone Mode)

Step 1: Install Java

Ensure Java is installed and set up on your system.

Windows:

1. Download JDK from the Oracle website or OpenJDK.

2. Install the JDK and set the JAVA_HOME environment variable:
o Right-click on 'This PC' > 'Properties' > 'Advanced system settings' >
'Environment Variables'.
o Add a new system variable JAVA_HOME with the path to the JDK (e.g.,
C:\Program Files\Java\jdk-11).
o Add %JAVA_HOME%\bin to the Path variable.

Step 2: Download and Install Apache Spark

1. Go to the Apache Spark download page.

2. Choose a Spark release and a package type (Pre-built for Hadoop).
3. Download the package and extract it to a directory of your choice.

Example for Spark 3.3.0:

Windows:
 Download the .tgz file and extract it using a tool like 7-Zip or WinRAR.
 Set the SPARK_HOME environment variable to the extracted directory.
o Add %SPARK_HOME%\bin to the Path variable.

Step 3: Install Hadoop (Optional)

Spark can run without Hadoop, but if you need Hadoop for your application:

1. Download Hadoop from the Apache Hadoop website.

2. Extract the package and set the HADOOP_HOME environment variable.

Step 4: Verify Spark Installation

Run the Spark shell to verify your installation:

spark-shell

You should see the Spark shell prompt, indicating Spark is installed correctly.

Spark RDD (Resilient Distributed Dataset)

Explanation: Resilient Distributed Datasets (RDDs) are the fundamental data
structure of Apache Spark. RDDs are immutable, distributed collections of objects
that can be processed in parallel across a cluster. They provide fault tolerance
and efficient data processing capabilities, making them the core abstraction for
parallel processing in Spark.

Key Features of RDDs:

1. Immutability: Once created, an RDD cannot be changed. Instead,

transformations produce new RDDs.
2. Partitioning: Data in RDDs is divided into partitions, which are processed in
parallel.
3. Fault Tolerance: RDDs can recover lost data through lineage information,
which tracks the transformations applied to create the RDD.
4. Lazy Evaluation: Transformations on RDDs are not immediately executed.
Instead, they are recorded as a lineage graph, and execution is triggered
only when an action is performed.
5. In-Memory Computation: RDDs can be cached in memory for faster access
during repeated operations.

Creation of RDDs:
1. Parallelizing Collections: Creating an RDD from an existing collection in the
driver program.
2. Reading External Datasets: Creating an RDD by reading data from external
storage systems like HDFS, S3, or local file systems.

Example:

from pyspark import SparkContext

# Initialize SparkContext
sc = SparkContext("local", "RDD Example")

# Create an RDD from a collection

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Create an RDD by reading a text file

file_rdd = sc.textFile("path/to/file.txt")

Spark RDD Operations

RDD operations can be categorized into two types: Transformations and Actions.

Transformations

Transformations create a new RDD from an existing one by defining a lineage.

They are lazily evaluated, meaning the transformations are not executed until an
action is called.

Common Transformations:

1. map(func):
o Applies a function to each element of the RDD, creating a new RDD.
o Example:

rdd2 = rdd.map(lambda x: x * 2) # [2, 4, 6, 8, 10]

2. filter(func):
o Returns a new RDD containing only the elements that satisfy a
predicate function.
o Example:
rdd2 = rdd.filter(lambda x: x % 2 == 0) # [2, 4]

3. flatMap(func):
o Similar to map, but each input element can be mapped to multiple
output elements (flattening the result).
o Example:

rdd2 = rdd.flatMap(lambda x: [x, x * 2]) # [1, 2, 2, 4, 3, 6, 4, 8, 5, 10]

4. distinct():
o Returns a new RDD containing the distinct elements of the original
RDD.
o Example:

rdd2 = rdd.distinct()

5. union(otherRDD):
o Returns a new RDD containing all elements from both RDDs.
o Example:

rdd2 = rdd.union(other_rdd)

Actions

Actions trigger the execution of the transformations and return results. They
bring data back to the driver program or write it to an external storage system.

Common Actions:

1. collect():
o Returns all elements of the RDD as an array to the driver program.
o Example:

result = rdd.collect() # [1, 2, 3, 4, 5]

2. count():
o Returns the number of elements in the RDD.
o Example:

result = rdd.count() # 5

3. first():
o Returns the first element of the RDD.
o Example:

result = rdd.first() # 1

4. take(n):
o Returns the first n elements of the RDD.
o Example:

result = rdd.take(3) # [1, 2, 3]

5. reduce(func):
o Aggregates the elements of the RDD using the specified function.
o Example:

result = rdd.reduce(lambda a, b: a + b) # 15

SA Unit 1 PPT 1
No ratings yet
SA Unit 1 PPT 1
19 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
DAV Chapter3
No ratings yet
DAV Chapter3
44 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
Data Analytics Chapter 3
No ratings yet
Data Analytics Chapter 3
12 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Big Data
No ratings yet
Big Data
37 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Reference Guide To Stream Processing
No ratings yet
Reference Guide To Stream Processing
14 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
13 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Lec 19
No ratings yet
Lec 19
24 pages
Bda Ut-2
No ratings yet
Bda Ut-2
18 pages
Lec 19
No ratings yet
Lec 19
23 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Unit 3
No ratings yet
Unit 3
51 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
15 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Mining Data Streams Partial Guide
No ratings yet
Mining Data Streams Partial Guide
4 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
Unit 3
No ratings yet
Unit 3
30 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Mining&Data Stream Unit-3 - Removed
No ratings yet
Mining&Data Stream Unit-3 - Removed
50 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Bda 2
No ratings yet
Bda 2
16 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Unit Iv
No ratings yet
Unit Iv
11 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Stream Processing and Website Tracking
No ratings yet
Stream Processing and Website Tracking
2 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
10 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Unit Iv
No ratings yet
Unit Iv
5 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit 2
No ratings yet
Unit 2
36 pages
Coursera - Programming Mobile Apps Android
No ratings yet
Coursera - Programming Mobile Apps Android
6 pages
Save Time & Effort and Avoid Risk: Werum PAS-X MES Helps You To Digitize Your Pharma and Biotech Production
No ratings yet
Save Time & Effort and Avoid Risk: Werum PAS-X MES Helps You To Digitize Your Pharma and Biotech Production
16 pages
Module II
No ratings yet
Module II
22 pages
1-12263642568 Saf PDF
No ratings yet
1-12263642568 Saf PDF
6 pages
RPA Interview Questions
100% (1)
RPA Interview Questions
2 pages
Wsi PSD
No ratings yet
Wsi PSD
18 pages
22bce11300 Cse3015
No ratings yet
22bce11300 Cse3015
28 pages
HTC Companies Brochure July2023
No ratings yet
HTC Companies Brochure July2023
44 pages
Sec Com
No ratings yet
Sec Com
2 pages
Curriculum Vitae: Jalan Perindustrian PBP3, Pusat Bandar Puchong 47100 Punchong, Selangor Darul Ehsan
No ratings yet
Curriculum Vitae: Jalan Perindustrian PBP3, Pusat Bandar Puchong 47100 Punchong, Selangor Darul Ehsan
5 pages
Single Line
No ratings yet
Single Line
54 pages
Platform As A Service (PaaS) 1
No ratings yet
Platform As A Service (PaaS) 1
24 pages
Automate Period Opening
No ratings yet
Automate Period Opening
4 pages
AWS Startup Security Baseline
No ratings yet
AWS Startup Security Baseline
55 pages
Access Questions - Answer Key
100% (1)
Access Questions - Answer Key
2 pages
Arista-FreeRadius 01dec2024
No ratings yet
Arista-FreeRadius 01dec2024
16 pages
Chapter 3 Risk Management and Future Expansion of Linux Server
No ratings yet
Chapter 3 Risk Management and Future Expansion of Linux Server
65 pages
(AgileCMMI) Practical Report: CMMI Measurements and Analysis in Agile Environment
50% (2)
(AgileCMMI) Practical Report: CMMI Measurements and Analysis in Agile Environment
12 pages
PBDSDN EN00 Foobar2000 SG en
No ratings yet
PBDSDN EN00 Foobar2000 SG en
7 pages
Cloud Proposal For All Office Infrastructure
No ratings yet
Cloud Proposal For All Office Infrastructure
6 pages
My First HPS
No ratings yet
My First HPS
13 pages
234 Slides 01
No ratings yet
234 Slides 01
21 pages
Java Most Completed 8kyu Only
No ratings yet
Java Most Completed 8kyu Only
19 pages
Volvo Cem m32c L
No ratings yet
Volvo Cem m32c L
8 pages
Production Scheduler
No ratings yet
Production Scheduler
1 page
Vaccines Chart
No ratings yet
Vaccines Chart
4 pages
Ap 3876
No ratings yet
Ap 3876
8 pages
PPS Question Bank - Updated
No ratings yet
PPS Question Bank - Updated
2 pages
Wireless Unified 108AG Access Point: DWL-8500AP
No ratings yet
Wireless Unified 108AG Access Point: DWL-8500AP
4 pages
Port Forwarding in AsusWRT Merlin
No ratings yet
Port Forwarding in AsusWRT Merlin
2 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
From Everand
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Big Data IV Nit

Uploaded by

Big Data IV Nit

Uploaded by

Introduction to Streams Concepts

Key Characteristics of Streams:

 Continuous Data Flow: Data continuously arrives and needs to be

Real-World Examples of Stream Processing:

 Financial Services: Fraud detection by monitoring transactions in real-time.

Basic Stream Processing Operations:

1. Transformation: Modifying the data (e.g., mapping, filtering).

Stream Processing Systems: Several frameworks and tools facilitate stream

Figure: Stream Processing Pipeline

Example: Consider a real-time monitoring system for a fleet of vehicles. Each

By understanding the basics of stream concepts, you can appreciate how

Stream Data Model and Architecture

Components of Stream Data Model:

Stream Processing Architecture: The architecture of a stream processing system

Figure: Stream Processing Architecture

1. Data Source: An IoT sensor generates temperature data every second.

Key Operations in Stream Computing:

1. Transformation: Modifying each data element in the stream.

Figure: Stream Computing Pipeline

 A sensor network sends temperature data continuously.

Sampling Data in a Stream

Why Sampling in Data Streams?

1. Efficiency: Reduces the volume of data that needs to be processed, stored,

Types of Sampling Techniques:

Filtering Criteria Examples:

1. Value-Based Filtering: Only data points with values above or below a

Figure: Filtering Streams

Incoming Stream ---> Filter (Condition) ---> Filtered Stream

 A social media platform receives a continuous stream of posts.

Counting Distinct Elements in a Stream

Traditional methods of counting distinct elements involve maintaining a set of all

Key Techniques for Counting Distinct Elements:

Figure: Counting Distinct Elements in a Stream

Incoming Stream ---> Distinct Counting Algorithm (e.g., HyperLogLog) --->

Detailed Example: HyperLogLog

HyperLogLog (HLL) Algorithm:

1. Initialization: Create an array of registers initialized to zero. The size of this

1. Data Stream: Assume a continuous stream of user IDs visiting a website.

Use Case: Real-Time Analytics

 Scenario: A website needs to track the number of unique visitors in real-

Introduction to Spark Concept

Key Features of Spark:

1. Speed: Spark processes data in-memory, making it up to 100 times faster

1. Spark Core: The foundation of the Spark platform, providing essential

Figure: Apache Spark Ecosystem

Spark Architecture and Components

 Spark's architecture includes a driver program, cluster manager, worker

1. Submit Spark Application: The user submits a Spark application to the

1. Java Development Kit (JDK): Spark requires Java 8 or higher.

Local Installation (Standalone Mode)

Step 1: Install Java

Ensure Java is installed and set up on your system.

1. Download JDK from the Oracle website or OpenJDK.

Step 2: Download and Install Apache Spark

1. Go to the Apache Spark download page.

Example for Spark 3.3.0:

Step 3: Install Hadoop (Optional)

1. Download Hadoop from the Apache Hadoop website.

Step 4: Verify Spark Installation

Run the Spark shell to verify your installation:

Spark RDD (Resilient Distributed Dataset)

Key Features of RDDs:

1. Immutability: Once created, an RDD cannot be changed. Instead,

from pyspark import SparkContext

# Create an RDD from a collection

# Create an RDD by reading a text file

Spark RDD Operations

Transformations create a new RDD from an existing one by defining a lineage.

rdd2 = rdd.map(lambda x: x * 2) # [2, 4, 6, 8, 10]

rdd2 = rdd.flatMap(lambda x: [x, x * 2]) # [1, 2, 2, 4, 3, 6, 4, 8, 5, 10]

result = rdd.collect() # [1, 2, 3, 4, 5]

result = rdd.take(3) # [1, 2, 3]

You might also like