0% found this document useful (0 votes)
7 views14 pages

Spark Kafka

This document outlines the implementation of a Spark Structured Streaming application for real-time data processing, focusing on a rate-based streaming application and Kafka integration. The project aims to achieve consistent processing intervals of three minutes while addressing compatibility issues between Java and Spark versions. The successful execution of the application is confirmed through continuous micro-batch processing and proper integration with Kafka, demonstrating the core concepts of structured streaming.

Uploaded by

aya boumelha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Spark Kafka

This document outlines the implementation of a Spark Structured Streaming application for real-time data processing, focusing on a rate-based streaming application and Kafka integration. The project aims to achieve consistent processing intervals of three minutes while addressing compatibility issues between Java and Spark versions. The successful execution of the application is confirmed through continuous micro-batch processing and proper integration with Kafka, demonstrating the core concepts of structured streaming.

Uploaded by

aya boumelha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CSC3331

Assignment#6 –Basic near real time processing


Using Structured Spark Streaming API

By: Hajar Makhlouf


Aya Boumelha
Table Of Content
Table Of Content .......................................................................................................... 2
1. Introduction: ......................................................................................................... 3
2. Project Overview ................................................................................................... 3
2.1 Objectives........................................................................................................ 3
2.2 Technical Requirements ................................................................................... 3
Table of Contribution: ................................................................................................... 3
Question1: Execution of the rate application .................................................................. 3
Question2: Kafka Stream Source Implementation........................................................... 9
Streaming Application Implementation ..................................................................... 10
Execution and Verification ....................................................................................... 10
Resources: ................................................................................................................ 13
1. Introduction:
Apache Spark Structured Streaming represents a significant advancement in real-time
data processing frameworks (Apache Software Foundation, 2024). This technology enables
developers to process data streams using the same DataFrame API used for batch processing,
making it more intuitive and efficient (Expedia Group Tech, 2021). In this assignment, we
explore the implementation of a Spark Structured Streaming application, focusing on processing
continuous data streams and performing real-time computations.

2. Project Overview
2.1 Objectives

• Implementation of a rate-based streaming application using Spark Structured Streaming


• Integration with Kafka stream source for real-time data processing
• Achievement of consistent 3-minute processing intervals

2.2 Technical Requirements

• Scala version 2.11.12


• Java 8 compatibility
• Spark 2.4.8
• Kafka integration capabilities

Table of Contribution:
Contributors Sections
Hajar Makhlouf All
Aya Boumelha All

Question1: Execution of the rate application


For the first part of this assignment, we successfully implemented a rate-based streaming
application using Spark Structured Streaming. The implementation began with setting up the
development environment, which included configuring Scala version 2.11.12 and ensuring
compatibility with Java 8. We created a build.sbt file to manage our project dependencies,
specifically incorporating Spark SQL, Spark Streaming, and Spark Core libraries version 2.4.8.
A crucial aspect of our implementation involved addressing the compatibility between
Java and Apache Spark versions. Initially, our system was running Java 21.0.4, which presented
compatibility issues with Spark 2.4.8. As documented in the OpenJDK documentation (Oracle
Corporation, 2024), this compatibility issue occurs because Spark 2.4.8 was designed
specifically for Java 8.

To resolve this, we implemented a systematic approach to change our Java environment.


First, we installed OpenJDK 8 on our Ubuntu system using the package manager. Then, using
the update-alternatives system, we configured our environment to use Java 8 instead of Java 21.
This change was verified by checking the Java version, which confirmed we were running
OpenJDK version 1.8.0_432.
The core streaming application was implemented in Scala, where we established a
SparkSession with streaming capabilities. Our application utilizes the rate source, which
generates data at a specified rate, providing a reliable stream of test data. The streaming query
processes this data in micro-batches, adding a computed result column to demonstrate real-time
data transformation.
The successful execution of our streaming application is evidenced by the continuous
processing of micro-batches, with each batch containing timestamp and value information. The
application maintains a steady state of processing, demonstrating the fundamental concepts of
structured streaming including continuous data ingestion, transformation, and output generation.
Question2: Kafka Stream Source Implementation
The implementation began with configuring the build.sbt file to include necessary Kafka
dependencies:

Kafka Broker Setup


For the Kafka broker configuration, we utilized the conduktor/kafka-stack-docker-compose
repository (Conduktor, 2024). Following the Apache Kafka documentation (Apache Software
Foundation, 2024), The setup process involved cloning the repository and launching a single-
node Kafka cluster:
git clone https://github.com/conduktor/kafka-stack-docker-compose.git cd kafka-stack-docker-
compose docker-compose -f zk-single-kafka-single.yml up –d

Streaming Application Implementation


The streamRateSource.scala file was modified to incorporate Kafka streaming capabilities:

Execution and Verification


The application was executed using sbt:
sbt "runMain streamRateSource"
The successful execution was confirmed through console output demonstrating several
key components (Apache Software Foundation, 2024). The system successfully established
connection to the Kafka broker at localhost:9092 and confirmed subscription to the test-topic.
The application maintained the specified processing intervals of 180 seconds (3 minutes) as
required. Furthermore, the console output verified proper batch processing setup and continuous
data flow through the system. Multiple screenshots demonstrate the sustained operation of these
components, with timestamp verification confirming the consistent three-minute processing
intervals.
Resources:
Apache Software Foundation. (2024). Apache Spark Structured Streaming Programming Guide.
Apache Spark.
https://spark.apache.org/docs/2.4.8/structured-streaming-programming-guide.html

Apache Software Foundation. (2024). Apache Kafka Documentation. Apache Kafka.


https://kafka.apache.org/documentation/

Conduktor. (2024). Kafka Stack Docker Compose. GitHub Repository.


https://github.com/conduktor/kafka-stack-docker-compose

Expedia Group Tech. (2021). Apache Spark Structured Streaming First Streaming Example.
Medium.
https://medium.com/expedia-group-tech/apache-spark-structured-streaming-first-streaming-
example-1-of-6-e8f3219748ef

Oracle Corporation. (2024). OpenJDK 8 Documentation. OpenJDK.


https://openjdk.org/projects/jdk8/

EPFL and Lightbend, Inc. (2024). Scala 2.11.12 API Documentation. Scala-Lang.
https://www.scala-lang.org/api/2.11.12/

You might also like