0% found this document useful (0 votes)
37 views2 pages

Spark Streaming Assignment

The assignment involves processing real-time advertisement data from a Kafka topic named ads_data using Spark Streaming. The objective is to perform window-based aggregation on the data, calculating total clicks, views, and average cost per view per ad_id, and then store the results in a Cassandra table. The submission requires the Spark Streaming application code and a report detailing the results and challenges encountered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views2 pages

Spark Streaming Assignment

The assignment involves processing real-time advertisement data from a Kafka topic named ads_data using Spark Streaming. The objective is to perform window-based aggregation on the data, calculating total clicks, views, and average cost per view per ad_id, and then store the results in a Cassandra table. The submission requires the Spark Streaming application code and a report detailing the results and challenges encountered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Spark Streaming Assignment: Real-Time Advertisement Data

Aggregation

Objective: Process real-time advertisement data using Spark Streaming to gain


business insights and store the aggregated data into Cassandra.

Background:
You have been provided with a Kafka topic named ads_data that contains
advertisement data in the following format:

ills
{
"ad_id": "12345",
"timestamp": "2023-08-23T12:01:05Z",

Sk
"clicks": 5,
"views": 10,
"cost": 50.75
}
a
at
The goal is to process this real-time data, compute business insights using
window-based aggregation, and write the aggregated results into a Cassandra
table. The aggregation key is ad_id, and aggregated values should update
D

previous values in the Cassandra table.

Tasks:
w

● Kafka setup and Mock data producer:


○ Set up Confluent Kafka on cloud or local
ro

○ Create topic named as ads_data


○ Write a python script which will use above mentioned data format
G

and keep on publishing random mock data in avro serialized form


into Kafka topic
● Reading Data from Kafka:
○ Set up a Spark Streaming application.
○ Use the Kafka connector to read data from the ads_data topic.
○ Parse & desearialize the incoming data into the appropriate
structure.

● Windowing Based Aggregation:


○ Perform a window-based aggregation over a window duration
(e.g., 1 minute) and sliding interval (e.g., 30 seconds).
○ Aggregate the following:
■ Total clicks per ad_id.
■ Total views per ad_id.
■ Average cost per view for each ad_id.

● Write Aggregated Data to Cassandra:


○ For each ad_id, check if an entry already exists in the Cassandra
table.

ills
○ If an entry exists, update the values:
■ Add new clicks/views to the existing counts.
■ Update the average cost per view.

Sk
● If an entry doesn't exist, create a new row with the aggregated
values.

Submission:
a
Submit your Spark Streaming application code, along with a brief report
detailing the results and any challenges faced during the assignment.
at
D
w
ro
G

You might also like