Day 72
Day 72
Interview
Question
Ganesh. R
Problem Statement
Problem Statement:
sender_id count_messages
3601 4
2520 3
Problem Statement: Message Count Analysis
Objective: To analyze and identify the top two senders with the highest number of messages
sent during August 2022 from a dataset of messages.
Background: You have a dataset containing message records. Each record includes information
such as the sender_id, message_id, and the date the message was sent (sent_date). The goal is
to extract insights regarding message activity during a specific month and year.
Dataset Description:
Table Name: messages Columns: sender_id (Integer): Unique identifier for each sender.
message_id (Integer): Unique identifier for each message. sent_date (DateTime): Timestamp
indicating when the message was sent.
# Create DataFrame
df = spark.createDataFrame(data, schema=schema)
df.createOrReplaceTempView("messages")
%sql
WITH message_counts AS (
SELECT
sender_id,
COUNT(message_id) AS count_messages
FROM
messages
WHERE
MONTH(sent_date) = 8
AND YEAR(sent_date) = 2022
GROUP BY
sender_id
)
SELECT
*
FROM
message_counts
ORDER BY
count_messages DESC
LIMIT
2;
Explanation:
Filter: We use the filter method to select records for August 2022.
GroupBy and Aggregation: The groupBy method groups by sender_id, and agg calculates the
count of messages.
Order and Limit: Finally, we sort the results in descending order and limit the output to the top
two senders.
Make sure to replace df with the actual DataFrame variable name you have in your PySpark
environment.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.
Ganesh. R
THANK YOU
For Your Support