0% found this document useful (0 votes)
2 views11 pages

Day 72

The document outlines a problem statement requiring a SQL query to analyze Facebook posts from users who posted at least twice in 2021, focusing on the days between their first and last posts. It also describes a message count analysis for August 2022 to identify the top two senders based on message activity. The document includes example data, schema definitions, and SQL/PySpark code snippets for implementation.

Uploaded by

Lapi Lapil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views11 pages

Day 72

The document outlines a problem statement requiring a SQL query to analyze Facebook posts from users who posted at least twice in 2021, focusing on the days between their first and last posts. It also describes a message count analysis for August 2022 to identify the top two senders based on message activity. The document includes example data, schema definitions, and SQL/PySpark code snippets for implementation.

Uploaded by

Lapi Lapil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Scenario Based

Interview
Question

Ganesh. R
Problem Statement

Problem Statement:

Given a table of Facebook posts, for each user who


posted at least twice in 2021, write a query to find the
number of days between each user’s first post of the
year and last post of the year in the year 2021. Output
the user and number of the days between each user's
first and last post
Input Table Data
# DataFrame# Define the data as a list of tuples
data = [ (901, 3601, 4500, "You up?",
datetime.strptime("08/03/2022 16:43:00",
"%m/%d/%Y %H:%M:%S")), (743, 3601, 8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00",
"%m/%d/%Y %H:%M:%S")), (888, 3601, 7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00",
"%m/%d/%Y %H:%M:%S")), (100, 2520, 6987,
"Send this out now!", datetime.strptime("08/16/2021
00:35:00", "%m/%d/%Y %H:%M:%S")), (898,
2520, 9630, "Are you ready for your upcoming
presentation?", datetime.strptime("08/13/2022
14:35:00", "%m/%d/%Y %H:%M:%S")), (990,
2520, 8520, "Maybe it was done by the automation
process.", datetime.strptime("08/19/2022 06:30:00",
"%m/%d/%Y %H:%M:%S")),
(819, 2310, 4500, "What's the status on this?",
datetime.strptime("07/10/2022 15:55:00",
"%m/%d/%Y %H:%M:%S")), (922, 3601, 4500,
"Get on the call", datetime.strptime("08/10/2022
17:03:00", "%m/%d/%Y %H:%M:%S")),
Input Table Data

(942, 2520, 3561, "How much do you know about


Data Science?", datetime.strptime("08/17/2022
13:44:00", "%m/%d/%Y %H:%M:%S")), (966,
3601, 7852, "Meet me in five!",
datetime.strptime("08/17/2022 02:20:00",
"%m/%d/%Y %H:%M:%S")), (902, 4500, 3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00",
"%m/%d/%Y %H:%M:%S")) ]
# Define schema for DataFrame
schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)
# Create DataFrame
df = spark.createDataFrame(data, schema)
Output Table

sender_id count_messages

3601 4

2520 3
Problem Statement: Message Count Analysis

Objective: To analyze and identify the top two senders with the highest number of messages
sent during August 2022 from a dataset of messages.

Background: You have a dataset containing message records. Each record includes information
such as the sender_id, message_id, and the date the message was sent (sent_date). The goal is
to extract insights regarding message activity during a specific month and year.

Dataset Description:

Table Name: messages Columns: sender_id (Integer): Unique identifier for each sender.
message_id (Integer): Unique identifier for each message. sent_date (DateTime): Timestamp
indicating when the message was sent.

from pyspark.sql.types import (


StructType,
StructField,
IntegerType,
StringType,
TimestampType,
)
from datetime import datetime

# Define schema for the DataFrame


schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)

# Data as a list of tuples


data = [
(
901,
3601,
4500,
"You up?",
datetime.strptime("08/03/2022 16:43:00", "%m/%d/%Y %H:%M:%S"),
),
(
743,
3601,
8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
888,
3601,
7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00", "%m/%d/%Y %H:%M:%S"),
),
(
100,
2520,
6987,
"Send this out now!",
datetime.strptime("08/16/2021 00:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
898,
2520,
9630,
"Are you ready for your upcoming presentation?",
datetime.strptime("08/13/2022 14:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
990,
2520,
8520,
"Maybe it was done by the automation process.",
datetime.strptime("08/19/2022 06:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
819,
2310,
4500,
"What's the status on this?",
datetime.strptime("07/10/2022 15:55:00", "%m/%d/%Y %H:%M:%S"),
),
(
922,
3601,
4500,
"Get on the call",
datetime.strptime("08/10/2022 17:03:00", "%m/%d/%Y %H:%M:%S"),
),
(
942,
2520,
3561,
"How much do you know about Data Science?",
datetime.strptime("08/17/2022 13:44:00", "%m/%d/%Y %H:%M:%S"),
),
(
966,
3601,
7852,
"Meet me in five!",
datetime.strptime("08/17/2022 02:20:00", "%m/%d/%Y %H:%M:%S"),
),
(
902,
4500,
3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00", "%m/%d/%Y %H:%M:%S"),
),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# display the DataFrame


df.display()

df.createOrReplaceTempView("messages")

%sql
WITH message_counts AS (
SELECT
sender_id,
COUNT(message_id) AS count_messages
FROM
messages
WHERE
MONTH(sent_date) = 8
AND YEAR(sent_date) = 2022
GROUP BY
sender_id
)
SELECT
*
FROM
message_counts
ORDER BY
count_messages DESC
LIMIT
2;

from pyspark.sql.functions import col, count, month, year

# Load your DataFrame (assuming it's already available as `df`)


# messages_df = spark.read.csv("path_to_your_data.csv", header=True,
inferSchema=True)

# Filter and group the DataFrame


message_counts_df = (
df.filter((month(col("sent_date")) == 8) & (year(col("sent_date"))
== 2022))
.groupBy("sender_id")
.agg(count("message_id").alias("count_messages"))
)

# Order by count_messages and limit the results


top_senders_df =
message_counts_df.orderBy(col("count_messages").desc()).limit(2)

# Show the results


top_senders_df.display()

Explanation:

Filter: We use the filter method to select records for August 2022.

GroupBy and Aggregation: The groupBy method groups by sender_id, and agg calculates the
count of messages.

Order and Limit: Finally, we sort the results in descending order and limit the output to the top
two senders.

Make sure to replace df with the actual DataFrame variable name you have in your PySpark
environment.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.

Ganesh. R
THANK YOU
For Your Support

I Appreciate for your support on


My Account, I will Never Stop to Share the
Knowledge.

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

You might also like