0% found this document useful (0 votes)

2 views11 pages

Day 72

The document outlines a problem statement requiring a SQL query to analyze Facebook posts from users who posted at least twice in 2021, focusing on the days between their first and last posts. It also describes a message count analysis for August 2022 to identify the top two senders based on message activity. The document includes example data, schema definitions, and SQL/PySpark code snippets for implementation.

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views11 pages

Day 72

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Scenario Based

Interview
Question

Ganesh. R
Problem Statement

Problem Statement:

Given a table of Facebook posts, for each user who

posted at least twice in 2021, write a query to find the
number of days between each user’s first post of the
year and last post of the year in the year 2021. Output
the user and number of the days between each user's
first and last post
Input Table Data
# DataFrame# Define the data as a list of tuples
data = [ (901, 3601, 4500, "You up?",
datetime.strptime("08/03/2022 16:43:00",
"%m/%d/%Y %H:%M:%S")), (743, 3601, 8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00",
"%m/%d/%Y %H:%M:%S")), (888, 3601, 7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00",
"%m/%d/%Y %H:%M:%S")), (100, 2520, 6987,
"Send this out now!", datetime.strptime("08/16/2021
00:35:00", "%m/%d/%Y %H:%M:%S")), (898,
2520, 9630, "Are you ready for your upcoming
presentation?", datetime.strptime("08/13/2022
14:35:00", "%m/%d/%Y %H:%M:%S")), (990,
2520, 8520, "Maybe it was done by the automation
process.", datetime.strptime("08/19/2022 06:30:00",
"%m/%d/%Y %H:%M:%S")),
(819, 2310, 4500, "What's the status on this?",
datetime.strptime("07/10/2022 15:55:00",
"%m/%d/%Y %H:%M:%S")), (922, 3601, 4500,
"Get on the call", datetime.strptime("08/10/2022
17:03:00", "%m/%d/%Y %H:%M:%S")),
Input Table Data

(942, 2520, 3561, "How much do you know about

Data Science?", datetime.strptime("08/17/2022
13:44:00", "%m/%d/%Y %H:%M:%S")), (966,
3601, 7852, "Meet me in five!",
datetime.strptime("08/17/2022 02:20:00",
"%m/%d/%Y %H:%M:%S")), (902, 4500, 3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00",
"%m/%d/%Y %H:%M:%S")) ]
# Define schema for DataFrame
schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)
# Create DataFrame
df = spark.createDataFrame(data, schema)
Output Table

sender_id count_messages

3601 4

2520 3
Problem Statement: Message Count Analysis

Objective: To analyze and identify the top two senders with the highest number of messages
sent during August 2022 from a dataset of messages.

Background: You have a dataset containing message records. Each record includes information
such as the sender_id, message_id, and the date the message was sent (sent_date). The goal is
to extract insights regarding message activity during a specific month and year.

Dataset Description:

Table Name: messages Columns: sender_id (Integer): Unique identifier for each sender.
message_id (Integer): Unique identifier for each message. sent_date (DateTime): Timestamp
indicating when the message was sent.

from pyspark.sql.types import (

StructType,
StructField,
IntegerType,
StringType,
TimestampType,
)
from datetime import datetime

# Define schema for the DataFrame

schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)

# Data as a list of tuples

data = [
(
901,
3601,
4500,
"You up?",
datetime.strptime("08/03/2022 16:43:00", "%m/%d/%Y %H:%M:%S"),
),
(
743,
3601,
8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
888,
3601,
7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00", "%m/%d/%Y %H:%M:%S"),
),
(
100,
2520,
6987,
"Send this out now!",
datetime.strptime("08/16/2021 00:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
898,
2520,
9630,
"Are you ready for your upcoming presentation?",
datetime.strptime("08/13/2022 14:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
990,
2520,
8520,
"Maybe it was done by the automation process.",
datetime.strptime("08/19/2022 06:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
819,
2310,
4500,
"What's the status on this?",
datetime.strptime("07/10/2022 15:55:00", "%m/%d/%Y %H:%M:%S"),
),
(
922,
3601,
4500,
"Get on the call",
datetime.strptime("08/10/2022 17:03:00", "%m/%d/%Y %H:%M:%S"),
),
(
942,
2520,
3561,
"How much do you know about Data Science?",
datetime.strptime("08/17/2022 13:44:00", "%m/%d/%Y %H:%M:%S"),
),
(
966,
3601,
7852,
"Meet me in five!",
datetime.strptime("08/17/2022 02:20:00", "%m/%d/%Y %H:%M:%S"),
),
(
902,
4500,
3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00", "%m/%d/%Y %H:%M:%S"),
),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# display the DataFrame

df.display()

df.createOrReplaceTempView("messages")

%sql
WITH message_counts AS (
SELECT
sender_id,
COUNT(message_id) AS count_messages
FROM
messages
WHERE
MONTH(sent_date) = 8
AND YEAR(sent_date) = 2022
GROUP BY
sender_id
)
SELECT
*
FROM
message_counts
ORDER BY
count_messages DESC
LIMIT
2;

from pyspark.sql.functions import col, count, month, year

# Load your DataFrame (assuming it's already available as `df`)

# messages_df = spark.read.csv("path_to_your_data.csv", header=True,
inferSchema=True)

# Filter and group the DataFrame

message_counts_df = (
df.filter((month(col("sent_date")) == 8) & (year(col("sent_date"))
== 2022))
.groupBy("sender_id")
.agg(count("message_id").alias("count_messages"))
)

# Order by count_messages and limit the results

top_senders_df =
message_counts_df.orderBy(col("count_messages").desc()).limit(2)

# Show the results

top_senders_df.display()

Explanation:

Filter: We use the filter method to select records for August 2022.

GroupBy and Aggregation: The groupBy method groups by sender_id, and agg calculates the
count of messages.

Order and Limit: Finally, we sort the results in descending order and limit the output to the top
two senders.

Make sure to replace df with the actual DataFrame variable name you have in your PySpark
environment.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.

Ganesh. R
THANK YOU
For Your Support

I Appreciate for your support on

My Account, I will Never Stop to Share the
Knowledge.

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Maestro XS Reference Manual Version 2.0 PDF
33% (3)
Maestro XS Reference Manual Version 2.0 PDF
130 pages
IC Joshi Aviation Met Total Q.
100% (6)
IC Joshi Aviation Met Total Q.
94 pages
S1 CS - U4 Data Ranges - Frequencies - Shifting
No ratings yet
S1 CS - U4 Data Ranges - Frequencies - Shifting
24 pages
FIFO
No ratings yet
FIFO
13 pages
Day 71
No ratings yet
Day 71
9 pages
Chat Analysis Notes
No ratings yet
Chat Analysis Notes
9 pages
019) Pandas - Batch 2 - Day 019 (FINAL DAY)
No ratings yet
019) Pandas - Batch 2 - Day 019 (FINAL DAY)
43 pages
Unit 5 I
No ratings yet
Unit 5 I
34 pages
XII IP Model 1
No ratings yet
XII IP Model 1
10 pages
Cs Sem V Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
No ratings yet
Cs Sem V Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
16 pages
Dav Pyq 2023
No ratings yet
Dav Pyq 2023
15 pages
Kendriya Vidyalaya Sangathan: Bhubaneswar Region PRE-BOARD - 1 (Nov.-2024) Session:2024-25 Class Xii Informatics Practices (065) SET-1
No ratings yet
Kendriya Vidyalaya Sangathan: Bhubaneswar Region PRE-BOARD - 1 (Nov.-2024) Session:2024-25 Class Xii Informatics Practices (065) SET-1
11 pages
Class 12 CS PB Worksheet
No ratings yet
Class 12 CS PB Worksheet
10 pages
DS Unit-Vi
No ratings yet
DS Unit-Vi
22 pages
12th - Mid-Term-IP
No ratings yet
12th - Mid-Term-IP
5 pages
Revision Worksheet (2024-2025)
No ratings yet
Revision Worksheet (2024-2025)
9 pages
Day 3 - Notes Interview Questions
No ratings yet
Day 3 - Notes Interview Questions
36 pages
6205solved Ip CL Xii 2020
No ratings yet
6205solved Ip CL Xii 2020
11 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Ip CLSS Xii 2024-25 Hy
No ratings yet
Ip CLSS Xii 2024-25 Hy
14 pages
Day 73
No ratings yet
Day 73
12 pages
Python Coding Interview Interview Questions Questions
No ratings yet
Python Coding Interview Interview Questions Questions
9 pages
Dev Record Final
No ratings yet
Dev Record Final
34 pages
Python For Data Analysis 3rd Edition - Wes McKinney-trang-4
No ratings yet
Python For Data Analysis 3rd Edition - Wes McKinney-trang-4
60 pages
Pandas Module
No ratings yet
Pandas Module
24 pages
Xii-Informatics Practices-Qp-Set B-18-11-2021
No ratings yet
Xii-Informatics Practices-Qp-Set B-18-11-2021
14 pages
Sample Questions For XII IP
No ratings yet
Sample Questions For XII IP
59 pages
12th CS Prelim Paper 2024 Updated
No ratings yet
12th CS Prelim Paper 2024 Updated
10 pages
Lecture 3:analyze Twitter Data by Time Period
No ratings yet
Lecture 3:analyze Twitter Data by Time Period
13 pages
12pb24ip01 QP
No ratings yet
12pb24ip01 QP
12 pages
Information Practices: Section A
No ratings yet
Information Practices: Section A
8 pages
Traversing Dataframe Elements Using: Iterrows, Iteritems and Itertuples
No ratings yet
Traversing Dataframe Elements Using: Iterrows, Iteritems and Itertuples
8 pages
Python Cheat Sheet Intermediate
No ratings yet
Python Cheat Sheet Intermediate
1 page
Httppython Mykvs inuploadsfilesXIIInfo Pract S E 150 PDF
No ratings yet
Httppython Mykvs inuploadsfilesXIIInfo Pract S E 150 PDF
15 pages
4.write The Output of The Following Code
No ratings yet
4.write The Output of The Following Code
2 pages
Pandas Datetime4
No ratings yet
Pandas Datetime4
10 pages
Group - by Python Code
No ratings yet
Group - by Python Code
11 pages
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
No ratings yet
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
15 pages
HEALTHCARE
No ratings yet
HEALTHCARE
3 pages
Https App - Oswaalbooks.com Download Sample-Qp Subsolution 508self Assessment Paper-1
No ratings yet
Https App - Oswaalbooks.com Download Sample-Qp Subsolution 508self Assessment Paper-1
4 pages
Class 12 IP 2024
No ratings yet
Class 12 IP 2024
8 pages
IP-Class 12 - Half Yearly - 2021
No ratings yet
IP-Class 12 - Half Yearly - 2021
11 pages
PGT Information Practices
No ratings yet
PGT Information Practices
8 pages
Twitter Data Pull
No ratings yet
Twitter Data Pull
10 pages
Section - A: Print (DF - Loc (0) ) A) Calories 420 B) Calories 420
100% (1)
Section - A: Print (DF - Loc (0) ) A) Calories 420 B) Calories 420
183 pages
Whatsapp - Analyzer
No ratings yet
Whatsapp - Analyzer
8 pages
ComputerScience SQP
No ratings yet
ComputerScience SQP
11 pages
All in One Xii Cs QP Ms 2024
No ratings yet
All in One Xii Cs QP Ms 2024
270 pages
Class 12 Cs All Region Papers
No ratings yet
Class 12 Cs All Region Papers
213 pages
2020 12 SP Informatics Practices New
No ratings yet
2020 12 SP Informatics Practices New
15 pages
InformaticsPractices SQP
No ratings yet
InformaticsPractices SQP
8 pages
Informatics Practices
No ratings yet
Informatics Practices
9 pages
Chapter 14
No ratings yet
Chapter 14
5 pages
Xii Ip Half Yearly
No ratings yet
Xii Ip Half Yearly
4 pages
Set A 2024-25 - Ms - Pre Board Xii Ip
No ratings yet
Set A 2024-25 - Ms - Pre Board Xii Ip
12 pages
Class Xii Term I Informatics Practices 2024-25
No ratings yet
Class Xii Term I Informatics Practices 2024-25
8 pages
CBSE Class 12 Computer Science Sample Question Paper (Solved)
No ratings yet
CBSE Class 12 Computer Science Sample Question Paper (Solved)
18 pages
XII-PB1-2024-25 Set-B
No ratings yet
XII-PB1-2024-25 Set-B
7 pages
Masters Among Us: An Exploration of Supernal Encounters and Miraculous Phenomena
From Everand
Masters Among Us: An Exploration of Supernal Encounters and Miraculous Phenomena
Thomas Curley
No ratings yet
Sudoku A Game of Mathematicians 320 Puzzles Medium Difficulty
From Everand
Sudoku A Game of Mathematicians 320 Puzzles Medium Difficulty
Kelly Johnson
No ratings yet
Future-Proofing Manufacturing & the Supply Chain Post COVID-19
From Everand
Future-Proofing Manufacturing & the Supply Chain Post COVID-19
Lisa Anderson
No ratings yet
A Teacher's Guide to Celebrating Positive Racial Identity
From Everand
A Teacher's Guide to Celebrating Positive Racial Identity
Bianca Sapara-Grant
No ratings yet
Redshift DG
No ratings yet
Redshift DG
733 pages
Day 62
No ratings yet
Day 62
9 pages
Day 24
No ratings yet
Day 24
8 pages
Day 27
No ratings yet
Day 27
6 pages
Day 28
No ratings yet
Day 28
5 pages
Day 76
No ratings yet
Day 76
10 pages
Day 57
No ratings yet
Day 57
11 pages
AWS Learning Material
No ratings yet
AWS Learning Material
13 pages
Cao Syllabus
No ratings yet
Cao Syllabus
2 pages
Stats1 Chapter 2::: Measures of Location & Spread
No ratings yet
Stats1 Chapter 2::: Measures of Location & Spread
53 pages
A Simheuristic Approach For Throughput Maximization of A - 2020 - Computers - Op
No ratings yet
A Simheuristic Approach For Throughput Maximization of A - 2020 - Computers - Op
13 pages
C# Practical Solution
No ratings yet
C# Practical Solution
61 pages
Worksheet Graphing Systems
No ratings yet
Worksheet Graphing Systems
3 pages
Din 653
No ratings yet
Din 653
5 pages
D12000i Rato Principle Block Diagram
No ratings yet
D12000i Rato Principle Block Diagram
1 page
Esci JPP
0% (1)
Esci JPP
27 pages
An Investigation of A Model For Air Resistance Lab
No ratings yet
An Investigation of A Model For Air Resistance Lab
4 pages
Aenexz Tech Data Science Curriculum 8 Weeks
No ratings yet
Aenexz Tech Data Science Curriculum 8 Weeks
8 pages
CAIE IGCSE Physics Theory
No ratings yet
CAIE IGCSE Physics Theory
52 pages
WA DOC 20230324 44dd412a
No ratings yet
WA DOC 20230324 44dd412a
8 pages
Wrninv 45 K Zhyt Bgro TTF9 X Le JT XCAVWEgf Ah IFn C
No ratings yet
Wrninv 45 K Zhyt Bgro TTF9 X Le JT XCAVWEgf Ah IFn C
12 pages
Class 10 2019 Science Set 2
No ratings yet
Class 10 2019 Science Set 2
11 pages
Module 2 Lab: Creating Data Types and Tables
No ratings yet
Module 2 Lab: Creating Data Types and Tables
5 pages
Daily Practice Problems (DPP) : Sub: Maths Chapter: Quadratic Equation DPP No.: 2
No ratings yet
Daily Practice Problems (DPP) : Sub: Maths Chapter: Quadratic Equation DPP No.: 2
4 pages
Perio Instruments
100% (3)
Perio Instruments
32 pages
5.0SMLJ24A Datasheet
No ratings yet
5.0SMLJ24A Datasheet
5 pages
M. Tech. Chemical 2018
No ratings yet
M. Tech. Chemical 2018
37 pages
JOTRON TRON UAIS TR-2500 - Operation - Installation Manual
No ratings yet
JOTRON TRON UAIS TR-2500 - Operation - Installation Manual
77 pages
Reliability: Case Processing Summary
No ratings yet
Reliability: Case Processing Summary
5 pages
Maths Practice Set 6 Solved (Combined Graduate Level Exam (CGLE) )
No ratings yet
Maths Practice Set 6 Solved (Combined Graduate Level Exam (CGLE) )
12 pages
Preliminary Dpp-04: For Unacademy Subscription Use Code - Join For Updates
No ratings yet
Preliminary Dpp-04: For Unacademy Subscription Use Code - Join For Updates
7 pages
Le Club Francais Case
No ratings yet
Le Club Francais Case
8 pages
Rust Experimental v2017 DevBlog 179 x64 #KnightsTable
No ratings yet
Rust Experimental v2017 DevBlog 179 x64 #KnightsTable
2 pages
Htl05 Sub Pe 001 Mem Imp Civ r00 - Equipment Foundations
No ratings yet
Htl05 Sub Pe 001 Mem Imp Civ r00 - Equipment Foundations
27 pages
Paper 1 Topic 4 - SL Questions
No ratings yet
Paper 1 Topic 4 - SL Questions
2 pages

Day 72

Uploaded by

Day 72

Uploaded by

Scenario Based

Given a table of Facebook posts, for each user who

(942, 2520, 3561, "How much do you know about

from pyspark.sql.types import (

# Define schema for the DataFrame

# Data as a list of tuples

# display the DataFrame

from pyspark.sql.functions import col, count, month, year

# Load your DataFrame (assuming it's already available as `df`)

# Filter and group the DataFrame

# Order by count_messages and limit the results

# Show the results

I Appreciate for your support on

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

You might also like