0% found this document useful (0 votes)
17 views34 pages

1st Question

The document outlines various data processing tasks involving DataFrames, including email domain extraction, phone number parsing, product code analysis, and more. Each task specifies the necessary transformations, filtering criteria, and aggregation methods to be applied to the data. Sample data is provided for each task to illustrate the expected input and output structure.

Uploaded by

Rakesh Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views34 pages

1st Question

The document outlines various data processing tasks involving DataFrames, including email domain extraction, phone number parsing, product code analysis, and more. Each task specifies the necessary transformations, filtering criteria, and aggregation methods to be applied to the data. Sample data is provided for each task to illustrate the expected input and output structure.

Uploaded by

Rakesh Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

SEEKHO BIGDATA INSTITUTE

1. Email Domain Extraction and Analysis

Problem: You have a DataFrame user_data with columns: user_id, email, signup_date, and
last_login.

 Extract the domain from the email column using regex_extract.

 Convert the domain to uppercase.

 Filter out users whose email domain ends with ".org".

 Group by the extracted domain and calculate:

o The number of users per domain.

o The average days between signup_date and last_login.

 Drop duplicate user_ids before the group by.

Sample Data:

user_id email signup_date last_login

U001 [email protected] 2024-01-01 2024-03-01

U002 [email protected] 2024-02-15 2024-03-10

U003 [email protected] 2024-03-01 2024-03-20

2. Phone Number Parsing and Region Analysis

Problem: You have a DataFrame phone_data with columns: customer_id, phone_number,


signup_date, and region.

 Use regex_extract to extract the country code, area code, and local number from the
phone_number column.

 Convert the region column to lowercase.

 Filter customers whose country code is not +1 (USA).

 Group by region and calculate:

o The count of customers per region.

o The most frequent area code within each region.

 Drop rows where the phone number is duplicated.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE

Sample Data:

customer_id phone_number signup_date region

C001 +1-415-5551234 2024-01-10 WEST

C002 +44-20-79461234 2024-02-20 EAST

C003 +91-22-23451234 2024-03-15 NORTH

3. Product Code Analysis and Transformation

Problem: You have a DataFrame product_data with columns: product_id, product_code,


release_date, and category.

 Split the product_code into three separate columns: brand_code, category_code, and
serial_number.

 Convert the category column to title case using initcap.

 Filter out products where brand_code starts with "X" and serial_number ends with "99".

 Group by category_code and calculate:

o The sum of serial numbers per category.

o The count of distinct brand_codes per category.

 Drop duplicates based on product_code.

Sample Data:

product_id product_code release_date category

P001 A123-456-789 2024-04-01 electronics

P002 B234-567-891 2024-05-10 home_appliance

P003 X345-678-999 2024-06-15 furniture

4. URL Path Analysis and Extraction

Problem: You have a DataFrame web_logs with columns: session_id, url, timestamp, and user_agent.

 Use split to extract the protocol, domain, and path from the url column.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE

 Convert the user_agent to lowercase.

 Filter out records where the path starts with "/admin".

 Group by domain and calculate:

o The average length of paths.

o The count of sessions per domain.

 Drop duplicates based on session_id.

Sample Data:

session_id url timestamp user_agent

S001 https://example.com/home 2024-07-01 10:00:00 Chrome/90.0

S002 http://sample.org/contact 2024-07-02 11:30:00 Firefox/85.0

S003 https://example.com/admin 2024-07-03 12:45:00 Safari/14.1

5. Address Parsing and Region Analysis

Problem: You have a DataFrame address_data with columns: address_id, full_address, city, state, and
zipcode.

 Use regex_extract to extract the street number, street name, and apartment number from
full_address.

 Convert the state column to uppercase.

 Filter out addresses where city starts with "New" and zipcode ends with "00".

 Group by state and calculate:

o The total count of distinct cities in each state.

o The minimum and maximum street numbers for each state.

 Drop duplicate addresses based on full_address.

Sample Data:

address_id full_address city state zipcode

A001 123 Main St Apt 4B New York NY 10001

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
address_id full_address city state zipcode

A002 456 Elm St San Francisco CA 94102

A003 789 Oak St Apt 12C Chicago IL 60603

6. Invoice Description Parsing and Analysis

Problem: You have a DataFrame invoice_data with columns: invoice_id, description, quantity,
unit_price, and total_amount.

 Split the description into product_name, color, and size using split.

 Convert the product_name to uppercase.

 Filter out invoices where the quantity is less than 5 and the total_amount is greater than
1000.

 Group by color and size and calculate:

o The total quantity sold.

o The average unit_price for each color-size combination.

 Drop duplicate invoices based on description.

Sample Data:

invoice_id description quantity unit_price total_amount

I001 T-shirt Red Large 10 20 200

I002 Jeans Blue Medium 3 50 150

I003 Jacket Black Small 5 100 500

7. Order ID and Product Code Parsing


Problem: You have a DataFrame order_data with columns: order_id, product_code, order_date, and
delivery_status.

 Use substr to extract the first 3 characters of order_id and the last 4 characters of
product_code.

 Convert the delivery_status to lowercase.


 Filter out orders where order_id starts with "ORD" and delivery_status ends with
"delivered".

 Group by delivery_status and calculate:

o The sum of the numeric part of the order_id.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
o The count of distinct product_codes for each status.

 Drop duplicate orders based on order_id.

Sample Data:

order_id product_code order_date delivery_status

ORD001 P123-4567 2024-08-01 Delivered

ORD002 P234-5678 2024-08-05 Pending

ORD003 P345-6789 2024-08-10 Delivered

8. Flight Number Parsing and Status Analysis

Problem: You have a DataFrame flight_data with columns: flight_id, flight_number, departure_date,
and status.

 Use instr to find the position of the airline code in the flight_number.

 Extract the airline code using substr based on the instr result.

 Convert the status to uppercase.

 Filter out flights where the status is not "ON TIME" and the airline code is "AA".

 Group by airline code and calculate:

o The count of flights for each airline.

o The minimum and maximum flight numbers for each airline.

 Drop duplicates based on flight_number.

Sample Data:

flight_id flight_number departure_date status

F001 AA123 2024-09-01 ON TIME

F002 DL456 2024-09-05 DELAYED

F003 UA789 2024-09-10 CANCELED

9. Stock Ticker Parsing and Performance Analysis

Problem: You have a DataFrame stock_data with columns: stock_id, ticker_symbol, trade_date,
closing_price, and volume.

 Use regex_extract to parse out the company code and market code from the ticker_symbol.

 Convert the ticker_symbol to uppercase.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Filter out records where the company code starts with "Z" and the closing_price is less than
50.

 Group by market_code and calculate:

o The total volume traded.

o The average closing_price for each market.

 Drop duplicate records based on ticker_symbol.

Sample Data:

stock_id ticker_symbol trade_date closing_price volume

S001 AAPL.NASDAQ 2024-10-01 150 10000

S002 MSFT.NASDAQ 2024-10-05 200 15000

S003 TSLA.NASDAQ 2024-10-10 250 20000

10. Document Parsing and Word Count

Problem: You have a DataFrame document_data with columns: doc_id, content, author, and
publish_date.

 Split the content column into individual words using split.

 Convert the author column to title case using initcap.

 Filter out documents where the first word starts with "A" or "An".

 Group by author and calculate:

o The total number of words in each document.

o The average word count per author.

 Drop duplicate records based on content.

Sample Data:

doc_id content author publish_date

D001 A quick brown fox jumps over the lazy dog John Doe 2024-11-01

D002 The quick brown fox Jane Smith 2024-11-05

D003 An apple a day keeps the doctor away Alice Johnson 2024-11-10

11. User Agent Parsing and Device Analysis

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Problem: You have a DataFrame web_sessions with columns: session_id, user_agent, login_time, and
location.

 Use regex_extract to parse the browser name and version from the user_agent column.

 Convert the location column to lowercase.

 Filter out sessions where the browser is not "Chrome" and the version is less than "90".

 Group by location and calculate:

o The total number of sessions.

o The count of distinct browser versions used in each location.

 Drop duplicate records based on session_id.

Sample Data:

session_id user_agent login_time location

Mozilla/5.0 (Windows NT 10.0; Win64; x64) 2024-12-01


S001 USA
Chrome/91.0.4472.124 08:00:00

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) 2024-12-02


S002 UK
Firefox/89.0 09:15:00

Mozilla/5.0 (Windows NT 10.0; Win64; x64) 2024-12-03


S003 CANADA
Chrome/89.0.4389.114 10:30:00

12. URL Query Parameter Extraction and Traffic Analysis

Problem: You have a DataFrame click_data with columns: click_id, url, click_time, and referrer.

 Use regex_extract to extract the query parameters (e.g., ?id=123&source=google) from the
url column.

 Convert the referrer column to uppercase.

 Filter out clicks where the query parameter source is not "google".

 Group by referrer and calculate:

o The total number of clicks.

o The count of distinct query parameters in each referrer.

 Drop duplicate records based on click_id.

Sample Data:

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
click_id url click_time referrer

C001 https://example.com/page?id=123&source=google 2024-12-05 11:45:00 GOOGLE

C002 https://sample.org/info?id=456&source=bing 2024-12-06 12:00:00 BING

C003 https://example.com/search?id=789&source=google 2024-12-07 13:30:00 YAHOO

13. File Path Analysis and Extension Extraction

Problem: You have a DataFrame file_data with columns: file_id, file_path, upload_date, and
size_in_mb.

 Use split to extract the file name and extension from the file_path.

 Convert the file extension to lowercase.

 Filter out files where the extension is not "pdf" and the size_in_mb is greater than 100.

 Group by extension and calculate:

o The average file size for each extension.

o The count of distinct file names for each extension.

 Drop duplicate records based on file_path.

Sample Data:

file_id file_path upload_date size_in_mb

F001 /docs/report1.pdf 2024-12-10 50

F002 /images/picture1.jpg 2024-12-11 150

F003 /docs/manual.docx 2024-12-12 75

14. Product Review Analysis and Sentiment Extraction


Problem: You have a DataFrame review_data with columns: review_id, product_id, review_text,
rating, and review_date.

 Use regex_extract to identify and extract any sentiment keywords (e.g., "excellent", "poor")
from the review_text.

 Convert the extracted sentiment keywords to title case using initcap.

 Filter out reviews where the rating is less than 3 and the sentiment contains "poor".

 Group by sentiment and calculate:

o The average rating for each sentiment.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
o The count of distinct product_ids associated with each sentiment.

 Drop duplicate reviews based on review_text.

Sample Data:

review_id product_id review_text rating review_date

R001 P001 This product is excellent 5 2024-12-15

R002 P002 The quality is poor 2 2024-12-16

R003 P003 Excellent build quality 4 2024-12-17

15. Hashtag Extraction and Post Analysis

Problem: You have a DataFrame social_posts with columns: post_id, content, likes, shares, and
post_date.

 Use regex_extract to extract hashtags from the content column.

 Convert the hashtags to lowercase.

 Filter out posts that contain hashtags starting with "ad" and have more than 1000 likes.

 Group by hashtags and calculate:

o The total number of likes and shares for each hashtag.

o The count of distinct post_ids for each hashtag.

 Drop duplicate posts based on content.

Sample Data:

post_id content likes shares post_date

P001 Loving this new product! #NewProduct #Excited 500 100 2024-12-20

P002 Amazing experience #Travel #Adventures 1200 250 2024-12-21

P003 Can't wait for the launch! #Upcoming 300 75 2024-12-22

16. Product Code Validation and Cleaning


Problem: You have a DataFrame inventory_data with columns: inventory_id, product_code, stock,
location, and last_updated.

 Use regex_extract to validate that the product_code follows a specific pattern (e.g., "ABC-
1234-X").

 Convert valid product_codes to uppercase.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Filter out products where the stock is less than 10 and the product_code does not follow the
valid pattern.

 Group by location and calculate:

o The total stock available in each location.

o The count of distinct valid product_codes in each location.

 Drop duplicate records based on product_code.

Sample Data:

inventory_id product_code stock location last_updated

I001 abc-1234-x 50 Warehouse 2024-12-25

I002 xyz-5678-y 5 Store 2024-12-26

I003 ABC-1234-X 100 Store 2024-12-27

17. Customer Address Normalization and Validation

Problem: You have a DataFrame customer_addresses with columns: customer_id, address, city, state,
and postal_code.

 Use regex_replace to standardize address abbreviations (e.g., "St." to "Street", "Ave." to


"Avenue").

 Convert the city and state columns to title case using initcap.

 Filter out addresses where postal_code is null or less than 5 digits.

 Group by state and calculate:

o The count of distinct cities per state.

o The total number of customers in each state.

 Drop duplicate addresses based on address.

Sample Data:

customer_id address city state postal_code

C001 123 Main St. new york ny 10001

C002 456 Elm Ave. san francisco ca 94102

C003 789 Oak St. chicago il 60603

18. Order Data Cleaning and Aggregation

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Problem: You have a DataFrame order_data with columns: order_id, order_number, order_date,
quantity, price_per_unit, and total_price.

 Use regex_replace to remove any special characters from order_number.

 Convert the order_number to uppercase.

 Filter out orders where total_price is not equal to quantity * price_per_unit.

 Group by order_date and calculate:

o The total quantity sold on each date.

o The total revenue generated each day.

 Drop duplicate orders based on order_number.

Sample Data:

order_id order_number order_date quantity price_per_unit total_price

O001 ORD-001 2024-12-30 10 15 150

O002 ORD@002 2024-12-31 5 20 100

O003 ORD-003 2025-01-01 8 25 200

19. Employee Role Parsing and Analysis

Problem: You have a DataFrame employee_data with columns: employee_id, name, role,
department, and salary.

 Use split to separate the role into title and level (e.g., "Manager - Senior").

 Convert the name column to title case using initcap.

 Filter out employees where the salary is less than 50000 and the role level is "Junior".

 Group by department and calculate:

o The average salary for each department.

o The count of distinct roles in each department.

 Drop duplicate records based on name.

Sample Data:

employee_id name role department salary

E001 JOHN DOE Manager - Senior Sales 75000

E002 JANE SMITH Analyst - Junior IT 45000

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
employee_id name role department salary

E003 ALICE JONES Developer - Senior Engineering 90000

20. Sensor Data Processing and Analysis

Problem: You have a DataFrame sensor_data with columns: sensor_id, reading_value, timestamp,
and unit.

 Use regex_replace to standardize units (e.g., "C" to "Celsius", "F" to "Fahrenheit").

 Convert the unit column to lowercase.

 Filter out readings where the value is below a specific threshold (e.g., 10) and the unit is
"celsius".

 Group by unit and calculate:

o The average reading_value for each unit.

o The count of distinct sensor_ids for each unit.

 Drop duplicate readings based on sensor_id.

Sample Data:

sensor_id reading_value timestamp unit

S001 15 2024-12-28 08:00:00 C

S002 65 2024-12-28 09:00:00 F

S003 8 2024-12-28 10:00:00 celsius

21. Transaction Code Analysis and Time-Based Filtering

Problem: You have a DataFrame transactions with columns: transaction_id, transaction_code,


transaction_time, amount, and currency.

 Use regex_extract to parse out the transaction type and code from the transaction_code.

 Convert the currency to uppercase.

 Filter transactions where the transaction type starts with "TR" and the amount is greater
than 500.

 Group by currency and transaction_code to calculate:

o The total amount for each currency-code combination.

o The count of distinct transaction_ids for each combination.

 Drop duplicate transactions based on transaction_code.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Sample Data:

transaction_id transaction_code transaction_time amount currency

T001 TRX-12345 2024-12-28 11:00:00 600 usd

T002 PAY-98765 2024-12-28 12:30:00 200 eur

T003 TRD-54321 2024-12-28 13:45:00 800 usd

22. Email Domain Parsing and Categorization

Problem: You have a DataFrame email_data with columns: email_id, email_address, sign_up_date,
and status.

 Use regex_extract to extract the domain from the email_address.

 Convert the status to title case using initcap.

 Filter out emails where the domain is not "gmail.com" and the status is "Inactive".

 Group by domain and calculate:

o The count of distinct email_ids for each domain.

o The earliest sign_up_date for each domain.

 Drop duplicate records based on email_address.

Sample Data:

email_id email_address sign_up_date status

E001 [email protected] 2024-12-30 Active

E002 [email protected] 2024-12-31 Inactive

E003 [email protected] 2025-01-01 Active

23. Product Code Splitting and Inventory Validation

Problem: You have a DataFrame inventory with columns: item_id, product_code, quantity, category,
and last_checked.

 Use split to separate the product_code into prefix, code number, and suffix.

 Convert the category to lowercase.

 Filter out items where the quantity is less than 50 and the product_code suffix is "X".

 Group by category and calculate:

o The total quantity in each category.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
o The count of distinct product_codes for each category.

 Drop duplicate records based on product_code.

Sample Data:

item_id product_code quantity category last_checked

I001 ABC-1234-X 60 electronics 2024-12-28

I002 DEF-5678-Y 40 clothing 2024-12-29

I003 GHI-9012-Z 100 gadgets 2024-12-30

24. Phone Number Extraction and User Segmentation

Problem: You have a DataFrame user_data with columns: user_id, phone_number, sign_up_date,
and subscription_type.

 Use regex_extract to extract the country code from the phone_number.

 Convert the subscription_type to uppercase.

 Filter out users where the country code is not "+1" and the subscription_type is "BASIC".

 Group by country_code and calculate:

o The total number of users per country code.

o The count of distinct subscription_types for each country code.

 Drop duplicate records based on phone_number.

Sample Data:

user_id phone_number sign_up_date subscription_type

U001 +1-555-1234567 2024-12-28 PREMIUM

U002 +44-20-1234567 2024-12-29 BASIC

U003 +91-9876543210 2024-12-30 PREMIUM

25. IP Address Segmentation and Traffic Analysis


Problem: You have a DataFrame network_data with columns: log_id, ip_address, access_time, and
data_transferred_mb.

 Use split to extract the network and host portions of the ip_address.

 Convert the network portion to title case using initcap.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Filter out logs where the data_transferred_mb is less than 100 and the network starts with
"192".

 Group by network and calculate:

o The total data_transferred_mb for each network.

o The count of distinct log_ids for each network.

 Drop duplicate logs based on ip_address.

Sample Data:

log_id ip_address access_time data_transferred_mb

L001 192.168.1.1 2024-12-28 08:00:00 150

L002 10.0.0.1 2024-12-28 09:30:00 50

L003 172.16.0.1 2024-12-28 10:45:00 200

26. Customer Name Parsing and Demographic Analysis

Problem: You have a DataFrame customer_data with columns: customer_id, full_name, age, city, and
state.

 Use split to extract the first name and last name from the full_name.

 Convert the city to lowercase.

 Filter out customers where the age is less than 30 and the city name contains "new".

 Group by state and calculate:

o The average age of customers in each state.

o The count of distinct city names in each state.

 Drop duplicate records based on full_name.

Sample Data:

customer_id full_name age city state

C001 John Doe 35 new york ny

C002 Jane Smith 25 los angeles ca

C003 Alice Johnson 45 chicago il

27. Order Number Splitting and Revenue Analysis

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Problem: You have a DataFrame sales_data with columns: order_id, order_number, order_date,
item_quantity, item_price, and total_value.

 Use split to separate the order_number into order prefix and order number.

 Convert the order_number prefix to uppercase.

 Filter out orders where the total_value is not equal to item_quantity * item_price.

 Group by order_date and calculate:

o The total item_quantity sold each day.

o The total revenue generated each day.

 Drop duplicate orders based on order_number.

Sample Data:

order_id order_number order_date item_quantity item_price total_value

O001 ORD-001 2024-12-30 10 15 150

O002 ORD-002 2024-12-31 5 20 100

O003 ORD-003 2025-01-01 8 25 200

29. Product Name Parsing and Category Analysis

Problem: You have a DataFrame product_data with columns: product_id, product_name, price,
category, and release_date.

 Use split to extract the main product name and the variant from product_name.

 Convert the product_name to uppercase.

 Filter products where the price is less than 50 and the product name contains "SPECIAL".

 Group by category and calculate:

o The average price in each category.

o The count of distinct product_names in each category.

 Drop duplicate products based on product_id.

Sample Data:

product_id product_name price category release_date

P001 Special Widget A 45 Gadgets 2024-11-15

P002 Standard Widget B 55 Gadgets 2024-12-01

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
product_id product_name price category release_date

P003 Special Widget C 25 Toys 2024-12-05

30. Customer Review Parsing and Rating Analysis

Problem: You have a DataFrame review_data with columns: review_id, customer_name, review_text,
rating, and review_date.

 Use regex_extract to find keywords like "excellent", "good", "poor" in the review_text.

 Convert review_text to lowercase.

 Filter reviews where the rating is less than 3 and the review text contains "poor".

 Group by rating and calculate:

o The total count of reviews for each rating.

o The earliest review_date for each rating.

 Drop duplicate reviews based on review_id.

Sample Data:

review_id customer_name review_text rating review_date

R001 Alice Excellent product, good 5 2024-12-01

R002 Bob Poor quality, not good 2 2024-12-05

R003 Charlie Good product, worth it 4 2024-12-10

31. Event Log Parsing and Time-Based Aggregation

Problem: You have a DataFrame event_logs with columns: event_id, event_description, event_type,
event_time, and user_id.

 Use regex_extract to parse out the event category from event_description.

 Convert event_type to uppercase.

 Filter events where the event_type starts with "ERROR" and the event_time is after "2024-
12-01".

 Group by event_category (extracted from event_description) and calculate:

o The total number of events for each category.

o The average time between events in each category.

 Drop duplicate events based on event_id.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Sample Data:

event_id event_description event_type event_time user_id

E001 Login failed - Invalid ERROR 2024-12-05 08:00:00 U001

E002 File upload successful INFO 2024-12-06 09:30:00 U002

E003 Database connection lost ERROR 2024-12-07 10:45:00 U003

32. Transaction Date Parsing and Weekly Analysis

Problem: You have a DataFrame transactions with columns: transaction_id, transaction_amount,


transaction_date, and store_location.

 Use date_format to extract the week number and year from transaction_date.

 Convert store_location to uppercase.

 Filter transactions where transaction_amount is greater than 100 and the store_location
ends with "STORE".

 Group by week_number and year and calculate:

o The total transaction_amount for each week.

o The average transaction_amount for each week.

 Drop duplicate transactions based on transaction_id.

Sample Data:

transaction_id transaction_amount transaction_date store_location

T001 150 2024-12-03 Main Store

T002 90 2024-12-05 Secondary Store

T003 200 2024-12-07 Main Store

33. User Activity Log Analysis

Problem: You have a DataFrame activity_logs with columns: log_id, user_id, activity_description,
activity_date, and duration_seconds.

 Use split to parse out the activity type and details from activity_description.

 Convert activity_description to lowercase.

 Filter logs where the duration_seconds is less than 30 and the activity_type starts with
"LOGIN".

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Group by user_id and calculate:

o The total duration_seconds spent per user.

o The count of distinct activity_types per user.

 Drop duplicate logs based on log_id.

Sample Data:

log_id user_id activity_description activity_date duration_seconds

L001 U001 Login - Success 2024-12-05 20

L002 U002 Logout - Failure 2024-12-06 40

L003 U001 Login - Success 2024-12-07 25

34. Product Price Changes and Vendor Analysis

Problem: You have a DataFrame price_changes with columns: change_id, product_id, old_price,
new_price, change_date, and vendor.

 Use regex_replace to standardize vendor names (e.g., "Vendor A" to "VENDOR_A").

 Convert the change_date to month and year.

 Filter out price changes where the price_difference is less than 10 and the vendor does not
contain "VENDOR".

 Group by vendor and calculate:

o The average price_difference for each vendor.

o The total number of price changes for each vendor.

 Drop duplicate changes based on change_id.

Sample Data:

change_id product_id old_price new_price change_date vendor

C001 P001 50 55 2024-12-01 Vendor A

C002 P002 70 65 2024-12-02 Vendor B

C003 P003 90 85 2024-12-03 Vendor C

35. Website Traffic Analysis

Problem: You have a DataFrame website_traffic with columns: session_id, user_id, page_url,
visit_duration, and visit_date.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Use regex_extract to parse out the page category from page_url (e.g., "home", "product").

 Convert the visit_date to the day of the week.

 Filter out sessions where the visit_duration is less than 60 seconds and the page category is
"product".

 Group by day_of_week and calculate:

o The average visit_duration for each day.

o The count of distinct user_ids visiting each day.

 Drop duplicate sessions based on session_id.

Sample Data:

session_id user_id page_url visit_duration visit_date

S001 U001 /home 120 2024-12-05

S002 U002 /product/12345 30 2024-12-05

S003 U003 /product/67890 90 2024-12-06

37. Session Activity and Time Analysis

Problem: You have a DataFrame session_logs with columns: session_id, user_id, page_url,
session_start, session_end, and activity_duration.

 Use date_format to extract the day of the week from session_start.

 Extract the base URL from page_url using split.

 Filter sessions where the activity_duration is greater than 300 seconds and the base URL is
"home".

 Group by user_id and calculate:

o The total activity_duration per user.

o The maximum session duration per user.

 Drop duplicate sessions based on session_id.

Sample Data:

session_id user_id page_url session_start session_end activity_duration

2024-12-01 2024-12-01
S001 U001 http://example.com/home 360
08:00:00 09:00:00

2024-12-01 2024-12-01
S002 U002 http://example.com/product 200
09:30:00 10:00:00

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
session_id user_id page_url session_start session_end activity_duration

2024-12-01 2024-12-01
S003 U001 http://example.com/home 450
10:30:00 11:30:00

38. Order Processing and Delivery Time Analysis

Problem: You have a DataFrame orders with columns: order_id, order_date, delivery_date,
product_id, quantity, and price_per_unit.

 Use date_format to extract the month and year from order_date and delivery_date.

 Convert product_id to uppercase.

 Filter orders where the delivery time (difference between delivery_date and order_date) is
more than 5 days.

 Group by month_year (month and year from order_date) and calculate:

o The total quantity ordered in each month.

o The total revenue (quantity * price_per_unit) in each month.

 Drop duplicate orders based on order_id.

Sample Data:

order_id order_date delivery_date product_id quantity price_per_unit

O001 2024-12-01 2024-12-10 P001 10 15

O002 2024-12-02 2024-12-08 P002 5 20

O003 2024-12-05 2024-12-15 P003 8 25

39. Customer Feedback and Sentiment Analysis

Problem: You have a DataFrame feedback with columns: feedback_id, customer_name,


feedback_text, feedback_date, and sentiment_score.

 Use regex_extract to identify keywords like "excellent", "average", "poor" in the


feedback_text.

 Convert feedback_text to lowercase.

 Filter feedback where the sentiment_score is below 3 and the feedback text contains "poor".

 Group by customer_name and calculate:

o The average sentiment_score per customer.

o The count of feedback entries per customer.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Drop duplicate feedback based on feedback_id.

Sample Data:

feedback_id customer_name feedback_text feedback_date sentiment_score

F001 Alice Excellent service 2024-12-05 5

F002 Bob Poor quality service 2024-12-06 2

F003 Charlie Average experience 2024-12-07 3

40. Product Availability and Sales Analysis

Problem: You have a DataFrame product_availability with columns: product_id, availability_date,


stock_count, price, and vendor.

 Use date_format to extract the quarter from availability_date.

 Extract the product category from product_id using regex_extract.

 Filter out records where the stock_count is less than 20 and the product_id does not start
with "PRD".

 Group by vendor and quarter and calculate:

o The total stock_count per vendor and quarter.

o The average price per vendor and quarter.

 Drop duplicate records based on product_id.

Sample Data:

product_id availability_date stock_count price vendor

P001 2024-12-01 30 15 Vendor A

P002 2024-12-05 10 20 Vendor B

P003 2024-12-10 50 25 Vendor C

41. Employee Performance and Feedback Analysis

Problem: You have a DataFrame performance_reviews with columns: review_id, employee_id,


review_date, performance_score, feedback_text, and department.

 Use regex_extract to extract key performance metrics from feedback_text.

 Convert feedback_text to lowercase.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Filter out reviews where the performance_score is below 3 and the feedback_text contains
"poor".

 Group by department and calculate:

o The average performance_score per department.

o The total count of reviews per department.

 Drop duplicate reviews based on review_id.

Sample Data:

review_id employee_id review_date performance_score feedback_text department

R001 E001 2024-12-01 4 Excellent work Sales

R002 E002 2024-12-05 2 Poor performance HR

R003 E003 2024-12-10 3 Average performance IT

42. Customer Purchase and Frequency Analysis

Problem: You have a DataFrame customer_purchases with columns: purchase_id, customer_id,


purchase_date, amount_spent, and product_category.

 Use date_format to extract the month from purchase_date.

 Extract the product type from product_category using split.

 Filter out purchases where the amount_spent is less than 50 and the product type is
"electronics".

 Group by customer_id and month and calculate:

o The total amount_spent per customer per month.

o The count of distinct purchase_ids per customer per month.

 Drop duplicate purchases based on purchase_id.

Sample Data:

purchase_id customer_id purchase_date amount_spent product_category

P001 C001 2024-12-01 100 electronics

P002 C002 2024-12-02 40 home appliances

P003 C001 2024-12-10 75 electronics

43. Employee Training and Attendance Analysis

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Problem: You have a DataFrame training_sessions with columns: session_id, employee_id,
session_date, session_duration, session_topic, and attendance_status.

 Use date_format to extract the week number from session_date.

 Convert session_topic to lowercase.

 Filter sessions where the session_duration is less than 1 hour and the attendance_status is
"Absent".

 Group by employee_id and week_number and calculate:

o The total session_duration per employee per week.

o The count of attended sessions per employee per week.

 Drop duplicate sessions based on session_id.

Sample Data:

session_id employee_id session_date session_duration session_topic attendance_status

T001 E001 2024-12-01 1.5 Leadership Present

T002 E002 2024-12-02 0.5 Technical Absent

T003 E001 2024-12-05 1.0 Soft Skills Present

44. Sales Data and Regional Analysis

Problem: You have a DataFrame sales_data with columns: sales_id, sales_date, region,
sales_amount, and product_id.

 Use date_format to extract the year and quarter from sales_date.

 Extract the region code from region using regex_extract.

 Filter out sales where the sales_amount is below 100 and the region_code does not start
with "N".

 Group by region and year_quarter and calculate:

o The total sales_amount per region per quarter.

o The average sales_amount per product in each region.

 Drop duplicate sales records based on sales_id.

Sample Data:

sales_id sales_date region sales_amount product_id

S001 2024-12-01 North 150 P001

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
sales_id sales_date region sales_amount product_id

S002 2024-12-02 South 90 P002

S003 2024-12-03 North 200 P003

45. Invoice and Payment Analysis

Problem: You have a DataFrame invoices with columns: invoice_id, customer_id, invoice_date,
amount_due, payment_status, and payment_date.

 Use date_format to extract the day and month from invoice_date.

 Convert payment_status to title case using initcap.

 Filter out invoices where the amount_due is greater than 500 and the payment_status is
"Unpaid".

 Group by payment_status and month and calculate:

o The total amount_due per status and month.

o The count of invoices per status and month.

 Drop duplicate invoices based on invoice_id.

Sample Data:

invoice_id customer_id invoice_date amount_due payment_status payment_date

I001 C001 2024-12-01 600 Unpaid

I002 C002 2024-12-05 300 Paid 2024-12-06

I003 C003 2024-12-10 750 Unpaid

46. User Login Frequency and Activity Analysis


Problem: You have a DataFrame user_logins with columns: login_id, user_id, login_timestamp,
activity_type, and session_duration.

 Use date_format to extract the month and day of the week from login_timestamp.

 Extract the hour of the day from login_timestamp using hour.

 Filter logins where activity_type contains "purchase" and session_duration exceeds 1 hour.

 Group by user_id and day_of_week and calculate:

o The total session_duration per user per day of the week.

o The count of logins per user per day of the week.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Drop duplicate logins based on login_id.

Sample Data:

login_id user_id login_timestamp activity_type session_duration

L001 U001 2024-12-01 08:30:00 purchase 75

L002 U002 2024-12-01 09:15:00 browse 45

L003 U001 2024-12-01 10:00:00 purchase 120

47. Employee Work Hours and Department Analysis

Problem: You have a DataFrame work_hours with columns: record_id, employee_id, work_date,
hours_worked, and department.

 Use date_format to extract the week of the year from work_date.

 Convert hours_worked to a categorical value based on ranges (e.g., "Low", "Medium",


"High") using when and otherwise.

 Filter out records where department is "HR" and hours_worked is categorized as "High".

 Group by department and week_of_year and calculate:

o The average hours_worked per department per week.

o The total count of "High" work hours records per department.

 Drop duplicate records based on record_id.

Sample Data:

record_id employee_id work_date hours_worked department

W001 E001 2024-12-01 9 IT

W002 E002 2024-12-02 6 HR

W003 E001 2024-12-05 12 IT

48. Product Return and Refund Analysis

Problem: You have a DataFrame returns with columns: return_id, order_id, return_date,
refund_amount, product_id, and return_reason.

 Use date_format to extract the year and month from return_date.

 Extract the product category from product_id using regex_extract.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Filter out returns where refund_amount is greater than 100 and return_reason contains
"defective".

 Group by year_month (year and month) and calculate:

o The total refund_amount for each month.

o The count of distinct return_ids per month.

 Drop duplicate returns based on return_id.

Sample Data:

return_id order_id return_date refund_amount product_id return_reason

R001 O001 2024-12-01 150 P001 defective

R002 O002 2024-12-02 80 P002 wrong size

R003 O003 2024-12-05 200 P003 defective

49. Customer Support and Ticket Analysis

Problem: You have a DataFrame support_tickets with columns: ticket_id, customer_id,


creation_date, issue_severity, resolution_time, and agent.

 Use date_format to extract the quarter from creation_date.

 Convert issue_severity to lowercase.

 Filter out tickets where resolution_time exceeds 72 hours and issue_severity is "high".

 Group by agent and quarter and calculate:

o The average resolution_time per agent per quarter.

o The total count of tickets resolved by each agent per quarter.

 Drop duplicate tickets based on ticket_id.

Sample Data:

ticket_id customer_id creation_date issue_severity resolution_time agent

T001 C001 2024-12-01 High 80 Agent A

T002 C002 2024-12-03 Medium 60 Agent B

T003 C003 2024-12-07 High 90 Agent A

50. Inventory Management and Product Analysis

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Problem: You have a DataFrame inventory with columns: inventory_id, product_id, update_date,
stock_quantity, restock_quantity, and supplier.

 Use date_format to extract the month and year from update_date.

 Extract the supplier code from supplier using regex_extract.

 Filter out inventory updates where stock_quantity is below 50 and the supplier code starts
with "S".

 Group by supplier and month_year and calculate:

o The total stock_quantity per supplier per month.

o The average restock_quantity per supplier per month.

 Drop duplicate inventory updates based on inventory_id.

Sample Data:

inventory_id product_id update_date stock_quantity restock_quantity supplier

I001 P001 2024-12-01 30 20 S001

I002 P002 2024-12-05 70 15 S002

I003 P003 2024-12-10 40 25 S001

51. Customer Subscription and Engagement Analysis

Problem: You have a DataFrame subscriptions with columns: subscription_id, customer_id,


start_date, end_date, subscription_type, and engagement_score.

 Use date_format to extract the year from start_date.

 Extract the subscription type from subscription_type using regex_extract.

 Filter out subscriptions where engagement_score is below 50 and subscription_type is


"premium".

 Group by year and subscription_type and calculate:

o The total count of subscriptions per type per year.

o The average engagement_score per type per year.

 Drop duplicate subscriptions based on subscription_id.

Sample Data:

subscription_id customer_id start_date end_date subscription_type engagement_score

S001 C001 2024-12-01 2025-12-01 Premium 45

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
subscription_id customer_id start_date end_date subscription_type engagement_score

S002 C002 2024-12-03 2025-12-03 Basic 60

S003 C003 2024-12-05 2025-12-05 Premium 70

52. Transaction and Customer Loyalty Analysis

Problem: You have a DataFrame transactions with columns: transaction_id, customer_id,


transaction_date, amount_spent, transaction_type, and category.

 Use date_format to extract the week number from transaction_date.

 Extract the transaction category from category using split.

 Filter out transactions where amount_spent exceeds 200 and transaction_type is "refund".

 Group by customer_id and week_number and calculate:

o The total amount_spent per customer per week.

o The count of distinct transaction_ids per customer per week.

 Drop duplicate transactions based on transaction_id.

Sample Data:

transaction_id customer_id transaction_date amount_spent transaction_type category

T001 C001 2024-12-01 250 refund electronics

T002 C002 2024-12-02 100 purchase clothing

T003 C001 2024-12-05 300 refund electronics

53. Expense Tracking and Budget Analysis

Problem: You have a DataFrame expenses with columns: expense_id, user_id, expense_date,
amount, category, and description.

 Use date_format to extract the year and month from expense_date.

 Extract the category prefix from category using regex_extract.

 Filter out expenses where amount is greater than 100 and category_prefix is "Travel".

 Group by year_month and category_prefix and calculate:

o The total amount per category per month.

o The count of expenses per category per month.

 Drop duplicate expenses based on expense_id.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Sample Data:

expense_id user_id expense_date amount category description

E001 U001 2024-12-01 150 Travel-Flight Flight

E002 U002 2024-12-02 80 Food-Dining Dinner

E003 U001 2024-12-10 200 Travel-Hotel Hotel

54. Order Fulfillment and Delivery Analysis

Problem: You have a DataFrame orders with columns: order_id, order_date, delivery_date,
product_id, order_amount, and status.

 Use date_format to extract the year and day of the week from order_date.

 Extract the order status category from status using regex_extract (e.g., "Shipped",
"Pending").

 Filter out orders where order_amount is greater than 200 and status ends with "Shipped".

 Group by year and status and calculate:

o The total order_amount per status per year.

o The average order_amount per status per year.

 Drop duplicate orders based on order_id.

Sample Data:

order_id order_date delivery_date product_id order_amount status

O001 2024-12-01 2024-12-03 P001 250 Shipped

O002 2024-12-02 2024-12-04 P002 150 Pending

O003 2024-12-05 2024-12-08 P003 300 Shipped

55. Employee Performance and Evaluation Analysis

Problem: You have a DataFrame evaluations with columns: evaluation_id, employee_id,


evaluation_date, score, reviewer_id, and department.

 Use date_format to extract the month and year from evaluation_date.

 Extract the department code from department using regex_extract.

 Filter out evaluations where score is below 70 and department_code starts with "D".

 Group by department_code and month_year and calculate:

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
o The average score per department per month.

o The total count of evaluations per department per month.

 Drop duplicate evaluations based on evaluation_id.

Sample Data:

evaluation_id employee_id evaluation_date score reviewer_id department

E001 EMP001 2024-12-01 65 REV001 D001

E002 EMP002 2024-12-03 80 REV002 D002

E003 EMP003 2024-12-10 75 REV003 D001

56. Customer Purchase History Analysis

Problem: You have a DataFrame purchases with columns: purchase_id, customer_id, purchase_date,
purchase_amount, category, and product_name.

 Use date_format to extract the week number and year from purchase_date.

 Extract the category type from category using split.


 Filter out purchases where purchase_amount is greater than 100 and product_name ends
with "Pro".

 Group by category_type and year_week and calculate:

o The total purchase_amount per category per week.

o The count of purchases per category per week.

 Drop duplicate purchases based on purchase_id.

Sample Data:

purchase_id customer_id purchase_date purchase_amount category product_name

P001 C001 2024-12-01 150 Electronics Phone Pro

P002 C002 2024-12-05 80 Clothing Jacket

P003 C003 2024-12-07 200 Electronics Laptop Pro

57. Supplier Order and Quality Analysis

Problem: You have a DataFrame supplier_orders with columns: order_id, supplier_id, order_date,
quantity, price, quality_score, and product_category.

 Use date_format to extract the quarter and year from order_date.

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
 Extract the product category code from product_category using regex_extract.

 Filter out orders where quantity is above 500 and quality_score is less than 70.

 Group by product_category_code and quarter_year and calculate:

o The total quantity per product category per quarter.

o The average price per product category per quarter.

 Drop duplicate orders based on order_id.

Sample Data:

order_id supplier_id order_date quantity price quality_score product_category

O001 S001 2024-12-01 600 50 65 Category-A

O002 S002 2024-12-05 300 30 80 Category-B

O003 S001 2024-12-07 700 55 60 Category-A

58. Product Sales and Stock Level Analysis

Problem: You have a DataFrame product_sales with columns: sale_id, product_id, sale_date,
quantity_sold, sale_price, and stock_level.

 Use date_format to extract the month and year from sale_date.

 Extract the product ID prefix from product_id using substr.

 Filter out sales where quantity_sold is above 100 and stock_level is below 50.

 Group by product_id_prefix and month_year and calculate:

o The total quantity_sold per product prefix per month.

o The average sale_price per product prefix per month.

 Drop duplicate sales based on sale_id.

Sample Data:

sale_id product_id sale_date quantity_sold sale_price stock_level

S001 P001-A 2024-12-01 150 20 40

S002 P002-B 2024-12-05 80 30 60

S003 P001-A 2024-12-07 120 25 45

59. Event Attendance and Engagement Analysis

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
Problem: You have a DataFrame event_attendance with columns: attendance_id, event_id,
attendee_id, attendance_date, engagement_score, and feedback.

 Use date_format to extract the week of the year from attendance_date.

 Extract the feedback sentiment from feedback using regex_extract (e.g., "positive",
"neutral", "negative").

 Filter out attendance records where engagement_score is below 50 and feedback_sentiment


is "negative".

 Group by event_id and week_of_year and calculate:

o The average engagement_score per event per week.

o The total count of feedback per event per week.

 Drop duplicate attendance records based on attendance_id.

Sample Data:

attendance_id event_id attendee_id attendance_date engagement_score feedback

A001 E001 A001 2024-12-01 45 negative

A002 E001 A002 2024-12-03 55 positive

A003 E002 A003 2024-12-07 65 neutral

60. Marketing Campaign and Conversion Analysis

Problem: You have a DataFrame campaigns with columns: campaign_id, customer_id,


campaign_start, campaign_end, spend_amount, conversion_rate, and campaign_type.

 Use date_format to extract the year and month from campaign_start.

 Extract the campaign type code from campaign_type using regex_extract.

 Filter out campaigns where spend_amount exceeds 1000 and conversion_rate is below 0.2.

 Group by campaign_type_code and year_month and calculate:

o The total spend_amount per campaign type per month.

o The average conversion_rate per campaign type per month.

 Drop duplicate campaigns based on campaign_id.

Sample Data:

campaign_i customer_i campaign_sta campaign_en spend_amou conversion_ra campaign_typ


d d rt d nt te e

C001 CU001 2024-12-01 2024-12-10 1200 0.15 Type-A

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737


SEEKHO BIGDATA INSTITUTE
campaign_i customer_i campaign_sta campaign_en spend_amou conversion_ra campaign_typ
d d rt d nt te e

C002 CU002 2024-12-05 2024-12-15 900 0.25 Type-B

C003 CU003 2024-12-10 2024-12-20 1500 0.18 Type-A

© 2024 Seekho Bigdata Institute. All rights reserved. www.seekhobigdata.com 9989454737

You might also like