1st Question
1st Question
Problem: You have a DataFrame user_data with columns: user_id, email, signup_date, and
last_login.
Sample Data:
Use regex_extract to extract the country code, area code, and local number from the
phone_number column.
Sample Data:
Split the product_code into three separate columns: brand_code, category_code, and
serial_number.
Filter out products where brand_code starts with "X" and serial_number ends with "99".
Sample Data:
Problem: You have a DataFrame web_logs with columns: session_id, url, timestamp, and user_agent.
Use split to extract the protocol, domain, and path from the url column.
Sample Data:
Problem: You have a DataFrame address_data with columns: address_id, full_address, city, state, and
zipcode.
Use regex_extract to extract the street number, street name, and apartment number from
full_address.
Filter out addresses where city starts with "New" and zipcode ends with "00".
Sample Data:
Problem: You have a DataFrame invoice_data with columns: invoice_id, description, quantity,
unit_price, and total_amount.
Split the description into product_name, color, and size using split.
Filter out invoices where the quantity is less than 5 and the total_amount is greater than
1000.
Sample Data:
Use substr to extract the first 3 characters of order_id and the last 4 characters of
product_code.
Sample Data:
Problem: You have a DataFrame flight_data with columns: flight_id, flight_number, departure_date,
and status.
Use instr to find the position of the airline code in the flight_number.
Extract the airline code using substr based on the instr result.
Filter out flights where the status is not "ON TIME" and the airline code is "AA".
Sample Data:
Problem: You have a DataFrame stock_data with columns: stock_id, ticker_symbol, trade_date,
closing_price, and volume.
Use regex_extract to parse out the company code and market code from the ticker_symbol.
Sample Data:
Problem: You have a DataFrame document_data with columns: doc_id, content, author, and
publish_date.
Filter out documents where the first word starts with "A" or "An".
Sample Data:
D001 A quick brown fox jumps over the lazy dog John Doe 2024-11-01
D003 An apple a day keeps the doctor away Alice Johnson 2024-11-10
Use regex_extract to parse the browser name and version from the user_agent column.
Filter out sessions where the browser is not "Chrome" and the version is less than "90".
Sample Data:
Problem: You have a DataFrame click_data with columns: click_id, url, click_time, and referrer.
Use regex_extract to extract the query parameters (e.g., ?id=123&source=google) from the
url column.
Filter out clicks where the query parameter source is not "google".
Sample Data:
Problem: You have a DataFrame file_data with columns: file_id, file_path, upload_date, and
size_in_mb.
Use split to extract the file name and extension from the file_path.
Filter out files where the extension is not "pdf" and the size_in_mb is greater than 100.
Sample Data:
Use regex_extract to identify and extract any sentiment keywords (e.g., "excellent", "poor")
from the review_text.
Filter out reviews where the rating is less than 3 and the sentiment contains "poor".
Sample Data:
Problem: You have a DataFrame social_posts with columns: post_id, content, likes, shares, and
post_date.
Filter out posts that contain hashtags starting with "ad" and have more than 1000 likes.
Sample Data:
P001 Loving this new product! #NewProduct #Excited 500 100 2024-12-20
Use regex_extract to validate that the product_code follows a specific pattern (e.g., "ABC-
1234-X").
Sample Data:
Problem: You have a DataFrame customer_addresses with columns: customer_id, address, city, state,
and postal_code.
Convert the city and state columns to title case using initcap.
Sample Data:
Sample Data:
Problem: You have a DataFrame employee_data with columns: employee_id, name, role,
department, and salary.
Use split to separate the role into title and level (e.g., "Manager - Senior").
Filter out employees where the salary is less than 50000 and the role level is "Junior".
Sample Data:
Problem: You have a DataFrame sensor_data with columns: sensor_id, reading_value, timestamp,
and unit.
Filter out readings where the value is below a specific threshold (e.g., 10) and the unit is
"celsius".
Sample Data:
Use regex_extract to parse out the transaction type and code from the transaction_code.
Filter transactions where the transaction type starts with "TR" and the amount is greater
than 500.
Problem: You have a DataFrame email_data with columns: email_id, email_address, sign_up_date,
and status.
Filter out emails where the domain is not "gmail.com" and the status is "Inactive".
Sample Data:
Problem: You have a DataFrame inventory with columns: item_id, product_code, quantity, category,
and last_checked.
Use split to separate the product_code into prefix, code number, and suffix.
Filter out items where the quantity is less than 50 and the product_code suffix is "X".
Sample Data:
Problem: You have a DataFrame user_data with columns: user_id, phone_number, sign_up_date,
and subscription_type.
Filter out users where the country code is not "+1" and the subscription_type is "BASIC".
Sample Data:
Use split to extract the network and host portions of the ip_address.
Sample Data:
Problem: You have a DataFrame customer_data with columns: customer_id, full_name, age, city, and
state.
Use split to extract the first name and last name from the full_name.
Filter out customers where the age is less than 30 and the city name contains "new".
Sample Data:
Use split to separate the order_number into order prefix and order number.
Filter out orders where the total_value is not equal to item_quantity * item_price.
Sample Data:
Problem: You have a DataFrame product_data with columns: product_id, product_name, price,
category, and release_date.
Use split to extract the main product name and the variant from product_name.
Filter products where the price is less than 50 and the product name contains "SPECIAL".
Sample Data:
Problem: You have a DataFrame review_data with columns: review_id, customer_name, review_text,
rating, and review_date.
Use regex_extract to find keywords like "excellent", "good", "poor" in the review_text.
Filter reviews where the rating is less than 3 and the review text contains "poor".
Sample Data:
Problem: You have a DataFrame event_logs with columns: event_id, event_description, event_type,
event_time, and user_id.
Filter events where the event_type starts with "ERROR" and the event_time is after "2024-
12-01".
Use date_format to extract the week number and year from transaction_date.
Filter transactions where transaction_amount is greater than 100 and the store_location
ends with "STORE".
Sample Data:
Problem: You have a DataFrame activity_logs with columns: log_id, user_id, activity_description,
activity_date, and duration_seconds.
Use split to parse out the activity type and details from activity_description.
Filter logs where the duration_seconds is less than 30 and the activity_type starts with
"LOGIN".
Sample Data:
Problem: You have a DataFrame price_changes with columns: change_id, product_id, old_price,
new_price, change_date, and vendor.
Filter out price changes where the price_difference is less than 10 and the vendor does not
contain "VENDOR".
Sample Data:
Problem: You have a DataFrame website_traffic with columns: session_id, user_id, page_url,
visit_duration, and visit_date.
Filter out sessions where the visit_duration is less than 60 seconds and the page category is
"product".
Sample Data:
Problem: You have a DataFrame session_logs with columns: session_id, user_id, page_url,
session_start, session_end, and activity_duration.
Filter sessions where the activity_duration is greater than 300 seconds and the base URL is
"home".
Sample Data:
2024-12-01 2024-12-01
S001 U001 http://example.com/home 360
08:00:00 09:00:00
2024-12-01 2024-12-01
S002 U002 http://example.com/product 200
09:30:00 10:00:00
2024-12-01 2024-12-01
S003 U001 http://example.com/home 450
10:30:00 11:30:00
Problem: You have a DataFrame orders with columns: order_id, order_date, delivery_date,
product_id, quantity, and price_per_unit.
Use date_format to extract the month and year from order_date and delivery_date.
Filter orders where the delivery time (difference between delivery_date and order_date) is
more than 5 days.
Sample Data:
Filter feedback where the sentiment_score is below 3 and the feedback text contains "poor".
Sample Data:
Filter out records where the stock_count is less than 20 and the product_id does not start
with "PRD".
Sample Data:
Sample Data:
Filter out purchases where the amount_spent is less than 50 and the product type is
"electronics".
Sample Data:
Filter sessions where the session_duration is less than 1 hour and the attendance_status is
"Absent".
Sample Data:
Problem: You have a DataFrame sales_data with columns: sales_id, sales_date, region,
sales_amount, and product_id.
Filter out sales where the sales_amount is below 100 and the region_code does not start
with "N".
Sample Data:
Problem: You have a DataFrame invoices with columns: invoice_id, customer_id, invoice_date,
amount_due, payment_status, and payment_date.
Filter out invoices where the amount_due is greater than 500 and the payment_status is
"Unpaid".
Sample Data:
Use date_format to extract the month and day of the week from login_timestamp.
Filter logins where activity_type contains "purchase" and session_duration exceeds 1 hour.
Sample Data:
Problem: You have a DataFrame work_hours with columns: record_id, employee_id, work_date,
hours_worked, and department.
Filter out records where department is "HR" and hours_worked is categorized as "High".
Sample Data:
Problem: You have a DataFrame returns with columns: return_id, order_id, return_date,
refund_amount, product_id, and return_reason.
Sample Data:
Filter out tickets where resolution_time exceeds 72 hours and issue_severity is "high".
Sample Data:
Filter out inventory updates where stock_quantity is below 50 and the supplier code starts
with "S".
Sample Data:
Sample Data:
Filter out transactions where amount_spent exceeds 200 and transaction_type is "refund".
Sample Data:
Problem: You have a DataFrame expenses with columns: expense_id, user_id, expense_date,
amount, category, and description.
Filter out expenses where amount is greater than 100 and category_prefix is "Travel".
Problem: You have a DataFrame orders with columns: order_id, order_date, delivery_date,
product_id, order_amount, and status.
Use date_format to extract the year and day of the week from order_date.
Extract the order status category from status using regex_extract (e.g., "Shipped",
"Pending").
Filter out orders where order_amount is greater than 200 and status ends with "Shipped".
Sample Data:
Filter out evaluations where score is below 70 and department_code starts with "D".
Sample Data:
Problem: You have a DataFrame purchases with columns: purchase_id, customer_id, purchase_date,
purchase_amount, category, and product_name.
Use date_format to extract the week number and year from purchase_date.
Sample Data:
Problem: You have a DataFrame supplier_orders with columns: order_id, supplier_id, order_date,
quantity, price, quality_score, and product_category.
Filter out orders where quantity is above 500 and quality_score is less than 70.
Sample Data:
Problem: You have a DataFrame product_sales with columns: sale_id, product_id, sale_date,
quantity_sold, sale_price, and stock_level.
Filter out sales where quantity_sold is above 100 and stock_level is below 50.
Sample Data:
Extract the feedback sentiment from feedback using regex_extract (e.g., "positive",
"neutral", "negative").
Sample Data:
Filter out campaigns where spend_amount exceeds 1000 and conversion_rate is below 0.2.
Sample Data: