Big Data Project - Questions
Big Data Project - Questions
Instructions:
OverView:
Data of a business organization, confined to the ‘sales and delivery’ domain is given
for the period of the last decade. The Data has been stored in multiple data frames,
From the given data, retrieve the solutions for the given scenario to help the
organization of understanding the key factors that they should look into.
Scenarios to be solved:
1. Set the schema for all the data sets and load them from different locations using
file structured streaming. (5 Marks)
2. Join all the Data frames and create a new Data frame called Full_DataFrame in
such a way that the new data frame does not contain duplicate columns.
(cust_dimen, market_fact, orders_dimen, prod_dimen, shipping_dimen)
(5 Marks)
3. Convert the Order_Date and Ship_Date columns type into Date type. And print
the schema and show the top 5 records for Order_Date and Ship_Date columns. (5
Marks)
4. Find the top 3 customers who have the maximum number of orders. (5 Marks)
5. Create a new column DaysTakenForDelivery that contains the date difference
between Order_Date and Ship_Date. (5 Marks)
6. Find the customer whose order took the maximum time to get delivered. (5
Marks)
7. Using the windows function, retrieve total sales made by each product from the
data. (5 Marks)
8. Using the windows function retrieve the total profit made from each product
from the data and also do without the windows function using pyspark data frame.
(5 Marks)
9. Count the total number of unique customers in January and how many of them
came back every month over the entire year in 2011. (5 Marks)
10. Calculate the total quantity purchased, discount received by the customer, and
calculate the total sales sold and profit earned from each customer. Order the data
frame on Total_profit in descending order. (5 Marks)