0% found this document useful (0 votes)
2 views2 pages

Big Data Project - Questions

The project involves analyzing sales and delivery data from a business organization over the last decade using Jupyter Hub on the nuvepro cluster. Key tasks include setting up data schemas, joining data frames, converting date types, identifying top customers, calculating delivery times, and analyzing sales and profits using various methods. The final deliverable is a Jupyter file containing the completed analysis and solutions to specified scenarios.

Uploaded by

Abhinav Kundlas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views2 pages

Big Data Project - Questions

The project involves analyzing sales and delivery data from a business organization over the last decade using Jupyter Hub on the nuvepro cluster. Key tasks include setting up data schemas, joining data frames, converting date types, identifying top customers, calculating delivery times, and analyzing sales and profits using various methods. The final deliverable is a Jupyter file containing the completed analysis and solutions to specified scenarios.

Uploaded by

Abhinav Kundlas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Project

Instructions:

1. Kindly use Jupyter Hub over the nuvepro cluster.


2. Keep the data sets that are required for the projects.
Note: Datasets already shared with you.
3. Give the path where you kept the datasets in the cluster while reading the
data.
4. Do not limit yourself to one or two methods to get the solution. Explore more
methods.
5. Kindly submit the jupyter file.

OverView:

Data of a business organization, confined to the ‘sales and delivery’ domain is given
for the period of the last decade. The Data has been stored in multiple data frames,
From the given data, retrieve the solutions for the given scenario to help the
organization of understanding the key factors that they should look into.

Scenarios to be solved:

1. Set the schema for all the data sets and load them from different locations using
file structured streaming. (5 Marks)

2. Join all the Data frames and create a new Data frame called Full_DataFrame in
such a way that the new data frame does not contain duplicate columns.
(cust_dimen, market_fact, orders_dimen, prod_dimen, shipping_dimen)
(5 Marks)

3. Convert the Order_Date and Ship_Date columns type into Date type. And print
the schema and show the top 5 records for Order_Date and Ship_Date columns. (5
Marks)

4. Find the top 3 customers who have the maximum number of orders. (5 Marks)
5. Create a new column DaysTakenForDelivery that contains the date difference
between Order_Date and Ship_Date. (5 Marks)

6. Find the customer whose order took the maximum time to get delivered. (5
Marks)

7. Using the windows function, retrieve total sales made by each product from the
data. (5 Marks)

8. Using the windows function retrieve the total profit made from each product
from the data and also do without the windows function using pyspark data frame.
(5 Marks)

9. Count the total number of unique customers in January and how many of them
came back every month over the entire year in 2011. (5 Marks)

10. Calculate the total quantity purchased, discount received by the customer, and
calculate the total sales sold and profit earned from each customer. Order the data
frame on Total_profit in descending order. (5 Marks)

You might also like