Big Data Project - Questions

The project involves analyzing sales and delivery data from a business organization over the last decade using Jupyter Hub on the nuvepro cluster. Key tasks include setting up data schemas, joining data frames, converting date types, identifying top customers, calculating delivery times, and analyzing sales and profits using various methods. The final deliverable is a Jupyter file containing the completed analysis and solutions to specified scenarios.

Uploaded by

Abhinav Kundlas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views2 pages

Big Data Project - Questions

Uploaded by

Abhinav Kundlas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Project

Instructions:

1. Kindly use Jupyter Hub over the nuvepro cluster.

2. Keep the data sets that are required for the projects.
Note: Datasets already shared with you.
3. Give the path where you kept the datasets in the cluster while reading the
data.
4. Do not limit yourself to one or two methods to get the solution. Explore more
methods.
5. Kindly submit the jupyter file.

OverView:

Data of a business organization, confined to the ‘sales and delivery’ domain is given
for the period of the last decade. The Data has been stored in multiple data frames,
From the given data, retrieve the solutions for the given scenario to help the
organization of understanding the key factors that they should look into.

Scenarios to be solved:

1. Set the schema for all the data sets and load them from different locations using
file structured streaming. (5 Marks)

2. Join all the Data frames and create a new Data frame called Full_DataFrame in
such a way that the new data frame does not contain duplicate columns.
(cust_dimen, market_fact, orders_dimen, prod_dimen, shipping_dimen)
(5 Marks)

3. Convert the Order_Date and Ship_Date columns type into Date type. And print
the schema and show the top 5 records for Order_Date and Ship_Date columns. (5
Marks)

4. Find the top 3 customers who have the maximum number of orders. (5 Marks)
5. Create a new column DaysTakenForDelivery that contains the date difference
between Order_Date and Ship_Date. (5 Marks)

6. Find the customer whose order took the maximum time to get delivered. (5
Marks)

7. Using the windows function, retrieve total sales made by each product from the
data. (5 Marks)

8. Using the windows function retrieve the total profit made from each product
from the data and also do without the windows function using pyspark data frame.
(5 Marks)

9. Count the total number of unique customers in January and how many of them
came back every month over the entire year in 2011. (5 Marks)

10. Calculate the total quantity purchased, discount received by the customer, and
calculate the total sales sold and profit earned from each customer. Order the data
frame on Total_profit in descending order. (5 Marks)

Spark Test Que
No ratings yet
Spark Test Que
3 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
DW Lab File
No ratings yet
DW Lab File
18 pages
Delhivery Feature Engineering - Solution Approach
No ratings yet
Delhivery Feature Engineering - Solution Approach
7 pages
Tasks - Data Science Fresher Position @NeenOpal
No ratings yet
Tasks - Data Science Fresher Position @NeenOpal
3 pages
Python - Pandas - Numpy Interview Q&A
No ratings yet
Python - Pandas - Numpy Interview Q&A
12 pages
Tasks
No ratings yet
Tasks
3 pages
Amazon Interview Questions
No ratings yet
Amazon Interview Questions
7 pages
Amazon Interview Questions
No ratings yet
Amazon Interview Questions
7 pages
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
IP Marking Scheme
No ratings yet
IP Marking Scheme
3 pages
Question Paper
No ratings yet
Question Paper
5 pages
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
No ratings yet
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
15 pages
Interview
No ratings yet
Interview
2 pages
UNIT 5 Scenario
No ratings yet
UNIT 5 Scenario
5 pages
A892642655 16741 5 2025 mgn342
No ratings yet
A892642655 16741 5 2025 mgn342
1 page
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Programming Notes 3
No ratings yet
Programming Notes 3
3 pages
Holidays Homework - Ip
No ratings yet
Holidays Homework - Ip
5 pages
FIT5196-S2-2020 Assessment 2
No ratings yet
FIT5196-S2-2020 Assessment 2
4 pages
Sales Analysis Using Python and SQL
No ratings yet
Sales Analysis Using Python and SQL
15 pages
FDS-Practical-Exam-Qs
No ratings yet
FDS-Practical-Exam-Qs
4 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Code Explanation
No ratings yet
Code Explanation
3 pages
FINTECH ICAP Sample Paper - Solution
No ratings yet
FINTECH ICAP Sample Paper - Solution
5 pages
BI - Analytics - Question 4
No ratings yet
BI - Analytics - Question 4
4 pages
BCGgamma OA2022 July
No ratings yet
BCGgamma OA2022 July
3 pages
Geakmindz Test - Ipynb - Colab
No ratings yet
Geakmindz Test - Ipynb - Colab
8 pages
Day 73
No ratings yet
Day 73
12 pages
Pyspark Questions
No ratings yet
Pyspark Questions
2 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
Updated InformaticsPractices MS
No ratings yet
Updated InformaticsPractices MS
7 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Oddstudents
No ratings yet
Oddstudents
35 pages
Practicals
No ratings yet
Practicals
42 pages
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
Python Programming
No ratings yet
Python Programming
2 pages
Even Students
No ratings yet
Even Students
36 pages
Text 3
No ratings yet
Text 3
3 pages
IIM PBA Assignment 2
No ratings yet
IIM PBA Assignment 2
3 pages
Key Ip Pre Board 2024-25
No ratings yet
Key Ip Pre Board 2024-25
10 pages
DMV Lab 7
No ratings yet
DMV Lab 7
9 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
QSTN5
No ratings yet
QSTN5
1 page
SQL Assessment Test v5
No ratings yet
SQL Assessment Test v5
2 pages
Python Programming
No ratings yet
Python Programming
3 pages
Data Science Sample
No ratings yet
Data Science Sample
5 pages
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Bhavishya IP PF
No ratings yet
Bhavishya IP PF
17 pages
IIT FDS Assignment 1 Likhita
No ratings yet
IIT FDS Assignment 1 Likhita
7 pages
Bda Assignment-1
No ratings yet
Bda Assignment-1
3 pages
CFE
No ratings yet
CFE
5 pages
Data Analysis Project On Customer Purchases Dataset
No ratings yet
Data Analysis Project On Customer Purchases Dataset
1 page
Association Rules Ans
No ratings yet
Association Rules Ans
28 pages
XII - IP - Practical - List 2023-24
No ratings yet
XII - IP - Practical - List 2023-24
4 pages
PYF Project LearnerNotebook LowCode
No ratings yet
PYF Project LearnerNotebook LowCode
6 pages
Target SQL - Reference
No ratings yet
Target SQL - Reference
11 pages
Naan Mudhalvan - Google Cloud Data Analytics
No ratings yet
Naan Mudhalvan - Google Cloud Data Analytics
33 pages
Quiz Complete
No ratings yet
Quiz Complete
4 pages
Python Programming Case Study
No ratings yet
Python Programming Case Study
2 pages

Big Data Project - Questions

Uploaded by

Big Data Project - Questions

Uploaded by

Project

1. Kindly use Jupyter Hub over the nuvepro cluster.

You might also like