BD Project Document
BD Project Document
BD Project Document
Project Description:
You are required to find an innovative idea and a dataset that supports your idea and on
which you will apply the analytical techniques we use throughout the course.
Your idea must address a business problem, bring a business solution, or provide
values, insights, or recommendations for business users; those who will benefit from the
results of your work.
You are required to use one of the big data processing frameworks (e.g. Hadoop, Spark,
Flink, etc.) integrated with your preferred language. The system should be implemented
in pseudo-distributed mode (single-machine multiple processes). The implementation in
fully distributed mode (multiple-machines multiple-processes) will be a bonus.
Using cloud computing providers (Azure or AWS) in the project will be considered as a
bonus.
FAQ:
1. What do you mean by innovative idea?
Innovative idea is a new business problem you are trying to solve. For example: - A
business problem for a bank may be “Should we give this customer a loan or not?”
or “Based on what factors/conditions should we give our customers a loan?” - A
business question for a retail store concerning shelf management may be “What are
the items commonly bought together by a sufficiently large number of customers?”
This is commonly referred to as market basket analysis.
- A business problem for a magazine/newspaper may be “What determines a
customer’s decision to subscribe or not?”
Your idea is not limited to business only but can be extended to governmental bodies,
environmental institutions, and societal organizations. For example:
- “Which people are more likely to vote for/against a law?”
- “How will the climate on Earth change for the next ten years?”
1
4. What analytical techniques can I use?
You can use (but not limited to) all the analytical techniques studied in this course.
Example:
- For a classification problem, you can build a classifier using logistic regression or
K-NN, train a neural network, build an SVM, among many other techniques. - For a
prediction problem, you can go with MLR (Multilinear Regression) or PCR (Principal
Component Regression).
- For a segmentation problem, you can try different clustering techniques
(K-means, hierarchical clustering, etc.)
5. We are seniors and about to graduate, will the project consume too much
time? No.
Deliverables:
1. Project Proposal:
The proposal should not exceed one page specifying the following points
clearly: a. Idea:
The problem statement should be described clearly.
b. Dataset(s):
Links to the dataset(s) that will be used.
c. Planned approach or Proposed solution.
A very brief plan (including Algorithms or technologies you might use) of your
proposed approach (you can change it later during the implementation phase).
2. Final Delivery:
a. Final Document containing:
i. Brief problem description.
ii. Project pipeline.
iii. Analysis and solution of the problem:
▪ Data preprocessing.
▪ Data visualization.
▪ Model/Classifier training.
iv. Results and Evaluation.
▪ Model accuracy on train, test, and validation data.
v. Unsuccessful trials that were not included in the final solution.
vi. Any Enhancements and future work.
b. Codes
2
c. Presentation:
i. Business part.
ii. Technical part.
Project Schedule:
Phase Week Due date
Team formation & Project Week 7 Monday 27th March, 11:59 PM.
Proposal
Notes:
- There is a penalty for late submissions in any of the three mentioned phases. -
Any sign of cheating or plagiarism will not be tolerated and will be graded
ZERO in the project.
Suggested Ideas:
1. Kaggle competitions and specially active ones
(https://www.kaggle.com/competitions)
2. Kaggle datasets https://www.kaggle.com/datasets
3. Analytics Vidhya has some datasets and sometimes competitions
https://datahack.analyticsvidhya.com/contest/all/
4. 17 places to find datasets for data science projects:
https://www.dataquest.io/blog/free-datasets-for-projects/
5. Data Science for Social Good: It has ideas but no data.
http://www.dssgfellowship.org/projects/