Capstones AIML and DS Capstone Projects

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Data Science (All categories)

Problem statement<Objective<Tools used<Learning objective

Domain: Real Estate


Problem statement: A banking institution requires actionable insights into mortgage-backed
securities, geographic business investment, and real estate analysis.The mortgage bank would
like to identify potential monthly mortgage expenses for each region based on monthly family
income and rental of the real estate.

Also you need to create a dashboard that demonstrates relationships and trends for the key
metrics as follows: number of loans, average rental income, monthly mortgage and owner’s
cost, family income vs mortgage cost comparison across different regions. The metrics
described here do noObjective: A statistical model needs to be created to predict the potential
demand for the amount of loan in dollars for each of the regions in the USA. Also, there is a
need to create a dashboard which would refresh periodically, post data retrieval from the
agencies.

t limit the dashboard to these few.

Tool used: Jupyter notebook and Tableau`

Learning Objective: In this capstone, you will perform exploratory data analysis (EDA), data
preprocessing prior to model building, and then build linear regression models that predict total
monthly expenditure for home mortgage loans. You will also create a dashboard in tableau.

Domain: Healthcare

Problem statement: NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases)
research creates knowledge about and treatments for the most chronic, costly, and
consequential diseases. The datasets consist of several medical predictor variables and one
target variable (Outcome). Predictor variables include the number of pregnancies the patient
has had, their BMI, insulin level, age, and more.

Objective: In this capstone you will have to predict whether or not a patient has diabetes, based
on certain diagnostic measurements included in the dataset. Build a model to accurately predict
whether the patients in the dataset have diabetes or not.
Tool used: Jupyter notebook and Tableau

Learning objective: You will have to perform descriptive analysis to explore the dataset
variables using histograms. You will also create scatter charts between the pair of variables to
understand the relationships. Report the data by creating a dashboard in tableau.

Domain: Retail

Problem statement: Customer segmentation is the practice of segregating the customer base
into groups of individuals based on some common characteristics such as age, gender,
interests, and spending habits. It is a critical requirement for business to understand the value
derived from a customer. RFM is a method used for analyzing customer value.

RFM: Measuring when was the last order of a customer, which is called ‘Recency’, is an
important customer attribute to consider for segmentation. It means the number of days since a
customer made the last purchase.

How often customer purchases from the store should also be taken into account for the
customer segmentation exercise. This can be termed as ‘Frequency’. It is about the
number of purchases in a given period. Bigger value of frequency indicates a more
engaged customer.

Can we conclude on customer-value based on Recency and Frequency only? Maybe not!
Because we also must incorporate the amount a customer paid for the purchases, which is
monetary value. ‘Monetary’ is the total amount of money a customer spent in the given time
period.

Objective:Perform customer segmentation using RFM analysis. The resulting segments


can be ordered from most valuable (highest recency, frequency, and value) to least
valuable (lowest recency, frequency, and value).

Tool used: Jupyter notebook and Tableau

Learning Objective: You must conduct a preliminary data inspection and data cleaning.
After performing cohort analysis (a cohort is a group of subjects who share a defining
characteristic), you will be asked to build an RFM (Recency Frequency Monetary) model
and calculate RFM metrics. At the end you will create a dashboard in tableau by choosing
appropriate chart types and metrics useful for the business
AIML(Masters)
Problem statement<Objective<Tools used<Learning objective

Domain: E-commerce

Problem statement: Amazon is an online shopping website that now caters to millions of
people everywhere. Over 34,000 consumer reviews for Amazon brand products like Kindle,
Fire TV Stick and more are provided. The dataset has attributes like brand, categories,
primary categories, reviews.title, reviews.text, and the sentiment. Sentiment is a categorical
variable with three levels "Positive", "Negative“, and "Neutral". For a given unseen data, the
sentiment needs to be predicted.

Objective: You are required to predict Sentiment or Satisfaction of a purchase based on


multiple features and review text.

Tools used: Jupyter notebook, Amazon sagemaker lab

Learning Objective: Perform an EDA(Exploratory Data Analysis) on the dataset to tackle


the class imbalance problem in the dataset. You will also have to run multinomial Naive
Bayes classifier, SVM and Random forest classifiers. You will also get an idea of Deep
learning by performing LSTM(long short-term memory networks). At the end you will be
asked to compare the accuracy of neural nets with traditional ML based algorithms

Domain: Finance

Problem statement: Finance Industry is the biggest consumer of AIML engineers. It faces
constant attack by fraudsters, who try to trick the system. Correctly identifying fraudulent
transactions is often compared with finding a needle in a haystack because of the low
event rate. It is important that credit card companies are able to recognize fraudulent credit
card transactions so that the customers are not charged for items that they did not
purchase.

Objective: You are required to try various techniques such as supervised models with
oversampling, unsupervised anomaly detection, and heuristics to get good accuracy at
fraud detection.
Tools used: Jupyter notebook, Amazon sagemaker lab
Learning Objective: You will perform an EDA on the Dataset. You will be required to
create models such as Naive Bayes, Logistic Regression, and SVM. Determine which one
performs the best. You will also be asked to predict store sales using ANN (Artificial Neural
Network). Aside from that, you will be required to implement anomaly detection algorithms.

Domain: Retail

Problem Statement: Demand Forecast is one of the key tasks in Supply Chain and Retail
Domain in general. It is key in effective operation and optimization of retail supply chain.
Effectively solving this problem requires knowledge about a wide range of tricks in Machine
learning and good understanding of ensemble techniques.

Objective: You are required to predict sales for each Store-Day level for one month. All the
features will be provided and actual sales that happened during that month will also be
provided for model evaluation.

Tool used: Jupyter notebook, Amazon sagemaker lab

Learning Objective: You will be transforming the dataset variables using data
manipulation techniques such as One-Hot Encoding and conducting an EDA (Exploratory
Data Analysis) to determine the impact of variables on Sales. You will be applying Linear
Regression to predict the store sales. You will investigate Non-Linear Regressors such as
Random Forest or other Tree-based Regressors and compare the performance of Linear
and Non-Linear Regressors based on previous observations. To understand the
significance of deep neural network algorithms you will be using ANN (Artificial Neural
Network) to predict Store Sales.

AIML(Bootcamp and PG)


Problem statement<Objective<Tools used<Learning objective

Domain: EdTech

Problem statement and Objective: Simplilearn would like to assess the quality of
E-Learning videos freely available on YouTube. This would give them ideas on preparing
their video content, which is more engaging with the students. They have chosen
handpicked playlists corresponding to various Computer Science Subjects from an NPTEL
channel as a pilot study. Videos will be assessed on various fronts like instructor presence
in the video, body language, use of blackboard, use of slides, etc.
Tools used: Jupyter notebook, Amazon sagemaker lab

Learning Objective: To proceed with the analysis, you need to employ uniform time
sampling to segment the MP4 videos into keyframes, and then perform clustering. As all
videos on YouTube are freely accessible, you can extract the comments and replies related
to them using YouTube API v3.

Domain: Healthcare

Problem Statement: ICMR wants to analyze different types of cancers, such as breast
cancer, renal cancer, colon cancer, lung cancer, and prostate cancer becoming a cause of
worry in recent years. The input dataset contains 802 samples for the corresponding 802
people who have been detected with different types of cancer. Each sample contains
expression values of more than 20K genes.
Samples have one of the types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD

Objective: Determine the most likely cause of these cancers in terms of the genes
responsible for each type of cancer. This would lead to earlier detection of each type of
cancer, lowering the mortality rate.

Tool used: Jupyter notebook

Learning Objective:Your task is to conduct an Exploratory Data Analysis (EDA) on the


dataset. Afterward, you need to use feature selection algorithms like forward selection and
backward elimination to narrow down the selected attributes. You will then perform
dimensionality reduction using techniques such as PCA, LDA, and t-SNE. Your goal is to
identify groups of genes and sample distributions that exhibit similar behavior. To achieve
this, you will apply clustering techniques such as k-means, hierarchical, and mean shift
clustering on genes and samples. Ultimately, your objective is to develop a strong
classification model that can accurately identify different types of cancer.

Domain: Cyber Security

Problem Statement: Book-My-Show will enable the ads on their website, but they are also
very cautious about their user privacy and information about who visits their website. Some
ads URL could contain a malicious link that can trick any recipient and lead to a malware
installation, freezing the system as part of a ransomware attack or revealing
sensitive information.
Objective: Book-My-Show now wants to analyze whether the particular URL is prone to
phishing (malicious) or not. The input dataset contains an 11k sample corresponding to the
11k URL. Each sample contains 32 features that give a different and unique description of
the URL ranging from -1,0,1.
-1: Phishing
0: Suspicious
1: Legitimate

Tool used: Jupyter notebook

Learning Objective: Your task is to conduct an Exploratory Data Analysis (EDA) on the
dataset. Identify the correlated features present in the data and remove the feature which
might be correlated with some threshold. Finally, you will be asked to build a robust
classification system that classifies whether the URL sample is a phishing site or not.

You might also like