Capstones AIML and DS Capstone Projects
Capstones AIML and DS Capstone Projects
Capstones AIML and DS Capstone Projects
Also you need to create a dashboard that demonstrates relationships and trends for the key
metrics as follows: number of loans, average rental income, monthly mortgage and owner’s
cost, family income vs mortgage cost comparison across different regions. The metrics
described here do noObjective: A statistical model needs to be created to predict the potential
demand for the amount of loan in dollars for each of the regions in the USA. Also, there is a
need to create a dashboard which would refresh periodically, post data retrieval from the
agencies.
Learning Objective: In this capstone, you will perform exploratory data analysis (EDA), data
preprocessing prior to model building, and then build linear regression models that predict total
monthly expenditure for home mortgage loans. You will also create a dashboard in tableau.
Domain: Healthcare
Problem statement: NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases)
research creates knowledge about and treatments for the most chronic, costly, and
consequential diseases. The datasets consist of several medical predictor variables and one
target variable (Outcome). Predictor variables include the number of pregnancies the patient
has had, their BMI, insulin level, age, and more.
Objective: In this capstone you will have to predict whether or not a patient has diabetes, based
on certain diagnostic measurements included in the dataset. Build a model to accurately predict
whether the patients in the dataset have diabetes or not.
Tool used: Jupyter notebook and Tableau
Learning objective: You will have to perform descriptive analysis to explore the dataset
variables using histograms. You will also create scatter charts between the pair of variables to
understand the relationships. Report the data by creating a dashboard in tableau.
Domain: Retail
Problem statement: Customer segmentation is the practice of segregating the customer base
into groups of individuals based on some common characteristics such as age, gender,
interests, and spending habits. It is a critical requirement for business to understand the value
derived from a customer. RFM is a method used for analyzing customer value.
RFM: Measuring when was the last order of a customer, which is called ‘Recency’, is an
important customer attribute to consider for segmentation. It means the number of days since a
customer made the last purchase.
How often customer purchases from the store should also be taken into account for the
customer segmentation exercise. This can be termed as ‘Frequency’. It is about the
number of purchases in a given period. Bigger value of frequency indicates a more
engaged customer.
Can we conclude on customer-value based on Recency and Frequency only? Maybe not!
Because we also must incorporate the amount a customer paid for the purchases, which is
monetary value. ‘Monetary’ is the total amount of money a customer spent in the given time
period.
Learning Objective: You must conduct a preliminary data inspection and data cleaning.
After performing cohort analysis (a cohort is a group of subjects who share a defining
characteristic), you will be asked to build an RFM (Recency Frequency Monetary) model
and calculate RFM metrics. At the end you will create a dashboard in tableau by choosing
appropriate chart types and metrics useful for the business
AIML(Masters)
Problem statement<Objective<Tools used<Learning objective
Domain: E-commerce
Problem statement: Amazon is an online shopping website that now caters to millions of
people everywhere. Over 34,000 consumer reviews for Amazon brand products like Kindle,
Fire TV Stick and more are provided. The dataset has attributes like brand, categories,
primary categories, reviews.title, reviews.text, and the sentiment. Sentiment is a categorical
variable with three levels "Positive", "Negative“, and "Neutral". For a given unseen data, the
sentiment needs to be predicted.
Domain: Finance
Problem statement: Finance Industry is the biggest consumer of AIML engineers. It faces
constant attack by fraudsters, who try to trick the system. Correctly identifying fraudulent
transactions is often compared with finding a needle in a haystack because of the low
event rate. It is important that credit card companies are able to recognize fraudulent credit
card transactions so that the customers are not charged for items that they did not
purchase.
Objective: You are required to try various techniques such as supervised models with
oversampling, unsupervised anomaly detection, and heuristics to get good accuracy at
fraud detection.
Tools used: Jupyter notebook, Amazon sagemaker lab
Learning Objective: You will perform an EDA on the Dataset. You will be required to
create models such as Naive Bayes, Logistic Regression, and SVM. Determine which one
performs the best. You will also be asked to predict store sales using ANN (Artificial Neural
Network). Aside from that, you will be required to implement anomaly detection algorithms.
Domain: Retail
Problem Statement: Demand Forecast is one of the key tasks in Supply Chain and Retail
Domain in general. It is key in effective operation and optimization of retail supply chain.
Effectively solving this problem requires knowledge about a wide range of tricks in Machine
learning and good understanding of ensemble techniques.
Objective: You are required to predict sales for each Store-Day level for one month. All the
features will be provided and actual sales that happened during that month will also be
provided for model evaluation.
Learning Objective: You will be transforming the dataset variables using data
manipulation techniques such as One-Hot Encoding and conducting an EDA (Exploratory
Data Analysis) to determine the impact of variables on Sales. You will be applying Linear
Regression to predict the store sales. You will investigate Non-Linear Regressors such as
Random Forest or other Tree-based Regressors and compare the performance of Linear
and Non-Linear Regressors based on previous observations. To understand the
significance of deep neural network algorithms you will be using ANN (Artificial Neural
Network) to predict Store Sales.
Domain: EdTech
Problem statement and Objective: Simplilearn would like to assess the quality of
E-Learning videos freely available on YouTube. This would give them ideas on preparing
their video content, which is more engaging with the students. They have chosen
handpicked playlists corresponding to various Computer Science Subjects from an NPTEL
channel as a pilot study. Videos will be assessed on various fronts like instructor presence
in the video, body language, use of blackboard, use of slides, etc.
Tools used: Jupyter notebook, Amazon sagemaker lab
Learning Objective: To proceed with the analysis, you need to employ uniform time
sampling to segment the MP4 videos into keyframes, and then perform clustering. As all
videos on YouTube are freely accessible, you can extract the comments and replies related
to them using YouTube API v3.
Domain: Healthcare
Problem Statement: ICMR wants to analyze different types of cancers, such as breast
cancer, renal cancer, colon cancer, lung cancer, and prostate cancer becoming a cause of
worry in recent years. The input dataset contains 802 samples for the corresponding 802
people who have been detected with different types of cancer. Each sample contains
expression values of more than 20K genes.
Samples have one of the types of tumors: BRCA, KIRC, COAD, LUAD, and PRAD
Objective: Determine the most likely cause of these cancers in terms of the genes
responsible for each type of cancer. This would lead to earlier detection of each type of
cancer, lowering the mortality rate.
Problem Statement: Book-My-Show will enable the ads on their website, but they are also
very cautious about their user privacy and information about who visits their website. Some
ads URL could contain a malicious link that can trick any recipient and lead to a malware
installation, freezing the system as part of a ransomware attack or revealing
sensitive information.
Objective: Book-My-Show now wants to analyze whether the particular URL is prone to
phishing (malicious) or not. The input dataset contains an 11k sample corresponding to the
11k URL. Each sample contains 32 features that give a different and unique description of
the URL ranging from -1,0,1.
-1: Phishing
0: Suspicious
1: Legitimate
Learning Objective: Your task is to conduct an Exploratory Data Analysis (EDA) on the
dataset. Identify the correlated features present in the data and remove the feature which
might be correlated with some threshold. Finally, you will be asked to build a robust
classification system that classifies whether the URL sample is a phishing site or not.