AI200 Capstone Project Instructions
AI200 Capstone Project Instructions
1. Why Kaggle?
For the AI200 Capstone, students will work on a Machine Learning prediction project on the
Kaggle platform. The primary reasons for this are two-fold:
1. Students are exposed to the end-to-end process of working on a Kaggle dataset for
Machine Learning. This is crucial so that students are equipped to work on other Kaggle
projects independently after AI200 to continue improving their skills and portfolio.
2. By the virtue of Kaggle being a widely known platform by data scientists worldwide,
students will have a significant edge in career outcomes with the inclusion of Kaggle
projects in their resume or portfolio that showcases their technical skills.
2. Business Scenario
You work for the LendingClub company which specialises in lending various types of loans to
urban customers. When the company receives a loan application, the company has to make a
decision for loan approval based on the applicant’s profile. Two types of risks are associated
with the bank’s decision:
• If the applicant is likely to repay the loan, then not approving the loan results in a loss of
business to the company
• If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving
the loan may lead to a financial loss for the company
The data given contains the information about past loan applicants and whether they ‘defaulted’
or not. The aim is to identify patterns which indicate if a person is likely to default, which may
be used for taking actions such as denying the loan, reducing the amount of loan, lending (to
risky applicants) at a higher interest rate, etc.
When a person applies for a loan, there are two types of decisions that could be taken by the
company:
1. Loan accepted: If the company approves the loan, there are 3 possible scenarios
described below:
o Fully paid: Applicant has fully paid the loan (the principal and the interest rate)
o Current: Applicant is in the process of paying the instalments, i.e. the tenure of
the loan is not yet completed. These candidates are excluded from the dataset.
o Charged-off: Applicant has not paid the instalments in due time for a long period
of time, i.e. he/she has defaulted on the loan
2. Loan rejected: The company had rejected the loan (because the candidate does not
meet their requirements etc.). Since the loan was rejected, there is no transactional
history of those applicants with the company and so this data is not available within the
company nor this dataset.
2 | Page
4. Business Objectives
LendingClub is the largest online loan marketplace, facilitating personal loans, business loans,
and financing of medical procedures. Borrowers can easily access lower interest rate loans
through a fast online interface. Like most other lending companies, lending loans to ‘risky’
applicants is the largest source of financial loss (called credit loss). The credit loss is the
amount of money lost by the lender when the borrower refuses to pay or runs away with the
money owed. In other words, borrowers who default cause the largest amount of loss to the
lenders. In this case, the customers labelled as 'charged-off' are the 'defaulters'.
If one is able to identify these risky loan applicants, then such loans can be reduced thereby
cutting down the amount of credit loss. Identification of such applicants using EDA and
machine learning is the aim of this case study. In other words, the company wants to
understand the driving factors (or driver variables) behind loan default, i.e. the variables which
are strong indicators of default. The company can utilise this knowledge for its portfolio and
risk assessment.
To develop your understanding of the domain, you are advised to independently research a
little about risk analytics (understanding the types of variables and their significance should be
enough).
5. Project Description
This In-Class Prediction Challenge is modelled after the LendingClub Issued Loans dataset.
LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.
It was the first peer-to-peer lender to register its offerings as securities with the Securities and
Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is
the world's largest peer-to-peer lending platform.
Solving this case study will give us an idea about how real business problems are solved using
EDA and Machine Learning. In this case study, we will also develop an understanding of risk
analytics in banking and financial services and understand how data is used to minimise the
risk of losing money while lending to customers.
In this competition, you'll be parsing through LendingClub’s complete loan dataset and build a
machine learning model to predict which of the loans are likely to be defaulted. Loan defaults
is an expensive problem which any financial institute that engages in borrowing inadvertently
faces. Each year, the financial industry loses billions of dollars due to loan defaults.
3 | Page
Please note:
▪ You must compete in the same teams as previously registered at the start of the course.
▪ Ensure you have (1) formed a full team and (2) properly renamed your team before
making your first sample submission.
▪ Individual submissions will be invalidated.
The maximum attainable score for the technical part of the project is 40. You will need to at
least attain a score of 20 to pass this course (an AUC score higher than 0.5).
You will receive 2 data files for this competition:
• lc_trainingset.csv. This dataset contains 26 columns of features and 1 outcome column
(loan_status) for loans made by lending club. You are to build a predictive classification
model using this dataset to predict the outcome in the loan_status column. Essentially,
think of this as your training dataset.
• lc_testset.csv. This dataset contains 26 columns of features for loans made by lending
club with no outcomes/labels. You are to use the model you trained using the
lc_trainingset.csv dataset to predict the loan_status for this particular dataset and
submit your predictions for this dataset in a .csv format to the Kaggle platform. The
platform will then measure the AUC score of your predictions.
Deliverable: With the help of the provided dataset, utilize machine learning models to generate
prediction probabilities for the loan_status column and submit your predictions to Kaggle.
4 | Page
Here is the marking rubrics for the technical component of the Capstone project:
Additionally, there will be additional 5 bonus marks available. All students are to individually
submit a brief report on:
• Your key takeaways and insights from working on this project (2 mark)
• What are some areas in your current workplace where you think the introduction of big
data application will be beneficial? (2 mark)
• Walk us through how you will go about implementing one big data application in your
workplace (1 mark)
5 | Page
7. Pre-Project Task: Make Your First Kaggle Submission (Due On Lesson 6)
1. Go to the Kaggle Capstone Project page provided in your class Telegram chat.
3. Read the competition rules and click "I Understand and Accept'
4. Go to the Team tab. From each team, one representative will collect the "Team Name"
of your teammates, and key each of them in the “Merge Teams” section to send a
request to merge team. Make sure all your teammates have approved your request
before moving on to the next step.
5. Only after all team members are added & your team is complete, then change your
team’s name to what you submitted to the team registration form.
6 | Page
6. As a team, only after you have formed the full team, make your first sample submission.
(Only one team member needs to do this):
7. After completing the 6 steps, your team name should appear in the Public Leaderboard.
Do PM your instructor if you encounter any issues or need further assistance.
8. Competition Rules
Submission Limit Each team may submit a maximum of 6 entries per day.
You may select up to 2 final submissions for judging. The better of the two will
be counted towards your final AUC score.
Eligibility The Competition is open to all AI200 students registered in the current cohort.
Submissions must only be made by the same team as previously registered via
Google Forms. All submissions by individuals will be invalidated.
Use of External Unless otherwise expressly stated on the Competition Website, Participants
Data must not use external data other than the provided dataset to develop and test
their models and submissions. Heicoders reserves the right in its sole
discretion to disqualify any Participant who is discovered to have undertaken or
attempted to undertake the use of external data during the Competition.
7 | Page
No Sharing of Sharing code or data outside of teams is not permitted. If any code is made
Codes / Data available to other teams, it must be done so publicly to all participating teams
via the Competition Website discussion forums.
One Account per As Kaggle strictly prohibits signing up from multiple accounts, no participant
Participant may submit from multiple accounts. If discovered by Kaggle, this may lead to
permanent deactivation and suspension of affected Kaggle accounts.
Determining This Competition is a challenge of skill, and the results are determined solely by
Winners leaderboard ranking on the private leaderboard at the end of the competition
(subject to compliance with Competition Rules). Participants' scores and ranks
on the public leaderboard are based on the AUC metric and determined by
applying the predictions in the Submission to the ground truth of a 30% subset
of the hidden test.csv outcomes used to generate the private leaderboard.
Prize awards are subject to verification of eligibility and compliance with these
Competition Rules. All decisions of the Competition Sponsor and judges will be
final and binding on all matters relating to this Competition. Competition
Sponsor reserves the right to examine the Submission and any associated code
or documentation for compliance with these Competition Rules. If the
Submission demonstrates a breach of these Competition Rules, Competition
Sponsor may disqualify the Submission(s) at its discretion.
Resolving Ties A tie between two or more valid and identically ranked submissions will be
resolved in favour of the tied submission submitted first.
8 | Page