0% found this document useful (0 votes)
97 views8 pages

AI200 Capstone Project Instructions

The document provides information about a capstone project for an AI course where students will use machine learning to predict loan defaults on Kaggle. The objectives are to identify patterns indicating if a person will default to help the lending company, LendingClub, reduce losses from loans that are not repaid. Students will be provided loan data to build and evaluate models to predict loan status. Their performance will be assessed based on metrics like data cleaning, feature engineering, model implementation and the model's AUC score.

Uploaded by

WANG ERJIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views8 pages

AI200 Capstone Project Instructions

The document provides information about a capstone project for an AI course where students will use machine learning to predict loan defaults on Kaggle. The objectives are to identify patterns indicating if a person will default to help the lending company, LendingClub, reduce losses from loans that are not repaid. Students will be provided loan data to build and evaluate models to predict loan status. Their performance will be assessed based on metrics like data cleaning, feature engineering, model implementation and the model's AUC score.

Uploaded by

WANG ERJIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1 | Page

ARTIFICIAL INTELLIGENCE TRACK


AI200: APPLIED MACHINE LEARNING
CAPSTONE PROJECT

1. Why Kaggle?

For the AI200 Capstone, students will work on a Machine Learning prediction project on the
Kaggle platform. The primary reasons for this are two-fold:

1. Students are exposed to the end-to-end process of working on a Kaggle dataset for
Machine Learning. This is crucial so that students are equipped to work on other Kaggle
projects independently after AI200 to continue improving their skills and portfolio.

2. By the virtue of Kaggle being a widely known platform by data scientists worldwide,
students will have a significant edge in career outcomes with the inclusion of Kaggle
projects in their resume or portfolio that showcases their technical skills.

2. Business Scenario

You work for the LendingClub company which specialises in lending various types of loans to
urban customers. When the company receives a loan application, the company has to make a
decision for loan approval based on the applicant’s profile. Two types of risks are associated
with the bank’s decision:

• If the applicant is likely to repay the loan, then not approving the loan results in a loss of
business to the company
• If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving
the loan may lead to a financial loss for the company

The data given contains the information about past loan applicants and whether they ‘defaulted’
or not. The aim is to identify patterns which indicate if a person is likely to default, which may
be used for taking actions such as denying the loan, reducing the amount of loan, lending (to
risky applicants) at a higher interest rate, etc.

When a person applies for a loan, there are two types of decisions that could be taken by the
company:

1. Loan accepted: If the company approves the loan, there are 3 possible scenarios
described below:
o Fully paid: Applicant has fully paid the loan (the principal and the interest rate)
o Current: Applicant is in the process of paying the instalments, i.e. the tenure of
the loan is not yet completed. These candidates are excluded from the dataset.
o Charged-off: Applicant has not paid the instalments in due time for a long period
of time, i.e. he/she has defaulted on the loan
2. Loan rejected: The company had rejected the loan (because the candidate does not
meet their requirements etc.). Since the loan was rejected, there is no transactional
history of those applicants with the company and so this data is not available within the
company nor this dataset.

2 | Page
4. Business Objectives

LendingClub is the largest online loan marketplace, facilitating personal loans, business loans,
and financing of medical procedures. Borrowers can easily access lower interest rate loans
through a fast online interface. Like most other lending companies, lending loans to ‘risky’
applicants is the largest source of financial loss (called credit loss). The credit loss is the
amount of money lost by the lender when the borrower refuses to pay or runs away with the
money owed. In other words, borrowers who default cause the largest amount of loss to the
lenders. In this case, the customers labelled as 'charged-off' are the 'defaulters'.

If one is able to identify these risky loan applicants, then such loans can be reduced thereby
cutting down the amount of credit loss. Identification of such applicants using EDA and
machine learning is the aim of this case study. In other words, the company wants to
understand the driving factors (or driver variables) behind loan default, i.e. the variables which
are strong indicators of default. The company can utilise this knowledge for its portfolio and
risk assessment.

To develop your understanding of the domain, you are advised to independently research a
little about risk analytics (understanding the types of variables and their significance should be
enough).

5. Project Description

This In-Class Prediction Challenge is modelled after the LendingClub Issued Loans dataset.
LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.
It was the first peer-to-peer lender to register its offerings as securities with the Securities and
Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is
the world's largest peer-to-peer lending platform.

Solving this case study will give us an idea about how real business problems are solved using
EDA and Machine Learning. In this case study, we will also develop an understanding of risk
analytics in banking and financial services and understand how data is used to minimise the
risk of losing money while lending to customers.

In this competition, you'll be parsing through LendingClub’s complete loan dataset and build a
machine learning model to predict which of the loans are likely to be defaulted. Loan defaults
is an expensive problem which any financial institute that engages in borrowing inadvertently
faces. Each year, the financial industry loses billions of dollars due to loan defaults.

3 | Page
Please note:

▪ You must compete in the same teams as previously registered at the start of the course.
▪ Ensure you have (1) formed a full team and (2) properly renamed your team before
making your first sample submission.
▪ Individual submissions will be invalidated.

6. Data Description, Project Deliverable and Marking Rubrics (45 marks)

The maximum attainable score for the technical part of the project is 40. You will need to at
least attain a score of 20 to pass this course (an AUC score higher than 0.5).
You will receive 2 data files for this competition:
• lc_trainingset.csv. This dataset contains 26 columns of features and 1 outcome column
(loan_status) for loans made by lending club. You are to build a predictive classification
model using this dataset to predict the outcome in the loan_status column. Essentially,
think of this as your training dataset.

• lc_testset.csv. This dataset contains 26 columns of features for loans made by lending
club with no outcomes/labels. You are to use the model you trained using the
lc_trainingset.csv dataset to predict the loan_status for this particular dataset and
submit your predictions for this dataset in a .csv format to the Kaggle platform. The
platform will then measure the AUC score of your predictions.

Deliverable: With the help of the provided dataset, utilize machine learning models to generate
prediction probabilities for the loan_status column and submit your predictions to Kaggle.

4 | Page
Here is the marking rubrics for the technical component of the Capstone project:

Metrics Fail Poor Satisfactory Good Exceptional


(0 marks) (2.5 marks) (5 marks) (7.5 marks) (10 marks)
Data Cleaning No data Trainees Trainees Trainees Trainees was
cleaning performed performed data performed data flawless in
was data cleaning cleaning, but cleaning with applying the
performed to some several less than 2 correct data
at all extent, techniques mistakes and techniques
however the were considered process and
techniques unsuitable for most of the managed to
used were the problem corner cases consider all
mostly corner cases
inappropriate
Feature Trainee did Trainee used Trainee used Trainees used Trainees was
Engineering not feature feature feature flawless in the
perform engineering to engineering but engineering implementation
any feature some extent. several of the with less than 2 of feature
engineering However, the techniques implementation engineering
techniques were wrongly mistakes and managed
used were implemented to consider all
mostly corner cases
inappropriate
Model Trainee did Trainee Trainees Trainees Trainees
Implementation not implemented managed to implemented implemented
implement basic implement a advanced advanced
any machine basic machine machine machine
machine learning learning model learning learning
learning models but with minor techniques (i.e techniques
model the code implementation ensemble outside the
cannot be mistakes learning) with scope of the
executed no class with no
implementation implementation
mistakes mistakes
AUC Score <0.5 AUC 0.5 - 0.65 AUC 0.65 - 0.8 AUC 0.8 - 0.85 AUC >0.85 AUC

Additionally, there will be additional 5 bonus marks available. All students are to individually
submit a brief report on:

• Your key takeaways and insights from working on this project (2 mark)
• What are some areas in your current workplace where you think the introduction of big
data application will be beneficial? (2 mark)
• Walk us through how you will go about implementing one big data application in your
workplace (1 mark)

5 | Page
7. Pre-Project Task: Make Your First Kaggle Submission (Due On Lesson 6)

Please complete the following steps before Lesson 6:

1. Go to the Kaggle Capstone Project page provided in your class Telegram chat.

2. Click 'Join Competition' (see screenshot)

3. Read the competition rules and click "I Understand and Accept'

4. Go to the Team tab. From each team, one representative will collect the "Team Name"
of your teammates, and key each of them in the “Merge Teams” section to send a
request to merge team. Make sure all your teammates have approved your request
before moving on to the next step.

5. Only after all team members are added & your team is complete, then change your
team’s name to what you submitted to the team registration form.

6 | Page
6. As a team, only after you have formed the full team, make your first sample submission.
(Only one team member needs to do this):

a. Under the Data tab, download sample-submission.csv, and Submit Predictions.


b. The description field is only visible to the instructor team and your teammates –
this field is for you to keep track of which submission is for which model
i. Example: “xgboost with 20 features and max_depth=__”
ii. For this sample submission, you can simply fill in "sample submission".

7. After completing the 6 steps, your team name should appear in the Public Leaderboard.
Do PM your instructor if you encounter any issues or need further assistance.

8. Competition Rules

Submission Limit Each team may submit a maximum of 6 entries per day.
You may select up to 2 final submissions for judging. The better of the two will
be counted towards your final AUC score.

Eligibility The Competition is open to all AI200 students registered in the current cohort.
Submissions must only be made by the same team as previously registered via
Google Forms. All submissions by individuals will be invalidated.

Use of External Unless otherwise expressly stated on the Competition Website, Participants
Data must not use external data other than the provided dataset to develop and test
their models and submissions. Heicoders reserves the right in its sole
discretion to disqualify any Participant who is discovered to have undertaken or
attempted to undertake the use of external data during the Competition.

7 | Page
No Sharing of Sharing code or data outside of teams is not permitted. If any code is made
Codes / Data available to other teams, it must be done so publicly to all participating teams
via the Competition Website discussion forums.

One Account per As Kaggle strictly prohibits signing up from multiple accounts, no participant
Participant may submit from multiple accounts. If discovered by Kaggle, this may lead to
permanent deactivation and suspension of affected Kaggle accounts.

Winner’s As a condition of receipt of the Prize, winning teams must:


Obligation
● Deliver the final model’s software code to Heicoders Academy by the
day before Lesson 8 in the form of a Jupyter Notebook. The delivered
software code must be capable of generating the winning submission
and include a description of resources required to run the executable
code successfully. This notebook is to be accompanied by associated
documentation (consistent with the winning model documentation
template available on the Kaggle wiki) to be eligible for the prize.

● Present their winning submission Notebook to the class on Lesson 8


(within a duration of 5 minutes)

Determining This Competition is a challenge of skill, and the results are determined solely by
Winners leaderboard ranking on the private leaderboard at the end of the competition
(subject to compliance with Competition Rules). Participants' scores and ranks
on the public leaderboard are based on the AUC metric and determined by
applying the predictions in the Submission to the ground truth of a 30% subset
of the hidden test.csv outcomes used to generate the private leaderboard.

Prize awards are subject to verification of eligibility and compliance with these
Competition Rules. All decisions of the Competition Sponsor and judges will be
final and binding on all matters relating to this Competition. Competition
Sponsor reserves the right to examine the Submission and any associated code
or documentation for compliance with these Competition Rules. If the
Submission demonstrates a breach of these Competition Rules, Competition
Sponsor may disqualify the Submission(s) at its discretion.

Resolving Ties A tie between two or more valid and identically ranked submissions will be
resolved in favour of the tied submission submitted first.

Declining Prizes A Participant may decline to be nominated as a Winner by notifying Heicoders


directly within 1 day following the Competition deadline, in which case the
declining Participant forgoes any prize or other features associated with
winning the Competition.
Kaggle reserves the right to disqualify a Participant who so declines at Kaggle's
sole discretion if Kaggle deems disqualification appropriate.

8 | Page

You might also like