43% found this document useful (14 votes)

3K views6 pages

Part 1 Building Your Own Binary Classification Model

The bank conducted an experiment issuing credit cards to 600 applicants without screening. 150 (25%) defaulted. The bank now wants to create a binary classification model to predict defaults using data from the experiment. The author analyzed the training set to create a model combining two inputs. The model had an AUC of 0.7 on the training set and 0.3 on the test set, indicating overfitting. Using the costs of $5000 per false negative and $2500 per false positive, the optimal threshold on the training set was 3.5, with a minimum cost per event of 600. Applying this threshold and model to the test set, the cost per event was 700.

Uploaded by

Wathek Al Zuaiby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

43% found this document useful (14 votes)

3K views6 pages

Part 1 Building Your Own Binary Classification Model

Uploaded by

Wathek Al Zuaiby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 6

First Binary Classification Model

Data_Final Project.xlsx
You work for a bank as a business data analyst in the credit card risk-modeling
department. Your bank conducted a bold experiment three years ago: for a single day
it quietly issued credit cards to everyone who applied, regardless of their credit
risk, until the bank had issued 600 cards without screening applicants.

After three years, 150, or 25%, of those card recipients defaulted: they failed to
pay back at least some of the money they owed. However, the bank collected very
valuable proprietary data that it can now use to optimize its future card-issuing
process.

The bank initially collected six pieces of data about each person:

Age

Years at current employer

Years at current address

Income over the past year

Current credit card debt, and

Current automobile debt

In addition, the bank now has a binary outcome: default = 1, and no default = 0.

Your first assignment is to analyze the data and create a binary classification
model to forecast future defaults.

You will combine data from the above six inputs to output a single score. Use the
Soldier Performance spreadsheet for a simple example of combining multiple inputs.

Forecasting Soldier Performance.xlsx

The relative rank-ordering of scores will determine the models effectiveness. For
convenience-- in particular, so that you can use the AUC Calculator Spreadsheet--
you are asked to use a scale for your score that has a maximum < 3.5 and a minimum
> -3.5.

At first you are not told what your banks own best estimate for its cost per False
Negative (accepted applicant who becomes a defaulting customer) and False Positive
(rejected customer who would not have defaulted) classification.

Therefore, the best you can do is to design your model to maximize the Area Under
the ROC Curve, or AUC.

You are told that if your model is effective (high enough AUC, not defined
further) and robust (again not defined, but in general this means relatively
little decrease in AUC across multiple sets of new data) then it may be adopted by
the bank as its predictive model for default, to determine which future applicants
will be issued credit cards.

You are first given a Training Set of 200 out of the 600 people in the
experiment. The Data_For_Final_Project (below) has both the training set and test
set you will need.

Design your model using the Training Set. Standardized versions of the input data
also provided for your convenience. You may combine the six inputs by adding them
to, or subtracting them from, each other, taking simple ratios, etc. Exclude inputs
that are not helpful and then experiment with how to combine the most informative
inputs.

Note that will need some of your quiz answers again later, so please write them
down and keep track of them as you go along.

Question: What is your model? Give it as a function of the two or more of the six
inputs. For example: (Age + Years at Current Address)/Income [not a great model!].

Your model should have at least two inputs.

1 r

What is your models AUC on the Training Set? Use two digits to the right of the
decimal place.
12 x
6 x
.7 r
((((Less than .5 is not correct - you need to make the highest value the lowest by
dividing by -1.

.5 has no predictive value.

.9 or higher is too good to be true!))))

Initial Assessment for Over-fitting (testing your model on new data)

Next test your model, without changing any parameters, on the Test Set of 200
additional applicants. See the Test Set spreadsheet. It is part of the
Data_For_Final_Project (below) and has both the training and test set.

Data_Final Project.xlsx
Hint: Make and use a second copy of the AUC Calculator Spreadsheet so that you can
compare Test Set and Training Set results easily.

AUC_Calculator and Review of AUC Curve.xlsx

What is your models new AUC on the Test Set? Give two digits to the right of the
decimal place.
12 x
36 x
.3 x
.8 r
((((<.5 is not valid - multiply by -1

.5 means no predictive value

> .90 is too good to be true!)))))

Finding the Cost-Minimizing Threshold for your Model

Now that you have, hopefully, developed your model to the point where it is
relatively robust across the training set and test set, your boss at the bank
finally gives you its current rough estimate of the banks average costs for each
type of classification error.

[Note that all bank models here include only profits and losses within three years
of when a card is issued, so the impact of out-years (years beyond 3) can be
ignored.]

Cost Per False Negative: $5000

Cost Per False Positive: $2500

For the 600 individuals that were automatically given cards without being
classified, the total cost of the experiment turned out to be 25%*($5000)*600 or
$750,000. This is $1,250 per event.

Only models with lower cost per event than $1,250 should have any value.

Question: What is the threshold score on the Training Set data for your model that
minimizes Cost per Event? You will need this number to answer later questions.

Hint: Using the AUC Calculator Spreadsheet, identify which Column displays the same
cost-per-event (row 17) as the overall minimum cost-per-event shown in Cell J2. The
threshold is shown in row 10 of that Column. What the threshold means is that at
and above this number everything is classified as a "default."

20 x
1000 x
3.5 r
((((Thresholds greater than 2.5 may not be utilizing the full range for analysis

Thresholds less than -2.5 0 may not be utilizing the full range for analysis)))))))

Finding the Minimum Cost Per Event

Question: Again referring only to the Training Set data, what is the overall
minimum cost-per-event?

Hint: You will need this number to answer later questions. If you used the AUC
Calculator, the overall minimum cost per event will be displayed in Cell J2.

Note: for Coursera to interpret your answer correctly you must give your answer as
an integer - no decimals or dollar sign.

For Example - enter $800.00 as "800"

600 r

Comparing the New Minimum Cost Per Event on Test Set Data

When you compared AUC for the Training and Test Sets, all that is necessary is to
look up the two different values in Cell G8. But to get an accurate measure of the
cost-savings using the original model on new data, you can not automatically use
the new threshold that results in the overall lowest cost-per-event on the Test
Set.
Remember that your model is being tested for its ability to forecast - but the new
optimal threshold will be known only after the outcomes for the entire Test Set are
known.

All you can use is the model you developed on the Training Set data and the
threshold from the Training Set that you should have recorded when answering
Question 4.

Question: At that same threshold score (NOT the threshold score that would minimize
costs for the new Test Set, but the old threshold score that minimized costs on
the Training Set) what is the cost per event on the test set?

Hint: Using the AUC Calculator Spreadsheet previously provided, locate the column
on the Training Set data that has the lowest-cost-per event. That same column and
threshold in the Test Set copy of the AUC Calculator will have a new cost-per-
event, displayed in row 17. This is almost always higher than the minimum cost-per-
event on the Training Set, and also higher than what the minimal cost-per-event
would be on the Test Set, if one could know the new optimal threshold in advance.
This number is the actual cost per event when applying the model-and-threshold
developed with the Training Set to the new, Test Set data.

Note: for Coursera to interpret your answer correctly you must give your answer as
an integer - no decimals or dollar sign.

For Example - enter $800.00 as "800"

200 x
1 x
150 x
700.00 r
((((((If you find that your costs per event on the test set are much higher than
your costs per event on the training set, consider making your model simpler
probably using fewer input variables as it is probably still over-fitting the
training set data. Problems with over-fitting that are were not obvious at the ROC-
curve stage may emerge when minimizing costs.)))))))))

Putting a Dollar Value on Your Model Plus the Data

Assume your Test Set cost-per-event results from Question 6 are sustainable long
term.

Question: How much money does the bank save, per event, using your model and its
data-inputs, instead of issuing credit cards to everyone who asks?

Hint: the cost of issuing credit cards to everyone (no model, no forecast) has been
determined to be 25%*$5000 = $1,250 per event. Dollar value of the model-plus-data
is the difference between $1,250 and your number.

Note: for Coursera to interpret your answer correctly you must give your answer as
an integer - no decimals or dollar sign.

For Example - enter $800.00 as "800"

100 x

200 r
(((((((((<=$150 savings is a weak model

<$150 to <= $250 savings is an ok model

< $250 to <= $450 savings is a very good model

>$450 savings is an excellent model))))))))

Payback Period for Your Model

Question: Given that it apparently cost the bank $750,000 to conduct the three-year
experiment, if the bank processes 1000 credit card applicants per day on average,
how many days will it take to ensure future savings will pay back the bank's
initial investment?

Give number rounded to the nearest day (integer value).

Hint: multiply your answer to Question 7 - the cost savings per applicant - by 1000
to get the savings per day.

700000 x

3 r
((((((More than a week poor

4-7 days very good

2-3 days excellent

1 day too good to be true!)))))))))

Any model that is reducing uncertainty will have a True Positive Rate...

...Equal to the Test Incidence (% of outcomes classified as "default") x

...Less than the Test Incidence (% of outcomes classified as "default") x
...Greater than the Test Incidence (% of outcomes classified as "default")

Given that the base rate of default in the population is 25%, any test that is
reducing uncertainty will have a Positive Predictive Value (PPV)...

...Equal to .25 x
...Less than .25 x
...Greater than .25

Given that the base rate of default in the population is 25%, any test that is
reducing uncertainty will have a Negative Predictive Value (NPV)...

Equal to .75 x
...Less than .75 x
...Greater than .75

Confusion Matrix Metrics. To determine all performance metrics for a binary

classification, it is sufficient to have three values

The Condition Incidence (here the default rate of 25%)

The probability of True Positives (the True Positive rate multiplied by the
Condition Incidence)
The Test Incidence (also called classification incidence - the sum of the
probability of True Positives and False Positives)
These three values can all be obtained from the AUC Calculator Spreadsheet and and
then used as inputs to the Information Gain Calculator Spreadsheet to determine all
other performance metrics.

AUC_Calculator and Review of AUC Curve.xlsx

Information Gain Calculator.xlsx
Question: What is your models True Positive Rate?

Save this answer as it will be needed again for Part 3 (Quiz 3)

1 x
30 x
.30 r
(((((((<= .25 is incorrect))))))))

Question: What is your models test incidence?

Save this answer as it will be needed again for Part 3 (Quiz 3)

0 x
1 x
1000 x
200.00 x
Test Incidences cannot be so small that they force a high false negative rate nor
large that they force a high false positive rate. A perfect test will of course
have a Test Incidence equal to the Condition Incidence but most classification
systems are focused on avoiding false negatives and have a higher Test Incidence
than Condition Incidence.

Part 4 Modeling Profitability Instead of Default
100% (7)
Part 4 Modeling Profitability Instead of Default
5 pages
Answers For Mastering Data Analysis in Excel
67% (6)
Answers For Mastering Data Analysis in Excel
3 pages
ML Week 3 Logistic Regression
60% (10)
ML Week 3 Logistic Regression
6 pages
Quiz 1
50% (2)
Quiz 1
5 pages
Excel Essentials
43% (7)
Excel Essentials
3 pages
Part 3 Comparing The Information Gain of Alternative Data and Models
60% (5)
Part 3 Comparing The Information Gain of Alternative Data and Models
3 pages
Practice Exam III
100% (2)
Practice Exam III
8 pages
Statistics 578 Assignment 5 Homework
100% (6)
Statistics 578 Assignment 5 Homework
13 pages
3.5 Session 14 - Naive Bayes Classifier
67% (3)
3.5 Session 14 - Naive Bayes Classifier
47 pages
Peer-Graded Assignment Part 5 Modeling Credit Card Default Risk and Customer Profitability 2
56% (18)
Peer-Graded Assignment Part 5 Modeling Credit Card Default Risk and Customer Profitability 2
2 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
Peer-Graded Assignment Part 5 Modeling Credit Card Default Risk and Customer Profitability
78% (9)
Peer-Graded Assignment Part 5 Modeling Credit Card Default Risk and Customer Profitability
2 pages
Simba S7 D - Techspecific
No ratings yet
Simba S7 D - Techspecific
4 pages
Fundamental of Time-Frequency Analyses
100% (1)
Fundamental of Time-Frequency Analyses
160 pages
Assignment (Key) 1
100% (1)
Assignment (Key) 1
16 pages
Office Automation
No ratings yet
Office Automation
14 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Parametric Models For Regression (Graded)
100% (2)
Parametric Models For Regression (Graded)
6 pages
Assignment
50% (2)
Assignment
10 pages
MIKE11 UserManual
No ratings yet
MIKE11 UserManual
542 pages
Wind Turbine Blade Design On SolidWorks
No ratings yet
Wind Turbine Blade Design On SolidWorks
6 pages
Parametric Models For Regression Graded PDF
No ratings yet
Parametric Models For Regression Graded PDF
6 pages
Part 2 Should The Bank Buy Third-Party Credit Information
60% (10)
Part 2 Should The Bank Buy Third-Party Credit Information
3 pages
Logistic+Regression - Done
100% (1)
Logistic+Regression - Done
41 pages
1 Plant Nutrition
No ratings yet
1 Plant Nutrition
35 pages
Assignment 1
100% (1)
Assignment 1
15 pages
Analizador de Carbono Orgánico Total C391E058L TOC V
100% (1)
Analizador de Carbono Orgánico Total C391E058L TOC V
20 pages
Lecture 9 Moments
No ratings yet
Lecture 9 Moments
29 pages
Prof. U.J.Dixit
No ratings yet
Prof. U.J.Dixit
11 pages
1516-Advanced Paper-2 Set-A PDF
No ratings yet
1516-Advanced Paper-2 Set-A PDF
21 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Survival Competing Risk
No ratings yet
Survival Competing Risk
29 pages
Casting Technology 04
No ratings yet
Casting Technology 04
11 pages
Data Analysis Formula Sheet Tables (DADM)
No ratings yet
Data Analysis Formula Sheet Tables (DADM)
8 pages
Exploratory Data Analysis On Haberman Dataset PDF
No ratings yet
Exploratory Data Analysis On Haberman Dataset PDF
11 pages
Activity
No ratings yet
Activity
11 pages
Time Series Questions
100% (1)
Time Series Questions
9 pages
Binary Logistic Regression Lecture 9
No ratings yet
Binary Logistic Regression Lecture 9
33 pages
Problem Set 5
No ratings yet
Problem Set 5
5 pages
Probability, AUC, and Excel Linest Function - Coursera
100% (1)
Probability, AUC, and Excel Linest Function - Coursera
7 pages
System Architecture Evolution (SAE) in 3GPP
No ratings yet
System Architecture Evolution (SAE) in 3GPP
24 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Earth Science Reviewer
No ratings yet
Earth Science Reviewer
13 pages
Introduction To Differential Calculus PDF
No ratings yet
Introduction To Differential Calculus PDF
45 pages
Advanced Statistics - Graded Quiz 1 - Solution
No ratings yet
Advanced Statistics - Graded Quiz 1 - Solution
4 pages
Hypothesis Testing Assignment
No ratings yet
Hypothesis Testing Assignment
4 pages
Lead Scoring Subjective Questions
No ratings yet
Lead Scoring Subjective Questions
3 pages
Assignment #3 Hypothesis Testing
No ratings yet
Assignment #3 Hypothesis Testing
10 pages
Untitled
No ratings yet
Untitled
29 pages
Module 2 Previous Year Questions
No ratings yet
Module 2 Previous Year Questions
9 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Assignment 1 Ans (Reference)
No ratings yet
Assignment 1 Ans (Reference)
18 pages
Decision Theory Quiz and Answer
No ratings yet
Decision Theory Quiz and Answer
4 pages
Probability, AUC, and Excel Linest Function
0% (1)
Probability, AUC, and Excel Linest Function
2 pages
Assignmeant-1 Sharan S
No ratings yet
Assignmeant-1 Sharan S
20 pages
Assignment
No ratings yet
Assignment
12 pages
Personal Loan Risk Model: Answer Key Template
No ratings yet
Personal Loan Risk Model: Answer Key Template
18 pages
IS4242 W6 Model Evaluation and Selection
No ratings yet
IS4242 W6 Model Evaluation and Selection
86 pages
Compressor Data Sheet: Chicago Pneumatic
No ratings yet
Compressor Data Sheet: Chicago Pneumatic
1 page
Roc Curve
No ratings yet
Roc Curve
43 pages
Quizz 4
No ratings yet
Quizz 4
8 pages
Big Data Management and Architecture Assignment
No ratings yet
Big Data Management and Architecture Assignment
9 pages
Part 1 - Building Your Own Binary Classification Model - Coursera
0% (10)
Part 1 - Building Your Own Binary Classification Model - Coursera
3 pages
1-Tac-12csu Tbfi1 Test Report
No ratings yet
1-Tac-12csu Tbfi1 Test Report
15 pages
Assignment
No ratings yet
Assignment
11 pages
Linear Regression. Examples
No ratings yet
Linear Regression. Examples
6 pages
Dynamo Player: Using Revit To Run A Dynamo Script
No ratings yet
Dynamo Player: Using Revit To Run A Dynamo Script
3 pages
Part 2 Should The Bank Buy Third Party Credit Information Coursera 1
No ratings yet
Part 2 Should The Bank Buy Third Party Credit Information Coursera 1
5 pages
Multiquark Hadrons 1st Edition Ahmed Ali Download
No ratings yet
Multiquark Hadrons 1st Edition Ahmed Ali Download
61 pages
Assignment 1 1
No ratings yet
Assignment 1 1
13 pages
Should The Bank Buy Third Party Model Summary Quiz 2
No ratings yet
Should The Bank Buy Third Party Model Summary Quiz 2
2 pages
Part 1: Building Your Own Binary Classification Model: Data - Final Project
No ratings yet
Part 1: Building Your Own Binary Classification Model: Data - Final Project
9 pages
Microsoft-Word Summary-Quiz-1
No ratings yet
Microsoft-Word Summary-Quiz-1
2 pages
Hypothesis Testing Assignment
No ratings yet
Hypothesis Testing Assignment
6 pages
Analytics in Practice: Model Evaluation
No ratings yet
Analytics in Practice: Model Evaluation
40 pages
Final Analytical Cement Mortar Grouting Exceed BOQ Report 11.03.2024 From TL
No ratings yet
Final Analytical Cement Mortar Grouting Exceed BOQ Report 11.03.2024 From TL
55 pages
Set+1 Descriptive+statistics+Probability+
No ratings yet
Set+1 Descriptive+statistics+Probability+
5 pages
Set+1 Descriptive+Statistics+Probability+
No ratings yet
Set+1 Descriptive+Statistics+Probability+
5 pages
Unit 9 Vocabulary
No ratings yet
Unit 9 Vocabulary
34 pages
Chapter 4 Transformations and Weighting To Correct Model Inadequacies 13 March
No ratings yet
Chapter 4 Transformations and Weighting To Correct Model Inadequacies 13 March
27 pages
1 - Assignment - PH 401 (EE) - MODULE - 6 (Statistical Mechanics)
No ratings yet
1 - Assignment - PH 401 (EE) - MODULE - 6 (Statistical Mechanics)
2 pages
HTML Tags
No ratings yet
HTML Tags
14 pages
AEF3e Level 1 TG PCM Grammar 2A
No ratings yet
AEF3e Level 1 TG PCM Grammar 2A
1 page
Flutter Analysis of The Aircraft Wing: Paramasivam Suresh (Ur13Ae044)
No ratings yet
Flutter Analysis of The Aircraft Wing: Paramasivam Suresh (Ur13Ae044)
9 pages
The Automatic Pilot
No ratings yet
The Automatic Pilot
10 pages
Week006-Where-LabExer003 Rivera Dennis
No ratings yet
Week006-Where-LabExer003 Rivera Dennis
6 pages
Summary of Lectures 02 Vector Spaces
No ratings yet
Summary of Lectures 02 Vector Spaces
3 pages
Improvements in The Mechanical Properties of The 18R-6R High-Hysteresis Martensitic Transformation by Nanoprecipitates in CuZnAl Alloys
No ratings yet
Improvements in The Mechanical Properties of The 18R-6R High-Hysteresis Martensitic Transformation by Nanoprecipitates in CuZnAl Alloys
8 pages
CLP 02.2 Course Title: Microprocessors & Microcontrollers Lab
No ratings yet
CLP 02.2 Course Title: Microprocessors & Microcontrollers Lab
6 pages
Iub Port Available Bandwidth Utilizing Ratio PDF
No ratings yet
Iub Port Available Bandwidth Utilizing Ratio PDF
2 pages