100% found this document useful (1 vote)
1K views

Credit Risk Modeling in Python Chapter2

The document discusses using logistic regression for predicting the probability of default on loans. It covers training a logistic regression model, interpreting the model coefficients, preprocessing categorical variables, evaluating model performance using metrics like accuracy and the ROC curve, and selecting classification thresholds.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

Credit Risk Modeling in Python Chapter2

The document discusses using logistic regression for predicting the probability of default on loans. It covers training a logistic regression model, interpreting the model coefficients, preprocessing categorical variables, evaluating model performance using metrics like accuracy and the ROC curve, and selecting classification thresholds.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Logistic regression

for probability of
default
CREDIT RIS K MODELIN G IN P YTH ON

Michael Crabtree
Data Scientist, Ford Motor Company
Probability of default
The likelihood that someone will default on a loan is the probability of default

A probability value between 0 and 1 like 0.86

loan_status of 1 is a default or 0 for non-default

CREDIT RISK MODELING IN PYTHON


Probability of default
The likelihood that someone will default on a loan is the probability of default

A probability value between 0 and 1 like 0.86

loan_status of 1 is a default or 0 for non-default

Probability of Default Interpretation Predicted loan status

0.4 Unlikely to default 0

0.90 Very likely to default 1

0.1 Very unlikely to default 0

CREDIT RISK MODELING IN PYTHON


Predicting probabilities
Probabilities of default as an outcome from machine learning
Learn from data in columns (features)

Classi cation models (default, non-default)

Two most common models:


Logistic regression

Decision tree

CREDIT RISK MODELING IN PYTHON


Logistic regression
Similar to the linear regression, but only produces values between 0 and 1

CREDIT RISK MODELING IN PYTHON


Training a logistic regression
Logistic regression available within the scikit-learn package

from sklearn.linear_model import LogisticRegression

Called as a function with or without parameters

clf_logistic = LogisticRegression(solver='lbfgs')

Uses the method .fit() to train

clf_logistic.fit(training_columns, np.ravel(training_labels))

Training Columns: all of the columns in our data except loan_status

Labels: loan_status (0,1)

CREDIT RISK MODELING IN PYTHON


Training and testing
Entire data set is usually split into two parts

CREDIT RISK MODELING IN PYTHON


Training and testing
Entire data set is usually split into two parts

Data Subset Usage Portion

Train Learn from the data to generate predictions 60%

Test Test learning on new unseen data 40%

CREDIT RISK MODELING IN PYTHON


Creating the training and test sets
Separate the data into training columns and labels

X = cr_loan.drop('loan_status', axis = 1)
y = cr_loan[['loan_status']]

Use train_test_split() function already within sci-kit learn

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

test_size : percentage of data for test set

random_state : a random seed value for reproducibility

CREDIT RISK MODELING IN PYTHON


Let's practice!
CREDIT RIS K MODELIN G IN P YTH ON
Predicting the
probability of default
CREDIT RIS K MODELIN G IN P YTH ON

Michael Crabtree
Data Scientist, Ford Motor Company
Logistic regression coef cients
# Model Intercept
array([-3.30582292e-10])
# Coefficients for ['loan_int_rate','person_emp_length','person_income']
array([[ 1.28517496e-09, -2.27622202e-09, -2.17211991e-05]])

# Calculating probability of default


int_coef_sum = -3.3e-10 +
(1.29e-09 * loan_int_rate) + (-2.28e-09 * person_emp_length) + (-2.17e-05 * person_income)
prob_default = 1 / (1 + np.exp(-int_coef_sum))
prob_nondefault = 1 - (1 / (1 + np.exp(-int_coef_sum)))

CREDIT RISK MODELING IN PYTHON


Interpreting coef cients
# Intercept
intercept = -1.02
# Coefficient for employment length
person_emp_length_coef = -0.056

For every 1 year increase in person_emp_length , the person is less likely to default

CREDIT RISK MODELING IN PYTHON


Interpreting coef cients
# Intercept
intercept = -1.02
# Coefficient for employment length
person_emp_length_coef = -0.056

For every 1 year increase in person_emp_length , the person is less likely to default

intercept person_emp_length value * coef probability of default

-1.02 10 (10 * -0.06 ) .17

-1.02 11 (11 * -0.06 ) .16

-1.02 12 (12 * -0.06 ) .15

CREDIT RISK MODELING IN PYTHON


Using non-numeric columns
Numeric: loan_int_rate , person_emp_length , person_income

Non-numeric:

cr_loan_clean['loan_intent']

EDUCATION
MEDICAL
VENTURE
PERSONAL
DEBTCONSOLIDATION
HOMEIMPROVEMENT

Will cause errors with machine learning models in Python unless processed

CREDIT RISK MODELING IN PYTHON


One-hot encoding
Represent a string with a number

CREDIT RISK MODELING IN PYTHON


One-hot encoding
Represent a string with a number

0 or 1 in a new column column_VALUE

CREDIT RISK MODELING IN PYTHON


Get dummies
Utilize the get_dummies() within pandas

# Separate the numeric columns


cred_num = cr_loan.select_dtypes(exclude=['object'])
# Separate non-numeric columns
cred_cat = cr_loan.select_dtypes(include=['object'])
# One-hot encode the non-numeric columns only
cred_cat_onehot = pd.get_dummies(cred_cat)
# Union the numeric columns with the one-hot encoded columns
cr_loan = pd.concat([cred_num, cred_cat_onehot], axis=1)

CREDIT RISK MODELING IN PYTHON


Predicting the future, probably
Use the .predict_proba() method within scikit-learn

# Train the model


clf_logistic.fit(X_train, np.ravel(y_train))
# Predict using the model
clf_logistic.predict_proba(X_test)

Creates array of probabilities of default

# Probabilities: [[non-default, default]]


array([[0.55, 0.45]])

CREDIT RISK MODELING IN PYTHON


Let's practice!
CREDIT RIS K MODELIN G IN P YTH ON
Credit model
performance
CREDIT RIS K MODELIN G IN P YTH ON

Michael Crabtree
Data Scientist, Ford Motor Company
Model accuracy scoring
Calculate accuracy

Use the .score() method from scikit-learn

# Check the accuracy against the test data


clf_logistic1.score(X_test,y_test)

0.81

81% of values for loan_status predicted correctly

CREDIT RISK MODELING IN PYTHON


ROC curve charts
Receiver Operating Characteristic curve
Plots true positive rate (sensitivity) against false positive rate (fall-out)

fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)


plt.plot(fallout, sensitivity, color = 'darkorange')

CREDIT RISK MODELING IN PYTHON


Analyzing ROC charts
Area Under Curve (AUC): area between curve and random prediction

CREDIT RISK MODELING IN PYTHON


Default thresholds
Threshold: at what point a probability is a default

CREDIT RISK MODELING IN PYTHON


Setting the threshold
Relabel loans based on our threshold of 0.5

preds = clf_logistic.predict_proba(X_test)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_default'])
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.5 else 0)

CREDIT RISK MODELING IN PYTHON


Credit classi cation reports
classification_report() within scikit-learn

from sklearn.metrics import classification_report


classification_report(y_test, preds_df['loan_status'], target_names=target_names)

CREDIT RISK MODELING IN PYTHON


Selecting classi cation metrics
Select and store speci c components from the classification_report()

Use the precision_recall_fscore_support() function from scikit-learn

from sklearn.metrics import precision_recall_fscore_support


precision_recall_fscore_support(y_test,preds_df['loan_status'])[1][1]

CREDIT RISK MODELING IN PYTHON


Let's practice!
CREDIT RIS K MODELIN G IN P YTH ON
Model
discrimination and
impact
CREDIT RIS K MODELIN G IN P YTH ON

Michael Crabtree
Data Scientist, Ford Motor Company
Confusion matrices
Shows the number of correct and incorrect predictions for each loan_status

CREDIT RISK MODELING IN PYTHON


Default recall for loan status
Default recall (or sensitivity) is the proportion of true defaults predicted

CREDIT RISK MODELING IN PYTHON


Recall portfolio impact
Classi cation report - Underperforming Logistic Regression model

CREDIT RISK MODELING IN PYTHON


Recall portfolio impact
Classi cation report - Underperforming Logistic Regression model

Number of true defaults: 50,000

Loan Amount Defaults Predicted / Not Predicted Estimated Loss on Defaults

$50 .04 / .96 (50000 x .96) x 50 = $2,400,000

CREDIT RISK MODELING IN PYTHON


Recall, precision, and accuracy
Dif cult to maximize all of them because there is a trade-off

CREDIT RISK MODELING IN PYTHON


Let's practice!
CREDIT RIS K MODELIN G IN P YTH ON

You might also like