1 - Credit Risk Log File

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 10

-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
----------
name: <unnamed>
log: \Credit Risk Log File_Last Version.log
log type: text

. Data : "2_Credit Risk data.dta"

. * Data description: Dependent variable: creditrisk (binary variable taking 1 if


the customer is high risk and taking 0 if the customer is low risk) Independent
Continuous
variables ( checking savings monthscustomer monthsemployed age) and independent
categorical variables ( loanpurpose gender maritalstatus housing job)
***********************************************************************************
***********************************************************************************
***********************

. * Descriptive Statistics: The Stata command "sum" for continuous variables and
the Stata command "tab1" to display at one time all the frequency tables of the
categorical
variables

.
.
. sum checking savings monthscustomer monthsemployed age

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
checking | 425 1048.014 3147.183 0 19812
savings | 425 1812.562 3597.285 0 19811
monthscust~r | 425 22.89647 12.2676 5 73
monthsempl~d | 425 31.89647 32.25932 0 119
age | 425 34.39765 11.04513 18 73

. tab1 loanpurpose gender maritalstatus housing job

-> tabulation of loanpurpose

loanpurpose | Freq. Percent Cum.


------------+-----------------------------------
1 | 44 10.35 10.35
2 | 23 5.41 15.76
3 | 85 20.00 35.76
4 | 4 0.94 36.71
5 | 104 24.47 61.18
6 | 12 2.82 64.00
7 | 2 0.47 64.47
8 | 105 24.71 89.18
9 | 40 9.41 98.59
10 | 6 1.41 100.00
------------+-----------------------------------
Total | 425 100.00

-> tabulation of gender

gender | Freq. Percent Cum.


------------+-----------------------------------
0 | 135 31.76 31.76
1 | 290 68.24 100.00
------------+-----------------------------------
Total | 425 100.00

-> tabulation of maritalstatus

maritalstat |
us | Freq. Percent Cum.
------------+-----------------------------------
1 | 156 36.71 36.71
2 | 36 8.47 45.18
3 | 233 54.82 100.00
------------+-----------------------------------
Total | 425 100.00

-> tabulation of housing

housing | Freq. Percent Cum.


------------+-----------------------------------
1 | 292 68.71 68.71
2 | 81 19.06 87.76
3 | 52 12.24 100.00
------------+-----------------------------------
Total | 425 100.00

-> tabulation of job

job | Freq. Percent Cum.


------------+-----------------------------------
1 | 54 12.71 12.71
2 | 271 63.76 76.47
3 | 89 20.94 97.41
4 | 11 2.59 100.00
------------+-----------------------------------
Total | 425 100.00

. * frequency table of the dependent variable : creditrisk

. tab creditrisk

creditrisk | Freq. Percent Cum.


------------+-----------------------------------
0 | 214 50.35 50.35
1 | 211 49.65 100.00
------------+-----------------------------------
Total | 425 100.00

. * Comment: 50.35 % of borrowers are ranked low risk and 49.65 are ranked high
risk by the bank

. * Joint distribution and correlation between creditrisk and loanpurpose (the


stata command is : tab2 creditrisk loanpurpose, chi2)

.
.
. tab2 creditrisk loanpurpose, chi2

-> tabulation of creditrisk by loanpurpose


| loanpurpose
creditrisk | 1 2 3 4 5 6
7 8 9 10 | Total
-----------
+----------------------------------------------------------------------------------
----------------------------+----------
0 | 21 9 42 1 39 8
1 63 28 2 | 214
1 | 23 14 43 3 65 4
1 42 12 4 | 211
-----------
+----------------------------------------------------------------------------------
----------------------------+----------
Total | 44 23 85 4 104 12
2 105 40 6 | 425

Pearson chi2(9) = 21.2695 Pr = 0.012

. * There is a strong significant correlation between creditrisk and loanpurpose


(p-value =1.2% < 5%)

.
.
. * Joint distribution and correlation between creditrisk and gender (the stata
command is : tab2 creditrisk gender, chi2)

.
.
. tab2 creditrisk gender, chi2

-> tabulation of creditrisk by gender

| gender
creditrisk | 0 1 | Total
-----------+----------------------+----------
0 | 57 157 | 214
1 | 78 133 | 211
-----------+----------------------+----------
Total | 135 290 | 425

Pearson chi2(1) = 5.2320 Pr = 0.022

.
.
. * There is a strong significant correlation between creditrisk and gender (p-
value =2.2% < 5%)

. * Joint distribution and correlation between creditrisk and housing (the stata
command is : tab2 creditrisk housing , chi2)

. tab2 creditrisk housing , chi2

-> tabulation of creditrisk by housing

| housing
creditrisk | 1 2 3 | Total
-----------+---------------------------------+----------
0 | 161 32 21 | 214
1 | 131 49 31 | 211
-----------+---------------------------------+----------
Total | 292 81 52 | 425

Pearson chi2(2) = 8.5524 Pr = 0.014

. * There is a strong significant correlation between creditrisk and housing (p-


value =1.4% < 5%)
***********************************************************************************
***********************************************************************************
**********************

. **** LOGISTIC REGRESSION****

. * Logistic regression is used when the dependent variable is binary and when we
have a typical coding : 0 for negative outcome (event did not occur) and 1 for
positive
outcome (event did occur). We use a Logit model when we are interested in seeing
how the independent variables affect the probabilty of the event occuring (or not
occuring)

. * Logit model : y = c + bX + e where y is the dependent variable (creditrisk) x a


set of independent continuous and categorical variables ( checking savings
monthscustomer
monthsemployed age loanpurpose gender maritalstatus housing job) c (constant/no
real significance in logistic regression model) and b are parameters to be
estimated.
e, the error term has mean 0 and variance π^2 (Pi squared). Pr(y=1|x)=
exp(c+bx)/1+exp(c+bx). So a positive coefficien b, indicates that higher levels of
x are associated
with an increase in Pr(y=1|x) and a negative coefficient indicates that higher
levels of x are associated with a decrease in Pr(y=1|x).

.
.
. * The Stata commad to estimate a Logit model is: "logit depvar indvars"

. logit creditrisk checking savings monthscustomer monthsemployed age i.loanpurpose


i.gender i.maritalstatus i.housing i.job

Iteration 0: log likelihood = -294.57696


Iteration 1: log likelihood = -257.67227
Iteration 2: log likelihood = -257.54937
Iteration 3: log likelihood = -257.54926
Iteration 4: log likelihood = -257.54926

Logistic regression Number of obs = 425


LR chi2(22) = 74.06
Prob > chi2 = 0.0000
Log likelihood = -257.54926 Pseudo R2 = 0.1257

--------------------------------------------------------------------------------
creditrisk | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
checking | -.0000476 .0000348 -1.37 0.171 -.0001158 .0000206
savings | -.0000496 .0000316 -1.57 0.117 -.0001116 .0000124
monthscustomer | .0502559 .0105246 4.78 0.000 .029628 .0708837
monthsemployed | -.0039044 .0037417 -1.04 0.297 -.011238 .0034291
age | -.0116182 .0112195 -1.04 0.300 -.0336081 .0103717
|
loanpurpose |
2 | .1805578 .5798451 0.31 0.756 -.9559176 1.317033
3 | .0208036 .4146947 0.05 0.960 -.791983 .8335902
4 | 1.511012 1.296294 1.17 0.244 -1.029677 4.051702
5 | .7164681 .4034342 1.78 0.076 -.0742485 1.507185
6 | -.7741574 .7415011 -1.04 0.296 -2.227473 .6791581
7 | .5033596 1.546789 0.33 0.745 -2.528292 3.535011
8 | -.2122486 .404214 -0.53 0.600 -1.004494 .5799963
9 | -1.345953 .5263792 -2.56 0.011 -2.377637 -.3142685
10 | .0141124 1.029348 0.01 0.989 -2.003373 2.031598
|
1.gender | .129157 .5282329 0.24 0.807 -.9061604 1.164474
|
maritalstatus |
2 | -.4163424 .6175879 -0.67 0.500 -1.626792 .7941076
3 | -.6619016 .5082489 -1.30 0.193 -1.658051 .334248
|
housing |
2 | .5931464 .2893651 2.05 0.040 .0260012 1.160292
3 | .5939747 .3756847 1.58 0.114 -.1423538 1.330303
|
job |
2 | -.2862753 .3604276 -0.79 0.427 -.9927004 .4201498
3 | -.1103014 .412744 -0.27 0.789 -.9192648 .6986621
4 | -.5714215 .7860727 -0.73 0.467 -2.112096 .9692526
|
_cons | -.1554973 .6956042 -0.22 0.823 -1.518857 1.207862
--------------------------------------------------------------------------------

. * Model 2 : logit creditrisk savings monthscustomer age i.loanpurpose i.gender


i.maritalstatus i.housing (some inconsistent independent variables are dropped)

.
.
. logit creditrisk savings monthscustomer age i.loanpurpose i.gender
i.maritalstatus i.housing

Iteration 0: log likelihood = -294.57696


Iteration 1: log likelihood = -259.73212
Iteration 2: log likelihood = -259.65054
Iteration 3: log likelihood = -259.65049
Iteration 4: log likelihood = -259.65049

Logistic regression Number of obs = 425


LR chi2(17) = 69.85
Prob > chi2 = 0.0000
Log likelihood = -259.65049 Pseudo R2 = 0.1186

--------------------------------------------------------------------------------
creditrisk | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
savings | -.0000523 .000031 -1.69 0.091 -.000113 8.43e-06
monthscustomer | .0491996 .0102011 4.82 0.000 .0292058 .0691934
age | -.0135457 .0106361 -1.27 0.203 -.0343921 .0073007
|
loanpurpose |
2 | .2306476 .5752784 0.40 0.688 -.8968774 1.358172
3 | .0283228 .4071357 0.07 0.945 -.7696485 .8262941
4 | 1.619859 1.280687 1.26 0.206 -.8902408 4.129959
5 | .7146552 .3995288 1.79 0.074 -.0684068 1.497717
6 | -.7673558 .7213625 -1.06 0.287 -2.1812 .6464887
7 | .3543893 1.514912 0.23 0.815 -2.614784 3.323563
8 | -.2122968 .3982476 -0.53 0.594 -.9928478 .5682542
9 | -1.228455 .5094138 -2.41 0.016 -2.226887 -.2300221
10 | .2539983 .9985979 0.25 0.799 -1.703218 2.211214
|
1.gender | .1892455 .5261844 0.36 0.719 -.842057 1.220548
|
maritalstatus |
2 | -.5019915 .6130893 -0.82 0.413 -1.703625 .6996415
3 | -.7653826 .502026 -1.52 0.127 -1.749336 .2185704
|
housing |
2 | .5846078 .286382 2.04 0.041 .0233094 1.145906
3 | .5378179 .3707556 1.45 0.147 -.1888497 1.264486
|
_cons | -.4400372 .5845148 -0.75 0.452 -1.585665 .7055908
--------------------------------------------------------------------------------

. * To check the predictive power of the estimated model, the stata post
estimation command is : "estat classification"

. estat classification

Logistic model for creditrisk

-------- True --------


Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 135 65 | 200
- | 76 149 | 225
-----------+--------------------------+-----------
Total | 211 214 | 425

Classified + if predicted Pr(D) >= .5


True D defined as creditrisk != 0
--------------------------------------------------
Sensitivity Pr( +| D) 63.98%
Specificity Pr( -|~D) 69.63%
Positive predictive value Pr( D| +) 67.50%
Negative predictive value Pr(~D| -) 66.22%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 30.37%
False - rate for true D Pr( -| D) 36.02%
False + rate for classified + Pr(~D| +) 32.50%
False - rate for classified - Pr( D| -) 33.78%
--------------------------------------------------
Correctly classified 66.82%
--------------------------------------------------

. * Considering the estimated Logit model, the percentage of customers correctly


classified (ranked high or low risk) is around 67% (correctly classified 66.82 %).
This model has a good predictive power.

. * To report the odds ratios (exp(b)) for each independent variable, the stata
command is "logit depvar indvars, or". Standard errors and confidence intervals
are also transformed.
.
.
. logit creditrisk savings monthscustomer age i.loanpurpose i.gender
i.maritalstatus i.housing, or

Iteration 0: log likelihood = -294.57696


Iteration 1: log likelihood = -259.73212
Iteration 2: log likelihood = -259.65054
Iteration 3: log likelihood = -259.65049
Iteration 4: log likelihood = -259.65049

Logistic regression Number of obs = 425


LR chi2(17) = 69.85
Prob > chi2 = 0.0000
Log likelihood = -259.65049 Pseudo R2 = 0.1186

--------------------------------------------------------------------------------
creditrisk | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
savings | .9999477 .000031 -1.69 0.091 .9998871 1.000008
monthscustomer | 1.05043 .0107155 4.82 0.000 1.029636 1.071643
age | .9865457 .010493 -1.27 0.203 .9661926 1.007327
|
loanpurpose |
2 | 1.259415 .7245144 0.40 0.688 .4078412 3.889079
3 | 1.028728 .4188318 0.07 0.945 .4631758 2.284836
4 | 5.052379 6.470514 1.26 0.206 .4105569 62.17537
5 | 2.043482 .8164298 1.79 0.074 .9338805 4.47147
6 | .464239 .3348846 -1.06 0.287 .1129059 1.908826
7 | 1.42531 2.15922 0.23 0.815 .0731836 27.75908
8 | .8087246 .3220727 -0.53 0.594 .37052 1.765183
9 | .2927446 .1491281 -2.41 0.016 .1078636 .7945161
10 | 1.28917 1.287362 0.25 0.799 .1820967 9.126792
|
1.gender | 1.208338 .6358084 0.36 0.719 .4308234 3.389044
|
maritalstatus |
2 | .605324 .3711177 -0.82 0.413 .1820226 2.013031
3 | .4651559 .2335204 -1.52 0.127 .1738895 1.244297
|
housing |
2 | 1.794287 .5138515 2.04 0.041 1.023583 3.14529
3 | 1.712266 .6348324 1.45 0.147 .8279109 3.54127
|
_cons | .6440125 .3764348 -0.75 0.452 .2048115 2.025043
--------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

.
.
. * Notes : 1) for positive b, "the odds are exp(b) times larger" or "the odds
increase by a factor of exp(b)" --- 2) for negative b, "the odds are exp(b) times
smaller" or
> "the odds decrease by a factor of exp(b)"--- 3) odds close to 1 indicate a small
change (multiplying by 1.01 or 0.99 does not change the odds much. --- 4) The odds
of Y=1 (high risk) increase multiplicatively by exp(b) for a one unit increase
in X, holding all other variables constant.
. * Comments: Results in the table above show 1) the odds of the variable
monthscustomer (continuous) for a customer ranked high risk (creditrisk=1) increase
by a factor of 1
> .05 for a unit increase in monthscustomer. 2) The odds for a customer ranked
high risk (creditrisk=1) increase by a factor of 1.8 when the customer rents a
house compared
when he owns his house.

. * How to obtain easier coefficients for easier interpretation : the stata post
estimation command "listcoef, percent" gives the percent change in odds for unit
increase in
> X and the percent change in odds for Standard Deviation increase in X.

. listcoef, percent

logit (N=425): Percentage Change in Odds

Odds of: 1 vs 0

----------------------------------------------------------------------
creditrisk | b z P>|z| % %StdX SDofX
-------------+--------------------------------------------------------
savings | -0.00005 -1.688 0.091 -0.0 -17.1 3597.2850
monthscust~r | 0.04920 4.823 0.000 5.0 82.9 12.2676
age | -0.01355 -1.274 0.203 -1.3 -13.9 11.0451
2.loanpurp~e | 0.23065 0.401 0.688 25.9 5.4 0.2265
3.loanpurp~e | 0.02832 0.070 0.945 2.9 1.1 0.4005
4.loanpurp~e | 1.61986 1.265 0.206 405.2 17.0 0.0967
5.loanpurp~e | 0.71466 1.789 0.074 104.3 36.0 0.4304
6.loanpurp~e | -0.76736 -1.064 0.287 -53.6 -11.9 0.1658
7.loanpurp~e | 0.35439 0.234 0.815 42.5 2.5 0.0685
8.loanpurp~e | -0.21230 -0.533 0.594 -19.1 -8.8 0.4318
9.loanpurp~e | -1.22845 -2.412 0.016 -70.7 -30.2 0.2923
10.loanpur~e | 0.25400 0.254 0.799 28.9 3.0 0.1181
1.gender | 0.18925 0.360 0.719 20.8 9.2 0.4661
2.maritals~s | -0.50199 -0.819 0.413 -39.5 -13.1 0.2788
3.maritals~s | -0.76538 -1.525 0.127 -53.5 -31.7 0.4983
2.housing | 0.58461 2.041 0.041 79.4 25.8 0.3932
3.housing | 0.53782 1.451 0.147 71.2 19.3 0.3281
----------------------------------------------------------------------

. * Table description : 1) b = raw coefficient 2)z = z-score for test of b=0 3)p>|
z|= p-value for z-test 4)% = percent change in odds for percent increase in X 5)
%StdX = percent change in odds for SD increase in X 6) SDofX = standard
deviation of X.

. * Results : 1)the odds of a high risk customer increase by 5% for 1 month


increase of the variable monthscustomer, holding other variables constant. 2)the
odds of a high
risk customer decrease by 70.7% when the loan purpose is buying a used car
( loanpurpose=9) compared to a high risk borrower when the loan is for business
( loanpurpose=1,
the reference), holding other variables constant. 3)the odds of a high risk
customer renting a house increase by 79.4% compared to a high risk borrower owning
a house
(house=1, the reference), holding other variables constant.
***********************************************************************************
***********************************************************************************
*********************************
. *** Probability Prediction***

. * to predict the probability of a customer to be ranked high risk, the stata


command is "prvalue". So how? Example: for a customer having specific
characteristics
x( savings=5000 monthscustomer=28 age=30 loanpurpose=2 gender=1 maritalstatus=2
housing=2), the first step is to estimate the model this way : "quietly logit
creditrisk savings
monthscustomer age loanpurpose gender maritalstatus housing" . Preceding any
stata command by "quietly and Stata would not display results (we do not need
them).
The second step consists to run the command "prvalue, x(savings=5000
monthscustomer=28 age=30 loanpurpose=2 gender=1 maritalstatus=2 housing=2)"

. quietly logit creditrisk savings monthscustomer age loanpurpose gender


maritalstatus housing

. * Stata does not display results...

. prvalue, x( savings=5000 monthscustomer=28 age=30 loanpurpose=2 gender=1


maritalstatus=2 housing=2)

logit: Predictions for creditrisk

Confidence intervals by delta method

95% Conf. Interval


Pr(y=1|x): 0.6512 [ 0.5224, 0.7800]
Pr(y=0|x): 0.3488 [ 0.2200, 0.4776]

savings monthscust~r age loanpurpose gender


maritalsta~s housing
x= 5000 28 30 2 1
2 2

. * The predicted probability to be ranked high risk of a customer with these


characteristics is 0.6512 with 95% CI [0.5224 0.7800].

. * To predict the probability to be ranked high risk of a customer at the mean of


the set of independent variables, the stata command is "prvalue, rest(mean)"

. prvalue, rest(mean)

logit: Predictions for creditrisk

Confidence intervals by delta method

95% Conf. Interval


Pr(y=1|x): 0.4976 [ 0.4472, 0.5479]
Pr(y=0|x): 0.5024 [ 0.4521, 0.5528]

savings monthscust~r age loanpurpose gender


maritalsta~s housing
x= 1812.5624 22.896471 34.397647 5.24 .68235294
2.1811765 1.4352941
. * the predicted probabilty to be ranked high risk is 0.4976 with 95% CI [0.4472
0.5479]

.
***********************************************************************************
***********************************************************************************
**********************************

. log close

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
----------

You might also like