RCode Group 4
RCode Group 4
German Credit
Group 4
Major: Logistics and Supply Chain Management
Course: Applied Statistics
Members:
Trần Minh Thư 20233212
Nguyễn Trần Vân Anh 20233119
Nguyễn Quý Trang 20233221
Hoàng Mạnh Dũng 20233143
Nguyễn Trọng Thành Khôi 20233173
1
Trần Minh Thư
20233212
2
Applied Statistic Project : German Credit
Group project
2024-1-10
1. Introduction
Understanding the factors influencing the credit amount offered to borrowers is crucial for lending institutions to make
sound financial decisions. The approved credit amount is shaped by various factors, including a borrower’s financial
stability, employment status, and demographic profile. A comprehensive analysis of these variables can help lenders
refine risk assessment models and tailor credit products more effectively.
This study delves into the primary factors that impact the credit amount provided to borrowers. By exam- ining these
variables, the research aims to deliver insights into how lending decisions are formed and how they can be optimized.
Objective
The main objective of this project is to analyze and predict the credit amount allocated to borrowers, which serves as
the dependent variable in this study. The research focuses on identifying and quantifying the relationships between
the credit amount and several independent variables within the dataset.
2. Data description
Independent Variables The independent variables that may influence the credit amount include:
1. Duration of Credit (month): ( quatitative data ) show the length of the credit repayment period
2. Purpose: ( qualitative data ) representing the reason for obtaining the credit with eleven levels ( 0-10
)
3
3. Instalment per cent: ( qualitative data )
4. Guarantors: ( qualitative data ) The presence or absence of guarantors for the credit. ( 1 - No ; 2 - Assistant ;
3- Gurantor)
5. Length of current employment: ( quatitative data ) a qualitative data containing 5 levels ( 1-5 )
6. Sex & Marital Status: ( qualitative data ) The gender and marital status of the borrower.
7. Age (years):( quatitative data ) The borrower’s age, which may indicate earning potential or financial stability.
8. Occupation: ( qualitative data ) The type of work the borrower engages in, which reflects income level and
stability.
9. Number of dependents: ( quatitative data ) The number of people financially dependent on the bor- rower.
10. Creditability: ( qualitative data ) worthy of belief ( 0 - Non_Creditability ; 1 - Creditability)
Variable selection Before describe the data, we remove Purpose, Guarantors, Length of current employ- ment, Sex
& Marital Status, Number of dependents because they are very general and don’t have enough impact on Credit
Amount to choose for analysis. These following variables are kept to do analysis:
1. Creditability (C)
2. Duration of Credit (DOC)
3. Instalment per cent (IPC)
4. Age (years) (A)
5. Occupation (O)
6. Credit Amount (CA)
After removing variables that are considered, we have the new dataset as follow.
## # A tibble: 6 x 6
## C DOC IPC A O CA
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1
1 18 4 21 3 1049
## 2 1 9 2 36 3 2799
## 3 1 12 2 23 2 841
## 4 1 12 3 39 2 2122
## 5 1 12 4 38 2 2171
## 6 1 10 1 48 2 2241
Data description
## C DOC IPC A O
## Min. :0.0 Min. : 4.0 Min. :1.000 Min. :19.00 Min. :1.000
## 1st Qu.:0.0 1st Qu.:12.0 1st Qu.:2.000 1st Qu.:27.00 1st Qu.:3.000
## Median :1.0 Median :18.0 Median :3.000 Median :33.00 Median :3.000
## Mean :0.7 Mean :20.9 Mean :2.973 Mean :35.54 Mean :2.904
## 3rd Qu.:1.0 3rd Qu.:24.0 3rd Qu.:4.000 3rd Qu.:42.00 3rd Qu.:3.000
4
## Max. :1.0 Max. :72.0 Max. :4.000 Max. :75.00 Max. :4.000
## CA
## Min. : 250 ## 1st
Qu.: 1366
## Median : 2320 ##
Mean : 3271 ## 3rd
Qu.: 3972 ## Max.
:18424
Data description
For each of variables, we use these functions: summary(), hist(), boxplot() to do analysis and visuallize the data
1. Creditability This is a binary data. Thus, central tendencies, dispersion does not make any sense. Because of
that reason, the authors are not going to use summary() to analyse this variable. Hence, the functions table() and
hist() are considered.
## C
## 0 1
## 300 700
Histogram of C
700
500
Frequency
300
100
5
2. Duration_of_Credit (month) This is a continous variable with numeric data, the function sum- mary() is
considered.
Histogram of DOC
250
216
200
164
Frequency
150
123
89
100
57 49
50
16 13
7 3 2 0 0 1
0
0 20 40 60
DOC
3. Instalment per cent This is a continous variable with numeric data, the function summary() is con- sidered.
## IPC
## 1 2 3 4
## 136 231 157 476
6
Histogram of IPC
400
300
Frequency
200
100
0
IPC
7
Histogram of A
200
174 178
150
141
Frequency
100
88
71
42
50
26 27
16 12 6
0
20 30 40 50 60 70
A
4. Age(years)
Most borrower were in young age and none of them <19 because of the law only allow who meet the age
requirement to make a loan
8
Histogram of O
500
Frequency
300
200
148
100
22
0 0 0 0 0 0 0 0 0 0 0
0
O
5. Occupation
This is qualitative data with 4 level where majority of observations concentrated at level 3.
6. Credit_Amount This variable is continuous. Therefore, to summarize the description of the variable we use
function summary().
The authors are going to use the function hist() to visuallize the data considered
9
Histogram of CA
400
300
Frequency
200
100
0
CA
3. SIMPLE REGRESSION
In this part, we will consider the relationship between the variables pairwise and investigate some simple
regressions when it makes sense.
We will use pairs(), cor() and summary(lm()) to have a general view about the relationship between variables.
10
10 40 70 20 40 60 0 10000
0.8
C
0.0
DOC
50
10
IPC
3.0
1.0
5020
3.0
O
1.0
15000
CA
0
## C DOC IPC A O CA
## C 1.00000000 -0.21492667 -0.07240394 0.09127195 -0.03273500 -0.15474015
## DOC -0.21492667 1.00000000 0.07474882 -0.03754986 0.21090973 0.62498846
## IPC -0.07240394 0.07474882 1.00000000 0.05727075 0.09775539 -0.27132228
## A 0.09127195 -0.03754986 0.05727075 1.00000000 0.01538303 0.03227268
## O -0.03273500 0.21090973 0.09775539 0.01538303 1.00000000 0.28539307
## CA -0.15474015 0.62498846 -0.27132228 0.03227268 0.28539307 1.00000000
The charts and correlations indicate several potential simple linear relationships. Specifically, there are noticeable
correlations between Creditability and Credit Amount (-0.155), Duration of Credit and Credit Amount (0.625),
Instalment Percent and Credit Amount (-0.271), and Occupation and Credit Amount (0.285).
Factors such as experience and the number of years working in the current job are excluded due to their lack of
analytical value. Additionally, the correlation between Age and Credit Amount is minimal (0.0323), so it will also be
disregarded in the analysis.
Credi^
t Amount = β0 + β1 × Duration of Credit
11
15000
10000
CA
5000
0
10 20 30 40 50 60 70
DOC
## [1] 0.6249885 ##
## Call:
## lm(formula = CA ~ DOC) ##
## Residuals:
## Min 1Q Median 3Q Max ## -5151.7
-1260.0 -432.9 653.2 13805.0 ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ##
(Intercept) 213.169 139.569 1.527 0.127
## DOC 146.299 5.784 25.292 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ##
## Residual standard error: 2205 on 998 degrees of freedom ## Multiple R-
squared: 0.3906, Adjusted R-squared: 0.39
## F-statistic: 639.7 on 1 and 998 DF, p-value: < 2.2e-16
Credi^
t Amount = 213.169 + 146.299 × Duration of Credit
12
The Multiple R-squared is calculated to equal to 0.3906 (39.06% of variation in Credit Amount can be explained by
the variability in Duration of Credit).
15000
10000
5000
CA
10 20 30 40 50 60 70
DOC
Credi^
t Amount = β0 + β1 × Creditability
13
Credit Amount Distribution by Creditability
15000
Credit Amount
10000
5000
0
No Yes
Creditability
## [1] -0.1547401 ##
## Call:
## lm(formula = CA ~ C) ##
## Residuals:
## Min 1Q Median 3Q Max ## -3505.1
-1765.6 -858.4 771.8 14485.9 ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3938.1 161.1 24.447 < 2e-16 ***
## C -952.7 192.5 -4.948 8.8e-07 *** ## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ##
## Residual standard error: 2790 on 998 degrees of freedom
## Multiple R-squared: 0.02394, Adjusted R-squared: 0.02297 ## F-
statistic: 24.48 on 1 and 998 DF, p-value: 8.795e-07
The Multiple R-squared is calculated to equal to 0.02394 (2.394% of variation in Credit Amount can be explained by
the variability in Creditability). -> Does not provide strong statistical value.
14
15000
10000
CA
5000
0
Credi^
t Amount = β0 + β1 × Instalment per cent
15
15000
10000
CA
5000
0
IPC
## [1] -0.2713223 ##
## Call:
## lm(formula = CA ~ IPC) ##
## Residuals:
## Min 1Q Median 3Q Max
## -4021.0 -1659.6 -854.5 788.9 13802.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5306.57 244.18 21.732 <2e-16 ***
## IPC -684.60 76.87 -8.905 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2718 on 998 degrees of freedom
## Multiple R-squared: 0.07362, Adjusted R-squared: 0.07269
The model given is:
## F-statistic: 79.31 on 1 and 998 DF, p-value: < 2.2e-16
Credi^
t Amount = 5306.57 − 684.60 × Instalment per cent
16
The Multiple R-squared is calculated to equal to 0.07362 (7.362% of variation in Credit Amount can be explained by
the variability in Instalment per cent). -> Does not provide strong statistical value.
15000
10000
5000
CA
IPC
Credi^
t Amount = β0 + β1 × Occupation
17
15000
10000
CA
5000
0
## [1] 0.2853931 ##
## Call:
## lm(formula = CA ~ O) ##
## Residuals:
## Min 1Q Median 3Q Max ## -3993.1
-1851.8 -777.6 776.4 13801.9 ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ##
(Intercept) -308 390 -0.790 0.43
## O 1232 131 9.407 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ##
## Residual standard error: 2707 on 998 degrees of freedom
## Multiple R-squared: 0.08145, Adjusted R-squared: 0.08053
## F-statistic: 88.49 on 1 and 998 DF, p-value: < 2.2e-16
The Multiple R-squared is calculated to equal to 0.08145 (8.145% of variation in Credit Amount can be explained by
the variability in Occupation). -> Does not provide strong statistical value.
18
15000
10000
CA
5000
0
4. Multiple regression
We are going to fit a linear model to explain the Credit amount which the response with the predictors C, DOC,IPC,A and O
C r e d^
i t . a m o u n t = β0 + β1 × C + β2 × DOC + β3 ×IPC + β4 ×A + β5 × O
##
## Call:
## lm(formula = CA ~ C + DOC + IPC + A + O) ##
## Residuals:
## Min 1Q Median 3Q Max ## -5805.4
-1096.3 -230.2 665.3 13252.0 ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ##
(Intercept) 69.478 377.685 0.184 0.854084
## C -312.785 137.292 -2.278 0.022923 *
## DOC 141.084 5.313 26.553 < 2e-16 ***
## IPC -865.190 55.192 -15.676 < 2e-16 ***
## A 18.965 5.420 3.499 0.000487 ***
## O 816.053 96.034 8.498 < 2e-16 ***
## ---
19
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ##
## Residual standard error: 1932 on 994 degrees of freedom ## Multiple R-
squared: 0.534, Adjusted R-squared: 0.5316 ## F-statistic: 227.8 on 5 and
994 DF, p-value: < 2.2e-16
C r e d^
i t . a m o u n t = 69.478 − 312.785 × C + 141.084 × DOC − 865.190 × IPC + 18.965 × A + 816.053 × O
Most independent variables ( C, DOC, IPC, A, O ) are statistically significant, with p -values below 0.05.
The Rˆ2 = 0.534 value suggests a moderately good model fit, about 53.4% of the variance in CA is explained by the
independent variables in the model.
-> The model captures important Credit amount determinants with middle level of explanatory power.
20
CONCLUSION
1. Correlation Analysis:
The correlation matrix revealed strong pairwise relationships among some numeric variables. For example,
Duration.of.Credit..month. and Credit.Amount showed a positive correlation, indicating that longer credit durations are
often associated with higher credit amounts.
The visualization helped identify dependencies among variables, such as moderate correlations with Instalment.per.cent.
Visualization Insights:
The bar plots revealed the distribution of key variables like Creditability and Purpose. Most credits were considered
reliable (Creditability = 1), and a few purposes dominated the dataset.
Strengths:
Exploratory Analysis: Thorough statistical and visual exploration of variables provided valuable insights into distributions
and relationships.
Regression Model: The use of a multiple regression model allowed for understanding the contribution of key predictors to
credit amount, supported by statistically significant coefficients.
Correlation Visualization: The correlation plot provided a comprehensive overview of variable relationships.
Limitations:
Data Structure and Preprocessing:
• The dataset includes categorical variables like Purpose and Sex...Marital.Status, which were not fully utilized in
regression modeling.
• Potential outliers in variables like Credit.Amount were not addressed, possibly affecting model accuracy.
Model Performance:
• Although the model explains 49.8% of the variance, it suggests that other predictors not included in the analysis may
play a significant role in determining credit amount.
• Residual analysis indicates potential non-linearity or heteroscedasticity not accounted for.
Feature Selection: The regression only used three predictors. A broader feature set, including categorical variables
converted to dummy variables, could enhance the model's explanatory power.
21