0% found this document useful (0 votes)
12 views

Modelling-project notes-2

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Modelling-project notes-2

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Banking Project

(Capstone Project – Final Report)


DSBA

By:
E. AuroRajashri
List of Content
1) Introduction of the business
problem..............................................................................................
....3
1.1 Defining problem statement
1.2 Need of the study/project
1.3 Understanding business/social opportunity

2) Data Report……………………………………………………….
……………………………….….5
2.1 Understanding how data was collected in terms of time, frequency
and methodology
2.2 Visual inspection of data (rows, columns, descriptive details)
2.3 Understanding of attributes (variable info, renaming if required)

3) Exploratory data
analysis…………………………………………………………………………….
………….…7
3.1 Univariate analysis (distribution and spread for every continuous
attribute, distribution of data in categories for categorical ones)
3.2 Bivariate analysis (relationship between different variables,
correlations)
3.3 Removal of unwanted variables (if applicable)
3.4 Missing Value treatment (if applicable)
3.5 Outlier treatment (if required)
3.6 Variable transformation (if applicable)
3.7 Addition of new variables (if required)

4) Business insights from


EDA ……………………………………………………………………………………...
………21
4.1 Is the data unbalanced? If so, what can be done? Please explain in the
context of the business
4.2 Any business insights using clustering (if applicable)
4.3 Any other business insights

5) Model building and


interpretation.....................................................................................
.....3
5.1 Build various models
5.2 Test your predictive model against the test set using various
performance metrics
5.3 Interpretation of the model(s)

6) Model Tuning……………………………………………………….…………….
……………….….12
6.1 Ensemble modelling, wherever applicable
6.2 Any other model tuning measures (if applicable)
6.3 Interpretation of the most optimum model and its implication on the
business

List of Tables
2.2 Descriptive Statistics…………………………………………………………………………….………5

2.3 Data Info……………………………………………………………………………………………….…….6

3.3.1 Removed User id variable from the data frame………………………………………….….15

3.3.2 Removed Name in email variable from the data frame…………………………….…….15

3.4.1 Percentage of missing value per column……………………………………………………….16

3.4.2 Post dropping off columns with 25% threshold……………………………………………..17

3.4.3 Post Imputation- Missing values……………………………………………………..…………..17

3.6.1 One-hot encoding……………………………………………………………………………..………..19

4.2.1 Post scaling treatment……………………………………………………………………….……….20

4.2.2 Inertia of various n_clusters………………………………………………………………….……20

4.2.4 Final dataset post clustering……………………………………………………………….……….21

List of Figures
3.1.1 Histogram of age

3.1.2 Histogram of Time_hours

3.1.3 Number of Defaulters and Non-defaulters

3.1.4 Top 10 Merchant Categories

3.1.5 Top 10 Merchant groups

3.1.6 Histogram of all numerical variables

3.2.1 Average Account Amount Added (12-24m) by Default status

3.2.2 Distribution of Max paid invoice(0-12m) by Default status

3.2.3 Violin plot: Age distribution by default status

3.2.4 Heat Map -Correlation

3.5.1 Outliers using box plot


3.5.2 Post Outliers Treatment

4.2.3 Elbow graph

5.1 Train and test data

5.2.1 Accuracy score – Random Forest

5.2.2 Confusion Matrix – Random Forest

5.2.3 Classification report – Random Forest

5.2.4 ROC Curve – Random Forest

5.2.5 Accuracy score – DTC

5.2.6 Confusion Matrix – DTC

5.2.7 Classification report – DTC

5.2.8 ROC curve – DTC

5.2.9 Accuracy score – NBC

5.2.10 Confusion Matrix – NBC

5.2.11 Classification report – NBC

5.2.12 ROC Curve – NBC

5.2.13 Accuracy score – SVM

5.2.14 Confusion matrix – SVM

5.2.15 Classification report – SVM

5.2.16 ROC Curve– SVM

6.1.1 Accuracy score – Bagging

6.1.2 Confusion matrix – Bagging

6.1.3 Classification report – Bagging

6.1.4 ROC Curve – Bagging

6.1.5 Accuracy score – Ada boosting

6.1.6 Confusion matrix – Ada boosting

6.1.7 Classification report – Ada boosting

6.1.8 ROC Curve – Ada boosting

6.1.9 Accuracy score – gradient boosting

6.1.10 confusion matrix – gradient boosting

6.1.11 classification report – gradient boosting

6.1.12 ROC Curve – gradient boosting

6.1.13 Performance metrics of models

6.2.1 Accuracy score – Randomized search cv using RFC

6.2.2 Confusion matrix – Randomized search cv using RFC

6.2.3 Classification report – Randomized search cv using RFC

6.2.4 ROC Curve – Randomized search cv using RFC


6.2.5 Accuracy score – Randomized search cv using DTC

6.2.6 Confusion matrix – Randomized search cv using DTC

6.2.7 Classification report – Randomized search cv using DTC

6.2.8 ROC Curve – Randomized search cv using DTC

6.2.9 Accuracy score – Randomized search cv using NB

6.2.10 Confusion matrix – Randomized search cv using NB

6.2.11 Classification report– Randomized search cv using NB

6.2.12 ROC Curve – Randomized search cv using NB

6.2.13 Accuracy score – Grid search cv using DTC

6.2.14 Confusion matrix – Grid search cv using DTC

6.2.15 Confusion matrix – Grid search cv using DTC

6.2.16 ROC curve – Grid search cv using DTC

6.2.17 Accuracy score – Grid search cv using NB

6.2.18 Confusion matrix– Grid search cv using NB

6.2.19 Classification report – Grid search cv using NB

6.2.20 ROC Curve – Grid search cv using NB

6.2.21 Performance metrics of all models

6.3.1 Top 10 feature importances


1.Introduction of the business problem
1.1 Defining problem statement
Problem Statement: This business problem is a supervised
learning example for a credit card company. The objective is to
predict the probability of default (whether the customer will pay
the credit card bill or not) based on the variables provided. There
are multiple variables on the credit card account, purchase and
delinquency information which can be used in the modelling.
PD modelling problems are meant for understanding the riskiness
of the customers and how much credit is at stake in case the
customer defaults. This is an extremely critical part in any
organization that lends money [both secured and unsecured
loans].
 The objective of this project is to develop a predictive
model that estimates the probability of default for credit
card customers. This involves using the provided dataset,
which contains various variables related to credit card
accounts, purchases, and delinquency information, to
understand the riskiness of customers.
 By accurately predicting the likelihood of default, the credit
card company can better assess the credit risk associated
with each customer and make informed decisions regarding
credit limits, interest rates, and other lending terms. This is
crucial for minimizing potential losses and managing the
overall credit risk portfolio of the organization.

1.2 Need of the study/project


The need for this study or project arises from the critical role that
predicting the probability of default (PD) plays in the financial
industry, particularly for credit card companies and other lending
institutions. Here are some key reasons why this study is essential:
1. Risk Management: Understanding the riskiness of customers is
crucial for managing the overall risk portfolio of a lending
institution. By predicting the likelihood of default, companies can
make informed decisions about whom to lend to and under what
terms.
2. Credit Allocation: Accurate PD models help in determining the
appropriate amount of credit to extend to each customer. This
ensures that credit is allocated efficiently, maximizing returns while
minimizing risk.
3. Loss Mitigation: By identifying high-risk customers, companies can
take proactive measures to mitigate potential losses. This might
include adjusting credit limits, changing interest rates, or
implementing stricter repayment terms.
4. Regulatory Compliance: Financial institutions are often required to
maintain certain levels of capital reserves based on the riskiness of
their loan portfolios. Accurate PD models help in meeting these
regulatory requirements by providing a clear picture of potential
defaults.
5. Profitability: By minimizing defaults and optimizing credit
allocation, companies can improve their profitability. This is
achieved by reducing bad debt expenses and increasing the overall
efficiency of the lending process.
6. Customer Relationship Management: Understanding customer
behavior and risk profiles allows companies to tailor their products
and services to meet the needs of different customer segments,
enhancing customer satisfaction and loyalty.
7. Strategic Planning: PD models provide valuable insights that can
inform strategic decisions, such as entering new markets,
developing new products, or adjusting business models to better
align with customer risk profiles.
Overall, this study is essential for enhancing the financial stability and
operational efficiency of lending institutions, ultimately contributing to
their long-term success.

1.3 Understanding business/social opportunity


 The business and social opportunities of this project are
significant, as it not only enhances the financial stability and
profitability of lending institutions but also promotes financial
inclusion and economic growth.
 By accurately predicting the probability of default, credit card
companies can extend credit more responsibly, reaching a
broader range of customers, including those who may have been
previously underserved.
 This can lead to reduced financial risks, lower costs of credit,
and improved customer satisfaction, ultimately contributing to
economic stability and growth.
 Additionally, by understanding customer risk profiles, companies
can tailor their products and services to better meet the needs
of different customer segments, enhancing customer
relationships and loyalty.
 Recent example: Yes Bank, once one of India's fastest-growing
private sector banks, faced a severe crisis in 2020 due to its
inability to manage credit risk effectively.
 Recent example: In 2019, SBI implemented an AI-powered
credit scoring system to assess loan applications and predict the
probability of default. This system has helped SBI better manage
its non-performing assets (NPAs) by more accurately predicting
which borrowers are likely to default

2.Data Report
2.1 Understanding how data was collected in terms of
time, frequency and methodology
 The data provided by a credit card company about its customer’s
credit activity and defaulters information.
 There are 99,979 customers and the observations are divided into
36 variables.

2.2 Visual inspection of data (rows, columns,


descriptive details)
 The data has 36 variables with 99,979 observations. And
99979 rows and 36 columns as shown in the result below.

 out of the data provided 3 discrete variables and 33


continuous variables.
 There are 700,141 missing cells
 Default is the dependent variable which captures if the
customer has defaulted or not.
2.2 Descriptive Statistics

 Most users do not have severe debt problems, as


indicated by the low average values for
acct_days_in_dc_12_24m, acct_days_in_rem_12_24m, and
other delinquency-related metrics.
 The higher values of the worst status in the past
(compared to recent months) suggest that either users are
improving their financial behaviour, or perhaps recent
data is still too new to reflect long-term issues.
 acct_days_in_dc_12_24m and
acct_days_in_rem_12_24m: These features have low
mean values (3.75 and 1.58, respectively), which suggests
that most users are not spending much time in debt
collection or remediation
 sum_capital_paid_acct_0_12m (mean of 351): This
measure of capital paid over the past 12 months is much
lower than the sum of invoices, indicating that many
payments may be focused on interest or smaller amounts,
with fewer users paying off large amounts of principal.
 The average age of users is around 42 years, with most
users between 34 (25th percentile) and 50 (75th
percentile). This is a relatively mature population, likely
implying they have had time to accumulate financial
responsibilities (e.g., mortgages, loans). The low minimum
of 18 and the high maximum of 75 indicate a broad age
range, which could suggest different behavioural patterns
based on life stage (e.g., younger users might be less
financially stable).
2.3 Understanding of attributes (variable info,
renaming if required)
 It has 33 numerical variables and 3 categorical variables

2.3 Data Info

 The type of variables present in the data are:


1. Demographic variables: userid, age, name_in_email
2. Loan variables:
acct_amt_added_12_24m,acct_days_in_dc_12_24m,acct_days_in_rem_1
2_24m,acct_days_in_term_12_24m,acct_incoming_debt_vs_paid_0_24m
,acct_status,has_paid,max_paid_inv_0_12m,max_paid_inv_0_24m,num_
active_inv,recovery_debt,sum_capital_paid_acct_0_12m,sum_capital_p
aid_acct_12_24m, sum_paid_inv_0_12m
3. Credit variables:
default,acct_worst_status_0_3m,acct_worst_status_12_24m,acct_worst
_status_3_6m,acct_worst_status_6_12m,avg_payment_span_0_12m,avg
_payment_span_0_3m,merchant_category,merchant_group,num_active
_div_by_paid_inv_0_12m,num_arch_dc_0_12m,num_arch_dc_12_24m,n
um_arch_ok_0_12m,num_arch_ok_12_24m,num_arch_rem_0_12m,statu
s_max_archived_0_6_months,status_max_archived_0_12_months,status
_max_archived_0_24_month.

3.Exploratory data analysis


3.1 Univariate analysis
3.1.1 Histogram of age

This histogram shows the distribution of age in the dataset. We can


observe that:
 The age distribution is right-skewed, with most customers falling in
the range of 25-45 years old.
 The peak of the distribution is around 30-35 years old.
 There are fewer customers in the older age ranges (above 60).

3.1.2 Histogram of Time_hours


 The data shows a higher frequency of occurrences around the 15
to 21-hour mark. This suggests that most of the recorded times fall
within this range.
 The distribution appears to be right-skewed, with more values
concentrated in the later hours (past 10 hours) and fewer in the
earlier hours.

3.1.3 Number of Defaulters and Non-defaulters

The dataset is highly imbalanced in terms of default status:


 88688 customers did not default (98.57%)

 1,288 customers defaulted (1.43%)


3.1.4 Top 10 Merchant category

Key insights:
 Concentration of Transactions: The Direct selling establishments
category dominates with the highest count, nearly 40,000, far
exceeding the other categories. This indicates a large number of
transactions or significant activity in this category.
 Moderate Activity: Categories like Books & Magazines and Youthful
Shoes & Clothing have moderate counts (around 10,000–15,000),
showing significant but not overwhelming activity compared to the
leader.
 Low Activity Categories: Categories like Dietary Supplements,
Prints & Photos, and Diversified electronics have much lower
counts (under 10,000). These are niche categories with fewer
transactions.
 Category Variety: The top 10 categories represent a broad range of
industries, including electronics, apparel, outdoor gear, books, and
general merchandise. This indicates diverse customer interests.
3.1.5 Top 10 Merchant groups

The bar chart you shared shows the top 10 merchant groups and the
count of transactions or occurrences associated with each group. Here's
a breakdown of the insights:
 Entertainment is by far the dominant category, with significantly
more counts (around 50,000) than the other categories. This
suggests that consumers engage with or spend more in this group.
 Clothing & Shoes follows as the second-highest group, though it's
much lower than Entertainment.
 The groups with the lowest counts are Jewelry & Accessories,
Home & Garden, Intangible Products, and Automotive Products.
 The distribution shows that spending or transaction volume is
concentrated heavily in Entertainment, with other categories
having relatively smaller but still notable volumes.
3.1.6 Histogram of all numerical variables

 Many variables, such as acct_worst_status_0_24m,


acct_worst_status_1_24m, and num_active_rev_tl, show high
frequencies at zero or low values with a steep decline as the values
increase. This suggests that most data points fall in the lower
range, with fewer high values.
 Variables like sum_capital_paid_account_0_12m and num_active_tl
also show extreme right-skewness, where the majority of data
points are concentrated at lower values.
 Some histograms, like time_hours, show a bimodal distribution with
significant peaks around certain values, possibly indicating two
common time ranges in the data.
 Many variables, like num_tl_90g_dpd_24m, num_actv_bc_tl, and
max_bal_bc, have a significant concentration of values near zero,
indicating that for these variables, the majority of the data points
reflect minimal activity or involvement (e.g., low number of
transactions or minimal balance).
 In many histograms (e.g., recovery_label,
sum_capital_paid_account_0_12m), there are long tails indicating
the presence of outliers or extreme values. This implies that there
are a few cases where the values are much higher than the rest of
the data.

3.2 Bivariate analysis

3.2.1 Average Account Amount Added (12-24m) by Default status

This barplot compares the average account amount added in the last 12-
24 months for customers who defaulted (1) versus those who didn't (0).
We can see that:
 Customers who defaulted (1) tend to have a higher average
account amount added compared to those who didn't default (0).
 This could suggest that customers who add larger amounts to their
accounts might be at a higher risk of default, possibly due to
overextending their financial capabilities.

3.2.2 Distribution of Max paid invoice(0-12m) by Default status

This strip plot shows the distribution of the maximum paid invoice in the last 12 months for
defaulted and non-defaulted customers. Observations:
 The distribution for non-defaulted customers (0) appears to be more concentrated in
the lower range, with some high-value outliers.

 Non-defaulted accounts (status 0) show a wider and higher distribution of max paid
invoices, while defaulted accounts (status 1) have smaller invoice amounts. This
pattern could be used for risk assessment or to better understand customer payment
behaviour

3.2.3 Violin plot: Age distribution by default status


This violin plot displays the age distribution for defaulted and non-
defaulted customers.
 The age distributions are fairly similar for both groups.
 Both distributions are slightly right-skewed, with most customers
between 25-45 years old.
 There's a slight indication that defaulted customers might be
younger on average, but the difference doesn't appear to be
substantial.

3.2.4 Heat Map -Correlation

Key Insights:
1. Highly Correlated Features:
 Features with a correlation coefficient close to 1 or -1
have a very strong linear relationship, either positively
or negatively correlated.
 For example, if max_paid_inv_0_12m and
num_active_inv_0_12m show high positive correlation,
it implies that as the number of active invoices
increases, the maximum paid invoice also tends to
increase.
 Similarly, features like acct_worst_status_12_24m
might be strongly correlated with
acct_worst_status_6_12m, indicating a consistency in
worst account status over different periods.
2. Clusters of Features:
 Features that are highly correlated with each other may
form "clusters." For instance, all account status
variables or payment-related features might be grouped
together, showing that they are related aspects of
customer behavior.
 Clustering often reveals related features that can be
treated similarly in model building or analysis, as they
provide overlapping information.
3. Negative Correlations:
 Strong negative correlations (close to -1) indicate an
inverse relationship. For example, if default_status has
a negative correlation with max_paid_inv_0_12m, it
means that customers with higher max paid invoices
are less likely to default.
 Similarly, a negative correlation between
acct_incoming_debt_vs_paid_0_24m and
acct_days_in_rem_12_24m might show that the more
days a person remains in arrears, the less they manage
to reduce their outstanding debt.
4. Redundancy:
 Features that are almost perfectly correlated (near 1)
may represent redundant information. For example, if
acct_worst_status_6_12m and acct_worst_status_3_6m
are highly correlated, it may be redundant to include
both in certain analyses. One of these features can
potentially be dropped in a model without losing
valuable information.
5. Outliers in Correlation:
 If there are features that stand out with unexpectedly
high or low correlations compared to others, they may
warrant deeper investigation. These outliers could
represent key insights into behavior or relationships
between variables that are not immediately obvious.

3.3 Removal of unwanted variables


 Removed userid variable

3.3.1 Removed Userid from the data frame

 Removed name in email variable


3.3.2 Removed Name in email variable from the data frame

3.4 Missing Value treatment


 There are 615512 missing values. The percentage of missing value
in each variable calculated and the result is below:

3.4.1 Percentage of missing value per column

 Dropping off the columns which has missing value greater than 25%
and below are the missing values in remaining columns
3.4.2 Post dropping off columns with 25% threshold

 By importing SimpleImputer, missing values are imputed by median.


Below is the result post imputation.

3.4.3Post Imputation- Missing values

3.5 Outlier treatment


 For outlier treatment, we are separated the data into object and
non-object to visualize the outliers

3.5.1 Outliers using box plot

 Post outlier treatment,

3.5.2 Post Outliers Treatment

3.6 Variable transformation


 One-hot encoding done for Merchant Category and Merchant group
(categorical columns)
 Post that, shape of dataset is shown below:
3.6.1 One-hot encoding

3.7 Addition of new variables


 As the provided variables are good enough for the processing and
modelling. So there was no need aroused to add new variables for
this dataset.

4.Business insights from EDA


4.1 Is the data unbalanced? If so, what can be done?
Please explain in the context of the business
 Imbalanced data refers to datasets where the target class has an
uneven distribution of observations, i.e., one class label has a
very high number of observations, and the other has a deficient
number of observations.
 The dataset is highly imbalanced in terms of default status:
86,035 customers did not default (98.5%) and 1,280 customers
defaulted (1.5%)
 This imbalance is important to consider when interpreting the
results for predictive modelling. We might need to use techniques
like oversampling, under sampling, or SMOTE (Synthetic Minority
Oversampling Technique) to balance the classes for better model
performance.
 The general approach to applying SMOTE on this dataset:

1. Separate features and target: Use default as the target


variable and the remaining columns as features.

2. Apply SMOTE: Use SMOTE to oversample the minority class. In


this, new instances are synthesized from the existing data

3. Train a model: You can then use the balanced data to train
your machine learning model.

4.2 Any business insights using clustering (if


applicable)
 For clustering, target labels like default are not needed, as
clustering is unsupervised.
 K-means clustering performed on the dataset as shown below:
 Before performing k-means clustering, Scaling is important. Below
is the result of the scaled data

4.2.1 Post scaling treatment

 Different number of Clusters are performed and inertia is


calculated

4.2.2 Inertia of various n_clusters

 It is visually viewed in elbow graph as shown below


4.2.3 Elbow graph

 Silhouette score is high at 5 number of clusters. Score is shown


below

 Sil width and clus_kmeans are added to the dataset.

4.2.4 Final dataset post clustering

4.3 Any other business insights


1. Imbalanced Dataset: The data is highly imbalanced with 98.5%
non-defaulters and only 1.5% defaulters. This can affect the
performance of predictive models. Techniques like oversampling
(SMOTE) should be applied to handle the imbalance and improve
prediction accuracy.
2. Customer Risk Segmentation: Customers who defaulted tend to
have higher average account amounts added in the last 12-24
months, which may indicate that higher financial activity could be
associated with default risk. Non-defaulting customers generally
have higher maximum paid invoices compared to defaulters,
implying better financial behaviour in terms of invoice payments.
3. Age and Default Probability: While age distributions of defaulters
and non-defaulters are quite similar, younger customers might
have a slightly higher default tendency, though this difference isn't
large.
4. Merchant Category Insights: The "Direct Selling Establishments"
category has the highest transaction count, indicating significant
customer spending in this area. Spending is heavily concentrated in
the entertainment sector, followed by clothing and shoes,
highlighting key areas of customer expenditure.
5. Clustering Insights: The optimal number of customer clusters was
identified as five using K-means clustering, suggesting distinct
customer segments based on financial behaviour. This can help in
targeted marketing and risk assessment strategies.
6. Correlation Insights: Features related to account status over
different periods (e.g., 0-3 months vs. 6-12 months) are highly
correlated, indicating consistency in customer payment behaviour.
Strong negative correlations between default status and variables
like the maximum paid invoice suggest that higher payments are
linked to lower default risk.
These insights can guide credit risk management, customer
segmentation, and business strategies for the credit card company.
5.Model building and interpretation
5.1 Build various models
Post doing EDA, building models would be the next step.
 Choice of Algorithms:
1. By looking into the data, supervised learning would be the
choice.
2. Among the supervised learning, Classification models are
typically applied in scenarios where the target variable is
categorical (e.g., default/no default).
3. We have applied several models like Decision tree
classifier, Random Forest classifier, Support vector
machine, Naïve Bayes classifier. These models are
evaluated using metrics like accuracy, confusion matrix,
and ROC-AUC scores.
 Dataset splitted into training and testing dataset before building
models as shown below:

1.1.1 Train and test data

Random forest classifier:


 Imported RandomForestClassifier library from sklearn ensemble
 It was fitted to the training data set.
Decision Tree classifier:
 Imported DecisionTreeClassifier library from sklearn.tree
 It was fitted to the training data set.
Naïve Bayes classifier:
 Imported GuassianNB library from sklearn.naive_bayes
 It was fitted to the training data set.
Support Vector Machine:
 Imported SVC library from sklearn.svm
 It was fitted to the training data set.
5.2 Test your predictive model against the test set
using various appropriate performance metrics
 Imported few libraries like confusion_matrix,
precision_score, recall_score, ConfusionMatrixDisplay,
Classification report, accuracy score.
Random forest classifier:
 It makes predictions on a test set, and calculates the accuracy
score. The accuracy achieved is 98.43%.

1.2.1 Accuracy score – Random Forest

 A confusion matrix is plotted using seaborn's heatmap function.


This matrix visualizes the performance of the classification
model:
1. The top-left cell (25312) represents true negatives (correctly predicted
Class 0)
2. The bottom-right cell (22) represents true positives (correctly
predicted Class 1)
3. The top-right (39) and bottom-left (372) cells represent false positives
and false negatives respectively
4. The high number of correct predictions in the diagonal cells and low
numbers in the off-diagonal cells indicate that the model performs
very well, which is consistent with the high accuracy score.

1.2.2 Confusion Matrix – Random Forest

 Few points observed from classification report:


1. The model performs very well in identifying non-defaulters (high
precision, recall, and F1-score for class 0)
2. However, it struggles with identifying defaulters (low precision,
very low recall, and low F1-score for class 1)
3. The high overall accuracy (98.42%) is misleading due to the class
imbalance
4. The large difference between macro and weighted averages further
highlights the impact of class imbalance

1.2.3 Classification report – Random Forest

 Key points from ROC curve:


1. The AUC is 0.80, which suggests that the classifier has
good performance
2. The curve is above the diagonal line indicating that the
classifier is better than random guessing.
3. Overall, the Random Forest classifier is performing well,
with a good balance between sensitivity and specificity. An
AUC of 0.80 suggests that the model is effective at
distinguishing between the two classes.

1.2.4 ROC Curve – Random Forest


Decision Tree classifier:
 It makes predictions on a test set, and calculates the accuracy
score. The accuracy achieved is 97.21%.

1.2.5 Accuracy score – DTC

 A confusion matrix is plotted using seaborn's heatmap function.


This confusion matrix suggests that while the Decision Tree
Classifier performs well for the majority class, it may need
improvement in correctly identifying the minority class

1.2.6 Confusion Matrix – DTC

 Few points observed from classification report:


1. The model performs very well in identifying non-defaulters
(Class 0) with high precision, recall, and F1-score (all above
0.98).
2. However, it struggles significantly with identifying defaulters
(Class 1), with low precision, recall, and F1-score.
3. The overall accuracy is high (0.972111), but this is misleading
due to class imbalance. There are far more non-defaulters than
defaulters in the dataset
4. The macro average, which gives equal weight to both classes,
shows much lower overall performance (around 0.55 for all
metrics) due to the poor performance on the minority class.
5. while this classifier is very good at identifying non-defaulters, it
performs poorly in detecting defaulters, which is likely the more
important class in many real-world scenarios.

1.2.7 Classification report – DTC

 Key points from ROC curve:


1. The ROC curve is close to the diagonal line, which
represents random performance. This further confirms
that the classifier's performance is not strong.
2. AUC of 0.57 suggests that the classifier has slightly
better performance than random guessing but is not
very effective.
3. Overall, the Decision Tree Classifier in this case has
limited discriminative ability, as indicated by the low
AUC score and the shape of the ROC curve.
Improvements might be needed, such as tuning the
model parameters or using a different classification
algorithm.

1.2.8 ROC curve – DTC


Naïve Bayes classifier:
 It makes predictions on a test set, and calculates the accuracy
score. The accuracy achieved is 95.98%.

1.2.9 Accuracy score – NBC

 While the Naive Bayes Classifier does reasonably well at


identifying Class 0 (with high true negatives), it performs poorly
in identifying Class 1 (Defaulters), as seen by the low number of
true positives and high false negatives

1.2.10 Confusion Matrix – NBC

 Based on classification report, The Naive Bayes classifier is heavily


biased towards predicting "non-defaulters", which leads to a very
low precision, recall, and F1-score for "Defaulters".
1.2.11 Classification report – NBC

 Key points of ROC curve:


1. The curve is above the random line: This confirms that the
classifier is better than random guessing.
2. Moderate AUC (0.80): The classifier performs well overall
but still has room for improvement, especially when
considering that the classification report showed poor
results for the minority class (Defaulters).
3. A score of 0.80 means that there's a 80% chance that the
classifier will distinguish between a randomly chosen
"Defaulter" and "Non-Defaulter" correctly.

1.2.12 ROC Curve – NBC

Support Vector Machine:


 It makes predictions on a test set, and calculates the accuracy
score. The accuracy achieved is 98.47%.

1.2.13 Accuracy score – SVM


 Key points from confusion matrix:
1. The SVM classifier has predicted all instances as "Non-
Defaulters" (Class 0). This is why there are no
predictions for Class 1 (Defaulters).
2. The confusion matrix indicates that the classifier is
highly biased towards the majority class (Class 0), and it
is not able to identify any instances of the minority class
(Class 1). This is often the result of severe class
imbalance, where the classifier is dominated by the
large number of "non-defaulters" and ignores the small
number of "Defaulters."
3. Since all the actual "Defaulters" are misclassified as
"non-defaulters," the model has 0 recall for Class 1,
which means it’s not useful for identifying defaulters at
all.

1.2.14 Confusion matrix – SVM

 From classification report, the model performs very well in


predicting non-defaulters but completely fails to detect
Defaulters. This could be due to class imbalance.

1.2.15 Classification report – SVM


 An AUC of 0.5 indicates that the model performs no better than
random guessing, meaning it has no discriminative power to
distinguish between the classes.

1.2.16 ROC Curve– SVM

5.3 Interpretation of the model(s)

 The Random Forest has a good accuracy (98.43%) and a


relatively high AUC (0.80), which indicates it performs well in
distinguishing classes. However, its precision (0.40) and recall
(0.08) for the minority class (likely Defaulters) are quite low,
showing that it struggles with class imbalance.
 The Decision Tree model has a lower AUC (0.58), and precision,
recall, and F1-scores are also quite low. It struggles more
compared to Random Forest in separating the classes, and
overall performance indicates that it might need tuning.
 Naive Bayes has a lower accuracy (95.98%), and while its
precision is low (0.08), it has a relatively higher recall (0.16). The
AUC score is similar to Random Forest (0.80), but its low
precision indicates that it struggles with false positives.
 The SVM model has a very high precision (1.00) but a recall of 0,
meaning it does not detect any Defaulters at all. This results in
an F1-score of 0 and a low AUC (0.50), indicating it performs no
better than random guessing.
 For models like Decision Tree, using boosting techniques (e.g.,
Gradient Boosting, XGBoost) could improve performance by
focusing on the misclassified instances.
 So, ensemble and model tuning is needed for the more effective
models.
6 Model Tuning
6.1 Ensemble modelling
 We have ensemble techniques like bagging, boosting.
Stacking.
 We can apply ensemble techniques to all the models but
not so effective for few models as explained below:
1. Ensemble methods like bagging and boosting are designed
to correct for high-variance models. Naive Bayes, however, is a
low-variance model because it does not overfit easily due to its
strong assumptions. Hence, ensembles often don’t provide much
gain since they address variance issues that Naive Bayes doesn't
struggle with.
2. Naive Bayes and SVM are typically strong models on
their own and don't require ensembling for variance reduction
or performance improvement as much as high-variance models
like decision trees do.
3. Instead of ensembling these models, hyperparameter
tuning and addressing class imbalance (especially in SVM) is
often more effective.
Bagging Classifier using Decision Tree:
 Imported BaggingClassifier from sklearn.ensemble
 It makes predictions on a test set, and calculates the accuracy
score. The accuracy achieved is 98.41%.

2.1.1 Accuracy score – Bagging

 Key points on confusion matrix:


1. Class 0 (non-defaulters) is being predicted quite
accurately, with 25,308 correct predictions and only 43
false positives. This suggests that the Bagging Classifier
performs well on the majority class.
2. Class 1 (Defaulters) is where the model struggles. Out of
394 true instances of Defaulters (from the earlier report), it
correctly identified only 27. The remaining 367 Defaulters
were misclassified as non-defaulters, leading to a high false
negative rate.
2.1.2 Confusion matrix – Bagging

 Key points on classification report:


1. The classifier is doing well on the majority class (Non-
Defaulters), but it performs poorly on the minority class
(Defaulters). This can be seen in the low precision, recall,
and F1-score for Defaulters.
2. The overall accuracy (98.4%) is high because of the class
imbalance. The model is heavily skewed toward predicting
non-defaulters correctly but is failing to capture the
Defaulters, which is crucial in many real-world
applications
3. The low recall (8.6%) for Defaulters means the model is
missing most of the Defaulters. This can be dangerous in
scenarios where detecting Defaulters is important.

2.1.3 Classification report – Bagging

 Key points on ROC curve:


1. The AUC score is 0.80, which indicates a good model. A
perfect model would have an AUC of 1, while a random model
would have an AUC of 0.5. AUC = 0.80 means that 80% of
the time, the model will correctly distinguish between a
Defaulter and a Non-Defaulter.
2. Good Performance: An AUC of 0.80 is a strong indicator that
the Bagging Classifier has a good balance between correctly
identifying Defaulters while minimizing the number of false
positives.
3. Although the AUC score is 0.80, which indicates a good
model, it’s essential to balance the trade-off between recall
and precision, especially in contexts where false positives or
false negatives can have significant costs.

2.1.4 ROC Curve – Bagging

Ada Boosting Classifier using Decision Tree:


 It makes predictions on a test set, and calculates the accuracy
score. The accuracy achieved is 98.45%.

2.1.5 Accuracy score – Ada boosting

 Based on confusion matrix, The model performs poorly at


identifying Class 1 (Defaulters), with only 5 true positives and
389 false negatives. This means the model frequently
misclassifies Class 1 as Class 0.
2.1.6 Confusion matrix – Ada boosting

 From classification report, the model is very good at identifying


Non-Defaulters (Class 0) but performs poorly for Defaulters
(Class 1).

2.1.7 Classification report – Ada boosting

 The classifier does a good job overall, with a relatively high AUC
score.
 Although the classifier performs well in general, it may still fail
to correctly identify the minority class (Class 1) as shown by its
low recall and F1-score for that class.

2.1.8 ROC Curve – Ada boosting

Gradient Boosting classifier:


 Gradient Boosting primarily uses decision trees as the base
model, and through an iterative process of reducing prediction
errors, it builds a strong overall model from these weaker
individual trees.
 With this, it has fairly good accuracy score of 98.45%

2.1.9 Accuracy score – gradient boosting

 Confusion matrix suggests the model is skewed towards


predicting Class 0 more often and may not perform well on the
minority Class 1.

2.1.10 confusion matrix – gradient boosting

 Since accuracy is misleading with imbalanced data, using


metrics like F1-score, precision-recall curve, or ROC-AUC may
provide better insight into model performance.

2.1.11 classification report – gradient boosting


2.1.12 ROC Curve – gradient boosting

 Post applying ensemble, below is the result of all the models


and its performance.

2.1.13 Performance metrics of models

6.2 Any other model tuning measures


 Few hyperparameter tuning like grid search, random search
was performed on the models.
Randomised Search CV using Random Forest Classifier:
 Performed hyperparameter tuning for a Random Forest
Classifier using RandomizedSearchCV from the sklearn library.
 the best parameters for the Random Forest model are
displayed, along with a best accuracy of 0.99, meaning the
model performed very well during cross-validation.

 The score () method is used to evaluate the model (which was


trained earlier using RandomizedSearchCV) on the test set
X_test and y_test.
 It returns the accuracy of the model on the test set, which is
stored in the variable accuracy.
2.2.1 Accuracy score – Randomized search cv using RFC

 The model is highly accurate for Class 0 but has difficulty


distinguishing Class 1, possibly due to class imbalance (many
more instances of class 0 compared to class 1). This type of
issue is common when one class dominates the dataset.

2.2.2 Confusion matrix – Randomized search cv using RFC

2.2.3 Classification report – Randomized search cv using RFC

2.2.4 ROC Curve – Randomized search cv using RFC


Randomised Search CV using Decision Tree Classifier
 Performed hyperparameter tuning for a Decision tree
Classifier using RandomizedSearchCV from the sklearn
library.
 the best parameters for the Decision tree model are
displayed, along with a best accuracy of 0.99, meaning the
model performed very well during cross-validation

 It returns the accuracy of the model on the test set, which is


stored in the variable accuracy, which is 0.98

2.2.5 Accuracy score – Randomized search cv using DTC

 The confusion matrix further reinforces the issue identified in


the classification report. The model is highly biased towards
the majority class (Non-Defaulters) and completely ignores
the minority class (Defaulters).

2.2.6 Confusion matrix – Randomized search cv using DTC

2.2.7 Classification report – Randomized search cv using DTC


2.2.8 ROC Curve – Randomized search cv using DTC

Randomised Search CV for Naive Bayes (Bernoulli)

2.2.9 Accuracy score – Randomized search cv using NB

2.2.10 Confusion matrix – Randomized search cv using NB


2.2.11 Classification report– Randomized search cv using NB

2.2.12 ROC Curve – Randomized search cv using NB

Grid search in decision tree classifier


 The model now identifies some Defaulters (37), but the
number of false negatives (357) is still significant.

2.2.13 Accuracy score – Grid search cv using DTC


2.2.14 Confusion matrix – Grid search cv using DTC

2.2.15 Confusion matrix – Grid search cv using DTC

2.2.16 ROC curve – Grid search cv using DTC

Grid search in Bernoulli NB classifier


2.2.17 Accuracy score – Grid search cv using NB

2.2.18 Confusion matrix– Grid search cv using NB

2.2.19 Classification report – Grid search cv using NB

2.2.20 ROC Curve – Grid search cv using NB


2.2.21 Performance metrics of all models

6.3 Interpretation of the most optimum model and its


implication on the business
 RandomisedSearchCV for RandomForestClassifier has the highest
accuracy (98.50%) and a good balance of precision (0.77) and AUC score
(0.88), making it a strong candidate for predicting default probability.
 The reason to chose this model is as follows:
 Highest Accuracy: The model achieves the highest accuracy of 98.50%
among all the models presented. This means it correctly predicts the
outcome (default or non-default) for 98.50% of the cases in the dataset.
High accuracy is crucial in banking risk assessment to minimize errors in
predicting defaults.
 Strong Precision: With a precision of 0.77, this model has the highest
precision among all models (tied with RandomisedSearchCV for
Decision Tree Classifier). Precision measures the proportion of true
positive predictions (correctly predicted defaults) out of all positive
predictions. A high precision means that when the model predicts a
default, it's more likely to be correct, reducing false alarms.
 High AUC Score: The Area Under the Curve (AUC) score of 0.88 is one
of the highest among all models. AUC represents the model's ability to
distinguish between classes (default and non-default). A score of 0.88
indicates that the model has a strong ability to separate the two classes,
which is crucial for a binary classification problem like predicting loan
defaults.
 Balanced Performance: This model provides a good balance between
different metrics. While some models might excel in one area but
perform poorly in others, this model maintains high scores across
accuracy, precision, and AUC.
Advantages of Random Forest: The base algorithm (Random Forest) is
known for its robustness and ability to handle complex relationships in data.
It's an ensemble method that combines multiple decision trees, which helps
in reducing overfitting and improving generalization.
 Hyperparameter Optimization: The use of RandomisedSearchCV
indicates that the model's hyperparameters have been optimized. This
process helps in finding the best configuration of the Random Forest
algorithm for this specific dataset, potentially improving its performance
over a standard Random Forest.
 Feature Importance Visualization:

2.3.1 Top 10 feature importances

 Financial behaviour: The majority of the important features seem to focus


on financial behaviour, particularly in terms of payments, investments,
and capital added within certain time frames (12 and 24 months).
 Age and time-related metrics: Age and duration within the system
("time_hours") are also influential, likely capturing aspects of experience,
reliability, or maturity.

You might also like