0% found this document useful (0 votes)
4 views

Modelling-project notes-2

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Modelling-project notes-2

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Banking Project

(Capstone Project Notes - 2)


DSBA

By:
E. AuroRajashri

0
List of Content
1) Model building and
interpretation..........................................................................................3
1.1 Build various models
1.2 Test your predictive model against the test set using various performance metrics
1.3 Interpretation of the model(s)

2) Model
Tuning……………………………………………………….…………….……………….….12
2.1 Ensemble modelling, wherever applicable
2.2 Any other model tuning measures (if applicable)
2.3 Interpretation of the most optimum model and its implication on the business

List of Figures
1.1 Train and test data

1.2.1 Accuracy score – Random Forest

1.2.2 Confusion Matrix – Random Forest

1.2.3 Classification report – Random Forest

1.2.4 ROC Curve – Random Forest

1.2.5 Accuracy score – DTC

1.2.6 Confusion Matrix – DTC

1.2.7 Classification report – DTC

1.2.8 ROC curve – DTC

1.2.9 Accuracy score – NBC

1.2.10 Confusion Matrix – NBC

1.2.11 Classification report – NBC

1.2.12 ROC Curve – NBC

1.2.13 Accuracy score – SVM

1.2.14 Confusion matrix – SVM

1.2.15 Classification report – SVM

1.2.16 ROC Curve– SVM

2.1.1 Accuracy score – Bagging

2.1.2 Confusion matrix – Bagging

2.1.3 Classification report – Bagging

2.1.4 ROC Curve – Bagging

1
2.1.5 Accuracy score – Ada boosting

2.1.6 Confusion matrix – Ada boosting

2.1.7 Classification report – Ada boosting

2.1.8 ROC Curve – Ada boosting

2.1.9 Accuracy score – gradient boosting

2.1.10 confusion matrix – gradient boosting

2.1.11 classification report – gradient boosting

2.1.12 ROC Curve – gradient boosting

2.1.13 Performance metrics of models

2.2.1 Accuracy score – Randomized search cv using RFC

2.2.2 Confusion matrix – Randomized search cv using RFC

2.2.3 Classification report – Randomized search cv using RFC

2.2.4 ROC Curve – Randomized search cv using RFC

2.2.5 Accuracy score – Randomized search cv using DTC

2.2.6 Confusion matrix – Randomized search cv using DTC

2.2.7 Classification report – Randomized search cv using DTC

2.2.8 ROC Curve – Randomized search cv using DTC

2.2.9 Accuracy score – Randomized search cv using NB

2.2.10 Confusion matrix – Randomized search cv using NB

2.2.11 Classification report– Randomized search cv using NB

2.2.12 ROC Curve – Randomized search cv using NB

2.2.13 Accuracy score – Grid search cv using DTC

2.2.14 Confusion matrix – Grid search cv using DTC

2.2.15 Confusion matrix – Grid search cv using DTC

2.2.16 ROC curve – Grid search cv using DTC

2.2.17 Accuracy score – Grid search cv using NB

2.2.18 Confusion matrix– Grid search cv using NB

2.2.19 Classification report – Grid search cv using NB

2.2.20 ROC Curve – Grid search cv using NB

2.2.21 Performance metrics of all models

2.3.1 Top 10 feature importances

2
1. Model building and interpretation
1.1 Build various models
Post doing EDA, building models would be the next step.
 Choice of Algorithms:
1. By looking into the data, supervised learning would be the choice.
2. Among the supervised learning, Classification models are typically
applied in scenarios where the target variable is categorical (e.g.,
default/no default).
3. We have applied several models like Decision tree classifier, Random
Forest classifier, Support vector machine, Naïve Bayes classifier.
These models are evaluated using metrics like accuracy, confusion
matrix, and ROC-AUC scores.
 Dataset splitted into training and testing dataset before building models as
shown below:

1.1.1 Train and test data

Random forest classifier:


 Imported RandomForestClassifier library from sklearn ensemble
 It was fitted to the training data set.
Decision Tree classifier:
 Imported DecisionTreeClassifier library from sklearn.tree
 It was fitted to the training data set.
Naïve Bayes classifier:

 Imported GuassianNB library from sklearn.naive_bayes


 It was fitted to the training data set.
Support Vector Machine:
 Imported SVC library from sklearn.svm
 It was fitted to the training data set.

3
1.2 Test your predictive model against the test set using
various appropriate performance metrics
 Imported few libraries like confusion_matrix, precision_score,
recall_score, ConfusionMatrixDisplay, Classification report,
accuracy score.
Random forest classifier:

 It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.43%.

1.2.1 Accuracy score – Random Forest

 A confusion matrix is plotted using seaborn's heatmap function. This matrix


visualizes the performance of the classification model:
1. The top-left cell (25312) represents true negatives (correctly predicted Class 0)
2. The bottom-right cell (22) represents true positives (correctly predicted Class 1)
3. The top-right (39) and bottom-left (372) cells represent false positives and false
negatives respectively
4. The high number of correct predictions in the diagonal cells and low numbers in
the off-diagonal cells indicate that the model performs very well, which is
consistent with the high accuracy score.

1.2.2 Confusion Matrix – Random Forest

 Few points observed from classification report:


1. The model performs very well in identifying non-defaulters (high precision,
recall, and F1-score for class 0)

4
2. However, it struggles with identifying defaulters (low precision, very low
recall, and low F1-score for class 1)
3. The high overall accuracy (98.42%) is misleading due to the class imbalance
4. The large difference between macro and weighted averages further highlights
the impact of class imbalance

1.2.3 Classification report – Random Forest

 Key points from ROC curve:


1. The AUC is 0.80, which suggests that the classifier has good
performance
2. The curve is above the diagonal line indicating that the classifier is
better than random guessing.
3. Overall, the Random Forest classifier is performing well, with a good
balance between sensitivity and specificity. An AUC of 0.80 suggests
that the model is effective at distinguishing between the two classes.

1.2.4 ROC Curve – Random Forest

5
Decision Tree classifier:
 It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 97.21%.

1.2.5 Accuracy score – DTC

 A confusion matrix is plotted using seaborn's heatmap function. This


confusion matrix suggests that while the Decision Tree Classifier performs
well for the majority class, it may need improvement in correctly identifying
the minority class

1.2.6 Confusion Matrix – DTC

 Few points observed from classification report:


1. The model performs very well in identifying non-defaulters (Class 0) with
high precision, recall, and F1-score (all above 0.98).
2. However, it struggles significantly with identifying defaulters (Class 1),
with low precision, recall, and F1-score.
3. The overall accuracy is high (0.972111), but this is misleading due to class
imbalance. There are far more non-defaulters than defaulters in the
dataset
4. The macro average, which gives equal weight to both classes, shows much
lower overall performance (around 0.55 for all metrics) due to the poor
performance on the minority class.
5. while this classifier is very good at identifying non-defaulters, it performs
poorly in detecting defaulters, which is likely the more important class in
many real-world scenarios.

6
1.2.7 Classification report – DTC

 Key points from ROC curve:


1. The ROC curve is close to the diagonal line, which represents
random performance. This further confirms that the classifier's
performance is not strong.
2. AUC of 0.57 suggests that the classifier has slightly better
performance than random guessing but is not very effective.
3. Overall, the Decision Tree Classifier in this case has limited
discriminative ability, as indicated by the low AUC score and the
shape of the ROC curve. Improvements might be needed, such as
tuning the model parameters or using a different classification
algorithm.

1.2.8 ROC curve – DTC

7
Naïve Bayes classifier:

 It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 95.98%.

1.2.9 Accuracy score – NBC

 While the Naive Bayes Classifier does reasonably well at identifying Class 0
(with high true negatives), it performs poorly in identifying Class 1
(Defaulters), as seen by the low number of true positives and high false
negatives

1.2.10 Confusion Matrix – NBC

 Based on classification report, The Naive Bayes classifier is heavily biased


towards predicting "non-defaulters", which leads to a very low precision,
recall, and F1-score for "Defaulters".

1.2.11 Classification report – NBC

8
 Key points of ROC curve:
1. The curve is above the random line: This confirms that the classifier
is better than random guessing.
2. Moderate AUC (0.80): The classifier performs well overall but still
has room for improvement, especially when considering that the
classification report showed poor results for the minority class
(Defaulters).
3. A score of 0.80 means that there's a 80% chance that the classifier
will distinguish between a randomly chosen "Defaulter" and "Non-
Defaulter" correctly.

1.2.12 ROC Curve – NBC

Support Vector Machine:


 It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.47%.

1.2.13 Accuracy score – SVM

 Key points from confusion matrix:


1. The SVM classifier has predicted all instances as "Non-
Defaulters" (Class 0). This is why there are no predictions for
Class 1 (Defaulters).
2. The confusion matrix indicates that the classifier is highly biased
towards the majority class (Class 0), and it is not able to identify

9
any instances of the minority class (Class 1). This is often the
result of severe class imbalance, where the classifier is dominated
by the large number of "non-defaulters" and ignores the small
number of "Defaulters."
3. Since all the actual "Defaulters" are misclassified as "non-
defaulters," the model has 0 recall for Class 1, which means it’s
not useful for identifying defaulters at all.

1.2.14 Confusion matrix – SVM

 From classification report, the model performs very well in predicting non-
defaulters but completely fails to detect Defaulters. This could be due to
class imbalance.

1.2.15 Classification report – SVM

 An AUC of 0.5 indicates that the model performs no better than random
guessing, meaning it has no discriminative power to distinguish between
the classes.

10
1.2.16 ROC Curve– SVM

1.3 Interpretation of the model(s)

 The Random Forest has a good accuracy (98.43%) and a relatively high
AUC (0.80), which indicates it performs well in distinguishing classes.
However, its precision (0.40) and recall (0.08) for the minority class (likely
Defaulters) are quite low, showing that it struggles with class imbalance.
 The Decision Tree model has a lower AUC (0.58), and precision, recall, and
F1-scores are also quite low. It struggles more compared to Random Forest
in separating the classes, and overall performance indicates that it might
need tuning.
 Naive Bayes has a lower accuracy (95.98%), and while its precision is low
(0.08), it has a relatively higher recall (0.16). The AUC score is similar to
Random Forest (0.80), but its low precision indicates that it struggles with
false positives.
 The SVM model has a very high precision (1.00) but a recall of 0, meaning
it does not detect any Defaulters at all. This results in an F1-score of 0 and a
low AUC (0.50), indicating it performs no better than random guessing.
 For models like Decision Tree, using boosting techniques (e.g., Gradient
Boosting, XGBoost) could improve performance by focusing on the
misclassified instances.
 So, ensemble and model tuning is needed for the more effective models.

11
2. Model Tuning
2.1 Ensemble modelling
 We have ensemble techniques like bagging, boosting. Stacking.
 We can apply ensemble techniques to all the models but not so
effective for few models as explained below:
1. Ensemble methods like bagging and boosting are designed to correct
for high-variance models. Naive Bayes, however, is a low-variance model
because it does not overfit easily due to its strong assumptions. Hence,
ensembles often don’t provide much gain since they address variance issues
that Naive Bayes doesn't struggle with.
2. Naive Bayes and SVM are typically strong models on their own
and don't require ensembling for variance reduction or performance
improvement as much as high-variance models like decision trees do.
3. Instead of ensembling these models, hyperparameter tuning and
addressing class imbalance (especially in SVM) is often more effective.
Bagging Classifier using Decision Tree:
 Imported BaggingClassifier from sklearn.ensemble
 It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.41%.

2.1.1 Accuracy score – Bagging

 Key points on confusion matrix:


1. Class 0 (non-defaulters) is being predicted quite accurately, with
25,308 correct predictions and only 43 false positives. This
suggests that the Bagging Classifier performs well on the majority class.
2. Class 1 (Defaulters) is where the model struggles. Out of 394 true
instances of Defaulters (from the earlier report), it correctly identified
only 27. The remaining 367 Defaulters were misclassified as non-
defaulters, leading to a high false negative rate.

12
2.1.2 Confusion matrix – Bagging

 Key points on classification report:


1. The classifier is doing well on the majority class (Non-Defaulters),
but it performs poorly on the minority class (Defaulters). This can
be seen in the low precision, recall, and F1-score for Defaulters.
2. The overall accuracy (98.4%) is high because of the class imbalance.
The model is heavily skewed toward predicting non-defaulters
correctly but is failing to capture the Defaulters, which is crucial in
many real-world applications
3. The low recall (8.6%) for Defaulters means the model is missing
most of the Defaulters. This can be dangerous in scenarios where
detecting Defaulters is important.

2.1.3 Classification report – Bagging

 Key points on ROC curve:


1. The AUC score is 0.80, which indicates a good model. A perfect model
would have an AUC of 1, while a random model would have an AUC of
0.5. AUC = 0.80 means that 80% of the time, the model will correctly
distinguish between a Defaulter and a Non-Defaulter.

13
2. Good Performance: An AUC of 0.80 is a strong indicator that the
Bagging Classifier has a good balance between correctly identifying
Defaulters while minimizing the number of false positives.
3. Although the AUC score is 0.80, which indicates a good model, it’s
essential to balance the trade-off between recall and precision,
especially in contexts where false positives or false negatives can have
significant costs.

2.1.4 ROC Curve – Bagging

Ada Boosting Classifier using Decision Tree:


 It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.45%.

2.1.5 Accuracy score – Ada boosting

 Based on confusion matrix, The model performs poorly at identifying Class


1 (Defaulters), with only 5 true positives and 389 false negatives. This
means the model frequently misclassifies Class 1 as Class 0.

2.1.6 Confusion matrix – Ada boosting

14
 From classification report, the model is very good at identifying Non-
Defaulters (Class 0) but performs poorly for Defaulters (Class 1).

2.1.7 Classification report – Ada boosting

 The classifier does a good job overall, with a relatively high AUC score.
 Although the classifier performs well in general, it may still fail to correctly
identify the minority class (Class 1) as shown by its low recall and F1-score
for that class.

2.1.8 ROC Curve – Ada boosting

Gradient Boosting classifier:


 Gradient Boosting primarily uses decision trees as the base model, and
through an iterative process of reducing prediction errors, it builds a
strong overall model from these weaker individual trees.
 With this, it has fairly good accuracy score of 98.45%

15
2.1.9 Accuracy score – gradient boosting

 Confusion matrix suggests the model is skewed towards predicting Class 0


more often and may not perform well on the minority Class 1.

2.1.10 confusion matrix – gradient boosting

 Since accuracy is misleading with imbalanced data, using metrics like F1-
score, precision-recall curve, or ROC-AUC may provide better insight into
model performance.

2.1.11 classification report – gradient boosting

16
2.1.12 ROC Curve – gradient boosting

 Post applying ensemble, below is the result of all the models and its
performance.

2.1.13 Performance metrics of models

2.2 Any other model tuning measures


 Few hyperparameter tuning like grid search, random search was
performed on the models.
Randomised Search CV using Random Forest Classifier:
 Performed hyperparameter tuning for a Random Forest Classifier using
RandomizedSearchCV from the sklearn library.
 the best parameters for the Random Forest model are displayed, along
with a best accuracy of 0.99, meaning the model performed very well
during cross-validation.

 The score () method is used to evaluate the model (which was trained
earlier using RandomizedSearchCV) on the test set X_test and y_test.
 It returns the accuracy of the model on the test set, which is stored in
the variable accuracy.

17
2.2.1 Accuracy score – Randomized search cv using RFC

 The model is highly accurate for Class 0 but has difficulty


distinguishing Class 1, possibly due to class imbalance (many more
instances of class 0 compared to class 1). This type of issue is common
when one class dominates the dataset.

2.2.2 Confusion matrix – Randomized search cv using RFC

2.2.3 Classification report – Randomized search cv using RFC

2.2.4 ROC Curve – Randomized search cv using RFC

18
Randomised Search CV using Decision Tree Classifier

 Performed hyperparameter tuning for a Decision tree Classifier using


RandomizedSearchCV from the sklearn library.
 the best parameters for the Decision tree model are displayed, along
with a best accuracy of 0.99, meaning the model performed very well
during cross-validation

 It returns the accuracy of the model on the test set, which is stored in
the variable accuracy, which is 0.98

2.2.5 Accuracy score – Randomized search cv using DTC

 The confusion matrix further reinforces the issue identified in the


classification report. The model is highly biased towards the majority
class (Non-Defaulters) and completely ignores the minority class
(Defaulters).

2.2.6 Confusion matrix – Randomized search cv using DTC

2.2.7 Classification report – Randomized search cv using DTC

19
2.2.8 ROC Curve – Randomized search cv using DTC

Randomised Search CV for Naive Bayes (Bernoulli)

2.2.9 Accuracy score – Randomized search cv using NB

2.2.10 Confusion matrix – Randomized search cv using NB

20
2.2.11 Classification report– Randomized search cv using NB

2.2.12 ROC Curve – Randomized search cv using NB

Grid search in decision tree classifier


 The model now identifies some Defaulters (37), but the number of false
negatives (357) is still significant.

2.2.13 Accuracy score – Grid search cv using DTC

21
2.2.14 Confusion matrix – Grid search cv using DTC

2.2.15 Confusion matrix – Grid search cv using DTC

2.2.16 ROC curve – Grid search cv using DTC

Grid search in Bernoulli NB classifier

22
2.2.17 Accuracy score – Grid search cv using NB

2.2.18 Confusion matrix– Grid search cv using NB

2.2.19 Classification report – Grid search cv using NB

2.2.20 ROC Curve – Grid search cv using NB

23
2.2.21 Performance metrics of all models

2.3 Interpretation of the most optimum model and its


implication on the business
 RandomisedSearchCV for RandomForestClassifier has the highest
accuracy (98.50%) and a good balance of precision (0.77) and AUC score
(0.88), making it a strong candidate for predicting default probability.
 The reason to chose this model is as follows:
 Highest Accuracy: The model achieves the highest accuracy of 98.50%
among all the models presented. This means it correctly predicts the
outcome (default or non-default) for 98.50% of the cases in the dataset.
High accuracy is crucial in banking risk assessment to minimize errors in
predicting defaults.
 Strong Precision: With a precision of 0.77, this model has the highest
precision among all models (tied with RandomisedSearchCV for
Decision Tree Classifier). Precision measures the proportion of true
positive predictions (correctly predicted defaults) out of all positive
predictions. A high precision means that when the model predicts a
default, it's more likely to be correct, reducing false alarms.
 High AUC Score: The Area Under the Curve (AUC) score of 0.88 is one
of the highest among all models. AUC represents the model's ability to
distinguish between classes (default and non-default). A score of 0.88
indicates that the model has a strong ability to separate the two classes,
which is crucial for a binary classification problem like predicting loan
defaults.
 Balanced Performance: This model provides a good balance between
different metrics. While some models might excel in one area but
perform poorly in others, this model maintains high scores across
accuracy, precision, and AUC.
Advantages of Random Forest: The base algorithm (Random Forest) is
known for its robustness and ability to handle complex relationships in data.
It's an ensemble method that combines multiple decision trees, which helps
in reducing overfitting and improving generalization.

24
 Hyperparameter Optimization: The use of RandomisedSearchCV
indicates that the model's hyperparameters have been optimized. This
process helps in finding the best configuration of the Random Forest
algorithm for this specific dataset, potentially improving its performance
over a standard Random Forest.
 Feature Importance Visualization:

2.3.1 Top 10 feature importances

 Financial behaviour: The majority of the important features seem to focus


on financial behaviour, particularly in terms of payments, investments,
and capital added within certain time frames (12 and 24 months).
 Age and time-related metrics: Age and duration within the system
("time_hours") are also influential, likely capturing aspects of experience,
reliability, or maturity.

25

You might also like