Modelling-project notes-2
Modelling-project notes-2
By:
E. AuroRajashri
0
List of Content
1) Model building and
interpretation..........................................................................................3
1.1 Build various models
1.2 Test your predictive model against the test set using various performance metrics
1.3 Interpretation of the model(s)
2) Model
Tuning……………………………………………………….…………….……………….….12
2.1 Ensemble modelling, wherever applicable
2.2 Any other model tuning measures (if applicable)
2.3 Interpretation of the most optimum model and its implication on the business
List of Figures
1.1 Train and test data
1
2.1.5 Accuracy score – Ada boosting
2
1. Model building and interpretation
1.1 Build various models
Post doing EDA, building models would be the next step.
Choice of Algorithms:
1. By looking into the data, supervised learning would be the choice.
2. Among the supervised learning, Classification models are typically
applied in scenarios where the target variable is categorical (e.g.,
default/no default).
3. We have applied several models like Decision tree classifier, Random
Forest classifier, Support vector machine, Naïve Bayes classifier.
These models are evaluated using metrics like accuracy, confusion
matrix, and ROC-AUC scores.
Dataset splitted into training and testing dataset before building models as
shown below:
3
1.2 Test your predictive model against the test set using
various appropriate performance metrics
Imported few libraries like confusion_matrix, precision_score,
recall_score, ConfusionMatrixDisplay, Classification report,
accuracy score.
Random forest classifier:
It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.43%.
4
2. However, it struggles with identifying defaulters (low precision, very low
recall, and low F1-score for class 1)
3. The high overall accuracy (98.42%) is misleading due to the class imbalance
4. The large difference between macro and weighted averages further highlights
the impact of class imbalance
5
Decision Tree classifier:
It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 97.21%.
6
1.2.7 Classification report – DTC
7
Naïve Bayes classifier:
It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 95.98%.
While the Naive Bayes Classifier does reasonably well at identifying Class 0
(with high true negatives), it performs poorly in identifying Class 1
(Defaulters), as seen by the low number of true positives and high false
negatives
8
Key points of ROC curve:
1. The curve is above the random line: This confirms that the classifier
is better than random guessing.
2. Moderate AUC (0.80): The classifier performs well overall but still
has room for improvement, especially when considering that the
classification report showed poor results for the minority class
(Defaulters).
3. A score of 0.80 means that there's a 80% chance that the classifier
will distinguish between a randomly chosen "Defaulter" and "Non-
Defaulter" correctly.
9
any instances of the minority class (Class 1). This is often the
result of severe class imbalance, where the classifier is dominated
by the large number of "non-defaulters" and ignores the small
number of "Defaulters."
3. Since all the actual "Defaulters" are misclassified as "non-
defaulters," the model has 0 recall for Class 1, which means it’s
not useful for identifying defaulters at all.
From classification report, the model performs very well in predicting non-
defaulters but completely fails to detect Defaulters. This could be due to
class imbalance.
An AUC of 0.5 indicates that the model performs no better than random
guessing, meaning it has no discriminative power to distinguish between
the classes.
10
1.2.16 ROC Curve– SVM
The Random Forest has a good accuracy (98.43%) and a relatively high
AUC (0.80), which indicates it performs well in distinguishing classes.
However, its precision (0.40) and recall (0.08) for the minority class (likely
Defaulters) are quite low, showing that it struggles with class imbalance.
The Decision Tree model has a lower AUC (0.58), and precision, recall, and
F1-scores are also quite low. It struggles more compared to Random Forest
in separating the classes, and overall performance indicates that it might
need tuning.
Naive Bayes has a lower accuracy (95.98%), and while its precision is low
(0.08), it has a relatively higher recall (0.16). The AUC score is similar to
Random Forest (0.80), but its low precision indicates that it struggles with
false positives.
The SVM model has a very high precision (1.00) but a recall of 0, meaning
it does not detect any Defaulters at all. This results in an F1-score of 0 and a
low AUC (0.50), indicating it performs no better than random guessing.
For models like Decision Tree, using boosting techniques (e.g., Gradient
Boosting, XGBoost) could improve performance by focusing on the
misclassified instances.
So, ensemble and model tuning is needed for the more effective models.
11
2. Model Tuning
2.1 Ensemble modelling
We have ensemble techniques like bagging, boosting. Stacking.
We can apply ensemble techniques to all the models but not so
effective for few models as explained below:
1. Ensemble methods like bagging and boosting are designed to correct
for high-variance models. Naive Bayes, however, is a low-variance model
because it does not overfit easily due to its strong assumptions. Hence,
ensembles often don’t provide much gain since they address variance issues
that Naive Bayes doesn't struggle with.
2. Naive Bayes and SVM are typically strong models on their own
and don't require ensembling for variance reduction or performance
improvement as much as high-variance models like decision trees do.
3. Instead of ensembling these models, hyperparameter tuning and
addressing class imbalance (especially in SVM) is often more effective.
Bagging Classifier using Decision Tree:
Imported BaggingClassifier from sklearn.ensemble
It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.41%.
12
2.1.2 Confusion matrix – Bagging
13
2. Good Performance: An AUC of 0.80 is a strong indicator that the
Bagging Classifier has a good balance between correctly identifying
Defaulters while minimizing the number of false positives.
3. Although the AUC score is 0.80, which indicates a good model, it’s
essential to balance the trade-off between recall and precision,
especially in contexts where false positives or false negatives can have
significant costs.
14
From classification report, the model is very good at identifying Non-
Defaulters (Class 0) but performs poorly for Defaulters (Class 1).
The classifier does a good job overall, with a relatively high AUC score.
Although the classifier performs well in general, it may still fail to correctly
identify the minority class (Class 1) as shown by its low recall and F1-score
for that class.
15
2.1.9 Accuracy score – gradient boosting
Since accuracy is misleading with imbalanced data, using metrics like F1-
score, precision-recall curve, or ROC-AUC may provide better insight into
model performance.
16
2.1.12 ROC Curve – gradient boosting
Post applying ensemble, below is the result of all the models and its
performance.
The score () method is used to evaluate the model (which was trained
earlier using RandomizedSearchCV) on the test set X_test and y_test.
It returns the accuracy of the model on the test set, which is stored in
the variable accuracy.
17
2.2.1 Accuracy score – Randomized search cv using RFC
18
Randomised Search CV using Decision Tree Classifier
It returns the accuracy of the model on the test set, which is stored in
the variable accuracy, which is 0.98
19
2.2.8 ROC Curve – Randomized search cv using DTC
20
2.2.11 Classification report– Randomized search cv using NB
21
2.2.14 Confusion matrix – Grid search cv using DTC
22
2.2.17 Accuracy score – Grid search cv using NB
23
2.2.21 Performance metrics of all models
24
Hyperparameter Optimization: The use of RandomisedSearchCV
indicates that the model's hyperparameters have been optimized. This
process helps in finding the best configuration of the Random Forest
algorithm for this specific dataset, potentially improving its performance
over a standard Random Forest.
Feature Importance Visualization:
25