Data Science Lab Assignment: Credit Risk Prediction Using Random Forest Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44
Data Science Lab Assignment: Credit Risk Prediction Using Random Forest Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44
Introduction:
The given dataset is a mixture of categorical and numeric data types. The raw dataset has 20
columns in which 7 columns are numerical and 13 columns are categorical. After the conversion of
categorical into numerical, the dataset has 24 columns. The output consists of two classes that are if
a customer is good or bad. We train RandomForestClassifier with the train data and predict on the
test dataset. The given dataset is split into an 80:20 ratio.
Algorithm Description:
RandomForestClassifier:
This is a supervised machine learning algorithm that is used for both classification and regression.
This algorithm creates multiple decision trees based on given parameters and gives the final output
through voting. The major output given by majority decision trees is given as output by random
forest.
We use the python inbuilt function RandomForestClassifier() which has many parameters. They are:
Setup-2
For each parameter, we take the value of parameters where we get high accuracy score.
We tune one parameter at a time making other parameters take default values.
Setup-3
From setup-2 we take all the best parameters where we got the highest accuracy scores and train
the model with that accuracies.
Setup-4
We use random parameters with 10-fold cross-validation where 9 parts are taken as a train set and 1
part is taken as a test set. This repeats 10 times.
Setup-5
In this, we use the best parameters from setup-4 and train the model without 10-fold cross-
validation
Setup-6
Increasing the number of estimators and depth of tree keeping other parameters to their best
Evaluation measures:
We considered two evaluation measures. They are:
• Accuracy
• Cost
Results:
Experiments N_estimators Max_depth Max_features Min_sample_leaf Test Train Cost
Accuracy Accuracy
Default 100 None auto 1 0.755 1 213
parameters
Setup-3 325 5 8 3 0.73 0.80 258
Setup-4 233 5 24 1 0.735 0.83 242
Setup-5 233 5 24 1 0.73 0.83 242
Setup-6 400 8 24 1 0.745 0.94 223
Setup-6 400 10 24 1 0.77 0.985 194
Setup-6 400 12 24 1 0.775 0.99 189
Setup-6 400 14 24 1 0.765 1 191
Setup-6 400 16 24 1 0.765 1 191
Observations:
• From the obtained results we see that the accuracy change is very minimal and change in
cost is also approximately minimum for 3 scenarios except for one scenario
• The accuracy change before cross-validation that is setup-3 and with cross-validation that is
setup-4 is 0.50, but there is a significant reduction in cost that is 16 units.
• There is no much difference in accuracy and cost in setup-5 where we used all the best
parameters from the setup-4 and trained the model without cross-validation.
• There is no much accuracy change with parameter tuning because there is a class imbalance
in the dataset and also the dataset size is very small.
• We observe that the accuracy is more and cost is less for default parameters because this
the case of purely overfitting.
• From setup-6 we observe that as the depth of the tree is increasing, we see that the model
is overfitted and after a certain height the model has reached the saturation point where the
accuracy and cost do not change further for the test and train dataset.