Data Science Lab Assignment: Credit Risk Prediction Using Random Forest Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44

This document summarizes an experiment using a random forest classifier to predict credit risk from a dataset containing both categorical and numeric features. The dataset was split 80/20 for training and testing. Several experimental setups were used, including tuning hyperparameters individually, using cross-validation, and increasing the number of trees and depth. The best results were a test accuracy of 0.775 and cost of 189, obtained with 400 trees, a maximum depth of 12, and other parameters at their defaults. Increasing the depth further did not improve performance, indicating the model had reached saturation.

Uploaded by

Jyothsna Vemula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views3 pages

Data Science Lab Assignment: Credit Risk Prediction Using Random Forest Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44

Uploaded by

Jyothsna Vemula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

DATA SCIENCE LAB

Assignment: Credit Risk Prediction Using Random Forest

Name: Vemula Yaminee Jyothsna
Roll no: 20BM6JP44

Introduction:
The given dataset is a mixture of categorical and numeric data types. The raw dataset has 20
columns in which 7 columns are numerical and 13 columns are categorical. After the conversion of
categorical into numerical, the dataset has 24 columns. The output consists of two classes that are if
a customer is good or bad. We train RandomForestClassifier with the train data and predict on the
test dataset. The given dataset is split into an 80:20 ratio.

Algorithm Description:
RandomForestClassifier:

This is a supervised machine learning algorithm that is used for both classification and regression.

This algorithm creates multiple decision trees based on given parameters and gives the final output
through voting. The major output given by majority decision trees is given as output by random
forest.

We use the python inbuilt function RandomForestClassifier() which has many parameters. They are:

• N_estimators: Number of decision trees in the forest.

• Criterion: split quality measurement. We have two values {‘gini’, ‘entropy’}. The default
value is gini.
• Max depth: maximum depth of the tree.
• Min sample split: minimum number of splits required to split internal node. Default is 2.
• Min samples leaf: minimum number of samples required to be a leaf. The default value is 1.
• Max features: number of features to be considered from the dataset. {‘auto’,’ sqrt’,’log2’}
Experimental Setup:
Setup-1

Firstly, the model is trained with all the default parameters.

Setup-2

For each parameter, we take the value of parameters where we get high accuracy score.

We tune one parameter at a time making other parameters take default values.

Setup-3

From setup-2 we take all the best parameters where we got the highest accuracy scores and train
the model with that accuracies.

Setup-4

We use random parameters with 10-fold cross-validation where 9 parts are taken as a train set and 1
part is taken as a test set. This repeats 10 times.

Setup-5

In this, we use the best parameters from setup-4 and train the model without 10-fold cross-
validation

Setup-6

Increasing the number of estimators and depth of tree keeping other parameters to their best
Evaluation measures:
We considered two evaluation measures. They are:

• Accuracy
• Cost

Cost is calculated using the formula 5* false-negative + false positive

Accuracy is (True positive + true negative) / total observations

Results:
Experiments N_estimators Max_depth Max_features Min_sample_leaf Test Train Cost
Accuracy Accuracy
Default 100 None auto 1 0.755 1 213
parameters
Setup-3 325 5 8 3 0.73 0.80 258
Setup-4 233 5 24 1 0.735 0.83 242
Setup-5 233 5 24 1 0.73 0.83 242
Setup-6 400 8 24 1 0.745 0.94 223
Setup-6 400 10 24 1 0.77 0.985 194
Setup-6 400 12 24 1 0.775 0.99 189
Setup-6 400 14 24 1 0.765 1 191
Setup-6 400 16 24 1 0.765 1 191

Observations:
• From the obtained results we see that the accuracy change is very minimal and change in
cost is also approximately minimum for 3 scenarios except for one scenario
• The accuracy change before cross-validation that is setup-3 and with cross-validation that is
setup-4 is 0.50, but there is a significant reduction in cost that is 16 units.
• There is no much difference in accuracy and cost in setup-5 where we used all the best
parameters from the setup-4 and trained the model without cross-validation.
• There is no much accuracy change with parameter tuning because there is a class imbalance
in the dataset and also the dataset size is very small.
• We observe that the accuracy is more and cost is less for default parameters because this
the case of purely overfitting.
• From setup-6 we observe that as the depth of the tree is increasing, we see that the model
is overfitted and after a certain height the model has reached the saturation point where the
accuracy and cost do not change further for the test and train dataset.