0% found this document useful (0 votes)
8 views

Decision - Tree Using R

The document discusses building and evaluating a decision tree model to predict loan defaults. It covers preparing training and test data by random sampling, using the C5.0 algorithm to train a decision tree model on the training data, and then evaluating the model's performance on the test data by calculating accuracy and error rates.

Uploaded by

Wajahat Ali085
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Decision - Tree Using R

The document discusses building and evaluating a decision tree model to predict loan defaults. It covers preparing training and test data by random sampling, using the C5.0 algorithm to train a decision tree model on the training data, and then evaluating the model's performance on the test data by calculating accuracy and error rates.

Uploaded by

Wajahat Ali085
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Decision Tree

exploring and preparing the data


Data preparation – creating random training and test
datasets
• Usually data that had been sorted in a random order, we simply divided the dataset
into two portions, by taking the first 90 percent of records for training, and the
remaining 10 percent for testing.
• In contrast, the credit dataset is not randomly ordered, making the prior approach
unwise.
• Suppose that the bank had sorted the data by the loan amount, with the largest
loans at the end of the file.
• If we used the first 90 percent for training and the remaining 10 percent for testing,
we would be training a model on only the small loans and testing the model on the
big loans. Obviously, this could be problematic.
• We'll solve this problem by using a random sample of the credit data for training.
• A random sample is simply a process that selects a subset of records at random.
• In R, the sample() function is used to perform random sampling.
• However, before putting it in action, a common practice is to set a seed value, which
causes the randomization process to follow a sequence that can be replicated later
on if desired.
training a model on the data
• We will use the C5.0 algorithm in the
C50 package to train our decision tree
model.
• For the first iteration of our credit
approval model, we'll use the default
C5.0 configuration, as shown in the
following code.
• The 17th column in credit_train is the
default class variable, so we need to
exclude it from the training data frame,
but supply it as the target factor vector
for classification.
If the checking account balance is unknown or greater than 200 DM, then classify as "not likely to
default."
2. Otherwise, if the checking account balance is less than zero DM or between one and 200 DM.
3. And the credit history is perfect or very good, then classify as "likely to default."
evaluating model performance
• credit_pred <- predict(credit_model, credit_test)
• This creates a vector of predicted class values, which we can compare
to the actual class values using the CrossTable() function in the
gmodels package.
Results
 Out of the 100 test loan application
records, our model correctly predicted that
59 did not default and 14 did default,
resulting in an accuracy of 73 percent and
an error rate of 27 percent.

 Also note that the model only correctly


predicted 14 of the 33 actual loan defaults
in the test data, or 42 percent.

You might also like