The document discusses building and evaluating a decision tree model to predict loan defaults. It covers preparing training and test data by random sampling, using the C5.0 algorithm to train a decision tree model on the training data, and then evaluating the model's performance on the test data by calculating accuracy and error rates.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
8 views
Decision - Tree Using R
The document discusses building and evaluating a decision tree model to predict loan defaults. It covers preparing training and test data by random sampling, using the C5.0 algorithm to train a decision tree model on the training data, and then evaluating the model's performance on the test data by calculating accuracy and error rates.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13
Decision Tree
exploring and preparing the data
Data preparation – creating random training and test datasets • Usually data that had been sorted in a random order, we simply divided the dataset into two portions, by taking the first 90 percent of records for training, and the remaining 10 percent for testing. • In contrast, the credit dataset is not randomly ordered, making the prior approach unwise. • Suppose that the bank had sorted the data by the loan amount, with the largest loans at the end of the file. • If we used the first 90 percent for training and the remaining 10 percent for testing, we would be training a model on only the small loans and testing the model on the big loans. Obviously, this could be problematic. • We'll solve this problem by using a random sample of the credit data for training. • A random sample is simply a process that selects a subset of records at random. • In R, the sample() function is used to perform random sampling. • However, before putting it in action, a common practice is to set a seed value, which causes the randomization process to follow a sequence that can be replicated later on if desired. training a model on the data • We will use the C5.0 algorithm in the C50 package to train our decision tree model. • For the first iteration of our credit approval model, we'll use the default C5.0 configuration, as shown in the following code. • The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame, but supply it as the target factor vector for classification. If the checking account balance is unknown or greater than 200 DM, then classify as "not likely to default." 2. Otherwise, if the checking account balance is less than zero DM or between one and 200 DM. 3. And the credit history is perfect or very good, then classify as "likely to default." evaluating model performance • credit_pred <- predict(credit_model, credit_test) • This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package. Results Out of the 100 test loan application records, our model correctly predicted that 59 did not default and 14 did default, resulting in an accuracy of 73 percent and an error rate of 27 percent.
Also note that the model only correctly
predicted 14 of the 33 actual loan defaults in the test data, or 42 percent.