Paper 5
Paper 5
Submitted by
SOURAV KUMAR
(1613101750)
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
ABSTRACT iv
1. INTRODUCTION 5
1.1 GENERAL 5
1.2 MACHINE LEARNING 5
1.2.1 General 5
1.2.2 Machine Learning Types 10
1.2.2.1 General 10
1.2.2.2 Supervised 11
1.2.2.3 Semi-Supervised 12
1.2.2.4 Unsupervised 13
1.2.2.5 Reinforcement 14
1.3 DECISION TREE 16
2. LITERATURE REVIEW 25
2.1 GENERAL 25
3. IMPLEMENTATION OF MODEL 27
3.1 Existing System 27
3.2 Proposed System 27
3.2.1 Datasets 28
3.3 Implementation 29
3.3.1 Data Exploratory Analysis 29
3.3.2 Decision Tree Algorithm 31
3.3.3 Processes for Prediction 32
3.3.4 Decision Tree for Model 33
3.3.5 Source Code 33
4. RESULTS 41
5. CONCLUSION 43
5.1 Future Scope 43
5.2 Conclusion 43
6. REFERENCES 44
ABSTRACT
When any financial institution lends the money to the person, it is always been a high
risk. Today data is increasing with the rapid pace in the banks, therefore the bankers
need to evaluate the person’s data before giving the loan. It can be a big headache to
evaluate the data. This problem is solved by analyzing and training the data by using one
of the Machine Learning algorithms. For this, we have generated a model for the
prediction that the person will get the loan or not. The primary objective of this paper is
to check whether the person can get the loan or not by evaluating the data with the help
of decision tree classifiers which can gives the accurate result for the prediction.
1.1 General
1.2.1 General
Machine Learning is the field of study that gives computers the capability to learn
without being explicitly programmed. ML is one of the most exciting technologies
that one would have ever come across. As it is evident from the name, it gives the
computer that makes it more similar to humans: The ability to learn. Machine
learning is actively being used today, perhaps in many more places than one would
expect.
The term Machine Learning was coined by Arthur Samuel in 1959, an American
pioneer in the field of computer gaming and artificial intelligence and stated that “it
gives computers the ability to learn without being explicitly programmed”.
And in 1997, Tom Mitchell gave a “well-posed” mathematical and relational
definition that “A computer program is said to learn from experience E with respect
to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.
Machine learning involves a computer to be trained using a given data set, and use
this training to predict the properties of a given new data. For example, we can train a
computer by feeding it 1000 images of cats and 1000 more images which are not of a
cat, and tell each time to the computer whether a picture is cat or not. Then if we
show the computer a new image, then from the above training, the computer should
be able to tell whether this new image is a cat or not.
Let’s try to understand Machine Learning in layman terms. Consider you are trying to
After first attempt, you realize that you have put too much force in it. After second
attempt, you realize you are closer to target but you need to increase your throw angle.
What is happening here is basically after every throw we are learning something and
improving the end result. We are programmed to learn from our experience.
This implies that the tasks in which machine learning is concerned offers a
fundamentally operational definition rather than defining the field in cognitive terms.
This follows Alan Turing’s proposal in his paper “Computing Machinery and
Intelligence”, in which the question “Can machines think?” is replaced with the question
“Can machines do what we (as thinking entities) can do?”
Within the field of data analytics, machine learning is used to devise complex models
and algorithms that lend themselves to prediction; in commercial use, this is known as
predictive analytics. These analytical models allow researchers, data scientists,
engineers, and analysts to “produce reliable, repeatable decisions and results” and
uncover “hidden insights” through learning from historical relationships and trends in
the data set(input).
Suppose that you decide to check out that offer for a vacation . You browse through the
travel agency website and search for a hotel. When you look at a specific hotel, just
below the hotel description there is a section titled “You might also like these hotels”.
This is a common use case of Machine Learning called “Recommendation Engine”.
Again, many data points were used to train a model in order to predict what will be the
best hotels to show you under that section, based on a lot of information they already
know about you.
So if you want your program to predict, for example, traffic patterns at a busy
intersection (task T), you can run it through a machine learning algorithm with data
about past traffic patterns (experience E) and, if it has successfully “learned”, it will then
do better at predicting future traffic patterns (performance measure P).
The highly complex nature of many real-world problems, though, often means that
inventing specialized algorithms that will solve them perfectly every time is impractical,
if not impossible. Examples of machine learning problems include, “Is this cancer?”,
“Which of these people are good friends with each other?”, “Will this person like this
movie?” such problems are excellent targets for Machine Learning, and in fact machine
learning has been applied such problems with great success.
There some variations of how to define the types of Machine Learning Algorithms but
commonly they can be divided into categories according to their purpose and the main
categories are the following:
• Supervised learning
• Unsupervised Learning
• Semi-supervised Learning
• Reinforcement Learning
• Here the human experts act as the teacher where we feed the computer with training
data containing the input/predictors and we show it the correct answers (output) and
from the data the computer should be able to learn the patterns.
• Predictive Model
• The main types of supervised learning problems include regression and classification
problems
• Nearest Neighbour
• Naive Bayes
• Decision Trees
• Linear Regression
• Neural Networks
1.2.2.3 Unsupervised Learning
• Here there’s no teacher at all, actually the computer might be able to teach you new
things after it learns patterns in data, these algorithms a particularly useful in cases
where the human expert doesn’t know what to look for in the data.
• are the family of machine learning algorithms which are mainly used in pattern
detection and descriptive modelling. However, there are no output categories or
labels here based on which the algorithm can try to model relationships. These
algorithms try to use techniques on the input data to mine for rules, detect patterns,
and summarize and group the data points which help in deriving meaningful insights
and describe the data better to the users.
Fig 4: Unsupervised Learning
Draft
• Descriptive Model
method aims at using observations gathered from the interaction with the
environment to take actions that would maximize the reward or minimize the risk.
Reinforcement learning algorithm (called the agent) continuously learns from the
environment in an iterative fashion. In the process, the agent learns from its experiences
of the environment until it explores the full range of possible states.
In order to produce intelligent programs (also called agents), reinforcement learning goes
through the following steps:
3. After the action is performed, the agent receives reward or reinforcement from the
environment.
• Q-Learning
Use cases:
Some applications of the reinforcement learning algorithms are computer played board
games (Chess, Go), robotic hands, and self-driving cars.
1.3 DECISION TREE
Decision tree is one of the predictive modelling approaches used in statistics, data
mining and machine learning.
Decision trees are constructed via an algorithmic approach that identifies ways to split a
data set based on different conditions. It is one of the most widely used and practical
methods for supervised learning. Decision Trees are a non-parametric supervised
learning method used for both classification and regression tasks.
Tree models where the target variable can take a discrete set of values are
called classification trees. Decision trees where the target variable can take continuous
values (typically real numbers) are called regression trees. Classification and Regression
Tree (CART) is general term for this.
Data Format
The dependent variable, Y, is the target variable that we are trying to understand, classify
or generalize. The vector x is composed of the features, x1, x2, x3 etc., that are used for
that task.
Example
training_data = [
['Green', 3, 'Apple'],
['Yellow', 3, 'Apple'],
['Red', 1, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon'],
]
# Header = ["Color", "diameter", "Label"]
# The last column is the label.
# The first two columns are features.
In Decision Tree the major challenge is to identification of the attribute for the root node
in each level. This process is known as attribute selection. We have two popular attribute
selection measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller
subsets the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A
= v, and Values (A) is the set of all possible values of A, then
Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity
of an arbitrary collection of examples. The higher the entropy more the information
content.
Definition: Suppose S is a set of instances, A is an attribute, S v is the subset of S with A
= v, and Values (A) is the set of all possible values of A, then
2. Gini Index
• Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
• It means an attribute with lower Gini index should be preferred.
• Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
• The Formula for the calculation of the of the Gini Index is given below.
• Get list of rows (dataset) which are taken into consideration for making decision tree
(recursively at each nodes).
• Calculate uncertainty of our dataset or Gini impurity or how much our data is mixed
up etc.
• Partition rows into True rows and False rows based on each question asked.
• Calculate information gain based on Gini impurity and partition of data from
previous step.
• Divide the node on best question. Repeat again from step 1 again until we get pure
node (leaf nodes).
my_tree = build_tree(training_data)
print_tree(my_tree)
Output
Is diameter >= 3?
--> True:
Is color == Yellow?
--> True:
Predict {'Lemon': 1, 'Apple': 1}
--> False:
Predict {'Apple': 1}
--> False:
Predict {'Grape': 2}
From above output we can see that at each steps data is divided into True and False rows.
This process keeps repeated until we reach leaf node where information gain is 0 and
further split of data is not possible as nodes are Pure.
Pruning should reduce the size of a learning tree without reducing predictive accuracy as
measured by cross-validation set. There are 2 major Pruning techniques.
• Minimum Error: The tree is pruned back to the point where the cross-validated error
is a minimum.
• Smallest Tree: The tree is pruned back slightly further than the minimum error.
Technically the pruning creates a decision tree with cross-validation error within 1
standard error of the minimum error.
An alternative method to prevent overfitting is to try and stop the tree-building process
early, before it produces leaves with very small samples. This heuristic is known as early
stopping but is also sometimes known as pre-pruning decision trees.
At each stage of splitting the tree, we check the cross-validation error. If the error does
not decrease significantly enough then we stop. Early stopping may underfit by stopping
too early. The current split may be of little benefit, but having made it, subsequent splits
more significantly reduce the error.
Early stopping and pruning can be used together, separately, or not at all. Post pruning
decision trees is more mathematically rigorous, finding a tree at least as good as early
stopping. Early stopping is a quick fix heuristic. If used together with pruning, early
stopping may save time. After all, why build a tree only to prune it back again?
Important Terminology related to Decision Trees
1. Root Node: It represents the entire population or sample and this further gets divided
into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say the opposite process of splitting.
6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent
node of sub-nodes whereas sub-nodes are the child of a parent node.
• Prone to overfitting.
Banks need to analyze for the person who applies for the loan will repay the loan or not.
Sometime it happens that customer has provided partial data to the bank, in this case
person may get the loan without proper verification and bank may end up with loss.
Bankers cannot analyze the huge amounts of data manually, it may become a big
headache to check whether a person will repay its loan or not. It is very much necessary
to know the person getting loan is going in safe hand or not. So, it is pretty much
important to have a automated model which should predict the customer getting the loan
will repay the loan or not.
I have developed a prediction model for Loan sanctioning which will predict whether the
person applying for loan will get loan or not. The major objective of this project is to
derive patterns from the datasets which are used for the loan sanctioning process and
create a model based on the patterns derived in the previous step. This model is
developed by using the one of the machine learning algorithms.
4)In the last step, pruning is applied to avoid the overfitting by removing that sections of
tree which has little classification power and determine the optimum size of tree.
3.3.3 Processes for Loan Prediction:
Y= balance_data. values[:, 0]
X_train, X_test, Y_train, y_test= train_test_split (X, Y, test_size = 0.3, random_state= 100)
Gini Index is the splitting criterion for the tree and decides address as a root node.
3.3.4 Generated Decision Tree for Model
Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
plt.style.use("seaborn")
#for machine learning
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
Load DataSets
bankloans = pd.read_csv("Data/bankloans.csv")
bankloans.head()
bankloans.columns
#number of observations and features
bankloans.shape
#data types in the dataframe
bankloans.info()
#check for any column has missing values
bankloans.isnull().any()
#check for number of missing values
bankloans.isnull().sum()
#Segregating the numeric and categorical variable names
numeric_var_names = [key for key in dict(bankloans.dtypes) if
dict(bankloans.dtypes)[key] in ['float64', 'int64', 'float32', 'int32']]
catgorical_var_names = [key for key in dict(bankloans.dtypes) if
dict(bankloans.dtypes)[key] in ['object']]
numeric_var_names
#splitting the data set into two sets - existing customers and new customers
bankloans_existing = bankloans.loc[bankloans.default.isnull() == 0]
bankloans_new = bankloans.loc[bankloans.default.isnull() == 1]
bankloans_existing.describe(percentiles=[.25,0.5,0.75,0.90,0.95])
sns.boxplot(y = "age",data=bankloans_existing)
plt.title("Box-Plot of age")
plt.show()
sns.boxplot(y = "employ",data=bankloans_existing)
plt.title("Box-Plot of employee tenure")
plt.show()
sns.boxplot(y = "income",data=bankloans_existing)
plt.title("Box-Plot of employee income")
plt.show()
sns.boxplot(y = "debtinc",data=bankloans_existing)
plt.title("Box-Plot of employee debt to income ratio")
plt.show()
sns.boxplot(y = "creddebt",data=bankloans_existing)
plt.title("Box-Plot of Credit to debit ratio")
plt.show()
income_minlimit = bankloans_existing["income"].quantile(0.75) + 1.5 *
(bankloans_existing["income"].quantile(0.75) -
bankloans_existing["income"].quantile(0.25))
income_minlimit
def outlier_capping(x):
"""A funtion to remove and replace the outliers for numerical columns"""
x = x.clip_upper(x.quantile(0.95))
return(x)
#outlier treatment
bankloans_existing = bankloans_existing.apply(lambda x: outlier_capping(x))
##Correlation Matrix
bankloans_existing.corr()
#Visualize the correlation using seaborn heatmap
sns.heatmap(bankloans_existing.corr(),annot=True,fmt="0.2f",cmap="coolwarm")
plt.show()
bankloans_existing.shape
bankloans_new.shape
#Indicator variable unique types
bankloans_existing['default'].value_counts()
bankloans_existing['default'].value_counts().plot.bar()
plt.xlabel("default")
plt.ylabel("count")
plt.title("Distribution of default")
plt.show()
#percentage of unique types in indicator variable
round(bankloans_existing['default'].value_counts()/bankloans_existing.shape[0] * 100,3)
## performing the independent t test on numerical variables
tstats_df = pd.DataFrame()
for eachvariable in numeric_var_names:
tstats = stats.ttest_ind(bankloans_existing.loc[bankloans_existing["default"] ==
1,eachvariable],bankloans_existing.loc[bankloans_existing["default"] == 0,
eachvariable],equal_var=False)
temp = pd.DataFrame([eachvariable, tstats[0], tstats[1]]).T
temp.columns = ['Variable Name', 'T-Statistic', 'P-Value']
tstats_df = pd.concat([tstats_df, temp], axis=0, ignore_index=True)
tstats_df = tstats_df.sort_values(by = "P-Value").reset_index(drop = True)
tstats_df
def BivariateAnalysisPlot(segment_by):
"""A funtion to analyze the impact of features on the target variable"""
fig, ax = plt.subplots(ncols=1,figsize = (10,8))
#boxplot
sns.boxplot(x = 'default', y = segment_by, data=bankloans_existing)
plt.title("Box plot of "+segment_by)
plt.show()
BivariateAnalysisPlot("age")
BivariateAnalysisPlot("ed")
BivariateAnalysisPlot("employ")
BivariateAnalysisPlot("address")
BivariateAnalysisPlot("income")
BivariateAnalysisPlot("debtinc")
BivariateAnalysisPlot("creddebt")
BivariateAnalysisPlot("othdebt")
#Decision tree Classifier
#make a pipeline for decision tree model
pipelines = {
"dtclass": make_pipeline(DecisionTreeClassifier(random_state=100))
}
#To check the accuracy of the pipeline
scores = cross_validate(pipelines['dtclass'],train_X,train_y,return_train_score=True)
scores['test_score'].mean()
#list of tunable hyper parameters for decision tree classifier pipeline
pipelines['dtclass'].get_params().keys()
decisiontree_hyperparameters = {
'decisiontreeclassifier__max_depth' : np.arange(3, 10),
'decisiontreeclassifier__max_features' : np.arange(3, 8),
'decisiontreeclassifier__min_samples_split' : np.arange(2, 15),
"decisiontreeclassifier__min_samples_leaf" : np.arange(1,3)
}
#Create a cross validation object from decision tree classifier and it's hyperparameters
dtclass_model = GridSearchCV(pipelines['dtclass'],decisiontree_hyperparameters,cv=5,
n_jobs=-1)
#fit the model
dtclass_model.fit(train_X, train_y)
#display the best parameters for decision tree model
dtclass_model.best_params_
#best score for the model
dtclass_model.best_score_
#In Pipeline we can use the string names to get the decisiontreeclassifer
dtclass_best_model =
dtclass_model.best_estimator_.named_steps['decisiontreeclassifier']
dtclass_best_model
#Predicting the test cases
bankloans_test_pred_dtclass = pd.DataFrame({'actual':test_y, 'predicted':
dtclass_best_model.predict(test_X)})
bankloans_test_pred_dtclass = bankloans_test_pred_dtclass.reset_index()
bankloans_test_pred_dtclass.head()
#creating a confusion matrix
cm_dtclass = metrics.confusion_matrix(bankloans_test_pred_dtclass.actual,
bankloans_test_pred_dtclass.predicted,labels = [1,0])
cm_dtclass
sns.heatmap(cm_dtclass,annot=True, fmt=".2f",
cmap="Greens",linewidths=.5,linecolor="red",
xticklabels = ["Default", "Not Default"] , yticklabels = ["Default", "Not
Default"])
plt.title("Confusion Matrix for Test data")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()
#probabilty of prediction
predict_prob_df = pd.DataFrame(dtclass_best_model.predict_proba(test_X))
predict_prob_df.head()
bankloans_test_pred_dtclass = pd.concat([bankloans_test_pred_dtclass,
predict_prob_df], axis = 1)
bankloans_test_pred_dtclass.columns = ['index', 'actual', 'predicted',
'default_0','default_1']
bankloans_test_pred_dtclass.head()
#find the auc score
auc_score = metrics.roc_auc_score(bankloans_test_pred_dtclass.actual,
bankloans_test_pred_dtclass.default_1)
round(auc_score,4)
#plotting the roc curve
The Confusion Matrix (CM) is used to analyze and determine the performance of the proposed loan
prediction model.
• True Positive (TP), when both the actual and predicted values are positive (1)
• True Negative (TN), when both the actual and predicted values are negative (0)
• False Positive (FP), when the actual value is negative and the predicted value is positive (1)
• False Negative (FN), when the actual value is positive (1) and the predicted value is
negative()
Confusion Matrix is given for the test data after the prediction of the model for loan. It shows the performance of
test datasets are done and datasets passed through the model. Precision score of the test data is 0.591.
ROC Curves summarize the trade-off between the true positive rate and false positive rate for a
In future, this model can be used to compare various machine learning algorithm generated prediction
models and the model which will give higher accuracy will be chosen as the prediction model.
5.2 Conclusion
After this work, we are able to conclude that Decision tree version is extraordinary efficient and gives a
higher end result. We have developed a model which can easily predict that the person will repay its
loan or not. we can see our model has reduced the efforts of bankers. Machine learning has helped a lot
[1] Kumar Arun,Gaur Ishan,Kaur Sanmit-Loan Prediction based on machine Learning approach,IOSR
[2] X.Frencis Jensy, V.P.Sumathi,Janani Shiva Shri-An exploratory Data Analysis for Loan Prediction
based on nature of clients, International Journal of Recent Technology and Engineering (IJRTE),
[3] Pidikiti Supriya, Myneedi Pavani, Nagarapu Saisushma,Namburi Vimala Kumari, k Vikash,-Loan
[4] Nikhil Madane, Siddharth Nanda-Loan Prediction using Decision tree,Journal of the Gujrat
decision tree,International Journal of Information and computer Science, Volume 6, Issue 5, May
2019
[7] Aditi kacheria, Nidhi Shivakumar, Shreya Sawker, Archana Gupta- Loan sanctioning prediction
september 2016
[8] Shrishti Srivastava, Ayush Garg, Arpit Sehgal, Ashok kumar – Analysis and comparison of Loan
Sanction Prediction Model using Python, International journal of computer science engineering and
[9] Anchal Goyal, Ranpreet Kaur- A survey on ensemble model of Loan Prediction, International
[10] G. Arutjothi, Dr. C. Senthamarai- Prediction of Loan Status in Commercial Bank using Machine