Data Mining Approach
Data Mining Approach
Research Article
Research on bank credit default prediction based on data mining algorithm
Li Ying
School of Business Administration, China University of Petroleum-Beijing, Beijing, 102249, China
ABSTRACT: It is of great importance to identify the potential risks to the bank's loan customers. Based on data mining
technology, it is an effective method to classify loan customers by classification algorithm. In this paper, we use Random
Forest method, Logistic Regression method, SVM method and other suitable classification algorithms by python to study
and analyze the bank credit data set, and compared these models on five model effect evaluation statistics of Accuracy,
Recall, precision, F1-score and ROC area. This paper use the data mining classification algorithm to identify the risk
customers from a large number of customers to provide an effective basis for the bank's loan approval.
Key words: Bank credit, Risk prediction, Data mining, Classification algorithm, python
INTRODUCTION Data sample collection and preprocessing
In today's information and digit age, bank credit default is still This paper uses the bank credit data set loan_model_sample in
frequent, how to establish an effective model for the the kaggle as the target data set for the study[6]. There are
prediction whether bank customers will default on the loan for 11017 samples and 199 attribute features. After the data
recognition of the risk in bank from a mass of loan applicants collection is completed, the data is viewed and pre-processed.
is of great significance. At present, many scholars at home and The overall framework of data preprocessing is shown in
abroad have studied the probability prediction model of bank Figure 1.
credit default and put forward some forecasting methods
which almost have different restrictions and defects. Beaver Loan Sample Data
proposes to use a single-factor method of financial ratios to
analyze the technical credit default prediction of Split train and test
enterprises[1]. Pomp at al constructed a default prediction
model using multivariate discriminant analysis (MDA)[2]. Missing value processing
Yang at al used Logistic Regression method establishing a Yes
probability prediction model of listed companies’ credit No
Missing value scale<20%?
default, and identified the most influential corporate financial
indicators[3]. Zhang at al proposed a SVM model, which
constructed the technical credit default prediction model of Delete the Mean value
small and medium-sized enterprises by constructing the feasures interpolation
4822 The International Journal of Social Sciences and Humanities Invention, vol. 5, Issue 06, Jun, 2018
Li Ying / Research on bank credit default prediction based on data mining algorithm
[2] Pompe, P.P.M., Bilderbe, J. The Prediction of Bankruptcy [18] Hilbe J M. Logistic regression models[M]. CRC press,
of Smalland Medium-sized Industrial Firms[J]. Journal of 2009.
Business Venturing, 2005,20.
[3] Yang Pengbo, Zhang Chenghu, Zhang Xiang. Prediction
model of credit default probability of listed companies
based on Logistic regression analysis [J]. economic
latitude and longitude,2009(02):144-148.
[4] Zhang Jie, Zhao Feng. [J]. statistics and decision making
of SME credit default prediction based on support vector
machine,2013(20):66-69.
[5] Mei Mei. Application of data mining classification
algorithm in credit card risk management [J]. modern
computer,2013(19):13-16.
[6] https://www.kaggle.com/datasets
[7] Liaw A, Wiener M. Classification and regression by
randomForest[J]. R news, 2002, 2(3): 18-22.
[8] Hosmer Jr D W, Lemeshow S, Sturdivant R X. Applied
logistic regression[M]. John Wiley & Sons, 2013.
[9] Joachims T. Making large-scale SVM learning
practical[R]. Technical report, SFB 475:
Komplexitätsreduktion in Multivariaten Datenstrukturen,
Universität Dortmund, 1998.
[10] Deng X, Liu Q, Deng Y, et al. An improved method to
construct basic probability assignment based on the
confusion matrix for classification problem[J].
Information Sciences, 2016, 340: 250-261.
[11] Ohsaki M, Wang P, Matsuda K, et al. Confusion-matrix-
based Kernel logistic regression for imbalanced data
classification[J]. IEEE Transactions on Knowledge and
Data Engineering, 2017, 29(9): 1806-1819.
[12] Branco P, Torgo L, Ribeiro R P. A survey of predictive
modeling on imbalanced domains[J]. ACM Computing
Surveys (CSUR), 2016, 49(2): 31.
[13] Fanshawe T R, Power M, Graziadio S, et al. Interactive
visualisation for interpreting diagnostic test accuracy
study results[J]. BMJ Evidence-Based Medicine, 2018,
23(1): 13-16.
[14] Davis J, Goadrich M. The relationship between Precision-
Recall and ROC curves[C]//Proceedings of the 23rd
international conference on Machine learning. ACM,
2006: 233-240.
[15] Fawcett T. An introduction to ROC analysis[J]. Pattern
recognition letters, 2006, 27(8): 861-874.
[16] Bradley A P. The use of the area under the ROC curve in
the evaluation of machine learning algorithms[J]. Pattern
recognition, 1997, 30(7): 1145-1159.
[17] Breheny P. Classification and regression trees[J]. 1984..
4823 The International Journal of Social Sciences and Humanities Invention, vol. 5, Issue 06, Jun, 2018