Data Mining Report

Data mining project about credit risk protection

Uploaded by

gamingbuddy24

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Data Mining Report

Data mining project about credit risk protection

Uploaded by

gamingbuddy24

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

UNIVERSITY OF DELHI

BHASKARACHARYA COLLEGE OF
APPLIED SCIENCES

BSC (HONS) COMPUTER SCIENCE

SEMESTER – 6

Data Mining Project Report

Submitted By

Chandan Rikhari: 2102016

Rohit Kumar: 2102052
Sourav: 2102061
Sumit Bharti: 2102063
Credit Risk Prediction
DESCRIPTION OF APPROACH:
I approached the problem reading the question and making notes for the steps required. The approach I
used can be summarized as below:
1. Reading and storing the Training file using pandas.read_csv: I created 2 methods (function:
readTrainfile() and readtestfile() ) which needed a parameter for filepath to read the input. Using
pandas.read_csv() function read the file and stored as a pandas’ DataFrame having column names: [ 'UID',
'YEARS_TO_LAST_DEGREE', 'WORK_HOURS_PER_WEEK', 'CAT_VALUE', 'CAT_VALUE_OCCUPATION', 'GAINS',
'LOSS', 'MARITAL_STATUS', 'TYPE_EMPLOYMENT', 'TYPE_EDUCATION', 'TYPE_RACE', 'TYPE_GENDER',
'CREDIT_RISK'] and readtestfile() without the 'CREDIT_RISK' column and after input checked if any column has
all null values then drop the column in the training set.

2. Preprocessing of the data: I created clean_data(data) method to process the data, where I dropped the
non- value adding columns, performed one hot encoding via a modular function ONE_HOT_ENCODING()
which removes the original column after performing pandas.get_dummies() to convert categorical variable to
binarycategorical data. I also removed the gender and added isFemale: 1 or 0 i.e., if the person is female then
1 otherwise 0. I also tried checking correlation between all the dataset and did not find any such strong
relation.

3. Cross Validation: Dividing the training data using StratifiedKFold (sklearn.model_selection.StratifiedKFold).

4. Prediction: I created a common function evaluate_training_algorithm() which accepts data, n_folds and
generates the folds of data for cross validation using a function called cross_validation() splits the data and
returns the data using StratifiedKFold and for each fold of data calls Classification_Algo_Training() which
accepts algorithm parameter, algorithm name and data set. The training data is passed through SMOTENC i.e.
Synthetic Minority Over-sampling Technique for Nominal and Continuous from the imblearn.over_sampling
class. It is used on the dataset containing numerical and categorical feature using the index stored after
preprocessing of the data. The prediction step returns the Predictions and the F1 Score. The classifiers used
are:

o from sklearn.ensemble => RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

o from sklearn.linear_model => LogisticRegression
o from sklearn.naive_bayes => GaussianNB
o from sklearn.svm => LinearSVC
o from sklearn.neighbors => KNeighborsClassifier (various Values of K)

I tried applying various parameters like max_depth ,max_features, n_neighbors, random_state, bootstrap, etc
in the above algorithms separately. (not all together but depending on each class’s parameters)

5. Finding accuracy: The accuracy is calculated for each prediction via the F1 Score i.e the weighted average
of Precision and Recall is the F1 Score using the sklearn.metrics class ‘s f1_score.

6. Real Test and Prediction: Reading test data file and applying Steps 1,2 and 4. Then using the saveOutput()
storethe output in the txt file and upload on the miner portal to get the results.
*The library modules which were used were wrapped for modularity and the sake of reusability.

Methodology of Choosing the Approach and Associated Parameters and Graphs

I started writing the code on Jupyter notebook and created two files, one for analysis and other for actual
prediction. I tried running the data directly without cleaning, the accuracy was not great and was around 0.37 an
then with categorical data’s processing it reached around 0.64 using various classifiers. Then tuned the parameter
of parameters like max_Depth and max_features(log2,None, Auto, Sqrt). There were several factors involved in
choosing the approach which are answered in the following questions:
1. How did you deal with different features?
a. Types of Features: Categorical and Continuous columns were present in the data.

b. Preprocessing of Data: Remove the empty columns and rows from the training data. Also dropped UID
as it was carrying no helpful information. I was not able to find strong correlation between the data.

c. One Hot Encoding: Using the pandas.get_dummies() generated Boolean categorical data and remove
the original categorical columns. I also removed the Gender column and replaced it with IsFemale, i.e
if person is Female then 1 otherwise 0.

d. Normalization of the Continuous data: I Used Robust Scaler algorithm as it scales the data, and it is
robust to the outliers. It uses the interquartile range. The median and scales of the data are removed
by this scaling algorithm according to the quantile range. I tried other normalization techniques but
those did not provide improvement in the F1 Score.
2. Did you exclude any specific features?
Yes, I dropped off the UID as it was unique for all the rows.

3. Was there a certain way you dealt with imbalance in the class distributions?
a. Cross Validation: I decided to use StratifiedKFold to split data into 5 folds for better validation. I then
created sets out of data using a modular function. I used StratifiedKFold to ensure that each fold of
dataset has the same proportion of observations with a given label and the datasets are balanced.

b. Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTENC): from the
imbalanced-learn library, which creates synthetic data for categorical as well as quantitative features
in the data set. SMOTENC slightly changes the way a new sample is generated by performing
something specific for the categorical features. SMOTENC is a technique based on nearest neighbors
judged by Euclidean Distance between data points in feature space. On adding it, there was increase
in accuracy from 0.66 to 0.68.

4. How did you perform model selection and which classifier stood out? Any theoretical reasoning
why?
I performed model selection based on average F1 scores on the various model checked on various experiments
performed over the different classifier algorithms. The below bar graph shows the average accuracies of each
of the classifier I tried. The data for them is as follows:
i. LogisticRegression : 0.6566836305022689
ii. GaussianNB : 0.6432987595066393
iii. LinearSVC : 0.6427111921121552
iv. RandomForestClassifier : 0.6459786591009992
v. KNeighborsClassifier : 0.6713221272228602
vi. AdaBoostClassifier: 0.6826596941113323
vii. GradientBoostingClassifier : 0.6944315486926854 (Best and it remained constant for all the iterations while
performing cross validation)

F1 Score: As the class distribution is unequal F1 is typically more useful than accuracy. So, I used F1 score
to compare the classifiers I.e. The weighted average of Precision and Recall is the F1 Score. As a result, this
score considers both false positives and false negatives. Although it is not as intuitive as accuracy. Below is
the plot for F1 Score for various K-Fold. Here we observed that F1 Score remained high for Gradient
Boosting Classifier.

Gradient boosting is a greedy algorithm and is one of the Arching algorithms. Boosting refers to this
general problem of producing a very accurate prediction rule by combining rough and moderately
inaccurate rules-of-thumb. Arcing is an acronym for Adaptive Reweighting and Combining. Each step in an
arcing algorithm consists of a weighted minimization followed by a recomputation of [the classifiers] and
[weighted input]. The statistical framework cast boosting as a numerical optimization problem where the
objective is to minimize the loss of the model by adding weak learners using a gradient descent like
procedure. This class of algorithms were described as a stage-wise additive model. This is because one
new weak learner is added at a time and existing weak learners in the model are frozen and left
unchanged. We generally use Gradient Boosting Algorithm when we want to decrease the Bias error.
Gradient boosting algorithm is one of the powerful algorithms that can be used for predicting categorical
target variable (as a Classifier), the cost function is Log loss. I tried other classifier experiments but none
of them reached near AdaBoost or Gradient Boosting Algorithms. Logistic Regression, K-Neighbours (Tired
for various values of K) and RandomForestClassifier were closely following but did not touch .70 F1 score.

CONCLUSION
On seeing the results of the various experiments on various classifiers, it was thus decided to go with the Gradient
boosting as it turned out to give the best F1 score with the constantly high F1 scores while cross validation. Without
SMOTENC the accuracy was 0.66 but on adding SMOTENC, the accuracy increased till .70 on the training set while
achieved .68 on the miner portal.

REFERENCES:
Documentation from https://scikit-learn.org/ , https://www.nltk.org/, https://www.scipy.org ,
https://matplotlib.org/ , and https://imbalanced-learn.org/stable/over_sampling.html

Blogs: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/ ,
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

Assignment 3: Logistic Regression (Individual Submission)
0% (1)
Assignment 3: Logistic Regression (Individual Submission)
3 pages
Inferensi Disekitar Mean Dan Pos Hoc-Zahro
No ratings yet
Inferensi Disekitar Mean Dan Pos Hoc-Zahro
11 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
minor project
No ratings yet
minor project
21 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
Assignment (4)
No ratings yet
Assignment (4)
5 pages
Mini Project
No ratings yet
Mini Project
9 pages
23BCE7092_ML_Lab_Assignment[1]
No ratings yet
23BCE7092_ML_Lab_Assignment[1]
14 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
22K61A0654_2_sasi_auto
No ratings yet
22K61A0654_2_sasi_auto
24 pages
5 markd
No ratings yet
5 markd
24 pages
23BCE7199 ML Lab Assignment[1]
No ratings yet
23BCE7199 ML Lab Assignment[1]
15 pages
TD2345
No ratings yet
TD2345
3 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
17 Ensemble Techniques Problem Statement
No ratings yet
17 Ensemble Techniques Problem Statement
28 pages
Credit_Card_Approval_Prediction_Report-Final
No ratings yet
Credit_Card_Approval_Prediction_Report-Final
27 pages
DM assignment 2
No ratings yet
DM assignment 2
23 pages
FinalPresentation
No ratings yet
FinalPresentation
12 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
B2 19bec113 19bec116 Loan Prediction
No ratings yet
B2 19bec113 19bec116 Loan Prediction
3 pages
Home Work
No ratings yet
Home Work
12 pages
Assignment (4) (2)
No ratings yet
Assignment (4) (2)
8 pages
Machine Leaning
No ratings yet
Machine Leaning
29 pages
Final Report (1)
No ratings yet
Final Report (1)
17 pages
ML Project Shivani Pandey
100% (2)
ML Project Shivani Pandey
49 pages
ANN_EXPERIENTIAL_LEARNING
No ratings yet
ANN_EXPERIENTIAL_LEARNING
43 pages
Titanic (5)
No ratings yet
Titanic (5)
3 pages
Titanic (4)
No ratings yet
Titanic (4)
3 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
ML Viva and Oral Question and Answers
No ratings yet
ML Viva and Oral Question and Answers
5 pages
Machine Learning Lab New
No ratings yet
Machine Learning Lab New
14 pages
Prediction---accuracy
No ratings yet
Prediction---accuracy
33 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
B24 ML Exp-3
No ratings yet
B24 ML Exp-3
10 pages
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
No ratings yet
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
13 pages
Credit Card Fraud Analysis Ashutosh
No ratings yet
Credit Card Fraud Analysis Ashutosh
3 pages
ML Lab PT
No ratings yet
ML Lab PT
25 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
7 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
27 ShivangiSrivastava ML Lab
No ratings yet
27 ShivangiSrivastava ML Lab
52 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Machine Learning Techniques For Sensor Data Analysis
No ratings yet
Machine Learning Techniques For Sensor Data Analysis
17 pages
ml_all_projectpdf_removed
No ratings yet
ml_all_projectpdf_removed
41 pages
Module 6
No ratings yet
Module 6
24 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
Classification Review
No ratings yet
Classification Review
8 pages
Machine Learning Based Student AcademicPerformance Prediction
No ratings yet
Machine Learning Based Student AcademicPerformance Prediction
6 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
Midterm Data Mining
No ratings yet
Midterm Data Mining
18 pages
Machine Learning Model Building
No ratings yet
Machine Learning Model Building
6 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
8 pages
Ml-Exp-3 - Jupyter Notebook
No ratings yet
Ml-Exp-3 - Jupyter Notebook
6 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
SVM
No ratings yet
SVM
8 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Project
No ratings yet
Project
15 pages
Lesson Plan in Random Variable
No ratings yet
Lesson Plan in Random Variable
10 pages
STA201 - Assignment 2 Question (Spring2023)
No ratings yet
STA201 - Assignment 2 Question (Spring2023)
3 pages
Assessment & Evaluation in Higher Education
No ratings yet
Assessment & Evaluation in Higher Education
10 pages
8MAT 152 Lesson 12
No ratings yet
8MAT 152 Lesson 12
22 pages
Sala I Martin
No ratings yet
Sala I Martin
7 pages
Andrew NG
No ratings yet
Andrew NG
31 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
25 pages
Contoh Data Regresi Dummy
No ratings yet
Contoh Data Regresi Dummy
10 pages
Analyzing and Interpreting Data From Likert
No ratings yet
Analyzing and Interpreting Data From Likert
3 pages
Evaluation
No ratings yet
Evaluation
12 pages
CORRELATION Rank Spearman
No ratings yet
CORRELATION Rank Spearman
6 pages
Unit Ii-Ds
No ratings yet
Unit Ii-Ds
12 pages
STAT 135: Linear Regression: Joan Bruna
No ratings yet
STAT 135: Linear Regression: Joan Bruna
232 pages
Thesis Chi Square
100% (3)
Thesis Chi Square
5 pages
BMS40420171201
No ratings yet
BMS40420171201
5 pages
Linear Regression
No ratings yet
Linear Regression
12 pages
Data-Science-and-Analytics-Reviewer
No ratings yet
Data-Science-and-Analytics-Reviewer
5 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
8 pages
Two-Way Classification (With One Observation Per Cell) :: y I P Q
No ratings yet
Two-Way Classification (With One Observation Per Cell) :: y I P Q
7 pages
Package POT': R Topics Documented
No ratings yet
Package POT': R Topics Documented
64 pages
Handout 04 Data Description
100% (1)
Handout 04 Data Description
44 pages
Independent Components Analysis
No ratings yet
Independent Components Analysis
26 pages
Research Aim: Holt-Winter's Method For Multiplicative Seasonality
No ratings yet
Research Aim: Holt-Winter's Method For Multiplicative Seasonality
8 pages
Topics To Know For Exams: CHAPTER 1 (1.1 - 1.4) - Summarizing Data Graphically and Numerically
No ratings yet
Topics To Know For Exams: CHAPTER 1 (1.1 - 1.4) - Summarizing Data Graphically and Numerically
6 pages
(1st Ed.) Tenko Raykov, - George A. Marcoulides - Introduction To Psychometric Theory-Routledge-128-159-23-32
No ratings yet
(1st Ed.) Tenko Raykov, - George A. Marcoulides - Introduction To Psychometric Theory-Routledge-128-159-23-32
10 pages
STA301 Midterm MCQs WithReferencesbyMoaaz PDF
No ratings yet
STA301 Midterm MCQs WithReferencesbyMoaaz PDF
28 pages