Data Mining Project
Data Mining Project
Sanatkumar Goutham
Charlottesville, USA Charlottesville, USA
[email protected] [email protected]
PCA SMOTE
PCA, shortform for Principal Component Analysis is a statis- We have fixed the imbalanced data only on the training data.
tical procedure which is useful for Dimensionality Reduction. This will ensure that the generated data won’t bleed into
Reducing the number of features is called Dimensionality Re- testing data and we can ensure that the result we get out of
duction. It transforms the input features and then we can drop this can be generalised.
the less important ones from the transformed features while
PCA
still using the valuable parts of our original features.
We have applied PCA on our data set for performing clas-
sification using SVM. This is beacause, SVM takes a lot
EXPERIMENTS
of time given the size of our dataset and having around 40
1. Preliminary Results After data pre processing, we per-
features and PCA is very useful in such case. The following
formed classification using algorithms K-Nearest Neigh-
shows how the variance changes with the increase in the
bor (KNN), Support Vector Machine (SVM), Naive Bayes,
number of components. We can see that after 5 components,
Decision Trees, Random Forest, XGBoost and Logistic re-
there is not much change in the variance percentage. We
gression. The following are the initial results of running
have used top 7 significant features returned by PCA to per-
these classification techniques on our data set.
form prediction. This would decrease the amount of time
Naive Bayes taken for training the model.
Accuracy - 77.47 Precision - 36.5
Recall - 4.65
Logistic Regression
Accuracy - 78.17 Precision - 40.48
Recall - 0.49
KNN
Accuracy - 74.55 Precision - 29.074
Recall - 11.72
3. current results The following are the results obtained as a
SVM
result of using SMOTE[5] and PCA[6] and then performing
Accuracy - 78.0 Precision - 28. 9 prediction. First, we display the confusion matrix results for
Recall - 12.3 each classification technique and then present the accuracy,
precision and recall in the tabular format.
XGboost
Accuracy - 78.22 Precision - 53.1 Naive Bayes
0 1
Recall - 7.3 0 17124 9162
Random Forest
1 4248 8405
Accuracy - 78.22 Precision - 50.13 Logistic Regression
0 1
Recall - 1.12 0 20238 6048
Decision tree 1 5162 7491
Accuracy - 77.14 Precision - 31.5 Random Forest
0 1
Recall - 4.25 0 25418 868
1 2779 9874
2. Improvements
KNN
From the preliminary results, we can see that the precision 0 1
is really low, which is because it is not classifying one class 0 19802 6484
of records properly. 1 4383 8270
Decision Tree CONCLUSION
0 1 We have significantly improved the accuracy and precision
0 25301 985 of predicting a loan defaulter. We also found that Neural
1 911 11742 Networks gives the best prediction with a high precision of
SVM 94.2. Decision Tree is also comparable with a high accuracy
0 1 of 95.13. We have some ideas on how this can be further
0 23098 3188 improved. This could be the future work for the project. The
1 4966 7687 accuracy can be further improved by creating an ensemble.
Gradient Boosting We can also use data and feature engineering techniques to
0 1 further improve accuracy. The Neural Networks can be further
0 24653 1633 fine tuned and more models can be explored.
1 1688 10965
References
Neural Networks [1] 2019. LT Vehicle Loan Default Prediction. (2019).
0 1 https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-
0 25628 658 predictiondata_dictionary.csv
1 1972 10681 [2] T. Cover &P. Hart Mickey Haggblade.Nearest neighbor pattern classifica-
Final Results summarized tion, IEEE Transactions on Information Theory 2013.
The following shows the final results of all the Algorithms [3] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recog-
run so far. We can see that Neural Networks gives the best nition,” submitted to Data Mining and Knowledge Discovery, 1998.
prediction with a high precision of 94.2. Decision Tree is [4] Lloyd, Stuart P. "Least squares quantization in PCM." Information Theory,
also comparable with a high accuracy of 95.13. IEEE Transactions on 28.2 (1982): 129-137.
[5] Nitesh V. Chawla , Kevin W. Bowyer. SMOTE: synthetic minority over-
sampling technique, Journal of Artificial Intelligence Research archive Vol-
ume 16 Issue 1, January 2002
[6] Hervé Abdi, Lynne J. Williams. Principal component analysis, WIREs
Computational Statistics, 2010
[7] Aadhar, https://uidai.gov.in/
[8] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting
System, KDD ’16 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining Pages 785-794
[9] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting
System, KDD ’16 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining Pages 785-794
[10] Leo Breiman, Random Forests, Journal Machine Learning, Volume 45
Issue 1, October 1 2001, Pages 5-32